Description
When running the Chroma-4B model on Apple Silicon (MPS backend), we observe significant audio stuttering caused by high Inter-Token Latency (ITL) and inconsistent generation speeds. Even with greedy search and optimized pre-loaded prompts, the inference speed frequently drops to 0.2x - 0.3x real-time, making live conversation impossible.
💻 Environment
Hardware: MacBook Pro (Mac14,5) - Apple M2 Max
OS: macOS 15.2 (26.2)
Backend: MPS (Metal Performance Shaders)
Versions:
transformers: 5.0.0
torch: 2.10.0
torchaudio: 2.10.0
torchcodec: 0.10.0
☁️ Cloud Context (Modal)
We also attempted to run the model in a cloud environment via Modal to rule out platform-specific bottlenecks.
Hardware: NVIDIA A100 (40GB/80GB)
Memory: 32GB
Runtime: CUDA 12.6
Results: While the throughput was higher than MPS, we still observed inconsistent inter-token latency (ITL) that causes audible jitter in a real-time speech-to-speech loop.
🔴 The Issue
When streaming Mimi tokens (80ms audio frames) for real-time speech-to-speech interaction, the model exhibits significant ITL (Inter-Token Latency) spikes. Even with optimizations like Greedy Search, Cached Speaker Prompts, and transformers==5.0.0, the throughput on MPS peaks at ~0.8x real-time but fluctuates frequently, falling behind the required 80ms/frame cadence.
[Chroma Trace] JITTER DETECTED: Frame 2 delay=157.3ms (Tokens arriving late)
[Chroma Trace] JITTER DETECTED: Frame 3 delay=133.3ms (Tokens arriving late)
[Chroma Streamer] Frame 10: ITL= 88.8ms, Decode= 23.1ms, Speed=0.33x (SLOW)
[Chroma Streamer] Frame 40: ITL= 75.1ms, Decode= 18.4ms, Speed=0.63x (SLOW)
[Chroma Streamer] Frame 80: ITL= 79.7ms, Decode= 21.0ms, Speed=0.76x (SLOW)
[Chroma Streamer] Frame 100: ITL= 72.1ms, Decode= 17.9ms, Speed=0.80x (SLOW)
[Chroma Trace] JITTER DETECTED: Frame 113 delay=111.3ms (Tokens arriving late)
🔍 Key Findings & Audit Results
Inference Latency Spikes: High-precision streamers show that token generation time is not consistent, with spikes reaching 150ms+ between Mimi frames.
MPS Synchronization: There appears to be significant overhead when synchronizing between the backbone and the audio decoder layers on the Metal backend.
❓ Requested Guidance
Are there specific MPS-optimized kernels recommended for the interleaved text-audio attention mechanism?
Is there a way to reduce synchronization points in the
generate()
loop for transformers==5.0.0 to achieve consistent <80ms ITL?
Description
When running the Chroma-4B model on Apple Silicon (MPS backend), we observe significant audio stuttering caused by high Inter-Token Latency (ITL) and inconsistent generation speeds. Even with greedy search and optimized pre-loaded prompts, the inference speed frequently drops to 0.2x - 0.3x real-time, making live conversation impossible.
💻 Environment
☁️ Cloud Context (Modal)
We also attempted to run the model in a cloud environment via Modal to rule out platform-specific bottlenecks.
Results: While the throughput was higher than MPS, we still observed inconsistent inter-token latency (ITL) that causes audible jitter in a real-time speech-to-speech loop.
🔴 The Issue
When streaming Mimi tokens (80ms audio frames) for real-time speech-to-speech interaction, the model exhibits significant ITL (Inter-Token Latency) spikes. Even with optimizations like Greedy Search, Cached Speaker Prompts, and transformers==5.0.0, the throughput on MPS peaks at ~0.8x real-time but fluctuates frequently, falling behind the required 80ms/frame cadence.
🔍 Key Findings & Audit Results
Inference Latency Spikes: High-precision streamers show that token generation time is not consistent, with spikes reaching 150ms+ between Mimi frames.
MPS Synchronization: There appears to be significant overhead when synchronizing between the backbone and the audio decoder layers on the Metal backend.
❓ Requested Guidance
Are there specific MPS-optimized kernels recommended for the interleaved text-audio attention mechanism?
Is there a way to reduce synchronization points in the
generate()
loop for transformers==5.0.0 to achieve consistent <80ms ITL?