Skip to content

Bug Report: Significant Audio Jitter and Low Throughput on Apple Silicon (MPS) #21

@psharrma

Description

@psharrma

Description

When running the Chroma-4B model on Apple Silicon (MPS backend), we observe significant audio stuttering caused by high Inter-Token Latency (ITL) and inconsistent generation speeds. Even with greedy search and optimized pre-loaded prompts, the inference speed frequently drops to 0.2x - 0.3x real-time, making live conversation impossible.

💻 Environment

Hardware: MacBook Pro (Mac14,5) - Apple M2 Max
OS: macOS 15.2 (26.2)
Backend: MPS (Metal Performance Shaders)
Versions:
  transformers: 5.0.0
  torch: 2.10.0
  torchaudio: 2.10.0
  torchcodec: 0.10.0

☁️ Cloud Context (Modal)

We also attempted to run the model in a cloud environment via Modal to rule out platform-specific bottlenecks.

Hardware: NVIDIA A100 (40GB/80GB)
Memory: 32GB
Runtime: CUDA 12.6

Results: While the throughput was higher than MPS, we still observed inconsistent inter-token latency (ITL) that causes audible jitter in a real-time speech-to-speech loop.

🔴 The Issue

When streaming Mimi tokens (80ms audio frames) for real-time speech-to-speech interaction, the model exhibits significant ITL (Inter-Token Latency) spikes. Even with optimizations like Greedy Search, Cached Speaker Prompts, and transformers==5.0.0, the throughput on MPS peaks at ~0.8x real-time but fluctuates frequently, falling behind the required 80ms/frame cadence.

[Chroma Trace] JITTER DETECTED: Frame 2 delay=157.3ms (Tokens arriving late)
[Chroma Trace] JITTER DETECTED: Frame 3 delay=133.3ms (Tokens arriving late)
[Chroma Streamer] Frame   10: ITL= 88.8ms, Decode= 23.1ms, Speed=0.33x (SLOW)
[Chroma Streamer] Frame   40: ITL= 75.1ms, Decode= 18.4ms, Speed=0.63x (SLOW)
[Chroma Streamer] Frame   80: ITL= 79.7ms, Decode= 21.0ms, Speed=0.76x (SLOW)
[Chroma Streamer] Frame  100: ITL= 72.1ms, Decode= 17.9ms, Speed=0.80x (SLOW)
[Chroma Trace] JITTER DETECTED: Frame 113 delay=111.3ms (Tokens arriving late)

🔍 Key Findings & Audit Results

Inference Latency Spikes: High-precision streamers show that token generation time is not consistent, with spikes reaching 150ms+ between Mimi frames.
MPS Synchronization: There appears to be significant overhead when synchronizing between the backbone and the audio decoder layers on the Metal backend.

❓ Requested Guidance

Are there specific MPS-optimized kernels recommended for the interleaved text-audio attention mechanism?
Is there a way to reduce synchronization points in the
generate()
loop for transformers==5.0.0 to achieve consistent <80ms ITL?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions