Skip to content

Latest commit

 

History

History
90 lines (62 loc) · 5.06 KB

File metadata and controls

90 lines (62 loc) · 5.06 KB

Architecture: Global Transformer + Delay-Pattern (MossTTSDelay)

This document details the MossTTSDelay architecture, the production-grade variant of the MOSS-TTS family. It employs a Single Transformer backbone with Multi-Head Parallel Prediction and a Delay-Pattern scheduling mechanism to achieve high-speed, stable, and long-form speech synthesis. The architecture diagram is shown in the figure.


1. Overview: Parallel Heads + Delay Pattern

Unlike the MossTTSLocal architecture which uses a hierarchical "Temporal + Depth" approach, MossTTSDelay integrates all modeling into a single large-scale Transformer. It achieves efficient multi-codebook modeling by shifting the RVQ layers in the time domain, allowing the model to predict all codebook layers for a given step simultaneously through multiple linear heads.

Key Components

  • Unified Transformer Backbone: A large-scale language model (based on the Qwen-8B scale) that handles text encoding, prosody modeling, and audio token prediction in a single forward pass.
  • Multi-Head Output Layer: The backbone is equipped with $1 + N_q$ (where $N_q=32$) prediction heads. One head manages the primary sequence logic, while the other 32 heads parallelly predict the RVQ codebook layers.
  • Delay-Pattern Scheduling: A specialized data formatting technique that introduces a 1-step offset between successive RVQ layers. This enables causal dependency modeling across codebook depths without requiring an additional "Depth Transformer."

2. Technical Specifications

Feature Specification
Backbone Model Initialized from Qwen-8B scale
Prediction Heads 33 LM Heads (1 Main + 32 RVQ Heads)
Audio Tokenizer Cat (Causal Audio Tokenizer)
Sampling Rate 24,000 Hz
Frame Rate 12.5 Hz (1s ≈ 12.5 tokens)
Codebooks 32 RVQ layers (10-bit each)
Generation Mode Parallel Autoregressive (Delay-Pattern)
Primary Advantage Inference speed & Long-context stability

3. Core Mechanism: Multi-Head Parallel Prediction

The defining characteristic of MossTTSDelay is its computational efficiency. By attaching 32 independent linear heads to the final hidden state of the Transformer backbone, the model can generate an entire frame's worth of multi-layer RVQ tokens in a single forward step.

Why this is faster than MossTTSLocal:

  • No Nested Loops: While the Local architecture requires a secondary "Local Transformer" to iterate through each RVQ layer within one time step, MossTTSDelay computes all layers in parallel.
  • Direct Projection: The relationship between codebook layers is captured by the backbone's internal representations and the delay-pattern, removing the latency overhead of a dedicated depth-modeling module.

4. Prediction Topology: Delay-Pattern

To maintain the hierarchical dependency of RVQ (where Layer $k$ should ideally "see" the information from Layer $k-1$), MossTTSDelay uses Delay-Pattern Scheduling.

The Pattern: At each training or inference step $t$, the input sequence is structured such that:

  • Head 1 predicts Layer 1 of Frame $t$.
  • Head 2 predicts Layer 2 of Frame $t-1$.
  • Head 3 predicts Layer 3 of Frame $t-2$.
  • ... and so on.

Dependency Modeling: Because the Transformer is causal, when the model predicts tokens for "Step $t$", it has already seen the tokens from "Step $t-1$" in its context. Due to the 1-step shift, the information for Layer $k-1$ (at Step $t$) is already present in the history when the model predicts Layer $k$ (at Step $t+1$). This "diagonal" dependency effectively models the coarse-to-fine structure of the audio tokenizer.


5. Evaluation & Performance

According to the moss_tts_model_card.md, the MossTTSDelay-8B is the recommended model for production and long-form stability:

Metric Result (Seed-TTS-Eval)
EN SIM (Speaker Similarity) 0.7146
ZH SIM (Speaker Similarity) 0.7705
EN WER (Word Error Rate) 1.79%
ZH CER (Char Error Rate) 1.32%

Conclusion: MossTTSDelay offers superior long-context stability and faster inference speeds compared to the Local variant. Its 8B parameter scale provides the capacity needed for complex prosody and ultra-long (up to 1 hour) speech generation.


6. Architecture Comparison

Aspect MossTTSDelay (Architecture A) MossTTSLocal (Architecture B)
Structure Single Transformer (8B) Temporal + Depth Transformers (1.7B)
Scheduling Delay-Pattern (Diagonal Shift) Per-step Synchronous Blocks
Prediction Heads 33 Parallel Heads Single Latent Head + Local Module
Inference Speed High (Parallel RVQ prediction) Moderate (Sequential RVQ prediction)
Stability Excellent for long-form (1h+) Optimized for short-segment metrics
Best For Production, Scalable Apps, Narration Research, Quality Benchmarks