Architecture: Global Transformer + Delay-Pattern (MossTTSDelay)

This document details the MossTTSDelay architecture, the production-grade variant of the MOSS-TTS family. It employs a Single Transformer backbone with Multi-Head Parallel Prediction and a Delay-Pattern scheduling mechanism to achieve high-speed, stable, and long-form speech synthesis. The architecture diagram is shown in the figure.

1. Overview: Parallel Heads + Delay Pattern

Unlike the MossTTSLocal architecture which uses a hierarchical "Temporal + Depth" approach, MossTTSDelay integrates all modeling into a single large-scale Transformer. It achieves efficient multi-codebook modeling by shifting the RVQ layers in the time domain, allowing the model to predict all codebook layers for a given step simultaneously through multiple linear heads.

Key Components

Unified Transformer Backbone: A large-scale language model (based on the Qwen-8B scale) that handles text encoding, prosody modeling, and audio token prediction in a single forward pass.
Multi-Head Output Layer: The backbone is equipped with $1 + N_q$ (where $N_q=32$) prediction heads. One head manages the primary sequence logic, while the other 32 heads parallelly predict the RVQ codebook layers.
Delay-Pattern Scheduling: A specialized data formatting technique that introduces a 1-step offset between successive RVQ layers. This enables causal dependency modeling across codebook depths without requiring an additional "Depth Transformer."

2. Technical Specifications

Feature	Specification
Backbone Model	Initialized from Qwen-8B scale
Prediction Heads	33 LM Heads (1 Main + 32 RVQ Heads)
Audio Tokenizer	Cat (Causal Audio Tokenizer)
Sampling Rate	24,000 Hz
Frame Rate	12.5 Hz (1s ≈ 12.5 tokens)
Codebooks	32 RVQ layers (10-bit each)
Generation Mode	Parallel Autoregressive (Delay-Pattern)
Primary Advantage	Inference speed & Long-context stability

3. Core Mechanism: Multi-Head Parallel Prediction

The defining characteristic of MossTTSDelay is its computational efficiency. By attaching 32 independent linear heads to the final hidden state of the Transformer backbone, the model can generate an entire frame's worth of multi-layer RVQ tokens in a single forward step.

Why this is faster than MossTTSLocal:

No Nested Loops: While the Local architecture requires a secondary "Local Transformer" to iterate through each RVQ layer within one time step, MossTTSDelay computes all layers in parallel.
Direct Projection: The relationship between codebook layers is captured by the backbone's internal representations and the delay-pattern, removing the latency overhead of a dedicated depth-modeling module.

4. Prediction Topology: Delay-Pattern

To maintain the hierarchical dependency of RVQ (where Layer $k$ should ideally "see" the information from Layer $k-1$), MossTTSDelay uses Delay-Pattern Scheduling.

The Pattern: At each training or inference step $t$, the input sequence is structured such that:

Head 1 predicts Layer 1 of Frame $t$.
Head 2 predicts Layer 2 of Frame $t-1$.
Head 3 predicts Layer 3 of Frame $t-2$.
... and so on.

Dependency Modeling: Because the Transformer is causal, when the model predicts tokens for "Step $t$", it has already seen the tokens from "Step $t-1$" in its context. Due to the 1-step shift, the information for Layer $k-1$ (at Step $t$) is already present in the history when the model predicts Layer $k$ (at Step $t+1$). This "diagonal" dependency effectively models the coarse-to-fine structure of the audio tokenizer.

5. Evaluation & Performance

According to the moss_tts_model_card.md, the MossTTSDelay-8B is the recommended model for production and long-form stability:

Metric	Result (Seed-TTS-Eval)
EN SIM (Speaker Similarity)	0.7146
ZH SIM (Speaker Similarity)	0.7705
EN WER (Word Error Rate)	1.79%
ZH CER (Char Error Rate)	1.32%

Conclusion: MossTTSDelay offers superior long-context stability and faster inference speeds compared to the Local variant. Its 8B parameter scale provides the capacity needed for complex prosody and ultra-long (up to 1 hour) speech generation.

6. Architecture Comparison

Aspect	MossTTSDelay (Architecture A)	MossTTSLocal (Architecture B)
Structure	Single Transformer (8B)	Temporal + Depth Transformers (1.7B)
Scheduling	Delay-Pattern (Diagonal Shift)	Per-step Synchronous Blocks
Prediction Heads	33 Parallel Heads	Single Latent Head + Local Module
Inference Speed	High (Parallel RVQ prediction)	Moderate (Sequential RVQ prediction)
Stability	Excellent for long-form (1h+)	Optimized for short-segment metrics
Best For	Production, Scalable Apps, Narration	Research, Quality Benchmarks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture: Global Transformer + Delay-Pattern (MossTTSDelay)

1. Overview: Parallel Heads + Delay Pattern

Key Components

2. Technical Specifications

3. Core Mechanism: Multi-Head Parallel Prediction

Why this is faster than MossTTSLocal:

4. Prediction Topology: Delay-Pattern

5. Evaluation & Performance

6. Architecture Comparison

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Architecture: Global Transformer + Delay-Pattern (MossTTSDelay)

1. Overview: Parallel Heads + Delay Pattern

Key Components

2. Technical Specifications

3. Core Mechanism: Multi-Head Parallel Prediction

Why this is faster than MossTTSLocal:

4. Prediction Topology: Delay-Pattern

5. Evaluation & Performance

6. Architecture Comparison