Skip to content

Latest commit

 

History

History
193 lines (151 loc) · 6.55 KB

File metadata and controls

193 lines (151 loc) · 6.55 KB

Converting MOSS-TTS Weights to GGUF

English | 简体中文

This guide walks through converting the original MOSS-TTS (HuggingFace) weights into the GGUF format used by the llama.cpp inference backend. If you just want to use the pre-converted weights, skip this guide and download them directly:

huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/MOSS-TTS-GGUF

Overview

The conversion pipeline has three steps:

  1. Extract weights — split the MOSS-TTS model into a standalone Qwen3 backbone (safetensors), embedding tables (.npy), and LM head matrices (.npy).
  2. Convert to GGUF — convert the Qwen3 backbone safetensors to a full-precision (f16) GGUF file using llama.cpp's convert_hf_to_gguf.py.
  3. Quantize — quantize the f16 GGUF to a smaller format (e.g. Q4_K_M) using llama-quantize.
OpenMOSS-Team/MOSS-TTS (HuggingFace)
  │
  ▼  Step 1: extract_weights.py
  ├── qwen3_backbone/     (safetensors + config.json)
  ├── embeddings/          (33 × .npy)
  └── lm_heads/            (33 × .npy)
        │
        ▼  Step 2: convert_hf_to_gguf.py
        backbone_f16.gguf
        │
        ▼  Step 3: llama-quantize
        backbone_q4km.gguf

Prerequisites

  • Python >= 3.10
  • safetensors, numpy, torch, huggingface_hub (pip install safetensors numpy torch huggingface_hub)
  • A compiled llama.cpp tree (for convert_hf_to_gguf.py and llama-quantize)

Building llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j
cd ..

After building, you will have:

  • llama.cpp/convert_hf_to_gguf.py — HF-to-GGUF conversion script
  • llama.cpp/build/bin/llama-quantize — quantization tool

Step 1: Extract Weights

This splits the full MOSS-TTS model into three component groups. The script downloads the model from HuggingFace automatically if a local path is not provided.

python moss_tts_delay/llama_cpp/conversion/extract_weights.py \
    --model OpenMOSS-Team/MOSS-TTS \
    --output weights/extracted

To use a local model directory instead of downloading:

python moss_tts_delay/llama_cpp/conversion/extract_weights.py \
    --model /path/to/MOSS-TTS \
    --output weights/extracted

Output structure

weights/extracted/
├── qwen3_backbone/
│   ├── config.json                          # Qwen3ForCausalLM config
│   ├── model-00001-of-00004.safetensors     # backbone shards
│   ├── model-00002-of-00004.safetensors
│   ├── model-00003-of-00004.safetensors
│   ├── model-00004-of-00004.safetensors
│   ├── model.safetensors.index.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── ...
├── embeddings/
│   ├── embed_tokens.npy      # shared text embedding table
│   ├── emb_ext_00.npy        # audio embedding codebook 0
│   ├── emb_ext_01.npy
│   └── ...                   # (32 audio codebooks total)
├── lm_heads/
│   ├── lm_head_text.npy      # text LM head
│   ├── lm_head_audio_00.npy  # audio LM head 0
│   ├── lm_head_audio_01.npy
│   └── ...                   # (32 audio heads total)
└── extraction_meta.json       # metadata (vocab sizes, paths, etc.)

Step 2: Convert Backbone to GGUF

Use llama.cpp's conversion script to turn the extracted Qwen3 backbone into a GGUF file:

python llama.cpp/convert_hf_to_gguf.py \
    weights/extracted/qwen3_backbone \
    --outfile weights/backbone_f16.gguf \
    --outtype f16

This produces a ~16 GB f16 GGUF file.

Step 3: Quantize

Quantize the f16 GGUF to a smaller format. Q4_K_M is a good balance of quality and size:

llama.cpp/build/bin/llama-quantize \
    weights/backbone_f16.gguf \
    weights/backbone_q4km.gguf \
    Q4_K_M

This reduces the file from ~16 GB to ~4.8 GB.

Other quantization options

Type Approx. Size BPW Notes
Q4_K_M ~4.8 GB 4.91 Recommended default
Q5_K_M ~5.7 GB 5.69 Slightly better quality
Q6_K ~6.6 GB 6.56 Near-lossless for most uses
Q8_0 ~8.7 GB 8.50 Highest quality quantization

All-in-One Example

# 0. Build llama.cpp (one-time)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release -j && cd ..

# 1. Extract weights
python moss_tts_delay/llama_cpp/conversion/extract_weights.py \
    --model OpenMOSS-Team/MOSS-TTS \
    --output weights/extracted

# 2. Convert to f16 GGUF
python llama.cpp/convert_hf_to_gguf.py \
    weights/extracted/qwen3_backbone \
    --outfile weights/backbone_f16.gguf \
    --outtype f16

# 3. Quantize to Q4_K_M
llama.cpp/build/bin/llama-quantize \
    weights/backbone_f16.gguf \
    weights/backbone_q4km.gguf \
    Q4_K_M

# Done! Use the quantized backbone + embeddings + lm_heads for inference.
# See the llama.cpp backend README for usage instructions.

Using the Converted Weights

After conversion, arrange the weights for the llama.cpp backend:

weights/
├── backbone_q4km.gguf          # from Step 3
├── embeddings/                  # from Step 1 (weights/extracted/embeddings/)
│   ├── embed_tokens.npy
│   └── emb_ext_*.npy
├── lm_heads/                    # from Step 1 (weights/extracted/lm_heads/)
│   ├── lm_head_text.npy
│   └── lm_head_audio_*.npy
└── tokenizer/                   # from Step 1 (weights/extracted/qwen3_backbone/)
    ├── tokenizer.json
    └── tokenizer_config.json

Then update your config YAML (e.g. configs/llama_cpp/default.yaml) to point to these paths and run inference:

python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello, world!" \
    --output output.wav

Troubleshooting

  • convert_hf_to_gguf.py fails with "unknown model architecture": Make sure you are converting the qwen3_backbone/ directory (not the original MOSS-TTS directory). The config.json must declare "architectures": ["Qwen3ForCausalLM"].
  • Out of memory during extraction: The extraction script uses lazy loading, so peak memory should be roughly one safetensors shard (~5 GB). If memory is still tight, close other applications.
  • Quantization produces unexpected size: Verify you are quantizing the f16 GGUF (not an already-quantized file). Double-check the quantization type argument.