This repository is a C/C++ library + CLI that runs several neural audio codecs (currently WavTokenizer-Large, DAC, Mimi, Qwen3-TTS-Tokenizer) using ggml graphs so execution can be offloaded via ggml backends (CPU/CUDA/Vulkan/Metal/etc.).
The intended architecture is llama.cpp-style:
- Build model forward passes as ggml graphs (ops).
- Execute via ggml_backend + ggml_backend_sched so backends can offload.
- Avoid bespoke CPU-side tensor math buffers when possible.
include/codec.h— public C API (model load/init, encode/decode, batch decode)src/codec.cpp— top-level dispatch + model loading + backend selectionsrc/models/— per-architecture graph builders and gluewavtokenizer.cpp/.hdac.cpp/.hmimi.cpp/.hqwen3_tts_tokenizer.cpp/.h
src/runtime/— graph cache + execution runtimegraph.cpp/.h— graph cache keyed by (kind, n_frames, n_q, hop, etc.)graph_exec.cpp— ggml_backend scheduler init + graph computetensor_utils.*,gguf_kv.*— tensor helpers / metadata
src/ops/— small wrappers around ggml ops + a few custom compositionsggml_ops.*— layernorm/groupnorm/linear/unary/snake/pad/crop helpersconv1d.*,convtr1d.*— conv wrappers (keep only if ggml lacks needed variant)
src/batch/— sequence-level batch container + decode loop (MVP)examples/— demo/inspection binaries (e.g. batch decode)ggml/— ggml submodule/subproject
CMake project with ggml as a subdirectory.
Typical CPU build:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -jEnable GPU backend (example: CUDA):
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
cmake --build build -jBackends are intended to be selected via ggml backend selection logic.
In src/codec.cpp the backend is selected roughly as:
- if
codec_model_params.use_gpu = true: callggml_backend_load_all()thenggml_backend_init_best() - else: CPU backend
src/runtime/graph_exec.cpp uses:
ggml_backend_sched_new(...)ggml_backend_sched_graph_compute(...)
This is the core mechanism enabling CPU/GPU split + offload.
Key rule: graphs should be constructed in a way that ggml can place tensors on supported backends; avoid pulling intermediate tensors out to CPU buffers.
Graphs are cached by a small key (see codec_graph_cache_key in internal headers), typically including:
- graph kind (encode/decode per model)
n_frames,n_q,hop, input sizes, latent_dim, etc.
Flow:
codec_graph_cache_get_or_build(...)builds graph in an eval arena (ggml_init(no_alloc=true)).codec_graph_prepare_io(...)allocates tensors for the graph in a backend buffer.codec_graph_compute(...)runs scheduler compute.
Important constraints:
- When switching to a different graph allocation, scheduler reset may be required to avoid dangling allocations (see comments in
codec_graph_prepare_io).
Prefer directly using ggml ops when available.
src/ops/ggml_ops.cpp provides small helpers that are either:
- thin wrappers over ggml primitives (
ggml_norm,ggml_group_norm,ggml_mul_mat, activations) - composed ops built from primitives (e.g. DAC
snakeimplemented asx + sin(ax)^2 / a)
If a needed op is missing in ggml:
- First try composing from existing ops.
- If impossible/perf critical, add a custom op (CPU SIMD first) and keep a path to backend support later.
ggml_conv_transpose_1drequiresp0 == 0andd0 == 1; use crop for padding.conv1dweight layout is[k, in, out].conv_transpose_1dweight layout is[k, out, in].- Prefer
ggml_contonly when needed; many ops require contiguous tensors. - Keep all math inside ggml; do not add CPU-only tensor math paths.
Models are loaded from .gguf.
Some tensors that could be generated at runtime should instead be baked into GGUF during conversion (for reproducibility + avoiding runtime FP32→FP16 conversions).
If conversion scripts are involved, regenerate gguf after changes (stale gguf is a common source of “missing tensor” errors).
If a model’s original PyTorch code is needed during conversion or reference:
- Put the upstream source under
.model-src/(local-only). - Converters should read from
.model-src/<repo>/...rather than importing from the network. - Runtime must not depend on the original Python source.
ggml/ is a submodule. Do not edit it directly unless explicitly asked to update the submodule.
If an op is missing:
- Prefer composition in
src/ops/. - If a true kernel is required, plan a submodule update (upstream or fork) rather than editing files in-place.
- Avoid CPU-only fallbacks; keep the path compatible with ggml backends.
- Keep encode/decode numerics stable (unit/regression tests where possible).
- Avoid introducing new CPU-only intermediate buffers; build everything as ggml tensors.
- Never reshape/transpose weights at runtime. If weights need reshape, do it in the GGUF converter or via gguf transpose ops during conversion.
- When touching graph execution / backend scheduler: be careful with allocation lifetimes (
eval_ctx, scheduler reset semantics). - Prefer small, reviewable commits.
We track Python deps in two files:
requirements.txtfor conversion/build utilities.requirements-e2e.txtfor end-to-end tests (HF refs + audio).
Keep them minimal and deterministic (pin versions when CI is sensitive).
- GGUF converter first
- Add converter in
scripts/converters/. - Bake all needed weights/metadata into GGUF. No runtime weight transforms.
- Confirm tensor layout: ggml expects
[k, in, out]for conv1d and[k, out, in]for conv_transpose_1d (seeggml_conv_transpose_1dconstraints).
- Add converter in
- Runtime model struct
- Add metadata fields in
codec_*struct and initialize incodec_*_init. - Read GGUF keys with sane defaults but validate shapes early.
- Add metadata fields in
- Graph build
- Build encode/decode forward graphs using ggml ops only.
- Cache graph with a compact key (kind, n_frames, n_q, hop, n_in, latent_dim).
- If graph is large, ensure graph size + backend scheduler capacity are adequate (see
src/runtime/graph.cppandsrc/runtime/graph_exec.cpp).
- Weights and IO
- Use
codec_*_copy_*helpers to map GGUF tensors into graph tensors. - Avoid any CPU-only math or bespoke tensor loops unless absolutely necessary.
- Use
- E2E tests
- Add/update model entry in
tests/e2e/config.json(sample rate, n_q, gguf path). - Ensure HF reference runs with the same sample rate/hop.
- Run
python tests/e2e/runner.py --models <name>.
- Add/update model entry in
- Prefer ggml primitives
- Implement as a composition in
src/ops/ggml_ops.cppwhen possible.
- Implement as a composition in
- If a custom op is needed
- Implement in ggml backend (CPU + optional GPU stubs).
- Keep API minimal and add a small targeted test.
- Respect ggml constraints
- Many ops impose shape/stride constraints; confirm in
ggml/sources. - Example:
ggml_conv_transpose_1denforcesp0==0andd0==1; use crop for padding.
- Many ops impose shape/stride constraints; confirm in
- Graph size assertions: if you hit
GGML_ASSERT(cgraph->n_nodes < cgraph->size)or scheduler hash-set asserts, increase graph/scheduler capacity. - Sample rate mismatches: ensure model
sample_ratein GGUF and E2E config match HF reference. - Silent tensor layout mistakes: verify tensor shapes against ggml expectations and PyTorch definitions.
- Runtime weight fixes: do not reshape/transpose weights at runtime; fix converter instead.
If you need to understand execution:
src/runtime/graph_exec.cpp(scheduler + compute)src/runtime/graph.cpp(cache + arena)
If you need to understand a model forward:
src/models/mimi.cpp/dac.cpp/wavtokenizer.cpp
If you need to add/replace an op:
src/ops/ggml_ops.cpp(+ possibly ggml upstream)
codec-model-dev— end-to-end guide for adding a model (converter → ggml graphs → tests).codec-op-dev— guidance for adding/adjusting ggml ops safely.
Use the skill files for step-by-step workflows; they encode the preferred design constraints for this repo.
Models are wired via a switch-based vtable registry in src/codec.cpp. codec_model stores:
- shared/core metadata fields (sample rate, hop, n_q, etc.)
impl(opaque model-specific struct)vtable(init/encode/decode/decode_latent)
Model-specific structs live in src/models/<model>.h and are not defined in src/codec_internal.h. Core code must not cast impl; only model files cast their own impl.
- Define the model struct in
src/models/<model>.h. - Implement model graph + init in
src/models/<model>.cpp. - Provide a
codec_<model>_vtable()withcreate_impl/destroy_impl/init/encode/decode. - Register it in the switch in
codec_model_vtable_for_arch()insrc/codec.cpp.
If a model reuses another model’s encoder (e.g. Qwen3 reuses Mimi), put both configs in a model-specific impl struct and call the shared encoder helpers (codec_mimi_encode_with(...)).