MumbleMED helps you build medical speech datasets for ASR fine-tuning without needing recorded speech. You get synthetic audio (from an LLM + TTS pipeline), aligned transcripts, and train/val/test splits that play nicely with Whisper-style models.
Warmer S, Idrissi-Yaghir A, Arzideh K, Schäfer H, Hosters B, Sadok N, Lang S, Schuler M, Hartmann S, Haubold J, Umutlu L, Forsting M, Friedrich CM, Nensa F, Borys K, Hosch R. MumbleMed: An end-to-end framework for fine-tuning automatic speech recognition models to medical language using large language and text-to-speech models.
Preprint to be made available as soon as possible.
- Publication
- What you get
- Prerequisites
- Quick start
- Modes:
llmvsreal - Installation
- Configuration
- CLI reference
- Outputs
- Transcription comparison report
- Coding systems
- TTS and languages
- Speaker reference samples
- Customizing the LLM
- Environment variables
- Troubleshooting
- Disclaimer
- License
You get .wav files (all TTS-generated, so no recording or privacy headaches), CSVs with paths, transcripts, speaker ids and durations, and ready-made train/val/test splits plus a small stats.json. The tool supports two modes—fully synthetic text from an LLM, or turning existing report text into speech—and you can plug in your own coding systems and switch TTS language.
It’s a good fit if you’re doing domain adaptation for medical ASR, need more data when real speech is hard to get, or want to try out ASR pipelines with realistic medical language before committing to big recording efforts.
You’ll need Python 3.10–3.14 and Poetry (or pip install -e . if you prefer). The pipeline talks to an LLM via any OpenAI-compatible Chat Completions API—local or hosted—so you need a running model and its URL. TTS is handled by Chatterbox (included); for German we use an optional Kartoffelbox patch, which needs a Hugging Face token and the right MODEL_REPO / T3_CHECKPOINT in your env. You also need a folder of short reference audio clips (speaker voices for TTS). For the synthetic (llm) mode you’ll need coding-system CSVs (display and code columns)—we default to icd10gm.csv, ops.csv, and radlex.csv in one directory.
A GPU speeds up TTS, and --whisper is there if you want to cap segment length for Whisper fine-tuning.
# 1. Clone and enter the repo
git clone https://github.com/UMEssen/MumbleMED.git
cd MumbleMED
# 2. Install with Poetry
poetry install
eval $(poetry env activate)
# 3. Copy the env template and fill in at least: LLM_NAME, LLM_ENDPOINT,
# LLM_DATASET_PATH, LLM_CSV_PATH, SPEAKER_VOICES_PATH, CODING_SYSTEMS_PATH
cp .env.example .env
# 4. Make sure everything is wired correctly (no audio is generated)
mumblemed llm --dry-run
# 5. Run a small synthetic run (use whatever output paths you like)
mumblemed llm \
--num-docs 10 \
--num-workers 1 \
--dataset-path ./out/audio \
--csv-path ./out/csvs \
--whisper \
--verboseAfter step 5 you should see .wav files under ./out/audio/ and train.csv, val.csv, test.csv, stats.json under ./out/csvs/. Tweak paths and --num-docs as you like. If you’re using a hosted LLM, set LLM_API_KEY in .env.
Two ways to use the pipeline:
llm— Fully synthetic. The LLM generates medical text from your coding systems (ICD, OPS, RADLEX, or your own), then we chunk it, normalize for TTS, and synthesize. You choose where the audio and CSVs go (--dataset-path,--csv-path). Best when you don’t have real reports and want full control over vocabulary and style.real— You already have report text in a CSV (columnreport_clear, one report per row). We normalize that text and run it through TTS, then split. You only set--dataset-path; CSVs and audio live in a run subfolder. Best when you have documents and want matching synthetic speech.
In both cases all audio is synthetic; we never use pre-recorded speech as input.
With Poetry (recommended):
git clone https://github.com/UMEssen/MumbleMED.git
cd MumbleMED
poetry install
eval $(poetry env activate)Or with pip: pip install -e . (and any extra deps from pyproject.toml if needed). Then copy .env.example to .env and fill in your LLM endpoint, paths, and so on (see Environment variables).
To double-check that everything is set up correctly, run mumblemed llm --dry-run or python scripts/verify_pipeline.py --mode llm. Neither command generates audio; they just validate config and inputs.
Three ways to set options: a .env file (paths, model name, endpoint, API key, etc.), a YAML or JSON config file (handy for reproducible runs), or CLI flags. CLI overrides config, and config overrides .env. Config keys use snake_case and match the long CLI options (dataset_path, num_docs, coding_systems, tts_language, etc.). Full list is in Environment variables and CLI reference.
Example:
mumblemed llm --config my_config.yaml
mumblemed real --config real_config.jsonGlobal:
--config PATH— Load options from YAML or JSON.--verbose— More log output.
mumblemed llm — Synthetic dataset (LLM + TTS).
| Option | Description |
|---|---|
--num-docs |
Number of synthetic documents to generate. |
--num-workers |
Parallel workers. |
--dataset-path |
Required. Directory for output audio. |
--csv-path |
Required. Directory for train/val/test CSVs and stats. |
--model-name |
LLM model name (e.g. for your endpoint). |
--llm-endpoint |
Base URL of the OpenAI-compatible API. |
--llm-api-key |
API key (for hosted endpoints). |
--gpu-id |
CUDA device id for TTS. |
--seed |
Random seed for reproducibility. |
--coding-systems-path |
Directory with coding system CSVs. |
--coding-systems |
Comma-separated list, e.g. ICD,OPS,RADLEX (default: all). |
--coding-system-files |
Optional JSON: custom name → filename, e.g. {"MY":"my.csv"}. |
--tts-language |
de (German, Kartoffelbox) or en (default Chatterbox). |
--whisper |
Restrict train/val to segments ≤ 30 s. |
--dry-run |
Validate config and inputs only; no generation. |
mumblemed real — Dataset from real document text (CSV → TTS → splits).
| Option | Description |
|---|---|
--name |
Required. Run name (used in output filenames). |
--input-csv |
Required. CSV with column report_clear. |
--num-docs |
How many rows to process. |
--num-workers |
Parallel workers. |
--dataset-path |
Required. Base path for output (run subdir created under it). |
--model-name, --llm-endpoint, --llm-api-key, --gpu-id |
Same as llm. |
--tts-language |
Same as llm. |
--whisper |
Same as llm. |
--dry-run |
Validate only. |
mumblemed visualize — HTML report comparing reference vs baseline vs fine-tuned transcriptions.
| Option | Description |
|---|---|
--input |
Full path to input CSV. Alternative to --input-dir + --input-file. |
--input-dir |
Directory containing the input CSV. Use with --input-file. |
--input-file |
Input CSV filename. Use with --input-dir. Ignored if --input is set. |
--output |
Required. Path for the output HTML file. |
--col-reference |
Column name for reference / ground truth (default: ground_truth). |
--col-baseline |
Column name for baseline transcription (default: whisper_v2). |
--col-finetuned |
Column name for fine-tuned transcription (default: mumble_med). |
--title |
Title in the HTML report (default: Medical ASR Performance Analysis). |
You must specify either --input <path> or both --input-dir and --input-file.
For llm: the CSVs and stats.json go under --csv-path, and the .wav files under --dataset-path. For real: everything goes under --dataset-path/real-<name>/ (same structure, with <name> in the filenames). The CSVs list paths, normalized transcripts, speaker id, duration, and (for llm) document/sentence ids and optional code columns; stats.json gives you basic counts and duration stats per split.
Remember: the pipeline does not guarantee that every sample is valid. Plan for a quality-check step in your workflow (e.g. filtering, sampling, or manual review) before using the data for training or evaluation.
After fine-tuning an ASR model you can compare its transcriptions to a baseline and to the reference (ground truth) in an HTML report. The report shows word-level differences: substitutions, deletions, and insertions for both baseline and fine-tuned output vs reference.
CSV format: Your CSV must have three text columns (you can choose the names):
- Reference (ground truth) — the reference transcript.
- Baseline — transcription from the baseline model (e.g. vanilla Whisper).
- Fine-tuned — transcription from your fine-tuned model.
Default column names: ground_truth, whisper_v2, mumble_med. If your CSV uses different names, pass --col-reference, --col-baseline, and --col-finetuned.
Examples:
Using a full input path and output file:
mumblemed visualize \
--input transcription_results/transcriptions.csv \
--output report.htmlUsing input directory and input file separately:
mumblemed visualize \
--input-dir transcription_results \
--input-file transcriptions.csv \
--output report.htmlWith custom column names:
mumblemed visualize \
--input-dir my_results \
--input-file eval.csv \
--output asr_comparison.html \
--col-reference ref_text \
--col-baseline baseline_whisper \
--col-finetuned ft_modelOpen the generated HTML in a browser to inspect the side-by-side comparison. Example of the evaluation report:
Coding systems are used only in llm mode to give the LLM realistic medical terms to work with. We ship with three logical names: ICD (file icd10gm.csv), OPS (ops.csv), and RADLEX (radlex.csv). Put the files in one directory and point --coding-systems-path (or CODING_SYSTEMS_PATH) at it. Each CSV must have columns display and code.
You can enable only some of them—e.g. --coding-systems ICD,OPS or coding_systems: ["ICD", "OPS"] in config. Default is all three. To add your own system: create a CSV in the same format, put it in that directory, then register it via coding_system_files (e.g. {"MYCODES": "mycodes.csv"} in config or CODING_SYSTEM_FILES in env) and add the name to coding_systems. Only the systems you list are loaded.
Out of the box we assume German: default is --tts-language de, and for that you need the Kartoffelbox patch (MODEL_REPO, T3_CHECKPOINT, HF_TOKEN in .env). For English or other languages, use --tts-language en; then we use the default Chatterbox model and you don’t need the patch—just put reference clips in the right language in SPEAKER_VOICES_PATH. Matching the reference audio to the target language helps a lot.
For the TTS voices used in this project we recorded one uniform sentence per speaker in both German and English. Using the same sentence for every speaker lets us capture comparable voice patterns across all of them, which keeps the synthetic data more consistent.
The sentence we used is:
- English: The patient was diagnosed with a malignant tumor and will undergo further radiological and histopathological evaluation.
- German: Der Patient wurde mit einem malignen Tumor diagnostiziert und wird sich weiteren radiologischen und histopathologischen Untersuchungen unterziehen.
If you supply your own reference clips in SPEAKER_VOICES_PATH, you can use any content you like; the pipeline doesn’t depend on this exact sentence. Using a fixed sentence (or short script) across speakers is still a good idea if you want comparable voice characteristics in your dataset.
If you want to change how the synthetic text reads or how we normalize for TTS, edit the prompts in mumblemed/utils/llm.py. generate_synthetic_medical_text() drives the narrative (specialty, language, structure); process_text_structure() controls how we expand abbreviations and punctuation for speech. Output locations are always set with --dataset-path and --csv-path (or the corresponding env vars)—there’s no separate “LLM audio path”.
The app reads .env from the current directory. R = required for that mode, O = optional.
| Variable | llm | real | Description |
|---|---|---|---|
LLM_NAME |
R | R | Model name for your endpoint. |
LLM_ENDPOINT |
R | R | Base URL (e.g. http://127.0.0.1:8000/v1). |
LLM_API_KEY |
R if hosted | R if hosted | API key for non-local endpoints. |
LLM_DATASET_PATH |
R | — | Where to write generated audio (llm). |
LLM_CSV_PATH |
R | — | Where to write CSVs (llm). |
REAL_DATASET_PATH |
— | R | Base path for real-document runs. |
SPEAKER_VOICES_PATH |
R | R | Directory of reference audio files. |
CODING_SYSTEMS_PATH |
R (llm) | — | Directory with coding system CSVs. |
CODING_SYSTEMS |
O | — | Comma-separated, e.g. ICD,OPS,RADLEX. Default: all. |
CODING_SYSTEM_FILES |
O | — | JSON for custom name → filename. |
TTS_LANGUAGE |
O | O | de or en (default de). |
GPU_ID |
O | O | CUDA device for TTS. |
MODEL_REPO, T3_CHECKPOINT, HF_TOKEN |
O | O | For German TTS (Kartoffelbox) only. |
NUM_SYNTHETIC_DOCS, NUM_REAL_DOCUMENTS, NUM_WORKERS |
O | O | Default doc counts and parallelism. |
“Missing required …” — That option has to come from env or the CLI. For llm you need LLM_DATASET_PATH, LLM_CSV_PATH, SPEAKER_VOICES_PATH, and CODING_SYSTEMS_PATH set somewhere.
Dry-run first — mumblemed llm --dry-run and mumblemed real … --dry-run check paths and config without generating any audio. Use them when something’s misconfigured.
NLTK complaining — First run may need network access so NLTK can pull data; or pre-download the resources it asks for.
LLM / API — The endpoint has to speak the OpenAI Chat Completions API. For hosted services, set LLM_API_KEY.
No speaker voices — SPEAKER_VOICES_PATH must be a directory with at least one audio file; we use those as TTS references.
German TTS — For tts_language=de you need MODEL_REPO, T3_CHECKPOINT, and HF_TOKEN. For English, use --tts-language en and you can skip those.
Ongoing research. This project is research software and is under active development. APIs and behaviour may change as we iterate.
Output quality. The pipeline can produce samples that are not fully correct or consistent (e.g. transcript errors, odd phrasing, or audio artefacts). We do not guarantee that every generated item is valid. If you use MumbleMED in your own project, you should add a quality-check step (e.g. automatic filters, spot checks, or human review) before relying on the data for training or evaluation.
MIT.

