MumbleMED

MumbleMED helps you build medical speech datasets for ASR fine-tuning without needing recorded speech. You get synthetic audio (from an LLM + TTS pipeline), aligned transcripts, and train/val/test splits that play nicely with Whisper-style models.

Publication

Warmer S, Idrissi-Yaghir A, Arzideh K, Schäfer H, Hosters B, Sadok N, Lang S, Schuler M, Hartmann S, Haubold J, Umutlu L, Forsting M, Friedrich CM, Nensa F, Borys K, Hosch R. MumbleMed: An end-to-end framework for fine-tuning automatic speech recognition models to medical language using large language and text-to-speech models.

Preprint to be made available as soon as possible.

What you get

You get .wav files (all TTS-generated, so no recording or privacy headaches), CSVs with paths, transcripts, speaker ids and durations, and ready-made train/val/test splits plus a small stats.json. The tool supports two modes—fully synthetic text from an LLM, or turning existing report text into speech—and you can plug in your own coding systems and switch TTS language.

It’s a good fit if you’re doing domain adaptation for medical ASR, need more data when real speech is hard to get, or want to try out ASR pipelines with realistic medical language before committing to big recording efforts.

Prerequisites

You’ll need Python 3.10–3.14 and Poetry (or pip install -e . if you prefer). The pipeline talks to an LLM via any OpenAI-compatible Chat Completions API—local or hosted—so you need a running model and its URL. TTS is handled by Chatterbox (included); for German we use an optional Kartoffelbox patch, which needs a Hugging Face token and the right MODEL_REPO / T3_CHECKPOINT in your env. You also need a folder of short reference audio clips (speaker voices for TTS). For the synthetic (llm) mode you’ll need coding-system CSVs (display and code columns)—we default to icd10gm.csv, ops.csv, and radlex.csv in one directory.

A GPU speeds up TTS, and --whisper is there if you want to cap segment length for Whisper fine-tuning.

Quick start

# 1. Clone and enter the repo
git clone https://github.com/UMEssen/MumbleMED.git
cd MumbleMED

# 2. Install with Poetry
poetry install
eval $(poetry env activate)

# 3. Copy the env template and fill in at least: LLM_NAME, LLM_ENDPOINT,
#    LLM_DATASET_PATH, LLM_CSV_PATH, SPEAKER_VOICES_PATH, CODING_SYSTEMS_PATH
cp .env.example .env

# 4. Make sure everything is wired correctly (no audio is generated)
mumblemed llm --dry-run

# 5. Run a small synthetic run (use whatever output paths you like)
mumblemed llm \
  --num-docs 10 \
  --num-workers 1 \
  --dataset-path ./out/audio \
  --csv-path ./out/csvs \
  --whisper \
  --verbose

After step 5 you should see .wav files under ./out/audio/ and train.csv, val.csv, test.csv, stats.json under ./out/csvs/. Tweak paths and --num-docs as you like. If you’re using a hosted LLM, set LLM_API_KEY in .env.

Modes: `llm` vs `real`

Two ways to use the pipeline:

llm — Fully synthetic. The LLM generates medical text from your coding systems (ICD, OPS, RADLEX, or your own), then we chunk it, normalize for TTS, and synthesize. You choose where the audio and CSVs go (--dataset-path, --csv-path). Best when you don’t have real reports and want full control over vocabulary and style.
real — You already have report text in a CSV (column report_clear, one report per row). We normalize that text and run it through TTS, then split. You only set --dataset-path; CSVs and audio live in a run subfolder. Best when you have documents and want matching synthetic speech.

In both cases all audio is synthetic; we never use pre-recorded speech as input.

Installation

With Poetry (recommended):

git clone https://github.com/UMEssen/MumbleMED.git
cd MumbleMED
poetry install
eval $(poetry env activate)

Or with pip: pip install -e . (and any extra deps from pyproject.toml if needed). Then copy .env.example to .env and fill in your LLM endpoint, paths, and so on (see Environment variables).

To double-check that everything is set up correctly, run mumblemed llm --dry-run or python scripts/verify_pipeline.py --mode llm. Neither command generates audio; they just validate config and inputs.

Configuration

Three ways to set options: a .env file (paths, model name, endpoint, API key, etc.), a YAML or JSON config file (handy for reproducible runs), or CLI flags. CLI overrides config, and config overrides .env. Config keys use snake_case and match the long CLI options (dataset_path, num_docs, coding_systems, tts_language, etc.). Full list is in Environment variables and CLI reference.

Example:

mumblemed llm --config my_config.yaml
mumblemed real --config real_config.json

CLI reference

Global:

--config PATH — Load options from YAML or JSON.
--verbose — More log output.

mumblemed llm — Synthetic dataset (LLM + TTS).

Option	Description
`--num-docs`	Number of synthetic documents to generate.
`--num-workers`	Parallel workers.
`--dataset-path`	Required. Directory for output audio.
`--csv-path`	Required. Directory for train/val/test CSVs and stats.
`--model-name`	LLM model name (e.g. for your endpoint).
`--llm-endpoint`	Base URL of the OpenAI-compatible API.
`--llm-api-key`	API key (for hosted endpoints).
`--gpu-id`	CUDA device id for TTS.
`--seed`	Random seed for reproducibility.
`--coding-systems-path`	Directory with coding system CSVs.
`--coding-systems`	Comma-separated list, e.g. `ICD,OPS,RADLEX` (default: all).
`--coding-system-files`	Optional JSON: custom name → filename, e.g. `{"MY":"my.csv"}`.
`--tts-language`	`de` (German, Kartoffelbox) or `en` (default Chatterbox).
`--whisper`	Restrict train/val to segments ≤ 30 s.
`--dry-run`	Validate config and inputs only; no generation.

mumblemed real — Dataset from real document text (CSV → TTS → splits).

Option	Description
`--name`	Required. Run name (used in output filenames).
`--input-csv`	Required. CSV with column `report_clear`.
`--num-docs`	How many rows to process.
`--num-workers`	Parallel workers.
`--dataset-path`	Required. Base path for output (run subdir created under it).
`--model-name`, `--llm-endpoint`, `--llm-api-key`, `--gpu-id`	Same as `llm`.
`--tts-language`	Same as `llm`.
`--whisper`	Same as `llm`.
`--dry-run`	Validate only.

mumblemed visualize — HTML report comparing reference vs baseline vs fine-tuned transcriptions.

Option	Description
`--input`	Full path to input CSV. Alternative to `--input-dir` + `--input-file`.
`--input-dir`	Directory containing the input CSV. Use with `--input-file`.
`--input-file`	Input CSV filename. Use with `--input-dir`. Ignored if `--input` is set.
`--output`	Required. Path for the output HTML file.
`--col-reference`	Column name for reference / ground truth (default: `ground_truth`).
`--col-baseline`	Column name for baseline transcription (default: `whisper_v2`).
`--col-finetuned`	Column name for fine-tuned transcription (default: `mumble_med`).
`--title`	Title in the HTML report (default: Medical ASR Performance Analysis).

You must specify either --input <path> or both --input-dir and --input-file.

Outputs

For llm: the CSVs and stats.json go under --csv-path, and the .wav files under --dataset-path. For real: everything goes under --dataset-path/real-<name>/ (same structure, with <name> in the filenames). The CSVs list paths, normalized transcripts, speaker id, duration, and (for llm) document/sentence ids and optional code columns; stats.json gives you basic counts and duration stats per split.

Remember: the pipeline does not guarantee that every sample is valid. Plan for a quality-check step in your workflow (e.g. filtering, sampling, or manual review) before using the data for training or evaluation.

Transcription comparison report

After fine-tuning an ASR model you can compare its transcriptions to a baseline and to the reference (ground truth) in an HTML report. The report shows word-level differences: substitutions, deletions, and insertions for both baseline and fine-tuned output vs reference.

CSV format: Your CSV must have three text columns (you can choose the names):

Reference (ground truth) — the reference transcript.
Baseline — transcription from the baseline model (e.g. vanilla Whisper).
Fine-tuned — transcription from your fine-tuned model.

Default column names: ground_truth, whisper_v2, mumble_med. If your CSV uses different names, pass --col-reference, --col-baseline, and --col-finetuned.

Examples:

Using a full input path and output file:

mumblemed visualize \
  --input transcription_results/transcriptions.csv \
  --output report.html

Using input directory and input file separately:

mumblemed visualize \
  --input-dir transcription_results \
  --input-file transcriptions.csv \
  --output report.html

With custom column names:

mumblemed visualize \
  --input-dir my_results \
  --input-file eval.csv \
  --output asr_comparison.html \
  --col-reference ref_text \
  --col-baseline baseline_whisper \
  --col-finetuned ft_model

Open the generated HTML in a browser to inspect the side-by-side comparison. Example of the evaluation report:

Coding systems

Coding systems are used only in llm mode to give the LLM realistic medical terms to work with. We ship with three logical names: ICD (file icd10gm.csv), OPS (ops.csv), and RADLEX (radlex.csv). Put the files in one directory and point --coding-systems-path (or CODING_SYSTEMS_PATH) at it. Each CSV must have columns display and code.

You can enable only some of them—e.g. --coding-systems ICD,OPS or coding_systems: ["ICD", "OPS"] in config. Default is all three. To add your own system: create a CSV in the same format, put it in that directory, then register it via coding_system_files (e.g. {"MYCODES": "mycodes.csv"} in config or CODING_SYSTEM_FILES in env) and add the name to coding_systems. Only the systems you list are loaded.

TTS and languages

Out of the box we assume German: default is --tts-language de, and for that you need the Kartoffelbox patch (MODEL_REPO, T3_CHECKPOINT, HF_TOKEN in .env). For English or other languages, use --tts-language en; then we use the default Chatterbox model and you don’t need the patch—just put reference clips in the right language in SPEAKER_VOICES_PATH. Matching the reference audio to the target language helps a lot.

Speaker reference samples

For the TTS voices used in this project we recorded one uniform sentence per speaker in both German and English. Using the same sentence for every speaker lets us capture comparable voice patterns across all of them, which keeps the synthetic data more consistent.

The sentence we used is:

English: The patient was diagnosed with a malignant tumor and will undergo further radiological and histopathological evaluation.
German: Der Patient wurde mit einem malignen Tumor diagnostiziert und wird sich weiteren radiologischen und histopathologischen Untersuchungen unterziehen.

If you supply your own reference clips in SPEAKER_VOICES_PATH, you can use any content you like; the pipeline doesn’t depend on this exact sentence. Using a fixed sentence (or short script) across speakers is still a good idea if you want comparable voice characteristics in your dataset.

Customizing the LLM

If you want to change how the synthetic text reads or how we normalize for TTS, edit the prompts in mumblemed/utils/llm.py. generate_synthetic_medical_text() drives the narrative (specialty, language, structure); process_text_structure() controls how we expand abbreviations and punctuation for speech. Output locations are always set with --dataset-path and --csv-path (or the corresponding env vars)—there’s no separate “LLM audio path”.

Environment variables

The app reads .env from the current directory. R = required for that mode, O = optional.

Variable	llm	real	Description
`LLM_NAME`	R	R	Model name for your endpoint.
`LLM_ENDPOINT`	R	R	Base URL (e.g. `http://127.0.0.1:8000/v1`).
`LLM_API_KEY`	R if hosted	R if hosted	API key for non-local endpoints.
`LLM_DATASET_PATH`	R	—	Where to write generated audio (llm).
`LLM_CSV_PATH`	R	—	Where to write CSVs (llm).
`REAL_DATASET_PATH`	—	R	Base path for real-document runs.
`SPEAKER_VOICES_PATH`	R	R	Directory of reference audio files.
`CODING_SYSTEMS_PATH`	R (llm)	—	Directory with coding system CSVs.
`CODING_SYSTEMS`	O	—	Comma-separated, e.g. `ICD,OPS,RADLEX`. Default: all.
`CODING_SYSTEM_FILES`	O	—	JSON for custom name → filename.
`TTS_LANGUAGE`	O	O	`de` or `en` (default `de`).
`GPU_ID`	O	O	CUDA device for TTS.
`MODEL_REPO`, `T3_CHECKPOINT`, `HF_TOKEN`	O	O	For German TTS (Kartoffelbox) only.
`NUM_SYNTHETIC_DOCS`, `NUM_REAL_DOCUMENTS`, `NUM_WORKERS`	O	O	Default doc counts and parallelism.

Troubleshooting

“Missing required …” — That option has to come from env or the CLI. For llm you need LLM_DATASET_PATH, LLM_CSV_PATH, SPEAKER_VOICES_PATH, and CODING_SYSTEMS_PATH set somewhere.

Dry-run first — mumblemed llm --dry-run and mumblemed real … --dry-run check paths and config without generating any audio. Use them when something’s misconfigured.

NLTK complaining — First run may need network access so NLTK can pull data; or pre-download the resources it asks for.

LLM / API — The endpoint has to speak the OpenAI Chat Completions API. For hosted services, set LLM_API_KEY.

No speaker voices — SPEAKER_VOICES_PATH must be a directory with at least one audio file; we use those as TTS references.

German TTS — For tts_language=de you need MODEL_REPO, T3_CHECKPOINT, and HF_TOKEN. For English, use --tts-language en and you can skip those.

Disclaimer

Ongoing research. This project is research software and is under active development. APIs and behaviour may change as we iterate.

Output quality. The pipeline can produce samples that are not fully correct or consistent (e.g. transcript errors, odd phrasing, or audio artefacts). We do not guarantee that every generated item is valid. If you use MumbleMED in your own project, you should add a quality-check step (e.g. automatic filters, spot checks, or human review) before relying on the data for training or evaluation.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MumbleMED

Publication

Table of contents

What you get

Prerequisites

Quick start

Modes: `llm` vs `real`

Installation

Configuration

CLI reference

Outputs

Transcription comparison report

Coding systems

TTS and languages

Speaker reference samples

Customizing the LLM

Environment variables

Troubleshooting

Disclaimer

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MumbleMED

Publication

Table of contents

What you get

Prerequisites

Quick start

Modes: llm vs real

Installation

Configuration

CLI reference

Outputs

Transcription comparison report

Coding systems

TTS and languages

Speaker reference samples

Customizing the LLM

Environment variables

Troubleshooting

Disclaimer

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Modes: `llm` vs `real`

Packages