Skip to content

UMEssen/MumbleMED

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

MumbleMED banner

Python 3.10–3.14 Poetry MIT License

MumbleMED

MumbleMED helps you build medical speech datasets for ASR fine-tuning without needing recorded speech. You get synthetic audio (from an LLM + TTS pipeline), aligned transcripts, and train/val/test splits that play nicely with Whisper-style models.

Publication

Warmer S, Idrissi-Yaghir A, Arzideh K, Schäfer H, Hosters B, Sadok N, Lang S, Schuler M, Hartmann S, Haubold J, Umutlu L, Forsting M, Friedrich CM, Nensa F, Borys K, Hosch R. MumbleMed: An end-to-end framework for fine-tuning automatic speech recognition models to medical language using large language and text-to-speech models.

Preprint to be made available as soon as possible.


Table of contents


What you get

You get .wav files (all TTS-generated, so no recording or privacy headaches), CSVs with paths, transcripts, speaker ids and durations, and ready-made train/val/test splits plus a small stats.json. The tool supports two modes—fully synthetic text from an LLM, or turning existing report text into speech—and you can plug in your own coding systems and switch TTS language.

It’s a good fit if you’re doing domain adaptation for medical ASR, need more data when real speech is hard to get, or want to try out ASR pipelines with realistic medical language before committing to big recording efforts.


Prerequisites

You’ll need Python 3.10–3.14 and Poetry (or pip install -e . if you prefer). The pipeline talks to an LLM via any OpenAI-compatible Chat Completions API—local or hosted—so you need a running model and its URL. TTS is handled by Chatterbox (included); for German we use an optional Kartoffelbox patch, which needs a Hugging Face token and the right MODEL_REPO / T3_CHECKPOINT in your env. You also need a folder of short reference audio clips (speaker voices for TTS). For the synthetic (llm) mode you’ll need coding-system CSVs (display and code columns)—we default to icd10gm.csv, ops.csv, and radlex.csv in one directory.

A GPU speeds up TTS, and --whisper is there if you want to cap segment length for Whisper fine-tuning.


Quick start

# 1. Clone and enter the repo
git clone https://github.com/UMEssen/MumbleMED.git
cd MumbleMED

# 2. Install with Poetry
poetry install
eval $(poetry env activate)

# 3. Copy the env template and fill in at least: LLM_NAME, LLM_ENDPOINT,
#    LLM_DATASET_PATH, LLM_CSV_PATH, SPEAKER_VOICES_PATH, CODING_SYSTEMS_PATH
cp .env.example .env

# 4. Make sure everything is wired correctly (no audio is generated)
mumblemed llm --dry-run

# 5. Run a small synthetic run (use whatever output paths you like)
mumblemed llm \
  --num-docs 10 \
  --num-workers 1 \
  --dataset-path ./out/audio \
  --csv-path ./out/csvs \
  --whisper \
  --verbose

After step 5 you should see .wav files under ./out/audio/ and train.csv, val.csv, test.csv, stats.json under ./out/csvs/. Tweak paths and --num-docs as you like. If you’re using a hosted LLM, set LLM_API_KEY in .env.


Modes: llm vs real

Two ways to use the pipeline:

  • llm — Fully synthetic. The LLM generates medical text from your coding systems (ICD, OPS, RADLEX, or your own), then we chunk it, normalize for TTS, and synthesize. You choose where the audio and CSVs go (--dataset-path, --csv-path). Best when you don’t have real reports and want full control over vocabulary and style.
  • real — You already have report text in a CSV (column report_clear, one report per row). We normalize that text and run it through TTS, then split. You only set --dataset-path; CSVs and audio live in a run subfolder. Best when you have documents and want matching synthetic speech.

In both cases all audio is synthetic; we never use pre-recorded speech as input.


Installation

With Poetry (recommended):

git clone https://github.com/UMEssen/MumbleMED.git
cd MumbleMED
poetry install
eval $(poetry env activate)

Or with pip: pip install -e . (and any extra deps from pyproject.toml if needed). Then copy .env.example to .env and fill in your LLM endpoint, paths, and so on (see Environment variables).

To double-check that everything is set up correctly, run mumblemed llm --dry-run or python scripts/verify_pipeline.py --mode llm. Neither command generates audio; they just validate config and inputs.


Configuration

Three ways to set options: a .env file (paths, model name, endpoint, API key, etc.), a YAML or JSON config file (handy for reproducible runs), or CLI flags. CLI overrides config, and config overrides .env. Config keys use snake_case and match the long CLI options (dataset_path, num_docs, coding_systems, tts_language, etc.). Full list is in Environment variables and CLI reference.

Example:

mumblemed llm --config my_config.yaml
mumblemed real --config real_config.json

CLI reference

Global:

  • --config PATH — Load options from YAML or JSON.
  • --verbose — More log output.

mumblemed llm — Synthetic dataset (LLM + TTS).

Option Description
--num-docs Number of synthetic documents to generate.
--num-workers Parallel workers.
--dataset-path Required. Directory for output audio.
--csv-path Required. Directory for train/val/test CSVs and stats.
--model-name LLM model name (e.g. for your endpoint).
--llm-endpoint Base URL of the OpenAI-compatible API.
--llm-api-key API key (for hosted endpoints).
--gpu-id CUDA device id for TTS.
--seed Random seed for reproducibility.
--coding-systems-path Directory with coding system CSVs.
--coding-systems Comma-separated list, e.g. ICD,OPS,RADLEX (default: all).
--coding-system-files Optional JSON: custom name → filename, e.g. {"MY":"my.csv"}.
--tts-language de (German, Kartoffelbox) or en (default Chatterbox).
--whisper Restrict train/val to segments ≤ 30 s.
--dry-run Validate config and inputs only; no generation.

mumblemed real — Dataset from real document text (CSV → TTS → splits).

Option Description
--name Required. Run name (used in output filenames).
--input-csv Required. CSV with column report_clear.
--num-docs How many rows to process.
--num-workers Parallel workers.
--dataset-path Required. Base path for output (run subdir created under it).
--model-name, --llm-endpoint, --llm-api-key, --gpu-id Same as llm.
--tts-language Same as llm.
--whisper Same as llm.
--dry-run Validate only.

mumblemed visualize — HTML report comparing reference vs baseline vs fine-tuned transcriptions.

Option Description
--input Full path to input CSV. Alternative to --input-dir + --input-file.
--input-dir Directory containing the input CSV. Use with --input-file.
--input-file Input CSV filename. Use with --input-dir. Ignored if --input is set.
--output Required. Path for the output HTML file.
--col-reference Column name for reference / ground truth (default: ground_truth).
--col-baseline Column name for baseline transcription (default: whisper_v2).
--col-finetuned Column name for fine-tuned transcription (default: mumble_med).
--title Title in the HTML report (default: Medical ASR Performance Analysis).

You must specify either --input <path> or both --input-dir and --input-file.


Outputs

For llm: the CSVs and stats.json go under --csv-path, and the .wav files under --dataset-path. For real: everything goes under --dataset-path/real-<name>/ (same structure, with <name> in the filenames). The CSVs list paths, normalized transcripts, speaker id, duration, and (for llm) document/sentence ids and optional code columns; stats.json gives you basic counts and duration stats per split.

Remember: the pipeline does not guarantee that every sample is valid. Plan for a quality-check step in your workflow (e.g. filtering, sampling, or manual review) before using the data for training or evaluation.


Transcription comparison report

After fine-tuning an ASR model you can compare its transcriptions to a baseline and to the reference (ground truth) in an HTML report. The report shows word-level differences: substitutions, deletions, and insertions for both baseline and fine-tuned output vs reference.

CSV format: Your CSV must have three text columns (you can choose the names):

  • Reference (ground truth) — the reference transcript.
  • Baseline — transcription from the baseline model (e.g. vanilla Whisper).
  • Fine-tuned — transcription from your fine-tuned model.

Default column names: ground_truth, whisper_v2, mumble_med. If your CSV uses different names, pass --col-reference, --col-baseline, and --col-finetuned.

Examples:

Using a full input path and output file:

mumblemed visualize \
  --input transcription_results/transcriptions.csv \
  --output report.html

Using input directory and input file separately:

mumblemed visualize \
  --input-dir transcription_results \
  --input-file transcriptions.csv \
  --output report.html

With custom column names:

mumblemed visualize \
  --input-dir my_results \
  --input-file eval.csv \
  --output asr_comparison.html \
  --col-reference ref_text \
  --col-baseline baseline_whisper \
  --col-finetuned ft_model

Open the generated HTML in a browser to inspect the side-by-side comparison. Example of the evaluation report:

Example of the transcription comparison HTML report


Coding systems

Coding systems are used only in llm mode to give the LLM realistic medical terms to work with. We ship with three logical names: ICD (file icd10gm.csv), OPS (ops.csv), and RADLEX (radlex.csv). Put the files in one directory and point --coding-systems-path (or CODING_SYSTEMS_PATH) at it. Each CSV must have columns display and code.

You can enable only some of them—e.g. --coding-systems ICD,OPS or coding_systems: ["ICD", "OPS"] in config. Default is all three. To add your own system: create a CSV in the same format, put it in that directory, then register it via coding_system_files (e.g. {"MYCODES": "mycodes.csv"} in config or CODING_SYSTEM_FILES in env) and add the name to coding_systems. Only the systems you list are loaded.


TTS and languages

Out of the box we assume German: default is --tts-language de, and for that you need the Kartoffelbox patch (MODEL_REPO, T3_CHECKPOINT, HF_TOKEN in .env). For English or other languages, use --tts-language en; then we use the default Chatterbox model and you don’t need the patch—just put reference clips in the right language in SPEAKER_VOICES_PATH. Matching the reference audio to the target language helps a lot.


Speaker reference samples

For the TTS voices used in this project we recorded one uniform sentence per speaker in both German and English. Using the same sentence for every speaker lets us capture comparable voice patterns across all of them, which keeps the synthetic data more consistent.

The sentence we used is:

  • English: The patient was diagnosed with a malignant tumor and will undergo further radiological and histopathological evaluation.
  • German: Der Patient wurde mit einem malignen Tumor diagnostiziert und wird sich weiteren radiologischen und histopathologischen Untersuchungen unterziehen.

If you supply your own reference clips in SPEAKER_VOICES_PATH, you can use any content you like; the pipeline doesn’t depend on this exact sentence. Using a fixed sentence (or short script) across speakers is still a good idea if you want comparable voice characteristics in your dataset.


Customizing the LLM

If you want to change how the synthetic text reads or how we normalize for TTS, edit the prompts in mumblemed/utils/llm.py. generate_synthetic_medical_text() drives the narrative (specialty, language, structure); process_text_structure() controls how we expand abbreviations and punctuation for speech. Output locations are always set with --dataset-path and --csv-path (or the corresponding env vars)—there’s no separate “LLM audio path”.


Environment variables

The app reads .env from the current directory. R = required for that mode, O = optional.

Variable llm real Description
LLM_NAME R R Model name for your endpoint.
LLM_ENDPOINT R R Base URL (e.g. http://127.0.0.1:8000/v1).
LLM_API_KEY R if hosted R if hosted API key for non-local endpoints.
LLM_DATASET_PATH R Where to write generated audio (llm).
LLM_CSV_PATH R Where to write CSVs (llm).
REAL_DATASET_PATH R Base path for real-document runs.
SPEAKER_VOICES_PATH R R Directory of reference audio files.
CODING_SYSTEMS_PATH R (llm) Directory with coding system CSVs.
CODING_SYSTEMS O Comma-separated, e.g. ICD,OPS,RADLEX. Default: all.
CODING_SYSTEM_FILES O JSON for custom name → filename.
TTS_LANGUAGE O O de or en (default de).
GPU_ID O O CUDA device for TTS.
MODEL_REPO, T3_CHECKPOINT, HF_TOKEN O O For German TTS (Kartoffelbox) only.
NUM_SYNTHETIC_DOCS, NUM_REAL_DOCUMENTS, NUM_WORKERS O O Default doc counts and parallelism.

Troubleshooting

“Missing required …” — That option has to come from env or the CLI. For llm you need LLM_DATASET_PATH, LLM_CSV_PATH, SPEAKER_VOICES_PATH, and CODING_SYSTEMS_PATH set somewhere.

Dry-run firstmumblemed llm --dry-run and mumblemed real … --dry-run check paths and config without generating any audio. Use them when something’s misconfigured.

NLTK complaining — First run may need network access so NLTK can pull data; or pre-download the resources it asks for.

LLM / API — The endpoint has to speak the OpenAI Chat Completions API. For hosted services, set LLM_API_KEY.

No speaker voicesSPEAKER_VOICES_PATH must be a directory with at least one audio file; we use those as TTS references.

German TTS — For tts_language=de you need MODEL_REPO, T3_CHECKPOINT, and HF_TOKEN. For English, use --tts-language en and you can skip those.


Disclaimer

Ongoing research. This project is research software and is under active development. APIs and behaviour may change as we iterate.

Output quality. The pipeline can produce samples that are not fully correct or consistent (e.g. transcript errors, odd phrasing, or audio artefacts). We do not guarantee that every generated item is valid. If you use MumbleMED in your own project, you should add a quality-check step (e.g. automatic filters, spot checks, or human review) before relying on the data for training or evaluation.


License

MIT.

About

MumbleMED is an AI‑native pipeline for generating medical speech datasets to fine‑tune ASR models. It pairs an LLM for clinically realistic text normalization with TTS‑based speech synthesis, then builds train/val/test splits ready for Whisper‑style fine‑tuning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors