A web-based voice cloning application powered by Qwen3-TTS. Generate natural-sounding speech in any voice using just a short audio sample.
First-Run Setup — The app guides you through downloading required models on first launch.
Voice Profiles — Create and manage voice profiles from reference audio samples.
![]() ![]() | ![]() |
Generate Speech — Generate speech with your cloned voice and preview with word-level timestamps.
Audio Library & Studio — Browse generated clips and combine them into sequences with timeline editing.
- Zero-shot voice cloning - Clone any voice with just 5-15 seconds of reference audio
- Multiple model support - Choose between 0.6B and 1.7B parameter models
Qwen/Qwen3-TTS-12Hz-1.7B-Base- Higher quality, more VRAM requiredQwen/Qwen3-TTS-12Hz-0.6B-Base- Faster, lower resource usage
- Voice profiles - Save reference audio + transcript for quick reuse
- Word-level timestamps - Automatic alignment for precise editing
- Timeline editor - Combine multiple clips into sequences with drag-and-drop
- Word-level editing - Select and delete words directly from the transcript
- Waveform trimming - Visual trim controls with real-time preview
- Gap controls - Add silence between clips
- Undo/redo - Full edit history support
- Seamless preview - Server-side audio combining for gapless playback
- Export - Download as WAV or MP3
- Project management - Save and load studio projects
- Cross-platform Whisper - Auto-detects and uses the best backend for your hardware:
- Apple Silicon:
mlx-whisperwithmlx-community/whisper-large-v3-turbo - NVIDIA GPU / CPU:
faster-whisperwithSystran/faster-whisper-large-v3-turbo
- Apple Silicon:
- Word timestamps - Precise timing for each word
- Python 3.10-3.12 (3.13+ not yet supported due to onnxruntime dependency)
- FFmpeg (for audio processing)
- CUDA-compatible GPU recommended (CPU works but slower)
- Clone the repository:
git clone https://github.com/transcriptionstream/mimic.git
cd mimic- Create a virtual environment (use Python 3.10-3.12):
python3.12 -m venv venv # or python3.11, python3.10
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies (automatically installs the correct Whisper backend for your platform):
pip install -r requirements.txt- Start the server:
python app.py-
Open http://localhost:8000 in your browser
-
On first launch, the app will guide you through downloading the required models:
- TTS Model (~3.4 GB) - For voice synthesis
- Whisper Model (~1.5 GB) - For transcription
- Upload reference audio - Record or upload 5-15 seconds of the voice you want to clone
- Add transcript - Type what was said in the reference audio (or use auto-transcribe)
- Enter target text - Type what you want the cloned voice to say
- Generate - Click generate and wait for the audio
- After uploading reference audio and transcript, click "Save as Profile"
- Give your profile a name
- Select the profile from the dropdown for future generations
- Click "Audio Studio" to open the editor
- Add clips from your generation history to the timeline
- Drag to reorder, click to select and trim
- Use word-level editing to remove unwanted words
- Preview your sequence, then export
- Backend: FastAPI, Python
- Frontend: Vanilla JavaScript, CSS
- TTS Engine: Qwen3-TTS
- Transcription: Whisper Large V3 Turbo via mlx-whisper or faster-whisper
- Audio Processing: soundfile, numpy, pydub
mimic/
├── app.py # FastAPI application
├── mimic/
│ ├── models.py # Model download management
│ ├── tts.py # Qwen3-TTS integration
│ ├── transcribe.py # Whisper transcription
│ ├── profiles.py # Voice profile management
│ ├── history.py # Generation history
│ └── studio.py # Audio studio backend
├── static/
│ ├── index.html # Main UI
│ ├── styles.css # Styling
│ └── js/
│ ├── main.js # Entry point
│ ├── app.js # Main application class
│ ├── studio.js # Audio studio module
│ └── ui.js # UI utilities
└── data/
├── models/ # Downloaded TTS models
├── profiles/ # Saved voice profiles
├── uploads/ # Uploaded audio files
└── history/ # Generated audio history
MIT License








