Skip to content

transcriptionstream/mimic

Repository files navigation

Mimic

A web-based voice cloning application powered by Qwen3-TTS. Generate natural-sounding speech in any voice using just a short audio sample.

Screenshots

First-Run Setup — The app guides you through downloading required models on first launch.

Welcome Downloading Ready

Voice Profiles — Create and manage voice profiles from reference audio samples.

Getting Started
Voice Profiles
Create Profile

Generate Speech — Generate speech with your cloned voice and preview with word-level timestamps.

Generate

Audio Library & Studio — Browse generated clips and combine them into sequences with timeline editing.

Library Audio Studio

Features

Voice Cloning

  • Zero-shot voice cloning - Clone any voice with just 5-15 seconds of reference audio
  • Multiple model support - Choose between 0.6B and 1.7B parameter models
    • Qwen/Qwen3-TTS-12Hz-1.7B-Base - Higher quality, more VRAM required
    • Qwen/Qwen3-TTS-12Hz-0.6B-Base - Faster, lower resource usage
  • Voice profiles - Save reference audio + transcript for quick reuse
  • Word-level timestamps - Automatic alignment for precise editing

Audio Studio

  • Timeline editor - Combine multiple clips into sequences with drag-and-drop
  • Word-level editing - Select and delete words directly from the transcript
  • Waveform trimming - Visual trim controls with real-time preview
  • Gap controls - Add silence between clips
  • Undo/redo - Full edit history support
  • Seamless preview - Server-side audio combining for gapless playback
  • Export - Download as WAV or MP3
  • Project management - Save and load studio projects

Transcription

  • Cross-platform Whisper - Auto-detects and uses the best backend for your hardware:
    • Apple Silicon: mlx-whisper with mlx-community/whisper-large-v3-turbo
    • NVIDIA GPU / CPU: faster-whisper with Systran/faster-whisper-large-v3-turbo
  • Word timestamps - Precise timing for each word

Installation

Prerequisites

  • Python 3.10-3.12 (3.13+ not yet supported due to onnxruntime dependency)
  • FFmpeg (for audio processing)
  • CUDA-compatible GPU recommended (CPU works but slower)

Setup

  1. Clone the repository:
git clone https://github.com/transcriptionstream/mimic.git
cd mimic
  1. Create a virtual environment (use Python 3.10-3.12):
python3.12 -m venv venv  # or python3.11, python3.10
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies (automatically installs the correct Whisper backend for your platform):
pip install -r requirements.txt
  1. Start the server:
python app.py
  1. Open http://localhost:8000 in your browser

  2. On first launch, the app will guide you through downloading the required models:

    • TTS Model (~3.4 GB) - For voice synthesis
    • Whisper Model (~1.5 GB) - For transcription

Usage

Basic Voice Cloning

  1. Upload reference audio - Record or upload 5-15 seconds of the voice you want to clone
  2. Add transcript - Type what was said in the reference audio (or use auto-transcribe)
  3. Enter target text - Type what you want the cloned voice to say
  4. Generate - Click generate and wait for the audio

Creating Voice Profiles

  1. After uploading reference audio and transcript, click "Save as Profile"
  2. Give your profile a name
  3. Select the profile from the dropdown for future generations

Audio Studio

  1. Click "Audio Studio" to open the editor
  2. Add clips from your generation history to the timeline
  3. Drag to reorder, click to select and trim
  4. Use word-level editing to remove unwanted words
  5. Preview your sequence, then export

Tech Stack

  • Backend: FastAPI, Python
  • Frontend: Vanilla JavaScript, CSS
  • TTS Engine: Qwen3-TTS
  • Transcription: Whisper Large V3 Turbo via mlx-whisper or faster-whisper
  • Audio Processing: soundfile, numpy, pydub

Project Structure

mimic/
├── app.py              # FastAPI application
├── mimic/
│   ├── models.py       # Model download management
│   ├── tts.py          # Qwen3-TTS integration
│   ├── transcribe.py   # Whisper transcription
│   ├── profiles.py     # Voice profile management
│   ├── history.py      # Generation history
│   └── studio.py       # Audio studio backend
├── static/
│   ├── index.html      # Main UI
│   ├── styles.css      # Styling
│   └── js/
│       ├── main.js     # Entry point
│       ├── app.js      # Main application class
│       ├── studio.js   # Audio studio module
│       └── ui.js       # UI utilities
└── data/
    ├── models/         # Downloaded TTS models
    ├── profiles/       # Saved voice profiles
    ├── uploads/        # Uploaded audio files
    └── history/        # Generated audio history

License

MIT License

About

a voice cloning studio based on qwen3-TTS and whisper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors