A fully self-contained, locally-hosted web application providing comprehensive AI capabilities including Text-to-Speech, Speech-to-Text, Large Language Model chat, Autonomous Agent Mode, and a Visual Pipeline Builder.
- High-quality speech synthesis using ChatterboxTTS
- RTX 50-Series Optimization: FP16 precision, CUDA Graphs, and embedding caching for lightning-fast generation
- Voice cloning with reference audio support
- Adjustable parameters (temperature, exaggeration, CFG weight)
- Accurate transcription using Faster-Whisper
- Support for audio file upload
- Live microphone recording and transcription
- Multiple language support
- Configurable model sizes (tiny to large-v3-turbo)
- Context-aware conversations powered by Ollama
- Web Search Plugin: Real-time internet browsing via DuckDuckGo to answer current events
- Calculator Plugin: Solves math problems accurately
- Vision Plugin: Analyzes screenshots and images
- Chat history management
- Screen Analysis: The AI can "see" your screen and understand UI elements
- Auto-Execution: Performs multi-step tasks (clicking, typing, navigating) autonomously
- Task Planning: Breaks down complex goals into actionable steps
- Permission System: Granular control over what the agent can do (mouse, keyboard, file system)
- Drag-and-drop interface to create custom AI workflows
- Connect blocks: Mic β STT β LLM β TTS β Audio Output
- Mobile-Friendly: Responsive design that adapts layout (vertical on PC, horizontal on mobile)
- Real-time execution logging
- Smart Self-Healing Setup: Automatically detects missing environments, creates virtual environments, and resolves dependency conflicts (e.g., fixing broken pip installs).
- System Integrity Checks: Verifies critical folders and files on every startup.
- Real-time system status monitoring (CPU, RAM, GPU VRAM)
- GPU/CPU detection with automatic fallback
- Model configuration display
Cloning the repository is not enough! You must ensure the AI models are available:
- Install Ollama: Download from ollama.com.
- Pull a Chat Model: Run
ollama pull llama3.1(ormistral,gemma, etc.) in your terminal (or anything that can run on your system). - First Run Downloads: The app will download TTS and STT models on the first launch automatically. Ensure you have a stable internet connection.
# Run the application as it will install dependencies automatically (may not be enough - Double check!)
.\start.ps1
# OR
.\start.bat# Install dependencies first
pip install -r requirements.txt
# Run the application
python web_app.pyThen open your browser to: https://localhost:5000
Note: You will see a "Not Secure" warning because the app uses a self-signed certificate. This is required for microphone access. Click "Advanced" -> "Proceed to localhost (unsafe)" to continue.
- Python: 3.11 (Voice cloning TTS uses 3.11 strictly)
- RAM: 16GB minimum (32GB recommended for Agent Mode or smarter models)
- GPU: NVIDIA GPU with CUDA support (RTX 3050 or better recommended; RTX 50-series optimized)
- Storage: 20GB+ free space for models
All dependencies are automatically installed on first run:
- Flask (web framework)
- PyTorch & TorchAudio (deep learning)
- ChatterboxTTS (text-to-speech)
- Faster-Whisper (speech-to-text)
- Ollama (LLM backend)
- DuckDuckGo Search (web browsing)
- PyAutoGUI & Pillow (screen interaction)
- SoundDevice & SoundFile (audio processing)
βββ web_app.py # Main Flask application with auto-dependency management
βββ agent.py # Autonomous agent logic (screen analysis, task execution)
βββ tts_optimizer.py # RTX 50-series specific optimizations
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ start.ps1 # PowerShell startup script (with integrity checks)
βββ start.bat # Batch startup script (with integrity checks)
βββ setup/
β βββ smart_setup.py # Self-healing dependency installer
β βββ ...
βββ templates/ # HTML templates
β βββ index.html # Home page
β βββ tts.html # Text-to-Speech page
β βββ stt.html # Speech-to-Text page
β βββ chat.html # AI Chat page
β βββ agent.html # Agent Mode page
β βββ pipeline.html # Visual Pipeline Builder
β βββ ...
βββ static/ # Static assets (CSS, JS)
βββ uploads/ # Uploaded files (auto-created)
βββ outputs/ # Generated audio files (auto-created)
- View system status
- Quick navigation to all features
- Real-time model status updates
- Click "Initialize TTS" to load the model
- Enter text in the text box
- (Optional) Upload reference audio for voice cloning
- Click "Generate Speech"
- Click "Initialize STT" and choose model size (Turbo recommended)
- Upload a file or use "Live Recording" to transcribe speech
- Click "Initialize LLM" to connect to Ollama
- Enable plugins like Web Search or Vision
- Ask questions about current events or math problems
- The AI will use tools to provide accurate answers
- Click "Initialize Agent"
- Select an Ollama model (e.g., LLaVA or Qwen3-VL for vision)
- Type a task: "Open Notepad and write a poem about AI"
- Watch as the agent takes control of your mouse and keyboard to complete the task
- Emergency Stop: Click "Cancel Task" at any time
- Drag blocks from the sidebar (Mic, STT, LLM, TTS, Speaker)
- Connect them to form a chain
- Click "Run Pipeline" to execute the flow step-by-step
- Great for testing custom interactions without coding
- Device: Auto-detected (GPU if available, else CPU)
- RTX 50-Series: Automatically enables FP16 and CUDA Graphs if detected
- Model Size: tiny, base, small, medium, large-v3, large-v3-turbo
- Compute Type: Auto, float16 (GPU), int8 (CPU)
- Provider: Ollama (local), OpenAI, Anthropic
- Model: Select from installed Ollama models
This application is designed for LOCAL USE ONLY
- Agent Mode: Grants the AI control over your mouse and keyboard. Use with caution and monitor execution.
- Web Server: Runs on all network interfaces (0.0.0.0) by default for local network access (so you can use it on other devices).
- No Authentication: Do not expose to the public internet.
Copyright Β© 2026 Zitacron. All rights reserved.
Made with β€οΈ for easy AI interaction