We covered Chatterbox TTS earlier this month for command-line voice cloning. Voicebox takes a different approach: a polished desktop app with a graphical interface, five TTS engines bundled, and tools for creating multi-voice productions.
Jamie Pine’s open-source project packages everything into one download. No Docker, no Python environments, no command-line setup.
What You Get
Voicebox runs entirely on your machine. Your voice samples, generated audio, and model weights never leave your computer.
The app bundles five text-to-speech engines:
| Engine | Languages | Strength |
|---|---|---|
| Qwen3-TTS | 10 | Highest quality, natural intonation, delivery instructions |
| LuxTTS | 1 (English) | Lightweight, CPU-efficient |
| Chatterbox Multilingual | 23 | Broadest language support |
| Chatterbox Turbo | 1 (English) | Fast, supports emotion tags like [laugh] |
| TADA (HumeAI) | 1 (English) | Extended coherent speech |
Voice cloning works from as little as 3 seconds of audio. You can upload a file, record from your microphone, or capture system audio playing on your computer.
Beyond basic cloning, Voicebox includes:
- Timeline editor for multi-voice projects with track arrangement
- Audio effects like pitch shift, reverb, delay, compression, and filters
- Presets to save and reuse effect chains
- REST API for programmatic access from other tools
- Whisper integration for automatic transcription of reference audio
Requirements
Minimum specs:
- macOS 11+ or Windows 10+
- 8GB RAM
- 5GB free disk space
Recommended:
- 16GB RAM
- NVIDIA GPU with CUDA support
- 10GB+ free disk space
Voicebox handles GPU acceleration automatically: MLX on Apple Silicon, CUDA on NVIDIA, ROCm on AMD, DirectML on Windows. CPU inference works but runs slower.
Linux builds aren’t available yet. You can build from source if you’re comfortable with Tauri/Rust development.
Installation
Download from the releases page:
macOS:
- Apple Silicon:
voicebox_aarch64.app.tar.gz - Intel:
voicebox_x64.app.tar.gz
Extract and drag to /Applications/.
Windows:
- MSI:
voicebox_x64_en-US.msi - Setup:
voicebox_x64-setup.exe
Run the installer and follow the prompts.
On first launch, Voicebox downloads the Qwen3-TTS model (2-4GB). This happens once and stores locally. A green status indicator in the bottom-left confirms the backend is running.
Creating Your First Voice Clone
- Create a voice profile - Go to Voice Profiles, click New, and provide audio. Three options: upload a file, record with your mic, or capture system audio
- Name and tag - Give it a recognizable name and set the language
- Generate - Open the generation panel, select your profile, type text, click generate
- Review - Output appears in your generation history with playback controls
For best results, use clean audio without background noise. Natural speech patterns work better than monotone reading. 10-30 seconds is ideal.
Multi-Voice Projects
The timeline editor lets you build longer productions with multiple voices:
- Create separate voice profiles for each character
- Open the timeline view
- Add text clips to different tracks
- Arrange, trim, and overlap as needed
- Apply per-track or global effects
- Export the mixed result
This is useful for podcasts, audiobooks, or any content with multiple speakers.
Audio Effects
Built-in effects include:
- Pitch shifting
- Reverb and delay
- Chorus
- Compression
- EQ and filters
Save combinations as presets. Effects apply non-destructively - you can always go back to the raw generated audio.
API Access
Voicebox runs a local REST API for integration with other tools. The server starts automatically with the desktop app.
Documentation lives at http://localhost:PORT/docs once running. Use cases include batch processing, automation scripts, or feeding generated audio into other applications.
What This Replaces
ElevenLabs pricing for comparison:
| Plan | Cost | Characters/Month |
|---|---|---|
| Free | $0 | 10,000 |
| Starter | $5 | 30,000 |
| Creator | $22 | 100,000 |
| Pro | $99 | 500,000 |
| Scale | $330 | 2,000,000 |
Voicebox: $0, unlimited characters, nothing uploaded to anyone’s servers.
The tradeoff is your own hardware and initial model download time. If you do any regular voice generation work, the math works out quickly.
Privacy Notes
All processing happens locally. Voicebox stores data in platform-specific locations:
- macOS:
~/Library/Application Support/com.voicebox.app/ - Windows:
%APPDATA%/com.voicebox.app/ - Linux:
~/.config/com.voicebox.app/
No telemetry, no account required, no cloud dependency. The MIT license allows commercial use.
Chatterbox vs Voicebox
Both are legitimate options. Quick comparison:
Chatterbox TTS Server - Docker-based, web UI, single engine family, lighter setup, command-line/API focused
Voicebox - Native desktop app, five engines, timeline editor, effects processing, designed for production work
If you just need quick voice cloning with minimal setup, Chatterbox TTS Server is simpler. If you’re building longer content or want the effects pipeline, Voicebox offers more.
What You Can Do
- Download from voicebox.sh or GitHub releases
- Install and let it download the initial model
- Create a voice profile from a short audio clip
- Generate your first clone
- Explore the timeline editor if you’re building multi-voice content
Standard caution: voice cloning technology can be misused. Get consent before cloning someone’s voice, and don’t use this for impersonation or fraud.