Self-Host Voicebox: A Desktop Voice Cloning Studio That Runs Offline

We covered Chatterbox TTS earlier this month for command-line voice cloning. Voicebox takes a different approach: a polished desktop app with a graphical interface, five TTS engines bundled, and tools for creating multi-voice productions.

Jamie Pine’s open-source project packages everything into one download. No Docker, no Python environments, no command-line setup.

What You Get

Voicebox runs entirely on your machine. Your voice samples, generated audio, and model weights never leave your computer.

The app bundles five text-to-speech engines:

Engine	Languages	Strength
Qwen3-TTS	10	Highest quality, natural intonation, delivery instructions
LuxTTS	1 (English)	Lightweight, CPU-efficient
Chatterbox Multilingual	23	Broadest language support
Chatterbox Turbo	1 (English)	Fast, supports emotion tags like [laugh]
TADA (HumeAI)	1 (English)	Extended coherent speech

Voice cloning works from as little as 3 seconds of audio. You can upload a file, record from your microphone, or capture system audio playing on your computer.

Beyond basic cloning, Voicebox includes:

Timeline editor for multi-voice projects with track arrangement
Audio effects like pitch shift, reverb, delay, compression, and filters
Presets to save and reuse effect chains
REST API for programmatic access from other tools
Whisper integration for automatic transcription of reference audio

Requirements

Minimum specs:

macOS 11+ or Windows 10+
8GB RAM
5GB free disk space

Recommended:

16GB RAM
NVIDIA GPU with CUDA support
10GB+ free disk space

Voicebox handles GPU acceleration automatically: MLX on Apple Silicon, CUDA on NVIDIA, ROCm on AMD, DirectML on Windows. CPU inference works but runs slower.

Linux builds aren’t available yet. You can build from source if you’re comfortable with Tauri/Rust development.

Installation

Download from the releases page:

macOS:

Apple Silicon: voicebox_aarch64.app.tar.gz
Intel: voicebox_x64.app.tar.gz

Extract and drag to /Applications/.

Windows:

MSI: voicebox_x64_en-US.msi
Setup: voicebox_x64-setup.exe

Run the installer and follow the prompts.

On first launch, Voicebox downloads the Qwen3-TTS model (2-4GB). This happens once and stores locally. A green status indicator in the bottom-left confirms the backend is running.

Creating Your First Voice Clone

Create a voice profile - Go to Voice Profiles, click New, and provide audio. Three options: upload a file, record with your mic, or capture system audio
Name and tag - Give it a recognizable name and set the language
Generate - Open the generation panel, select your profile, type text, click generate
Review - Output appears in your generation history with playback controls

For best results, use clean audio without background noise. Natural speech patterns work better than monotone reading. 10-30 seconds is ideal.

Multi-Voice Projects

The timeline editor lets you build longer productions with multiple voices:

Create separate voice profiles for each character
Open the timeline view
Add text clips to different tracks
Arrange, trim, and overlap as needed
Apply per-track or global effects
Export the mixed result

This is useful for podcasts, audiobooks, or any content with multiple speakers.

Audio Effects

Built-in effects include:

Pitch shifting
Reverb and delay
Chorus
Compression
EQ and filters

Save combinations as presets. Effects apply non-destructively - you can always go back to the raw generated audio.

API Access

Voicebox runs a local REST API for integration with other tools. The server starts automatically with the desktop app.

Documentation lives at http://localhost:PORT/docs once running. Use cases include batch processing, automation scripts, or feeding generated audio into other applications.

What This Replaces

ElevenLabs pricing for comparison:

Plan	Cost	Characters/Month
Free	$0	10,000
Starter	$5	30,000
Creator	$22	100,000
Pro	$99	500,000
Scale	$330	2,000,000

Voicebox: $0, unlimited characters, nothing uploaded to anyone’s servers.

The tradeoff is your own hardware and initial model download time. If you do any regular voice generation work, the math works out quickly.

Privacy Notes

All processing happens locally. Voicebox stores data in platform-specific locations:

macOS: ~/Library/Application Support/com.voicebox.app/
Windows: %APPDATA%/com.voicebox.app/
Linux: ~/.config/com.voicebox.app/

No telemetry, no account required, no cloud dependency. The MIT license allows commercial use.

Chatterbox vs Voicebox

Both are legitimate options. Quick comparison:

Chatterbox TTS Server - Docker-based, web UI, single engine family, lighter setup, command-line/API focused

Voicebox - Native desktop app, five engines, timeline editor, effects processing, designed for production work

If you just need quick voice cloning with minimal setup, Chatterbox TTS Server is simpler. If you’re building longer content or want the effects pipeline, Voicebox offers more.

What You Can Do

Download from voicebox.sh or GitHub releases
Install and let it download the initial model
Create a voice profile from a short audio clip
Generate your first clone
Explore the timeline editor if you’re building multi-voice content

Standard caution: voice cloning technology can be misused. Get consent before cloning someone’s voice, and don’t use this for impersonation or fraud.