Self-Host Voicebox: A Desktop Voice Cloning Studio That Runs Offline

Voicebox is a free, open-source desktop app for voice cloning. Five TTS engines, 23 languages, timeline editor. All offline, zero cloud uploads.

Professional condenser microphone in recording studio with pop filter and audio equipment

We covered Chatterbox TTS earlier this month for command-line voice cloning. Voicebox takes a different approach: a polished desktop app with a graphical interface, five TTS engines bundled, and tools for creating multi-voice productions.

Jamie Pine’s open-source project packages everything into one download. No Docker, no Python environments, no command-line setup.

What You Get

Voicebox runs entirely on your machine. Your voice samples, generated audio, and model weights never leave your computer.

The app bundles five text-to-speech engines:

EngineLanguagesStrength
Qwen3-TTS10Highest quality, natural intonation, delivery instructions
LuxTTS1 (English)Lightweight, CPU-efficient
Chatterbox Multilingual23Broadest language support
Chatterbox Turbo1 (English)Fast, supports emotion tags like [laugh]
TADA (HumeAI)1 (English)Extended coherent speech

Voice cloning works from as little as 3 seconds of audio. You can upload a file, record from your microphone, or capture system audio playing on your computer.

Beyond basic cloning, Voicebox includes:

  • Timeline editor for multi-voice projects with track arrangement
  • Audio effects like pitch shift, reverb, delay, compression, and filters
  • Presets to save and reuse effect chains
  • REST API for programmatic access from other tools
  • Whisper integration for automatic transcription of reference audio

Requirements

Minimum specs:

  • macOS 11+ or Windows 10+
  • 8GB RAM
  • 5GB free disk space

Recommended:

  • 16GB RAM
  • NVIDIA GPU with CUDA support
  • 10GB+ free disk space

Voicebox handles GPU acceleration automatically: MLX on Apple Silicon, CUDA on NVIDIA, ROCm on AMD, DirectML on Windows. CPU inference works but runs slower.

Linux builds aren’t available yet. You can build from source if you’re comfortable with Tauri/Rust development.

Installation

Download from the releases page:

macOS:

  • Apple Silicon: voicebox_aarch64.app.tar.gz
  • Intel: voicebox_x64.app.tar.gz

Extract and drag to /Applications/.

Windows:

  • MSI: voicebox_x64_en-US.msi
  • Setup: voicebox_x64-setup.exe

Run the installer and follow the prompts.

On first launch, Voicebox downloads the Qwen3-TTS model (2-4GB). This happens once and stores locally. A green status indicator in the bottom-left confirms the backend is running.

Creating Your First Voice Clone

  1. Create a voice profile - Go to Voice Profiles, click New, and provide audio. Three options: upload a file, record with your mic, or capture system audio
  2. Name and tag - Give it a recognizable name and set the language
  3. Generate - Open the generation panel, select your profile, type text, click generate
  4. Review - Output appears in your generation history with playback controls

For best results, use clean audio without background noise. Natural speech patterns work better than monotone reading. 10-30 seconds is ideal.

Multi-Voice Projects

The timeline editor lets you build longer productions with multiple voices:

  1. Create separate voice profiles for each character
  2. Open the timeline view
  3. Add text clips to different tracks
  4. Arrange, trim, and overlap as needed
  5. Apply per-track or global effects
  6. Export the mixed result

This is useful for podcasts, audiobooks, or any content with multiple speakers.

Audio Effects

Built-in effects include:

  • Pitch shifting
  • Reverb and delay
  • Chorus
  • Compression
  • EQ and filters

Save combinations as presets. Effects apply non-destructively - you can always go back to the raw generated audio.

API Access

Voicebox runs a local REST API for integration with other tools. The server starts automatically with the desktop app.

Documentation lives at http://localhost:PORT/docs once running. Use cases include batch processing, automation scripts, or feeding generated audio into other applications.

What This Replaces

ElevenLabs pricing for comparison:

PlanCostCharacters/Month
Free$010,000
Starter$530,000
Creator$22100,000
Pro$99500,000
Scale$3302,000,000

Voicebox: $0, unlimited characters, nothing uploaded to anyone’s servers.

The tradeoff is your own hardware and initial model download time. If you do any regular voice generation work, the math works out quickly.

Privacy Notes

All processing happens locally. Voicebox stores data in platform-specific locations:

  • macOS: ~/Library/Application Support/com.voicebox.app/
  • Windows: %APPDATA%/com.voicebox.app/
  • Linux: ~/.config/com.voicebox.app/

No telemetry, no account required, no cloud dependency. The MIT license allows commercial use.

Chatterbox vs Voicebox

Both are legitimate options. Quick comparison:

Chatterbox TTS Server - Docker-based, web UI, single engine family, lighter setup, command-line/API focused

Voicebox - Native desktop app, five engines, timeline editor, effects processing, designed for production work

If you just need quick voice cloning with minimal setup, Chatterbox TTS Server is simpler. If you’re building longer content or want the effects pipeline, Voicebox offers more.

What You Can Do

  1. Download from voicebox.sh or GitHub releases
  2. Install and let it download the initial model
  3. Create a voice profile from a short audio clip
  4. Generate your first clone
  5. Explore the timeline editor if you’re building multi-voice content

Standard caution: voice cloning technology can be misused. Get consent before cloning someone’s voice, and don’t use this for impersonation or fraud.