Mistral's Voxtral TTS: Open-Weight Voice Cloning That Challenges ElevenLabs

Mistral just dropped an open-weight text-to-speech model that clones voices from three seconds of audio, runs on consumer GPUs, and beats ElevenLabs in blind listening tests.

Voxtral TTS is a 4 billion parameter model you can download from Hugging Face and run locally. For anyone tired of paying per-character fees to cloud TTS services while their voice data gets processed on someone else’s servers, this changes the game.

What Voxtral Actually Does

The model takes text and a reference voice sample—as little as 3 seconds—and generates speech that sounds like the source voice. It works in nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

According to Mistral’s technical breakdown, Voxtral is built on three components:

A 3.4 billion parameter transformer decoder (based on Ministral 3B)
A 390 million parameter flow-matching acoustic transformer
A 300 million parameter neural audio codec

The system processes your voice prompt, generates semantic tokens, then produces acoustic output through 16 function evaluations. First audio arrives in about 70 milliseconds.

The ElevenLabs Comparison

Here’s where it gets interesting. In blind human evaluations, Mistral claims testers preferred Voxtral over ElevenLabs Flash v2.5 roughly 63% of the time on stock voices, and nearly 70% on custom voice cloning tasks.

The testing involved recognizable voices across all nine supported languages, with annotators rating naturalness, accent accuracy, and how closely the output matched the original reference.

Voxtral reportedly matches ElevenLabs v3 (their premium tier) on overall quality while offering comparable time-to-first-audio latency.

The key difference: ElevenLabs charges per character. Voxtral’s weights are free.

Running It Locally

You need:

GPU: 16GB VRAM minimum
Software: vLLM 0.18.0 or later, plus vllm-omni

Basic setup:

pip install -U vllm
pip install git+https://github.com/vllm-project/vllm-omni.git
vllm serve mistralai/Voxtral-4B-TTS-2603 --omni

Once running, you hit it via API:

import httpx

payload = {
    "input": "Your text here",
    "model": "mistralai/Voxtral-4B-TTS-2603",
    "response_format": "wav",
    "voice": "casual_male",  # or use custom voice reference
}

response = httpx.post("http://localhost:8000/v1/audio/speech", json=payload)

Voxtral ships with 20 preset voices. For custom cloning, you provide a 5-25 second reference sample.

At concurrency of 32, Mistral reports throughput of about 1,430 characters per second per GPU. That’s enough for real-time applications.

Privacy Angle

This is where local TTS models matter most.

When you use ElevenLabs, OpenAI’s TTS, or other cloud services, your voice data—including any voice you’re cloning—gets sent to their servers. Their privacy policies govern what happens next.

With Voxtral running locally, nothing leaves your machine. Your voice samples stay on your hardware. Your generated audio never touches external servers.

For businesses handling sensitive communications, healthcare providers generating patient-facing audio, or anyone who’d rather not hand over their voice biometrics to a third party, local inference is the only option that makes sense.

Mistral explicitly markets this angle: “Enterprises can download Voxtral TTS, run it on their own servers or even on a smartphone, and never send a single audio frame to a third party.”

The Catch: CC BY-NC License

The open weights come under Creative Commons Attribution-NonCommercial 4.0.

Translation: you can use Voxtral for personal projects, research, and internal tools. Commercial products require a different arrangement with Mistral.

This isn’t unusual for open-weight models from commercial AI companies. Mistral is giving away capability while protecting their revenue from enterprise customers who’d otherwise build products on free weights.

If you’re a business that wants to deploy Voxtral commercially, you’ll need to contact Mistral for licensing. The API is priced at $0.016 per 1,000 characters—significantly cheaper than ElevenLabs—if you’d rather pay than self-host.

What This Means

Open-weight TTS has been lagging behind commercial offerings for years. The gap was wide enough that most serious applications defaulted to ElevenLabs, Play.ht, or similar cloud services despite the privacy tradeoffs and per-character costs.

Voxtral narrows that gap dramatically. A model that matches ElevenLabs quality, clones voices from three-second samples, runs on mid-range gaming GPUs, and costs nothing to run locally represents a genuine shift in what’s possible without cloud dependencies.

The 16GB VRAM requirement keeps it out of reach for laptops and lower-end desktops. But anyone with an RTX 4080 or better—or access to cloud GPUs for batch processing—can now generate professional-quality voice cloning without external APIs.

For the local AI community, this is the TTS equivalent of what Stable Diffusion did for image generation: taking a capability that was locked behind expensive APIs and making it something you can run in your basement.

Try It

If you’ve got the hardware:

Download from Hugging Face
Install vLLM and vllm-omni
Run the serve command
Hit the API with your text and voice samples

For those without beefy GPUs, Mistral’s API at $0.016/1k characters is the cheapest quality TTS option available. You’re still sending data to their servers, but at least it’s significantly cheaper than alternatives.

The model is new enough that community tooling is still catching up. Expect integrations with local AI frameworks, llama.cpp-style CPU implementations, and quantized versions for lower VRAM requirements over the coming weeks.