ElevenLabs charges $99/month for their Pro API tier. OpenAI’s TTS runs $15 per million characters. If you’re generating voiceovers, audiobooks, podcasts, or accessibility audio, those costs add up fast.
Kokoro TTS is an 82-million parameter open-source model that ranked #1 on the TTS Arena leaderboard in January 2026 — ahead of models 10-15x its size. It runs on a CPU. It’s Apache-licensed. And it exposes an OpenAI-compatible API, meaning anything that works with OpenAI’s text-to-speech works with Kokoro out of the box.
Here’s how to replace your paid TTS service this afternoon.
What You’re Getting
Kokoro-82M delivers surprisingly natural speech from a model small enough to fit on a Raspberry Pi. The numbers:
- 54 voices across 8 languages (English, Japanese, Chinese, Korean, French, German, Italian, Portuguese)
- Under 2GB VRAM — or runs entirely on CPU
- 35-100x realtime speed on a mid-range GPU, 3-5x on CPU alone
- Multiple output formats: MP3, WAV, Opus, FLAC, M4A, PCM
- Streaming support for real-time applications
- Voice mixing — blend two voices with custom weights
It’s not quite ElevenLabs. The expressiveness gap is real — paid services still lead on emotional nuance and voice cloning. But for narration, accessibility, content production, and automation, Kokoro is close enough that the price difference stops making sense.
The Cost Math
Let’s say you generate 10 hours of audio per month — a reasonable workload for a content creator or accessibility team.
| Service | Monthly Cost | Annual Cost |
|---|---|---|
| ElevenLabs Pro API | $99/mo | $1,188 |
| OpenAI TTS Standard | ~$45/mo | ~$540 |
| Kokoro (self-hosted) | $0 | $0 |
Over three years, self-hosting Kokoro saves roughly $3,500-$11,000 depending on which service you’re replacing. Your only cost is the electricity to run the model.
Hardware Requirements
This is where Kokoro stands apart from most local AI models. You don’t need a gaming PC.
Minimum (CPU-only):
- Any modern CPU (Intel i5 / AMD Ryzen 5 or better)
- 4GB RAM
- 350MB disk space
- No GPU required
Recommended (GPU-accelerated):
- NVIDIA GPU with 2GB+ VRAM (even a GTX 1060 works)
- 8GB RAM
- Docker installed
That’s it. No 24GB VRAM card. No $2,000 hardware investment. If your computer was made after 2018, it can probably run Kokoro.
Option 1: Kokoro-FastAPI (Recommended)
Kokoro-FastAPI is the most mature self-hosted wrapper. It gives you a Docker container with an OpenAI-compatible API, a web UI, and GPU or CPU inference.
Step 1: Run the Container
CPU (works everywhere):
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest
GPU (NVIDIA with CUDA):
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest
That’s literally it. One command. The container downloads the model on first run.
Step 2: Verify It’s Working
Open your browser to http://localhost:8880/web for the built-in web interface. You can test voices, adjust speed, and generate audio files directly.
The API docs live at http://localhost:8880/docs.
Step 3: Generate Speech
Using curl:
curl -X POST http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro",
"input": "Self-hosted text to speech. No subscriptions. No data collection. Just your words, your voice, your hardware.",
"voice": "af_heart"
}' \
--output speech.mp3
Using Python (OpenAI SDK):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8880/v1",
api_key="not-needed"
)
response = client.audio.speech.create(
model="kokoro",
voice="af_heart",
input="Your text here."
)
response.stream_to_file("output.mp3")
Notice: the Python code uses OpenAI’s official SDK. You just point base_url at your local server. Any tool, script, or application that already uses OpenAI TTS works without code changes.
Step 4: Docker Compose (Persistent Setup)
For a production-style deployment, create a docker-compose.yml:
services:
kokoro-tts:
image: ghcr.io/remsky/kokoro-fastapi-cpu:latest
ports:
- "8880:8880"
restart: unless-stopped
volumes:
- kokoro-data:/app/data
volumes:
kokoro-data:
For GPU, swap the image to ghcr.io/remsky/kokoro-fastapi-gpu:latest and add:
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
Start it with docker compose up -d and it runs in the background, surviving reboots.
Option 2: Kokoro Web (Browser-Based)
If you want something even simpler, Kokoro Web runs the model directly in your browser using WebGPU. No server, no Docker, no Python.
services:
kokoro-web:
image: ghcr.io/eduardolat/kokoro-web:latest
ports:
- "3000:3000"
environment:
- KW_SECRET_API_KEY=your-secret-key
restart: unless-stopped
Access it at http://localhost:3000. It also exposes an OpenAI-compatible API on the same port.
The trade-off: Kokoro Web is newer (MIT-licensed, ~590 GitHub stars) and less battle-tested than Kokoro-FastAPI. But if you want a quick setup with a clean UI and don’t need advanced features like voice mixing, it works.
Available Voices
Kokoro ships with 54 voices. Here are some highlights:
| Voice ID | Language | Description |
|---|---|---|
af_heart | English (US) | Warm, natural female voice — the default |
af_bella | English (US) | Clear, professional female |
am_adam | English (US) | Deep, steady male |
am_michael | English (US) | Conversational male |
bf_emma | English (UK) | British female |
bm_george | English (UK) | British male |
jf_alpha | Japanese | Japanese female |
ff_siwis | French | French female |
You can also blend voices with weighted combinations:
curl -X POST http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro",
"input": "A blend of two voices.",
"voice": "af_heart(0.7)+af_bella(0.3)"
}' \
--output blended.mp3
This creates a voice that’s 70% af_heart and 30% af_bella. Useful for finding a tone that fits your project without training a custom voice.
Practical Use Cases
Audiobook generation. Convert entire books to audio locally. One benchmark test generated a complete audiobook in 8.5 minutes — compared to 2+ hours through ElevenLabs’ API.
Podcast production. Generate narration segments, intros, or entire scripted episodes. Combine with a local Whisper instance for a fully private audio pipeline.
Accessibility. Add text-to-speech to your applications without sending user content to third parties. Particularly important for healthcare, legal, or education contexts.
Home automation. Give your smart home a voice that doesn’t route through Amazon or Google servers.
Content creation. Generate voiceovers for videos, tutorials, and presentations. At zero marginal cost, you can iterate on scripts without watching your API balance drain.
Integration with Existing Tools
Because Kokoro-FastAPI implements the OpenAI speech API specification, it works with:
- Open WebUI — direct integration documented
- n8n workflows — swap the OpenAI TTS node URL to
http://kokoro-tts:8880/v1 - Home Assistant — any TTS integration that supports custom OpenAI endpoints
- Any application using the OpenAI Python/Node SDK — just change the
base_url
What It Can’t Do
Be honest about the gaps:
- Voice cloning. Kokoro can’t replicate a specific person’s voice. ElevenLabs still leads here by a wide margin.
- Emotional range. The voices are natural but not as expressive as premium paid options. Sarcasm, whispers, and dramatic pauses don’t land the same way.
- Real-time conversation. Latency is low on GPU but not instant. For sub-100ms voice responses in a chatbot, you’ll want dedicated streaming TTS infrastructure.
- Some languages. Eight languages is solid but nowhere near ElevenLabs’ 30+ language coverage.
If you need cloned voices or hyper-realistic emotional delivery, paid services are still worth it. For everything else, Kokoro is good enough — and getting better.
Privacy Considerations
Running Kokoro locally means your text never leaves your machine. No API calls, no data collection, no terms-of-service granting usage rights over your content.
This matters specifically for:
- Medical content where patient information appears in text
- Legal documents that can’t be sent to third-party servers
- Business communications with confidential information
- Personal content you’d rather not have in anyone’s training data
The model weights are downloaded once from Hugging Face. After that, the system works entirely offline.
What’s Next
Kokoro is under active development. The model was trained on hundreds of hours of data for just $1,000 in compute costs — proving that quality TTS doesn’t require massive budgets. The community is already building:
- Fine-tuned voices for specific use cases
- Additional language support
- Improved emotional expressiveness
- Rust-based inference for even lower latency
The Apache license means anyone can extend, modify, and commercially deploy Kokoro without restrictions.
What You Can Do
- Try it right now. Run the single Docker command above. You’ll have working TTS in under five minutes.
- Test it against your current service. Generate the same text with Kokoro and your paid provider. See if you can tell the difference.
- Swap one workflow. Replace one automation or content pipeline that uses paid TTS. Keep the paid service as a fallback until you trust the output.
- Point your existing code at it. If you’re using OpenAI’s TTS API, changing the base URL is the only modification needed.
Text-to-speech shouldn’t cost $99/month. An 82-million parameter model just proved that it doesn’t have to.