Self-Host Kokoro TTS: Drop-In Replacement for ElevenLabs and OpenAI Text-to-Speech

An 82-million parameter model that runs on a CPU, sounds nearly as good as ElevenLabs, and costs nothing. Here's how to set it up.

A condenser microphone in a recording studio with warm lighting

ElevenLabs charges $99/month for their Pro API tier. OpenAI’s TTS runs $15 per million characters. If you’re generating voiceovers, audiobooks, podcasts, or accessibility audio, those costs add up fast.

Kokoro TTS is an 82-million parameter open-source model that ranked #1 on the TTS Arena leaderboard in January 2026 — ahead of models 10-15x its size. It runs on a CPU. It’s Apache-licensed. And it exposes an OpenAI-compatible API, meaning anything that works with OpenAI’s text-to-speech works with Kokoro out of the box.

Here’s how to replace your paid TTS service this afternoon.

What You’re Getting

Kokoro-82M delivers surprisingly natural speech from a model small enough to fit on a Raspberry Pi. The numbers:

  • 54 voices across 8 languages (English, Japanese, Chinese, Korean, French, German, Italian, Portuguese)
  • Under 2GB VRAM — or runs entirely on CPU
  • 35-100x realtime speed on a mid-range GPU, 3-5x on CPU alone
  • Multiple output formats: MP3, WAV, Opus, FLAC, M4A, PCM
  • Streaming support for real-time applications
  • Voice mixing — blend two voices with custom weights

It’s not quite ElevenLabs. The expressiveness gap is real — paid services still lead on emotional nuance and voice cloning. But for narration, accessibility, content production, and automation, Kokoro is close enough that the price difference stops making sense.

The Cost Math

Let’s say you generate 10 hours of audio per month — a reasonable workload for a content creator or accessibility team.

ServiceMonthly CostAnnual Cost
ElevenLabs Pro API$99/mo$1,188
OpenAI TTS Standard~$45/mo~$540
Kokoro (self-hosted)$0$0

Over three years, self-hosting Kokoro saves roughly $3,500-$11,000 depending on which service you’re replacing. Your only cost is the electricity to run the model.

Hardware Requirements

This is where Kokoro stands apart from most local AI models. You don’t need a gaming PC.

Minimum (CPU-only):

  • Any modern CPU (Intel i5 / AMD Ryzen 5 or better)
  • 4GB RAM
  • 350MB disk space
  • No GPU required

Recommended (GPU-accelerated):

  • NVIDIA GPU with 2GB+ VRAM (even a GTX 1060 works)
  • 8GB RAM
  • Docker installed

That’s it. No 24GB VRAM card. No $2,000 hardware investment. If your computer was made after 2018, it can probably run Kokoro.

Kokoro-FastAPI is the most mature self-hosted wrapper. It gives you a Docker container with an OpenAI-compatible API, a web UI, and GPU or CPU inference.

Step 1: Run the Container

CPU (works everywhere):

docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest

GPU (NVIDIA with CUDA):

docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest

That’s literally it. One command. The container downloads the model on first run.

Step 2: Verify It’s Working

Open your browser to http://localhost:8880/web for the built-in web interface. You can test voices, adjust speed, and generate audio files directly.

The API docs live at http://localhost:8880/docs.

Step 3: Generate Speech

Using curl:

curl -X POST http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro",
    "input": "Self-hosted text to speech. No subscriptions. No data collection. Just your words, your voice, your hardware.",
    "voice": "af_heart"
  }' \
  --output speech.mp3

Using Python (OpenAI SDK):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8880/v1",
    api_key="not-needed"
)

response = client.audio.speech.create(
    model="kokoro",
    voice="af_heart",
    input="Your text here."
)

response.stream_to_file("output.mp3")

Notice: the Python code uses OpenAI’s official SDK. You just point base_url at your local server. Any tool, script, or application that already uses OpenAI TTS works without code changes.

Step 4: Docker Compose (Persistent Setup)

For a production-style deployment, create a docker-compose.yml:

services:
  kokoro-tts:
    image: ghcr.io/remsky/kokoro-fastapi-cpu:latest
    ports:
      - "8880:8880"
    restart: unless-stopped
    volumes:
      - kokoro-data:/app/data

volumes:
  kokoro-data:

For GPU, swap the image to ghcr.io/remsky/kokoro-fastapi-gpu:latest and add:

    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

Start it with docker compose up -d and it runs in the background, surviving reboots.

Option 2: Kokoro Web (Browser-Based)

If you want something even simpler, Kokoro Web runs the model directly in your browser using WebGPU. No server, no Docker, no Python.

services:
  kokoro-web:
    image: ghcr.io/eduardolat/kokoro-web:latest
    ports:
      - "3000:3000"
    environment:
      - KW_SECRET_API_KEY=your-secret-key
    restart: unless-stopped

Access it at http://localhost:3000. It also exposes an OpenAI-compatible API on the same port.

The trade-off: Kokoro Web is newer (MIT-licensed, ~590 GitHub stars) and less battle-tested than Kokoro-FastAPI. But if you want a quick setup with a clean UI and don’t need advanced features like voice mixing, it works.

Available Voices

Kokoro ships with 54 voices. Here are some highlights:

Voice IDLanguageDescription
af_heartEnglish (US)Warm, natural female voice — the default
af_bellaEnglish (US)Clear, professional female
am_adamEnglish (US)Deep, steady male
am_michaelEnglish (US)Conversational male
bf_emmaEnglish (UK)British female
bm_georgeEnglish (UK)British male
jf_alphaJapaneseJapanese female
ff_siwisFrenchFrench female

You can also blend voices with weighted combinations:

curl -X POST http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro",
    "input": "A blend of two voices.",
    "voice": "af_heart(0.7)+af_bella(0.3)"
  }' \
  --output blended.mp3

This creates a voice that’s 70% af_heart and 30% af_bella. Useful for finding a tone that fits your project without training a custom voice.

Practical Use Cases

Audiobook generation. Convert entire books to audio locally. One benchmark test generated a complete audiobook in 8.5 minutes — compared to 2+ hours through ElevenLabs’ API.

Podcast production. Generate narration segments, intros, or entire scripted episodes. Combine with a local Whisper instance for a fully private audio pipeline.

Accessibility. Add text-to-speech to your applications without sending user content to third parties. Particularly important for healthcare, legal, or education contexts.

Home automation. Give your smart home a voice that doesn’t route through Amazon or Google servers.

Content creation. Generate voiceovers for videos, tutorials, and presentations. At zero marginal cost, you can iterate on scripts without watching your API balance drain.

Integration with Existing Tools

Because Kokoro-FastAPI implements the OpenAI speech API specification, it works with:

  • Open WebUIdirect integration documented
  • n8n workflows — swap the OpenAI TTS node URL to http://kokoro-tts:8880/v1
  • Home Assistant — any TTS integration that supports custom OpenAI endpoints
  • Any application using the OpenAI Python/Node SDK — just change the base_url

What It Can’t Do

Be honest about the gaps:

  • Voice cloning. Kokoro can’t replicate a specific person’s voice. ElevenLabs still leads here by a wide margin.
  • Emotional range. The voices are natural but not as expressive as premium paid options. Sarcasm, whispers, and dramatic pauses don’t land the same way.
  • Real-time conversation. Latency is low on GPU but not instant. For sub-100ms voice responses in a chatbot, you’ll want dedicated streaming TTS infrastructure.
  • Some languages. Eight languages is solid but nowhere near ElevenLabs’ 30+ language coverage.

If you need cloned voices or hyper-realistic emotional delivery, paid services are still worth it. For everything else, Kokoro is good enough — and getting better.

Privacy Considerations

Running Kokoro locally means your text never leaves your machine. No API calls, no data collection, no terms-of-service granting usage rights over your content.

This matters specifically for:

  • Medical content where patient information appears in text
  • Legal documents that can’t be sent to third-party servers
  • Business communications with confidential information
  • Personal content you’d rather not have in anyone’s training data

The model weights are downloaded once from Hugging Face. After that, the system works entirely offline.

What’s Next

Kokoro is under active development. The model was trained on hundreds of hours of data for just $1,000 in compute costs — proving that quality TTS doesn’t require massive budgets. The community is already building:

  • Fine-tuned voices for specific use cases
  • Additional language support
  • Improved emotional expressiveness
  • Rust-based inference for even lower latency

The Apache license means anyone can extend, modify, and commercially deploy Kokoro without restrictions.

What You Can Do

  1. Try it right now. Run the single Docker command above. You’ll have working TTS in under five minutes.
  2. Test it against your current service. Generate the same text with Kokoro and your paid provider. See if you can tell the difference.
  3. Swap one workflow. Replace one automation or content pipeline that uses paid TTS. Keep the paid service as a fallback until you trust the output.
  4. Point your existing code at it. If you’re using OpenAI’s TTS API, changing the base URL is the only modification needed.

Text-to-speech shouldn’t cost $99/month. An 82-million parameter model just proved that it doesn’t have to.