Self-Host Your Own AI Code Assistant With Continue and Ollama

GitHub Copilot costs $19 a month. Every keystroke you type feeds into Microsoft’s servers. Every snippet of proprietary code you’re working on gets sent to an API endpoint you don’t control, processed by a model you can’t inspect, and potentially used in ways their privacy policy leaves deliberately vague.

You don’t need any of that. Continue.dev is an open-source AI code assistant that plugs into VS Code and JetBrains. Pair it with Ollama running a local model, and you get autocomplete, chat, and code editing that works entirely on your machine. No subscription. No telemetry. No code leaving your network.

This guide gets you from zero to working local code completion in about 20 minutes.

What You Need

Hardware:

8 GB RAM — Runs the 1.5B autocomplete model comfortably. Good enough for inline completions.
16 GB RAM — The sweet spot. Runs a 7B model for autocomplete and a separate chat model side by side.
32 GB RAM or a dedicated GPU — Full setup with a 32B reasoning model for chat, 7B for autocomplete, and embeddings for codebase indexing.

No GPU required for the basic setup. Ollama runs fine on CPU, just slower. If you have an NVIDIA GPU with 6+ GB VRAM or an Apple Silicon Mac, completions will feel near-instant.

Software:

VS Code (or JetBrains IDE)
A terminal

That’s it.

Step 1: Install Ollama

Ollama handles downloading, running, and serving AI models locally. One command:

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Windows: Download the installer from ollama.com.

Start the service:

ollama serve

Verify it’s running:

curl http://localhost:11434

You should see “Ollama is running.” That’s your local AI server, done.

Step 2: Pull Your Models

You need at least one model. Here’s what to grab depending on your hardware:

Autocomplete model (the important one):

# Light and fast — works on any machine with 8 GB RAM
ollama pull qwen2.5-coder:1.5b

# Better quality — needs 16 GB RAM or a GPU with 8+ GB VRAM
ollama pull qwen2.5-coder:7b

Qwen 2.5 Coder scores 88.4% on HumanEval — higher than GPT-4’s 87.1%. The 1.5B version is purpose-built for autocomplete: small enough to respond in milliseconds, accurate enough to be genuinely useful. The Apache 2.0 license means no restrictions on how you use it.

Chat model (optional but recommended):

# Good all-rounder for code questions and refactoring
ollama pull llama3.1:8b

# Heavyweight reasoning — if you have the RAM
ollama pull deepseek-r1:32b

The chat model handles “explain this code,” “refactor this function,” and interactive debugging. You can skip it initially and add it later.

Step 3: Install Continue

Open VS Code and install the Continue extension from the marketplace. Search for “Continue” — it’s the one with 10M+ installs.

After installing, Continue opens a setup panel. You can close it — we’ll configure everything manually for a cleaner setup.

Step 4: Configure Continue

Open your Continue config. Hit Ctrl+Shift+P (or Cmd+Shift+P on Mac) and search for “Continue: Open Config File.”

Replace the contents with this:

models:
  - name: Qwen2.5 Coder 7B
    provider: ollama
    model: qwen2.5-coder:7b
    apiBase: http://localhost:11434
    roles:
      - chat
      - edit

  - name: Qwen2.5 Coder 1.5B
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles:
      - autocomplete

If you pulled a separate chat model like Llama 3.1, use that for chat instead:

models:
  - name: Llama 3.1 8B
    provider: ollama
    model: llama3.1:8b
    apiBase: http://localhost:11434
    roles:
      - chat
      - edit

  - name: Qwen2.5 Coder 1.5B
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles:
      - autocomplete

Save the file. Continue picks up changes immediately.

Step 5: Test It

Open any code file in VS Code. Start typing a function — you should see ghost text suggestions appear after a brief pause. Press Tab to accept.

Try the chat: click the Continue icon in the sidebar (or press Ctrl+L), select some code, and ask “What does this function do?”

If completions aren’t showing up:

Check Ollama is running: curl http://localhost:11434
Verify the model is pulled: ollama list
Check the Continue output panel for errors: View → Output → Continue

Picking the Right Model

The benchmarks tell a clear story:

Model	Size	HumanEval	Best For	License
Qwen 2.5 Coder 1.5B	1.5B	—	Autocomplete (speed)	Apache 2.0
Qwen 2.5 Coder 7B	7B	88.4%	Autocomplete + chat	Apache 2.0
Codestral	22B	—	Inline completion (FIM)	Non-Production
DeepSeek Coder V2	14B active	83.5%	Low VRAM chat	MIT

Qwen 2.5 Coder is the default recommendation. It leads on more than 10 coding benchmarks including generation, completion, reasoning, and repair, all under Apache 2.0.

Codestral is technically the best for inline completions — it tops the LMSys Copilot Arena leaderboard — but its non-production license makes it a non-starter for professional work unless you negotiate a separate agreement with Mistral.

Tuning for Performance

A few config tweaks that make a real difference:

Enable multiline completions if you’re only getting single-line suggestions:

tabAutocompleteOptions:
  multilineCompletions: always

Reduce context length if your machine struggles:

models:
  - name: Qwen2.5 Coder 1.5B
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles:
      - autocomplete
    contextLength: 2048

Avoid thinking models for autocomplete. Models like DeepSeek R1 generate slowly because they reason through problems step by step. That’s exactly what you don’t want for autocomplete, where speed matters more than depth. Use thinking models for chat, fast models for completions.

What You Save

GitHub Copilot Individual: $19/month, $228/year. Copilot Business: $19/user/month. Copilot Enterprise: $39/user/month.

This setup: $0/month, forever. The electricity to run a 1.5B model is negligible — a few watts above idle.

But the real value isn’t the money. It’s that your proprietary code never leaves your machine. No corporate IP getting sent to Microsoft’s servers. No training data concerns. No enterprise compliance headaches about where your code ends up. If you work with client code, medical data, financial systems, or anything under NDA, this distinction matters.

What You Lose

Honesty check: local models aren’t as good as Copilot on every task. GPT-4-class cloud models still have an edge on complex, multi-file refactoring and understanding large codebases holistically. The 1.5B autocomplete model handles line completions and short function bodies well, but it won’t architect a new module for you.

The gap is closing fast. Qwen 2.5 Coder already matches GPT-4o on HumanEval, and the 7B model handles most practical coding tasks without breaking a sweat. For day-to-day autocomplete — finishing the line you’re typing, suggesting the next function call, filling in boilerplate — local models are genuinely competitive.

Going Further

Once you’re comfortable with the basic setup, there are a few upgrades worth exploring:

Add codebase indexing. Continue can index your project files so the chat model understands your full codebase. Pull an embeddings model (ollama pull nomic-embed-text) and enable indexing in the config.

Try Tabby as an alternative. Tabby is a self-hosted coding assistant built for team deployments. It runs as a server with its own dashboard and supports multiple users — useful if you want to set this up for a whole team.

Swap models freely. The beauty of this stack is that every component is interchangeable. New coding model drops that beats Qwen? Pull it with Ollama, update one line in your config, and you’re running it. No vendor lock-in, no subscription changes, no migration.

You own the whole stack. That’s the point.