Local AI Code Assistant: Setting Up Copilot Without the Cloud

Set up a fully local AI code assistant using Continue with Ollama in VS Code, Tabby for self-hosted completions, and Aider for terminal-based coding. Includes model benchmarks and configuration.

A local AI code assistant gives you the benefits of GitHub Copilot without sending your code to the cloud. Your proprietary code, API keys, and internal documentation stay on your machine. This guide covers three approaches: Continue with Ollama for VS Code integration, Tabby for self-hosted completions, and Aider for terminal-based AI coding. Each setup runs entirely locally using open-source code models that have become competitive with cloud alternatives.

Approach 1: Continue + Ollama (VS Code)

Continue is the most popular open-source AI code assistant. It integrates directly into VS Code and JetBrains IDEs, providing chat, autocomplete, and inline editing powered by your local Ollama models.

Step 1: Install Ollama and Code Models

# Install Ollama if not already installed
curl -fsSL https://ollama.com/install.sh | sh

# Pull code-optimized models
ollama pull qwen2.5-coder:7b      # Primary code model (4.7 GB)
ollama pull qwen2.5-coder:1.5b    # Fast autocomplete model (1.0 GB)

# Optional: larger model for better quality (needs 16+ GB VRAM)
ollama pull qwen2.5-coder:14b     # Higher quality (9.0 GB)

# Optional: general model for explanations
ollama pull llama3.1:8b            # General chat (4.7 GB)

Step 2: Install Continue Extension

  1. Open VS Code
  2. Go to Extensions (Ctrl+Shift+X / Cmd+Shift+X)
  3. Search for “Continue”
  4. Install “Continue - Codestral, Claude, and more”
  5. Click the Continue icon in the sidebar

Step 3: Configure Continue

Continue’s configuration lives in ~/.continue/config.json. Here’s an optimized configuration:

{
  "models": [
    {
      "title": "Qwen 2.5 Coder 7B",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "contextLength": 8192,
      "completionOptions": {
        "temperature": 0.1,
        "maxTokens": 2048
      }
    },
    {
      "title": "Qwen 2.5 Coder 14B",
      "provider": "ollama",
      "model": "qwen2.5-coder:14b",
      "contextLength": 8192,
      "completionOptions": {
        "temperature": 0.1,
        "maxTokens": 4096
      }
    },
    {
      "title": "Llama 3.1 8B (Chat)",
      "provider": "ollama",
      "model": "llama3.1:8b",
      "contextLength": 8192,
      "completionOptions": {
        "temperature": 0.3,
        "maxTokens": 2048
      }
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen Coder 1.5B",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b",
    "contextLength": 4096
  },
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text"
  },
  "customCommands": [
    {
      "name": "test",
      "prompt": "Write unit tests for the selected code. Use the project's existing test framework and conventions.",
      "description": "Generate unit tests"
    },
    {
      "name": "review",
      "prompt": "Review the selected code for bugs, security issues, and improvements. Be specific and actionable.",
      "description": "Code review"
    },
    {
      "name": "docs",
      "prompt": "Add comprehensive documentation to the selected code. Include parameter descriptions, return values, and usage examples.",
      "description": "Generate documentation"
    }
  ]
}

Step 4: Using Continue

Chat (Cmd+L / Ctrl+L):

  • Ask questions about your codebase
  • Request code generation
  • Debug errors by pasting stack traces
  • Get explanations of complex code

Autocomplete (Tab):

  • Continue suggests completions as you type
  • Press Tab to accept, Escape to dismiss
  • Uses the smaller tabAutocompleteModel for speed

Inline Edit (Cmd+I / Ctrl+I):

  • Select code and press Cmd+I
  • Describe the change you want
  • Continue rewrites the selected code

Codebase Context (@codebase):

  • Type @codebase in chat to search your entire project
  • Continue uses embeddings to find relevant files
  • Requires nomic-embed-text model
// Example chat prompts:
@codebase How does the authentication flow work?
@file:src/auth.ts Explain this middleware
/test (with code selected)
/review (with code selected)

Performance Optimization for Continue

// In config.json - optimize autocomplete speed
{
  "tabAutocompleteOptions": {
    "debounceDelay": 500,
    "maxPromptTokens": 1024,
    "prefixPercentage": 0.7,
    "multilineCompletions": "auto",
    "useCache": true
  }
}

For faster autocomplete, use the smallest model that gives acceptable quality:

ModelSizeSpeed (t/s)QualityBest For
qwen2.5-coder:0.5b395 MB100+BasicSimple completions
qwen2.5-coder:1.5b1.0 GB60-80GoodRecommended autocomplete
qwen2.5-coder:3b2.0 GB40-60BetterMulti-line completions
qwen2.5-coder:7b4.7 GB25-40BestChat + completions

Approach 2: Tabby (Self-Hosted Code Completion Server)

Tabby is a self-hosted AI coding assistant that provides a GitHub Copilot-compatible API. It supports VS Code, JetBrains, Vim/Neovim, and Emacs.

Step 1: Install Tabby

# Docker (recommended, with NVIDIA GPU)
docker run -d \
  --gpus all \
  -p 8080:8080 \
  -v tabby_data:/data \
  --name tabby \
  --restart unless-stopped \
  tabbyml/tabby \
  serve --model Qwen2.5-Coder-7B --device cuda

# Docker (CPU only)
docker run -d \
  -p 8080:8080 \
  -v tabby_data:/data \
  --name tabby \
  --restart unless-stopped \
  tabbyml/tabby \
  serve --model Qwen2.5-Coder-3B --device cpu

# Or install directly
pip install tabby-server
tabby serve --model Qwen2.5-Coder-7B --device cuda

Step 2: Install Tabby Extension

VS Code:

  1. Install “Tabby” extension from marketplace
  2. Open Settings (Cmd+,)
  3. Search for “Tabby”
  4. Set endpoint to http://localhost:8080

JetBrains:

  1. Install “Tabby” plugin from marketplace
  2. Settings > Tabby > Server endpoint: http://localhost:8080

Vim/Neovim:

" Using vim-plug
Plug 'TabbyML/vim-tabby'

" In init.vim/init.lua
let g:tabby_server_url = 'http://localhost:8080'

Step 3: Repository-Level Context

Tabby can index your entire repository for better context-aware completions:

# Enable repository indexing
docker exec tabby tabby config set \
  --repository-url /path/to/your/repo \
  --repository-name "my-project"

# Or in the Tabby admin UI at http://localhost:8080/admin

Tabby vs. Continue

FeatureContinueTabby
Chat interfaceYesLimited
AutocompleteYesYes (optimized)
Inline editYesNo
Repository indexingBasic (@codebase)Advanced
IDE supportVS Code, JetBrainsVS Code, JetBrains, Vim, Emacs
BackendAny Ollama modelTabby’s own model serving
Custom commandsYesNo
Self-hosted team useVia OllamaBuilt-in team features

Approach 3: Aider (Terminal AI Coding)

Aider is a terminal-based AI coding assistant that works with your git repository. It can read, understand, and edit multiple files at once.

Step 1: Install Aider

pip install aider-chat

# Or with pipx for isolated install
pipx install aider-chat

Step 2: Configure with Ollama

# Set Ollama as the backend
export OLLAMA_API_BASE=http://localhost:11434

# Run Aider with a local model
aider --model ollama/qwen2.5-coder:7b

# With a larger model for complex tasks
aider --model ollama/qwen2.5-coder:14b

Step 3: Using Aider

# Start Aider in your project directory
cd /path/to/project
aider --model ollama/qwen2.5-coder:7b

# Add files to the chat context
> /add src/auth.py src/models.py

# Ask for changes
> Add rate limiting to the login endpoint. 
> Use a sliding window of 5 attempts per minute.

# Aider will:
# 1. Read the files
# 2. Propose changes as a diff
# 3. Apply changes to the files
# 4. Create a git commit

# Run tests after changes
> /run pytest tests/test_auth.py

# Undo the last change
> /undo

Aider Configuration File

Create ~/.aider.conf.yml:

# Default model
model: ollama/qwen2.5-coder:7b

# Git behavior
auto-commits: true
auto-lint: true

# Context
map-tokens: 2048

# Editor
editor-model: ollama/qwen2.5-coder:7b

Aider Best Practices

  1. Start small: Add only the files Aider needs to see. Less context = better results.
  2. Be specific: “Add input validation to the create_user function” works better than “improve the code.”
  3. Review diffs: Always review Aider’s proposed changes before accepting.
  4. Use git: Aider works best with git. Commits let you easily undo changes.
  5. Iterate: Make one change at a time. Complex multi-file refactors work better as a series of small changes.

Best Code Models Compared

Benchmark Comparison

ModelSizeHumanEvalMBPPMultiPL-EBest For
Qwen 2.5 Coder 1.5B1.0 GB55.251.448.1Fast autocomplete
Qwen 2.5 Coder 7B4.7 GB76.872.368.5Best 7B code model
Qwen 2.5 Coder 14B9.0 GB82.177.874.2Sweet spot
Qwen 2.5 Coder 32B19 GB87.482.179.3Near-cloud quality
DeepSeek Coder V2 16B9.4 GB78.673.570.1Reasoning-heavy code
CodeLlama 34B19 GB72.468.264.8Older but proven
StarCoder 2 15B9.2 GB68.565.362.1FIM/autocomplete
Llama 3.1 8B4.7 GB62.359.855.4General + some code

HumanEval, MBPP: Python benchmarks. MultiPL-E: Multi-language. Higher = better.

Language Support by Model

ModelPythonJavaScriptTypeScriptGoRustJavaC++
Qwen 2.5 CoderAAAAAAA
DeepSeek Coder V2AAAB+AAB+
CodeLlamaAB+BBBB+B
StarCoder 2AAB+B+B+AB+

A = Excellent, B+ = Good, B = Adequate

Budget Setup (8 GB VRAM)

# Chat: Qwen 2.5 Coder 7B
# Autocomplete: Qwen 2.5 Coder 1.5B
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:1.5b

Continue config: Use 7B for chat, 1.5B for tab autocomplete.

Mid-Range Setup (16 GB VRAM)

# Chat: Qwen 2.5 Coder 14B
# Autocomplete: Qwen 2.5 Coder 3B
ollama pull qwen2.5-coder:14b
ollama pull qwen2.5-coder:3b

High-End Setup (24 GB VRAM)

# Chat: Qwen 2.5 Coder 32B Q4
# Autocomplete: Qwen 2.5 Coder 7B
ollama pull qwen2.5-coder:32b
ollama pull qwen2.5-coder:7b

Apple Silicon Setup (32 GB Unified Memory)

# Chat: Qwen 2.5 Coder 14B
# Autocomplete: Qwen 2.5 Coder 3B
ollama pull qwen2.5-coder:14b
ollama pull qwen2.5-coder:3b

Tips for Better Code Generation

Write Good Prompts

# Bad: "Fix this code"
# Good: "Fix the TypeError in the parse_json function. The error occurs 
#        when the input contains nested arrays. The function should handle 
#        nested structures recursively."

# Bad: "Write a web server"
# Good: "Write a FastAPI endpoint that accepts POST requests with a JSON 
#        body containing 'url' and 'depth' fields, crawls the URL to the 
#        specified depth, and returns the extracted text content."

Provide Context

# In Continue chat, reference specific files:
# @file:src/models/user.py Add a 'last_login' field as a datetime, 
# nullable, with a default of None. Update the __repr__ method.

Use System Prompts

In Ollama Modelfile, add coding-specific system prompts:

cat > ~/Modelfile-coder << 'EOF'
FROM qwen2.5-coder:7b
SYSTEM """You are an expert software engineer. Follow these rules:
1. Write clean, well-documented code
2. Include type hints for Python
3. Follow the project's existing code style
4. Handle errors explicitly
5. Write code that is testable
6. Prefer standard library solutions over third-party packages"""
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
EOF

ollama create coder -f ~/Modelfile-coder

Integrating with Development Workflow

Git Commit Messages

# Generate commit messages from staged changes
git diff --cached | ollama run qwen2.5-coder:7b "Write a concise git commit message for these changes:"

Code Review

# Review a PR diff
git diff main..feature-branch | ollama run qwen2.5-coder:7b "Review this code diff for bugs, security issues, and improvements:"

Documentation Generation

# Generate docs for a file
cat src/auth.py | ollama run qwen2.5-coder:7b "Generate comprehensive documentation for this Python module, including function docstrings and a module-level overview:"

Next Steps

Frequently Asked Questions

Can a local code assistant match GitHub Copilot quality?

For common languages and frameworks (Python, JavaScript, TypeScript, Go, Rust), a local code model like Qwen 2.5 Coder 14B or 32B approaches Copilot quality. For niche languages or highly specialized frameworks, Copilot still has an edge due to its larger model and training data. The gap has narrowed significantly since 2024, and for many developers the privacy and cost benefits of local code assistance outweigh the quality difference.

Which code model is best for local use?

Qwen 2.5 Coder is the current leader for local code assistance across all sizes. At 7B, it outperforms all other 7B code models on HumanEval and MBPP benchmarks. At 32B, it competes with much larger cloud models. DeepSeek Coder V2 is a strong alternative for reasoning-heavy code tasks. For autocomplete (fill-in-middle), Qwen 2.5 Coder and StarCoder 2 both excel.

How much VRAM do I need for a local code assistant?

For tab completion (autocomplete), a 3B model works well and needs only 2-4 GB VRAM. For chat-based code assistance (explaining code, writing functions), a 7B model needs 6-8 GB VRAM. For the best experience, a 14B-32B code model with 12-24 GB VRAM provides quality close to cloud services.

Can I use a local code assistant with JetBrains IDEs?

Yes. Continue supports JetBrains IDEs (IntelliJ, PyCharm, WebStorm, etc.) in addition to VS Code. Tabby also has JetBrains plugins. Both connect to Ollama or any OpenAI-compatible API server running locally.