A local AI code assistant gives you the benefits of GitHub Copilot without sending your code to the cloud. Your proprietary code, API keys, and internal documentation stay on your machine. This guide covers three approaches: Continue with Ollama for VS Code integration, Tabby for self-hosted completions, and Aider for terminal-based AI coding. Each setup runs entirely locally using open-source code models that have become competitive with cloud alternatives.
Approach 1: Continue + Ollama (VS Code)
Continue is the most popular open-source AI code assistant. It integrates directly into VS Code and JetBrains IDEs, providing chat, autocomplete, and inline editing powered by your local Ollama models.
Step 1: Install Ollama and Code Models
# Install Ollama if not already installed
curl -fsSL https://ollama.com/install.sh | sh
# Pull code-optimized models
ollama pull qwen2.5-coder:7b # Primary code model (4.7 GB)
ollama pull qwen2.5-coder:1.5b # Fast autocomplete model (1.0 GB)
# Optional: larger model for better quality (needs 16+ GB VRAM)
ollama pull qwen2.5-coder:14b # Higher quality (9.0 GB)
# Optional: general model for explanations
ollama pull llama3.1:8b # General chat (4.7 GB)
Step 2: Install Continue Extension
- Open VS Code
- Go to Extensions (Ctrl+Shift+X / Cmd+Shift+X)
- Search for “Continue”
- Install “Continue - Codestral, Claude, and more”
- Click the Continue icon in the sidebar
Step 3: Configure Continue
Continue’s configuration lives in ~/.continue/config.json. Here’s an optimized configuration:
{
"models": [
{
"title": "Qwen 2.5 Coder 7B",
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"contextLength": 8192,
"completionOptions": {
"temperature": 0.1,
"maxTokens": 2048
}
},
{
"title": "Qwen 2.5 Coder 14B",
"provider": "ollama",
"model": "qwen2.5-coder:14b",
"contextLength": 8192,
"completionOptions": {
"temperature": 0.1,
"maxTokens": 4096
}
},
{
"title": "Llama 3.1 8B (Chat)",
"provider": "ollama",
"model": "llama3.1:8b",
"contextLength": 8192,
"completionOptions": {
"temperature": 0.3,
"maxTokens": 2048
}
}
],
"tabAutocompleteModel": {
"title": "Qwen Coder 1.5B",
"provider": "ollama",
"model": "qwen2.5-coder:1.5b",
"contextLength": 4096
},
"embeddingsProvider": {
"provider": "ollama",
"model": "nomic-embed-text"
},
"customCommands": [
{
"name": "test",
"prompt": "Write unit tests for the selected code. Use the project's existing test framework and conventions.",
"description": "Generate unit tests"
},
{
"name": "review",
"prompt": "Review the selected code for bugs, security issues, and improvements. Be specific and actionable.",
"description": "Code review"
},
{
"name": "docs",
"prompt": "Add comprehensive documentation to the selected code. Include parameter descriptions, return values, and usage examples.",
"description": "Generate documentation"
}
]
}
Step 4: Using Continue
Chat (Cmd+L / Ctrl+L):
- Ask questions about your codebase
- Request code generation
- Debug errors by pasting stack traces
- Get explanations of complex code
Autocomplete (Tab):
- Continue suggests completions as you type
- Press Tab to accept, Escape to dismiss
- Uses the smaller
tabAutocompleteModelfor speed
Inline Edit (Cmd+I / Ctrl+I):
- Select code and press Cmd+I
- Describe the change you want
- Continue rewrites the selected code
Codebase Context (@codebase):
- Type
@codebasein chat to search your entire project - Continue uses embeddings to find relevant files
- Requires
nomic-embed-textmodel
// Example chat prompts:
@codebase How does the authentication flow work?
@file:src/auth.ts Explain this middleware
/test (with code selected)
/review (with code selected)
Performance Optimization for Continue
// In config.json - optimize autocomplete speed
{
"tabAutocompleteOptions": {
"debounceDelay": 500,
"maxPromptTokens": 1024,
"prefixPercentage": 0.7,
"multilineCompletions": "auto",
"useCache": true
}
}
For faster autocomplete, use the smallest model that gives acceptable quality:
| Model | Size | Speed (t/s) | Quality | Best For |
|---|---|---|---|---|
| qwen2.5-coder:0.5b | 395 MB | 100+ | Basic | Simple completions |
| qwen2.5-coder:1.5b | 1.0 GB | 60-80 | Good | Recommended autocomplete |
| qwen2.5-coder:3b | 2.0 GB | 40-60 | Better | Multi-line completions |
| qwen2.5-coder:7b | 4.7 GB | 25-40 | Best | Chat + completions |
Approach 2: Tabby (Self-Hosted Code Completion Server)
Tabby is a self-hosted AI coding assistant that provides a GitHub Copilot-compatible API. It supports VS Code, JetBrains, Vim/Neovim, and Emacs.
Step 1: Install Tabby
# Docker (recommended, with NVIDIA GPU)
docker run -d \
--gpus all \
-p 8080:8080 \
-v tabby_data:/data \
--name tabby \
--restart unless-stopped \
tabbyml/tabby \
serve --model Qwen2.5-Coder-7B --device cuda
# Docker (CPU only)
docker run -d \
-p 8080:8080 \
-v tabby_data:/data \
--name tabby \
--restart unless-stopped \
tabbyml/tabby \
serve --model Qwen2.5-Coder-3B --device cpu
# Or install directly
pip install tabby-server
tabby serve --model Qwen2.5-Coder-7B --device cuda
Step 2: Install Tabby Extension
VS Code:
- Install “Tabby” extension from marketplace
- Open Settings (Cmd+,)
- Search for “Tabby”
- Set endpoint to
http://localhost:8080
JetBrains:
- Install “Tabby” plugin from marketplace
- Settings > Tabby > Server endpoint:
http://localhost:8080
Vim/Neovim:
" Using vim-plug
Plug 'TabbyML/vim-tabby'
" In init.vim/init.lua
let g:tabby_server_url = 'http://localhost:8080'
Step 3: Repository-Level Context
Tabby can index your entire repository for better context-aware completions:
# Enable repository indexing
docker exec tabby tabby config set \
--repository-url /path/to/your/repo \
--repository-name "my-project"
# Or in the Tabby admin UI at http://localhost:8080/admin
Tabby vs. Continue
| Feature | Continue | Tabby |
|---|---|---|
| Chat interface | Yes | Limited |
| Autocomplete | Yes | Yes (optimized) |
| Inline edit | Yes | No |
| Repository indexing | Basic (@codebase) | Advanced |
| IDE support | VS Code, JetBrains | VS Code, JetBrains, Vim, Emacs |
| Backend | Any Ollama model | Tabby’s own model serving |
| Custom commands | Yes | No |
| Self-hosted team use | Via Ollama | Built-in team features |
Approach 3: Aider (Terminal AI Coding)
Aider is a terminal-based AI coding assistant that works with your git repository. It can read, understand, and edit multiple files at once.
Step 1: Install Aider
pip install aider-chat
# Or with pipx for isolated install
pipx install aider-chat
Step 2: Configure with Ollama
# Set Ollama as the backend
export OLLAMA_API_BASE=http://localhost:11434
# Run Aider with a local model
aider --model ollama/qwen2.5-coder:7b
# With a larger model for complex tasks
aider --model ollama/qwen2.5-coder:14b
Step 3: Using Aider
# Start Aider in your project directory
cd /path/to/project
aider --model ollama/qwen2.5-coder:7b
# Add files to the chat context
> /add src/auth.py src/models.py
# Ask for changes
> Add rate limiting to the login endpoint.
> Use a sliding window of 5 attempts per minute.
# Aider will:
# 1. Read the files
# 2. Propose changes as a diff
# 3. Apply changes to the files
# 4. Create a git commit
# Run tests after changes
> /run pytest tests/test_auth.py
# Undo the last change
> /undo
Aider Configuration File
Create ~/.aider.conf.yml:
# Default model
model: ollama/qwen2.5-coder:7b
# Git behavior
auto-commits: true
auto-lint: true
# Context
map-tokens: 2048
# Editor
editor-model: ollama/qwen2.5-coder:7b
Aider Best Practices
- Start small: Add only the files Aider needs to see. Less context = better results.
- Be specific: “Add input validation to the create_user function” works better than “improve the code.”
- Review diffs: Always review Aider’s proposed changes before accepting.
- Use git: Aider works best with git. Commits let you easily undo changes.
- Iterate: Make one change at a time. Complex multi-file refactors work better as a series of small changes.
Best Code Models Compared
Benchmark Comparison
| Model | Size | HumanEval | MBPP | MultiPL-E | Best For |
|---|---|---|---|---|---|
| Qwen 2.5 Coder 1.5B | 1.0 GB | 55.2 | 51.4 | 48.1 | Fast autocomplete |
| Qwen 2.5 Coder 7B | 4.7 GB | 76.8 | 72.3 | 68.5 | Best 7B code model |
| Qwen 2.5 Coder 14B | 9.0 GB | 82.1 | 77.8 | 74.2 | Sweet spot |
| Qwen 2.5 Coder 32B | 19 GB | 87.4 | 82.1 | 79.3 | Near-cloud quality |
| DeepSeek Coder V2 16B | 9.4 GB | 78.6 | 73.5 | 70.1 | Reasoning-heavy code |
| CodeLlama 34B | 19 GB | 72.4 | 68.2 | 64.8 | Older but proven |
| StarCoder 2 15B | 9.2 GB | 68.5 | 65.3 | 62.1 | FIM/autocomplete |
| Llama 3.1 8B | 4.7 GB | 62.3 | 59.8 | 55.4 | General + some code |
HumanEval, MBPP: Python benchmarks. MultiPL-E: Multi-language. Higher = better.
Language Support by Model
| Model | Python | JavaScript | TypeScript | Go | Rust | Java | C++ |
|---|---|---|---|---|---|---|---|
| Qwen 2.5 Coder | A | A | A | A | A | A | A |
| DeepSeek Coder V2 | A | A | A | B+ | A | A | B+ |
| CodeLlama | A | B+ | B | B | B | B+ | B |
| StarCoder 2 | A | A | B+ | B+ | B+ | A | B+ |
A = Excellent, B+ = Good, B = Adequate
Recommended Setups
Budget Setup (8 GB VRAM)
# Chat: Qwen 2.5 Coder 7B
# Autocomplete: Qwen 2.5 Coder 1.5B
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:1.5b
Continue config: Use 7B for chat, 1.5B for tab autocomplete.
Mid-Range Setup (16 GB VRAM)
# Chat: Qwen 2.5 Coder 14B
# Autocomplete: Qwen 2.5 Coder 3B
ollama pull qwen2.5-coder:14b
ollama pull qwen2.5-coder:3b
High-End Setup (24 GB VRAM)
# Chat: Qwen 2.5 Coder 32B Q4
# Autocomplete: Qwen 2.5 Coder 7B
ollama pull qwen2.5-coder:32b
ollama pull qwen2.5-coder:7b
Apple Silicon Setup (32 GB Unified Memory)
# Chat: Qwen 2.5 Coder 14B
# Autocomplete: Qwen 2.5 Coder 3B
ollama pull qwen2.5-coder:14b
ollama pull qwen2.5-coder:3b
Tips for Better Code Generation
Write Good Prompts
# Bad: "Fix this code"
# Good: "Fix the TypeError in the parse_json function. The error occurs
# when the input contains nested arrays. The function should handle
# nested structures recursively."
# Bad: "Write a web server"
# Good: "Write a FastAPI endpoint that accepts POST requests with a JSON
# body containing 'url' and 'depth' fields, crawls the URL to the
# specified depth, and returns the extracted text content."
Provide Context
# In Continue chat, reference specific files:
# @file:src/models/user.py Add a 'last_login' field as a datetime,
# nullable, with a default of None. Update the __repr__ method.
Use System Prompts
In Ollama Modelfile, add coding-specific system prompts:
cat > ~/Modelfile-coder << 'EOF'
FROM qwen2.5-coder:7b
SYSTEM """You are an expert software engineer. Follow these rules:
1. Write clean, well-documented code
2. Include type hints for Python
3. Follow the project's existing code style
4. Handle errors explicitly
5. Write code that is testable
6. Prefer standard library solutions over third-party packages"""
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
EOF
ollama create coder -f ~/Modelfile-coder
Integrating with Development Workflow
Git Commit Messages
# Generate commit messages from staged changes
git diff --cached | ollama run qwen2.5-coder:7b "Write a concise git commit message for these changes:"
Code Review
# Review a PR diff
git diff main..feature-branch | ollama run qwen2.5-coder:7b "Review this code diff for bugs, security issues, and improvements:"
Documentation Generation
# Generate docs for a file
cat src/auth.py | ollama run qwen2.5-coder:7b "Generate comprehensive documentation for this Python module, including function docstrings and a module-level overview:"
Next Steps
- Set up RAG for your codebase: Local RAG Chatbot — index your documentation
- Choose the right model: Model selection guide
- Optimize for your platform: Windows, macOS, Linux
- Deploy for your team: Enterprise Local AI