What is the best local LLM for general chat?

For general chat, Llama 3.2 8B offers the best balance of quality and hardware requirements, running on any GPU with 6+ GB VRAM. If you have 24+ GB VRAM, Llama 3.2 70B provides near-frontier conversational quality.

What is the best local model for coding?

Qwen2.5-Coder 7B and DeepSeek-Coder-V2 16B are the top choices. Qwen2.5-Coder matches or exceeds GPT-3.5 on coding benchmarks at just 7B parameters. For more VRAM, DeepSeek-Coder-V2 offers strong multi-language code generation.

Should I use a smaller model with better quantization or a larger model with aggressive quantization?

Generally, a smaller model at Q5-Q8 quantization outperforms a larger model squeezed into the same VRAM at Q2-Q3. For example, a 13B model at Q5 typically beats a 30B model at Q2 in terms of coherence and accuracy. Only drop below Q4 if the larger model excels specifically at your task.

How do I know if a model is good before downloading it?

Check the model's performance on benchmarks relevant to your use case (MMLU for knowledge, HumanEval for coding, GSM8K for math). Read community reviews on Reddit's r/LocalLLaMA and Hugging Face model cards. Start with well-known model families (Llama, Qwen, Mistral) rather than obscure fine-tunes.

What size model should I start with?

Start with a 7B-8B parameter model. These run on virtually any modern hardware, produce good-quality output for most tasks, and let you learn the ecosystem without hardware constraints. Upgrade to 13B-14B or larger once you have identified specific tasks where the smaller model falls short.

How to Choose the Right Local LLM for Your Use Case

Choosing the right local LLM means matching your specific task, hardware constraints, and quality expectations to the right model family, parameter count, and quantization level — and the best starting point for most users is a 7B-8B parameter model from the Llama, Qwen, or Mistral families, which run on any modern GPU with 6+ GB VRAM and deliver strong performance across general tasks. There is no single “best” model; the right choice depends on whether you need a general assistant, a coding expert, a reasoning powerhouse, a creative writer, or a specialized tool for RAG and embeddings.

The local LLM landscape in 2026 offers an extraordinary range of choices. Dozens of model families, hundreds of fine-tunes, and multiple quantization formats compete for your attention. This guide cuts through the noise with a practical decision framework: understand your task, know your hardware limits, pick the right model family and size, and choose the appropriate quantization level.

What are the major local LLM families?

Before diving into recommendations, you need to understand the major model families available for local deployment. Each family has different strengths, architectures, and licensing terms.

Meta Llama

Llama 3.2 (2024-2025) is the most widely used open-weight model family. Available in 1B, 3B, 8B, 70B, and 405B parameter sizes, with multimodal variants that handle both text and images.

Strengths: Excellent general-purpose quality, strong instruction following, massive community support, extensive fine-tune ecosystem.
Best sizes: 8B for everyday use, 70B for near-frontier quality.
License: Llama Community License — free for most commercial use (restrictions apply to companies with 700M+ monthly active users).
Why choose it: The safest default choice. Enormous community, thousands of fine-tunes available, supported by every inference engine.

Qwen (Alibaba)

Qwen 2.5 is a comprehensive model family covering language, code, and math. Available at 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B, and 110B sizes, with specialized variants for coding (Qwen2.5-Coder) and mathematics (Qwen2.5-Math).

Strengths: Strong multilingual support (especially Chinese-English), excellent coding capabilities, competitive math performance, wide range of sizes.
Best sizes: 7B for general use, 14B-32B for high quality, Coder-7B/32B for programming.
License: Apache 2.0 for most sizes — the most permissive license among top model families.
Why choose it: Best coding models at the 7B scale. Excellent multilingual support. Apache 2.0 license.

Mistral / Mixtral

Mistral models are known for efficiency and strong performance relative to their size. The Mixtral series uses a mixture-of-experts (MoE) architecture where only a subset of parameters is active for each token, giving larger-model quality with smaller-model inference cost.

Strengths: Efficient architecture, strong European multilingual support, fast inference, excellent instruction following.
Key models: Mistral 7B (dense), Mixtral 8x7B (MoE, 47B total but 13B active), Mixtral 8x22B (MoE, 176B total but 44B active), Mistral Large.
License: Apache 2.0 for most models.
Why choose it: MoE architecture gives you larger-model quality at near-small-model speeds. Strong for multilingual European applications.

Google Gemma

Gemma 2 offers compact, high-quality models at 2B, 9B, and 27B parameters. Trained on Google’s data and infrastructure, these models punch above their weight class.

Strengths: High quality for their size, strong instruction following, good safety tuning, efficient architecture.
Best sizes: 9B for a quality boost over 7B models, 27B for high-quality output.
License: Gemma Terms of Use — permissive for most commercial use.
Why choose it: The 9B model is often considered the best model in the 7B-10B range for general quality.

DeepSeek

DeepSeek has produced several standout models, including DeepSeek-V2 (a general MoE model) and DeepSeek-R1 (a reasoning-focused model that rivals GPT-4 on math and logic).

Strengths: Exceptional reasoning and mathematics, innovative MoE architecture with very efficient inference, strong coding capabilities.
Key models: DeepSeek-R1 (reasoning specialist), DeepSeek-Coder-V2 (coding), DeepSeek-V2.5 (general purpose).
License: MIT for most models — fully permissive.
Why choose it: Best-in-class reasoning and math capabilities. DeepSeek-R1 competes with GPT-4 and o1 on mathematical reasoning benchmarks.

Microsoft Phi

Phi-3 and Phi-4 are “small language models” (SLMs) designed to maximize quality at small parameter counts through high-quality training data.

Strengths: Extraordinary quality for their size (3.8B and 14B), efficient inference, strong reasoning for a small model.
Best sizes: Phi-3 Mini (3.8B) for constrained hardware, Phi-4 (14B) for quality.
License: MIT.
Why choose it: When you need the most capable model that fits in minimal memory. Phi-3 Mini at 3.8B outperforms many 7B models on benchmarks.

Cohere Command-R

Command-R and Command-R+ are designed specifically for retrieval-augmented generation (RAG) and tool use.

Strengths: Built-in citation generation, excellent at grounded generation from retrieved documents, strong tool-use capabilities, multilingual.
Best sizes: Command-R (35B) for RAG workflows, Command-R+ (104B) for maximum quality.
License: CC-BY-NC for research; commercial license available.
Why choose it: Purpose-built for RAG. If your primary use case is question-answering over documents, Command-R produces better-grounded, better-cited answers than general-purpose models.

How do you choose a model by task type?

The most important factor in model selection is what you want the model to do. Different tasks have different requirements for model capabilities, size, and specialization.

Recommendation matrix by use case

Use Case	Recommended Model (Budget)	Recommended Model (Quality)	Min VRAM	Notes
General chat	Llama 3.2 8B, Gemma 2 9B	Llama 3.2 70B, Qwen2.5 72B	6 GB	Start with 8B; upgrade to 70B if quality matters
Coding	Qwen2.5-Coder 7B	DeepSeek-Coder-V2 16B, Qwen2.5-Coder 32B	6 GB	Coding models outperform general models significantly
Math/reasoning	Phi-4 14B, Qwen2.5-Math 7B	DeepSeek-R1, Qwen2.5-Math 72B	6 GB	Reasoning models use chain-of-thought; expect longer outputs
Creative writing	Llama 3.2 8B, Mistral 7B	Llama 3.2 70B, Mixtral 8x22B	6 GB	Larger models produce more nuanced, coherent long-form writing
RAG / document Q&A	Llama 3.2 8B, Command-R 35B	Command-R+ 104B, Qwen2.5 72B	6 GB	RAG quality depends more on retrieval than model size
Summarization	Phi-3 Mini 3.8B, Llama 3.2 8B	Llama 3.2 70B	4 GB	Even small models summarize well; larger models are more faithful
Translation	Qwen2.5 7B, Mistral 7B	Qwen2.5 72B, Mixtral 8x22B	6 GB	Qwen excels at Chinese-English; Mistral at European languages
Classification / extraction	Phi-3 Mini 3.8B, Llama 3.2 3B	Any 7B+ model	2 GB	Small models are excellent for structured extraction tasks
Embeddings	nomic-embed-text, mxbai-embed-large	BGE-large, E5-mistral-7b	1 GB	Embedding models are small and fast; use the best one you can fit
Vision / multimodal	LLaVA 7B, Llama 3.2 Vision 11B	Llama 3.2 Vision 90B, Qwen-VL 72B	8 GB	Vision models need extra VRAM for image encoding
Roleplay / characters	Mistral 7B fine-tunes, Llama 3.2 8B	Llama 3.2 70B, Mixtral 8x22B	6 GB	Community fine-tunes on Hugging Face excel here
Function calling / agents	Llama 3.2 8B, Mistral 7B	Command-R+ 104B, Qwen2.5 72B	6 GB	Choose models specifically trained for tool use

Key insights from the table

7B-8B models cover most use cases adequately. You do not need a large model for everyday tasks. Reserve 70B+ models for tasks where you have noticed quality shortfalls with smaller models.
Specialized models beat general models at specific tasks. Qwen2.5-Coder 7B outperforms Llama 3.2 70B on coding benchmarks despite being 10x smaller. DeepSeek-R1 beats general 70B models on math. Always check if a specialized model exists for your task.
RAG quality is mostly about retrieval. For RAG workflows, the quality of your chunking, embedding, and retrieval pipeline matters more than the language model size. An 8B model with excellent retrieval outperforms a 70B model with poor retrieval.
Embedding models are separate from language models. You need a dedicated embedding model for RAG and semantic search. These are small (100M-500M parameters typically) and run alongside your language model.

How does model size affect quality?

Model quality scales with parameter count, but the relationship is not linear. There are diminishing returns at each size increase, and the importance of training data quality and methodology has grown relative to raw parameter count.

Quality tiers by parameter count

Parameter Count	Quality Tier	Comparable Cloud Model	Practical Reality
1B-3B	Basic	Below GPT-3.5	Simple tasks: classification, extraction, short Q&A. Not suitable for complex reasoning or long-form writing.
7B-8B	Good	GPT-3.5 level	Capable general assistant. Handles chat, coding, summarization, and RAG well. Occasional errors on complex reasoning.
13B-14B	Strong	Between GPT-3.5 and GPT-4	Noticeably better coherence, instruction following, and knowledge than 7B. Good for professional use.
30B-34B	Very strong	Approaching GPT-4 on many tasks	High-quality output across most tasks. Strong reasoning and long-form writing.
70B-72B	Excellent	Competitive with GPT-4 on most tasks	Near-frontier quality. Excels at complex reasoning, nuanced writing, and multi-step tasks.
100B+	Frontier-class	Rivals or matches GPT-4	The best open-weight models. DeepSeek-R1 at 671B (MoE) competes with GPT-4 on reasoning.

The size vs specialization trade-off

A common mistake is assuming bigger is always better. In practice:

Qwen2.5-Coder 7B outperforms Llama 3.2 70B on HumanEval (coding benchmark)
Phi-3 Mini 3.8B outperforms many 7B general models on reasoning benchmarks
DeepSeek-R1 Distill 7B outperforms Llama 3.2 70B on GSM8K (math)
A fine-tuned 7B model on your domain data often outperforms a general 70B model for that specific domain

The takeaway: pick the right model family and specialization first, then pick the largest size your hardware can run at Q4+ quantization. A specialized 7B model will often serve you better than a generic 30B model.

How do you choose a model based on available VRAM?

Your hardware determines which models you can practically run. Here is what fits at each VRAM tier, assuming Q4_K_M quantization (the most common sweet spot of quality and size):

Models by VRAM budget

Available VRAM	Models That Fit (Q4_K_M)	Recommended Pick	Notes
4 GB	1B-3B models	Phi-3 Mini 3.8B (tight), Llama 3.2 3B	Very limited; suitable for basic tasks only
6 GB	Up to 7B	Llama 3.2 8B (Q3), Qwen2.5 7B, Mistral 7B	The entry point for useful local AI
8 GB	Up to 8B comfortably	Llama 3.2 8B (Q4), Gemma 2 9B (Q3), Qwen2.5-Coder 7B	Good balance; most 7B-8B models fit with context room
10-12 GB	Up to 13B	Llama 3.2 8B (Q8), Qwen2.5 14B (Q3-Q4), Phi-4 14B (Q3)	Can run higher-quality quants of smaller models
16 GB	Up to 14B comfortably	Qwen2.5 14B (Q5), Phi-4 14B (Q4-Q5), Gemma 2 27B (Q2)	Sweet spot for most users. 14B models at good quality.
24 GB	Up to 30B; 70B at Q2	Qwen2.5 32B (Q4), Mixtral 8x7B, Llama 3.2 70B (Q2)	The enthusiast tier. Excellent range of model options.
32 GB (unified)	Up to 30B comfortably	Qwen2.5 32B (Q5), Command-R 35B (Q4)	Apple Silicon sweet spot. 30B models with generous context.
48 GB	Up to 70B at Q3-Q4	Llama 3.2 70B (Q3-Q4), Qwen2.5 72B (Q3)	70B models become practical. Multi-GPU or Apple Silicon.
64 GB (unified)	Up to 70B comfortably	Llama 3.2 70B (Q4-Q5), Qwen2.5 72B (Q4)	High-quality 70B inference on Apple Silicon.
96 GB+	Up to 70B at Q8; 100B+	Llama 3.2 70B (Q8), Command-R+ 104B, DeepSeek-V2 236B (Q2)	Professional/enterprise tier.
192 GB+	405B at Q3-Q4	Llama 3.2 405B (Q3-Q4)	Maximum capability. Mac Ultra or multi-GPU server.

Partial GPU offloading

If your model does not fully fit in VRAM, most inference engines (llama.cpp, Ollama) support partial GPU offloading. This loads some model layers on the GPU and the rest in system RAM. Layers on the GPU process at full GPU speed; layers in RAM process at CPU speed. The result is a blended speed between full-GPU and full-CPU inference.

Example: A 70B Q4 model (~44 GB) on an RTX 4090 (24 GB VRAM) with 64 GB system RAM. Roughly 55% of layers run on the GPU and 45% on the CPU. You might get 10-15 tok/s instead of 25 tok/s (full GPU) or 3-5 tok/s (full CPU). This is a practical compromise when you want to run a model slightly larger than your VRAM.

How does quantization affect model selection?

Quantization reduces model size by representing weights with fewer bits. The trade-off is between quality retention, memory usage, and inference speed. This directly affects which models you can run on your hardware.

Quick quantization reference

Quantization	Bits per Weight	Quality Retention	Size vs FP16	When to Use
Q8_0	8	~99.5%	50%	Maximum quality when VRAM allows
Q6_K	6.6	~99%	40%	Excellent quality, moderate savings
Q5_K_M	5.5	~98%	35%	Great balance — recommended if VRAM is not tight
Q4_K_M	4.8	~97%	30%	The community default. Best overall value.
Q4_K_S	4.5	~96%	28%	Slightly smaller than Q4_K_M with minimal quality loss
Q3_K_M	3.9	~93%	25%	Noticeable quality drop. Use when you need to squeeze a model in.
Q3_K_S	3.5	~90%	22%	Significant quality loss. Only for models where size matters most.
Q2_K	2.7	~82%	17%	Severe quality degradation. Last resort for running a model at all.
IQ2_XS	2.3	~75%	15%	Experimental. Significant artifacts. Not recommended for general use.

The golden rule: Run the largest model that fits in your VRAM at Q4_K_M or better. A 13B model at Q5 almost always outperforms a 30B model at Q2, because aggressive quantization degrades the larger model more than the size advantage compensates for.

For a detailed explanation of quantization formats and methods, see Understanding Quantization.

How do you evaluate a model before committing to it?

Downloading and testing models takes time, so it helps to evaluate candidates before downloading. Here is a practical evaluation process:

Step 1: Check benchmarks

Key benchmarks to look for in model cards and leaderboards:

Benchmark	What It Measures	Good Score (7B)	Good Score (70B)
MMLU	General knowledge and reasoning	60-70%	78-85%
HumanEval	Python code generation	40-55% (general), 70%+ (code model)	70-85%
GSM8K	Grade-school math reasoning	55-70%	85-95%
MT-Bench	Multi-turn conversation quality	7.0-8.0	8.5-9.5
ARC-Challenge	Science reasoning	55-65%	70-80%
TruthfulQA	Factual accuracy	45-55%	55-70%
Winogrande	Common-sense reasoning	75-82%	83-88%

Caution: Benchmarks are useful but imperfect. Models can be overfitted on benchmark datasets, and benchmark scores do not always correlate with real-world usefulness. Use benchmarks as a first filter, not the final word.

Step 2: Read community feedback

The most reliable evaluation comes from real users:

r/LocalLLaMA on Reddit: The most active community for local AI. Search for your target model to find user reviews, comparisons, and real-world usage reports.
Hugging Face model cards: Read the model card for training details, intended use cases, known limitations, and community discussions in the model’s “Community” tab.
TheBloke’s quantization notes: If using GGUF quantizations, TheBloke and other quantizers often include perplexity measurements and quality notes.

Step 3: Run a quick test

Download the Q4_K_M quantization of your candidate model and test it on prompts representative of your actual use case. Do not test with generic benchmarks — test with the specific tasks you need the model for:

If you need a coding assistant, give it real code problems from your codebase
If you need a chat assistant, have a multi-turn conversation on topics you care about
If you need a RAG model, test it with retrieved context and questions from your domain
If you need summarization, feed it real documents and evaluate the summaries

A 30-minute hands-on test is worth more than any benchmark score.

Step 4: Compare with a fallback

Always compare your candidate against a known baseline. Run the same prompts through Llama 3.2 8B (the de facto standard) and your candidate model. This gives you a concrete reference point for whether the new model is actually better for your specific needs.

What are the best models for specific scenarios?

Here are concrete recommendations for common local AI scenarios, updated for early 2026:

Scenario: Daily-driver AI assistant on a laptop (8-16 GB VRAM)

Primary model: Llama 3.2 8B (Q4_K_M) — 4.9 GB, fits easily in 8 GB VRAM Why: Best all-around model at this size. Strong instruction following, good knowledge, reliable outputs. Massive community support means you can find help easily. Alternative: Gemma 2 9B (Q4_K_M) — slightly higher quality on some benchmarks, slightly larger.

Scenario: Coding assistant integrated with IDE (8-16 GB VRAM)

Primary model: Qwen2.5-Coder 7B (Q5_K_M) — ~5.3 GB Why: Purpose-built for code. Outperforms general-purpose models on HumanEval, MBPP, and real-world coding tasks. Supports 40+ programming languages. Alternative: DeepSeek-Coder-V2 Lite 16B (Q3_K_M) — larger but stronger for complex code generation. Tool: Connect via Continue.dev or Cody for IDE integration with Ollama backend.

Scenario: Private document Q&A with RAG (16-24 GB VRAM)

Language model: Llama 3.2 8B or Qwen2.5 14B for answer generation Embedding model: nomic-embed-text (137M parameters, ~0.3 GB) or mxbai-embed-large (335M, ~0.7 GB) Why: RAG quality depends more on retrieval quality than model size. Use the saved VRAM for longer context windows. Command-R 35B is ideal if you have 24 GB VRAM, as it produces citations natively.

Scenario: Maximum quality on a single RTX 4090 (24 GB VRAM)

Primary model: Qwen2.5 32B (Q4_K_M) — ~20 GB Why: At 32B, Qwen2.5 is significantly more capable than any 7B-14B model. Fits in 24 GB with room for a reasonable context window. Strong across all task types. Alternative: Mixtral 8x7B (Q4_K_M) — MoE architecture activates only 13B parameters per token, giving 47B-quality output at 13B speed. Uses ~26 GB for weights but only 13B worth of compute per token.

Scenario: Near-frontier quality on Apple Silicon (48-64 GB unified memory)

Primary model: Llama 3.2 70B (Q4_K_M) — ~44 GB Why: 70B models represent the practical ceiling for local frontier quality. On a Mac with 48-64 GB unified memory, you get 15-25 tok/s with excellent output quality that rivals GPT-4 for most tasks. Alternative: Qwen2.5 72B for stronger multilingual and coding capabilities.

Scenario: Offline research and analysis (any hardware)

Primary model: Whatever fits your hardware at Q4+ Supplement: An embedding model for semantic search over your research corpus Key insight: For offline use, download multiple models in advance so you can switch based on task. Keep a small model (3B-7B) for quick queries and a larger model (14B-70B) for deep analysis.

How do mixture-of-experts (MoE) models change the equation?

MoE models like Mixtral and DeepSeek-V2 use a fundamentally different architecture that affects model selection:

In a traditional “dense” model, every parameter is used for every token. A 70B dense model uses all 70B parameters for each token it generates.

In an MoE model, the model has a large total parameter count but only activates a subset (the “experts”) for each token. Mixtral 8x7B has 47B total parameters but activates only 13B per token. DeepSeek-V2 has 236B total parameters but activates only 21B per token.

Implications for model selection

Aspect	Dense Models	MoE Models
VRAM needed	Proportional to total parameters	Proportional to total parameters (all weights must be loaded)
Inference speed	Proportional to total parameters	Proportional to active parameters (faster than total size suggests)
Quality	Scales with total parameters	Can exceed dense models of similar active parameter count
Best for	Predictable resource usage	Getting larger-model quality at faster speeds

Key insight: MoE models require VRAM for all their parameters but run at the speed of their active parameter count. Mixtral 8x7B needs the VRAM of a ~47B model but generates tokens at the speed of a ~13B model, with quality often exceeding dense 13B models. This makes MoE models excellent choices when you have sufficient VRAM but want faster inference.

What about fine-tuned and community models?

The Hugging Face ecosystem contains thousands of fine-tuned variants of base models. These fine-tunes adjust the model’s behavior for specific tasks or styles:

Types of fine-tunes

Instruction-tuned: The base model trained to follow instructions (e.g., Llama 3.2-Instruct). Always use instruction-tuned variants for chat and task completion.
Domain-specific: Models fine-tuned on medical, legal, financial, or scientific data. Useful when you need specialized knowledge.
Code-tuned: Models fine-tuned on code datasets (e.g., CodeLlama, Qwen2.5-Coder). Significantly better at programming tasks.
Chat/roleplay-tuned: Models optimized for engaging, character-consistent conversation. Popular in the creative community.
Merged models: Models created by combining weights from multiple fine-tunes. Can blend strengths of different models.

How to evaluate fine-tunes

Check the base model: Fine-tunes inherit the base model’s fundamental capabilities. A fine-tune of Llama 3.2 8B cannot exceed the knowledge and reasoning capacity of the 8B architecture.
Read the training details: Good model cards explain what data was used for fine-tuning, how many epochs were trained, and what the intended use case is.
Check download counts and likes: On Hugging Face, popular models with many downloads and positive community feedback are generally safer choices.
Test on your specific task: The only reliable way to know if a fine-tune is better for your use case is to test it.

Recommendation: Start with the official instruction-tuned variant of a major model family (e.g., meta-llama/Llama-3.2-8B-Instruct). Only explore community fine-tunes after you have established a baseline and identified specific areas where the official model falls short.

What is the decision framework summary?

Here is a step-by-step process for choosing your model:

Identify your primary task — Chat, coding, reasoning, creative writing, RAG, or something else?
Check your VRAM/memory budget — How much VRAM (GPU) or unified memory (Apple Silicon) do you have available?
Pick a model family — Choose based on your task (Llama for general, Qwen for coding, DeepSeek for reasoning, etc.)
Pick the largest size that fits — At Q4_K_M quantization, what is the largest parameter count that fits in your VRAM with 2-4 GB headroom for context?
Check for specialized variants — Is there a code-tuned, math-tuned, or domain-specific variant?
Download and test — Run your actual use-case prompts through the model for 30 minutes
Compare against baseline — Test the same prompts against Llama 3.2 8B to see if your chosen model is actually better for your needs
Adjust quantization — If the model fits but context is tight, try Q3_K_M. If you have VRAM to spare, try Q5_K_M or Q6_K for quality improvement.

This process takes about an hour and will save you days of running the wrong model. The local AI community is helpful — if you are uncertain, post your use case and hardware on r/LocalLLaMA and you will get informed recommendations within hours.

Once you have chosen a model, understanding quantization will help you pick the right quality/performance trade-off. Read Understanding LLM Quantization for a deep dive into GGUF, GPTQ, AWQ, and EXL2 formats.