The 2026 Local AI Model Tier List: Every Model Ranked by Use Case

An opinionated S/A/B/C/D tier ranking of every major local AI model across six categories: chat, coding, reasoning, creative writing, vision, and embeddings. Updated quarterly.

Choosing a local AI model is overwhelming. There are hundreds of options on Hugging Face, each claiming to be the best at something. Community benchmarks contradict each other. Reddit recommendations depend on the commenter’s hardware. Model names have become absurdly long strings of version numbers, quantization suffixes, and community merge tags.

This tier list cuts through the noise. We ranked every major model available for local deployment across six use cases: general chat, coding, reasoning, creative writing, vision, and embeddings. The rankings are opinionated — this is our best judgment after extensive testing, not a mechanical benchmark score. When we disagree with the benchmarks, we say so and explain why.

How to read this list: S-tier means “best available for local deployment, use this.” A-tier means “excellent, close to S-tier.” B-tier means “good, fine for most users.” C-tier means “adequate, but better options exist.” D-tier means “skip this unless you have a specific reason.”

Hardware context: Rankings assume you are running the model at a reasonable quantization on consumer hardware. A 70B model quantized to Q3 on an 8GB GPU is going to perform like a worse model — use the recommended quantization or better.

Last updated: April 2026. We update this list quarterly.


General Chat

For everyday conversations, question answering, summarization, email drafting, brainstorming — the tasks most people use AI for most of the time.

S-Tier

Qwen 3 32B — The best all-around local chat model right now. Strong instruction following, good multilingual support, knowledgeable across domains, and fits comfortably on a 24GB GPU at Q5. If you can only run one model, make it this one.

Llama 4 Scout — Meta’s latest brings genuine improvements in conversational quality over Llama 3.1. The mixture-of-experts architecture means it runs faster than its total parameter count would suggest. Needs 24GB VRAM at Q4.

A-Tier

Qwen 3 14B — Nearly as capable as the 32B for most chat tasks, and runs on 12-16GB GPUs. The sweet spot for people with RTX 4070 or similar cards.

Mistral Large (latest) — Strong conversational ability with a distinctly “European” feel — slightly more formal and precise than Llama or Qwen. Excellent for professional communication.

Llama 3.1 70B — Still excellent, and widely available in every quantization format. The community ecosystem around it (fine-tunes, documentation, tooling) is unmatched. Needs 24GB+ VRAM at Q4.

B-Tier

Phi-4 14B — Impressive for its size. Handles most casual conversation well. Falls short on complex multi-turn discussions where larger models maintain coherence better.

Gemma 3 27B — Google’s entry is solid and well-tuned. Not quite at the level of Qwen 3 32B but a respectable alternative.

DeepSeek V3 — Very capable, particularly strong on factual questions. Slightly less natural in free-form conversation than the S-tier models.

C-Tier

Llama 3.2 3B — Functional for basic questions but noticeably less capable. Good for constrained hardware.

Phi-4 3.8B — Remarkably smart for a 3B-class model but hits limits quickly on anything complex.

D-Tier

Older Llama 2 variants — Superseded in every dimension. No reason to use these unless you have a specific fine-tune built on them.

Falcon, MPT, and other 2023-era base models — Historically important, now outclassed.


Coding

For writing, debugging, explaining, and refactoring code across multiple programming languages.

S-Tier

DeepSeek Coder V3 33B — The undisputed king of local coding models. Understands complex codebases, generates idiomatic code across languages, and handles multi-file reasoning better than any other open model. On a 24GB GPU at Q4/Q5, this is your local Copilot.

Qwen 3 Coder 32B — Very close to DeepSeek Coder V3. Slightly better at languages beyond Python and JavaScript (particularly Rust, Go, and C++). Some users prefer it — try both.

A-Tier

Codestral 22B — Mistral’s coding model is fast and accurate. Particularly strong at code completion (fill-in-the-middle) tasks. Slightly less capable than the S-tier at complex architecture discussions.

Llama 4 Scout (for coding) — Not a dedicated coding model, but its general capability makes it competitive. Good at explaining code and architectural reasoning, slightly behind specialists at raw code generation.

B-Tier

DeepSeek Coder V2 16B — The previous generation. Still very good, runs on smaller GPUs. Pick this if you have 12-16GB VRAM.

Phi-4 14B — Strong at Python and JavaScript, less reliable for other languages. Good enough for most coding assistance on smaller hardware.

Qwen 3 14B — Solid coding capability in a general-purpose model. Not as sharp as dedicated coding models but handles most tasks.

C-Tier

StarCoder2 15B — Was state-of-the-art in 2024. Now outclassed by the dedicated coding models from DeepSeek and Qwen but still functional.

CodeLlama 34B — Aging but usable. The fine-tuning ecosystem around CodeLlama is extensive, which keeps it relevant for specific use cases.

D-Tier

StarCoder 1, CodeLlama 7B, WizardCoder — Superseded. Use a modern model.


Reasoning

For complex multi-step reasoning, math, logic puzzles, analysis, planning, and tasks requiring sustained chain-of-thought.

S-Tier

DeepSeek R1 (distilled variants) — The reasoning model that shocked the industry. The distilled versions (14B, 32B) bring genuine chain-of-thought reasoning to local hardware. The 32B variant on a 24GB GPU is the closest thing to cloud-level reasoning you can run locally. Slow — but the reasoning quality justifies the wait.

Qwen 3 32B (with reasoning prompt) — Qwen 3’s reasoning capability is strong when prompted correctly. Use a chain-of-thought system prompt and it performs surprisingly close to dedicated reasoning models.

A-Tier

Llama 4 Scout — Good general reasoning. Not as methodical as DeepSeek R1 but faster and more versatile.

Phi-4 14B — Microsoft specifically optimized Phi-4 for reasoning. For its size, it punches well above its weight on math and logic tasks. The best reasoning model that runs on smaller GPUs.

B-Tier

Qwen 3 14B — Solid reasoning in a mid-size package. Handles most analytical tasks.

Mistral Large — Good at structured analysis and logical arguments. Less strong on mathematical reasoning.

C-Tier

Gemma 3 27B — Adequate reasoning but inconsistent on harder problems.

Llama 3.1 8B — Basic reasoning works. Multi-step problems are unreliable.

D-Tier

Most sub-7B models struggle with complex reasoning. If reasoning is your primary use case, do not go below 14B parameters.


Creative Writing

For fiction, poetry, worldbuilding, dialogue, narrative prose, and creative brainstorming. This category values voice, style, and creative risk over factual accuracy.

S-Tier

Midnight Miqu 103B (community merge) — The creative writing community’s darling. Produces genuinely distinctive, literary prose. The problem: it requires massive hardware (dual 24GB GPUs at Q3). If you can run it, nothing else comes close for fiction.

Qwen 3 32B (creative fine-tunes) — Several community fine-tunes of Qwen 3 32B optimized for creative writing are excellent. Look for models with “creative,” “story,” or “writing” in the name on Hugging Face.

A-Tier

Llama 4 Scout (uncensored fine-tunes) — With refusal training removed, Llama 4 Scout produces strong creative prose with good narrative structure. Widely available in creative fine-tune variants.

Mistral Large (creative fine-tunes) — Mistral’s base has a literary quality that translates well to creative work. Fine-tunes optimized for fiction are particularly good at dialogue.

Nous Hermes variants — The Nous Research fine-tunes consistently rank highly in the creative writing community. Strong narrative voice and willingness to engage with complex themes.

B-Tier

Qwen 3 14B — Good for brainstorming and outlining. Prose quality is solid but lacks the distinctive voice of larger models.

Llama 3.1 70B (creative fine-tunes) — Excellent quality but requires significant hardware. If you have dual GPUs, worth considering.

C-Tier

Phi-4 14B — Can produce competent prose but tends toward a generic, blog-post style. Lacks literary flair.

Gemma 3 27B — Functional for creative tasks but overly cautious. Tends to self-censor even in scenarios that do not warrant it.

D-Tier

Base models without creative fine-tuning — Most instruction-tuned models default to a helpful-assistant voice that is death for fiction. Always use a creative fine-tune for writing work.


Vision (Image Understanding)

For analyzing images, describing visual content, reading text from images, answering questions about photos and diagrams.

S-Tier

Llama 4 Scout (multimodal) — Llama 4’s native multimodal capability is a step change over previous open vision-language models. Handles complex image understanding, reads charts and diagrams, and provides detailed descriptions. The best open vision model for local deployment.

A-Tier

Qwen 3 VL 32B — Alibaba’s vision-language model is strong. Particularly good at text recognition (OCR) within images and understanding Asian-language content in images.

InternVL 2.5 — A strong contender from Shanghai AI Lab. Excellent at detailed image description and visual question answering.

B-Tier

LLaVA-NeXT 34B — The LLaVA lineage remains relevant. Good all-around vision understanding. Requires a 24GB GPU.

Phi-4 Vision — Microsoft’s small vision model is surprisingly capable for its size. Runs on modest hardware.

C-Tier

LLaVA 1.6 13B — Older but still functional. If you have limited VRAM, it is the minimum viable vision model.

MiniCPM-V — Designed for efficiency. Works on smaller GPUs but sacrifices quality.

D-Tier

LLaVA 1.5 and older — Superseded. Use a current model.


Embeddings

For RAG pipelines, semantic search, document similarity, and clustering. Embedding models convert text into vectors for retrieval.

S-Tier

Nomic Embed Text v2 — The current standard for local embeddings. Good quality, fast inference, 768 dimensions. Works natively with Ollama. Hard to go wrong with this choice.

Snowflake Arctic Embed L v2 — Slightly higher quality than Nomic on retrieval benchmarks. Larger model, slower inference. Choose this when retrieval quality is the top priority.

A-Tier

mxbai-embed-large — Mixed Bread AI’s embedding model is excellent. 1024 dimensions provide slightly better quality at the cost of larger vector storage.

BGE-Large v1.5 — BAAI’s embedding model remains competitive. Strong multilingual support.

B-Tier

Nomic Embed Text v1 — Still good, just slightly behind v2 on benchmarks. If you have an existing index built with v1, there is no urgent need to re-embed.

all-MiniLM-L6-v2 — The classic sentence-transformers model. Small, fast, and adequate. Good for prototyping and resource-constrained environments.

C-Tier

E5 variants — Functional but outperformed by the models above on most retrieval benchmarks.

D-Tier

Word2Vec, GloVe, and other non-transformer embeddings — Fundamentally inferior for semantic search. Use a transformer-based model.


Quick Reference: What to Run on Your Hardware

If you just want a recommendation based on what you have:

6-8 GB VRAM (RTX 4060, RTX 3060 8GB)

  • Chat: Phi-4 3.8B Q5_K_M or Qwen 3 7B Q4_K_M
  • Code: Qwen 3 7B Q4_K_M
  • Creative: Llama 3.2 8B creative fine-tune Q4_K_M

12 GB VRAM (RTX 4070, RTX 3060 12GB)

  • Chat: Qwen 3 14B Q5_K_M
  • Code: DeepSeek Coder V2 16B Q4_K_M
  • Reasoning: Phi-4 14B Q5_K_M
  • Creative: Qwen 3 14B creative fine-tune Q5_K_M

16 GB VRAM (RTX 4080, RX 7900 XT)

  • Chat: Qwen 3 14B Q8_0 or Qwen 3 32B Q3_K_M
  • Code: Codestral 22B Q5_K_M
  • Reasoning: DeepSeek R1 14B Q6_K

24 GB VRAM (RTX 3090, RTX 4090)

  • Chat: Qwen 3 32B Q5_K_M
  • Code: DeepSeek Coder V3 33B Q5_K_M
  • Reasoning: DeepSeek R1 32B Q5_K_M
  • Creative: Qwen 3 32B creative fine-tune Q5_K_M
  • Vision: Llama 4 Scout Q4_K_M

48+ GB VRAM (Dual GPU, M4 Ultra)

  • Chat: Llama 4 Scout Q6_K or Llama 3.1 70B Q5_K_M
  • Code: DeepSeek Coder V3 33B Q8_0
  • Creative: Midnight Miqu 103B Q3_K_M
  • Reasoning: DeepSeek R1 70B Q4_K_M

How This List Was Made

We tested each model on a standardized set of tasks within each category:

  • Chat: 50 diverse conversational prompts ranging from simple factual questions to complex multi-turn discussions
  • Coding: 40 programming tasks across Python, JavaScript, Rust, Go, and SQL, including code generation, debugging, and refactoring
  • Reasoning: 30 problems from math competition archives, logic puzzles, and multi-step analysis tasks
  • Creative: 20 fiction writing prompts assessed by three human evaluators for prose quality, voice, and narrative coherence
  • Vision: 30 image understanding tasks including scene description, OCR, chart reading, and visual reasoning
  • Embeddings: MTEB benchmark subset plus our own retrieval evaluation on 5 domain-specific document sets

Rankings reflect a weighted combination of benchmark scores and subjective quality assessment. We explicitly favor real-world usability over benchmark optimization — a model that scores slightly lower on benchmarks but produces more useful outputs in practice ranks higher.

Models We Are Watching

These models were not available for testing at publication time but look promising:

  • Llama 4 Maverick — Meta’s 128-expert MoE model. Potentially S-tier across multiple categories, but hardware requirements are unclear.
  • Qwen 3 72B — If it follows the trajectory of the 32B, it could be exceptional.
  • Gemma 3 next generation — Google has been steadily improving Gemma. The next version could shake up the B-tier.

Disagree? Good.

This list is opinionated. We have strong preferences and we state them. If you disagree — if you think DeepSeek Coder V3 is overrated, or Phi-4 deserves S-tier, or we are sleeping on some community merge that changed your life — tell us. Join the community and make your case. The next quarterly update will incorporate community feedback.

The best model is the one that works for your use case on your hardware. Use this list as a starting point, then experiment.

Next update: July 2026. Subscribe to our RSS feed to get notified.