What is the best quantization level for local AI?

Q4_K_M (GGUF) is the community standard sweet spot. It reduces model size by about 70% compared to FP16 while retaining approximately 97% of the original model's quality. For most users, Q4_K_M offers the best balance of quality, size, and speed.

What is the difference between GGUF and GPTQ?

GGUF is designed for CPU and mixed CPU/GPU inference via llama.cpp, supporting flexible quantization levels from Q2 to Q8. GPTQ is a GPU-only format optimized for NVIDIA CUDA, typically offering faster pure-GPU inference but requiring the entire model to fit in VRAM.

Does quantization make models worse?

Yes, but the impact depends on the level. Q8 and Q6 quantization is nearly indistinguishable from full precision. Q4 introduces a small, usually imperceptible quality drop. Below Q3, quality degradation becomes noticeable — especially on reasoning, factual accuracy, and coherent long-form generation.

Should I use GGUF or AWQ?

Use GGUF if you want flexibility (CPU, GPU, or mixed inference), are using Ollama or llama.cpp, or have limited VRAM and need partial offloading. Use AWQ if you have a modern NVIDIA GPU with enough VRAM to hold the entire model and want optimized GPU throughput via vLLM or TGI.

What does Q4_K_M mean?

Q4_K_M is a GGUF quantization level where Q4 indicates approximately 4-bit precision, K means it uses the k-quant method (importance-based mixed precision), and M means medium quality (balancing size and accuracy). The K-quant system assigns more bits to important layers and fewer to less important ones.

Understanding LLM Quantization: GGUF, GPTQ, AWQ, EXL2 Explained

Quantization is the process of reducing the numerical precision of a model’s weights — from 16-bit floating point down to 8-bit, 4-bit, or even 2-bit integers — to shrink the model’s memory footprint and increase inference speed, with a controlled trade-off in output quality. A 7B parameter model in full precision (FP16) requires approximately 14 GB of memory; the same model quantized to 4-bit (Q4_K_M) needs only about 4.4 GB — a 70% reduction — while retaining roughly 97% of the original quality. Quantization is what makes running large language models on consumer hardware practical.

Without quantization, local AI would be impractical for most users. A 70B model at full precision requires 140 GB of memory — far beyond any consumer GPU. At 4-bit quantization, that same model fits in approximately 44 GB, making it runnable on a Mac with 48 GB unified memory or a pair of 24 GB GPUs. Quantization is not a compromise; it is an enabling technology that unlocks the entire local AI ecosystem.

This guide explains how quantization works, compares every major quantization format, provides detailed quality and performance benchmarks, and gives you a decision framework for choosing the right quantization for your hardware and use case.

What is quantization and why does it matter?

How model weights are stored

A large language model is fundamentally a massive collection of numerical parameters called “weights.” These weights are the learned values that encode the model’s knowledge and capabilities. During inference, these weights are multiplied with input data through matrix multiplication operations to produce outputs.

In their original training format, weights are stored as FP16 (16-bit floating point) or BF16 (Brain Float 16) numbers. Each weight occupies 2 bytes of memory. The math is straightforward:

7B parameters × 2 bytes = 14 GB
13B parameters × 2 bytes = 26 GB
70B parameters × 2 bytes = 140 GB
405B parameters × 2 bytes = 810 GB

These sizes exceed the VRAM of consumer GPUs (typically 8-24 GB), making full-precision models impractical for local deployment beyond the smallest models.

How quantization reduces size

Quantization replaces high-precision floating-point weights with lower-precision integers. Instead of storing each weight as a 16-bit float, you store it as an 8-bit, 4-bit, or even 2-bit integer, along with scaling factors that allow approximate reconstruction of the original values.

The key insight is that neural network weights are redundant. Not every weight needs full 16-bit precision to contribute meaningfully to the model’s output. Many weights cluster around similar values and can be represented with fewer bits without significantly affecting the model’s behavior.

Think of it like image compression: a JPEG photo is much smaller than the raw image file, and for most purposes, the visual quality is indistinguishable. Similarly, a quantized model is much smaller than the full-precision model, and for most tasks, the output quality is nearly identical.

The quantization trade-off

Quantization involves a three-way trade-off:

Size reduction: Fewer bits per weight means a smaller model file and less memory needed.
Speed improvement: Smaller models load faster and can be read from memory faster, increasing tokens per second.
Quality loss: Lower precision means less accurate weight representations, which can degrade output quality.

The art of quantization is finding the sweet spot where size and speed improvements are maximized while quality loss remains imperceptible for your use case.

What are the different bit-widths used in quantization?

Full precision: FP16 and BF16

FP16 (Float16): 16-bit floating point. The standard training format. 2 bytes per weight.
BF16 (BFloat16): Google’s 16-bit format with a wider exponent range than FP16. Used in training; functionally equivalent to FP16 for inference quality.
FP32 (Float32): 32-bit floating point. 4 bytes per weight. Rarely used for LLM inference because it doubles memory with negligible quality improvement over FP16.

These are “full precision” — the baseline against which all quantized formats are compared.

INT8 (8-bit integer)

8-bit quantization stores each weight as an integer from 0 to 255, with scaling factors to map the integer range back to approximate floating-point values. This halves memory usage compared to FP16 with minimal quality loss (typically less than 0.5% on benchmarks).

INT8 is widely supported and is considered “nearly lossless” quantization. It is a safe choice when you have enough memory to afford 8-bit but not full precision.

INT4 (4-bit integer)

4-bit quantization stores each weight as an integer from 0 to 15. This reduces memory by 75% compared to FP16. Quality loss becomes measurable but remains small with good quantization techniques (typically 1-3% on benchmarks).

4-bit is the most popular quantization level for local AI because it offers the best balance of size, speed, and quality. The majority of models downloaded through Ollama and LM Studio use 4-bit quantization.

INT3 and INT2 (3-bit and 2-bit)

3-bit and 2-bit quantization pushes the limits of compression. Quality degradation becomes significant — noticeable in longer generations, complex reasoning, and factual accuracy. These levels are useful when you need to fit a much larger model into limited memory and are willing to accept quality trade-offs.

Mixed precision

Modern quantization methods do not apply a uniform bit-width across all weights. Instead, they use mixed precision — assigning more bits to important layers (like attention layers) and fewer bits to less important layers (like feedforward layers). This yields better quality than uniform quantization at the same average bit rate.

GGUF’s “K-quant” system and EXL2’s variable bitrate both use this approach.

How does GGUF quantization work?

GGUF (GPT-Generated Unified Format) is the quantization format created by the llama.cpp project and is the most widely used format for local AI. It is supported by Ollama, LM Studio, GPT4All, llama.cpp, and many other tools.

GGUF advantages

Flexible deployment: Runs on CPU, NVIDIA GPU (CUDA), AMD GPU (ROCm), Apple GPU (Metal), and Vulkan
Partial GPU offloading: Load some layers on GPU and the rest on CPU — no need for the entire model to fit in VRAM
Wide quantization range: Available from Q2_K (2.7 bits) to Q8_0 (8 bits)
Self-contained: A single GGUF file contains everything needed — model weights, tokenizer, architecture config
Universal support: The de facto standard for local AI tools

GGUF quantization levels explained

GGUF offers many quantization levels using the k-quant system. The naming convention is: Q{bits}_{type}_{size}, where bits is the approximate bit-width, type is the quantization method, and size indicates quality tier (S=small, M=medium, L=large).

Level	Bits/Weight	Size (7B model)	Quality	Speed	Description
Q2_K	2.7	3.0 GB	Poor	Fastest	Severe quality loss. Only for fitting models that otherwise do not fit.
Q3_K_S	3.5	3.4 GB	Low	Very fast	Significant quality loss. Noticeable on reasoning and coherence.
Q3_K_M	3.9	3.6 GB	Below average	Very fast	Noticeable quality drop vs Q4. Acceptable for simple tasks.
Q3_K_L	4.1	3.9 GB	Fair	Fast	Better than Q3_K_M. Reasonable for chat when VRAM is tight.
Q4_0	4.0	4.0 GB	Fair	Fast	Legacy uniform 4-bit. Superseded by Q4_K variants.
Q4_K_S	4.5	4.2 GB	Good	Fast	Good quality. Slightly worse than Q4_K_M.
Q4_K_M	4.8	4.4 GB	Very good	Fast	The community standard. Best balance of quality, size, and speed.
Q5_0	5.0	4.7 GB	Good	Moderate	Legacy uniform 5-bit. Superseded by Q5_K variants.
Q5_K_S	5.3	5.0 GB	Very good	Moderate	Good upgrade from Q4_K_M if you have the VRAM.
Q5_K_M	5.5	5.3 GB	Excellent	Moderate	Near-full-precision quality. Recommended when VRAM allows.
Q6_K	6.6	5.9 GB	Excellent	Moderate	Barely distinguishable from FP16 on most tasks.
Q8_0	8.0	7.7 GB	Near-perfect	Slower	Essentially lossless. Use when you want maximum quality.

GGUF imatrix quantization (IQ series)

The “IQ” variants use importance matrix data to further optimize quantization. They measure which weights are most important using calibration data and assign precision accordingly:

Level	Bits/Weight	Size (7B model)	Quality vs Standard	Notes
IQ1_S	1.6	1.8 GB	Very poor	Experimental. Extreme compression.
IQ2_XS	2.3	2.5 GB	Poor	Better than Q2_K at similar sizes.
IQ2_S	2.5	2.7 GB	Below average	Usable for simple tasks.
IQ3_XS	3.3	3.3 GB	Fair	Better than Q3_K_S at similar sizes.
IQ3_M	3.6	3.5 GB	Fair-good	Competitive with Q3_K_M.
IQ4_XS	4.3	4.1 GB	Good	Slightly smaller than Q4_K_S with similar quality.

IQ quantizations are newer and can offer better quality at very low bit rates (2-3 bits) compared to standard K-quants. They require an importance matrix generated from calibration data, which quantization providers like TheBloke and bartowski typically include.

How does GPTQ quantization work?

GPTQ (Generalized Post-Training Quantization) is a GPU-optimized quantization method that uses calibration data to minimize the quantization error layer by layer.

How GPTQ differs from GGUF

GPTQ takes a different approach to quantization. Instead of simple rounding (as in basic GGUF), GPTQ:

Processes the model one layer at a time
For each layer, uses a calibration dataset to measure the impact of quantizing each weight
Applies a second-order correction (based on the Hessian matrix) to adjust remaining weights to compensate for quantization error
This “error compensation” produces better quality at the same bit rate compared to naive quantization

GPTQ specifications

Aspect	Details
Typical bit-widths	4-bit (most common), 8-bit, 3-bit
Group size	128 (standard), 64, 32 — smaller groups = better quality but larger file
Format	SafeTensors files with quantization config
GPU requirement	Requires NVIDIA GPU with CUDA. Does not support CPU inference.
Supported engines	ExLlamaV2, vLLM, AutoGPTQ, TGI
Offloading	Limited. Best when entire model fits in VRAM.

When to use GPTQ

You have a modern NVIDIA GPU with enough VRAM to hold the entire model
You want fast GPU inference and do not need CPU fallback
You are using ExLlamaV2 or vLLM as your inference engine
You want good quality at 4-bit with error compensation

GPTQ quality comparison

At 4-bit precision, GPTQ typically offers slightly better quality than basic GGUF Q4_0 due to its error compensation technique. However, GGUF’s newer K-quant methods (Q4_K_M, Q4_K_S) have largely closed this gap. The main advantage of GPTQ is speed on NVIDIA GPUs with engines optimized for it.

How does AWQ quantization work?

AWQ (Activation-Aware Weight Quantization) is a more recent quantization method that observes which weights have the largest impact on model activations and preserves those weights at higher precision.

AWQ’s approach

AWQ’s key insight is that not all weights are equally important. A small fraction of weights (roughly 1%) has a disproportionate impact on model outputs. AWQ:

Analyzes the model using calibration data to identify “salient” weights — those that produce the largest activations
Applies a per-channel scaling transformation to protect salient weights from quantization error
Quantizes all weights to 4-bit, but the scaling ensures that important weights lose less precision

This activation-aware approach yields better quality than naive quantization at the same bit rate, particularly for smaller models where every bit of precision matters more.

AWQ specifications

Aspect	Details
Typical bit-widths	4-bit (standard)
Group size	128 (standard)
Format	SafeTensors files with AWQ config
GPU requirement	Requires modern NVIDIA GPU. Optimized for Ampere+ (RTX 30xx, 40xx).
Supported engines	vLLM, TGI, AutoAWQ
Key advantage	Faster inference than GPTQ on modern GPUs due to optimized kernels

When to use AWQ

You are deploying with vLLM or TGI for production serving
You have a modern NVIDIA GPU (Ampere or newer — RTX 3000/4000 series)
You want the fastest 4-bit GPU inference with good quality
You are serving multiple concurrent users and throughput matters

AWQ vs GPTQ

Dimension	AWQ	GPTQ
Quality at 4-bit	Slightly better on average	Good, with error compensation
Inference speed	Faster on Ampere+ GPUs	Good, but slightly slower than AWQ on modern hardware
GPU support	Ampere+ recommended	Broader GPU support (works on older NVIDIA GPUs)
Engine support	vLLM, TGI, AutoAWQ	ExLlamaV2, vLLM, AutoGPTQ, TGI
Quantization speed	Faster to quantize	Slower to quantize (Hessian computation)
Community adoption	Growing; increasingly the default for vLLM	Established; large library of pre-quantized models

For most users deploying with vLLM on modern NVIDIA GPUs, AWQ is the recommended choice. For users on older GPUs or using ExLlamaV2, GPTQ is the better option.

How does EXL2 quantization work?

EXL2 is the quantization format used by the ExLlamaV2 inference engine. It offers a unique capability: variable bitrate across the model.

EXL2’s variable bitrate

Unlike GGUF (where you choose a single quantization level like Q4_K_M for the entire model) or GPTQ/AWQ (which use a fixed 4-bit or 8-bit for all weights), EXL2 allows you to specify a target average bitrate (e.g., 4.0 bits per weight) and then distributes bits non-uniformly across layers.

For example, at a 4.0 bpw (bits per weight) target, EXL2 might:

Assign 6-8 bits to the most important attention layers
Assign 3-4 bits to less important feedforward layers
Average out to exactly 4.0 bits per weight overall

This variable allocation produces better quality than uniform quantization at the same average bit rate.

EXL2 specifications

Aspect	Details
Bit-widths	Variable. Common targets: 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 6.0, 8.0 bpw
Format	SafeTensors with EXL2 quantization tables
GPU requirement	NVIDIA GPU with CUDA (6.0+). GPU-only — no CPU inference.
Supported engines	ExLlamaV2 only
Key advantage	Fastest token generation speeds; variable bitrate for optimal quality
Key limitation	ExLlamaV2-only; no CPU support; smaller community

When to use EXL2

You are using ExLlamaV2 as your inference engine
You want the absolute fastest token generation on NVIDIA GPUs
You want fine-grained control over the quality/size trade-off (e.g., 3.5 bpw instead of being locked to Q3 or Q4)
You are an advanced user comfortable with a smaller ecosystem

EXL2 performance

ExLlamaV2 with EXL2 quantization achieves some of the fastest token generation speeds available:

Model	Quantization	GPU	Tokens/Second
Llama 3.2 8B	4.0 bpw EXL2	RTX 4090	110-140
Llama 3.2 8B	4.0 bpw EXL2	RTX 3090	80-100
Llama 3.2 70B	4.0 bpw EXL2	RTX 4090	18-25
Mixtral 8x7B	4.5 bpw EXL2	RTX 4090	45-60

These speeds are typically 10-30% faster than GGUF on the same hardware for pure GPU inference, making EXL2 the performance king for NVIDIA GPU users who want maximum speed.

How do all the quantization formats compare?

Here is a comprehensive comparison of all major quantization formats:

Format	Bits	CPU Support	GPU Support	Partial Offload	Quality (4-bit)	Speed (4-bit, GPU)	Primary Engine	Ecosystem Size
GGUF	2-8	Yes	CUDA, Metal, ROCm, Vulkan	Yes	Very good (K-quants)	Good	llama.cpp, Ollama	Largest
GPTQ	3-8	No	CUDA	Limited	Good	Good	ExLlamaV2, vLLM	Large
AWQ	4	No	CUDA (Ampere+)	No	Very good	Very good	vLLM, TGI	Growing
EXL2	2-8 (variable)	No	CUDA	No	Excellent (variable bitrate)	Best	ExLlamaV2	Smaller
ONNX	4-16	Yes	Various	Depends	Good	Varies	ONNX Runtime	Moderate
BitsAndBytes (NF4)	4	Yes (slow)	CUDA	N/A	Good	Moderate	Transformers (HF)	Large

Decision tree for choosing a format

Start here: What is your inference engine?
│
├── Ollama or LM Studio → GGUF
│
├── llama.cpp → GGUF
│
├── vLLM (production serving) →
│   ├── Modern NVIDIA GPU (RTX 30xx/40xx)? → AWQ
│   └── Older NVIDIA GPU? → GPTQ
│
├── ExLlamaV2 (maximum speed) → EXL2
│
├── TGI (Hugging Face serving) →
│   ├── Want fastest inference? → AWQ
│   └── Want broadest compatibility? → GPTQ
│
└── Hugging Face Transformers (research/dev) → BitsAndBytes NF4

For 90% of local AI users: Use GGUF Q4_K_M. It is supported by the widest range of tools, runs on any hardware, and offers excellent quality. You only need to consider other formats if you are optimizing for specific deployment scenarios.

What does quantization quality loss actually look like?

Understanding the theoretical quality numbers is useful, but what does quantization quality loss actually look like in practice?

Perplexity benchmarks

Perplexity is the standard metric for measuring quantization quality. Lower perplexity means the model better predicts the next token. Here are representative perplexity measurements for Llama 3.2 8B at different quantization levels (measured on the wikitext-2 dataset):

Quantization	Perplexity	Increase vs FP16	Practical Impact
FP16 (baseline)	6.14	—	Perfect quality (reference)
Q8_0	6.15	+0.01 (+0.2%)	Indistinguishable from FP16
Q6_K	6.17	+0.03 (+0.5%)	Indistinguishable from FP16
Q5_K_M	6.21	+0.07 (+1.1%)	No perceptible difference for most tasks
Q4_K_M	6.32	+0.18 (+2.9%)	Minimal impact. The sweet spot.
Q4_K_S	6.36	+0.22 (+3.6%)	Slight quality drop vs Q4_K_M. Still very good.
Q3_K_M	6.58	+0.44 (+7.2%)	Noticeable on complex reasoning. OK for chat.
Q3_K_S	6.79	+0.65 (+10.6%)	Clear quality drop. Visible in coherence.
Q2_K	7.62	+1.48 (+24.1%)	Significant degradation. Frequent errors.
IQ2_XS	8.10	+1.96 (+31.9%)	Severe quality loss. Not recommended for general use.

Where quality loss manifests

At Q4_K_M (the recommended level), quality loss is negligible for most tasks. When you drop to Q3 and below, degradation shows up in specific ways:

Factual accuracy: The model is more likely to confuse similar facts, mix up dates, or hallucinate details.
Reasoning chains: Multi-step logical reasoning breaks down. The model may reach incorrect conclusions despite starting correctly.
Long-form coherence: Extended generations (1000+ tokens) may lose thematic consistency or introduce contradictions.
Instruction following: The model may ignore parts of complex instructions or produce the wrong output format.
Rare knowledge: Obscure facts and specialized domain knowledge are the first to degrade.

For simple tasks like summarization, classification, extraction, and basic chat, even Q3 quantization is often adequate. For complex tasks like coding, analysis, and multi-step reasoning, staying at Q4 or above is strongly recommended.

How do you quantize a model yourself?

While pre-quantized models are readily available on Hugging Face, you may want to quantize a model yourself — perhaps a new release without existing quantizations, a custom fine-tune, or a specific quantization level that is not available.

Quantizing to GGUF

The llama.cpp project provides the llama-quantize tool:

# Step 1: Convert the model from Hugging Face format to GGUF (FP16)
python convert_hf_to_gguf.py /path/to/model --outfile model-f16.gguf

# Step 2: Quantize to your desired level
./llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M

For imatrix quantization (IQ variants), you first generate an importance matrix using calibration data:

# Generate importance matrix
./llama-imatrix -m model-f16.gguf -f calibration-data.txt -o imatrix.dat

# Quantize with importance matrix
./llama-quantize --imatrix imatrix.dat model-f16.gguf model-IQ4_XS.gguf IQ4_XS

Quantizing to GPTQ

Use the AutoGPTQ library:

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=True,
)

model = AutoGPTQForCausalLM.from_pretrained(
    "/path/to/model",
    quantize_config
)
model.quantize(calibration_dataset)
model.save_quantized("/path/to/output")

Quantizing to AWQ

Use the AutoAWQ library:

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("/path/to/model")
model.quantize(
    tokenizer,
    quant_config={"w_bit": 4, "q_group_size": 128}
)
model.save_quantized("/path/to/output")

Quantizing to EXL2

Use the ExLlamaV2 quantizer:

python convert.py \
  -i /path/to/model \
  -o /path/to/output \
  -b 4.0 \
  -cf /path/to/calibration-data.parquet

The -b flag specifies the target bits per weight. EXL2 will automatically distribute bits across layers to optimize quality at the target average.

What about newer quantization methods?

The quantization landscape continues to evolve. Several newer methods are worth knowing about:

AQLM (Additive Quantization of Language Models)

AQLM uses multi-codebook quantization — instead of mapping each weight to a single integer, it maps groups of weights to combinations from multiple codebooks. This achieves impressive quality at 2-bit precision, outperforming Q2_K GGUF significantly. Still emerging but promising for extreme compression scenarios.

QuIP# (Quantization with Incoherence Processing)

QuIP# applies random rotations to model weights before quantization to reduce coherence (correlation between weights), which makes them easier to quantize without quality loss. Achieves strong results at 2-4 bits but requires specialized inference support.

HQQ (Half-Quadratic Quantization)

HQQ is a fast, calibration-free quantization method. Unlike GPTQ and AWQ, which require running calibration data through the model, HQQ quantizes based on weight statistics alone. This makes it much faster to quantize (minutes instead of hours) with competitive quality.

GGUF’s evolving quantization

The llama.cpp team continues to improve GGUF quantization. Recent additions include:

Improved IQ (importance-matrix) quantization for better quality at 2-3 bits
Better support for MoE model quantization
Optimized kernels for mixed-precision inference

What is the recommended quantization for each scenario?

Scenario	Recommended Format	Recommended Level	Reasoning
First-time user with Ollama	GGUF	Q4_K_M	Best overall balance; widest compatibility
Power user, plenty of VRAM	GGUF or EXL2	Q5_K_M or 5.0 bpw	Quality boost with minimal size increase
VRAM-constrained, want larger model	GGUF	Q3_K_M or IQ3_M	Squeeze in a larger model with acceptable quality
Extremely limited VRAM (4-6 GB)	GGUF	Q2_K or IQ2_S	Last resort; significant quality trade-off
Production serving with vLLM	AWQ	4-bit	Fastest throughput on modern NVIDIA GPUs
Maximum speed on NVIDIA GPU	EXL2	4.0-4.5 bpw	ExLlamaV2 achieves fastest tok/s
Apple Silicon (Metal)	GGUF	Q4_K_M or Q5_K_M	GGUF is the only format with Metal support
CPU-only inference	GGUF	Q4_K_M	GGUF is the only format with strong CPU support
AMD GPU (ROCm)	GGUF	Q4_K_M	GGUF’s ROCm support is the most mature
Research / quality comparison	GGUF	Q8_0 or FP16	Minimize quantization effects for fair evaluation

How do you choose between a smaller model at high quantization vs a larger model at low quantization?

This is one of the most common dilemmas in local AI. For example, with 12 GB of VRAM, you could run:

Llama 3.2 8B at Q8_0 (7.7 GB) — maximum quality for this model size
Llama 3.2 8B at Q4_K_M (4.4 GB) — standard quality with headroom for context
Qwen2.5 14B at Q3_K_M (6.9 GB) — larger model, lower quantization

The general rule

A smaller model at Q4+ quantization usually outperforms a larger model at Q2-Q3 quantization. The quality degradation from aggressive quantization compounds with model size — larger models have more weights to lose precision, and the errors accumulate more dramatically.

However, there are exceptions:

If the larger model was specifically trained for your task (e.g., a 14B code model vs an 8B general model), the specialization advantage may outweigh the quantization penalty
For tasks that depend primarily on knowledge breadth rather than reasoning precision (like trivia or translation), larger models retain their knowledge advantage even at lower quantization
MoE models handle lower quantization somewhat better because only a subset of weights is active per token

Practical recommendation

Start with the larger model at Q4_K_M if it fits
If it does not fit at Q4_K_M, try Q3_K_M
If it does not fit at Q3_K_M, drop to the smaller model at Q5_K_M or Q4_K_M
Never go below Q2_K for any model you plan to use seriously
Test both options on your specific tasks — the right answer depends on your use case

Summary: quantization in practice

Quantization transforms local AI from a theoretical possibility into a daily-use reality. Here are the essential takeaways:

Q4_K_M GGUF is the default choice for 90% of users. It offers approximately 70% size reduction with approximately 97% quality retention.
GGUF is the universal format. It works everywhere — CPU, NVIDIA GPU, AMD GPU, Apple Silicon. Use it unless you have a specific reason not to.
AWQ is the best choice for production GPU serving with vLLM on modern NVIDIA hardware.
EXL2 is the performance champion for maximum tokens per second on NVIDIA GPUs via ExLlamaV2.
GPTQ is a solid GPU format with broad engine support, slightly eclipsed by AWQ on modern hardware.
Stay at Q4 or above for tasks requiring reasoning, accuracy, and coherence. Only go lower when the model size increase justifies the quality trade-off.
Pre-quantized models are readily available on Hugging Face. You rarely need to quantize yourself unless you are working with custom fine-tunes or new model releases.

Quantization is not a compromise — it is a precision engineering discipline that makes the entire local AI ecosystem possible. Understanding it lets you make informed choices about the quality, speed, and memory trade-offs that define your local AI experience.

For help choosing which model to quantize, see How to Choose the Right Local LLM. For hardware recommendations to run your chosen quantization level, see the Local AI Hardware Guide. For definitions of quantization-related terms, see our Local AI Glossary.