Quantization is the process of reducing the numerical precision of a model’s weights — from 16-bit floating point down to 8-bit, 4-bit, or even 2-bit integers — to shrink the model’s memory footprint and increase inference speed, with a controlled trade-off in output quality. A 7B parameter model in full precision (FP16) requires approximately 14 GB of memory; the same model quantized to 4-bit (Q4_K_M) needs only about 4.4 GB — a 70% reduction — while retaining roughly 97% of the original quality. Quantization is what makes running large language models on consumer hardware practical.
Without quantization, local AI would be impractical for most users. A 70B model at full precision requires 140 GB of memory — far beyond any consumer GPU. At 4-bit quantization, that same model fits in approximately 44 GB, making it runnable on a Mac with 48 GB unified memory or a pair of 24 GB GPUs. Quantization is not a compromise; it is an enabling technology that unlocks the entire local AI ecosystem.
This guide explains how quantization works, compares every major quantization format, provides detailed quality and performance benchmarks, and gives you a decision framework for choosing the right quantization for your hardware and use case.
What is quantization and why does it matter?
How model weights are stored
A large language model is fundamentally a massive collection of numerical parameters called “weights.” These weights are the learned values that encode the model’s knowledge and capabilities. During inference, these weights are multiplied with input data through matrix multiplication operations to produce outputs.
In their original training format, weights are stored as FP16 (16-bit floating point) or BF16 (Brain Float 16) numbers. Each weight occupies 2 bytes of memory. The math is straightforward:
- 7B parameters × 2 bytes = 14 GB
- 13B parameters × 2 bytes = 26 GB
- 70B parameters × 2 bytes = 140 GB
- 405B parameters × 2 bytes = 810 GB
These sizes exceed the VRAM of consumer GPUs (typically 8-24 GB), making full-precision models impractical for local deployment beyond the smallest models.
How quantization reduces size
Quantization replaces high-precision floating-point weights with lower-precision integers. Instead of storing each weight as a 16-bit float, you store it as an 8-bit, 4-bit, or even 2-bit integer, along with scaling factors that allow approximate reconstruction of the original values.
The key insight is that neural network weights are redundant. Not every weight needs full 16-bit precision to contribute meaningfully to the model’s output. Many weights cluster around similar values and can be represented with fewer bits without significantly affecting the model’s behavior.
Think of it like image compression: a JPEG photo is much smaller than the raw image file, and for most purposes, the visual quality is indistinguishable. Similarly, a quantized model is much smaller than the full-precision model, and for most tasks, the output quality is nearly identical.
The quantization trade-off
Quantization involves a three-way trade-off:
- Size reduction: Fewer bits per weight means a smaller model file and less memory needed.
- Speed improvement: Smaller models load faster and can be read from memory faster, increasing tokens per second.
- Quality loss: Lower precision means less accurate weight representations, which can degrade output quality.
The art of quantization is finding the sweet spot where size and speed improvements are maximized while quality loss remains imperceptible for your use case.
What are the different bit-widths used in quantization?
Full precision: FP16 and BF16
- FP16 (Float16): 16-bit floating point. The standard training format. 2 bytes per weight.
- BF16 (BFloat16): Google’s 16-bit format with a wider exponent range than FP16. Used in training; functionally equivalent to FP16 for inference quality.
- FP32 (Float32): 32-bit floating point. 4 bytes per weight. Rarely used for LLM inference because it doubles memory with negligible quality improvement over FP16.
These are “full precision” — the baseline against which all quantized formats are compared.
INT8 (8-bit integer)
8-bit quantization stores each weight as an integer from 0 to 255, with scaling factors to map the integer range back to approximate floating-point values. This halves memory usage compared to FP16 with minimal quality loss (typically less than 0.5% on benchmarks).
INT8 is widely supported and is considered “nearly lossless” quantization. It is a safe choice when you have enough memory to afford 8-bit but not full precision.
INT4 (4-bit integer)
4-bit quantization stores each weight as an integer from 0 to 15. This reduces memory by 75% compared to FP16. Quality loss becomes measurable but remains small with good quantization techniques (typically 1-3% on benchmarks).
4-bit is the most popular quantization level for local AI because it offers the best balance of size, speed, and quality. The majority of models downloaded through Ollama and LM Studio use 4-bit quantization.
INT3 and INT2 (3-bit and 2-bit)
3-bit and 2-bit quantization pushes the limits of compression. Quality degradation becomes significant — noticeable in longer generations, complex reasoning, and factual accuracy. These levels are useful when you need to fit a much larger model into limited memory and are willing to accept quality trade-offs.
Mixed precision
Modern quantization methods do not apply a uniform bit-width across all weights. Instead, they use mixed precision — assigning more bits to important layers (like attention layers) and fewer bits to less important layers (like feedforward layers). This yields better quality than uniform quantization at the same average bit rate.
GGUF’s “K-quant” system and EXL2’s variable bitrate both use this approach.
How does GGUF quantization work?
GGUF (GPT-Generated Unified Format) is the quantization format created by the llama.cpp project and is the most widely used format for local AI. It is supported by Ollama, LM Studio, GPT4All, llama.cpp, and many other tools.
GGUF advantages
- Flexible deployment: Runs on CPU, NVIDIA GPU (CUDA), AMD GPU (ROCm), Apple GPU (Metal), and Vulkan
- Partial GPU offloading: Load some layers on GPU and the rest on CPU — no need for the entire model to fit in VRAM
- Wide quantization range: Available from Q2_K (2.7 bits) to Q8_0 (8 bits)
- Self-contained: A single GGUF file contains everything needed — model weights, tokenizer, architecture config
- Universal support: The de facto standard for local AI tools
GGUF quantization levels explained
GGUF offers many quantization levels using the k-quant system. The naming convention is: Q{bits}_{type}_{size}, where bits is the approximate bit-width, type is the quantization method, and size indicates quality tier (S=small, M=medium, L=large).
| Level | Bits/Weight | Size (7B model) | Quality | Speed | Description |
|---|---|---|---|---|---|
| Q2_K | 2.7 | 3.0 GB | Poor | Fastest | Severe quality loss. Only for fitting models that otherwise do not fit. |
| Q3_K_S | 3.5 | 3.4 GB | Low | Very fast | Significant quality loss. Noticeable on reasoning and coherence. |
| Q3_K_M | 3.9 | 3.6 GB | Below average | Very fast | Noticeable quality drop vs Q4. Acceptable for simple tasks. |
| Q3_K_L | 4.1 | 3.9 GB | Fair | Fast | Better than Q3_K_M. Reasonable for chat when VRAM is tight. |
| Q4_0 | 4.0 | 4.0 GB | Fair | Fast | Legacy uniform 4-bit. Superseded by Q4_K variants. |
| Q4_K_S | 4.5 | 4.2 GB | Good | Fast | Good quality. Slightly worse than Q4_K_M. |
| Q4_K_M | 4.8 | 4.4 GB | Very good | Fast | The community standard. Best balance of quality, size, and speed. |
| Q5_0 | 5.0 | 4.7 GB | Good | Moderate | Legacy uniform 5-bit. Superseded by Q5_K variants. |
| Q5_K_S | 5.3 | 5.0 GB | Very good | Moderate | Good upgrade from Q4_K_M if you have the VRAM. |
| Q5_K_M | 5.5 | 5.3 GB | Excellent | Moderate | Near-full-precision quality. Recommended when VRAM allows. |
| Q6_K | 6.6 | 5.9 GB | Excellent | Moderate | Barely distinguishable from FP16 on most tasks. |
| Q8_0 | 8.0 | 7.7 GB | Near-perfect | Slower | Essentially lossless. Use when you want maximum quality. |
GGUF imatrix quantization (IQ series)
The “IQ” variants use importance matrix data to further optimize quantization. They measure which weights are most important using calibration data and assign precision accordingly:
| Level | Bits/Weight | Size (7B model) | Quality vs Standard | Notes |
|---|---|---|---|---|
| IQ1_S | 1.6 | 1.8 GB | Very poor | Experimental. Extreme compression. |
| IQ2_XS | 2.3 | 2.5 GB | Poor | Better than Q2_K at similar sizes. |
| IQ2_S | 2.5 | 2.7 GB | Below average | Usable for simple tasks. |
| IQ3_XS | 3.3 | 3.3 GB | Fair | Better than Q3_K_S at similar sizes. |
| IQ3_M | 3.6 | 3.5 GB | Fair-good | Competitive with Q3_K_M. |
| IQ4_XS | 4.3 | 4.1 GB | Good | Slightly smaller than Q4_K_S with similar quality. |
IQ quantizations are newer and can offer better quality at very low bit rates (2-3 bits) compared to standard K-quants. They require an importance matrix generated from calibration data, which quantization providers like TheBloke and bartowski typically include.
How does GPTQ quantization work?
GPTQ (Generalized Post-Training Quantization) is a GPU-optimized quantization method that uses calibration data to minimize the quantization error layer by layer.
How GPTQ differs from GGUF
GPTQ takes a different approach to quantization. Instead of simple rounding (as in basic GGUF), GPTQ:
- Processes the model one layer at a time
- For each layer, uses a calibration dataset to measure the impact of quantizing each weight
- Applies a second-order correction (based on the Hessian matrix) to adjust remaining weights to compensate for quantization error
- This “error compensation” produces better quality at the same bit rate compared to naive quantization
GPTQ specifications
| Aspect | Details |
|---|---|
| Typical bit-widths | 4-bit (most common), 8-bit, 3-bit |
| Group size | 128 (standard), 64, 32 — smaller groups = better quality but larger file |
| Format | SafeTensors files with quantization config |
| GPU requirement | Requires NVIDIA GPU with CUDA. Does not support CPU inference. |
| Supported engines | ExLlamaV2, vLLM, AutoGPTQ, TGI |
| Offloading | Limited. Best when entire model fits in VRAM. |
When to use GPTQ
- You have a modern NVIDIA GPU with enough VRAM to hold the entire model
- You want fast GPU inference and do not need CPU fallback
- You are using ExLlamaV2 or vLLM as your inference engine
- You want good quality at 4-bit with error compensation
GPTQ quality comparison
At 4-bit precision, GPTQ typically offers slightly better quality than basic GGUF Q4_0 due to its error compensation technique. However, GGUF’s newer K-quant methods (Q4_K_M, Q4_K_S) have largely closed this gap. The main advantage of GPTQ is speed on NVIDIA GPUs with engines optimized for it.
How does AWQ quantization work?
AWQ (Activation-Aware Weight Quantization) is a more recent quantization method that observes which weights have the largest impact on model activations and preserves those weights at higher precision.
AWQ’s approach
AWQ’s key insight is that not all weights are equally important. A small fraction of weights (roughly 1%) has a disproportionate impact on model outputs. AWQ:
- Analyzes the model using calibration data to identify “salient” weights — those that produce the largest activations
- Applies a per-channel scaling transformation to protect salient weights from quantization error
- Quantizes all weights to 4-bit, but the scaling ensures that important weights lose less precision
This activation-aware approach yields better quality than naive quantization at the same bit rate, particularly for smaller models where every bit of precision matters more.
AWQ specifications
| Aspect | Details |
|---|---|
| Typical bit-widths | 4-bit (standard) |
| Group size | 128 (standard) |
| Format | SafeTensors files with AWQ config |
| GPU requirement | Requires modern NVIDIA GPU. Optimized for Ampere+ (RTX 30xx, 40xx). |
| Supported engines | vLLM, TGI, AutoAWQ |
| Key advantage | Faster inference than GPTQ on modern GPUs due to optimized kernels |
When to use AWQ
- You are deploying with vLLM or TGI for production serving
- You have a modern NVIDIA GPU (Ampere or newer — RTX 3000/4000 series)
- You want the fastest 4-bit GPU inference with good quality
- You are serving multiple concurrent users and throughput matters
AWQ vs GPTQ
| Dimension | AWQ | GPTQ |
|---|---|---|
| Quality at 4-bit | Slightly better on average | Good, with error compensation |
| Inference speed | Faster on Ampere+ GPUs | Good, but slightly slower than AWQ on modern hardware |
| GPU support | Ampere+ recommended | Broader GPU support (works on older NVIDIA GPUs) |
| Engine support | vLLM, TGI, AutoAWQ | ExLlamaV2, vLLM, AutoGPTQ, TGI |
| Quantization speed | Faster to quantize | Slower to quantize (Hessian computation) |
| Community adoption | Growing; increasingly the default for vLLM | Established; large library of pre-quantized models |
For most users deploying with vLLM on modern NVIDIA GPUs, AWQ is the recommended choice. For users on older GPUs or using ExLlamaV2, GPTQ is the better option.
How does EXL2 quantization work?
EXL2 is the quantization format used by the ExLlamaV2 inference engine. It offers a unique capability: variable bitrate across the model.
EXL2’s variable bitrate
Unlike GGUF (where you choose a single quantization level like Q4_K_M for the entire model) or GPTQ/AWQ (which use a fixed 4-bit or 8-bit for all weights), EXL2 allows you to specify a target average bitrate (e.g., 4.0 bits per weight) and then distributes bits non-uniformly across layers.
For example, at a 4.0 bpw (bits per weight) target, EXL2 might:
- Assign 6-8 bits to the most important attention layers
- Assign 3-4 bits to less important feedforward layers
- Average out to exactly 4.0 bits per weight overall
This variable allocation produces better quality than uniform quantization at the same average bit rate.
EXL2 specifications
| Aspect | Details |
|---|---|
| Bit-widths | Variable. Common targets: 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 6.0, 8.0 bpw |
| Format | SafeTensors with EXL2 quantization tables |
| GPU requirement | NVIDIA GPU with CUDA (6.0+). GPU-only — no CPU inference. |
| Supported engines | ExLlamaV2 only |
| Key advantage | Fastest token generation speeds; variable bitrate for optimal quality |
| Key limitation | ExLlamaV2-only; no CPU support; smaller community |
When to use EXL2
- You are using ExLlamaV2 as your inference engine
- You want the absolute fastest token generation on NVIDIA GPUs
- You want fine-grained control over the quality/size trade-off (e.g., 3.5 bpw instead of being locked to Q3 or Q4)
- You are an advanced user comfortable with a smaller ecosystem
EXL2 performance
ExLlamaV2 with EXL2 quantization achieves some of the fastest token generation speeds available:
| Model | Quantization | GPU | Tokens/Second |
|---|---|---|---|
| Llama 3.2 8B | 4.0 bpw EXL2 | RTX 4090 | 110-140 |
| Llama 3.2 8B | 4.0 bpw EXL2 | RTX 3090 | 80-100 |
| Llama 3.2 70B | 4.0 bpw EXL2 | RTX 4090 | 18-25 |
| Mixtral 8x7B | 4.5 bpw EXL2 | RTX 4090 | 45-60 |
These speeds are typically 10-30% faster than GGUF on the same hardware for pure GPU inference, making EXL2 the performance king for NVIDIA GPU users who want maximum speed.
How do all the quantization formats compare?
Here is a comprehensive comparison of all major quantization formats:
| Format | Bits | CPU Support | GPU Support | Partial Offload | Quality (4-bit) | Speed (4-bit, GPU) | Primary Engine | Ecosystem Size |
|---|---|---|---|---|---|---|---|---|
| GGUF | 2-8 | Yes | CUDA, Metal, ROCm, Vulkan | Yes | Very good (K-quants) | Good | llama.cpp, Ollama | Largest |
| GPTQ | 3-8 | No | CUDA | Limited | Good | Good | ExLlamaV2, vLLM | Large |
| AWQ | 4 | No | CUDA (Ampere+) | No | Very good | Very good | vLLM, TGI | Growing |
| EXL2 | 2-8 (variable) | No | CUDA | No | Excellent (variable bitrate) | Best | ExLlamaV2 | Smaller |
| ONNX | 4-16 | Yes | Various | Depends | Good | Varies | ONNX Runtime | Moderate |
| BitsAndBytes (NF4) | 4 | Yes (slow) | CUDA | N/A | Good | Moderate | Transformers (HF) | Large |
Decision tree for choosing a format
Start here: What is your inference engine?
│
├── Ollama or LM Studio → GGUF
│
├── llama.cpp → GGUF
│
├── vLLM (production serving) →
│ ├── Modern NVIDIA GPU (RTX 30xx/40xx)? → AWQ
│ └── Older NVIDIA GPU? → GPTQ
│
├── ExLlamaV2 (maximum speed) → EXL2
│
├── TGI (Hugging Face serving) →
│ ├── Want fastest inference? → AWQ
│ └── Want broadest compatibility? → GPTQ
│
└── Hugging Face Transformers (research/dev) → BitsAndBytes NF4
For 90% of local AI users: Use GGUF Q4_K_M. It is supported by the widest range of tools, runs on any hardware, and offers excellent quality. You only need to consider other formats if you are optimizing for specific deployment scenarios.
What does quantization quality loss actually look like?
Understanding the theoretical quality numbers is useful, but what does quantization quality loss actually look like in practice?
Perplexity benchmarks
Perplexity is the standard metric for measuring quantization quality. Lower perplexity means the model better predicts the next token. Here are representative perplexity measurements for Llama 3.2 8B at different quantization levels (measured on the wikitext-2 dataset):
| Quantization | Perplexity | Increase vs FP16 | Practical Impact |
|---|---|---|---|
| FP16 (baseline) | 6.14 | — | Perfect quality (reference) |
| Q8_0 | 6.15 | +0.01 (+0.2%) | Indistinguishable from FP16 |
| Q6_K | 6.17 | +0.03 (+0.5%) | Indistinguishable from FP16 |
| Q5_K_M | 6.21 | +0.07 (+1.1%) | No perceptible difference for most tasks |
| Q4_K_M | 6.32 | +0.18 (+2.9%) | Minimal impact. The sweet spot. |
| Q4_K_S | 6.36 | +0.22 (+3.6%) | Slight quality drop vs Q4_K_M. Still very good. |
| Q3_K_M | 6.58 | +0.44 (+7.2%) | Noticeable on complex reasoning. OK for chat. |
| Q3_K_S | 6.79 | +0.65 (+10.6%) | Clear quality drop. Visible in coherence. |
| Q2_K | 7.62 | +1.48 (+24.1%) | Significant degradation. Frequent errors. |
| IQ2_XS | 8.10 | +1.96 (+31.9%) | Severe quality loss. Not recommended for general use. |
Where quality loss manifests
At Q4_K_M (the recommended level), quality loss is negligible for most tasks. When you drop to Q3 and below, degradation shows up in specific ways:
- Factual accuracy: The model is more likely to confuse similar facts, mix up dates, or hallucinate details.
- Reasoning chains: Multi-step logical reasoning breaks down. The model may reach incorrect conclusions despite starting correctly.
- Long-form coherence: Extended generations (1000+ tokens) may lose thematic consistency or introduce contradictions.
- Instruction following: The model may ignore parts of complex instructions or produce the wrong output format.
- Rare knowledge: Obscure facts and specialized domain knowledge are the first to degrade.
For simple tasks like summarization, classification, extraction, and basic chat, even Q3 quantization is often adequate. For complex tasks like coding, analysis, and multi-step reasoning, staying at Q4 or above is strongly recommended.
How do you quantize a model yourself?
While pre-quantized models are readily available on Hugging Face, you may want to quantize a model yourself — perhaps a new release without existing quantizations, a custom fine-tune, or a specific quantization level that is not available.
Quantizing to GGUF
The llama.cpp project provides the llama-quantize tool:
# Step 1: Convert the model from Hugging Face format to GGUF (FP16)
python convert_hf_to_gguf.py /path/to/model --outfile model-f16.gguf
# Step 2: Quantize to your desired level
./llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M
For imatrix quantization (IQ variants), you first generate an importance matrix using calibration data:
# Generate importance matrix
./llama-imatrix -m model-f16.gguf -f calibration-data.txt -o imatrix.dat
# Quantize with importance matrix
./llama-quantize --imatrix imatrix.dat model-f16.gguf model-IQ4_XS.gguf IQ4_XS
Quantizing to GPTQ
Use the AutoGPTQ library:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=True,
)
model = AutoGPTQForCausalLM.from_pretrained(
"/path/to/model",
quantize_config
)
model.quantize(calibration_dataset)
model.save_quantized("/path/to/output")
Quantizing to AWQ
Use the AutoAWQ library:
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained("/path/to/model")
model.quantize(
tokenizer,
quant_config={"w_bit": 4, "q_group_size": 128}
)
model.save_quantized("/path/to/output")
Quantizing to EXL2
Use the ExLlamaV2 quantizer:
python convert.py \
-i /path/to/model \
-o /path/to/output \
-b 4.0 \
-cf /path/to/calibration-data.parquet
The -b flag specifies the target bits per weight. EXL2 will automatically distribute bits across layers to optimize quality at the target average.
What about newer quantization methods?
The quantization landscape continues to evolve. Several newer methods are worth knowing about:
AQLM (Additive Quantization of Language Models)
AQLM uses multi-codebook quantization — instead of mapping each weight to a single integer, it maps groups of weights to combinations from multiple codebooks. This achieves impressive quality at 2-bit precision, outperforming Q2_K GGUF significantly. Still emerging but promising for extreme compression scenarios.
QuIP# (Quantization with Incoherence Processing)
QuIP# applies random rotations to model weights before quantization to reduce coherence (correlation between weights), which makes them easier to quantize without quality loss. Achieves strong results at 2-4 bits but requires specialized inference support.
HQQ (Half-Quadratic Quantization)
HQQ is a fast, calibration-free quantization method. Unlike GPTQ and AWQ, which require running calibration data through the model, HQQ quantizes based on weight statistics alone. This makes it much faster to quantize (minutes instead of hours) with competitive quality.
GGUF’s evolving quantization
The llama.cpp team continues to improve GGUF quantization. Recent additions include:
- Improved IQ (importance-matrix) quantization for better quality at 2-3 bits
- Better support for MoE model quantization
- Optimized kernels for mixed-precision inference
What is the recommended quantization for each scenario?
| Scenario | Recommended Format | Recommended Level | Reasoning |
|---|---|---|---|
| First-time user with Ollama | GGUF | Q4_K_M | Best overall balance; widest compatibility |
| Power user, plenty of VRAM | GGUF or EXL2 | Q5_K_M or 5.0 bpw | Quality boost with minimal size increase |
| VRAM-constrained, want larger model | GGUF | Q3_K_M or IQ3_M | Squeeze in a larger model with acceptable quality |
| Extremely limited VRAM (4-6 GB) | GGUF | Q2_K or IQ2_S | Last resort; significant quality trade-off |
| Production serving with vLLM | AWQ | 4-bit | Fastest throughput on modern NVIDIA GPUs |
| Maximum speed on NVIDIA GPU | EXL2 | 4.0-4.5 bpw | ExLlamaV2 achieves fastest tok/s |
| Apple Silicon (Metal) | GGUF | Q4_K_M or Q5_K_M | GGUF is the only format with Metal support |
| CPU-only inference | GGUF | Q4_K_M | GGUF is the only format with strong CPU support |
| AMD GPU (ROCm) | GGUF | Q4_K_M | GGUF’s ROCm support is the most mature |
| Research / quality comparison | GGUF | Q8_0 or FP16 | Minimize quantization effects for fair evaluation |
How do you choose between a smaller model at high quantization vs a larger model at low quantization?
This is one of the most common dilemmas in local AI. For example, with 12 GB of VRAM, you could run:
- Llama 3.2 8B at Q8_0 (7.7 GB) — maximum quality for this model size
- Llama 3.2 8B at Q4_K_M (4.4 GB) — standard quality with headroom for context
- Qwen2.5 14B at Q3_K_M (6.9 GB) — larger model, lower quantization
The general rule
A smaller model at Q4+ quantization usually outperforms a larger model at Q2-Q3 quantization. The quality degradation from aggressive quantization compounds with model size — larger models have more weights to lose precision, and the errors accumulate more dramatically.
However, there are exceptions:
- If the larger model was specifically trained for your task (e.g., a 14B code model vs an 8B general model), the specialization advantage may outweigh the quantization penalty
- For tasks that depend primarily on knowledge breadth rather than reasoning precision (like trivia or translation), larger models retain their knowledge advantage even at lower quantization
- MoE models handle lower quantization somewhat better because only a subset of weights is active per token
Practical recommendation
- Start with the larger model at Q4_K_M if it fits
- If it does not fit at Q4_K_M, try Q3_K_M
- If it does not fit at Q3_K_M, drop to the smaller model at Q5_K_M or Q4_K_M
- Never go below Q2_K for any model you plan to use seriously
- Test both options on your specific tasks — the right answer depends on your use case
Summary: quantization in practice
Quantization transforms local AI from a theoretical possibility into a daily-use reality. Here are the essential takeaways:
- Q4_K_M GGUF is the default choice for 90% of users. It offers approximately 70% size reduction with approximately 97% quality retention.
- GGUF is the universal format. It works everywhere — CPU, NVIDIA GPU, AMD GPU, Apple Silicon. Use it unless you have a specific reason not to.
- AWQ is the best choice for production GPU serving with vLLM on modern NVIDIA hardware.
- EXL2 is the performance champion for maximum tokens per second on NVIDIA GPUs via ExLlamaV2.
- GPTQ is a solid GPU format with broad engine support, slightly eclipsed by AWQ on modern hardware.
- Stay at Q4 or above for tasks requiring reasoning, accuracy, and coherence. Only go lower when the model size increase justifies the quality trade-off.
- Pre-quantized models are readily available on Hugging Face. You rarely need to quantize yourself unless you are working with custom fine-tunes or new model releases.
Quantization is not a compromise — it is a precision engineering discipline that makes the entire local AI ecosystem possible. Understanding it lets you make informed choices about the quality, speed, and memory trade-offs that define your local AI experience.
For help choosing which model to quantize, see How to Choose the Right Local LLM. For hardware recommendations to run your chosen quantization level, see the Local AI Hardware Guide. For definitions of quantization-related terms, see our Local AI Glossary.