Model quantization is what makes running large language models on consumer hardware possible, and the format you choose affects quality, speed, memory usage, and which tools you can use. GGUF, GPTQ, AWQ, and EXL2 are the four dominant quantization formats in the local LLM ecosystem as of 2026, each optimized for different hardware and use cases. This comparison provides the data-driven analysis you need to pick the right format for your models, hardware, and inference stack.
Quick Comparison
| Feature | GGUF | GPTQ | AWQ | EXL2 |
|---|---|---|---|---|
| Full name | GPT-Generated Unified Format | GPT-Quantization | Activation-aware Weight Quantization | ExLlamaV2 format |
| Developer | Georgi Gerganov (llama.cpp) | IST Austria + community | MIT Han Lab | turboderp |
| Quantization type | Post-training (weight-only) | Post-training (weight-only) | Post-training (weight-only) | Post-training (mixed precision) |
| Bit widths | 2, 3, 4, 5, 6, 8-bit + IQ | 2, 3, 4, 8-bit | 4-bit (primarily) | Flexible (2-8 bit, per-layer) |
| CPU inference | Excellent (SIMD optimized) | No (GPU only) | No (GPU only) | No (GPU only) |
| NVIDIA GPU | Yes (CUDA offloading) | Yes (CUDA kernels) | Yes (CUDA kernels) | Yes (custom CUDA kernels) |
| AMD GPU | Yes (ROCm, Vulkan) | Limited | Limited | No |
| Apple Silicon | Yes (Metal) | No | No | No |
| Hybrid CPU+GPU | Yes (layer offloading) | No | No | No |
| Primary engine | llama.cpp, Ollama | AutoGPTQ, vLLM, TGI | AutoAWQ, vLLM, TGI | ExLlamaV2 |
| Calibration data | Optional (imatrix) | Required | Required | Required |
| Self-contained | Yes (weights + metadata) | Separate config needed | Separate config needed | Separate config needed |
| File structure | Single .gguf file | Model directory (safetensors) | Model directory (safetensors) | Model directory |
| HF availability | Thousands of models | Thousands of models | Thousands of models | Hundreds of models |
Quality / Size / Speed Comparison
The following table compares Llama 3.1 8B across formats at approximately 4-bit quantization, measured on an NVIDIA RTX 4090. Perplexity is measured on WikiText-2 (lower is better). FP16 baseline perplexity for this model is ~6.2.
| Format & Quant | Size (GB) | Perplexity | Generation (tok/s) | Prompt (tok/s) | VRAM (GB) |
|---|---|---|---|---|---|
| FP16 (baseline) | 16.0 | 6.20 | ~75 | ~2,800 | 16.5 |
| GGUF Q4_K_M | 4.9 | 6.38 | ~110 | ~3,200 | 5.8 |
| GGUF Q4_K_S | 4.6 | 6.42 | ~115 | ~3,300 | 5.5 |
| GGUF IQ4_XS | 4.3 | 6.35 | ~105 | ~3,000 | 5.2 |
| GPTQ 4-bit 128g | 4.5 | 6.40 | ~120 | ~3,500 | 5.5 |
| GPTQ 4-bit 32g | 5.0 | 6.32 | ~105 | ~3,200 | 6.0 |
| AWQ 4-bit 128g | 4.5 | 6.36 | ~125 | ~3,800 | 5.4 |
| EXL2 4.0 bpw | 4.2 | 6.33 | ~150 | ~4,000 | 5.3 |
| EXL2 3.5 bpw | 3.7 | 6.55 | ~160 | ~4,200 | 4.8 |
| EXL2 5.0 bpw | 5.2 | 6.26 | ~135 | ~3,600 | 6.2 |
| GGUF Q5_K_M | 5.7 | 6.28 | ~100 | ~2,900 | 6.6 |
| GGUF Q6_K | 6.6 | 6.23 | ~90 | ~2,600 | 7.4 |
| GGUF Q8_0 | 8.5 | 6.21 | ~80 | ~2,200 | 9.3 |
Key observations:
- EXL2 leads in speed on NVIDIA GPUs due to turboderp’s custom CUDA kernels optimized specifically for mixed-precision inference
- AWQ is the fastest among standard formats because its activation-aware approach enables more efficient GPU kernels
- GGUF offers the most size/quality options ranging from Q2 (aggressive compression) through Q8 (near-lossless)
- Quality is remarkably close across all formats at 4-bit — the perplexity difference is typically under 0.1, which is imperceptible in most practical use
- GGUF IQ quantizations (imatrix-based) punch above their weight, achieving quality comparable to GPTQ/AWQ at similar sizes
Quality Retention Deep Dive
GGUF
GGUF offers the widest range of quantization levels, from extremely aggressive Q2_K (2-bit) to near-lossless Q8_0 (8-bit). The K-quant variants (Q4_K_M, Q5_K_M, etc.) use mixed precision within the model — more important layers get higher precision, less important layers get lower precision.
The IQ (importance-matrix) quantizations represent the state of the art in GGUF quality. By measuring the importance of each weight using calibration data, imatrix quantizations allocate bits more intelligently. IQ4_XS achieves quality comparable to standard Q4_K_M while being 10-15% smaller.
GGUF’s quality advantage is flexibility: you can choose exactly the quality-size tradeoff you need, from aggressive 2-bit compression for memory-constrained systems to 8-bit for maximum quality.
GPTQ
GPTQ quantizes weights by solving a layer-wise quantization optimization problem using second-order information (the Hessian matrix). The group size parameter controls the granularity of quantization — smaller groups (32) preserve more quality at a slight size cost, larger groups (128) are more compact.
GPTQ produces consistent, reliable quantizations. The format is well-understood and widely supported. Quality at 4-bit with group size 128 is comparable to GGUF Q4_K_M.
AWQ
AWQ (Activation-aware Weight Quantization) identifies weights that are important based on activation patterns in calibration data, and protects those weights during quantization. This activation-aware approach achieves slightly better quality than GPTQ at the same bit width in many benchmarks.
AWQ’s primary advantage is that its quantization-aware approach produces models that are slightly easier for GPU kernels to dequantize efficiently, contributing to its speed advantage.
EXL2
EXL2’s per-layer mixed-precision approach is the most flexible. During quantization, EXL2 allocates bits per layer based on measured sensitivity — layers that affect output quality most get more bits, while less sensitive layers are compressed more aggressively. You specify a target bits-per-weight (bpw), and EXL2 distributes bits optimally across layers.
This flexibility means EXL2 can target any bpw between ~2.0 and ~8.0, and the quality at any given average bpw tends to match or exceed other formats because the bit allocation is globally optimized rather than uniform.
Speed Analysis
On NVIDIA GPU
EXL2 is fastest on NVIDIA GPUs because ExLlamaV2 uses custom CUDA kernels that are purpose-built for mixed-precision dequantization and matrix multiplication. These kernels are optimized for the RTX 3000/4000/5000 series consumer GPUs.
AWQ is second-fastest because its quantization approach produces weight layouts that are efficiently dequantized by GPU kernels. The Marlin and GEMM kernels used for AWQ inference in vLLM are highly optimized.
GPTQ speed depends on the kernel implementation. With the fast Marlin kernel (available in vLLM and newer AutoGPTQ versions), GPTQ matches AWQ speeds. With older kernels, it is 10-20% slower.
GGUF on NVIDIA GPU is competitive but typically 10-30% slower than EXL2/AWQ because llama.cpp’s CUDA kernels are designed for generality (supporting many quantization types) rather than being specialized for a single format.
On CPU
Only GGUF supports efficient CPU inference. llama.cpp’s CPU kernels use SIMD instructions (AVX2, AVX-512 on x86; NEON on ARM) for fast dequantization and matrix multiplication. No other format has comparable CPU support.
On Apple Silicon
Only GGUF supports Apple Silicon inference through Metal compute shaders. MLX uses its own format. GPTQ, AWQ, and EXL2 do not work on Apple Silicon.
VRAM Usage
VRAM usage is primarily determined by model size after quantization, plus overhead for the KV cache and engine runtime.
At 4-bit quantization, all formats produce similarly sized models (within 10-15% of each other for the same model). The differences come from:
- Group size: Smaller groups (32 vs 128) add overhead for per-group parameters
- Mixed precision: EXL2 and GGUF K-quants use varying precision per layer, so VRAM depends on the specific bit allocation
- Engine overhead: ExLlamaV2 has minimal overhead (~200 MB). vLLM has more overhead (~500-800 MB) due to PagedAttention data structures. llama.cpp overhead varies by configuration.
For VRAM-constrained scenarios (fitting the largest possible model in limited VRAM), EXL2 at a low bpw (3.0-3.5) or GGUF Q3_K types offer the most aggressive compression while maintaining usable quality.
Tooling Support
| Tool | GGUF | GPTQ | AWQ | EXL2 |
|---|---|---|---|---|
| Ollama | Yes | No | No | No |
| LM Studio | Yes | No | No | No |
| llama.cpp | Yes | No | No | No |
| vLLM | No | Yes | Yes | No |
| ExLlamaV2 | No | Yes | No | Yes |
| Text Generation Inference | No | Yes | Yes | No |
| Hugging Face Transformers | No | Yes | Yes | No |
| Jan | Yes | No | No | No |
| GPT4All | Yes | No | No | No |
| Kobold.cpp | Yes | No | No | No |
| TabbyAPI | No | Yes | No | Yes |
| Mullama | Yes | No | No | No |
GGUF dominates the consumer/desktop tooling ecosystem. GPTQ and AWQ dominate the GPU server ecosystem. EXL2 is supported by a smaller but dedicated set of tools.
CPU Compatibility
This dimension is decisive for many users.
GGUF is the only format with production-quality CPU inference. llama.cpp’s CPU backend supports:
- x86-64: AVX2 (most modern CPUs), AVX-512 (server CPUs, recent Intel desktop)
- ARM: NEON (Apple Silicon, Raspberry Pi 4/5, ARM servers)
- Hybrid CPU+GPU: offload specific layers to GPU while running others on CPU
This makes GGUF the only option for users without a dedicated GPU, users with limited VRAM who need CPU+GPU hybrid inference, and Apple Silicon Macs.
GPTQ, AWQ, EXL2 are GPU-only formats. They rely on GPU-specific dequantization kernels and do not have optimized CPU implementations. While some libraries can technically run them on CPU, performance is impractical for interactive use.
When to Choose Each Format
Choose GGUF if you:
- Use Ollama, LM Studio, llama.cpp, or any desktop AI tool
- Need CPU inference or CPU+GPU hybrid inference
- Run on Apple Silicon
- Want the widest range of quantization options
- Need a single self-contained model file
- Prioritize tooling compatibility over raw GPU speed
Choose GPTQ if you:
- Deploy on NVIDIA GPU servers using vLLM or TGI
- Want a well-established, widely-supported GPU format
- Need 4-bit or 8-bit quantization with good quality
- Are already in the Hugging Face Transformers ecosystem
Choose AWQ if you:
- Deploy on NVIDIA GPUs and want the best speed among standard formats
- Use vLLM or TGI for serving
- Want efficient GPU kernels (Marlin)
- Prioritize inference speed alongside quality
Choose EXL2 if you:
- Have an NVIDIA consumer GPU and want maximum speed
- Want fine-grained control over bits-per-weight
- Use ExLlamaV2 or TabbyAPI
- Need the best quality-per-bit through mixed-precision allocation
The Bottom Line
The quantization format landscape is divided by hardware. If you run inference on CPU, Apple Silicon, or need maximum tool compatibility, GGUF is the only practical choice. If you run inference on NVIDIA GPUs, all four formats work, and the choice comes down to your inference engine (Ollama demands GGUF, vLLM prefers AWQ/GPTQ, ExLlamaV2 demands EXL2). Quality differences between formats at the same bit width are small enough that tool compatibility and inference speed should drive your decision, not quality concerns.