Which quantization format has the best quality?

At the same bit width, EXL2 and GGUF (with importance-matrix quantization) generally retain the highest quality. EXL2's per-layer mixed quantization optimizes bit allocation based on each layer's sensitivity. GGUF's IQ quantizations (imatrix-based) achieve similar quality optimization. GPTQ and AWQ are close behind at 4-bit. The practical quality difference between all four at 4-bit is often negligible for most use cases.

Can I run GPTQ or AWQ models on CPU?

GPTQ and AWQ are designed for GPU inference and rely on GPU-specific dequantization kernels. Running them on CPU is either unsupported or extremely slow. If you need CPU inference, GGUF is the only practical choice — it was specifically designed for efficient CPU execution with SIMD optimizations (AVX2, AVX-512, ARM NEON).

Which format should I use with Ollama?

GGUF. Ollama exclusively uses the GGUF format. When you run 'ollama pull llama3.2', you are downloading a GGUF-quantized model. If you have a GPTQ, AWQ, or EXL2 model, you would need to convert it to GGUF (or find a GGUF version) to use it with Ollama.

GGUF vs GPTQ vs AWQ vs EXL2: Model Quantization Format Comparison

Model quantization is what makes running large language models on consumer hardware possible, and the format you choose affects quality, speed, memory usage, and which tools you can use. GGUF, GPTQ, AWQ, and EXL2 are the four dominant quantization formats in the local LLM ecosystem as of 2026, each optimized for different hardware and use cases. This comparison provides the data-driven analysis you need to pick the right format for your models, hardware, and inference stack.

Quick Comparison

Feature	GGUF	GPTQ	AWQ	EXL2
Full name	GPT-Generated Unified Format	GPT-Quantization	Activation-aware Weight Quantization	ExLlamaV2 format
Developer	Georgi Gerganov (llama.cpp)	IST Austria + community	MIT Han Lab	turboderp
Quantization type	Post-training (weight-only)	Post-training (weight-only)	Post-training (weight-only)	Post-training (mixed precision)
Bit widths	2, 3, 4, 5, 6, 8-bit + IQ	2, 3, 4, 8-bit	4-bit (primarily)	Flexible (2-8 bit, per-layer)
CPU inference	Excellent (SIMD optimized)	No (GPU only)	No (GPU only)	No (GPU only)
NVIDIA GPU	Yes (CUDA offloading)	Yes (CUDA kernels)	Yes (CUDA kernels)	Yes (custom CUDA kernels)
AMD GPU	Yes (ROCm, Vulkan)	Limited	Limited	No
Apple Silicon	Yes (Metal)	No	No	No
Hybrid CPU+GPU	Yes (layer offloading)	No	No	No
Primary engine	llama.cpp, Ollama	AutoGPTQ, vLLM, TGI	AutoAWQ, vLLM, TGI	ExLlamaV2
Calibration data	Optional (imatrix)	Required	Required	Required
Self-contained	Yes (weights + metadata)	Separate config needed	Separate config needed	Separate config needed
File structure	Single .gguf file	Model directory (safetensors)	Model directory (safetensors)	Model directory
HF availability	Thousands of models	Thousands of models	Thousands of models	Hundreds of models

Quality / Size / Speed Comparison

The following table compares Llama 3.1 8B across formats at approximately 4-bit quantization, measured on an NVIDIA RTX 4090. Perplexity is measured on WikiText-2 (lower is better). FP16 baseline perplexity for this model is ~6.2.

Format & Quant	Size (GB)	Perplexity	Generation (tok/s)	Prompt (tok/s)	VRAM (GB)
FP16 (baseline)	16.0	6.20	~75	~2,800	16.5
GGUF Q4_K_M	4.9	6.38	~110	~3,200	5.8
GGUF Q4_K_S	4.6	6.42	~115	~3,300	5.5
GGUF IQ4_XS	4.3	6.35	~105	~3,000	5.2
GPTQ 4-bit 128g	4.5	6.40	~120	~3,500	5.5
GPTQ 4-bit 32g	5.0	6.32	~105	~3,200	6.0
AWQ 4-bit 128g	4.5	6.36	~125	~3,800	5.4
EXL2 4.0 bpw	4.2	6.33	~150	~4,000	5.3
EXL2 3.5 bpw	3.7	6.55	~160	~4,200	4.8
EXL2 5.0 bpw	5.2	6.26	~135	~3,600	6.2
GGUF Q5_K_M	5.7	6.28	~100	~2,900	6.6
GGUF Q6_K	6.6	6.23	~90	~2,600	7.4
GGUF Q8_0	8.5	6.21	~80	~2,200	9.3

Key observations:

EXL2 leads in speed on NVIDIA GPUs due to turboderp’s custom CUDA kernels optimized specifically for mixed-precision inference
AWQ is the fastest among standard formats because its activation-aware approach enables more efficient GPU kernels
GGUF offers the most size/quality options ranging from Q2 (aggressive compression) through Q8 (near-lossless)
Quality is remarkably close across all formats at 4-bit — the perplexity difference is typically under 0.1, which is imperceptible in most practical use
GGUF IQ quantizations (imatrix-based) punch above their weight, achieving quality comparable to GPTQ/AWQ at similar sizes

Quality Retention Deep Dive

GGUF

GGUF offers the widest range of quantization levels, from extremely aggressive Q2_K (2-bit) to near-lossless Q8_0 (8-bit). The K-quant variants (Q4_K_M, Q5_K_M, etc.) use mixed precision within the model — more important layers get higher precision, less important layers get lower precision.

The IQ (importance-matrix) quantizations represent the state of the art in GGUF quality. By measuring the importance of each weight using calibration data, imatrix quantizations allocate bits more intelligently. IQ4_XS achieves quality comparable to standard Q4_K_M while being 10-15% smaller.

GGUF’s quality advantage is flexibility: you can choose exactly the quality-size tradeoff you need, from aggressive 2-bit compression for memory-constrained systems to 8-bit for maximum quality.

GPTQ

GPTQ quantizes weights by solving a layer-wise quantization optimization problem using second-order information (the Hessian matrix). The group size parameter controls the granularity of quantization — smaller groups (32) preserve more quality at a slight size cost, larger groups (128) are more compact.

GPTQ produces consistent, reliable quantizations. The format is well-understood and widely supported. Quality at 4-bit with group size 128 is comparable to GGUF Q4_K_M.

AWQ

AWQ (Activation-aware Weight Quantization) identifies weights that are important based on activation patterns in calibration data, and protects those weights during quantization. This activation-aware approach achieves slightly better quality than GPTQ at the same bit width in many benchmarks.

AWQ’s primary advantage is that its quantization-aware approach produces models that are slightly easier for GPU kernels to dequantize efficiently, contributing to its speed advantage.

EXL2

EXL2’s per-layer mixed-precision approach is the most flexible. During quantization, EXL2 allocates bits per layer based on measured sensitivity — layers that affect output quality most get more bits, while less sensitive layers are compressed more aggressively. You specify a target bits-per-weight (bpw), and EXL2 distributes bits optimally across layers.

This flexibility means EXL2 can target any bpw between ~2.0 and ~8.0, and the quality at any given average bpw tends to match or exceed other formats because the bit allocation is globally optimized rather than uniform.

Speed Analysis

On NVIDIA GPU

EXL2 is fastest on NVIDIA GPUs because ExLlamaV2 uses custom CUDA kernels that are purpose-built for mixed-precision dequantization and matrix multiplication. These kernels are optimized for the RTX 3000/4000/5000 series consumer GPUs.

AWQ is second-fastest because its quantization approach produces weight layouts that are efficiently dequantized by GPU kernels. The Marlin and GEMM kernels used for AWQ inference in vLLM are highly optimized.

GPTQ speed depends on the kernel implementation. With the fast Marlin kernel (available in vLLM and newer AutoGPTQ versions), GPTQ matches AWQ speeds. With older kernels, it is 10-20% slower.

GGUF on NVIDIA GPU is competitive but typically 10-30% slower than EXL2/AWQ because llama.cpp’s CUDA kernels are designed for generality (supporting many quantization types) rather than being specialized for a single format.

On CPU

Only GGUF supports efficient CPU inference. llama.cpp’s CPU kernels use SIMD instructions (AVX2, AVX-512 on x86; NEON on ARM) for fast dequantization and matrix multiplication. No other format has comparable CPU support.

On Apple Silicon

Only GGUF supports Apple Silicon inference through Metal compute shaders. MLX uses its own format. GPTQ, AWQ, and EXL2 do not work on Apple Silicon.

VRAM Usage

VRAM usage is primarily determined by model size after quantization, plus overhead for the KV cache and engine runtime.

At 4-bit quantization, all formats produce similarly sized models (within 10-15% of each other for the same model). The differences come from:

Group size: Smaller groups (32 vs 128) add overhead for per-group parameters
Mixed precision: EXL2 and GGUF K-quants use varying precision per layer, so VRAM depends on the specific bit allocation
Engine overhead: ExLlamaV2 has minimal overhead (~200 MB). vLLM has more overhead (~500-800 MB) due to PagedAttention data structures. llama.cpp overhead varies by configuration.

For VRAM-constrained scenarios (fitting the largest possible model in limited VRAM), EXL2 at a low bpw (3.0-3.5) or GGUF Q3_K types offer the most aggressive compression while maintaining usable quality.

Tooling Support

Tool	GGUF	GPTQ	AWQ	EXL2
Ollama	Yes	No	No	No
LM Studio	Yes	No	No	No
llama.cpp	Yes	No	No	No
vLLM	No	Yes	Yes	No
ExLlamaV2	No	Yes	No	Yes
Text Generation Inference	No	Yes	Yes	No
Hugging Face Transformers	No	Yes	Yes	No
Jan	Yes	No	No	No
GPT4All	Yes	No	No	No
Kobold.cpp	Yes	No	No	No
TabbyAPI	No	Yes	No	Yes
Mullama	Yes	No	No	No

GGUF dominates the consumer/desktop tooling ecosystem. GPTQ and AWQ dominate the GPU server ecosystem. EXL2 is supported by a smaller but dedicated set of tools.

CPU Compatibility

This dimension is decisive for many users.

GGUF is the only format with production-quality CPU inference. llama.cpp’s CPU backend supports:

x86-64: AVX2 (most modern CPUs), AVX-512 (server CPUs, recent Intel desktop)
ARM: NEON (Apple Silicon, Raspberry Pi 4/5, ARM servers)
Hybrid CPU+GPU: offload specific layers to GPU while running others on CPU

This makes GGUF the only option for users without a dedicated GPU, users with limited VRAM who need CPU+GPU hybrid inference, and Apple Silicon Macs.

GPTQ, AWQ, EXL2 are GPU-only formats. They rely on GPU-specific dequantization kernels and do not have optimized CPU implementations. While some libraries can technically run them on CPU, performance is impractical for interactive use.

When to Choose Each Format

Choose GGUF if you:

Use Ollama, LM Studio, llama.cpp, or any desktop AI tool
Need CPU inference or CPU+GPU hybrid inference
Run on Apple Silicon
Want the widest range of quantization options
Need a single self-contained model file
Prioritize tooling compatibility over raw GPU speed

Choose GPTQ if you:

Deploy on NVIDIA GPU servers using vLLM or TGI
Want a well-established, widely-supported GPU format
Need 4-bit or 8-bit quantization with good quality
Are already in the Hugging Face Transformers ecosystem

Choose AWQ if you:

Deploy on NVIDIA GPUs and want the best speed among standard formats
Use vLLM or TGI for serving
Want efficient GPU kernels (Marlin)
Prioritize inference speed alongside quality

Choose EXL2 if you:

Have an NVIDIA consumer GPU and want maximum speed
Want fine-grained control over bits-per-weight
Use ExLlamaV2 or TabbyAPI
Need the best quality-per-bit through mixed-precision allocation

The Bottom Line

The quantization format landscape is divided by hardware. If you run inference on CPU, Apple Silicon, or need maximum tool compatibility, GGUF is the only practical choice. If you run inference on NVIDIA GPUs, all four formats work, and the choice comes down to your inference engine (Ollama demands GGUF, vLLM prefers AWQ/GPTQ, ExLlamaV2 demands EXL2). Quality differences between formats at the same bit width are small enough that tool compatibility and inference speed should drive your decision, not quality concerns.