Running a local LLM is fundamentally a memory problem. The model weights need to fit in memory, and faster memory means faster inference. This guide breaks down exactly what hardware you need based on your budget, which GPUs to buy (new and used), how to maximize CPU-only performance, and why Apple Silicon has become a favorite for local AI enthusiasts.
The One Rule: VRAM Is King
The single most important specification for local AI is memory — specifically, fast memory that can feed the model quickly.
Here’s the hierarchy from fastest to slowest:
- GPU VRAM (HBM): Found in datacenter GPUs like A100/H100. Extremely fast, extremely expensive.
- GPU VRAM (GDDR6X): Found in consumer NVIDIA GPUs. Fast and affordable.
- Apple Unified Memory: Shared between CPU and GPU. Good bandwidth, very large capacity.
- System RAM (DDR5): CPU-only inference. Usable for smaller models.
- System RAM (DDR4): Slower CPU inference. Works but noticeably slower.
- NVMe SSD: Too slow for real-time inference. Used only for storage.
The practical implication: get as much fast memory as your budget allows.
VRAM Requirements by Model Size
This table shows how much memory you need for different model sizes at the most common quantization levels:
| Model Size | Q4_K_M | Q5_K_M | Q6_K | Q8_0 | FP16 |
|---|---|---|---|---|---|
| 1B | 0.7 GB | 0.8 GB | 0.9 GB | 1.1 GB | 2.0 GB |
| 3B | 1.9 GB | 2.2 GB | 2.5 GB | 3.2 GB | 6.0 GB |
| 7B | 4.4 GB | 5.0 GB | 5.5 GB | 7.5 GB | 14 GB |
| 8B | 4.9 GB | 5.5 GB | 6.2 GB | 8.5 GB | 16 GB |
| 13B | 7.4 GB | 8.4 GB | 9.3 GB | 13 GB | 26 GB |
| 14B | 8.2 GB | 9.2 GB | 10 GB | 14 GB | 28 GB |
| 32B | 18 GB | 21 GB | 23 GB | 32 GB | 64 GB |
| 34B | 19 GB | 22 GB | 25 GB | 34 GB | 68 GB |
| 70B | 38 GB | 43 GB | 48 GB | 70 GB | 140 GB |
Important: Add 1-2 GB to these numbers for KV cache (context processing). Longer conversations need more KV cache memory.
NVIDIA GPUs: The Gold Standard
NVIDIA dominates local AI because of CUDA, their proprietary compute framework. Every major inference engine supports CUDA out of the box.
Consumer GPUs Ranked for Local AI
| GPU | VRAM | Memory BW | Max Model (Q4) | New Price | Used Price | Value Rating |
|---|---|---|---|---|---|---|
| RTX 4090 | 24 GB | 1008 GB/s | 32B | $1,600 | $1,200 | Good |
| RTX 4080 Super | 16 GB | 736 GB/s | 13B | $1,000 | $800 | OK |
| RTX 4070 Ti Super | 16 GB | 672 GB/s | 13B | $800 | $600 | OK |
| RTX 3090 | 24 GB | 936 GB/s | 32B | Disc. | $700 | Excellent |
| RTX 3090 Ti | 24 GB | 1008 GB/s | 32B | Disc. | $750 | Excellent |
| RTX 3060 12 GB | 12 GB | 360 GB/s | 7-8B | $250 | $150 | Best Budget |
| RTX 4060 Ti 16 GB | 16 GB | 288 GB/s | 13B | $500 | $400 | OK |
| RTX 3080 10 GB | 10 GB | 760 GB/s | 7B | Disc. | $250 | Good |
| RTX A6000 | 48 GB | 768 GB/s | 70B | $3,500 | $1,800 | Best Pro |
| Tesla P40 | 24 GB | 346 GB/s | 32B | - | $150 | Ultra Budget |
Best Picks by Budget
Under $200: RTX 3060 12 GB (used) The best value in local AI. 12 GB VRAM runs any 7-8B model comfortably and can squeeze in some 13B models at Q3. Available used for $130-180.
# Verify your GPU is detected
nvidia-smi
# Check VRAM
nvidia-smi --query-gpu=name,memory.total --format=csv
Under $400: RTX 3080 10 GB or RTX 3060 12 GB (new equivalent) The 3080 is faster but has less VRAM (10 GB vs 12 GB). For AI, the 3060’s extra 2 GB of VRAM often matters more than raw speed.
Under $800: RTX 3090 (used) The sweet spot for serious local AI. 24 GB VRAM runs 32B models easily. Used prices have dropped significantly.
Under $1,500: RTX 4090 The best consumer GPU for local AI. 24 GB VRAM with the fastest consumer memory bandwidth. If you’re building a new system, this is the card to get.
Enterprise/Pro: RTX A6000 or dual GPUs 48 GB VRAM runs 70B models. Used A6000s can be found for $1,500-2,000.
The Used GPU Market
Buying used GPUs is the best way to maximize VRAM per dollar for local AI. Here’s what to look for:
Best used buys:
- RTX 3060 12 GB ($130-180): Best entry point
- RTX 3090 ($650-800): Best mid-range
- Tesla P40 ($100-170): 24 GB VRAM on the extreme cheap, but no video output and needs a blower cooler
- RTX A6000 ($1,500-2,000): Best high-end
Where to buy:
- eBay (check seller ratings, buy with buyer protection)
- r/hardwareswap
- Local listings (Facebook Marketplace, Craigslist)
- Refurbished from NVIDIA certified sellers
What to verify:
- Run
nvidia-smiimmediately to confirm VRAM amount - Run a stress test (FurMark or similar) for 30 minutes
- Check that all VRAM is accessible (no bad memory)
- Inspect for physical damage, especially to the PCIe connector
The Tesla P40: Budget Secret
The NVIDIA Tesla P40 deserves special mention. It’s a datacenter GPU from 2016 with 24 GB VRAM that sells used for $100-170. The catches:
- No video output (it’s a compute-only card)
- No hardware video decode
- Uses a passive cooler designed for server airflow (you need to add a fan or aftermarket cooler)
- Only supports FP32, not FP16 (but quantized models don’t need FP16)
- PCIe Gen 3 x16
- Loud if using a blower fan
For pure LLM inference with quantized models, the P40 is remarkably capable. It won’t win speed contests, but it runs 32B Q4 models that a $500 consumer GPU can’t fit.
# P40 setup on Linux
sudo apt install nvidia-driver-535
# Reboot, then verify
nvidia-smi
AMD GPUs
AMD GPUs work for local AI through ROCm (Radeon Open Compute), but support is less mature than NVIDIA’s CUDA.
Current State of AMD Support
| Feature | Status |
|---|---|
| Ollama | Supported on Linux (ROCm) |
| llama.cpp | Supported via hipBLAS |
| LM Studio | Limited support |
| vLLM | Supported for select GPUs |
| PyTorch | Supported via ROCm |
| Windows support | Experimental/limited |
Supported AMD GPUs
Well supported (ROCm officially supported):
- RX 7900 XTX (24 GB) — best AMD option for local AI
- RX 7900 XT (20 GB)
- RX 7800 XT (16 GB)
- Radeon PRO W7900 (48 GB)
Community supported (may need workarounds):
- RX 6900 XT (16 GB)
- RX 6800 XT (16 GB)
- RX 6700 XT (12 GB)
AMD Setup
# Install ROCm on Ubuntu
sudo apt install rocm-hip-libraries rocm-hip-sdk
# Set environment for unsupported GPUs
export HSA_OVERRIDE_GFX_VERSION=11.0.0 # For RDNA 3
# Install Ollama (auto-detects ROCm)
curl -fsSL https://ollama.com/install.sh | sh
# Verify GPU detection
ollama run llama3.2 --verbose
Bottom line: If you’re buying new hardware specifically for AI, NVIDIA is the safer choice. If you already have an AMD GPU, it can work well on Linux with some setup effort.
Apple Silicon: The Unified Memory Advantage
Apple’s M-series chips have a unique advantage for local AI: unified memory. The GPU and CPU share the same pool of RAM, which means a MacBook with 64 GB of unified memory can load a 70B Q4 model that would require a $1,600+ GPU on a PC.
Performance by Chip
| Chip | Max Memory | Memory BW | 7B Q4 (t/s) | 13B Q4 (t/s) | 70B Q4 (t/s) |
|---|---|---|---|---|---|
| M1 | 16 GB | 68 GB/s | 15-20 | 8-12 | N/A |
| M1 Pro | 32 GB | 200 GB/s | 25-35 | 15-20 | N/A |
| M1 Max | 64 GB | 400 GB/s | 35-45 | 25-30 | 10-15 |
| M1 Ultra | 128 GB | 800 GB/s | 45-55 | 35-40 | 20-25 |
| M2 | 24 GB | 100 GB/s | 18-25 | 10-14 | N/A |
| M2 Pro | 32 GB | 200 GB/s | 28-38 | 18-22 | N/A |
| M2 Max | 96 GB | 400 GB/s | 38-48 | 28-33 | 12-18 |
| M2 Ultra | 192 GB | 800 GB/s | 50-60 | 38-45 | 22-28 |
| M3 | 24 GB | 100 GB/s | 20-28 | 12-16 | N/A |
| M3 Pro | 36 GB | 150 GB/s | 30-40 | 20-25 | N/A |
| M3 Max | 128 GB | 400 GB/s | 40-50 | 30-35 | 14-20 |
| M4 | 32 GB | 120 GB/s | 25-32 | 14-18 | N/A |
| M4 Pro | 48 GB | 273 GB/s | 35-45 | 25-30 | N/A |
| M4 Max | 128 GB | 546 GB/s | 50-60 | 38-45 | 20-28 |
t/s = tokens per second for generation. Actual speeds vary by model and context length.
Recommended Models by Apple Silicon Config
| Config | Best Models |
|---|---|
| M1/M2/M3/M4 8 GB | Phi-3 Mini 3.8B, Llama 3.2 1B |
| M1/M2/M3/M4 16 GB | Llama 3.1 8B, Qwen 2.5 7B |
| M-Pro 18-36 GB | Qwen 2.5 14B, Llama 3.1 8B Q8 |
| M-Max 32-64 GB | Qwen 2.5 32B, Llama 3.1 8B FP16 |
| M-Max/Ultra 64-128 GB | Llama 3.1 70B Q4, Qwen 2.5 72B Q4 |
| M-Ultra 128-192 GB | Llama 3.1 70B Q8, multiple models |
Apple Silicon Optimization
For best performance on Apple Silicon, use MLX or Ollama (which uses Metal acceleration by default):
# Ollama uses Metal automatically
ollama run llama3.1:8b
# For MLX (Apple's ML framework)
pip install mlx-lm
mlx_lm.generate --model mlx-community/Llama-3.1-8B-Instruct-4bit \
--prompt "Hello, world"
# Check GPU utilization
sudo powermetrics --samplers gpu_power -i 1000
Mac Buying Guide for Local AI
| Budget | Recommendation | Max Model |
|---|---|---|
| $800-1000 | Used M1 Max MacBook Pro 32 GB | 14B |
| $1200-1500 | Used M2 Max MacBook Pro 64 GB | 32B |
| $1800-2500 | Used M1/M2 Ultra Mac Studio 128 GB | 70B |
| $2500-3500 | New M4 Max MacBook Pro 128 GB | 70B |
| $4000+ | New M4 Ultra Mac Studio 192 GB | 70B Q8 or 100B+ |
CPU-Only Inference
If you don’t have a dedicated GPU and aren’t on Apple Silicon, you can still run local LLMs using your CPU.
What Makes a Good CPU for LLM Inference?
1. AVX2/AVX-512 Support: SIMD instructions that accelerate matrix math. Most CPUs from 2015+ have AVX2. AVX-512 (found in some Intel and AMD Zen 4 chips) gives a further boost.
2. Core Count: More cores help with prompt processing (the “prefill” phase). For generation, single-thread performance matters more.
3. RAM Speed: Faster RAM means faster token generation. DDR5 is ~50% faster than DDR4 for inference.
4. RAM Amount: Determines the maximum model size you can load.
5. Cache Size: Large L3 caches (32 MB+) help with inference performance.
CPU Performance Expectations
| CPU | RAM | 7B Q4 Speed | 13B Q4 Speed |
|---|---|---|---|
| Intel i5-12400 | 32 GB DDR5 | 8-12 t/s | 4-7 t/s |
| Intel i7-13700K | 64 GB DDR5 | 12-18 t/s | 7-10 t/s |
| AMD Ryzen 7 7800X | 32 GB DDR5 | 10-15 t/s | 6-9 t/s |
| AMD Ryzen 9 7950X | 64 GB DDR5 | 15-22 t/s | 9-13 t/s |
| Intel i5-10400 | 32 GB DDR4 | 5-8 t/s | 3-5 t/s |
| AMD Ryzen 5 5600X | 32 GB DDR4 | 7-10 t/s | 4-6 t/s |
Optimizing CPU Inference
# Check AVX support
lscpu | grep -i avx
# For llama.cpp, build with optimal CPU flags
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)
# Set thread count (usually best at physical core count)
# Ollama does this automatically, but for llama.cpp:
./llama-cli -m model.gguf -t $(nproc --all) -ngl 0
# Monitor CPU usage during inference
htop
RAM configuration tips:
- Use dual-channel RAM (two sticks, not one). This roughly doubles memory bandwidth.
- For DDR5, higher MHz matters. 6000 MHz DDR5 gives ~15% better inference speed than 4800 MHz.
- Fill all memory channels if possible.
Intel Arc GPUs
Intel’s Arc GPUs (A770, A750) support local AI through SYCL/oneAPI and increasingly through community efforts.
| GPU | VRAM | Status |
|---|---|---|
| Arc A770 16 GB | 16 GB | Experimental support in llama.cpp, Ollama |
| Arc A750 8 GB | 8 GB | Experimental support |
Support is improving but still behind NVIDIA and AMD. Not recommended as a primary purchase for AI use, but if you already have one, it’s worth trying.
# llama.cpp with SYCL (Intel GPU) support
cmake -B build -DGGML_SYCL=ON
cmake --build build --config Release -j$(nproc)
Multi-GPU Setups
If one GPU isn’t enough VRAM, you can split a model across multiple GPUs.
How It Works
The model’s layers are divided across GPUs. During inference, data flows from one GPU to the next through the layers. This is called pipeline parallelism (or layer splitting).
Requirements
- Same GPU family (e.g., two NVIDIA cards). Mixing brands doesn’t work.
- Sufficient PCIe lanes (x8 per GPU minimum, x16 preferred)
- A motherboard with multiple PCIe x16 slots
- Adequate power supply (two GPUs = 2x power draw)
- Good case airflow (two GPUs generate significant heat)
Multi-GPU with Ollama
# On Linux, Ollama detects multiple NVIDIA GPUs automatically
# Set visible GPUs
export CUDA_VISIBLE_DEVICES=0,1
# Run Ollama normally - it will split the model
ollama run llama3.1:70b
Multi-GPU with llama.cpp
# Split layers across GPUs
./llama-cli -m model.gguf \
-ngl 99 \ # Offload all layers to GPU
--tensor-split 0.5,0.5 # Even split between 2 GPUs
Practical Multi-GPU Builds
| Setup | Total VRAM | Max Model (Q4) | Approx Cost (Used) |
|---|---|---|---|
| 2x RTX 3060 12 GB | 24 GB | 32B | $300-360 |
| 2x RTX 3090 | 48 GB | 70B | $1,400-1,600 |
| 2x Tesla P40 | 48 GB | 70B | $250-350 |
| 4x RTX 3060 12 GB | 48 GB | 70B | $600-720 |
Partial GPU Offload
If your model is too large for your VRAM but you have plenty of RAM, you can offload some layers to the GPU and keep the rest in RAM. This gives you GPU speed for the offloaded layers and CPU speed for the rest.
# In llama.cpp, -ngl controls how many layers go to GPU
# A 7B model has ~32 layers
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -ngl 20 # 20 layers on GPU, rest on CPU
# In Ollama, partial offload happens automatically when VRAM is insufficient
# You can influence it with:
OLLAMA_NUM_GPU=20 ollama run llama3.1:8b
Rule of thumb: Offloading even 50% of layers to GPU gives a significant speedup over pure CPU inference.
Building a Dedicated Local AI Machine
If you’re building a system specifically for local AI, here are recommended builds:
Budget Build ($400-600)
- CPU: AMD Ryzen 5 5600 or Intel i5-12400F
- RAM: 32 GB DDR4-3200 (2x16)
- GPU: Used RTX 3060 12 GB ($150)
- Storage: 500 GB NVMe SSD
- PSU: 550W 80+ Bronze
- Case: Any mid-tower with decent airflow
Runs: 7-8B models at full GPU speed, 13B with partial offload
Mid-Range Build ($1,000-1,500)
- CPU: AMD Ryzen 7 7700X or Intel i5-13600K
- RAM: 64 GB DDR5-6000 (2x32)
- GPU: Used RTX 3090 24 GB ($700)
- Storage: 1 TB NVMe SSD
- PSU: 850W 80+ Gold
- Case: Full tower with good airflow
Runs: Up to 32B models at full GPU speed, 70B with heavy partial offload
High-End Build ($3,000-4,000)
- CPU: AMD Ryzen 9 7950X
- RAM: 128 GB DDR5-6000 (4x32)
- GPU: RTX 4090 24 GB ($1,600)
- Storage: 2 TB NVMe SSD
- PSU: 1000W 80+ Platinum
- Case: Full tower with excellent airflow
Runs: Up to 32B at maximum GPU speed with room for large contexts
Extreme Build ($5,000-8,000)
- CPU: AMD Threadripper 7960X or Intel Xeon w5
- RAM: 256 GB DDR5
- GPU: 2x RTX 3090 or RTX A6000 48 GB
- Storage: 4 TB NVMe SSD
- PSU: 1600W 80+ Titanium
- Case: Server chassis or E-ATX tower
Runs: 70B+ models at full GPU speed
Power Consumption and Running Costs
Local AI uses electricity, but less than you might think:
| Setup | Idle | Inference | Monthly Cost (24/7) |
|---|---|---|---|
| CPU only (65W TDP) | 30W | 65-100W | $5-8 |
| RTX 3060 | 15W | 170-200W | $8-15 |
| RTX 3090 | 20W | 300-350W | $15-25 |
| RTX 4090 | 15W | 350-450W | $18-30 |
| M4 MacBook Pro | 5W | 20-40W | $2-4 |
Based on $0.12/kWh US average. Models are only loaded during active use in most setups.
For comparison, ChatGPT Plus costs $20/month, and API usage for moderate use can easily exceed $50/month. A local setup pays for itself in electricity savings within months.
Next Steps
Now that you understand the hardware landscape:
- Already have hardware? Jump to your platform guide: Windows, macOS, or Linux
- Need to choose a model? Read How to Choose the Right Local LLM
- Ready for quick start? Follow Your First Local AI in 5 Minutes
- Want to understand the full stack? See The Local AI Stack