Local AI Hardware Guide: What Do You Need to Run LLMs?

Complete hardware guide for running local LLMs. GPU recommendations by budget, VRAM requirements, CPU-only setups, Apple Silicon performance, and used GPU buying advice.

Running a local LLM is fundamentally a memory problem. The model weights need to fit in memory, and faster memory means faster inference. This guide breaks down exactly what hardware you need based on your budget, which GPUs to buy (new and used), how to maximize CPU-only performance, and why Apple Silicon has become a favorite for local AI enthusiasts.

The One Rule: VRAM Is King

The single most important specification for local AI is memory — specifically, fast memory that can feed the model quickly.

Here’s the hierarchy from fastest to slowest:

  1. GPU VRAM (HBM): Found in datacenter GPUs like A100/H100. Extremely fast, extremely expensive.
  2. GPU VRAM (GDDR6X): Found in consumer NVIDIA GPUs. Fast and affordable.
  3. Apple Unified Memory: Shared between CPU and GPU. Good bandwidth, very large capacity.
  4. System RAM (DDR5): CPU-only inference. Usable for smaller models.
  5. System RAM (DDR4): Slower CPU inference. Works but noticeably slower.
  6. NVMe SSD: Too slow for real-time inference. Used only for storage.

The practical implication: get as much fast memory as your budget allows.

VRAM Requirements by Model Size

This table shows how much memory you need for different model sizes at the most common quantization levels:

Model SizeQ4_K_MQ5_K_MQ6_KQ8_0FP16
1B0.7 GB0.8 GB0.9 GB1.1 GB2.0 GB
3B1.9 GB2.2 GB2.5 GB3.2 GB6.0 GB
7B4.4 GB5.0 GB5.5 GB7.5 GB14 GB
8B4.9 GB5.5 GB6.2 GB8.5 GB16 GB
13B7.4 GB8.4 GB9.3 GB13 GB26 GB
14B8.2 GB9.2 GB10 GB14 GB28 GB
32B18 GB21 GB23 GB32 GB64 GB
34B19 GB22 GB25 GB34 GB68 GB
70B38 GB43 GB48 GB70 GB140 GB

Important: Add 1-2 GB to these numbers for KV cache (context processing). Longer conversations need more KV cache memory.

NVIDIA GPUs: The Gold Standard

NVIDIA dominates local AI because of CUDA, their proprietary compute framework. Every major inference engine supports CUDA out of the box.

Consumer GPUs Ranked for Local AI

GPUVRAMMemory BWMax Model (Q4)New PriceUsed PriceValue Rating
RTX 409024 GB1008 GB/s32B$1,600$1,200Good
RTX 4080 Super16 GB736 GB/s13B$1,000$800OK
RTX 4070 Ti Super16 GB672 GB/s13B$800$600OK
RTX 309024 GB936 GB/s32BDisc.$700Excellent
RTX 3090 Ti24 GB1008 GB/s32BDisc.$750Excellent
RTX 3060 12 GB12 GB360 GB/s7-8B$250$150Best Budget
RTX 4060 Ti 16 GB16 GB288 GB/s13B$500$400OK
RTX 3080 10 GB10 GB760 GB/s7BDisc.$250Good
RTX A600048 GB768 GB/s70B$3,500$1,800Best Pro
Tesla P4024 GB346 GB/s32B-$150Ultra Budget

Best Picks by Budget

Under $200: RTX 3060 12 GB (used) The best value in local AI. 12 GB VRAM runs any 7-8B model comfortably and can squeeze in some 13B models at Q3. Available used for $130-180.

# Verify your GPU is detected
nvidia-smi

# Check VRAM
nvidia-smi --query-gpu=name,memory.total --format=csv

Under $400: RTX 3080 10 GB or RTX 3060 12 GB (new equivalent) The 3080 is faster but has less VRAM (10 GB vs 12 GB). For AI, the 3060’s extra 2 GB of VRAM often matters more than raw speed.

Under $800: RTX 3090 (used) The sweet spot for serious local AI. 24 GB VRAM runs 32B models easily. Used prices have dropped significantly.

Under $1,500: RTX 4090 The best consumer GPU for local AI. 24 GB VRAM with the fastest consumer memory bandwidth. If you’re building a new system, this is the card to get.

Enterprise/Pro: RTX A6000 or dual GPUs 48 GB VRAM runs 70B models. Used A6000s can be found for $1,500-2,000.

The Used GPU Market

Buying used GPUs is the best way to maximize VRAM per dollar for local AI. Here’s what to look for:

Best used buys:

  • RTX 3060 12 GB ($130-180): Best entry point
  • RTX 3090 ($650-800): Best mid-range
  • Tesla P40 ($100-170): 24 GB VRAM on the extreme cheap, but no video output and needs a blower cooler
  • RTX A6000 ($1,500-2,000): Best high-end

Where to buy:

  • eBay (check seller ratings, buy with buyer protection)
  • r/hardwareswap
  • Local listings (Facebook Marketplace, Craigslist)
  • Refurbished from NVIDIA certified sellers

What to verify:

  • Run nvidia-smi immediately to confirm VRAM amount
  • Run a stress test (FurMark or similar) for 30 minutes
  • Check that all VRAM is accessible (no bad memory)
  • Inspect for physical damage, especially to the PCIe connector

The Tesla P40: Budget Secret

The NVIDIA Tesla P40 deserves special mention. It’s a datacenter GPU from 2016 with 24 GB VRAM that sells used for $100-170. The catches:

  • No video output (it’s a compute-only card)
  • No hardware video decode
  • Uses a passive cooler designed for server airflow (you need to add a fan or aftermarket cooler)
  • Only supports FP32, not FP16 (but quantized models don’t need FP16)
  • PCIe Gen 3 x16
  • Loud if using a blower fan

For pure LLM inference with quantized models, the P40 is remarkably capable. It won’t win speed contests, but it runs 32B Q4 models that a $500 consumer GPU can’t fit.

# P40 setup on Linux
sudo apt install nvidia-driver-535
# Reboot, then verify
nvidia-smi

AMD GPUs

AMD GPUs work for local AI through ROCm (Radeon Open Compute), but support is less mature than NVIDIA’s CUDA.

Current State of AMD Support

FeatureStatus
OllamaSupported on Linux (ROCm)
llama.cppSupported via hipBLAS
LM StudioLimited support
vLLMSupported for select GPUs
PyTorchSupported via ROCm
Windows supportExperimental/limited

Supported AMD GPUs

Well supported (ROCm officially supported):

  • RX 7900 XTX (24 GB) — best AMD option for local AI
  • RX 7900 XT (20 GB)
  • RX 7800 XT (16 GB)
  • Radeon PRO W7900 (48 GB)

Community supported (may need workarounds):

  • RX 6900 XT (16 GB)
  • RX 6800 XT (16 GB)
  • RX 6700 XT (12 GB)

AMD Setup

# Install ROCm on Ubuntu
sudo apt install rocm-hip-libraries rocm-hip-sdk

# Set environment for unsupported GPUs
export HSA_OVERRIDE_GFX_VERSION=11.0.0  # For RDNA 3

# Install Ollama (auto-detects ROCm)
curl -fsSL https://ollama.com/install.sh | sh

# Verify GPU detection
ollama run llama3.2 --verbose

Bottom line: If you’re buying new hardware specifically for AI, NVIDIA is the safer choice. If you already have an AMD GPU, it can work well on Linux with some setup effort.

Apple Silicon: The Unified Memory Advantage

Apple’s M-series chips have a unique advantage for local AI: unified memory. The GPU and CPU share the same pool of RAM, which means a MacBook with 64 GB of unified memory can load a 70B Q4 model that would require a $1,600+ GPU on a PC.

Performance by Chip

ChipMax MemoryMemory BW7B Q4 (t/s)13B Q4 (t/s)70B Q4 (t/s)
M116 GB68 GB/s15-208-12N/A
M1 Pro32 GB200 GB/s25-3515-20N/A
M1 Max64 GB400 GB/s35-4525-3010-15
M1 Ultra128 GB800 GB/s45-5535-4020-25
M224 GB100 GB/s18-2510-14N/A
M2 Pro32 GB200 GB/s28-3818-22N/A
M2 Max96 GB400 GB/s38-4828-3312-18
M2 Ultra192 GB800 GB/s50-6038-4522-28
M324 GB100 GB/s20-2812-16N/A
M3 Pro36 GB150 GB/s30-4020-25N/A
M3 Max128 GB400 GB/s40-5030-3514-20
M432 GB120 GB/s25-3214-18N/A
M4 Pro48 GB273 GB/s35-4525-30N/A
M4 Max128 GB546 GB/s50-6038-4520-28

t/s = tokens per second for generation. Actual speeds vary by model and context length.

ConfigBest Models
M1/M2/M3/M4 8 GBPhi-3 Mini 3.8B, Llama 3.2 1B
M1/M2/M3/M4 16 GBLlama 3.1 8B, Qwen 2.5 7B
M-Pro 18-36 GBQwen 2.5 14B, Llama 3.1 8B Q8
M-Max 32-64 GBQwen 2.5 32B, Llama 3.1 8B FP16
M-Max/Ultra 64-128 GBLlama 3.1 70B Q4, Qwen 2.5 72B Q4
M-Ultra 128-192 GBLlama 3.1 70B Q8, multiple models

Apple Silicon Optimization

For best performance on Apple Silicon, use MLX or Ollama (which uses Metal acceleration by default):

# Ollama uses Metal automatically
ollama run llama3.1:8b

# For MLX (Apple's ML framework)
pip install mlx-lm
mlx_lm.generate --model mlx-community/Llama-3.1-8B-Instruct-4bit \
  --prompt "Hello, world"

# Check GPU utilization
sudo powermetrics --samplers gpu_power -i 1000

Mac Buying Guide for Local AI

BudgetRecommendationMax Model
$800-1000Used M1 Max MacBook Pro 32 GB14B
$1200-1500Used M2 Max MacBook Pro 64 GB32B
$1800-2500Used M1/M2 Ultra Mac Studio 128 GB70B
$2500-3500New M4 Max MacBook Pro 128 GB70B
$4000+New M4 Ultra Mac Studio 192 GB70B Q8 or 100B+

CPU-Only Inference

If you don’t have a dedicated GPU and aren’t on Apple Silicon, you can still run local LLMs using your CPU.

What Makes a Good CPU for LLM Inference?

1. AVX2/AVX-512 Support: SIMD instructions that accelerate matrix math. Most CPUs from 2015+ have AVX2. AVX-512 (found in some Intel and AMD Zen 4 chips) gives a further boost.

2. Core Count: More cores help with prompt processing (the “prefill” phase). For generation, single-thread performance matters more.

3. RAM Speed: Faster RAM means faster token generation. DDR5 is ~50% faster than DDR4 for inference.

4. RAM Amount: Determines the maximum model size you can load.

5. Cache Size: Large L3 caches (32 MB+) help with inference performance.

CPU Performance Expectations

CPURAM7B Q4 Speed13B Q4 Speed
Intel i5-1240032 GB DDR58-12 t/s4-7 t/s
Intel i7-13700K64 GB DDR512-18 t/s7-10 t/s
AMD Ryzen 7 7800X32 GB DDR510-15 t/s6-9 t/s
AMD Ryzen 9 7950X64 GB DDR515-22 t/s9-13 t/s
Intel i5-1040032 GB DDR45-8 t/s3-5 t/s
AMD Ryzen 5 5600X32 GB DDR47-10 t/s4-6 t/s

Optimizing CPU Inference

# Check AVX support
lscpu | grep -i avx

# For llama.cpp, build with optimal CPU flags
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)

# Set thread count (usually best at physical core count)
# Ollama does this automatically, but for llama.cpp:
./llama-cli -m model.gguf -t $(nproc --all) -ngl 0

# Monitor CPU usage during inference
htop

RAM configuration tips:

  • Use dual-channel RAM (two sticks, not one). This roughly doubles memory bandwidth.
  • For DDR5, higher MHz matters. 6000 MHz DDR5 gives ~15% better inference speed than 4800 MHz.
  • Fill all memory channels if possible.

Intel Arc GPUs

Intel’s Arc GPUs (A770, A750) support local AI through SYCL/oneAPI and increasingly through community efforts.

GPUVRAMStatus
Arc A770 16 GB16 GBExperimental support in llama.cpp, Ollama
Arc A750 8 GB8 GBExperimental support

Support is improving but still behind NVIDIA and AMD. Not recommended as a primary purchase for AI use, but if you already have one, it’s worth trying.

# llama.cpp with SYCL (Intel GPU) support
cmake -B build -DGGML_SYCL=ON
cmake --build build --config Release -j$(nproc)

Multi-GPU Setups

If one GPU isn’t enough VRAM, you can split a model across multiple GPUs.

How It Works

The model’s layers are divided across GPUs. During inference, data flows from one GPU to the next through the layers. This is called pipeline parallelism (or layer splitting).

Requirements

  • Same GPU family (e.g., two NVIDIA cards). Mixing brands doesn’t work.
  • Sufficient PCIe lanes (x8 per GPU minimum, x16 preferred)
  • A motherboard with multiple PCIe x16 slots
  • Adequate power supply (two GPUs = 2x power draw)
  • Good case airflow (two GPUs generate significant heat)

Multi-GPU with Ollama

# On Linux, Ollama detects multiple NVIDIA GPUs automatically
# Set visible GPUs
export CUDA_VISIBLE_DEVICES=0,1

# Run Ollama normally - it will split the model
ollama run llama3.1:70b

Multi-GPU with llama.cpp

# Split layers across GPUs
./llama-cli -m model.gguf \
  -ngl 99 \              # Offload all layers to GPU
  --tensor-split 0.5,0.5  # Even split between 2 GPUs

Practical Multi-GPU Builds

SetupTotal VRAMMax Model (Q4)Approx Cost (Used)
2x RTX 3060 12 GB24 GB32B$300-360
2x RTX 309048 GB70B$1,400-1,600
2x Tesla P4048 GB70B$250-350
4x RTX 3060 12 GB48 GB70B$600-720

Partial GPU Offload

If your model is too large for your VRAM but you have plenty of RAM, you can offload some layers to the GPU and keep the rest in RAM. This gives you GPU speed for the offloaded layers and CPU speed for the rest.

# In llama.cpp, -ngl controls how many layers go to GPU
# A 7B model has ~32 layers
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -ngl 20  # 20 layers on GPU, rest on CPU

# In Ollama, partial offload happens automatically when VRAM is insufficient
# You can influence it with:
OLLAMA_NUM_GPU=20 ollama run llama3.1:8b

Rule of thumb: Offloading even 50% of layers to GPU gives a significant speedup over pure CPU inference.

Building a Dedicated Local AI Machine

If you’re building a system specifically for local AI, here are recommended builds:

Budget Build ($400-600)

  • CPU: AMD Ryzen 5 5600 or Intel i5-12400F
  • RAM: 32 GB DDR4-3200 (2x16)
  • GPU: Used RTX 3060 12 GB ($150)
  • Storage: 500 GB NVMe SSD
  • PSU: 550W 80+ Bronze
  • Case: Any mid-tower with decent airflow

Runs: 7-8B models at full GPU speed, 13B with partial offload

Mid-Range Build ($1,000-1,500)

  • CPU: AMD Ryzen 7 7700X or Intel i5-13600K
  • RAM: 64 GB DDR5-6000 (2x32)
  • GPU: Used RTX 3090 24 GB ($700)
  • Storage: 1 TB NVMe SSD
  • PSU: 850W 80+ Gold
  • Case: Full tower with good airflow

Runs: Up to 32B models at full GPU speed, 70B with heavy partial offload

High-End Build ($3,000-4,000)

  • CPU: AMD Ryzen 9 7950X
  • RAM: 128 GB DDR5-6000 (4x32)
  • GPU: RTX 4090 24 GB ($1,600)
  • Storage: 2 TB NVMe SSD
  • PSU: 1000W 80+ Platinum
  • Case: Full tower with excellent airflow

Runs: Up to 32B at maximum GPU speed with room for large contexts

Extreme Build ($5,000-8,000)

  • CPU: AMD Threadripper 7960X or Intel Xeon w5
  • RAM: 256 GB DDR5
  • GPU: 2x RTX 3090 or RTX A6000 48 GB
  • Storage: 4 TB NVMe SSD
  • PSU: 1600W 80+ Titanium
  • Case: Server chassis or E-ATX tower

Runs: 70B+ models at full GPU speed

Power Consumption and Running Costs

Local AI uses electricity, but less than you might think:

SetupIdleInferenceMonthly Cost (24/7)
CPU only (65W TDP)30W65-100W$5-8
RTX 306015W170-200W$8-15
RTX 309020W300-350W$15-25
RTX 409015W350-450W$18-30
M4 MacBook Pro5W20-40W$2-4

Based on $0.12/kWh US average. Models are only loaded during active use in most setups.

For comparison, ChatGPT Plus costs $20/month, and API usage for moderate use can easily exceed $50/month. A local setup pays for itself in electricity savings within months.

Next Steps

Now that you understand the hardware landscape:

Frequently Asked Questions

What is the minimum hardware needed to run a local LLM?

The absolute minimum is any modern x86-64 CPU with AVX2 support and 8 GB of RAM. This lets you run small 1-3B parameter models at usable speeds via CPU inference. For a good experience with 7B models, you want 16 GB of RAM and ideally any NVIDIA GPU with 8+ GB VRAM.

Is it worth buying a GPU specifically for local AI?

Yes, if you plan to use local AI regularly. A GPU with 8-12 GB VRAM (like a used RTX 3060 12 GB for around $200) transforms the experience from 'usable' to 'fast.' The speed improvement over CPU-only is typically 5-10x. If you already have a gaming GPU, you likely already have everything you need.

Can I use multiple GPUs for local AI?

Yes. llama.cpp, Ollama (on Linux), and vLLM all support multi-GPU inference. The model is split across GPUs, allowing you to run larger models. Two RTX 3060 12 GB cards (24 GB total) can run a 30B model. However, multi-GPU adds complexity and inter-GPU communication overhead.

Is Apple Silicon or NVIDIA better for local AI?

It depends on your use case. NVIDIA GPUs offer higher raw inference speed per dollar. Apple Silicon offers unified memory, meaning a Mac with 64 GB can run 70B models that would require a $1600+ GPU on PC. For maximum speed, NVIDIA wins. For running the largest possible model, Apple Silicon's unified memory architecture is hard to beat.