What is the minimum hardware needed to run a local LLM?

The absolute minimum is any modern x86-64 CPU with AVX2 support and 8 GB of RAM. This lets you run small 1-3B parameter models at usable speeds via CPU inference. For a good experience with 7B models, you want 16 GB of RAM and ideally any NVIDIA GPU with 8+ GB VRAM.

Is it worth buying a GPU specifically for local AI?

Yes, if you plan to use local AI regularly. A GPU with 8-12 GB VRAM (like a used RTX 3060 12 GB for around $200) transforms the experience from 'usable' to 'fast.' The speed improvement over CPU-only is typically 5-10x. If you already have a gaming GPU, you likely already have everything you need.

Can I use multiple GPUs for local AI?

Yes. llama.cpp, Ollama (on Linux), and vLLM all support multi-GPU inference. The model is split across GPUs, allowing you to run larger models. Two RTX 3060 12 GB cards (24 GB total) can run a 30B model. However, multi-GPU adds complexity and inter-GPU communication overhead.

Is Apple Silicon or NVIDIA better for local AI?

It depends on your use case. NVIDIA GPUs offer higher raw inference speed per dollar. Apple Silicon offers unified memory, meaning a Mac with 64 GB can run 70B models that would require a $1600+ GPU on PC. For maximum speed, NVIDIA wins. For running the largest possible model, Apple Silicon's unified memory architecture is hard to beat.

Local AI Hardware Guide: What Do You Need to Run LLMs?

Running a local LLM is fundamentally a memory problem. The model weights need to fit in memory, and faster memory means faster inference. This guide breaks down exactly what hardware you need based on your budget, which GPUs to buy (new and used), how to maximize CPU-only performance, and why Apple Silicon has become a favorite for local AI enthusiasts.

The One Rule: VRAM Is King

The single most important specification for local AI is memory — specifically, fast memory that can feed the model quickly.

Here’s the hierarchy from fastest to slowest:

GPU VRAM (HBM): Found in datacenter GPUs like A100/H100. Extremely fast, extremely expensive.
GPU VRAM (GDDR6X): Found in consumer NVIDIA GPUs. Fast and affordable.
Apple Unified Memory: Shared between CPU and GPU. Good bandwidth, very large capacity.
System RAM (DDR5): CPU-only inference. Usable for smaller models.
System RAM (DDR4): Slower CPU inference. Works but noticeably slower.
NVMe SSD: Too slow for real-time inference. Used only for storage.

The practical implication: get as much fast memory as your budget allows.

VRAM Requirements by Model Size

This table shows how much memory you need for different model sizes at the most common quantization levels:

Model Size	Q4_K_M	Q5_K_M	Q6_K	Q8_0	FP16
1B	0.7 GB	0.8 GB	0.9 GB	1.1 GB	2.0 GB
3B	1.9 GB	2.2 GB	2.5 GB	3.2 GB	6.0 GB
7B	4.4 GB	5.0 GB	5.5 GB	7.5 GB	14 GB
8B	4.9 GB	5.5 GB	6.2 GB	8.5 GB	16 GB
13B	7.4 GB	8.4 GB	9.3 GB	13 GB	26 GB
14B	8.2 GB	9.2 GB	10 GB	14 GB	28 GB
32B	18 GB	21 GB	23 GB	32 GB	64 GB
34B	19 GB	22 GB	25 GB	34 GB	68 GB
70B	38 GB	43 GB	48 GB	70 GB	140 GB

Important: Add 1-2 GB to these numbers for KV cache (context processing). Longer conversations need more KV cache memory.

NVIDIA GPUs: The Gold Standard

NVIDIA dominates local AI because of CUDA, their proprietary compute framework. Every major inference engine supports CUDA out of the box.

Consumer GPUs Ranked for Local AI

GPU	VRAM	Memory BW	Max Model (Q4)	New Price	Used Price	Value Rating
RTX 4090	24 GB	1008 GB/s	32B	$1,600	$1,200	Good
RTX 4080 Super	16 GB	736 GB/s	13B	$1,000	$800	OK
RTX 4070 Ti Super	16 GB	672 GB/s	13B	$800	$600	OK
RTX 3090	24 GB	936 GB/s	32B	Disc.	$700	Excellent
RTX 3090 Ti	24 GB	1008 GB/s	32B	Disc.	$750	Excellent
RTX 3060 12 GB	12 GB	360 GB/s	7-8B	$250	$150	Best Budget
RTX 4060 Ti 16 GB	16 GB	288 GB/s	13B	$500	$400	OK
RTX 3080 10 GB	10 GB	760 GB/s	7B	Disc.	$250	Good
RTX A6000	48 GB	768 GB/s	70B	$3,500	$1,800	Best Pro
Tesla P40	24 GB	346 GB/s	32B	-	$150	Ultra Budget

Best Picks by Budget

Under $200: RTX 3060 12 GB (used) The best value in local AI. 12 GB VRAM runs any 7-8B model comfortably and can squeeze in some 13B models at Q3. Available used for $130-180.

# Verify your GPU is detected
nvidia-smi

# Check VRAM
nvidia-smi --query-gpu=name,memory.total --format=csv

Under $400: RTX 3080 10 GB or RTX 3060 12 GB (new equivalent) The 3080 is faster but has less VRAM (10 GB vs 12 GB). For AI, the 3060’s extra 2 GB of VRAM often matters more than raw speed.

Under $800: RTX 3090 (used) The sweet spot for serious local AI. 24 GB VRAM runs 32B models easily. Used prices have dropped significantly.

Under $1,500: RTX 4090 The best consumer GPU for local AI. 24 GB VRAM with the fastest consumer memory bandwidth. If you’re building a new system, this is the card to get.

Enterprise/Pro: RTX A6000 or dual GPUs 48 GB VRAM runs 70B models. Used A6000s can be found for $1,500-2,000.

The Used GPU Market

Buying used GPUs is the best way to maximize VRAM per dollar for local AI. Here’s what to look for:

Best used buys:

RTX 3060 12 GB ($130-180): Best entry point
RTX 3090 ($650-800): Best mid-range
Tesla P40 ($100-170): 24 GB VRAM on the extreme cheap, but no video output and needs a blower cooler
RTX A6000 ($1,500-2,000): Best high-end

Where to buy:

eBay (check seller ratings, buy with buyer protection)
r/hardwareswap
Local listings (Facebook Marketplace, Craigslist)
Refurbished from NVIDIA certified sellers

What to verify:

Run nvidia-smi immediately to confirm VRAM amount
Run a stress test (FurMark or similar) for 30 minutes
Check that all VRAM is accessible (no bad memory)
Inspect for physical damage, especially to the PCIe connector

The Tesla P40: Budget Secret

The NVIDIA Tesla P40 deserves special mention. It’s a datacenter GPU from 2016 with 24 GB VRAM that sells used for $100-170. The catches:

No video output (it’s a compute-only card)
No hardware video decode
Uses a passive cooler designed for server airflow (you need to add a fan or aftermarket cooler)
Only supports FP32, not FP16 (but quantized models don’t need FP16)
PCIe Gen 3 x16
Loud if using a blower fan

For pure LLM inference with quantized models, the P40 is remarkably capable. It won’t win speed contests, but it runs 32B Q4 models that a $500 consumer GPU can’t fit.

# P40 setup on Linux
sudo apt install nvidia-driver-535
# Reboot, then verify
nvidia-smi

AMD GPUs

AMD GPUs work for local AI through ROCm (Radeon Open Compute), but support is less mature than NVIDIA’s CUDA.

Current State of AMD Support

Feature	Status
Ollama	Supported on Linux (ROCm)
llama.cpp	Supported via hipBLAS
LM Studio	Limited support
vLLM	Supported for select GPUs
PyTorch	Supported via ROCm
Windows support	Experimental/limited

Supported AMD GPUs

Well supported (ROCm officially supported):

RX 7900 XTX (24 GB) — best AMD option for local AI
RX 7900 XT (20 GB)
RX 7800 XT (16 GB)
Radeon PRO W7900 (48 GB)

Community supported (may need workarounds):

RX 6900 XT (16 GB)
RX 6800 XT (16 GB)
RX 6700 XT (12 GB)

AMD Setup

# Install ROCm on Ubuntu
sudo apt install rocm-hip-libraries rocm-hip-sdk

# Set environment for unsupported GPUs
export HSA_OVERRIDE_GFX_VERSION=11.0.0  # For RDNA 3

# Install Ollama (auto-detects ROCm)
curl -fsSL https://ollama.com/install.sh | sh

# Verify GPU detection
ollama run llama3.2 --verbose

Bottom line: If you’re buying new hardware specifically for AI, NVIDIA is the safer choice. If you already have an AMD GPU, it can work well on Linux with some setup effort.

Apple Silicon: The Unified Memory Advantage

Apple’s M-series chips have a unique advantage for local AI: unified memory. The GPU and CPU share the same pool of RAM, which means a MacBook with 64 GB of unified memory can load a 70B Q4 model that would require a $1,600+ GPU on a PC.

Performance by Chip

Chip	Max Memory	Memory BW	7B Q4 (t/s)	13B Q4 (t/s)	70B Q4 (t/s)
M1	16 GB	68 GB/s	15-20	8-12	N/A
M1 Pro	32 GB	200 GB/s	25-35	15-20	N/A
M1 Max	64 GB	400 GB/s	35-45	25-30	10-15
M1 Ultra	128 GB	800 GB/s	45-55	35-40	20-25
M2	24 GB	100 GB/s	18-25	10-14	N/A
M2 Pro	32 GB	200 GB/s	28-38	18-22	N/A
M2 Max	96 GB	400 GB/s	38-48	28-33	12-18
M2 Ultra	192 GB	800 GB/s	50-60	38-45	22-28
M3	24 GB	100 GB/s	20-28	12-16	N/A
M3 Pro	36 GB	150 GB/s	30-40	20-25	N/A
M3 Max	128 GB	400 GB/s	40-50	30-35	14-20
M4	32 GB	120 GB/s	25-32	14-18	N/A
M4 Pro	48 GB	273 GB/s	35-45	25-30	N/A
M4 Max	128 GB	546 GB/s	50-60	38-45	20-28

t/s = tokens per second for generation. Actual speeds vary by model and context length.

Recommended Models by Apple Silicon Config

Config	Best Models
M1/M2/M3/M4 8 GB	Phi-3 Mini 3.8B, Llama 3.2 1B
M1/M2/M3/M4 16 GB	Llama 3.1 8B, Qwen 2.5 7B
M-Pro 18-36 GB	Qwen 2.5 14B, Llama 3.1 8B Q8
M-Max 32-64 GB	Qwen 2.5 32B, Llama 3.1 8B FP16
M-Max/Ultra 64-128 GB	Llama 3.1 70B Q4, Qwen 2.5 72B Q4
M-Ultra 128-192 GB	Llama 3.1 70B Q8, multiple models

Apple Silicon Optimization

For best performance on Apple Silicon, use MLX or Ollama (which uses Metal acceleration by default):

# Ollama uses Metal automatically
ollama run llama3.1:8b

# For MLX (Apple's ML framework)
pip install mlx-lm
mlx_lm.generate --model mlx-community/Llama-3.1-8B-Instruct-4bit \
  --prompt "Hello, world"

# Check GPU utilization
sudo powermetrics --samplers gpu_power -i 1000

Mac Buying Guide for Local AI

Budget	Recommendation	Max Model
$800-1000	Used M1 Max MacBook Pro 32 GB	14B
$1200-1500	Used M2 Max MacBook Pro 64 GB	32B
$1800-2500	Used M1/M2 Ultra Mac Studio 128 GB	70B
$2500-3500	New M4 Max MacBook Pro 128 GB	70B
$4000+	New M4 Ultra Mac Studio 192 GB	70B Q8 or 100B+

CPU-Only Inference

If you don’t have a dedicated GPU and aren’t on Apple Silicon, you can still run local LLMs using your CPU.

What Makes a Good CPU for LLM Inference?

1. AVX2/AVX-512 Support: SIMD instructions that accelerate matrix math. Most CPUs from 2015+ have AVX2. AVX-512 (found in some Intel and AMD Zen 4 chips) gives a further boost.

2. Core Count: More cores help with prompt processing (the “prefill” phase). For generation, single-thread performance matters more.

3. RAM Speed: Faster RAM means faster token generation. DDR5 is ~50% faster than DDR4 for inference.

4. RAM Amount: Determines the maximum model size you can load.

5. Cache Size: Large L3 caches (32 MB+) help with inference performance.

CPU Performance Expectations

CPU	RAM	7B Q4 Speed	13B Q4 Speed
Intel i5-12400	32 GB DDR5	8-12 t/s	4-7 t/s
Intel i7-13700K	64 GB DDR5	12-18 t/s	7-10 t/s
AMD Ryzen 7 7800X	32 GB DDR5	10-15 t/s	6-9 t/s
AMD Ryzen 9 7950X	64 GB DDR5	15-22 t/s	9-13 t/s
Intel i5-10400	32 GB DDR4	5-8 t/s	3-5 t/s
AMD Ryzen 5 5600X	32 GB DDR4	7-10 t/s	4-6 t/s

Optimizing CPU Inference

# Check AVX support
lscpu | grep -i avx

# For llama.cpp, build with optimal CPU flags
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)

# Set thread count (usually best at physical core count)
# Ollama does this automatically, but for llama.cpp:
./llama-cli -m model.gguf -t $(nproc --all) -ngl 0

# Monitor CPU usage during inference
htop

RAM configuration tips:

Use dual-channel RAM (two sticks, not one). This roughly doubles memory bandwidth.
For DDR5, higher MHz matters. 6000 MHz DDR5 gives ~15% better inference speed than 4800 MHz.
Fill all memory channels if possible.

Intel Arc GPUs

Intel’s Arc GPUs (A770, A750) support local AI through SYCL/oneAPI and increasingly through community efforts.

GPU	VRAM	Status
Arc A770 16 GB	16 GB	Experimental support in llama.cpp, Ollama
Arc A750 8 GB	8 GB	Experimental support

Support is improving but still behind NVIDIA and AMD. Not recommended as a primary purchase for AI use, but if you already have one, it’s worth trying.

# llama.cpp with SYCL (Intel GPU) support
cmake -B build -DGGML_SYCL=ON
cmake --build build --config Release -j$(nproc)

Multi-GPU Setups

If one GPU isn’t enough VRAM, you can split a model across multiple GPUs.

How It Works

The model’s layers are divided across GPUs. During inference, data flows from one GPU to the next through the layers. This is called pipeline parallelism (or layer splitting).

Requirements

Same GPU family (e.g., two NVIDIA cards). Mixing brands doesn’t work.
Sufficient PCIe lanes (x8 per GPU minimum, x16 preferred)
A motherboard with multiple PCIe x16 slots
Adequate power supply (two GPUs = 2x power draw)
Good case airflow (two GPUs generate significant heat)

Multi-GPU with Ollama

# On Linux, Ollama detects multiple NVIDIA GPUs automatically
# Set visible GPUs
export CUDA_VISIBLE_DEVICES=0,1

# Run Ollama normally - it will split the model
ollama run llama3.1:70b

Multi-GPU with llama.cpp

# Split layers across GPUs
./llama-cli -m model.gguf \
  -ngl 99 \              # Offload all layers to GPU
  --tensor-split 0.5,0.5  # Even split between 2 GPUs

Practical Multi-GPU Builds

Setup	Total VRAM	Max Model (Q4)	Approx Cost (Used)
2x RTX 3060 12 GB	24 GB	32B	$300-360
2x RTX 3090	48 GB	70B	$1,400-1,600
2x Tesla P40	48 GB	70B	$250-350
4x RTX 3060 12 GB	48 GB	70B	$600-720

Partial GPU Offload

If your model is too large for your VRAM but you have plenty of RAM, you can offload some layers to the GPU and keep the rest in RAM. This gives you GPU speed for the offloaded layers and CPU speed for the rest.

# In llama.cpp, -ngl controls how many layers go to GPU
# A 7B model has ~32 layers
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -ngl 20  # 20 layers on GPU, rest on CPU

# In Ollama, partial offload happens automatically when VRAM is insufficient
# You can influence it with:
OLLAMA_NUM_GPU=20 ollama run llama3.1:8b

Rule of thumb: Offloading even 50% of layers to GPU gives a significant speedup over pure CPU inference.

Building a Dedicated Local AI Machine

If you’re building a system specifically for local AI, here are recommended builds:

Budget Build ($400-600)

CPU: AMD Ryzen 5 5600 or Intel i5-12400F
RAM: 32 GB DDR4-3200 (2x16)
GPU: Used RTX 3060 12 GB ($150)
Storage: 500 GB NVMe SSD
PSU: 550W 80+ Bronze
Case: Any mid-tower with decent airflow

Runs: 7-8B models at full GPU speed, 13B with partial offload

Mid-Range Build ($1,000-1,500)

CPU: AMD Ryzen 7 7700X or Intel i5-13600K
RAM: 64 GB DDR5-6000 (2x32)
GPU: Used RTX 3090 24 GB ($700)
Storage: 1 TB NVMe SSD
PSU: 850W 80+ Gold
Case: Full tower with good airflow

Runs: Up to 32B models at full GPU speed, 70B with heavy partial offload

High-End Build ($3,000-4,000)

CPU: AMD Ryzen 9 7950X
RAM: 128 GB DDR5-6000 (4x32)
GPU: RTX 4090 24 GB ($1,600)
Storage: 2 TB NVMe SSD
PSU: 1000W 80+ Platinum
Case: Full tower with excellent airflow

Runs: Up to 32B at maximum GPU speed with room for large contexts

Extreme Build ($5,000-8,000)

CPU: AMD Threadripper 7960X or Intel Xeon w5
RAM: 256 GB DDR5
GPU: 2x RTX 3090 or RTX A6000 48 GB
Storage: 4 TB NVMe SSD
PSU: 1600W 80+ Titanium
Case: Server chassis or E-ATX tower

Runs: 70B+ models at full GPU speed

Power Consumption and Running Costs

Local AI uses electricity, but less than you might think:

Setup	Idle	Inference	Monthly Cost (24/7)
CPU only (65W TDP)	30W	65-100W	$5-8
RTX 3060	15W	170-200W	$8-15
RTX 3090	20W	300-350W	$15-25
RTX 4090	15W	350-450W	$18-30
M4 MacBook Pro	5W	20-40W	$2-4

Based on $0.12/kWh US average. Models are only loaded during active use in most setups.

For comparison, ChatGPT Plus costs $20/month, and API usage for moderate use can easily exceed $50/month. A local setup pays for itself in electricity savings within months.

Next Steps

Now that you understand the hardware landscape:

Already have hardware? Jump to your platform guide: Windows, macOS, or Linux
Need to choose a model? Read How to Choose the Right Local LLM
Ready for quick start? Follow Your First Local AI in 5 Minutes
Want to understand the full stack? See The Local AI Stack