Running Local LLMs on Windows: Complete Setup Guide

Step-by-step guide to running local LLMs on Windows with NVIDIA and AMD GPUs. Covers Ollama, LM Studio, WSL2, CUDA setup, and troubleshooting common issues.

Windows is the most popular desktop operating system, and it runs local LLMs well thanks to native support from Ollama, LM Studio, and other tools. This guide walks you through the complete setup process, from GPU driver installation to running your first model, with specific instructions for both NVIDIA and AMD hardware. Whether you want a simple one-click experience or a full development environment with WSL2, this guide covers every path.

Prerequisites

Before starting, verify your system meets these requirements:

  • Windows 10 version 21H2 or later, or Windows 11
  • 8 GB RAM minimum (16 GB recommended)
  • 10 GB free disk space (more for models)
  • A 64-bit processor with AVX2 support (most CPUs from 2015 onward)

Check your system:

# Open PowerShell and check Windows version
winver

# Check CPU features (look for AVX2)
wmic cpu get Name, NumberOfCores, NumberOfLogicalProcessors

Step 1: GPU Driver Setup

NVIDIA GPU Setup

NVIDIA GPUs offer the best local AI experience on Windows. You need the latest Game Ready or Studio drivers.

Install or update drivers:

  1. Open GeForce Experience or go to nvidia.com/drivers
  2. Download the latest driver for your GPU
  3. Install with “Express” settings
  4. Restart your computer

Verify the installation:

# Open PowerShell or Command Prompt
nvidia-smi

You should see output showing your GPU name, driver version, and CUDA version. If nvidia-smi is not recognized, the drivers didn’t install correctly.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 560.xx    Driver Version: 560.xx    CUDA Version: 12.6          |
|-------------------------------+----------------------+----------------------+
| GPU  Name        | VRAM Usage |    GPU Utilization   |
|-------------------------------+----------------------+----------------------+
| 0    RTX 4090    | 0MiB/24GB  |         0%           |
+-------------------------------+----------------------+----------------------+

CUDA Toolkit (optional, needed for building from source):

Download from developer.nvidia.com/cuda-downloads. For Ollama and LM Studio, you do NOT need the CUDA Toolkit — the drivers alone are sufficient.

AMD GPU Setup

AMD GPU support on Windows uses DirectML (for Ollama) or Vulkan (for LM Studio and llama.cpp).

  1. Download the latest AMD Adrenalin drivers from amd.com/drivers
  2. Install with default settings
  3. Restart your computer

Supported GPUs: RX 6600 and newer (RDNA 2+) for best compatibility. Older GCN GPUs may work with Vulkan but performance will be limited.

Intel Arc GPU Setup

Intel Arc support is experimental. Install the latest Intel drivers and try LM Studio or llama.cpp with Vulkan backend.

Ollama is the fastest path to running a local LLM on Windows.

Installation

  1. Download the installer from ollama.com/download
  2. Run the .exe installer
  3. Follow the prompts (default settings are fine)
  4. Ollama starts automatically and runs in the system tray

Verify Installation

Open PowerShell or Command Prompt:

ollama --version

Run Your First Model

# Download and run Llama 3.1 8B
ollama run llama3.1:8b

The first run downloads the model (~4.7 GB). Subsequent runs start instantly.

# Try different models
ollama run phi3:mini          # Small and fast (2.3 GB)
ollama run qwen2.5:7b         # Strong general purpose
ollama run qwen2.5-coder:7b   # Best for coding
ollama run llama3.1:70b       # Large model (needs 48+ GB RAM/VRAM)

Verify GPU Is Being Used

While a model is running, check GPU utilization:

# In a new PowerShell window
nvidia-smi

# Or open Task Manager > Performance > GPU
# Look for "3D" or "Compute" usage

If the GPU shows 0% utilization while the model is generating text, Ollama is running on CPU. This usually means a driver issue.

Ollama Configuration on Windows

Ollama stores models in C:\Users\<username>\.ollama\models by default. To change the storage location:

# Set environment variable (persistent)
[System.Environment]::SetEnvironmentVariable("OLLAMA_MODELS", "D:\ollama\models", "User")

# Restart Ollama from system tray

Other useful environment variables:

# Change API listen address (default: localhost only)
[System.Environment]::SetEnvironmentVariable("OLLAMA_HOST", "0.0.0.0:11434", "User")

# Set GPU layers (useful for partial offload)
[System.Environment]::SetEnvironmentVariable("OLLAMA_NUM_GPU", "35", "User")

# Limit models loaded simultaneously
[System.Environment]::SetEnvironmentVariable("OLLAMA_MAX_LOADED_MODELS", "2", "User")

Step 3: Install LM Studio (Alternative)

LM Studio provides a graphical desktop experience with an integrated model browser.

Installation

  1. Download from lmstudio.ai
  2. Run the installer
  3. Launch LM Studio

Using LM Studio

Browse and download models:

  1. Click the Search icon in the left sidebar
  2. Search for a model (e.g., “llama 3.1 8b”)
  3. Select a quantization level (Q4_K_M recommended)
  4. Click Download

Chat with a model:

  1. Click the Chat icon
  2. Select your downloaded model from the dropdown
  3. Start typing

Run as a local server:

  1. Click the Server icon
  2. Load a model
  3. Toggle the server on
  4. The API is available at http://localhost:1234/v1

LM Studio is particularly good for browsing Hugging Face models and trying different quantization levels without command-line work.

Step 4: WSL2 Setup (Advanced)

Windows Subsystem for Linux 2 gives you a full Linux environment inside Windows. This is useful for:

  • Running vLLM (which doesn’t support native Windows)
  • Using Docker with GPU passthrough
  • Running Linux-only tools
  • Building llama.cpp from source with full optimization

Install WSL2

# Open PowerShell as Administrator
wsl --install

# This installs Ubuntu by default. Restart when prompted.
# After restart, set up your Linux username and password.

Enable GPU in WSL2

NVIDIA GPU passthrough works automatically in WSL2 if you have the correct Windows drivers installed. No separate Linux driver is needed.

# Inside WSL2, verify GPU access
nvidia-smi

If nvidia-smi works, your GPU is available inside WSL2.

Install Ollama in WSL2

curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama (it may need to be started manually in WSL)
ollama serve &

# Run a model
ollama run llama3.1:8b

Docker with GPU Support in WSL2

# Install Docker in WSL2
curl -fsSL https://get.docker.com | sh

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Test GPU in Docker
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu24.04 nvidia-smi

# Run Ollama in Docker with GPU
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Step 5: Add a Web Interface

For a ChatGPT-like experience, install Open WebUI.

Using Docker Desktop

Install Docker Desktop for Windows first, then:

docker run -d -p 3000:8080 ^
  --add-host=host.docker.internal:host-gateway ^
  -v open-webui:/app/backend/data ^
  --name open-webui ^
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser.

Without Docker

If you don’t want Docker, Open WebUI can be installed via pip:

pip install open-webui
open-webui serve

Performance Optimization

NVIDIA GPU Optimization

# Set GPU to maximum performance mode
# Open NVIDIA Control Panel > Manage 3D Settings
# Set "Power management mode" to "Prefer maximum performance"

# Or via command line (requires admin):
nvidia-smi -pm 1
nvidia-smi -pl 350  # Set power limit (adjust for your GPU)

Windows-Specific Tweaks

Disable memory compression (can interfere with model loading):

# PowerShell as Administrator
Disable-MMAgent -MemoryCompression

Increase virtual memory (for models larger than available RAM):

  1. Open System Properties > Advanced > Performance Settings
  2. Advanced tab > Virtual Memory > Change
  3. Set custom size: Initial = RAM size, Maximum = 2x RAM size

Exclude Ollama from antivirus scanning:

  1. Open Windows Security > Virus & threat protection
  2. Manage settings > Exclusions > Add exclusion
  3. Add folder: C:\Users\<username>\.ollama
  4. Add process: ollama.exe

Context Length and Memory

Longer conversations use more memory. If you run out of VRAM mid-conversation:

# Reduce context length in Ollama
# Create a Modelfile
echo "FROM llama3.1:8b" > Modelfile
echo "PARAMETER num_ctx 2048" >> Modelfile
ollama create llama3.1-short -f Modelfile
ollama run llama3.1-short

Troubleshooting

”CUDA not available” or Model Running on CPU

# Check NVIDIA driver
nvidia-smi

# If nvidia-smi fails:
# 1. Reinstall NVIDIA drivers from nvidia.com/drivers
# 2. Make sure to select "Clean install" option
# 3. Restart computer

# Check Ollama GPU detection
ollama run llama3.1:8b --verbose
# Look for "using GPU" in the output

Ollama Won’t Start

# Check if Ollama is already running
tasklist | findstr ollama

# Kill existing instance
taskkill /f /im ollama.exe

# Restart
ollama serve

# Check logs
type %LOCALAPPDATA%\Ollama\server.log

Model Download Fails

# Check disk space
wmic logicaldisk get size,freespace,caption

# If space is low, move models to another drive
[System.Environment]::SetEnvironmentVariable("OLLAMA_MODELS", "D:\ollama\models", "User")

# Retry download
ollama pull llama3.1:8b

Out of Memory Errors

# Check how much VRAM is available
nvidia-smi

# Try a smaller model
ollama run phi3:mini

# Or use a lower quantization
ollama run llama3.1:8b-instruct-q3_K_M

# Close other GPU applications (games, video editors, Chrome with hardware acceleration)

WSL2 GPU Not Detected

# In Windows PowerShell (admin)
wsl --update

# Ensure Windows NVIDIA driver is up to date (not a Linux driver!)
# The WSL2 kernel uses the Windows driver through a shim

# Restart WSL
wsl --shutdown
wsl

Here’s a recommended setup for different Windows users:

Casual User

Install LM Studio. Browse models in the GUI, download what interests you, chat directly. No command line needed.

Developer

Install Ollama natively. Use it from PowerShell and integrate with your IDE via Continue or similar extensions. Use the API for scripting.

Power User

Install both Ollama (for daily use) and WSL2 (for advanced tools). Run Open WebUI in Docker Desktop for a web interface. Use WSL2 for vLLM, fine-tuning, or other Linux-specific tools.

Team Lead

Set up Ollama + Open WebUI in Docker Desktop. Configure Open WebUI with user accounts for your team. See our Open WebUI guide for multi-user setup details.

Next Steps

Your Windows machine is now running local AI. Here’s where to go next:

Frequently Asked Questions

Does Ollama work on Windows without WSL?

Yes. Ollama has a native Windows installer that runs directly on Windows without WSL2. It automatically detects NVIDIA GPUs via CUDA and AMD GPUs via DirectML. The native installer is the recommended approach for most users.

Should I use native Windows or WSL2 for local AI?

For most users, native Windows with Ollama or LM Studio is simpler and works well. Use WSL2 if you need Linux-specific tools, want to run vLLM, need Docker with GPU passthrough, or prefer a Linux development environment. WSL2 adds complexity but gives you access to the full Linux AI ecosystem.

Why is my model running slowly on Windows?

The most common causes are: (1) The model is running on CPU instead of GPU -- check with nvidia-smi or Task Manager to verify GPU utilization. (2) CUDA drivers are outdated -- update to the latest NVIDIA drivers. (3) The model is too large for your VRAM, causing partial CPU offload. (4) Antivirus software is interfering -- add Ollama to your exclusion list.

Can I use an AMD GPU for local AI on Windows?

Yes, with limitations. Ollama supports AMD GPUs on Windows via DirectML for RDNA 2 and newer GPUs (RX 6000/7000 series). LM Studio also supports AMD GPUs via Vulkan. Performance is typically 60-80% of equivalent NVIDIA hardware due to less mature software optimization. For the best AMD experience, consider running Linux instead.