Running Local LLMs on Linux: Ubuntu, Fedora, and Arch Guide

Complete guide to running local LLMs on Linux. Covers NVIDIA CUDA and AMD ROCm setup, Ollama installation, building llama.cpp from source, systemd services, and performance tuning for Ubuntu, Fedora, and Arch.

Linux is the ideal operating system for running local LLMs. It offers the best GPU driver support, the most flexible configuration options, and the full ecosystem of AI tools without the limitations of Windows or macOS. This guide covers GPU setup for both NVIDIA (CUDA) and AMD (ROCm), Ollama installation, building llama.cpp from source for maximum performance, setting up systemd services for production use, and performance tuning across Ubuntu, Fedora, and Arch Linux.

Prerequisites

Check your system:

# Kernel version (5.15+ recommended)
uname -r

# CPU features (AVX2 needed for efficient inference)
lscpu | grep -i avx

# Total RAM
free -h

# List GPUs
lspci | grep -i 'vga\|3d\|display'

GPU Setup: NVIDIA (CUDA)

NVIDIA GPUs with CUDA provide the fastest local AI inference on Linux. Setup varies by distribution.

Ubuntu 22.04 / 24.04

Option A: Ubuntu’s packaged driver (simplest)

# List available drivers
ubuntu-drivers list

# Install the recommended driver
sudo ubuntu-drivers install

# Reboot
sudo reboot

Option B: NVIDIA’s official driver

# Add NVIDIA's repository
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:graphics-drivers/ppa
sudo apt update

# Install the latest driver
sudo apt install -y nvidia-driver-560

# Reboot
sudo reboot

Option C: CUDA Toolkit (includes driver)

# Download CUDA installer
wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux.run

# Install
sudo sh cuda_12.6.0_560.28.03_linux.run

# Add to PATH
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

Fedora 39 / 40

# Install RPM Fusion repositories
sudo dnf install -y \
  https://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm \
  https://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm

# Install NVIDIA driver
sudo dnf install -y akmod-nvidia xorg-x11-drv-nvidia-cuda

# Wait for kernel module to build (important!)
sudo akmods --force
sudo dracut --force

# Reboot
sudo reboot

Arch Linux

# Install NVIDIA driver
sudo pacman -S nvidia nvidia-utils cuda

# For the latest driver (may be needed for newest GPUs)
# Use the nvidia-dkms package:
sudo pacman -S nvidia-dkms nvidia-utils cuda

# Reboot
sudo reboot

Verify NVIDIA Installation

# Check driver and CUDA version
nvidia-smi

# Expected output:
# Driver Version: 560.xx    CUDA Version: 12.6
# And your GPU listed with memory info

# Check CUDA compiler (if CUDA Toolkit installed)
nvcc --version

Headless Server Setup

For servers without a display:

# Install driver without OpenGL (headless)
sudo apt install -y nvidia-headless-560 nvidia-utils-560

# Or from NVIDIA .run file:
sudo sh cuda_*.run --no-opengl-files

# Verify
nvidia-smi

GPU Setup: AMD (ROCm)

AMD GPUs use ROCm (Radeon Open Compute) for GPU-accelerated inference on Linux.

Supported AMD GPUs

Officially supported by ROCm:

  • RX 7900 XTX, 7900 XT, 7800 XT, 7700 XT (RDNA 3)
  • RX 6900 XT, 6800 XT, 6800, 6700 XT (RDNA 2)
  • Radeon PRO W7900, W7800
  • Instinct MI210, MI250, MI300X

Community supported (may need HSA_OVERRIDE_GFX_VERSION):

  • RX 7600, 6600 XT, 6600
  • Older RDNA/GCN GPUs

Ubuntu ROCm Installation

# Add AMD's repository
wget https://repo.radeon.com/amdgpu-install/6.0/ubuntu/jammy/amdgpu-install_6.0.60000-1_all.deb
sudo apt install -y ./amdgpu-install_6.0.60000-1_all.deb

# Install ROCm
sudo amdgpu-install -y --usecase=rocm

# Add user to required groups
sudo usermod -aG render,video $USER

# Reboot
sudo reboot

Fedora ROCm Installation

# ROCm support on Fedora requires manual setup
# Install from AMD's repository or build from source
# Check AMD's documentation for the latest Fedora instructions

# Alternative: Use Ollama which bundles ROCm support
curl -fsSL https://ollama.com/install.sh | sh

Arch Linux ROCm

# Install ROCm from AUR or official repos
sudo pacman -S rocm-hip-sdk rocm-opencl-sdk

# Add user to groups
sudo usermod -aG render,video $USER

# Reboot
sudo reboot

Verify ROCm Installation

# Check ROCm
rocm-smi

# Check HIP (ROCm's CUDA equivalent)
hipconfig --full

# For unsupported GPUs, set override
export HSA_OVERRIDE_GFX_VERSION=11.0.0  # For RDNA 3
export HSA_OVERRIDE_GFX_VERSION=10.3.0  # For RDNA 2

Installing Ollama

Ollama is the recommended starting point for all Linux distributions.

One-Line Install

curl -fsSL https://ollama.com/install.sh | sh

This script:

  • Downloads the Ollama binary to /usr/local/bin/ollama
  • Creates an ollama system user
  • Creates a systemd service
  • Starts Ollama automatically
  • Detects and configures NVIDIA CUDA or AMD ROCm

Manual Install

If you prefer not to pipe to shell:

# Download binary
curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/local/bin/ollama
chmod +x /usr/local/bin/ollama

# Create user
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
sudo usermod -aG render,video ollama  # For AMD GPU access

# Create systemd service
sudo tee /etc/systemd/system/ollama.service << 'EOF'
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="HOME=/usr/share/ollama"
Environment="OLLAMA_HOST=0.0.0.0:11434"

[Install]
WantedBy=default.target
EOF

# Start and enable
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama

Verify Installation

# Check service status
systemctl status ollama

# Check version
ollama --version

# Run a model
ollama run llama3.1:8b

# Check GPU is being used
ollama run llama3.1:8b --verbose

Ollama Configuration

Configure via environment variables in the systemd service:

# Edit the service
sudo systemctl edit ollama

# Add environment variables in the override file:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_KEEP_ALIVE=30m"

# Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart ollama

Key environment variables:

VariableDefaultDescription
OLLAMA_HOST127.0.0.1:11434Listen address (use 0.0.0.0 for network access)
OLLAMA_MODELS~/.ollama/modelsModel storage directory
OLLAMA_NUM_PARALLEL1Concurrent request slots
OLLAMA_MAX_LOADED_MODELS1Models loaded simultaneously
OLLAMA_KEEP_ALIVE5mHow long to keep model in memory
OLLAMA_NUM_GPUautoNumber of GPU layers
CUDA_VISIBLE_DEVICESallWhich GPUs to use

Building llama.cpp from Source

Building from source gives you maximum performance with optimizations for your specific hardware.

NVIDIA GPU Build

# Install dependencies
sudo apt install -y build-essential cmake git  # Ubuntu
sudo dnf install -y gcc gcc-c++ cmake git      # Fedora
sudo pacman -S base-devel cmake git             # Arch

# Clone
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA
cmake -B build \
  -DGGML_CUDA=ON \
  -DGGML_NATIVE=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

# Verify
./build/bin/llama-cli --version

AMD GPU Build

cd llama.cpp

# Build with ROCm/HIP
cmake -B build \
  -DGGML_HIP=ON \
  -DGGML_NATIVE=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DAMDGPU_TARGETS="gfx1100;gfx1030"  # Adjust for your GPU
cmake --build build --config Release -j$(nproc)

AMD GPU target strings:

  • gfx1100 — RDNA 3 (RX 7900 series)
  • gfx1030 — RDNA 2 (RX 6800/6900 series)
  • gfx1010 — RDNA 1 (RX 5700 series)

CPU-Only Build (Optimized)

cd llama.cpp

cmake -B build \
  -DGGML_NATIVE=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

The -DGGML_NATIVE=ON flag enables CPU-specific optimizations (AVX2, AVX-512, etc.) for your exact processor.

Running llama.cpp Server

# Download a model (e.g., from Hugging Face)
wget https://huggingface.co/bartowski/Llama-3.1-8B-Instruct-GGUF/resolve/main/Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Run the server
./build/bin/llama-server \
  -m Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  -c 8192 \
  --threads $(nproc)

llama.cpp as a systemd Service

sudo tee /etc/systemd/system/llamacpp.service << 'EOF'
[Unit]
Description=llama.cpp Server
After=network.target

[Service]
Type=simple
User=llamacpp
ExecStart=/opt/llama.cpp/build/bin/llama-server \
  -m /opt/models/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  -c 8192
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable llamacpp
sudo systemctl start llamacpp

Installing vLLM (Production Serving)

vLLM is the go-to for serving models to multiple users with high throughput.

# Install via pip (requires Python 3.9+)
pip install vllm

# Serve a model
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

# Multi-GPU (tensor parallelism)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 8192

vLLM as a systemd Service

sudo tee /etc/systemd/system/vllm.service << 'EOF'
[Unit]
Description=vLLM Inference Server
After=network.target

[Service]
Type=simple
User=vllm
Environment="CUDA_VISIBLE_DEVICES=0,1"
ExecStart=/usr/local/bin/vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm

Performance Tuning

NVIDIA GPU Tuning

# Set persistence mode (keeps GPU initialized)
sudo nvidia-smi -pm 1

# Set maximum performance mode
sudo nvidia-smi -ac 1593,1410  # Memory,Graphics clocks (check your GPU's max)

# Disable ECC for consumer GPUs (frees ~6% VRAM on some cards)
# WARNING: Only do this if you understand the trade-off
sudo nvidia-smi --ecc-config=0
# Requires reboot

# Monitor GPU during inference
watch -n 0.5 nvidia-smi

CPU Tuning

# Set CPU governor to performance
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Make it persistent (Ubuntu/Fedora)
sudo apt install -y cpufrequtils     # Ubuntu
echo 'GOVERNOR="performance"' | sudo tee /etc/default/cpufrequtils
sudo systemctl restart cpufrequtils

# Verify
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Memory Tuning

# Increase huge pages (improves memory access patterns)
echo 'vm.nr_hugepages=1024' | sudo tee -a /etc/sysctl.d/99-hugepages.conf
sudo sysctl --system

# Disable transparent huge pages if causing latency spikes
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

# Check NUMA topology (for multi-socket systems)
numactl --hardware
# Run Ollama pinned to the NUMA node nearest your GPU:
numactl --cpunodebind=0 --membind=0 ollama serve

IO Tuning

# Use a fast NVMe SSD for model storage
# Check disk speed
sudo hdparm -tT /dev/nvme0n1

# Mount with noatime for slightly better IO
# In /etc/fstab, add noatime to your mount options

Network Tuning (For Multi-User Serving)

# Increase connection limits
echo 'net.core.somaxconn=65535' | sudo tee -a /etc/sysctl.d/99-network.conf
echo 'net.ipv4.tcp_max_syn_backlog=65535' | sudo tee -a /etc/sysctl.d/99-network.conf
sudo sysctl --system

Docker Setup

Install Docker

# Ubuntu
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

# Fedora
sudo dnf install -y dnf-plugins-core
sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
sudo dnf install -y docker-ce docker-ce-cli containerd.io
sudo systemctl enable docker
sudo systemctl start docker

# Arch
sudo pacman -S docker
sudo systemctl enable docker
sudo systemctl start docker

NVIDIA Container Toolkit

# Add repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit

# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu24.04 nvidia-smi

Run Ollama in Docker

# NVIDIA GPU
docker run -d \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  --restart unless-stopped \
  ollama/ollama

# AMD GPU
docker run -d \
  --device /dev/kfd \
  --device /dev/dri \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  --restart unless-stopped \
  ollama/ollama:rocm

# Pull a model
docker exec -it ollama ollama pull llama3.1:8b

Firewall Configuration

If serving models over the network:

# UFW (Ubuntu)
sudo ufw allow 11434/tcp  # Ollama
sudo ufw allow 8080/tcp   # llama.cpp server
sudo ufw allow 3000/tcp   # Open WebUI

# firewalld (Fedora)
sudo firewall-cmd --permanent --add-port=11434/tcp
sudo firewall-cmd --permanent --add-port=3000/tcp
sudo firewall-cmd --reload

# iptables (Arch/any)
sudo iptables -A INPUT -p tcp --dport 11434 -j ACCEPT

Monitoring and Logging

GPU Monitoring

# Real-time NVIDIA monitoring
watch -n 0.5 nvidia-smi

# Detailed per-process GPU usage
nvidia-smi pmon -i 0 -s u -d 1

# AMD GPU monitoring
watch -n 1 rocm-smi

# Install nvtop for a nice TUI
sudo apt install -y nvtop  # Ubuntu
sudo dnf install -y nvtop  # Fedora
sudo pacman -S nvtop       # Arch
nvtop

Ollama Logs

# View service logs
journalctl -u ollama -f

# View last 100 lines
journalctl -u ollama -n 100

# Filter by time
journalctl -u ollama --since "1 hour ago"

System Monitoring During Inference

# Combined monitoring
htop    # CPU and RAM
nvtop   # GPU
iotop   # Disk IO

# All in one terminal with tmux
tmux new-session -d -s monitor 'htop' \; \
  split-window -h 'nvtop' \; \
  attach

Troubleshooting

NVIDIA Driver Issues

# Check if driver is loaded
lsmod | grep nvidia

# If not loaded, try:
sudo modprobe nvidia

# If modprobe fails, check secure boot
mokutil --sb-state
# If enabled, either disable secure boot or sign the NVIDIA module

# Check dmesg for errors
dmesg | grep -i nvidia

ROCm Not Detecting GPU

# Check permissions
groups $USER  # Should include render and video

# Check device nodes
ls -la /dev/kfd /dev/dri/render*

# Try override for unsupported GPU
export HSA_OVERRIDE_GFX_VERSION=11.0.0
rocm-smi

Ollama CUDA Errors

# Common fix: reinstall Ollama to get matching CUDA libraries
curl -fsSL https://ollama.com/install.sh | sh

# Check CUDA library path
ldconfig -p | grep cuda

# Set library path explicitly
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Out of Memory

# Check available VRAM
nvidia-smi

# Check system RAM
free -h

# Kill other GPU processes
nvidia-smi  # Note PIDs
kill <pid>

# Use a smaller model or lower quantization
ollama run llama3.1:8b-instruct-q3_K_M

Next Steps

Your Linux system is fully configured for local AI. Where to go next:

Frequently Asked Questions

Which Linux distribution is best for local AI?

Ubuntu 22.04/24.04 LTS has the best out-of-box support because NVIDIA, AMD, and most AI frameworks target it first. Fedora is also excellent with good NVIDIA RPM Fusion support. Arch works well but requires more manual setup. The inference engines themselves (Ollama, llama.cpp) work on any distribution.

How do I check if my GPU is being used for inference?

For NVIDIA, run nvidia-smi and look for your process in the list and VRAM usage. For AMD, use rocm-smi. You can also run 'ollama run model --verbose' which outputs the backend being used and tokens-per-second rate. If speed is only 5-15 t/s for a 7B model, you're likely on CPU.

Can I run local AI on a headless Linux server?

Yes. Ollama, llama.cpp, and vLLM all work perfectly on headless servers. You don't need a display, desktop environment, or X server. Install NVIDIA drivers with --no-opengl-files flag for headless setups, and serve models via API. Pair with Open WebUI for a web interface accessible from other machines.

How do I make Ollama start automatically on boot?

The Ollama install script creates a systemd service automatically. Verify with 'systemctl status ollama'. If it's not enabled, run 'sudo systemctl enable ollama'. The service runs as the 'ollama' user and starts automatically on boot.