Building a Local Voice Assistant: Whisper + LLM + TTS

Build a fully local voice assistant pipeline with speech-to-text (Whisper.cpp), an LLM for processing (Ollama), and text-to-speech (Piper/Kokoro). Includes latency optimization and wake word detection.

A local voice assistant runs entirely on your hardware: your voice never leaves your machine, there is no cloud dependency, and it works offline. The pipeline consists of three components: speech-to-text (STT) to convert your voice to text, a large language model (LLM) to generate a response, and text-to-speech (TTS) to speak the response aloud. This guide walks through building each component, connecting them into a pipeline, and optimizing latency for a conversational experience.

Architecture Overview

┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│ Microphone│ ──> │ Whisper  │ ──> │  Ollama  │ ──> │ Piper/   │ ──> Speaker
│  (Audio)  │     │  (STT)   │     │  (LLM)   │     │ Kokoro   │
│           │     │          │     │          │     │  (TTS)   │
└──────────┘     └──────────┘     └──────────┘     └──────────┘
   Input          Text from        Generated        Audio
   audio          speech           response         output

Optional additions:

  • Wake word detection: Trigger the pipeline with a keyword like “Hey Computer”
  • Streaming TTS: Start speaking before the full LLM response is ready
  • Voice Activity Detection (VAD): Automatically detect when you start/stop speaking

Part 1: Speech-to-Text with Whisper.cpp

Whisper.cpp is a C/C++ port of OpenAI’s Whisper model, optimized for local inference.

Install Whisper.cpp

# Clone and build
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

# For NVIDIA GPU acceleration
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

# For Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Download Whisper Models

# Download models (from whisper.cpp directory)
bash models/download-ggml-model.sh tiny.en     # 75 MB, fast, English only
bash models/download-ggml-model.sh base.en     # 142 MB, good balance
bash models/download-ggml-model.sh small.en    # 466 MB, better accuracy
bash models/download-ggml-model.sh medium.en   # 1.5 GB, high accuracy
bash models/download-ggml-model.sh large-v3    # 3.1 GB, best accuracy, multilingual

Whisper Model Comparison

ModelSizeEnglish WERSpeed (CPU)Speed (GPU)Languages
tiny.en75 MB8.2%Real-time10x RTEnglish
base.en142 MB5.9%~Real-time8x RTEnglish
small.en466 MB4.4%0.5x RT6x RTEnglish
medium.en1.5 GB3.8%0.2x RT4x RTEnglish
large-v33.1 GB3.0%0.1x RT2x RT99 languages

WER = Word Error Rate (lower is better). RT = Real-Time (1x = transcribes as fast as audio plays).

Test Whisper

# Transcribe a WAV file
./build/bin/whisper-cli -m models/ggml-base.en.bin -f audio.wav

# Real-time microphone transcription
./build/bin/whisper-cli -m models/ggml-base.en.bin --stream --step 3000 --length 8000

# Output as text only (no timestamps)
./build/bin/whisper-cli -m models/ggml-base.en.bin -f audio.wav --no-timestamps --output-txt

Whisper as a Server

# Run Whisper HTTP server
./build/bin/whisper-server -m models/ggml-base.en.bin --host 0.0.0.0 --port 8178

# Send audio for transcription
curl -X POST http://localhost:8178/inference \
  -H "Content-Type: multipart/form-data" \
  -F [email protected] \
  -F response_format=json

Python Whisper Alternative

If you prefer Python over C++:

pip install faster-whisper

# faster-whisper uses CTranslate2 for optimized inference
from faster_whisper import WhisperModel

model = WhisperModel("base.en", device="cuda", compute_type="float16")
# For CPU: device="cpu", compute_type="int8"

segments, info = model.transcribe("audio.wav", beam_size=5)
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Part 2: LLM Processing with Ollama

The LLM receives the transcribed text and generates a response.

Model Selection for Voice

Voice assistants need models that are:

  • Fast: Low latency is critical for conversation
  • Concise: Long responses feel unnatural when spoken
  • Instruction-following: Must respond appropriately to conversational queries
VRAMRecommended ModelWhy
4 GBPhi-3 Mini 3.8BFast, good instruction following
6-8 GBLlama 3.1 8BBest all-around at this size
12-16 GBQwen 2.5 14BHigher quality responses
24 GBQwen 2.5 32B Q4Near-cloud quality

Configure for Voice

Create a voice-optimized Modelfile:

cat > ~/Modelfile-voice << 'EOF'
FROM llama3.1:8b

SYSTEM """You are a helpful voice assistant. Follow these rules:
1. Keep responses concise (1-3 sentences for simple questions)
2. Use natural, spoken language (avoid bullet points, code blocks, markdown)
3. Be conversational and friendly
4. If the question is ambiguous, ask for clarification
5. For complex topics, give a brief answer and offer to explain more"""

PARAMETER temperature 0.7
PARAMETER num_ctx 2048
PARAMETER num_predict 256
PARAMETER stop "<|eot_id|>"
EOF

ollama create voice-assistant -f ~/Modelfile-voice
ollama run voice-assistant

Streaming API for Lower Latency

import requests
import json

def get_llm_response_stream(text):
    """Stream LLM response for lower time-to-first-token."""
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "voice-assistant",
            "prompt": text,
            "stream": True,
        },
        stream=True,
    )
    
    full_response = ""
    for line in response.iter_lines():
        if line:
            data = json.loads(line)
            token = data.get("response", "")
            full_response += token
            yield token  # Yield each token as it arrives
    
    return full_response

Part 3: Text-to-Speech

Piper is a fast, lightweight TTS engine that runs in real-time on CPU.

# Install Piper
pip install piper-tts

# Or download binary
wget https://github.com/rhasspy/piper/releases/latest/download/piper_linux_x86_64.tar.gz
tar xzf piper_linux_x86_64.tar.gz

# Download a voice model
# Browse voices at: https://rhasspy.github.io/piper-samples/
mkdir -p ~/.local/share/piper/voices
cd ~/.local/share/piper/voices

# US English, medium quality (recommended starting point)
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json

Using Piper

# Command line
echo "Hello, how can I help you today?" | \
  piper --model ~/.local/share/piper/voices/en_US-lessac-medium.onnx \
  --output_file response.wav

# Play immediately
echo "Hello, how can I help you today?" | \
  piper --model ~/.local/share/piper/voices/en_US-lessac-medium.onnx \
  --output-raw | aplay -r 22050 -f S16_LE -t raw -

Piper in Python

import subprocess
import io
import wave

def text_to_speech_piper(text, model_path, output_file="response.wav"):
    """Convert text to speech using Piper."""
    process = subprocess.Popen(
        ["piper", "--model", model_path, "--output_file", output_file],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    process.communicate(input=text.encode())
    return output_file

Option B: Kokoro TTS (Higher Quality)

Kokoro produces more natural-sounding speech but requires more resources.

pip install kokoro-onnx

# Download model
python -c "
from kokoro_onnx import Kokoro
kokoro = Kokoro('kokoro-v1.0.onnx', 'voices-v1.0.bin')
"
from kokoro_onnx import Kokoro
import soundfile as sf

kokoro = Kokoro("kokoro-v1.0.onnx", "voices-v1.0.bin")

# Generate speech
audio, sr = kokoro.create(
    "Hello, I'm your local AI assistant. How can I help you today?",
    voice="af_heart",  # Available: af_heart, af_bella, am_adam, am_michael, etc.
    speed=1.0,
)

sf.write("response.wav", audio, sr)

Option C: Coqui TTS / XTTS (Cloneable Voices)

For voice cloning and the most natural output:

pip install TTS

# List available models
tts --list_models

# Generate with XTTS v2
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
    --text "Hello from your local AI" \
    --speaker_wav reference_voice.wav \
    --language_idx en \
    --out_path response.wav

TTS Comparison

EngineSpeed (CPU)QualityVoice CloningSizeBest For
PiperVery fast (~50x RT)GoodNo15-80 MBSpeed-critical, RPi
KokoroFast (~10x RT)Very goodNo~100 MBQuality on desktop
Coqui XTTSSlow (~1x RT)ExcellentYes~2 GBMaximum quality
BarkVery slowExcellentLimited~5 GBExpressive speech

Part 4: Connecting the Pipeline

Complete Python Pipeline

#!/usr/bin/env python3
"""Local voice assistant: Whisper + Ollama + Piper"""

import subprocess
import json
import tempfile
import os
import requests
import pyaudio
import wave
import numpy as np

# Configuration
WHISPER_MODEL = os.path.expanduser("~/whisper.cpp/models/ggml-base.en.bin")
WHISPER_BIN = os.path.expanduser("~/whisper.cpp/build/bin/whisper-cli")
PIPER_MODEL = os.path.expanduser("~/.local/share/piper/voices/en_US-lessac-medium.onnx")
OLLAMA_MODEL = "voice-assistant"
SAMPLE_RATE = 16000
CHANNELS = 1
RECORD_SECONDS = 5  # Max recording length


def record_audio(filename, duration=RECORD_SECONDS):
    """Record audio from microphone."""
    p = pyaudio.PyAudio()
    stream = p.open(
        format=pyaudio.paInt16,
        channels=CHANNELS,
        rate=SAMPLE_RATE,
        input=True,
        frames_per_buffer=1024,
    )
    
    print("🎤 Listening...")
    frames = []
    for _ in range(0, int(SAMPLE_RATE / 1024 * duration)):
        data = stream.read(1024)
        frames.append(data)
    
    print("Processing...")
    stream.stop_stream()
    stream.close()
    p.terminate()
    
    wf = wave.open(filename, "wb")
    wf.setnchannels(CHANNELS)
    wf.setsampwidth(p.get_sample_size(pyaudio.paInt16))
    wf.setframerate(SAMPLE_RATE)
    wf.writeframes(b"".join(frames))
    wf.close()


def transcribe(audio_file):
    """Transcribe audio using Whisper.cpp."""
    result = subprocess.run(
        [WHISPER_BIN, "-m", WHISPER_MODEL, "-f", audio_file,
         "--no-timestamps", "-nt"],
        capture_output=True, text=True,
    )
    return result.stdout.strip()


def get_response(text):
    """Get LLM response from Ollama."""
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": OLLAMA_MODEL,
            "prompt": text,
            "stream": False,
        },
    )
    return response.json()["response"]


def speak(text):
    """Convert text to speech and play it."""
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        tmp_path = f.name
    
    subprocess.run(
        ["piper", "--model", PIPER_MODEL, "--output_file", tmp_path],
        input=text.encode(),
        capture_output=True,
    )
    
    # Play audio
    subprocess.run(["aplay", tmp_path], capture_output=True)
    os.unlink(tmp_path)


def main():
    print("Local Voice Assistant")
    print("Press Ctrl+C to exit\n")
    
    while True:
        try:
            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
                audio_path = f.name
            
            # 1. Record
            record_audio(audio_path)
            
            # 2. Transcribe
            text = transcribe(audio_path)
            os.unlink(audio_path)
            
            if not text.strip():
                print("(no speech detected)")
                continue
            
            print(f"You: {text}")
            
            # 3. Get LLM response
            response = get_response(text)
            print(f"Assistant: {response}")
            
            # 4. Speak response
            speak(response)
            
        except KeyboardInterrupt:
            print("\nGoodbye!")
            break


if __name__ == "__main__":
    main()

Install Dependencies

pip install pyaudio requests numpy

# System dependencies for PyAudio
sudo apt install -y portaudio19-dev python3-pyaudio  # Ubuntu
sudo dnf install -y portaudio-devel                   # Fedora
brew install portaudio                                 # macOS

Run the Pipeline

# Make sure Ollama is running
ollama serve &

# Create the voice model
ollama create voice-assistant -f ~/Modelfile-voice

# Run the assistant
python3 voice_assistant.py

Part 5: Advanced Features

Voice Activity Detection (VAD)

Instead of fixed recording duration, detect when speech starts and stops:

import webrtcvad
import pyaudio
import collections

def record_with_vad(sample_rate=16000):
    """Record audio, automatically stopping when speech ends."""
    vad = webrtcvad.Vad(2)  # Aggressiveness 0-3
    
    p = pyaudio.PyAudio()
    stream = p.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=sample_rate,
        input=True,
        frames_per_buffer=480,  # 30ms at 16kHz
    )
    
    frames = []
    ring_buffer = collections.deque(maxlen=30)  # 900ms
    triggered = False
    voiced_frames = []
    
    print("Listening... (speak to begin)")
    
    while True:
        frame = stream.read(480)
        is_speech = vad.is_speech(frame, sample_rate)
        
        if not triggered:
            ring_buffer.append((frame, is_speech))
            num_voiced = len([f for f, speech in ring_buffer if speech])
            if num_voiced > 0.8 * ring_buffer.maxlen:
                triggered = True
                print("Recording...")
                voiced_frames.extend([f for f, _ in ring_buffer])
                ring_buffer.clear()
        else:
            voiced_frames.append(frame)
            ring_buffer.append((frame, is_speech))
            num_unvoiced = len([f for f, speech in ring_buffer if not speech])
            if num_unvoiced > 0.8 * ring_buffer.maxlen:
                break
    
    stream.stop_stream()
    stream.close()
    p.terminate()
    
    return b"".join(voiced_frames)
pip install webrtcvad

Wake Word Detection

Add a wake word (like “Hey Computer”) using OpenWakeWord:

pip install openwakeword
from openwakeword import Model as WakeWordModel
import pyaudio
import numpy as np

def listen_for_wake_word(callback, wake_word="hey_computer"):
    """Listen for wake word, then call callback."""
    model = WakeWordModel(wakeword_models=[wake_word])
    
    p = pyaudio.PyAudio()
    stream = p.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=16000,
        input=True,
        frames_per_buffer=1280,
    )
    
    print(f"Waiting for wake word '{wake_word}'...")
    
    while True:
        audio = np.frombuffer(stream.read(1280), dtype=np.int16)
        prediction = model.predict(audio)
        
        if prediction[wake_word] > 0.5:
            print("Wake word detected!")
            callback()

# Usage
def handle_voice_command():
    audio = record_with_vad()
    text = transcribe(audio)
    response = get_response(text)
    speak(response)

listen_for_wake_word(handle_voice_command)

Streaming TTS (Reduce Perceived Latency)

Start speaking before the full LLM response is ready:

import threading
import queue

def stream_response_with_tts(prompt):
    """Stream LLM output and speak sentences as they complete."""
    tts_queue = queue.Queue()
    
    def tts_worker():
        while True:
            sentence = tts_queue.get()
            if sentence is None:
                break
            speak(sentence)
    
    # Start TTS worker thread
    tts_thread = threading.Thread(target=tts_worker, daemon=True)
    tts_thread.start()
    
    # Stream LLM response
    current_sentence = ""
    for token in get_llm_response_stream(prompt):
        current_sentence += token
        
        # Check if we have a complete sentence
        if any(current_sentence.rstrip().endswith(p) for p in '.!?'):
            sentence = current_sentence.strip()
            if sentence:
                tts_queue.put(sentence)
                current_sentence = ""
    
    # Handle remaining text
    if current_sentence.strip():
        tts_queue.put(current_sentence.strip())
    
    tts_queue.put(None)  # Signal end
    tts_thread.join()

Part 6: Latency Optimization

Bottleneck Analysis

ComponentTypical LatencyOptimization
VAD/Recording0.5-2sUse shorter ring buffer
Whisper (base, GPU)0.3-0.8sUse tiny model for speed
Whisper (base, CPU)0.5-2sUse tiny.en model
LLM first token0.5-2sUse smaller model, lower context
LLM full response1-5sLimit max_tokens, stream
TTS synthesis0.1-0.5sUse Piper (fastest)
Audio playback0.1sStream audio
Total (GPU)2-4s
Total (CPU)5-10s

Quick Wins

  1. Use Whisper tiny.en for English-only with low latency
  2. Use a 3B model for the LLM if speed matters more than quality
  3. Use Piper for TTS (50x real-time on CPU)
  4. Stream TTS to start speaking while LLM is still generating
  5. Reduce context with num_ctx 1024 in the Modelfile
  6. Keep models loaded with OLLAMA_KEEP_ALIVE=30m

Hardware Recommendations

BudgetSetupExpected Latency
LowCPU only, tiny.en + Phi-3 Mini + Piper5-8s
MediumRTX 3060 12 GB, base.en + Llama 3.1 8B + Piper2-4s
HighRTX 4090, small.en + Qwen 2.5 14B + Kokoro1.5-3s
AppleM3 Pro 36 GB, base.en + Llama 3.1 8B + Piper2-4s

Next Steps

Frequently Asked Questions

What is the end-to-end latency of a local voice assistant?

On a modern system with a GPU, expect 2-4 seconds total latency: 0.5-1.5 seconds for speech-to-text (Whisper), 1-2 seconds for the LLM to start generating, and 0.2-0.5 seconds for TTS to begin speaking. Streaming TTS can start speaking before the full LLM response is complete, reducing perceived latency. CPU-only systems will see 5-10+ second latency.

Can I run the full voice pipeline on CPU only?

Yes, but expect higher latency. Whisper tiny/base models transcribe in near-real-time on CPU. The LLM is the bottleneck -- use a 1-3B model for acceptable response times. Piper TTS is very fast on CPU. Total latency on a modern CPU might be 5-10 seconds for short responses, which is usable for simple tasks but not for fluid conversation.

How does local voice quality compare to cloud services like Alexa or Google Assistant?

Speech recognition (Whisper large) matches or exceeds cloud services for most languages. LLM quality depends on model size but 7B+ models handle conversational tasks well. TTS is the area with the most gap -- local TTS voices are good but not yet as natural as the best cloud voices. Kokoro and some Piper voices are approaching cloud quality. The main advantage is complete privacy and offline capability.