What is the end-to-end latency of a local voice assistant?

On a modern system with a GPU, expect 2-4 seconds total latency: 0.5-1.5 seconds for speech-to-text (Whisper), 1-2 seconds for the LLM to start generating, and 0.2-0.5 seconds for TTS to begin speaking. Streaming TTS can start speaking before the full LLM response is complete, reducing perceived latency. CPU-only systems will see 5-10+ second latency.

Can I run the full voice pipeline on CPU only?

Yes, but expect higher latency. Whisper tiny/base models transcribe in near-real-time on CPU. The LLM is the bottleneck -- use a 1-3B model for acceptable response times. Piper TTS is very fast on CPU. Total latency on a modern CPU might be 5-10 seconds for short responses, which is usable for simple tasks but not for fluid conversation.

How does local voice quality compare to cloud services like Alexa or Google Assistant?

Speech recognition (Whisper large) matches or exceeds cloud services for most languages. LLM quality depends on model size but 7B+ models handle conversational tasks well. TTS is the area with the most gap -- local TTS voices are good but not yet as natural as the best cloud voices. Kokoro and some Piper voices are approaching cloud quality. The main advantage is complete privacy and offline capability.

Building a Local Voice Assistant: Whisper + LLM + TTS

A local voice assistant runs entirely on your hardware: your voice never leaves your machine, there is no cloud dependency, and it works offline. The pipeline consists of three components: speech-to-text (STT) to convert your voice to text, a large language model (LLM) to generate a response, and text-to-speech (TTS) to speak the response aloud. This guide walks through building each component, connecting them into a pipeline, and optimizing latency for a conversational experience.

Architecture Overview

┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│ Microphone│ ──> │ Whisper  │ ──> │  Ollama  │ ──> │ Piper/   │ ──> Speaker
│  (Audio)  │     │  (STT)   │     │  (LLM)   │     │ Kokoro   │
│           │     │          │     │          │     │  (TTS)   │
└──────────┘     └──────────┘     └──────────┘     └──────────┘
   Input          Text from        Generated        Audio
   audio          speech           response         output

Optional additions:

Wake word detection: Trigger the pipeline with a keyword like “Hey Computer”
Streaming TTS: Start speaking before the full LLM response is ready
Voice Activity Detection (VAD): Automatically detect when you start/stop speaking

Part 1: Speech-to-Text with Whisper.cpp

Whisper.cpp is a C/C++ port of OpenAI’s Whisper model, optimized for local inference.

Install Whisper.cpp

# Clone and build
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

# For NVIDIA GPU acceleration
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

# For Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Download Whisper Models

# Download models (from whisper.cpp directory)
bash models/download-ggml-model.sh tiny.en     # 75 MB, fast, English only
bash models/download-ggml-model.sh base.en     # 142 MB, good balance
bash models/download-ggml-model.sh small.en    # 466 MB, better accuracy
bash models/download-ggml-model.sh medium.en   # 1.5 GB, high accuracy
bash models/download-ggml-model.sh large-v3    # 3.1 GB, best accuracy, multilingual

Whisper Model Comparison

Model	Size	English WER	Speed (CPU)	Speed (GPU)	Languages
tiny.en	75 MB	8.2%	Real-time	10x RT	English
base.en	142 MB	5.9%	~Real-time	8x RT	English
small.en	466 MB	4.4%	0.5x RT	6x RT	English
medium.en	1.5 GB	3.8%	0.2x RT	4x RT	English
large-v3	3.1 GB	3.0%	0.1x RT	2x RT	99 languages

WER = Word Error Rate (lower is better). RT = Real-Time (1x = transcribes as fast as audio plays).

Test Whisper

# Transcribe a WAV file
./build/bin/whisper-cli -m models/ggml-base.en.bin -f audio.wav

# Real-time microphone transcription
./build/bin/whisper-cli -m models/ggml-base.en.bin --stream --step 3000 --length 8000

# Output as text only (no timestamps)
./build/bin/whisper-cli -m models/ggml-base.en.bin -f audio.wav --no-timestamps --output-txt

Whisper as a Server

# Run Whisper HTTP server
./build/bin/whisper-server -m models/ggml-base.en.bin --host 0.0.0.0 --port 8178

# Send audio for transcription
curl -X POST http://localhost:8178/inference \
  -H "Content-Type: multipart/form-data" \
  -F [email protected] \
  -F response_format=json

Python Whisper Alternative

If you prefer Python over C++:

pip install faster-whisper

# faster-whisper uses CTranslate2 for optimized inference

from faster_whisper import WhisperModel

model = WhisperModel("base.en", device="cuda", compute_type="float16")
# For CPU: device="cpu", compute_type="int8"

segments, info = model.transcribe("audio.wav", beam_size=5)
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Part 2: LLM Processing with Ollama

The LLM receives the transcribed text and generates a response.

Model Selection for Voice

Voice assistants need models that are:

Fast: Low latency is critical for conversation
Concise: Long responses feel unnatural when spoken
Instruction-following: Must respond appropriately to conversational queries

VRAM	Recommended Model	Why
4 GB	Phi-3 Mini 3.8B	Fast, good instruction following
6-8 GB	Llama 3.1 8B	Best all-around at this size
12-16 GB	Qwen 2.5 14B	Higher quality responses
24 GB	Qwen 2.5 32B Q4	Near-cloud quality

Configure for Voice

Create a voice-optimized Modelfile:

cat > ~/Modelfile-voice << 'EOF'
FROM llama3.1:8b

SYSTEM """You are a helpful voice assistant. Follow these rules:
1. Keep responses concise (1-3 sentences for simple questions)
2. Use natural, spoken language (avoid bullet points, code blocks, markdown)
3. Be conversational and friendly
4. If the question is ambiguous, ask for clarification
5. For complex topics, give a brief answer and offer to explain more"""

PARAMETER temperature 0.7
PARAMETER num_ctx 2048
PARAMETER num_predict 256
PARAMETER stop "<|eot_id|>"
EOF

ollama create voice-assistant -f ~/Modelfile-voice
ollama run voice-assistant

Streaming API for Lower Latency

import requests
import json

def get_llm_response_stream(text):
    """Stream LLM response for lower time-to-first-token."""
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "voice-assistant",
            "prompt": text,
            "stream": True,
        },
        stream=True,
    )
    
    full_response = ""
    for line in response.iter_lines():
        if line:
            data = json.loads(line)
            token = data.get("response", "")
            full_response += token
            yield token  # Yield each token as it arrives
    
    return full_response

Part 3: Text-to-Speech

Option A: Piper TTS (Recommended for Speed)

Piper is a fast, lightweight TTS engine that runs in real-time on CPU.

# Install Piper
pip install piper-tts

# Or download binary
wget https://github.com/rhasspy/piper/releases/latest/download/piper_linux_x86_64.tar.gz
tar xzf piper_linux_x86_64.tar.gz

# Download a voice model
# Browse voices at: https://rhasspy.github.io/piper-samples/
mkdir -p ~/.local/share/piper/voices
cd ~/.local/share/piper/voices

# US English, medium quality (recommended starting point)
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json

Using Piper

# Command line
echo "Hello, how can I help you today?" | \
  piper --model ~/.local/share/piper/voices/en_US-lessac-medium.onnx \
  --output_file response.wav

# Play immediately
echo "Hello, how can I help you today?" | \
  piper --model ~/.local/share/piper/voices/en_US-lessac-medium.onnx \
  --output-raw | aplay -r 22050 -f S16_LE -t raw -

Piper in Python

import subprocess
import io
import wave

def text_to_speech_piper(text, model_path, output_file="response.wav"):
    """Convert text to speech using Piper."""
    process = subprocess.Popen(
        ["piper", "--model", model_path, "--output_file", output_file],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    process.communicate(input=text.encode())
    return output_file

Option B: Kokoro TTS (Higher Quality)

Kokoro produces more natural-sounding speech but requires more resources.

pip install kokoro-onnx

# Download model
python -c "
from kokoro_onnx import Kokoro
kokoro = Kokoro('kokoro-v1.0.onnx', 'voices-v1.0.bin')
"

from kokoro_onnx import Kokoro
import soundfile as sf

kokoro = Kokoro("kokoro-v1.0.onnx", "voices-v1.0.bin")

# Generate speech
audio, sr = kokoro.create(
    "Hello, I'm your local AI assistant. How can I help you today?",
    voice="af_heart",  # Available: af_heart, af_bella, am_adam, am_michael, etc.
    speed=1.0,
)

sf.write("response.wav", audio, sr)

Option C: Coqui TTS / XTTS (Cloneable Voices)

For voice cloning and the most natural output:

pip install TTS

# List available models
tts --list_models

# Generate with XTTS v2
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
    --text "Hello from your local AI" \
    --speaker_wav reference_voice.wav \
    --language_idx en \
    --out_path response.wav

TTS Comparison

Engine	Speed (CPU)	Quality	Voice Cloning	Size	Best For
Piper	Very fast (~50x RT)	Good	No	15-80 MB	Speed-critical, RPi
Kokoro	Fast (~10x RT)	Very good	No	~100 MB	Quality on desktop
Coqui XTTS	Slow (~1x RT)	Excellent	Yes	~2 GB	Maximum quality
Bark	Very slow	Excellent	Limited	~5 GB	Expressive speech

Part 4: Connecting the Pipeline

Complete Python Pipeline

#!/usr/bin/env python3
"""Local voice assistant: Whisper + Ollama + Piper"""

import subprocess
import json
import tempfile
import os
import requests
import pyaudio
import wave
import numpy as np

# Configuration
WHISPER_MODEL = os.path.expanduser("~/whisper.cpp/models/ggml-base.en.bin")
WHISPER_BIN = os.path.expanduser("~/whisper.cpp/build/bin/whisper-cli")
PIPER_MODEL = os.path.expanduser("~/.local/share/piper/voices/en_US-lessac-medium.onnx")
OLLAMA_MODEL = "voice-assistant"
SAMPLE_RATE = 16000
CHANNELS = 1
RECORD_SECONDS = 5  # Max recording length


def record_audio(filename, duration=RECORD_SECONDS):
    """Record audio from microphone."""
    p = pyaudio.PyAudio()
    stream = p.open(
        format=pyaudio.paInt16,
        channels=CHANNELS,
        rate=SAMPLE_RATE,
        input=True,
        frames_per_buffer=1024,
    )
    
    print("🎤 Listening...")
    frames = []
    for _ in range(0, int(SAMPLE_RATE / 1024 * duration)):
        data = stream.read(1024)
        frames.append(data)
    
    print("Processing...")
    stream.stop_stream()
    stream.close()
    p.terminate()
    
    wf = wave.open(filename, "wb")
    wf.setnchannels(CHANNELS)
    wf.setsampwidth(p.get_sample_size(pyaudio.paInt16))
    wf.setframerate(SAMPLE_RATE)
    wf.writeframes(b"".join(frames))
    wf.close()


def transcribe(audio_file):
    """Transcribe audio using Whisper.cpp."""
    result = subprocess.run(
        [WHISPER_BIN, "-m", WHISPER_MODEL, "-f", audio_file,
         "--no-timestamps", "-nt"],
        capture_output=True, text=True,
    )
    return result.stdout.strip()


def get_response(text):
    """Get LLM response from Ollama."""
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": OLLAMA_MODEL,
            "prompt": text,
            "stream": False,
        },
    )
    return response.json()["response"]


def speak(text):
    """Convert text to speech and play it."""
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        tmp_path = f.name
    
    subprocess.run(
        ["piper", "--model", PIPER_MODEL, "--output_file", tmp_path],
        input=text.encode(),
        capture_output=True,
    )
    
    # Play audio
    subprocess.run(["aplay", tmp_path], capture_output=True)
    os.unlink(tmp_path)


def main():
    print("Local Voice Assistant")
    print("Press Ctrl+C to exit\n")
    
    while True:
        try:
            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
                audio_path = f.name
            
            # 1. Record
            record_audio(audio_path)
            
            # 2. Transcribe
            text = transcribe(audio_path)
            os.unlink(audio_path)
            
            if not text.strip():
                print("(no speech detected)")
                continue
            
            print(f"You: {text}")
            
            # 3. Get LLM response
            response = get_response(text)
            print(f"Assistant: {response}")
            
            # 4. Speak response
            speak(response)
            
        except KeyboardInterrupt:
            print("\nGoodbye!")
            break


if __name__ == "__main__":
    main()

Install Dependencies

pip install pyaudio requests numpy

# System dependencies for PyAudio
sudo apt install -y portaudio19-dev python3-pyaudio  # Ubuntu
sudo dnf install -y portaudio-devel                   # Fedora
brew install portaudio                                 # macOS

Run the Pipeline

# Make sure Ollama is running
ollama serve &

# Create the voice model
ollama create voice-assistant -f ~/Modelfile-voice

# Run the assistant
python3 voice_assistant.py

Part 5: Advanced Features

Voice Activity Detection (VAD)

Instead of fixed recording duration, detect when speech starts and stops:

import webrtcvad
import pyaudio
import collections

def record_with_vad(sample_rate=16000):
    """Record audio, automatically stopping when speech ends."""
    vad = webrtcvad.Vad(2)  # Aggressiveness 0-3
    
    p = pyaudio.PyAudio()
    stream = p.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=sample_rate,
        input=True,
        frames_per_buffer=480,  # 30ms at 16kHz
    )
    
    frames = []
    ring_buffer = collections.deque(maxlen=30)  # 900ms
    triggered = False
    voiced_frames = []
    
    print("Listening... (speak to begin)")
    
    while True:
        frame = stream.read(480)
        is_speech = vad.is_speech(frame, sample_rate)
        
        if not triggered:
            ring_buffer.append((frame, is_speech))
            num_voiced = len([f for f, speech in ring_buffer if speech])
            if num_voiced > 0.8 * ring_buffer.maxlen:
                triggered = True
                print("Recording...")
                voiced_frames.extend([f for f, _ in ring_buffer])
                ring_buffer.clear()
        else:
            voiced_frames.append(frame)
            ring_buffer.append((frame, is_speech))
            num_unvoiced = len([f for f, speech in ring_buffer if not speech])
            if num_unvoiced > 0.8 * ring_buffer.maxlen:
                break
    
    stream.stop_stream()
    stream.close()
    p.terminate()
    
    return b"".join(voiced_frames)

pip install webrtcvad

Wake Word Detection

Add a wake word (like “Hey Computer”) using OpenWakeWord:

pip install openwakeword

from openwakeword import Model as WakeWordModel
import pyaudio
import numpy as np

def listen_for_wake_word(callback, wake_word="hey_computer"):
    """Listen for wake word, then call callback."""
    model = WakeWordModel(wakeword_models=[wake_word])
    
    p = pyaudio.PyAudio()
    stream = p.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=16000,
        input=True,
        frames_per_buffer=1280,
    )
    
    print(f"Waiting for wake word '{wake_word}'...")
    
    while True:
        audio = np.frombuffer(stream.read(1280), dtype=np.int16)
        prediction = model.predict(audio)
        
        if prediction[wake_word] > 0.5:
            print("Wake word detected!")
            callback()

# Usage
def handle_voice_command():
    audio = record_with_vad()
    text = transcribe(audio)
    response = get_response(text)
    speak(response)

listen_for_wake_word(handle_voice_command)

Streaming TTS (Reduce Perceived Latency)

Start speaking before the full LLM response is ready:

import threading
import queue

def stream_response_with_tts(prompt):
    """Stream LLM output and speak sentences as they complete."""
    tts_queue = queue.Queue()
    
    def tts_worker():
        while True:
            sentence = tts_queue.get()
            if sentence is None:
                break
            speak(sentence)
    
    # Start TTS worker thread
    tts_thread = threading.Thread(target=tts_worker, daemon=True)
    tts_thread.start()
    
    # Stream LLM response
    current_sentence = ""
    for token in get_llm_response_stream(prompt):
        current_sentence += token
        
        # Check if we have a complete sentence
        if any(current_sentence.rstrip().endswith(p) for p in '.!?'):
            sentence = current_sentence.strip()
            if sentence:
                tts_queue.put(sentence)
                current_sentence = ""
    
    # Handle remaining text
    if current_sentence.strip():
        tts_queue.put(current_sentence.strip())
    
    tts_queue.put(None)  # Signal end
    tts_thread.join()

Part 6: Latency Optimization

Bottleneck Analysis

Component	Typical Latency	Optimization
VAD/Recording	0.5-2s	Use shorter ring buffer
Whisper (base, GPU)	0.3-0.8s	Use tiny model for speed
Whisper (base, CPU)	0.5-2s	Use tiny.en model
LLM first token	0.5-2s	Use smaller model, lower context
LLM full response	1-5s	Limit max_tokens, stream
TTS synthesis	0.1-0.5s	Use Piper (fastest)
Audio playback	0.1s	Stream audio
Total (GPU)	2-4s
Total (CPU)	5-10s

Quick Wins

Use Whisper tiny.en for English-only with low latency
Use a 3B model for the LLM if speed matters more than quality
Use Piper for TTS (50x real-time on CPU)
Stream TTS to start speaking while LLM is still generating
Reduce context with num_ctx 1024 in the Modelfile
Keep models loaded with OLLAMA_KEEP_ALIVE=30m

Hardware Recommendations

Budget	Setup	Expected Latency
Low	CPU only, tiny.en + Phi-3 Mini + Piper	5-8s
Medium	RTX 3060 12 GB, base.en + Llama 3.1 8B + Piper	2-4s
High	RTX 4090, small.en + Qwen 2.5 14B + Kokoro	1.5-3s
Apple	M3 Pro 36 GB, base.en + Llama 3.1 8B + Piper	2-4s

Next Steps

Add document knowledge: Local RAG Chatbot — make your voice assistant answer from documents
Choose your model: Model selection guide
Deploy as a service: Docker deployment guide
Home automation: Integrate with Home Assistant for smart home control