A local voice assistant runs entirely on your hardware: your voice never leaves your machine, there is no cloud dependency, and it works offline. The pipeline consists of three components: speech-to-text (STT) to convert your voice to text, a large language model (LLM) to generate a response, and text-to-speech (TTS) to speak the response aloud. This guide walks through building each component, connecting them into a pipeline, and optimizing latency for a conversational experience.
Architecture Overview
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Microphone│ ──> │ Whisper │ ──> │ Ollama │ ──> │ Piper/ │ ──> Speaker
│ (Audio) │ │ (STT) │ │ (LLM) │ │ Kokoro │
│ │ │ │ │ │ │ (TTS) │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
Input Text from Generated Audio
audio speech response output
Optional additions:
- Wake word detection: Trigger the pipeline with a keyword like “Hey Computer”
- Streaming TTS: Start speaking before the full LLM response is ready
- Voice Activity Detection (VAD): Automatically detect when you start/stop speaking
Part 1: Speech-to-Text with Whisper.cpp
Whisper.cpp is a C/C++ port of OpenAI’s Whisper model, optimized for local inference.
Install Whisper.cpp
# Clone and build
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
# For NVIDIA GPU acceleration
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
# For Apple Silicon (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
Download Whisper Models
# Download models (from whisper.cpp directory)
bash models/download-ggml-model.sh tiny.en # 75 MB, fast, English only
bash models/download-ggml-model.sh base.en # 142 MB, good balance
bash models/download-ggml-model.sh small.en # 466 MB, better accuracy
bash models/download-ggml-model.sh medium.en # 1.5 GB, high accuracy
bash models/download-ggml-model.sh large-v3 # 3.1 GB, best accuracy, multilingual
Whisper Model Comparison
| Model | Size | English WER | Speed (CPU) | Speed (GPU) | Languages |
|---|---|---|---|---|---|
| tiny.en | 75 MB | 8.2% | Real-time | 10x RT | English |
| base.en | 142 MB | 5.9% | ~Real-time | 8x RT | English |
| small.en | 466 MB | 4.4% | 0.5x RT | 6x RT | English |
| medium.en | 1.5 GB | 3.8% | 0.2x RT | 4x RT | English |
| large-v3 | 3.1 GB | 3.0% | 0.1x RT | 2x RT | 99 languages |
WER = Word Error Rate (lower is better). RT = Real-Time (1x = transcribes as fast as audio plays).
Test Whisper
# Transcribe a WAV file
./build/bin/whisper-cli -m models/ggml-base.en.bin -f audio.wav
# Real-time microphone transcription
./build/bin/whisper-cli -m models/ggml-base.en.bin --stream --step 3000 --length 8000
# Output as text only (no timestamps)
./build/bin/whisper-cli -m models/ggml-base.en.bin -f audio.wav --no-timestamps --output-txt
Whisper as a Server
# Run Whisper HTTP server
./build/bin/whisper-server -m models/ggml-base.en.bin --host 0.0.0.0 --port 8178
# Send audio for transcription
curl -X POST http://localhost:8178/inference \
-H "Content-Type: multipart/form-data" \
-F [email protected] \
-F response_format=json
Python Whisper Alternative
If you prefer Python over C++:
pip install faster-whisper
# faster-whisper uses CTranslate2 for optimized inference
from faster_whisper import WhisperModel
model = WhisperModel("base.en", device="cuda", compute_type="float16")
# For CPU: device="cpu", compute_type="int8"
segments, info = model.transcribe("audio.wav", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Part 2: LLM Processing with Ollama
The LLM receives the transcribed text and generates a response.
Model Selection for Voice
Voice assistants need models that are:
- Fast: Low latency is critical for conversation
- Concise: Long responses feel unnatural when spoken
- Instruction-following: Must respond appropriately to conversational queries
| VRAM | Recommended Model | Why |
|---|---|---|
| 4 GB | Phi-3 Mini 3.8B | Fast, good instruction following |
| 6-8 GB | Llama 3.1 8B | Best all-around at this size |
| 12-16 GB | Qwen 2.5 14B | Higher quality responses |
| 24 GB | Qwen 2.5 32B Q4 | Near-cloud quality |
Configure for Voice
Create a voice-optimized Modelfile:
cat > ~/Modelfile-voice << 'EOF'
FROM llama3.1:8b
SYSTEM """You are a helpful voice assistant. Follow these rules:
1. Keep responses concise (1-3 sentences for simple questions)
2. Use natural, spoken language (avoid bullet points, code blocks, markdown)
3. Be conversational and friendly
4. If the question is ambiguous, ask for clarification
5. For complex topics, give a brief answer and offer to explain more"""
PARAMETER temperature 0.7
PARAMETER num_ctx 2048
PARAMETER num_predict 256
PARAMETER stop "<|eot_id|>"
EOF
ollama create voice-assistant -f ~/Modelfile-voice
ollama run voice-assistant
Streaming API for Lower Latency
import requests
import json
def get_llm_response_stream(text):
"""Stream LLM response for lower time-to-first-token."""
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "voice-assistant",
"prompt": text,
"stream": True,
},
stream=True,
)
full_response = ""
for line in response.iter_lines():
if line:
data = json.loads(line)
token = data.get("response", "")
full_response += token
yield token # Yield each token as it arrives
return full_response
Part 3: Text-to-Speech
Option A: Piper TTS (Recommended for Speed)
Piper is a fast, lightweight TTS engine that runs in real-time on CPU.
# Install Piper
pip install piper-tts
# Or download binary
wget https://github.com/rhasspy/piper/releases/latest/download/piper_linux_x86_64.tar.gz
tar xzf piper_linux_x86_64.tar.gz
# Download a voice model
# Browse voices at: https://rhasspy.github.io/piper-samples/
mkdir -p ~/.local/share/piper/voices
cd ~/.local/share/piper/voices
# US English, medium quality (recommended starting point)
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json
Using Piper
# Command line
echo "Hello, how can I help you today?" | \
piper --model ~/.local/share/piper/voices/en_US-lessac-medium.onnx \
--output_file response.wav
# Play immediately
echo "Hello, how can I help you today?" | \
piper --model ~/.local/share/piper/voices/en_US-lessac-medium.onnx \
--output-raw | aplay -r 22050 -f S16_LE -t raw -
Piper in Python
import subprocess
import io
import wave
def text_to_speech_piper(text, model_path, output_file="response.wav"):
"""Convert text to speech using Piper."""
process = subprocess.Popen(
["piper", "--model", model_path, "--output_file", output_file],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
process.communicate(input=text.encode())
return output_file
Option B: Kokoro TTS (Higher Quality)
Kokoro produces more natural-sounding speech but requires more resources.
pip install kokoro-onnx
# Download model
python -c "
from kokoro_onnx import Kokoro
kokoro = Kokoro('kokoro-v1.0.onnx', 'voices-v1.0.bin')
"
from kokoro_onnx import Kokoro
import soundfile as sf
kokoro = Kokoro("kokoro-v1.0.onnx", "voices-v1.0.bin")
# Generate speech
audio, sr = kokoro.create(
"Hello, I'm your local AI assistant. How can I help you today?",
voice="af_heart", # Available: af_heart, af_bella, am_adam, am_michael, etc.
speed=1.0,
)
sf.write("response.wav", audio, sr)
Option C: Coqui TTS / XTTS (Cloneable Voices)
For voice cloning and the most natural output:
pip install TTS
# List available models
tts --list_models
# Generate with XTTS v2
tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
--text "Hello from your local AI" \
--speaker_wav reference_voice.wav \
--language_idx en \
--out_path response.wav
TTS Comparison
| Engine | Speed (CPU) | Quality | Voice Cloning | Size | Best For |
|---|---|---|---|---|---|
| Piper | Very fast (~50x RT) | Good | No | 15-80 MB | Speed-critical, RPi |
| Kokoro | Fast (~10x RT) | Very good | No | ~100 MB | Quality on desktop |
| Coqui XTTS | Slow (~1x RT) | Excellent | Yes | ~2 GB | Maximum quality |
| Bark | Very slow | Excellent | Limited | ~5 GB | Expressive speech |
Part 4: Connecting the Pipeline
Complete Python Pipeline
#!/usr/bin/env python3
"""Local voice assistant: Whisper + Ollama + Piper"""
import subprocess
import json
import tempfile
import os
import requests
import pyaudio
import wave
import numpy as np
# Configuration
WHISPER_MODEL = os.path.expanduser("~/whisper.cpp/models/ggml-base.en.bin")
WHISPER_BIN = os.path.expanduser("~/whisper.cpp/build/bin/whisper-cli")
PIPER_MODEL = os.path.expanduser("~/.local/share/piper/voices/en_US-lessac-medium.onnx")
OLLAMA_MODEL = "voice-assistant"
SAMPLE_RATE = 16000
CHANNELS = 1
RECORD_SECONDS = 5 # Max recording length
def record_audio(filename, duration=RECORD_SECONDS):
"""Record audio from microphone."""
p = pyaudio.PyAudio()
stream = p.open(
format=pyaudio.paInt16,
channels=CHANNELS,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=1024,
)
print("🎤 Listening...")
frames = []
for _ in range(0, int(SAMPLE_RATE / 1024 * duration)):
data = stream.read(1024)
frames.append(data)
print("Processing...")
stream.stop_stream()
stream.close()
p.terminate()
wf = wave.open(filename, "wb")
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(pyaudio.paInt16))
wf.setframerate(SAMPLE_RATE)
wf.writeframes(b"".join(frames))
wf.close()
def transcribe(audio_file):
"""Transcribe audio using Whisper.cpp."""
result = subprocess.run(
[WHISPER_BIN, "-m", WHISPER_MODEL, "-f", audio_file,
"--no-timestamps", "-nt"],
capture_output=True, text=True,
)
return result.stdout.strip()
def get_response(text):
"""Get LLM response from Ollama."""
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": OLLAMA_MODEL,
"prompt": text,
"stream": False,
},
)
return response.json()["response"]
def speak(text):
"""Convert text to speech and play it."""
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
tmp_path = f.name
subprocess.run(
["piper", "--model", PIPER_MODEL, "--output_file", tmp_path],
input=text.encode(),
capture_output=True,
)
# Play audio
subprocess.run(["aplay", tmp_path], capture_output=True)
os.unlink(tmp_path)
def main():
print("Local Voice Assistant")
print("Press Ctrl+C to exit\n")
while True:
try:
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
audio_path = f.name
# 1. Record
record_audio(audio_path)
# 2. Transcribe
text = transcribe(audio_path)
os.unlink(audio_path)
if not text.strip():
print("(no speech detected)")
continue
print(f"You: {text}")
# 3. Get LLM response
response = get_response(text)
print(f"Assistant: {response}")
# 4. Speak response
speak(response)
except KeyboardInterrupt:
print("\nGoodbye!")
break
if __name__ == "__main__":
main()
Install Dependencies
pip install pyaudio requests numpy
# System dependencies for PyAudio
sudo apt install -y portaudio19-dev python3-pyaudio # Ubuntu
sudo dnf install -y portaudio-devel # Fedora
brew install portaudio # macOS
Run the Pipeline
# Make sure Ollama is running
ollama serve &
# Create the voice model
ollama create voice-assistant -f ~/Modelfile-voice
# Run the assistant
python3 voice_assistant.py
Part 5: Advanced Features
Voice Activity Detection (VAD)
Instead of fixed recording duration, detect when speech starts and stops:
import webrtcvad
import pyaudio
import collections
def record_with_vad(sample_rate=16000):
"""Record audio, automatically stopping when speech ends."""
vad = webrtcvad.Vad(2) # Aggressiveness 0-3
p = pyaudio.PyAudio()
stream = p.open(
format=pyaudio.paInt16,
channels=1,
rate=sample_rate,
input=True,
frames_per_buffer=480, # 30ms at 16kHz
)
frames = []
ring_buffer = collections.deque(maxlen=30) # 900ms
triggered = False
voiced_frames = []
print("Listening... (speak to begin)")
while True:
frame = stream.read(480)
is_speech = vad.is_speech(frame, sample_rate)
if not triggered:
ring_buffer.append((frame, is_speech))
num_voiced = len([f for f, speech in ring_buffer if speech])
if num_voiced > 0.8 * ring_buffer.maxlen:
triggered = True
print("Recording...")
voiced_frames.extend([f for f, _ in ring_buffer])
ring_buffer.clear()
else:
voiced_frames.append(frame)
ring_buffer.append((frame, is_speech))
num_unvoiced = len([f for f, speech in ring_buffer if not speech])
if num_unvoiced > 0.8 * ring_buffer.maxlen:
break
stream.stop_stream()
stream.close()
p.terminate()
return b"".join(voiced_frames)
pip install webrtcvad
Wake Word Detection
Add a wake word (like “Hey Computer”) using OpenWakeWord:
pip install openwakeword
from openwakeword import Model as WakeWordModel
import pyaudio
import numpy as np
def listen_for_wake_word(callback, wake_word="hey_computer"):
"""Listen for wake word, then call callback."""
model = WakeWordModel(wakeword_models=[wake_word])
p = pyaudio.PyAudio()
stream = p.open(
format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
frames_per_buffer=1280,
)
print(f"Waiting for wake word '{wake_word}'...")
while True:
audio = np.frombuffer(stream.read(1280), dtype=np.int16)
prediction = model.predict(audio)
if prediction[wake_word] > 0.5:
print("Wake word detected!")
callback()
# Usage
def handle_voice_command():
audio = record_with_vad()
text = transcribe(audio)
response = get_response(text)
speak(response)
listen_for_wake_word(handle_voice_command)
Streaming TTS (Reduce Perceived Latency)
Start speaking before the full LLM response is ready:
import threading
import queue
def stream_response_with_tts(prompt):
"""Stream LLM output and speak sentences as they complete."""
tts_queue = queue.Queue()
def tts_worker():
while True:
sentence = tts_queue.get()
if sentence is None:
break
speak(sentence)
# Start TTS worker thread
tts_thread = threading.Thread(target=tts_worker, daemon=True)
tts_thread.start()
# Stream LLM response
current_sentence = ""
for token in get_llm_response_stream(prompt):
current_sentence += token
# Check if we have a complete sentence
if any(current_sentence.rstrip().endswith(p) for p in '.!?'):
sentence = current_sentence.strip()
if sentence:
tts_queue.put(sentence)
current_sentence = ""
# Handle remaining text
if current_sentence.strip():
tts_queue.put(current_sentence.strip())
tts_queue.put(None) # Signal end
tts_thread.join()
Part 6: Latency Optimization
Bottleneck Analysis
| Component | Typical Latency | Optimization |
|---|---|---|
| VAD/Recording | 0.5-2s | Use shorter ring buffer |
| Whisper (base, GPU) | 0.3-0.8s | Use tiny model for speed |
| Whisper (base, CPU) | 0.5-2s | Use tiny.en model |
| LLM first token | 0.5-2s | Use smaller model, lower context |
| LLM full response | 1-5s | Limit max_tokens, stream |
| TTS synthesis | 0.1-0.5s | Use Piper (fastest) |
| Audio playback | 0.1s | Stream audio |
| Total (GPU) | 2-4s | |
| Total (CPU) | 5-10s |
Quick Wins
- Use Whisper tiny.en for English-only with low latency
- Use a 3B model for the LLM if speed matters more than quality
- Use Piper for TTS (50x real-time on CPU)
- Stream TTS to start speaking while LLM is still generating
- Reduce context with
num_ctx 1024in the Modelfile - Keep models loaded with
OLLAMA_KEEP_ALIVE=30m
Hardware Recommendations
| Budget | Setup | Expected Latency |
|---|---|---|
| Low | CPU only, tiny.en + Phi-3 Mini + Piper | 5-8s |
| Medium | RTX 3060 12 GB, base.en + Llama 3.1 8B + Piper | 2-4s |
| High | RTX 4090, small.en + Qwen 2.5 14B + Kokoro | 1.5-3s |
| Apple | M3 Pro 36 GB, base.en + Llama 3.1 8B + Piper | 2-4s |
Next Steps
- Add document knowledge: Local RAG Chatbot — make your voice assistant answer from documents
- Choose your model: Model selection guide
- Deploy as a service: Docker deployment guide
- Home automation: Integrate with Home Assistant for smart home control