RAG (Retrieval-Augmented Generation) is the most practical way to make a local LLM answer questions about your own documents, knowledge base, or data. Instead of fine-tuning the model (which is expensive and inflexible), RAG retrieves relevant document chunks at query time and includes them in the LLM’s prompt. This guide walks you through building a complete local RAG pipeline from scratch, with no data leaving your machine, using Ollama for inference, embedding models for vector search, and ChromaDB for storage.
How RAG Works
The RAG pipeline has two phases:
Ingestion Phase (One-time)
Documents → Chunking → Embedding → Vector Database
│ │ │ │
PDF/TXT Split into Convert to Store for
files passages vectors fast search
- Load documents: Read PDFs, text files, markdown, web pages
- Chunk: Split documents into smaller passages (typically 500-1000 characters)
- Embed: Convert each chunk into a numerical vector using an embedding model
- Store: Save vectors in a vector database (ChromaDB, FAISS)
Query Phase (Every question)
User Query → Embed Query → Search → Top K Chunks → LLM + Context → Answer
│ │ │ │ │ │
"What is Convert to Find most Retrieve Prompt with Generated
X?" vector similar relevant context response
vectors passages
- Embed the question: Convert the user’s question into a vector
- Search: Find the most similar document chunks in the vector database
- Build prompt: Combine the retrieved chunks with the question in a prompt
- Generate: Send the prompt to the LLM, which generates an answer grounded in the retrieved context
Prerequisites
# Ollama with models
ollama pull llama3.1:8b # Chat model
ollama pull nomic-embed-text # Embedding model
# Python packages
pip install \
langchain \
langchain-ollama \
langchain-chroma \
langchain-community \
chromadb \
pypdf \
unstructured
Step 1: Document Loading
Loading PDFs
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
# Single PDF
loader = PyPDFLoader("report.pdf")
documents = loader.load()
# Directory of PDFs
loader = DirectoryLoader(
"./documents",
glob="**/*.pdf",
loader_cls=PyPDFLoader
)
documents = loader.load()
print(f"Loaded {len(documents)} pages")
Loading Multiple File Types
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
UnstructuredMarkdownLoader,
CSVLoader,
)
from langchain_community.document_loaders import DirectoryLoader
# Create loaders for different file types
loaders = [
DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader),
DirectoryLoader("./docs", glob="**/*.txt", loader_cls=TextLoader),
DirectoryLoader("./docs", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader),
DirectoryLoader("./docs", glob="**/*.csv", loader_cls=CSVLoader),
]
documents = []
for loader in loaders:
documents.extend(loader.load())
print(f"Loaded {len(documents)} documents from all sources")
Loading Web Pages
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader([
"https://example.com/page1",
"https://example.com/page2",
])
web_docs = loader.load()
Step 2: Chunking Strategies
Chunking determines how documents are split into retrievable passages. This is one of the most impactful decisions in RAG quality.
Basic Recursive Splitting
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Max characters per chunk
chunk_overlap=200, # Overlap between chunks
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")
Semantic Chunking
Splits at natural topic boundaries rather than fixed character counts:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
semantic_splitter = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95,
)
semantic_chunks = semantic_splitter.split_documents(documents)
Markdown-Aware Splitting
For markdown documents, split by headers to preserve structure:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split)
md_chunks = md_splitter.split_text(markdown_text)
Choosing Chunk Parameters
| Document Type | Chunk Size | Overlap | Rationale |
|---|---|---|---|
| Technical docs | 500-800 | 100 | Precise answers need focused chunks |
| Legal documents | 1000-1500 | 200 | Context is critical for legal text |
| Narrative/prose | 1000-2000 | 200 | Larger chunks preserve storytelling |
| Code documentation | 500-1000 | 100 | Functions and explanations as units |
| FAQ/Q&A | 300-500 | 50 | Each Q&A pair should be one chunk |
| Research papers | 800-1200 | 150 | Balance between sections and detail |
Step 3: Embedding Models
The embedding model converts text into vectors. Quality of embeddings directly impacts retrieval accuracy.
Using Ollama Embeddings
from langchain_ollama import OllamaEmbeddings
# nomic-embed-text: Good quality, fast, 768 dimensions
embeddings = OllamaEmbeddings(
model="nomic-embed-text",
base_url="http://localhost:11434"
)
# Test embedding
test_vector = embeddings.embed_query("What is machine learning?")
print(f"Vector dimensions: {len(test_vector)}") # 768
Embedding Model Options
| Model | Dimensions | Size | Quality | Speed |
|---|---|---|---|---|
| nomic-embed-text | 768 | 274 MB | Good | Fast |
| mxbai-embed-large | 1024 | 670 MB | Very good | Medium |
| all-minilm | 384 | 45 MB | Decent | Very fast |
| snowflake-arctic-embed | 1024 | 670 MB | Very good | Medium |
# Pull your chosen embedding model
ollama pull nomic-embed-text
ollama pull mxbai-embed-large
Step 4: Vector Database Setup
ChromaDB (Recommended for Local)
ChromaDB runs embedded (no separate server) and persists to disk.
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Create vector store and embed all chunks
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
collection_name="my_documents"
)
print(f"Stored {vectorstore._collection.count()} chunks in ChromaDB")
Loading an Existing Database
# Load previously created database
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings,
collection_name="my_documents"
)
Adding New Documents
# Add more documents to existing database
new_docs = PyPDFLoader("new_report.pdf").load()
new_chunks = text_splitter.split_documents(new_docs)
vectorstore.add_documents(new_chunks)
FAISS Alternative
FAISS (by Meta) is faster for large collections but doesn’t persist to disk by default.
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(chunks, embeddings)
# Save to disk
vectorstore.save_local("./faiss_index")
# Load from disk
vectorstore = FAISS.load_local(
"./faiss_index",
embeddings,
allow_dangerous_deserialization=True
)
Step 5: Retrieval Configuration
Basic Retriever
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4} # Return top 4 most relevant chunks
)
# Test retrieval
results = retriever.invoke("What are the key findings?")
for doc in results:
print(f"Source: {doc.metadata.get('source', 'unknown')}")
print(f"Content: {doc.page_content[:200]}...")
print("---")
MMR Retrieval (Diversity)
Maximum Marginal Relevance balances relevance with diversity to avoid returning near-duplicate chunks:
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": 4,
"fetch_k": 20, # Fetch 20 candidates, return 4 diverse ones
"lambda_mult": 0.5 # 0 = max diversity, 1 = max relevance
}
)
Similarity Score Threshold
Only return chunks above a relevance threshold:
retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={
"score_threshold": 0.7, # Only return if similarity > 0.7
"k": 6
}
)
Step 6: Building the RAG Chain
Basic RAG Chain with LangChain
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# Initialize LLM
llm = ChatOllama(
model="llama3.1:8b",
temperature=0.1, # Low temperature for factual answers
)
# RAG prompt template
template = """Answer the question based only on the following context.
If the context doesn't contain enough information to answer the question,
say "I don't have enough information to answer that question."
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
# Helper to format retrieved documents
def format_docs(docs):
return "\n\n---\n\n".join(
f"Source: {doc.metadata.get('source', 'unknown')}\n{doc.page_content}"
for doc in docs
)
# Build the chain
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Ask a question
response = rag_chain.invoke("What are the main conclusions of the report?")
print(response)
RAG Chain with Sources
from langchain_core.runnables import RunnableParallel
# Chain that returns both answer and source documents
rag_chain_with_sources = RunnableParallel(
{"context": retriever, "question": RunnablePassthrough()}
).assign(
answer=lambda x: (
prompt.invoke({
"context": format_docs(x["context"]),
"question": x["question"]
})
| llm
| StrOutputParser()
).invoke(x)
)
result = rag_chain_with_sources.invoke("What is the budget for Q3?")
print("Answer:", result["answer"])
print("\nSources:")
for doc in result["context"]:
print(f" - {doc.metadata.get('source')}: {doc.page_content[:100]}...")
Conversational RAG (Chat History)
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
# Prompt with chat history
contextualize_prompt = ChatPromptTemplate.from_messages([
("system", """Given a chat history and the latest user question,
formulate a standalone question that can be understood without
the chat history. Do NOT answer the question, just reformulate it."""),
MessagesPlaceholder("chat_history"),
("human", "{input}"),
])
# Full conversational RAG prompt
qa_prompt = ChatPromptTemplate.from_messages([
("system", """Answer the question based on the following context.
If unsure, say you don't know.
Context: {context}"""),
MessagesPlaceholder("chat_history"),
("human", "{input}"),
])
# Usage with chat history
chat_history = []
def ask(question):
# Simple conversational chain
context_docs = retriever.invoke(question)
context = format_docs(context_docs)
messages = qa_prompt.format_messages(
context=context,
chat_history=chat_history,
input=question
)
response = llm.invoke(messages)
chat_history.append(HumanMessage(content=question))
chat_history.append(AIMessage(content=response.content))
return response.content
print(ask("What were the Q3 results?"))
print(ask("How do they compare to Q2?")) # Uses chat history
Step 7: Alternative with LlamaIndex
LlamaIndex provides a more streamlined API for RAG:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
# Configure global settings
Settings.llm = Ollama(model="llama3.1:8b", request_timeout=120)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.chunk_size = 1000
Settings.chunk_overlap = 200
# Load and index documents
documents = SimpleDirectoryReader("./documents").load_data()
index = VectorStoreIndex.from_documents(documents)
# Save index to disk
index.storage_context.persist(persist_dir="./llama_index_store")
# Create query engine
query_engine = index.as_query_engine(
similarity_top_k=4,
streaming=True,
)
# Ask questions
response = query_engine.query("Summarize the key findings")
print(response)
print("\nSources:")
for node in response.source_nodes:
print(f" Score: {node.score:.3f} - {node.metadata.get('file_name', 'unknown')}")
LlamaIndex Chat Engine (Conversational)
chat_engine = index.as_chat_engine(
chat_mode="condense_plus_context",
similarity_top_k=4,
)
response = chat_engine.chat("What are the main points?")
print(response)
response = chat_engine.chat("Can you elaborate on the second point?")
print(response)
Step 8: Building a Complete Application
Streamlit RAG App
# app.py
import streamlit as st
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
import tempfile
import os
st.title("Local RAG Chatbot")
st.caption("Upload documents and ask questions. Everything runs locally.")
# Initialize components
@st.cache_resource
def get_embeddings():
return OllamaEmbeddings(model="nomic-embed-text")
@st.cache_resource
def get_llm():
return ChatOllama(model="llama3.1:8b", temperature=0.1)
embeddings = get_embeddings()
llm = get_llm()
# File upload
uploaded_files = st.file_uploader(
"Upload PDF documents",
type=["pdf"],
accept_multiple_files=True
)
if uploaded_files:
with st.spinner("Processing documents..."):
documents = []
for file in uploaded_files:
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
tmp.write(file.getvalue())
loader = PyPDFLoader(tmp.name)
documents.extend(loader.load())
os.unlink(tmp.name)
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200
)
chunks = splitter.split_documents(documents)
vectorstore = Chroma.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
st.success(f"Processed {len(documents)} pages into {len(chunks)} chunks")
# Chat interface
if "messages" not in st.session_state:
st.session_state.messages = []
for msg in st.session_state.messages:
st.chat_message(msg["role"]).write(msg["content"])
if question := st.chat_input("Ask a question about your documents"):
st.session_state.messages.append({"role": "user", "content": question})
st.chat_message("user").write(question)
template = """Answer based on this context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
def format_docs(docs):
return "\n\n".join(d.page_content for d in docs)
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt | llm | StrOutputParser()
)
with st.chat_message("assistant"):
response = chain.invoke(question)
st.write(response)
st.session_state.messages.append({"role": "assistant", "content": response})
# Run the app
streamlit run app.py
Step 9: Prompt Engineering for RAG
The prompt template significantly impacts answer quality.
Effective RAG Prompts
# Strict factual (no hallucination)
strict_template = """You are a precise assistant that answers questions
using ONLY the provided context. Follow these rules:
1. Only use information from the context below
2. If the context doesn't contain the answer, say "This information is not in the provided documents"
3. Quote relevant passages when possible
4. Cite the source document
Context:
{context}
Question: {question}
Answer:"""
# Analytical (synthesize across documents)
analytical_template = """Analyze the following context to answer the question.
Synthesize information across multiple sources if needed.
Highlight any contradictions or gaps in the available information.
Context:
{context}
Question: {question}
Analysis:"""
# Conversational (natural dialogue)
conversational_template = """You are a helpful assistant with access to
a knowledge base. Use the context below to answer naturally.
If you're unsure, say so.
Context:
{context}
Question: {question}
Response:"""
Step 10: Evaluation and Improvement
Testing Retrieval Quality
# Test if the right chunks are being retrieved
test_questions = [
"What was the revenue in Q3?",
"Who is the CEO?",
"What are the risk factors?",
]
for question in test_questions:
results = retriever.invoke(question)
print(f"\nQ: {question}")
for i, doc in enumerate(results):
print(f" [{i+1}] Score relevance - {doc.page_content[:100]}...")
Common Issues and Fixes
| Problem | Symptom | Fix |
|---|---|---|
| Chunks too large | Answer includes irrelevant info | Reduce chunk_size to 500 |
| Chunks too small | Missing context | Increase chunk_size to 1500 |
| Wrong chunks retrieved | Irrelevant results | Try better embedding model (mxbai-embed-large) |
| Hallucination | Makes up facts | Use stricter prompt, lower temperature |
| Slow retrieval | Long wait for results | Reduce k, use FAISS instead of ChromaDB |
| Missing answers | ”Not in documents” when it is | Increase k to 6-8, check chunking boundaries |
Improving Quality
- Better chunking: Use semantic chunking or document-structure-aware splitting
- Better embeddings: Upgrade from nomic-embed-text to mxbai-embed-large
- Hybrid search: Combine vector similarity with keyword (BM25) search
- Reranking: After retrieval, rerank results with a cross-encoder model
- Larger LLM: Use a 14B or 32B model for better comprehension
- Query expansion: Rephrase the query to improve retrieval
No-Code RAG with Open WebUI
If you don’t want to write code, Open WebUI has built-in RAG:
# Start Open WebUI with Ollama
docker compose up -d # Using the compose file from earlier guides
# In the Open WebUI interface:
# 1. Click the + button in a chat
# 2. Upload documents (PDF, TXT, etc.)
# 3. Ask questions about the uploaded content
# Open WebUI handles chunking, embedding, and retrieval automatically
Next Steps
- Deploy for your team: Enterprise Local AI
- Add voice input: Local Voice Assistant
- Fine-tune for better results: Fine-Tuning guide
- Choose better models: Model selection guide