Most RAG tutorials show you the happy path. They index three documents, ask one perfectly-scoped question, and celebrate when the model regurgitates the answer. Then you try it with your actual data and it falls apart — wrong chunks retrieved, hallucinated answers, context windows overflowing, embeddings that do not capture semantic meaning.
This guide builds a RAG pipeline that works on real-world data. Not “works in a demo.” Works when you throw 5,000 messy documents at it, ask ambiguous questions, and expect reliable answers. We will use Ollama for inference and embeddings, ChromaDB as the vector store, and LangChain for orchestration — all running locally, no API keys needed.
Prerequisites
- Python 3.11+
- Ollama installed and running (
ollama serve) - At least 16GB RAM (32GB recommended)
- A GPU with 8GB+ VRAM for the language model (or a capable CPU if you are patient)
Pull the models we will use:
ollama pull qwen3:14b
ollama pull nomic-embed-text
Project Setup
mkdir local-rag && cd local-rag
python -m venv venv
source venv/bin/activate
pip install langchain langchain-ollama langchain-chroma chromadb pypdf docx2txt unstructured
Create the project structure:
local-rag/
documents/ # Drop your files here
chroma_db/ # Persistent vector store
ingest.py # Document ingestion pipeline
query.py # Query interface
config.py # Shared configuration
Step 1: Configuration
# config.py
OLLAMA_BASE_URL = "http://localhost:11434"
EMBEDDING_MODEL = "nomic-embed-text"
CHAT_MODEL = "qwen3:14b"
CHROMA_PERSIST_DIR = "./chroma_db"
COLLECTION_NAME = "local_docs"
# Chunking configuration — these values are tuned for general-purpose use
CHUNK_SIZE = 512
CHUNK_OVERLAP = 50
# Retrieval configuration
TOP_K = 5
The chunk size and overlap values matter more than most tutorials acknowledge. Here is the reasoning:
- 512 tokens is small enough to be semantically focused but large enough to contain a complete thought. Smaller chunks (128-256) retrieve more precisely but lose context. Larger chunks (1024+) include more context but dilute relevance.
- 50-token overlap ensures that important information split across chunk boundaries is captured in at least one chunk.
We will revisit these values when we discuss tuning.
Step 2: Document Ingestion
# ingest.py
import os
import sys
import time
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
Docx2txtLoader,
UnstructuredMarkdownLoader,
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma
from config import *
def get_loader(file_path: str):
"""Return the appropriate loader based on file extension."""
ext = os.path.splitext(file_path)[1].lower()
loaders = {
".pdf": PyPDFLoader,
".txt": TextLoader,
".md": UnstructuredMarkdownLoader,
".docx": Docx2txtLoader,
}
loader_class = loaders.get(ext)
if loader_class is None:
print(f"Skipping unsupported file type: {ext}")
return None
return loader_class(file_path)
def ingest_documents(docs_dir: str = "./documents"):
"""Load, chunk, embed, and store documents."""
# Collect all documents
all_docs = []
for root, dirs, files in os.walk(docs_dir):
for file in files:
file_path = os.path.join(root, file)
loader = get_loader(file_path)
if loader:
try:
docs = loader.load()
# Add source metadata
for doc in docs:
doc.metadata["source"] = file_path
doc.metadata["filename"] = file
all_docs.extend(docs)
print(f"Loaded: {file} ({len(docs)} pages/sections)")
except Exception as e:
print(f"Error loading {file}: {e}")
if not all_docs:
print("No documents found. Add files to the documents/ directory.")
sys.exit(1)
print(f"\nTotal documents loaded: {len(all_docs)}")
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
chunks = splitter.split_documents(all_docs)
print(f"Total chunks created: {len(chunks)}")
# Create embeddings and store in ChromaDB
embeddings = OllamaEmbeddings(
model=EMBEDDING_MODEL,
base_url=OLLAMA_BASE_URL,
)
print("Embedding and storing chunks (this may take a while)...")
start = time.time()
# Process in batches to avoid overwhelming Ollama
batch_size = 100
vectorstore = None
for i in range(0, len(chunks), batch_size):
batch = chunks[i : i + batch_size]
if vectorstore is None:
vectorstore = Chroma.from_documents(
documents=batch,
embedding=embeddings,
persist_directory=CHROMA_PERSIST_DIR,
collection_name=COLLECTION_NAME,
)
else:
vectorstore.add_documents(batch)
print(f" Processed {min(i + batch_size, len(chunks))}/{len(chunks)} chunks")
elapsed = time.time() - start
print(f"\nIngestion complete in {elapsed:.1f}s")
print(f"Vector store saved to {CHROMA_PERSIST_DIR}")
if __name__ == "__main__":
ingest_documents()
Pain Point #1: Ingestion is slow. Embedding 5,000 chunks with Nomic Embed Text on a CPU takes about 15 minutes. On a GPU, it takes about 2 minutes. If you have thousands of documents, run ingestion once and persist the vector store. Incremental ingestion (only processing new/changed files) is a valuable improvement we cover in the tuning section.
Step 3: The Query Pipeline
# query.py
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from config import *
# System prompt that reduces hallucination
PROMPT_TEMPLATE = """You are a helpful assistant answering questions based on the
provided context. Use ONLY the context below to answer. If the context does not
contain enough information to answer the question, say "I don't have enough
information in the provided documents to answer that question."
Do not make up information. Do not use knowledge outside of the provided context.
If you are unsure, say so.
Context:
{context}
Question: {question}
Answer:"""
def create_chain():
"""Create the RAG chain."""
embeddings = OllamaEmbeddings(
model=EMBEDDING_MODEL,
base_url=OLLAMA_BASE_URL,
)
vectorstore = Chroma(
persist_directory=CHROMA_PERSIST_DIR,
embedding_function=embeddings,
collection_name=COLLECTION_NAME,
)
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximal Marginal Relevance for diversity
search_kwargs={
"k": TOP_K,
"fetch_k": 20, # Fetch more, then re-rank for diversity
},
)
llm = OllamaLLM(
model=CHAT_MODEL,
base_url=OLLAMA_BASE_URL,
temperature=0.1, # Low temperature for factual answers
)
prompt = PromptTemplate(
template=PROMPT_TEMPLATE,
input_variables=["context", "question"],
)
chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt},
)
return chain
def query(question: str):
"""Ask a question and get an answer with sources."""
chain = create_chain()
result = chain.invoke({"query": question})
print(f"\nQuestion: {question}")
print(f"\nAnswer: {result['result']}")
print(f"\nSources:")
seen = set()
for doc in result["source_documents"]:
source = doc.metadata.get("filename", "unknown")
if source not in seen:
seen.add(source)
print(f" - {source}")
return result
if __name__ == "__main__":
import sys
if len(sys.argv) > 1:
question = " ".join(sys.argv[1:])
else:
question = input("Ask a question: ")
query(question)
Step 4: Run It
# Drop some documents in the documents/ directory
cp ~/papers/*.pdf ./documents/
cp ~/notes/*.md ./documents/
# Ingest
python ingest.py
# Query
python query.py "What are the key findings from the Q3 report?"
If everything is set up correctly, you should get an answer grounded in your documents with source citations. But getting it to work well requires tuning.
The Tuning Guide: Fixing Common Problems
Problem: “The retrieved chunks are irrelevant”
This is the most common RAG failure. The embedding model retrieves chunks that are semantically similar to the query but do not actually contain the answer.
Solutions:
-
Try different chunk sizes. If your documents have long, detailed paragraphs, increase
CHUNK_SIZEto 1024. If they are dense technical documents with many discrete facts, decrease to 256. -
Use MMR retrieval (already set up in our code). Maximal Marginal Relevance ensures retrieved chunks are diverse, not just five paraphrases of the same passage.
-
Add metadata filtering. If your documents span different topics, add metadata during ingestion and filter at query time:
# During ingestion, add topic metadata
doc.metadata["topic"] = "engineering" # or detect automatically
# During retrieval, filter
retriever = vectorstore.as_retriever(
search_kwargs={
"k": TOP_K,
"filter": {"topic": "engineering"},
},
)
- Use a better embedding model. Nomic Embed Text is good and runs fast, but if retrieval quality is critical, consider
mxbai-embed-largeorsnowflake-arctic-embed:
ollama pull mxbai-embed-large
Problem: “The model hallucinates despite having context”
The model ignores the retrieved context and makes up information.
Solutions:
-
Lower the temperature. We set 0.1 in the code, but for pure factual Q&A, try 0.0.
-
Strengthen the system prompt. The prompt template above is deliberately aggressive about constraining the model to the context. If hallucination persists, make it stronger:
IMPORTANT: Base your answer SOLELY on the context above. If the answer is not
explicitly stated in the context, respond with "This information is not available
in the provided documents." Never supplement with outside knowledge.
- Use a more instruction-following model. Some models are better at following the “only use this context” instruction. Qwen 3 and Llama 4 Scout are particularly good at this. Older or smaller models tend to ignore instructions more.
Problem: “Answers are too vague or too short”
The model gives correct but unhelpful answers like “Yes, the document mentions this.”
Solutions:
- Adjust the prompt to request detailed answers:
Provide a detailed, comprehensive answer. Include specific numbers, dates,
and names from the context when available.
-
Increase TOP_K to retrieve more chunks, giving the model more context to synthesize from. Try 8-10 instead of 5.
-
Use the “refine” chain type instead of “stuff” for longer answers that need to synthesize across many chunks:
chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="refine", # Iteratively refines answer with each chunk
retriever=retriever,
return_source_documents=True,
)
Problem: “Ingestion is too slow for my document volume”
If you have tens of thousands of documents, the initial ingestion can take hours.
Solutions:
-
Batch embedding on GPU. Make sure Ollama is using your GPU for the embedding model. Check with
ollama psduring ingestion. -
Implement incremental ingestion. Track which files have been processed and only ingest new or modified files:
import hashlib
import json
PROCESSED_FILE = "./chroma_db/processed.json"
def get_file_hash(path):
with open(path, "rb") as f:
return hashlib.md5(f.read()).hexdigest()
def get_processed():
if os.path.exists(PROCESSED_FILE):
with open(PROCESSED_FILE) as f:
return json.load(f)
return {}
def save_processed(processed):
with open(PROCESSED_FILE, "w") as f:
json.dump(processed, f)
- Use a faster embedding model.
nomic-embed-textis a good balance of speed and quality. If speed is critical and you can sacrifice some retrieval quality, smaller embedding models exist.
Problem: “Context window overflow with many retrieved chunks”
When TOP_K is high and chunks are large, the total context can exceed the model’s context window.
Solutions:
-
Calculate your budget. A 14B model with 8K context can hold about 6,000 tokens of retrieved context plus the prompt and expected answer. With 512-token chunks and TOP_K=5, that is 2,560 tokens — well within budget. But TOP_K=15 with 1024-token chunks would overflow.
-
Use a re-ranking step to filter retrieved chunks before stuffing them into the prompt:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainFilter
compressor = LLMChainFilter.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=retriever,
)
This uses the LLM to evaluate whether each retrieved chunk is actually relevant to the question before including it in the final prompt. It is slower but significantly improves answer quality.
Adding a Chat Interface
For a more interactive experience, wrap the query pipeline in a simple chat loop:
# chat.py
from query import create_chain
def chat():
chain = create_chain()
print("RAG Chat — ask questions about your documents (type 'quit' to exit)")
print("-" * 60)
while True:
question = input("\nYou: ").strip()
if question.lower() in ("quit", "exit", "q"):
break
if not question:
continue
result = chain.invoke({"query": question})
print(f"\nAssistant: {result['result']}")
sources = set(d.metadata.get("filename", "?") for d in result["source_documents"])
print(f"\n[Sources: {', '.join(sources)}]")
if __name__ == "__main__":
chat()
Production Hardening Checklist
If you are deploying this for real use (not just experimentation), address these items:
-
Persistent ChromaDB storage. Our code already persists to disk, but consider running ChromaDB as a separate server process for multi-client access:
chroma run --path ./chroma_db --port 8000 -
Document deduplication. If the same content appears in multiple files, you will get duplicate chunks in your vector store. Hash-based deduplication during ingestion prevents this.
-
Access control. ChromaDB does not have built-in access control. If multiple users share the same RAG system, implement filtering at the application layer or use separate collections per user/team.
-
Monitoring. Log every query, the retrieved chunks, and the generated answer. This log is invaluable for debugging retrieval quality issues.
-
Regular re-indexing. As your document corpus changes, schedule periodic re-ingestion to keep the vector store current.
Why This Stack
A brief justification for each component choice:
- Ollama over raw llama.cpp: Model management, API compatibility, and multi-model support with minimal configuration. The overhead is negligible.
- ChromaDB over Qdrant/Milvus/Weaviate: Simplest setup, embedded mode works great for up to ~1M documents, and the Python API is clean. For larger scale, Qdrant is worth considering.
- LangChain over LlamaIndex/raw code: Despite its complexity reputation, LangChain’s RAG chains are well-tested and handle edge cases (token counting, chunk overflow, source tracking) that you would otherwise have to implement yourself.
- Nomic Embed Text over other embedding models: Good quality, runs fast on Ollama, and produces 768-dimensional embeddings that balance quality with storage efficiency.
Next Steps
This pipeline handles straightforward document Q&A well. To go further:
- Multi-modal RAG: Index images and tables from PDFs using vision models
- Hybrid search: Combine vector similarity with BM25 keyword search for better retrieval
- Agentic RAG: Use an agent that can reformulate queries, search multiple collections, and chain multiple retrieval steps
- Evaluation: Use RAGAS or similar frameworks to systematically measure retrieval and generation quality
The full code from this tutorial is available in our GitHub repository. Questions? Join our community Discord.
This tutorial was last updated April 2026. We will update it as the underlying tools evolve.