What is RAG and why use it instead of fine-tuning?

RAG (Retrieval-Augmented Generation) retrieves relevant documents and includes them in the LLM's prompt context. Unlike fine-tuning, RAG doesn't modify the model. It's faster to set up, works with any model, lets you update data instantly without retraining, and provides source attribution. Use RAG when your data changes frequently or when you need to cite sources. Use fine-tuning when you need to change the model's behavior or writing style.

How many documents can a local RAG system handle?

A local RAG system can handle thousands to tens of thousands of documents easily. ChromaDB and FAISS handle millions of embeddings on a single machine. The bottleneck is usually the initial embedding computation, not storage or retrieval. A 7B embedding model can process about 50-100 pages per minute on a modern GPU.

What chunk size should I use for RAG?

Start with 500-1000 characters with 100-200 character overlap. Smaller chunks (200-500) improve precision but may lose context. Larger chunks (1000-2000) preserve more context but may include irrelevant information. The optimal size depends on your content type: technical docs work well with smaller chunks, narrative content benefits from larger ones. Test with your actual data and queries.

Can I build RAG without writing code?

Yes. Open WebUI has built-in RAG functionality. Upload documents directly in the chat interface and ask questions about them. AnythingLLM is another no-code option with a GUI for document management and chat. For more control and customization, the code-based approaches in this guide offer maximum flexibility.

Building a Local RAG Chatbot: Documents, Embeddings, and Retrieval

RAG (Retrieval-Augmented Generation) is the most practical way to make a local LLM answer questions about your own documents, knowledge base, or data. Instead of fine-tuning the model (which is expensive and inflexible), RAG retrieves relevant document chunks at query time and includes them in the LLM’s prompt. This guide walks you through building a complete local RAG pipeline from scratch, with no data leaving your machine, using Ollama for inference, embedding models for vector search, and ChromaDB for storage.

How RAG Works

The RAG pipeline has two phases:

Ingestion Phase (One-time)

Documents → Chunking → Embedding → Vector Database
   │            │           │              │
   PDF/TXT    Split into   Convert to    Store for
   files      passages     vectors       fast search

Load documents: Read PDFs, text files, markdown, web pages
Chunk: Split documents into smaller passages (typically 500-1000 characters)
Embed: Convert each chunk into a numerical vector using an embedding model
Store: Save vectors in a vector database (ChromaDB, FAISS)

Query Phase (Every question)

User Query → Embed Query → Search → Top K Chunks → LLM + Context → Answer
     │            │           │          │               │             │
  "What is    Convert to   Find most  Retrieve       Prompt with   Generated
   X?"        vector       similar    relevant       context       response
                           vectors    passages

Embed the question: Convert the user’s question into a vector
Search: Find the most similar document chunks in the vector database
Build prompt: Combine the retrieved chunks with the question in a prompt
Generate: Send the prompt to the LLM, which generates an answer grounded in the retrieved context

Prerequisites

# Ollama with models
ollama pull llama3.1:8b           # Chat model
ollama pull nomic-embed-text       # Embedding model

# Python packages
pip install \
  langchain \
  langchain-ollama \
  langchain-chroma \
  langchain-community \
  chromadb \
  pypdf \
  unstructured

Step 1: Document Loading

Loading PDFs

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader

# Single PDF
loader = PyPDFLoader("report.pdf")
documents = loader.load()

# Directory of PDFs
loader = DirectoryLoader(
    "./documents",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader
)
documents = loader.load()

print(f"Loaded {len(documents)} pages")

Loading Multiple File Types

from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    UnstructuredMarkdownLoader,
    CSVLoader,
)
from langchain_community.document_loaders import DirectoryLoader

# Create loaders for different file types
loaders = [
    DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader),
    DirectoryLoader("./docs", glob="**/*.txt", loader_cls=TextLoader),
    DirectoryLoader("./docs", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader),
    DirectoryLoader("./docs", glob="**/*.csv", loader_cls=CSVLoader),
]

documents = []
for loader in loaders:
    documents.extend(loader.load())

print(f"Loaded {len(documents)} documents from all sources")

Loading Web Pages

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader([
    "https://example.com/page1",
    "https://example.com/page2",
])
web_docs = loader.load()

Step 2: Chunking Strategies

Chunking determines how documents are split into retrievable passages. This is one of the most impactful decisions in RAG quality.

Basic Recursive Splitting

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Max characters per chunk
    chunk_overlap=200,      # Overlap between chunks
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")

Semantic Chunking

Splits at natural topic boundaries rather than fixed character counts:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)

semantic_chunks = semantic_splitter.split_documents(documents)

Markdown-Aware Splitting

For markdown documents, split by headers to preserve structure:

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split)
md_chunks = md_splitter.split_text(markdown_text)

Choosing Chunk Parameters

Document Type	Chunk Size	Overlap	Rationale
Technical docs	500-800	100	Precise answers need focused chunks
Legal documents	1000-1500	200	Context is critical for legal text
Narrative/prose	1000-2000	200	Larger chunks preserve storytelling
Code documentation	500-1000	100	Functions and explanations as units
FAQ/Q&A	300-500	50	Each Q&A pair should be one chunk
Research papers	800-1200	150	Balance between sections and detail

Step 3: Embedding Models

The embedding model converts text into vectors. Quality of embeddings directly impacts retrieval accuracy.

Using Ollama Embeddings

from langchain_ollama import OllamaEmbeddings

# nomic-embed-text: Good quality, fast, 768 dimensions
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"
)

# Test embedding
test_vector = embeddings.embed_query("What is machine learning?")
print(f"Vector dimensions: {len(test_vector)}")  # 768

Embedding Model Options

Model	Dimensions	Size	Quality	Speed
nomic-embed-text	768	274 MB	Good	Fast
mxbai-embed-large	1024	670 MB	Very good	Medium
all-minilm	384	45 MB	Decent	Very fast
snowflake-arctic-embed	1024	670 MB	Very good	Medium

# Pull your chosen embedding model
ollama pull nomic-embed-text
ollama pull mxbai-embed-large

Step 4: Vector Database Setup

ChromaDB (Recommended for Local)

ChromaDB runs embedded (no separate server) and persists to disk.

from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Create vector store and embed all chunks
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="my_documents"
)

print(f"Stored {vectorstore._collection.count()} chunks in ChromaDB")

Loading an Existing Database

# Load previously created database
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
    collection_name="my_documents"
)

Adding New Documents

# Add more documents to existing database
new_docs = PyPDFLoader("new_report.pdf").load()
new_chunks = text_splitter.split_documents(new_docs)
vectorstore.add_documents(new_chunks)

FAISS Alternative

FAISS (by Meta) is faster for large collections but doesn’t persist to disk by default.

from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(chunks, embeddings)

# Save to disk
vectorstore.save_local("./faiss_index")

# Load from disk
vectorstore = FAISS.load_local(
    "./faiss_index",
    embeddings,
    allow_dangerous_deserialization=True
)

Step 5: Retrieval Configuration

Basic Retriever

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # Return top 4 most relevant chunks
)

# Test retrieval
results = retriever.invoke("What are the key findings?")
for doc in results:
    print(f"Source: {doc.metadata.get('source', 'unknown')}")
    print(f"Content: {doc.page_content[:200]}...")
    print("---")

MMR Retrieval (Diversity)

Maximum Marginal Relevance balances relevance with diversity to avoid returning near-duplicate chunks:

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 4,
        "fetch_k": 20,      # Fetch 20 candidates, return 4 diverse ones
        "lambda_mult": 0.5   # 0 = max diversity, 1 = max relevance
    }
)

Similarity Score Threshold

Only return chunks above a relevance threshold:

retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.7,  # Only return if similarity > 0.7
        "k": 6
    }
)

Step 6: Building the RAG Chain

Basic RAG Chain with LangChain

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Initialize LLM
llm = ChatOllama(
    model="llama3.1:8b",
    temperature=0.1,  # Low temperature for factual answers
)

# RAG prompt template
template = """Answer the question based only on the following context. 
If the context doesn't contain enough information to answer the question, 
say "I don't have enough information to answer that question."

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Helper to format retrieved documents
def format_docs(docs):
    return "\n\n---\n\n".join(
        f"Source: {doc.metadata.get('source', 'unknown')}\n{doc.page_content}"
        for doc in docs
    )

# Build the chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Ask a question
response = rag_chain.invoke("What are the main conclusions of the report?")
print(response)

RAG Chain with Sources

from langchain_core.runnables import RunnableParallel

# Chain that returns both answer and source documents
rag_chain_with_sources = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(
    answer=lambda x: (
        prompt.invoke({
            "context": format_docs(x["context"]),
            "question": x["question"]
        })
        | llm
        | StrOutputParser()
    ).invoke(x)
)

result = rag_chain_with_sources.invoke("What is the budget for Q3?")
print("Answer:", result["answer"])
print("\nSources:")
for doc in result["context"]:
    print(f"  - {doc.metadata.get('source')}: {doc.page_content[:100]}...")

Conversational RAG (Chat History)

from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage

# Prompt with chat history
contextualize_prompt = ChatPromptTemplate.from_messages([
    ("system", """Given a chat history and the latest user question, 
    formulate a standalone question that can be understood without 
    the chat history. Do NOT answer the question, just reformulate it."""),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

# Full conversational RAG prompt
qa_prompt = ChatPromptTemplate.from_messages([
    ("system", """Answer the question based on the following context. 
    If unsure, say you don't know.
    
    Context: {context}"""),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

# Usage with chat history
chat_history = []

def ask(question):
    # Simple conversational chain
    context_docs = retriever.invoke(question)
    context = format_docs(context_docs)
    
    messages = qa_prompt.format_messages(
        context=context,
        chat_history=chat_history,
        input=question
    )
    
    response = llm.invoke(messages)
    
    chat_history.append(HumanMessage(content=question))
    chat_history.append(AIMessage(content=response.content))
    
    return response.content

print(ask("What were the Q3 results?"))
print(ask("How do they compare to Q2?"))  # Uses chat history

Step 7: Alternative with LlamaIndex

LlamaIndex provides a more streamlined API for RAG:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

# Configure global settings
Settings.llm = Ollama(model="llama3.1:8b", request_timeout=120)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.chunk_size = 1000
Settings.chunk_overlap = 200

# Load and index documents
documents = SimpleDirectoryReader("./documents").load_data()
index = VectorStoreIndex.from_documents(documents)

# Save index to disk
index.storage_context.persist(persist_dir="./llama_index_store")

# Create query engine
query_engine = index.as_query_engine(
    similarity_top_k=4,
    streaming=True,
)

# Ask questions
response = query_engine.query("Summarize the key findings")
print(response)
print("\nSources:")
for node in response.source_nodes:
    print(f"  Score: {node.score:.3f} - {node.metadata.get('file_name', 'unknown')}")

LlamaIndex Chat Engine (Conversational)

chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    similarity_top_k=4,
)

response = chat_engine.chat("What are the main points?")
print(response)

response = chat_engine.chat("Can you elaborate on the second point?")
print(response)

Step 8: Building a Complete Application

Streamlit RAG App

# app.py
import streamlit as st
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
import tempfile
import os

st.title("Local RAG Chatbot")
st.caption("Upload documents and ask questions. Everything runs locally.")

# Initialize components
@st.cache_resource
def get_embeddings():
    return OllamaEmbeddings(model="nomic-embed-text")

@st.cache_resource
def get_llm():
    return ChatOllama(model="llama3.1:8b", temperature=0.1)

embeddings = get_embeddings()
llm = get_llm()

# File upload
uploaded_files = st.file_uploader(
    "Upload PDF documents",
    type=["pdf"],
    accept_multiple_files=True
)

if uploaded_files:
    with st.spinner("Processing documents..."):
        documents = []
        for file in uploaded_files:
            with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
                tmp.write(file.getvalue())
                loader = PyPDFLoader(tmp.name)
                documents.extend(loader.load())
                os.unlink(tmp.name)

        splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000, chunk_overlap=200
        )
        chunks = splitter.split_documents(documents)

        vectorstore = Chroma.from_documents(chunks, embeddings)
        retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

        st.success(f"Processed {len(documents)} pages into {len(chunks)} chunks")

    # Chat interface
    if "messages" not in st.session_state:
        st.session_state.messages = []

    for msg in st.session_state.messages:
        st.chat_message(msg["role"]).write(msg["content"])

    if question := st.chat_input("Ask a question about your documents"):
        st.session_state.messages.append({"role": "user", "content": question})
        st.chat_message("user").write(question)

        template = """Answer based on this context:
        {context}
        
        Question: {question}
        Answer:"""
        prompt = ChatPromptTemplate.from_template(template)

        def format_docs(docs):
            return "\n\n".join(d.page_content for d in docs)

        chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt | llm | StrOutputParser()
        )

        with st.chat_message("assistant"):
            response = chain.invoke(question)
            st.write(response)
            st.session_state.messages.append({"role": "assistant", "content": response})

# Run the app
streamlit run app.py

Step 9: Prompt Engineering for RAG

The prompt template significantly impacts answer quality.

Effective RAG Prompts

# Strict factual (no hallucination)
strict_template = """You are a precise assistant that answers questions 
using ONLY the provided context. Follow these rules:
1. Only use information from the context below
2. If the context doesn't contain the answer, say "This information is not in the provided documents"
3. Quote relevant passages when possible
4. Cite the source document

Context:
{context}

Question: {question}

Answer:"""

# Analytical (synthesize across documents)
analytical_template = """Analyze the following context to answer the question.
Synthesize information across multiple sources if needed.
Highlight any contradictions or gaps in the available information.

Context:
{context}

Question: {question}

Analysis:"""

# Conversational (natural dialogue)
conversational_template = """You are a helpful assistant with access to 
a knowledge base. Use the context below to answer naturally. 
If you're unsure, say so.

Context:
{context}

Question: {question}

Response:"""

Step 10: Evaluation and Improvement

Testing Retrieval Quality

# Test if the right chunks are being retrieved
test_questions = [
    "What was the revenue in Q3?",
    "Who is the CEO?",
    "What are the risk factors?",
]

for question in test_questions:
    results = retriever.invoke(question)
    print(f"\nQ: {question}")
    for i, doc in enumerate(results):
        print(f"  [{i+1}] Score relevance - {doc.page_content[:100]}...")

Common Issues and Fixes

Problem	Symptom	Fix
Chunks too large	Answer includes irrelevant info	Reduce chunk_size to 500
Chunks too small	Missing context	Increase chunk_size to 1500
Wrong chunks retrieved	Irrelevant results	Try better embedding model (mxbai-embed-large)
Hallucination	Makes up facts	Use stricter prompt, lower temperature
Slow retrieval	Long wait for results	Reduce k, use FAISS instead of ChromaDB
Missing answers	”Not in documents” when it is	Increase k to 6-8, check chunking boundaries

Improving Quality

Better chunking: Use semantic chunking or document-structure-aware splitting
Better embeddings: Upgrade from nomic-embed-text to mxbai-embed-large
Hybrid search: Combine vector similarity with keyword (BM25) search
Reranking: After retrieval, rerank results with a cross-encoder model
Larger LLM: Use a 14B or 32B model for better comprehension
Query expansion: Rephrase the query to improve retrieval

No-Code RAG with Open WebUI

If you don’t want to write code, Open WebUI has built-in RAG:

# Start Open WebUI with Ollama
docker compose up -d  # Using the compose file from earlier guides

# In the Open WebUI interface:
# 1. Click the + button in a chat
# 2. Upload documents (PDF, TXT, etc.)
# 3. Ask questions about the uploaded content
# Open WebUI handles chunking, embedding, and retrieval automatically

Next Steps

Deploy for your team: Enterprise Local AI
Add voice input: Local Voice Assistant
Fine-tune for better results: Fine-Tuning guide
Choose better models: Model selection guide