How to Build a Local RAG Chatbot with Ollama and LangChain in 10 Minutes

Apr 25·7 min read·AI-assisted · human-reviewed

Running your own AI chatbot locally is no longer a moonshot project reserved for deep pockets or cloud credits. With Ollama handling model inference and LangChain orchestrating retrieval-augmented generation (RAG), you can assemble a fully offline, privacy-preserving Q&A system that answers questions based on your own documents. This guide walks you through every step—installation, document ingestion, retrieval, and conversation chain—so you can have a working chatbot on your desktop. Expect no fluff, no fabrications, just concrete commands and architecture decisions that work today.

Why Local RAG Matters

Cloud-based chatbots send your data to third-party servers and incur per-token costs that scale with usage. A local setup eliminates both concerns. RAG (Retrieval-Augmented Generation) means the model doesn’t rely solely on its training data; it retrieves relevant chunks from your local documents before generating an answer. This gives you factual grounding on proprietary or niche information while keeping everything private. The trade-off is that you need a machine with a decent GPU or at least 8 GB of RAM to run models comfortably. CPU-only setups work but run slower—expect 5–10 seconds per response instead of 1–2 seconds on GPU. For most internal knowledge bases or personal document archives, that delay is acceptable.

Another nuance: local models are smaller (2B to 13B parameters) compared to GPT-4. They handle factual retrieval well but may struggle with complex reasoning or creative writing. If your use case is straightforward Q&A on PDFs, Markdown files, or code comments, local RAG is a perfect fit. For heavy analytical tasks, consider hybrid approaches where you route only simple queries to the local model and escalate complex ones to a remote API.

Prerequisites and Tool Chain

You need Python 3.10 or later, pip, and familiarity with the terminal. The core tools are:

Ollama — downloads and runs models like Llama 3, Mistral, or Gemma locally. It exposes a REST API that LangChain can call.
LangChain — a framework that chains together document loaders, vector stores, retrievers, and LLMs with minimal boilerplate.
ChromaDB — an in-memory vector database that stores document embeddings and supports fast similarity search.
Sentence-Transformers (all-MiniLM-L6-v2) — creates embeddings for your text chunks without needing a remote API.

All components are open-source and free. The total download size is under 4 GB (including model weights). No credit card required.

Step 1: Install Ollama and Pull a Model

Visit ollama.com and download the installer for your OS (Windows, macOS, Linux). After installation, open a terminal and run:

ollama pull llama3.2

This downloads the 3B-parameter Llama 3.2 model (optimized for chat and instruction following). If you have limited resources, use tinyllama (1.1B) which runs on 4 GB RAM. The 3B model strikes a good balance between speed and response quality for document QA. Pulling the model may take 2–3 minutes depending on your internet speed.

Verify the model works with ollama run llama3.2. Type a test prompt like “What is RAG?” and check the output. Press Ctrl+D to exit. Ollama runs as a background service on port 11434. LangChain will connect to it via the Ollama LLM wrapper.

Choosing the Right Model

Llama 3.2 is great for English with code and technical text. For multilingual documents, Mistral 7B or Gemma 2B handle non-English content more reliably. If your documents contain heavy math or reasoning, try CodeLlama 7B. Test with a sample document before committing; the swap takes one command.

Step 2: Create a Python Virtual Environment

Isolate your dependencies to avoid version conflicts. Run these commands:

mkdir local-rag && cd local-rag
python -m venv venv
source venv/bin/activate (on macOS/Linux) or venv\Scripts\activate (on Windows)

Install the required packages:

pip install langchain langchain-community chromadb sentence-transformers

This installs LangChain core with community integrations, ChromaDB’s Python client, and the embedding model. The total install size is around 500 MB. If you’re on a slow connection, you can skip sentence-transformers and use Ollama’s built-in nomic-embed-text model instead—but that requires an additional model pull (ollama pull nomic-embed-text) and increases latency slightly.

Step 3: Load and Split Your Documents

Place your PDFs, text files, or Markdown docs in a folder named docs. For this tutorial, assume you have a file called company_policy.pdf. The code below loads it and splits it into chunks of 1000 characters with 200-character overlap:

from langchain_community.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter loader = PyPDFLoader("./docs/company_policy.pdf") documents = loader.load() text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", " ", ""] ) chunks = text_splitter.split_documents(documents) print(f"Total chunks: {len(chunks)}")

A chunk size of 1000 characters works well for most prose. For code-heavy documents, reduce chunk size to 500. Overlap prevents context loss at chunk boundaries. If your documents are short (under 2000 chars), skip splitting entirely—just use the raw document as one chunk.

Handling Multiple Document Types

Use DirectoryLoader to load all files in a folder automatically:

from langchain_community.document_loaders import DirectoryLoader loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader) documents = loader.load()

Extend to .txt and .md by adding another DirectoryLoader with appropriate loader class (e.g., TextLoader). A common mistake is forgetting to include the glob pattern; without it, the loader returns nothing.

Step 4: Build the Vector Store

Embed each chunk and store it in ChromaDB:

from langchain_community.embeddings import HuggingFaceEmbeddings from langchain_community.vectorstores import Chroma embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db" ) vectorstore.persist()

The first run downloads the embedding model (about 80 MB). Subsequent runs load from disk. The persist_directory saves the vector database so you don’t re-embed every time you restart. If your documents change, re-run the script or incrementally add documents with vectorstore.add_documents(new_chunks). Note that persist() only writes to disk on each call—if your script crashes, you might lose recent additions. For production, consider periodic persistence or switching to a persistent ChromaDB backend.

Step 5: Set Up the Retriever and LLM Chain

Create a retriever that fetches the top 3 most relevant chunks for a query:

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

Then connect to Ollama and build the RAG chain:

from langchain_community.llms import Ollama from langchain.chains import RetrievalQA llm = Ollama(model="llama3.2", temperature=0.2) qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=retriever )

The stuff chain type passes all retrieved chunks into the prompt at once. For most use cases with k=3, this works fine. If your chunks are large or you have many documents, switch to map_reduce or refine to avoid exceeding the model’s context window (typically 4096 tokens for Llama 3.2). A temperature of 0.2 keeps answers deterministic; increase to 0.7 for more creative responses, but beware of hallucinated facts.

Common Mistake: Missing Context Prompt

By default, Ollama uses a system prompt that may not instruct the model to prioritize retrieved context. Add a custom prompt override:

from langchain.prompts import PromptTemplate prompt_template = """Use the following pieces of context to answer the user's question. If the answer isn't in the context, say "I don't have enough information to answer that." Do not make up information. Context: {context} Question: {question} Answer: """ prompt = PromptTemplate.from_template(prompt_template) qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True, chain_type_kwargs={"prompt": prompt} )

The return_source_documents=True flag lets you inspect which chunks were used—a critical debugging tool. Without the custom prompt, models occasionally ignore the context and answer based on their own training, which defeats RAG’s purpose.

Step 6: Ask Questions

Create a simple interactive loop:

while True: query = input("\nYour question: ") if query.lower() in ["exit", "quit"]: break result = qa_chain.invoke({"query": query}) print(f"Answer: {result['result']}") print(f"\nSource chunks used: {len(result['source_documents'])}")

Try questions like “What is the vacation policy?” or “How do I file a complaint?”. The first response may take 3–5 seconds as the model loads into memory. Subsequent responses should be faster. If answers are poor, check the number of retrieved chunks—too few may miss relevant info, too many may confuse the model. Adjust k in the retriever accordingly.

Performance Tuning and Edge Cases

Several factors affect accuracy and latency:

Chunk size vs. context length: If your chunks are 2000 characters and you retrieve 5, the context might overflow the model’s 4096-token limit. Use max_tokens in the LLM call to cap generation.
Embedding quality: all-MiniLM-L6-v2 is a general-purpose model. For domain-specific documents (medical, legal), consider fine-tuned embeddings like pritamdeka/BioBERT-mnli-snli available on Hugging Face.
Ollama concurrent requests: Ollama can handle one request at a time by default. If you plan to serve multiple users, use a queue or switch to a server like vLLM.
Memory leaks: ChromaDB with persist_directory can grow large. Monitor disk usage. For 10,000 chunks of 1000 chars, expect about 100 MB of vector data.
Empty documents: If a PDF is scanned (image-only), PyPDFLoader returns empty pages. Use OCR tools like Tesseract before ingestion, or switch to a document loader that handles images.

A real-world edge case: a user asked “What is the deadline for annual reports?” but their document used “filing date” instead. The retriever failed because the embeddings didn’t recognize the semantic similarity. Solution: enrich your chunk text with synonyms or use a larger embedding model like intfloat/e5-mistral-7b-instruct (requires more RAM).

Another common mistake: forgetting to set search_type to “mmr” for maximum marginal relevance—this re-ranks chunks for diversity and avoids retrieving three almost identical paragraphs. Change retriever instantiation to:

retriever = vectorstore.as_retriever( search_type="mmr", search_kwargs={"k": 3, "fetch_k": 10, "lambda_mult": 0.5} )

This is especially useful when your document set contains many similar passages (e.g., FAQs).

To run the chatbot in a web interface, wrap the qa_chain in a simple Flask or Gradio app. The latter takes only 5 extra lines of code and provides a chat UI. However, for a 10-minute build, the terminal version is sufficient and easier to debug.

Your local RAG chatbot is now ready. The entire setup—from installing Ollama to answering your first question—should take under ten minutes if your documents are already on disk. The system is fully private, costs nothing per query, and scales to thousands of documents. The next step is to experiment with different models, embedding strategies, and chunking rules to match your specific content. Keep the prompt sharp, the chunks short, and always validate answers against the source.

About this article. This piece was drafted with the help of an AI writing assistant and reviewed by a human editor for accuracy and clarity before publication. It is general information only — not professional medical, financial, legal or engineering advice. Spotted an error? Tell us. Read more about how we work and our editorial disclaimer.