Running your own AI chatbot locally is no longer a moonshot project reserved for deep pockets or cloud credits. With Ollama handling model inference and LangChain orchestrating retrieval-augmented generation (RAG), you can assemble a fully offline, privacy-preserving Q&A system that answers questions based on your own documents. This guide walks you through every step—installation, document ingestion, retrieval, and conversation chain—so you can have a working chatbot on your desktop. Expect no fluff, no fabrications, just concrete commands and architecture decisions that work today.
Cloud-based chatbots send your data to third-party servers and incur per-token costs that scale with usage. A local setup eliminates both concerns. RAG (Retrieval-Augmented Generation) means the model doesn’t rely solely on its training data; it retrieves relevant chunks from your local documents before generating an answer. This gives you factual grounding on proprietary or niche information while keeping everything private. The trade-off is that you need a machine with a decent GPU or at least 8 GB of RAM to run models comfortably. CPU-only setups work but run slower—expect 5–10 seconds per response instead of 1–2 seconds on GPU. For most internal knowledge bases or personal document archives, that delay is acceptable.
Another nuance: local models are smaller (2B to 13B parameters) compared to GPT-4. They handle factual retrieval well but may struggle with complex reasoning or creative writing. If your use case is straightforward Q&A on PDFs, Markdown files, or code comments, local RAG is a perfect fit. For heavy analytical tasks, consider hybrid approaches where you route only simple queries to the local model and escalate complex ones to a remote API.
You need Python 3.10 or later, pip, and familiarity with the terminal. The core tools are:
All components are open-source and free. The total download size is under 4 GB (including model weights). No credit card required.
Visit ollama.com and download the installer for your OS (Windows, macOS, Linux). After installation, open a terminal and run:
ollama pull llama3.2
This downloads the 3B-parameter Llama 3.2 model (optimized for chat and instruction following). If you have limited resources, use tinyllama (1.1B) which runs on 4 GB RAM. The 3B model strikes a good balance between speed and response quality for document QA. Pulling the model may take 2–3 minutes depending on your internet speed.
Verify the model works with ollama run llama3.2. Type a test prompt like “What is RAG?” and check the output. Press Ctrl+D to exit. Ollama runs as a background service on port 11434. LangChain will connect to it via the Ollama LLM wrapper.
Llama 3.2 is great for English with code and technical text. For multilingual documents, Mistral 7B or Gemma 2B handle non-English content more reliably. If your documents contain heavy math or reasoning, try CodeLlama 7B. Test with a sample document before committing; the swap takes one command.
Isolate your dependencies to avoid version conflicts. Run these commands:
mkdir local-rag && cd local-ragpython -m venv venvsource venv/bin/activate (on macOS/Linux) or venv\Scripts\activate (on Windows)
Install the required packages:
pip install langchain langchain-community chromadb sentence-transformers
This installs LangChain core with community integrations, ChromaDB’s Python client, and the embedding model. The total install size is around 500 MB. If you’re on a slow connection, you can skip sentence-transformers and use Ollama’s built-in nomic-embed-text model instead—but that requires an additional model pull (ollama pull nomic-embed-text) and increases latency slightly.
Place your PDFs, text files, or Markdown docs in a folder named docs. For this tutorial, assume you have a file called company_policy.pdf. The code below loads it and splits it into chunks of 1000 characters with 200-character overlap:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyPDFLoader("./docs/company_policy.pdf")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)
print(f"Total chunks: {len(chunks)}")
A chunk size of 1000 characters works well for most prose. For code-heavy documents, reduce chunk size to 500. Overlap prevents context loss at chunk boundaries. If your documents are short (under 2000 chars), skip splitting entirely—just use the raw document as one chunk.
Use DirectoryLoader to load all files in a folder automatically:
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
Extend to .txt and .md by adding another DirectoryLoader with appropriate loader class (e.g., TextLoader). A common mistake is forgetting to include the glob pattern; without it, the loader returns nothing.
Embed each chunk and store it in ChromaDB:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
vectorstore.persist()
The first run downloads the embedding model (about 80 MB). Subsequent runs load from disk. The persist_directory saves the vector database so you don’t re-embed every time you restart. If your documents change, re-run the script or incrementally add documents with vectorstore.add_documents(new_chunks). Note that persist() only writes to disk on each call—if your script crashes, you might lose recent additions. For production, consider periodic persistence or switching to a persistent ChromaDB backend.
Create a retriever that fetches the top 3 most relevant chunks for a query:
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
Then connect to Ollama and build the RAG chain:
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
llm = Ollama(model="llama3.2", temperature=0.2)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever
)
The stuff chain type passes all retrieved chunks into the prompt at once. For most use cases with k=3, this works fine. If your chunks are large or you have many documents, switch to map_reduce or refine to avoid exceeding the model’s context window (typically 4096 tokens for Llama 3.2). A temperature of 0.2 keeps answers deterministic; increase to 0.7 for more creative responses, but beware of hallucinated facts.
By default, Ollama uses a system prompt that may not instruct the model to prioritize retrieved context. Add a custom prompt override:
from langchain.prompts import PromptTemplate
prompt_template = """Use the following pieces of context to answer the user's question. If the answer isn't in the context, say "I don't have enough information to answer that." Do not make up information.
Context: {context}
Question: {question}
Answer: """
prompt = PromptTemplate.from_template(prompt_template)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt}
)
The return_source_documents=True flag lets you inspect which chunks were used—a critical debugging tool. Without the custom prompt, models occasionally ignore the context and answer based on their own training, which defeats RAG’s purpose.
Create a simple interactive loop:
while True:
query = input("\nYour question: ")
if query.lower() in ["exit", "quit"]:
break
result = qa_chain.invoke({"query": query})
print(f"Answer: {result['result']}")
print(f"\nSource chunks used: {len(result['source_documents'])}")
Try questions like “What is the vacation policy?” or “How do I file a complaint?”. The first response may take 3–5 seconds as the model loads into memory. Subsequent responses should be faster. If answers are poor, check the number of retrieved chunks—too few may miss relevant info, too many may confuse the model. Adjust k in the retriever accordingly.
Several factors affect accuracy and latency:
max_tokens in the LLM call to cap generation.pritamdeka/BioBERT-mnli-snli available on Hugging Face.persist_directory can grow large. Monitor disk usage. For 10,000 chunks of 1000 chars, expect about 100 MB of vector data.A real-world edge case: a user asked “What is the deadline for annual reports?” but their document used “filing date” instead. The retriever failed because the embeddings didn’t recognize the semantic similarity. Solution: enrich your chunk text with synonyms or use a larger embedding model like intfloat/e5-mistral-7b-instruct (requires more RAM).
Another common mistake: forgetting to set search_type to “mmr” for maximum marginal relevance—this re-ranks chunks for diversity and avoids retrieving three almost identical paragraphs. Change retriever instantiation to:
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 3, "fetch_k": 10, "lambda_mult": 0.5}
)
This is especially useful when your document set contains many similar passages (e.g., FAQs).
To run the chatbot in a web interface, wrap the qa_chain in a simple Flask or Gradio app. The latter takes only 5 extra lines of code and provides a chat UI. However, for a 10-minute build, the terminal version is sufficient and easier to debug.
Your local RAG chatbot is now ready. The entire setup—from installing Ollama to answering your first question—should take under ten minutes if your documents are already on disk. The system is fully private, costs nothing per query, and scales to thousands of documents. The next step is to experiment with different models, embedding strategies, and chunking rules to match your specific content. Keep the prompt sharp, the chunks short, and always validate answers against the source.
Browse the latest reads across all four sections — published daily.
← Back to BestLifePulse