Retrieval-Augmented Generation (RAG): Grounding LLMs in Reality
November 15, 2024
10 min read
RAGLLMVector DatabaseLangChain
Retrieval-Augmented Generation (RAG)
What You Need to Learn
- Vector Embeddings: Converting text to numbers
- Vector Databases: Storing and searching embeddings (Pinecone, Weaviate, Chroma)
- Retrieval Strategies: Semantic search, hybrid search
- LLM Integration: Combining retrieved context with prompts
- Evaluation: Measuring relevance and accuracy
ELI5: What is RAG?
Imagine an exam:
Without RAG (Just LLM):
- Student memorized everything
- Sometimes makes up answers (hallucination)
- Can't access new information
With RAG:
- Student has access to textbooks during exam
- Looks up relevant info first
- Then writes answer based on facts
- Never makes up information!
Real Example:
- Question: "What's our Q4 2024 carbon emissions?"
- RAG: Searches your company database → Finds report → LLM generates answer
- Result: Accurate, grounded in real data!
System Design: RAG Architecture
┌─────────────────────────────────────────┐
│ 1. Document Ingestion │
│ PDFs, Docs → Text Chunks │
└─────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 2. Embedding Generation │
│ Text → Embedding Model → │
│ → Vector Embeddings │
└─────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 3. Vector Storage │
│ Store in Vector DB │
│ (Pinecone, Chroma, Databricks) │
└─────────────────────────────────────────┘
Query Time:
┌─────────────────────────────────────────┐
│ User Query │
│ ↓ │
│ Embed Query → Vector Search → │
│ → Retrieve Top-K Docs │
│ ↓ │
│ Combine Query + Retrieved Docs → │
│ → LLM Prompt │
│ ↓ │
│ LLM Generates Answer │
│ ↓ │
│ Return Response to User │
└─────────────────────────────────────────┘
How Vector Search Works
Traditional Search (Keyword)
Query: "machine learning"
Results: Documents with exact words "machine" AND "learning"
Problem: Misses "artificial intelligence", "neural networks"
Vector Search (Semantic)
Query: "machine learning"
→ Embedding: [0.2, -0.5, 0.8, ...]
Document 1: "AI and neural networks"
→ Embedding: [0.3, -0.4, 0.7, ...] ← SIMILAR!
Document 2: "Cooking recipes"
→ Embedding: [-0.8, 0.2, -0.3, ...] ← NOT SIMILAR
Returns Document 1 (semantically related)
Distance Metrics:
- Cosine Similarity (most common)
- Euclidean Distance
- Dot Product
Building a RAG System (Step-by-Step)
Step 1: Document Processing
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # characters per chunk
chunk_overlap=200 # overlap for context
)
chunks = text_splitter.split_documents(documents)
Why chunk? LLMs have token limits (4K-128K)
Step 2: Generate Embeddings
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vectors = embeddings.embed_documents([chunk.page_content for chunk in chunks])
Step 3: Store in Vector DB
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
Step 4: Query & Retrieve
# Retrieve top 3 most relevant chunks
docs = vectorstore.similarity_search(
"What are the benefits of RAG?",
k=3
)
Step 5: Generate Answer
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(temperature=0),
retriever=vectorstore.as_retriever()
)
response = qa_chain.run("What are the benefits of RAG?")
Advanced RAG Techniques
1. Hybrid Search
Combine keyword + semantic search
Query: "Q4 revenue 2024"
→ Keyword: Find exact "Q4 2024"
→ Semantic: Find similar financial terms
→ Merge results
2. Re-ranking
Retrieve 100 docs → Re-rank top 10 → Send to LLM
3. Query Expansion
Original: "ML algorithms"
Expanded: "machine learning algorithms", "AI models", "neural networks"
→ Better retrieval
4. Contextual Compression
Retrieved doc too long? → Compress relevant parts only
RAG vs Fine-Tuning
| Aspect | RAG | Fine-Tuning |
|---|---|---|
| Data Updates | Real-time | Requires retraining |
| Cost | Lower (no training) | Higher (GPU hours) |
| Accuracy | Factual (grounded) | May hallucinate |
| Setup Time | Minutes | Hours/Days |
| Use Case | Q&A, Knowledge base | Task-specific behavior |
Best of Both: Fine-tune for style, RAG for facts!
Real-World Example
At Nike, sustainability assistant with RAG:
Architecture:
- Data: Carbon emissions metrics in Databricks
- Embeddings: All-MiniLM-L6-v2 model
- Vector Store: Databricks Vector Search
- LLM: GPT-4 for generation
- Validation: Cross-check with ESG frameworks
Query: "What's our Scope 3 emissions from logistics?"
RAG Process:
- Retrieve: Logistics emissions data from vector DB
- Context: Include GHG Protocol definitions
- Generate: LLM creates compliant response
- Validate: Check against regulatory thresholds
Result: Accurate, compliant responses for sustainability reporting!
Common Challenges & Solutions
Challenge: Chunking Strategy
Problem: How big should chunks be? Solution:
- Test different sizes (500, 1000, 2000 chars)
- Use overlap (200 chars) for context
- Metadata (source, date) for filtering
Challenge: Retrieval Accuracy
Problem: Wrong documents retrieved Solution:
- Better embeddings model
- Hybrid search
- Query rewriting
Challenge: Cost
Problem: Embedding costs for large docs Solution:
- Cache embeddings
- Batch processing
- Use cheaper models (ada-002)
Best Practices
- Chunk Size: Balance between context and specificity
- Metadata Filtering: Add date, source, category
- Evaluation: Measure retrieval accuracy (not just LLM)
- Version Control: Track embedding model versions
- Monitoring: Log queries, retrieval quality, user feedback
- Fallback: Handle "no relevant docs found" gracefully
Evaluation Metrics
- Retrieval Precision: % of retrieved docs that are relevant
- Retrieval Recall: % of relevant docs that were retrieved
- Answer Correctness: Compare to ground truth
- Hallucination Rate: % of made-up facts
- Latency: Time to retrieve + generate