RAG

Retrieval-Augmented Generation (RAG): Grounding LLMs in Reality

November 15, 2024
10 min read
RAGLLMVector DatabaseLangChain

Retrieval-Augmented Generation (RAG)

What You Need to Learn

  1. Vector Embeddings: Converting text to numbers
  2. Vector Databases: Storing and searching embeddings (Pinecone, Weaviate, Chroma)
  3. Retrieval Strategies: Semantic search, hybrid search
  4. LLM Integration: Combining retrieved context with prompts
  5. Evaluation: Measuring relevance and accuracy

ELI5: What is RAG?

Imagine an exam:

Without RAG (Just LLM):

  • Student memorized everything
  • Sometimes makes up answers (hallucination)
  • Can't access new information

With RAG:

  • Student has access to textbooks during exam
  • Looks up relevant info first
  • Then writes answer based on facts
  • Never makes up information!

Real Example:

  • Question: "What's our Q4 2024 carbon emissions?"
  • RAG: Searches your company database → Finds report → LLM generates answer
  • Result: Accurate, grounded in real data!

System Design: RAG Architecture

┌─────────────────────────────────────────┐
│         1. Document Ingestion           │
│   PDFs, Docs → Text Chunks             │
└─────────────┬───────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│         2. Embedding Generation         │
│   Text → Embedding Model →              │
│      → Vector Embeddings                │
└─────────────┬───────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│         3. Vector Storage               │
│   Store in Vector DB                    │
│   (Pinecone, Chroma, Databricks)        │
└─────────────────────────────────────────┘

Query Time:
┌─────────────────────────────────────────┐
│   User Query                            │
│      ↓                                  │
│   Embed Query → Vector Search →         │
│      → Retrieve Top-K Docs              │
│      ↓                                  │
│   Combine Query + Retrieved Docs →      │
│      → LLM Prompt                       │
│      ↓                                  │
│   LLM Generates Answer                  │
│      ↓                                  │
│   Return Response to User               │
└─────────────────────────────────────────┘

How Vector Search Works

Traditional Search (Keyword)

Query: "machine learning"
Results: Documents with exact words "machine" AND "learning"
Problem: Misses "artificial intelligence", "neural networks"

Vector Search (Semantic)

Query: "machine learning"
→ Embedding: [0.2, -0.5, 0.8, ...]

Document 1: "AI and neural networks"
→ Embedding: [0.3, -0.4, 0.7, ...]  ← SIMILAR!

Document 2: "Cooking recipes"
→ Embedding: [-0.8, 0.2, -0.3, ...]  ← NOT SIMILAR

Returns Document 1 (semantically related)

Distance Metrics:

  • Cosine Similarity (most common)
  • Euclidean Distance
  • Dot Product

Building a RAG System (Step-by-Step)

Step 1: Document Processing

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # characters per chunk
    chunk_overlap=200  # overlap for context
)

chunks = text_splitter.split_documents(documents)

Why chunk? LLMs have token limits (4K-128K)

Step 2: Generate Embeddings

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
vectors = embeddings.embed_documents([chunk.page_content for chunk in chunks])

Step 3: Store in Vector DB

from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

Step 4: Query & Retrieve

# Retrieve top 3 most relevant chunks
docs = vectorstore.similarity_search(
    "What are the benefits of RAG?", 
    k=3
)

Step 5: Generate Answer

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    retriever=vectorstore.as_retriever()
)

response = qa_chain.run("What are the benefits of RAG?")

Advanced RAG Techniques

1. Hybrid Search

Combine keyword + semantic search

Query: "Q4 revenue 2024"
→ Keyword: Find exact "Q4 2024"
→ Semantic: Find similar financial terms
→ Merge results

2. Re-ranking

Retrieve 100 docs → Re-rank top 10 → Send to LLM

3. Query Expansion

Original: "ML algorithms"
Expanded: "machine learning algorithms", "AI models", "neural networks"
→ Better retrieval

4. Contextual Compression

Retrieved doc too long? → Compress relevant parts only


RAG vs Fine-Tuning

AspectRAGFine-Tuning
Data UpdatesReal-timeRequires retraining
CostLower (no training)Higher (GPU hours)
AccuracyFactual (grounded)May hallucinate
Setup TimeMinutesHours/Days
Use CaseQ&A, Knowledge baseTask-specific behavior

Best of Both: Fine-tune for style, RAG for facts!


Real-World Example

At Nike, sustainability assistant with RAG:

Architecture:

  1. Data: Carbon emissions metrics in Databricks
  2. Embeddings: All-MiniLM-L6-v2 model
  3. Vector Store: Databricks Vector Search
  4. LLM: GPT-4 for generation
  5. Validation: Cross-check with ESG frameworks

Query: "What's our Scope 3 emissions from logistics?"

RAG Process:

  1. Retrieve: Logistics emissions data from vector DB
  2. Context: Include GHG Protocol definitions
  3. Generate: LLM creates compliant response
  4. Validate: Check against regulatory thresholds

Result: Accurate, compliant responses for sustainability reporting!


Common Challenges & Solutions

Challenge: Chunking Strategy

Problem: How big should chunks be? Solution:

  • Test different sizes (500, 1000, 2000 chars)
  • Use overlap (200 chars) for context
  • Metadata (source, date) for filtering

Challenge: Retrieval Accuracy

Problem: Wrong documents retrieved Solution:

  • Better embeddings model
  • Hybrid search
  • Query rewriting

Challenge: Cost

Problem: Embedding costs for large docs Solution:

  • Cache embeddings
  • Batch processing
  • Use cheaper models (ada-002)

Best Practices

  1. Chunk Size: Balance between context and specificity
  2. Metadata Filtering: Add date, source, category
  3. Evaluation: Measure retrieval accuracy (not just LLM)
  4. Version Control: Track embedding model versions
  5. Monitoring: Log queries, retrieval quality, user feedback
  6. Fallback: Handle "no relevant docs found" gracefully

Evaluation Metrics

  • Retrieval Precision: % of retrieved docs that are relevant
  • Retrieval Recall: % of relevant docs that were retrieved
  • Answer Correctness: Compare to ground truth
  • Hallucination Rate: % of made-up facts
  • Latency: Time to retrieve + generate