LLM

Large Language Models (LLMs): GPT, BERT, and Beyond

November 20, 2024
11 min read
LLMGPTBERTNLPAI

Large Language Models (LLMs)

What You Need to Learn

  1. Transformer Architecture: Self-attention, encoder-decoder
  2. Pre-training Methods: Masked language modeling, causal language modeling
  3. Fine-tuning: Task-specific adaptation
  4. Prompt Engineering: Zero-shot, few-shot, chain-of-thought
  5. Deployment: API vs self-hosted, cost optimization

ELI5: What are Large Language Models?

Imagine a student who read the entire internet:

Traditional Program:

  • You: "Translate 'hello' to Spanish"
  • Computer: "Hola" (looks up in dictionary)

Large Language Model (LLM):

  • Computer read billions of webpages
  • Learned patterns in language
  • You: "Translate 'hello' to Spanish"
  • Computer: "Based on patterns I've seen, it's 'Hola'"

Magic: LLM can do tasks it was never explicitly trained for!

How? By understanding language patterns:

  • Grammar rules (without being told)
  • Context (what comes before/after)
  • Relationships (synonyms, antonyms)
  • Even reasoning (to some extent!)

System Design: LLM Architecture

┌─────────────────────────────────────────┐
│         Pre-training Phase              │
│  Internet Text → Transformer →          │
│     → Base Model (billions of params)   │
└─────────────┬───────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│         Fine-tuning Phase               │
│  Task-specific data →                   │
│     → Specialized Model                 │
└─────────────┬───────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│         Inference Phase                 │
│  User Prompt → Model →                  │
│     → Generated Response                │
└─────────────────────────────────────────┘

Model Components:
┌─────────────────────────────────────────┐
│  Input Text                             │
│     ↓                                   │
│  Tokenization (words → numbers)         │
│     ↓                                   │
│  Embedding Layer (numbers → vectors)    │
│     ↓                                   │
│  Transformer Blocks (24-96 layers)      │
│   - Self-Attention                      │
│   - Feed Forward                        │
│     ↓                                   │
│  Output Layer                           │
│     ↓                                   │
│  Generated Text                         │
└─────────────────────────────────────────┘

How LLMs Generate Text

Example: Complete "The cat sat on the"

  1. Tokenization: Break into pieces

    • ["The", "cat", "sat", "on", "the", "???"]
  2. Embedding: Convert to numbers (vectors)

  3. Attention: Look at context

    • "mat" makes sense (cats sit on mats)
    • "moon" doesn't make sense (cats don't sit on moons)
  4. Probability Distribution:

    • 60% → "mat"
    • 20% → "floor"
    • 10% → "couch"
    • 10% → other
  5. Sampling: Pick "mat" (highest probability)

Result: "The cat sat on the mat"


GPT vs BERT: Key Differences

GPT (Generative Pre-trained Transformer)

Goal: Generate next word

Training: "The cat sat on the [MASK]"
GPT learns: Predict what comes next
Use: Text generation, completion, chat

BERT (Bidirectional Encoder Representations)

Goal: Understand context

Training: "The cat [MASK] on the mat"
BERT learns: What word fits here? (looks both ways!)
Use: Classification, Q&A, understanding

Key Difference: GPT = Generator, BERT = Understander


Prompt Engineering

The art of asking LLMs the right way

Zero-Shot

Prompt: "Classify sentiment: 'This movie was terrible'"
Response: "Negative"

Few-Shot

Prompt: 
"Classify sentiment:
Example 1: 'I loved it' → Positive
Example 2: 'It was okay' → Neutral
Example 3: 'Worst ever' → Negative

New: 'This movie was terrible'"
Response: "Negative"

Chain-of-Thought

Prompt: "What's 15% tip on $82.50? Think step by step."
Response:
"Step 1: Calculate 10% = $8.25
Step 2: Calculate 5% = $4.13
Step 3: Add them: $8.25 + $4.13 = $12.38"

Fine-Tuning vs Prompting

When to Fine-Tune

  • Domain-specific language (legal, medical)
  • Consistent output format needed
  • Privacy concerns (can't send data to API)
  • Cost optimization (many queries)

When to Prompt

  • Quick prototyping
  • Varied tasks
  • No labeled data for training
  • Using latest model capabilities

LLM in Production

Challenges

  1. Cost: GPT-4 API = $0.03-0.06 per 1K tokens
  2. Latency: Responses take 2-10 seconds
  3. Hallucinations: Makes up facts confidently
  4. Context Limits: Max 4K-128K tokens

Solutions

# Example: Optimize cost with caching
import lru_cache

@lru_cache(maxsize=1000)
def get_llm_response(prompt):
    # Only call API for new prompts
    return openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

Real-World Example

At Nike, Agentic AI sustainability assistant:

  • Base: GPT-4 for reasoning
  • RAG: Retrieve carbon metrics from Databricks
  • Prompt: Structured for compliance reporting
  • Validation: Cross-check with regulatory frameworks
  • Result: Accurate, compliant responses for sustainability queries

Key Insight: LLM + Domain Data + Validation = Production-Ready AI


Best Practices

  1. Prompt Templates: Standardize for consistency
  2. Temperature Control: 0 = deterministic, 1 = creative
  3. Validation: Always verify critical outputs
  4. Cost Monitoring: Track token usage
  5. Fallbacks: Handle API failures gracefully
  6. Version Control: Lock model versions for reproducibility