Large Language Models (LLMs)

What You Need to Learn

Transformer Architecture: Self-attention, encoder-decoder
Pre-training Methods: Masked language modeling, causal language modeling
Fine-tuning: Task-specific adaptation
Prompt Engineering: Zero-shot, few-shot, chain-of-thought
Deployment: API vs self-hosted, cost optimization

ELI5: What are Large Language Models?

Imagine a student who read the entire internet:

Traditional Program:

You: "Translate 'hello' to Spanish"
Computer: "Hola" (looks up in dictionary)

Large Language Model (LLM):

Computer read billions of webpages
Learned patterns in language
You: "Translate 'hello' to Spanish"
Computer: "Based on patterns I've seen, it's 'Hola'"

Magic: LLM can do tasks it was never explicitly trained for!

How? By understanding language patterns:

Grammar rules (without being told)
Context (what comes before/after)
Relationships (synonyms, antonyms)
Even reasoning (to some extent!)

System Design: LLM Architecture

┌─────────────────────────────────────────┐
│         Pre-training Phase              │
│  Internet Text → Transformer →          │
│     → Base Model (billions of params)   │
└─────────────┬───────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│         Fine-tuning Phase               │
│  Task-specific data →                   │
│     → Specialized Model                 │
└─────────────┬───────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│         Inference Phase                 │
│  User Prompt → Model →                  │
│     → Generated Response                │
└─────────────────────────────────────────┘

Model Components:
┌─────────────────────────────────────────┐
│  Input Text                             │
│     ↓                                   │
│  Tokenization (words → numbers)         │
│     ↓                                   │
│  Embedding Layer (numbers → vectors)    │
│     ↓                                   │
│  Transformer Blocks (24-96 layers)      │
│   - Self-Attention                      │
│   - Feed Forward                        │
│     ↓                                   │
│  Output Layer                           │
│     ↓                                   │
│  Generated Text                         │
└─────────────────────────────────────────┘

How LLMs Generate Text

Example: Complete "The cat sat on the"

Tokenization: Break into pieces
- ["The", "cat", "sat", "on", "the", "???"]
Embedding: Convert to numbers (vectors)
Attention: Look at context
- "mat" makes sense (cats sit on mats)
- "moon" doesn't make sense (cats don't sit on moons)
Probability Distribution:
- 60% → "mat"
- 20% → "floor"
- 10% → "couch"
- 10% → other
Sampling: Pick "mat" (highest probability)

Result: "The cat sat on the mat"

GPT vs BERT: Key Differences

GPT (Generative Pre-trained Transformer)

Goal: Generate next word

Training: "The cat sat on the [MASK]"
GPT learns: Predict what comes next
Use: Text generation, completion, chat

BERT (Bidirectional Encoder Representations)

Goal: Understand context

Training: "The cat [MASK] on the mat"
BERT learns: What word fits here? (looks both ways!)
Use: Classification, Q&A, understanding

Key Difference: GPT = Generator, BERT = Understander

Prompt Engineering

The art of asking LLMs the right way

Zero-Shot

Prompt: "Classify sentiment: 'This movie was terrible'"
Response: "Negative"

Few-Shot

Prompt: 
"Classify sentiment:
Example 1: 'I loved it' → Positive
Example 2: 'It was okay' → Neutral
Example 3: 'Worst ever' → Negative

New: 'This movie was terrible'"
Response: "Negative"

Chain-of-Thought

Prompt: "What's 15% tip on $82.50? Think step by step."
Response:
"Step 1: Calculate 10% = $8.25
Step 2: Calculate 5% = $4.13
Step 3: Add them: $8.25 + $4.13 = $12.38"

Fine-Tuning vs Prompting

When to Fine-Tune

Domain-specific language (legal, medical)
Consistent output format needed
Privacy concerns (can't send data to API)
Cost optimization (many queries)

When to Prompt

Quick prototyping
Varied tasks
No labeled data for training
Using latest model capabilities

LLM in Production

Challenges

Cost: GPT-4 API = $0.03-0.06 per 1K tokens
Latency: Responses take 2-10 seconds
Hallucinations: Makes up facts confidently
Context Limits: Max 4K-128K tokens

Solutions

# Example: Optimize cost with caching
import lru_cache

@lru_cache(maxsize=1000)
def get_llm_response(prompt):
    # Only call API for new prompts
    return openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

Real-World Example

At Nike, Agentic AI sustainability assistant:

Base: GPT-4 for reasoning
RAG: Retrieve carbon metrics from Databricks
Prompt: Structured for compliance reporting
Validation: Cross-check with regulatory frameworks
Result: Accurate, compliant responses for sustainability queries

Key Insight: LLM + Domain Data + Validation = Production-Ready AI

Best Practices

Prompt Templates: Standardize for consistency
Temperature Control: 0 = deterministic, 1 = creative
Validation: Always verify critical outputs
Cost Monitoring: Track token usage
Fallbacks: Handle API failures gracefully
Version Control: Lock model versions for reproducibility

Large Language Models (LLMs): GPT, BERT, and Beyond