Deep Learning

Deep Learning Architectures: CNNs, RNNs, and Transformers

November 28, 2024
12 min read
Deep LearningNeural NetworksPyTorchTransformers

Deep Learning Architectures

What You Need to Learn

  1. Neural Network Basics: Neurons, layers, backpropagation
  2. CNNs: Convolutional Neural Networks for images
  3. RNNs/LSTMs: Recurrent networks for sequences
  4. Transformers: Attention mechanism, BERT, GPT
  5. Training Techniques: Batch normalization, dropout, learning rate scheduling

ELI5: What are Neural Networks?

Your brain learning to recognize your friend's face:

  1. Input Layer = Your eyes see features (hair color, eye shape, nose)
  2. Hidden Layers = Your brain combines features ("brown hair + blue eyes + small nose")
  3. Output Layer = Recognition! "That's Sarah!"

Artificial Neural Network:

  • Same idea, but with math
  • Each "neuron" is a simple calculation
  • Many layers of neurons = "deep" learning
  • Learns by adjusting connections (weights)

Example: Teaching a network to recognize cats:

  • Show 1000 cat pictures → network adjusts weights
  • Show 1000 non-cat pictures → network adjusts more
  • New picture → network predicts: cat or not cat!

System Design: Neural Network Architecture Types

┌─────────────────────────────────────────┐
│     Feedforward Neural Network (FNN)    │
│  Input → Hidden → Hidden → Output       │
│  Use: Tabular data, simple classification│
└─────────────────────────────────────────┘

┌─────────────────────────────────────────┐
│  Convolutional Neural Network (CNN)     │
│  Conv → Pool → Conv → Pool → Dense      │
│  Use: Images, spatial data              │
└─────────────────────────────────────────┘

┌─────────────────────────────────────────┐
│  Recurrent Neural Network (RNN/LSTM)    │
│  Input[t] → Hidden[t] → Output[t]       │
│           ↓                              │
│  Input[t+1] → Hidden[t+1] → Output[t+1] │
│  Use: Time series, text sequences       │
└─────────────────────────────────────────┘

┌─────────────────────────────────────────┐
│  Transformer Architecture               │
│  Input → Embedding → Attention →        │
│       → Feed Forward → Output            │
│  Use: NLP, translation, GPT/BERT        │
└─────────────────────────────────────────┘

CNN: Understanding Convolutional Layers

Why CNNs for Images?

Traditional neural network: 1000x1000 image = 1M pixels = 1M weights per neuron = HUGE!

CNN Solution: Local patterns matter more

Image (cat):
┌─────────────┐
│ ╱\_╱\_       │  ← Ears (pattern)
│ (• . •)     │  ← Eyes (pattern)
│  > ^ <      │  ← Whiskers (pattern)
└─────────────┘

Convolution:
- Small filter slides across image
- Detects edges, then shapes, then objects
- Shares weights (efficient!)

Layers:

  1. Conv Layer: Detect features (edges, textures)
  2. Pooling: Reduce size, keep important info
  3. Conv Layer: Detect higher features (eyes, ears)
  4. Pooling: Reduce more
  5. Dense: Final classification

LSTM: Handling Sequential Data

Problem with basic RNNs: Forget long-term context

LSTM (Long Short-Term Memory): Remembers important info, forgets irrelevant

# Example: LSTM for time-series
import torch.nn as nn

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, 
                           batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)
    
    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        predictions = self.fc(lstm_out[:, -1, :])
        return predictions

Use Cases:

  • Stock price prediction
  • Language translation
  • Speech recognition
  • Weather forecasting

Transformer Architecture

Revolutionary idea: Attention is all you need!

Problem: RNNs process sequentially (slow)

Solution: Process all words simultaneously, use "attention" to find relationships

Sentence: "The cat sat on the mat"

Attention Mechanism:
- "sat" pays attention to "cat" (who sat?)
- "sat" pays attention to "mat" (sat where?)
- Learns relationships without sequential processing

Key Components:

  1. Self-Attention: Words relate to other words
  2. Multi-Head Attention: Multiple attention patterns
  3. Positional Encoding: Remember word order
  4. Feed Forward: Standard neural network layers

Famous Transformers:

  • BERT: Bidirectional Encoder (understanding)
  • GPT: Generative Pre-trained (generation)
  • T5: Text-to-Text Transfer

Training Deep Networks

Challenges

Vanishing Gradients: Deep networks stop learning (gradients → 0) Solution:

  • Batch Normalization
  • Residual Connections (ResNet)
  • Better activations (ReLU, GELU)

Overfitting: Memorizes training data Solution:

  • Dropout (randomly turn off neurons)
  • Data augmentation
  • Early stopping

Best Practices

# Example training loop
import torch
import torch.nn as nn
import torch.optim as optim

model = MyModel()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(num_epochs):
    for batch in dataloader:
        # Forward pass
        outputs = model(batch['input'])
        loss = criterion(outputs, batch['labels'])
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Real-World Example

At UNT, energy consumption forecasting:

  • LSTM + GRU hybrid architecture
  • Input: Temperature, occupancy, time features
  • 15% better accuracy than traditional models
  • SHAP for interpretability (which features matter?)

Choosing the Right Architecture

Data TypeArchitectureExample
ImagesCNNFace recognition
TextTransformerChatGPT
Time SeriesLSTM/GRUStock prediction
TabularFNNFraud detection
AudioCNN + RNNSpeech-to-text