Deep Learning Architectures: CNNs, RNNs, and Transformers
Deep Learning Architectures
What You Need to Learn
- Neural Network Basics: Neurons, layers, backpropagation
- CNNs: Convolutional Neural Networks for images
- RNNs/LSTMs: Recurrent networks for sequences
- Transformers: Attention mechanism, BERT, GPT
- Training Techniques: Batch normalization, dropout, learning rate scheduling
ELI5: What are Neural Networks?
Your brain learning to recognize your friend's face:
- Input Layer = Your eyes see features (hair color, eye shape, nose)
- Hidden Layers = Your brain combines features ("brown hair + blue eyes + small nose")
- Output Layer = Recognition! "That's Sarah!"
Artificial Neural Network:
- Same idea, but with math
- Each "neuron" is a simple calculation
- Many layers of neurons = "deep" learning
- Learns by adjusting connections (weights)
Example: Teaching a network to recognize cats:
- Show 1000 cat pictures → network adjusts weights
- Show 1000 non-cat pictures → network adjusts more
- New picture → network predicts: cat or not cat!
System Design: Neural Network Architecture Types
┌─────────────────────────────────────────┐
│ Feedforward Neural Network (FNN) │
│ Input → Hidden → Hidden → Output │
│ Use: Tabular data, simple classification│
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Convolutional Neural Network (CNN) │
│ Conv → Pool → Conv → Pool → Dense │
│ Use: Images, spatial data │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Recurrent Neural Network (RNN/LSTM) │
│ Input[t] → Hidden[t] → Output[t] │
│ ↓ │
│ Input[t+1] → Hidden[t+1] → Output[t+1] │
│ Use: Time series, text sequences │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ Transformer Architecture │
│ Input → Embedding → Attention → │
│ → Feed Forward → Output │
│ Use: NLP, translation, GPT/BERT │
└─────────────────────────────────────────┘
CNN: Understanding Convolutional Layers
Why CNNs for Images?
Traditional neural network: 1000x1000 image = 1M pixels = 1M weights per neuron = HUGE!
CNN Solution: Local patterns matter more
Image (cat):
┌─────────────┐
│ ╱\_╱\_ │ ← Ears (pattern)
│ (• . •) │ ← Eyes (pattern)
│ > ^ < │ ← Whiskers (pattern)
└─────────────┘
Convolution:
- Small filter slides across image
- Detects edges, then shapes, then objects
- Shares weights (efficient!)
Layers:
- Conv Layer: Detect features (edges, textures)
- Pooling: Reduce size, keep important info
- Conv Layer: Detect higher features (eyes, ears)
- Pooling: Reduce more
- Dense: Final classification
LSTM: Handling Sequential Data
Problem with basic RNNs: Forget long-term context
LSTM (Long Short-Term Memory): Remembers important info, forgets irrelevant
# Example: LSTM for time-series
import torch.nn as nn
class LSTMModel(nn.Module):
def __init__(self, input_size, hidden_size, num_layers):
super().__init__()
self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
batch_first=True)
self.fc = nn.Linear(hidden_size, 1)
def forward(self, x):
lstm_out, _ = self.lstm(x)
predictions = self.fc(lstm_out[:, -1, :])
return predictions
Use Cases:
- Stock price prediction
- Language translation
- Speech recognition
- Weather forecasting
Transformer Architecture
Revolutionary idea: Attention is all you need!
Problem: RNNs process sequentially (slow)
Solution: Process all words simultaneously, use "attention" to find relationships
Sentence: "The cat sat on the mat"
Attention Mechanism:
- "sat" pays attention to "cat" (who sat?)
- "sat" pays attention to "mat" (sat where?)
- Learns relationships without sequential processing
Key Components:
- Self-Attention: Words relate to other words
- Multi-Head Attention: Multiple attention patterns
- Positional Encoding: Remember word order
- Feed Forward: Standard neural network layers
Famous Transformers:
- BERT: Bidirectional Encoder (understanding)
- GPT: Generative Pre-trained (generation)
- T5: Text-to-Text Transfer
Training Deep Networks
Challenges
Vanishing Gradients: Deep networks stop learning (gradients → 0) Solution:
- Batch Normalization
- Residual Connections (ResNet)
- Better activations (ReLU, GELU)
Overfitting: Memorizes training data Solution:
- Dropout (randomly turn off neurons)
- Data augmentation
- Early stopping
Best Practices
# Example training loop
import torch
import torch.nn as nn
import torch.optim as optim
model = MyModel()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(num_epochs):
for batch in dataloader:
# Forward pass
outputs = model(batch['input'])
loss = criterion(outputs, batch['labels'])
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
Real-World Example
At UNT, energy consumption forecasting:
- LSTM + GRU hybrid architecture
- Input: Temperature, occupancy, time features
- 15% better accuracy than traditional models
- SHAP for interpretability (which features matter?)
Choosing the Right Architecture
| Data Type | Architecture | Example |
|---|---|---|
| Images | CNN | Face recognition |
| Text | Transformer | ChatGPT |
| Time Series | LSTM/GRU | Stock prediction |
| Tabular | FNN | Fraud detection |
| Audio | CNN + RNN | Speech-to-text |