Building Self-Attention Layers

- Understand the motivation for attention: limitations of fixed-window n-gram models - Explore how word meaning changes with context using static vs contextual embeddings (e.g., "bank" problem) - Learn the mechanics of self-attention: Query, Key, Value, dot products, and weighted sums - Manually compute attention scores and visualize how softmax creates probabilistic context focus - Implement self-attention layers in PyTorch using toy examples and evaluate outputs - Visualize attention heatmaps using real LLMs to interpret which words the model attends to - Compare loss curves of self-attention models vs trigram models and observe learning dynamics - Understand how embeddings evolve through transformer layers and extract them using GPT-2 - Build both single-head and multi-head transformer models; compare their predictions and training performance - Implement a Mixture-of-Experts (MoE) attention model and observe gating behavior on different inputs - Evaluate self-attention vs MoE vs n-gram models on fluency, generalization, and loss curves - Run meta-evaluation across all models to compare generation quality and training stability