📚 Learning Plan

Week 0: Overview of ML Fundamentals¶

The ML fundamentals section introduces model evaluation, classical algorithms and more, which becomes the building blocks of the topics in the weeks following.

Week 1-2: Probability Foundations + Markov Assumption¶

Probability_and_Markov_Overview
Topics:
- conditional_probability_and_bayes_rule
- naive_bayes_and_gaussian_naive_bayes
- joint_and_marginal_distributions
- Markov Assumption: what it is and why it matters in NLP
Resources:
- StatQuest: Conditional Probability (YouTube)
- StatQuest: Bayes' Rule
- 3Blue1Brown: Bayes theorem, the geometry of changing beliefs
- StatQuest: Naive Bayes
- StatQuest: Gaussian Naive Bayes
- Khan Academy - Probability & Statistics
- Speech and Language Processing by Jurafsky & Martin Ch. 3 (Markov models)

Week 3: N-gram Models & Language Modeling¶

Ngram_Language_Modeling
Topics:
- What is an n-gram?
- How n-gram language models work
- Perplexity and limitations of n-gram models
Activities:
- Implement a bigram/trigram model on a toy corpus
Resources:
- The Illustrated Transformer - start with n-gram part
- Happy-LLM intro chapter
- Optional: n-gram language model notebook

Week 4: Intro to Information Theory¶

Information_Theory
Topics:
- Entropy, Cross-Entropy, KL Divergence
- Why they matter in language modeling
Activities:
- Manually compute entropy of a simple probability distribution
- Implement cross-entropy loss
Resources:
- 3Blue1Brown â€“ But what is entropy?
- Stanford CS224n Lecture 1

Week 5-6: Linear Algebra for ML¶

Linear_Algebra_for_ML
Topics:
- Vectors, Matrices, Matrix Multiplication
- Dot product, norms, projections
- Eigenvalues & Singular Value Decomposition (SVD)
Activities:
- Practice via small matrix coding problems (NumPy or PyTorch)
Resources:
- 3Blue1Brown: Essence of Linear Algebra
- Stanford CS229 Linear Algebra Review

Week 7: Calculus + Gradient Descent¶

Calculus_and_Gradient_Descent
Topics:
- Partial Derivatives
- Chain Rule
- Gradients and optimization intuition
Activities:
- Derive gradients of simple functions
- Visualize gradient descent in 2D
Resources:
- Khan Academy Calculus (focus on multivariable sections)
- Gradient Descent Visualization (YouTube)

Week 8-9: Neural Networks & Backpropagation¶

Neural_Networks_and_Deep_Learning_Overview
Topics:
- Introduction to Perceptron Algorithm
- Structure of a feedforward neural network
- Activation functions (ReLU, softmax)
- Backpropagation algorithm
Activities:
- Implement a simple NN from scratch (e.g., on MNIST or XOR)
- Derive gradient of softmax + cross-entropy
Resources:
- Michael Nielsenâ€™s NN book: http://neuralnetworksanddeeplearning.com/
- CS231n lecture on backprop

Week 10: Integration and Project¶

Integration_and_Project
Goal:
- Build a mini-project combining n-gram + neural net ideas
- Example: Predict the next word using both n-gram and a small MLP
Outcome:
- Review all learned concepts
- Prepare to transition to Happy-LLMâ€™s transformer section

Phase 2: Modern Deep Learning - Attention & Transformers¶

Week 11-12: Attention Mechanisms¶

Attention & Transformers Overview
Topics:
- Why attention? The sequence-to-sequence motivation
- Attention fundamentals: Query, Key, Value framework
- Attention mathematics: Scaled dot-product attention
- Attention vs. RNNs: parallelization and long-range dependencies
Activities:
- Implement basic attention mechanism from scratch
- Visualize attention weights on simple sequences
- Compare attention to fixed-context approaches
Resources:
- The Illustrated Transformer - Jay Alammar
- 3Blue1Brown: Attention in transformers
- Stanford CS224N: Attention

Week 13-14: Self-Attention & Multi-Head Attention¶

Self-Attention Overview
Multi-Head Attention Overview
Topics:
- Self-attention mechanism: attending within a sequence
- Permutation equivariance and the need for positional encoding
- Positional encodings: sinusoidal vs. learned
- Multi-head attention: parallel attention heads
- What do different attention heads learn?
Activities:
- Implement self-attention from scratch (NumPy or PyTorch)
- Implement multi-head attention
- Visualize attention patterns across multiple heads
- Add positional encodings and observe effects
Resources:
- Attention? Attention! - Lilian Weng
- The Annotated Transformer - Harvard NLP
- Original paper: Attention is All You Need

Week 15-16: Transformer Architecture¶

Transformer Architecture Overview
Topics:
- Full transformer architecture: encoder and decoder stacks
- Encoder: multi-head self-attention + feed-forward networks
- Decoder: masked self-attention + cross-attention + feed-forward
- Residual connections and layer normalization
- Training transformers: learning rate warmup, label smoothing
Activities:
- Implement a complete transformer from scratch
- Train on a small machine translation task
- Experiment with different hyperparameters (heads, layers, dimensions)
- Analyze attention patterns in trained model
Resources:
- Andrej Karpathy: Let's build GPT
- Transformer Explainer - Interactive visualization
- BertViz - Attention visualization tool

Week 17-18: BERT and GPT - Modern Applications¶

BERT & GPT Overview
Topics:
- BERT (encoder-only): Masked language modeling, bidirectional context
- GPT (decoder-only): Causal language modeling, autoregressive generation
- Pre-training vs. fine-tuning paradigm
- Transfer learning with transformers
- Prompt engineering and few-shot learning (GPT-3)
- Instruction tuning and RLHF (ChatGPT, InstructGPT)
Activities:
- Fine-tune a pre-trained BERT model for text classification
- Generate text with GPT-2/GPT-3 using different prompting strategies
- Compare encoder-only vs. decoder-only architectures
- Experiment with prompt engineering for various tasks
Resources:
- BERT paper - Devlin et al.
- GPT-2 paper - Radford et al.
- GPT-3 paper - Brown et al.
- Hugging Face Transformers Course
- Stanford CS224N: Pre-training

Week 19-20: Advanced Topics & Integration Project¶

Topics:
- Efficient transformers: Sparse attention, Linformer, Reformer
- Long-context models: Relative position encoding, ALiBi
- Vision transformers (ViT): applying transformers to images
- Multimodal transformers: CLIP, DALL-E, GPT-4
- State-space models and alternatives to attention
Integration Project:
- Build an end-to-end NLP application using transformers
- Examples:
  - Question answering system with BERT
  - Text summarization with BART/T5
  - Chatbot with GPT-2 fine-tuning
  - Code generation assistant
Resources: