π Learning Plan
Week 0: Overview of ML Fundamentals¶
The ML fundamentals section introduces model evaluation, classical algorithms and more, which becomes the building blocks of the topics in the weeks following.
Week 1-2: Probability Foundations + Markov Assumption¶
- Probability_and_Markov_Overview
- Topics:
- conditional_probability_and_bayes_rule
- naive_bayes_and_gaussian_naive_bayes
- joint_and_marginal_distributions
- Markov Assumption: what it is and why it matters in NLP
- Resources:
- StatQuest: Conditional Probability (YouTube)
- StatQuest: Bayes' Rule
- 3Blue1Brown: Bayes theorem, the geometry of changing beliefs
- StatQuest: Naive Bayes
- StatQuest: Gaussian Naive Bayes
- Khan Academy - Probability & Statistics
- Speech and Language Processing by Jurafsky & Martin Ch. 3 (Markov models)
Week 3: N-gram Models & Language Modeling¶
- Ngram_Language_Modeling
- Topics:
- What is an n-gram?
- How n-gram language models work
- Perplexity and limitations of n-gram models
- Activities:
- Implement a bigram/trigram model on a toy corpus
- Resources:
- The Illustrated Transformer - start with n-gram part
- Happy-LLM intro chapter
- Optional: n-gram language model notebook
Week 4: Intro to Information Theory¶
-
Topics:
- Entropy, Cross-Entropy, KL Divergence
- Why they matter in language modeling
- Activities:
- Manually compute entropy of a simple probability distribution
- Implement cross-entropy loss
- Resources:
- 3Blue1Brown Γ’β¬β But what is entropy?
- Stanford CS224n Lecture 1
Week 5-6: Linear Algebra for ML¶
- Linear_Algebra_for_ML
- Topics:
- Vectors, Matrices, Matrix Multiplication
- Dot product, norms, projections
- Eigenvalues & Singular Value Decomposition (SVD)
- Activities:
- Practice via small matrix coding problems (NumPy or PyTorch)
- Resources:
- 3Blue1Brown: Essence of Linear Algebra
- Stanford CS229 Linear Algebra Review
Week 7: Calculus + Gradient Descent¶
- Calculus_and_Gradient_Descent
- Topics:
- Partial Derivatives
- Chain Rule
- Gradients and optimization intuition
- Activities:
- Derive gradients of simple functions
- Visualize gradient descent in 2D
- Resources:
- Khan Academy Calculus (focus on multivariable sections)
- Gradient Descent Visualization (YouTube)
Week 8-9: Neural Networks & Backpropagation¶
- Neural_Networks_and_Deep_Learning_Overview
- Topics:
- Introduction to Perceptron Algorithm
- Structure of a feedforward neural network
- Activation functions (ReLU, softmax)
- Backpropagation algorithm
- Activities:
- Implement a simple NN from scratch (e.g., on MNIST or XOR)
- Derive gradient of softmax + cross-entropy
- Resources:
- Michael NielsenΓ’β¬β’s NN book: http://neuralnetworksanddeeplearning.com/
- CS231n lecture on backprop
Week 10: Integration and Project¶
- Integration_and_Project
- Goal:
- Build a mini-project combining n-gram + neural net ideas
- Example: Predict the next word using both n-gram and a small MLP
- Outcome:
- Review all learned concepts
- Prepare to transition to Happy-LLMΓ’β¬β’s transformer section
Phase 2: Modern Deep Learning - Attention & Transformers¶
Week 11-12: Attention Mechanisms¶
- Attention & Transformers Overview
- Topics:
- Why attention? The sequence-to-sequence motivation
- Attention fundamentals: Query, Key, Value framework
- Attention mathematics: Scaled dot-product attention
- Attention vs. RNNs: parallelization and long-range dependencies
- Activities:
- Implement basic attention mechanism from scratch
- Visualize attention weights on simple sequences
- Compare attention to fixed-context approaches
- Resources:
Week 13-14: Self-Attention & Multi-Head Attention¶
- Self-Attention Overview
- Multi-Head Attention Overview
- Topics:
- Self-attention mechanism: attending within a sequence
- Permutation equivariance and the need for positional encoding
- Positional encodings: sinusoidal vs. learned
- Multi-head attention: parallel attention heads
- What do different attention heads learn?
- Activities:
- Implement self-attention from scratch (NumPy or PyTorch)
- Implement multi-head attention
- Visualize attention patterns across multiple heads
- Add positional encodings and observe effects
- Resources:
- Attention? Attention! - Lilian Weng
- The Annotated Transformer - Harvard NLP
- Original paper: Attention is All You Need
Week 15-16: Transformer Architecture¶
- Transformer Architecture Overview
- Topics:
- Full transformer architecture: encoder and decoder stacks
- Encoder: multi-head self-attention + feed-forward networks
- Decoder: masked self-attention + cross-attention + feed-forward
- Residual connections and layer normalization
- Training transformers: learning rate warmup, label smoothing
- Activities:
- Implement a complete transformer from scratch
- Train on a small machine translation task
- Experiment with different hyperparameters (heads, layers, dimensions)
- Analyze attention patterns in trained model
- Resources:
- Andrej Karpathy: Let's build GPT
- Transformer Explainer - Interactive visualization
- BertViz - Attention visualization tool
Week 17-18: BERT and GPT - Modern Applications¶
- BERT & GPT Overview
- Topics:
- BERT (encoder-only): Masked language modeling, bidirectional context
- GPT (decoder-only): Causal language modeling, autoregressive generation
- Pre-training vs. fine-tuning paradigm
- Transfer learning with transformers
- Prompt engineering and few-shot learning (GPT-3)
- Instruction tuning and RLHF (ChatGPT, InstructGPT)
- Activities:
- Fine-tune a pre-trained BERT model for text classification
- Generate text with GPT-2/GPT-3 using different prompting strategies
- Compare encoder-only vs. decoder-only architectures
- Experiment with prompt engineering for various tasks
- Resources:
- BERT paper - Devlin et al.
- GPT-2 paper - Radford et al.
- GPT-3 paper - Brown et al.
- Hugging Face Transformers Course
- Stanford CS224N: Pre-training
Week 19-20: Advanced Topics & Integration Project¶
- Topics:
- Efficient transformers: Sparse attention, Linformer, Reformer
- Long-context models: Relative position encoding, ALiBi
- Vision transformers (ViT): applying transformers to images
- Multimodal transformers: CLIP, DALL-E, GPT-4
- State-space models and alternatives to attention
- Integration Project:
- Build an end-to-end NLP application using transformers
- Examples:
- Question answering system with BERT
- Text summarization with BART/T5
- Chatbot with GPT-2 fine-tuning
- Code generation assistant
- Resources: