Overview
The Position Problem¶
Self-attention is permutation-invariant: If you shuffle the input tokens, you get the same output (just shuffled).
Consider these two sentences: 1. "The cat chased the mouse" 2. "The mouse chased the cat"
Without positional information, self-attention would produce identical representations (just reordered), even though the meanings are opposite!
The Problem: Attention mechanisms have no inherent notion of sequence order.
The Solution: Positional encoding - inject position information into the model.
Why RNNs Don't Need This¶
Recurrent Neural Networks (RNNs) process sequentially:
h_1 = f(x_1, h_0)
h_2 = f(x_2, h_1) ← knows it comes after x_1
h_3 = f(x_3, h_2) ← knows it comes after x_2
Position information is implicit in the recurrence structure.
Transformers: Process all positions in parallel → must explicitly encode positions.
Positional Encoding Methods¶
1. Sinusoidal Positional Encoding (Original Transformer)¶
Add position-dependent patterns using sine and cosine functions:
Where: - \(pos\): position in sequence (0, 1, 2, ...) - \(i\): dimension index (0 to \(d_{model}/2\)) - \(d_{model}\): model dimension
2. Learned Positional Embeddings (BERT, GPT)¶
Learn position embeddings like word embeddings:
Where each position has a learnable vector.
Sinusoidal Positional Encoding¶
Intuition¶
Use different frequencies for different dimensions: - Low frequencies: Slowly varying across positions (capture global structure) - High frequencies: Rapidly varying (capture fine-grained position differences)
Think of it like a binary clock with smooth transitions: - Least significant bit: Alternates every position - Most significant bit: Changes very slowly
Mathematical Formula¶
For position \(pos\) and dimension \(i\):
Even dimensions: Use sine Odd dimensions: Use cosine
Example: 4-Dimensional Encoding¶
For \(d_{model} = 4\):
Position 0:
PE[0] = [sin(0/10000^0), cos(0/10000^0), sin(0/10000^{1/2}), cos(0/10000^{1/2})]
= [0.000, 1.000, 0.000, 1.000]
Position 1:
PE[1] = [sin(1/1), cos(1/1), sin(1/100), cos(1/100)]
= [0.841, 0.540, 0.010, 0.999]
Position 2:
PE[2] = [sin(2/1), cos(2/1), sin(2/100), cos(2/100)]
= [0.909, -0.416, 0.020, 0.998]
Visualization¶
For a 128-dim model and 50 positions, positional encoding looks like a heatmap:
Dim 0 Dim 1 Dim 2 Dim 3 ... Dim 127
Pos 0 [0.00 1.00 0.00 1.00 ... 0.71 ]
Pos 1 [0.84 0.54 0.01 1.00 ... 0.71 ]
Pos 2 [0.91 -0.42 0.02 1.00 ... 0.71 ]
...
Pos 49 [-0.95 -0.30 0.48 0.88 ... 0.73 ]
Each row is a unique "fingerprint" for that position.
Why Sinusoidal Works¶
1. Unique Encodings¶
Each position gets a unique vector - no two positions have the same encoding.
2. Relative Position Information¶
For any fixed offset \(k\), the encoding at position \(pos + k\) is a linear function of the encoding at position \(pos\):
Where \(T_k\) is a linear transformation matrix.
This means the model can learn to attend based on relative positions, not just absolute positions.
3. Generalization to Longer Sequences¶
Sinusoidal encodings generalize to sequence lengths not seen during training: - Trained on sequences up to 512 tokens - Can handle 1024+ tokens at inference - Encodings are deterministic, not learned
4. Smooth Transitions¶
Adjacent positions have similar encodings: - \(PE(pos)\) and \(PE(pos+1)\) are close in embedding space - Gradual changes across the sequence - Helps model learn smooth position-dependent functions
Learned Positional Embeddings¶
Approach¶
Treat positions like vocabulary tokens:
# Create learnable embedding for each position
self.position_embedding = nn.Embedding(max_seq_length, d_model)
# At forward pass
pos_ids = torch.arange(seq_len) # [0, 1, 2, ..., seq_len-1]
pos_encodings = self.position_embedding(pos_ids)
Each position index (0, 1, 2, ...) has a learned \(d_{model}\)-dimensional vector.
Advantages¶
- Flexibility: Can learn arbitrary position patterns
- Task-specific: Adapts to specific task needs
- Empirically strong: Often performs as well or better than sinusoidal
Disadvantages¶
- Fixed maximum length: Can't generalize beyond training length
- More parameters: \(O(\text{max\_seq\_length} \times d_{model})\)
- Less interpretable: No mathematical structure
Used In¶
- BERT: Learned positions (max 512 tokens)
- GPT: Learned positions
- T5: Relative position biases (variant)
How to Add Positional Encoding¶
Option 1: Addition (Standard)¶
Add positional encoding to token embeddings:
Why addition? - Simple and effective - Preserves dimensionality - Token and position information mix naturally
Option 2: Concatenation (Rare)¶
Concatenate token and positional embeddings:
Drawbacks: - Doubles dimension - More parameters - Less commonly used
Standard practice: Use addition.
Implementation: Sinusoidal¶
import torch
import math
def create_sinusoidal_positional_encoding(seq_len, d_model):
"""
Create sinusoidal positional encoding matrix.
Args:
seq_len: Maximum sequence length
d_model: Model dimension
Returns:
PE: (seq_len, d_model) positional encoding matrix
"""
PE = torch.zeros(seq_len, d_model)
position = torch.arange(0, seq_len).unsqueeze(1).float() # (seq_len, 1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
-(math.log(10000.0) / d_model))
# Even dimensions: sin
PE[:, 0::2] = torch.sin(position * div_term)
# Odd dimensions: cos
PE[:, 1::2] = torch.cos(position * div_term)
return PE
# Usage
seq_len = 100
d_model = 512
PE = create_sinusoidal_positional_encoding(seq_len, d_model)
# Add to token embeddings
token_embeddings = ... # (batch, seq_len, d_model)
input_with_position = token_embeddings + PE[:token_embeddings.size(1), :]
Implementation: Learned¶
class LearnedPositionalEncoding(nn.Module):
def __init__(self, max_seq_len, d_model):
super().__init__()
self.position_embeddings = nn.Embedding(max_seq_len, d_model)
def forward(self, seq_len):
positions = torch.arange(seq_len, device=self.position_embeddings.weight.device)
return self.position_embeddings(positions)
# Usage
pos_encoder = LearnedPositionalEncoding(max_seq_len=512, d_model=512)
token_embeddings = ... # (batch, seq_len, d_model)
pos_encodings = pos_encoder(token_embeddings.size(1)) # (seq_len, d_model)
input_with_position = token_embeddings + pos_encodings
Relative Positional Encoding (Advanced)¶
Instead of absolute positions, encode relative distances between tokens.
T5 Relative Position Biases¶
Add learned biases to attention scores based on distance:
Where \(b_{i-j}\) is a learned bias for relative distance \(i-j\).
Advantages: - Focuses on relative positions (often more important than absolute) - Better generalization to longer sequences - Used in T5, DeBERTa
Rotary Position Embedding (RoPE)¶
Rotate query and key vectors based on position:
Where \(R_m\) is a rotation matrix for position \(m\).
Properties: - Dot product \(q_m' \cdot k_n'\) depends on relative position \(m - n\) - Very efficient - Used in: GPT-Neo, PaLM, LLaMA
Comparing Methods¶
| Method | Pros | Cons | Used In |
|---|---|---|---|
| Sinusoidal | No extra params, generalizes beyond training length, interpretable | May underperform learned | Original Transformer |
| Learned Absolute | Flexible, task-adaptive, empirically strong | Fixed max length, more params | BERT, GPT |
| Relative (T5) | Generalizes well, focuses on relative distance | More complex, added computation | T5, DeBERTa |
| RoPE | Efficient, strong performance, relative position | More complex implementation | LLaMA, GPT-Neo |
When Position Matters Most¶
Position information is crucial for:
- Word Order: "dog bites man" vs "man bites dog"
- Syntax: Subject-verb-object order
- Temporal Sequences: "before" vs "after"
- Positional References: "first", "last", "next"
Less crucial for: - Bag-of-words tasks (sentiment on keywords) - Set-based reasoning (where order doesn't matter)
Ablation Studies¶
What happens without positional encoding?
Results (on various tasks): - Translation: ~5-10 BLEU score drop - Question Answering: ~10-15% accuracy drop - Language Modeling: ~0.5-1.0 perplexity increase
Conclusion: Positional encoding is essential for most NLP tasks.
Visualizing Position Information¶
Similarity Matrix¶
Compute cosine similarity between positional encodings:
Similarity[i, j] = cos(PE[i], PE[j])
For sinusoidal encodings, you'll see: - High similarity for nearby positions - Periodic patterns (due to sine/cosine waves) - Gradual decrease with distance
Attention Pattern Changes¶
With positional encoding: - Attention patterns become position-aware - Models can learn position-specific behaviors - Enables local vs. global attention strategies
Summary¶
The Problem: Transformers process all tokens in parallel → no inherent position information
The Solution: Add positional encodings to input embeddings
Two Main Approaches: 1. Sinusoidal (Original Transformer) - Deterministic sine/cosine functions - Generalizes to unseen lengths - No extra parameters
- Learned (BERT, GPT)
- Learnable embedding per position
- Task-adaptive
- Fixed maximum length
Modern Variants: - Relative position biases (T5) - Rotary Position Embeddings (RoPE) - LLaMA
Usage: $\(\text{Input to Transformer} = \text{Token Embeddings} + \text{Positional Encodings}\)$
Next Steps¶
Ready to see the full architecture? - Transformer Architecture
Want to implement positional encoding? - Positional Encoding Implementation
Interested in advanced position methods? - Relative Position Encoding
Related Topics¶
- Self-Attention - Why we need position information
- Transformer Architecture - Where positional encoding fits
- Vector Operations - Understanding embeddings