ML sketches: Self Attention

Saturday, January 24, 2026

x → Embedding → MultiHeadAttention → Concat → Project to lower dim →

→ Add(x) → LayerNorm → FFN → Add → LayerNorm

Vocab to embedding

torch.nn.embedding(Vocab, embed_dim)

Batch X Seq Len X Vocab → Batch X Seq Len X embed_dim

PE = Batch X Seq Len X embed_dim

Self Attention- K, Q matrix, & Attention weight Score

Self Attention- Attention weight softmax example

Self Attention- Attention weighted features

Linear Attention