guts of the transformer

The Guts

A standard Transformer is a encoder-decoder model.

There are 4 main steps in the encoder:

Word Embedding: Because numbers not strings.
Positional Encoding: Because order matters and transformers have no inherent sense of order like a recurrent model.
Self-Attention: Because for each word in the input sequence we need to know how relevant every other word in the input sequence is to it.
Residual Connections between 2 and 3: Because deep models forget earlier layers otherwise.

There are 5 main steps in the decoder:

Word Embedding: Because numbers not strings.
Positional Encoding: Because order matters and transformers have no inherent sense of order like a recurrent model.
Self-Attention: Because for each word in the output sequence we need to know how relevant every other word in the output sequence is to it.
Encoder-Decoder Attention: Because for every word in the output sequence we need to know how relevant each input word is to it.
Residual Connections between 2 and 3, and between 3 and 4: Because deep models forget earlier layers otherwise.

Bunch of normalization stuff in between.

Types	Details	Example	Usecase
Decoder-only	Uses Masked Self-Attention instead of regular Self-Attention (in step 3 of the decoder) because we want each word to attend to only the previous words, not future words i.e. auto-regressive.	GPT	Expansion, Given a prompt, generate a response.
Encoder-only	Uses regular Self-Attention (in step 3 of the encoder) because we want each word to attend to all other words in the input sequence whether before or after it.	BERT	Extraction, Given some text, generate a context-aware embedding useful for downstream tasks like classification, clustering, similarity search for RAG, etc.