guts of the transformer
The Guts
A standard Transformer is a encoder-decoder model.
There are 4 main steps in the encoder:
- Word Embedding: Because numbers not strings.
- Positional Encoding: Because order matters and transformers have no inherent sense of order like a recurrent model.
- Self-Attention: Because for each word in the input sequence we need to know how relevant every other word in the input sequence is to it.
- Residual Connections between 2 and 3: Because deep models forget earlier layers otherwise.
There are 5 main steps in the decoder:
- Word Embedding: Because numbers not strings.
- Positional Encoding: Because order matters and transformers have no inherent sense of order like a recurrent model.
- Self-Attention: Because for each word in the output sequence we need to know how relevant every other word in the output sequence is to it.
- Encoder-Decoder Attention: Because for every word in the output sequence we need to know how relevant each input word is to it.
- Residual Connections between 2 and 3, and between 3 and 4: Because deep models forget earlier layers otherwise.
Bunch of normalization stuff in between.
The Variants
| Types | Details | Example | Usecase |
|---|---|---|---|
| Decoder-only | Uses Masked Self-Attention instead of regular Self-Attention (in step 3 of the decoder) because we want each word to attend to only the previous words, not future words i.e. auto-regressive. | GPT | Expansion, Given a prompt, generate a response. |
| Encoder-only | Uses regular Self-Attention (in step 3 of the encoder) because we want each word to attend to all other words in the input sequence whether before or after it. | BERT | Extraction, Given some text, generate a context-aware embedding useful for downstream tasks like classification, clustering, similarity search for RAG, etc. |