Attention Is All You Need
Paper: Attention Is All You Need
This is the 2017 paper that introduced the Transformer, the architecture behind most modern language models. The central idea is surprisingly clean: instead of processing sequences step by step with recurrence, the model can use attention to let every token look directly at the other relevant tokens.
Key points
- The Transformer removes recurrence and convolution from sequence modeling. It relies on self-attention and feed-forward layers instead.
- Self-attention makes the model much easier to parallelize during training because tokens do not have to be processed one after another.
- Multi-head attention lets the model attend to different kinds of relationships at the same time. One head might learn syntax-like relationships, while another might track longer-range dependencies.
- Positional encodings are added because attention alone does not know word order. These encodings give the model a sense of where each token sits in the sequence.
- The encoder-decoder structure remains useful for translation: the encoder builds contextual representations of the input, and the decoder generates the output while attending to both previous output tokens and the encoded input.
- The paper showed strong machine translation results on WMT 2014 English-to-German and English-to-French, with less training cost than earlier recurrent or convolutional approaches.
- The bigger insight was not just better translation. It showed that attention could be the main mechanism for learning sequence relationships, which later became the foundation for BERT, GPT-style models, and much of modern AI.
Why it matters
Before this paper, recurrent models like LSTMs were the default mental model for sequence tasks. The Transformer changed that by making sequence learning more parallel, scalable, and flexible. That scalability is the reason the architecture became so important once researchers started training on very large datasets.
The title is almost comically bold, but it aged well.