Transformers have fundamentally reshaped the AI landscape — powering models like ChatGPT and driving major innovations across Google Search, recommendation engines, and enterprise analytics. From smarter user interfaces to advanced automation and real-time insight generation, transformer-based models such as GPT, BERT, and T5 are enabling businesses to streamline workflows, personalize customer experiences, uncover valuable insights, and accelerate product development.
But what makes transformers so effective, and how do they actually work?
In this blog post, we’ll break down the architecture that powers modern AI. You’ll learn why transformers were invented, the differences between encoder-only, decoder-only, and full transformer models, and when each is best suited — whether you’re a data scientist or machine learning engineer building applications, a product manager making roadmap decisions, or a business leader evaluating AI’s strategic value. We’ll also explore how the inner components work, with diagrams and practical examples to ground the theory in real-world use.
Before transformers, models like RNNs and LSTMs were used for sequential data. These models process tokens one at a time, which limits parallelism and struggles with long-range dependencies.
Transformers changed the game by using self-attention, which allows each token to directly consider all other tokens in the sequence simultaneously — capturing relationships between words, no matter how far apart they are.
This makes transformers faster to train, more scalable, and dramatically more powerful.
There are three main variants of the transformer architecture, each optimized for different types of tasks:
Architecture | Example Models | Best For |
---|---|---|
Full Transformer | T5, BART, MarianMT | Translation, summarization, multimodal |
Encoder-Only | BERT, RoBERTa | Classification, sentence similarity, QA |
Decoder-Only | GPT-2/3/4, LLaMA, PaLM, Claude | Text generation, chat, code completion |
Input Tokens (e.g. English)
↓
[Encoder Stack]
↓
Context Representations
↓
[Decoder Stack with Cross-Attention]
↓
Output Tokens (e.g. French)
Input Tokens
↓
[Encoder Stack]
↓
Contextual Embeddings → used for classification or sentence-level tasks
Prompt/Input Tokens
↓
[Decoder Stack with Masked Self-Attention]
↓
Autoregressive Output → one token at a time
Each encoder layer in the stack follows this sequence:
Converts each input token into a vector and adds a positional signal.
Each input token (like “The”, “cat”, “sleeps”) is mapped to a vector using a learned token embedding matrix. These vectors represent the meaning of each token in a high-dimensional space. Let’s break it down:
Given:
If your sentence is ["The", "cat", "sleeps"]
, and these map to token IDs [12, 45, 230]
, then:
These vectors are learned during training via backpropagation.
We add a vector to each token embedding that tells the model its position in the sequence. There are two variants:
For position \(p\) with its implied token ID \(t_p\):
\[X_p = E_{\text{token}}[t_p] + \text{PE}[p]\]This combined vector \(X_p\) which encodes both what the word is and where it is becomes the input to the first transformer layer.
Each word looks at all the other words in the sentence and decides how much attention to pay to each of them.
This helps the model understand context. Multi-head means this is done in multiple ways at once. One head might look at subject-verb, another at adjectives, etc. For example, if the word is “sleeps”, attention helps it realize that “cat” is the subject performing the action.
Each input token vector \(x_i \in \mathbb{R}^{1 \times d}\) from the input matrix \(X \in \mathbb{R}^{n \times d}\) is transformed into:
where \(W^Q, W^K, W^V \in \mathbb{R}^{d \times d_k}\) are learned weight matrices, and \(d_k\) is typically \(d / h\), with \(h\) being the number of heads.
Stacking across all tokens:
\[Q = X W^Q \in \mathbb{R}^{n \times d_k}\] \[K = X W^K \in \mathbb{R}^{n \times d_k}\] \[V = X W^V \in \mathbb{R}^{n \times d_k}\]For each query-key pair, compute a score:
\[\text{score}_{ij} = \frac{Q_i K_j^T}{\sqrt{d_k}}\]This measures how much token \(i\) should attend to token \(j\).
Convert scores to attention weights:
\[\alpha_{ij} = \text{softmax}_j\left(\text{score}_{ij}\right)\]Each \(\alpha_{ij} \in [0,1]\), and \(\sum_j \alpha_{ij} = 1\).
Intuition:
Use the attention weights to combine the values:
\[\text{output}_i = \sum_j \alpha_{ij} V_j\]This is the context-aware representation for token \(i\).
So, the full attention operation is:
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]This function maps the queries to a weighted sum of values, using scores derived from the keys.
Repeat the attention process above \(h\) times, each with its own set of learned \(W^Q, W^K, W^V\) matrices:
\[\text{head}_i = \text{Attention}(Q^{(i)}, K^{(i)}, V^{(i)}) \in \mathbb{R}^{n \times d_k}, \quad i = 1, \dots, h\]Concatenate all the heads along the feature dimension and apply a final linear projection:
\[\text{MultiHead}(X) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O\]where \(W^O \in \mathbb{R}^{(h d_k) \times d}\) is also learned.
\(W^O\) determines how to combine the different “perspectives” from all attention heads into a single, unified vector that can be used by the next layer. It decides how much weight to give to each head’s output, essentially blending them into a coherent representation for each token.
Each output vector is a blend of others — how much it blends depends on the attention scores. That’s how the model learns context.
Adds the attention output back to the original input (residual connection), then applies layer normalization.
This applies Layer Normalization to the sum of the input matrix \(X\) and the multi-head attention output, producing output matrix \(Z \in \mathbb{R}^{n \times d}\).
LayerNorm is applied per token, i.e., on each row \(z_i \in \mathbb{R}^{1 \times d}\) of \(Z\). It normalizes the feature vector by adjusting its mean and variance:
\[\mu = \frac{1}{d} \sum_{i=1}^{d} z_i, \quad \sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (z_i - \mu)^2\]Then:
\[\text{LayerNorm}(z) = \gamma \cdot \frac{z - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\]where:
Each token’s vector (after attention) is passed through a small neural network — the same one for every position — to refine its representation.
Each token goes through this same transformation separately — it’s position-wise, not sequence-wide.
For example, if the word “sleeps” attends to “cat” in attention, the FFN helps turn that into a refined idea like “subject performs action.
It consists of two fully connected layers with an activation in between:
\[\text{FFN}(x_i) = \text{GELU}(x_i W_1 + b_1) W_2 + b_2 \in \mathbb{R}^{1 \times d}\]This is applied independently to each token vector in the sequence, with shared weights across positions.
This block is repeated \(N\) times to build deeper semantic understanding.
Each decoder layer includes all of the above plus masking and cross-attention:
Same as the encoder: tokens are converted into vectors and combined with positional encodings to retain order.
Target sequence: ["The", "cat", "sleeps"]
Shifted decoder input: ["<BOS>", "The", "cat"]
<BOS>
stands for Beginning of Sequence. It’s a special token inserted at the start of the decoder input to indicate the start of generation.
The decoder generates tokens one by one, using only the tokens that came before. During training, we shift the input so that:
This teaches the model to learn autoregressive generation — i.e., predict the next word based only on previously generated ones.
In the decoder, masked self-attention ensures that each token can only attend to earlier tokens — not to future ones.
Compute Q, K, V projections just like in the encoder:
\[Q = XW^Q, \quad K = XW^K, \quad V = XW^V\]Compute raw attention scores:
\[\text{scores} = \frac{QK^T}{\sqrt{d_k}}\]Apply causal mask: Set all positions \((i,j)\) where \(j > i\) to \(-\infty\):
\[\text{scores}_{ij} = -\infty \text{ if } j > i\]Apply softmax:
\[\alpha_{ij} = \text{softmax}_j(\text{scores}_{ij})\]Compute output:
\[\text{output}_i = \sum_j \alpha_{ij} V_j\]Each token only “looks left” — at the tokens that came before.
Cross-attention allows the decoder to look at the encoder’s output — i.e., the representation of the input sequence.
Use the decoder’s hidden states to compute queries:
\[Q = X_{\text{decoder}} W^Q\]Use the encoder’s output (fixed after encoding) to compute keys and values:
\[K = X_{\text{encoder}} W^K, \quad V = X_{\text{encoder}} W^V\]Compute attention scores:
\[\text{scores} = \frac{QK^T}{\sqrt{d_k}}\]Apply softmax to get weights:
\[\alpha_{ij} = \text{softmax}_j(\text{scores}_{ij})\]Compute weighted sum of values:
\[\text{output}_i = \sum_j \alpha_{ij} V_j\]This output is then passed forward in the decoder layer.
This mechanism is what allows sequence-to-sequence models to perform tasks like summarization, translation, and more.
Decoder layers are also repeated \(N\) times for generation depth.
This process is repeated \(N\) times. Each layer refines the understanding.
Despite these limitations, transformers remain the dominant architecture in NLP and are being extended to vision, audio, robotics, and multimodal applications.
Although GPT is a decoder-only transformer, it can handle tasks traditionally associated with encoder–decoder models because of how it’s trained and how prompting works:
GPT models are trained on datasets that include examples of translation, summarization, Q&A, etc. These are framed as text-in → text-out tasks.
In decoder-only transformers:
You can turn nearly any problem into a single text string, which GPT learns to respond to appropriately:
Summarize: Climate change is accelerating due to... → Summary:
Translate: Hello, how are you? → French:
Describe this image: <image tokens> → A dog jumping over a fence.
This lets GPT solve problems without separate encoder/decoder modules.
Decoder-only transformers like GPT have proven incredibly powerful, as they can perform many tasks just by clever prompting, without needing a full encoder-decoder structure.
Still, encoder-only and full transformer models are valuable in understanding tasks and structured input-output tasks, respectively.
The choice depends on task structure and deployment goals.
Understanding these architectures is essential to mastering LLMs like GPT, BERT, Claude, Gemini, LLaMA, and beyond.
For further inquiries or collaboration, feel free to contact me at my email.