Is Attention Really All You Need?
New research challenges the necessity of attention mechanisms in Transformers.
Zhang's paper "Attention Is Not What You Need" proposes replacing self-attention with operations on Grassmann manifolds, claiming competitive language modeling performance with linear complexity. An independent replication produced substantially worse results. The paper omits critical implementation details, making faithful reproduction difficult.
Transformers and Attention
The Transformer, introduced in 2017, replaced recurrence and convolutions with a single mechanism: self-attention [1]. Every token attends to every other token, and the model learns which relationships matter. This turned out to be remarkably effective, and it became the foundation for BERT [2], GPT-3 [3], and most of what followed. For nearly a decade, the field has operated on a shared assumption: attention is all you need. Zhang challenges that assumption directly, asking an important question [4]:
Is explicit self-attention really the fundamental ingredient we need for strong sequence modeling and reasoning?
Their answer is no.
Why It Matters
That performance came with a cost: the attention matrix scales quadratically with sequence length, making long contexts expensive. This has motivated years of work on making self-attention faster, from sparse attention [5] to low-rank approximations [6]. These are technically demanding research areas, but they share a common goal: reducing the complexity of attention rather than replacing it.
Zhang's proposal sidesteps the problem entirely. Because the Grassmann mechanism operates on local token pairs within a fixed neighborhood rather than computing pairwise interactions across the full sequence, the complexity drops from to . If that holds up, it means better scaling is not a matter of endlessly optimizing attention. It means not needing attention at all.
What Zhang Proposes
The core idea is to replace the attention matrix with geometric operations on a mathematical structure called a Grassmann manifold [7]. Instead of computing pairwise interactions across all tokens, the Grassmann head operates on local token pairs within a fixed neighborhood. Because each token interacts with a small, fixed set of neighbors regardless of sequence length, the computation scales linearly rather than quadratically.
The mechanism encodes relationships between nearby tokens as points on the manifold, then fuses those geometric features back into the hidden states through a learned gate. The paper argues this is a more structured way to propagate information through a sequence, and that the geometric constraints may offer a cleaner route to understanding what the model has learned.
Whether that interpretability argument holds in practice is an open question. For the full technical details, see section 3 of the paper [4:1].
Experimental Setup
My replication targeted the language modeling results from section 5.1 of Zhang's paper, following the architecture in section 3 and the experimental setup in section 4.1.
Dataset and Tokenization
WikiText-2-raw with the BERT tokenizer (vocabulary size: 30,522). This matches Zhang's setup exactly.
Model Architecture
Decoder-only Transformer following the original "Attention Is All You Need" paper, with minor modifications as specified by Zhang:
| Component | Value |
|---|---|
| Layers | 6 |
| Heads | 4 |
| 256 | |
| 1024 | |
| Max Context Length | 128 |
| Activation | GELU* |
| Normalization | Post-Norm LayerNorm / Pre-Norm** |
| Positional Encoding | Learned embeddings* |
| Weight Tying | Enabled† |
* The original Transformer uses ReLU and sinusoidal positional encodings. Zhang departs from this.
** Both variants were evaluated. Post-Norm aligns with Zhang's reported configuration.
† Inferred from parameter counts.
Training Configuration
Zhang does not specify key hyperparameters for the language modeling task. The choices below follow standard practice from related work and may differ from Zhang's actual setup.
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 1 × 10⁻³ |
| LR Schedule | Cosine decay |
| Warmup | None |
Ambiguities
The paper specifies dropout for the classification task (section 4.2) but not for language modeling. It mentions dropout "after the Grassmann mixing layer" without clarifying placement across the full architecture. My replication followed the original Transformer convention: dropout rate 0.1 applied uniformly to sub-layer outputs, attention weights, and embeddings [1:1].
Post-Norm Transformers are sensitive to initialization and typically require learning rate warmup to avoid instability early in training [8]. Zhang reports 30 epochs with no warmup mentioned. Without the author's code, it is not possible to know whether this diverges from the actual training setup.
Replication Results
Perplexity — Reported vs. Replication
Lower is better. Y-axis zoomed to data range.
The baseline Transformer matched Zhang's reported numbers closely. The Grassmann Transformer did not. The replication produced a validation perplexity of 356.6 against the paper's reported 275.7, a gap of roughly 29%. A difference of this size points to either a fundamental implementation divergence or results that cannot be reproduced from the information provided.
Parameter counts add to the concern. Matching the baseline's reported 12.59M parameters required enabling weight tying, which the paper does not mention. The Grassmann model's parameter counts could not be reconciled with the reported values at all. Without the author's code, there is no way to verify that the architecture described in the paper is the one that was actually trained.
Communication with the Author
I contacted the author requesting clarification on training details and parameter counts. The author did not respond.
A detailed parameter count breakdown, full replication scripts, and environment details are available in a public GitHub repository.
Bottom Line
The Grassmann Transformer is a genuinely interesting idea, and the question Zhang is asking is worth asking. The field has focused more on making attention efficient than on questioning whether it is the right foundation.
The replication results do not support the paper's empirical claims. A 29% perplexity gap, unresolved parameter count discrepancies, and missing training details make the reported numbers hard to accept at face value. The conceptual contribution stands on its own, but the empirical case needs stronger support and, at minimum, a public codebase.
Readers curious about attention-free architectures may find more traction elsewhere. State-space models [9] have produced stronger empirical results on similar problems and represent a more mature line of work for those interested in moving beyond attention.
For now, attention really is all you need.