Is Attention Really All You Need?

Jan 30, 20265 min readTransformersDeep Learning

New research challenges the necessity of attention mechanisms in Transformers.

Transformers and Attention

The Transformer, introduced in 2017, replaced recurrence and convolutions with a single mechanism: self-attention [1]. Every token attends to every other token, and the model learns which relationships matter. This turned out to be remarkably effective, and it became the foundation for BERT [2], GPT-3 [3], and most of what followed. For nearly a decade, the field has operated on a shared assumption: attention is all you need. Zhang challenges that assumption directly, asking an important question [4]:

Is explicit self-attention really the fundamental ingredient we need for strong sequence modeling and reasoning?

Their answer is no.

Why It Matters

That performance came with a cost: the attention matrix scales quadratically with sequence length, making long contexts expensive. This has motivated years of work on making self-attention faster, from sparse attention [5] to low-rank approximations [6]. These are technically demanding research areas, but they share a common goal: reducing the $O(n^2)$ complexity of attention rather than replacing it.

Zhang's proposal sidesteps the problem entirely. Because the Grassmann mechanism operates on local token pairs within a fixed neighborhood rather than computing pairwise interactions across the full sequence, the complexity drops from $O(n^2)$ to $O(n)$ . If that holds up, it means better scaling is not a matter of endlessly optimizing attention. It means not needing attention at all.

What Zhang Proposes

The core idea is to replace the attention matrix with geometric operations on a mathematical structure called a Grassmann manifold [7]. Instead of computing pairwise interactions across all tokens, the Grassmann head operates on local token pairs within a fixed neighborhood. Because each token interacts with a small, fixed set of neighbors regardless of sequence length, the computation scales linearly rather than quadratically.

The mechanism encodes relationships between nearby tokens as points on the manifold, then fuses those geometric features back into the hidden states through a learned gate. The paper argues this is a more structured way to propagate information through a sequence, and that the geometric constraints may offer a cleaner route to understanding what the model has learned.

Whether that interpretability argument holds in practice is an open question. For the full technical details, see section 3 of the paper [4:1].

Experimental Setup

My replication targeted the language modeling results from section 5.1 of Zhang's paper, following the architecture in section 3 and the experimental setup in section 4.1.

Dataset and Tokenization

WikiText-2-raw with the BERT tokenizer (vocabulary size: 30,522). This matches Zhang's setup exactly.

Model Architecture

Decoder-only Transformer following the original "Attention Is All You Need" paper, with minor modifications as specified by Zhang:

Component	Value
Layers	6
Heads	4
$d_{model}$	256
$d_{ff}$	1024
Max Context Length	128
Activation	GELU*
Normalization	Post-Norm LayerNorm / Pre-Norm**
Positional Encoding	Learned embeddings*
Weight Tying	Enabled†

* The original Transformer uses ReLU and sinusoidal positional encodings. Zhang departs from this.
** Both variants were evaluated. Post-Norm aligns with Zhang's reported configuration.
† Inferred from parameter counts.

Training Configuration

Note

Zhang does not specify key hyperparameters for the language modeling task. The choices below follow standard practice from related work and may differ from Zhang's actual setup.

Parameter	Value
Optimizer	AdamW
Learning Rate	1 × 10⁻³
LR Schedule	Cosine decay
Warmup	None

Ambiguities

The paper specifies dropout for the classification task (section 4.2) but not for language modeling. It mentions dropout "after the Grassmann mixing layer" without clarifying placement across the full architecture. My replication followed the original Transformer convention: dropout rate 0.1 applied uniformly to sub-layer outputs, attention weights, and embeddings [1:1].

Warning

Post-Norm Transformers are sensitive to initialization and typically require learning rate warmup to avoid instability early in training [8]. Zhang reports 30 epochs with no warmup mentioned. Without the author's code, it is not possible to know whether this diverges from the actual training setup.

Replication Results

Perplexity — Reported vs. Replication

Lower is better. Y-axis zoomed to data range.

Standard Attention

Grassmann

The baseline Transformer matched Zhang's reported numbers closely. The Grassmann Transformer did not. The replication produced a validation perplexity of 356.6 against the paper's reported 275.7, a gap of roughly 29%. A difference of this size points to either a fundamental implementation divergence or results that cannot be reproduced from the information provided.

Caution

Parameter counts add to the concern. Matching the baseline's reported 12.59M parameters required enabling weight tying, which the paper does not mention. The Grassmann model's parameter counts could not be reconciled with the reported values at all. Without the author's code, there is no way to verify that the architecture described in the paper is the one that was actually trained.

Communication with the Author

I contacted the author requesting clarification on training details and parameter counts. The author did not respond.

Tip

A detailed parameter count breakdown, full replication scripts, and environment details are available in a public GitHub repository.

Bottom Line

The Grassmann Transformer is a genuinely interesting idea, and the question Zhang is asking is worth asking. The field has focused more on making attention efficient than on questioning whether it is the right foundation.

The replication results do not support the paper's empirical claims. A 29% perplexity gap, unresolved parameter count discrepancies, and missing training details make the reported numbers hard to accept at face value. The conceptual contribution stands on its own, but the empirical case needs stronger support and, at minimum, a public codebase.

Readers curious about attention-free architectures may find more traction elsewhere. State-space models [9] have produced stronger empirical results on similar problems and represent a more mature line of work for those interested in moving beyond attention.

For now, attention really is all you need.

References

[1]Vaswani, A., et al. "Attention Is All You Need" arXiv, 2017. https://arxiv.org/pdf/1706.03762 ↩↩1

[2]Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" arXiv, 2018. https://arxiv.org/pdf/1810.04805 ↩

[3]Brown, T., et al. "Language Models are Few-Shot Learners" arXiv, 2020. https://arxiv.org/pdf/2005.14165 ↩

[4]Zhang, C. "Attention Is Not What You Need" arXiv, 2025. https://arxiv.org/pdf/2512.19428 ↩↩1

[5]Child, R., et al. "Generating Long Sequences with Sparse Transformers" arXiv, 2019. https://arxiv.org/pdf/1904.10509 ↩

[6]Wang, S., et al. "Linformer: Self-Attention with Linear Complexity" arXiv, 2020. https://arxiv.org/pdf/2006.04768 ↩

[7]Robbin, J. W. "Introduction to Grassmann Manifolds" University of Wisconsin-Madison, 2006. https://people.math.wisc.edu/~jwrobbin/761dir/grassmann.pdf ↩

[8]Xiong, R., et al. "On Layer Normalization in the Transformer Architecture" arXiv, 2020. https://arxiv.org/pdf/2002.04745 ↩

[9]Gu, A., et al. "Efficiently Modeling Long Sequences with Structured State Spaces" arXiv, 2021. https://arxiv.org/pdf/2111.00396 ↩

Is Attention Really All You Need?

Jan 30, 20265 min readTransformersDeep Learning

New research challenges the necessity of attention mechanisms in Transformers.

Transformers and Attention

Is explicit self-attention really the fundamental ingredient we need for strong sequence modeling and reasoning?

Their answer is no.

Why It Matters

What Zhang Proposes

Whether that interpretability argument holds in practice is an open question. For the full technical details, see section 3 of the paper [4:1].

Experimental Setup

My replication targeted the language modeling results from section 5.1 of Zhang's paper, following the architecture in section 3 and the experimental setup in section 4.1.

Dataset and Tokenization

WikiText-2-raw with the BERT tokenizer (vocabulary size: 30,522). This matches Zhang's setup exactly.

Model Architecture

Decoder-only Transformer following the original "Attention Is All You Need" paper, with minor modifications as specified by Zhang:

Component	Value
Layers	6
Heads	4
$d_{model}$	256
$d_{ff}$	1024
Max Context Length	128
Activation	GELU*
Normalization	Post-Norm LayerNorm / Pre-Norm**
Positional Encoding	Learned embeddings*
Weight Tying	Enabled†

Training Configuration

Note

Zhang does not specify key hyperparameters for the language modeling task. The choices below follow standard practice from related work and may differ from Zhang's actual setup.

Parameter	Value
Optimizer	AdamW
Learning Rate	1 × 10⁻³
LR Schedule	Cosine decay
Warmup	None

Ambiguities

Warning

Replication Results

Perplexity — Reported vs. Replication

Lower is better. Y-axis zoomed to data range.

Standard Attention

Grassmann

Caution

Communication with the Author

I contacted the author requesting clarification on training details and parameter counts. The author did not respond.

Tip

A detailed parameter count breakdown, full replication scripts, and environment details are available in a public GitHub repository.

Bottom Line

For now, attention really is all you need.

References

[1]Vaswani, A., et al. "Attention Is All You Need" arXiv, 2017. https://arxiv.org/pdf/1706.03762 ↩↩1

[2]Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" arXiv, 2018. https://arxiv.org/pdf/1810.04805 ↩

[3]Brown, T., et al. "Language Models are Few-Shot Learners" arXiv, 2020. https://arxiv.org/pdf/2005.14165 ↩

[4]Zhang, C. "Attention Is Not What You Need" arXiv, 2025. https://arxiv.org/pdf/2512.19428 ↩↩1

[5]Child, R., et al. "Generating Long Sequences with Sparse Transformers" arXiv, 2019. https://arxiv.org/pdf/1904.10509 ↩

[6]Wang, S., et al. "Linformer: Self-Attention with Linear Complexity" arXiv, 2020. https://arxiv.org/pdf/2006.04768 ↩

[7]Robbin, J. W. "Introduction to Grassmann Manifolds" University of Wisconsin-Madison, 2006. https://people.math.wisc.edu/~jwrobbin/761dir/grassmann.pdf ↩

[8]Xiong, R., et al. "On Layer Normalization in the Transformer Architecture" arXiv, 2020. https://arxiv.org/pdf/2002.04745 ↩

[9]Gu, A., et al. "Efficiently Modeling Long Sequences with Structured State Spaces" arXiv, 2021. https://arxiv.org/pdf/2111.00396 ↩

Transformers and Attention#

Why It Matters#

What Zhang Proposes#

Experimental Setup#

Dataset and Tokenization#

Model Architecture#

Training Configuration#

Ambiguities#

Replication Results#

Communication with the Author#

Bottom Line#

References

Transformers and Attention#

Why It Matters#

What Zhang Proposes#

Experimental Setup#

Dataset and Tokenization#

Model Architecture#

Training Configuration#

Ambiguities#

Replication Results#

Communication with the Author#

Bottom Line#

References

Transformers and Attention

Why It Matters

What Zhang Proposes

Experimental Setup

Dataset and Tokenization

Model Architecture

Training Configuration

Ambiguities

Replication Results

Communication with the Author

Bottom Line

Transformers and Attention

Why It Matters

What Zhang Proposes

Experimental Setup

Dataset and Tokenization

Model Architecture

Training Configuration

Ambiguities

Replication Results

Communication with the Author

Bottom Line