transformers Archives - Thinking In Structure

1. Introduction

For the past several years, nearly every successful large-scale sequence model has converged on the same architectural pattern: transformers and their variants. Sparse attention, linear attention, grouped-query attention, kernel tricks — the surface details change, but the underlying mechanism remains the same.

This has produced a familiar question:

Are transformers inevitable, or are we simply stuck?

The answer is neither. What is happening is more specific: the field has largely committed to one particular way of building global structure, and transformers saturate that choice extremely well.

Once the alternatives are made explicit, both the limits of transformers and the shape of what comes next become much clearer.

2. The Core Question: How Is Global Structure Built?

Any sequence model that aims to perform non-trivial reasoning must answer one fundamental question:

How does information from distant parts of the sequence come together?

There are only a few fundamentally different answers. Everything else is variation.

3. Explicit Comparison: The Transformer Regime

Transformers build global structure by explicitly comparing tokens to each other.

Each layer:

embeds tokens in a shared space,
computes similarity scores between all token pairs,
aggregates information based on those scores,
repeats the process in bounded depth.

This gives transformers two defining properties:

Random access — any token can directly query any other.
Symmetry — relationships are not tied to sequence order or direction.

The cost is obvious: O(n²) interactions. The payoff is equally clear: maximal expressiveness for arbitrary global retrieval and comparison.

This is why transformers dominate tasks such as:

language modeling,
code understanding,
cross-document reasoning,
retrieval-augmented generation.

Variants that keep explicit comparison but reduce cost (sparsity, kernels, approximations) remain inside this regime. They change how efficiently comparison is approximated, not what kind of structure is being computed.

3.1 Hardware Alignment of Transformers

The persistence of transformers is not just architectural — it is also hardware-driven.

Dense attention has:

high arithmetic intensity,
predictable memory access patterns,
minimal control flow,
excellent tiling into SRAM / shared memory.

In practice, large attention blocks amortize memory movement from high-bandwidth memory (HBM) and keep GPUs saturated. By contrast, many “efficient” alternatives reduce FLOPs but introduce:

serial dependencies,
irregular memory access,
lower arithmetic intensity.

As a result, O(n²) attention often runs closer to peak hardware utilization than O(n) alternatives, particularly on modern accelerators.

3.2 The KV Cache Problem

In practice, the dominant bottleneck for long-context transformers is no longer raw attention FLOPs, but the memory footprint and bandwidth of the key–value (KV) cache during inference.

For autoregressive generation, the KV cache grows linearly with context length and must be:

stored in high-bandwidth memory,
read at every decoding step,
kept resident to avoid recomputation.

As context windows push into hundreds of thousands or millions of tokens, KV cache traffic — not attention compute — becomes the primary scaling limit.

This is the concrete pain point that hardware-aware state-space models address. By replacing explicit token–token comparison with a constant-sized state, models such as Mamba eliminate the KV cache entirely. The trade is explicit: linear savings in memory and bandwidth in exchange for compressed global structure.

This reframes the comparison:

Transformers pay for expressiveness primarily in memory bandwidth.
SSMs buy efficiency by fixing memory cost at O(1) per layer.

The architectural divide is therefore as much about memory systems as about computation.

4. Explicit Dynamics: The State-Space Regime

State-space models (SSMs) such as Mamba, S4, RWKV, and Hyena take a genuinely different approach.

Instead of explicitly comparing tokens, they:

maintain a finite-dimensional state,
update it sequentially as tokens arrive,
let global context accumulate implicitly through dynamics.

This replaces explicit comparison with state evolution.

The benefits are real:

linear-time computation,
streaming capability,
low memory footprint,
strong performance on very long sequences with local or structured dependencies.

But the limitation is structural:

If the state has dimension d, it cannot faithfully encode O(n²) independent token–token relationships when n ≫ d.

Information is compressed as it flows forward. Some distinctions are lost by design.

This is not a flaw. It is the tradeoff.

SSMs excel when:

long-range dependencies are compressible,
locality dominates,
throughput and context length matter more than arbitrary retrieval.

5. The Role of Data (Often Under-Emphasized)

Architecture alone does not determine how global structure is learned.

Training data matters enormously:

Natural language has strong locality, redundancy, and hierarchical structure.
Code has explicit scoping, repetition, and long-range references.
Video and audio have smooth temporal dynamics.

Transformers succeed partly because:

their inductive bias is weak,
large datasets teach them which comparisons matter.

SSMs succeed where:

the data itself is compressible,
long-range dependencies can be summarized rather than retrieved exactly.

In other words:

Architecture determines what can be represented; data determines what needs to be represented.

6. Implicit Constraints: The Variational / Lagrangian Regime

A third regime replaces explicit comparison and explicit dynamics with implicit global constraints.

These models define:

an energy, action, or constraint functional,
whose stationary point defines the representation.

Examples include:

Deep Equilibrium Models (DEQs),
closed-loop / equilibrium transformers,
modern Hopfield-style associative memory networks.

6.1 Implicit Depth and Gradient Flow

In these models:

depth is not the number of layers,
it is the number of iterations required to reach equilibrium.

This yields effectively unbounded depth without explicit stacking.

Gradients are computed via implicit differentiation, rather than back-propagating through each iteration step. This mitigates classical vanishing/exploding gradient issues, but shifts sensitivity to conditioning and solver stability.

6.2 Practical Costs

inference time is data-dependent,
convergence is not guaranteed in bounded steps,
conditioning matters enormously,
hardware utilization is poor due to iterative solvers and control flow.

These models are powerful for:

global consistency,
constraint satisfaction,
associative reasoning,

but remain operationally fragile at scale.

6.3 Quantization and Numerical Stability

An under-appreciated advantage of transformers is their robustness to aggressive quantization. Attention-based models routinely operate at 8-bit — and increasingly 4-bit — precision with minimal degradation.

This robustness follows from:

feed-forward algebraic structure,
bounded activations via normalization,
absence of iterative convergence during inference.

By contrast, it remains an open question whether variational and equilibrium models can maintain stable convergence under heavy quantization. Because these models rely on:

fixed-point iteration,
implicit solvers,
conditioning-sensitive dynamics,

reduced numerical precision may affect convergence guarantees directly, rather than merely degrading output quality.

As hardware efficiency increasingly depends on low-precision arithmetic, quantization tolerance becomes a first-class architectural constraint.

7. Empirical Signatures of the Three Regimes

Transformers excel at precise global retrieval when data supports it and hardware can sustain dense compute.
SSMs excel when data structure allows aggressive compression and long sequential propagation.
Variational models excel when the task is fundamentally about satisfying constraints rather than retrieving facts.

8. A Practical Decision Guide

The right architectural question is not “what’s best?”, but:

What must be preserved — and what can be traded away?

Need arbitrary random access → Transformers
Dependencies compressible, very long context → SSMs
Need global consistency → Variational components
Need multiple capabilities → Hybrid designs

9. Hybrids: Not Speculative, Already Here

Hybrid systems are not just algorithmic compromises — they are hardware-aware decompositions:

dense attention where arithmetic intensity is high,
state-space models where memory bandwidth dominates,
retrieval and tools where exact operations matter,
variational components where constraint satisfaction outweighs throughput.

Successful hybrids reflect a single principle: explicit comparison is powerful but expensive, and should be used only where it is indispensable.

An illustrative analogy.
The distinction between explicit comparison and state-based dynamics can be made intuitive by analogy with composition versus continuation in music. Writing a new piece requires global structural decisions: motif selection, contrast, recurrence, and long-range planning. This is analogous to explicit comparison, where distant elements are actively related and reinterpreted. By contrast, extending an already-determined piece—maintaining its harmonic field, texture, and atmosphere—is primarily a matter of smooth propagation of state. This is where state-space dynamics excel. The analogy helps clarify why hybrid systems work best when these roles are separated in time or function: explicit mechanisms for planning and constraint-setting, followed by dynamic mechanisms for execution and continuation.

This also explains why many naïve hybrids fail. When multiple mechanisms are applied indiscriminately to the same global-structure problem, the system pays the costs of each without gaining the benefits of either. Effective hybrids are not blends; they are partitions, with clear division of responsibility between comparison, propagation, and constraint enforcement.

9.1 Hybrids as the Emerging Production Consensus

The move toward hybrid architectures is no longer speculative. By 2025, it has become the dominant pattern in large-scale production models, particularly for long-context workloads where both expressiveness and efficiency matter.

Several recent systems exemplify this convergence:

Jamba (AI21) combines state-space layers with transformer attention and mixture-of-experts routing, achieving context lengths beyond 256K tokens while maintaining high throughput.
Falcon-H1 (TII) interleaves parallel attention with Mamba-2 layers, targeting multilingual and long-context settings where memory bandwidth is the primary constraint.
Bamba (IBM) provides an open-source hybrid explicitly designed to reduce the memory overhead associated with full attention.
Related architectures (e.g. Zamba, Heracles, and similar designs) typically allocate 10–50% of layers to explicit attention, with the remainder implemented as state-space dynamics.

Across balanced benchmarks, these hybrids consistently outperform both pure transformers and pure SSMs, not by inventing new primitives, but by assigning each mechanism to the role it performs best.

This pattern reinforces the central claim of this paper: progress is not coming from replacing attention wholesale, but from restricting its use to the subproblems that genuinely require explicit comparison, while delegating long-range propagation and continuity to more efficient dynamics.

10. Additional Axes and Open Frontiers

The three-regime framework captures the dominant architectural tradeoffs, but several additional axes sharpen the picture.

10.1 Recurrence vs. Parallelization

Transformers are fundamentally parallelizable across sequence length.
SSMs are fundamentally sequential, due to true recurrence.

This affects not just inference, but training efficiency and scalability. Parallelism enables higher utilization and faster convergence per wall-clock time; recurrence enables constant memory and streaming computation. This is a deep computational divide.

10.2 Generalization and Out-of-Distribution Behavior

Different inductive biases lead to different generalization properties:

Transformers often generalize better on compositional and retrieval-based tasks.
SSMs often generalize better on temporal extrapolation and dynamical continuation.

OOD reliability is therefore architecture-dependent, not merely data-dependent.

10.3 Explicit Externalization: Tools and Memory

When global structure cannot be efficiently computed or compressed internally, it is externalized:

retrieval systems,
databases,
code interpreters,
symbolic engines.

This is not a failure mode but a fourth regime: explicit externalization of global structure. Modern systems already rely on this pathway to route around O(n²) limits.

10.4 The Long Tail of Specialized Inductive Biases

Highly structured data (graphs, sets, geometry) often favors specialized architectures:

graph neural networks,
equivariant models,
domain-specific solvers.

These increasingly appear as components in hybrid systems, reinforcing the shift toward modular design.

11. “But Large Transformers Already Work — Isn’t That Enough?”

Yes — when O(n²) is affordable.

But context windows are already pressing hardware limits, and many domains (video, audio, large codebases, agent memory) naturally exceed them. Existing systems already rely on retrieval, chunking, tools, and external structure.

Hybrids are not about replacing transformers. They are about extending the regimes where transformers remain usable.

12. Conclusion: Strategic Hybridization, Not Architectural Revolution

Transformers dominate not because they are inevitable, but because they sit at the intersection of:

expressive global comparison,
data regimes that tolerate weak inductive bias,
hardware that rewards dense, regular computation.

Progress beyond them is not coming from overthrow, but from strategic hybridization:

identifying where explicit comparison is indispensable,
replacing it elsewhere with dynamics, constraints, or external tools,
and aligning architecture choices with data structure and hardware realities.

This is not stagnation. It is the mark of a maturing engineering discipline — one that understands its tradeoffs and designs accordingly.

Tag: transformers

Beyond Transformers: Three Ways to Build Global Structure — and How the Field Is Actually Moving Forward