They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.
Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214
I ask because in practice, for inference, attention is typically computed with low-precision (4-bit, 8-bit, 16-bit) floats.
Numerical error, in fact, may be a key factor as to why quadratic attention, in practice, exhibits context rot as context gets longer, analogous to an RNN:
https://www.anthropic.com/engineering/effective-context-engi...
> DSA reduces the core attention complexity of the main model from O(L^2) to O(Lk), where k (<< L) is the number of selected tokens. Although the lightning indexer still has a complexity of O(L^2), it requires much less computation compared with MLA in DeepSeek-V3.1-Terminus
This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.
[1] Attention Is Not What You Need, https://arxiv.org/abs/2512.19428
> In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution, acceptable for many AI applications.
ie., the claim is that this method reproduces the results of conventional attention, up to float16 numerical precision.
Fundamentally, multiplication need to look at every pair of integer from the two input numbers. It must be O(n^2); N digits looking at N other digits is quadratic. Any sub-quadratic multiplication must hence necessarily lose some information.
The big mitigation for this is that in causal transformers (i.e. all the chatbot type applications, where each token is only allowed to see tokens before it), you're running inference repeatedly on the same prefix in order to grow it by one token at a time. So if you cache the computations for tokens 0..N-1, on each inference pass you only have to compute O(N) for the newly added token at the end of the sequence.
That's why caching (and caching charges) appear so prominently everywhere in the pricing of inference.
In practice, caching is most beneficial at inference time, because you typically have relatively long conversations that start with the same cacheable prefix (the system prompt). At training time the same optimization can apply, but you're typically not pushing the same prefixes through the model repeatedly so you end up paying the quadratic cost more often.
The quadratic cost of attention is the fundamental compute bottleneck for transformer architectures, which is why there's research like this trying to find shortcuts in computing attention, as well as research into completely new primitives to replace attention (e.g. SSM, which is O(N) on a cold cache and O(1) on a warm cache).
[1] GrokAlign: Geometric Characterisation and Acceleration of Grokking, https://arxiv.org/abs/2510.09782
[2] The Geometry of Reasoning: Flowing Logics in Representation Space, https://arxiv.org/abs/2506.12284
The forward pass is propagating from inputs to outputs, computing the thing the model was trained for. The reverse/backwards pass is propagating from outputs back to inputs, but it's calculating the gradients of parameters for training (rougly: how much changing each parameter in isolation affects the output, and whether it makes the output closer to the desired training output). The result of the "reverse pass" isn't a set of inputs, but a set of annotations on the model's parameters that guide their adjustment.
The computations of the forward pass are not trivially reversible (e.g. they include additions, which destroys information about the operand values). As a sibling thread points out, you can still probabilistically explore what inputs _could_ produce a given output, and get some information back that way, but it's a lossy process.
And of course, you could train a "reverse" model, one that predicts the prefix of a sequence given a suffix (trivially: it's the same suffix prediction problem, but you train it on reversed sequences). But that would be a separate model trained from scratch on that task, and in that model the prefix prediction would be its forward pass.
Convolving two arrays can be done perfectly accurately in O(n log n), despite every element being combined with every other element.
Or consider the even more basic sum of products a[i] * b[j] for all possible i, j:
total = 0
for i in range(len(a)):
for j in range(len(b)):
total += a[i] * b[j]
This can be computed in linear time as sum(a) * sum(b).Your logic that 'the result contains terms of all pairs, therefore the algorithm must be quadratic' simply doesn't hold.
and they really do mean that, their results show +/- 1 on log10 plots.
\iiint f(x, y, z) dx dy dz = \int [\int g(x, y) dx]*[\int h(y, z) dz] dy
which greatly accelerated numerical integration (O(n^2) rather than O(n^3)).
My advisor was not particularly impressed and objectively I could have skipped it and let the simulations take a bit longer (quite a bit longer--this integration was done millions of times for different function parameters in an inner loop). But it was clever and all mine and I was proud of it.
Attention also has some specific properties.
And sometimes results are just unexpected. Did you know that anything a Turing machine can do in t tome steps, a different Turing machine can do in O(sqrt(t log t)) memory cells? >>44055347
Taylor approximations converge slowly in terms of error if the function they're representing is discontinuous (the error disappears quadratically if continuous, linearly if not), and they tend to create highly energetic swings near discontinuties (similarly to Fourier series with Gibbs oscillations).
Moreover, Taylor series are inherently nonlinear, and much of the mathematical toolset around AI assumes general linearity (cue linear algebra), with the exception of sigmoids , and going beyond cubic approximations tends to make errors worse (as expressed in SNR).
By combining many more adding units, we can do (fixed-size) multiplication in constant time, too: https://en.wikipedia.org/wiki/Dadda_multiplier