zlacker

[return to "Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation"]
1. thomas+Yc[view] [source] 2026-02-04 15:33:26
>>fheins+(OP)
There's a graveyard of 100s of papers with "approximate near linear time attention."

They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.

Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214

◧◩
2. jcarre+gu[view] [source] 2026-02-04 16:47:25
>>thomas+Yc
The paper says that:

> In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution, acceptable for many AI applications.

ie., the claim is that this method reproduces the results of conventional attention, up to float16 numerical precision.

◧◩◪
3. fheins+Qz[view] [source] 2026-02-04 17:13:04
>>jcarre+gu
The method is more general. The github repository's first example is with eight Taylor terms (P = 8).
◧◩◪◨
4. torgin+y22[view] [source] 2026-02-05 00:45:34
>>fheins+Qz
I'm clueless about this whole thing, but from my EE education I remember that in general:

Taylor approximations converge slowly in terms of error if the function they're representing is discontinuous (the error disappears quadratically if continuous, linearly if not), and they tend to create highly energetic swings near discontinuties (similarly to Fourier series with Gibbs oscillations).

Moreover, Taylor series are inherently nonlinear, and much of the mathematical toolset around AI assumes general linearity (cue linear algebra), with the exception of sigmoids , and going beyond cubic approximations tends to make errors worse (as expressed in SNR).

[go to top]