zlacker

The paper says that:

> In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution, acceptable for many AI applications.

ie., the claim is that this method reproduces the results of conventional attention, up to float16 numerical precision.

replies(3): >>fheins+A5 >>energy+19 >>kristj+Sw

>>jcarre+(OP)
The method is more general. The github repository's first example is with eight Taylor terms (P = 8).

replies(1): >>torgin+iy1

>>jcarre+(OP)
It converges on conventional attention as P goes up

>>jcarre+(OP)
> approximately the same magnitude

and they really do mean that, their results show +/- 1 on log10 plots.

replies(1): >>cptroo+Xy1

>>fheins+A5
I'm clueless about this whole thing, but from my EE education I remember that in general:

Taylor approximations converge slowly in terms of error if the function they're representing is discontinuous (the error disappears quadratically if continuous, linearly if not), and they tend to create highly energetic swings near discontinuties (similarly to Fourier series with Gibbs oscillations).

Moreover, Taylor series are inherently nonlinear, and much of the mathematical toolset around AI assumes general linearity (cue linear algebra), with the exception of sigmoids , and going beyond cubic approximations tends to make errors worse (as expressed in SNR).

>>kristj+Sw
I don't think this is an accurate characterization of the error magnitude? Their error plots (from appendix 3) are all showing `log_10(|Y - \dot{Y}|)` as having a median of ~-3 (difference of 0.001) and a max of ~1.5 (difference of 0.035), and this is with only 3 Taylor terms.