Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

>>fheins+(OP)
I almost feel like this goes opposite to what attention is good at. This would be good at approximating all the places where attention is low / not sharp. Where attention/the exponential is key is when it selects out / needle-in-haystack / winner-takes-all focus (the word "attention" itself sorta implies this), and this is where the taylor expression would fail to represent the values well. This just... softens attentions ability to attend?

(I'm imagining that if in the context there's ~4-8 "similar" attention-targets that should be sharp, and regular attention learns to select the correct one, this taylor approximation version would wash out any difference and they'd all loosly be attended to, and it'd fail to isolate the correct signal)

Really wish this had some downstream tests -- apply it to a pretrained model and see how performance degrades, train a fresh one, etc. The tests are worth doing, but I somehow don't feel that hopeful this is the unlock required for sub-quadratic attention. It's possible that a freshly trained model with this learns to attend without the sharp attention signals, but that seems a bit dubious to me.

But also, maybe this combined with some other selective (sparse attention) trick, means that the hybrid model gets the "fuzzy long tail" of attention well represented as well as the sharpness well represented, and all together it could actually be a part of the larger solution.

>>blueco+Ca
> this is where the taylor expression would fail to represent the values well.

"In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution"

>>energy+Fb
I read that too, but I wondered whether elementwise error is the right metric. Surely the actual error metric should be to evaluate model performance for a conventional transformer model and then the same model with the attention mechanism replaced by this 4th order Taylor approximation?

>>seanhu+Gg
Bounded error weights by definition is a more strict evaluation criterion than “performance” metrics through running the model.

zlacker