Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

>>fheins+(OP)
This uses the Taylor approximation to approximate softmax, but that IS only an approximation. I wonder exactly how much that trade-off costs in terms of accuracy vs performance? I note that they say it's close to float16 with four Taylor terms.

My other concern would be that Taylor itself is fairly complex. I wonder how well GPU's handle this in comparison to good old fashioned softmax? The last time I used Taylor with a custom Triton kernel it was still very slow. That could just have been my own jank vibe-coded implementation though.

>>mapont+cb
If the model learns by using the approximate softmax, then why does it matter? We only need the behavior of softmax, not an exact numerical solution.

>>slashd+IQ
I guess that what I'm saying is I'd love to see an LLM actually have it's attention mechanism replaced with this and get benchmarked on real world tasks in comparison to quadratic attention. They don't seem to have done that here. They claim that's it's close to being the same, but my experience tells me that it needs to do better than get "pretty close."

They also haven't' tried to write a high performance kernel for triton yet. If it goes the way my last experiment with Taylor did they're in for some bad news.

I'm just a hobbyist though, it's certainly possible that people with more time/resources could outperform me without much effort. I just want to see it tested on something familiar and benchmark-able.

zlacker