Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

>>fheins+(OP)
There's a graveyard of 100s of papers with "approximate near linear time attention."

They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.

Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214

>>thomas+Yc
> self-attention is efficiently computable to arbitrary precision with constant cost per token

This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.

>>kristj+sm
It can't be successful at that any more than 1+1 can equal 3. Fundamentally, if every token wants to be able to look at every previous token without loss of information, it must be O(n^2); N tokens looking at N tokens is quadratic. Any sub-quadratic attention must hence necessarily lose some information and be unable to support perfect recall on longer sequences.

>>logicc+qu
Your argument just assumes there is no latent structure that can be exploited. That's a big assumption.

>>naaski+IV
It's a necessary assumption for the universal approximation property; if you assume some structure then your LLM can no longer solve problems that don't fit into that structure as effectively.

>>logicc+aA1
Neural nets are structured as matrix multiplication, yet, they are universal approximators.

zlacker