Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

>>fheins+(OP)
There's a graveyard of 100s of papers with "approximate near linear time attention."

They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.

Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214

>>thomas+Yc
> self-attention is efficiently computable to arbitrary precision with constant cost per token

This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.

>>kristj+sm
It can't be successful at that any more than 1+1 can equal 3. Fundamentally, if every token wants to be able to look at every previous token without loss of information, it must be O(n^2); N tokens looking at N tokens is quadratic. Any sub-quadratic attention must hence necessarily lose some information and be unable to support perfect recall on longer sequences.

>>logicc+qu
That argument could also be used to say that the FFT's time complexity of O(n log n) should be impossible.

zlacker