zlacker

[return to "Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation"]
1. thomas+Yc[view] [source] 2026-02-04 15:33:26
>>fheins+(OP)
There's a graveyard of 100s of papers with "approximate near linear time attention."

They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.

Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214

◧◩
2. kristj+sm[view] [source] 2026-02-04 16:15:41
>>thomas+Yc
> self-attention is efficiently computable to arbitrary precision with constant cost per token

This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.

◧◩◪
3. logicc+qu[view] [source] 2026-02-04 16:47:42
>>kristj+sm
It can't be successful at that any more than 1+1 can equal 3. Fundamentally, if every token wants to be able to look at every previous token without loss of information, it must be O(n^2); N tokens looking at N tokens is quadratic. Any sub-quadratic attention must hence necessarily lose some information and be unable to support perfect recall on longer sequences.
◧◩◪◨
4. helloh+TB[view] [source] 2026-02-04 17:22:34
>>logicc+qu
I'm not saying if the paper is correct or not (since I can't tell), but I don't think your argument really holds. Consider applying it to multiplication:

Fundamentally, multiplication need to look at every pair of integer from the two input numbers. It must be O(n^2); N digits looking at N other digits is quadratic. Any sub-quadratic multiplication must hence necessarily lose some information.

◧◩◪◨⬒
5. action+5L[view] [source] 2026-02-04 18:00:25
>>helloh+TB
Doesn't that have to do with how many bits you allow in the actual calculation in physical reality?
◧◩◪◨⬒⬓
6. helloh+s01[view] [source] 2026-02-04 19:03:27
>>action+5L
Well, for multiplication complexity is defined in terms of on the number of digits/bits digits directly. For attention, complexity is defined on terms of the number of input vectors which are all at fixed precision. I don't understand what happens to the method proposed in the paper at higher precision (since I don't understand the paper), but in reality in doesn't matter since there is no value in anything over float16 for machine learning.
[go to top]