This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.
Fundamentally, multiplication need to look at every pair of integer from the two input numbers. It must be O(n^2); N digits looking at N other digits is quadratic. Any sub-quadratic multiplication must hence necessarily lose some information.
Convolving two arrays can be done perfectly accurately in O(n log n), despite every element being combined with every other element.
Or consider the even more basic sum of products a[i] * b[j] for all possible i, j:
total = 0
for i in range(len(a)):
for j in range(len(b)):
total += a[i] * b[j]
This can be computed in linear time as sum(a) * sum(b).Your logic that 'the result contains terms of all pairs, therefore the algorithm must be quadratic' simply doesn't hold.
\iiint f(x, y, z) dx dy dz = \int [\int g(x, y) dx]*[\int h(y, z) dz] dy
which greatly accelerated numerical integration (O(n^2) rather than O(n^3)).
My advisor was not particularly impressed and objectively I could have skipped it and let the simulations take a bit longer (quite a bit longer--this integration was done millions of times for different function parameters in an inner loop). But it was clever and all mine and I was proud of it.
Attention also has some specific properties.
And sometimes results are just unexpected. Did you know that anything a Turing machine can do in t tome steps, a different Turing machine can do in O(sqrt(t log t)) memory cells? >>44055347
By combining many more adding units, we can do (fixed-size) multiplication in constant time, too: https://en.wikipedia.org/wiki/Dadda_multiplier