They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.
Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214
This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.
Fundamentally, multiplication need to look at every pair of integer from the two input numbers. It must be O(n^2); N digits looking at N other digits is quadratic. Any sub-quadratic multiplication must hence necessarily lose some information.
By combining many more adding units, we can do (fixed-size) multiplication in constant time, too: https://en.wikipedia.org/wiki/Dadda_multiplier