Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

>>fheins+(OP)
I skimmed the paper, and I think I completely lost the plot.

Sections 2.1 through 2.4 talk about the decomposing the per-token-pair attention (key vector from the ith token with query vector from the jth token, where, in inference, the jth token is the one being sampled) into an approximation that is only mildly outrageously exponential in size compared to the original exponential-of-a-dot product. And they get something that's a polynomial (in the mathematical sense -- you're literally evaluating a polynomial) and has a size that's manageable at 4th order.

Okay, great, they took something simple and made it bigger and nastier but less transcendental without losing too much precision. (As far as I know, there is really nothing special about the exp in attention in the first place, so trying to approximate it well seems mostly useful insofar as it will keep existing models working.)

But the reason that attention is quadratic is that each token gets evaluated with respect to each other token. They haven't changed this at all. Section 2.5 seems like it's deferring this to an appendix. Section 2.6 gives the hidden state size per token, which, on first read, is strictly larger than the hidden state in normal attention (in normal attention it's d_v * d_k -- I'm not sure where their +1 comes from).

So what did the paper gain? Is there some detail that I missed or that the paper completely glossed over that explains why there is any gain of efficiency at all?

For what it's worth, the paper's overall claim is, in some sense, impossible. You can think of attention as being a sort of vector database, and this gets more accurate the sharper you make the exponential. If you replace softmax with actual max, a query locates the key that is the closest match to the query and returns the associated value. This operation is a plain linear search, it's possible (in principle anyway) to do lots of queries and recover the entire contents of the database, and I think that any paper claiming to do it faster than linear time should explain how it's compressing the data and where the loss is.

In language model terms, imagine an prompt like so:

    1: [string 1]
    2: [string 2]
    3: [string 3]
    ...
    n: [string n]
    
    Tell me the string associated with the number k.

As long as there's enough precision and enough query/key space to fit some embedding of the number k that will match the right thing (and there is a lot of room in high-dimensional spaces), one might expect a transformer to be able to answer this question. But this obviously requires memory with size linear in the prompt length. If you try to get rid of that, you necessarily lose something. (This is not to say that nice attention scaling is impossible -- one could imagine schemes where it takes the model multiple tokens to answer the question, and the number of tokens needed could scale, say, logarithmically with prompt size. But you still need that linear memory.)

>>amluto+Ix
> Section 2.6 gives the hidden state size per token, which, on first read, is strictly larger than the hidden state in normal attention

This is where you’ve gone off track. The “hidden state” for their model is a fixed size thing, like in an RNN, not per token. For a transformer, the “hidden state” is called the KV cache, and it grows with sequence length. This is why their method is linear not quadratic.

The Taylor Series they derive isn’t just for softmax (after all, real implementations of softmax will likely already use the Taylor series!), it’s for the entire tensor-level softmax(QK) computation.

zlacker