FlashAttention-T: Towards Tensorized Attention

>>matt_d+(OP)
I built guided window attn (literally predict the position of the window) a while ago and that works great. Why are we still stuck on any form of attn that looks at the entire context in any meaningful way? Do humans work this way? Do I need a whole book to predict the next word? Who out there is working on really new unique ways to deal with infinite history, other than me of course :)

>>jmward+zv
> Who out there is working on ... infinite history?

Many people are still working on improving RNNs, mostly in academia. Examples off the top of my head:

* RWKV: https://arxiv.org/abs/2006.16236 / https://arxiv.org/abs/2404.05892 https://arxiv.org/abs/2305.13048

* Linear attention: https://arxiv.org/abs/2503.14456

* State space models: https://arxiv.org/abs/2312.00752 / https://arxiv.org/abs/2405.21060

* Linear RNNs: https://arxiv.org/abs/2410.01201

Industry OTOH has gone all-in on Transformers.

>>cs702+cz
RNNs have two huge issues: - long context. Recurrence degrades the signal for the same reason that 'deep' nn architectures don't go much past 3-4 layers before you need residual connections and the like - (this is the big one) training performance is terrible since you can't parallelize them across a sequence like you can with causal masked attn in transformers

On the huge benefit side though you get: - guaranteed state size so perfect batch packing, perfect memory use, easy load/unload from a batch, O(1) of token gen so generally massive performance gains in inference. - unlimited context (well, no need for a concept of a position embedding or similar system)

Taking the best of both worlds is definitely where it is at for the future. An architecture that can train parallelized, has a fixed state size so you can load/unload and patch batches perfectly, unlimited context (with perfect recall), etc etc. That is the real architecture to go for.

>>jmward+nD
RNN training cannot be parallelized along the sequence dimension like attention can, but it can still be trained in batches on multiple sequences simultaneously. Given the sizes of modern training sets and the limits on context size for transformer-based models, it's not clear to what extent this is an important limitation nowadays. It may have been more relevant in the early days of attention-based models where being able to do experimental training runs quickly on relatively small sizes of training data may have been important.

>>zozbot+UG
To get a similar token/sec in training though you would need to swap batch size and seq length so you could have the massive batch size but then won't you start hitting memory issues with any reasonable sequence length? You would have to create do something similar to a minibatch along the sequence and cut the gradients after a short number of tokens on each sequence. So how will they learn truly long sequences for recall? Or is there a different trick I am missing here?

zlacker