FlashAttention-T: Towards Tensorized Attention

>>matt_d+(OP)
I built guided window attn (literally predict the position of the window) a while ago and that works great. Why are we still stuck on any form of attn that looks at the entire context in any meaningful way? Do humans work this way? Do I need a whole book to predict the next word? Who out there is working on really new unique ways to deal with infinite history, other than me of course :)

>>jmward+zv
> Who out there is working on ... infinite history?

Many people are still working on improving RNNs, mostly in academia. Examples off the top of my head:

* RWKV: https://arxiv.org/abs/2006.16236 / https://arxiv.org/abs/2404.05892 https://arxiv.org/abs/2305.13048

* Linear attention: https://arxiv.org/abs/2503.14456

* State space models: https://arxiv.org/abs/2312.00752 / https://arxiv.org/abs/2405.21060

* Linear RNNs: https://arxiv.org/abs/2410.01201

Industry OTOH has gone all-in on Transformers.

>>cs702+cz
> Industry OTOH has gone all-in on Transformers.

It's so annoying. Transformers keep improving and recurrent networks are harder to train so until we hit some real wall, companies don't seem eager to diverge. It's like lithium batteries improving easy faster than it was profitable to work on sodium ones, even though we unfortunately want the sodium ones to be better.

>>virapt+gC
I'd add this to the list of linear-attention RNNs:

https://arxiv.org/abs/2602.00294

Recently saw it on HN.

zlacker