FlashAttention-T: Towards Tensorized Attention

>>matt_d+(OP)
Less annoying link directly to the paper: https://dl.acm.org/doi/pdf/10.1145/3774934.3786425?download=...

>>saagar+8g
link if you don't want to automatically download files

https://dl.acm.org/doi/pdf/10.1145/3774934.3786425

>>sigbot+jo
yep, https://github.com/poad42/cuda-fp8-ampere recently another attempt at squeezing whatever's left from ampere

>>jmward+zv
> Who out there is working on ... infinite history?

Many people are still working on improving RNNs, mostly in academia. Examples off the top of my head:

* RWKV: https://arxiv.org/abs/2006.16236 / https://arxiv.org/abs/2404.05892 https://arxiv.org/abs/2305.13048

* Linear attention: https://arxiv.org/abs/2503.14456

* State space models: https://arxiv.org/abs/2312.00752 / https://arxiv.org/abs/2405.21060

* Linear RNNs: https://arxiv.org/abs/2410.01201

Industry OTOH has gone all-in on Transformers.

>>vlovic+ew
I'm correct on the technical level as well: https://chatgpt.com/s/t_698293481e308191838b4131c1b605f1

>>coolsu+DB
https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwe...

cooperative execution yeah

as you can tell I do not do CUDA for a living :D

>>simian+gp
The vast majority of benefits that can be obtained from scaling a single layer inside a neural network can often be better accomplished by having more layers instead.

Here is an illustrative example: You can write higher order polynomials as a recursive chain of first order polynomials. (Horner's method).

Things like TreeConnect [0] scale better if each TreeConnect layer has a depth of two and you add more TreeConnect layers to compensate the lack of expressivity instead of choosing a higher depth.

Attention pairs every token against every other token. n^10 would mean pairing each token with nine other tokens. The primary benefit of doing this is that you can have a "function" that accepts the interactions of 10 tokens as input to produce a single output, but you already have that if you have a ten layer network. The interactions of two tokens can form a combined token that contains information of both tokens. The network can repeat this ten times to accumulate the desired information into a single super token and then make a decision based on all ten input tokens.

[0] https://ieeexplore.ieee.org/document/8576141

>>crysta+WA
There are two ingredients that don't fit in the "attention-is-kernel-smoothing" as far as I can tell: positional encoding and causal masking (another way to say positional encoding, I guess)

Also, Simplical attention is pretty much what the OP was going for, but the hardware lottery is such that it's gonna be pretty difficult to get competitive in terms of engineering, not that people aren't trying (e.g. https://arxiv.org/pdf/2507.02754)

>>virapt+gC
I'd add this to the list of linear-attention RNNs:

https://arxiv.org/abs/2602.00294

Recently saw it on HN.

zlacker

FlashAttention-T: Towards Tensorized Attention