FlashAttention-T: Towards Tensorized Attention

>>matt_d+(OP)
I built guided window attn (literally predict the position of the window) a while ago and that works great. Why are we still stuck on any form of attn that looks at the entire context in any meaningful way? Do humans work this way? Do I need a whole book to predict the next word? Who out there is working on really new unique ways to deal with infinite history, other than me of course :)

>>jmward+zv
> Who out there is working on really new unique ways to deal with infinite history, other than me of course :)

I'm working on a novel (I think) linear attention mechanism in my personal lab that's O(L) for effectively infinite context. I haven't yet decided how much of it is going to be open source, but I agree with you that it's important to figure this out.

Was your work open? Is there some place I can read more about it? I'm trying to figure out what to do with my thing on the off-chance that it actually does turn out to work the way I want it to.

>>mapont+wR
I'm trying to figure the same thing out for my stuff. I figured out a simple way to train location prediction so I'm using it for guided window prediction which is great for attn (predict a distance in the past to look at) and for memory (predict an x, y location for a 2d window into a memory store to look at that will be helpful). I suspect there are a lot of people out there that have found that one weird trick but haven't released it because they don't know how to capitalize on the idea. Why give OpenAI and others the keys to the future for free?

zlacker