zlacker

I built guided window attn (literally predict the position of the window) a while ago and that works great. Why are we still stuck on any form of attn that looks at the entire context in any meaningful way? Do humans work this way? Do I need a whole book to predict the next word? Who out there is working on really new unique ways to deal with infinite history, other than me of course :)

replies(3): >>cs702+D3 >>mapont+Xl >>remexr+UK

>>jmward+(OP)
> Who out there is working on ... infinite history?

Many people are still working on improving RNNs, mostly in academia. Examples off the top of my head:

* RWKV: https://arxiv.org/abs/2006.16236 / https://arxiv.org/abs/2404.05892 https://arxiv.org/abs/2305.13048

* Linear attention: https://arxiv.org/abs/2503.14456

* State space models: https://arxiv.org/abs/2312.00752 / https://arxiv.org/abs/2405.21060

* Linear RNNs: https://arxiv.org/abs/2410.01201

Industry OTOH has gone all-in on Transformers.

replies(2): >>virapt+H6 >>jmward+O7

>>cs702+D3
> Industry OTOH has gone all-in on Transformers.

It's so annoying. Transformers keep improving and recurrent networks are harder to train so until we hit some real wall, companies don't seem eager to diverge. It's like lithium batteries improving easy faster than it was profitable to work on sodium ones, even though we unfortunately want the sodium ones to be better.

replies(1): >>cs702+Vy5

>>cs702+D3
RNNs have two huge issues: - long context. Recurrence degrades the signal for the same reason that 'deep' nn architectures don't go much past 3-4 layers before you need residual connections and the like - (this is the big one) training performance is terrible since you can't parallelize them across a sequence like you can with causal masked attn in transformers

On the huge benefit side though you get: - guaranteed state size so perfect batch packing, perfect memory use, easy load/unload from a batch, O(1) of token gen so generally massive performance gains in inference. - unlimited context (well, no need for a concept of a position embedding or similar system)

Taking the best of both worlds is definitely where it is at for the future. An architecture that can train parallelized, has a fixed state size so you can load/unload and patch batches perfectly, unlimited context (with perfect recall), etc etc. That is the real architecture to go for.

replies(2): >>zozbot+lb >>cs702+Me

>>jmward+O7
RNN training cannot be parallelized along the sequence dimension like attention can, but it can still be trained in batches on multiple sequences simultaneously. Given the sizes of modern training sets and the limits on context size for transformer-based models, it's not clear to what extent this is an important limitation nowadays. It may have been more relevant in the early days of attention-based models where being able to do experimental training runs quickly on relatively small sizes of training data may have been important.

replies(2): >>jmward+5f >>Develo+Al

>>jmward+O7
Linear RNNs overcome both issues. All the RNNs I mentioned are linear RNNs.

replies(1): >>jmward+9f

>>zozbot+lb
To get a similar token/sec in training though you would need to swap batch size and seq length so you could have the massive batch size but then won't you start hitting memory issues with any reasonable sequence length? You would have to create do something similar to a minibatch along the sequence and cut the gradients after a short number of tokens on each sequence. So how will they learn truly long sequences for recall? Or is there a different trick I am missing here?

>>cs702+Me
I'll give them all a look. Thanks!

>>zozbot+lb
Not quite, most of the recent work on modern RNNs has been addressing this exact limitation. For instance linear attention yields formulations that can be equivalently interpreted either as a parallel operation or a recursive one. The consequence is that these parallelizable versions of RNNs are often "less expressive per param" than their old-school non-parallelizable RNN counterparts, though you could argue that they make up for that in practice by being more powerful per unit of training compute via much better training efficiency.

>>jmward+(OP)
> Who out there is working on really new unique ways to deal with infinite history, other than me of course :)

I'm working on a novel (I think) linear attention mechanism in my personal lab that's O(L) for effectively infinite context. I haven't yet decided how much of it is going to be open source, but I agree with you that it's important to figure this out.

Was your work open? Is there some place I can read more about it? I'm trying to figure out what to do with my thing on the off-chance that it actually does turn out to work the way I want it to.

replies(1): >>jmward+qP

>>jmward+(OP)
how does this compare to MoSA (arXiv:2505.00315)? do you require that there's a single contiguous window? and do you literally predict on position, or with a computed feature?

replies(1): >>jmward+xT

>>mapont+Xl
I'm trying to figure the same thing out for my stuff. I figured out a simple way to train location prediction so I'm using it for guided window prediction which is great for attn (predict a distance in the past to look at) and for memory (predict an x, y location for a 2d window into a memory store to look at that will be helpful). I suspect there are a lot of people out there that have found that one weird trick but haven't released it because they don't know how to capitalize on the idea. Why give OpenAI and others the keys to the future for free?

>>remexr+UK
I predict a specific location then put a window around that. Of course you can predict a different location per head or multiple window locations per head as well. The cost is negligible (single linear embx1 size) so attn becomes a fixed cost per token just like traditional windowed attn. Of course this doesn't solve memory consumption because you still have a kv cache unless you only do attn over the initial embeddings at which point you don't need the cache, just the token history. This is the tact I'm taking now since I have other ways of providing long context at deeper layers that remain O(1) for token prediction and are paralellizable like standard attn. I think this kind of architecture is the future, infinite context, fixed size state, O(1) prediction, externalized memory are all possible and break current context, memory and compute problems. It is clear that in the future token caching will be dead once these types of models (mine or someone else's with the same properties) are properly tuned and well trained.

>>virapt+H6
I'd add this to the list of linear-attention RNNs:

https://arxiv.org/abs/2602.00294

Recently saw it on HN.