Many people are still working on improving RNNs, mostly in academia. Examples off the top of my head:
* RWKV: https://arxiv.org/abs/2006.16236 / https://arxiv.org/abs/2404.05892 https://arxiv.org/abs/2305.13048
* Linear attention: https://arxiv.org/abs/2503.14456
* State space models: https://arxiv.org/abs/2312.00752 / https://arxiv.org/abs/2405.21060
* Linear RNNs: https://arxiv.org/abs/2410.01201
Industry OTOH has gone all-in on Transformers.
It's so annoying. Transformers keep improving and recurrent networks are harder to train so until we hit some real wall, companies don't seem eager to diverge. It's like lithium batteries improving easy faster than it was profitable to work on sodium ones, even though we unfortunately want the sodium ones to be better.
On the huge benefit side though you get: - guaranteed state size so perfect batch packing, perfect memory use, easy load/unload from a batch, O(1) of token gen so generally massive performance gains in inference. - unlimited context (well, no need for a concept of a position embedding or similar system)
Taking the best of both worlds is definitely where it is at for the future. An architecture that can train parallelized, has a fixed state size so you can load/unload and patch batches perfectly, unlimited context (with perfect recall), etc etc. That is the real architecture to go for.
I'm working on a novel (I think) linear attention mechanism in my personal lab that's O(L) for effectively infinite context. I haven't yet decided how much of it is going to be open source, but I agree with you that it's important to figure this out.
Was your work open? Is there some place I can read more about it? I'm trying to figure out what to do with my thing on the off-chance that it actually does turn out to work the way I want it to.
https://arxiv.org/abs/2602.00294
Recently saw it on HN.