Many people are still working on improving RNNs, mostly in academia. Examples off the top of my head:
* RWKV: https://arxiv.org/abs/2006.16236 / https://arxiv.org/abs/2404.05892 https://arxiv.org/abs/2305.13048
* Linear attention: https://arxiv.org/abs/2503.14456
* State space models: https://arxiv.org/abs/2312.00752 / https://arxiv.org/abs/2405.21060
* Linear RNNs: https://arxiv.org/abs/2410.01201
Industry OTOH has gone all-in on Transformers.
On the huge benefit side though you get: - guaranteed state size so perfect batch packing, perfect memory use, easy load/unload from a batch, O(1) of token gen so generally massive performance gains in inference. - unlimited context (well, no need for a concept of a position embedding or similar system)
Taking the best of both worlds is definitely where it is at for the future. An architecture that can train parallelized, has a fixed state size so you can load/unload and patch batches perfectly, unlimited context (with perfect recall), etc etc. That is the real architecture to go for.