There are probably load so ways you can make language models with 100M parameters more efficient, but most of them won't scale to models with 100B parameters.
IIRC there is a bit of a phase transition that happens around 7B parameters where the distribution of activations changes qualitatively.
Anthropic have interpretability papers where their method does not work for 'small' models (with ~5B parameters) but works great for models with >50B parameters.
For Example, check out the proceedings of the AGI Conference that's been going on for 16 years. https://www.agi-conference.org/
I have faith that Ilya. He's not going to allow this blunder to define his reputation.
He's going to go all in on research to find something to replace Transformers, leaving everyone else in the dust.