zlacker

Yeah but testing if they work does, that's the problem.

There are probably load so ways you can make language models with 100M parameters more efficient, but most of them won't scale to models with 100B parameters.

IIRC there is a bit of a phase transition that happens around 7B parameters where the distribution of activations changes qualitatively.

Anthropic have interpretability papers where their method does not work for 'small' models (with ~5B parameters) but works great for models with >50B parameters.

replies(1): >>kvetch+ga

>>sebzim+(OP)
Deep NN aren't the only path to AGI... They actually could be one of the worst paths

For Example, check out the proceedings of the AGI Conference that's been going on for 16 years. https://www.agi-conference.org/

I have faith that Ilya. He's not going to allow this blunder to define his reputation.

He's going to go all in on research to find something to replace Transformers, leaving everyone else in the dust.