Without a fairly deep grounding in this stuff it’s hard to appreciate how far ahead Brain and DM are.
Neither OpenAI nor FAIR ever has the top score on anything unless Google delays publication. And short of FAIR? D2 lacrosse. There are exceptions to such a brash generalization, NVIDIA’s group comes to mind, but it’s a very good rule of thumb. Or your whole face the next time you are tempted to doze behind the wheel of a Tesla.
There are two big reasons for this:
- the talent wants to work with the other talent, and through a combination of foresight and deep pockets Google got that exponent on their side right around the time NVIDIA cards started breaking ImageNet. Winning the Hinton bidding war clinched it.
- the current approach of “how many Falcon Heavy launches worth of TPU can I throw at the same basic masked attention with residual feedback and a cute Fourier coloring” inherently favors deep pockets, and obviously MSFT, sorry OpenAI has that, but deep pockets also non-linearly scale outcomes when you’ve got in-house hardware for multiply-mixed precision.
Now clearly we’re nowhere close to Maxwell’s Demon on this stuff, and sooner or later some bright spark is going to break the logjam of needing 10-100MM in compute to squeeze a few points out of a language benchmark. But the incentives are weird here: who, exactly, does it serve for us plebs to be able to train these things from scratch?
This is ... very incorrect. I am very certain (95%+) that Google had nothing even close to GPT-3 at the time of its release. It's been 2 full years since GPT-3 was released, and even longer since OpenAI actually trained it.
That's not to talk about any of the other things OpenAI/FAIR has released that were SOTA at the time of release (Dall-E 1, JukeBox, Poker, Diplomacy, Codex).
Google Brain and Deepmind have done a lot of great work, but to imply that they essentially have a monopoly on SOTA results and all SOTA results other labs have achieved are just due to Google delaying publication is ridiculous.
GPT is, opaque. It’s somewhere between common knowledge and conspiracy theory that it gets a helping hand from Turks when it gets in over its head.
The exact details of why a BERT-style transformer, or any of the zillion other lookalikes, isn’t just over-fitting Wikipedia the more corpus and compute you feed to its insatiable maw has always seemed a little big on claims and light on reproducibility.
I don’t think there are many attention skeptics in language modeling, it’s a good idea that you can demo on a gaming PC. Transformers demonstrably work, and a better beam-search (or whatever) hits the armchair Turing test harder for a given compute budget.
But having seen some of this stuff play out at scale, and admittedly this is purely anecdotal, these things are basically asking the question: “if I overfit all human language on the Internet, is that a bad thing?”
It’s my personal suspicion that this is the dominant term, and it’s my personal belief that Google’s ability to do both corpus and model parallelism at Jeff Dean levels while simultaneously building out hardware to the exact precision required is unique by a long way.
But, to be more accurate than I was in my original comment, I don’t know most of that in the sense that would be required by peer-review, let alone a jury. It’s just an educated guess.