Without a fairly deep grounding in this stuff it’s hard to appreciate how far ahead Brain and DM are.
Neither OpenAI nor FAIR ever has the top score on anything unless Google delays publication. And short of FAIR? D2 lacrosse. There are exceptions to such a brash generalization, NVIDIA’s group comes to mind, but it’s a very good rule of thumb. Or your whole face the next time you are tempted to doze behind the wheel of a Tesla.
There are two big reasons for this:
- the talent wants to work with the other talent, and through a combination of foresight and deep pockets Google got that exponent on their side right around the time NVIDIA cards started breaking ImageNet. Winning the Hinton bidding war clinched it.
- the current approach of “how many Falcon Heavy launches worth of TPU can I throw at the same basic masked attention with residual feedback and a cute Fourier coloring” inherently favors deep pockets, and obviously MSFT, sorry OpenAI has that, but deep pockets also non-linearly scale outcomes when you’ve got in-house hardware for multiply-mixed precision.
Now clearly we’re nowhere close to Maxwell’s Demon on this stuff, and sooner or later some bright spark is going to break the logjam of needing 10-100MM in compute to squeeze a few points out of a language benchmark. But the incentives are weird here: who, exactly, does it serve for us plebs to be able to train these things from scratch?
I'm not sure it matters. The history of computing shows that within the decade we will all have the ability to train and use these models.
For example: the high-frequency trading industry is estimated to have made somewhere between 2-3 billion dollars in all of 2020, profit/earnings. That’s a good weekend at Google.
HFT shops pay well, but not much different to top performers at FAANG.
People work in HFT because without taking a pay cut they can play real ball: they want to try themselves against the best.
Heavy learning people are no different in wanting both a competitive TC but maybe even more to be where the action is.
That’s currently Blade Runner Industries Ltd, but that could change.
Google clearly demonstrates their unrivaled capability to leverage massive quantities of data and compute, but it’s premature to declare that they’ve secured victory in the AI Wars.
And I don’t think whatever iteration of PaLM was cooking at the time GPT-3 started getting press would have looked to shabby.
I think Google crushed OpenAI on both GPT and DALL-E in short order because OpenAI published twice and someone had had enough.
But if you’re interested I’m happy to (attempt) answers to anything that was jargon: by virtue of HN my answers will be peer-reviewed in real time, and with only modest luck, a true expert might chime in.
But in general it is likely more due in part to the fact that it's going to happen anyway, if we can share our approaches and research findings, we'll just achieve it sooner.
If I were going to cite evidence for Alphabet’s “supremacy” in AI, I would’ve picked something more novel and surprising such as AlphaFold, or perhaps even Gato.
It’s not clear to me that Google has anything which compares to Reality Labs, although this may simply be my own ignorance.
Nvidia surely scooped Google with Instant Neural Graphics Primitives, in spite of Google publishing dozens of (often very interesting) NeRF papers. It’s not a war, all these works build on one another.
I’ve got no interest in moralizing on this, but if any of the big actors wanted to they could put a meaningful if not overwhelming subset of the corpus on S3, put the source code on GitHub, and you could on a modest budget see an epoch or 3.
I’m not holding my breath.
The bits and pieces I saw first hand tie out reasonably well with that account.
And to be equally clear, I have no inside baseball on how Brain/DM choose when to publish. I have some watercooler chat on the friendly but serious rivalry between those groups, but that’s about it.
I’m looking from the outside in at OpenAI getting all the press and attention, which sounds superficial but sooner or later turns into actual hires of actual star-bound post docs, and Google laying a little low for a few years.
Then we get Gato, Imagen, and PaLM in the space of like what, 2 months?
Clearly I’m speculating that someone pulled the trigger, but I don’t think it’s like, absurd.
It’s Boltzmann and Szilard that did the original “kT” stuff around underlying thermodynamics governing energy dissipation in these scenarios, and Rolf Landaeur (I think that’s how you spell it) who did the really interesting work on how to apply that thermo work to lower-bounds on energy-expenditure in a given computation.
I said Maxwell’s Demon because it’s the best known example of a deep connection between useful work and computation. But it was sloppy.
This is ... very incorrect. I am very certain (95%+) that Google had nothing even close to GPT-3 at the time of its release. It's been 2 full years since GPT-3 was released, and even longer since OpenAI actually trained it.
That's not to talk about any of the other things OpenAI/FAIR has released that were SOTA at the time of release (Dall-E 1, JukeBox, Poker, Diplomacy, Codex).
Google Brain and Deepmind have done a lot of great work, but to imply that they essentially have a monopoly on SOTA results and all SOTA results other labs have achieved are just due to Google delaying publication is ridiculous.
You could’ve had the same reaction years ago when Google published GoogleNet followed by a series of increasingly powerful Inception models - namely that Google would wind up owning the DNN space. But it didn’t play out that way, perhaps because Google dragged its feet releasing the models and training code, and by the time it did, there were simpler and more powerful models available like ResNet.
Meta’s recent release of the actual OPT LLM weights is probably going to have more impact than PaLM, unless Google can be persuaded to open up that model.
I don’t know what “we should grab a coffee or a beer sometime” means in the hyper-global post-C19 era, but I’d love to speak more on this without dragging a whole HN comment thread through it.
Drop me a line if you’re inclined: ben.reesman at gmail
The TLDR is that people had been trying for ages to capture long-distance (in the input or output, not the black box) relationships in a way that was amenable to traditional neural-network training techniques, which is non-obvious how to do because your basic NN takes an input without a distance metric, or put more plainly: it can know all the words in a sentence but struggles with what order they are in without some help.
The state of the art for awhile was something called an LSTM, and those gadgets are still useful sometimes, but have mostly been obsoleted by this attention/transformer business.
That paper had a number of cool things in it but two stand out:
- by blinding an NN to some parts of the input (“masking”) you can incentivize/compel it to look at (“attend to”) others. That’s a gross oversimplification, but it gets the gist of it I think. People have come up with very clever ways to boost up this or that part of the input in a context-dependent way.
- by playing with some trigonometry you can get a unique shape that came be expressed as a sun on something else that gives the model its “bearings” so to speak as to “where” it is in the input. such a word is closer to the beginning of a paragraph sort of a thing. people have also gotten very clever about how to do this, but the idea is the same: how do I tell a neural network that there’s structure in what would otherwise be a pile of numbers.
I did a bit of disclaimer on my original post but not enough to withstand detailed scrutiny. This is sort of the trouble with trying to talk about cutting-edge research in what amounts to a tweet: what’s the right amount of oversimplified, emphatic statement to add legitimate insight but not overstep into being just full of shit.
I obviously don’t know that publication schedules at heavy-duty learning shops are deliberate and factor-in other publications. The only one I know anything concretely about is FAIR and even that’s badly dated knowledge.
I was trying to squeeze into a few hundred characters my very strong belief that Brain and DM haven’t let themselves be scooped since ResNet, based on my even stronger belief that no one has the muscle to do it.
To the extent that my oversimplification detracted from the conversation I regret that.
GPT is, opaque. It’s somewhere between common knowledge and conspiracy theory that it gets a helping hand from Turks when it gets in over its head.
The exact details of why a BERT-style transformer, or any of the zillion other lookalikes, isn’t just over-fitting Wikipedia the more corpus and compute you feed to its insatiable maw has always seemed a little big on claims and light on reproducibility.
I don’t think there are many attention skeptics in language modeling, it’s a good idea that you can demo on a gaming PC. Transformers demonstrably work, and a better beam-search (or whatever) hits the armchair Turing test harder for a given compute budget.
But having seen some of this stuff play out at scale, and admittedly this is purely anecdotal, these things are basically asking the question: “if I overfit all human language on the Internet, is that a bad thing?”
It’s my personal suspicion that this is the dominant term, and it’s my personal belief that Google’s ability to do both corpus and model parallelism at Jeff Dean levels while simultaneously building out hardware to the exact precision required is unique by a long way.
But, to be more accurate than I was in my original comment, I don’t know most of that in the sense that would be required by peer-review, let alone a jury. It’s just an educated guess.