zlacker

I apologize in advance for the elitist-sounding tone. In my defense the people I’m calling elite I have nothing to do with, I’m certainly not talking about myself.

Without a fairly deep grounding in this stuff it’s hard to appreciate how far ahead Brain and DM are.

Neither OpenAI nor FAIR ever has the top score on anything unless Google delays publication. And short of FAIR? D2 lacrosse. There are exceptions to such a brash generalization, NVIDIA’s group comes to mind, but it’s a very good rule of thumb. Or your whole face the next time you are tempted to doze behind the wheel of a Tesla.

There are two big reasons for this:

- the talent wants to work with the other talent, and through a combination of foresight and deep pockets Google got that exponent on their side right around the time NVIDIA cards started breaking ImageNet. Winning the Hinton bidding war clinched it.

- the current approach of “how many Falcon Heavy launches worth of TPU can I throw at the same basic masked attention with residual feedback and a cute Fourier coloring” inherently favors deep pockets, and obviously MSFT, sorry OpenAI has that, but deep pockets also non-linearly scale outcomes when you’ve got in-house hardware for multiply-mixed precision.

Now clearly we’re nowhere close to Maxwell’s Demon on this stuff, and sooner or later some bright spark is going to break the logjam of needing 10-100MM in compute to squeeze a few points out of a language benchmark. But the incentives are weird here: who, exactly, does it serve for us plebs to be able to train these things from scratch?

replies(7): >>meowfa+H2 >>ttul+23 >>f38zf5+k3 >>dougab+14 >>joshcr+15 >>Herodo+E9 >>chille+Ka

>>benree+(OP)
Not elitist at all; I highly appreciate this post. I know the basics of ML but otherwise am clueless when it comes to the true depths of this field and it's interesting to hear this perspective.

replies(1): >>benree+R4

>>benree+(OP)
In short, it’s all about money.

replies(1): >>benree+W3

>>benree+(OP)
> But the incentives are weird here: who, exactly, does it serve for us plebs to be able to train these things from scratch?

I'm not sure it matters. The history of computing shows that within the decade we will all have the ability to train and use these models.

replies(1): >>sineno+ga

>>ttul+23
Yes and no.

For example: the high-frequency trading industry is estimated to have made somewhere between 2-3 billion dollars in all of 2020, profit/earnings. That’s a good weekend at Google.

HFT shops pay well, but not much different to top performers at FAANG.

People work in HFT because without taking a pay cut they can play real ball: they want to try themselves against the best.

Heavy learning people are no different in wanting both a competitive TC but maybe even more to be where the action is.

That’s currently Blade Runner Industries Ltd, but that could change.

>>benree+(OP)
This characterization is not really accurate. OpenAI has had almost a 2 year lead with GPT-3 dominating the discussion of LLMs (large language models). Google didn’t release its paper on the powerful PaLM-540b model until recently. Similarly, CLiP, Glide, DALL-E, and DALL-E2 have been incredibly influential in visual-language models. Imagen, while highly impressive, definitely is a catch-up piece of work (as was PaLM-540b).

Google clearly demonstrates their unrivaled capability to leverage massive quantities of data and compute, but it’s premature to declare that they’ve secured victory in the AI Wars.

replies(1): >>benree+q4

>>dougab+14
I agree that it’s still a jump ball in a rapidly moving field, I was saying Google is far ahead, not that they’ve won.

And I don’t think whatever iteration of PaLM was cooking at the time GPT-3 started getting press would have looked to shabby.

I think Google crushed OpenAI on both GPT and DALL-E in short order because OpenAI published twice and someone had had enough.

replies(2): >>dougab+M5 >>alphab+b6

>>meowfa+H2
I used a lot of jargon and lingo and inside baseball in that post, it was intended for people who have deep background.

But if you’re interested I’m happy to (attempt) answers to anything that was jargon: by virtue of HN my answers will be peer-reviewed in real time, and with only modest luck, a true expert might chime in.

replies(1): >>blindi+Rc

>>benree+(OP)
Who does it serve for plebs to be shown the approach openly? I don't know that it does a disservice to anyone by showing the approach.

But in general it is likely more due in part to the fact that it's going to happen anyway, if we can share our approaches and research findings, we'll just achieve it sooner.

replies(1): >>benree+R5

>>benree+q4
That’s pretty speculative and dubious (the holding back part) given the heavy bias to publication culture at Google Research and DeepMind. OpenAI has hardly been “crushed” here; PaLM and Imagen are solid, incremental advances, but given what came before them, not Earth-shattering.

If I were going to cite evidence for Alphabet’s “supremacy” in AI, I would’ve picked something more novel and surprising such as AlphaFold, or perhaps even Gato.

It’s not clear to me that Google has anything which compares to Reality Labs, although this may simply be my own ignorance.

Nvidia surely scooped Google with Instant Neural Graphics Primitives, in spite of Google publishing dozens of (often very interesting) NeRF papers. It’s not a war, all these works build on one another.

replies(1): >>benree+L8

>>joshcr+15
Once upon a time you could lie only a little bit and Stanford would give you the whole ImageNet corpus. I know because, uh, a friend told me.

I’ve got no interest in moralizing on this, but if any of the big actors wanted to they could put a meaningful if not overwhelming subset of the corpus on S3, put the source code on GitHub, and you could on a modest budget see an epoch or 3.

I’m not holding my breath.

>>benree+q4
OpenAI and FAIR are definitely in the same league as Google but Google has been all-in on AI from the beginning. They've probably spent well over $100B on AI research. I really enjoyed the Genius Makers book which came out last year from an NYT reporter on history of ML race. Deepmind apparently turned down a FB offer of double what Google was offering.

replies(1): >>benree+y6

>>alphab+b6
Cade Metz is that author and most of it I can only speculate on.

The bits and pieces I saw first hand tie out reasonably well with that account.

>>dougab+M5
I want to be clear, all of this stuff is fascinating, expensive, and difficult. With the possible exception of a few trailer-park weirdos like me, it basically takes a PhD to even stay on top of the field, and you clearly know your stuff.

And to be equally clear, I have no inside baseball on how Brain/DM choose when to publish. I have some watercooler chat on the friendly but serious rivalry between those groups, but that’s about it.

I’m looking from the outside in at OpenAI getting all the press and attention, which sounds superficial but sooner or later turns into actual hires of actual star-bound post docs, and Google laying a little low for a few years.

Then we get Gato, Imagen, and PaLM in the space of like what, 2 months?

Clearly I’m speculating that someone pulled the trigger, but I don’t think it’s like, absurd.

replies(1): >>dougab+3b

>>benree+(OP)
Is Maxwell's Demon applicable to this scenario? I'm not a physicist but I recently had to look it up after talking with someone and thought it had to do with a specific thermodynamic thought experiment with gas particles and heat differences. Is there is another application I don't understand with computing power?

replies(1): >>benree+ra

>>f38zf5+k3
... Unless the possession of capable models becomes a legal liability by that time.

replies(1): >>astran+gj

>>Herodo+E9
You’re absolutely right that I used a sloppy analogy there.

It’s Boltzmann and Szilard that did the original “kT” stuff around underlying thermodynamics governing energy dissipation in these scenarios, and Rolf Landaeur (I think that’s how you spell it) who did the really interesting work on how to apply that thermo work to lower-bounds on energy-expenditure in a given computation.

I said Maxwell’s Demon because it’s the best known example of a deep connection between useful work and computation. But it was sloppy.

replies(1): >>Herodo+9b

>>benree+(OP)
> Neither OpenAI nor FAIR ever has the top score on anything unless Google delays publication.

This is ... very incorrect. I am very certain (95%+) that Google had nothing even close to GPT-3 at the time of its release. It's been 2 full years since GPT-3 was released, and even longer since OpenAI actually trained it.

That's not to talk about any of the other things OpenAI/FAIR has released that were SOTA at the time of release (Dall-E 1, JukeBox, Poker, Diplomacy, Codex).

Google Brain and Deepmind have done a lot of great work, but to imply that they essentially have a monopoly on SOTA results and all SOTA results other labs have achieved are just due to Google delaying publication is ridiculous.

replies(2): >>gwern+Tf >>benree+lj

>>benree+L8
Scaling up improved versions of existing recipes can be done surprisingly fast if you have strong DL infrastructure. Also, GPT-3 was built on top of previous advances such as Google’s BERT. I’m surprised that it took Google so long to answer w/ PaLM, though it seems plausible to me that they wanted a clear enough qualitative advancement that people didn’t immediate say, “So what.”

You could’ve had the same reaction years ago when Google published GoogleNet followed by a series of increasingly powerful Inception models - namely that Google would wind up owning the DNN space. But it didn’t play out that way, perhaps because Google dragged its feet releasing the models and training code, and by the time it did, there were simpler and more powerful models available like ResNet.

Meta’s recent release of the actual OPT LLM weights is probably going to have more impact than PaLM, unless Google can be persuaded to open up that model.

replies(1): >>benree+pe

>>benree+ra
OK thanks I figured there was a connection between computational power and thermodynamics when you get to a small enough scale but I wasn't sure how to apply it!

>>benree+R4
Is there a handy list of generally recognized AI advancements, and their owners, that you would recommend reviewing? Or perhaps, seminal papers published? I'm only tangentially familiar with the field but would be curious to learn about the clash of the Titans playing out. Thanks!

replies(1): >>benree+Xg

>>dougab+3b
There are a lot of really knowledgeable people on here, but this field is near and dear to my heart and it’s obvious that you know it well.

I don’t know what “we should grab a coffee or a beer sometime” means in the hyper-global post-C19 era, but I’d love to speak more on this without dragging a whole HN comment thread through it.

Drop me a line if you’re inclined: ben.reesman at gmail

>>chille+Ka
Yeah, at the time, GB was still very big on mixture-of-expert models and bidirectional models like T5. (I'm not too enthusiastic about the former, but the latter has been a great model family and even if not GPT-3, still awesome.) DeepMind pivoted faster than GB, based on Gopher's reported training date, and GB followed some time after. But definitely neither had their own GPT-3-scale dense Transformer when GPT-3 was published.

replies(1): >>benree+tl

>>blindi+Rc
That’s too big a question to even attempt an answer in an HN comment, but to try to answer a realistic subset of it: “Attention is All You Need” in like 2017 is the paper most germane to my remark, and probably the thread. The modeling style it introduced often gets called a “transformer”.

The TLDR is that people had been trying for ages to capture long-distance (in the input or output, not the black box) relationships in a way that was amenable to traditional neural-network training techniques, which is non-obvious how to do because your basic NN takes an input without a distance metric, or put more plainly: it can know all the words in a sentence but struggles with what order they are in without some help.

The state of the art for awhile was something called an LSTM, and those gadgets are still useful sometimes, but have mostly been obsoleted by this attention/transformer business.

That paper had a number of cool things in it but two stand out:

- by blinding an NN to some parts of the input (“masking”) you can incentivize/compel it to look at (“attend to”) others. That’s a gross oversimplification, but it gets the gist of it I think. People have come up with very clever ways to boost up this or that part of the input in a context-dependent way.

- by playing with some trigonometry you can get a unique shape that came be expressed as a sun on something else that gives the model its “bearings” so to speak as to “where” it is in the input. such a word is closer to the beginning of a paragraph sort of a thing. people have also gotten very clever about how to do this, but the idea is the same: how do I tell a neural network that there’s structure in what would otherwise be a pile of numbers.

>>sineno+ga
This won’t happen in an interesting way. What will happen is you’ll find out training a model on copyrighted inputs causes it to memorize those inputs and the owners own your output.

>>chille+Ka
Any “brash generalization” is clearly going to be grossly incorrect in concrete cases, and while I have a little gossip from true insiders, it’s nowhere near enough to make definitive statements about specific progress on teams at companies that I’ve never worked for.

I did a bit of disclaimer on my original post but not enough to withstand detailed scrutiny. This is sort of the trouble with trying to talk about cutting-edge research in what amounts to a tweet: what’s the right amount of oversimplified, emphatic statement to add legitimate insight but not overstep into being just full of shit.

I obviously don’t know that publication schedules at heavy-duty learning shops are deliberate and factor-in other publications. The only one I know anything concretely about is FAIR and even that’s badly dated knowledge.

I was trying to squeeze into a few hundred characters my very strong belief that Brain and DM haven’t let themselves be scooped since ResNet, based on my even stronger belief that no one has the muscle to do it.

To the extent that my oversimplification detracted from the conversation I regret that.

>>gwern+Tf
At the risk of sounding like I’m trying to defend a position that I’ve already conceded is an oversimplification, I’m frankly a little skeptical of how we can even know that.

GPT is, opaque. It’s somewhere between common knowledge and conspiracy theory that it gets a helping hand from Turks when it gets in over its head.

The exact details of why a BERT-style transformer, or any of the zillion other lookalikes, isn’t just over-fitting Wikipedia the more corpus and compute you feed to its insatiable maw has always seemed a little big on claims and light on reproducibility.

I don’t think there are many attention skeptics in language modeling, it’s a good idea that you can demo on a gaming PC. Transformers demonstrably work, and a better beam-search (or whatever) hits the armchair Turing test harder for a given compute budget.

But having seen some of this stuff play out at scale, and admittedly this is purely anecdotal, these things are basically asking the question: “if I overfit all human language on the Internet, is that a bad thing?”

It’s my personal suspicion that this is the dominant term, and it’s my personal belief that Google’s ability to do both corpus and model parallelism at Jeff Dean levels while simultaneously building out hardware to the exact precision required is unique by a long way.

But, to be more accurate than I was in my original comment, I don’t know most of that in the sense that would be required by peer-review, let alone a jury. It’s just an educated guess.