zlacker

At the risk of sounding like I’m trying to defend a position that I’ve already conceded is an oversimplification, I’m frankly a little skeptical of how we can even know that.

GPT is, opaque. It’s somewhere between common knowledge and conspiracy theory that it gets a helping hand from Turks when it gets in over its head.

The exact details of why a BERT-style transformer, or any of the zillion other lookalikes, isn’t just over-fitting Wikipedia the more corpus and compute you feed to its insatiable maw has always seemed a little big on claims and light on reproducibility.

I don’t think there are many attention skeptics in language modeling, it’s a good idea that you can demo on a gaming PC. Transformers demonstrably work, and a better beam-search (or whatever) hits the armchair Turing test harder for a given compute budget.

But having seen some of this stuff play out at scale, and admittedly this is purely anecdotal, these things are basically asking the question: “if I overfit all human language on the Internet, is that a bad thing?”

It’s my personal suspicion that this is the dominant term, and it’s my personal belief that Google’s ability to do both corpus and model parallelism at Jeff Dean levels while simultaneously building out hardware to the exact precision required is unique by a long way.

But, to be more accurate than I was in my original comment, I don’t know most of that in the sense that would be required by peer-review, let alone a jury. It’s just an educated guess.