My AI skeptic friends are all nuts

>>tablet+(OP)
I feel like we get one of these articles that addresses valid AI criticisms with poor arguments every week and at this point I’m ready to write a boilerplate response because I already know what they’re going to say.

Interns don’t cost 20 bucks a month but training users in the specifics of your org is important.

Knowing what is important or pointless comes with understanding the skill set.

>>ofjcih+21
I feel the opposite, and pretty much every metric we have shows basically linear improvement of these models over time.

The criticisms I hear are almost always gotchas, and when confronted with the benchmarks they either don’t actually know how they are built or don’t want to contribute to them. They just want to complain or seem like a contrarian from what I can tell.

Are LLMs perfect? Absolutely not. Do we have metrics to tell us how good they are? Yes

I’ve found very few critics that actually understand ML on a deep level. For instance Gary Marcus didn’t know what a test train split was. Unfortunately, rage bait like this makes money

>>mounta+S3
"pretty much every metric we have shows basically linear improvement of these models over time."

They're also trained on random data scraped off the Internet which might include benchmarks, code that looks like them, and AI articles with things like chain of thought. There's been some effort to filter obvious benchmarks but is that enough? I cant know if the AI's are getting smarter on their own or more cheat sheets are in the training data.

Just brainstorming, one thing I came up with is training them on datasets from before the benchmarks or much AI-generated material existed. Keep testing algorithmic improvements on that in addition to models trained on up to date data. That might be a more accurate assessment.

>>nickps+qJ
thats not a bad idea, very expensive though, and you end up with a pretty useless model in most regards.

A lot of the trusted benchmarks today are somewhat dynamic or have a hidden set.

>>mounta+pn2
That could happen. One would need to risk it to take the approach. However, if it was trained on legal data, then there might be a market for it among those not risking copyright infringement. Think FairlyTrained.org.

"somewhat dynamic or have a hidden set"

Are there example inputs and outputs for the dynamic ones online? And are the hidden sets online? (I haven't looked at benchmark internals in a while.)

zlacker