zlacker

I feel the opposite, and pretty much every metric we have shows basically linear improvement of these models over time.

The criticisms I hear are almost always gotchas, and when confronted with the benchmarks they either don’t actually know how they are built or don’t want to contribute to them. They just want to complain or seem like a contrarian from what I can tell.

Are LLMs perfect? Absolutely not. Do we have metrics to tell us how good they are? Yes

I’ve found very few critics that actually understand ML on a deep level. For instance Gary Marcus didn’t know what a test train split was. Unfortunately, rage bait like this makes money

replies(3): >>attemp+o5 >>Night_+r5 >>nickps+yF

>>mounta+(OP)
>I feel the opposite, and pretty much every metric we have shows basically linear improvement of these models over time.

Wait, what kind of metric are you talking about? When I did my masters in 2023 SOTA models where trying to push the boundaries by minuscule amounts. And sometimes blatantly changing the way they measure "success" to beat the previous SOTA

replies(1): >>mounta+kh

>>mounta+(OP)
Models are absolutely not improving linearly. They improve logarithmically with size, and we've already just about hit the limits of compute without becoming totally unreasonable from a space/money/power/etc standpoint.

We can use little tricks here and there to try to make them better, but fundamentally they're about as good as they're ever going to get. And none of their shortcomings are growing pains - they're fundamental to the way an LLM operates.

replies(2): >>_dain_+6f >>mounta+ch

>>Night_+r5
remember in 2022 when we "hit a wall"? everyone said that back then. turned out we didn't.

and in 2023 and 2024 and january 2025 and ...

all those "walls" collapsed like paper. they were phantoms; ppl literally thinking the gaps between releases were permanent flatlines.

money obviously isn't an issue here, VCs are pouring in billions upon billions. they're building whole new data centres and whole fucking power plants for these things; electricity and compute aren't limits. neither is data, since increasingly the models get better through self-play.

>fundamentally they're about as good as they're ever going to get

one trillion percent cope and denial

replies(2): >>jhonof+HA >>yahooz+yo1

>>Night_+r5
Most of the benchmarks are in fact improving linearly, we often don't even know the size. You can find this out but just looking at the scores over time.

And yes, it often is small things that make models better. It always has been, bit by slow they get more powerful, this has been happening since the dawn of machine learning

>>attemp+o5
Almost every single major benchmark, and yes progress is incremental but it adds up, this has always been the case

replies(1): >>attemp+dY

>>_dain_+6f
The difference in quality between model versions has slowed down imo, I know the benchmarks don't say that but as a person who uses LLMs everyday, the difference between Claude 3.5 and the cutting edge today is not very large at all, and that model came out a year ago. The jumps are getting smaller I think, unless the stuff in house is just way ahead of what is public at the moment.

>>mounta+(OP)
"pretty much every metric we have shows basically linear improvement of these models over time."

They're also trained on random data scraped off the Internet which might include benchmarks, code that looks like them, and AI articles with things like chain of thought. There's been some effort to filter obvious benchmarks but is that enough? I cant know if the AI's are getting smarter on their own or more cheat sheets are in the training data.

Just brainstorming, one thing I came up with is training them on datasets from before the benchmarks or much AI-generated material existed. Keep testing algorithmic improvements on that in addition to models trained on up to date data. That might be a more accurate assessment.

replies(1): >>mounta+xj2

>>mounta+kh
We were talking about linear improvements and I have yet to see it

replies(1): >>mounta+aj2

>>_dain_+6f
Yet we are still at the “treat it like a junior” level

>>attemp+dY
check the benchmarks or make one of your own

replies(1): >>attemp+KQ2

>>nickps+yF
thats not a bad idea, very expensive though, and you end up with a pretty useless model in most regards.

A lot of the trusted benchmarks today are somewhat dynamic or have a hidden set.

replies(1): >>nickps+6j3

>>mounta+aj2
I checked the BlEU-Score and Perplexity of popular models and both have stagnated around 2021. As a disclaimer this was a cursory check and I didn't dive into the details of how individuals scores were evaluated.

replies(1): >>mounta+b15

>>mounta+xj2
That could happen. One would need to risk it to take the approach. However, if it was trained on legal data, then there might be a market for it among those not risking copyright infringement. Think FairlyTrained.org.

"somewhat dynamic or have a hidden set"

Are there example inputs and outputs for the dynamic ones online? And are the hidden sets online? (I haven't looked at benchmark internals in a while.)

>>attemp+KQ2
on what benchmarks? pretty much every major one is linear improvement