zlacker

[parent] [thread] 1 comments
1. aspenm+(OP)[view] [source] 2026-01-02 20:54:59
You might want to be more specific because benchmarks abound and they paint a pretty consistent picture. LMArena "vibes" paint another picture. I don't know what you are doing to "check" the frontier LLMs but whatever you're doing doesn't seem to match more careful measurement...

You don't actually have to take peoples word for it, read epoch.ai developments, look into the benchmark literature, look at ARC-AGI...

replies(1): >>qualif+ua8
2. qualif+ua8[view] [source] 2026-01-05 16:20:10
>>aspenm+(OP)
That's half the problem though. I can see benchmarks. I can see number go up on some chart or that the AI scores higher on some niche math or programming test, but those results don't seem to actually connect much to meaningful improvements in daily usage of the software when those updates hit the public.

That's where the skepticism comes in, because one side of the discussion is hyping up exponential growth and the other is seeing something that looks more logarithmic instead.

I realize anecdotes aren't as useful as numbers for this kind of analysis, but there's such a wide gap between what people are observing in practice and what the tests and metrics are showing it's hard not to wonder about those numbers.

[go to top]