Here someone just claimed that it is "entirely clear" LLMs will become super-human, without any evidence.
https://en.wikipedia.org/wiki/Extraordinary_claims_require_e...
The way you've framed it seems like the only evidence you will accept is after it's actually happened.
In my mind, at this point we either need (a) some previously "hidden" super-massive source of training data, or (b) another architectural breakthrough. Without either, this is a game of optimization, and the scaling curves are going to plateau really fast.
a) it hasn't even been a year since the last big breakthrough, the reasoning models like o3 only came out in September, and we don't know how far those will go yet. I'd wait a second before assuming the low-hanging fruit is done.
b) I think coding is a really good environment for agents / reinforcement learning. Rather than requiring a continual supply of new training data, we give the model coding tasks to execute (writing / maintaining / modifying) and then test its code for correctness. We could for example take the entire history of a code-base and just give the model its changing unit + integration tests to implement. My hunch (with no extraordinary evidence) is that this is how coding agents start to nail some of the higher-level abilities.
GPT4 was another big improvement, and was the first time I found it useful for non-trivial queries. 4o was nice, and there was decent bump with the reasoning models, especially for coding. However, since o1 it's felt a lot more like optimization than systematic improvement, and I don't see a way for current reasoning models to advance to the point of designing and implementing medium+ coding projects without the assistance of a human.
Like the other commenter mention, I'm sure it will happen eventually with architectural improvements, but I wouldn't bet on 1-5 years.
Theoretical limitations of multi-layer Transformer https://arxiv.org/abs/2412.02975
o4 has no problem with the examples of the first paper (appendix A). You can see its reasoning here is also sound: https://chatgpt.com/share/681b468c-3e80-8002-bafe-279bbe9e18.... Not conclusive unfortunately since this is in date-range of its training data. Reasoning models killed off a large class of "easy logic errors" people discovered from the earlier generations though.
They are not reasoning in any real sense, they are writing pages and pages of text before giving you the answer. This is not super-unlike the "ever bigger training data" method, just applied to output instead of input.