People who say this nonsense need to start properly defining human level intelligence because nearly anything you throw at GPT-4 it performs at at least average human level, often well above.
Give criteria that 4 fails that a significant chunk of the human population doesn't also fail and we can talk.
Else this is just another instance of people struggling to see what's right in front of them.
Just blows my mind the lengths some will go to ignore what is already easily verifiable right now. "I'll know agi when i see it", my ass.
"Average human level" is pretty boring though. Computers have been doing arithmetic at well above "average human level" since they were first invented. The premise of AGI isn't that it can do something better than people, it's that it can do everything at least as well. Which is clearly still not the case.
It seems you have the wrong idea of what is being conveyed, or what average human intelligence is. It isn't about being able to do math. It is being able to invent, mimic quickly, abstract, memorize, specialize, and generalize. There's a reason humans have occupied every continent of the earth and even areas outside. It's far more than being able to do arithmetic or playing chess. This just all seems unimpressive to us because it is normal, to us. But this certainly isn't normal if we look outside ourselves. Yes, there's intelligence in many lifeforms, even ants, but there is some ineffable or difficult to express uniqueness to human intelligence (specifically in its generality) that is being referenced here.
To put it one way, a group of machines that could think at the level of an average teenager (or even lower) but able to do so 100x faster would probably outmatch a group of human scientists in being able to solve complex and novel math problems. This isn't "average human level" but below. "Average human level" is just a shortcut term for this ineffable description of the _capacity_ to generalize and adapt so well. Because we don't even have a fucking definition of intelligence.
But this is exactly why average is boring.
If you ask ChatGPT what it's like to be in the US Navy, it will have texts written by Navy sailors in its training data and produce something based on those texts in response to related questions.
If you ask the average person what it's like to be in the US Navy, they haven't been in the Navy, may not know anyone who is, haven't taken any time to research it, so their answers will be poor. ChatGPT could plausibly give a better response.
But if you ask the questions of someone who has, they'll answer related questions better than ChatGPT. Even if the average person who has been in the Navy has no greater intelligence than the average person who hasn't.
It's not better at reasoning. It's barely even capable of it, but has access to training data that the average person lacks.
You are wrong. and there's many papers to show otherwise.
Algorithmic, Casual, Inference, Analogical
LLMs reason just fine
https://arxiv.org/abs/2212.09196
https://arxiv.org/abs/2305.00050
You'll often see these works discussing zero-shot performance. But many of these tasks are either not zero-shot or even a known n-shot. Let's take a good example, Imagen[0] claims zero-shot MS-COCO performance but trains on LAION. COCO classes exist in LAION and there are similar texts. Explore COCO[1] and explore clip retrieval[2] for LAION. The example given is the first sample from COCO aircraft and you'll find almost identical images and captions with many of the same keywords. This isn't zero-shot.
Why's this matter? Dataset contamination[3] being used in the evaluation process. You can't conclude that a model has learned something if it has access to the evaluation data. Test sets have always been a proxy for generalization and MUST be recognized as proxies.
This gets really difficult with LLMs where all we know is that they've scrapped a large swath of the internet and that includes GitHub and Reddit. I show some explicit examples and explanation with code generation here [4]. From there you might even see how it is difficult to generate novel test sets that aren't actually contaminated, which is my complaint about HumanEval. I show that we can find dupes or near dupes on GitHub despite these being "hand written."
As per your sources all use GPT, which we don't know what data they have and don't have. But we do know they were trained on Reddit and GitHub. That should be enough to tell you that certain things like Physics and Coding problems[5] are spoiled. If you look at all the datasets used for evaluation in the works you listed I think you'll find reason to believe that there's a good chance that these too are spoiled. (Other datasets are spoiled and there's lots of experimentation that demonstrates the causal reasoning isn't as good as the performance suggests)
Now mind you, this doesn't mean that LMs can't do causal reasoning. They definite can. Including causal discovery[6]. But this all tells us that it is fucking hard to evaluate models and even harder when we don't know what they were trained on. That maybe we need to be a bit more nuanced and stop claiming things so confidently. There's a lot of people trying to sell snake oil right now. These are very powerful tools that are going to change the world, but they are complex and people don't know much about them. We saw many snake oil salesmen at the birth of the internet too. Didn't mean the internet wasn't important or not going to change the course of humanity. Just meant that people were profiting off of the confusion and complexity.
[0] https://arxiv.org/abs/2205.11487
[1] https://cocodataset.org/#explore
[2] https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2....
[3] https://twitter.com/alon_jacovi/status/1659212730300268544
[4] https://news.ycombinator.com/item?id=35806152
[5] https://twitter.com/random_walker/status/1637929631037927424