Compared to what, exactly? because over the last 50 years, there have been dramatic improvements[1].
[1]: https://www.brookings.edu/research/the-evolution-of-global-p...
It's true - there's room to do better. So, so much better. But discarding the progress of the last 50 years is so unbelievably counter-productive.
Happened last week. Download here.[1]
Yes, the open models are worse, but are getting better. There will be plenty of high quality commercial alternatives.
"move fast and break things"
It very much feels like they are trying to build a legislative moat, blocking out competitors and even open source projects. Ridiculous.
I don't fear what this technology does to us, I fear what we do to each other because of it. This is just the start.
0: https://twitter.com/wunderwuzzi23/status/1659411665853779971
> Let ChatGPT visit a website and have your email stolen.
> Plugins, Prompt Injection and Cross Plug-in Request Forgery.
Metaculus has some probabilities [1] of what kind of regulation might actually happen by ~2024-2026, e.g. requiring disclosure of human/non-human, restricting APIs, reporting on large training runs, etc.
https://twitter.com/exteriorpower/status/1659069336227819520
How does this not justify what the above person stated, poverty is running rampant? More than 600 million people are still in extreme poverty. A record 100 million are displaced due to conflict in their countries. So I have to ask what exactly is unbelievably counter-productive here? I would argue that placating ourselves is.
[1:14] https://social.desa.un.org/sites/default/files/inline-files/...
I imagine an important concern is the learning & improvement velocity. Humans get old, tired, etc. GPUs do not. It isn't the case now, but it is fuzzy how fast we could collectively get there. Break out problem domains into modules, off to the silicon dojos until your models exceed human capabilities, and then roll them up. You can pick from OpenGPT plugins, why wouldn't an LLM hypervisor/orchestrator do the same?
https://waitbutwhy.com/2015/01/artificial-intelligence-revol...
https://waitbutwhy.com/2015/01/artificial-intelligence-revol...
I think you are the one who's disconnected. Ask your average crackhead on the block if they're happy, and then compare the answer to your average college dropout stocking groceries. People who haven't seen both sides tend to think happiness is made by Maslow's hierarchy of needs or is a linear function of material wealth - it's not. It seems like a joke, but this post https://www.reddit.com/r/drugscirclejerk/comments/8iyp0c/i_f... describes exactly what I mean. I genuinely believe some homeless people are more happy than some working-class people.
Case in point, you just spouted more metrics to me that have to do with the well being of the economy not the well being of the average person. I do not care about your numbers, because time and again they have been played. We should consider the idea that if we can take steps forward, we can also take steps backward.
And while we're at it I should ask - have you ever had to deal with a dead-end job with subpar pay? Were you ever forced to work in abusive environments? If so, then you can agree with me that it's a terrible state to be in - not the same as being homeless definitely but still terrible.
And if not, then why are you talking about things you don't know about? Do you really think economic metrics are a viable substitute for this lack of knowledge?
You are wrong. and there's many papers to show otherwise.
Algorithmic, Casual, Inference, Analogical
LLMs reason just fine
https://arxiv.org/abs/2212.09196
https://arxiv.org/abs/2305.00050
You'll often see these works discussing zero-shot performance. But many of these tasks are either not zero-shot or even a known n-shot. Let's take a good example, Imagen[0] claims zero-shot MS-COCO performance but trains on LAION. COCO classes exist in LAION and there are similar texts. Explore COCO[1] and explore clip retrieval[2] for LAION. The example given is the first sample from COCO aircraft and you'll find almost identical images and captions with many of the same keywords. This isn't zero-shot.
Why's this matter? Dataset contamination[3] being used in the evaluation process. You can't conclude that a model has learned something if it has access to the evaluation data. Test sets have always been a proxy for generalization and MUST be recognized as proxies.
This gets really difficult with LLMs where all we know is that they've scrapped a large swath of the internet and that includes GitHub and Reddit. I show some explicit examples and explanation with code generation here [4]. From there you might even see how it is difficult to generate novel test sets that aren't actually contaminated, which is my complaint about HumanEval. I show that we can find dupes or near dupes on GitHub despite these being "hand written."
As per your sources all use GPT, which we don't know what data they have and don't have. But we do know they were trained on Reddit and GitHub. That should be enough to tell you that certain things like Physics and Coding problems[5] are spoiled. If you look at all the datasets used for evaluation in the works you listed I think you'll find reason to believe that there's a good chance that these too are spoiled. (Other datasets are spoiled and there's lots of experimentation that demonstrates the causal reasoning isn't as good as the performance suggests)
Now mind you, this doesn't mean that LMs can't do causal reasoning. They definite can. Including causal discovery[6]. But this all tells us that it is fucking hard to evaluate models and even harder when we don't know what they were trained on. That maybe we need to be a bit more nuanced and stop claiming things so confidently. There's a lot of people trying to sell snake oil right now. These are very powerful tools that are going to change the world, but they are complex and people don't know much about them. We saw many snake oil salesmen at the birth of the internet too. Didn't mean the internet wasn't important or not going to change the course of humanity. Just meant that people were profiting off of the confusion and complexity.
[0] https://arxiv.org/abs/2205.11487
[1] https://cocodataset.org/#explore
[2] https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2....
[3] https://twitter.com/alon_jacovi/status/1659212730300268544
[4] https://news.ycombinator.com/item?id=35806152
[5] https://twitter.com/random_walker/status/1637929631037927424
What you describe is impossible with these 3.
https://arxiv.org/abs/2212.09196 - new evaluation set introduced with the paper. modelled after tests that previously only had visual equivalents. contamination literally impossible
https://arxiv.org/abs/2204.02329 - effect of explanations on questions introduced with the paper. dataset concerns make no sense.
https://arxiv.org/abs/2211.09066 - new prompting method introduced to improve algorithmic calculations. dataset concerns make no sense.
The Casual paper is the only one where worries about dataset contamination makes any sense at all.
The papers were linked in another comment. 3 of them don't even have anything to do with a existing dataset testing. so yeah, actual.
for the world model papers
https://arxiv.org/abs/2210.13382
https://arxiv.org/abs/2305.11169
>Lack of access to cameras or vehicle controls isn't why it can't drive a car.
It would be best to wait till what you say can be evaluated. that is your hunch, not fact.
>The existence of numerous ChatGPT jailbreaks is evidence to the contrary.
No it's not. People fall for social engineering and do what you ask. if you think people can't be easily derailed, boy do i have a bridge for you.
>Many people are of below average intelligence, or give up when something is hard but not impossible.
Ok. Doesn't help your point. and many above average people don't reach expert level either. If you want to rationalize all that as "gave up when it wasn't impossible", go ahead lol but reality paints a very different picture.
>If you have one machine that will make one attempt to solve a problem a day and succeeds 90% of the time and another that will make a billion attempts to solve a problem a second and succeeds 10% of the time, which one has solved more problems by the end of the week?
"Problems" aren't made equal. Practically speaking, it's very unlikely the billion per second thinker is solving any of the caliber of problems the one attempt per day is solving. Solving more "problems" does not make you a super intelligence.
and the base model was excellently calibrated. https://openai.com/research/gpt-4
"Interestingly, the base pre-trained model is highly calibrated (its predicted confidence in an answer generally matches the probability of being correct). However, through our current post-training process, the calibration is reduced."
Next ?
I'll assume in good faith but let's try to keep this in mind both ways.
> What you describe is impossible with these 3.
Definitely possible. I did not write my comment as a paper but I did provide plenty of evidence. I specifically ask that you pay close attention to my HumanEval comment and click that link. I am much more specific about how a "novel" dataset may not actually be novel. This is a complicated topic and we must connect many dots. So care is needed. You have no reason to trust my claim that I am an ML researcher, but I assure you that this is what I do. I have a special place in my heart for evaluation metrics too and understanding their limitations. This is actually key. If you don't understand the limits to a metric then you don't understand your work. If you don't understand the limits of your datasets and how they could be hacked you don't understand your work.
=== Webb et al ===
Let's see what they are using to evaluate. > To answer this question, we evaluated the language model GPT-3 on a range of zero-shot analogy tasks, and performed direct comparisons with human behavior. These tasks included a novel text-based matrix reasoning task based on Raven’s Progressive Matrices, a visual analogy problem set commonly viewed as one of the best measures of fluid intelligence
Okay, so they created a new dataset. Great, but do we have the HumanEval issues? You can see that Raven Progressive Matrices were introduced in 1938 (referenced paper) and you'll also find many existing code sets on GitHub that are almost a decade old. Even ML ones that are >7 years old. We can also find them in blogspot, wordpress, and wikipedia, which are the top three domains for common crawl (used for GPT3)[0]. This automatically disqualifies this claim from the paper:
> Strikingly, we found that GPT-3 performed as well or better than college students in most conditions, __despite receiving no direct training on this task.__
It may be technically correct since there is no "direct" training but it is clear that the model was trained on these types of problems. But that's not the only work they did
> GPT-3 also displayed strong zero-shot performance on letter string analogies, four-term verbal analogies, and identification of analogies between stories.
I think we can see that these are also obviously going to be in the training data as well. That GPT-3 had access to examples, similar questions, and even in depth break downs as to why the answers are the correct answers.
Contamination isn't "literally impossible" but trivially proven. This seems to exactly match my complaint about HumanEval.
=== Lampinen et al ===
We need just look at our example on the second page.
Task instruction > Answer these questions by identifying whether the second sentence is an appro- priate paraphrase of the first, metaphori- cal sentence.
Answer explanation > Explanation: David’s eyes were not lit- erally daggers, it is a metaphor used to imply that David was glaring fiercely at Paul.
You just have to ask yourself if this prompt and answer are potentially anywhere in common crawl. I think we know there are many blogspot posts that have questions similar to SAT and IQ tests, which this experiment is similar to
=== Conclusion ===
You have strong critiques of my response but have little to back up these critiques. I'll reiterate, because it was in my initial response: you are not performing zero-shot testing when your test set is includes similar data. That's not what zero shot it. I wrote more about this a few months back[1] and may be worth reading. What needs to be responded to me to change my opinion is not a claim that the dataset was not existing prior to the crawl but that the model was not trained on data significantly similar to that in the test set. This is, again, my original complaint about HumanEval and these papers do nothing to address these complaints.
I'll go even further. I'd encourage you to look at this paper[2] where data isn't just exactly de-duplicated, but near de-duplicated. There is an increase in performance for these results. But I'm not going to explain everything to you. I will tell you that you need to look at Figures 4, 6, 7, A3, ESPECIALLY A4, A5, and A6 VERY carefully. Think about how these results can be explained and the relationship to random pruning. I'll also say that their ImageNet results ARE NOT zero-shot (for reasons given previously).
But we're coming back to the same TLDR: evaluating models is hard and already noisy process. Evaluating models that have scraped a significant portion of the internet are substantially harder to evaluate. If you can provide to me strong evidence that there isn't contamination then I'll take these works more seriously. This is a point you are not addressing. You have to back up the claims, not just state them. In the mean time, I have strong evidence that these, and many other, datasets are contaminated. This even includes many causal datasets that you have not listed but were used in other works. Essentially: if the test sets are on GitHub, it is contaminated. Again, see HumanEval and my specific response that I linked. You can't just say "wrong," drop some sources, and leave it at that. That's not how academic conversations happen.
[0] https://commoncrawl.github.io/cc-crawl-statistics/plots/doma...
For anyone following along, they are in my sibling comment. Linked papers here[0]. The exact same conversation is happening there, but sourced.
> 3 of them don't even have anything to do with a existing dataset testing
Specifically I address this claim and bring strong evidence to why you should doubt this claim. Especially this specific wording. The short end is when you scrape the entire internet for your training data that you have a lot of overlap and that you can't confidently call these evaluations "zero shot." All experiments performed in the linked works use datasets that are not significantly different from data found in the training set. For those that are "hand written" see my complaints (linked) about HumanEval.
https://www.metaculus.com/questions/5121/date-of-artificial-...
I don't know what would change your mind.