You'll often see these works discussing zero-shot performance. But many of these tasks are either not zero-shot or even a known n-shot. Let's take a good example, Imagen[0] claims zero-shot MS-COCO performance but trains on LAION. COCO classes exist in LAION and there are similar texts. Explore COCO[1] and explore clip retrieval[2] for LAION. The example given is the first sample from COCO aircraft and you'll find almost identical images and captions with many of the same keywords. This isn't zero-shot.
Why's this matter? Dataset contamination[3] being used in the evaluation process. You can't conclude that a model has learned something if it has access to the evaluation data. Test sets have always been a proxy for generalization and MUST be recognized as proxies.
This gets really difficult with LLMs where all we know is that they've scrapped a large swath of the internet and that includes GitHub and Reddit. I show some explicit examples and explanation with code generation here [4]. From there you might even see how it is difficult to generate novel test sets that aren't actually contaminated, which is my complaint about HumanEval. I show that we can find dupes or near dupes on GitHub despite these being "hand written."
As per your sources all use GPT, which we don't know what data they have and don't have. But we do know they were trained on Reddit and GitHub. That should be enough to tell you that certain things like Physics and Coding problems[5] are spoiled. If you look at all the datasets used for evaluation in the works you listed I think you'll find reason to believe that there's a good chance that these too are spoiled. (Other datasets are spoiled and there's lots of experimentation that demonstrates the causal reasoning isn't as good as the performance suggests)
Now mind you, this doesn't mean that LMs can't do causal reasoning. They definite can. Including causal discovery[6]. But this all tells us that it is fucking hard to evaluate models and even harder when we don't know what they were trained on. That maybe we need to be a bit more nuanced and stop claiming things so confidently. There's a lot of people trying to sell snake oil right now. These are very powerful tools that are going to change the world, but they are complex and people don't know much about them. We saw many snake oil salesmen at the birth of the internet too. Didn't mean the internet wasn't important or not going to change the course of humanity. Just meant that people were profiting off of the confusion and complexity.
[0] https://arxiv.org/abs/2205.11487
[1] https://cocodataset.org/#explore
[2] https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2....
[3] https://twitter.com/alon_jacovi/status/1659212730300268544
[4] https://news.ycombinator.com/item?id=35806152
[5] https://twitter.com/random_walker/status/1637929631037927424
What you describe is impossible with these 3.
https://arxiv.org/abs/2212.09196 - new evaluation set introduced with the paper. modelled after tests that previously only had visual equivalents. contamination literally impossible
https://arxiv.org/abs/2204.02329 - effect of explanations on questions introduced with the paper. dataset concerns make no sense.
https://arxiv.org/abs/2211.09066 - new prompting method introduced to improve algorithmic calculations. dataset concerns make no sense.
The Casual paper is the only one where worries about dataset contamination makes any sense at all.
I'll assume in good faith but let's try to keep this in mind both ways.
> What you describe is impossible with these 3.
Definitely possible. I did not write my comment as a paper but I did provide plenty of evidence. I specifically ask that you pay close attention to my HumanEval comment and click that link. I am much more specific about how a "novel" dataset may not actually be novel. This is a complicated topic and we must connect many dots. So care is needed. You have no reason to trust my claim that I am an ML researcher, but I assure you that this is what I do. I have a special place in my heart for evaluation metrics too and understanding their limitations. This is actually key. If you don't understand the limits to a metric then you don't understand your work. If you don't understand the limits of your datasets and how they could be hacked you don't understand your work.
=== Webb et al ===
Let's see what they are using to evaluate. > To answer this question, we evaluated the language model GPT-3 on a range of zero-shot analogy tasks, and performed direct comparisons with human behavior. These tasks included a novel text-based matrix reasoning task based on Raven’s Progressive Matrices, a visual analogy problem set commonly viewed as one of the best measures of fluid intelligence
Okay, so they created a new dataset. Great, but do we have the HumanEval issues? You can see that Raven Progressive Matrices were introduced in 1938 (referenced paper) and you'll also find many existing code sets on GitHub that are almost a decade old. Even ML ones that are >7 years old. We can also find them in blogspot, wordpress, and wikipedia, which are the top three domains for common crawl (used for GPT3)[0]. This automatically disqualifies this claim from the paper:
> Strikingly, we found that GPT-3 performed as well or better than college students in most conditions, __despite receiving no direct training on this task.__
It may be technically correct since there is no "direct" training but it is clear that the model was trained on these types of problems. But that's not the only work they did
> GPT-3 also displayed strong zero-shot performance on letter string analogies, four-term verbal analogies, and identification of analogies between stories.
I think we can see that these are also obviously going to be in the training data as well. That GPT-3 had access to examples, similar questions, and even in depth break downs as to why the answers are the correct answers.
Contamination isn't "literally impossible" but trivially proven. This seems to exactly match my complaint about HumanEval.
=== Lampinen et al ===
We need just look at our example on the second page.
Task instruction > Answer these questions by identifying whether the second sentence is an appro- priate paraphrase of the first, metaphori- cal sentence.
Answer explanation > Explanation: David’s eyes were not lit- erally daggers, it is a metaphor used to imply that David was glaring fiercely at Paul.
You just have to ask yourself if this prompt and answer are potentially anywhere in common crawl. I think we know there are many blogspot posts that have questions similar to SAT and IQ tests, which this experiment is similar to
=== Conclusion ===
You have strong critiques of my response but have little to back up these critiques. I'll reiterate, because it was in my initial response: you are not performing zero-shot testing when your test set is includes similar data. That's not what zero shot it. I wrote more about this a few months back[1] and may be worth reading. What needs to be responded to me to change my opinion is not a claim that the dataset was not existing prior to the crawl but that the model was not trained on data significantly similar to that in the test set. This is, again, my original complaint about HumanEval and these papers do nothing to address these complaints.
I'll go even further. I'd encourage you to look at this paper[2] where data isn't just exactly de-duplicated, but near de-duplicated. There is an increase in performance for these results. But I'm not going to explain everything to you. I will tell you that you need to look at Figures 4, 6, 7, A3, ESPECIALLY A4, A5, and A6 VERY carefully. Think about how these results can be explained and the relationship to random pruning. I'll also say that their ImageNet results ARE NOT zero-shot (for reasons given previously).
But we're coming back to the same TLDR: evaluating models is hard and already noisy process. Evaluating models that have scraped a significant portion of the internet are substantially harder to evaluate. If you can provide to me strong evidence that there isn't contamination then I'll take these works more seriously. This is a point you are not addressing. You have to back up the claims, not just state them. In the mean time, I have strong evidence that these, and many other, datasets are contaminated. This even includes many causal datasets that you have not listed but were used in other works. Essentially: if the test sets are on GitHub, it is contaminated. Again, see HumanEval and my specific response that I linked. You can't just say "wrong," drop some sources, and leave it at that. That's not how academic conversations happen.
[0] https://commoncrawl.github.io/cc-crawl-statistics/plots/doma...