zlacker

[parent] [thread] 1 comments
1. famous+(OP)[view] [source] 2023-05-22 23:25:33
I don't think you took more than a passing glance, if any at those papers.

What you describe is impossible with these 3.

https://arxiv.org/abs/2212.09196 - new evaluation set introduced with the paper. modelled after tests that previously only had visual equivalents. contamination literally impossible

https://arxiv.org/abs/2204.02329 - effect of explanations on questions introduced with the paper. dataset concerns make no sense.

https://arxiv.org/abs/2211.09066 - new prompting method introduced to improve algorithmic calculations. dataset concerns make no sense.

The Casual paper is the only one where worries about dataset contamination makes any sense at all.

replies(1): >>godels+sj
2. godels+sj[view] [source] 2023-05-23 02:45:51
>>famous+(OP)
> I don't think you took more than a passing glance, if any at those papers.

I'll assume in good faith but let's try to keep this in mind both ways.

> What you describe is impossible with these 3.

Definitely possible. I did not write my comment as a paper but I did provide plenty of evidence. I specifically ask that you pay close attention to my HumanEval comment and click that link. I am much more specific about how a "novel" dataset may not actually be novel. This is a complicated topic and we must connect many dots. So care is needed. You have no reason to trust my claim that I am an ML researcher, but I assure you that this is what I do. I have a special place in my heart for evaluation metrics too and understanding their limitations. This is actually key. If you don't understand the limits to a metric then you don't understand your work. If you don't understand the limits of your datasets and how they could be hacked you don't understand your work.

=== Webb et al ===

Let's see what they are using to evaluate. > To answer this question, we evaluated the language model GPT-3 on a range of zero-shot analogy tasks, and performed direct comparisons with human behavior. These tasks included a novel text-based matrix reasoning task based on Raven’s Progressive Matrices, a visual analogy problem set commonly viewed as one of the best measures of fluid intelligence

Okay, so they created a new dataset. Great, but do we have the HumanEval issues? You can see that Raven Progressive Matrices were introduced in 1938 (referenced paper) and you'll also find many existing code sets on GitHub that are almost a decade old. Even ML ones that are >7 years old. We can also find them in blogspot, wordpress, and wikipedia, which are the top three domains for common crawl (used for GPT3)[0]. This automatically disqualifies this claim from the paper:

> Strikingly, we found that GPT-3 performed as well or better than college students in most conditions, __despite receiving no direct training on this task.__

It may be technically correct since there is no "direct" training but it is clear that the model was trained on these types of problems. But that's not the only work they did

> GPT-3 also displayed strong zero-shot performance on letter string analogies, four-term verbal analogies, and identification of analogies between stories.

I think we can see that these are also obviously going to be in the training data as well. That GPT-3 had access to examples, similar questions, and even in depth break downs as to why the answers are the correct answers.

Contamination isn't "literally impossible" but trivially proven. This seems to exactly match my complaint about HumanEval.

=== Lampinen et al ===

We need just look at our example on the second page.

Task instruction > Answer these questions by identifying whether the second sentence is an appro- priate paraphrase of the first, metaphori- cal sentence.

Answer explanation > Explanation: David’s eyes were not lit- erally daggers, it is a metaphor used to imply that David was glaring fiercely at Paul.

You just have to ask yourself if this prompt and answer are potentially anywhere in common crawl. I think we know there are many blogspot posts that have questions similar to SAT and IQ tests, which this experiment is similar to

=== Conclusion ===

You have strong critiques of my response but have little to back up these critiques. I'll reiterate, because it was in my initial response: you are not performing zero-shot testing when your test set is includes similar data. That's not what zero shot it. I wrote more about this a few months back[1] and may be worth reading. What needs to be responded to me to change my opinion is not a claim that the dataset was not existing prior to the crawl but that the model was not trained on data significantly similar to that in the test set. This is, again, my original complaint about HumanEval and these papers do nothing to address these complaints.

I'll go even further. I'd encourage you to look at this paper[2] where data isn't just exactly de-duplicated, but near de-duplicated. There is an increase in performance for these results. But I'm not going to explain everything to you. I will tell you that you need to look at Figures 4, 6, 7, A3, ESPECIALLY A4, A5, and A6 VERY carefully. Think about how these results can be explained and the relationship to random pruning. I'll also say that their ImageNet results ARE NOT zero-shot (for reasons given previously).

But we're coming back to the same TLDR: evaluating models is hard and already noisy process. Evaluating models that have scraped a significant portion of the internet are substantially harder to evaluate. If you can provide to me strong evidence that there isn't contamination then I'll take these works more seriously. This is a point you are not addressing. You have to back up the claims, not just state them. In the mean time, I have strong evidence that these, and many other, datasets are contaminated. This even includes many causal datasets that you have not listed but were used in other works. Essentially: if the test sets are on GitHub, it is contaminated. Again, see HumanEval and my specific response that I linked. You can't just say "wrong," drop some sources, and leave it at that. That's not how academic conversations happen.

[0] https://commoncrawl.github.io/cc-crawl-statistics/plots/doma...

[1] https://news.ycombinator.com/item?id=35489811

[2] https://arxiv.org/abs/2303.09540

[go to top]