For anyone following along, they are in my sibling comment. Linked papers here[0]. The exact same conversation is happening there, but sourced.
> 3 of them don't even have anything to do with a existing dataset testing
Specifically I address this claim and bring strong evidence to why you should doubt this claim. Especially this specific wording. The short end is when you scrape the entire internet for your training data that you have a lot of overlap and that you can't confidently call these evaluations "zero shot." All experiments performed in the linked works use datasets that are not significantly different from data found in the training set. For those that are "hand written" see my complaints (linked) about HumanEval.