I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.
The point is to construct non-circular ways of quantifying model performance in reasoning. That the LLM has access to prior exemplars of any given problem is exactly the issue in establishing performance in reasoning, over historical synthesis.