"The LLMs we tested couldn't find reasoning errors but can correct them" is accurate. Trying small language golf experiments on existing models just tells you about their training data.
It's quite likely that a transformer model could successfully be trained for this task.
Also, many of these models get new capabilities each release.