LLMs cannot find reasoning errors, but can correct them

>>koie+(OP)
Stop doing self-correction within the context of the model's own generation.

The previous paper on self correction told the model "you previously said X - are there errors with this?"

This one has the mistakes statically added to the prompt in a task prompt and response without additional context immediately before asking if it has any errors.

Think about the training data.

How often does the training data of most of the Internet reflect users identifying issues with their own output?

How often does the training data reflect users identifying issues with someone else's output?

Try doing self-correction by setting up the context of "this was someone else's answer". It is still technically self-correction if a model is reviewing its own output in that context - it just isn't set up as "correct your own answer."

This may even be part of why the classifier did a better job at identifying issues - less the fine tuning and more the context (unfortunately I don't see the training/prompts for the classifier in their GitHub repo).

It really seems like the aversion to anthropomorphizing LLMs is leading people to ignore or overlook relevant patterns in the highly anthropomorphic training data fed into them. We might not want to entertain that a LLM has a concept of self vs other or a bias between critiques based on such a differentiation, and yet the training data almost certainly reflects such a concept and bias.

I'd strongly encourage future work on self-correction to explicitly define the thing being evaluated as the work of another. (Or ideally even compare self-correction rates between critiques in the context of their own output vs another's output.)

>>kromem+UE
The obvious way to do this would be as adversarial networks like in GANs for image generation. Have the existing LLM as the generator trained exactly as at present but with an additional penalty for being found to have committed an error and have another network trained at the same time as a validator where its fitness function is finding errors in the output of the generator.

People must be doing this, probably just takes a while for the research to bear fruit.

Some of these errors are so obvious I can’t imagine this would be too hard. For an example, try asking an LLM “generate me a system of two equations in two unknowns. Both the coefficients and the solutions must be integers between -10 and 10”. In my experience it will generate a valid system. Some of the time the coefficients will be in the range specified. Probably about a third to a half the time the solution it gives will be wrong and when you ask for an explanation of the solution it will make some basic arithmetic error (eg flipping a sign etc). Then when you point out the error it will correct.

zlacker