LLMs cannot find reasoning errors, but can correct them

>>koie+(OP)
Stop doing self-correction within the context of the model's own generation.

The previous paper on self correction told the model "you previously said X - are there errors with this?"

This one has the mistakes statically added to the prompt in a task prompt and response without additional context immediately before asking if it has any errors.

Think about the training data.

How often does the training data of most of the Internet reflect users identifying issues with their own output?

How often does the training data reflect users identifying issues with someone else's output?

Try doing self-correction by setting up the context of "this was someone else's answer". It is still technically self-correction if a model is reviewing its own output in that context - it just isn't set up as "correct your own answer."

This may even be part of why the classifier did a better job at identifying issues - less the fine tuning and more the context (unfortunately I don't see the training/prompts for the classifier in their GitHub repo).

It really seems like the aversion to anthropomorphizing LLMs is leading people to ignore or overlook relevant patterns in the highly anthropomorphic training data fed into them. We might not want to entertain that a LLM has a concept of self vs other or a bias between critiques based on such a differentiation, and yet the training data almost certainly reflects such a concept and bias.

I'd strongly encourage future work on self-correction to explicitly define the thing being evaluated as the work of another. (Or ideally even compare self-correction rates between critiques in the context of their own output vs another's output.)

>>kromem+UE
With due respect (and I actually mean due respect), this embodies exactly what is wrong with the modern approach to AI. Who cares that there's no examples in the training set. True AI should be able of taking a few steps out of book without getting flummoxed. When you learn your first language, teacher does not stand before the class and provide examples of ungrammatical statements, yet you figure out the rules of grammar just fine.

There is something fundamentally flawed in the approach not in the data.

>>NoToP+fG1
There are training methodologies that do this but they don’t necessarily work in this case (or noone has got them to work that well yet).

For example reinforcement learning, like when AlphaZero famously learned by playing itself at chess and go and became much stronger than the purpose-built “alphago” first version.

Or another example generative adversarial networks where you have a generator network generating images and a validator network trying to spot fake images.

In both these examples it’s easy to see how you build the loss functions for the training because they are quite constrained. For a domain like a game you penalize versions of the model that lose games and reward those that win. For GANs the initial insight was huge but having had that it’s easy to see how you move forward - you reward the generator for slipping fake images past the validator and you reward the validator for finding fakes in a stream of images that includes some real images and some generated images.

For an open-ended general model like an LLM it’s not so easy to see how you do this in the general case. GPT models are actually pretty good at “zero shot” learning (without examples) and “transfer” learning (where lessons from a domain are applied to an associated domain).

Your example of a language is interesting, because you don’t learn your first language from any sort of teacher - you learn it from your parents and others talking around you and to you. So you have lots of examples to draw on. You then try out various sounds and words and everyone looks confused but becomes more excited as you get closer to saying something that is a real word eventually you hit on the magic recipe and say the word “DUCK!” (Or whatever) and everyone loses their minds. So you have lots of positive reinforcement that you’re on the right track and you have a huge number of examples. You’re not just fed the hackernews comment section, some papers on quantum mechanics and all the english literature that has fallen out of copyright and left to get on with it.

>>seanhu+IM1
I wish I could take credit for my example, but it's perhaps the most famous example in all of linguistics and its the thing that made Noam Chomsky's name in the field.

To summarise it quickly, Chomsky's contention was that all the world's languages can be described by shockingly few degrees of freedom on the same universal grammar, and that we learn language surprisingly fast relative to training data because all we are really picking up are those parameters and the rest is hard wired from birth the same way horses come out the womb already hard wired to gallop.

Decades later, very few things have truely stood the test of being universal among languages, but it was still a valuable contribution because he poked a serious hole in the pure Hebbian reinforcement theories which were in vogue back then.

zlacker