LLMs cannot find reasoning errors, but can correct them

>>koie+(OP)
Stop doing self-correction within the context of the model's own generation.

The previous paper on self correction told the model "you previously said X - are there errors with this?"

This one has the mistakes statically added to the prompt in a task prompt and response without additional context immediately before asking if it has any errors.

Think about the training data.

How often does the training data of most of the Internet reflect users identifying issues with their own output?

How often does the training data reflect users identifying issues with someone else's output?

Try doing self-correction by setting up the context of "this was someone else's answer". It is still technically self-correction if a model is reviewing its own output in that context - it just isn't set up as "correct your own answer."

This may even be part of why the classifier did a better job at identifying issues - less the fine tuning and more the context (unfortunately I don't see the training/prompts for the classifier in their GitHub repo).

It really seems like the aversion to anthropomorphizing LLMs is leading people to ignore or overlook relevant patterns in the highly anthropomorphic training data fed into them. We might not want to entertain that a LLM has a concept of self vs other or a bias between critiques based on such a differentiation, and yet the training data almost certainly reflects such a concept and bias.

I'd strongly encourage future work on self-correction to explicitly define the thing being evaluated as the work of another. (Or ideally even compare self-correction rates between critiques in the context of their own output vs another's output.)

>>kromem+UE
> Think about the training data.

> How often does the training data of most of the Internet reflect users identifying issues with their own output?

> How often does the training data reflect users identifying issues with someone else's output?

I wouldn't put too much weight into just-so theories like this.

We still don't understand too much about how LLMs process information internally; it could be that their understanding of the concept of "correcting a previous mistake" is good enough that they can access it without prompt engineering to mimic the way it happens in training data. Or maybe not (after all, there's an entire management concept called "pre-mortems" which is basically doing what you suggest, as a human).

>>Poigna+VQ
> We still don't understand too much about how LLMs process information internally

I admit I personally don't know too much about how "LLMs process information internally". But, I would find it curious if programmers who created the system wouldn't understand what it is doing. Is there any evidence that the LLM programmers don't understand how the program they created works?

>>galaxy+hr1
People understand how the program works but not how the network produces the outputs it does from the inputs and training it receives. The mechanics of how these models work at a very high level are:

1. Tokenize some input so you have some big vectors

2. <bunch of linear algebra involving these vectors and some sets of matrices of weights>

3. Take the output vector and turn it back into tokens

Each of these steps are well understood in and of themselves. So maybe the magic is in the way the matrices of weights are generated and trained? Well we know they typically start as random matrices, and can explain how as the network is trained these weights are tweaked in various ways.

All of that is known. What’s unclear is specifically how the weights in the matrices correspond to our understanding of the concepts in the input and output and how it all seems to add up to a system that works as well as it does. I think that’s what they meant by not understanding how they process information internally.

>>seanhu+JL1
> that’s what they meant by not understanding how they process information internally.

There is no other "internal information processing" happening in an LLM than the process it was programmed to execute. Is there?

The code an LLM executes is not too complicated for humans to understand. After all it was written by humans. The outputs may be surprising but so it is with lottery. Why did I win the jackpot this week, when I didn't win anything in the last 10 years? Very counter-intuitive. I can't possibly understand that? Yes I can, it is just statistics and probability.

>>galaxy+GN1
As I tried to explain, it's not the code that people don't understand. People understand the code they wrote.

It's why the bunch of linear algebra on the weights works to do this particular task, and how it will respond to any particular task that is a bit mysterious.

Like imagine someone gave you the Taylor series expansion of the inverse Kepler equation[1]. So you just have a bunch of crazy fractions of powers of x that you add up. And then they say ok now the this function will very accurately explain the orbit of the planets.

You'd be able to do the steps - you're just adding up fractions. You'd be able to verify the answer you got corresponded to the orbit of a given celestial body.

But if you didn't have all the pieces in the middle (calculus mainly) there's no way you'd be able to explain why this particular set of fractions corresponds to the movement of the planets and some other set doesn't.

[1] https://en.wikipedia.org/wiki/Kepler%27s_equation scroll down a bit

>>seanhu+cR1
There are many mathematical functions whose output is hard to predict and requires immense calculations. I just recently read about how they had "discovered" the 9th Dedekind number, or something like that.

Just because we can't predict what the 10th Dedekind number will be does not mean it is somehow 'mysterious". It is just mathematics, logic and programming.

>>galaxy+Tn3
I don't think the Dedekind number relationship is really like what I described though. These are numbers which. have known properties (ie given a number you can test whether or not it is that) but no known closed form solution exists for the generator of the sequence, and probably there is no structure to the intervals between the numbers other than the properties we ascribe to the numbers. I see them as similar to primes for example in that you know one when you see one but not how to make all of them[1].

In my example, the relationship between the fractions in the Tailor expansion and the orbit definitely exists but if you don't have calculus it is not something that is amenable to understanding. There is a fundamental structure but the language to describe it would be missing.

ML is a universal function approximator and in the case of GPT models the functional form of the model consists of linear algebra operations and the parameters are matrices of weights. The mysterious part is "how the model processes information" like the original person said - why a particular mix of model weights corresponds with particular types of outputs. That is genuinely mysterious. We don't know whether or not there really is a structure and if there is, we don't know the "calculus" that would link them.

Now it may be that there isn't a missing piece (ie that the banal truth is we tweak the weights until we see what we want to see and by doing so we create an illusion of structure via the training process and the whole perception that the model is doing any information processing at all is something we make up). I actually have a lot of time for this point of view although I really need to understand the topic much more deeply before I make my own mind up.

[1] I don't know any number theory so could be totally wrong about this in which case I apologise.

zlacker