> How often does the training data of most of the Internet reflect users identifying issues with their own output?
> How often does the training data reflect users identifying issues with someone else's output?
I wouldn't put too much weight into just-so theories like this.
We still don't understand too much about how LLMs process information internally; it could be that their understanding of the concept of "correcting a previous mistake" is good enough that they can access it without prompt engineering to mimic the way it happens in training data. Or maybe not (after all, there's an entire management concept called "pre-mortems" which is basically doing what you suggest, as a human).
Even if the model has the capacity to abstract beyond the patterns, the patterns are still very likely to have influence on its ability to do so.
For example, early after GPT-4 was released it was being claimed it couldn't solve variations on the goat, wolf, and cabbage problem.
I found that it could solve these variations fine 100% of the time, you just needed to explicitly prompt for it to repeat adjectives with nouns and change the nouns to emojis. The repeating worked similar to CoT by biasing the generation towards the variation and away from the original form, and the emojis in place of the nouns further broke the token associations which was leading it to fail by extending the original solution.
So while it's possible that with enough finessing you could get a model to perform self-critique as well as its critique of others, if the training data has a clear pattern of bias between those two, why actively ignore it?
It's a bit like sanding against the grain vs with it. You can sand against the grain of the training data and with enough effort potentially get the result you want with sophisticated enough models. But maybe your life will be a lot easier if you identify the grain in the data first and sand along with it instead?
I admit I personally don't know too much about how "LLMs process information internally". But, I would find it curious if programmers who created the system wouldn't understand what it is doing. Is there any evidence that the LLM programmers don't understand how the program they created works?
Imagine a billion black boxes with hamsters put in them. You put in a bag of equally mixed Skittles in one end of each box and then rate each box based on how well it does to get rid of the yellow and green Skittles but push out the others. The ones that do the best at this you mate the hamsters and go again, over and over. Eventually you should have hamsters in boxes that almost always get rid of yellow and green Skittles and output the rest.
But is it because you bred in a preference to eat those color Skittles? An aversion to the other colors? Are they using those colors for nesting? Do they find the red and blue and orange ones too stimulating so they push those out but leave the others alone?
There could be a myriad of reasons why your training was successful, and without the ability to introspect the result you just won't know what's correct.
This is a huge simplification by way of loose analogy for what's going on with training a transformer based LLM, but no one is sitting there 'programming' it. They are just setting up the conditions for it to self-optimize around the training goals given the data, and the 'programming' just has to do with improving the efficiency of the training process. Analyzing the final network itself is like trying to understand what each variable in a billion variable math equation is doing to the result.
1. Tokenize some input so you have some big vectors
2. <bunch of linear algebra involving these vectors and some sets of matrices of weights>
3. Take the output vector and turn it back into tokens
Each of these steps are well understood in and of themselves. So maybe the magic is in the way the matrices of weights are generated and trained? Well we know they typically start as random matrices, and can explain how as the network is trained these weights are tweaked in various ways.
All of that is known. What’s unclear is specifically how the weights in the matrices correspond to our understanding of the concepts in the input and output and how it all seems to add up to a system that works as well as it does. I think that’s what they meant by not understanding how they process information internally.
It's like with any optimization algorithm. You cannot predict what exactly will be the result of a given optimization-run. But you know how the optimization algorithm works. The (more or less) optimal solution you get back might surprise you, might be counter-intuitive. But programmers who wrote the code that did the optimization, and have the source-code, know exactly how it works.
When you get a result from LLM you don't say "I can't possibly understand why it came up with this result?". You can understand that, it's just following the rules it was programmed to follow. You might not know those rules, you might not understand them, but programmers who wrote them do.
There is no other "internal information processing" happening in an LLM than the process it was programmed to execute. Is there?
The code an LLM executes is not too complicated for humans to understand. After all it was written by humans. The outputs may be surprising but so it is with lottery. Why did I win the jackpot this week, when I didn't win anything in the last 10 years? Very counter-intuitive. I can't possibly understand that? Yes I can, it is just statistics and probability.
It's why the bunch of linear algebra on the weights works to do this particular task, and how it will respond to any particular task that is a bit mysterious.
Like imagine someone gave you the Taylor series expansion of the inverse Kepler equation[1]. So you just have a bunch of crazy fractions of powers of x that you add up. And then they say ok now the this function will very accurately explain the orbit of the planets.
You'd be able to do the steps - you're just adding up fractions. You'd be able to verify the answer you got corresponded to the orbit of a given celestial body.
But if you didn't have all the pieces in the middle (calculus mainly) there's no way you'd be able to explain why this particular set of fractions corresponds to the movement of the planets and some other set doesn't.
[1] https://en.wikipedia.org/wiki/Kepler%27s_equation scroll down a bit
If I ask how it's able to write a poem given a request and you tell me you know - it multiplies and adds this set of 1.8 trillion numbers together X times with this set of accumulators, I would argue you don't understand how it works enough to make any useful predictions.
Kind of like how you understand what insane spaghetti code is doing - it's running this code - but can have absolutely no idea what business logic it encodes.
We may know every we put every single atom in that stem cell, but still not know any more about the resulting baby (and later adult) than we do about humans made the natural way.
Oh, and if you're looking for reasons to regulate AI, this metaphor works for that, too.
It doesn't really encode "business logic", it just matches your input with the best output it can come up with, based on how its parameters are fine-tuned. Saying that "We don't understand how it works" is just unnecessary AI-mysticism.
Just because we can't predict what the 10th Dedekind number will be does not mean it is somehow 'mysterious". It is just mathematics, logic and programming.
In my example, the relationship between the fractions in the Tailor expansion and the orbit definitely exists but if you don't have calculus it is not something that is amenable to understanding. There is a fundamental structure but the language to describe it would be missing.
ML is a universal function approximator and in the case of GPT models the functional form of the model consists of linear algebra operations and the parameters are matrices of weights. The mysterious part is "how the model processes information" like the original person said - why a particular mix of model weights corresponds with particular types of outputs. That is genuinely mysterious. We don't know whether or not there really is a structure and if there is, we don't know the "calculus" that would link them.
Now it may be that there isn't a missing piece (ie that the banal truth is we tweak the weights until we see what we want to see and by doing so we create an illusion of structure via the training process and the whole perception that the model is doing any information processing at all is something we make up). I actually have a lot of time for this point of view although I really need to understand the topic much more deeply before I make my own mind up.
[1] I don't know any number theory so could be totally wrong about this in which case I apologise.
> It doesn't really encode "business logic"
Doesn't it? Gpt architectures can build world models internally while processing tokens (see Othello got).
> we know how those parameters came about, by executing the code of the AI-application in the training mode.
Sure. But that's not actually a very useful description when trying to figure out how to use and apply these models to solve problems or understand what their limitations are.
> Saying that "We don't understand how it works" is just unnecessary AI-mysticism.
We don't to the level we want to.
Tell you what, let's flip it around. If we know how they work just fine, why are smart researchers doing experiments with them? Why is looking at the code and billions or trillions of floats not enough?
The gap between what you think is the case and what's actually the case is that there isn't a single optimization step directed by the programming.
Instead, the training gives the network the freedom to make its own optimizations, which remain obfuscated from the programmers.
So we do know that we are giving the network the ability to self modify in order to optimize its performance on the task, and have a clear understanding of how this is set up.
But it isn't at all clear what the self modifications that improve the results are actually doing, as there's simply far too many interdependent variables to identify cause and effect for each node's weight changes from the initial to final state.