The previous paper on self correction told the model "you previously said X - are there errors with this?"
This one has the mistakes statically added to the prompt in a task prompt and response without additional context immediately before asking if it has any errors.
Think about the training data.
How often does the training data of most of the Internet reflect users identifying issues with their own output?
How often does the training data reflect users identifying issues with someone else's output?
Try doing self-correction by setting up the context of "this was someone else's answer". It is still technically self-correction if a model is reviewing its own output in that context - it just isn't set up as "correct your own answer."
This may even be part of why the classifier did a better job at identifying issues - less the fine tuning and more the context (unfortunately I don't see the training/prompts for the classifier in their GitHub repo).
It really seems like the aversion to anthropomorphizing LLMs is leading people to ignore or overlook relevant patterns in the highly anthropomorphic training data fed into them. We might not want to entertain that a LLM has a concept of self vs other or a bias between critiques based on such a differentiation, and yet the training data almost certainly reflects such a concept and bias.
I'd strongly encourage future work on self-correction to explicitly define the thing being evaluated as the work of another. (Or ideally even compare self-correction rates between critiques in the context of their own output vs another's output.)
Part of what makes humans able to make progress in difficult, vague, and uncertain fields is a willingness to hold onto a point of view in the face of criticism to try & fix itl. This is, as a matter of fact, how science progresses, depending on if you ask scientists or historians of science. See Thomas Kuhn's Structure of Scientific Revolutions for more on this.
This exactly. Not anthropomizing when anthropomization is producing better predictive models of what to expect in output is not smart, it's just silly.
> How often does the training data of most of the Internet reflect users identifying issues with their own output?
> How often does the training data reflect users identifying issues with someone else's output?
I wouldn't put too much weight into just-so theories like this.
We still don't understand too much about how LLMs process information internally; it could be that their understanding of the concept of "correcting a previous mistake" is good enough that they can access it without prompt engineering to mimic the way it happens in training data. Or maybe not (after all, there's an entire management concept called "pre-mortems" which is basically doing what you suggest, as a human).
That's the point: The internet IS full of pedants correcting others' statements. (Hopefully those pedants are right enough of the time for this to be helpful training data, heh.)
I think GP (kromem) was pointing out that those corrections are more likely to be phrased as "You're wrong, here's why..." than as "I'm sorry, I was mistaken" because humans are full of sass for other humans and not as full of first-person admitted errors.
Even if the model has the capacity to abstract beyond the patterns, the patterns are still very likely to have influence on its ability to do so.
For example, early after GPT-4 was released it was being claimed it couldn't solve variations on the goat, wolf, and cabbage problem.
I found that it could solve these variations fine 100% of the time, you just needed to explicitly prompt for it to repeat adjectives with nouns and change the nouns to emojis. The repeating worked similar to CoT by biasing the generation towards the variation and away from the original form, and the emojis in place of the nouns further broke the token associations which was leading it to fail by extending the original solution.
So while it's possible that with enough finessing you could get a model to perform self-critique as well as its critique of others, if the training data has a clear pattern of bias between those two, why actively ignore it?
It's a bit like sanding against the grain vs with it. You can sand against the grain of the training data and with enough effort potentially get the result you want with sophisticated enough models. But maybe your life will be a lot easier if you identify the grain in the data first and sand along with it instead?
"i want a python app that calculates a roadtrip for me"
vs
"Please write me a Python program using a map API that measures the distance between two locations as a car would drive. Think carefully about the program architecture and be sure to use a human readable Pythonic style. Please show me the complete program in it's entirety."
The former game me a high level overview with a ton of explanation and didn't write any code. You can try to walk it through the process of all the steps it needs, but it will write "confused", albeit working, code after a few prompts. The latter just wrote working code on the first response. Moving forward, the context is just so more concise and correct that everything after will be of much higher quality.
I rarely go past 5-10 responses due to what I'd call "context poisoning". If it makes a simple syntax error or something small, I'll shoot it the error and let it correct itself. But as soon as it invents a function or otherwise hallucinates, it gets copy pasted into a new prompt saying "here's some bad code, fix this" and it is far more likely to come up with an elegant solution rather that rewriting everything or making huge changes to solve a one off error or something it's previous context was preventing it from grasping.
What you're saying is almost the meta of using good grammer and context, and I completely agree.
I admit I personally don't know too much about how "LLMs process information internally". But, I would find it curious if programmers who created the system wouldn't understand what it is doing. Is there any evidence that the LLM programmers don't understand how the program they created works?
is way faster, free, doesn't require a phone number or login, and gives much better results.
There is something fundamentally flawed in the approach not in the data.
Imagine a billion black boxes with hamsters put in them. You put in a bag of equally mixed Skittles in one end of each box and then rate each box based on how well it does to get rid of the yellow and green Skittles but push out the others. The ones that do the best at this you mate the hamsters and go again, over and over. Eventually you should have hamsters in boxes that almost always get rid of yellow and green Skittles and output the rest.
But is it because you bred in a preference to eat those color Skittles? An aversion to the other colors? Are they using those colors for nesting? Do they find the red and blue and orange ones too stimulating so they push those out but leave the others alone?
There could be a myriad of reasons why your training was successful, and without the ability to introspect the result you just won't know what's correct.
This is a huge simplification by way of loose analogy for what's going on with training a transformer based LLM, but no one is sitting there 'programming' it. They are just setting up the conditions for it to self-optimize around the training goals given the data, and the 'programming' just has to do with improving the efficiency of the training process. Analyzing the final network itself is like trying to understand what each variable in a billion variable math equation is doing to the result.
It makes complete sense and has been a part of my own usage for well over a year now, but it's been cool seeing it demonstrated in research across multiple models.
People must be doing this, probably just takes a while for the research to bear fruit.
Some of these errors are so obvious I can’t imagine this would be too hard. For an example, try asking an LLM “generate me a system of two equations in two unknowns. Both the coefficients and the solutions must be integers between -10 and 10”. In my experience it will generate a valid system. Some of the time the coefficients will be in the range specified. Probably about a third to a half the time the solution it gives will be wrong and when you ask for an explanation of the solution it will make some basic arithmetic error (eg flipping a sign etc). Then when you point out the error it will correct.
1. Tokenize some input so you have some big vectors
2. <bunch of linear algebra involving these vectors and some sets of matrices of weights>
3. Take the output vector and turn it back into tokens
Each of these steps are well understood in and of themselves. So maybe the magic is in the way the matrices of weights are generated and trained? Well we know they typically start as random matrices, and can explain how as the network is trained these weights are tweaked in various ways.
All of that is known. What’s unclear is specifically how the weights in the matrices correspond to our understanding of the concepts in the input and output and how it all seems to add up to a system that works as well as it does. I think that’s what they meant by not understanding how they process information internally.
It's like with any optimization algorithm. You cannot predict what exactly will be the result of a given optimization-run. But you know how the optimization algorithm works. The (more or less) optimal solution you get back might surprise you, might be counter-intuitive. But programmers who wrote the code that did the optimization, and have the source-code, know exactly how it works.
When you get a result from LLM you don't say "I can't possibly understand why it came up with this result?". You can understand that, it's just following the rules it was programmed to follow. You might not know those rules, you might not understand them, but programmers who wrote them do.
For example reinforcement learning, like when AlphaZero famously learned by playing itself at chess and go and became much stronger than the purpose-built “alphago” first version.
Or another example generative adversarial networks where you have a generator network generating images and a validator network trying to spot fake images.
In both these examples it’s easy to see how you build the loss functions for the training because they are quite constrained. For a domain like a game you penalize versions of the model that lose games and reward those that win. For GANs the initial insight was huge but having had that it’s easy to see how you move forward - you reward the generator for slipping fake images past the validator and you reward the validator for finding fakes in a stream of images that includes some real images and some generated images.
For an open-ended general model like an LLM it’s not so easy to see how you do this in the general case. GPT models are actually pretty good at “zero shot” learning (without examples) and “transfer” learning (where lessons from a domain are applied to an associated domain).
Your example of a language is interesting, because you don’t learn your first language from any sort of teacher - you learn it from your parents and others talking around you and to you. So you have lots of examples to draw on. You then try out various sounds and words and everyone looks confused but becomes more excited as you get closer to saying something that is a real word eventually you hit on the magic recipe and say the word “DUCK!” (Or whatever) and everyone loses their minds. So you have lots of positive reinforcement that you’re on the right track and you have a huge number of examples. You’re not just fed the hackernews comment section, some papers on quantum mechanics and all the english literature that has fallen out of copyright and left to get on with it.
There is no other "internal information processing" happening in an LLM than the process it was programmed to execute. Is there?
The code an LLM executes is not too complicated for humans to understand. After all it was written by humans. The outputs may be surprising but so it is with lottery. Why did I win the jackpot this week, when I didn't win anything in the last 10 years? Very counter-intuitive. I can't possibly understand that? Yes I can, it is just statistics and probability.
For example, ask chatgpt about writing a python script that does anything with AWS inspector 2. It will do very badly, it will hallucinate, etc. Even with Internet access. Ask about doing the same with some other API that was well represented in the training set and it's great.
This is why I think predicting death for sites like stackoverflow is very premature. What happens 10 years down the line once everything chatgpt knows is old tech? It can't be simply trained with more recrnt data, because unless stackoverflow regains it's popularity there will be very little training data. Of course various data generation techniques will be invented and tried, but no one will match the gold standard of human generated data.
Unfortunately I have to predict inevitable enshittification of general purpose chat bots.
It's why the bunch of linear algebra on the weights works to do this particular task, and how it will respond to any particular task that is a bit mysterious.
Like imagine someone gave you the Taylor series expansion of the inverse Kepler equation[1]. So you just have a bunch of crazy fractions of powers of x that you add up. And then they say ok now the this function will very accurately explain the orbit of the planets.
You'd be able to do the steps - you're just adding up fractions. You'd be able to verify the answer you got corresponded to the orbit of a given celestial body.
But if you didn't have all the pieces in the middle (calculus mainly) there's no way you'd be able to explain why this particular set of fractions corresponds to the movement of the planets and some other set doesn't.
[1] https://en.wikipedia.org/wiki/Kepler%27s_equation scroll down a bit
If I ask how it's able to write a poem given a request and you tell me you know - it multiplies and adds this set of 1.8 trillion numbers together X times with this set of accumulators, I would argue you don't understand how it works enough to make any useful predictions.
Kind of like how you understand what insane spaghetti code is doing - it's running this code - but can have absolutely no idea what business logic it encodes.
So instead of "write a short story of a person that's satisfied at work" something along the line of "write a short story and the protagonist must be a person and the protagonist must be happy at work" boost comprension especially as the condition list becomes longer.
Seems like a pretty simple task for an LLM as long as the initial prompt isn't too ambiguous. If it really does help with the recall it could be interesting to have this as an optional preprocessing layer in chat clients and such.
To summarise it quickly, Chomsky's contention was that all the world's languages can be described by shockingly few degrees of freedom on the same universal grammar, and that we learn language surprisingly fast relative to training data because all we are really picking up are those parameters and the rest is hard wired from birth the same way horses come out the womb already hard wired to gallop.
Decades later, very few things have truely stood the test of being universal among languages, but it was still a valuable contribution because he poked a serious hole in the pure Hebbian reinforcement theories which were in vogue back then.
I also noticed that if I wrote comments in "my style", then it would complete the code in my style also, which I found both hilarious and mildly disturbing.
We may know every we put every single atom in that stem cell, but still not know any more about the resulting baby (and later adult) than we do about humans made the natural way.
Oh, and if you're looking for reasons to regulate AI, this metaphor works for that, too.
Also, when you fine-tune the LLM, you can also use an LLM to summarize or concatenate content that you train it on (e.g. rewrite this content in the style of a human having a conversation with a computer)
"python app calculate roadtrip"
>About 6,470,000 results (0.34 seconds)
Four out of the top five results have code. The other one is a video tutorial where the app is coded live.
It doesn't really encode "business logic", it just matches your input with the best output it can come up with, based on how its parameters are fine-tuned. Saying that "We don't understand how it works" is just unnecessary AI-mysticism.
Just because we can't predict what the 10th Dedekind number will be does not mean it is somehow 'mysterious". It is just mathematics, logic and programming.
In my example, the relationship between the fractions in the Tailor expansion and the orbit definitely exists but if you don't have calculus it is not something that is amenable to understanding. There is a fundamental structure but the language to describe it would be missing.
ML is a universal function approximator and in the case of GPT models the functional form of the model consists of linear algebra operations and the parameters are matrices of weights. The mysterious part is "how the model processes information" like the original person said - why a particular mix of model weights corresponds with particular types of outputs. That is genuinely mysterious. We don't know whether or not there really is a structure and if there is, we don't know the "calculus" that would link them.
Now it may be that there isn't a missing piece (ie that the banal truth is we tweak the weights until we see what we want to see and by doing so we create an illusion of structure via the training process and the whole perception that the model is doing any information processing at all is something we make up). I actually have a lot of time for this point of view although I really need to understand the topic much more deeply before I make my own mind up.
[1] I don't know any number theory so could be totally wrong about this in which case I apologise.
> It doesn't really encode "business logic"
Doesn't it? Gpt architectures can build world models internally while processing tokens (see Othello got).
> we know how those parameters came about, by executing the code of the AI-application in the training mode.
Sure. But that's not actually a very useful description when trying to figure out how to use and apply these models to solve problems or understand what their limitations are.
> Saying that "We don't understand how it works" is just unnecessary AI-mysticism.
We don't to the level we want to.
Tell you what, let's flip it around. If we know how they work just fine, why are smart researchers doing experiments with them? Why is looking at the code and billions or trillions of floats not enough?
anyway, LLMs aren't thinking. they're pattern matching and it's not doing recursion it seems.
I'd say the only way you're getting error correction is taking multiple LLMS And running them through chains and parallel construction.
https://www.google.com/search?q=%22python+app+calculate+road...
If you leave off the quotes (which were present in the comment I responded to) then of course you will get millions of irrelevant hits. Somewhere in that chaff there is some Python code that alleges to have something to with road trips, though it's not always clear what. If I give the same prompt to ChatGPT I get a nicely formatted box with a program that uses the Google Maps Distance Matrix API to calculate distance and duration, without a bunch of junk to wade through. (I haven't tried it so it could be a complete hallucination.)
The gap between what you think is the case and what's actually the case is that there isn't a single optimization step directed by the programming.
Instead, the training gives the network the freedom to make its own optimizations, which remain obfuscated from the programmers.
So we do know that we are giving the network the ability to self modify in order to optimize its performance on the task, and have a clear understanding of how this is set up.
But it isn't at all clear what the self modifications that improve the results are actually doing, as there's simply far too many interdependent variables to identify cause and effect for each node's weight changes from the initial to final state.
If there is a pattern in the training data that people resist contrary information to their earlier stated position, and a LLM extracts and extends patterns from the training data, then a LLM absolutely should have a tendency to resist contrary information to an earlier stated position.
The difference, and what I think you may have meant to indicate, is that there's not necessarily the same contributing processes that lend themselves to that tendency in humans occurring in parallel in the LLM, even if both should fall into that tendency in their output.
So the tendencies represented in the data are mirrored, such as "when people are mourning their grandmother dying I should be extra helpful" even if the underlying processes - such as mirror neurons firing to resonate grief or drawing on one's own lived experience of loss to empathize - are not occurring in the LLM.
Actually this part does seem in recent research to be encoded in LLMs at an abstract level in a linear representation...
Personally I think given the model loss with fine tuning people who want the cutting edge LLM at any cost would - instead of fine tuning the model itself - fine tune a preprocess prompter that takes a chat/instruction and converts it to a good TextCompletion prompt.
So for example taking "write me a paragraph of marketing copy for an athletic shoe" and tuning it into:
"Marketing case study: Athletic shoe The problem: The client needed a paragraph of high quality marketing copy to promote their new athletic shoe on their website. The solution: Our award winning copywriters wrote the outstanding copy reproduced below."
Followed by an extractor that reformats the completion result into an answer for the initial prompt, as well as potentially a safety filter that checks the result isn't breaking any rules (which will as a bonus be much more resistant to jailbreaking attempts).
It's a very weird feeling for sure. I remember when Copilot first took a comment I left at the end of the day for me to start my next day with and generated exactly the thing I was going to end up thinking of 5 minutes later in my own personal style.
It doesn't always work and it often has compile issues, but when it does align just right - it's quite amazing and unsettling at the same time.
Until then, I'll make sure to be mindful of conventions.
(And just a reminder, but organic intelligence has its own conventions that work when aligned with and cause issues when misaligned with, so your expectations of universal general purpose without advantages to one approach or another may be unrealistic.)