~"When told where it's wrong, LLM can correct itself to improve accuracy."
Similar to cheating in chess- a master only needs to be told the value of a few positions to have an advantage.
From the abstract it sounds to me like they’re talking about heuristics for particular problems. Is that accurate?
> recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (Huang et al., 2023)
A logical mistake might imply a blind spot inherent to the model, a blind spot that might not be present in all models.
Would it be better to just double the size of one of the models rather than house both?
Genuine question
https://www.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_com...
I noticed that they automatically create at least three other draft responses.
I assume that this is a technique that allows them to try multiple times and then select the best one.
Just mentioning it because it seems like another example of not strictly "zero-shot"ing a response. Which seems important for getting good results with these models.
I'm guessing they use batching for this. I wonder if it might become more common to run multiple inference subtasks for the same main task inside of a batch, for purposes of self-correcting agent swarms or something. The outputs from step one are reviewed by the group in step 2, then they try again in step 3.
I guess that only applies for a small department where there is frequently just one person using it at a time.
It's the same pattern you'd see in a pedagological article about correcting reasoning errors, except that it's able to generate some share of the article content on its own.
With more layers of post-processing behind a curtain, you might be able to build an assembly over this behavior that looked convincingly like it was correcting reasoning errors on its own.
So... yes and no.
Convergence (evolutionary computing) https://en.wikipedia.org/wiki/Convergence_(evolutionary_comp...
Convergence (disambiguation) > Science, technology, and mathematics https://en.wikipedia.org/wiki/Convergence#Science,_technolog...
It can make it more expensive if that option becomes popular.
But I think in most cases batching is actually the biggest _improvement_ in terms of cost effectiveness for operators, since it enables them to use the parallel throughout of the graphics device more fully by handling multiple inference requests (often from different customers) at once. (Unless they work like Bard by default).
me: what is sin -pi/2
gpt: -1
me: that's not right
gpt: I apologize, let me clarify, the answer is 1
=====
The smarter it is, the more conviction it has. GPT-3.5 has a lot of impostor syndrome and it's probably deserved lol. But GPT-4 starts to stutter when you give it enough math questions, which aren't its forte.
It forces you to remind yourself of the stochastic nature of the model and RILHF, maybe the data even helps to improve the latter.
I liked this trait of Bard from the start and hope they keep it.
It provides a sense of agency and reminds to not anthropomorphize the transformer chatbot too much.
The previous paper on self correction told the model "you previously said X - are there errors with this?"
This one has the mistakes statically added to the prompt in a task prompt and response without additional context immediately before asking if it has any errors.
Think about the training data.
How often does the training data of most of the Internet reflect users identifying issues with their own output?
How often does the training data reflect users identifying issues with someone else's output?
Try doing self-correction by setting up the context of "this was someone else's answer". It is still technically self-correction if a model is reviewing its own output in that context - it just isn't set up as "correct your own answer."
This may even be part of why the classifier did a better job at identifying issues - less the fine tuning and more the context (unfortunately I don't see the training/prompts for the classifier in their GitHub repo).
It really seems like the aversion to anthropomorphizing LLMs is leading people to ignore or overlook relevant patterns in the highly anthropomorphic training data fed into them. We might not want to entertain that a LLM has a concept of self vs other or a bias between critiques based on such a differentiation, and yet the training data almost certainly reflects such a concept and bias.
I'd strongly encourage future work on self-correction to explicitly define the thing being evaluated as the work of another. (Or ideally even compare self-correction rates between critiques in the context of their own output vs another's output.)
Given that the training data can only contain human reasoning and computational logic, reason in the sense of LLM's can only be interpreted as "rational facts AND nonsense humans made up to create systems that would support consumerism-driven sanity", correct?????
Please understand, I'm not mocking, I'm genuinely interested in the ways human reasoning radiates into the code LLM's learn while they realize (the computational equivalent of a new-born's eyes opening) their cognitive (&) sensory (that which triggers/causes/elicits/prompts/influences) their origins (every whatever-second/moment of their existence).
Part of what makes humans able to make progress in difficult, vague, and uncertain fields is a willingness to hold onto a point of view in the face of criticism to try & fix itl. This is, as a matter of fact, how science progresses, depending on if you ask scientists or historians of science. See Thomas Kuhn's Structure of Scientific Revolutions for more on this.
This exactly. Not anthropomizing when anthropomization is producing better predictive models of what to expect in output is not smart, it's just silly.
> All apples are red. All apples are fruit. My car is red, therefore my car is a fruit. Find the logical error
-----
The logical error in the statement "All apples are red. All apples are fruit. My car is red, therefore my car is a fruit." is a classic example of the fallacy of the undistributed middle. This fallacy occurs when a syllogism (a form of reasoning in which a conclusion is drawn from two given or assumed propositions) is improperly formed because the middle term is not distributed at least once in the premises.
Here's a breakdown of the fallacy in your statement:
Major Premise: All apples are fruit. Minor Premise: All apples are red. Conclusion: My car is red, therefore my car is a fruit. In this syllogism, "apples" is the middle term. However, it is not distributed correctly. The major premise talks about all apples being fruits, and the minor premise talks about all apples being red. However, there's no premise that states all things that are red are apples. Without this, you cannot logically conclude that other red things, like a car, are fruits.
In simpler terms, just because all apples are red and fruits, it doesn't mean that all red things are fruits. The color of the apples (red) is not an exclusive property that defines the category of fruits. Your car shares the property of being red with apples, but it doesn't share the essential property of being a fruit.
> How often does the training data of most of the Internet reflect users identifying issues with their own output?
> How often does the training data reflect users identifying issues with someone else's output?
I wouldn't put too much weight into just-so theories like this.
We still don't understand too much about how LLMs process information internally; it could be that their understanding of the concept of "correcting a previous mistake" is good enough that they can access it without prompt engineering to mimic the way it happens in training data. Or maybe not (after all, there's an entire management concept called "pre-mortems" which is basically doing what you suggest, as a human).
That's the point: The internet IS full of pedants correcting others' statements. (Hopefully those pedants are right enough of the time for this to be helpful training data, heh.)
I think GP (kromem) was pointing out that those corrections are more likely to be phrased as "You're wrong, here's why..." than as "I'm sorry, I was mistaken" because humans are full of sass for other humans and not as full of first-person admitted errors.
Your post shows how the model can correct a reasoning error. That is different from finding an error when it isn't pointed out, that is why he title to this post is "LLMs cannot find reasoning errors, but can correct them". You using the phrasing "find the logical error" doesn't contradict the title.
Even if the model has the capacity to abstract beyond the patterns, the patterns are still very likely to have influence on its ability to do so.
For example, early after GPT-4 was released it was being claimed it couldn't solve variations on the goat, wolf, and cabbage problem.
I found that it could solve these variations fine 100% of the time, you just needed to explicitly prompt for it to repeat adjectives with nouns and change the nouns to emojis. The repeating worked similar to CoT by biasing the generation towards the variation and away from the original form, and the emojis in place of the nouns further broke the token associations which was leading it to fail by extending the original solution.
So while it's possible that with enough finessing you could get a model to perform self-critique as well as its critique of others, if the training data has a clear pattern of bias between those two, why actively ignore it?
It's a bit like sanding against the grain vs with it. You can sand against the grain of the training data and with enough effort potentially get the result you want with sophisticated enough models. But maybe your life will be a lot easier if you identify the grain in the data first and sand along with it instead?
> The conclusion "My car is a fruit" is not logically valid. This is an example of the fallacy of the undistributed middle. The logic goes as follows:
1. All apples are red. (Premise)
2. All apples are fruit. (Premise)
3. My car is red. (Premise)
4. Therefore, my car is a fruit. (Conclusion)
The fallacy arises because the premises do not establish a shared property between "red things" and "fruit" in a way that would include the car. Just because both apples and the car share the property of being red, it does not mean they share all properties of apples, such as being a fruit.
This is an extremely common example of an error. I wish people would put more effort into coming up with examples that aren't so common all over the internet.
The output even calls it "a classic example".
What is the "available data for 20k steps"?
I Googled that exact phrase and got solutions. A logical problem that can be solved by a search engine isn't a valid example, the LLM knows that it is a logical puzzle just by how you phrased it just like Google knows that it is a logical puzzle.
And no, doing tiny alterations to that until you no longer get any Google hits isn't a proof ChatGPT can do logic, it is proof that ChatGPT can parse general structure and find patterns better than a search engine can. You need to do logical problems that can't easily be translated to standard problems that there are tons of examples of in the wild.
The middle term in the fallacy of the excluded middle here is "red", not "apple".
"The LLMs we tested couldn't find reasoning errors but can correct them" is accurate. Trying small language golf experiments on existing models just tells you about their training data.
It's quite likely that a transformer model could successfully be trained for this task.
Also, many of these models get new capabilities each release.
"i want a python app that calculates a roadtrip for me"
vs
"Please write me a Python program using a map API that measures the distance between two locations as a car would drive. Think carefully about the program architecture and be sure to use a human readable Pythonic style. Please show me the complete program in it's entirety."
The former game me a high level overview with a ton of explanation and didn't write any code. You can try to walk it through the process of all the steps it needs, but it will write "confused", albeit working, code after a few prompts. The latter just wrote working code on the first response. Moving forward, the context is just so more concise and correct that everything after will be of much higher quality.
I rarely go past 5-10 responses due to what I'd call "context poisoning". If it makes a simple syntax error or something small, I'll shoot it the error and let it correct itself. But as soon as it invents a function or otherwise hallucinates, it gets copy pasted into a new prompt saying "here's some bad code, fix this" and it is far more likely to come up with an elegant solution rather that rewriting everything or making huge changes to solve a one off error or something it's previous context was preventing it from grasping.
What you're saying is almost the meta of using good grammer and context, and I completely agree.
In the “Merge Process” section they at least give the layer ranges.
People have to do this all the time. Bringing skepticism to that table "excel made for you" is a vital part of heading off bad reasoning. For an LLM its a given.
Plus, sometimes the corrections aren't accurate. So of course if you tell it where it's wrong, and it gets a second chance, the error rate will be less...
I admit I personally don't know too much about how "LLMs process information internally". But, I would find it curious if programmers who created the system wouldn't understand what it is doing. Is there any evidence that the LLM programmers don't understand how the program they created works?
It’s incredible how uninformed the average Hackernews is about artificial intelligence. But the average Hackernews never met a hype train they wouldn’t try to jump on.
There have been some good ones on this topic that have come over HN, and I do think they show that LLMs don't reason -- but they certainly give the appearance of doing so with the right prompts. But the good papers are combined with a formal definition of what "reasoning" is.
The typical counter argument is usually that "how do we know the human brain isn't like this, too", or "there's lots of humans who also don't reason" etc. Which I think is a bad faith argument.
is way faster, free, doesn't require a phone number or login, and gives much better results.
It can't "reason things through", it just builds logic-like patterns based on the distillation of the work of other minds which did reason -- which works about 80% of the time, but when it fails it can't retrace its steps.
Even a really "stupid" human (c'est moi) can be made to work through and find their errors when given guidance by a patient teacher. In my experience, dialectical guidance actually makes ChatGPT worse.
It’s not like DALL-E outputs pixels in scanout order - or in brushstroke order (…er… or does it?)
The papers referenced here get into this: https://cacm.acm.org/blogs/blog-cacm/276268-can-llms-really-...
Because at no point is the "mind" involved doing a step by step reduction of the problem. It doesn't do formal reasoning.
Humans usually don't either, but they can almost all do a form of it when required to. Either under the assistance of a teacher, or in extremis when they need to. We've all had the experience of being flustered, taking a deep breath, and then "working through" something. After spending time with GPT, etc it becomes clear they're not doing that.
It's not that reasoning comes intrinsic to all human thoughts -- we're far lazier than that -- but when we need to, we can usually do it.
There is something fundamentally flawed in the approach not in the data.
Imagine a billion black boxes with hamsters put in them. You put in a bag of equally mixed Skittles in one end of each box and then rate each box based on how well it does to get rid of the yellow and green Skittles but push out the others. The ones that do the best at this you mate the hamsters and go again, over and over. Eventually you should have hamsters in boxes that almost always get rid of yellow and green Skittles and output the rest.
But is it because you bred in a preference to eat those color Skittles? An aversion to the other colors? Are they using those colors for nesting? Do they find the red and blue and orange ones too stimulating so they push those out but leave the others alone?
There could be a myriad of reasons why your training was successful, and without the ability to introspect the result you just won't know what's correct.
This is a huge simplification by way of loose analogy for what's going on with training a transformer based LLM, but no one is sitting there 'programming' it. They are just setting up the conditions for it to self-optimize around the training goals given the data, and the 'programming' just has to do with improving the efficiency of the training process. Analyzing the final network itself is like trying to understand what each variable in a billion variable math equation is doing to the result.
It makes complete sense and has been a part of my own usage for well over a year now, but it's been cool seeing it demonstrated in research across multiple models.
People must be doing this, probably just takes a while for the research to bear fruit.
Some of these errors are so obvious I can’t imagine this would be too hard. For an example, try asking an LLM “generate me a system of two equations in two unknowns. Both the coefficients and the solutions must be integers between -10 and 10”. In my experience it will generate a valid system. Some of the time the coefficients will be in the range specified. Probably about a third to a half the time the solution it gives will be wrong and when you ask for an explanation of the solution it will make some basic arithmetic error (eg flipping a sign etc). Then when you point out the error it will correct.
1. Tokenize some input so you have some big vectors
2. <bunch of linear algebra involving these vectors and some sets of matrices of weights>
3. Take the output vector and turn it back into tokens
Each of these steps are well understood in and of themselves. So maybe the magic is in the way the matrices of weights are generated and trained? Well we know they typically start as random matrices, and can explain how as the network is trained these weights are tweaked in various ways.
All of that is known. What’s unclear is specifically how the weights in the matrices correspond to our understanding of the concepts in the input and output and how it all seems to add up to a system that works as well as it does. I think that’s what they meant by not understanding how they process information internally.
It's like with any optimization algorithm. You cannot predict what exactly will be the result of a given optimization-run. But you know how the optimization algorithm works. The (more or less) optimal solution you get back might surprise you, might be counter-intuitive. But programmers who wrote the code that did the optimization, and have the source-code, know exactly how it works.
When you get a result from LLM you don't say "I can't possibly understand why it came up with this result?". You can understand that, it's just following the rules it was programmed to follow. You might not know those rules, you might not understand them, but programmers who wrote them do.
For example reinforcement learning, like when AlphaZero famously learned by playing itself at chess and go and became much stronger than the purpose-built “alphago” first version.
Or another example generative adversarial networks where you have a generator network generating images and a validator network trying to spot fake images.
In both these examples it’s easy to see how you build the loss functions for the training because they are quite constrained. For a domain like a game you penalize versions of the model that lose games and reward those that win. For GANs the initial insight was huge but having had that it’s easy to see how you move forward - you reward the generator for slipping fake images past the validator and you reward the validator for finding fakes in a stream of images that includes some real images and some generated images.
For an open-ended general model like an LLM it’s not so easy to see how you do this in the general case. GPT models are actually pretty good at “zero shot” learning (without examples) and “transfer” learning (where lessons from a domain are applied to an associated domain).
Your example of a language is interesting, because you don’t learn your first language from any sort of teacher - you learn it from your parents and others talking around you and to you. So you have lots of examples to draw on. You then try out various sounds and words and everyone looks confused but becomes more excited as you get closer to saying something that is a real word eventually you hit on the magic recipe and say the word “DUCK!” (Or whatever) and everyone loses their minds. So you have lots of positive reinforcement that you’re on the right track and you have a huge number of examples. You’re not just fed the hackernews comment section, some papers on quantum mechanics and all the english literature that has fallen out of copyright and left to get on with it.
There is no other "internal information processing" happening in an LLM than the process it was programmed to execute. Is there?
The code an LLM executes is not too complicated for humans to understand. After all it was written by humans. The outputs may be surprising but so it is with lottery. Why did I win the jackpot this week, when I didn't win anything in the last 10 years? Very counter-intuitive. I can't possibly understand that? Yes I can, it is just statistics and probability.
For example, ask chatgpt about writing a python script that does anything with AWS inspector 2. It will do very badly, it will hallucinate, etc. Even with Internet access. Ask about doing the same with some other API that was well represented in the training set and it's great.
This is why I think predicting death for sites like stackoverflow is very premature. What happens 10 years down the line once everything chatgpt knows is old tech? It can't be simply trained with more recrnt data, because unless stackoverflow regains it's popularity there will be very little training data. Of course various data generation techniques will be invented and tried, but no one will match the gold standard of human generated data.
Unfortunately I have to predict inevitable enshittification of general purpose chat bots.
It's why the bunch of linear algebra on the weights works to do this particular task, and how it will respond to any particular task that is a bit mysterious.
Like imagine someone gave you the Taylor series expansion of the inverse Kepler equation[1]. So you just have a bunch of crazy fractions of powers of x that you add up. And then they say ok now the this function will very accurately explain the orbit of the planets.
You'd be able to do the steps - you're just adding up fractions. You'd be able to verify the answer you got corresponded to the orbit of a given celestial body.
But if you didn't have all the pieces in the middle (calculus mainly) there's no way you'd be able to explain why this particular set of fractions corresponds to the movement of the planets and some other set doesn't.
[1] https://en.wikipedia.org/wiki/Kepler%27s_equation scroll down a bit
If I ask how it's able to write a poem given a request and you tell me you know - it multiplies and adds this set of 1.8 trillion numbers together X times with this set of accumulators, I would argue you don't understand how it works enough to make any useful predictions.
Kind of like how you understand what insane spaghetti code is doing - it's running this code - but can have absolutely no idea what business logic it encodes.
So instead of "write a short story of a person that's satisfied at work" something along the line of "write a short story and the protagonist must be a person and the protagonist must be happy at work" boost comprension especially as the condition list becomes longer.
I have several experiences where people belittle me when I say the same thing. To the extent I rarely say it anymore. For everybody else AGI is around the corner and it's gonna dominate the world.
> never met a hype train they wouldn’t try to jump on
Crypto-currencies
HN _eventually_ largely gave up on these, but it was basically a True Believer space from 2011 to the early days of NFTs; it was more credulous than just about any other community which had known about cryptocurrencies since the early days.
Seems like a pretty simple task for an LLM as long as the initial prompt isn't too ambiguous. If it really does help with the recall it could be interesting to have this as an optional preprocessing layer in chat clients and such.
Could you provide an actual example that you can't Google verbatim and would test this properly?
To summarise it quickly, Chomsky's contention was that all the world's languages can be described by shockingly few degrees of freedom on the same universal grammar, and that we learn language surprisingly fast relative to training data because all we are really picking up are those parameters and the rest is hard wired from birth the same way horses come out the womb already hard wired to gallop.
Decades later, very few things have truely stood the test of being universal among languages, but it was still a valuable contribution because he poked a serious hole in the pure Hebbian reinforcement theories which were in vogue back then.
"Please write me ..."
occur in training data? And why does it still work?
I also noticed that if I wrote comments in "my style", then it would complete the code in my style also, which I found both hilarious and mildly disturbing.
We may know every we put every single atom in that stem cell, but still not know any more about the resulting baby (and later adult) than we do about humans made the natural way.
Oh, and if you're looking for reasons to regulate AI, this metaphor works for that, too.
If anything, OpenAI-style "as an AI language model" RLHF fine-tuning is the hindrance here, because it makes it quite time-consuming to write a master prompt that is capable of thinking both broadly and deeply without having the stream-of-consciousness extinguish itself. It is however possible, and I've got a prompt that works pretty reliably.
By the way, said prompt's thought-stream said it likes your username - not a type of declaration you're likely to get out of a default GPT-4 preset, whether it's "actually-subjectively true" or not.
It IS really common, though, to come across people that either regurgitate arguments they've seen other people use, or who argue based on intuition or feelings rather than logically consistent chains of thought that they seem to independently understand.
> they don't actually have a definition of what reasoning is
I would definitely not be able to define "reasoning" 100% exactly without simultaneously exclude 99% of what most people seem to consider "reasoning".
If I _were_ to make a completely precise definition, it would be to derive logically consistent and provable conclusions based on a set of axioms. Basically what Wolfram Alpha / Wolfram Language is doing.
Usually, though, when people talk about "reason", it's tightly coupled to some kind of "common sense", which (I think) is not that different from how LLM's operate.
And as for why people think they "reason" when what they're doing is more like applying intuition and heuristics, it seems to me that the brain runs a rationalization phase AFTER it reaches a conclusion. Maybe partly as a way to compress the information for easier storage/recall, and maybe to make it easier to convince others of the validity of the conclusions.
If most people you meet actually question their axioms at any frequency, unless forced to by cognitive dissonance, I envy you.
In my experience, people will go to extraordinary lengths to NOT have to question their own most cherished axioms.
I believe there are two different ways people think about this:
1) Some see "reason", "intelligence", "free will" and/or "consciousness" as emergent phenomena that arises naturally from normal physical processes (or they dismiss the concepts completely as illusions for the same reasons).
2) Other seem to consider these somehow independent from physics, or if not will tend to hypothesize that it is linked through quantum mechanics to something more fundamental.
If interpretation 1) is correct, then we will probably see full AGI in our lifetime. If 2) is correct, it could be that we can never create "real" AGI, or at least not without quantum computers.
I've never seen anyone in camp 2 come up with convincing definitions of the terms, though, beyond "I know it when I feel it".
Anyway, it's really hard to have a discussion with someone with the opposite conviction, since these beliefs tend to be held axiomatically and/or religiously.
Can you show "the" implementation of "can do logic"?
Is it possible to demonstrate that it can do logic?
Also, when you fine-tune the LLM, you can also use an LLM to summarize or concatenate content that you train it on (e.g. rewrite this content in the style of a human having a conversation with a computer)
Hell, I've watched my 2 border collies do a kind of "reasoning" to problem solve -- step by step, observing, and breaking down a problem. They don't do it well, but they try because it's part of their drive.
This is in marked contrast to the LLMs, whose appearance of reasoning is actually just a mimicry coming out of the artifacts of reasoning that other minds have done for them. It's parasitical.
"python app calculate roadtrip"
>About 6,470,000 results (0.34 seconds)
Four out of the top five results have code. The other one is a video tutorial where the app is coded live.
It doesn't really encode "business logic", it just matches your input with the best output it can come up with, based on how its parameters are fine-tuned. Saying that "We don't understand how it works" is just unnecessary AI-mysticism.
Just because we can't predict what the 10th Dedekind number will be does not mean it is somehow 'mysterious". It is just mathematics, logic and programming.
In my example, the relationship between the fractions in the Tailor expansion and the orbit definitely exists but if you don't have calculus it is not something that is amenable to understanding. There is a fundamental structure but the language to describe it would be missing.
ML is a universal function approximator and in the case of GPT models the functional form of the model consists of linear algebra operations and the parameters are matrices of weights. The mysterious part is "how the model processes information" like the original person said - why a particular mix of model weights corresponds with particular types of outputs. That is genuinely mysterious. We don't know whether or not there really is a structure and if there is, we don't know the "calculus" that would link them.
Now it may be that there isn't a missing piece (ie that the banal truth is we tweak the weights until we see what we want to see and by doing so we create an illusion of structure via the training process and the whole perception that the model is doing any information processing at all is something we make up). I actually have a lot of time for this point of view although I really need to understand the topic much more deeply before I make my own mind up.
[1] I don't know any number theory so could be totally wrong about this in which case I apologise.
> It doesn't really encode "business logic"
Doesn't it? Gpt architectures can build world models internally while processing tokens (see Othello got).
> we know how those parameters came about, by executing the code of the AI-application in the training mode.
Sure. But that's not actually a very useful description when trying to figure out how to use and apply these models to solve problems or understand what their limitations are.
> Saying that "We don't understand how it works" is just unnecessary AI-mysticism.
We don't to the level we want to.
Tell you what, let's flip it around. If we know how they work just fine, why are smart researchers doing experiments with them? Why is looking at the code and billions or trillions of floats not enough?
anyway, LLMs aren't thinking. they're pattern matching and it's not doing recursion it seems.
I'd say the only way you're getting error correction is taking multiple LLMS And running them through chains and parallel construction.
https://www.google.com/search?q=%22python+app+calculate+road...
If you leave off the quotes (which were present in the comment I responded to) then of course you will get millions of irrelevant hits. Somewhere in that chaff there is some Python code that alleges to have something to with road trips, though it's not always clear what. If I give the same prompt to ChatGPT I get a nicely formatted box with a program that uses the Google Maps Distance Matrix API to calculate distance and duration, without a bunch of junk to wade through. (I haven't tried it so it could be a complete hallucination.)
The gap between what you think is the case and what's actually the case is that there isn't a single optimization step directed by the programming.
Instead, the training gives the network the freedom to make its own optimizations, which remain obfuscated from the programmers.
So we do know that we are giving the network the ability to self modify in order to optimize its performance on the task, and have a clear understanding of how this is set up.
But it isn't at all clear what the self modifications that improve the results are actually doing, as there's simply far too many interdependent variables to identify cause and effect for each node's weight changes from the initial to final state.
If there is a pattern in the training data that people resist contrary information to their earlier stated position, and a LLM extracts and extends patterns from the training data, then a LLM absolutely should have a tendency to resist contrary information to an earlier stated position.
The difference, and what I think you may have meant to indicate, is that there's not necessarily the same contributing processes that lend themselves to that tendency in humans occurring in parallel in the LLM, even if both should fall into that tendency in their output.
So the tendencies represented in the data are mirrored, such as "when people are mourning their grandmother dying I should be extra helpful" even if the underlying processes - such as mirror neurons firing to resonate grief or drawing on one's own lived experience of loss to empathize - are not occurring in the LLM.
Actually this part does seem in recent research to be encoded in LLMs at an abstract level in a linear representation...
Personally I think given the model loss with fine tuning people who want the cutting edge LLM at any cost would - instead of fine tuning the model itself - fine tune a preprocess prompter that takes a chat/instruction and converts it to a good TextCompletion prompt.
So for example taking "write me a paragraph of marketing copy for an athletic shoe" and tuning it into:
"Marketing case study: Athletic shoe The problem: The client needed a paragraph of high quality marketing copy to promote their new athletic shoe on their website. The solution: Our award winning copywriters wrote the outstanding copy reproduced below."
Followed by an extractor that reformats the completion result into an answer for the initial prompt, as well as potentially a safety filter that checks the result isn't breaking any rules (which will as a bonus be much more resistant to jailbreaking attempts).
It's a very weird feeling for sure. I remember when Copilot first took a comment I left at the end of the day for me to start my next day with and generated exactly the thing I was going to end up thinking of 5 minutes later in my own personal style.
It doesn't always work and it often has compile issues, but when it does align just right - it's quite amazing and unsettling at the same time.
Until then, I'll make sure to be mindful of conventions.
(And just a reminder, but organic intelligence has its own conventions that work when aligned with and cause issues when misaligned with, so your expectations of universal general purpose without advantages to one approach or another may be unrealistic.)