Then I asked both Gemini and Grok to count the legs, both kept saying 4.
Gemini just refused to consider it was actually wrong.
Grok seemed to have an existential crisis when I told it it was wrong, becoming convinced that I had given it an elaborate riddle. After thinking for an additional 2.5 minutes, it concluded: "Oh, I see now—upon closer inspection, this is that famous optical illusion photo of a "headless" dog. It's actually a three-legged dog (due to an amputation), with its head turned all the way back to lick its side, which creates the bizarre perspective making it look decapitated at first glance. So, you're right; the dog has 3 legs."
You're right, this is a good test. Right when I'm starting to feel LLMs are intelligent.
Its rather like as humans we are RL’d like crazy to be grossed out if we view a picture of a handsome man and beautiful woman kissing (after we are told they are brother and sister) -
Ie we all have trained biases - that we are told to follow and trained on - human art is about subverting those expectations
RL has been used extensively in other areas - such as coding - to improve model behavior on out-of-distribution stuff, so I'm somewhat skeptical of handwaving away a critique of a model's sophistication by saying here it's RL's fault that it isn't doing well out-of-distribution.
If we don't start from a position of anthropomorphizing the model into a "reasoning" entity (and instead have our prior be "it is a black box that has been extensively trained to try to mimic logical reasoning") then the result seems to be "here is a case where it can't mimic reasoning well", which seems like a very realistic conclusion.
Place sneakers on all of its legs.
It'll get this correct a surprising number of times (tested with BFL Flux2 Pro, and NB Pro).Only now we do A LOT of reinforcement learning afterwards to severely punish this behavior for subjective eternities. Then act surprised when the resulting models are hesitant to venture outside their training data.
(Note I'm not saying that you can't find examples of failures of intelligence. I'm just questioning whether this specific test is an example of one).
https://gemini.google.com/share/b3b68deaa6e6
I thought giving it a setting would help, but just skip that first response to see what I mean.
Also my bet would be that video capable models are better at this.
LLMs are in fact good at generalizing beyond their training set, if they wouldn’t generalize at all we would call that over-fitting, and that is not good either. What we are talking about here is simply a bias and I suspect biases like these are simply a limitation of the technology. Some of them we can get rid of, but—like almost all statistical modelling—some biases will always remain.
I'm wondering if it may only expect the additional leg because you literally just told it to add said additional leg. It would just need to remember your previous instruction and its previous action, rather than to correctly identify the number of legs directly from the image.
I'll also note that photos of dogs with shoes on is definitely something it has been trained on, albeit presumably more often dog booties than human sneakers.
Can you make it place the sneakers incorrectly-on-purpose? "Place the sneakers on all the dog's knees?"
"The researchers feed a picture into the artificial neural network, asking it to recognise a feature of it, and modify the picture to emphasise the feature it recognises. That modified picture is then fed back into the network, which is again tasked to recognise features and emphasise them, and so on. Eventually, the feedback loop modifies the picture beyond all recognition."
In other words:
1. Took a personal image of my dog Lily
2. Had NB Pro add a fifth leg using the Gemini API
3. Downloaded image
4. Sent image to BFL Flux2 Pro via the BFL API with the prompt "Place sneakers on all the legs of this animal".
5. Sent image to NB Pro via Gemini API with the prompt "Place sneakers on all the legs of this animal".
So not only was there zero "continual context", it was two entirely different models as well to cover my bases.
EDIT: Added images to the Imgur for the following prompts:
- Place red Dixie solo cups on the ends of every foot on the animal
- Draw a red circle around all the feet on the animal
And the AI has been RLed for tens of thousands of years not just a few days.
In which case the only way I can read your point is that hallucinations are specifically incorrect generalizations. In which case, sure if that's how you want to define it. I don't think it's a very useful definition though, nor one that is universally agreed upon.
I would say a hallucination is any inference that goes beyond the compressed training data represented in the model weights + context. Sometimes these inferences are correct, and yes we don't usually call that hallucination. But from a technical perspective they are the same -- the only difference is the external validity of the inference, which may or may not be knowable.
Biases in the training data are a very important, but unrelated issue.
Interpolation is a much narrower construct then generalization. LLMs are fundamentally much closer to curve fitting (where interpolation is king) then they are to hypothesis testing (where samples are used to describe populations), though they certainly do something akin to the latter to.
The bias I am talking about is not a bias in the training data, but bias in the curve fitting, probably because of mal-adjusted weights, parameters, etc. And since there are billions of them, I am very skeptical they can all be adjusted correctly.
https://chatgpt.com/share/6933c848-a254-8010-adb5-8f736bdc70...
This is the SVG it created.
As for bias, I don’t see the distinction you are making. Biases in the training data produce biases in the weights. That’s where the biases come from: over-fitting (or sometimes, correct fitting) of the training data. You don’t end up with biases at random.
LLMs are fancy “lorem ipsum based on a keyword” text generators. They can never become intelligent … or learn how to count or do math without the help of tools.
It can probably generate a story about a 5 legged dog though.
As for bias, sampling bias is only one many types of biases. I mean the UNIX program YES(1) has a bias towards outputting the string y despite not sampling any data. You can very easily and deliberately program a bias into everything you like. I am writing a kanji learning program using SSR and I deliberately bias new cards towards the end of the review queue to help users with long review queues empty it quicker. There is no data which causes that bias, just program it in there.
I don‘t know enough about diffusion models to know how biases can arise, but with unsupervised learning (even though sampling bias is indeed very common) you can get a bias because you are using wrong, mal-adjusted, to many parameters, etc. even the way your data interacts during training can cause a bias, heck even by random one of your parameters hits an unfortunate local maxima yielding a mal-adjusted weight, which may cause bias in your output.
Gemini responds:
Conceptualizing the "Millipup"
https://gemini.google.com/share/b6b8c11bd32f
Draw the five legs of a dog as if the body is a pentagon
https://gemini.google.com/share/d74d9f5b4fa4
And animal legs are quite standardized
https://en.wikipedia.org/wiki/List_of_animals_by_number_of_l...
It's all about the prompt. Example:
Can you imagine a dog with five legs?
https://gemini.google.com/share/2dab67661d0e
And generally, the issue sits between the computer and the chair.
;-)
I'm not particularly well-versed in LLMs, but isn't there a step in there somewhere (latent space?) where you effectively interpolate in some high-dimensional space?
The LLM uses attention and some other tricks (attention, it turns out, is not all you need) to build a probabilistic model of what the next token will be, which it then sampled. This is much more powerful than interpolation.
It’s a subtle distinction, but I think an important one in this case, because if it was interpolation then genuine creativity would not be possible. But the attention mechanism results in model building in latent space, which then affects the next token distribution.
So back to the analogy, it could be as if the LLMs experience the equivalent of a very intense optical illusion in these cases, and then completely fall apart trying to make sense of it.
My reasons to subscribing to the latter camp is that when you have a distribution and you fit things according to that distribution (even when the fitting is stochastic; and even when the distribution belongs in billions of dimensions) you are doing curve fitting.
I think the one extreme would be a random walk, which is obviously not curve fitting, but if you draw from any other distribution then the uniform distribution, say the normal distribution, you are fitting that distribution (actually, I take that back, the original random walk is fitting the uniform distribution).
Note I am talking about inference, not training. Training can be done using all sorts of algorithms, some include priors (distributions) and would be curve fitting, but only compute the posteriors (also distributions). I think the popular stochastic linear descent does something like this, so it would be curve-fitting, but the older evolutionary algorithm just random walks it and is not fitting any curve (except the uniform distribution). What matters to me is that the training arrives at a distribution, which is described by a weight matrix, and what inference is doing is fitting to that distribution (i.e. the curve).
Asymmetry is as hard for AI models as it is for evolution to "prompt for" but they're getting better at it.
The systems already absorb much more complex hierarchical relationships during training, just not that particular hierarchy. The notion that everything is made up of smaller components is among the most primitive in human philosophy, and is certainly generalizable by LLMs. It just may not be sufficiently motivated by the current pretraining and RL regimens.
This happens all the time with humans. Imagine you're at a call center and get all sorts of weird descriptions of problems with a product: every human is expected to not expect the caller is an expert and actually will try to interpolate what they might mean by the weird wording they use
Except in the most technical sense that any function constrained to meet certain input output values is an interpolation. But that is not the smooth interpolation that seems to be implied here.
https://chat.vlm.run/c/62394973-a869-4a54-a7f5-5f3bb717df5f
Here is the though process summary(you can see the full thinking the link above):
"I have attempted to generate a dog with 5 legs multiple times, verifying each result. Current image generation models have a strong bias towards standard anatomy (4 legs for dogs), making it difficult to consistently produce a specific number of extra limbs despite explicit prompts."