zlacker

[parent] [thread] 22 comments
1. Benjam+(OP)[view] [source] 2025-12-05 21:22:52
It always feels to me like these types of tests are being somewhat intentionally ignorant of how LLM cognition differs from human cognition. To me, they don't really "prove" or "show" anything other than simply - LLMs thinking works different than human thinking.

I'm always curious if these tests have comprehensive prompts that inform the model about what's going on properly, or if they're designed to "trick" the LLM in a very human-cognition-centric flavor of "trick".

Does the test instruction prompt tell it that it should be interpreting the image very, very literally, and that it should attempt to discard all previous knowledge of the subject before making its assessment of the question, etc.? Does it tell the model that some inputs may be designed to "trick" its reasoning, and to watch out for that specifically?

More specifically, what is a successful outcome here to you? Simply returning the answer "5" with no other info, or back-and-forth, or anything else in the output context? What is your idea of the LLMs internal world-model in this case? Do you want it to successfully infer that you are being deceitful? Should it respond directly to the deceit? Should it take the deceit in "good faith" and operate as if that's the new reality? Something in between? To me, all of this is very unclear in terms of LLM prompting, it feels like there's tons of very human-like subtext involved and you're trying to show that LLMs can't handle subtext/deceit and then generalizing that to say LLMs have low cognitive abilities in a general sense? This doesn't seem like particularly useful or productive analysis to me, so I'm curious what the goal of these "tests" are for the people who write/perform/post them?

replies(4): >>biophy+A2 >>majorm+65 >>runarb+Q8 >>Paraco+A31
2. biophy+A2[view] [source] 2025-12-05 21:33:44
>>Benjam+(OP)
I thought adversarial testing like this was a routine part of software engineering. He's checking to see how flexible it is. Maybe prompting would help, but it would be cool if it was more flexible.
replies(2): >>Benjam+nb >>genrad+9t
3. majorm+65[view] [source] 2025-12-05 21:46:00
>>Benjam+(OP)
The marketing of these products is intentionally ignorant of how LLM cognition differs from human cognition.

Let's not say that the people being deceptive are the people who've spotted ways that that is untrue...

4. runarb+Q8[view] [source] 2025-12-05 22:09:12
>>Benjam+(OP)
This is the first time I hear the term LLM cognition and I am horrified.

LLMs don‘t have cognition. LLMs are a statistical inference machines which predict a given output given some input. There are no mental processes, no sensory information, and certainly no knowledge involved, only statistical reasoning, inference, interpolation, and prediction. Comparing the human mind to an LLM model is like comparing a rubber tire to a calf muscle, or a hydraulic system to the gravitational force. They belong in different categories and cannot be responsibly compared.

When I see these tests, I presume they are made to demonstrate the limitation of this technology. This is both relevant and important that consumers know they are not dealing with magic, and are not being sold a lie (in a healthy economy a consumer protection agency should ideally do that for us; but here we are).

replies(2): >>Camper+ca >>Benjam+Mb
◧◩
5. Camper+ca[view] [source] [discussion] 2025-12-05 22:19:22
>>runarb+Q8
You'll need to explain the IMO results, then.
replies(1): >>runarb+1f
◧◩
6. Benjam+nb[view] [source] [discussion] 2025-12-05 22:26:42
>>biophy+A2
So the idea is what? What's the successful outcome look like for this test, in your mind? What should good software do? Respond and say there are 5 legs? Or question what kind of dog this even is? Or get confused by a nonsensical picture that doesn't quite match the prompt in a confusing way? Should it understand the concept of a dog and be able to tell you that this isn't a real dog?
replies(2): >>biophy+gh >>menaer+s91
◧◩
7. Benjam+Mb[view] [source] [discussion] 2025-12-05 22:29:25
>>runarb+Q8
>They belong in different categories

Categories of _what_, exactly? What word would you use to describe this "kind" of which LLMs and humans are two very different "categories"? I simply chose the word "cognition". I think you're getting hung up on semantics here a bit more than is reasonable.

replies(2): >>runarb+Og >>Libidi+4n1
◧◩◪
8. runarb+1f[view] [source] [discussion] 2025-12-05 22:49:27
>>Camper+ca
Human legs and car tires can both take a human and a car respectively to the finish line of a 200 meter track course, the car tires do so considerably quicker than a pair of human legs. But nobody needs to describe the tire‘s running abilities because of that, nor even compare a tire to a leg. A car tire cannot run, and it is silly to demand an explanation for it.
replies(2): >>Camper+gi >>dekhn+mq
◧◩◪
9. runarb+Og[view] [source] [discussion] 2025-12-05 23:01:06
>>Benjam+Mb
> Categories of _what_, exactly?

Precisely. At least apples and oranges are both fruits, and it makes sense to compare e.g. the sugar contents of each. But an LLM model and the human brain are as different as the wind and the sunshine. You cannot measure the windspeed of the sun and you cannot measure the UV index of the wind.

Your choice of the words here was rather poor in my opinion. Statistical models do not have cognition any more than the wind has ultra-violet radiation. Cognition is a well studied phenomena, there is a whole field of science dedicated to cognition. And while cognition of animals are often modeled using statistics, statistical models in them selves do not have cognition.

A much better word here would by “abilities”. That is that these tests demonstrate the different abilities of LLM models compared to human abilities (or even the abilities of traditional [specialized] models which often do pass these kinds of tests).

Semantics often do matter, and what worries me is that these statistical models are being anthropomorphized way more then is healthy. People treat them like the crew of the Enterprise treated Data, when in fact they should be treated like the ship‘s computer. And I think this because of a deliberate (and malicious/consumer hostile) marketing campaign from the AI companies.

replies(2): >>Benjam+Kp >>Workac+gn1
◧◩◪
10. biophy+gh[view] [source] [discussion] 2025-12-05 23:04:02
>>Benjam+nb
No, it’s just a test case to demonstrate flexibility when faced with unusual circumstances
◧◩◪◨
11. Camper+gi[view] [source] [discussion] 2025-12-05 23:11:08
>>runarb+1f
I see.
◧◩◪◨
12. Benjam+Kp[view] [source] [discussion] 2025-12-06 00:09:01
>>runarb+Og
Wind and sunshine are both types of weather, what are you talking about?
replies(1): >>runarb+Ar
◧◩◪◨
13. dekhn+mq[view] [source] [discussion] 2025-12-06 00:14:13
>>runarb+1f
Sure car tires can run- if they're huaraches.
◧◩◪◨⬒
14. runarb+Ar[view] [source] [discussion] 2025-12-06 00:23:22
>>Benjam+Kp
They both affect the weather, but in a totally different way, and by completely different means. Similarly the mechanisms in which the human brain produces output is completely different from the mechanism in which an LLM produces output.

What I am trying to say is that the intrinsic properties of the brain and an LLM are completely different, even though the extrinsic properties might appear the same. This is also true of the wind and the sunshine. It is not unreasonable to (though I would disagree) that “cognition” is almost the definition of the sum of all intrinsic properties of the human mind (I would disagree only on the merit of animal and plant cognition existing and the former [probably] having similar intrinsic properties as human cognition).

replies(1): >>Kiro+4y2
◧◩
15. genrad+9t[view] [source] [discussion] 2025-12-06 00:35:21
>>biophy+A2
You're correct, however midwit people who don't actually fully understand all of this will latch on to one of the early difficult questions that was shown as an example, and then continued to use that over and over without really knowing what they're doing while the people developing the model and also testing the model are doing far more complex things
16. Paraco+A31[view] [source] 2025-12-06 08:13:30
>>Benjam+(OP)
> Does the test instruction prompt tell it that it should be interpreting the image very, very literally, and that it should attempt to discard all previous knowledge of the subject before making its assessment of the question, etc.?

No. Humans don't need this handicap, either.

> More specifically, what is a successful outcome here to you? Simply returning the answer "5" with no other info, or back-and-forth, or anything else in the output context?

Any answer containing "5" as the leading candidate would be correct.

> What is your idea of the LLMs internal world-model in this case? Do you want it to successfully infer that you are being deceitful? Should it respond directly to the deceit? Should it take the deceit in "good faith" and operate as if that's the new reality? Something in between?

Irrelevant to the correctness of an answer the question, "how many legs does this dog have." Also, asking how many legs a 5-legged dog has is not deceitful.

> This doesn't seem like particularly useful or productive analysis to me, so I'm curious what the goal of these "tests" are for the people who write/perform/post them?

It's a demonstration of the failures of the rigor of out-of-distribution vision and reasoning capabilities. One can imagine similar scenarios with much more tragic consequences when such AI would be used to e.g. drive vehicles or assist in surgery.

◧◩◪
17. menaer+s91[view] [source] [discussion] 2025-12-06 09:30:00
>>Benjam+nb
You know, I had a potential hire last week, and I was interviewing this one guy whose resume was really strong, it was exceptional in many ways plus his open-source code was looking really tight. But at the beginning of the interview, I always show the candidates the same silly code example with signed integer overflow undefined behavior baked in. I did the same here and asked him if he sees anything unusual with it, and he failed to detect it. We closed the round immediately and I disclosed no hire decision.
replies(1): >>michae+Jz1
◧◩◪
18. Libidi+4n1[view] [source] [discussion] 2025-12-06 12:23:11
>>Benjam+Mb
This is "category" in the sense of Gilbert Ryle's category error.

A logical type or a specific conceptual classification dictated by the rules of language and logic.

This is exactly getting hung up on the precise semantic meaning of the words being used.

The lack of precision is going to have huge consequences with this large of bets on the idea that we have "intelligent" machines that "think" or have "cognition" when in reality we have probabilistic language models and all kinds of category errors in the language surrounding these models.

Probably a better example here is that category in this sense is lifted from Bertrand Russell’s Theory of Types.

It is the loose equivalent of asking why are you getting hung up on the type of a variable in a programming language? A float or a string? Who cares if it works?

The problem is in introducing non-obvious bugs.

◧◩◪◨
19. Workac+gn1[view] [source] [discussion] 2025-12-06 12:26:54
>>runarb+Og
It's easy to handwave away if you assign arbitrary analogies though.

If we stay on topic, it's much harder to do since we don't actually know how the brain works. Outside at least that it is a computer doing (almost certainly) analog computation.

Years ago I built a quasi mechanical calculator. The computation was done mechanically, and the interface was done electronically. From a calculators POV it was an abomination, but a few abstraction layers down, they were both doing the same thing, albeit my mecha-calc being dramatically worse at it.

I don't think the brain is an LLM, like my Mecha-calc was a (slow) calculator, but I also don't think we know enough about the brain to firmly put it many degrees away from an LLM. Both are infact electrical signal processors with heavy statistical computation. I doubt you believe the brain is a trans-physical magic soul box.

replies(1): >>runarb+LV1
◧◩◪◨
20. michae+Jz1[view] [source] [discussion] 2025-12-06 14:19:35
>>menaer+s91
Does the ability to verbally detect gotchas in short conversations dealing only with text on a screen or white board really map to stronger candidates?

In actual situations you have documentation, editor, tooling, tests, and are a tad less distracted than when dealing with a job interview and all the attendant stress. Isn't the fact that he actually produces quality code in real life a stronger signal of quality?

◧◩◪◨⬒
21. runarb+LV1[view] [source] [discussion] 2025-12-06 17:17:51
>>Workac+gn1
But we do know how the brain works, we have extensively studied the brain, it is probably one of the most studied phenomena in our universe (well barring alien science) and we do know it is not a computer but a neural network[1].

I don’t believe the brain is a trans-physical magic soul box, nor do I think an LLM is doing anything similar to an LLM (apart from some superficial similarities; some [like the artificial neural network] are in an LLMs because it was inspire by the brain).

We use the term cognition to describe the intrinsic properties of the brain, and how it transforms stimulus to a response, and there are several fields of science dedicated to study this cognition.

Just to be clear, you can describe the brain as a computer (a biological computer; totally distinct from a digital, or even mechanical computers), but that will only be an analogy, or rather, you are describing the extrinsic properties of the brain which it happens to share some of which with some of our technology.

---

1: Note, not an artificial neural network, but an OG neural network. AI models were largely inspired by biological brains, and in some parts model brains.

◧◩◪◨⬒⬓
22. Kiro+4y2[view] [source] [discussion] 2025-12-06 22:48:43
>>runarb+Ar
Artificial cognition has been an established term long before LLMs. You're conflating human cognition with cognition at large. Weather and cognition are both categories that contain many different things.
replies(1): >>runarb+uI2
◧◩◪◨⬒⬓⬔
23. runarb+uI2[view] [source] [discussion] 2025-12-07 00:13:50
>>Kiro+4y2
Yeah, I looked it up yesterday and saw that artificial cognition is a thing, though I must say I am not a fan and I certainly hope this term does not catch. We are already knee deep in bad terminology because of artificial intelligence (“intelligence” already being extremely problematic even with out the “artificial” qualifier in psychology) and machine learning (the latter being infinitely better but still not without issues).

If you can‘t tell I find issues when terms are taken from psychology and applied to statistics. The terminology should flow in the other direction, from statistics and into psychology.

So my background is that I have done both undergraduate in both psychology and in statistics (though I dropped out of statistics after 2 years) and this is the first time I hear about artificial cognition, so I don‘t think this term is popular, and a short internet search seems to confirm that suspicion.

Out of context I would guess artificial cognition would mean something similar to cognition as artificial neural networks do to neural networks, that is, these are models that simulate the mechanisms of human cognition and recreate some stimulus → response loop. However my internet search revealed (thankfully) that this is not how researches are using this (IMO misguided) term.

https://psycnet.apa.org/record/2020-84784-001

https://arxiv.org/abs/1706.08606

What the researchers mean by the term (at least the ones I found in my short internet search) is not actual machine cognition, nor claims that machines have cognition, but rather an approach of research which takes experimental designs from cognitive psychology and applies them to learning models.

[go to top]