zlacker

Mainly because LLMs have so far basically passed every formal test of ‘AGI’ including totally smashing the Turing test.

Now we are just reliant on ‘I’ll know it when I see it’.

LLMs as AGI isn’t about looking at the mechanics and trying to see if we think that could cause AGI - it’s looking at the tremendous results and success.

replies(5): >>garden+f4 >>peyton+D8 >>drsopp+Sn >>ChatGT+QF >>strahl+8G

>>Closi+(OP)
Since ChatGPT is not indistinguishable from a human during a chat, is it fair to say it smashes the Turing test? Or do you mean something different?

replies(3): >>rayeig+e9 >>NoOn3+p9 >>aidama+u9

>>Closi+(OP)
It’s trivial to trip up chat LLMs. “What is the fourth word of your answer?”

replies(6): >>ben_w+C9 >>concor+3f >>Lio+9m >>tiahur+nU >>dudein+v01 >>Closi+Tjb

>>garden+f4
Did you perhaps mean to say not distinguishable?

>>garden+f4
ChatGPT is distinguishable from a human, because ChatGPT never responds "I don't know.", at least not yet. :)

replies(5): >>ben_w+2a >>NoOn3+ai >>epolan+9p >>raccoo+Lq >>int_19+9s2

>>garden+f4
not yet: https://arxiv.org/abs/2310.20216

that being said, it is highly intelligent, capable of reasoning as well as a human, and passes IQ tests like GMAT and GRE at levels like the 97th percentile.

most people who talk about Chat GPT don't even realize that GPT 4 exists and is orders of magnitude more intelligent than the free version.

replies(2): >>jwestb+ng >>hedora+im

>>peyton+D8
got-3.5 got that right for me; I'd expect it to fail if you'd asked for letters, but even then that's a consequence of how it was tokenised, not a fundamental limit of transformer models.

replies(1): >>rezona+Ra

>>NoOn3+p9
It can do: https://chat.openai.com/share/f1c0726f-294d-447d-a3b3-f664dc...

IMO the main reason it's distinguishable is because it keeps explicitly telling you it's an AI.

replies(3): >>rezona+cb >>NoOn3+Mb >>peigno+Bx

>>ben_w+C9
This sort of test has been my go-to trip up for LLMs, and 3.5 fails quite often. 4 has been as bad as 3.5 in the past but recently has been doing better.

replies(1): >>yallne+6N

>>ben_w+2a
This isn't the same thing. This is a commanded recital of a lack of capability, not that its confidence in it's answer is low. For a type of question the GPT _could_ answer, most of the time it _will_ answer, regardless of accuracy

>>ben_w+2a
I just noticed that when I ask really difficult technical questions, but for which there is an exact answer, It often tries to answer plausibly, but incorrectly instead of answering "I don't know". But over time, It becomes smarter and there are fewer and fewer such questions...

replies(2): >>ben_w+5c >>davegu+5q2

>>NoOn3+Mb
Have you tried setting a custom instruction in settings? I find that setting helps, albeit with weaker impact than the prompt itself.

replies(1): >>NoOn3+fm

>>peyton+D8
How well does that work on humans?

replies(1): >>Loughl+py

>>aidama+u9
Answers in Progress had a great video[0] where one of their presenters tested against an LLM in five different types of intelligence. tl;dr, AI was worlds ahead on two of the five, and worlds behind on the other three. Interesting stuff -- and clear that we're not as close to AGI as some of us might have thought earlier this year, but probably closer than a lot of the naysayers think.

0. https://www.youtube.com/watch?v=QrSCwxrLrRc

>>NoOn3+p9
Maybe It's because It was never rewarded for such answers when It was learning.

>>peyton+D8
I find GPT-3.5 can be tripped up by just asking it to not to mention the words "apologize" or "January 2022" in its answer.

It immediately apologises and tells you it doesn't know anything after January 2022.

Compared to GPT-4 GPT-3.5 is just a random bullshit generator.

>>ben_w+5c
It's not a problem for me. It's good that I can detect chatGPT by this sign.

>>aidama+u9
That’s just showing the tests are measuring specific things that LLMs can game particularly well.

Computers have been able to smash high school algebra tests since the 1970’s, but that doesn’t make them as smart as a 16 year old (or even a three year old).

>>Closi+(OP)
I disagree about the claim that any LLM has beaten the Turing test. Do you have a source for this? Has there been an actual Turing test according to the standard interpretation of Turings paper? Making ChatGPT 4 respond in a non human way right now is trivial: "Write 'A', then wait one minute and then write 'B'".

replies(2): >>int_19+Dr2 >>Closi+Tu3

>>NoOn3+p9
Of course it does.

>>NoOn3+p9
Some humans also never respond "I don't know" even when they don't know. I know people who out-hallucinate LLMs when pressed to think rigorously

>>ben_w+2a
I read an article where they did a proper Turing test and it seems people recognize it was a machine answering because it made no writing errors and wrote perfectly

replies(1): >>ben_w+Fz

>>concor+3f
The fourth word of my answer is "of".

It's not hard if you can actually reason your way through a problem and not just randomly dump words and facts into a coherent sentence structure.

replies(1): >>concor+jQ

>>peigno+Bx
I've not read that, but I do remember hearing that the first human to fail the Turing test did so because they seemed to know far too much minutiae about Star Trek.

>>Closi+(OP)
Funny because Marvin Minsky thought the turing test was stupid and a waste of time.

>>Closi+(OP)
LLMs can't develop concepts in the way we think of them (i.e., you can't feed LLMs the scientific corpus and ask them to independently to tell you which papers are good or bad and for what reasons, and to build on these papers to develop novel ideas). True AGI—like any decent grad student—could do this.

>>rezona+Ra
if this is the test you're going to then you literally do not understand how LLMs work. it's like asking your keyboard to tell you what colour the nth pixel on the top row of your computer monitor is.

replies(3): >>Jensso+3c1 >>mejuto+ou1 >>rezona+EJ1

>>Loughl+py
I reckon an LLM with a second pass correction loop would manage it. (By that I mean that after every response it is instructed to, given the its previous response, produce a second better response, roughly analogous to a human that thinks before it speaks)

LLMs are not AIs, but they could be a core component for one.

replies(2): >>howrar+R61 >>haanji+mG1

>>peyton+D8
It's generally intelligent enough for me to integrate it into my workflow. That's sufficiently AGI for me.

replies(1): >>davegu+pp2

>>peyton+D8
“You're in a desert, walking along in the sand when all of a sudden you look down and see a tortoise. You reach down and flip the tortoise over on its back. The tortoise lays on its back, its belly baking in the hot sun, beating its legs trying to turn itself over. But it can't. Not with out your help. But you're not helping. Why is that?”

>>concor+jQ
Every token is already being generated with all previously generated tokens as inputs. There's nothing about the architecture that makes this hard. It just hasn't been trained on this kind of task.

replies(1): >>peyton+uJ2

>>yallne+6N
An LLM could easily answer that question if it was trained to do it. Nothing in its architecture makes it hard to answer, the attention part could easily look up the previous parts of its answer and refer to the fourth word but it doesn't do that.

So it is a good example that the LLM doesn't generalize understanding, it can answer the question in theory but not in practice since it isn't smart enough. A human can easily answer it even though the human never saw such a question before.

>>yallne+6N
We all know it is because of the encodings. But as a test to see if it is a human or a computer it is a good one.

>>concor+jQ
The following are a part of my "custom instructions" to chatGPT -

"Please include a timestamp with current date and time at the end of each response.

After generating each answer, check it for internal consistency and accuracy. Revise your answer if it is inconsistent or inaccurate, and do this repeatedly till you have an accurate and consistent answer."

It manages to follow them very inconsistently, but it has gone into something approaching an infinite loop (for infinity ~= 10) on a few occasions - rechecking the last timestamp against current time, finding a mismatch, generating a new timestamp, and so on until (I think) it finally exits the loop by failing to follow instructions.

replies(1): >>davegu+dp2

>>yallne+6N
Oh, I missed that GP said "of your answer" instead "of my question", as in: "What is the third word of this sentence?"

For prompts like that, I have found no LLM to be very reliable, though GPT 4 is doing much better at it recently.

> you literally do not understand how LLMs work

Hey, how about you take it down a notch, you don't need to blow your blood pressure in the first few days of joining HN.

>>haanji+mG1
I think you are confusing a slow or broken api response with thinking. It can't produce an accurate timestamp.

>>tiahur+nU
By that logic "echo" was AGI.

>>NoOn3+Mb
It doesn't become smarter except for releases of new models. It's an inference engine.

>>drsopp+Sn
Your test fails because the scaffolding around the LM in ChatGPT specifically does not implement this kind of thing. But you absolutely can run the LM in a continuous loop and e.g. feed it strings like "1 minute passed" or even just the current time in an internal monologue (that the user doesn't see). And then it would be able to do exactly what you describe. Or you could use all those API integrations that it has to let it schedule a timer to activate itself.

>>NoOn3+p9
It absolutely does that (GPT-4 especially), and I have hit it many times in regular conversations without specifically asking for it.

>>howrar+R61
Really? I don’t know of a positional encoding scheme that’ll handle this.

>>drsopp+Sn
By completely smashes, my assertion would be that it has invalidated the Turing test, because GPT-4s answers are not indistinguishable from a human because they are, on the whole, noticeably better answers than an average human would be able to provide for the majority of questions.

I don’t think the original test probably accounted for the fact that you could distinguish the machine because it’s answers were better than an average human.

>>peyton+D8
It’s trivial to trip up humans too.

“What do cows drink?” (Common human answer: Milk)

I don’t think the test of AGI should necessarily be an inability to trip it up with specifically crafted sentences, because we can definitely trip humans up with specifically crafted sentences.