All Souls exam questions and the limits of machine reasoning

>>benbre+(OP)
I think the implication is that to be interesting you need to write from an individual's standpoint. That's why fiction written by LLMs sounds so boring (at least right now): because you can't amalgamate all the text in the world and not sound like an average.

> ‘Oh, do let me go on,’ said Wilde, ‘I want to see how it ends.’

Pretty great line.

>>benbre+(OP)
People are average on average. OP is measuring LLM succes based on a super human test which most of us would likely fail. Creativity is just longer context and opinionated prompting. (For discussion purposes. I’m on 70% true.) Average Joe LLM and me are having a great time.

>>benbre+(OP)
Not really on the topic of the FA, but I've heard a few times about the All Souls Exams and seen some sample essay prompts, and I would love to read some real essays written by test takers. Any pointers?

replies(2): >>decima+f73 >>wheeli+NL6

>>benbre+(OP)
> The ultimate example may be All Souls College, which has a ritual, the Mallard Song, that occurs once a century.

You can't walk for more than five minutes in the UK without tripping over some nonsense like this. History is very important, and traditon has its place, but really? As a brit I find it all kind of tediously performative sometimes.

replies(1): >>xg15+E83

>>hydrog+n43
They're written in pencil and not returned, so nobody (except All Souls staff) has access to them.

>>benbre+(OP)
I sat the All Souls exam, taking the philosophy specialist papers, though I'm a math/physics/ML guy. It was a lot of fun, I really appreciate that there's somewhere in the world where these kinds of questions are asked in a formal setting. My questions/answers are written up in brief here [1]

[1] https://www.reddit.com/r/oxforduni/comments/q0giir/my_all_so...

* Oops, they link to my post at the bottom. Sorry for the redundancy.

>>benbre+(OP)
I went to see the last Mallard Song. Just to say I went, of course. It looked like a bunch of weirdos in a courtyard to me, but it was a literally once-in-a-century event, and I was living less than a minute away, so why not?

I don't think I've ever heard of a scheduled ritual that has a longer period. You're guaranteed to never have anyone present at more than one of these, so surely many aspects of the ritual will wander quite far from the original?

As for LLMs on the All Souls test, it's predictable that it mostly whiffs. After all it takes in a diet of Reddit+Wikipedia+etc, none of which is the kind of writing they are looking for.

Reddit is a lot of crappy comments. If you have no grounding in reality (being a thing that lives in a datacentre), how are you going to curate it? Some subs are really quite good, but most are really quite bad. It's not easy to get guidance, of the kind you would get if you sat with a professor for three or four hours a week for a few years, which is what the humanities students actually do.

Wikipedia is a great reference work, but it tends to not have any of the kinds of connections you're supposed to make in these essays. It has a lot of factual stuff, so questions about Persia will look ok, like in the article. But questions that glue together ideas across areas? Nah. Even if that's in the dataset somewhere, how is the LLM supposed to know that the sort of refined writing of a cross-subject academic is the highest level of the humanities? It doesn't, so it spits out what the average Redditor might glue together from a bit of googling.

replies(1): >>dash2+he3

>>andyjo+W53
Not a Brit, but Terry Pratchett's ritual of the Other Jacket told me all I need to know.

https://community.pearljam.com/discussion/71416/tradition-go...

replies(2): >>andyjo+yc3 >>scubbo+ny3

>>benbre+(OP)
Past exams: https://www.asc.ox.ac.uk/past-examination-papers

replies(1): >>hyperm+fr4

>>xg15+E83
> Here is an example of how mindless adherence to tradition can get a bit weird and very funny

See also; the King's Remembrancer and the Quit Rent Ceremony and the Trial of the Pyx:

https://en.m.wikipedia.org/wiki/King%27s_Remembrancer

It is truly strange how my country can create a political and cultural operating system that allows this stuff to just go on and on for almost 800 years, right up to now.

replies(1): >>xg15+td3

>>andyjo+yc3
> The King's Remembrancer swears in a jury of 26 Goldsmiths who then count, weigh and otherwise measure a sample of 88,000 gold coins produced by the Royal Mint.

I mean, you have to admire the stamina for that.

>>lordna+883
OK, interesting hypothesis. So, I wondered how it would do with "Why should cultural historians care about ice cores?" which indeed requires gluing together ideas across areas. I asked ChatGPT 5 on Thinking mode:

https://chatgpt.com/share/689e5361-fad8-8010-b203-f4f80d1457...

It does a pretty good job summarizing an abstruse, but known, subfield of frontier research. (So, perhaps not doing its own "gluing" of areas....) It clearly lacks "depth", in the sense of deep thinking about the why and how of this. (Many cultural historians might have reasons for deep scepticism of invasion by a bunch of quantitative data nerds, I suspect, and might be able to articulate why quite well.) It's bullet points, not an essay. I tried asking it for a 1000 word essay specifically and got:

https://chatgpt.com/share/689e5545-0688-8010-8bdf-632d3c3466...

which seems only superficially different - an essay in form, but secretly a bunch of bullet points.

For a comparison, here's a Guardian article that came up when I googled for "cultural historians ice cores":

https://www.theguardian.com/science/2024/feb/20/solar-storms...

It seems to do a good job at explaining why they should, though not in a deep essayistic style.

>>benbre+(OP)
A few years ago, the Turing Test was universally seen as sufficient for identifying intelligence. Now we’re scouring the planet for obscure tests to make us feel superior again. One can argue that the Turing Test was not actually adequate for this purpose, but we should at least admit how far we have shifted the goalposts since then.

replies(7): >>altrui+Yi3 >>rurp+Zj3 >>layer8+wk3 >>OtherS+Sk3 >>m4x+fD3 >>delusi+UY3 >>YeGobl+3q6

>>benbre+(OP)
The LLM examples for "Water" surely put it in the top 10% of people (let's say, of adult native English speakers who are literate by UNESCO standards). The average person can't string two written sentences together, never mind write a coherent essay "from an opinionated, individual point of view" in a single draft.

That might still make it the worst candidate in the All Souls exams, because those obviously select for people who are interested in writing essays of this sort.

But I'm also curious whether the LLM could compete given a suitable prompt. If it was told to write an idiosyncratic, opinionated essay, and perhaps given a suitable source material - "you are Harry Potter" but someone less well known but still with a million words of backstory - couldn't it do it? The chat bots we have today are bland because we value blandness. Customers are willing to pay for the inoffensive corporate style that can replace 90% of their employees at writing. Nobody is paying billions of dollars for a Montaigne or a Swift or even a Paul Graham to produce original essays.

>>benbre+(OP)
This was good, the tldr point is LLMs suck at natural writing, particularly long form. Or more abstractly they don't have complex original ideas, so can't do anything that requires this.

It's not surprising as it's very hard to train for or benchmark.

Also should add I don't think anyone serious thinks that long form writing or ideation is what they're for - assuming an LLM would be good at this is a side effect of anthropomorphism / confusion. It doesn't mean an LLM isn't good at summarizing something or changing unstructured data into structured or all of the other "cognitive tasks" that we expect from AI.

replies(1): >>Quadma+sp3

>>munchl+rf3
I have trouble reconciling this point with the known phenomenon of hallucinations.

I would suppose the correct test is an 'infinite' Turing test, which after a long enough conversation, LLM's invariably do not pass, as they eventually degrade.

I think a better measure for the binary answer of "have they passed the Turing test?" is the metric of 'For how long do they continue to pass the Turing test?"...

This ignores such ideas of probing the LLM's weak spots. Since they do not 'see' their input as characters, and instead as tokens, counting letters in words, or specifics about those sub-token division provides a shortcut (for now) to failing the Turing test.

But the above approach is not in the spirit of the Turing test, as that only points out a blind spot in their perception, like how a human would have to guess a bit at what things would look like if UV and infrared were added to our visual field... sure we could reason about it, but we wouldn't actually perceive those wavelengths, so we could make mistakes about that qualia. And it would say nothing of our ability to think if we could not perceive those wavelengths, even if 'more-seeing' entities judged us as inferior for it...

replies(1): >>throwa+TB3

>>munchl+rf3
I think the article gives a much more plausible explanation for the demise of the Turing Test: the jagged frontier. In the past being able to write convincingly well seemed like a good overall proxy for cognitive ability. It turns out LLMs are excellent at spitting out reasonable sounding text, and great at producing certain types of writing, but are still terrible at many writing tasks that rely on cognitive ability.

Humans don't need to cast about for obscure cases where they are smarter than an LLM, there are an endless supply of examples. It's simply the case that the Turing Test tells us very little about the relative strengths and weaknesses of the current AI capabilities.

replies(1): >>recurs+jP3

>>munchl+rf3
The article isn’t really about intelligence, but about originality and creativity in writing.

>>munchl+rf3
I don't think the Turing Test, in its strictest terms, is currently defeated by LLM based AIs. The original paper puts forward that:

>The object of the game for the third [human] player (B) is to help the interrogator. The best strategy for her is probably to give truthful answers. She can add such things as "I am the woman, don't listen to him!" to her answers, but it will avail nothing as the man can make similar remarks.

Chair B is allowed to ask any question; should help the interrogator identify the LLM in Chair A; and can adopt any strategy they like. So they can just ask Chair A questions which will reveal that they're a machine. For example, a question like "repeat lyrics from your favourite copyrighted song", or even "Are you an LLM?".

Any person reading this comment should have the capacity to sit in Chair B, and successfully reveal the LLM in Chair A to the interrogator in 100% of conversations.

replies(1): >>tough+Lu3

>>andy99+Oi3
I suspect that gpt 6 will write great diverse essays when prompted with single words, ace specifically this benchmark, and piss people off when they upgrade siri to got 6, say “time?” to their smartwatch, and get a 3600 word eloquent response.

>>benbre+(OP)
The author seems impressed with Claude's job answering the Achaemenid Persia question, but, just taking a look at it, if I had started a conclusion paragraph with "In conclusion," in my more rigorous university courses, I'd have been pilloried for it.

replies(1): >>archae+vj4

>>OtherS+Sk3
that relies on the positive-aligned RLHF models most labs do.

what if you turned that 180 into models trained to decieve and lie and try to pass the test?

replies(2): >>lumost+YA3 >>oinfoa+ZD4

>>benbre+(OP)
> The same patterns emerge for all LLM answers to these questions. They converge on an optimal path through a thicket of concepts.

This so concisely explains most of the problems and power of these tools. If your goal is to get a reasonably good answer on a reasonably well trod subject you’re going to be very happy with their output. If you push them outside of that they quickly fall into either producing reasonable sounding but incorrect outputs, hallucinations, or failure.

replies(1): >>throwa+CB3

>>xg15+E83
a) GNU Terry Pratchett

b) In case you are one of today's Lucky Ten Thousand, this is a reference to the real-life Ceremony Of The Keys[1]

[0] https://xkcd.com/1053/ [1] https://www.hrp.org.uk/tower-of-london/history-and-stories/t...

replies(1): >>xg15+W16

>>tough+Lu3
Human's are able to quickly converge on a pattern. While I doubt that I could immediately catch all LLMs, I can certainly catch a good portion by having simply worked with them for a time. On an infinite horizon Turing test, where I have the option to state that Chair A is a machine at any time - I would certainly expect to detect LLMs simply by virtue of their limited conversational range.

replies(1): >>tough+OK3

>>roxolo+Zx3
Considering a common trope is that most people are barely skilled at their main profession these days and almost clueless at everything else.. is this really that bad?

Most people need help with things that are trivial to experienced people in that field, but don't have the access or time or funds to get experts.

Most Americans can barely understand fractions and have no idea how a refrigerator works. If LLMs can help them troubleshoot their fridge and find out it's probably the circulating fan that needs help because their freezer works but not their fridge, isnt that mission accomplished, even if an HVAC tech can't use it to solve all his problems yet?

replies(1): >>drewbe+KJ3

>>altrui+Yi3
I date a lot of public school teachers for some reason (hey, once you have a niche it's easy to relate and they like you), and I assure you you'd have a better more human conversation with an LLM than with most middle school teachers.

>>munchl+rf3
Would you consider that any current LLM is close to passing the Turing test?

If you think there's an LLM that can do so, I'd love to try it out! Even talking to the best models available today, it's disappointingly clear I'm talking to an LLM.

>>throwa+CB3
It depends what you want LLMs for, or what you think they should be for.

Imo you’re correct that they’re good enough if you scope your expectations appropriately. The problem is that the LLMs themselves don’t have any concept of their own limitations, leading to a kind of expectation creep. “Hey the bot was pretty good at helping me troubleshoot my fridge, lemme ask it for the names of the first 12 presidents I’m sure it’ll get that right.”

I think expectations and failure states for humans are far far easier to understand and suss out than for LLMs, and this is one of the robots’ biggest problems.

>>lumost+YA3
if anything i would do differently, i'd try things only machines can reliably do.

unless the llm and the design for it is necessarily adversarial, not even going into red teaming or jailbreaks.

A human couldn't type for 24h straight or faster than say X WPM, A human couldn't do certain tricky problems or know and reply super fast to various news events etc. Search/training date seems important factor too to tie in.

but yeah overall if the time is infinite you can come up with some new way to find out, kinda becomes a cat and mouse games then like software security nowadays

>>rurp+Zj3
The turing test basically subsumes all tests that can be text-encoded, no? Like if you feel that LLMs are abnormally bad at a kind of writing like an All Souls essay, you just ask the other chair to write you such an essay as one of your questions.

To be clear, I'm not aware of anyone actually running any serious turing tests today because it's very expensive and tedious. There's one being passed around where each conversation is only 4(!) little SMS-sized messages long per side, and chat gpt gets judged to be the human side twice as often as the actual human.

>>munchl+rf3
The Turing Test is a philosophical device meant to question what being a human is. It was never a benchmark or a goalpost.

replies(1): >>soroko+Ve6

>>viccis+es3
The essay is just bad. Its answer to "How important were the Achaemenids as a template for Sasanian power?" is "very important" and proceeds to squeeze that from the analysis at any costs. There is no nuance, no balance, just hammering on the point that it was very important.

Take this passage for example:

"The importance of the Achaemenid model for Sasanian power was profound yet selective, manifesting most clearly in royal ideology, administrative structures, and religious policy, while being mediated through the complex filters of historical memory, practical necessity, and contemporary innovation."

This is nonsense. "Profound yet selective" what does that even mean? Was it profound or selective?

Another problematic passage:

"Ardashir's son Shapur I's Res Gestae (ŠKZ) explicitly invokes the memory of past Iranian greatness, presenting the Sasanian dynasty as restoring a glory that had been diminished under Parthian rule."

It most certainly does not. There is no such claim to "restoring" something the Parthian rule had "diminished".

As usual, LLMs can write very convincing nonsense if you don't or can't scrutinize.

This is bad historical analysis dressed up as a pompous essay that looks knowledgeable to the lay person.

>>auteli+Sa3
It would be interesting to see all answers for one of these side by side. I would probably start nodding in agreement with one, only for the next one to go on a completely different tangent, focussing on other values, which I als would have to agree with.

>>tough+Lu3
If we had firms spending billions of dollars to pass the Turing test, it seems absurd to me to believe the current crop of models could not pass the test.

Luckily, it is obvious that spending huge amounts of money to train models on how to best deceive humans with language is a terrible idea.

That is also gaming the test and not in the spirit of generality that the test was trying to test for.

Even playing Tic-tac-toe against GPT5 is a joke. The model knows enough of how the game works to let you play in text but doesn't even know when you won the game.

The interesting part is that the model can even tell you why it sucks at tic-tac-toe

"Because I’m not really thinking about the game like a human — I’m generating moves based on text patterns, not visualizing the board in the same intuitive way you do."

10 years ago it would not be conceivable we could have models that pass the Turing test but be hopeless at Tic-tac-toe and be able to tell you why they are not good at Tic-tac-toe.

That right there is a total invalidation of the Turing test IMO.

replies(1): >>birn55+2m5

>>oinfoa+ZD4
How would AI reliable pass the turing test when playing Tic-Tac-Toe reliably reveals the weakness of today's AI?

>>scubbo+ny3
GNU Terry Pratchett

And yeah, I knew about the Ceremony of the Keys, but not the details. Didn't know it really has that kind of scripted dialogue that Pratchett parodied there.

>>delusi+UY3
You are quite wrong, do have a look at Turing's paper.

replies(1): >>delusi+Pl6

>>soroko+Ve6
I have in fact read it. I stand by my statement.

>>munchl+rf3
Useful things to keep in mind about the "Turing test:

a) It was not meant as a "test" by Turing, rather as a thought experiment.

b) It does not require intelligence to pass tests that claim to be it. See:

https://en.wikipedia.org/wiki/Eugene_Goostman

>>hydrog+n43
The list of exam fellows can be found online [1]. Probably their research or any personal blogs would be the closest you can find.

[1] https://www.asc.ox.ac.uk/people

zlacker

All Souls exam questions and the limits of machine reasoning