Basically, if humans have had meaningful discussions about it, the product of their reasoning is there for the LLM, right?
Seems to me, the "how many R's are there in the word "strawberry" problem is very suggestive of the idea LLM systems cannot reason. If they could, the question is not difficult.
The fact is humans may never have actually discussed that topic in any meaningful way captured in the training data.
And because of that and how specific the question is, the LLM has no clear relationships to map into a response. It just does best case, whatever the math deemed best.
Seems plausible enough to support the opinion LLM'S cannot reason.
What we do know is LLMs can work with anything expressed in terms of relationships between words.
There is a ton of reasoning templates contained in that data.
Put another way:
Maybe LLM systems are poor at deduction, save for examples contained in the data. But there are a ton of examples!
So this is hard to notice.
Maybe LLM systems are fantastic at inference! And so those many examples get mapped to the prompt at hand very well.
And we do notice that and see it like real thinking, not just some horribly complex surface containing a bazillion relationships...
Other examples exist.
[0]That example is due to tokenization. DoH! I knew better too.
Ah well.
I think it's something like the counting parts of problems that current models are shaky with, and I imagine it's a training data problem.