zlacker

That sounds to me more like evidence that an LLM is never reasoning at all, even when it looks like it is.

The mock conversation that is written between think tags is not a conversation. It's the collection of tokens that are most likely to be written after a prompt to a model that was trained on example conversations.

Why is that different? In a real conversation, participants use logic to choose what is worth saying next. The next statement is already determined in the speaker's mind to be logically sound. In a mock conversation (the LLM's CoT), there is no logic. The next statement is only determined to be statistically familiar, then written immediately.

The end result of a desirable CoT interaction is text that would have been written by a thoughtful/logical conversationalist. Whether or not the mock conversation itself is logically consistent with the mock conclusion is irrelevant, because the LLM is only concerned with how familiar that mock conclusion is to the prompt, its mock conversation, and its training.

The overall vibe of how something is written behaves as a replacement for actual logic. Logical deduction is replaced with measures of confidence, conversations turns, etc. in writing style. It all works out in the end, because we are so consistent with the style in which we write real logical deductions, we have ended up providing an invisible semantics for the LLM to follow.

There is something meaningful that we are entirely blind to. Unfortunately, it doesn't follow rules the way logic does, so it's not a trustworthy replacement. Fortunately, it's useful for more general exploration.