The Illusion of Thinking: Strengths and limitations of reasoning models [pdf]

>>amrrs+(OP)
The study challenges the assumption that more “thinking” or longer reasoning traces necessarily lead to better problem-solving in LRMs

>>bicepj+RF
As a test, I asked Gemini 2.5 Flash and Gemini 2.5 Pro to decode a single BASE64 string.

Flash answered correctly in ~2 seconds, at most. Pro answered very wrongly after thinking and elaborating for ~5 minutes.

Flash was also giving a wrong answer for the same string in the past, but it improved.

Prompt was the same: "Hey, can you decode $BASE64_string?"

I have no further comments.

>>bayind+oI
well that's not a very convincing argument. That's just a failure to recognize when the use of a tool- base64 decoder- is needed, not a reasoning problem at all, right?

>>rafter+KX
Translating to BASE64 is a good test to see how well it works as a language translator without changing things, because its the same skill for an AI model.

If the model changes things it means it didn't really capture the translation patterns for BASE64, so then who knows what it will miss when translating between languages if it can't even do BASE64?

zlacker