zlacker

Well "reasoning" refers to Chain-of-Thought and if you look at the generated prompts it's not hard to see why it's called that.

That said, it's fascinating to me that it works (and empirically, it does work; a reasoning model generating tens of thousands of tokens while working out the problem does produce better results). I wish I knew why. A priori I wouldn't have expected it, since there's no new input. That means it's all "in there" in the weights already. I don't see why it couldn't just one shot it without all the reasoning. And maybe the future will bring us more distilled models that can do that, or they can tease out all that reasoning with more generated training data, to move it from dispersed around the weights -> prompt -> more immediately accessible in the weights. But for now "reasoning" works.

But then, at the back of my mind is the easy answer: maybe you can't optimize it. Maybe the model has to "reason" to "organize its thoughts" and get the best results. After all, if you give me a complicated problem I'll write down hypotheses and outline approaches and double check results for consistency and all that. But now we're getting dangerously close to the "anthropomorphization" that this article is lamenting.

replies(5): >>sdento+tc >>shakad+jd >>variad+Fz >>Terr_+eA >>grey-a+HAd

>>losved+(OP)
CoT gives the model more time to think and process the inputs it has. To give an extreme example, suppose you are using next token prediction to answer 'Is P==NP?' The tiny number of input tokens means that there's a tiny amount of compute to dedicate to producing an answer. A scratchpad allows us to break free of the short-inputs problem.

Meanwhile, things can happen in the latent representation which aren't reflected in the intermediate outputs. You could, instead of using CoT, say "Write a recipe for a vegetarian chile, along with a lengthy biographical story relating to the recipe. Afterwards, I will ask you again about my original question." And the latents can still help model the primary problem, yielding a better answer than you would have gotten with the short input alone.

Along these lines, I believe there are chain of thought studies which find that the content of the intermediate outputs don't actually matter all that much...

>>losved+(OP)
> I don't see why it couldn't just one shot it without all the reasoning.

That's reminding me of deep neural networks where single layer networks could achieve the same results, but the layer would have to be excessively large. Maybe we're re-using the same kind of improvement, scaling in length instead of width because of our computation limitations ?

>>losved+(OP)
Using more tokens = more compute to use for a given problem. I think most of the benefit of CoT has more to do with autoregressive models being unable to “think ahead” and revise their output, and less to do with actual reasoning. The fact that an LLM can have incorrect reasoning in its CoT and still produce the right answer, or that it can “lie” in its CoT to avoid being detected as cheating on RL tasks, makes me believe that the semantic content of CoT is an illusion, and that the improved performance is from being able to explore and revise in some internal space using more compute before producing a final output.

>>losved+(OP)
I like this mental-model, which rests heavily on the "be careful not to anthropomorphize" approach:

It was already common to use a document extender (LLM) against a hidden document, which resembles a movie or theater play where a character named User is interrogating a character named Bot.

Chain-of-thought switches the movie/script style to film noir, where the [Detective] Bot character has additional content which is not actually "spoken" at the User character. The extra words in the script add a certain kind of metaphorical inertia.

>>losved+(OP)
It looks superficially like reasoning, but is it reasoning?