A non-anthropomorphized view of LLMs

>>zdw+(OP)
I have the technical knowledge to know how LLMs work, but I still find it pointless to not anthropomorphize, at least to an extent.

The language of "generator that stochastically produces the next word" is just not very useful when you're talking about, e.g., an LLM that is answering complex world modeling questions or generating a creative story. It's at the wrong level of abstraction, just as if you were discussing an UI events API and you were talking about zeros and ones, or voltages in transistors. Technically fine but totally useless to reach any conclusion about the high-level system.

We need a higher abstraction level to talk about higher level phenomena in LLMs as well, and the problem is that we have no idea what happens internally at those higher abstraction levels. So, considering that LLMs somehow imitate humans (at least in terms of output), anthropomorphization is the best abstraction we have, hence people naturally resort to it when discussing what LLMs can do.

>>Al-Khw+uK
On the contrary, anthropomorphism IMO is the main problem with narratives around LLMs - people are genuinely talking about them thinking and reasoning when they are doing nothing of that sort (actively encouraged by the companies selling them) and it is completely distorting discussions on their use and perceptions of their utility.

>>grey-a+cL
Well "reasoning" refers to Chain-of-Thought and if you look at the generated prompts it's not hard to see why it's called that.

That said, it's fascinating to me that it works (and empirically, it does work; a reasoning model generating tens of thousands of tokens while working out the problem does produce better results). I wish I knew why. A priori I wouldn't have expected it, since there's no new input. That means it's all "in there" in the weights already. I don't see why it couldn't just one shot it without all the reasoning. And maybe the future will bring us more distilled models that can do that, or they can tease out all that reasoning with more generated training data, to move it from dispersed around the weights -> prompt -> more immediately accessible in the weights. But for now "reasoning" works.

But then, at the back of my mind is the easy answer: maybe you can't optimize it. Maybe the model has to "reason" to "organize its thoughts" and get the best results. After all, if you give me a complicated problem I'll write down hypotheses and outline approaches and double check results for consistency and all that. But now we're getting dangerously close to the "anthropomorphization" that this article is lamenting.

>>losved+op1
CoT gives the model more time to think and process the inputs it has. To give an extreme example, suppose you are using next token prediction to answer 'Is P==NP?' The tiny number of input tokens means that there's a tiny amount of compute to dedicate to producing an answer. A scratchpad allows us to break free of the short-inputs problem.

Meanwhile, things can happen in the latent representation which aren't reflected in the intermediate outputs. You could, instead of using CoT, say "Write a recipe for a vegetarian chile, along with a lengthy biographical story relating to the recipe. Afterwards, I will ask you again about my original question." And the latents can still help model the primary problem, yielding a better answer than you would have gotten with the short input alone.

Along these lines, I believe there are chain of thought studies which find that the content of the intermediate outputs don't actually matter all that much...

zlacker