> LLMs have hidden state not necessarily directly reflected in the tokens being produced, and it is possible for LLMs to output tokens in opposition to this hidden state to achieve longer-term outcomes (or predictions, if you prefer).
But what does it mean for an LLM to output a token in opposition to its hidden state? If there's a longer-term goal, it either needs to be verbalized in the output stream, or somehow reconstructed from the prompt on each token.
There’s some work (a link would be great) that disentangles whether chain-of-thought helps because it gives the model more FLOPs to process, or because it makes its subgoals explicit—e.g., by outputting “Okay, let’s reason through this step by step...” versus just "...." What they find is that even placeholder tokens like "..." can help.
That seems to imply some notion of evolving hidden state! I see how that comes in!
But crucially, in autoregressive models, this state isn’t persisted across time. Each token is generated afresh, based only on the visible history. The model’s internal (hidden) layers are certainly rich and structured and "non verbal".
But any nefarious intention or conclusion has to be arrived at on every forward pass.
Goals, such as they are, are essentially programs, or simulations, the LLM runs that help it predict (generate) future tokens.
Anyway, the whole original article is a rejection of anthropomorphism. I think the anthropomorphism is useful, but you still need to think of LLMs as deeply defective minds. And I totally reject the idea that they have intrinsic moral weight or consciousness or anything close to that.