Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens

>>nyrikk+(OP)
I think it’s helpful to remember that language models are not producing tokens, they are producing a distribution of possible next tokens. Just because your sampler picks a sequence of tokens that contain incorrect reasoning doesn't mean a useful reasoning trace isn’t also contained within the latent space.

It’s a misconception that transformers reason in token space. Tokens don’t attend to other tokens. High dimensional latents attend to other high dimensional latents. The final layer of a decoder only transformer has full access to entire latent space of all previous latents, the same latents you can project into a distribution of next tokens.

>>valine+r7
So you're saying that the reasoning trace represents sequential connections between the full distribution rather than the sampled tokens from that distribution?

>>jacob0+3k
The lower dimensional logits are discarded, the original high dimensional latents are not.

But yeah, the LLM doesn’t even know the sampler exists. I used the last layer as an example, but it’s likely that reasoning traces exist in the latent space of every layer not just the final one, with the most complex reasoning concentrated in the middle layers.

>>valine+xk
Either I'm wildly misunderstanding or that can't possibly be true--if you sample at high temperature and it chooses a very-low probability token, it continues consistent with the chosen token, not with the more likely ones

>>bcoate+TM
Attention computes a weighted average of all previous latents. So yes, it’s a new token as input to the forward pass, but after it feeds through an attention head it contains a little bit of every previous latent.

zlacker