Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens

>>nyrikk+(OP)
I think it’s helpful to remember that language models are not producing tokens, they are producing a distribution of possible next tokens. Just because your sampler picks a sequence of tokens that contain incorrect reasoning doesn't mean a useful reasoning trace isn’t also contained within the latent space.

It’s a misconception that transformers reason in token space. Tokens don’t attend to other tokens. High dimensional latents attend to other high dimensional latents. The final layer of a decoder only transformer has full access to entire latent space of all previous latents, the same latents you can project into a distribution of next tokens.

>>valine+r7
So you're saying that the reasoning trace represents sequential connections between the full distribution rather than the sampled tokens from that distribution?

zlacker