zlacker

I think it’s helpful to remember that language models are not producing tokens, they are producing a distribution of possible next tokens. Just because your sampler picks a sequence of tokens that contain incorrect reasoning doesn't mean a useful reasoning trace isn’t also contained within the latent space.

It’s a misconception that transformers reason in token space. Tokens don’t attend to other tokens. High dimensional latents attend to other high dimensional latents. The final layer of a decoder only transformer has full access to entire latent space of all previous latents, the same latents you can project into a distribution of next tokens.

replies(4): >>woadwa+U8 >>jacob0+Cc >>x_flyn+M61 >>aiiizz+Aw1

>>valine+(OP)
> Just because your sampler picks a sequence of tokens that contain incorrect reasoning doesn't mean a useful reasoning trace isn’t also contained within the latent space.

That's essentially the core idea in Coconut[1][2], to keep the reasoning traces in a continuous space.

[1]: https://arxiv.org/abs/2412.06769

[2]: https://github.com/facebookresearch/coconut

>>valine+(OP)
So you're saying that the reasoning trace represents sequential connections between the full distribution rather than the sampled tokens from that distribution?

replies(1): >>valine+6d

>>jacob0+Cc
The lower dimensional logits are discarded, the original high dimensional latents are not.

But yeah, the LLM doesn’t even know the sampler exists. I used the last layer as an example, but it’s likely that reasoning traces exist in the latent space of every layer not just the final one, with the most complex reasoning concentrated in the middle layers.

replies(2): >>jacob0+Lg >>bcoate+sF

>>valine+6d
I don't think that's accurate. The logits actually have high dimensionality, and they are intermediate outputs used to sample tokens. The latent representations contain contextual information and are also high-dimensional, but they serve a different role--they feed into the logits.

replies(1): >>valine+gi

>>jacob0+Lg
The dimensionality I suppose depends on the vocab size and your hidden dimension size, but that’s not really relevant. It’s a single linear projection to go from latents to logits.

Reasoning is definitely not happening in the linear projection to logits if that’s what you mean.

replies(1): >>pyinst+r11

>>valine+6d
Either I'm wildly misunderstanding or that can't possibly be true--if you sample at high temperature and it chooses a very-low probability token, it continues consistent with the chosen token, not with the more likely ones

replies(1): >>valine+2G

>>bcoate+sF
Attention computes a weighted average of all previous latents. So yes, it’s a new token as input to the forward pass, but after it feeds through an attention head it contains a little bit of every previous latent.

>>valine+gi
Where does it happen ?

replies(1): >>valine+8Z1

>>valine+(OP)
What the model is doing in latent space is auxilliary to anthropomorphic interpretations of the tokens, though. And if the latent reasoning matches a ground-truth procedure (A*), then we'd expect it to be projectable to semantic tokens, but it isn't. So it seems the model has learned an alternative method for solving these problems.

replies(2): >>refulg+771 >>valine+QZ1

>>x_flyn+M61
It is worth pointing out that "latent space" is meaningless.

There's a lot of stuff that makes this hard to discuss, ex. "projectable to semantic tokens" you mean "able to be written down"...right?

Something I do to make an idea really stretch its legs is reword it in Fat Tony, the Taleb character.

Setting that aside, why do we think this path finding can't be written down?

Is Claude/Gemini Plays Pokemon just an iterated A* search?

>>valine+(OP)
Is that really true? E.g. anthropic said that the model can make decisions about all the tokens, before a single token is produced.

replies(1): >>valine+FY1

>>aiiizz+Aw1
That’s true yeah. The model can do that because calculating latents is independent of next token prediction. You do a forward pass for each token in your sequence without the final projection to logits.

>>pyinst+r11
My personal theory is that it’s an emergent property of many attention heads working together. If each attention head is a bird, reasoning would be the movement of the flock.

>>x_flyn+M61
You’re thinking about this like the final layer of the model is all that exists. It’s highly likely reasoning is happening at a lower layer, in a different latent space that can’t natively be projected into logits.