zlacker

What this paper tells us is that the token itself is meaningless. When the model is "thinking", it is storing hidden information in the logits. Most of that information is discarded as a result of token sampling. This is the same reason why knowledge distillation (training on the logits) works, but training on LLM produced text does not.

So the solution to this is to build a multi modal LLM that accepts both text and raw logits as input.

You query a model to produce the reasoning trace, then you feed the reasoning trace's logits into a second model query to produce the answer.

These steps have to be done separately, because there is no training data that contains the logits. You can only parallelize training when it is done on known input + output data.