So the solution to this is to build a multi modal LLM that accepts both text and raw logits as input.
You query a model to produce the reasoning trace, then you feed the reasoning trace's logits into a second model query to produce the answer.
These steps have to be done separately, because there is no training data that contains the logits. You can only parallelize training when it is done on known input + output data.