zlacker

No, that's not quite what I mean. I used the logits in another reply to point out that there is data specific to the generation process that is not available from the tokens, but there's also the network activations adding up to that state.

Processing tokens is a bit like ticks in a CPU, where the model weights are the program code, and tokens are both input and output. The computation that occurs logically retains concepts and plans over multiple token generation steps.

That it is fully deterministic is no more interesting than saying a variable in a single threaded program is not state because you can recompute its value by replaying the program with the same inputs. It seems to me that this uninteresting distinction is the GP's issue.