Processing tokens is a bit like ticks in a CPU, where the model weights are the program code, and tokens are both input and output. The computation that occurs logically retains concepts and plans over multiple token generation steps.
That it is fully deterministic is no more interesting than saying a variable in a single threaded program is not state because you can recompute its value by replaying the program with the same inputs. It seems to me that this uninteresting distinction is the GP's issue.