eg. pick 'the' as the next token because there's a strong probability of 'planet' as the token after?
is it only past state that influences the choice of 'the'? or that the model is predicting many tokens in advance and only returning the one in the output?
if it does predict many, id consider that state hidden in the model weights.
https://www.anthropic.com/research/tracing-thoughts-language...
Going to be a lot more "an apple" in the corpus than "an pear"