zlacker

Potentially more correct, yes. It frees the model to choose lower probability tokens to some degree, technically it boosts their probabilities, which may be more correct depending on the task.

There are also sampling schemes, top_p and top_k which can each individually help choose tokens that are less probable (but still highly probable) but more correct, and they are often used together for the best effect.

And then there are various decoding methods like beam search where choosing the most optimal beam may not mean the most optimal individual token.

By default a simple greedy search is used which always chooses the next highest probability token.