zlacker

RL in ChatGPT is used for that: to generate text that maximizes reward. But if you have other domains with their reward functions, then you could plan on them

replies(1): >>PaulHo+z9

>>qumpis+(OP)
My impression is that the complex optimization happens during training but that the actual inference is using some kind of greedy algorithm like beam search. If the inference algorithm was using simulated annealing or something like that that would be different.