I noticed that they automatically create at least three other draft responses.
I assume that this is a technique that allows them to try multiple times and then select the best one.
Just mentioning it because it seems like another example of not strictly "zero-shot"ing a response. Which seems important for getting good results with these models.
I'm guessing they use batching for this. I wonder if it might become more common to run multiple inference subtasks for the same main task inside of a batch, for purposes of self-correcting agent swarms or something. The outputs from step one are reviewed by the group in step 2, then they try again in step 3.
I guess that only applies for a small department where there is frequently just one person using it at a time.
It can make it more expensive if that option becomes popular.
But I think in most cases batching is actually the biggest _improvement_ in terms of cost effectiveness for operators, since it enables them to use the parallel throughout of the graphics device more fully by handling multiple inference requests (often from different customers) at once. (Unless they work like Bard by default).
It forces you to remind yourself of the stochastic nature of the model and RILHF, maybe the data even helps to improve the latter.
I liked this trait of Bard from the start and hope they keep it.
It provides a sense of agency and reminds to not anthropomorphize the transformer chatbot too much.
It’s not like DALL-E outputs pixels in scanout order - or in brushstroke order (…er… or does it?)