LLMs cannot find reasoning errors, but can correct them

>>koie+(OP)
I was just testing Bard with some very simple coding exercises and it did well.

I noticed that they automatically create at least three other draft responses.

I assume that this is a technique that allows them to try multiple times and then select the best one.

Just mentioning it because it seems like another example of not strictly "zero-shot"ing a response. Which seems important for getting good results with these models.

I'm guessing they use batching for this. I wonder if it might become more common to run multiple inference subtasks for the same main task inside of a batch, for purposes of self-correcting agent swarms or something. The outputs from step one are reviewed by the group in step 2, then they try again in step 3.

I guess that only applies for a small department where there is frequently just one person using it at a time.

>>ilaksh+Hm
IIRC there were some OpenAI docs that recommended doing exactly this, make n generations and use a smaller fine tuned model to select the best one

>>Millio+Kt
...does this directly relate to the high operating costs of LLMs-as-a-service, if for every request they have to run n-many redundant LLM requests? So if they could improve things so that a single prompt/request+response has a higher chance of being high-quality they wouldn't need to run alternatives?

>>DaiPlu+9A
A lot of people don't run multiple at a time.

It can make it more expensive if that option becomes popular.

But I think in most cases batching is actually the biggest _improvement_ in terms of cost effectiveness for operators, since it enables them to use the parallel throughout of the graphics device more fully by handling multiple inference requests (often from different customers) at once. (Unless they work like Bard by default).

zlacker