zlacker

IIRC there were some OpenAI docs that recommended doing exactly this, make n generations and use a smaller fine tuned model to select the best one

replies(2): >>Tostin+y4 >>DaiPlu+p6

>>Millio+(OP)
Right, most inference servers support this already.

>>Millio+(OP)
...does this directly relate to the high operating costs of LLMs-as-a-service, if for every request they have to run n-many redundant LLM requests? So if they could improve things so that a single prompt/request+response has a higher chance of being high-quality they wouldn't need to run alternatives?

replies(2): >>ilaksh+p7 >>Millio+r51

>>DaiPlu+p6
A lot of people don't run multiple at a time.

It can make it more expensive if that option becomes popular.

But I think in most cases batching is actually the biggest _improvement_ in terms of cost effectiveness for operators, since it enables them to use the parallel throughout of the graphics device more fully by handling multiple inference requests (often from different customers) at once. (Unless they work like Bard by default).

>>DaiPlu+p6
Another point: Now that I think about it, I doubt this is compatible with streaming the output to the user, which might be an issue in some cases.

replies(1): >>DaiPlu+p81

>>Millio+r51
I thought the char-by-char teleprinter thing was just an effect (y’know, for user-engagement and to make the interaction feel more genuine) - and that these systems just return output in buffered blocks/pages or whatever-it-is that they wired-up their network to do.

It’s not like DALL-E outputs pixels in scanout order - or in brushstroke order (…er… or does it?)

replies(1): >>ilaksh+FE1

>>DaiPlu+p81
It's not an effect at all. It calculates and outputs one token at time. The algorithm requires all previous tokens in order to output the next one. DALL-E is a totally different algorithm. It does not have a scanout or brushstrokes.