zlacker

I was just testing Bard with some very simple coding exercises and it did well.

I noticed that they automatically create at least three other draft responses.

I assume that this is a technique that allows them to try multiple times and then select the best one.

Just mentioning it because it seems like another example of not strictly "zero-shot"ing a response. Which seems important for getting good results with these models.

I'm guessing they use batching for this. I wonder if it might become more common to run multiple inference subtasks for the same main task inside of a batch, for purposes of self-correcting agent swarms or something. The outputs from step one are reviewed by the group in step 2, then they try again in step 3.

I guess that only applies for a small department where there is frequently just one person using it at a time.

replies(3): >>Millio+37 >>stavro+47 >>erhaet+ee

>>ilaksh+(OP)
IIRC there were some OpenAI docs that recommended doing exactly this, make n generations and use a smaller fine tuned model to select the best one

replies(2): >>Tostin+Bb >>DaiPlu+sd

>>ilaksh+(OP)
Isn't that textbook MoE?

replies(1): >>Tostin+Kc

>>Millio+37
Right, most inference servers support this already.

>>stavro+47
No, like the other comment said, it's just using the `n` parameter in an OpenAI style API. For example, vLLM and llamacpp have support for it.

replies(1): >>stavro+ld

>>Tostin+Kc
Ah, it's the same model, multiple runs, then? Not actually N different models?

replies(1): >>Tostin+af

>>Millio+37
...does this directly relate to the high operating costs of LLMs-as-a-service, if for every request they have to run n-many redundant LLM requests? So if they could improve things so that a single prompt/request+response has a higher chance of being high-quality they wouldn't need to run alternatives?

replies(2): >>ilaksh+se >>Millio+uc1

>>ilaksh+(OP)
I don't like this. It forces me to read 2 prompts instead of 1 so that I can help train their LLM. ChatGPT and Bard already have regenerate buttons if I don't like their response, it doesn't need to be that in my face.

replies(2): >>moritz+wg >>bongod+FM

>>DaiPlu+sd
A lot of people don't run multiple at a time.

It can make it more expensive if that option becomes popular.

But I think in most cases batching is actually the biggest _improvement_ in terms of cost effectiveness for operators, since it enables them to use the parallel throughout of the graphics device more fully by handling multiple inference requests (often from different customers) at once. (Unless they work like Bard by default).

>>stavro+ld
Correct.

>>erhaet+ee
I think there is an argument that it would be beneficial for this to be common, despite the cognitive burden.

It forces you to remind yourself of the stochastic nature of the model and RILHF, maybe the data even helps to improve the latter.

I liked this trait of Bard from the start and hope they keep it.

It provides a sense of agency and reminds to not anthropomorphize the transformer chatbot too much.

>>erhaet+ee
How else do you expect the LLM you use to become better? I'm more than happy to provide feedback. Unless you want it to only scrape data, I can't imagine why you'd be opposed to improving a product you use especially when that's really the only way to do it. If you don't care, just pick one and don't think about it, they're usually extremely similar anyway. I'm not sure I've come across an option where one was acceptable and one wasn't. They are literally giving you options that that don't need to give and you're complaining.

>>DaiPlu+sd
Another point: Now that I think about it, I doubt this is compatible with streaming the output to the user, which might be an issue in some cases.

replies(1): >>DaiPlu+sf1

>>Millio+uc1
I thought the char-by-char teleprinter thing was just an effect (y’know, for user-engagement and to make the interaction feel more genuine) - and that these systems just return output in buffered blocks/pages or whatever-it-is that they wired-up their network to do.

It’s not like DALL-E outputs pixels in scanout order - or in brushstroke order (…er… or does it?)

replies(1): >>ilaksh+IL1

>>DaiPlu+sf1
It's not an effect at all. It calculates and outputs one token at time. The algorithm requires all previous tokens in order to output the next one. DALL-E is a totally different algorithm. It does not have a scanout or brushstrokes.