zlacker

I wonder if separate LLMs can find each other’s logical mistakes. If I ask llama to find the logical mistake in Yi output, would that work better than llama finding a mistake in llama output?

A logical mistake might imply a blind spot inherent to the model, a blind spot that might not be present in all models.

replies(2): >>EricMa+W5 >>sevagh+R8

>>valine+(OP)
wouldn't this effectively be using a "model" twice the size?

Would it be better to just double the size of one of the models rather than house both?

Genuine question

replies(4): >>valine+G7 >>averev+Pg >>rainco+ik >>sevagh+GP2

>>EricMa+W5
Maybe. Goliath 120B took two different llama variants and interwove the layers. Surprisingly Goliath 120B quantized to 2bit is outperforming llama 70B 4bit in many benchmarks.

https://www.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_com...

replies(1): >>ghotli+4H

>>valine+(OP)
I frequently share responses between ChatGPT (paid version with GPT4) and Copilot-X to break an impasse when trying to generate or fix a tricky piece of code.

>>EricMa+W5
Parsing is faster than generating, so having a small model produce a whole output and then have Goliath only produce "good/bad" single token response evaluation would be faster than having Goliath produce everything. This would be the extreme, adhoc and iterative version of speculative decoding, which is already a thing and would probably give the best compromise

>>EricMa+W5
I think the relationship between model size and training time isn't linear. So if you want a twice bigger model it'll take more resources to train it than two original models.

>>valine+G7
Do you happen to have a link to where that interwoven layers bit is described? As far as I can tell it's not clear on the model cards.

replies(1): >>valine+rX

>>ghotli+4H
The model page is the only info I’ve found on it. As far as I can tell there’s no paper published on the technique.

In the “Merge Process” section they at least give the layer ranges.

https://huggingface.co/alpindale/goliath-120b

replies(1): >>ghotli+x51

>>valine+rX
Ah, actually reviewing that more closely I found a link to it in the acknowledgements.

https://github.com/cg123/mergekit

>>EricMa+W5
I believe another factor is that sometimes the model responds better to your prompt than other times. This way you get two dice rolls of your prompt hitting "the good path."