zlacker

[parent] [thread] 9 comments
1. valine+(OP)[view] [source] 2023-11-20 20:28:09
I wonder if separate LLMs can find each other’s logical mistakes. If I ask llama to find the logical mistake in Yi output, would that work better than llama finding a mistake in llama output?

A logical mistake might imply a blind spot inherent to the model, a blind spot that might not be present in all models.

replies(2): >>EricMa+W5 >>sevagh+R8
2. EricMa+W5[view] [source] 2023-11-20 20:52:20
>>valine+(OP)
wouldn't this effectively be using a "model" twice the size?

Would it be better to just double the size of one of the models rather than house both?

Genuine question

replies(4): >>valine+G7 >>averev+Pg >>rainco+ik >>sevagh+GP2
◧◩
3. valine+G7[view] [source] [discussion] 2023-11-20 20:59:51
>>EricMa+W5
Maybe. Goliath 120B took two different llama variants and interwove the layers. Surprisingly Goliath 120B quantized to 2bit is outperforming llama 70B 4bit in many benchmarks.

https://www.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_com...

replies(1): >>ghotli+4H
4. sevagh+R8[view] [source] 2023-11-20 21:03:37
>>valine+(OP)
I frequently share responses between ChatGPT (paid version with GPT4) and Copilot-X to break an impasse when trying to generate or fix a tricky piece of code.
◧◩
5. averev+Pg[view] [source] [discussion] 2023-11-20 21:38:17
>>EricMa+W5
Parsing is faster than generating, so having a small model produce a whole output and then have Goliath only produce "good/bad" single token response evaluation would be faster than having Goliath produce everything. This would be the extreme, adhoc and iterative version of speculative decoding, which is already a thing and would probably give the best compromise
◧◩
6. rainco+ik[view] [source] [discussion] 2023-11-20 21:53:27
>>EricMa+W5
I think the relationship between model size and training time isn't linear. So if you want a twice bigger model it'll take more resources to train it than two original models.
◧◩◪
7. ghotli+4H[view] [source] [discussion] 2023-11-21 00:02:44
>>valine+G7
Do you happen to have a link to where that interwoven layers bit is described? As far as I can tell it's not clear on the model cards.
replies(1): >>valine+rX
◧◩◪◨
8. valine+rX[view] [source] [discussion] 2023-11-21 01:55:41
>>ghotli+4H
The model page is the only info I’ve found on it. As far as I can tell there’s no paper published on the technique.

In the “Merge Process” section they at least give the layer ranges.

https://huggingface.co/alpindale/goliath-120b

replies(1): >>ghotli+x51
◧◩◪◨⬒
9. ghotli+x51[view] [source] [discussion] 2023-11-21 02:47:43
>>valine+rX
Ah, actually reviewing that more closely I found a link to it in the acknowledgements.

https://github.com/cg123/mergekit

◧◩
10. sevagh+GP2[view] [source] [discussion] 2023-11-21 15:37:44
>>EricMa+W5
I believe another factor is that sometimes the model responds better to your prompt than other times. This way you get two dice rolls of your prompt hitting "the good path."
[go to top]