zlacker

[parent] [thread] 7 comments
1. EricMa+(OP)[view] [source] 2023-11-20 20:52:20
wouldn't this effectively be using a "model" twice the size?

Would it be better to just double the size of one of the models rather than house both?

Genuine question

replies(4): >>valine+K1 >>averev+Ta >>rainco+me >>sevagh+KJ2
2. valine+K1[view] [source] 2023-11-20 20:59:51
>>EricMa+(OP)
Maybe. Goliath 120B took two different llama variants and interwove the layers. Surprisingly Goliath 120B quantized to 2bit is outperforming llama 70B 4bit in many benchmarks.

https://www.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_com...

replies(1): >>ghotli+8B
3. averev+Ta[view] [source] 2023-11-20 21:38:17
>>EricMa+(OP)
Parsing is faster than generating, so having a small model produce a whole output and then have Goliath only produce "good/bad" single token response evaluation would be faster than having Goliath produce everything. This would be the extreme, adhoc and iterative version of speculative decoding, which is already a thing and would probably give the best compromise
4. rainco+me[view] [source] 2023-11-20 21:53:27
>>EricMa+(OP)
I think the relationship between model size and training time isn't linear. So if you want a twice bigger model it'll take more resources to train it than two original models.
◧◩
5. ghotli+8B[view] [source] [discussion] 2023-11-21 00:02:44
>>valine+K1
Do you happen to have a link to where that interwoven layers bit is described? As far as I can tell it's not clear on the model cards.
replies(1): >>valine+vR
◧◩◪
6. valine+vR[view] [source] [discussion] 2023-11-21 01:55:41
>>ghotli+8B
The model page is the only info I’ve found on it. As far as I can tell there’s no paper published on the technique.

In the “Merge Process” section they at least give the layer ranges.

https://huggingface.co/alpindale/goliath-120b

replies(1): >>ghotli+BZ
◧◩◪◨
7. ghotli+BZ[view] [source] [discussion] 2023-11-21 02:47:43
>>valine+vR
Ah, actually reviewing that more closely I found a link to it in the acknowledgements.

https://github.com/cg123/mergekit

8. sevagh+KJ2[view] [source] 2023-11-21 15:37:44
>>EricMa+(OP)
I believe another factor is that sometimes the model responds better to your prompt than other times. This way you get two dice rolls of your prompt hitting "the good path."
[go to top]