Mistral 7B Fine-Tune Optimized

>>tosh+(OP)
>Model merging is, to me, one of the most counterintuitive empirical results in modern deep learning. It turns out that you can actually more-or-less naively merge the weights of two different models and produce a new one that captures some or all of the abilities of its parents!

I would hope the article give some more details on model merging. Is it merging two different fine-tuned models, one fine-tuned on dogs, another fine-tuned on cats, and the merging of the two different models is good on cats and dogs as if by magic?

Like fine-tune one model just on Python and test it thoroughly, fine-tune one on Java and test it thoroughly, and then if the need arises for a project that uses both Java and Python, merge the two together and use that. If there is no need for Java, use the one fine-tuned just on Python.

Pretty magical indeed! Let alone the fact, that a separate smaller model of half a billion parameters could figure out how to merge the two together. If the cost of LMs could be reduced by a factor of 100, why not reduce it by a factor of 1000?

>>empora+8U
Funnily enough, and not so concidentally, this has been well known in practice by...drumroll please...America's greatest innovators, the Adult Entertainment Hobbyists.

It doesn't have order-of-magnitude, or I'd even wager 50%, benefits in enabling smaller models. But you nailed it exactly. Fine tune on dogs, fine tune on cats, then...just...average the weights. And you have something better than the original with minimal loss from finetuning.

LoRA's end up being more popular for that use case because they're easier to combine and mix, match, and scale. Model merging is still a key technique for a successful base model.

zlacker