Mistral 7B Fine-Tune Optimized

>>tosh+(OP)
>Model merging is, to me, one of the most counterintuitive empirical results in modern deep learning. It turns out that you can actually more-or-less naively merge the weights of two different models and produce a new one that captures some or all of the abilities of its parents!

I would hope the article give some more details on model merging. Is it merging two different fine-tuned models, one fine-tuned on dogs, another fine-tuned on cats, and the merging of the two different models is good on cats and dogs as if by magic?

Like fine-tune one model just on Python and test it thoroughly, fine-tune one on Java and test it thoroughly, and then if the need arises for a project that uses both Java and Python, merge the two together and use that. If there is no need for Java, use the one fine-tuned just on Python.

Pretty magical indeed! Let alone the fact, that a separate smaller model of half a billion parameters could figure out how to merge the two together. If the cost of LMs could be reduced by a factor of 100, why not reduce it by a factor of 1000?

>>empora+8U
This is not so surprising if you consider the fact that finetuning is extremely sparse and barely imparts any new knowledge to the model. The paper "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch"[1] made this clear:

> We initially demonstrate that SFT LM (either encoder- or decoder-based) always tends to acquire excessively redundant delta parameters. To be specific, we present DARE, which randomly resets some delta parameters to zeros based on a drop rate p and subsequently scales the remaining parameters by a factor of 1/(1 − p). Despite its simplicity, with the assistance of DARE, when the LM model parameters reach 70 billion, we can eliminate up to 99% delta parameters with minimal impact on model performance (see Figure 1(a)). The more parameters the LM has, the larger p it can tolerate. This discovery suggests that SFT LM indeed learns a multitude of low-rank structures akin to LoRA [25]

Insofar as those adaptations are mostly distinct, you can just preserve both sets and that's what explains successes of merging, I guess.

1. https://arxiv.org/abs/2311.03099

zlacker