I would hope the article give some more details on model merging. Is it merging two different fine-tuned models, one fine-tuned on dogs, another fine-tuned on cats, and the merging of the two different models is good on cats and dogs as if by magic?
Like fine-tune one model just on Python and test it thoroughly, fine-tune one on Java and test it thoroughly, and then if the need arises for a project that uses both Java and Python, merge the two together and use that. If there is no need for Java, use the one fine-tuned just on Python.
Pretty magical indeed! Let alone the fact, that a separate smaller model of half a billion parameters could figure out how to merge the two together. If the cost of LMs could be reduced by a factor of 100, why not reduce it by a factor of 1000?
> We initially demonstrate that SFT LM (either encoder- or decoder-based) always tends to acquire excessively redundant delta parameters. To be specific, we present DARE, which randomly resets some delta parameters to zeros based on a drop rate p and subsequently scales the remaining parameters by a factor of 1/(1 − p). Despite its simplicity, with the assistance of DARE, when the LM model parameters reach 70 billion, we can eliminate up to 99% delta parameters with minimal impact on model performance (see Figure 1(a)). The more parameters the LM has, the larger p it can tolerate. This discovery suggests that SFT LM indeed learns a multitude of low-rank structures akin to LoRA [25]
Insofar as those adaptations are mostly distinct, you can just preserve both sets and that's what explains successes of merging, I guess.