Arcee’s MergeKit: A Toolkit for Merging Large Language Models

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers,
Vlad Karpukhin, Brian Benedict, Mark McQuade, Jacob Solawetz
Arcee, Florida, USA
{charles, shamane, malikeh, luke, vlad, benedict, mark, jacob}@arcee.ai

Abstract

The rapid expansion of the open-source language model landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pretrained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to utilize each other’s strengths. Model merging facilitates the creation of multitask models without the need for additional training, offering a promising avenue for enhancing model performance and versatility. By preserving the intrinsic capabilities of the original models, model merging addresses complex challenges in AI - including the difficulties of catastrophic forgetting and multitask learning. To support this expanding area of research, we introduce MergeKit, a comprehensive, open-source library designed to facilitate the application of model merging strategies. MergeKit offers an extensible framework to efficiently merge models on any hardware, providing utility to researchers and practitioners. To date, thousands of models have been merged by the open-source community, leading to the creation of some of the worlds most powerful open-source model checkpoints, as assessed by the Open LLM Leaderboard. The library is accessible at https://github.com/arcee-ai/MergeKit.

1 Introduction

Over the last year, we noticed a rapid development in open-source LLM models and these LLMs are accessible via the Hugging Face model hub Wolf et al. (2019). These models are typically trained on a corpus comprising trillions of tokens and they consist of parameters in the range of 1-70 billions Minaee et al. (2024); Zhang et al. (2024). The landscape of open-source checkpoints is rich and varied, with a distinction generally made between pretrained checkpoints Zhuang et al. (2020) and those specifically aligned for instruction-following tasks across diverse domains (e.g., coding Roziere et al. (2023), medical applications Wu et al. (2023), etc.). However, fine-tuning a separate model for each task raises two major challenges: (1) For each new task, the task-specific model should be stored and deployed separately, and (2) models trained independently cannot utilize insights from related tasks to enhance performance within their domain or generalize beyond it Sanh et al. (2021); Ramé et al. (2023); Yadav et al. (2024); Yu et al. (2023).

Training these models from scratch represents a formidable investment, exemplified by the Mistral-7B model Jiang et al. (2023), which demands an outlay of 2 to 3 million USD. In addition, further fine-tuning pretrained models can lead to catastrophic forgetting De Lange et al. (2021), which is a phenomenon occurring when further refinement of these models results in a degradation of their original general capabilities, thereby impeding their ability to perform optimally across a range of tasks Cheng et al. (2023); Wu et al. (2024). Furthermore, the process of aligning these models to respond in a favorable fashion takes extensive effort to gather human preference data, and is impossible for most teams to replicate Wang et al. (2023); Rafailov et al. (2024). This backdrop sets the stage for the pivotal question of how to maximally leverage existing pretrained checkpoints for research and industrial use cases. In this context, research in model merging has emerged as a transformative strategy, which combines the parameters of multiple individual models, often trained for specific tasks, into a single unified model. It enables multitask learning and continual learning while reducing the risk of catastrophic forgetting, all without the prohibitive costs of retraining from scratch Yadav et al. (2023); Yu et al. (2023).

MergeKit¹¹1https://github.com/arcee-ai/MergeKit is a centralized library that provides tooling to execute merging strategies as they are formulated by the community. It facilitates the integration of various state-of-the-art merging techniques, offering a platform for the community to seamlessly combine models and set new benchmarks. MergeKit’s architecture is designed with a focus on scalability and functionality, providing support for execution on memory-constrained CPUs as well as on accelerated GPUs.

To date, MergeKit encompasses a broad array of merging techniques and has been instrumental in the development of thousands of merged models, many of which have evaluated at or near the top of the Open LLM Leaderboard (e.g., BioMistral Labrak et al. (2024), OpenPipe’s Mistral 7B Fine-Tune Optimized Corbitt (2023), etc.).

The contributions of this paper are manifold, aiming to:

1.

Provide an overview of model merging research to date.
2.

Introduce key objectives, architectural decisions, and development principles of MergeKit to establish an extensible foundation for the future efforts of the model merging community.

2 Background & Related Work

2.1 The Concept of Model Merging

Model merging Ainsworth et al. (2022), though a relatively recent focal point within the research community, builds upon a foundation laid by numerous prior studies. At its core, model merging involves integrating two or more pretrained models — whether they’ve been trained on identical tasks or distinct architectures — into a unified model that retains the strengths and capabilities of all the original models. Much of this builds upon concepts explored with weight averaging, such as in Utans (1996). The success of these techniques relies on the concept of mode connectivity Garipov et al. (2018). In the simplest cases, techniques take advantage of the Linear Mode Connectivity (LMC) Entezari et al. (2021) of models fine-tuned from a common pretrained model Nagarajan and Kolter (2019); Neyshabur et al. (2021). Other works build on this concept by employing the permutation symmetry and applying transformations to the weights of models that bring them into common regions in the loss landscape Ainsworth et al. (2022); Stoica et al. (2023); Verma and Elbayad (2024).

2.2 Different Types of Model Merging

In the development of our toolkit, as depicted in the Figure 1, we categorize both existing and anticipated model merging techniques. This systematic classification aims to enhance our understanding of their capabilities, focusing on two critical aspects: weight initializations and the architectural configurations of various checkpoints.

2.2.1 Merging Models with Both Identical Architectures and Initializations

This section explores various model merging techniques that utilize the Linear Mode Connectivity (LMC) Nagarajan and Kolter (2019) between two neural network checkpoints to derive a final merged model. Essentially, the methods described here employ forms of linear interpolation techniques. A key requirement for these methods is that the models to be merged must have identical architectures as well as identical initializations. However, models with the same initializations that have been fine-tuned are also compatible with these techniques.

The simplest method, built upon the results of weight averaging literature Utans (1996) Smith and Gashler (2017) Garipov et al. (2018) Izmailov et al. (2018) and the Model Soups Wortsman et al. (2022) approach, is linear averaging of weights. This technique relies on linear mode connectivity and is the foundation of most others.

Task Arithmetic Ilharco et al. (2022) expands upon this approach by introducing the concept of task vectors, showing that performing arithmetic on the differences between fine-tuned models and a common base model is both useful and semantically meaningful.

Trim, Elect Sign & Merge (TIES merging) Yadav et al. (2023), Model Breadcrumbs Davari and Belilovsky (2023), and Drop And REscale (DARE) Yu et al. (2023) further introduce methods for sparsifying and combining these task vectors that enable larger numbers of models to be combined into one without degrading capabilities.

The use of the Spherical Linear intERPolation (SLERP) technique Shoemake (1985) to interpolate between model checkpoints represents an extension of the simple weight averaging, and its success shows that there is often a spherical path with a lower loss barrier than a direct linear interpolation. SLERP²²2https://github.com/Digitous/LLM-SLERP-Merge leverages the geometric and rotational properties within the models’ vector space, ensuring a blend that more accurately embodies the characteristics of both parent models.

It should be emphasized that the methods described above do not require training data for merging operations or further fine-tuning post-merging. Compared to the methods mentioned above, other approaches expand on this paradigm by introducing varied weightings for different neural network parameters, requiring the computation of activations with training data to determine these varying weightings. Matena and Raffel (2022) explore the use of the Fisher information matrix and Jin et al. (2022) introduce the RegMean Jin et al. (2022) method, which allows merges to produce optimal weights with respect to $L2$ distance to model predictions while keeping training data private.

2.2.2 Merging Models with Identical Architectures and Different Initializations

This section explores merging methods that extend beyond simply combining checkpoints with identical initializations. Previous research has indicated that when dealing with checkpoints from different initializations, the straightforward Linear Model Combination (LMC) approach falls short Ainsworth et al. (2022). So, basically, this line of work utilizes the permutation symmetry of the different initializations of checkpoints prior to merging Verma and Elbayad (2024); Ainsworth et al. (2022)

Git-Rebasin Ainsworth et al. (2022) explores the impact of permutation symmetries in neural network loss landscapes on model merging. They introduce algorithms for aligning the weights of two independently-trained models to achieve functionally equivalent weight configurations, facilitating model merging in weight space. Their empirical results demonstrate this capability across various architectures and datasets, revealing that neural network optimization often leads to a single, permutation-symmetry-informed basin.

Similarly, prior work Optimizing Mode Connectivity via Neuron Alignment Tatro et al. (2020), and Optimal Transport Fusion (OTFusion) Singh and Jaggi (2020), posits that permutation symmetries of neural network hidden units can be exploited to reduce the interpolation barrier between models. These methods introduce various strategies for assigning correspondences between model neurons and perform simple interpolation in the transformed weight space.

Imfeld et al. (2023); Verma and Elbayad (2024) extend these methods to support Transformer-based model architectures. Jordan et al. (2022) expound on the problem of variance collapse in interpolated deep networks and propose a rescaling step that further reduces loss barriers between permuted models.

It is important to note that the above methodologies facilitate merging of models which share architectures and sizes, despite them not being linearly mode connected — differences that may stem from varied random initializations, extensive continued pretraining, or other distinct reasons.

Further expanding the scope, compared to above-mentioned methods, ZipIt Stoica et al. (2023) explores the possibility of merging models of similar architectures that have been trained on distinct tasks. This method generalizes model merging by supporting ‘zipping’ of correlated features within and across each model, and it permits partial merging of models up to a specified layer, thereby producing a multi-head model. This represents a significant step forward in the flexibility and applicability of model merging techniques, addressing the challenge of preserving and integrating the knowledge of models from different domains within a single unified framework without any additional training.

2.2.3 Fusing Models with Different Architectures

While not strictly model merging, Composition to Augment Language Models (CALM) Bansal et al. (2024) and recent approaches like knowledge fusion approach for large language models (FUSELLM) Wan et al. (2024) represent significant steps towards the fusion of models with diverse architectures. CALM utilizes cross-attention mechanisms to blend representations from different models, aiming to leverage their combined strengths and functionalities, thus facilitating integration across varied neural network structures. Similarly, FUSELLM Wan et al. (2024) focuses on harnessing the generative capabilities of source Large Language Models (LLMs) by aligning and fusing their probabilistic distributions. This strategy aims to externalize and amplify the collective knowledge and unique advantages of individual models, enhancing the overall capabilities of the fused LLM. In contrast to the approaches discussed above, the methods outlined in this section require additional continual pretraining of the models.

Refer to caption — Figure 1: Classification of model merging methods. We currently support the model merging methods outlined on the left, and we are actively working to incorporate additional merging techniques such as ZipIt, OT Fusion, and Git Rebasin.

2.3 Practical Use Cases of Model Merging

Model merging has found its place in a variety of practical applications, significantly impacting the landscape of machine learning models available on platforms such as HuggingFace’s model hub Wolf et al. (2019). These merged models, which will be detailed further, have demonstrated competitive performance across a range of tasks. A notable example of this is BioMistral Labrak et al. (2024), a project that merges domain-adapted checkpoints with existing Mistral chat variants, showcasing the efficacy of model merging in enhancing performance in specialized domains. OpenPipe’s Mistral 7B Fine-Tune Optimized Corbitt (2023) demonstrates the promise of merging fine-tuned models to produce a high-quality base for further tuning as in Choshen et al. (2022). Wei et al. (2024) illustrate that employing the MergeKit tool for model fusion is a successful method for enhancing the performance of hallucination detection.

The success stories of merged models underscore the technique’s value in continuous learning and multitask learning scenarios. By utilizing the broad spectrum of open-sourced LLMs, model merging enables the creation of versatile and robust models capable of excelling at multiple tasks simultaneously or adapting to new domains without the need for training from scratch. This approach not only maximizes the utility of existing resources but also paves the way for innovative solutions in leveraging pretrained models for complex, real-world problems.

3 Library Design: Key Design Principles

MergeKit has been thoughtfully engineered to facilitate the straightforward application of both current and forthcoming model merging techniques. Our repository includes detailed tutorials and IPython notebooks³³3https://github.com/arcee-ai/mergekit/tree/main/examples to guide users through the process of utilizing MergeKit effectively. This section is dedicated to outlining the fundamental design principles underpinning the library, with the aim of assisting the open-source community in adopting our toolkit and incorporating new techniques.

3.1 User-Centric Design: Intuitive Interface and YAML Configuration Control

The primary interface for MergeKit is through YAML configuration files that allow users of all skill levels to define complex merge operations without the need for coding experience. This approach both democratizes the use of MergeKit and fosters community engagement by making merge recipes easily repeatable, shareable, and remixable.

A YAML configuration file⁴⁴4https://github.com/arcee-ai/mergekit/blob/main/examples/ties.yml defines the merge method, input models, and any parameters necessary for the merging algorithm selected. Parameters can be set globally or targeted to specific model components, and can be specified as constant scalar values or as layer-varying interpolated gradients. These different levels of granularity offer an easy introduction for simple merges while allowing power users to define truly complex operations.

3.2 Modularity: Plug-and-Play Components

MergeKit is designed with composability and reusability as guiding principles. Merge methods are designed to be interchangeable and easy-to-add. Components are structured such that they can be added, removed, or interchanged to allow customization and experimentation. Wherever possible, components are designed to be useful standalone for external use. For instance, MergeKit’s lazy tensor loading functionality is a core component of the toolkit, but is also simple and convenient to pull into one-off scripts.

3.3 Interoperability: Framework Compatibility

Engineered for flawless integration with the HuggingFace Transformers library Wolf et al. (2019) and its model hub, MergeKit enables users to effortlessly combine various open-sourced checkpoints, facilitating the merging of diverse models available in the community.

3.4 Scalability: Efficiency and Performance Optimization

MergeKit is designed specifically to address the challenge of merging large pretrained language models, ensuring compatibility and efficiency across a diverse range of computational resources.. At the heart of its efficiency is an out-of-core approach to model merging. By loading only the tensors necessary for each individual operation into working memory, MergeKit can scale from a high-end research cluster all the way down to a personal laptop with no GPU and limited RAM.

3.4.1 Computational Graph Scheduling

MergeKit internally represents a merge as a directed acyclic graph of operations, or Task instances. This representation is used to schedule the execution of tasks such that the working set needed at any given time is minimized. Execution of the graph also implicitly handles eviction of intermediate values that are no longer needed. This infrastructure allows developers to build new merge methods that benefit from MergeKit’s memory efficiency and hardware scalability with little to no extra effort.

3.5 Community Engagement and Support: Regular Updates and Maintenance

We facilitate discussions, feedback, and collaboration among users and contributors, and ensure that MergeKit stays current with the latest developments in model merging and machine learning.

4 Extensibility of MergeKit

Given the rapid success of model merging techniques and the anticipated development of innovative methods, we invite the community to contribute novel merging strategies and enhancements, thereby contributing to the growth and refinement of MergeKit. This section aims to provide a streamlined guide on integrating new merging methods into MergeKit, utilizing existing functionalities where applicable to facilitate the process.

To incorporate a new merging method into MergeKit, contributors should familiarize themselves with several key Python modules within the repository:

•

graph.py: Handles the scheduling, execution, and data management throughout the merge process. This is the heart of MergeKit’s performance and resource efficiency. Working with this module to guide the lifecycle of intermediate results and data movement across devices is crucial.
•

merge_methods/base.py: Defines the interface that new merge methods must implement.
•

plan.py: Responsible for planning the merging process. If a new merging strategy has different steps involved or inputs required in combining multiple models, they should be defined here.
•

architecture.py: This module deals with the structures of different checkpoints. When adding a new method, ensure compatibility with existing model architectures.

Each module plays a distinct role in the merging pipeline and must be considered when extending MergeKit’s capabilities. Figure 2 is a graphical representation of the repository structure, indicating where to find these modules.

5 Popularity and Effectiveness of MergeKit

Model	Medical Benchmarks			General Benchmarks
	USMLE	MedMCQA	PubMedQA	Arc Challenge	HellaSwag	MMLU
Llama2-7B-Chat Touvron et al. (2023)	35.90	35.45	73.40	44.20	55.40	46.37
Meditron-7B Chen et al. (2023)	38.40	24.07	71.40	40.20	54.50	33.06
MeditronLlama-7B-Lerp	39.10	36.65	75.60	46.76	58.66	48.44
MeditronLlama-7B-Slerp	39.20	36.91	75.60	46.84	58.67	47.97
MeditronLlama-7B-Dare-Ties	36.37	27.56	72.20	42.92	54.79	41.17
MeditronLlama-7B-Ties	38.73	32.27	75.60	45.05	58.23	45.03

Table 1: Comparison of the Llama2-7B Chat and Meditron-7B Chen et al. (2023) models, plus their merged variants, using MergeKit techniques across medical and general benchmarks. It highlights the best-performing models in bold for each metric.

The utilization of model merging techniques in the development and refinement of LLMs has gained considerable attention within the machine learning community. This trend is evidenced by the Open LLM Leaderboard Beeching et al. (2023) data as of March 15th, 2024, which highlights the increasing prevalence of merged models among high-performing LLMs. Specifically, merged models represent 20% of the top 50 and 34% of the top 100 models, underscoring their significance in current LLM advancements. Plus, the current best-performing 3B⁵⁵5liminerity/Phigments12, 7B⁶⁶6liminerity/M7-7b, and the third top 13B⁷⁷7RubielLabarta/LogoS-7Bx2-MoE-13B-v0.2 open-source models on the Open LLM Leaderboard are merged models using MergeKit. In addition, the MergeKit repository’s meteoric rise in popularity is clearly illustrated by Figure 3, highlighting a significant and accelerating number of GitHub stars over time — an indicator of its growing influence and user endorsement in the developer community.

Figure 4 indicates a visual comparison of the performance of the top 50 open-source models ranked by their ‘Average’ scores on the Open LLM Leaderboard Park (2023). The ‘Average’ score is the mean value of evaluation scores from ‘ARC’Clark et al. (2018), ‘HellaSwag’Zellers et al. (2019), ‘MMLU’Hendrycks et al. (2021), ‘TruthfulQA’Lin et al. (2022), ‘Winogrande’Sakaguchi et al. (2019), ‘GSM8K’Cobbe et al. (2021) benchmarks. The dotted plot in black illustrates the parameter count for each model, and the bar chart represents the average score of each model. Merged models are depicted in ruby red bars, while the rest are colored in pink. The majority of leading merged models typically contain around 7 to 13 billion parameters. Among them, the top-performing merged model, ‘RubielLabarta/LogoS-7Bx2-MoE-13B-v0.2’⁸⁸8https://huggingface.co/RubielLabarta/LogoS-7Bx2-MoE-13B-v0.2, ranks 17th on the list. Remarkably, apart from models utilizing mixture of experts (MoE) Gale et al. (2023), majority of unmerged models surpassing this performance comprise approximately 35 to 70+ billion parameters. As depicted by the Open LLM Leaderboard, Merged models, particularly those incorporating mixture of experts (MoE) architecture, emerge as formidable contenders, offering superior performance while optimizing computational resources — a testament to the transformative potential of model merging in advancing the frontier of natural language processing.

5.1 Practical Example: Applying Model Merging in Medical Domain

As illustrated in Table 1, we experimented with a range of merging techniques available on MergeKit, including Linear intERPolation (LERP), SLERP, TIES, and DARE-TIES, to merge the Meditron-7B⁹⁹9Meditron-7B checkpoint is based on Llama2-7B base model, which is extensively pretrained on a comprehensively curated medical corpus. Chen et al. (2023) checkpoint with Llama2-7B chat model Touvron et al. (2023). Notably, both models are based on the Llama2-7B base model. The evaluation results are depicted in Table 1. According to the findings, all the merged models outperform the Meditron-7B model across various medical benchmarks, including the US Medical License Exam (USMLE) Jin et al. (2021), Medical Multiple-Choice Question Answering (MedMCQA) Pal et al. (2022), and PubMed¹⁰¹⁰10https://pubmed.ncbi.nlm.nih.gov/ Question Answering (PubMedQA) Jin et al. (2019). Furthermore, models merged using LERP and SLERP techniques exhibit superior performance over the Llama2-7B chat model in general benchmarks. Our empirical experiments highlight the varying capabilities of merged models and provide comparative performance insights. Within the medical domain, the SLERP method appears to outperform others. However, more importantly, these experiments reveal how model merging can lead to the development of more generalized models with enhanced capabilities across diverse applications.

6 Conclusion and Future Work

In this paper, we have introduced Mergekit, an innovative open-source tool designed to facilitate the seamless integration of large language models. Beyond detailing the functionalities of the library, we have provided a synthesis of recent literature on model merging from an engineering standpoint, focusing specifically on two primary categories: Merging Models with Identical Architectures and Initializations, and Merging Models with Identical Architectures but Different Initializations. Furthermore, we offered insights on incorporating new merging techniques, thereby encouraging researchers in the open-source community to contribute their novel methods by leveraging our existing capabilities. It is crucial to emphasize that Mergekit represents a dynamic project, committed to the continuous incorporation of new methodologies through collaborative efforts with the open-source community. For an up-to-date list of currently supported methods, we invite readers to consult the ReadMe section of our repository¹¹¹¹11https://github.com/arcee-ai/mergekit.git.

Ethics Statement

As stewards of the open-source community dedicated to the advancement of LLMs, our work with MergeKit underscores a commitment to democratizing access to cutting-edge AI technologies while fostering an environment of ethical integrity and continuous improvement. By providing an open-source toolkit that enables the merging of model checkpoints, we aim to enhance the collaborative capabilities of researchers, developers, and practitioners across the globe, encouraging innovation and the sharing of knowledge. In doing so, we are acutely aware of the necessity to uphold principles of fairness, accountability, and transparency within this community. This includes the proactive identification and mitigation of biases within merged models, ensuring the ethical use of data, and maintaining the privacy and security of information. Our commitment extends beyond technological advancements, encompassing the responsibility to engage with diverse stakeholders, gather feedback, and adapt our approaches to address ethical concerns effectively. We recognize the imperative to continually evolve our practices, striving for solutions that not only push the boundaries of AI but also do so with an unwavering commitment to the improvement of society.

References

Ainsworth et al. (2022) Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. 2022. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836.
Bansal et al. (2024) Rachit Bansal, Bidisha Samanta, Siddharth Dalmia, Nitish Gupta, Shikhar Vashishth, Sriram Ganapathy, Abhishek Bapna, Prateek Jain, and Partha Talukdar. 2024. Llm augmented llms: Expanding capabilities through composition. arXiv preprint arXiv:2401.02412.
Beeching et al. (2023) Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. 2023. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
Chen et al. (2023) Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. 2023. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
Cheng et al. (2023) Daixuan Cheng, Shaohan Huang, and Furu Wei. 2023. Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530.
Choshen et al. (2022) Leshem Choshen, Elad Venezian, Noam Slonim, and Yoav Katz. 2022. Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044.
Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems.
Corbitt (2023) Kyle Corbitt. 2023. How we built “mistral 7b fine-tune optimized,” the best 7b model for fine-tuning.
Davari and Belilovsky (2023) MohammadReza Davari and Eugene Belilovsky. 2023. Model breadcrumbs: Scaling multi-task model merging with sparse masks. arXiv preprint arXiv:2312.06795.
De Lange et al. (2021) Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. 2021. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385.
Entezari et al. (2021) Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. 2021. The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296.
Gale et al. (2023) Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. 2023. Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5.
Garipov et al. (2018) Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. 2018. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31.
Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding.
Ilharco et al. (2022) Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. 2022. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
Imfeld et al. (2023) Moritz Imfeld, Jacopo Graldi, Marco Giordano, Thomas Hofmann, Sotiris Anagnostidis, and Sidak Pal Singh. 2023. Transformer fusion with optimal transport. arXiv preprint arXiv:2310.05719.
Izmailov et al. (2018) Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407.
Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
Jin et al. (2021) Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
Jin et al. (2019) Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. Association for Computational Linguistics.
Jin et al. (2022) Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. 2022. Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849.
Jordan et al. (2022) Keller Jordan, Hanie Sedghi, Olga Saukh, Rahim Entezari, and Behnam Neyshabur. 2022. Repair: Renormalizing permuted activations for interpolation repair. arXiv preprint arXiv:2211.08403.
Labrak et al. (2024) Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. 2024. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373.
Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods.
Matena and Raffel (2022) Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716.
Minaee et al. (2024) Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large language models: A survey. arXiv preprint arXiv:2402.06196.
Nagarajan and Kolter (2019) Vaishnavh Nagarajan and J Zico Kolter. 2019. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32.
Neyshabur et al. (2021) Behnam Neyshabur, Hanie Sedghi, and Chiyuan Zhang. 2021. What is being transferred in transfer learning? arXiv preprint arXiv:2008.11687.
Pal et al. (2022) Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–260. PMLR.
Park (2023) Daniel Park. 2023. Open-llm-leaderboard-report.
Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
Ramé et al. (2023) Alexandre Ramé, Kartik Ahuja, Jianyu Zhang, Matthieu Cord, Léon Bottou, and David Lopez-Paz. 2023. Model ratatouille: Recycling diverse models for out-of-distribution generalization. In International Conference on Machine Learning, pages 28656–28679. PMLR.
Roziere et al. (2023) Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
Sakaguchi et al. (2019) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. WINOGRANDE: an adversarial winograd schema challenge at scale.
Sanh et al. (2021) Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2021. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
Shoemake (1985) Ken Shoemake. 1985. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pages 245–254.
Singh and Jaggi (2020) Sidak Pal Singh and Martin Jaggi. 2020. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055.
Smith and Gashler (2017) Joshua Smith and Michael Gashler. 2017. An investigation of how neural networks learn from the experiences of peers through periodic weight averaging. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 731–736. IEEE.
Stoica et al. (2023) George Stoica, Daniel Bolya, Jakob Bjorner, Taylor Hearn, and Judy Hoffman. 2023. Zipit! merging models from different tasks without training. arXiv preprint arXiv:2305.03053.
Tatro et al. (2020) Norman Tatro, Pin-Yu Chen, Payel Das, Igor Melnyk, Prasanna Sattigeri, and Rongjie Lai. 2020. Optimizing mode connectivity via neuron alignment. Advances in Neural Information Processing Systems, 33:15300–15311.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Utans (1996) Joachim Utans. 1996. Weight averaging for neural networks and local resampling schemes. In Proc. AAAI-96 Workshop on Integrating Multiple Learned Models. AAAI Press, pages 133–138. Citeseer.
Verma and Elbayad (2024) Neha Verma and Maha Elbayad. 2024. Merging text transformer models from different initializations. arXiv preprint arXiv:2403.00986.
Wan et al. (2024) Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. 2024. Knowledge fusion of large language models. arXiv preprint arXiv:2401.10491.
Wang et al. (2023) Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
Wei et al. (2024) Chengcheng Wei, Ze Chen, Songtan Fang, Jiarong He, and Max Gao. 2024. Opdai at semeval-2024 task 6: Small llms can accelerate hallucination detection with weakly supervised data. arXiv preprint arXiv:2402.12913.
Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
Wortsman et al. (2022) Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. 2022. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR.
Wu et al. (2023) Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454.
Wu et al. (2024) Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, and Ying Shan. 2024. Llama pro: Progressive llama with block expansion. arXiv preprint arXiv:2401.02415.
Yadav et al. (2023) Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. 2023. Resolving interference when merging models. arXiv preprint arXiv:2306.01708.
Yadav et al. (2024) Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. 2024. Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems, 36.
Yu et al. (2023) Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. 2023. Language models are super mario: Absorbing abilities from homologous models as a free lunch. arXiv preprint arXiv:2311.03099.
Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence?
Zhang et al. (2024) Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
Zhuang et al. (2020) Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. 2020. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76.