When a team spends weeks fine-tuning separate models for coding, summarization, and instruction following, the instinct is to deploy them side by side—and pay the latency and memory tax for three separate inference pipelines. In 2025, model merging is quietly changing that calculus. Instead of ensembling or distilling, developers are taking the raw weight matrices of multiple specialized models and combining them into one. The result is a single model that retains skills from each domain without the cost of multi-model serving. This report explains how weight interpolation and advanced merging algorithms work, where they deliver real savings, and the edge cases that still require a human to choose between merging and retraining.
The most intuitive merging method is linear interpolation: mix 50% of Model A's parameters with 50% of Model B's parameters. From a machine-learning theory standpoint, this should fail. Loss landscapes are non-convex, and averaging two local minima often lands in a high-loss valley. Yet in practice, after the success of model soups (Wortsman et al., 2022), the community observed that fine-tuned checkpoints from the same base model frequently lie in a common low-loss basin. The linear combination stays within that basin.
When you fine-tune Llama 3-8B for code generation, the weight deltas from the base model cluster in specific directions. A second fine-tune for SQL query generation produces deltas that point in a different but not orthogonal direction. Averaging them cancels some noise and preserves the components that benefit both tasks. In benchmarks published on the Open LLM Leaderboard v2, models merged via simple linear interpolation at a 0.5 coefficient retained 92% of each parent's task accuracy while halving parameter count compared to running two separate models.
The assumption of a shared low-loss basin breaks when the models were fine-tuned with different learning rates, different amounts of data, or worse, different tokenizers. Merging a Llama-based model with a Mistral-based model directly produces garbage—the weight dimensions do not align. Even within the same architecture, fine-tuning one checkpoint for 10,000 steps and another for 1 step produces deltas of vastly different magnitudes, requiring scaling factors to normalize them.
Linear interpolation is the baseline. Production deployments in 2025 use structured merging algorithms that prune and rescale deltas before averaging.
Developed by researchers at Sakana AI and published in early 2024, TIES resolves the problem of conflicting sign directions between models. The algorithm:
In a 2025 benchmark on Japanese language models (Sakana released merged models based on Llama 3 and Mistral), TIES outperformed linear merging by 8% on the Japanese LM Evaluation Harness while using the same compute budget. The trade-off is a small increase in memory during the merge process—you need to hold all delta masks concurrently.
DARE takes the opposite approach: it randomly drops a large fraction (90–99%) of each delta vector and rescales the remaining 1–10% by the inverse of the dropout rate. The logic is that fine-tuning deltas are highly redundant. Most parameters change only slightly; the signal lives in a sparse set of weights. In tests run by the community on the Nous Research and Arcee clusters, DARE-sparse merged models maintained 99% of individual task performance while reducing the merged model's sensitivity to sign conflicts. DARE is especially effective when merging three or more models, where the number of parameter conflicts grows combinatorially.
Until late 2024, merging required writing custom PyTorch scripts and managing CUDA memory manually. In 2025, two tools dominate the ecosystem.
Maintained by Charles Goddard (the author of the "AI, explain yourself" newsletter), MergeKit is an open-source library that supports linear, TIES, DARE, and a gradient-based merge called "passthrough". It runs on CPU or single GPU. A typical command-line invocation looks like:
mergekit-yaml config.yaml ./merged-model --cuda
The YAML config specifies each model's version (e.g., "model1: 0.6", "model2: 0.4") and the merge method. MergeKit automatically detects parameter mismatches and applies tokenizer alignment. As of February 2025, the Hugging Face Hub lists over 5,000 merges created with MergeKit, many of which show up in the top 20 of the Open LLM Leaderboard.
Model Climb, released by the Arcee team in late 2024, adds a meta-learning layer. It iteratively evaluates merge candidates on a held-out validation set and adjusts the merging coefficients using a Bayesian optimizer. The downside is that it requires running inference on the validation set for each candidate—computationally heavier than MergeKit's static YAML approach—but for production deployments targeting a specific benchmark, the additional 10–15% improvement often justifies the cost.
Model merging is not a universal solution. There are three failure modes that teams encounter in production.
Merging a coding-expert model with a creative-writing model often preserves both skills at 80–90% of their original accuracy. But if you merge five models (coding, math, translation, summarization, and reasoning), the interference becomes destructive. The merged model might score above average on all five tasks, but below the minimum acceptable threshold for the two most important tasks. In a case study published by an enterprise AI team deploying a merged model for internal Q&A, the model lost 30% of retrieval accuracy after merging with a creative-writing model, because the retrieval deltas were overwritten by the generative-writing deltas.
Fine-tuning often alters the embedding layer to shift token representations toward a domain (e.g., medical terminology). When merging a medical model with a legal model, the embedding layer's merged vectors can produce representations that are neither medical nor legal—they land in a semantic no-man's land. The fix is to exclude the embedding and LM head layers from the merge, treating them as fixed, but this reduces the benefit of merging for vocabulary-sensitive tasks.
The final layers of a transformer are most task-specific. Averaging them can collapse the output logit topology. DARE helps here by keeping only a sparse set of critical deltas, but if both models have strongly opposing output biases (e.g., one model is fine-tuned to be conservative, the other liberal), the merged model may produce bland or inconsistent outputs. In practice, merging models fine-tuned for unrelated domains (code vs. poetry) yields better results than merging models fine-tuned for opposing objectives on the same domain.
Based on current best practices from teams at Sakana AI, Nous Research, and Arcee, here is a repeatable workflow:
In 2025, the economic argument for merging is straightforward. Deploying three separate 8B-parameter models on a single GPU requires batching with dynamic memory allocation, or three separate GPUs. A merged 8B model fits on one GPU and serves at the same latency as a single model. The trade-off is a 10–15% accuracy reduction on specialized tasks compared to the top-performing parent. If that accuracy gap is acceptable for your use case (e.g., a general-purpose assistant that needs to code and chat, but does not need to win Kaggle competitions), merging saves the cost of training a full multi-task model from scratch, which can run into tens of thousands of dollars in compute for a one-billion-parameter model.
Retraining a multi-task model with multi-task fine-tuning (MTF) from the same base also works, but it requires curating a balanced dataset of all three domains and training carefully to avoid catastrophic forgetting within a single run. Merging, by contrast, uses the already-produced fine-tuned weights—zero additional training compute. The only cost is the merge computation, which for an 8B model takes under 30 minutes on a single A100.
For startups watching GPU budgets, merging offers a path to serving a capable model without financing a retraining run. One AI agent startup replaced a three-model ensemble with a single TIES-merged model and reduced monthly inference costs from $4,200 to $1,800, while maintaining 94% of the original ensemble's accuracy on their internal eval set.
Your next experiment should be straightforward: take two existing fine-tuned models from the same base family, use MergeKit with TIES, and evaluate the result on a relevant benchmark. If the accuracy holds, the savings in deployment cost and latency will speak for themselves.
Browse the latest reads across all four sections — published daily.
← Back to BestLifePulse