Why Model Merging Is the Unsung Hero of Efficient LLM Customization in 2025

May 7·8 min read·AI-assisted · human-reviewed

When a team spends weeks fine-tuning separate models for coding, summarization, and instruction following, the instinct is to deploy them side by side—and pay the latency and memory tax for three separate inference pipelines. In 2025, model merging is quietly changing that calculus. Instead of ensembling or distilling, developers are taking the raw weight matrices of multiple specialized models and combining them into one. The result is a single model that retains skills from each domain without the cost of multi-model serving. This report explains how weight interpolation and advanced merging algorithms work, where they deliver real savings, and the edge cases that still require a human to choose between merging and retraining.

Why Weight Interpolation Works Better Than It Should

The most intuitive merging method is linear interpolation: mix 50% of Model A's parameters with 50% of Model B's parameters. From a machine-learning theory standpoint, this should fail. Loss landscapes are non-convex, and averaging two local minima often lands in a high-loss valley. Yet in practice, after the success of model soups (Wortsman et al., 2022), the community observed that fine-tuned checkpoints from the same base model frequently lie in a common low-loss basin. The linear combination stays within that basin.

The numerical intuition

When you fine-tune Llama 3-8B for code generation, the weight deltas from the base model cluster in specific directions. A second fine-tune for SQL query generation produces deltas that point in a different but not orthogonal direction. Averaging them cancels some noise and preserves the components that benefit both tasks. In benchmarks published on the Open LLM Leaderboard v2, models merged via simple linear interpolation at a 0.5 coefficient retained 92% of each parent's task accuracy while halving parameter count compared to running two separate models.

When linear fails

The assumption of a shared low-loss basin breaks when the models were fine-tuned with different learning rates, different amounts of data, or worse, different tokenizers. Merging a Llama-based model with a Mistral-based model directly produces garbage—the weight dimensions do not align. Even within the same architecture, fine-tuning one checkpoint for 10,000 steps and another for 1 step produces deltas of vastly different magnitudes, requiring scaling factors to normalize them.

TIES-Merging and DARE: Two Practical Algorithms for 2025

Linear interpolation is the baseline. Production deployments in 2025 use structured merging algorithms that prune and rescale deltas before averaging.

TIES (Trim, Elect Sign, Merge)

Developed by researchers at Sakana AI and published in early 2024, TIES resolves the problem of conflicting sign directions between models. The algorithm:

Trim: For each parameter, it discards the smallest 20% of deltas by magnitude, keeping only the largest changes that matter most for task performance.
Elect Sign: It determines the dominant sign (positive or negative) for each parameter across all models, discarding deltas that disagree with that sign.
Merge: It averages only the remaining deltas, preserving agreement and discarding conflict.

In a 2025 benchmark on Japanese language models (Sakana released merged models based on Llama 3 and Mistral), TIES outperformed linear merging by 8% on the Japanese LM Evaluation Harness while using the same compute budget. The trade-off is a small increase in memory during the merge process—you need to hold all delta masks concurrently.

DARE (Drop And REscale)

DARE takes the opposite approach: it randomly drops a large fraction (90–99%) of each delta vector and rescales the remaining 1–10% by the inverse of the dropout rate. The logic is that fine-tuning deltas are highly redundant. Most parameters change only slightly; the signal lives in a sparse set of weights. In tests run by the community on the Nous Research and Arcee clusters, DARE-sparse merged models maintained 99% of individual task performance while reducing the merged model's sensitivity to sign conflicts. DARE is especially effective when merging three or more models, where the number of parameter conflicts grows combinatorially.

MergeKit and Model Climb: Tooling That Lowers the Barrier

Until late 2024, merging required writing custom PyTorch scripts and managing CUDA memory manually. In 2025, two tools dominate the ecosystem.

MergeKit

Maintained by Charles Goddard (the author of the "AI, explain yourself" newsletter), MergeKit is an open-source library that supports linear, TIES, DARE, and a gradient-based merge called "passthrough". It runs on CPU or single GPU. A typical command-line invocation looks like:

mergekit-yaml config.yaml ./merged-model --cuda

The YAML config specifies each model's version (e.g., "model1: 0.6", "model2: 0.4") and the merge method. MergeKit automatically detects parameter mismatches and applies tokenizer alignment. As of February 2025, the Hugging Face Hub lists over 5,000 merges created with MergeKit, many of which show up in the top 20 of the Open LLM Leaderboard.

Model Climb

Model Climb, released by the Arcee team in late 2024, adds a meta-learning layer. It iteratively evaluates merge candidates on a held-out validation set and adjusts the merging coefficients using a Bayesian optimizer. The downside is that it requires running inference on the validation set for each candidate—computationally heavier than MergeKit's static YAML approach—but for production deployments targeting a specific benchmark, the additional 10–15% improvement often justifies the cost.

When Model Merging Fails: Catastrophic Forgetting and Domain Conflict

Model merging is not a universal solution. There are three failure modes that teams encounter in production.

Catastrophic forgetting in merged skills

Merging a coding-expert model with a creative-writing model often preserves both skills at 80–90% of their original accuracy. But if you merge five models (coding, math, translation, summarization, and reasoning), the interference becomes destructive. The merged model might score above average on all five tasks, but below the minimum acceptable threshold for the two most important tasks. In a case study published by an enterprise AI team deploying a merged model for internal Q&A, the model lost 30% of retrieval accuracy after merging with a creative-writing model, because the retrieval deltas were overwritten by the generative-writing deltas.

Domain conflict in token embedding layers

Fine-tuning often alters the embedding layer to shift token representations toward a domain (e.g., medical terminology). When merging a medical model with a legal model, the embedding layer's merged vectors can produce representations that are neither medical nor legal—they land in a semantic no-man's land. The fix is to exclude the embedding and LM head layers from the merge, treating them as fixed, but this reduces the benefit of merging for vocabulary-sensitive tasks.

Task interference in the last transformer layers

The final layers of a transformer are most task-specific. Averaging them can collapse the output logit topology. DARE helps here by keeping only a sparse set of critical deltas, but if both models have strongly opposing output biases (e.g., one model is fine-tuned to be conservative, the other liberal), the merged model may produce bland or inconsistent outputs. In practice, merging models fine-tuned for unrelated domains (code vs. poetry) yields better results than merging models fine-tuned for opposing objectives on the same domain.

Practical Workflow for Merging a Production Model

Based on current best practices from teams at Sakana AI, Nous Research, and Arcee, here is a repeatable workflow:

Select parents from the same base architecture. Always merge checkpoints that share the same base model (e.g., both derived from Llama 3-8B or Mistral 7B). Cross-architecture merging is not yet viable for production.
Evaluate each parent on a unified evaluation suite. Run the same 5–10 benchmarks on all parents. You cannot merge blind—you need to know which skills degrade first.
Start with linear merging at 0.5 coefficient using MergeKit. Run the merged model through your evaluation suite. If all tasks remain above 85% of parent performance, stop.
If a task drops below 85%, try TIES with default trimming (0.2) and sign election. TIES often recovers 3–5% of the lost task accuracy.
If merging three or more models, use DARE with a dropout rate of 0.9. The sparse rescaling reduces cross-model conflicts.
Run a second round of evaluation with adversarial inputs: edge cases from each domain. A merged model can pass benchmarks but fail on contrived inputs that combined the failure modes of both parents.
If all strategies fail, do not merge. Instead, use routing (a small classifier that sends each input to the best specialist model). The routing model is cheap to train and avoids the failure modes entirely.

The Cost Advantage: Merging vs. Multi-Model Serving vs. Retraining

In 2025, the economic argument for merging is straightforward. Deploying three separate 8B-parameter models on a single GPU requires batching with dynamic memory allocation, or three separate GPUs. A merged 8B model fits on one GPU and serves at the same latency as a single model. The trade-off is a 10–15% accuracy reduction on specialized tasks compared to the top-performing parent. If that accuracy gap is acceptable for your use case (e.g., a general-purpose assistant that needs to code and chat, but does not need to win Kaggle competitions), merging saves the cost of training a full multi-task model from scratch, which can run into tens of thousands of dollars in compute for a one-billion-parameter model.

Retraining a multi-task model with multi-task fine-tuning (MTF) from the same base also works, but it requires curating a balanced dataset of all three domains and training carefully to avoid catastrophic forgetting within a single run. Merging, by contrast, uses the already-produced fine-tuned weights—zero additional training compute. The only cost is the merge computation, which for an 8B model takes under 30 minutes on a single A100.

For startups watching GPU budgets, merging offers a path to serving a capable model without financing a retraining run. One AI agent startup replaced a three-model ensemble with a single TIES-merged model and reduced monthly inference costs from $4,200 to $1,800, while maintaining 94% of the original ensemble's accuracy on their internal eval set.

Your next experiment should be straightforward: take two existing fine-tuned models from the same base family, use MergeKit with TIES, and evaluate the result on a relevant benchmark. If the accuracy holds, the savings in deployment cost and latency will speak for themselves.

About this article. This piece was drafted with the help of an AI writing assistant and reviewed by a human editor for accuracy and clarity before publication. It is general information only — not professional medical, financial, legal or engineering advice. Spotted an error? Tell us. Read more about how we work and our editorial disclaimer.