Behavior Knowledge Merge in Reinforced Agentic Models

Abstract

TL;DR: We propose RAM (Reinforced Agent Merging), a new framework to merge RL-trained agents. Unlike SFT merging, RAM explicitly preserves "unique" parameter updates that encode task-specific behaviors, preventing the performance dilution common in standard averaging methods.

Reinforcement learning (RL) is central to post-training, particularly for agentic models that require specialized reasoning behaviors. In this setting, model merging offers a practical mechanism for integrating multiple RL-trained agents from different tasks into a single generalist model. However, existing merging methods are designed for supervised fine-tuning (SFT), and they are suboptimal to preserve task-specific capabilities on RL-trained agentic models.

The root is a task-vector mismatch between RL and SFT: on-policy RL induces task vectors that are highly sparse and heterogeneous, whereas SFT-style merging implicitly assumes dense and globally comparable task vectors. When standard global averaging is applied under this mismatch, RL's non-overlapping task vectors that encode critical task-specific behaviors are reduced and parameter updates are diluted.

To address this issue, we propose Reinforced Agent Merging (RAM), a distribution-aware merging framework explicitly designed for RL-trained agentic models. RAM disentangles shared and task-specific unique parameter updates, averaging shared components while selectively preserving and rescaling unique ones to counteract parameter update dilution. Experiments across multiple agent domains and model architectures demonstrate that RAM not only surpasses merging baselines, but also unlocks synergistic potential among agents to achieve performance superior to that of specialized agents in their domains.

Motivations

1. Heterogeneity in Reinforced Task Vectors

Existing model merging methods typically assume that task vectors (parameter updates during post-training) are dense and globally comparable (similar to SFT updates). However, our analysis reveals a critical mismatch between these assumptions and the nature of Reinforcement Learning (RL). As shown in Figure 2, RL-induced task vectors exhibit extreme sparsity and heterogeneity. For instance, the coding agent modifies only 3.2% of parameters, whereas memory agents update over 54%. More importantly, these updates are distributed across disparate regions of the parameter space, creating unique, non-overlapping patterns for different capabilities.

Heterogeneity in Sparsity and Distribution

Figure 2: Left: Reinforced task vectors exhibit varying degrees of sparsity across different agents (Code, Tool, Memory). Right: The non-zero elements of these vectors are distributed heterogeneously, with significant portions being "unique" to specific tasks rather than shared.

2. The "Signal Dilution" Problem

Why do standard merging methods fail in this regime? The culprit is a phenomenon we term Signal Dilution. Standard methods rely on global averaging (e.g., 1/N). When an agent has a "unique" update for a specific parameter (while others do not), averaging divides this critical signal by the number of models, effectively treating the zero-updates from other agents as valid data. Figure 3 demonstrates the impact of this dilution. Unique regions drive significant performance gains (orange bars). However, when we simulate standard averaging by diluting these unique vectors (teal bars), the performance drops sharply. This necessitates a method like RAM, which can disentangle and preserve these unique signals.

Figure 3: Analysis of Signal Dilution. The "unique" regions of task vectors (orange) are critical for domain-specific performance (e.g., Coding). Applying standard averaging dilutes these signals (teal), causing significant performance regression across all domains.

Method

Our framework, Reinforced Agent Merging (RAM), addresses the task-vector mismatch between RL and SFT. The pipeline consists of three main steps:

1. Disentanglement: We separate shared knowledge (common across agents) from unique knowledge (task-specific).
2. Selective Preservation: Instead of global averaging, we apply a mask to preserve the magnitude of unique updates.
3. Rescaling: Parameters are rescaled to ensure the merged model retains the specialized capabilities of the original agents.

Results

Main results of agent merging. We evaluate the capabilities across three domains: Coding (LiveBench, LiveCodeBench), Tool Use (Live, Non-Live), and Memory (RULER-HotpotQA, RULER-SQuAD). Bold and underlined values denote the best and second-best performance among merged models, respectively. Cells highlighted in red indicate the best performance across all evaluated models, including the specialized Task Models.

The performances of merging two agents across domains.

Merging results for RL agents trained from Llama3.2-3B-Instruction base model.

@article{yuan2026behavior, title={Behavior Knowledge Merge in Reinforced Agentic Models}, author={Yuan, Xiangchi and Shi, Dachuan and Zhang, Chunhui and Liu, Zheyuan and Yao, Shenglong and Vosoughi, Soroush and Lee, Wenke}, journal={arXiv preprint arXiv:2601.13572}, year={2026} }