Arxiv Papers of Today

生成时间: 2026-05-07 18:26:47 (UTC+8); Arxiv 发布时间: 2026-05-07 20:00 EDT (2026-05-08 08:00 UTC+8)

今天共有 40 篇相关文章

Keyword: reinforcement learning

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

自由能量驱动强化学习，采用自适应优势塑形，用于大型语言模型中的无监督推理

Authors: Yiming Huang, Zhenbo Shi, Xin-Cheng Wen, Jichuan Zeng, Cuiyun Gao, Peiyi Han, Chuanyi Liu
Subjects: Subjects: Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.04065
Pdf link: https://arxiv.org/pdf/2605.04065
Abstract Unsupervised reinforcement learning (RL) has emerged as a promising paradigm for enabling self-improvement in large language models (LLMs). However, existing unsupervised RL-based methods often lack the capacity to adapt to the model's evolving reasoning capabilities during training. Therefore, these methods can misdirect policy optimization in the absence of ground-truth supervision. To address this issue, we introduce FREIA, a novel RL-based algorithm built on two key innovations: (1) Free Energy-Driven Reward (FER) adapts rewards to balance consensus and exploration based on the Free Energy Principle. (2) Adaptive Advantage Shaping (AAS) adaptively adjusts learning signals based on the statistical characteristics of sampled rewards. Empirical evaluations on nine datasets across three reasoning tasks showcase that FREIA outperforms other unsupervised RL-based baselines. Notably, in mathematical reasoning tasks, FREIA surpasses other methods by an average of 0.5 to 3.5 points in Pass@1 using the DeepSeek-R1-Distill-Qwen-1.5B model.
中文摘要 无监督强化学习（RL）已成为大型语言模型（LLM）自我提升的有前景范式。然而，现有的无监督强化学习方法通常缺乏适应模型在训练过程中不断演变推理能力的能力的能力。因此，在缺乏实地监督的情况下，这些方法可能会误导政策优化。为解决这一问题，我们引入了FREIA，一种基于强化学习的新算法，基于两项关键创新：（1）自由能源驱动奖励（FER）根据自由能源原理调整奖励以平衡共识与探索。（2）自适应优势塑造（AAS）根据抽样奖励的统计特性自适应调整学习信号。对九个数据集、三个推理任务的实证评估显示，FREIA优于其他无监督强化学习基线。值得注意的是，在数学推理任务中，使用DeepSeek-R1-Distill-Qwen-1.5B模型，FREIA平均比其他方法高出0.5到3.5个Pass@1分。

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

适应以茁壮成长！自适应幂均值策略优化以提升LLM推理能力

Authors: Yiming Huang, Zhenbo Shi, Shuzheng Gao, Cuiyun Gao, Peiyi Han, Chuanyi Liu
Subjects: Subjects: Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.04066
Pdf link: https://arxiv.org/pdf/2605.04066
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) is an essential paradigm that enhances the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically rely on static policy optimization schemes that misalign with the model's evolving reasoning capabilities. To address this issue, we propose Adaptive Power-Mean Policy Optimization (APMPO), which comprises two main innovations: Power-Mean Policy Optimization (PMPO) and Feedback-Adaptive Clipping (FAC). Specifically, PMPO introduces a generalized power-mean objective. This enables the model to adaptively transition from the signal-amplifying behavior of the arithmetic mean to the consistency-enforcing behavior of the geometric mean. FAC adaptively adjusts clipping bounds based on real-time reward statistics to overcome the limitations of static mechanisms. Capitalizing on these innovations, APMPO improves learning dynamics and reasoning performance. Extensive experiments on nine datasets across three reasoning tasks showcase the superiority of APMPO over state-of-the-art RLVR-based baselines. For instance, APMPO boosts the average Pass@1 score on mathematical reasoning benchmarks by 3.0 points compared to GRPO when using Qwen2.5-3B-Instruct.
中文摘要 带可验证奖励的强化学习（RLVR）是一种提升大型语言模型（LLM）推理能力的重要范式。然而，现有方法通常依赖静态策略优化方案，这与模型不断演变的推理能力不匹配。为解决这一问题，我们提出了自适应功率-均值策略优化（APMPO），该方案包含两项主要创新：功率-平均策略优化（PMPO）和反馈自适应剪裁（FAC）。具体来说，PMPO引入了一个广义的幂均目标。这使得模型能够自适应地从算术平均的信号放大行为过渡到几何平均值的一致性强化行为。FAC根据实时奖励统计自适应调整削波边界，克服静态机制的局限。利用这些创新，APMPO提升了学习动态和推理表现。在九个数据集上、三项推理任务中的大量实验展示了APMPO相较于最先进的基于RLVR的基线的优势。例如，APMPO在数学推理基准测试中，平均Pass@1分数比GRPO提高了3.0分，使用Qwen2.5-3B-Instruct。

Designing a double deep reinforcement learning selection tool for resilient demand prediction

设计一个用于韧性需求预测的双深度强化学习选择工具

Authors: Bilel Abderrahmane Benziane, Benoit Lardeux, Ayoub Mcharek, Maher Jridi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.04068
Pdf link: https://arxiv.org/pdf/2605.04068
Abstract The use of artificial intelligence in supply chain forecasting has attracted many scientific studies for several decades. However, the process of selecting an appropriate forecasting solution becomes a daunting task. This complexity arises due to the distinct features inherent to each dataset. Research to tackle this issue has been performed since the eighties but recent development of demand forecasting has opened new perspectives. This research aims to enhance automatic forecasting model selection by proposing a novel architecture that acts as a double deep reinforcement learning agent, selecting automatically a forecasting model from the forecasting committee at the time of prediction. Moreover, a novel early-stopping approach based on average reward convergence has been introduced to expedite training time. To evaluate the model's performance, an empirical study was conducted utilizing grocery sales datasets and snack demands datasets. The experimental results demonstrate the robustness of the proposed approach when compared to state-of-the-art methods.
中文摘要 人工智能在供应链预测中的应用，几十年来吸引了众多科学研究。然而，选择合适的预测方案的过程变得艰巨。这种复杂性源于每个数据集固有的独特特征。自八十年代以来，针对这一问题的研究就已展开，但近年来需求预测的发展带来了新的视角。本研究旨在通过提出一种新颖架构，作为双深度强化学习代理，在预测时自动从预测委员会中选择预测模型，从而提升自动预测模型选择。此外，基于平均奖励收敛性的新型早期停止方法被引入以加快训练时间。为评估模型表现，进行了利用杂货销售数据集和零食需求数据集的实证研究。实验结果表明，与最先进方法相比，该方法具有鲁棒性。

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

平衡聚合：理解并纠正GRPO中的聚合偏差

Authors: Zhiyuan Zeng, Jiameng Huang, Zhangyue Yin, Jiashuo Liu, Ziniu Li, Bingrui Li, Yuhao Wu, Yining Zheng, Ge Zhang, Wenhao Huang, Xipeng Qiu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.04077
Pdf link: https://arxiv.org/pdf/2605.04077
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a central paradigm for improving reasoning and code generation in large language models, and GRPO-style training is widely adopted for its simplicity and effectiveness. However, an important design choice remains underexplored: how token-level policy gradient terms are aggregated within each sampled group. Standard GRPO uses sequence aggregation, while recent work has advocated token aggregation as a better alternative. We show that these two rules induce different optimization biases: token aggregation introduces sign-length coupling, while sequence aggregation implicitly downweights longer responses through sequence-level equal weighting. To address this tension, we propose \textbf{Balanced Aggregation (BA)}, a simple drop-in replacement that computes token-level means separately within the positive and negative subsets and then combines them with sequence-count-based weights. Experiments with Qwen2.5-Math-7B and Qwen3-1.7B on DAPO-17k and Polaris, evaluated on six reasoning and coding benchmarks, show that BA consistently improves training stability and final performance over standard token and sequence aggregation. Our analysis further shows that the relative effectiveness of token and sequence aggregation is largely governed by response-length variation and the positive-negative length gap, highlighting aggregation as a critical design dimension in GRPO-style RLVR.
中文摘要 带可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理和代码生成的核心范式，GRPO式训练因其简洁有效而被广泛采用。然而，一个重要的设计选择仍未被充分探讨：如何在每个抽样组内聚合代币级的政策梯度项。标准GRPO使用序列聚合，而近期研究则倡导代币聚合作为更好的替代方案。我们证明这两条规则会引发不同的优化偏差：代币聚合引入了符号长度耦合，而序列聚合通过序列层级的等权重隐式降低了较长响应的权重。为解决这种张力，我们提出了\textbf{平衡聚合（BA）}，这是一种简单的直接替换，分别计算正负子集的代币级均值，然后与基于序列计数的权重结合。在DAPO-17k和Polaris上，基于六个推理和编码基准测试，QWEN2.5-Math-7B和Qwen3-1.7B的实验显示，BA在训练稳定性和最终性能上持续优于标准的标记和序列聚合。我们的分析进一步表明，代币和序列聚合的相对有效性主要受响应长度变异和正负长度差距的影响，凸显聚合作为GRPO式RLVR中关键设计维度。

Constraint-Enhanced Reinforcement Learning Based on Dynamic Decoupled Spherical Radial Squashing

基于动态解耦球面径向压缩的约束增强强化学习

Authors: Qijun Liao, Zhaoxin Yu, Jue Yang
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.04185
Pdf link: https://arxiv.org/pdf/2605.04185
Abstract When deploying reinforcement learning policies to physical robots, actuator rate constraints -- hard limits on how fast each joint can move per control step -- are unavoidable. These limits vary substantially across joints due to differences in motor inertia, power bandwidth, and transmission stiffness, creating pronounced heterogeneity that existing methods fail to handle geometrically: the per-joint feasible region forms a high-dimensional box in action-increment space, yet QP projection and spherical parameterization methods impose isotropic ball-shaped constraints, exponentially under-covering the true feasible set as heterogeneity grows. This paper proposes Dynamic Decoupled Spherical Radial Squashing (DD-SRad), which resolves this mismatch by computing a position-adaptive radius independently for each actuator, achieving tight alignment with the true per-joint feasible region. DD-SRad satisfies per-step hard constraints with probability~1, preserves well-conditioned gradients throughout training, and admits exact policy gradient backpropagation with zero runtime solver overhead. MuJoCo benchmark experiments demonstrate the highest task return at zero constraint violation -- matching the unconstrained upper bound -- with 30%--50% improvement in constraint-space coverage over spherical baselines. High-fidelity IsaacLab simulations with Unitree H1 and G1 humanoid robots confirm end-to-end optimality parameterized directly from official joint specifications, validating a systematic pathway from hardware datasheets to safe deployment.
中文摘要 在向物理机器人部署强化学习策略时，执行器速率限制——对每个控制步骤每个关节移动速度的硬性限制——是不可避免的。由于运动惯性、功率带宽和传输刚度的差异，这些限制在关节间差异显著，造成了现有方法几何上无法处理的显著异质性：每个关节的可行区域在作用增量空间中形成了一个高维方框，而量子态势投影和球面参数化方法则施加了各向同性的球形约束，随着异质性增加，实际可行集的覆盖率呈指数增长。本文提出了动态解耦球面径向挤压（DD-SRad），通过独立计算每个执行器的位置自适应半径，实现与真实每关节可行区域的紧密对齐，解决了这一不匹配。DD-SRad 以概率 ~1 满足每步硬约束，在整个训练过程中保持良好条件的梯度，并且实现精确的策略梯度反向传播，且运行时求解器开销为零。MuJoCo基准实验显示，在零约束违规时，任务返回率最高——与无约束上限匹配——并且在约束空间覆盖率上比球面基线提升了30%至50%。IsaacLab与Unitree H1和G1类人机器人的高精度仿真验证了端到端的最优性，这些参数直接来自官方联合规范，验证了从硬件数据手册到安全部署的系统化路径。

Hierarchical Support Vector State Partitioning for Distilling Black Box Reinforcement Learning Policies

用于提炼黑匣子强化学习策略的层级支持向量状态划分

Authors: Senne Deproost, Mehrdad Asadi, Ann Nowé
Subjects: Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2605.04254
Pdf link: https://arxiv.org/pdf/2605.04254
Abstract We introduce State Vector Space Partitioning (SVSP), a novel method to mimic a black box reinforcement learning policy using a set of human-interpretable subpolicies. By partitioning a distillation dataset of state action pairs with linear support vector machine splits, SVSP constructs a compact and structured representation of the original policy. Our method improves mean return by +7.4\% over previous critic driven state partitioning attempts such as Voronoi State Partitioning (VSP) and +2.8\% over the original TD3 policy, while reducing the number of required subpolicies against VSP by 82.1\%. Our results pave the path towards a more flexible form of distillation where both the decision boundary and surrogate models can be chosen within a margin of the original black box behavior.
中文摘要 我们介绍状态向量空间划分（SVSP），这是一种利用一组人类可理解子策略模拟黑箱强化学习策略的新方法。通过将状态动作对的提炼数据集与线性支持向量机拆分进行划分，SVSP构建了一个紧凑且结构化的原始策略表示。我们的方法使平均收益比之前的批评驱动状态划分尝试（如Voronoi状态划分（VSP）提升+7.4%，比原始TD3策略提升+2.8%，同时将VSP所需的子策略数量减少了82.1%。我们的结果为一种更灵活的蒸馏形式铺平了道路，在那里决策边界模型和替代模型都可以在原始黑箱行为的范围内选择。

Explaining and Preventing Alignment Collapse in Iterative RLHF

解释和防止迭代RLHF中的比对崩溃

Authors: Etienne Gauthier, Francis Bach, Michael I. Jordan
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.04266
Pdf link: https://arxiv.org/pdf/2605.04266
Abstract Reinforcement learning from human feedback (RLHF) typically assumes a static or non-strategic reward model (RM). In iterative deployment, however, the policy generates the data on which the RM is retrained, creating a feedback loop. Building on the Stackelberg game formulation of this interaction, we derive an analytical decomposition of the policy's true optimization gradient into a standard policy gradient and a parameter-steering term that captures the policy's influence on the RM's future parameters. We show that standard iterative RLHF, which drops this steering term entirely, suffers from alignment collapse: the policy systematically exploits the RM's blind spots, producing low-quality, high-reward outputs whose feedback reinforces the very errors it exploits. To mitigate this, we propose foresighted policy optimization (FPO), a mechanism-design intervention that restores the missing steering term by regularizing the policy's parameter-steering effect on RM updates. We instantiate FPO via a scalable first-order approximation and demonstrate that it prevents alignment collapse on both controlled environments and an LLM alignment pipeline using Llama-3.2-1B.
中文摘要 来自人类反馈的强化学习（RLHF）通常假设采用静态或非战略性奖励模型（RM）。然而，在迭代部署中，策略生成用于重新训练RM的数据，形成反馈循环。基于斯塔克尔伯格博弈对该相互作用的表述，我们推导出策略真实优化梯度的分析分解为标准策略梯度和一个参数引导项，以捕捉策略对RM未来参数的影响。我们展示了标准迭代RLHF，即完全放弃该引导项，存在对齐崩溃问题：该策略系统性地利用了平衡值的盲点，产生低质量但高回报的输出，而反馈反而强化了它所利用的错误。为缓解这一问题，我们提出了前瞻性策略优化（FPO），这是一种机制设计干预，通过正则化策略对RM更新的参数引导效应，恢复缺失的引导项。我们通过可扩展的一阶近似实现FPO，并证明它能防止在受控环境和使用Llama-3.2-1B的LLM比对流水线上发生对齐崩溃。

Efficiently Aligning Language Models with Online Natural Language Feedback

高效地将语言模型与在线自然语言反馈对齐

Authors: Christine Ye, Joe Benton
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.04356
Pdf link: https://arxiv.org/pdf/2605.04356
Abstract Reinforcement learning with verifiable rewards has been used to elicit impressive performance from language models in many domains. But, broadly beneficial deployments of AI may require us to train models with strong capabilities in "fuzzy", hard-to-supervise domains. In this paper, we develop methods to align language models in fuzzy domains where human experts are still able to provide high-quality supervision signal, but only for a small number of model outputs, using online natural language feedback. Specifically, we train models by iteratively optimizing against proxy reward signals, stopping at the point of over-optimization, collecting fresh expert supervision, and updating the proxy reward. We construct proxy reward models from language models using in-context learning (ICL) and fine-tuning. We test our methods by eliciting creative writing and alignment research capabilities in Qwen3-8B and Haiku 4.5 respectively. For Qwen3-8B, ICL methods recover up to 35% of performance with 50x fewer expert samples, while fine-tuning methods recover 80% with up to 20x fewer samples and 100% with 3x fewer samples. For Haiku 4.5, ICL methods recover up to 35% of performance with 30x fewer samples, and fine-tuning methods recover 100% with 10x fewer samples. Our results suggest that online natural language feedback can substantially improve the data efficiency of expert supervision.
中文摘要 带有可验证奖励的强化学习已被用来在许多领域激发语言模型的卓越表现。但总体上，AI的广泛有益部署可能需要我们在“模糊”、难以监管的领域训练具备强大能力的模型。本文开发了在模糊领域中对齐语言模型的方法，这些领域人类专家仍能提供高质量的监督信号，但仅针对少量模型输出，利用在线自然语言反馈。具体来说，我们通过迭代优化代理奖励信号来训练模型，在过度优化时停止，重新收集专家监督，并更新代理奖励。我们通过语境内学习（ICL）和微调，从语言模型构建代理奖励模型。我们通过在Qwen3-8B和俳句4.5中分别引导创意写作和对齐研究能力来测试我们的方法。对于Qwen3-8B，ICL方法在50倍少的专家样本下恢复最多35%的性能，而微调方法在最多20倍样本下恢复80%，3倍减少样本可恢复100%。对于俳句4.5，ICL方法在30倍的样本下恢复最多35%的性能，而微调方法则在10倍的样本中恢复100%。我们的结果表明，在线自然语言反馈可以显著提升专家监督的数据效率。

Extending Differential Temporal Difference Methods for Episodic Problems

扩展情节性问题的微分时间差分方法

Authors: Kris De Asis, Mohamed Elsayed, Jiamin He
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.04368
Pdf link: https://arxiv.org/pdf/2605.04368
Abstract Differential temporal difference (TD) methods are value-based reinforcement learning algorithms that have been proposed for infinite-horizon problems. They rely on reward centering, where each reward is centered by the average reward. This keeps the return bounded and removes a value function's state-independent offset. However, reward centering can alter the optimal policy in episodic problems, limiting its applicability. Motivated by recent works that emphasize the role of normalization in streaming deep reinforcement learning, we study reward centering in episodic problems and propose a generalization of differential TD. We prove that this generalization maintains the ordering of policies in the presence of termination, and thus extends differential TD to episodic problems. We show equivalence with a form of linear TD, thereby inheriting theoretical guarantees that have been shown for those algorithms. We then extend several streaming reinforcement learning algorithms to their differential counterparts. Across a range of base algorithms and environments, we empirically validate that reward centering can improve sample efficiency in episodic problems.
中文摘要 差分时间差分（TD）方法是一种基于价值的强化学习算法，已被提出用于无限视界问题。它们依赖奖励中心化，即每个奖励都以平均奖励为中心。这样可以保持返回有界，并消除值函数的状态无关偏移量。然而，奖励中心化可能会改变情节性问题中的最优策略，限制其适用范围。受近期强调归一化在流式深度强化学习中作用的研究启发，我们研究了情节问题中的奖励中心化，并提出了差分TD的推广。我们证明该推广在终止存在下保持策略的排序，从而将微分TD扩展到情节问题。我们证明了与线性TD形式的等价性，从而继承了这些算法已被证明的理论保证。随后，我们将多个流强化学习算法扩展到其差分对应算法。在多种基础算法和环境中，我们实证验证了奖励中心能提升情节问题中的样本效率。

Joint Optimization of Trajectory Control, Resource Allocation, and Task Offloading for Multi-UAV-Assisted IoV

多无人机辅助IoV的轨迹控制、资源分配和任务卸载的联合优化

Authors: Maoxin Ji, Qiong Wu, Pingyi Fan, Cui Zhang, Nan Cheng, Wen Chen, Khaled B. Letaief
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.04436
Pdf link: https://arxiv.org/pdf/2605.04436
Abstract This paper investigates a multi-Unmanned Aerial Vehicle (UAV) joint base station-assisted Internet of Vehicles (IoV) task offloading system in dense urban environments. To minimize system delay and energy consumption under strict coupling constraints, the complex non-convex optimization problem is decoupled into a hierarchical execution framework. First, a sequential distributed optimization algorithm based on Second-Order Cone Programming (SOCP) is proposed to optimize the 3D flight trajectory of each UAV, ensuring adaptive network coverage. Second, a novel hybrid resource scheduling paradigm synergizing Deep Reinforcement Learning (DRL) and Large Language Models (LLMs) is developed. Within this framework, the DRL agent dictates the initial resource allocation, while the LLM acts as a semantic macro-scheduler to rectify long-tail allocation imbalances for failed and surplus tasks. Crucially, a reward decoupling mechanism is introduced to isolate DRL training from external LLM interventions, thereby ensuring policy convergence. Finally, the task offloading ratios are precisely determined via Linear Programming (LP) within an alternating optimization loop. Simulation results demonstrate that the proposed method significantly outperforms traditional multi-agent reinforcement learning baselines in terms of task success rate and system efficiency.
中文摘要 本文研究了密集城市环境中的多无人机（UAV）联合基站辅助车联网（IoV）任务卸载系统。为了在严格耦合约束下最小化系统延迟和能耗，复杂的非凸优化问题被解耦进一个层级执行框架。首先，提出了基于二阶锥体规划（SOCP）的顺序分布式优化算法，以优化每架无人机的三维飞行轨迹，确保自适应网络覆盖。其次，开发了一种新型混合资源调度范式，协同深度强化学习（DRL）和大型语言模型（LLM）。在此框架下，DRL代理决定初始资源分配，而LLM作为语义宏调度器，纠正失败和剩余任务的长尾分配不平衡。关键是引入了奖励解耦机制，将DRL训练与外部LLM干预隔离开来，从而确保策略趋同。最后，任务卸载比率通过线性规划（LP）在交替优化循环中精确确定。模拟结果表明，所提方法在任务成功率和系统效率方面显著优于传统多智能体强化学习基线。

Queue-Aware and Resilient Routing in LEO Satellite Networks Using Multi-Agent Reinforcement Learning

利用多智能体强化学习实现低地轨道卫星网络中的队列感知和弹性路由

Authors: Mudassar Liaq, Mahyar Tajeri, Peng Hu
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.04448
Pdf link: https://arxiv.org/pdf/2605.04448
Abstract With the rapid growth in data demand and stringent latency requirements of modern applications has driven significant interest in Low Earth Orbit (LEO) satellite constellations as an emerging solution for global Internet coverage. However, routing in LEO networks remains a fundamental challenge due to highly dynamic topologies, time-varying traffic conditions, and its susceptibility to link failures. Conventional routing algorithms typically assume static link metrics and fail to account for queue backlogs or real-time system variations, making them less effective in such environments. We propose a queue-aware multi-agent deep reinforcement learning (MA-DRL) framework for routing in LEO satellite networks. Each satellite is modeled as an independent agent responsible for making local routing decisions, enabling a distributed and scalable solution. The proposed framework formulates a latency-aware optimization problem that incorporates background traffic, queue dynamics at each satellite, and a resilience score to improve robustness. We evaluate the proposed approach against the state-action-reward-state-action (SARSA) and Dijkstra algorithms. While Dijkstra achieves the lowest end-to-end latency under ideal conditions, its computational and signaling overhead becomes a significant bottleneck as the network scales. In contrast, our proposed approach incurs significantly lower overhead (approximately 50% of Dijkstra at a 5 s recalculation interval), scales efficiently with network size, and effectively manages queue backlogs and resilience under increasing traffic load, demonstrating enhanced robustness and scalability in LEO satellite networks while maintaining competitive latency and resilience scores.
中文摘要 随着数据需求的快速增长和现代应用严格的延迟要求，低地球轨道（LEO）卫星星座作为全球互联网覆盖新兴解决方案引起了极大兴趣。然而，由于高度动态的拓扑结构、时间变化的交通状况以及链路故障的易发生性，LEO网络中的路由仍是一个根本性挑战。传统路由算法通常假设静态链路度量，未能考虑队列积压或实时系统变化，因此在此类环境中效果较差。我们提出了一个队列感知的多智能体深度强化学习（MA-DRL）框架，用于LEO卫星网络中的路由。每个卫星都被建模为独立代理，负责本地路由决策，实现分布式且可扩展的解决方案。该框架提出了一个延迟感知优化问题，结合了后台流量、每个卫星的队列动态和韧性评分，以提升鲁棒性。我们针对状态-动作-奖励-状态-动作（SARSA）和Dijkstra算法进行了评估。虽然Dijkstra在理想条件下实现了最低的端到端延迟，但随着网络规模的扩大，其计算和信令开销成为一个重要的瓶颈。相比之下，我们提出的方法显著降低开销（约为5秒重算间隔时Dijkstra的50%），能高效地随网络规模扩展，并在流量负载增加下有效管理队列积压和韧性，展现了LEO卫星网络更强的鲁棒性和可扩展性，同时保持了具有竞争力的延迟和韧性评分。

Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

基于数据的在线强化探索：从人类反馈中学习

Authors: Zhen-Yu Zhang, Yuting Tang, Jiandong Zhang, Lanjihong Ma, Masashi Sugiyama
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.04477
Pdf link: https://arxiv.org/pdf/2605.04477
Abstract Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in this setting is exploration, which requires algorithms that enable the LLMs to generate informative comparisons that improve sample-efficiency in online RLHF. Existing exploration strategies often derive bonuses via on-policy expectations, which are difficult to estimate reliably from the limited historical preference data available during training; as a result, the policy can prematurely down-weight under-explored regions that may contain high-value behaviors. In this paper, we propose data-dependent exploration for preference optimization (DEPO), a simple and scalable method that leverages historical data to construct an extra uncertainty bonus for high-uncertainty regions, encouraging exploration toward potentially high-value data. Theoretically, we provide a data-dependent regret bound for the proposed algorithm, showing that it adapts to the hardness of the learning task itself and can be tighter than worst-case bounds in practice. Empirically, the proposed method consistently outperforms strong baselines across benchmarks, demonstrating improved sample efficiency.
中文摘要 在线人类反馈强化学习（RLHF）已成为一种有前景的范式，通过在训练过程中不断收集新的偏好反馈，来对齐大型语言模型（LLMs）。该环境中的一个基础挑战是探索，这需要算法使LLM能够生成有益的比较，从而提升在线快速生长环境（RLHF）中的样本效率。现有的探索策略常通过政策预期获得加成，而这些期望难以从培训期间有限的历史偏好数据中可靠估计;因此，该政策可能过早降低那些可能存在高价值行为的未充分探索区域的权重。本文提出了基于数据的偏好优化探索（DEPO），这是一种简单且可扩展的方法，利用历史数据为高不确定性区域构建额外的不确定性加成，鼓励探索潜在高价值数据。理论上，我们为所提算法提供了一个数据依赖的遗憾界限，表明它能适应学习任务本身的难度，并且在实际操作中可以比最坏情况界限更严格。实证上，所提方法在各基准测试中持续优于强基线，显示出样本效率的提升。

Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

迈向一般偏好对齐：纳什均衡下的扩散模型

Authors: Jiaming Hu, Jiamu Bai, Haoyu Wang, Debarghya Mukherjee, Ioannis Ch. Paschalidis
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.04494
Pdf link: https://arxiv.org/pdf/2605.04494
Abstract Reinforcement learning from human feedback (RLHF) has been popular for aligning text-to-image (T2I) diffusion models with human preferences. As a mainstream branch of RLHF, Direct Preference Optimization (DPO) offers a computationally efficient alternative that avoids explicit reward modeling and has been widely adopted in diffusion alignment. However, existing preference-based methods for diffusion alignment still rely on reward-induced preference signals and typically assume that human preferences can be adequately modeled by the Bradley--Terry (BT) model, which may fail to capture the full complexity of human preferences. In this paper, we formulate diffusion alignment from a game-theoretic perspective. We propose Diffusion Nash Preference Optimization (Diff.-NPO), an intuitive general preference framework for diffusion alignment. Diff.-NPO encourages the current policy to play against itself to achieve self improvement and lead to a better alignment. Empirically, we demonstrate the effectiveness of Diff.-NPO on the text-to-image generation task via various metrics. Diff.-NPO consistently outperforms existing preference-based diffusion alignment methods.
中文摘要 基于人类反馈的强化学习（RLHF）在将文本到图像（T2I）扩散模型与人类偏好对齐方面非常受欢迎。作为RLHF的主流分支，直接偏好优化（DPO）提供了一种计算效率高的替代方案，避免了显式奖励建模，并已被广泛应用于扩散比对。然而，现有基于偏好的扩散比对方法仍依赖于奖励诱导的偏好信号，通常假设人类偏好可以通过Bradley-Terry（BT）模型充分模拟，但该模型可能无法完全反映人类偏好的复杂性。本文从博弈论视角提出了扩散比对。我们提出了扩散纳什偏好优化（Diffusion Nash Preference Optimization，简称Diff.-NPO），这是一个直观的通用扩散比对偏好框架。Diff.-NPO鼓励现行政策自我对抗，以实现自我提升并实现更好的对齐。通过实证，我们通过多种指标证明了Diff.-NPO在文本生成任务中的有效性。Diff.-NPO 持续优于现有基于偏好的扩散比对方法。

Pen-Strategist: A Reasoning Framework for Penetration Testing Strategy Formation and Analysis

笔策略师：渗透测试策略制定与分析的推理框架

Authors: Yasod Ginige, Pasindu Marasinghe, Sajal Jain, Suranga Seneviratne
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.04499
Pdf link: https://arxiv.org/pdf/2605.04499
Abstract Cyber threats are rapidly increasing, expanding their impact from large-scale enterprises to government services and individual users, making robust security systems increasingly essential. However, a significant shortage of skilled cybersecurity professionals exacerbates this challenge. While recent research has explored automating tasks such as penetration testing using LLM-based agents, existing frameworks often perform poorly due to limited capability in strategy formulation, domain-specific reasoning, and accurate action and tool selection. To overcome these limitations, we propose Pen-Strategist framework, consisting of a novel domain-specific reasoning model that derives pentesting strategies via logical reasoning and a classifier that converts the strategies into actionable steps. First, we construct a reasoning dataset containing logical explanations for both strategy derivation and step selection in pentesting scenarios. We then fine-tune a Qwen-3-14B model for strategy generation using reinforcement learning. Evaluation on the test split of the dataset demonstrates a 87% improvement in strategy derivation performance compared to the baseline. Furthermore, we integrate the fine-tuned Pen-Strategist model into existing automated pentesting frameworks, such as PentestGPT, and evaluate its performance on vulnerable machines, achieving a 47.5% improvement in subtask completion while surpassing the baseline GPT-5. Further experiments on the CTFKnow benchmark show an 18% performance gain over the base model. For step prediction, we train a semantic-based CNN classifier, which outperforms commercial LLMs by 28% and enhances execution stability. Finally, we conduct a user study to qualitatively assess the generated strategies, and Pen-Strategist demonstrates superior performance compared to the Claude-4.6-Sonnet.
中文摘要 网络威胁迅速增加，其影响从大型企业扩展到政府服务和个人用户，使得强大的安全系统变得日益必要。然而，专业网络安全专业人员的严重短缺加剧了这一挑战。尽管近期研究探讨了利用基于LLM的代理自动化渗透测试等任务，但现有框架由于策略制定能力有限、领域特定推理以及准确的动作和工具选择，表现往往较差。为克服这些局限，我们提出了笔墨策略师框架，该框架由一个新颖的领域特定推理模型组成，该模型通过逻辑推理推导渗透测试策略，以及一个将策略转化为可操作步骤的分类器。首先，我们构建了一个推理数据集，包含渗透测试场景中策略推导和步骤选择的逻辑解释。随后，我们对Qwen-3-14B模型进行了微调，用于强化学习的策略生成。对数据集测试拆分的评估显示，策略推导性能相比基线提升了87%。此外，我们将经过精细优化的Pen-Strategist模型集成到现有的自动化渗透测试框架中，如PentestGPT，并评估其在易受攻击机器上的表现，实现了子任务完成率提升47.5%，同时超越了基线GPT-5。CTFKnow 基准测试显示，性能比基础模型提升了18%。在步进预测方面，我们训练了一个基于语义的CNN分类器，其性能比商业大型语言模型高出28%，并提升了执行稳定性。最后，我们进行了用户研究以定性评估生成策略，Pen-Strategist表现优于Claude-4.6-Sonnet。

Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

功率分配桥接采样、自我奖励强化学习和自我蒸馏

Authors: Akiyoshi Tomihari, Issei Sato
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.04542
Pdf link: https://arxiv.org/pdf/2605.04542
Abstract Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as effective ways to improve LLM performance. However, the relationship among RL, distillation, and sampling remains unclear. In this study, we focus on the power distribution, the target distribution of power sampling, and show that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation. From the sampling perspective, we show that inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model's sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. We further show that power self-distillation can achieve self-reward sharpening, while improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks support our analysis: power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed the performance of power sampling at much lower inference cost.
中文摘要 近期分析质疑强化学习（RL）是否对大型语言模型（LLMs）中的强推理负有责任。与此同时，蒸馏和推断时间采样（包括幂采样）已成为提升LLM性能的有效方法。然而，强化学习、蒸馏和采样之间的关系仍不明确。本研究重点关注功率分布、功率采样的目标分布，并展示了功率分布连接采样、自我奖励KL正则化强化学习和自蒸馏。从采样的角度来看，我们表明，廉价的局部近似无法在没有可能后缀信息的情况下重现序列级的功率。从强化学习的角度来看，当模型的序列级对数概率作为奖励时，幂分布是KL正则化RL的闭式优化器。这种识别导致了功率自蒸馏，这是一种离线蒸馏替代品，共享相同的目标分布，并将功率采样的成本摊销到教师样本的监督培训中。我们进一步证明，功率自我蒸馏可以实现自我奖励锐利化，而下游真奖励的改进则受功率分布下真实奖励与自我奖励之间的协差性支配。推理任务的实验支持我们的分析：功率抽样能提升自我奖励，真正的回报收益依赖于与自我奖励的一致性，而功率自我蒸馏可以以更低的推断成本匹敌甚至超越功率抽样的性能。

Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models

Counter-Dyna：基于强化学习的数据高效HVAC控制，利用反事实建筑模型

Authors: Jan Marco Ruiz de Vargas, Fabian Raisch, Zoltan Nagy, Pierre Pinson, Christoph Goebel
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.04555
Pdf link: https://arxiv.org/pdf/2605.04555
Abstract Model-based reinforcement learning (MBRL) offers a promising approach for data-efficient energy management in buildings, combining the strengths of predictive modeling and reinforcement learning. While previous MBRL methods applied to HVAC control have reduced training data requirements, they still require several months of interaction with the building to learn a satisfactory control policy. A key reason is that existing surrogate models attempt to predict the entire state-space, including weather and electricity prices that are unaffected by control actions, or completely ignore these variables. Addressing these issues, we propose Counter-Dyna, a method that enhances the data-efficiency of Dyna, an MBRL method. We create data-efficient counterfactual surrogate models (CSM) by leveraging invariances in the state-space. Using a CSM in Dyna speeds up RL training measured in environment interaction data compared to previous results. In comparison with previous state-of-the-art that used 6-12 months of environment interactions, our method needs only 5 weeks. We evaluate our method in a large simulation study using the literature standard BOPTEST framework and proximal policy algorithm (PPO) as the RL algorithm. Our results show cost-saving potentials of 5.3% to 17.0% in a hypothetical deployment scenario. Our work is a significant step towards making real-world deployment of RL algorithms in HVAC control practically viable.
中文摘要 基于模型的强化学习（MBRL）为建筑中数据高效的能源管理提供了一种有前景的方法，结合了预测建模和强化学习的优势。虽然之前应用于暖通空调控制的MBRL方法减少了训练数据需求，但仍需数月与建筑互动以学习满意的控制策略。一个关键原因是现有的替代模型试图预测整个状态空间，包括不受控制作用影响的天气和电价，或者完全忽略这些变量。针对这些问题，我们提出了Counter-Dyna方法，这是一种提升MBRL方法Dyna数据效率的方法。我们通过利用状态空间的不变性创建数据高效的反事实替代模型（CSM）。在Dyna中使用CSM可以加快环境交互数据中测量的强化学习训练，相比以往结果。与之前采用6-12个月环境交互的先进方法相比，我们的方法只需5周时间。我们在一项大型模拟研究中评估了我们的方法，采用文献标准的BOPTEST框架，近端策略算法（PPO）作为强化学习算法。我们的结果显示，在假设部署情景下，成本节约潜力为5.3%至17.0%。我们的工作是实现强化学习算法在暖通空调控制中实际应用的重要一步。

Delay-Aware Large-Small Model Collaboration over LEO Satellite Networks

基于低轨道卫星网络的延迟感知大小模型协作

Authors: Mingyu Guo, Wen Wu, Ying Wang, Songge Zhang, Liang Li
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2605.04565
Pdf link: https://arxiv.org/pdf/2605.04565
Abstract In this paper, we introduce a delay-aware largesmall model collaboration scheme for low Earth orbit (LEO) satellite networks, which can balance the computational load among satellites and the communication load across inter-satellite links. Specifically, computational resource constrained remote sensing satellites are responsible for data collection and local processing using small models, while collaborating with computing satellites that provide large model processing. To minimize the service delay, we formulate a joint optimization problem for offloading decision and routing strategy design, which is transformed into a decentralized partially observable Markov decision process. To solve the problem, we develop a multi-agent reinforcement learning (MARL)-based algorithm with offline policy training and online bisection search. The offline trained policy determines routing strategies, while online bisection search iteratively adjusts the offloading decisions. Simulation results demonstrate that the proposed scheme can reduce the service delay by up to 31.85% compared with the benchmarks.
中文摘要 本文介绍了一种针对低地球轨道（LEO）卫星网络的延迟感知大小型模型协作方案，能够平衡卫星间的计算负载和跨卫星链路的通信负载。具体来说，计算资源受限的遥感卫星负责利用小模型进行数据收集和局部处理，同时与提供大型模型处理的计算卫星协作。为最小化服务延迟，我们构建了一个用于卸载决策和路由策略设计的联合优化问题，并将其转化为去中心化的部分可观测马尔可夫决策过程。为解决该问题，我们开发了一种基于多智能体强化学习（MARL）的算法，支持离线策略训练和在线二分搜索。离线训练策略决定路由策略，而在线二分搜索则迭代调整卸载决策。模拟结果表明，与基准方案相比，该方案可将服务延迟减少多达31.85%。

Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

Dream-MPC：基于梯度的模型预测控制，利用潜在想象力

Authors: Jonathan Spieler, Sven Behnke
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.04568
Pdf link: https://arxiv.org/pdf/2605.04568
Abstract State-of-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to leverage the advantages of both paradigms have shown promising results. However, these approaches typically rely on gradient-free optimization methods, which can be computationally expensive for high-dimensional control tasks. While gradient-based methods are a promising alternative, recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts. We propose Dream-MPC, a novel approach that generates few candidate trajectories from a rolled-out policy and optimizes each trajectory by gradient ascent using a learned world model, uncertainty regularization and amortization of optimization iterations over time by reusing previously optimized actions. Our results on 24 continuous control tasks show that Dream-MPC can significantly improve the performance of the underlying policy and can outperform gradient-free MPC and state-of-the-art baselines. We will open source our code and more at this https URL.
中文摘要 最先进的基于模型的强化学习（RL）方法要么采用无梯度、基于群体的方法进行规划，要么采用学习策略网络，或结合政策网络与规划。结合模型预测控制（MPC）与已学习模型和政策的混合方法，利用这两种范式的优势，已显示出令人鼓舞的成果。然而，这些方法通常依赖于无梯度优化方法，对于高维控制任务来说计算量可能较大。虽然基于梯度的方法是一个有前景的替代方案，但近期研究实证显示，基于梯度的方法往往表现不如无梯度方法。我们提出了Dream-MPC，这是一种新颖的方法，从已推出的策略中生成少量候选轨迹，并通过学习世界模型的梯度上升、不确定性正则化以及通过重用先前优化的动作对优化迭代进行摊销来优化每个轨迹。我们在24个连续控制任务中的结果表明，Dream-MPC能够显著提升底层策略的性能，并且能优于无梯度MPC和最先进的基线。我们将在此 https URL 开源我们的代码及更多内容。

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

ReflectDrive-2：离散扩散驱动的强化学习对齐自编辑

Authors: Huimin Wang, Yue Wang, Bihao Cui, Pengxiang Li, Ben Lu, Mingqian Wang, Tong Wang, Chuan Tang, Teng Zhang, Kun Zhan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.04647
Pdf link: https://arxiv.org/pdf/2605.04647
Abstract We introduce ReflectDrive-2, a masked discrete diffusion planner with separate action expert for autonomous driving that represents plans as discrete trajectory tokens and generates them through parallel masked decoding. This discrete token space enables in-place trajectory revision: AutoEdit rewrites selected tokens using the same model, without requiring an auxiliary refinement network. To train this capability, we use a two-stage procedure. First, we construct structure-aware perturbations of expert trajectories along longitudinal progress and lateral heading directions and supervise the model to recover the original expert trajectory. We then fine-tune the full decision--draft--reflect rollout with reinforcement learning (RL), assigning terminal driving reward to the final post-edit trajectory and propagating policy-gradient credit through full-rollout transitions. Full-rollout RL proves crucial for coupling drafting and editing: under supervised training alone, inference-time AutoEdit improves PDMS by at most $0.3$, whereas RL increases its gain to $1.9$. We also co-design an efficient reflective decoding stack for the decision--draft--reflect pipeline, combining shared-prefix KV reuse, Alternating Step Decode, and fused on-device unmasking. On NAVSIM, ReflectDrive-2 achieves $91.0$ PDMS with camera-only input and $94.8$ PDMS in a best-of-6 oracle setting, while running at $31.8$ ms average latency on NVIDIA Thor.
中文摘要 我们引入了ReflectDrive-2，一款带有独立动作专家的掩蔽离散扩散规划器，用于自动驾驶，它将计划表示为离散轨迹标记，并通过并行掩码解码生成。这种离散的标记空间支持原地轨迹修正：AutoEdit 使用相同模型重写选定的标记，无需辅助细化网络。为了训练这项能力，我们采用了两阶段的程序。首先，我们构建专家轨迹沿纵向进展和横向方向的结构感知扰动，并监督模型恢复原始专家轨迹。随后，我们通过强化学习（RL）微调完整决策——草案-反映推广，将终端驱动奖励分配给最终编辑后的轨迹，并通过全面推广过渡传递策略梯度积分。全面推广 RL 对于将草图与编辑结合至关重要：仅在监督训练下，推理时间自动编辑最多提升 PDMS 0.3 美元，而 RL 则提升至 1.9 美元。我们还共同设计了决策-草案-反射流水线的高效反射解码栈，结合了共享前缀KV重用、交替步骤解码和融合设备内解除掩码。在NAVSIM模式下，ReflectDrive-2仅通过摄像头输入实现91.0美元PDMS，在6局两胜制的预言机模式下实现94.8美元PDMS，而在NVIDIA Thor上平均延迟为31.8美元。

From Reach to Insert: Tactile-Augmented Precision Assembly under Sub-Millimeter Tolerances

从伸缩到插入：亚毫米公差下的触觉增强精密组装

Authors: Xinpan Meng, Siyao Huang, JingPu Yang, Muyuan Ma, Zhenghua Ma, Lijun Han, Gao Yuan, Houcheng Li, Long Cheng
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.04649
Pdf link: https://arxiv.org/pdf/2605.04649
Abstract High-precision assembly frequently involves tight-tolerance insertions, where even slight pose errors can cause jamming or excessive interaction forces, making robust and safe insertion policies difficult to obtain. This paper proposes a tactile-augmented two-stage method that combines Imitation Learning (IL) and Reinforcement Learning (RL) for precision insertion tasks. In the first stage, IL learns a reaching policy with position generalization that grasps the peg and brings it to the vicinity of the target region. In the second stage, RL executes the insertion and enables recovery from failures during contact-rich interactions. To better exploit tactile feedback, we introduce tactile group sampling to increase coverage of critical contact segments during training, and design a tactile critic to more accurately evaluate policy values, improving insertion performance while maintaining low contact forces. We conduct systematic experiments across five hole geometries and three clearance settings. Results show that our method substantially improves insertion performance across all settings; under the most challenging 0.05\,mm clearance, it achieves a 67\% success rate while keeping contact forces low, reducing the maximum interaction force by 60\% and torque by 44\%, thereby validating both effectiveness and safety for precision assembly.
中文摘要 高精度组装常涉及高精度插入，即使是轻微的姿态错误也可能导致卡壳或过大相互作用力，使得获得稳健且安全的插入策略变得困难。本文提出了一种触觉增强的两阶段方法，结合了模仿学习（IL）和强化学习（RL）用于精确插入任务。在第一阶段，IL学习一种带有位置泛化的达标策略，能够抓住目标节点并将其带到目标区域附近。在第二阶段，强化学习执行插入，并使在接触丰富交互中实现故障恢复。为更好地利用触觉反馈，我们引入触觉群抽样以增加培训期间关键接触段的覆盖，并设计触觉批评器以更准确地评估政策价值，提升插入性能，同时保持低接触力。我们对五种孔体几何形状和三种净空设置进行了系统性实验。结果显示，我们的方法在所有设置下显著提升了插入性能;在最具挑战性的0.05毫米间隙下，它实现了67%的成功率，同时保持了低接触力，最大相互作用力减少了60%，扭矩减少了44%，从而验证了精密组装的有效性和安全性。

ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horizon Visual MPC

ELVIS：长视野视觉MPC的合奏校准潜在想象力

Authors: Yurui Du, Pinhao Song, Yutong Hu, Renaud Detry
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.04709
Pdf link: https://arxiv.org/pdf/2605.04709
Abstract A central challenge of visual control with model-based reinforcement learning (RL) is reliable long-horizon planning: long rollouts with learned latent dynamics exhibit branching futures and multi-modal action-value distributions. In addition, compounding model errors amplified by visual occlusions make deep imagination brittle. We present ELVIS, a latent model predictive controller (MPC) designed to make long-horizon planning practical. ELVIS plans in a Dreamer-style recurrent state space model (RSSM) and replaces standard unimodal model predictive path integral (MPPI) with a Gaussian-mixture MPPI that maintains multiple coherent hypotheses over long horizons, avoiding mode averaging under branching rollouts. In parallel, ELVIS stabilizes deep imagination with a shared uncertainty-aware lambda-return: an ensemble of latent critics defines an upper-confidence-bound (UCB) score that gates a time-varying lambda, adaptively trading off bootstrapping versus look-ahead to limit compounding error during planning. The same return is used both to train an actor-critic prior from imagined rollouts and to score candidate trajectories inside GMM-MPPI, aligning RL objectives with the planner's long-horizon optimization. On fourteen DeepMind Control Suite visual tasks, ELVIS establishes state-of-the-art performance compared with TD-MPC2 and DreamerV3. Finally, ELVIS transfers zero-shot to a real-world sand-spraying task with severe occlusions, improving surface-quality metrics and demonstrating robustness beyond simulation.
中文摘要 基于模型的强化学习（RL）视觉控制的一个核心挑战是可靠的长期规划：长时间的推广与学习的潜在动态会表现出分支未来和多模态的行动价值分布。此外，视觉遮挡放大的模型错误叠加使深度想象变得脆弱。我们介绍ELVIS，一种潜在模型预测控制器（MPC），旨在实现长期规划的实用性。ELVIS采用Dreamer风格的循环状态空间模型（RSSM），并用高斯混合MPPI替代标准单峰模型预测路径积分（MPPI），后者在长视野内保持多个相干假设，避免在分支扩展下出现模态平均。与此同时，ELVIS通过共享的不确定性意识λ回报稳定了深度想象力：一组潜在批评者定义了一个上置信界（UCB）分数，限制时间变化的λ，并自适应地权衡自带法和前瞻法，以减少规划中的复利误差。相同的回报既用于从想象的展开前训练演员-批评者，也用于在GMM-MPPI中对候选轨迹进行评分，使强化学习目标与规划者的长期视野优化保持一致。在十四项DeepMind控制套件视觉任务中，ELVIS实现了与TD-MPC2和DreamerV3相比最先进的性能。最后，ELVIS将零射技术转移到现实中严重遮挡的喷砂任务中，提升了表面质量指标，并展示了超越模拟的鲁棒性。

SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning

SPHERE：减少专家混合中深度强化学习中光谱可塑性的丧失

Authors: Lirui Luo, Guoxi Zhang, Hongming Xu, Cong Fang, Qing Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.04712
Pdf link: https://arxiv.org/pdf/2605.04712
Abstract In deep reinforcement learning (DRL), an agent is trained from a stream of experience. In a continual learning setting, such agents can suffer from plasticity loss: their ability to learn new skills from new experiences diminishes over training. Recently, Mixture-of-Experts (MoE) networks have been reported to enable scaling laws and facilitate the learning of diverse skills. However, in continual reinforcement learning settings, their performance can degenerate as learning proceeds, indicating a loss of plasticity. To address this, building on Neural Tangent Kernel (NTK) theory, we formalize the plasticity loss in MoE policies as a loss of spectral plasticity. We then derive a tractable proxy for spectral plasticity, one expressible in terms of individual expert feature matrices. Leveraging this proxy, we introduce SPHERE, a practical Parseval penalty tailored for MoE-based policies that alleviates the loss of spectral plasticity. On MetaWorld and HumanoidBench, SPHERE improves average success under continual RL by 133% and 50% over an unregularized MoE baseline, while maintaining higher spectral plasticity throughout training.
中文摘要 在深度强化学习（DRL）中，智能体是通过经验流进行训练的。在持续学习环境中，这些代理可能会出现可塑性丧失：他们从新经验中学习新技能的能力随着训练而减弱。最近，专家混合网络（MoE）被报道能够实现法律规模化并促进多样化技能的学习。然而，在持续强化学习环境中，其表现可能随着学习的进行而退化，表明可塑性丧失。为此，基于神经切核（NTK）理论，我们将MoE策略中的可塑性损失形式化为光谱可塑性的丧失。随后我们推导出一个可处理的谱可塑性代理，可以用单个专家特征矩阵来表示。利用这一代理，我们引入了SPHERE，这是一种针对基于MoE策略的实用解析惩罚，可以缓解光谱可塑性的丧失。在MetaWorld和HumanoidBench上，SPHERE在持续强化学习下的平均成功率提升了133%，相比非正则的MoE基线提升了50%，同时在整个训练过程中保持更高的光谱可塑性。

Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL

每一步都很重要：工具集成文本转SQL的步骤级学分作业

Authors: Yaxun Dai, Baolin Sun, Junying Wang, Pengfei Wang, Yingqi Gao, Xuemei Dong, Mengdie Chu, Xiang Qi, Pingfu Chao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.04719
Pdf link: https://arxiv.org/pdf/2605.04719
Abstract Tool-integrated Text-to-SQL parsing has emerged as a promising paradigm, framing SQL generation as a sequential decision-making process interleaved with tool execution. However, existing reinforcement learning approaches mainly rely on coarse-grained outcome supervision, resulting in a fundamental credit assignment problem: models receive the same reward for any trajectory that yields the correct answer, even when intermediate steps are redundant, inefficient, or erroneous. Consequently, models are encouraged to explore suboptimal reasoning spaces, limiting both efficiency and generalization. To address this problem, we propose FineStep, a novel framework for step-level credit assignment in tool-augmented Text-to-SQL. First, we introduce a reward design with independent process rewards to alleviate the signal sparsity of outcome supervision. Next, we present a step-level credit assignment mechanism to precisely quantify the value of each reasoning step. Finally, we develop a policy optimization method based on step-level advantages for efficient updates. Extensive experiments on BIRD benchmarks show that FineStep achieves state-of-the-art performance and reduces redundant tool interactions, with a 3.25% average EX gain over GRPO at the 4B scale.
中文摘要 工具集成的文本转SQL解析已成为一种有前景的范式，将SQL生成框架为一种顺序决策过程与工具执行交织。然而，现有的强化学习方法主要依赖粗粒度的结果监督，导致了一个根本性的学分分配问题：模型对任何得出正确答案的轨迹都获得相同奖励，即使中间步骤是冗余、低效或错误的。因此，鼓励模型探索次优推理空间，限制了效率和泛化。为解决这一问题，我们提出了FineStep，这是一种工具增强的文本转SQL中用于步骤级学分分配的新框架。首先，我们引入了带有独立过程奖励的奖励设计，以缓解结果监督的信号稀疏。接下来，我们提出了一个步骤级的学分分配机制，以精确量化每个推理步骤的价值。最后，我们基于步骤层面优势开发了一种策略优化方法，实现高效更新。BIRD基准测试的大量实验表明，FineStep实现了最先进的性能，减少了冗余工具交互，在4B尺度上平均EX提升比GRPO高3.25%。

Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

沉浸式视频角色扮演的奖励分解强化学习

Authors: Miao Wang, Yuling Shi, Yijiang Li, Yeheng Chen, Xiaodong Gu, Bin Li, Bo Gao, Yaduan Ruan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.04733
Pdf link: https://arxiv.org/pdf/2605.04733
Abstract Text-based role-playing models can imitate character styles, yet they often fail to reflect a scene's atmosphere and evolving tension, both essential for immersive applications such as Virtual Reality (VR) games and interactive narratives. We study video-grounded role-playing dialogue and introduce EBM-RL (Eye-Brain-Mouth Reinforcement Learning), a decoupled GRPO-based framework that explicitly separates observation ([perception]), reasoning ([think]), and utterance ([answer]). This structure promotes human-like sensory grounding by compelling the model to first attend to visual cues, then form internal interpretations, and finally generate context-appropriate dialogue. EBM-RL integrates four complementary rewards: (i) CLIP-based scene-text alignment to improve ambiance and emotion; (ii) a Perceptual-Cognitive reward that encourages [perception] and [think] processes that increase the likelihood of the reference response; (iii) answer accuracy to ensure faithfulness; and (iv) a dense format reward to enforce the desired structured output. Extensive experiments demonstrate that EBM-RL substantially outperforms text-only role-playing baselines and larger-scale vision-language models on our immersive role-playing benchmark, delivering simultaneous gains in visual-atmosphere consistency and character authenticity. Beyond the role-playing domain, EBM-RL also exhibits strong zero-shot generalization: without any additional fine-tuning, it consistently improves performance on out-of-domain VideoQA benchmarks. We additionally release an open-source dataset for video-grounded role-playing dialogue.
中文摘要 基于文本的角色扮演模型可以模仿角色风格，但它们往往无法反映场景的氛围和不断变化的紧张感，而这两者对于虚拟现实（VR）游戏和互动叙事等沉浸式应用至关重要。我们研究基于视频的角色扮演对话，并引入EBM-RL（眼脑口强化学习），这是一种基于GRPO的解耦框架，明确区分观察（[感知]）、推理（[思考]）和言语（[答案]）。这种结构通过促使模型先关注视觉线索，然后形成内部解读，最终生成符合情境的对话，从而促进了类似人类的感官扎根。EBM-RL集成了四项互补奖励：（i）基于CLIP的场景-文本对齐以改善氛围和情感;（ii）一种感知-认知奖励，鼓励[感知]和[思考]过程，从而增加参照反应的可能性;（iii）回答准确以确保忠实;以及（iv）密集格式奖励以强制实现期望的结构化输出。大量实验表明，EBM-RL在我们的沉浸式角色扮演基准中，显著优于纯文本角色扮演基线和更大规模视觉语言模型，同时在视觉-氛围一致性和角色真实性方面获得提升。在角色扮演领域之外，EBM-RL还表现出强烈的零拍摄泛化能力：无需额外微调，它在域外视频质量保证基准测试中持续提升性能。我们还发布了一个基于视频的角色扮演对话开源数据集。

Hierarachical Multiagent Reinforcement Learning for Multi-Group Tax Game

多组税务游戏的层级多智能体强化学习

Authors: Honglei Guo, Yuhan Zhao, Yexin Li
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.04741
Pdf link: https://arxiv.org/pdf/2605.04741
Abstract Reinforcement learning has increasingly been used to study economic decision-making, such as taxation, public spending, and labour supply. However, most existing RL-based economic models focus on a single government--household group, thereby overlooking the strategic interactions that arise when multiple governments compete while managing their own populations. In practice, many economic systems (e.g., taxation) exhibit a multi-group structure, where each government must optimize its fiscal policy in response not only to household behaviour within its jurisdiction, but also to the policies of other competing governments. To capture this structure, we formulate taxation as a hierarchical multi-group game. Within each group, the interaction between the government and households is modelled as a leader--follower game; across groups, governments are modelled as players in a competitive game. This results in a hybrid hierarchical game that is difficult to solve using standard multi-agent reinforcement learning algorithms. We therefore propose a bi-level training framework built on multi-agent reinforcement learning, together with \textit{ Curriculum Learning} and a \textit{ Closed-Loop Sequential Update} strategy, to stabilize training and promote convergence. We instantiate this framework in a taxation game simulation environment grounded in classical economic models. The environment supports the evaluation of different taxation algorithms and provides multiple economic indicators for assessing policy performance. Experiments show that our approach can learn stable tax policies that benefit all participating groups. Compared with a two-group baseline without the proposed update mechanisms, our method avoids premature game collapse, extends the effective game duration by 60.92\%, produces more sustainable and robust tax policies, and reduces GDP disparities among governments by 44.12\%.
中文摘要 强化学习越来越多地被用于研究经济决策，如税收、公共支出和劳动力供应。然而，大多数现有基于强化学习的经济模型都聚焦于单一政府——家庭群体，因此忽视了当多个政府在管理自身人口的同时竞争时所产生的战略互动。实际上，许多经济体系（如税收）呈现出多群体结构，每个政府必须优化财政政策，不仅要根据其管辖内的家庭行为，还要应对其他竞争政府的政策。为了捕捉这种结构，我们将税收表述为一个层级多群体博弈。在每个群体内，政府与家庭之间的互动被建模为领导者-追随者游戏;跨群体，政府被建模为竞争博弈中的参与者。这导致了一个混合层级博弈，使用标准多智能体强化学习算法难以求解。因此，我们提出一个基于多智能体强化学习的双级训练框架，结合 \textit{ 课程学习}和 \textit{闭环顺序更新}策略，以稳定训练并促进融合。我们将这一框架实例化为基于经典经济模型的税务游戏模拟环境。该环境支持评估不同税收算法，并提供多种经济指标以评估政策绩效。实验表明，我们的方法能够学习出有利于所有参与群体的稳定税收政策。与没有所提更新机制的两组基线相比，我们的方法避免了游戏过早崩溃，将有效游戏时长延长了60.92%，制定了更可持续、更健全的税收政策，并将政府间的GDP差距减少了44.12%。

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

VTAgent：证据感知视频的代理关键帧锚定文本 VQA

Authors: Haibin He, Maoyuan Ye, Jing Zhang, Juhua Liu, Bo Du
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.04870
Pdf link: https://arxiv.org/pdf/2605.04870
Abstract Video text-based visual question answering (Video TextVQA) aims to answer questions by reasoning over visual textual content appearing in videos. Despite the strong multimodal video understanding capabilities of recent Video-LLMs, their performance on existing Video TextVQA benchmarks remains limited. To better understand this gap, we conduct an upper-bound analysis through frame-wise question answering, counting a sample as correct if any frame yields the right answer, which significantly outperforms direct video-based inference and reveals a substantial performance gap. The results suggest that the primary bottleneck lies in the localization of key question-relevant evidence, rather than in reasoning capacity itself. Building on this insight, we propose a question-guided agent framework that explicitly anchors the relevant keyframes before answering. The approach operates effectively in a training-free setting and consistently surpasses direct video inference. With additional supervised fine-tuning (SFT) and reinforcement learning (RL), it achieves an average improvement of +12.12 in accuracy and +11.15 in ANLS across benchmarks, establishing new state-of-the-art results. Our study underscores the critical role of explicit keyframe anchoring for advancing Video TextVQA. The code will be publicly released.
中文摘要 基于视频文本的视觉问答（Video TextVQA）旨在通过推理视频中出现的视觉文本内容来回答问题。尽管近期视频大型语言模型具备强大的多模态视频理解能力，但其在现有视频文本VQA基准测试中的表现仍然有限。为了更好地理解这一差距，我们通过逐帧问题回答进行上界分析，如果任何一帧都给出正确答案，则该样本被计为正确，这显著优于基于视频的直接推断，并揭示了显著的性能差距。结果表明，主要瓶颈在于关键问题相关证据的定位，而非推理能力本身。基于这一见解，我们提出了一个问题引导代理框架，明确锚定相关关键帧后再回答。该方法在无培训环境中有效运行，且持续超越直接视频推断。通过额外的监督微调（SFT）和强化学习（RL），在基准测试中平均精度提升为+12.12，ANLS提升+11.15，建立了新的最先进结果。我们的研究强调了显性关键帧锚定在推动视频文本VQA发展中的关键作用。代码将公开发布。

A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

SMDP中平均奖励强化学习的谐均表述

Authors: Erel Shtossel, Alicia Vidler, Uri Shaham, Gal A. Kaminka
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.04880
Pdf link: https://arxiv.org/pdf/2605.04880
Abstract Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.
中文摘要 近期研究重新激发并放大了对无限视野、非片段式（持续）任务中无折扣平均奖励强化学习算法的兴趣。半马尔可夫决策过程（SMDPs）尤其值得关注。在SMDP中，离散动作随机地同时产生奖励和持续时间，目标是优化平均奖励率。现有算法通过优化奖励与持续时间的比例来实现这一目标。然而，当奖励和持续时间在无限视野内是非平定的时，这种判断可能不正确。本文提出了一种新颖的修正谐均算符，即使在此类条件下也能正确计算奖励率。这产生了无模型的学习算法，能够与SMDPs合作，同时保持对非平稳奖励和持续时间分布的鲁棒性。我们证明了修正谐均算符的理论性质，并通过实证方式展示了其相较于现有算法的有效性。

A Hierarchical Agent System with Reinforcement Learning for Multivariate Time Series Data Cleaning

一个带有强化学习的分层代理系统，用于多变量时间序列数据清洗

Authors: Yuhan Shi, Yuanyuan Yao, Lu Chen, Mourad Khayati, Tianyi Li
Subjects: Subjects: Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2605.04902
Pdf link: https://arxiv.org/pdf/2605.04902
Abstract Multivariate time series (MTS) are frequently affected by co-occurring quality issues, such as missing values, outliers, and constraint violations, which significantly undermine downstream analytics. Existing cleaning approaches fix only a limited set of such issues, making them ill-suited for scenarios where multiple quality problems arise simultaneously. Furthermore, these methods commonly depend on the availability of ground truth data or domain-specific rules, both of which are rarely accessible in real-world applications. In this paper, we introduce \sys, an agent system with reinforcement learning designed to clean multiple data quality issues in MTS. We cast the cleaning process as a joint optimization problem that simultaneously handles quality issue order and cleaning model selection, allowing efficient navigation of the large space of possible cleaning pipelines. Our framework relies on a hierarchical agent architecture, where a high-level agent determines the order in which data quality issues should be processed, while a low-level agent identifies the most suitable cleaning method for each issue. To guide the agent toward an optimal cleaning pipeline, we propose a dual-stage reward mechanism that couples upstream (cleaning) and downstream performance, enabling effective optimization without relying on ground truth. Our experimental results show that \sys consistently outperforms existing methods, achieving up to 96\% improvement in data cleaning quality and 27\% improvement in downstream performance.
中文摘要 多变量时间序列（MTS）经常受到共存的质量问题影响，如缺失值、离群值和约束违规，这些都会显著削弱下游分析。现有的清洁方法只能解决有限的此类问题，因此不适合同时出现多个质量问题的情形。此外，这些方法通常依赖于真实数据或领域特定规则的可用性，而这两者在实际应用中很少能获得。本文介绍了\sys，一个带有强化学习的代理系统，旨在清理MTS中的多个数据质量问题。我们将清洁过程定位为一个联合优化问题，同时处理质量问题的发单顺序和清洁模型的选择，从而高效地导航大量可能的清洗管道。我们的框架依赖于分层代理架构，高级代理决定数据质量问题应处理的顺序，而低级代理则确定每个问题最合适的清理方法。为了引导代理走向最佳清洁流水线，我们提出了一种双阶段奖励机制，将上游（清洁）和下游性能结合起来，实现有效优化而无需依赖地面真实信息。我们的实验结果显示，系统系统持续优于现有方法，数据清理质量提升高达96%，下游性能提升27%。

Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games

Strat-Reasoner：在多智能体游戏中强化大型语言模型的战略推理

Authors: Yidong He, Yutao Lai, Pengxu Yang, Jiarui Gan, Jiexin Wang, Yi Cai, Mengchen Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.04906
Pdf link: https://arxiv.org/pdf/2605.04906
Abstract While Large Language Models (LLMs) excel in certain reasoning tasks, they struggle in multi-agent games where the final outcome depends on the joint strategies of all agents. In multi-agent games, the non-stationarity of other agents brings significant challenges on the evaluation of the reasoning process and the credit assignment over multiple reasoning steps. Existing single-agent reinforcement learning (RL) approaches and their multi-agent extensions fail to address these challenges as they do not incorporate other agents in the reasoning process. In this work, we propose Strat-Reasoner, a novel RL-based framework that improves LLMs' strategic reasoning ability in multi-agent games. We introduce a novel recursive reasoning paradigm where an agent's reasoning also integrates other agents' reasoning processes. To provide effective reward signals for the intermediate reasoning sequences, we employ a centralized Chain-of-Thought (CoT) comparison module to evaluate the reasoning quality. Finally, we compute an accurate hybrid advantage and develop a group-relative RL approach to optimize the LLM policy. Experimental results show that Strat-Reasoner substantially improves strategic abilities of underlying LLMs, achieving 22.1\% average performance improvements across various multi-agent games.
中文摘要 虽然大型语言模型（LLM）在某些推理任务中表现出色，但在多智能体游戏中表现不佳，因为最终结果依赖于所有智能体的联合策略。在多智能体博弈中，其他智能体的非平稳性对推理过程和多步骤推理的信用分配给了重大挑战。现有的单智能体强化学习（RL）方法及其多智能体扩展未能解决这些挑战，因为它们未将其他智能体纳入推理过程。在本研究中，我们提出了Strat-Reasoner，一种基于强化学习的新框架，旨在提升LLM在多智能体博弈中的战略推理能力。我们引入了一种新的递归推理范式，其中一个智能体的推理还整合了其他智能体的推理过程。为中间推理序列提供有效的奖励信号，我们采用集中式思维链（CoT）比较模块来评估推理质量。最后，我们计算出准确的混合优势，并开发了群体相对强化学习方法以优化LLM策略。实验结果显示，Strat-Reasoner 显著提升了底层大型语言模型的战略能力，在多种多智能体游戏中实现了 22.1% 的平均性能提升。

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

基于结果层级优化的组合推广强化学习

Authors: Xiyan Fu, Wei Liu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.04920
Pdf link: https://arxiv.org/pdf/2605.04920
Abstract Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fails to capture the global compositional structure required for generalizing to unseen combinations. In this work, we investigate whether compositional generalization can instead be improved through outcome-level reinforcement learning. We adopt Group Relative Policy Optimization to optimize models based on feedback on their final outputs. Within this framework, we explore both a simple binary outcome reward and a composite reward that provides additional composition feedback. Experiments on multiple compositional benchmarks show that reinforcement learning improves compositional generalization compared to supervised fine-tuning. Further analysis reveals that supervised models tend to overfit frequent training compositions, whereas reinforcement learning improves compositional generalization by reshaping the output distribution, particularly for more complex composition types.
中文摘要 合成泛化指的是正确解释已知原语的新组合，这仍是一个重大挑战。现有方法通常依赖监督微调，鼓励模型模仿目标输出。这种标记级训练范式未能捕捉泛化到未见组合所需的全局组合结构。本研究探讨是否可以通过结果级强化学习改进组合泛化。我们采用群体相对策略优化，基于对模型最终输出的反馈进行优化。在此框架下，我们探讨了简单的二元结果奖励和提供额外组成反馈的复合奖励。多重组合基准测试的实验表明，强化学习相比监督微调能提升组合泛化。进一步分析显示，监督模型往往会对频繁训练的组合进行过拟合，而强化学习则通过重塑输出分布来提升组合泛化，尤其是针对更复杂的组合类型。

Modular Reinforcement Learning For Cooperative Swarms

合作群体模块化强化学习

Authors: Erel Shtossel, Gal A. Kaminka
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.04939
Pdf link: https://arxiv.org/pdf/2605.04939
Abstract A cooperative robot swarm is a collective of computationally-limited robots that share a common goal. Each robot can only interact with a small subset of its peers, without knowing how this affects the collective utility. Recent advances in distributed multi-agent reinforcement learning have demonstrated that it is possible for robots to learn how to interact effectively with others, in a manner that is aligned with the common goal, despite each robot learning independently of others. However, this requires each robot to represent a potentially combinatorial number of interaction states, challenging the memory capabilities of the robots. This paper proposes an alternative approach for representing spatial interaction states for multi-robot reinforcement learning in swarms. A modular (decomposed) representation is used, where each feature of the state is handled by a separate learning procedure, and the results aggregated. We demonstrate the efficacy of the approach in numerous experiments with simulated robot swarms carrying out foraging.
中文摘要 协作机器人群体是一组计算能力有限、共享共同目标的机器人集体。每个机器人只能与其一小部分同类机器人互动，且不知道这如何影响集体效用。分布式多智能体强化学习的最新进展表明，尽管每个机器人独立学习，机器人仍能学会如何有效与他人互动，且与共同目标保持一致。然而，这要求每个机器人都必须代表潜在的组合数量的交互状态，这对机器人的记忆能力构成了挑战。本文提出了一种替代方法，用于群体中多机器人强化学习中的空间交互状态表示。采用模块化（分解）表示法，状态的每个特征由独立的学习过程处理，结果被聚合。我们在多次模拟机器人群体采集实验中证明了该方法的有效性。

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

EP-GRPO：熵-进展对齐的群体相对策略优化，含隐式流程指导

Authors: Song Yu, Li Li, Wenwen Zhao, Zhisheng Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.04960
Pdf link: https://arxiv.org/pdf/2605.04960
Abstract Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, uniform polarity that penalizes correct steps and rewards incorrect ones, and zero-variance collapse that erases outcome-driven gradients. We systematically quantify these failures, revealing highly non-uniform token informativeness, widespread step-level polarity misalignment, and substantial training waste. To address these limitations, we propose Entropy-Progress Aligned GRPO (EP-GRPO), a framework that mines the model's intrinsic information flow for dense, self-supervised guidance. EP-GRPO integrates entropy-gated modulation to prioritize high entropy decision pivots, implicit process signals from policy divergence anchored to outcome advantages for directional token-level feedback without external reward models, and cumulative entropy mapping that enables progress-aligned advantage normalization, naturally maintaining gradient flow under zero reward variance. Extensive experiments on mathematical reasoning benchmarks demonstrate that EP-GRPO achieves superior accuracy and efficiency compared to GRPO and its variants. The code will be available.
中文摘要 带有可验证奖励的强化学习（RLVR），尤其是群相对策略优化（Group Relative Policy Optimization，GRPO），推动了LLM推理的发展。然而，GRPO存在三个信用分配失败的问题：统一的代币级粒度忽视异构信息价值，统一极性惩罚正确步骤而奖励错误步骤，以及零方差崩溃消除结果驱动梯度。我们系统地量化这些失败，揭示了高度不均匀的代币信息量、广泛的步级极性错位以及大量训练浪费。为解决这些局限性，我们提出了熵-进展对齐GRPO（EP-GRPO）框架，该框架挖掘模型内在信息流，提供密集的自我监督指导。EP-GRPO集成了熵门控调制以优先处理高熵决策枢纽，基于策略背离的隐式过程信号锚定于结果优势，用于无外部奖励模型的定向代币级反馈，以及累积熵映射，实现与进展对齐的优势规范化，自然保持梯度流在零奖励方差下。大量数学推理基准测试的实验表明，EP-GRPO相比GRPO及其变体在准确性和效率上均优于GRPO。代码将会公布。

Graph-SND: Sparse Aggregation for Behavioral Diversity in Multi-Agent Reinforcement Learning

图-SND：多智能体强化学习中行为多样性的稀疏聚合

Authors: Shawn Ray
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.05020
Pdf link: https://arxiv.org/pdf/2605.05020
Abstract System Neural Diversity (SND) measures behavioral heterogeneity in multi-agent reinforcement learning by averaging pairwise distances over all $\binom{n}{2}$ agent pairs, making each call quadratic in team size. We introduce Graph-SND, which replaces this complete-graph average with a weighted average over the edges of an arbitrary graph $G$. Three regimes follow: $G=K_n$ recovers SND exactly; a fixed sparse $G$ defines a localized diversity measure at $O(|E|)$ cost; and random edge samples yield an unbiased Horvitz-Thompson estimator and a normalized sample mean with $O(1/\sqrt{m})$ concentration in the sampled edge count $m$. For fixed sparse graphs we prove forwarding-index distortion bounds for expanders and a spectral refinement under low-rank distance structure; for random $d$-regular graphs we prove an unconditional probabilistic $\widetilde{\mathcal{O}}(D_{\max}/\sqrt{n})$ bound. On VMAS we verify recovery, unbiasedness, concentration, and wall-clock scaling, with a PettingZoo TVD panel checking non-Gaussian transfer. In a 500-iteration $n=100$ PPO run, Bernoulli-$0.1$ Graph-SND tracks full SND while reducing per-call metric time by about $10\times$, and frozen-policy GPU timing up to $n=500$ follows the predicted $\binom{n}{2}/|E|$ speedup. Random $d$-regular expanders empirically achieve $\mathrm{SND}_{G}^{\mathrm{u}}/\mathrm{SND} \in [0.9987, 1.0013]$ at $\Theta(n \log n)$ edges. In DiCo diversity control at $n=50$, Bernoulli-$0.1$ Graph-SND preserves set-point tracking with paired reward differences indistinguishable from zero across nine matched cells while cutting per-call metric cost by ${\sim}9.5\times$. Together, these results show that the SND aggregation bottleneck can be removed without changing the metric's semantics, yielding a drop-in sparse alternative that scales beyond complete-graph SND and supports both passive measurement and closed-loop diversity control.
中文摘要 系统神经多样性（SND）通过对所有$\binom{n}{2}$代理对的平均成对距离，使每次呼叫的团队规模为二次方，来衡量多智能体强化学习中的行为异质性。我们引入图-SND，它用任意图$G$的边的加权平均替代了该完全图平均值。以下有三种模式：$G=K_n$ 精确恢复 SND;固定稀疏$G$定义了局部多样性度量 $O（|E|）$成本;随机边样本得到无偏的Horvitz-Thompson估计量和样本均值，样本平均值在采样边数$m$中浓度为$O（1/\sqrt{m}）$。对于固定稀疏图，我们证明了扩展器的前向指标畸变边界，并在低秩距离结构下进行了谱细化;对于随机$d$正则图，我们证明了一个无条件概率的$\widetilde{\mathcal{O}}（D_{\max}/\sqrt{n}）$的界限。在VMAS中，我们验证恢复率、无偏性、专注力和壁钟级标度，并通过PettingZoo TVD面板检查非高斯转移。在500次迭代$n=100$ PPO运行中，伯努利-$0.1$ Graph-SND跟踪全SND，同时将每次调用的指标时间减少约10美元乘以$，冻结策略的GPU时序最高可达$n=500$，遵循预测的$\binom{n}{2}/|加速。随机$d$-正则扩展器在 $\Theta（n \log n）$ 边上经验上实现了 $\mathrm{SND}_{G}^{\mathrm{u}}/\mathrm{SND} \in [0.9987， 1.0013]$ 在 $\Theta（n \log n）$ 边。在$n=50$的DiCo多样性控制中，伯努利-$0.1$图-SND保持设定点追踪，配对奖励差异在九个匹配单元中无异于零，同时将每次通话指标成本降低${\sim}9.5\times$。综合来看，这些结果表明，无需改变度量语义即可消除SND聚合瓶颈，从而产生一种可直接插入的稀疏替代方案，其规模超越完全图SND，并支持被动测量和闭环多样性控制。

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

基于偏好的自我提炼：超越通过奖励正则化实现的基层匹配

Authors: Xin Yu, Liuchen Liao, Yiwen Zhang, Yingchen Yu, Lingzhou Xue, Qinzhen Guo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05040
Pdf link: https://arxiv.org/pdf/2605.05040
Abstract On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level training signals. However, its reliance on a stronger external teacher has driven recent work on on-policy self-distillation, where the same model serves as both teacher and student under different prompt contexts. Yet, existing self-distillation methods largely reduce learning to KL matching toward the context-augmented teacher model. This approach often suffers from training instability and can degrade reasoning performance over time. Moreover, self-distillation from the same model with prompt augmentation lacks the exploratory diversity provided by a genuine external teacher. To address these limitations, we move beyond fixed-teacher KL matching and propose \textbf{P}reference-\textbf{B}ased \textbf{S}elf-\textbf{D}istillation (\textbf{PBSD}), which revisits on-policy self-distillation through a reward-regularized perspective. Instead of directly matching the teacher distribution, we derive a reward-regularized objective whose analytic optimum is a reward-reweighted teacher distribution, yielding a target policy provably superior to the original teacher under this objective. Practically, PBSD optimizes preference gaps between teacher and student samples while maintaining on-policy student sampling. We support this framework with a statistical analysis of the induced preference-learning problem, formally establishing when on policy self-distillation is preferable to learning from an external teacher in our setting. Experiments on mathematical reasoning and tool-use benchmarks across multiple model scales demonstrate that PBSD consistently achieves the strongest average performance among comparable baselines, showing improved training stability over prior self-distillation baselines while preserving token efficiency.
中文摘要 策略提纯是强化学习的高效替代方案，提供密集的令牌级训练信号。然而，其对更强外部教师的依赖推动了近期关于政策自我提炼的研究，即同一模型在不同提示语境下既作为教师又作为学生。然而，现有的自我提炼方法在很大程度上将学习与上下文增强教师模型的KL匹配减少。这种方法常常存在训练不稳定性，并且随着时间推移可能降低推理表现。此外，从同一模型中自我提炼并迅速补充，缺乏真正外部教师所提供的探索多样性。为解决这些局限性，我们超越固定教师的 KL 匹配，提出 \textbf{P}reference-\textbf{B}ased \textbf{S}elf-\textbf{D}istillation（\textbf{PBSD}），通过奖励正则化视角重新审视策略自提纯。我们不直接匹配教师分布，而是推导一个奖励正则化目标，其分析最优值为奖励重加权教师分布，从而得到一个在该目标下可证明优于原教师的目标策略。实际上，PBSD在保持学生抽样政策的同时，优化了教师和学生样本之间的偏好差距。我们通过对诱导偏好学习问题的统计分析支持该框架，正式确定在我们环境中何时更适合自我提炼而非外部教师学习。在多个模型尺度上的数学推理和工具使用基准实验表明，PBSD在可比基线中始终保持最强的平均表现，在保持代币效率的同时，比以往的自蒸馏基线提升了训练稳定性。

Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

多臂盗贼与强化学习中的分布遗憾统一框架

Authors: Harin Lee, Min-hwan Oh
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.05102
Pdf link: https://arxiv.org/pdf/2605.05102
Abstract We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels $\delta \in (0,1]$, thereby characterizing the regret distribution across the full range of $\delta$. We present a simple UCBVI-style algorithm with exploration bonus $\min{c_{1,k}/N, c_{2,k}/\sqrt{N}}$, where $N$ denotes the visit count and $(c_{1,k},c_{2,k})$ are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes. As a special case, for multi-armed bandits with $A$ arms and horizon $T$, we obtain a distributional regret bound of order $\mathcal{O}(\sqrt{AT}\log(1/\delta))$, confirming the conjecture of Lattimore & Szepesvári (2020, Section 17.1) for the first time.
中文摘要 我们通过统一框架研究随机多臂强盗和情节强化学习中的遗憾分布。我们将分布遗憾界限形式化为一个概率保证，该保证在所有置信水平 $\delta \in （0,1]} 上均成立，从而刻画了整个 $\delta$ 范围内的悔恨分布。我们提出了一个简单的UCBVI风格算法，探索奖励为$\min{c_{1，k}/N，c_{2，k}/\sqrt{N}}$，其中$N$表示访问次数，$（c_{1，k}，c_{2，k}）$为用户指定的参数。对于任意参数序列，我们推导出一般的差距无关和依赖缺口的分布遗憾界限，从而原则性地描述了这些参数如何控制预期性能、尾部风险和实例依赖行为之间的权衡。特别地，我们的界限在极小极大值和实例依赖性条件下，实现了预期遗憾与分布遗憾之间的最佳权衡。作为一个特殊情况，对于拥有$A$臂和地平$T$的多臂土匪，我们得到一个分布后悔界限的阶数为$\mathcal{O}（\sqrt{AT}\log（1/\delta））$，首次证实了Lattimore和Szepesvári（2020，第17.1节）的猜想。

LineRides: Line-Guided Reinforcement Learning for Bicycle Robot Stunts

LineRides：自行车机器人特技的线引导强化学习

Authors: Seungeun Rho, Shamel Fahmi, Jeonghwan Kim, Arianna Ilvonen, Sehoon Ha, Gabriel Nelson
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05110
Pdf link: https://arxiv.org/pdf/2605.05110
Abstract Designing reward functions for agile robotic maneuvers in reinforcement learning remains difficult, and demonstration-based approaches often require reference motions that are unavailable for novel platforms or extreme stunts. We present LineRides, a line-guided learning framework that enables a custom bicycle robot to acquire diverse, commandable stunt behaviors from a user-provided spatial guideline and sparse key-orientations, without demonstrations or explicit timing. LineRides handles physically infeasible guidelines using a tracking margin that permits controlled deviation, resolves temporal ambiguity by measuring progress via traveled distance along the guideline, and disambiguates motion details through position- and sequence-based key-orientations. We evaluate LineRides on the Ultra Mobility Vehicle (UMV) and show that the policy trained with our methods supports seamless transitions between normal driving and stunt execution, enabling five distinct stunts on command: MiniHop, LargeHop, ThreePointTurn, Backflip, and DriftTurn.
中文摘要 在强化学习中设计敏捷机器人动作的奖励函数仍然很困难，而基于演示的方法往往需要用于新颖平台或极限特技无法获得的参考动作。我们推出了LineRides，一种线路引导学习框架，使定制自行车机器人能够通过用户提供的空间指南和稀疏的按键方向，在无需演示或明确计时的情况下，获得多样化且可指挥的特技行为。LineRides通过允许受控偏差的跟踪余距处理物理上不可行的指导线，通过测量沿导轨的行进距离来解决时间模糊性，并通过基于位置和顺序的键向消除运动细节。我们评估了Ultra Mobility Vehicle（UMV）上的LineRides，并证明我们方法训练的政策支持正常驾驶与特技执行之间的无缝过渡，能够根据指令实现五种不同的特技：MiniHop、LargeHop、ThreePointTurn、Backflip和DriftTurn。

Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

推广通过率控制：引导二元奖励强化学习走向最具信息量的状态

Authors: Tianshu Zhu, Wenyu Zhang, Xiaoying Zuo, Lun Tian, Haotian Zhao, Yucheng Zeng, Jingnan Gu, Daxiang Dong, Jianmin Wu, Dawei Yin, Dou Shen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.05112
Pdf link: https://arxiv.org/pdf/2605.05112
Abstract SWE-bench-style agentic reinforcement learning relies on expensive stateful trajectories, yet substantial compute is wasted on sampled rollout groups with skewed pass rates, where binary rewards provide a weak contrastive signal. We frame this inefficiency as a pass-rate control problem and show that a 50% pass rate is the most informative operating point: it maximizes reward entropy, the probability of surviving group filtering, RLOO advantage energy under GRPO, and success--failure contrastive structure. Guided by this principle, we propose Prefix Sampling (PS), which replays trajectory prefixes to steer skewed groups toward this regime: successful prefixes serve as head starts for mostly failing groups, while failing prefixes serve as handicaps for mostly passing groups. In stateful agent environments, prefix states are reconstructed through replay while replayed tokens are excluded from the loss, restricting optimization to continuations generated by the current policy. On SWE-bench-style agentic RL, PS delivers end-to-end wall-clock speedups of 2.01x on Qwen3-14B and 1.55x on Qwen3-32B while preserving or improving final verified performance. For 14B, the SWE-bench Verified peak rises from the baseline peak of 0.273 to 0.295 under PS. Additional mathematical reasoning experiments on AIME 2025 show the same pass-rate control pattern and decompose the gains into replay, bidirectional coverage, and adaptive control.
中文摘要 SWE式智能体强化学习依赖昂贵的有状态轨迹，但在采样的滚动组中，通过率偏斜，二元奖励提供的对比信号较弱，却浪费了大量计算。我们将这种低效率框架为通过率控制问题，并证明50%的通过率是最有参考价值的工作点：它最大化了奖励熵、组过滤存活概率、GRPO下的RLOO优势能量以及成功-失败对比结构。基于这一原则，我们提出了前缀抽样（PS），它重演轨迹前缀，引导偏斜的组朝这一方向发展：成功的前缀为大多数失败组的起步，失败的前缀则成为大多数通过的组的障碍。在有状态代理环境中，前缀状态通过重放重建，重放令牌被排除在丢失之外，优化限制在当前策略生成的延续状态。在 SWE 工作台式的代理强化环境中，PS 在 Qwen3-14B 上实现端到端的墙钟加速提升 2.01 倍，Qwen3-32B 为 1.55 倍，同时保持或提升最终验证的性能。对于14B，SWE台验证峰值从基线峰值0.273上升至PS下的0.295。AIME 2025上的额外数学推理实验显示了相同的通过率控制模式，并将收益分解为重放、双向覆盖和自适应控制。

Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning

互动预算下的自适应策略选择与微调，用于离线到在线强化学习

Authors: Alper Kamil Bozkurt, Xiaoan Xu, Shangtong Zhang, Miroslav Pajic, Yuichi Motai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05123
Pdf link: https://arxiv.org/pdf/2605.05123
Abstract In offline-to-online reinforcement learning (O2O-RL), policies are first safely trained offline using previously collected datasets and then further fine-tuned for tasks via limited online interactions. In a typical O2O-RL pipeline, candidate policies trained with offline RL are evaluated via either off-policy evaluation (OPE) or online evaluation (OE). The policy with the highest estimated value is then deployed and continually fine-tuned. However, this setup has two main issues. First, OPE can be unreliable, making it risky to deploy a policy based solely on those estimates, whereas OE may identify a viable policy with substantial online interaction, which could have been used for fine-tuning. Second--and more importantly--it is also often not possible to determine a priori whether a pretrained policy will improve with post-deployment fine-tuning, especially in non-stationary environments. As a result, procedures committing to a single deployed policy are impractical in many real-world settings. Moreover, a naive remedy that exhaustively fine-tunes all candidates would violate interaction budget constraints and is likewise infeasible. In this paper, we propose a novel adaptive approach for policy selection and fine-tuning under online interaction budgets in O2O-RL. Following the standard pipeline, we first train a set of candidate policies with different offline RL algorithms and hyperparameters; we then perform OPE to obtain initial performance estimates. We next adaptively select and fine-tune the policies based on their predicted performance via an upper-confidence-bound approach thereby making efficient use of online interactions. We demonstrate that our approach improves upon O2O-RL baselines with various benchmarks.
中文摘要 在离线到在线强化学习（O2O-RL）中，策略首先在离线安全训练中使用先前收集的数据集，然后通过有限的在线交互进一步微调任务。在典型的O2O-RL流程中，用离线强化学习训练的候选策略通过非策略评估（OPE）或在线评估（OE）进行评估。估计值最高的策略随后被部署并持续微调。然而，这种设置存在两个主要问题。首先，OPE可能不可靠，仅凭这些估计部署政策存在风险，而OE则可能识别出具有大量在线互动的可行策略，这些策略本可用于微调。其次——更重要的是——通常无法事先判断预训练策略是否会随着部署后的微调而改善，尤其是在非固定环境中。因此，在许多现实环境中，承诺单一部署策略的程序并不切实际。此外，一个对所有候选人进行全面微调的天真方案会违反交互预算限制，同样不可行。本文提出了一种新的自适应方法，用于O2O-RL在线交互预算下的策略选择和微调。按照标准流水线，我们首先训练一组候选策略，采用不同的离线强化学习算法和超参数;然后我们进行OPE以获得初步性能估算。接着，我们根据预测表现，通过上置信度界限方法自适应地选择和微调策略，从而高效利用在线互动。我们通过多种基准测试证明了我们的方法在O2O-RL基线基础上有所改进。

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

当生活给你BC时，制作Q函数：从行为克隆中提取Q值以实现机器人强化学习

Authors: Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex, Robin Walters, Karl Schmeckpeper, Thomas Weng
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05172
Pdf link: https://arxiv.org/pdf/2605.05172
Abstract Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at this https URL
中文摘要 行为克隆（BC）已成为机器人学习中一种非常有效的范式。然而，BC缺乏在收集示范后进行自我指导的在线改进机制。现有的离线到在线学习方法常因离线数据与在线学习之间的分布不匹配，导致政策取代了先前学到的良好行为。在本研究中，我们提出了基于BC的Q2RL、Q估计和Q门控技术，用于强化学习，这是一种高效的离线到在线学习算法。我们的方法包括两部分：（1）Q-估计通过与环境的几个交互步骤从BC策略中提取Q函数，随后是在线RL与（2）Q-门票，基于BC和RL的Q值切换策略动作，收集用于强化策略训练的样本。在D4RL和机器人模拟基准的操作任务中，Q2RL在成功率和收敛时间上优于SOTA离线到在线学习基线。Q2RL足够高效，可以应用于机器人强化学习环境，在1-2小时的在线交互中学习针对接触丰富且高精度操作任务（如管道组装和配对）的稳健策略，成功率高达100%，比原始BC策略提升高达3.75倍。代码和视频可在此 https URL 获取

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

OpenSearch-VL：前沿多模态搜索代理的开放配方

Authors: Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai, Quanxin Shou, Yunlong Lin, Xiangyu Yue, Shenghua Gao, Tianyu Pang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.05185
Pdf link: https://arxiv.org/pdf/2605.05185
Abstract Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. Despite rapid progress, top-tier multimodal search agents remain difficult to reproduce, largely due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes. To this end, we introduce OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents with agentic reinforcement learning. First, we curated a dedicated pipeline to construct high-quality training data through Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding, which jointly reduce shortcuts and one-step retrieval collapse. Based on this pipeline, we curate two training datasets, SearchVL-SFT-36k for SFT and SearchVL-RL-8k for RL. Besides, we design a diverse tool environment that unifies text search, image search, OCR, cropping, sharpening, super-resolution, and perspective correction, enabling agents to combine active perception with external knowledge acquisition. Finally, we propose a multi-turn fatal-aware GRPO training algorithm that handles cascading tool failures by masking post-failure tokens while preserving useful pre-failure reasoning through one-sided advantage clamping. Built on this recipe, OpenSearch-VL delivers substantial performance gains, with over 10-point average improvements across seven benchmarks, and achieves results comparable to proprietary commercial models on several tasks. We will release all data, code, and models to support open research on multimodal deep search agents.
中文摘要 深度搜索已成为前沿多模态智能体的关键能力，使模型能够通过主动搜索、证据验证和多步推理解决复杂问题。尽管进展迅速，顶级多模态搜索代理仍难以复现，主要原因是缺乏开放的高质量训练数据、透明的轨迹合成流程或详细的训练配方。为此，我们介绍OpenSearch-VL，这是一种完全开源的训练方法，用于通过智能强化学习训练前沿多模态深度搜索代理。首先，我们策划了一条专门的流程，通过维基百科路径采样、模糊实体重写和源锚视觉基础构建高质量的训练数据，这些共同减少了捷径和一步检索崩溃。基于该流程，我们策划了两个训练数据集，分别是用于SFT的SearchVL-SFT-36k和用于RL的SearchVL-RL-8k。此外，我们设计了一个多样化的工具环境，统一了文本搜索、图像搜索、光学字符识别（OCR）、裁剪、锐化、超分辨率和透视校正，使智能体能够将主动感知与外部知识获取结合起来。最后，我们提出了一种多回合致命感知的GRPO训练算法，通过掩盖失败后的标记来处理级联工具故障，同时通过单方面优势钳制保持有用的故障前推理。基于这一方案，OpenSearch-VL实现了显著的性能提升，七个基准测试平均提升超过10分，并在多个任务上实现了与专有商业模型相当的结果。我们将发布所有数据、代码和模型，以支持对多模态深度搜索代理的开放研究。

Keyword: diffusion policy

There is no result