Arxiv Papers of Today

生成时间: 2026-05-12 18:39:26 (UTC+8); Arxiv 发布时间: 2026-05-12 20:00 EDT (2026-05-13 08:00 UTC+8)

今天共有 122 篇相关文章

Keyword: reinforcement learning

Reinforcement learning for inverse structural design and rapid laser cutting of kirigami prototypes

用于逆结构设计和快速激光切割的强化学习

Authors: Milad Yazdani, Shahriar Shalileh, Dena Shahriari
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.08098
Pdf link: https://arxiv.org/pdf/2605.08098
Abstract Kirigami is an increasingly useful fabrication method to produce shape-programmable metamaterial structures. However, inverse design remains difficult because deployment is nonlinear, and feasible cut layouts must satisfy discrete compatibility rules, avoid overlap, and map one target shape to valid designs. We present RL-Kirigami, an inverse design framework that combines optimal-transport conditional flow matching (OT-CFM) with reinforcement learning to generate compatible ratio fields for compact reconfigurable parallelogram quad kirigami. A marching decoder enforces global geometric compatibility, and Group Relative Policy Optimization (GRPO) aligns the generator with nondifferentiable rewards for silhouette matching, feasibility, and ratio-field regularity. Across procedurally generated target shape instances, a single sample from the pretrained OT-CFM prior reached $94.2%$ sIoU and outperformed solver baselines while reducing forward simulator evaluations from hundreds to 1. GRPO improved accuracy to $94.91%$ sIoU and, with regularity included, reduced $\mathrm{TV}(\mathbf{x})$ from 0.95 to 0.81 while maintaining $94.83%$ sIoU. Generated layouts were exported to DXF and laser-cut in $50~\mu\mathrm{m}$ polymeric sheets to produce deployable prototypes in $8.0 \pm 1.0$ minutes per part. These results support a manufacturing-aware inverse design workflow for deployable kirigami metamaterials under hard geometric feasibility constraints.
中文摘要 Kirigami 是一种日益有用的制造方法，用于制造形状可编程的超材料结构。然而，逆设计仍然困难，因为部署是非线性的，且可行的切割布局必须满足离散兼容性规则，避免重叠，并将一个目标形状映射到有效设计。我们提出了RL-Kirigami，这是一个逆设计框架，结合了最优传输条件流匹配（OT-CFM）与强化学习，生成兼容的比例场，用于紧凑可重构的四边形四边形Kirigami。行进解码器强制执行全局几何兼容性，群相对策略优化（GRPO）使生成器与轮廓匹配、可行性和比场正则性的不可微奖励对齐。在程序生成的目标形状实例中，预训练的OT-CFM单样本已达到94.2%的sIoU，表现优于求解器基线，同时将前向模拟器评估从数百次降至1次。GRPO将精度提升至94.91%的sIoU，并且在包含规律性的情况下，将$\mathrm{TV}（\mathbf{x}）$从0.95降至0.81，同时保持94.83%的sIoU。生成的布局导出为DXF，并以50美元~\mu\mathrm{m}$的聚合物片材激光切割，每件零件耗时8美元/分钟，即可完成可展开的原型。这些结果支持在严格几何可行性约束下，针对可部署雾神超材料的制造意识逆向设计流程。

Distributional Reinforcement Learning via the Cramér Distance

通过克拉默距离的分布式强化学习

Authors: Vanya Aziz, Ivo Nowak, E.M.T Hendrix
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.08104
Pdf link: https://arxiv.org/pdf/2605.08104
Abstract This paper explores the application of the Soft Actor-Critic (SAC) algorithm within a Distributional Reinforcement Learning setting and introduces an implementation of such algorithm named Cramér-based Distributional Soft Actor-Critic (C-DSAC). The novel approach employs distributional reinforcement learning to represent state-action values, and minimizes the squared Cramér distance for learning the distribution. Empirical results across various robotic benchmarks indicate that our algorithm surpasses the performance of baseline SAC and contemporary distributional methods, with the performance advantage becoming increasingly pronounced in high-complexity environments. To explain the efficiency of the new approach, we conduct an analysis showing that its superior performance is partly due to \textit{confidence-driven} Q-value updates: High-variance target distributions (low confidence in target) lead to more conservative model updates, thereby attenuating the impact of overestimated values. This work deepens the understanding of distributional reinforcement learning, offering insights into the algorithmic mechanisms governing convergence and value estimation.
中文摘要 本文探讨了软演员-批判者（SAC）算法在分布式强化学习环境中的应用，并介绍了该算法的实现，称为基于Cramér的分布式软演员-批判者（C-DSAC）。这种新颖的方法采用分布强化学习来表示状态-动作值，并最小化了学习分布所需的克拉默距离平方。跨越多种机器人基准测试的实证结果表明，我们的算法性能超越了基础SAC和当代分布方法，在高复杂环境中性能优势日益明显。为解释新方法的效率，我们进行了分析，表明其优越性能部分归功于Q值更新：高方差目标分布（目标置信度低）导致模型更新更保守，从而减弱高估值的影响。这项工作加深了对分布强化学习的理解，提供了对控制收敛和价值估计的算法机制的见解。

Interactive Inverse Reinforcement Learning of Interaction Scenarios via Bi-level Optimization

通过双级优化对交互场景进行交互式逆强化学习

Authors: Yue Mao, Shicheng Liu, Siyuan Xu, Minghui Zhu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.08131
Pdf link: https://arxiv.org/pdf/2605.08131
Abstract Inverse reinforcement learning (IRL) learns a reward function and a corresponding policy that best fit the demonstration data of an expert. However, in the current IRL setting, the learner is isolated from the expert and can only passively observe the expert demonstrations. This limits the applicability of IRL to interactive settings, where the learner actively interacts with the expert and needs to infer the expert's reward function from the interactions. To bridge the gap, this paper studies interactive IRL (IIRL) where a learner aims to learn the reward function of an expert and a policy to interact with the expert during its interactions with the expert. We formulate IIRL as a stochastic bi-level optimization problem where the lower level learns a reward function to explain the behaviors of the expert, and the upper level learns a policy to interact with the expert. We develop a double-loop algorithm, Bi-level Interactive Scenarios Inverse Reinforcement Learning (BISIRL), which solves the lower-level problem in the inner loop and the upper-level problem in the outer loop. We formally guarantee that BISIRL converges and validate our algorithm through extensive experiments.
中文摘要 逆强化学习（IRL）学习一个奖励函数和对应的策略，最符合专家的演示数据。然而，在当前现实生活中，学习者与专家隔离，只能被动地观察专家的演示。这限制了现实生活中在互动环境中的适用性，在互动环境中，学习者需要主动与专家互动，并需要从互动中推断专家的奖励函数。为弥合这一差距，本文研究了互动式IRL（IIRL），其中学习者旨在学习专家的奖励函数，以及在专家与专家互动时与专家互动的策略。我们将IIRL表述为一个随机双层优化问题，底层学习奖励函数以解释专家的行为，上层学习与专家交互的策略。我们开发了一种双循环算法——双层交互场景逆向强化学习（BISIRL），解决内环的低层问题和外环的高层问题。我们正式保证BISIRL能够收敛并通过大量实验验证我们的算法。

Quantile Geometry Regularization for Distributional Reinforcement Learning

分布强化学习中的分位几何正则化

Authors: Zhaofan Zhang, Minghao Yang, Rufeng Chen, Sihong Xie, Hui Xiong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08182
Pdf link: https://arxiv.org/pdf/2605.08182
Abstract Quantile-based distributional reinforcement learning methods learn return distributions through sampled quantile regression, but their bootstrapped target quantiles may induce distorted or degenerate distribution estimates. We propose Robust Quantile-based Implicit Quantile Networks (RQIQN), a lightweight Wasserstein distributionally robust enhancement boosted from a quantile estimation perspective. We first reinterpret a snapshot of IQN loss as a collection of local empirical quantile estimation problems over sampled current fractions. We then robustify each local slot with a Wasserstein distributionally robust quantile estimation formulation, yielding a closed-form, fraction-dependent correction to the Bellman target. This correction directly addresses distributional degeneration: its median antisymmetry preserves the risk-neutral quantile average, while its monotonicity enlarges upper-lower quantile gaps and counteracts collapsed distributional spread. RQIQN thus regularizes quantile geometry without changing the underlying value objective or requiring additional sample set reconstruction. Finally, we empirically show that the proposed RQIQN outperforms other existing quantile-based distributional reinforcement learning algorithms in risk-sensitive navigation and Atari games.
中文摘要 基于分位数的分布强化学习方法通过抽样分位数回归学习返回分布，但其自助式目标分位数可能导致分布估计失真或退化。我们提出了基于分位数的稳健隐式分位数网络（RQIQN），这是一种从分位数估计角度增强的轻量级Wasserstein分布稳健增强方法。我们首先将IQN损失的快照重新解释为一组针对抽样电流分数的局部经验分位数估计问题。然后，我们用Wasserstein分布稳健的分位数估计公式对每个局部槽进行了鲁棒化，从而得到对Bellman目标的闭式、分数依赖的修正。这种修正直接解决了分布退化：其中位反对称性保持了风险中性的分位数平均值，而其单调性则扩大了上下分位数间隙，抵消了分布扩散的崩溃。因此，RQIQN在不改变底层值目标或需要额外样本集重建的情况下，正则化了分位几何。最后，我们实证证明所提出的RQIQN在风险敏感导航和Atari游戏中优于其他现有基于分位数的分布强化学习算法。

Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

超越惩罚：基于扩散的分布外检测与离线强化学习中的选择性正则化

Authors: Qingjun Wang, Hongtu Zhou, Hang Yu, Junqiao Zhao, Yanping Zhao, Chen Ye, Ziqiao Wang, Guang Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08202
Pdf link: https://arxiv.org/pdf/2605.08202
Abstract Offline reinforcement learning (RL) faces a critical challenge of overestimating the value of out-of-distribution (OOD) actions. Existing methods mitigate this issue by penalizing unseen samples, yet they fail to accurately identify OOD actions and may suppress beneficial exploration beyond the behavioral support. Although several methods have been proposed to differentiate OOD samples with distinct properties, they typically rely on restrictive assumptions about the data distribution and remain limited in discrimination ability. To address this problem, we propose DOSER (Diffusion-based OOD Detection and Selective Regularization), a novel framework that goes beyond uniform penalization. DOSER trains two diffusion models to capture the behavior policy and state distribution, using single-step denoising reconstruction error as a reliable OOD indicator. During policy optimization, it further distinguishes between beneficial and detrimental OOD actions by evaluating predicted transitions, selectively suppressing risky actions while encouraging exploration of high-potential ones. Theoretically, we prove that DOSER is a $\gamma$-contraction and therefore admits a unique fixed point with bounded value estimates. We further provide an asymptotic performance guarantee relative to the optimal policy under model approximation and OOD detection errors. Across extensive offline RL benchmarks, DOSER consistently attains superior performance to prior methods, especially on suboptimal datasets.
中文摘要 离线强化学习（RL）面临着一个关键挑战，即高估了分布外（OOD）动作的价值。现有方法通过惩罚未见样本来缓解这一问题，但它们无法准确识别值班人员的行为，可能抑制超出行为支持的有益探索。尽管已有多种方法被提出以区分具有不同属性的OOD样本，但它们通常依赖于对数据分布的限制性假设，且在辨别能力上仍有限。为解决这一问题，我们提出了DOSER（基于扩散的OOD检测与选择性正则化），这是一个超越统一惩罚的新框架。DOSER训练两种扩散模型，以捕捉行为策略和状态分布，采用单步去噪重建误差作为可靠的OOD指示器。在策略优化过程中，它通过评估预测的转变，进一步区分有益与有害的值外行动，选择性地抑制风险行为，同时鼓励探索高潜力行动。理论上，我们证明DOSER是一个$\gamma$收缩，因此存在一个唯一的不动点，且估计值有界。我们还提供了相对于最优策略在模型近似和OOD检测误差下的渐近性能保证。在广泛的离线强化学习基准测试中，DOSER在次优数据集上持续优于以往方法。

Path-Coupled Bellman Flows for Distributional Reinforcement Learning

分布式强化学习中的路径耦合贝尔曼流

Authors: Boyang Xu, Qing Zou, Siqin Yang, Hao Yan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08253
Pdf link: https://arxiv.org/pdf/2605.08253
Abstract Distributional reinforcement learning (DRL) models the full return distribution, but existing finite-support or quantile-based methods rely on projections, while recent flow-based approaches can suffer from \emph{boundary mismatch} at the flow source or from \emph{high-variance} bootstrapping when current and successor noises are independent. We propose Path-Coupled Bellman Flows (PCBF), a continuous-time DRL method that learns return distributions with flow matching using \textbf{source-consistent Bellman-coupled paths}: the current path starts from the required base prior at $t{=}0$, reaches the Bellman target at $t{=}1$, and maintains a pathwise affine relation to the successor flow at intermediate times (without requiring time-$t$ marginals to satisfy a distributional Bellman fixed point for all $t$). PCBF couples current and successor return flows through shared base noise and uses a $\lambda$-parameterized control-variate target: $\lambda{=}0$ recovers an unbiased sample Bellman target, while $\lambda{>}0$ trades controlled bias for variance reduction. Experiments on analytically tractable MRPs, OGBench, and D4RL show improved distributional fidelity and training stability, and competitive offline RL performance.
中文摘要 分布强化学习（DRL）建模了全回报分布，但现有的有限支持或分位数方法依赖于投影，而最新的基于流的方法则可能在流源处出现\emph{边界错配}，或者当当前和后继噪声独立时，存在\emph{高方差}自举。我们提出了路径耦合贝尔曼流（PCBF），这是一种连续时间DRL方法，利用\textbf{源一致的贝尔曼耦合路径}学习带有流匹配的返回分布：当前路径从所需的基准前点$t{=}0$开始，达到$t{=}1$的贝尔曼目标，并在中间时间保持路径仿射关系（无需时间$t$的边际来满足所有$t$的分布贝尔曼不动点）。PCBF将当前和后继返回流耦合通过共享碱基噪声，并使用$\lambda$参数化的控制变量目标：$\lambda{=}0$恢复无偏的样本Bellman目标，而$\lambda{>}0$则以控制偏置换取方差缩小。对可解析MRP、OGBench和D4RL的实验显示，分布保真度和训练稳定性有所提升，离线RL表现更具竞争力。

Insider Attacks in Multi-Agent LLM Consensus Systems

多智能体LLM共识系统中的内部攻击

Authors: Xiaolin Sun, Zixuan Liu, Yibin Hu, Zizhan Zheng
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08268
Pdf link: https://arxiv.org/pdf/2605.08268
Abstract Large language models (LLMs) are increasingly deployed in multi-agent systems where agents communicate in natural language to solve tasks jointly. A key capability in such systems is consensus formation, where agents iteratively exchange messages and update decisions to reach a shared outcome. However, most existing multi-agent LLM frameworks assume that all participating agents are aligned with the system objective. In practice, a malicious insider may participate as a legitimate member of the group while pursuing a hidden adversarial goal. In this work, we study insider manipulation in multi-agent LLM consensus systems. We formalize the problem as a sequential decision-making task in which a malicious agent seeks to delay or prevent agreement among benign agents. To make attack optimization tractable, we propose a world-model-based framework that learns surrogate dynamics over the latent behavioral states of benign agents and then trains an attacker using reinforcement learning based on this learned model. Preliminary results show that the trained attacker reduces the benign consensus rate and prolongs disagreement more effectively than the direct malicious-prompt baseline. These results suggest that combining latent world models with reinforcement learning is a promising direction for adaptive insider attacks in language-based multi-agent systems.
中文摘要 大型语言模型（LLM）越来越多地被部署在多智能体系统中，代理们通过自然语言交流共同解决任务。这类系统中的关键能力是共识形成，即代理们通过迭代交换消息并更新决策，以达成共享结果。然而，大多数现有的多代理LLM框架假设所有参与代理都与系统目标保持一致。实际上，恶意内部人员可能作为合法成员参与，同时追求隐藏的对抗目标。本研究中，我们研究了多智能体大型语言模型共识系统中的内部操控。我们将问题形式化为一个顺序决策任务，恶意主体试图延迟或阻止良性主体之间的协议。为了使攻击优化变得可操作，我们提出了一个基于世界模型的框架，该框架学习对良性代理潜在行为状态的代理动态，然后基于该学习模型通过强化学习训练攻击者。初步结果显示，受过训练的攻击者比直接恶意提示基线更有效地降低良性共识率，并延长分歧。这些结果表明，将潜在世界模型与强化学习结合起来，是语言类多智能体系统中自适应内部攻击的有前景方向。

Anatomical Landmark-Guided Deep Reinforcement Learning for Autonomous Gastric Navigation

解剖标志引导深度强化学习，用于自主胃导航

Authors: Haoxuan Wu, Sishen Yuan, Haitao Gao, Zhen Li, Xiuli Zuo, Hongliang Ren
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.08269
Pdf link: https://arxiv.org/pdf/2605.08269
Abstract Wireless capsule endoscopy (WCE) enables painless visualization of the gastrointestinal tract, but its diagnostic potential is limited by incomplete mucosal coverage and poor transferability of existing navigation methods across patient anatomies. We propose a transferable, anatomical landmarkguided deep reinforcement learning (AL-DRL) framework for autonomous gastric navigation. Leveraging a lightweight edgecontour-depth fusion module, our policy operates on stable, lowdimensional landmark coordinates rather than high-dimensional video streams, effectively bridging the sim-to-real gap. In simulations across eight patient-derived models, the method achieves over 97% coverage within 50 seconds, significantly outperforming vanilla PPO, SAC, and DQN agents. A two-stage sim-to-real pipeline with an adaptive dynamic programming controller actively mitigates physical disturbances. Ex-vivo experiments demonstrate a mean coverage of 87% and a 53% reduction in procedure time compared with expert manual control.
中文摘要 无线胶囊内镜（WCE）实现了胃肠道的无痛可视化，但其诊断潜力受限于黏膜覆盖不完整及现有导航方法在患者解剖结构间的迁移性差。我们提出了一种可迁移的解剖标志导向深度强化学习（AL-DRL）框架，用于自主胃导航。利用轻量级的边缘等高线深度融合模块，我们的政策在稳定的低维地标坐标上运行，而非高维视频流，有效弥合了模拟与现实之间的差距。在八个患者模型的模拟中，该方法在50秒内实现超过97%的覆盖率，显著优于普通PPO、SAC和DQN药物。采用两阶段模拟到实物流水线，配合自适应动态规划控制器，能够主动缓解物理干扰。体外实验显示，与专家手工对照相比，平均覆盖率为87%，且手术时间缩短了53%。

LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations

LaWM：基于视觉观测的长期物理一致性最小作用世界模型

Authors: Qixin Xiao, Maani Ghaffari
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08279
Pdf link: https://arxiv.org/pdf/2605.08279
Abstract Learning predictive world models from visual observations is a core problem in embodied AI, with applications to model-based reinforcement learning and robotic planning. Existing latent world models typically generate future states with unconstrained neural transition functions, while modern video generation systems often prioritize perceptual plausibility or introduce physical structure through auxiliary losses, external guidance, or separate dynamics modules. As a result, long-horizon rollouts can remain weakly grounded in the physical principles that govern real dynamics, leading to compounding error, energy drift, and physically inconsistent futures. We propose Least Action World Models (LaWM), a latent world-modeling framework that operationalizes the Principle of Least Action in learned visual latent space: future rollouts are governed by a learned Lagrangian action functional rather than produced only by an unconstrained transition predictor. Our main technical realization is a latent variational integrator: LaWM encodes observations into learned generalized coordinates, learns a latent discrete Lagrangian over consecutive latent states, constructs a discrete action functional, and advances prediction by solving the corresponding discrete integration condition. Thus, physical structure is not merely used to score, regularize, or constrain a completed trajectory; it defines the latent transition rule itself. Because the transition is induced by a discrete variational principle, LaWM provides a structure-preserving bias for long-horizon visual prediction. Across physics-clean synthetic dynamics and embodied robot interaction benchmarks, LaWM improves physical invariance, background consistency, motion smoothness, and appearance and geometric prediction metrics over video-generation and world-model baselines.
中文摘要 从视觉观察学习预测世界模型是具身人工智能的核心问题，应用于基于模型的强化学习和机器人规划。现有的潜在世界模型通常生成具有不受约束神经转移功能的未来状态，而现代视频生成系统则常常优先考虑感知可信性，或通过辅助损耗、外部指导或独立动力学模块引入物理结构。因此，长期的部署可能因实际动力学的物理原理而缺乏基础，导致误差叠加、能量漂移和物理不一致的未来。我们提出了最小作用世界模型（LaWM），这是一种潜伏世界建模框架，在学习的视觉潜空间中操作化最小作用原则：未来的推广由学习到的拉格朗日作用函数控制，而非仅由无约束的过渡预测器产生。我们的主要技术实现是一个潜在变分积分器：LaWM将观测值编码为学习的广义坐标，学习连续潜态上的潜在离散拉格朗日量，构造离散作用泛函，并通过求解相应的离散积分条件推进预测。因此，物理结构不仅仅是用来评分、规范或限制已完成的轨迹;它定义了潜在过渡规则本身。由于该跃迁由离散变分原理诱导，LaWM为长视距视觉预测提供了结构保持偏差。在物理纯净的合成动力学和具象机器人交互基准测试中，LaWM在物理不变性、背景一致性、运动平滑性以及外观和几何预测指标上均优于视频生成和世界模型基线。

HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

HTPO：迈向探索-开发，通过层级代币级目标控制实现平衡策略优化

Authors: Xincheng Yao, Ruoqi Li, Cheng Chen, Daoxin Zhang, Yi Wu, Yao Hu, Chongyang Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.08283
Pdf link: https://arxiv.org/pdf/2605.08283
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a pivotal technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, the de facto practice of mainstream RL algorithms is to treat all tokens of one response equally and assign the same optimization objective to each token, failing to provide granular guidance for the reasoning process. While in Chain-of-Thought (CoT) reasoning, different tokens usually play distinct roles. Therefore, the current RL algorithms lack an effective mechanism to dynamically balance the exploration-exploitation trade-off during learning. To this end, we propose Hierarchical Token-level Objective Control Policy Optimization (HTPO), a novel RL algorithm that takes the divide-and-conquer idea to hierarchically partition the response tokens into specific functional groups from three aspects (i.e., prompt difficulty, answer correctness, and token entropy). Within each group, according to the contributions to exploration or exploitation, we design specialized optimization objectives to facilitate the effective execution of each token's expected functionality. In this way, HTPO can achieve a more balanced exploration-exploitation trade-off. Extensive experiments on challenging reasoning benchmarks validate the superiority of our HTPO algorithm, which significantly outperforms the strong DAPO baseline (e.g., +8.6% and +6.7% on AIME'24 and AIME'25, respectively). When scaling test-time compute, the HTPO-trained model maintains a consistent performance advantage over the DAPO baseline, and the gap widens as the sampling budget increases, validating that our adaptive token-level control method fosters effective exploration without sacrificing exploitation performance. Code will be at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）已成为提升大型语言模型（LLM）推理能力的关键技术。然而，主流强化学习算法的事实做法是将同一反应的所有标记一视同仁，并赋予每个标记相同的优化目标，未能为推理过程提供细致的指导。在思维链（Chain-of-Thought，简称CoT）推理中，不同的标记通常扮演不同的角色。因此，当前的强化学习算法缺乏有效机制来动态平衡探索与利用的权衡。为此，我们提出了分层令牌级目标控制策略优化（HTPO），这是一种新型强化学习算法，采用分而治之的理念，将响应标记从三个方面（即提示难度、答案正确性和令牌熵）分层划分为特定功能组。在每个组内，根据对探索或利用的贡献，我们设计了专门的优化目标，以促进每个代币预期功能的有效执行。通过这种方式，HTPO可以实现更平衡的勘探与开发权衡。对挑战性推理基准的大量实验验证了我们HTPO算法的优越性，其表现显著优于强DAPO基线（例如，AIME'24和AIME'25分别为+8.6%和+6.7%）。在测试时间计算的扩展时，HTPO训练的模型在性能上始终保持优于DAPO基线的优势，随着采样预算的增加，这一差距进一步扩大，验证了我们的自适应令牌级控制方法在不牺牲利用性能的前提下，促进了有效探索。代码会显示在这个 https 网址。

SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators

SalesSim：基准测试与对齐多模态语言模型作为零售用户模拟器

Authors: Yada Pruksachatkun, Elaine Wan, Lyanna Chen, Kai-Wei Chang, Chien-Sheng Wu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.08334
Pdf link: https://arxiv.org/pdf/2605.08334
Abstract We present SalesSim, a framework and testbed for evaluating the ability of Multimodal Large Language Models (MLLMs) to simulate realistic, persona-driven customer behavior in multi-turn, multi-modal, tool-augmented online retail conversations. Unlike prior work that treat user simulation as surface-level dialogue generation, SalesSim models retail interaction and decision-making as a grounded, agentic process, where shoppers with diverse backgrounds, preferences, and dealbreakers interact with a sales agent, seek clarifications, and make informed purchasing decisions. For evaluation, we design a suite of metrics centered on decision alignment, measuring the consistency between the simulator's actions and its persona specifications, as well as conversational quality. We find several behavioral gaps after benchmarking 6 open and closed-source state-of-the-art models. First, while models produce fluent conversations, they display significantly lower lexical diversity and overdisclosure of criteria across personas compared to human conversations. Second, models tend to be persuaded by sales agent suggestions and drift from persona specifications. Even the strongest model achieves less than 79% average alignment with its underlying persona specifications. To make progress on these limitations, we propose UserGRPO, a multi-turn, multi-objective reinforcement learning recipe to optimize both conversational fluency and decision alignment under persona specifications. Our experiments demonstrate that UserGRPO boosts decision alignment of the baseline model by 13.8% while improving conversational quality. By introducing SalesSim, we provide a new testbed for the community to investigate and improve the adherence of user simulators in goal-oriented settings.
中文摘要 我们介绍了SalesSim，一个用于评估多模态大型语言模型（MLLM）在多回合、多模态、工具增强在线零售对话中模拟真实、以角色为驱动的客户行为能力的框架和测试平台。与以往将用户模拟视为表面对话生成的研究不同，SalesSim将零售互动和决策建模为一个扎实、代理化的过程，拥有不同背景、偏好和底线的购物者与销售代理互动，寻求澄清，并做出明智的购买决策。在评估方面，我们设计了一套以决策对齐为中心的指标，衡量模拟器行为与其人格规格之间的一致性，以及对话质量。我们通过对6个开源和闭源最先进模型进行基准测试，发现存在若干行为缺陷。首先，虽然模型能产生流畅的对话，但它们在词汇多样性和跨人格的标准过度披露方面明显低于人类对话。其次，模特往往被销售代理的建议说服，偏离了角色设定。即使是最强的模型，也未能实现与其底层人格规格的平均一致性79%。为了解决这些限制，我们提出了UserGRPO，一种多回合、多目标的强化学习方案，旨在优化人格规范下的会话流利度和决策对齐。我们的实验表明，UserGRPO能提升基线模型的决策一致性13.8%，同时提升对话质量。通过引入SalesSim，我们为社区提供了一个新的测试平台，帮助他们探索并提升用户模拟器在目标导向环境中的遵循率。

On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective

关于在培训后区分能力诱导与能力创造：自由能源视角

Authors: Yuhao Li, Shengchao Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.08368
Pdf link: https://arxiv.org/pdf/2605.08368
Abstract Debates about large language model post-training often treat supervised fine-tuning (SFT) as imitation and reinforcement learning (RL) as discovery. But this distinction is too coarse. What matters is whether a training procedure increases the probability of behaviors the pretrained model could already produce, or whether it changes what the model can practically reach. We argue that post-training research should distinguish between capability elicitation and capability creation. We make this distinction operational by introducing the notion of accessible support: the set of behaviors that a model can practically produce under finite budgets. Post-training that reweights behaviors within this support is capability elicitation; whereas changing the support itself corresponds to capability creation. We develop this argument through a free-energy view of post-training. SFT and RL can both be seen as reweighting a pretrained reference distribution, only with different external signals. Demonstration signals define low-energy behavior for SFT, and reward signals define low-energy behavior for RL. When the update remains close to the base model, the main effect is local reweighting, not capability creation. Within this framework, the central question is no longer whether post-training is framed as SFT or RL, but whether it reweights behaviors already within reach, or instead expands the model's reachable behavioral space through search, interaction, tool use, or the incorporation of new information.
中文摘要 关于大型语言模型后训练的争论，常将监督微调（SFT）视为模仿，强化学习（RL）视为发现。但这种区分过于粗糙。关键在于训练过程是否增加了预训练模型可能产生的行为概率，还是改变了模型实际可达的范围。我们认为培训后研究应区分能力诱导与能力创造。我们通过引入可及支持的概念来实现这一区分：即模型在有限预算下实际能够实现的一系列行为。训练后重新加权支持内行为的过程是能力诱导;而改变支持本身对应于能力的创建。我们通过自由能视角发展了训练后期的论证。SFT和RL都可以看作是对预训练参考分布的加权，只是外部信号不同。示范信号定义了SFT的低能行为，奖励信号定义了强化学习的低能耗行为。当更新仍接近基础模型时，主要效果是局部加权，而非能力创造。在这个框架下，核心问题不再是后训练是被框架为SFT还是RL，而是它是否重新加权已在范围内的行为，或者通过搜索、互动、工具使用或新信息的引入来扩展模型可达的行为空间。

Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

可扩展且可信智能系统的强化学习

Authors: Guangchen Lan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.08378
Pdf link: https://arxiv.org/pdf/2605.08378
Abstract Reinforcement learning has become a powerful paradigm for improving the capability of intelligent systems, but its practical deployment faces two central challenges. First, reinforcement learning must scale efficiently in distributed environments where communication bandwidth is limited and computation is heterogeneous across agents. Second, as reinforcement learning is increasingly used in post-training large language models and autonomous agents, the optimized policies must also be aligned with human preferences and satisfy safety requirements such as privacy-aware information disclosure. This dissertation addresses both challenges through four complementary contributions spanning federated optimization, preference alignment, and contextual safety. The first part of the dissertation studies scalable reinforcement learning in federated settings. The second part of the dissertation studies trustworthy reinforcement learning for large language models. Together, these contributions advance reinforcement learning along two complementary dimensions. On the one hand, they make reinforcement learning more scalable through communication-efficient and asynchronous federated optimization. On the other hand, they make reinforcement learning more trustworthy by improving alignment with human preferences and by reducing contextually inappropriate information disclosure in language-based intelligent systems. As a whole, this dissertation argues that the next generation of intelligent systems will require both efficient optimization and trustworthy behavior, and that reinforcement learning provides a unifying framework for addressing both goals.
中文摘要 强化学习已成为提升智能系统能力的强大范式，但其实际应用面临两个核心挑战。首先，强化学习必须在通信带宽有限且计算跨代理异构的分布式环境中高效扩展。其次，随着强化学习在训练后大型语言模型和自主智能体中日益应用，优化策略也必须符合人类偏好，并满足安全要求，如隐私意识的信息披露。本论文通过四项互补贡献，涵盖联邦优化、偏好比对和上下文安全，解决了这两个挑战。论文的第一部分研究联邦环境中的可扩展强化学习。论文的第二部分研究大型语言模型的可信强化学习。这些贡献共同推动了强化学习在两个互补维度上的进步。一方面，它们通过高效的通信和异步联邦优化，使强化学习更具可扩展性。另一方面，它们通过改善与人类偏好的对齐度，减少基于语言的智能系统中上下文不合适的信息披露，使强化学习更具可信度。总体而言，本论文论证了下一代智能系统既需要高效优化，也需要可信行为，强化学习为实现这两个目标提供了一个统一框架。

SACHI: Structured Agent Coordination via Holistic Information Integration in Multi-Agent Reinforcement Learning

SACHI：通过整体信息集成实现结构化代理协调，实现多智能体强化学习中的结构化代理协调

Authors: Nikunj Gupta, James Zachary Hare, Jesse Milzman, Rajgopal Kannan, Viktor Prasanna
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.08391
Pdf link: https://arxiv.org/pdf/2605.08391
Abstract Cooperative multi-agent reinforcement learning agents that act on partial local observations face a fundamental information bottleneck: the knowledge needed to select jointly optimal actions is scattered across the team, yet each agent must commit to a decision without access to its teammates' observations, intentions, or chosen actions. Existing methods either ignore this bottleneck, compress it into a scalar mixing signal, or route around it with learned communication channels. Framing action coordination as a problem of structured information integration among agents, we propose \textit{structured agent coordination via holistic information integration}, or SACHI, in which graph transformer convolutions over an inter-agent coordination graph enrich each agent's representation with receiver-sensitive, content-dependent signals from teammates prior to action selection. We evaluate SACHI across five cooperative tasks spanning spatial, communicative, and adversarial coordination challenges against twelve baselines. SACHI consistently matches or outperforms the best baseline on every task, and rigorous aggregate statistical analyses, including normalized metrics with bootstrap confidence intervals, Friedman ranking, and performance profiling, confirm that this advantage is statistically significant, robust across environments, and not attributable to increased model capacity. Parameter-matched ablations further trace the source of the gains to a single architectural property: the degree of content-dependence in the message-passing operator.
中文摘要 基于部分局部观察的合作多智能体强化学习智能体面临一个根本的信息瓶颈：选择联合最优行动所需的知识分散在团队各处，但每个智能体必须在无法访问队友观察、意图或选择行动的情况下承诺决策。现有方法要么忽略这一瓶颈，要么将其压缩成标量混合信号，或者通过学习的通信通道绕过它。我们将动作协调框架为代理间结构化信息整合的问题，提出了\textit{通过整体信息整合实现结构化代理协调}，简称SACHI，其中在代理间协调图上的图变换器卷积，通过来自队友在动作选择前的接收方敏感、内容依赖信号丰富每个代理的表示。我们通过五个跨空间、交流和对抗性协调挑战的协作任务，针对十二个基线评估SACHI。SACHI在每个任务中始终稳定地匹配或超越最佳基线，严谨的综合统计分析，包括带有自助置信区间的归一化指标、弗里德曼排名和性能剖析，证实这一优势具有统计学显著性，且在不同环境中稳健，且不归因于模型容量的提升。参数匹配消融进一步追溯到一个结构特性：消息传递操作符中内容依赖程度。

AIPO: : Learning to Reason from Active Interaction

AIPO：从主动互动中学习推理

Authors: Junnan Liu, Linhao Luo, Thuy-Trang Vu, Gholamreza Haffari
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08401
Pdf link: https://arxiv.org/pdf/2605.08401
Abstract Recent advances in large language models (LLMs) have demonstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards (RLVR). However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model. Although recent methods introduce external expert demonstrations to extend this boundary, they typically rely on complete trajectory-level guidance, which is sample-inefficient, information-sparse, and may confine exploration to a static guidance space. Inspired by the potential of multi-agent systems, we propose $\textbf{AIPO}$, an enhanced reinforcement learning framework that improves LLM reasoning through active multi-agent interaction during exploration. Specifically, AIPO enables the policy model to proactively consult three functional collaborative agents, $\textit{Verify Agent}$, $\textit{Knowledge Agent}$, and $\textit{Reasoning Agent}$, when encountering reasoning bottlenecks, thereby receiving fine-grained and targeted guidance to actively expand its capability boundary during training. We further introduce a tailored importance sampling coefficient together with a clipping strategy to mitigate the off-policy bias and gradient vanishing issues that arise when learning from agent-provided feedback. After training, the policy model performs reasoning independently without relying on collaborative agents. Extensive experiments on diverse reasoning benchmarks, including AIME, MATH500, GPQA-Diamond, and LiveCodeBench, show that AIPO consistently improves reasoning performance, generalizes robustly across different policy models and RLVR algorithms, and effectively expands the reasoning capability boundary of the policy model.
中文摘要 大型语言模型（LLMs）的最新进展展示了显著的推理能力，主要得益于可验证奖励强化学习（RLVR）。然而，现有的强化学习算法面临一个根本性局限：其探索在很大程度上仍受限于策略模型固有的能力边界。尽管近期方法引入了外部专家的演示来扩展这一边界，但它们通常依赖于完整的轨迹级指导，这种指导采样效率低、信息稀疏，且可能将探索限制在静态的指导空间内。受多智能体系统潜力启发，我们提出了$\textbf{AIPO}$，一种增强型强化学习框架，通过探索过程中的主动多智能体交互提升LLM推理能力。具体来说，AIPO使策略模型能够在遇到推理瓶颈时主动咨询三个功能协作代理：$\textit{Verify Agent}$、$\textit{Knowledge Agent}$和$\textit{Reasoning Agent}$，从而在训练过程中获得细粒度和有针对性的指导，主动扩展能力边界。我们还进一步引入了定制的重要性抽样系数及剪裁策略，以减轻从代理反馈学习时出现的偏向偏差和梯度消失问题。训练完成后，策略模型独立进行推理，不依赖协作代理。在包括AIME、MATH500、GPQA-Diamond和LiveCodeBench等多种推理基准测试上的广泛实验表明，AIPO能够持续提升推理性能，能够在不同策略模型和RLVR算法间进行稳健推广，并有效扩展了策略模型的推理能力边界。

Central Limit Theorem for Two-Time-Scale Approximate Distributionally Robust RL

两时间尺度的中心极限定理近似分布稳健强化学习

Authors: Shengbo Wang, Zexi Zhang
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2605.08417
Pdf link: https://arxiv.org/pdf/2605.08417
Abstract Designing model-free algorithms for distributionally robust reinforcement learning (DRRL) poses fundamental challenges. The robust Bellman operator is nonlinear in the transition kernel, which makes one-sample Bellman updates biased, while the adversarial optimization underlying robustness makes robust evaluation computationally demanding. To address these difficulties, we consider the natural small-ambiguity regime under Kullback--Leibler ambiguity sets and propose an approximate DRRL framework based on a first-order expansion of the relevant robust functional. This yields an approximate robust Bellman equation that removes the adversarial optimization while remaining first-order accurate in the ambiguity radius. To learn the fixed point of this approximate equation, we propose Mean-Variance Stochastic Approximation (MVSA), a model-free algorithm that uses only one-sample updates. This is achieved via a lifted stochastic approximation dynamics and a two-time-scale design. We then prove convergence and a central limit theorem for MVSA: its main iterate satisfies a central limit theorem at the canonical $n^{-1/2}$ scale, with explicitly characterized asymptotic covariances. Finally, we validate our theoretical findings with a numerical experiment.
中文摘要 设计分布鲁棒强化学习（DRRL）的无模型算法面临根本性挑战。稳健的贝尔曼算子在转移核中是非线性的，这使得单采样贝尔曼更新存在偏置，而鲁棒性背后的对抗性优化使得稳健的评估计算量过高。为解决这些困难，我们考虑了Kullback--Leibler模糊性集合下的自然小歧义区，并提出了基于相关稳健泛函一阶展开的近似DRRL框架。这得到一个近似的稳健贝尔曼方程，去除了对抗优化，同时在歧义半径内保持一阶准确。为了了解该近似方程的不动点，我们提出了平均方差随机近似（MVSA）算法，这是一种仅使用单样本更新的无模型算法。这通过提升随机近似动力学和两时间尺度设计实现。随后我们证明了收敛性和MVSA的中心极限定理：其主要迭代满足一个典型$n^{-1/2}$尺度的中心极限定理，且具有明确刻画的渐近协方差。最后，我们通过数值实验验证了理论发现。

DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

DUET：优化代币-预算分配以实现可验证奖励的强化学习

Authors: Haoyu Hu, Xuandong Zhao, Xuhai "Orson'' Xu, Nori Jacoby
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08441
Pdf link: https://arxiv.org/pdf/2605.08441
Abstract Reinforcement learning with verifiable rewards (RLVR) generates hundreds of thousands of tokens per training step, with rollout generation dominating the computational cost. The overall token budget can be controlled along two main dimensions: (i) deciding which prompts to allocate rollouts to, and (ii) deciding how long each rollout should be. Prior work has generally controlled only one of these dimensions at a time. We show that jointly tuning both decisions under a shared compute budget improves both reasoning quality and wall-clock training time. We instantiate this view as \textbf{DU}al-controlled tok\textbf{E}n alloca\textbf{T}ion (DUET), a computationally efficient layer over GRPO that uses a lightweight pre-rollout surrogate of prompt informativeness to set how many rollouts each prompt receives, and a marker-gated abort rule with importance reweighting to set when to stop them. On Qwen3-1.7B trained on MATH, DUET outperforms full-budget GRPO and the other three budget-aware baseline methods. DUET's advantage further generalizes to other benchmarks across math and coding, and is on par with the best baseline on the scientific Q\&A domain, while also achieving a $1.62\times$ wall-clock speedup. More notably, using only 50\% of the token budget, DUET still outperforms all baseline methods at their full budget, achieving an even higher $2.51\times$ speedup over full-budget GRPO. We verify the high performance of DUET on other backbone LLMs, including Qwen3-4B and Llama-3.2-3B-Instruct. Notably, the gap between DUET and the strongest baseline \emph{widens} as the budget tightens, contrary to the usual pattern in which efficient methods trade off quality as compute decreases. More broadly, these results suggest that DUET budget-aware control strategies are valuable not only for accelerating training, but also for improving the quality of the learning signal.
中文摘要 带有可验证奖励的强化学习（RLVR）每个训练步骤生成数十万个代币，而推广生成主导了计算成本。整体代币预算可通过两个主要维度控制：（i）决定分配哪些提示，以及（ii）决定每次发布时长。以往的工作通常一次只控制其中一个维度。我们证明，在共享计算预算下联合调整这两个决策，既能提升推理质量，也能缩短训练时间。我们将此视图实例化为 \textbf{DU}al 控制的 tok\textbf{E}n alloca\textbf{T}ion （DUET），这是一个基于 GRPO 的计算高效层，利用提示信息性这一轻量级预发布代理来设置每个提示接收的推送次数，并通过带有重要性重权重的标记门控中止规则来决定何时停止。在QWEN3-1.7B上，基于数学训练，DUET优于全预算GRPO及其他三种预算意识基线方法。DUET的优势进一步推广到数学和编程领域的其他基准测试，在科学问答领域与最佳基线不相上下，同时实现了1.62美元、每小时的时钟加速。更值得注意的是，仅使用代币预算的50%时，DUET在全预算下仍优于所有基础方法，比全预算GRPO实现了更高的2.51倍加速。我们验证了DUET在其他骨干大型语言模型上的高性能，包括Qwen3-4B和Llama-3.2-3B-Instruct。值得注意的是，随着预算收紧，DUET与最强基线之间的差距会扩大，这与通常高效方法在计算减少时权衡质量的模式相反。更广泛地说，这些结果表明，DUET预算感知控制策略不仅有助于加速训练，还能提升学习信号的质量。

PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents

PYTHALAB-MERA：冷冻大型语言模型编码代理的验证基础记忆、检索与验收控制

Authors: Mehmet Iscan
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.08468
Pdf link: https://arxiv.org/pdf/2605.08468
Abstract Local LLM-based coding agents increasingly work in settings where correctness is earned through execution feedback, persistent state, and bounded repair, not through a single fluent answer. Static retrieval, long-context prompting, self-refinement, execution-feedback repair, and reinforcement learning over model weights each address part of this setting, but they do not jointly provide validation-grounded episodic memory, adaptive retrieval-action selection, delayed credit assignment, and structural skill reuse around a frozen local model. We introduce PYTHALAB-MERA, a lightweight external controller for local validation-conditioned code generation. The frozen language model proposes complete source files; the controller decides which memory records and AST-derived skills should enter the next prompt, validates each candidate through a fail-fast pipeline, converts validation outcomes into bounded shaped rewards, and propagates delayed credit through TD(lambda)-style eligibility traces. We evaluate the implementation as a local CLI artifact on reinforcement-learning coding tasks with strict validation gates. In the measured hard RL setting with three tasks, three repetitions, and a three-attempt budget, PYTHALAB-MERA passed 8/9 strict validations; the self-refinement baseline and the investigated GRACE extension each passed 0/9. These results support a deliberately bounded claim: in this recorded setting, the external memory-and-retrieval controller improved validation success. They do not establish general-purpose code synthesis, state-of-the-art performance, formal program correctness, or formal safety.
中文摘要 基于本地LLM的编码代理越来越多地工作在通过执行反馈、持久状态和有界修复获得正确性，而非单一流畅答案的环境中工作。静态检索、长上下文提示、自我精炼、执行反馈修复和基于模型权重的强化学习均针对该设定的部分，但它们并未共同提供基于验证的情节记忆、自适应检索-动作选择、延迟学分分配以及围绕冻结局部模型的结构性技能重用。我们介绍了PYTHALAB-MERA，一款用于本地验证条件代码生成的轻量级外部控制器。冻结语言模型提出完整的源文件;控制器决定哪些记忆记录和AST衍生技能应进入下一个提示，通过失败快速的流水线验证每个候选人，将验证结果转换为有界形状的奖励，并通过TD（lambda）风格的资格追踪传播延迟学分。我们将该实现作为强化学习编码任务的本地CLI工件进行评估，且带有严格的验证门。在严格的硬强化学习环境下，包含三项任务、三次重复和三次尝试预算，PYTHALAB-MERA 通过了 8/9 的严格验证;自精化基线和研究的GRACE扩展均通过0/9。这些结果支持了一个刻意有界的主张：在该记录环境中，外部内存与检索控制器提高了验证成功率。它们不建立通用代码综合、最先进的性能、形式化程序的正确性或形式安全性。

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

中期训练中用自生成数据提升语言模型中的强化学习

Authors: Aswin RRV, Jacob Dineen, Divij Handa, Mihir Parmar, Ben Zhou, Swaroop Mishra, Chitta Baral
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08472
Pdf link: https://arxiv.org/pdf/2605.08472
Abstract The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.
中文摘要 强化学习（RL）在大型语言模型（LLMs）中的有效性取决于在强化学习前及过程中所用数据的性质和多样性。特别是，推理问题通常可以用多种方式来处理，依赖不同的推理形式，而训练数据中接触的此类方法范围有限，可能会限制强化学习的有效性。基于此，我们研究在训练中期使用多样化的自生成数据作为强化学习训练的中间步骤。具体来说，我们采用了基于George Polya问题解决方法的自助数据生成框架，为训练数据中的每个问题生成多样正确答案，然后进行微调。我们首先从理论角度讲述了在此类数据中进行训练如何改善强化学习，并解释策略梯度更新如何激励多种方法的结合。随后，我们实证证明，使用我们中期训练数据初始化的强化学习模型在各种数学推理基准测试及其他外部任务（如代码生成和叙事推理）中取得了持续的提升。总体而言，我们的调查研究表明，通过自生成数据学习多种问题解决方法的语言模型，有助于后续强化学习。

Quantile-Coupled Flow Matching for Distributional Reinforcement Learning

分布强化学习中的分位数耦合流匹配

Authors: Michael Groom, Victor-Alexandru Darvariu, Lars Kunze, James Wilson, Nick Hawes
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.08515
Pdf link: https://arxiv.org/pdf/2605.08515
Abstract Unlike standard expected-return Reinforcement Learning (RL), Distributional RL (DRL) models the full return distribution, making it better-suited for uncertainty-aware and risk-sensitive decision-making. Conditional Flow Matching (CFM) critics have recently attracted attention for modelling continuous, multi-modal return distributions. Despite this interest, there remains a substantial metric mismatch: DRL theory relies on the distributional Bellman operator being contractive in the $p$-Wasserstein distance, yet existing CFM critics are trained with arbitrary source-target couplings, so their flow-matching losses are not Wasserstein-aligned surrogates for matching Bellman target return distributions. In this work, we address this mismatch by proposing FlowIQN, a CFM critic that sorts source and Bellman target samples within each mini-batch to approximate the monotone optimal transport coupling, replacing arbitrary pairings with quantile-aligned flow paths. We prove that the loss of our quantile-coupled CFM critic yields a Wasserstein-aligned approximate projection compatible with the foundations of DRL. To our knowledge, FlowIQN is the first flow-matching distributional critic with an explicit Wasserstein-aligned projection guarantee. We further extend FlowIQN with shortcut models for efficient inference. Empirical results show that FlowIQN improves Wasserstein return-distribution accuracy over other CFM critics. It also yields competitive performance on offline RL benchmarks across multiple policy extraction methods, providing a theoretically grounded CFM critic that is readily compatible with DRL pipelines. Code: this https URL.
中文摘要 与标准的预期收益强化学习（RL）不同，分布式强化学习（DRL）建模了完整的回报分布，使其更适合不确定性和风险敏感的决策。条件流匹配（CFM）批评者最近因建模连续多模态回报分布而受到关注。尽管如此，度量不匹配仍存在显著：DRL理论依赖分布Bellman算子在$p$-Wasserstein距离内是收缩的，但现有CFM批评者则用任意源-目标耦合训练，因此其流量匹配损耗并非匹配Bellman目标返回分布的Wasserstein对齐替代。在本研究中，我们提出了FlowIQN，一种CFM批评方法，通过在每个小批次中对源样本和Bellman目标样本进行排序，近似单调最优传输耦合，用分位数对齐的流路径替代任意配对来解决这一不匹配。我们证明了分位数耦合CFM批判者的丧失得到一个与DRL基础兼容的Wasserstein对齐近似投影。据我们所知，FlowIQN 是首个具有明确 Wasserstein 对齐投影保证的流匹配分布批评器。我们进一步扩展了FlowIQN，增加了捷径模型以实现高效的推理。实证结果表明，FlowIQN相比其他CFM批评者更能提升Wasserstein的回报分布准确性。它还在多种策略提取方法的离线强化学习基准测试中表现出竞争力，提供了一个理论基础且易于兼容DRL流水线的CFM批评者。代码：这个 https URL。

OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

OracleTSC：交通信号控制的Oracle知情奖励障碍与不确定性规范化

Authors: Darryl Jacob, Xinyu Liu, Muchao Ye, Xiaoyong Yuan, Pan He
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08516
Pdf link: https://arxiv.org/pdf/2605.08516
Abstract Transparent decision-making is essential for traffic signal control (TSC) systems to earn public trust. However, traditional reinforcement learning-based TSC methods function as black boxes with limited interpretability. Although large language models (LLMs) can provide natural language reasoning, reinforcement finetuning for TSC remains unstable because feedback is sparse and delayed, while most actions produce only marginal changes in congestion metrics. We introduce OracleTSC, which stabilizes LLM-based TSC through two mechanisms: (1) a reward hurdle mechanism that filters weak learning signals by subtracting a calibrated threshold from environmental rewards, and (2) uncertainty regularization that maximizes the probability of the selected response to encourage consistent decisions across sampled outputs. Experiments on the LibSignal benchmark show that OracleTSC enables a compact LLaMA3-8B model to substantially improve traffic efficiency, achieving a 75% reduction in travel time and a 67% decrease in queue length compared with the pretrained baseline while preserving interpretability through natural language explanations. OracleTSC also demonstrates strong cross-intersection generalization: a policy trained on one intersection transfers to a structurally different intersection with 17% lower travel time and 39% lower queue length without additional finetuning. These results suggest that uncertainty-aware reward shaping can improve the stability and effectiveness of reinforcement fine-tuning for TSC.
中文摘要 透明的决策对于交通信号控制（TSC）系统赢得公众信任至关重要。然而，传统的基于强化学习的TSC方法则是具有有限解释性的黑箱。尽管大型语言模型（LLMs）可以提供自然语言推理，但TSC的强化微调仍然不稳定，因为反馈稀疏且延迟，大多数操作仅产生拥塞度量的边际变化。我们介绍了OracleTSC，它通过两种机制稳定基于LLM的TSC：（1）通过从环境奖励中扣除校准阈值来过滤弱学习信号的奖励障碍机制;（2）不确定性正则化，最大化所选反应的概率，从而鼓励在采样输出间做出一致的决策。LibSignal基准测试的实验表明，OracleTSC使紧凑的LLaMA3-8B模型能够显著提升流量效率，较预训练基线缩短75%的行程时间和67%的队列长度，同时保持自然语言解释的可理解性。OracleTSC还展示了强有力的交叉交叉推广：一个基于一个交叉口训练的策略，无需额外微调即可转移到结构截然不同的交叉口，行车时间减少17%，排队长度缩短39%。这些结果表明，不确定性意识的奖励塑造可以提升TSC强化微调的稳定性和效果。

MARLaaS: Multi-Tenant Asynchronous Reinforcement Learning as a Service

MARLaaS：多租户异步强化学习即服务

Authors: Timothy Tin Long Yu, Gursimran Singh, Ge Shi, Hanieh Sadri, Yong Zhang, Zhenan Fan
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08527
Pdf link: https://arxiv.org/pdf/2605.08527
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) has significantly improved the reasoning capabilities of large language models (LLMs), particularly in multi-turn agentic settings involving environment interaction like tool use. However, fine-tuning such models remains prohibitively expensive due to high computational requirements, limiting accessibility. We propose MARLaaS (Multi-tenant Asynchronous RL as a Service), a system for concurrent RL fine-tuning across multiple users and tasks. Our approach is based on two key ideas: (1) sharing a base model across tenants using lightweight LoRA adapters, and (2) a disaggregated asynchronous architecture that decouples rollout generation, environment interaction, and policy training into independently scheduled stages. This design enables tasks to progress through the RL pipeline at their own pace in an event-driven manner, reducing cross-task interference, idle time, and end-to-end latency. In multi-task settings (we report up to 32 concurrent tasks), MARLaaS achieves single-task state-of-the-art performance while improving accelerator utilization by up to 4.3x and reducing end-to-end training time by 85%.
中文摘要 可验证奖励强化学习（RLVR）显著提升了大型语言模型（LLMs）的推理能力，尤其是在涉及环境交互的多回合代理环境中，如工具使用。然而，由于计算需求高，微调此类模型仍然成本高昂，限制了可及性。我们提出了MARLaaS（多租户异步RL即服务）系统，这是一个跨多个用户和任务进行并发强化学习微调的系统。我们的方法基于两个核心理念：（1）使用轻量级LoRA适配器在租户间共享基础模型，（2）将部署生成、环境交互和策略培训拆分成独立调度阶段的异步架构。该设计使任务能够以事件驱动的方式以自身节奏在强化学习流水线中推进，减少跨任务干扰、空闲时间和端到端延迟。在多任务环境中（我们报告多达32个并发任务），MARLaaS实现单任务的先进性能，同时将加速器利用率提升4.3倍，端到端培训时间减少85%。

Technical Report: A Hierarchical Dynamically Weighting Deep Reinforcement Learning Method for Multi-UAV Multi-Task Coordination

技术报告：一种用于多无人机多任务协调的层级动态加权深度强化学习方法

Authors: Xindi Wang, Haining Li, Tao Ding, Bolin Cai
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.08623
Pdf link: https://arxiv.org/pdf/2605.08623
Abstract This paper investigates the multi-UAV multi-task coordination problem in infrastructure-less emergency scenarios, where UAVs collaboratively are required to jointly perform aerial image acquisition and ground-user communication. To tackle the challenge of balancing heterogeneous tasks within dynamic environments, we propose a hierarchical dynamic weighting Deep Reinforcement Learning (DRL) framework. Specifically, an episode-level module is introduced to capture global task preferences, while a step-level module adaptively adjusts the objective weights according to real-time system conditions. By integrating global and instantaneous weights, the proposed framework improves decision stability and responsiveness during task execution. Simulation results demonstrate that the proposed method achieves faster convergence, more stable training, and higher task completion efficiency than conventional works.
中文摘要 本文探讨了无基础设施紧急场景下的多无人机多任务协调问题，这些场景中无人机需要协同执行空中图像采集和地面用户通信。为了解决在动态环境中平衡异构任务的挑战，我们提出了一个层级动态加权深度强化学习（DRL）框架。具体来说，引入了章节级模块以捕捉全局任务偏好，而步进级模块则根据实时系统条件自适应调整目标权重。通过整合全局权重和瞬时权重，所提框架提升了任务执行中的决策稳定性和响应性。模拟结果表明，所提方法比传统方法实现更快的收敛、更稳定的训练和更高的任务完成效率。

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

ReLibra：路由-重放-引导负载平衡，用于强化学习中的MoE训练

Authors: Chao Jin, Xinming Wei, Yinmin Zhong, Chengxu Yang, Bingyang Wu, Ruidong Zhu, Zili Zhang, Yuliang Liu, Xin Jin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.08639
Pdf link: https://arxiv.org/pdf/2605.08639
Abstract Load imbalance is a long-standing challenge in Mixture-of-Experts (MoE) training and is exacerbated in reinforcement learning (RL) for LLMs, where hot experts can shift frequently across micro-batches. Existing MoE training systems rely on historical loads to predict future expert demand, making them less effective under sharp fluctuations. We propose ReLibra, an MoE RL training system that exploits a unique opportunity in RL's rollout-training workflow, routing replay, to enable fine-grained load balancing at micro-batch granularity. Because rollout and training process the same tokens with the same MoE parameters, the token-to-expert routing decisions are known before training starts. Leveraging this information, ReLibra places two MoE load-balancing mechanisms at inter- and intra-batch timescales, matching their communication patterns to hierarchical network bandwidths. At the inter-batch timescale, ReLibra performs expert reordering to redistribute experts for batch-level cross-node balancing; at the intra-batch timescale, it dynamically performs expert replication within a node to absorb micro-batch-level load fluctuations. Experiments on diverse MoE LLMs and RL workloads show that ReLibra improves training throughput by up to 1.6$\times$ over Megatron-LM and by up to 1.2$\times$ over EPLB, even when EPLB is given oracle loads. Moreover, ReLibra remains within 6%-10% of the throughput of an idealized balanced baseline.
中文摘要 负载不平衡是专家混合（Mixture-of-Experts，MoE）培训中的长期挑战，在大型语言模型（LLM）强化学习（RL）中尤为严重，因为热专家可能在微批量间频繁切换。现有的MoE培训系统依赖历史负载预测未来专家需求，因此在剧烈波动下效果较差。我们提出了ReLibra，这是一套MoE强化学习训练系统，利用了RL部署训练流程中一个独特的机会——路由重放，实现微批量细度的细粒度负载均衡。由于推广和训练处理相同的代币和相同的 MoE 参数，代币到专家的路由决策在训练开始前就已知。利用这些信息，ReLibra在批次间和批内时间尺度放置了两种MoE负载均衡机制，将其通信模式匹配到分层网络带宽。在批次间时间尺度上，ReLibra 进行专家重排序，以重新分配专家以实现批次级跨节点平衡;在批次内时间尺度上，它在节点内动态执行专家复制，以吸收微批次级别的负载波动。在多种MoE大型语言模型和强化学习工作负载上的实验显示，即使EPLB被赋予oracle负载，ReLibra的训练吞吐量比Megatron-LM提升高达1.6$\时间$，EPLB也比EPLB提升1.2$\时间$。此外，ReLibra的吞吐量仍处于理想平衡基线的6%-10%以内。

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

结构化循环混合器用于大规模并行序列生成

Authors: Benjamin L. Badger
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.08696
Pdf link: https://arxiv.org/pdf/2605.08696
Abstract Over the last two decades, language modeling has experienced a shift from predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements in parallel during training, which results in greater training efficiency and stability at the expense of lower inference throughput. Here we introduce the Structured Recurrent Mixer, an architecture that allows for algebraic conversion between a sequence parallel representation at train time and a recurrent representation at inference, notably without the need for specialized kernels or device-specific memory management. We show experimentally that this dual representation allows for greater training efficiency, higher input information capacity, and larger inference throughput and concurrency when compared to other linear complexity models. We postulate that recurrent models are poorly suited to extended sequence length scaling for information-rich inputs typical of language, but are well suited to scaling in the sample (batch) dimension due to their constant memory per sample. We provide Mojo/MAX inference implementations of SRMs exhibiting 12x the throughput and 170x the concurrency of similarly powerful Transformers inferenced on vLLM, increases characteristic of Pytorch implementations resulting in a 30\% increase in compute-constant GSM8k Pass@k. We conclude by demonstrating that SRMs are effective reinforcement learning training candidates.
中文摘要 在过去二十年里，语言建模经历了从以顺序处理符号为主的循环架构，转变为在训练过程中并行处理序列元素的非循环模型，这带来了更高的训练效率和稳定性，但代价是推理吞吐量降低。这里我们介绍结构化循环混合器，这是一种允许在列车时进行序列并行表示与推理阶段递归表示之间的代数转换的架构，尤其是在无需专用内核或设备特定内存管理的情况下。我们通过实验表明，与其他线性复杂度模型相比，这种对偶表示允许更高的训练效率、更高的输入信息容量以及更高的推理吞吐量和并发性。我们假设，循环模型不适合用于语言典型信息丰富输入的扩展序列长度缩放，但由于每个样本内存恒定，它们非常适合在样本（批处理）维度进行缩放。我们提供了 Mojo/MAX 推理实现的 SRM，其吞吐量是 vLLM 上推断的同等强大变换器的 12 倍，并发率是 170 倍，这与 Pytorch 实现相符，导致 GSM8k 计算常数Pass@k提升了 30% 百分比。最后，我们证明SRM是有效的强化学习培训候选者。

REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer

REAP：基于高斯喷溅模拟器实现Real2Sim2Real传输的端到端自主停车强化学习

Authors: Changze Li, Zhe Chen, Shaoyu Chen, Lisen Mu, Yijian Li, Yuelong Yu, Qian Zhang, Qing Su, Ming Yang, Tong Qin
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08713
Pdf link: https://arxiv.org/pdf/2605.08713
Abstract In recent years, autonomous parking has made significant advances, yet parking tasks still face challenges in extreme scenarios such as mechanical and dead-end parking slots, often resulting in failures. This is mainly due to traditional parking methods adopting a multistage approach, lacking the ability to optimize the parking problem as a whole. End-to-end methods enable joint optimization across perception and planning modules to eliminate the accumulation of errors, enhancing algorithm performance in extreme scenarios. Although several end-to-end parking methods use imitation or reinforcement learning, the former is limited by data cost and distribution coverage, while the latter suffers from inefficient exploration. To address these challenges, we propose a Reinforcement learning End-to-end Autonomous Parking method (REAP). REAP employs Soft Actor-Critic (SAC) within an asymmetric reinforcement learning framework to improve training efficiency and inference performance. To accelerate model convergence, we distill the capabilities of a rule-based planner into the end-to-end network through behavior cloning. We further introduce a soft predictive collision penalty mechanism to reduce collision rates by penalizing obstacle-approaching actions. To ensure that the trained reinforcement learning network can directly transfer to real-world scenarios, we have established a Real2Sim2Real simulator. In the Real2Sim step, we use 3D Gaussian Splatting (3DGS) to transform real-world scenes into digital scenes. In the Sim2Real step, we deploy the end-to-end model onto the vehicle to bridge the Sim2Real gap. Trained in the 3DGS simulator and deployed on physical vehicles, REAP successfully parks in various types of parking spaces, especially demonstrating the feasibility of end-to-end RL parking in extremely narrow mechanical slots.
中文摘要 近年来，自动停车取得了显著进步，但停车任务仍面临机械停车位和死胡同等极端场景的挑战，常常导致故障。这主要是因为传统停车方法采用多阶段方法，缺乏优化整体停车问题的能力。端到端方法实现了感知与规划模块之间的联合优化，消除误差的累积，提升了极端场景下的算法性能。尽管多种端到端停车方法采用模仿或强化学习，但前者受限于数据成本和分发覆盖范围，后者则存在探索效率低下的问题。为应对这些挑战，我们提出了一种强化学习端到端自主停车方法（REAP）。REAP在非对称强化学习框架内采用软演员-批评者（SAC）以提升训练效率和推理性能。为了加速模型融合，我们通过行为克隆将基于规则的规划器能力提炼到端到端网络中。我们还引入了软性预测碰撞惩罚机制，通过惩罚接近障碍物的行为来降低碰撞率。为了确保训练有素的强化学习网络能够直接迁移到现实场景，我们建立了Real2Sim2Real模拟器。在Real2Sim步骤中，我们使用3D高斯喷溅（3DGS）将现实场景转换为数字场景。在Sim2Real步骤中，我们将端到端模型部署到车辆上，以弥合Sim2Real的差距。REAP在3DGS模拟器中训练并部署于实体车辆上，成功地将多种停车位停放，特别是展示了在极窄机械空间内端到端强化车停车的可行性。

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

AgentForesight：多智能体系统中早期故障预测的在线审计

Authors: Boxuan Zhang, Jianing Zhu, Zeru Shi, Dongfang Liu, Ruixiang Tang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.08715
Pdf link: https://arxiv.org/pdf/2605.08715
Abstract LLM-based multi-agent systems are increasingly deployed on long-horizon tasks, but a single decisive error is often accepted by downstream agents and cascades into trajectory-level failure. Existing work frames this as \emph{post-hoc failure attribution}, diagnosing the responsible agent and step after the trajectory has ended. However, this paradigm forfeits any opportunity to intervene while trajectory is still unfolding. In this work, we introduce AgentForesight, a framework that reframes this problem as online auditing: at each step of an unfolding trajectory, an auditor observes only the current prefix and must either continue the run or alarm at the earliest decisive error, without access to future steps. To this end, we curate AFTraj-2K, a corpus of agentic trajectories across Coding, Math, and Agentic domains, in which safe trajectories are retained under a strict curation pipeline and unsafe trajectories are annotated at the step of their decisive error via consensus among multiple LLM judges. Built on that, we develop AgentForesight-7B, a compact online auditor trained with a coarse-to-fine reinforcement learning recipe that first equips it with a risk-anticipation prior at the failure boundary on adjacent safe/unsafe prefix pairs, then sharpens this prior into precise step-level localization under a three-axis reward jointly targeting the what, where, and who of an audit verdict. Across AFTraj-2K and an external Who\&When benchmark, AgentForesight-7B outperforms leading proprietary models, including GPT-4.1 and DeepSeek-V4-Pro, achieving up to +19.9% performance gain and 3$\times$ lower step localization error, opening the loop from post-hoc failures detection to enabling deployment-time intervention. Project page: this https URL
中文摘要 基于LLM的多智能体系统越来越多地部署在长期任务中，但一个决定性错误常被下游代理接受，并导致轨迹级故障。现有研究将其框架为\emph{事后失效归因}，即在轨迹结束后诊断责任主体和步骤。然而，这种范式放弃了在发展轨迹尚未展开时干预的任何机会。在本研究中，我们介绍了AgentForesight，这一框架将这个问题重新定义为在线审计：在展开轨迹的每一步，审计员只观察当前前缀，必须在最早的决定性错误时继续运行或报警，无法访问后续步骤。为此，我们策划了AFTraj-2K，这是一个涵盖编码、数学和代理领域的代理轨迹语料库，其中安全的轨迹在严格的策展流程下保留，而不安全轨迹则通过多位大型语言模型评委共识在决定性错误的步骤进行注释。在此基础上，我们开发了AgentForesight-7B，这是一款紧凑型在线审计器，采用从粗到细的强化学习公式训练，首先在相邻安全/不安全前缀对的失效边界处具备风险预判，然后将此先验提升为精确的步骤级定位，并以三轴奖励联合针对审计判决的“何事”、“地点”和“谁”。在AFTraj-2K和外部Who\&Hen基准测试中，AgentForesight-7B的表现优于包括GPT-4.1和DeepSeek-V4-Pro在内的领先专有模型，实现高达+19.9%的性能提升和3$倍的步长定位误差，从事后故障检测到部署时干预的实现。项目页面：此 https URL

Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents

打破僵局：社会语言代理的双尺度进化政策培训

Authors: Minzheng Wang, Run Luo, Yanbo Wang, Zichen Liu, Yuqiao Tan, Tao Tan, Xu Nan, Yinhe Zheng, Wenji Mao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.08721
Pdf link: https://arxiv.org/pdf/2605.08721
Abstract While Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for closed-ended tasks, extending it to open-ended social language games via self-play reveals a critical issue: evolution impasse. Due to the vast strategy space, language agents frequently converge to homogenized behaviors, leading to deterministic match outcomes that eliminate the gradient signals necessary for policy evolution. To tackle this issue, we propose Dual-scale Evolutionary Policy Training (DEPT) for social language games. DEPT introduces a time-scaled evolutionary perception mechanism that detects impasse by quantifying dual-scale value baseline divergence alongside match entropy. Upon perceiving the collapse, it then activates asymmetric advantage reshaping to dynamically modulate the optimization landscape for intervention. Thus, our method effectively restores gradient signals and enforces sustained strategic exploration. Extensive experiments on multiple social language games demonstrate that DEPT outperforms strong baselines, avoiding policy degeneration and driving the continuous evolution of social language agents.
中文摘要 虽然带可验证奖励的强化学习（RLVR）已被证明对封闭式任务有效，但将其推广到自玩的开放式社交语言游戏中，揭示了一个关键问题：进化僵局。由于战略空间广阔，语言代理经常趋同于同质化行为，导致确定性匹配结果，消除了策略演化所需的梯度信号。为解决这一问题，我们提出了针对社会语言游戏的双尺度进化政策培训（DEPT）。DEPT引入了一种时间尺度的进化感知机制，通过量化双尺度值基线偏差与匹配熵来检测僵局。一旦感知到崩溃，它就会激活非对称优势重塑，动态调节干预的优化环境。因此，我们的方法有效恢复梯度信号，并推动持续的战略探索。多款社会语言博弈的广泛实验表明，DEPT优于强基线，避免了政策退化，并推动社会语言代理的持续进化。

Generative Actor-Critic with Soft Bridge Policies

带有软桥政策的生成行为者-批评者

Authors: Ke He, Le He, Shunpu Tang, Yafei Wang, Lisheng Fan
Subjects: Subjects: Machine Learning (cs.LG); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2605.08733
Pdf link: https://arxiv.org/pdf/2605.08733
Abstract Expressive generative policies such as diffusion and flow models are appealing for MaxEnt online reinforcement learning because of their ability to model multimodal and highly non-Gaussian action distributions. However, training effective soft generative policies faces two obstacles that often arise together. First, marginal action densities are often unavailable, so existing methods typically rely on entropy bounds, heuristic proxies or approximations. Second, iterative shared-parameter samplers raise inference cost and require backpropagation through time over repeated network evaluations, increasing memory cost and destabilizing policy optimization. These obstacles motivate us to seek a generative policy that exposes a tractable MaxEnt objective while requiring only a single sampled actor forward pass for action generation. To this end, we propose soft generative actor-critic (SoftGAC), whose actor defines a stochastic bridge from a fixed base latent to a terminal action latent in pre-tanh space. This structured bridge allows us to lift the MaxEnt objective as an analytically tractable path-wise relative-entropy objective against a high-entropy reference process. In practical finite-step implementation, this relative entropy reduces exactly to sampled transition control energy and thus provides principled soft regularization. Moreover, we keep the single-pass actor lightweight by using small step-specific bridge transitions, each evaluated only once per sampled action, while maintaining a parameter budget comparable to strong actor baselines. Extensive experiments on challenging continuous-control benchmarks show that SoftGAC attains higher or competitive returns than strong generative policy baselines, including diffusion and flow-matching policies, while staying in the low-latency regime of one-pass actors and showing considerable improvements in the compute-return tradeoff.
中文摘要 表现型生成策略如扩散和流模型对MaxEnt在线强化学习具有吸引力，因为它们能够模拟多模态且高度非高斯作用分布。然而，培训有效的软生成策略面临两个常常同时出现的障碍。首先，边际作用密度通常不可得，因此现有方法通常依赖熵界限、启发式代理或近似。其次，迭代共享参数采样器提高了推理成本，并要求通过网络反向评估时间进行反向传播，增加内存成本并破坏策略优化。这些障碍促使我们寻求一种生成策略，既能暴露一个可操作的 MaxEnt 目标，同时只需对单个采样演员的前向传递进行动作生成。为此，我们提出了软生成演员-批判者（SoftGAC），其演员定义了从固定基潜伏到预坦空间中潜在终端动作的随机桥。该结构化桥梁使我们能够将MaxEnt目标作为一个可解析的路径相对熵目标，提升到高熵参考过程。在实际的有限步实现中，该相对熵精确归还为采样的跃迁控制能量，从而提供了原则性的软正则化。此外，我们通过使用小的步特异桥接转移（每个采样动作仅评估一次），保持单遍演员的轻量级，同时保持与强演员基线相当的参数预算。对挑战性连续控制基准的广泛实验表明，SoftGAC在保持低延迟的单次操作行为（如扩散和流量匹配策略）中获得更高或有竞争力的回报，同时保持在单次执行者的低延迟区间，并在计算与回报权衡方面表现出显著改善。

Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations

价值分解强化学习框架，用于滑行道路由，具有层级冲突感知观察

Authors: Shizhong Zhou, Haifeng Liu, Zheng Zhang, Shiyu Zhang, Bo Yang, Yi Lin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08754
Pdf link: https://arxiv.org/pdf/2605.08754
Abstract Taxiway routing and on-surface conflict avoidance are coupled safety-critical decision problems in airport surface operations. Existing planning and optimization methods are often limited by online computational cost, while reinforcement learning methods may struggle to represent downstream traffic conflicts and balance multiple objectives. This paper presents Conflict-aware Taxiway Routing (CaTR), a reinforcement learning framework for real-time multi-aircraft taxiway routing. CaTR constructs a grid-based airport surface environment with action masking, introduces a hierarchical foresight traffic representation to encode current and downstream conflict-related traffic conditions, and adopts a value-decomposed reinforcement learning strategy to prioritize sparse but safety-critical objectives. Experiments are conducted on a realistic environment based on Changsha Huanghua International Airport under multiple traffic density levels. Results show that CaTR achieves better safety--efficiency trade-offs than representative planning, optimization, and reinforcement learning baselines while maintaining practical runtime.
中文摘要 滑行道路由与地面冲突避免是机场地面运营中安全关键决策问题的结合。现有的规划和优化方法常常受限于在线计算成本，而强化学习方法则可能难以表现下游流量冲突并平衡多个目标。本文介绍了冲突感知滑行道路由（CaTR），这是一种用于实时多机滑行道路由的强化学习框架。CaTR构建了一个基于网格的机场地面环境，采用动作掩蔽，引入分层的前瞻性交通表示以编码当前及后续冲突相关的交通状况，并采用价值分解强化学习策略，优先处理稀疏但安全关键的目标。实验基于长沙黄花国际机场，在多个交通密度层级下的真实环境中进行。结果显示，CaTR在安全性和效率权衡上优于代表性规划、优化和强化学习基线，同时保持了实用运行时间。

AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

AHD 代理：自动化启发式设计的智能强化学习

Authors: Haoze Lv, Ning Lu, Ziang Zhou, Shengcai Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2605.08756
Pdf link: https://arxiv.org/pdf/2605.08756
Abstract Automatic heuristic design (AHD) has emerged as a promising paradigm for solving NP-hard combinatorial optimization problems (COPs). Recent works show that large language models (LLMs), when integrated into well-designed frameworks (i.e., LLM-AHD), can autonomously discover high-performing heuristics. However, existing LLM-AHD frameworks typically treat LLMs as passive generators within fixed workflows, where the model generates heuristics from manually designed, limited context. Such context may fail to capture state-dependent information (e.g., specific failure modes), leading to inefficient trial-and-error exploration. To overcome these limitations, we propose AHD Agent, a novel tool-integrated, multi-turn framework that empowers LLMs to proactively decide whether to generate heuristics or invoke tools to retrieve targeted evidence from the solving environment. To effectively train such a dynamic decision-making agent, we introduce an agentic reinforcement learning (RL) system, which leverages a novel environment synthesis pipeline to optimize a compact model's generalizable AHD capabilities. Experiments across eight diverse domains, including four held-out tasks, demonstrate that our 4B-parameter agent matches or surpasses state-of-the-art baselines using much larger models, while requiring significantly fewer evaluations. Model and inference scaling analysis further reveals that AHD Agent offers an effective trajectory toward truly autonomous heuristic design.
中文摘要 自动启发式设计（AHD）已成为解决NP难组合优化问题（COPs）的有前景范式。最新研究表明，大型语言模型（LLMs）在被整合进设计良好的框架（即LLM-AHD）时，能够自主发现高效能的启发式。然而，现有的LLM-AHD框架通常将LLM视为固定工作流中的被动生成器，模型从手动设计的有限上下文中生成启发式。此类上下文可能无法捕捉依赖状态的信息（例如特定的故障模式），导致试错探索效率低下。为克服这些局限，我们提出了AHD代理，这是一种新型工具集成、多回合框架，使大型语言模型能够主动决定是生成启发式算法还是调用工具，从解决环境中检索定向证据。为了有效训练这样一个动态决策代理，我们引入了一种智能强化学习（RL）系统，利用一种新颖的环境综合流水线优化紧凑模型的可推广AHD能力。跨八个不同领域的实验，包括四个未完成任务，表明我们的4B参数代理能够匹配甚至超过使用更大模型的最先进基线，同时所需的评估次数显著减少。模型和推断尺度分析进一步表明，AHD代理为实现真正自主的启发式设计提供了有效路径。

Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems

基于学习的全尺度顺序决策框架，用于托特搬运机器人系统的订单履行

Authors: Jiaxin Liu, Peng Yang, Yuping Li, Xinyue Xie
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08758
Pdf link: https://arxiv.org/pdf/2605.08758
Abstract Driven by the rapid expansion of e-commerce and small-batch production, the size of the intralogistics load unit of finished goods, semi-finished goods and raw materials is steadily shrinking. Totes are gradually replacing pallets as the primary handling and storage container. This shift has propelled tote-handling robotic systems to the forefront of automation order fulfillment centers. The order-fulfillment decisions of tote-handling robotic systems share a common order-tote-robot sequential decision-making nature. Existing studies primarily focus on decision mechanisms tailored to particular systems, making it difficult to generalize or transfer them to other contexts. We propose an Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems (OLSF-TRS), a generalized and scalable sequential decision framework that combines structured combinatorial optimization with multi-agent reinforcement learning to coordinate order,tote, and robot decisions. On small-scale tote-handling robotic systems, OLSF-TRS achieves near-optimal performance with average optimality gaps below 3.5% across two distinct system configurations. In large-scale scenarios, OLSF-TRS consistently outperforms heuristic baselines across two different system types, reducing total tote movements by 8-12% and over 30% compared to SOTA rule-based approaches, while maintaining real-time responsiveness. These improvements translate into tangible operational benefits, including cost reduction, lower energy consumption, and enhanced throughput stability. The proposed framework delivers an efficient and unified order fulfillment decision-making framework for widely deployed tote-handling robotic systems,supporting high-quality order fulfillment in both e-commerce and industrial logistics sectors.
中文摘要 受电子商务和小批量生产快速扩展推动，成品、半成品和原材料的内物流装载单元规模正在稳步缩小。托特箱正逐渐取代托盘，成为主要的处理和储存容器。这一转变推动了托特搬运机器人系统成为自动化订单履行中心的前沿。托特处理机器人系统的订单履行决策具有共同的订单-托特-机器人顺序决策特性。现有研究主要聚焦于针对特定系统量身定制的决策机制，这使得将其推广或转移到其他语境变得困难。我们提出了一种全尺度基于学习的顺序决策框架，用于托特搬运机器人系统的订单实现（OLSF-TRS），这是一种通用且可扩展的顺序决策框架，结合结构化组合优化与多智能体强化学习，协调订单、托特和机器人决策。在小型托盘搬运机器人系统中，OLSF-TRS在两种不同系统配置下实现了接近最佳性能，平均最优差距低于3.5%。在大规模场景下，OLSF-TRS在两种不同系统类型中始终优于启发式基线，将总托特箱移动减少8-12%，与基于SOTA规则的方法相比减少了30%以上，同时保持实时响应性。这些改进带来了切实的运营效益，包括降低成本、降低能耗和提升吞吐量稳定性。该框架为广泛部署的托特搬运机器人系统提供高效统一的订单履行决策框架，支持电子商务和工业物流领域的高质量订单履约。

Not All Turns Matter: Credit Assignment for Multi-Turn Jailbreaking

并非所有回合都重要：多回合越狱的功劳分配

Authors: Zhida He, Xiaoyu Wen, Han Qi, Ziyuan Zhou, Peng Yu, Xingcheng Xu, Dongrui Liu, Xia Hu, Chaochao Lu, Qiaosheng Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.08778
Pdf link: https://arxiv.org/pdf/2605.08778
Abstract Deploying LLMs in multi-turn dialogues facilitates jailbreak attacks that distribute harmful intent across seemingly benign turns. Recent training-based multi-turn jailbreak methods learn long-horizon attack strategies from interaction feedback, but often rely on coarse trajectory-level outcome signals that broadcast uniformly to every turn. However, we find that turn-level contributions in multi-turn jailbreaking are non-uniform, phase-dependent, and target-specific. Such coarse outcome supervision induces a credit assignment problem, leading to over-rewarding redundant turns in successful trajectories and under-crediting useful intermediate turns in failed ones. To address this, we propose TRACE, a turn-aware credit assignment framework for reinforcement learning (RL)-based multi-turn jailbreaking. For successful trajectories, TRACE estimates turn-level contributions via leave-one-turn-out semantic masking; for failed ones, TRACE assigns penalties based on prompt harmfulness and semantic relevance, with an additional local refusal-aware penalty. Furthermore, we reuse the attack-side credit signal for multi-turn defense alignment. Extensive experiments on open-source and closed-source targets show that TRACE achieves strong overall performance in effectiveness, transferability, and efficiency, yielding about a 25% relative improvement in attack success rate over the strongest RL baseline while also improving the safety-utility balance when reused for defense alignment.
中文摘要 在多回合对话中使用大型语言模型，有助于越狱攻击，将有害意图分散到看似无害的回合。近期基于训练的多回合越狱方法通过交互反馈学习长视野攻击策略，但通常依赖于均匀广播到每一回合的粗略轨迹级结果信号。然而，我们发现多回合越狱中的回合级贡献是非均匀的、相位依赖且目标特定。这种粗糙的结果监督会引发信用分配问题，导致成功轨迹中冗余的转向过度奖励，而失败轨迹中有用的中间转向被低估。为此，我们提出了TRACE，一种基于强化学习（RL）的多回合越狱的轮回感知学分分配框架。对于成功的轨迹，TRACE通过保留一转弯语义掩蔽估计回合级贡献;对于失败的案件，TRACE根据及时的危害性和语义相关性施加惩罚，并额外施加本地拒绝意识惩罚。此外，我们会重复使用攻击侧信用信号进行多回合防御对齐。对开源和闭源目标的广泛实验表明，TRACE在整体效能、可转移性和效率方面表现出色，攻击成功率相较最强强的强强化学习基线提升约25%，同时在防御对齐中重用时也改善了安全与效用的平衡。

CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

CoLVR：通过对比优化增强探索性潜在视觉推理

Authors: Ziyang Ding, Linjian Meng, Yiming Wu, Yuhan Li, Yuhao Liu, Zhen Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.08802
Pdf link: https://arxiv.org/pdf/2605.08802
Abstract Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding. Then, CoLVR employs a latent trajectory contrastive reward for RL (Reinforcement Learning) post-training to enable fine-grained optimization of latent visual reasoning process and thus fostering diverse reasoning behaviors. Experiments demonstrate that CoLVR significantly enhances the exploratory capability of latent representations, achieving average improvements of 5.83% on VSP and 8.00% on Jigsaw, while also outperforming existing latent models on out of domain benchmarks, with a 3.40% gain on MMStar. The data, codes, and models are released at this https URL.
中文摘要 由于潜在视觉推理具有探索性推理的潜力，近期研究倾向于使多模态大型语言模型（MLLM）能够通过传播连续的隐藏状态来实现视觉推理，而不是将中间步骤解码为离散的标记。然而，现有研究通常依赖硬对齐目标，强制潜在表征与预定义的视觉特征相匹配，从而严重限制了潜在推理过程的探索。为解决这个问题，我们提出了CoLVR（潜在视觉推理的对比优化）。为了获得更具探索性的视觉推理，CoLVR引入了潜在对比训练框架。首先，CoLVR学习多样且探索性的表征，并以基于角度的扰动引导的潜在对比目标，扩展语义潜在空间，避免过度约束嵌入。随后，CoLVR采用潜在轨迹对比奖励，用于强化学习（RL）训练后，以实现对潜在视觉推理过程的细粒度优化，从而促进多样化的推理行为。实验表明，CoLVR显著增强了潜在表征的探索能力，在VSP上平均提升了5.83%，在Jigsaw上分别提升了8.00%，同时在域外基准测试中也优于现有潜在模型，MMStar提升了3.40%。数据、代码和模型均在此 https URL 发布。

Constraint-Aware Diffusion Priors for High-Fidelity and Versatile Quadruped Locomotion

高保真和多功能四足行走的约束感知扩散先导

Authors: Jianhui Chen, Ruixin Zhan, Liu Liu, Yang Cai, Ziqiao Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.08804
Pdf link: https://arxiv.org/pdf/2605.08804
Abstract Reinforcement learning combined with imitation learning has significantly advanced biomimetic quadrupedal locomotion. However, scaling these frameworks to massive, multi-source datasets exposes fundamental bottlenecks. First, traditional GAN-based discriminators are prone to mode collapse, struggling to capture diverse motion distributions from uncurated datasets. Second, existing kinematic priors suffer from out-of-distribution (OOD) tracking conflicts, leading to severe unintended heading drifts during complex maneuvers. Furthermore, deploying unconstrained priors to physical hardware poses critical safety risks by disregarding actuator dynamics. To overcome these challenges, we propose Diff-CAST (Diffusion-guided Constraint-Aware Symmetric Tracking), a novel motion prior framework leveraging the multi-modal distribution modeling capabilities of diffusion models for stylistic rewards. Diff-CAST effectively replaces traditional GAN discriminators, unlocking robust data scaling on heterogeneous collections. To ensure high-fidelity intent execution and reliable real-world deployment, we introduce a comprehensive Sim2Re architecture integrating Symmetric Augmented Command Conditioning (SACC) for drift-free tracking, and Constrained RL for hardware safety. Experiments on a quadruped demonstrate that Diff-CAST mitigates mode collapse, enables seamless transitions between diverse skills, and ensures robust, hardware-compliant locomotion.
中文摘要 强化学习与模仿学习的结合显著推动了仿生四足行走的发展。然而，将这些框架扩展到庞大、多源数据集时，会暴露出根本性的瓶颈。首先，传统的基于GAN的判别器容易出现模式崩溃，难以从未经策划的数据集中捕捉多样化的运动分布。其次，现有的运动学先验存在分布外（OOD）跟踪冲突，导致复杂机动中严重的非预期航向漂移。此外，将不受限制的先验部署到物理硬件，存在严重的安全风险，因为它忽视了执行器动力学。为克服这些挑战，我们提出了Diff-CAST（扩散引导约束感知对称跟踪）这一新型运动先验框架，利用扩散模型的多模分布建模能力实现风格奖励。Diff-CAST 有效取代了传统的 GAN 判别器，解锁了对异构集合的数据稳健扩展能力。为确保高精度意图执行和可靠的实际部署，我们引入了全面的Sim2Re架构，集成了对称增强指令条件（SACC）实现无漂移跟踪，以及受限强化学习（Constrained RL）实现硬件安全。四足动物的实验表明，Diff-CAST能够减少模式崩溃，实现不同技能之间的无缝过渡，并确保行走稳健且符合硬件标准。

How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors

《你如何开始就是你的推理：通过前缀调优先验驱动RLVR中的探索》

Authors: Yifan Xu, Junren Chen, Yifan Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08817
Pdf link: https://arxiv.org/pdf/2605.08817
Abstract Reinforcement learning with verifiable rewards (RLVR) recently thrives in large language model (LLM) reasoning tasks. However, the reward sparsity and the long reasoning horizon make effective exploration challenging. In practice, this challenge manifests as the \emph{entropy collapse} phenomenon, where RLVR improves single-rollout accuracy but fails to expand coverage on successful reasoning trajectories. Passive exploration techniques like entropy regularization tend to dismiss generation quality, resulting in noisy rollouts. In response to this issue, we propose an Information-Maximizing Augmented eXploration (IMAX) framework to train a pool of soft prefixes that reshapes the base model's prior over reasoning trajectories. Rather than relying on RL to incentivize exploration on top of the base model, each prefix acts as a trainable control knob that induces a distinct rollout distribution from the same backbone model. To encourage discovery of diverse and task-relevant reasoning behaviors, we derive an Information Maximization (InfoMax) reward to complement the verifiable rewards for RL training. IMAX is in general algorithm-agnostic and can be seamlessly integrated into existing RLVR pipelines. Experiment results have shown that across three backbone scales, IMAX consistently improves reasoning performance over standard RLVR, with gains up to 11.60\% in Pass@4 and 10.57\% in Avg@4.
中文摘要 带有可验证奖励的强化学习（RLVR）近年来在大型语言模型（LLM）推理任务中蓬勃发展。然而，奖励稀少和推理时间长使得有效的探索变得具有挑战性。在实际操作中，这一挑战表现为\emph{熵坍缩}现象，即RLVR提高了单次推广的准确性，但未能扩大成功推理轨迹的覆盖范围。像熵正则化这样的被动探索技术往往忽略生成质量，导致滚动噪声大。针对这一问题，我们提出了一个信息最大化增强探索（IMAX）框架，用于训练一组软前缀，重塑基础模型的先验推理轨迹。每个前缀不再依赖强化学习来激励在基础模型上探索，而是作为可训练的控制旋钮，从同一骨干模型中诱导出不同的推广分布。为了鼓励发现多样且与任务相关的推理行为，我们推导出信息最大化（InfoMax）奖励，以补充强化学习训练的可验证奖励。IMAX总体上不依赖算法，可以无缝集成到现有的RLVR流水线中。实验结果显示，在三个骨干量表上，IMAX的推理性能优于标准RLVR，Pass@4提升最高11.60%，Avg@4提升10.57%。

BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

BubbleSpec：将长尾气泡转化为同步强化学习的推测性推广草稿

Authors: Yuhang Xu, Kaibin Tian, Yang Tian, Zhice Yang, Yifeng Yu, Yan Li, Shengzhong Liu, Fan Wu, Guihai Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08862
Pdf link: https://arxiv.org/pdf/2605.08862
Abstract Reinforcement Learning (RL) has become a cornerstone for improving the performance of Large Language Models (LLMs). However, its rollout phase constitutes a significant efficiency bottleneck, mainly arising from the long-tail bubbles across data parallel ranks, particularly in long-context scenarios where faster GPUs remain idle while waiting for stragglers. Existing solutions, such as partial rollout or asynchronous RL, mitigate these bubbles by compromising the algorithm's strict synchronous nature. Instead, we propose BubbleSpec, a novel framework that accelerates RL rollouts while strictly keeping the mathematical exactness. Instead of attempting to eliminate bubbles, BubbleSpec exploits them. We exploit the idle time windows of faster ranks to pre-generate rollout results for subsequent steps, serving as drafts for speculative decoding. Unlike prior speculative methods that rely on historical epoch similarity and warm-ups, BubbleSpec is agnostic to dataset size and provides immediate acceleration from the onset of training. Extensive evaluations demonstrate that BubbleSpec reduces decoding steps by 50% and increases rollout throughput by up to 1.8x. Critically, BubbleSpec is seamlessly compatible with various RL frameworks and strategies as it sustains the strict synchronous property of RL algorithms.
中文摘要 强化学习（RL）已成为提升大型语言模型（LLM）性能的基石。然而，其推广阶段构成了显著的效率瓶颈，主要源于数据并行队列间的长尾气泡，尤其是在较长上下文场景中，更快的GPU闲置等待落队者。现有解决方案，如部分展开或异步强化学习，通过破坏算法严格同步性来缓解这些气泡。相反，我们提出了BubbleSpec，一种新颖框架，可以在严格保持数学精确性的同时加快强化学习的推广。BubbleSpec 没有试图消除气泡，而是利用它们。我们利用更快排位的闲置时间窗口预先生成后续步骤的推展结果，作为推测解码的草稿。与以往依赖历史纪元相似性和热身的推测方法不同，BubbleSpec 不受数据集大小影响，并且从训练开始就能立即加速。广泛评估表明，BubbleSpec 将解码步骤减少了 50%，并将推广吞吐量提升高达 1.8 倍。关键是，BubbleSpec 与多种强化学习框架和策略无缝兼容，因为它保持了强化学习算法严格的同步性质。

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

通过保守SFT保持流匹配VLA的基础能力

Authors: Tianyi Zhang, Shaopeng Zhai, Haoran Zhang, Fuxian Huang, Qi Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.08879
Pdf link: https://arxiv.org/pdf/2605.08879
Abstract Unconstrained fine-tuning of flow-matching Vision-Language-Action (VLA) models drives dense parameter overwrites, degrading pre-trained capabilities. We present Conservative Supervised Fine-Tuning (ConSFT), an optimization objective that adapts to target distributions while mitigating catastrophic forgetting, requiring zero prior data or architectural overhead. By dynamically scaling learning signals based on model confidence, ConSFT suppresses excessive gradients from low-confidence samples to prevent disproportionate parameter updates, thereby bounding the intrinsic parameter disruption risk. Inspired by reinforcement learning's trust-region clipping, this formulation establishes a progressive learning dynamic to secure target convergence and prior capability retention, maintaining sparse parameter updates without relying on the parallel reference networks required by explicit regularization. We evaluate ConSFT on the LIBERO and RoboTwin benchmarks across state-of-the-art flow-matching VLAs ($\pi_0$, $\pi_{0.5}$, and GR00T-N1.6-3B). The method outperforms vanilla SFT in capability retention by an average absolute margin of over 20\%, matching the efficacy of data-heavy Experience Replay in a prior-data-free regime. Real-world robotic deployments confirm that ConSFT precludes spatial overfitting during downstream adaptation, preserving pre-trained physical skills while acquiring sequential target tasks.
中文摘要 对流匹配视觉-语言-动作（VLA）模型进行无限制微调会导致参数覆盖密集，降低预训练能力。我们提出了保守监督微调（ConSFT），这是一种优化目标，能够适应目标分布，同时减少灾难性遗忘，无需任何先前数据或架构开销。通过基于模型置信度动态扩展学习信号，ConSFT抑制低置信样本中的过度梯度，防止参数不成比例的更新，从而限制了内在参数中断的风险。该表述受强化学习信任区域裁剪启发，建立了渐进式学习动态，以确保目标收敛和先前能力保留，保持稀疏参数更新，而无需依赖显式正则化所需的并行引用网络。我们在LIBERO和RoboTwin基准测试中评估ConSFT，涵盖最先进的流量匹配VLA（$\pi_0$、$\pi_{0.5}$和GR00T-N1.6-3B）。该方法在能力保留率上平均绝对优势超过20%，与数据量丰富的经验回放在无数据环境中的效果相当。现实世界的机器人部署证实，ConSFT在下游适应过程中避免了空间过拟合，保持了预先训练好的物理技能，同时获得连续的目标任务。

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

Forge：用于LLM中NP难优化的质量感知强化学习

Authors: Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Yang Li, Linyang Li, Haodong Duan, Qingwen Liu, Kai Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08905
Pdf link: https://arxiv.org/pdf/2605.08905
Abstract Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic, and puzzles. However, existing benchmarks evaluate only correctness, while overlooking optimality, namely the ability to find the best solutions under constraints. We propose OPT-BENCH, the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. OPT-BENCH provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility, measured by Success Rate, and quality, measured by Quality Ratio; and quality-aware rewards that enable continuous improvement beyond binary correctness. Training on Qwen2.5-7B-Instruct-1M with 15K examples achieves 93.1% SR and 46.6% QR, significantly outperforming GPT-4o, which achieves 29.6% SR and 14.6% QR. Beyond optimization, training on OPT-BENCH transfers to diverse tasks, including mathematics (+2.2%), logic (+1.2%), knowledge (+4.1%), and instruction following (+6.1%). Our analysis reveals that quality-aware rewards improve solutions by 28.8% over binary rewards, and that task diversity drives generalization more than data quantity, offering insights into RLVR scaling for complex reasoning.
中文摘要 大型语言模型（LLM）通过可验证奖励强化学习（RLVR）在推理基准测试中取得了显著成功，在数学、编码、逻辑和谜题等任务中表现出色。然而，现有基准测试仅评估正确性，忽视了最优性，即在约束条件下找到最佳解的能力。我们提出了OPT-BENCH，这是首个通过质量感知RLVR训练和评估NP难优化问题的大型模型的综合框架。OPT-BENCH提供三个关键组成部分：可扩展的培训基础设施，配备实例生成器、质量验证器和10个任务的最佳基线;一个包含1000个实例的严谨基准，评估可行性（以成功率衡量）和质量（以质量比率衡量）;以及质量意识的奖励，使得持续改进超越二元正确。在Qwen2.5-7B-Instruct-1M上训练，包含15K样本，实现了93.1%的SR和46.6%的QR，显著优于GPT-4o的29.6%SR和14.6%的QR。除了优化外，OPT-BENCH的训练还能转化为多种任务，包括数学（+2.2%）、逻辑（+1.2%）、知识（+4.1%）和指令跟随（+6.1%）。我们的分析显示，质量感知奖励比二元奖励改进了28.8%，任务多样性比数据量更能推动泛化，为复杂推理下的RLVR尺度提供了洞见。

Internalizing Safety Understanding in Large Reasoning Models via Verification

通过验证在大型推理模型中内化安全理解

Authors: Yi Zhang, Yuxin Chen, Leheng Sheng, Dongcheng Zhang, Chaochao Lu, Xiang Wang, An Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08930
Pdf link: https://arxiv.org/pdf/2605.08930
Abstract While explicit Chain-of-Thought (CoT) empowers large reasoning models (LRMs), it enables the generation of riskier final answers. Current alignment paradigms primarily rely on externally enforced compliance, optimizing models to detect malicious prompts rather than evaluating the safety of their own outputs. We argue that this approach remains largely behavioral: our empirical analysis reveals that ostensibly aligned models lack intrinsic safety understanding, often failing to verify their own response safety and remaining vulnerable to adversarial jailbreaks. To address this fundamental limitation, we propose Safety Internal (SInternal), a framework that internalizes safety specifications by training LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories. We demonstrate that learning to verify induces a strong generalization for response safety, significantly enhancing robustness against out-of-domain jailbreaks. Furthermore, when combined with reinforcement learning, SInternal serves as a superior initialization compared to standard supervised fine-tuning, suggesting that internalizing safety understanding creates a more robust foundation for alignment than merely mimicking safe behaviors. Our codes are available at this https URL
中文摘要 虽然显式思维链（CoT）赋能大型推理模型（LRM），但它也使得出更具风险的最终答案成为可能。当前的对齐范式主要依赖外部强制合规，优化模型以检测恶意提示，而非评估自身输出的安全性。我们认为这种方法在很大程度上仍是行为层面：我们的实证分析显示，表面上对齐的模型缺乏内在安全理解，常常未能验证自身的响应安全性，且容易受到对抗性越狱的威胁。为解决这一根本局限，我们提出了安全内部（Safety Internal Thisternal，简称SInternal）框架，通过专门培训LRM（安全验证任务）来内化安全规范，并利用专家推理轨迹批判自己生成的答案。我们证明，学习验证能强推广响应安全性，显著增强对域外越狱的鲁棒性。此外，结合强化学习，SInternal作为优越初始化，优于标准监督微调，表明内化安全理解比单纯模仿安全行为更为坚实的对齐基础。我们的代码可在此 https URL 获取

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

自我重置：学会从不安全的推理轨迹中自我恢复

Authors: Dongcheng Zhang, Yi Zhang, Yuxin Chen, An Zhang, Xiang Wang, Chaochao Lu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.08936
Pdf link: https://arxiv.org/pdf/2605.08936
Abstract Large Reasoning Models possess remarkable capabilities for self-correction in general domain; however, they frequently struggle to recover from unsafe reasoning trajectories under adversarial attacks. Existing alignment methods attempt to mitigate this vulnerability by fine-tuning the model on expert data including reflection traces or adversarial prefixes. Crucially, these approaches are often hindered by static training data which inevitably deviate from model's dynamic, on-policy reasoning traces, resulting in model hardly covering its vast generation space and learning to recover from its own failures. To bridge this gap, we propose Self-ReSET, a pure reinforcement learning framework designed to equip LRMs with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning. Extensive experiments across various LRMs and benchmarks demonstrate that Self-ReSET significantly enhances robustness against adversarial attacks especially out-of-distribution (OOD) jailbreak prompts while maintaining general utility, along with efficient data utilization. Further analysis reveals that our method effectively fosters self-recovery patterns, enabling models to better identify and recover from unsafe intermediate error states back to benign paths. Our codes and data are available at this https URL.
中文摘要 大型推理模型在一般领域具有卓越的自我纠正能力;然而，在对抗性攻击下，他们经常难以从不安全的推理轨迹中恢复。现有的比对方法试图通过基于专家数据（包括反射迹或对抗前缀）微调模型来缓解这一漏洞。关键是，这些方法常常受到静态训练数据的阻碍，这些数据不可避免地偏离了模型动态、策略内的推理轨迹，导致模型几乎无法覆盖其庞大的世代空间，并学会从自身的失败中恢复。为弥合这一差距，我们提出了Self-ReSET，一种纯强化学习框架，旨在赋予LRM自身从自身安全错误轨迹中恢复的内在能力，这些误差随后被用作强化学习的初始状态。在各种LRM（长距离移动模型）和基准测试中的大量实验表明，Self-ReSET显著增强了对抗敌对性攻击（尤其是分发外（OOD）越狱提示的鲁棒性，同时保持了通用效用和高效的数据利用。进一步分析显示，我们的方法有效促进了自我恢复模式，使模型能够更好地识别并从不安全的中间错误状态恢复到良性路径。我们的代码和数据可在此 https URL 获取。

A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets

用于学习帕累托覆盖集的单一深度偏好条件策略

Authors: Akihiro Kubo, Kosuke Nakanishi, Shin Ishii
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.08946
Pdf link: https://arxiv.org/pdf/2605.08946
Abstract Preference-conditioned multi-objective reinforcement learning aims to learn a single policy that captures trade-offs across preferences, but under nonlinear scalarization the uniqueness and continuity of the preference-to-solution correspondence remain unclear. We study this problem in tabular multi-objective Markov decision processes (MDPs) using smooth Tchebycheff scalarization as a monotone utility. Under mild interior conditions on the preference set, we prove that each preference induces a unique Pareto-optimal return vector and that this vector depends Lipschitz-continuously on the preference, providing a principled foundation for preference sweeping toward dense Pareto-front coverage. To compute these targets, we formulate the problem over occupancy measures and derive Concave Mirror Descent Policy Iteration (CMDPI), which achieves an $O(1/k)$ objective-suboptimality rate. We further show that each update is equivalent to solving a Kullback-Leibler-regularized MDP with the previous policy as reference, yielding a policy-iteration interpretation and finite-iterate policy continuity across preferences. We instantiate the update as a deep actor-critic algorithm preserving previous-policy regularization. On eight MO-Gymnasium tasks, it achieves the best average hypervolume rank among recent baselines and strong expected-utility performance. Continuous-control experiments indicate gains beyond the discrete-action setting.
中文摘要 偏好条件多目标强化学习旨在学习一个能够捕捉偏好权衡的单一策略，但在非线性标量化下，偏好与解决方案对应的唯一性和连续性仍然不明确。我们在表格多目标马尔可夫决策过程（MDP）中研究该问题，使用平滑切比谢夫标量化作为单调效用。在偏好集的温和内部条件下，我们证明每个偏好都诱导出唯一的帕累托最优回归矢量，且该向量对偏好进行利普希茨连续依赖，为偏好向密集帕累托前缘覆盖扫荡提供了原则基础。为计算这些目标，我们对占用率指标提出问题，并推导出凹面镜下降政策迭代（CMDPI），实现$O（1/k）$的客观次优率。我们进一步证明，每次更新等价于求解一个以前一个政策为参考的Kullback-Leibler正则化MDP，从而得到策略迭代解释和有限次次政策连续性。我们将更新实例化为一个深度演员-批评算法，保持先前策略正则化。在八项MO-Gymnasium任务中，它在近期基线中取得了最佳的平均超量排名和强劲的预期效用表现。连续控制实验显示，离散作用设置之外还能获得收益。

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

学习探索：通过探索感知策略优化扩展代理推理

Authors: Xingyuan Hua, Sheng Yue, Ju Ren
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08978
Pdf link: https://arxiv.org/pdf/2605.08978
Abstract Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine-grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with an exploration-aware grouping mechanism that separates exploratory actions from task-completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks. Code is available at \url{this https URL} and models are available at this https URL.
中文摘要 代理测试时间尺度的最新进展使模型能够在最终行动前收集环境反馈。现有方法的一个主要局限是通常采用无差异化的探索策略，缺乏适应性区分何时真正需要探索的能力。本文提出了一种探索感知型强化学习框架，使LLM代理仅在不确定性较高时进行自适应探索。我们的方法通过变分推断引入了细粒度奖励函数，通过估计探索性行为对未来决策的潜力进行显式评估，同时结合一种探索感知的分组机制，在优化过程中将探索性行为与任务完成性行为分离。通过针对信息空白，该设计使智能体能够有选择地探索，并在任务上下文明确后立即转向执行。通过实证，我们证明我们的方法在一系列具有挑战性的基于文本和图形界面的智能体基准测试中实现了持续的改进。代码可在 \url{this https URL} 获取，模型则可在此 https URL 获取。

ParityFuzz: Finding Inconsistencies across Solidity Compilers via Fine-Grained Mutation and Differential Analysis

ParityFuzz：通过细粒度突变和微分分析发现固体编译器间的不一致

Authors: Bowei Su, Mingxi Ye, Yuhong Na, Peilin Zheng, Zibin Zheng
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2605.09051
Pdf link: https://arxiv.org/pdf/2605.09051
Abstract The Solidity smart contract ecosystem has rapidly grown, leading to multiple compilers targeting different blockchain platforms or improving compilation efficiency. Although many compilers aim to be compatible with the primary Solidity compiler (Solc), significant inconsistencies in compilation and execution remain. These inconsistencies hinder contract migration, mislead developers during debugging, and may introduce exploitable vulnerabilities, causing financial losses. Existing testing techniques mainly focus on bugs within a single compiler or perform differential testing in the same execution environment. However, they are insufficient for detecting cross-compiler inconsistencies, as they lack mechanisms to explore triggering conditions and compare bytecode across environments. We propose ParityFuzz, a cross-compiler differential testing framework for Solidity. It operates in three stages. First, it derives mutation rules, including syntax- and boundary-oriented rules, by analyzing compilers and execution environments. Second, it uses reinforcement learning to select effective mutation rules for test generation. Third, it compiles and executes programs across multiple compilers, then normalizes and compares results to detect inconsistencies. Our evaluation shows ParityFuzz is efficient and effective. It achieves up to 18x higher compilation success rate and 1.8x higher code coverage than state-of-the-art fuzzers. It uncovers 64 previously unknown inconsistencies across six compilers. Notably, 11 issues have been fixed, and our findings received a bounty from the Polkadot community.
中文摘要 Solidity智能合约生态系统迅速发展，促使多个编译器针对不同的区块链平台或提升编译效率。尽管许多编译器力求与主 Solidity 编译器（Solc）兼容，但在编译和执行上仍存在显著不一致之处。这些不一致阻碍了合同迁移，误导开发者调试，并可能引入可被利用的漏洞，导致财务损失。现有的测试技术主要集中在单个编译器内的缺陷，或在同一执行环境中进行差分测试。然而，它们不足以检测跨编译器的不一致，因为它们缺乏探索触发条件和跨环境字节码比较的机制。我们提出了 ParityFuzz，这是一个用于 Solidity 的交叉编译器差分测试框架。它分三个阶段运作。首先，它通过分析编译器和执行环境，推导出包括语法和边界导向规则在内的突变规则。其次，它利用强化学习选择有效的突变规则进行测试生成。第三，它在多个编译器上编译并执行程序，然后对结果进行规范化和比较，以检测不一致之处。我们的评估显示，ParityFuzz 既高效又有效。它的编译成功率是最先进颤音器的18倍，代码覆盖率也高出1.8倍。它揭示了六个编译器中64个此前未知的不一致之处。值得注意的是，已有11个问题被修复，我们的发现还获得了Polkadot社区的奖励。

BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

BoostAPR：通过执行基础强化学习和双奖励模型提升自动程序修复

Authors: Yuanhao Li, Hongbo Wang, Xiaotang Shang, Xunzhu Tang, Yiming Cao, Xuhong Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2605.09134
Pdf link: https://arxiv.org/pdf/2605.09134
Abstract Reinforcement learning for program repair is hindered by sparse execution feedback and coarse sequence-level rewards that obscure which edits actually fix bugs. We present BoostAPR, a three-stage framework addressing these challenges: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models--a sequence-level assessor and a line-level credit allocator--from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions. This line-level credit assignment operates at an intermediate granularity naturally suited to code changes. Trained on SWE-Gym and evaluated on four benchmarks, BoostAPR achieves 40.7% on SWE-bench Verified (+22.9pp over base model), 24.8% on Defects4J (Python-to-Java transfer), 84.5% on HumanEval-Java, and 95.0% on QuixBugs, achieving competitive results among open-source models with strong cross-language generalization.
中文摘要 程序修复的强化学习受限于执行反馈稀疏和粗糙的序列级奖励，这些奖励掩盖了哪些编辑真正修复了错误。我们介绍BoostAPR，一个三阶段框架，解决这些挑战：（1）对执行验证演示的监督微调，带推理痕迹;（2）从执行结果训练双重奖励模型——序列级评估器和线级学分分配器，以及（3）PPO优化，线级模型将奖励重新分配到关键编辑区域。这种线级信用分配在中间细度下运行，非常适合代码变更。BoostAPR在SWE-Gym训练并基于四个基准测试测试中，在SWE-bench Verified（基础模型+22.9pp）上取得了40.7%的成绩，Defects4J（Python到Java传输）为24.8%，在HumanEval-Java上为84.5%，在QuixBugs上达到了95.0%，在具有强大跨语言泛化能力的开源模型中取得了竞争性。

Beyond Self-Play: Hierarchical Reasoning for Continuous Motion in Closed-Loop Traffic Simulation

超越自我游戏：闭环交通模拟中连续运动的层级推理

Authors: Weifan Zhang, Xiaofeng Zhao, Adel Bazzi, Mingrui Li, Yifan Wei, Dengfeng Sun
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09153
Pdf link: https://arxiv.org/pdf/2605.09153
Abstract Closed-loop traffic simulation requires agents that are both scalable and behaviorally realistic. Recent self-play reinforcement learning approaches demonstrate strong scalability, but their equilibrium strategies fail to capture the socially aware behaviors of real human drivers. We propose a hierarchical architecture that goes beyond self-play by combining high-level multi-agent interaction reasoning with low-level continuous trajectory realization. Specifically, a Stackelberg-style Multi-Agent Reinforcement Learning (MARL) module generates interaction-aware intention commands. These commands condition a low-level continuous motion module, translating the strategic intent into physically consistent, scene-responsive control sequences. To mitigate distribution shift in closed-loop deployment, we introduce a hybrid co-training scheme combining MARL with auxiliary recovery supervision. Experiments on a SUMO-based urban network demonstrate that the proposed framework achieves superior control smoothness and safety compared to self-play and passive imitation baselines, while maintaining competitive traffic efficiency.
中文摘要 闭环交通模拟需要既可扩展又行为真实的代理。近期的自我游戏强化学习方法展现了强大的可扩展性，但其均衡策略未能捕捉真实人类驱动者的社会意识行为。我们提出了一种超越自我游戏的分层架构，结合了高层次多智能体交互推理与低层连续轨迹实现。具体来说，一个Stackelberg风格的多智能体强化学习（MARL）模块生成交互感知的意图命令。这些指令会对低级连续运动模块进行条件，将战略意图转化为物理上一致、场景响应的控制序列。为减轻闭环部署中的分布转移，我们引入了结合MARL与辅助恢复监督的混合共训方案。基于SUMO的城市网络实验表明，所提框架相比自玩和被动模仿基线，在保持交通效率的同时，实现了更优越的控制流畅性和安全性。

Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

在熵正则化的行为者-批评者中重新审视混合策略

Authors: Jiamin He, Samuel Neumann, Jincheng Mei, Adam White, Martha White
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09157
Pdf link: https://arxiv.org/pdf/2605.09157
Abstract Mixture policies theoretically offer greater flexibility than unimodal policies in continuous action reinforcement learning, but the practical benefits of this complexity remain elusive. Mixture policies are notably absent from most state-of-the-art algorithms, raising a fundamental question: Is the added representational overhead useful? We show that increased flexibility can theoretically enhance solution quality and entropy robustness. Yet standard algorithms like SAC do not leverage these advantages. A core issue is the lack of a low-variance reparameterization trick for mixtures, a luxury Gaussian policies enjoy. We propose a marginalized reparameterization (MRP) estimator to address this, proving it offers lower variance than the standard likelihood-ratio (LR) approach. Our experiments across Gym MuJoCo, DeepMind Control Suite, and MetaWorld show that MRP mixture policies significantly outperform their LR ones, and reach parity (sometimes better) with Gaussian counterparts. In addition, we do find several cases where MRP mixture policies exhibit clear empirical advantages. In this paper, we provide a clearer understanding of the trade-offs involved, elevating MRP mixture policies from theoretical curiosity to a practical tool.
中文摘要 理论上，混合策略在连续动作强化学习中比单模策略提供了更大的灵活性，但这种复杂性的实际益处仍然难以实现。大多数最先进的算法中明显缺失混合策略，这引发了一个根本性问题：额外的表示开销有用吗？我们证明，提高柔韧性理论上可以提升溶液质量和熵的鲁棒性。然而，像SAC这样的标准算法并未充分利用这些优势。一个核心问题是缺乏低方差重参数化技巧，而高斯策略则享有这种优势。我们提出了一个边际重参数化（MRP）估计器来解决这个问题，证明其方差低于标准似然比（LR）方法。我们在Gym MuJoCo、DeepMind Control Suite和MetaWorld上的实验显示，MRP混合策略显著优于其LR策略，甚至与高斯对应方案达到均衡（有时更好）。此外，我们还发现若干MRP混合政策在实证上具有明显优势的案例。本文更清晰地阐述了其中的权衡，将MRP混合政策从理论上的好奇心提升为实用工具。

Data-Driven Inverse Reinforcement Learning of Linear Systems with Model Uncertainty: A Convex Optimization View

基于数据驱动的线性系统逆强化学习模型不确定性：凸优化视图

Authors: Duc Cuong Nguyen, Phuong Nam Dao
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.09164
Pdf link: https://arxiv.org/pdf/2605.09164
Abstract Inverse reinforcement learning (IRL) for linear systems seeks a cost function whose optimal controller reproduces an expert policy from data. Existing data-driven methods for discrete-time linear systems are largely built on iterative policy/value updates, repeated matrix inversions, and, in some cases, an initial stabilizing controller, which can limit numerical robustness and practical applicability. This paper develops a convex-optimization framework for data-driven inverse reinforcement learning of discrete-time linear systems with model uncertainty. For nominal systems, we derive a semidefinite characterization of inverse optimality and a relaxed formulation that recovers an equivalent state-cost matrix together with a stabilizing controller from expert trajectories. We then obtain a model-free, off-policy reformulation by replacing the unknown system matrices with a regressed kernel matrix identified from local input--state data. For uncertain local systems, we show that a standard LQR cost is generally insufficient to represent every stabilizing target gain and therefore introduce a generalized LQR cost with a state--input cross term. Based on this model, we develop a convex data-driven inverse-RL method and extend it to robust cost design over a population of perturbations via differentiable semidefinite programming and stochastic approximation. Simulations on a discrete-time power-system example show accurate recovery of expert behavior, improved robustness to gain-estimation error and model mismatch, and a simpler computational pipeline than classical iterative inverse-RL schemes.
中文摘要 线性系统的逆强化学习（IRL）寻求一个成本函数，其最优控制器能够从数据中重现专家策略。现有的数据驱动离散时间线性系统方法主要基于迭代策略/值更新、反复矩阵反演，以及在某些情况下的初始稳定控制器，这可能限制数值鲁棒性和实用性。本文开发了一个凸优化框架，用于数据驱动的逆强化学习，针对具有模型不确定性的离散时间线性系统。对于名义系统，我们推导出逆最优性的半定表征和一个宽松表述，该表述从专家轨迹中恢复等效的状态-成本矩阵和稳定控制器。随后，我们将未知系统矩阵替换为从本地输入-状态数据识别的回归核矩阵，获得无模型、非策略的重述。对于不确定的局部系统，我们证明标准LQR成本通常不足以表示每一个稳定目标增益，因此引入一个带有状态-输入交叉项的广义LQR成本。基于该模型，我们开发了一种凸数据驱动的逆强化学习方法，并通过可微的半正定规划和随机近似将其扩展到对一组扰动的鲁棒成本设计。离散时间功率系统示例的模拟显示出专家行为的准确恢复，增强了对增益估计误差和模型不匹配的鲁棒性，以及比经典迭代逆强化学习方案更简单的计算流水线。

DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

DARE：难度自适应强化学习与共进化难度估计

Authors: Yang Zhou, Can Jin, Zihan Dong, Zhepeng Wang, Yanting Yang, Shiyu Zhao, Lei Li, Runxue Bao, Yaochen Xie, Dimitris N. Metaxas
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09188
Pdf link: https://arxiv.org/pdf/2605.09188
Abstract Reinforcement learning improves the reasoning ability of large language models but remains costly and sample-inefficient, as many rollouts provide weak learning signals. Difficulty-aware data selection methods attempt to address this by prioritizing moderately difficult prompts, yet our analysis reveals three limitations: difficulty estimates become inaccurate under policy drift, data selection alone yields limited final-performance gains, and inference efficiency remains largely unchanged. These findings suggest that efficient and effective RL requires more than filtering by difficulty: the policy should learn to solve hard tasks while producing concise responses for easy ones. To this end, we propose Dare, a unified framework that co-evolves difficulty estimation with the policy via self-normalized importance sampling, maintains diverse difficulty coverage through a symmetric Beta sampling distribution, and applies tailored training strategies across difficulty tiers with adaptive compute allocation. Extensive experiments across multiple models and domains demonstrate that Dare consistently outperforms existing methods in training efficiency, final effectiveness, and inference efficiency, producing more concise responses on easy tasks while improving correctness on hard ones. Code is available at this https URL.
中文摘要 强化学习提升了大型语言模型的推理能力，但由于许多推广方式提供的学习信号较弱，成本高且样本效率低下。难度感知型数据选择方法试图通过优先考虑中等难度提示来解决这个问题，但我们的分析揭示了三个局限性：在策略漂移下难度估计会变得不准确，单靠数据选择带来的最终性能提升有限，推理效率基本保持不变。这些发现表明，高效且有效的强化学习不仅需要按难度过滤：政策应学会解决困难任务，同时为简单任务提供简洁的回应。为此，我们提出了Dare，一个统一框架，通过自我归一化重要性抽样共同演化难度估计与策略，通过对称的Beta采样分布保持多样化难度覆盖，并通过自适应计算分配在不同难度层级应用定制化训练策略。跨多个模型和领域的大量实验表明，Dare在训练效率、最终效果和推理效率方面始终优于现有方法，在简单任务中能给出更简洁的回答，同时在困难任务中提升正确性。代码可在此 https URL 访问。

Rethinking Ratio-Based Trust Regions for Policy Optimization in Multi-Agent Reinforcement Learning

重新思考基于比率的信任区域在多智能体强化学习中的策略优化

Authors: Chulabhaya Wijesundara, Andrea Baisero, Zhongheng Li, Gregory Castañón, Alan Carlin, Christopher Amato
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.09212
Pdf link: https://arxiv.org/pdf/2605.09212
Abstract Centralized training with decentralized execution (CTDE) is a standard framework for cooperative multi-agent policy-gradient reinforcement learning, allowing agents to learn from joint information while acting from local observations. Ratio-based trust-region methods such as Multi-Agent Proximal Policy Optimization (MAPPO) and Multi-Agent Simple Policy Optimization (MASPO) update decentralized actors using per-agent probability ratios weighted by joint advantage estimates. Teammate non-stationarity increases the variance of these advantages, which in turn increases the variance in the local ratio updates. This exposes two method-specific failure modes: MAPPO's additive clipping removes gradients for outlier samples and weakens recovery from policy drift, while MASPO's soft quadratic penalty can allow probability collapse. We introduce Multi-Agent Ratio Symmetry (MARS), a novel policy optimization objective that replaces these additive ratio-based trust-region mechanisms with a multiplicatively symmetric geometric barrier. MARS preserves corrective gradients while assigning unbounded cost as probability ratios approach zero. Across 47 tasks spanning eight multi-agent environments, including novel JAX benchmarks PaxMen and AeroJAX, MARS matches or exceeds MAPPO and MASPO in aggregate environment-level performance. Ablations show that these gains arise from the geometry of the symmetric barrier rather than from flexible trust-region boundaries alone.
中文摘要 去中心化执行集中训练（CTDE）是一种合作式多智能体策略梯度强化学习的标准框架，允许智能体在根据局部观察进行行动的同时，从联合信息中学习。基于比率的信任区域方法，如多代理近端策略优化（MAPPO）和多代理简单策略优化（MASPO），通过按联合优势估计加权的每个代理概率比率更新去中心化的参与者。队友非平稳性增加了这些优势的方差，进而增加局部比率更新的方差。这揭示了两种方法特有的失效模式：MAPPO的加法裁剪去除了离群值样本的梯度，削弱了策略漂移的恢复，而MASPO的软二次惩罚则允许概率崩溃。我们引入了多代理比率对称性（MARS），这是一种新颖的策略优化目标，用乘对称几何障碍取代了基于加法比率的信任区域机制。MARS保持修正梯度，同时在概率比趋近于零时赋予无界成本。在涵盖八个多智能体环境的47项任务中，包括新颖的JAX基准PaxMen和AeroJAX，MARS在环境级总性能上与MAPPO和MASPO相当甚至超越。消融表明，这些收益来自对称屏障的几何形状，而非仅仅来自信任区域边界的灵活。

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

在单策略集中下采用前向KL正则化的离线上下文强盗快速速率

Authors: Qingyue Zhao, Kaixuan Ji, Heyang Zhao, Quanquan Gu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.09214
Pdf link: https://arxiv.org/pdf/2605.09214
Abstract \emph{Kullback-Leibler} (KL) regularization is ubiquitous in reinforcement learning algorithms in the form of \emph{reverse} or \emph{forward} KL. Recent studies have demonstrated $\epsilon^{-1}$-type fast rates for decision making under reverse KL regularization, in contrast to the standard $\epsilon^{-2}$-type sample complexity. However, for forward-KL-regularized objectives, existing statistical analyses are either not applicable or result in $\tilde{O}(\epsilon^{-2})$ slow rates. We take the first step towards addressing this problem via a streamlined analysis of forward-KL-regularized offline CBs. We give the first $\tilde{O}(\epsilon^{-1})$ upper bounds in tabular and general function approximation settings, both under notions of \emph{single-policy concentrability}. In particular, our convex-analytical pipeline unifies these settings by exploiting the pessimism principle in a novel way and completely bypasses the proof routines in previous works based on the mean value theorem, which might be of independent interest. Moreover, we provide rate-optimal lower bounds, manifesting the tightness of our upper bounds in terms of statistical rates. Our lower bounds also demonstrate that the forward-KL-regularized sample complexity recovers the unregularized slow rate in the low-regularization regime, similarly to the reverse-KL regularization.
中文摘要 \emph{Kullback-Leibler} （KL）正则化在强化学习算法中无处不在，表现为 \emph{reverse} 或 \emph{forward} KL。最新研究表明，在逆KL正则化下，决策可实现$\epsilon^{-1}$型快速率，这与标准的$\epsilon^{-2}$型样本复杂度形成对比。然而，对于前向KL正则化目标，现有统计分析要么不适用，要么导致$\tilde{O}（\epsilon^{-2}）$慢速速率。我们通过对前向KL正则化离线CB的简化分析，迈出解决这一问题的第一步。我们在表格和一般函数近似设置中给出第一个 $\tilde{O}（\epsilon^{-1}）$ 上界，均基于 \emph{单策略向心}的概念。特别是，我们的凸分析流水线通过新颖地利用悲观主义原理统一了这些设定，并完全绕过了基于均值定理的先前工作中的证明程序，后者可能具有独立的兴趣。此外，我们提供了速率最优的下界，以统计速率的形式体现了上界的紧密性。我们的下界还表明，前向KL正则化样本复杂度在低正则化区间恢复了未正则化的慢速率，类似于逆KL正则化。

Learning the Preferences of a Learning Agent

学习智能体的偏好

Authors: Karim Abdel Sadek, Mark Bedaywi, Rhys Gould, Stuart Russell
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.09217
Pdf link: https://arxiv.org/pdf/2605.09217
Abstract For AI systems to be useful to humans, they must understand and act in accordance with our values and preferences. Since specifying preferences is a hard task, inverse reinforcement learning (IRL) aims to develop methods that allow for inferring preferences from observed behavior. However, IRL assumes the human to be approximately optimal. This is a big limitation in cases where the human themselves may be learning to act optimally in an environment. In this paper, we formalize the problem of learning the preferences of a learning agent: a predictor observes a learner acting online and tries to infer the underlying reward function being (initially suboptimally) optimized by the learner. We model the learner as either being no-regret, or as converging to an optimal Boltzmann policy over time. In each of these settings, we establish theoretical guarantees for various preference learning algorithms, or otherwise show that such guarantees are impossible.
中文摘要 要让人工智能系统对人类有用，它们必须理解并按照我们的价值观和偏好行动。由于指定偏好是一项艰巨的任务，逆向强化学习（IRL）旨在开发能够从观察到的行为中推断偏好的方法。然而，现实中假设人类是近似最优的。这在人类自己可能正在学习如何在环境中表现最佳时，是一个很大的限制。本文形式化了学习主体偏好的问题：预测变量观察学习者在线行为，试图推断学习者（最初不优化）的潜在奖励函数。我们将学习者建模为无悔者，或随着时间趋向最优玻尔兹曼策略。在这些环境中，我们为各种偏好学习算法建立理论保证，或者证明这些保证是不可能实现的。

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

基石还是绊脚石？在政策提炼中解码岩石代币

Authors: Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, Zhao Yang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09253
Pdf link: https://arxiv.org/pdf/2605.09253
Abstract While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18\% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model's actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks'' can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.
中文摘要 尽管近期在可验证奖励强化学习（RLVR）中的研究显示，少数关键代币不成比例地推动推理收益，但对策略提纯（OPD）的类似代币层面理解仍大多未被深入探讨。本研究研究高损耗代币类型，作为OPD每代币KL目标下师生不匹配的最直接信号，根据现有研究，随着培训趋近，这种类型应逐渐减少;然而，我们的实证分析显示事实并非如此。即使OPD训练达到表面饱和，仍有相当一部分代币持续表现出高损耗;这些代币，我们称之为岩石代币，它们在生成输出中最多可占代币的18%。我们的调查揭示了两个令人震惊的悖论。首先，尽管它们的高出现频率提供了不成比例的整体梯度规范，但岩石代币在培训过程中始终停滞不前，难以接受教师的纠正。其次，通过因果干预，我们发现这些代币对模型实际推理表现的功能贡献微乎其微。这些发现表明，大量优化带宽被用于结构性和话语残差，而学生模型无法或无需内化这些残差。通过拆解这些动态，我们证明了策略性地绕过这些“绊脚石”可以显著简化对齐过程，挑战统一代币权重的必要性，并为大规模模型提炼提供更高效的范式。

Reinforcing Multimodal Reasoning Against Visual Degradation

强化多模态推理以反对视觉退化

Authors: Rui Liu, Dian Yu, Haolin Liu, Yucheng Shi, Tong Zheng, Runpeng Dai, Haitao Mi, Pratap Tokekar, Leoweiliang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.09262
Pdf link: https://arxiv.org/pdf/2605.09262
Abstract Reinforcement Learning has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet the resulting policies remain brittle against real-world visual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout induces reward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning against visual degradation while preserving clean-input performance. A dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply a token-level surrogate KL penalty against the worst-case augmentation; to prevent policy collapse under regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance, correctness-conditioned regularization restricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions over GRPO while matching clean accuracy.
中文摘要 强化学习显著提升了多模态大型语言模型（MLLM）的推理能力，但其策略对现实视觉劣化（如模糊、压缩伪影和低分辨率扫描）仍然脆弱。以往视觉和深度强化学习的鲁棒性技术依赖静态数据增强或基于值的正则化，但这些方法都无法顺利过渡到无批评的强化学习自回归MLLM微调。强化反对此类腐败的推理并非简单：在推出过程中天真地注入退化视图会导致奖励中毒，感知遮挡触发幻觉轨迹并破坏优化。我们提出了ROMA，这是一种强化学习的微调框架，通过修改优化动态，在保持干净输入性能的同时，强化对视觉劣化的推理。双向前向传递策略利用教师强制评估损坏视图与干净图像轨迹，避免在退化输入上重新部署。为了分布一致性，我们对最坏情况的增强施加代币级替代KL惩罚;为防止正则化下的政策崩溃，基于清晰图像优势的辅助策略梯度损失保持了可靠的奖励信号;为避免系统性错误的不变性，正确性条件正则化限制执行仅限于成功的轨迹。在Qwen3-VL 4B/8B中，跨越七个多模态推理基准测试，我们的方法在GRPO上可视数据中提升了+2.4%，在未可见损坏时提升了+2.3%，同时保持了干净的准确性。

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

DeltaRubric：通过联合规划与验证实现生成多模态奖励建模

Authors: Rui Liu, Dian Yu, Zhenwen Liang, Yucheng Shi, Tong Zheng, Runpeng Dai, Haitao Mi, Pratap Tokekar, Leoweiliang
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.09269
Pdf link: https://arxiv.org/pdf/2605.09269
Abstract Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce $\textbf{DeltaRubric}$, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a $\textit{Disagreement Planner}$, the model generates a neutral, instance-specific verification checklist. Transitioning into a $\textit{Checklist Verifier}$, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by $\textbf{+22.6}$ (4B) and $\textbf{+18.8}$ (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.
中文摘要 对齐多模态大型语言模型（MLLM）需要可靠的奖励模型，但现有的单步评估者可能存在判断懒惰，利用语言先验而非细粒度的视觉验证。虽然基于评分标准的评估在纯文本环境中减轻了这些偏差，但将其推广到多模态任务时，视觉推理的复杂性会成为瓶颈。反应之间的关键差异通常取决于具体的视觉细节。稳健评估需要动态综合评分标准，以隔离空间和事实上的差异。为此，我们引入了$\textbf{DeltaRubric}$，这是一种将多模态偏好评估重新表述为单一MLLM内的计划与执行过程的方法。DeltaRubric 的工作分为两个步骤：首先作为 $\textit{分歧规划器}$，模型生成一个中立的、针对实例的验证清单。切换到$\textit{Checklist Verifier}$，它会对图片和问题进行自我生成的检查，最终做出有根据的判断。我们将DeltaRubric制定为一个多功能强化学习问题，共同优化规划与验证能力。在Qwen3-VL 4B和8B Instruct模型上验证后，DeltaRubric取得了扎实的实证成果。例如，在VL-RewardBench上，它将基础模型整体准确率提升了$\textbf{+22.6}$（4B）和$\textbf{+18.8}$（8B）点，远超标准无评分基线。结果表明，将评估分解为结构化、可验证的步骤，能够实现更可靠且可推广的多模态奖励建模。

PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

PiCA：基于转折的学分作业用于搜索能动强化学习

Authors: Dongyi Liu, Yifan Niu, Qinwen Wang, Han Xiao, Jia Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09287
Pdf link: https://arxiv.org/pdf/2605.09287
Abstract Large Language Model (LLM)-based search agents trained with reinforcement learning (RL) have significantly improved the performance of knowledge-intensive tasks. However, existing methods encounter critical challenges in long-horizon credit assignment: (i) Reward Sparsity, where models receive only outcome feedback without step-level guidance to differentiate action quality; (ii) Isolated Credit, where credit is assigned to steps independently, failing to capture sequential dependencies; and (iii) Distributional Shift, where rewards are estimated on templates that deviate from the model's natural generative distribution. To address these issues, we propose Pivot-Based Credit Assignment (PiCA), a novel step reward mechanism that reformulates the search trajectory as a sequential process of cumulative search progress. Unlike prior isolated step rewards, PiCA defines process rewards as success probabilities dependent on the historical context based on Potential-Based Reward Shaping (PBRS). This approach identifies pivot steps, which comprise target golden sub-queries and sub-answers derived from historical trajectories, as information peaks that significantly boost the likelihood of a correct final answer. By anchoring these step rewards to the final task objective, PiCA provides dense, pivot-aware and trajectory-dependent guidance while maintaining distributional consistency. Extensive experiments show that PiCA outperforms existing strong baselines across seven knowledge-intensive QA benchmarks, achieving 15.2% and 2.2% improvements for 3B and 7B models. The consistent performance gains across various models show PiCA's robust generalization. The code is available at this https URL.
中文摘要 基于大型语言模型（LLM）的搜索代理通过强化学习（RL）训练，显著提升了知识密集型任务的性能。然而，现有方法在长期视角的学分分配中面临关键挑战：（i）奖励稀疏性，模型仅获得结果反馈，缺乏步骤级指导以区分行动质量;（ii）孤立信用，即独立赋予步骤，未能捕捉顺序依赖关系;以及（iii）分布转移，即在偏离模型自然生成分布的模板上估算奖励。为解决这些问题，我们提出了基于转折的学分分配（Pivot-Based Credit Assignment，简称PiCA），这是一种新型的阶梯奖励机制，将搜索轨迹重新表述为一个连续的累积搜索进展过程。与之前的孤立阶级奖励不同，PiCA将过程奖励定义为基于潜在奖励塑造（PBRS）的历史背景下的成功概率。该方法识别了枢轴步骤，即基于历史轨迹的目标黄金子查询和子答案，作为显著提高正确最终答案概率的信息峰值。通过将这些步级奖励锚定于最终任务目标，PiCA提供了密集、感知枢轴且依赖轨迹的指导，同时保持分布一致性。大量实验显示，PiCA在七个知识密集型质量保证基准中优于现有强劲基线，分别在3B和7B模型中分别提升了15.2%和2.2%。各模型间持续的性能提升显示了PiCA的强健泛化能力。代码可在该 https URL 访问。

dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

dFlowGRPO：离散流模型的速率感知策略优化

Authors: Zhengyan Wan, Yidong Ouyang, Panwen Hu, Qiang Sun
Subjects: Subjects: Machine Learning (cs.LG); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2605.09291
Pdf link: https://arxiv.org/pdf/2605.09291
Abstract Discrete flow models (DFMs) are a class of flexible generative models for generating discrete data, and diffusion large language models (dLLMs) can be viewed as a special case with a specific choice of mixture path and a masked source distribution. While several recent works have explored reinforcement learning into dLLMs, its application to more general discrete flow models remains underexplored. In this work, we present discrete Flow-GRPO (dFlowGRPO), a unified reinforcement learning framework for discrete flow models that supports a broad family of probability paths and non-masked source distributions. We derive the full trajectory probability for DFMs and formulate denoising as a Markov decision process, enabling dFlowGRPO to incorporate information from both the associated conditional transition rates and the posterior model during reinforcement learning. We apply dFlowGRPO to FUDOKI, a recent multimodal discrete flow model, and evaluate it on both image generation and multimodal understanding tasks. Empirical results show that dFlowGRPO outperforms existing GRPO-type methods for dLLMs on text-to-image generation tasks and achieves performance competitive with continuous flow-based models trained using FlowGRPO, while also demonstrating strong capabilities on understanding tasks.
中文摘要 离散流模型（DFMs）是一类用于生成离散数据的灵活生成模型，扩散大型语言模型（dLLM）可视为一种特例，具有特定的混合路径选择和掩蔽的源分布。尽管近期有几项研究探讨了将强化学习应用于数字大型语言模型（dLLM），但其在更通用的离散流模型中的应用仍未被充分探索。在本研究中，我们提出了离散流-GRPO（dFlowGRPO），这是一个统一的离散流模型强化学习框架，支持广泛的概率路径和非掩码源分布。我们推导DFMs的完整轨迹概率，并将去噪化表述为马尔可夫决策过程，使dFlowGRPO能够在强化学习过程中整合相关条件转移速率和后验模型的信息。我们将dFlowGRPO应用于FUDOKI——一个近期的多模离散流模型，并在图像生成和多模态理解任务中进行评估。实证结果表明，dFlowGRPO在文本到图像生成任务中优于现有的GRPO类型dLLM方法，性能可与使用FlowGRPO训练的连续流模型媲美，同时在任务理解方面展现出强大能力。

Functional Graphs for Predicting and Explaining Goal Failure in Sparse Goal-Conditioned RL

稀疏目标条件强化学习中预测和解释目标失败的功能图

Authors: Shalley Dash
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.09335
Pdf link: https://arxiv.org/pdf/2605.09335
Abstract Sparse goal-conditioned reinforcement learning can produce policies whose failures are hidden by aggregate success rates. We analyze trained goal-conditioned value policies through the deterministic functional graphs induced by greedy evaluation: for each goal, every state maps to a single successor, decomposing behavior into attractors and basins. This reveals a local-to-global structure in learned policies. We define local goal support (LGS), a one-step statistic measuring the fraction of valid neighboring states whose greedy successor is the goal. In deterministic sparse GridWorlds, zero LGS exactly precludes goal entry from non-goal starts. Empirically, weak LGS is a strong diagnostic of goal-level failure across update rules, curricula, larger grids, and bottleneck geometries: the fixed rule LGS <= 0.5 identifies low-success goals with precision 0.921, recall 0.929, and F1 0.925 in the main 8x8 TD setting, with similar performance across variants. However, local support is not sufficient for global success: some supported goals still fail because distant states are captured by competing attractors or fragmented basin structure. We therefore introduce a compact post-hoc taxonomy of policy-induced graphs -- goal-dominant, competitor-dominated, partial/contested, and fragmented -- to characterize residual failure modes beyond local support. These results show that sparse GCRL failures can be understood as structured policy-induced dynamics, and that local one-step policy structure provides a cheap post-training diagnostic for goal-level failure.
中文摘要 稀疏的目标条件化强化学习可能产生失败被整体成功率掩盖的政策。我们通过贪婪评估诱导的确定性函数图分析训练有素的目标条件值策略：对于每个目标，每个状态映射到一个后继者，将行为分解为吸引子和盆地。这揭示了学习政策中从地方到全球的结构。我们定义了本地目标支持（LGS），这是一个一步统计，衡量有效邻国中其“贪婪继承者”的比例。在确定性稀疏的GridWorld中，零LGS正好排除了非目标起始时的目标进入。从经验上看，弱LGS是预测更新规则、课程、更大网格和瓶颈几何中目标层级失败的有力诊断：固定规则LGS<= 0.5在主8x8 TD设置中以0.921的精度识别低成功目标，召回精度为0.929，F1的0.925，且各变体表现相似。然而，局部支持不足以实现全球成功：一些支持的目标仍然失败，因为遥远的状态被竞争的吸引子或分散的盆地结构所捕获。因此，我们引入了一个紧凑的事后分类法，涵盖政策引发的图——目标主导、竞争者主导、部分/争议图和碎片图——以描述超出局部支持的残余失效模式。这些结果表明，稀疏的GCRL失效可以理解为结构化的策略驱动动态，而局部一步策略结构为目标级失效提供了廉价的训练后诊断。

Skill-R1: Agent Skill Evolution via Reinforcement Learning

技能-R1：通过强化学习实现代理技能进化

Authors: Yash Vishe, Rohan Surana, Xunyi Jiang, Zihan Huang, Xintong Li, Nikki Lijing Kuang, Tong Yu, Ryan A. Rossi, Jingbo Shang, Julian McAuley, Junda Wu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09359
Pdf link: https://arxiv.org/pdf/2605.09359
Abstract Agentic large language models often rely on skills, reusable natural language procedures that guide planning, action, and tool use. In practice, skills are typically improved through prompt engineering or by aligning the task LLM itself, which is costly, model-specific, and often infeasible for closed-source models. Skill optimization is not a one-step problem but a recurrent process with two coupled levels of credit assignment: a useful skill must improve rollout quality under current conditioning, while a useful revision must turn observed outcomes into a better skill for the next round. We propose Skill-R1, a reinforcement learning framework for instance-level recurrent skill optimization from verifiable rewards. Rather than updating the task LLM, Skill-R1 trains a lightweight skill generator that conditions on the task context, prior rollouts, and their verified outcomes to produce skills that steer a frozen task LLM. This preserves black-box compatibility with both open- and closed-source models while making adaptation substantially cheaper than model-level updates. Skill-R1 proceeds over multiple generations: at each step, the current skill induces rollouts whose verified outcomes are fed back to produce the next revision. To optimize this recurrent process, we introduce a bi-level group-relative policy optimization objective combining intra-generation and inter-generation advantages. The intra-generation term compares rollouts under shared skill conditioning, while the inter-generation term rewards revisions that improve behavior across successive generations. Together, these provide a principled objective for directional skill evolution rather than one-shot self-refinement. Empirically, Skill-R1 achieves consistent gains over no-skill baselines and standard GRPO across benchmarks with verifiable rewards, with particularly strong improvements on complex, multi-step tasks.
中文摘要 代理大型语言模型通常依赖技能、可重用的自然语言程序来指导规划、行动和工具使用。实际上，技能通常通过提示工程或对任务LLM本身进行对齐来提升，这对应成本高昂、模型特定，且对闭源模型往往不可行。技能优化不是一步步的问题，而是一个反复出现的过程，具有两个耦合等级的学分分配：一项有用的技能必须在当前条件下提升推广质量，而一项有用的修订则必须将观察到的结果转化为下一轮的更好技能。我们提出了Skill-R1，一种基于可验证奖励进行实例级重复技能优化的强化学习框架。Skill-R1 不更新任务 LLM，而是训练一个轻量级技能生成器，基于任务上下文、先前的推广及其验证结果，生成引导冻结任务 LLM 的技能。这既保持了对开源和闭源模型的黑箱兼容性，又使得适配成本远低于模型级更新。技能R1跨越多代进行：每一步，当前技能会诱导推出，经过验证的结果会反馈以产生下一次修订。为优化这一重复过程，我们引入了结合世代内和代际优势的双层次群体相对策略优化目标。代际术语比较共享技能条件下的推广，而代际术语则奖励能在后代中改善行为的修订。这些元素共同提供了有原则的技能进化目标，而非一次性自我精炼。实证上，Skill-R1在基准测试中相较于无技能基线和标准GRPO实现了持续的提升，且有可验证的奖励，尤其在复杂多步骤任务上表现显著。

Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning

目标条件强化学习的多尺度预测表征

Authors: Valliappan Chidambaram Adaikkappan, David Meger, Sai Rajeswar, Pietro Mazzaglia
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.09364
Pdf link: https://arxiv.org/pdf/2605.09364
Abstract This paper investigates robust representation learning in offline goal-conditioned reinforcement learning (GCRL). Particularly in sparse reward scenarios, learning representations that align state and goal latents is a challenge that frequently culminates in representation divergence where the encoder drifts toward a low-dimensional, goal-agnostic subspace that destabilizes policy learning. We address this issue by showing that an agent must acquire a fundamental understanding of its environment across multiple scales, from local physical dynamics to long-horizon goal-directed structure. Building on this insight, we propose this http URL, a framework that leverages multi-scale predictive supervision to enforce goal-directed alignment within the latent space. We demonstrate that this http URL leads to improved representation quality and strong performance on both vision and state-based tasks. Furthermore, we show that our approach is exceptionally resilient under realistic, challenging data regimes, maintaining state-of-the-art performance across a wide variety of tasks, trajectory stitching scenarios, and extreme noise conditions.
中文摘要 本文探讨了离线目标条件强化学习（GCRL）中的强韧表征学习。尤其是在奖励稀疏场景中，学习能够对齐状态和目标潜在变量的表征是一个挑战，常常导致表征发散，编码器会漂移到低维、目标无关的子空间，从而破坏策略学习的稳定性。我们通过展示代理必须在多个尺度上获得对环境的基本理解来解决这个问题，从局部物理动力学到长期目标导向结构。基于这一见解，我们提出了这个http URL，这是一个利用多尺度预测监督，在潜在空间内强制目标导向一致的框架。我们证明了该 http URL 能够提升表示质量，并在基于视觉和状态的任务中表现出色。此外，我们证明了我们的方法在现实且具有挑战性的数据环境下表现出极高的韧性，能够在多种任务、轨迹拼接场景和极端噪声条件下保持最先进的性能。

From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay

从被动再利用到主动推理：为神经符号体验重放奠定大型语言模型基础

Authors: Yanan Xiao, Yixiang Tang, Zechen Feng, Lu Jiang, Minghao Yin, Pengyang Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09419
Pdf link: https://arxiv.org/pdf/2605.09419
Abstract While experience replay is essential for data efficiency in reinforcement learning (RL), standard methods treat the replay buffer as a passive memory system, prioritizing samples based on numerical prediction errors rather than their semantic significance. This approach stands in contrast to human learning, which accelerates mastery by actively abstracting fragmented experiences into behavioral rules. To bridge this gap, we propose Neuro-Symbolic Experience Replay (NSER), a framework that transforms experience replay from a passive sample reuse mechanism into an active engine for knowledge construction. Specifically, NSER addresses the incompatibility between linguistic reasoning and numerical optimization through a novel neuro-symbolic grounding pipeline. It leverages Large Language Models (LLMs) in a zero-shot manner to induce candidate behavioral rules from accumulated trajectories, grounds these insights into differentiable first-order logic representations, and utilizes the resulting symbolic structures to dynamically reweight the replay distribution. By allowing abstract knowledge to directly shape policy optimization, NSER achieves consistent superior sample efficiency and convergence speed across reactive, rule-based, and procedural benchmarks.
中文摘要 虽然经验回放对于强化学习（RL）中的数据效率至关重要，但标准方法将回放缓冲区视为被动记忆系统，基于数值预测误差而非语义意义优先排序样本。这种方法与人类学习形成对比，后者通过主动将碎片化的经验抽象为行为规则，加速掌握。为弥合这一差距，我们提出了神经符号体验回放（NSER）框架，将经验回放从被动的样本再利用机制转变为知识构建的主动引擎。具体来说，NSER通过一种新的神经符号基础流水线解决了语言推理与数值优化之间的不兼容问题。它利用大型语言模型（LLMs）以零样本方式从累积轨迹中诱导候选行为规则，将这些洞见置于可微的一阶逻辑表示中，并利用所得的符号结构动态重权重回放分布。通过让抽象知识直接塑造策略优化，NSER在反应式、基于规则和程序化的基准测试中实现了持续优越的样本效率和收敛速度。

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

无互动的感知：剖析LMM中的因果发现缺失

Authors: Jiafeng Liang, Zhihao Zhu, Zihan Zhang, Baoqi Ren, Shixin Jiang, Runxuan Liu, Tao Ren, Ming Liu, See-Kiong Ng, Bing Qin
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.09422
Pdf link: https://arxiv.org/pdf/2605.09422
Abstract Although Large Multimodal Models (LMMs) have achieved strong performance on general video understanding, their susceptibility to textual prior shortcuts during causal discovery has been recognized as a critical deficit. The underlying mechanisms of this phenomenon remain incompletely understood, as existing benchmarks only measure response accuracy without revealing the sources and extent of the deficit. We introduce ProCauEval, a perturbation-based evaluation protocol that shifts from outcome assessment to mechanism diagnosis, probing causal discovery through five controlled configurations that systematically manipulate visual and textual modalities to decompose their respective contributions to model behavior and dissect the failure modes. Evaluating 17 mainstream LMMs, we find that models faithfully perceive video content yet systematically underexploit it during causal reasoning. We further observe that stronger post-training amplifies rather than mitigates textual prior reliance, and that higher baseline performance correlates with greater fragility under perturbation. To address these, we propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning framework built on negative teacher alignment, which augments GRPO by explicitly pushing the policy away from a prior-only counterfactual teacher induced by visual corruption. Specifically, ADPO maximizes the divergence between the policy distributions conditioned on the original and visually corrupted inputs, thereby forcing the model to ground its reasoning in visual evidence rather than textual shortcuts. Extensive experiments show that ADPO improves visual engagement without sacrificing fundamental comprehension, thus offering a preliminary step toward reliable causal discovery.
中文摘要 尽管大型多模模型（LMM）在一般视频理解方面表现出色，但其在因果发现过程中易受文本先验捷径的影响被认为是一个关键缺陷。这一现象的根本机制尚未完全明了，因为现有基准仅衡量反应准确性，并未揭示赤字的来源和范围。我们介绍了ProCauEval，一种基于扰动的评估协议，从结果评估转向机制诊断，通过五种受控配置系统地操作视觉和文本模态，分析其对模型行为的贡献并剖析失效模式，深入探讨因果发现。在评估17个主流LMM时，我们发现模型忠实感知视频内容，但在因果推理中系统性地低估了其价值。我们还观察到，更强的后期训练会放大而非减轻文本先验依赖，且更高的基线性能与在扰动下更脆弱的相关性。为解决这些问题，我们提出了反蒸馏政策优化（ADPO），这是一种基于负向教师对齐的强化学习框架，通过明确推动政策远离由视觉腐败诱导的先前仅反事实教师的行为来增强GRPO。具体来说，ADPO最大化了基于原始输入与视觉损坏输入的政策分布之间的分歧，从而迫使模型以视觉证据为基础推理，而非文本捷径。大量实验表明，ADPO在不牺牲基本理解的前提下提升了视觉参与度，从而为可靠因果发现提供了初步步骤。

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

跨语言政策自我提炼以实现多语言推理

Authors: Yihong Liu, Raoyuan Zhao, Michael A. Hedderich, Hinrich Schütze
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.09548
Pdf link: https://arxiv.org/pdf/2605.09548
Abstract Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, but this ability is not equally accessible across languages. Especially low-resource languages exhibit much lower reasoning performance. To address this, we propose Crosslingual On-Policy Self-Distillation (COPSD), which transfers a model's own high-resource reasoning behavior to low-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distribution token-level divergence on the student's own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-only reinforcement learning (RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resource mathematical reasoning across model sizes and substantially outperforms Group Relative Policy Optimization (GRPO). Further analyses show that COPSD improves answer-format adherence, strengthens test-time scaling, and generalizes to harder multilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: this https URL.
中文摘要 大型语言模型（LLMs）在数学推理方面取得了显著进步，但这种能力在各语言间并不平等。尤其是低资源语言的推理表现明显较低。为此，我们提出了跨语言策略自蒸馏（COPSD），将模型自身的高资源推理行为转移到低资源语言中。COPSD采用与学生和教师相同的模型：学生只看到低资源问题，而教师则获得特权的跨语言上下文，包括问题翻译和参考解决方案。训练最大限度地减少了学生自身推广时的全分发令牌级偏差，提供密集的监督，同时避免了仅结果强化学习（RL）的稀疏性和不稳定性。对17种低资源非洲语言的实验显示，COPSD在不同模型规模下持续提升低资源数学推理能力，且显著优于群体相对策略优化（GRPO）。进一步分析显示，COPSD提高了答案格式的遵循性，强化了测试时间的缩放，并推广到更难的多语言推理基准，尤其在资源较低的语言中取得了显著提升。我们将代码和数据公开于：https URL。

Neuromorphic Reinforcement Learning for Quadruped Locomotion Control on Uneven Terrain

四足行走控制的神经形态强化学习

Authors: Zhuangyu Han, Abhronil Sengupta
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.09595
Pdf link: https://arxiv.org/pdf/2605.09595
Abstract Reinforcement learning (RL) has enabled robust quadruped locomotion over complex terrain, but most learned controllers are trained offline with backpropagation in massively parallel simulation and deployed as fixed policies, limiting adaptation to terrain variation, payload changes, actuator wear, and other real-world conditions under onboard power constraints. Local learning provides a potential path toward energy-aware on-robot adaptation by replacing global backpropagation graphs with updates driven by local neural states, making the learning rule more compatible with neuromorphic and in-memory computing substrates. This work proposes an equilibrium-propagation (EP)-based proximal policy optimization (PPO) framework for uneven-terrain quadruped locomotion. The controller combines a bio-inspired central pattern generator (CPG) policy with a residual postural adjustment policy, while replacing conventional backpropagation-trained policy and value networks with EP-enabled local learning. To train stochastic continuous-control policies with EP, we derive an EP-compatible PPO output-nudging signal and introduce a two-sided ratio clipping mechanism that stabilizes policy updates during relaxation. Experiments on a 12-DoF A1 quadruped show that the proposed controller achieves stable policy convergence in a two-stage uneven terrain locomotion task. Its locomotion performance is comparable to a backpropagation-trained PPO baseline in success rate, velocity tracking, actuator power, and body stability, while improving GPU memory efficiency by 4.3(\times) compared with backpropagation through time (BPTT). These results suggest that local equilibrium-based learning can support high-dimensional embodied locomotion and provide an algorithmic foundation for low-power on-robot adaptation and fine-tuning.
中文摘要 强化学习（RL）使得在复杂地形上实现了稳健的四足行走，但大多数已学会的控制器是在离线中通过大规模并行模拟的反向传播训练，并作为固定策略部署，限制了对地形变化、有效载荷变化、执行器磨损及其他机载功率限制下的实际环境的适应性。局部学习通过用由局部神经状态驱动的更新替代全局反向传播图，为实现能量感知机器人适应提供了潜在路径，使学习规则更兼容神经形态和内存计算基底。本研究提出了基于均衡传播（EP）的近侧政策优化（PPO）框架，用于不平整地形的四足行走。控制器结合了仿生中央模式生成器（CPG）策略和残余姿势调整策略，同时用EP驱动的局部学习取代了传统的反向传播训练策略和价值网络。为了用EP训练随机连续控制策略，我们推导出一个EP兼容的PPO输出助推信号，并引入了双侧比率削波机制，在松弛期间稳定策略更新。在一台12景深的A1四足行走动物上的实验表明，所提出的控制器在两阶段的不平整地形运动任务中实现了稳定的政策收敛。其运动性能在成功率、速度追踪、执行器功率和身体稳定性方面与反向传播训练的PPO基线相当，同时相比反向传播时间（BPTT）提高了4.3（\时间\）。这些结果表明，基于局部均衡的学习可以支持高维的具身运动，并为低功耗的机器人适应和微调提供算法基础。

Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

任何带有知识转移的3D扩散模型：放疗规划研究

Authors: Yuhan Wang, Zihan Li, Han Liu, Simon Arberet, Martin Kraus, Yuyin Zhou, Florin-Cristian Ghesu, Dorin Comaniciu, Ali Kamen, Riqiang Gao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09622
Pdf link: https://arxiv.org/pdf/2605.09622
Abstract Voxel-wise dose prediction is a critical yet challenging task in practical radiotherapy (RT) planning, as bespoke models trained from scratch often struggle to generalize across diverse clinical settings. Meanwhile, generative models trained on billion-scale datasets from vision domains have achieved impressive performance. Herein, we propose DiffKT3D, a unified Any2Any 3D diffusion framework that leverages prior knowledge from pretrained video diffusion models for efficient and clinically meaningful dose prediction. To enable flexible conditioning across multiple clinical modalities (CT, anatomical structures, body, beam settings, etc.), we introduce an Any2Any conditional paradigm utilizing modality-specific embeddings without cross-attention overhead. Further, we design a novel reinforcement learning (RL) post-training mechanism guided by a clinically-informed Scorecard explicitly tailored to institutional treatment preferences. Compared with winner of GDP-HMM challenge, DiffKT3D sets a new state-of-the-art in dose prediction by reducing voxel-level MAE from 2.07 to 1.93. In addition, DiffKT3D achieves superior image quality and preference match. These results demonstrate that transferring diffusion priors via modality-aware conditioning and clinically aligned RL post-training can provide a robust and generalizable solution for RT planning across various clinical scenarios.
中文摘要 体素剂量预测是实际放疗（RT）规划中一项关键且具有挑战性的任务，因为从零开始训练的定制模型常常难以在不同临床环境中推广。与此同时，基于视觉领域数十亿级数据集训练的生成模型取得了令人印象深刻的表现。本文提出DiffKT3D，一个统一的Any2Any三维扩散框架，利用预训练视频扩散模型的先验知识，实现高效且临床意义深厚的剂量预测。为了实现多种临床模态（CT、解剖结构、身体、光束设置等）的灵活条件反射，我们引入了Any2Any条件范式，利用特定模态嵌入且无交叉注意力开销。此外，我们设计了一种新型强化学习（RL）训练后机制，基于临床知情的评分卡，明确针对机构治疗偏好量身定制。与GDP-HMM挑战获胜者相比，DiffKT3D通过将体素级MAE从2.07降至1.93，实现了剂量预测的新水平。此外，DiffKT3D实现了更优越的图像质量和偏好匹配。这些结果表明，通过模式意识条件反射和临床对齐的强化学习训练后转移扩散先验，可以为多种临床场景下的RT规划提供稳健且可推广的解决方案。

Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning

Plan2Cleanse：通过蒙特卡洛规划在深度强化学习中的测试时间后门防御

Authors: Sze-Ann Chen, Zhi-Yi Chin, Kui-Yuan Chen, Chi-Yu Li, Ping-Chun Hsieh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.09638
Pdf link: https://arxiv.org/pdf/2605.09638
Abstract Ensuring the security of reinforcement learning (RL) models is critical, particularly when they are trained by third parties and deployed in real-world systems. Attackers can implant backdoors into these models, causing them to behave normally under typical conditions, but execute malicious behaviors when specific triggers are activated. In this work, we propose Plan2Cleanse, a test-time detection and mitigation framework that adapts Monte Carlo Tree Search to efficiently identify and neutralize RL backdoor attacks without requiring model retraining. Our approach recasts backdoor detection as a planning problem, enabling systematic exploration of temporally extended trigger sequences while maintaining black-box access to the target policy. By leveraging the detection results, Plan2Cleanse can further achieve efficient mitigation through tree-search preventive replanning. We evaluated our method in competitive MuJoCo environments, simulated O-RAN wireless networks, and Atari games. Plan2Cleanse achieves substantial improvements, increasing trigger detection success rates by more than 61.4 percentage points in stealthy O-RAN scenarios and improving win rates from 35\% to 53\% in competitive Humanoid environments. These results demonstrate the effectiveness of our test-time defense approach and highlight the importance of proactive defenses against backdoor threats in RL deployments. Our implementation is publicly available at this https URL.
中文摘要 确保强化学习（RL）模型的安全性至关重要，尤其是在这些模型由第三方训练并部署于现实世界系统时。攻击者可以在这些模型中植入后门，使其在典型条件下表现正常，但在特定触发器被激活时会执行恶意行为。在本研究中，我们提出了Plan2Cleanse，一种测试时检测和缓解框架，能够优化蒙特卡洛树搜索，高效识别和中和强化学习后门攻击，无需模型重新训练。我们的方法将后门检测重新定位为一个规划问题，使得系统性地探索时间延长的触发序列，同时保持对目标策略的黑箱访问。通过利用检测结果，Plan2Cleanse 还可以通过树木搜索预防性重新规划实现高效的缓解。我们在竞争激烈的MuJoCo环境、模拟的O-RAN无线网络和雅达利游戏中评估了我们的方法。Plan2Cleanse 实现了显著改进，在隐形 O-RAN 场景中将触发检测成功率提高了超过 61.4 个百分点，并在竞争性类人生物环境中将胜率从 35% 提升至 53%。这些结果展示了我们测试时防御方法的有效性，并凸显了在强化学习部署中主动防御后门威胁的重要性。我们的实现可在此 https URL 公开获取。

Adaptive Data Harvesting for Efficient Neural Network Learning with Universal Constraints

自适应数据采集，实现高效神经网络学习，具有普遍约束

Authors: Siteng Kang, Xinhua Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09707
Pdf link: https://arxiv.org/pdf/2605.09707
Abstract Training neural networks to satisfy universal constraints over continuous domains poses unique challenges. Common examples include Lyapunov Neural Networks (Lyapunov NNs) and Physics-Informed Neural Networks (PINNs), where analytical solutions are generally either unavailable or overly restrictive. Sample-based methods are therefore commonly used to enforce these constraints, and the choice of samples has a substantial impact on convergence speed, stability, and solution quality. Most existing methods rely on fixed heuristics or handcrafted rules, and are suboptimal in practice. In this paper, we aim to improve upon them by learning, from data and experience, how to dynamically and iteratively adjust the samples in response to the model's evolving learning performance. Trained by reinforcement learning, the learned policy improves empirical constraint satisfaction on test problems while significantly improving efficiency. We validate the approach on both Lyapunov NNs and PINNs, and demonstrate its broader applicability to domains where adaptive input selection is essential for effective training.
中文摘要 训练神经网络以满足连续域上的普遍约束存在独特挑战。常见的例子包括李雅普诺夫神经网络（Lyapunov NN）和物理知情神经网络（PIN），这些方法通常缺乏或过于限制性。因此，基于样本的方法常被用于强制执行这些约束，样本的选择对收敛速度、稳定性和解质量有显著影响。大多数现有方法依赖固定启发式或手工制定的规则，实际上并不理想。本文旨在通过从数据和经验中学习如何动态且迭代地调整样本，以应对模型不断演变的学习表现，从而改进这些方法。通过强化学习训练，所学策略提升了测试问题的经验约束满足度，同时显著提升了效率。我们在李雅普诺夫神经网络和PINN上验证了该方法，并展示了其在适应性输入选择对有效训练至关重要的领域中的更广泛适用性。

On-Policy Distillation with Best-of-N Teacher Rollout Selection

配合最佳教师推广选择的政策提炼

Authors: Ke Zhang, Yunjie Tian, DongDi Zhao, Yijiang Li, Yuanye Liu, Vishal M Patel, Di Fu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.09725
Pdf link: https://arxiv.org/pdf/2605.09725
Abstract On-policy distillation (OPD), which supervises a student on its own sampled trajectories, has emerged as a data-efficient post-training method for improving reasoning while avoiding the reward dependence of reinforcement learning and the catastrophic forgetting often observed in standard supervised fine-tuning. However, standard OPD typically computes teacher supervision under noisy student-generated contexts and often relies on a single stochastic teacher rollout per prompt. As a result, the supervision signal can be high-variance: the sampled teacher trajectory can be incorrect, uninformative, or poorly matched to the student's current reasoning behavior. To address this limitation, we propose BRTS, a Best-of-N Rollout Teacher Selection framework for on-policy distillation. BRTS augments standard student-context OPD with a teacher-context supervision branch constructed from the curated teacher trajectory. Rather than distilling from the first sampled teacher rollout, BRTS samples a small pool of teacher trajectories and selects the auxiliary trajectory using a simple priority rule: correctness first, student alignment second. When multiple correct teacher trajectories are available, BRTS chooses the one most aligned with the student's current behavior; when unconditioned teacher samples fail on harder prompts, it invokes a ground-truth-conditioned recovery step to elicit a natural derivation. The selected trajectory is then used to provide reliable teacher-context supervision inside the OPD loop, augmented with an auxiliary loss on the teacher trajectory. Experiments on AIME 2024, AIME 2025, and AMC 2023 show that BRTS improves over standard OPD on challenging reasoning benchmarks, with the largest gains on harder datasets. Our code is available at this https URL.
中文摘要 策略提纯（OPD）通过监督学生自身抽样轨迹，已成为一种高效的训练后方法，旨在改善推理能力，同时避免强化学习的奖励依赖和标准监督微调中常见的灾难性遗忘。然而，标准OPD通常在学生生成的嘈杂环境中计算教师监督，且通常依赖每个提示单一随机教师展开。因此，监督信号可能存在高方差：抽样教师的轨迹可能不正确、信息不足，或与学生当前的推理行为不匹配。为解决这一限制，我们提出了BRTS，即N最佳推广教师选拔框架，用于政策提炼。BRTS通过基于精心策划的教师轨迹构建的教师环境监督分支，补充标准的学生语境门诊。BRTS不从首次抽样教师推广中提炼，而是从一小部分教师轨迹中抽样，并通过一个简单的优先规则选择辅助轨迹：正确性优先，学生对齐度其次。当有多个正确的教师轨迹时，BRTS会选择最符合学生当前行为的那条;当未条件教师样本在较难的提示中失败时，它会触发一个基于实情条件的恢复步骤，以引发自然的推导。选定的轨迹随后用于在OPD环路内提供可靠的教师情境监督，辅以教师轨迹的辅助损耗。AIME 2024、AIME 2025和AMC 2023的实验显示，BRTS在具有挑战性的推理基准测试中优于标准OPD，且在较难的数据集上提升最大。我们的代码可在此 https URL 访问。

One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning

一人为所有人：非线性变换器可以实现跨域推广，实现上下文强化学习

Authors: Bowen He, Juncheng Dong, Lin Lin, Xiang Cheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09727
Pdf link: https://arxiv.org/pdf/2605.09727
Abstract A central challenge in reinforcement learning (RL) is to learn models that generalize beyond the tasks on which they are trained, a goal traditionally pursued through multi-task and meta RL. Recently, transformer architectures have emerged as a promising approach, enabling adaptation to new tasks via in-context learning without explicit parameter updates. From a functional perspective, a transformer can be viewed as a functional operator that maps a context to a task-specific function. It is thus fundamental to understand and design this operator to support stronger generalization in RL. In this work, we address this resulting question of generalization from a kernel-based perspective by establishing a connection between non-linear transformers and kernel-based temporal difference learning. By interpreting the transformer as performing regression in a Reproducing Kernel Hilbert Space (RKHS), we show that value functions from different domains can be represented using a shared set of weights, provided they lie within the same RKHS. Experiments on multiple MetaWorld domains support this interpretation, demonstrating convergence of the temporal-difference objective.
中文摘要 强化学习（RL）的一个核心挑战是学习能够超越训练任务范围的模型，这一目标传统上通过多任务和元强化学习实现。近年来，变换器架构作为一种有前景的方法出现，使得通过上下文学习适应新任务，而无需显式参数更新。从功能角度看，变换器可以被视为将上下文映射到任务特定功能的函数算子。因此，理解和设计该算子以支持更强的强化强化是基础。在本研究中，我们通过建立非线性变换器与基于核的时间差分学习之间的联系，从核视角探讨这一推广问题。通过将变换器解释为在重现核希尔伯特空间（RKHS）中进行回归，我们证明了来自不同域的值函数只要位于同一RKHS内，可以用共享的权重集合表示。多个元世界领域的实验支持这一解释，证明了时间差分目标的收敛性。

Operationalizing Cybersecurity Governance for Mitigation Planning with Attack-Path Modeling and Reinforcement Learning

通过攻击路径建模和强化学习，实现网络安全治理以实现缓解规划

Authors: Philip Huff, Dakota Dale, Harshith Guduru, Rohan Singh, Qinghua Li
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2605.09792
Pdf link: https://arxiv.org/pdf/2605.09792
Abstract We address a fundamental challenge in cybersecurity operations of translating governance frameworks into actionable mitigation decisions under realistic resource constraints. Frameworks such as the NIST Cybersecurity Framework (CSF) provide widely adopted measures of organizational maturity, but do not directly support the selection and prioritization of defensive strategies against adversarial behavior. We present a system that operationalizes governance frameworks by mapping CSF maturity assessments into MITRE ATT\&CK mitigation capabilities, which enables direct integration of organizational security posture with adversary-informed defensive planning. To manage adversary complexity, we employ a Variable-Order Markov Model (VOMM) trained on observed ATT\&CK technique sequences to enable scalable adversary simulation within a Deep Reinforcement Learning (DRL) environment. We reconstruct likely attack paths and defensive responses using beam search, and then jointly optimize mitigation selection under explicit budget constraints. Our environment supports concurrent adversaries and realistic mitigation costs. Across multiple reward formulations and configurations, we show that the approach produces stable policies, meaningful cost-risk trade-offs, and interpretable mitigation plans aligned with organizational maturity. These results demonstrate that adversary-aware DRL can generate practical, resource-constrained defense strategies grounded in real-world frameworks and threat behavior.
中文摘要 我们解决了网络安全运营中一个根本性的挑战：在现实资源限制下，将治理框架转化为可执行的缓解决策。如NIST网络安全框架（CSF）等框架提供了广泛采用的组织成熟度衡量标准，但并未直接支持针对对抗性行为的防御策略的选择和优先排序。我们提出了一个系统，通过将CSF成熟度评估映射到MITRE ATT\&CK缓解能力中，实现治理框架的操作化，从而实现组织安全态势与对手知情防御规划的直接整合。为管理对手复杂性，我们采用基于观察到的ATT\&CK技术序列训练的变量阶马尔可夫模型（VOMM），实现深度强化学习（DRL）环境中可扩展的对手模拟。我们利用束流搜索重建可能的攻击路径和防御反应，并在明确的预算约束下共同优化缓解选择。我们的环境支持着同时存在的对手和现实的缓解成本。通过多种奖励表述和配置，我们证明该方法产生了稳定的政策、有意义的成本-风险权衡，以及与组织成熟度相匹配的可解释缓解方案。这些结果表明，具备对手意识的DRL能够基于现实世界框架和威胁行为，生成实用且资源有限的防御策略。

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

LEAD：大型语言模型的长度高效自适应与动态推理

Authors: Songtao Wei, Yi Li, Zhikai Li, Xu Hu, Yuede Ji, Guanpeng Li, Feng Chen, Carl Yang, Zhichun Guo, Bingzhe Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09806
Pdf link: https://arxiv.org/pdf/2605.09806
Abstract Large reasoning models, such as OpenAI o1 and DeepSeek-R1, tend to become increasingly verbose as their reasoning capabilities improve. These inflated Chain-of-Thought (CoT) trajectories often exceed what the underlying problems require, wasting compute, latency, and context budgets. While introducing length-based efficiency rewards during reinforcement learning offers a natural remedy, existing methods struggle with two fundamental challenges: the optimal balance between correctness and efficiency is non-stationary throughout training, and intrinsic reasoning budgets vary drastically across problems. Relying on static reward weights and global length constraints inevitably forces a compromise between degraded accuracy and unrealized compression. To overcome these limitations, we propose LEAD (Length-Efficient Adaptive and Dynamic reasoning), a method that replaces static heuristics with online, self-adaptive mechanisms. LEAD dynamically calibrates the correctness-efficiency trade-off at each step using a Potential-Scaled Instability, directing optimization capacity to the most informative learning signal. Furthermore, it estimates an adaptive per-problem target length online based on the model's own correct rollouts, applying a symmetric efficiency reward that penalizes both overthinking and over-compression. Evaluated on five mathematical reasoning benchmarks, LEAD achieves the highest accuracy and Accuracy-Efficiency Score among RL-trained efficient-reasoning methods while producing substantially shorter outputs than the base model.
中文摘要 大型推理模型，如OpenAI o1和DeepSeek-R1，随着推理能力的提升，往往变得越来越冗长。这些膨胀的思维链（Chain-of-Think，CoT）轨迹往往超出了底层问题所需的水平，浪费了计算量、延迟和上下文预算。虽然在强化学习中引入基于长度的效率奖励是一种自然的解决办法，但现有方法面临两个根本挑战：正确性与效率的最佳平衡在整个训练过程中是不稳定的，且内在推理预算因问题而差异巨大。依赖静态奖励权重和全局长度约束，必然会在精度下降和压缩未实现之间做出妥协。为克服这些局限，我们提出了LEAD（长度高效自适应与动态推理）的方法，用在线自适应机制取代静态启发式。LEAD 利用潜在尺度不稳定性动态校准每一步的正确性与效率权衡，将优化能力引导至最具信息量的学习信号。此外，它基于模型自身的正确展开，估算一个自适应的每个问题在线目标长度，并施加对称效率奖励，惩罚过度思考和过度压缩。在五个数学推理基准测试中，LEAD在强化学习训练的高效推理方法中实现了最高的准确性和准确性-效率评分，同时输出明显短于基础模型。

Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

量化用户模拟器在构建协作式LLM助手中的实用性

Authors: Joseph Suh, Ayush Raj, Minwoo Kang, Serina Chang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.09808
Pdf link: https://arxiv.org/pdf/2605.09808
Abstract User simulators are increasingly leveraged to build interactive AI assistants, yet how to measure the quality of these simulators remains an open question. In this work, we show how simulator quality can be quantified in terms of its downstream utility: how an LLM assistant trained with this user simulator performs in the wild when interacting with real humans. In a controlled experiment where only the user simulator varies, we train LLM assistants via reinforcement learning against a spectrum of simulators, from an LLM prompted to role-play a user to one fine-tuned on human utterances from WildChat. As evaluation, we measure pairwise win rates in a user study with 283 participants and on WildBench, a benchmark derived from real human--AI conversations. Training against the role-playing LLM yields an assistant statistically indistinguishable from the initial assistant in our user study (51% win rate), whereas training against the fine-tuned simulator yields significant gains (58% over the initial and 57% over the one trained against role-playing). Closer inspection reveals three further patterns: methods for making role-playing LLMs more realistic (e.g., persona conditioning) improve trained assistants but do not close the gap to the fine-tuned simulator; scaling the simulator's model size benefits the fine-tuned simulator but yields no gain for role-playing ones; and assistants trained against role-playing simulators fail to generalize when paired with other simulators at test time, while the one trained against fine-tuned simulator does. Together, these results argue for grounding user simulators in real human behavior and measuring their quality by their downstream effect on real users.
中文摘要 用户模拟器越来越多地被用来构建交互式AI助手，但如何衡量这些模拟器的质量仍是一个悬而未决的问题。在本研究中，我们展示了模拟器质量如何通过其下游效用来量化：即使用该用户模拟器训练的LLM助手在与真人互动时的实际表现。在一个仅用户模拟器变化的受控实验中，我们通过强化学习对多种模拟器进行训练，从被提示角色扮演用户的LLM到通过WildChat微调的人类话语。作为评估，我们在一项有283名参与者的用户研究和基于真实人类与人工智能对话的基准WildBench上，测量了成对获胜率。在我们用户研究中，使用角色扮演LLM训练时，助手的胜率与初始助手在统计上无异（胜率为51%），而在精细调优的模拟器中训练则有显著提升（比初始助手提升58%，比针对角色扮演训练的助手提升57%）。仔细观察还发现了三个模式：使角色扮演大型语言模型更真实的方法（例如人格条件反射）提升了受训助手，但并未弥合与精细模拟器的差距;放大模型尺寸对精细调优的模拟器有利，但对角色扮演模拟器没有任何收益;而在角色扮演模拟器中训练的助手在测试时与其他模拟器搭配时无法泛化，而与精细调优模拟器训练的助手则可以。这些结果共同支持将用户模拟器扎根于真实人类行为，并通过其对真实用户的下游影响来衡量其质量。

Learning to Compress Time-to-Control: A Reinforcement Learning Framework for Chronic Disease Management

学习压缩控制时间：慢性病管理的强化学习框架

Authors: Prabhjot Singh, Abhishek Gupta, Chris Betz, Abe Flansburg, Brett Ives, Sudeep Lama, Jung Hoon Son
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.09818
Pdf link: https://arxiv.org/pdf/2605.09818
Abstract Reinforcement learning (RL) in healthcare has had mixed results, with reward sparsity, unreliable off-policy evaluation, and deployment-simulation gap as recurring failure modes. We argue that chronic disease management is structurally a more tractable RL setting than the acute-care problems the field has primarily studied, but only if the problem is formalized to exploit chronic care's properties. We propose such a formalization. The agent's objective is to compress time-to-control (TTC) under a tiered reward calibrated to the CMS ACCESS Model. Two quantities from our companion preference-learning paper [Singh et al. 2026] enter as load-bearing structural elements: the execution intensity \epsilon bounds action availability under a constrained Markov Decision Process, and the clinician capability \kappa weights offline-data transitions during RL training. Together they couple preference learning and RL into a two-loop architecture. We present simulation results on synthetic state machines for hypertension and type 2 diabetes. Capability-weighted offline RL outperforms uniform-weighted offline RL and the behavior policy by 15 percentage points on T2D TTC; the uniform-weighted formulation (the standard in existing healthcare RL) underperforms even the heterogeneous behavior policy. \Epsilon-aware policies generalize across deployment regimes while \epsilon-naive policies do not.
中文摘要 医疗领域的强化学习（RL）效果参差不齐，奖励稀疏、不可靠的非策略评估以及部署与模拟差距是反复出现的失败模式。我们认为，慢性病管理在结构上比该领域主要研究的急性护理问题更为可处理，但前提是问题被形式化以利用慢性护理的特性。我们提出这样的形式化。该代理的目标是在根据CMS ACCESS模型校准的分级奖励下压缩控制时间（TTC）。我们配套的偏好学习论文[Singh 等，2026]中的两个量作为承载结构元素进入：在受限马尔可夫决策过程下的执行强度\ε界限行动可用性，以及临床医生能力\kappa加权在强化学习训练期间离线数据转换。它们将偏好学习和强化学习结合成一个双环路架构。我们展示了高血压和2型糖尿病合成状态机的模拟结果。能力加权离线RL在T2D TTC中比统一加权离线RL和行为策略高出15个百分点;统一加权表述（现有医疗强化学习的标准）甚至不及异质行为政策。\Epsilon-naid策略在部署体系中泛化，而\epsilon-naïve策略则不然。

Geometric Pareto Control: Riemannian Gradient Flow of Energy Function via Lie Group Homotopy

几何帕累托控制：通过李群同伦实现的黎曼梯度能量函数流

Authors: Tong Wu
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.09824
Pdf link: https://arxiv.org/pdf/2605.09824
Abstract We propose Geometric Pareto Control (GPC), a framework overcoming barriers of reinforcement learning in cyber-physical systems where governing physics is known. Reinforcement learning confronts barriers in safety-critical applications: sample complexity grows with action-space dimension, retraining is required when objectives or conditions shift, goals such as safety recovery and economic dispatch demand brittle switching logic, and unsafe exploration persists under constrained RL formulations. GPC resolves these barriers through a two-stage geometric approach. Offline, the supported family of Pareto-optimal solutions (i.e., solutions recoverable by weighted scalarization) is embedded as a submanifold within a Lie group. Exponential map closure preserves membership in the ambient Lie group; drift and reset assumptions keep online latent states within a bounded neighbourhood of the Pareto submanifold, and a training-time feasibility margin guarantees decoded actions remain feasible without post-hoc projection, constructing a "map" of the solution landscape. Online, a closed-form proximal navigator traverses this submanifold via a unified Riemannian gradient flow driven by a singular perturbation potential field, inducing dual-timescale dynamics that prioritize constraint restoration over performance optimization. The homeomorphic structure of the submanifold guarantees that varying system parameters and objective weights produce continuous control actions, enabling deployment under unseen conditions without retraining. Validated on a nonconvex control task and real-time multi-objective optimal power flow, GPC achieves 100% feasibility, 0.30% oracle suboptimality, and 12.3 ms decisions while shifting from constraint recovery to economic dispatch. Under branch-admittance uncertainty, it remains 100% feasible without retraining, whereas model-free baselines produce no feasible dispatches.
中文摘要 我们提出了几何帕累托控制（GPC），这是一个克服网络物理系统中强化学习障碍的框架，这些系统已知支配物理。强化学习在安全关键应用中面临障碍：样本复杂度随动作空间维度增长，目标或条件变化时需要再训练，安全恢复和经济调度等目标需要脆弱切换逻辑，且在受限强化学习表述下不安全探索依然存在。GPC通过两阶段几何方法解决这些障碍。离线时，支持的帕累托最优解族（即通过加权标量可恢复的解）被嵌入为李群中的子流形。指数映射闭包保持在环境李群中的成员资格;漂移和重置假设将在线潜在态保持在帕累托子流形的有界邻域内，训练时间可行性裕度保证解码后的动作在无需事后投影的情况下依然可行，构建了解图的“地图”。在线上，闭式近端导航器通过由奇异微扰势场驱动的统一黎曼梯度流遍历该子流形，从而引入对偶时间尺度动力学，优先考虑约束恢复而非性能优化。子流形的同胚结构保证了变化的系统参数和目标权重产生连续的控制动作，从而在不可见条件下部署而无需重新训练。经过非凸控制任务和实时多目标最优功率流验证，GPC实现了100%可行性、0.30%的预言机次优率和12.3毫秒决策，同时从约束恢复转向经济派遣。在分支准入不确定性下，不需再训练即可100%可行，而无模型基线则不产生可行的派遣。

Exploration-Driven Optimization for Test-Time Large Language Model Reasoning

探索驱动的测试时大型语言模型推理优化

Authors: Changhao Li, Yuchen Zhuang, Chenxiao Gao, Haotian Sun, Rushi Qiang, Chao Zhang, Bo Dai
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.09853
Pdf link: https://arxiv.org/pdf/2605.09853
Abstract Post-training techniques combined with inference-time scaling significantly enhance the reasoning and alignment capabilities of large language models (LLMs). However, a fundamental tension arises: inference-time methods benefit from diverse sampling from a relatively flattened probability distribution, whereas reinforcement learning (RL)-based post-training inherently sharpens these distributions. To address this, we propose Exploration-Driven Optimization (EDO), which extends reward-biasing style exploration objectives to iterative post-training and integrates them into standard RL objectives, encouraging greater diversity in sampled solutions while facilitating more effective inference-time computation. We incorporate EDO into iterative Direct Preference Optimization (iDPO) and Group Relative Policy Optimization (GRPO), resulting in two variants: ED-iDPO and ED-GRPO. Extensive experiments demonstrate that both ED-iDPO and ED-GRPO exhibit greater solution diversity and improved reasoning abilities, particularly when combined with test-time computation techniques like self-consistency. Across three in-distribution reasoning benchmarks, EDO achieves a 1.0-1.3\% improvement over the strongest baselines, and delivers an additional 1.5\% average gain on five out-of-distribution tasks. Beyond accuracy, EDO preserves model entropy and stabilizes RL training dynamics, highlighting its effectiveness in preventing over-optimization collapse. Taken together, these results establish EDO as a practical framework for balancing exploration and exploitation in LLM reasoning, especially in settings that rely on test-time scaling.
中文摘要 后训练技术结合推理时间尺度，显著增强了大型语言模型（LLM）的推理和对齐能力。然而，存在一个根本的张力：推断时间方法受益于从相对扁平的概率分布中进行多样化采样，而基于强化学习（RL）的后训练则固有地使这些分布更加清晰。为此，我们提出了探索驱动优化（EDO），将奖励偏向式探索目标扩展到迭代训练后，并将其整合进标准强化学习目标，鼓励采样解的多样性，同时促进更有效的推理时间计算。我们将EDO纳入迭代直接偏好优化（iDPO）和群体相对政策优化（GRPO），产生两种变体：ED-iDPO和ED-GRPO。大量实验表明，ED-iDPO和ED-GRPO在与测试时计算技术如自洽性结合时，表现出更高的解多样性和更强的推理能力。在三个分布内推理基准测试中，EDO相较最强基线提升了1.0-1.3%，并在五个分布外任务中额外提升了1.5%的平均收益。除了准确性，EDO还保持模型熵并稳定强化学习训练动态，突出其防止过度优化崩溃的有效性。综合来看，这些结果确立了EDO作为一个实用框架，用于平衡LLM推理中的探索与利用，尤其是在依赖测试时间扩展的环境中。

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

先分离，后融合：通过模态特定思维链缓解视听大型语言模型推理中的跨模态干扰

Authors: Xuanchen Li, Yuheng Lu, Chenrui Cui, Tianrui Wang, Zikang Huang, Yu Jiang, Long Zhou, Longbiao Wang, Jianwu Dang
Subjects: Subjects: Artificial Intelligence (cs.AI); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2605.09906
Pdf link: https://arxiv.org/pdf/2605.09906
Abstract Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality cues when answering. We further introduce a modality-specific reasoning mechanism that preserves modality isolation during the separated reasoning stage while enabling full access to cross-modal information at the evidence fusion stage. Experiments demonstrate consistent improvements in both accuracy and robustness, yielding an average relative gain of 5.16\% on general AVQA benchmarks and 11.17\% on a cross-modal hallucination benchmark.
中文摘要 音频和视觉为视听问答提供了互补证据，但当前视听大型语言模型可能存在跨模态干扰：一种模态的信息误导了另一种模态的解读，从而引发幻觉。我们将此问题归因于中间推理过程中无法控制的跨模态交互。为缓解这一问题，我们提出了“先分离，后融合”（SFFL），这是一种旨在减少跨模态干扰的视听推理框架。SFFL执行特定模态的思维链推理，分别生成音频和视觉推理痕迹，并整合证据以供回答。我们通过数据管道在不同模态输入设置下构建模态偏好标签。我们将这些标签作为强化学习中的辅助奖励，鼓励在回答时对模态线索产生实例依赖性的偏好。我们进一步引入了一种特定于模态的推理机制，在分离推理阶段保持模态隔离，同时在证据融合阶段实现对跨模态信息的全面访问。实验显示，准确性和稳健性均有持续提升，在一般AVQA基准测试中平均相对增益为5.16%，在跨模态幻觉基准中为11.17%。

Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

通过内在梯度-范数奖励实现无验证器的大型语言模型强化学习

Authors: Xuexiang Wen, Hang Yu, Linchao Zhu, Gaoang Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09920
Pdf link: https://arxiv.org/pdf/2605.09920
Abstract While Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising post-training paradigm for Large Language Models (LLMs), its dependency on the gold label or domain-specific verifiers limits its scalability to new tasks and domains. In this work, we propose Verifier-free Intrinsic Gradient-Norm Reward (VIGOR), a simple reward that uses only the policy model itself. Given a prompt, VIGOR samples a group of completions and assigns higher within-group rewards to outputs that induce smaller $\ell_2$ norms of the teacher-forced negative log-likelihood gradients under the current parameters. Intuitively, lower gradient norms suggest the completion aligns better with the current policy, serving as an intrinsic preference signal for policy optimization. To make this intrinsic signal practical for RL, we correct the systematic length bias of averaged token-level gradients with a $\sqrt{T}$ scaling, and apply group-wise rank shaping to stabilize reward scales across prompts. Across mathematical reasoning benchmarks, VIGOR outperforms the state-of-the-art Reinforcement Learning from Internal Feedback (RLIF) baseline, and it also exhibits cross-domain transfer to code benchmarks when trained only on math data. For instance, on Qwen2.5-7B-Base post-trained on MATH, VIGOR improves the average math accuracy by +3.31% and the average code accuracy by +1.91% over this baseline, while exhibiting more stable training dynamics. The code is available at this https URL.
中文摘要 虽然带可验证奖励的强化学习（RLVR）最近已成为大型语言模型（LLM）一种有前景的训练后范式，但其对金标签或领域特定验证器的依赖限制了其扩展性，仅限于新的任务和领域。在本研究中，我们提出了无验证器的内在梯度-范数奖励（VIGOR），这是一种仅使用策略模型本身的简单奖励。给定提示时，VIGOR对一组完成任务进行采样，并对在当前参数下诱导教师强制负对数似然梯度的$/ell_2$范数更小的输出分配更高的组内奖励。直观上，较低梯度范数表明完备性更符合当前策略，作为策略优化的内在偏好信号。为了使这一内在信号在强化学习中实用，我们纠正了平均代币级梯度的系统性长度偏差，且其缩放为$\sqrt{T}$，并应用群体级排序以稳定各提示的奖励尺度。在数学推理基准测试中，VIGOR优于最先进的内部反馈强化学习（RLIF）基线，并且仅用数学数据训练时，还表现出跨域迁移到代码基准测试的能力。例如，在QWEN2.5-7B基础上，经过MATH 后期训练，VIGOR 在该基线上平均数学准确率提升了+3.31%，平均代码准确率提升了+1.91%，同时表现出更稳定的训练动态。代码可在该 https URL 访问。

expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling

EXPO：通过自适应的基层调律和高斯课程抽样进行探索优先政策优化

Authors: Mingxiong Lin, Zhangquan Gong, Maowen Tang, Qian Li, Chuangchuang Wang, Jian Ma, Sutian Huang, Kai Tang, Haonan Lu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09923
Pdf link: https://arxiv.org/pdf/2605.09923
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, where Group Relative Policy Optimization (GRPO) serves as the mainstream algorithm. We point out two understudied inefficiencies existing in GRPO. First, the fixed KL penalty coefficient overly restricts policy exploration at stages where the model requires significant deviation from the reference policy. Second, uniform sampling of training questions ignores that moderately difficult problems provide the most informative gradient signals for optimization. We propose Exploration-Prioritized Policy Optimization (EXPO) with two lightweight plug-in modules. The Accuracy-Conditioned KL Scaling (AKL) dynamically adjusts KL regularization strength through a smooth nonlinear function of batch average accuracy, relaxing the penalty when the model underperforms and strengthening it when the model achieves good results. The Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at moderate accuracy around 0.5, focusing training on the model's learning frontier. We conduct extensive experiments on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base over six mathematical reasoning benchmarks. The results show EXPO steadily surpasses vanilla GRPO. It obtains an absolute gain of 13.34 on AIME 2025 pass@32, rising from 63.33 percent to 76.67 percent, and achieves an average pass@32 improvement of 2.66 on the 8B model. The much larger performance gains on pass@32 compared with pass@1 demonstrate that EXPO effectively enlarges the model's exploration boundary under a fixed inference cost budget.
中文摘要 带可验证奖励的强化学习（RLVR）已成为大型语言模型数学推理的标准范式，其中群体相对策略优化（GRPO）成为主流算法。我们指出GRPO中存在两个鲜为人知的低效问题。首先，固定的 KL 惩罚系数在模型需要显著偏离参考政策的阶段过度限制了策略探索。其次，对训练问题的均匀抽样忽略了中等难度问题提供了最有价值的梯度信号以促进优化。我们提出了探索优先政策优化（EXPO），配备两个轻量级插件模块。准确性条件KL标度（AKL）通过批量平均精度的平滑非线性函数动态调整KL正则化强度，当模型表现不佳时放宽惩罚，当模型取得良好结果时加强惩罚。高斯课程抽样（GCS）为以中等精度约0.5为中心的高斯分布问题分配抽样权重，重点训练模型的学习前沿。我们在六个数学推理基准测试中对DeepSeek-R1-Distill-Qwen-1.5B和Qwen3-8B-Base进行了大量实验。结果显示EXPO稳步超过原版GRPO。在AIME 2025 pass@32中，绝对提升13.34%，从63.33%升至76.67%，8B模型的平均pass@32提升为2.66。相比pass@1，pass@32 上的性能提升大幅提升表明，EXPO 在固定推断成本预算下有效扩大了模型的勘探边界。

TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

TRACER：多模态工具使用代理的可验证生成来源

Authors: Bihui Yu, Caijun Jia, Jing Chi, Xiaohan Liu, Yining Wang, He Bai, Yuchen Liu, Jingxuan Wei, Junnan Zhu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.09934
Pdf link: https://arxiv.org/pdf/2605.09934
Abstract Multimodal large language models increasingly solve vision-centric tasks by calling external tools for visual inspection, OCR, retrieval, calculation, and multi-step reasoning. Current tool-using agents usually expose the executed tool trajectory and the final answer, but they rarely specify which tool observation supports each generated claim. We call this missing claim-level dependency structure the provenance gap. The gap makes tool use hard to verify and hard to optimize, because useful evidence, redundant exploration, and unsupported reasoning are mixed in the same trajectory. We introduce TRACER, a framework for verifiable generative provenance in multimodal tool-using agents. Instead of adding citations after generation, TRACER generates each answer sentence together with a structured provenance record that identifies the supporting tool turn, evidence unit, and semantic support relation. Its relation space contains Quotation, Compression, and Inference, covering direct reuse, faithful condensation, and grounded derivation. TRACER verifies each record through schema checking, tool-turn alignment, source authenticity, and relation rationality, and then converts verified provenance into traceability constraints and provenance-derived local credit for reinforcement learning. We further construct TRACE-Bench, a benchmark for sentence-level provenance reconstruction from coarse multimodal tool trajectories. On TRACE-Bench, simply adding tools often introduces noise. With Qwen3-VL-8B, TRACER reaches 78.23% answer accuracy and 95.72% summary accuracy, outperforming the strongest closed-source tool-augmented baseline by 23.80 percentage points. Compared with tool-only supervised fine-tuning, it also reduces total test-set tool calls from 4949 to 3486. These results show that reliable multimodal tool reasoning depends on provenance-aware use of observations, not on more tool calls alone.
中文摘要 多模态大型语言模型越来越多地通过调用外部工具进行视觉检查、OCR、检索、计算和多步推理来解决以视觉为中心的任务。当前的工具使用代理通常会公开执行的工具轨迹和最终答案，但很少具体说明支持每个生成主张的工具观察。我们称这种缺失的索赔级别依赖结构为来源差距。这一差距使得工具使用难以验证和优化，因为有用的证据、冗余探索和无根据的推理混合在同一轨迹中。我们介绍了TRACER，这是一个用于多模态工具使用代理中可验证生成来源的框架。TRACER不在生成后添加引用，而是生成每个答案句子，并附带结构化的来源记录，识别支持工具转向、证据单元和语义支持关系。其关系空间包含引用、压缩和推理，涵盖直接重用、忠实凝聚和基导。TRACER通过模式检查、工具转向对齐、源真实性和关系理性来验证每条记录，然后将验证的来源转换为可追溯性约束和来源衍生的本地信用，用于强化学习。我们进一步构建了TRACE-Bench，这是基于粗多模态工具轨迹进行句子级来源重建的基准测试。在 TRACE-Bench 中，简单添加工具往往会产生噪声。借助Qwen3-VL-8B，TRACER的答案准确率达到78.23%，摘要准确率达到95.72%，比最强的闭源工具增强基线高出23.80个百分点。与仅工具监督的微调相比，它还将测试集工具调用总数从4949次减少到3486次。这些结果表明，可靠的多模态工具推理依赖于对来源意识的观察使用，而非单纯依赖工具调用次数。

HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

HAGE：通过强化学习驱动的加权图演化利用能动记忆

Authors: Dongming Jiang, Yi Li, Guanpeng Li, Qiannan Li, Bingzhe Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.09942
Pdf link: https://arxiv.org/pdf/2605.09942
Abstract Memory retrieval in agentic large language model (LLM) systems is often treated as a static lookup problem, relying on flat vector search or fixed binary relational graphs. However, fixed graph structures cannot capture the varying strength, confidence, and query-dependent relevance of relationships between events. In this paper, we propose HAGE, a weighted multi-relational memory framework that reconceptualizes retrieval as sequential, query-conditioned traversal over a unified relational memory graph. Memory is organized as relation-specific graph views over shared memory nodes, where each edge is associated with a trainable relation feature vector encoding multiple relational signals. Given a query, an LLM-based classifier identifies the relational intent, and a routing network dynamically modulates the corresponding dimensions of the edge embedding. Traversal scores are computed via a learned combination of semantic similarity and these query-conditioned edge representations. This allows memory traversal to prioritize high-utility relational paths while softly suppressing noisy or weakly relevant connections. Beyond adaptive traversal, HAGE further introduces a reinforcement learning-based training framework that jointly optimizes routing behavior and edge representations using downstream tasks. Finally, empirical results demonstrate improved long-horizon reasoning accuracy and a favorable accuracy-efficiency trade-off compared to state-of-the-art agentic memory systems. Our code is available at this https URL.
中文摘要 在智能大型语言模型（LLM）系统中，内存检索通常被视为静态查找问题，依赖于平面向量搜索或固定的二元关系图。然而，固定图结构无法捕捉事件之间关系强度、置信度和查询相关性的变化。本文提出了HAGE，这是一种加权多关系记忆框架，将检索重新概念化为在统一关系记忆图上的顺序、查询条件的遍历。内存组织为基于共享内存节点的关系特异图视图，每条边对应一个可训练关系特征向量，编码多个关系信号。给定查询时，基于LLM的分类器识别关系意图，路由网络动态调制边缘嵌入的相应维度。遍历分数是通过学习中语义相似性和这些查询条件边表示的结合来计算的。这使得内存遍历能够优先考虑高效用关系路径，同时温和地抑制噪声或相关性较弱的连接。除了自适应遍历，HAGE还引入了基于强化学习的训练框架，结合下游任务优化路由行为和边缘表示。最后，实证结果表明，与最先进的智能记忆系统相比，长视野推理准确性提升，且在准确率与效率的权衡上更为有利。我们的代码可在此 https URL 访问。

Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

面向通才游戏玩家：游戏多元宇宙中基础模型的研究

Authors: Kuan Zhang, Dongchen Liu, Qiyue Zhao, Tianyu Xin, Yue Su, Haisheng Wang, Han Yin, Hongbo Ma, Peize Li, Tianjun Gu, Xiangnan Wu, Xinran Zhang, Yongxuan Li, Zirong Chen, Yiming Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.09965
Pdf link: https://arxiv.org/pdf/2605.09965
Abstract The real world unfolds along a single set of physics laws, yet human intelligence demonstrates a remarkable capacity to generalize experiences from this singular physical existence into a multiverse of games, each governed by entirely different rules, aesthetics, physics, and objectives. This omni-reality adaptability is a hallmark of general intelligence. As Artificial Intelligence progresses towards Artificial General Intelligence, the multiverse of games has evolved from mere entertainment into the ultimate ground for training and evaluating AGI. The pursuit of this generality has unfolded across four eras: from environment-specific symbolic and reinforcement learning agents, to current large foundation models acting as generalist players, and toward a future creator stage where agent both creates new game worlds and continually evolves within them. We trace the full lifecycle of a generalist game player along four interdependent pillars: Dataset, Model, Harness, and Benchmark. Every advance across these pillars can be read as an attempt to break one of five fundamental trade-offs that currently bound the whole system. Building on this end-to-end view, we chart a five-level roadmap, progressing from single-game mastery to the ultimate creator stage in which the agent simultaneously creates and evolves within theoretical game multiverse. Taken together, our work offers a unified lens onto a rapidly shifting field,and a principled path toward the omnipotent generalist agent capable of seamlessly mastering any challenge within the multiverse of games, thereby paving the way for AGI.
中文摘要 现实世界按照一套物理定律展开，但人类智能展现出惊人的能力，能够将这单一物理存在的体验推广到多元宇宙中，每个游戏都由完全不同的规则、美学、物理和目标支配。这种全现实适应能力是一般智能的标志。随着人工智能向通用人工智能的推进，游戏的多元宇宙已从单纯的娱乐演变成了训练和评估通用人工智能的终极领域。这种普遍性的追求跨越了四个时代：从环境特定的符号和强化学习代理，到当前作为通用玩家的大型基础模型，再到未来创造阶段，代理既创造新的游戏世界，又在其中不断演进。我们沿着四个相互依赖的支柱追踪了通用型游戏玩家的整个生命周期：数据集、模型、驾驭和基准。每一次跨越这些柱子的进展，都可以被解读为试图打破目前限制整个体系的五大基本权衡之一。基于这一端到端视角，我们绘制了一个五层级的路线图，从单一游戏的精通逐步推进到最终的创造者阶段，在此阶段代理在理论游戏多元宇宙中同时创造和进化。综合来看，我们的工作提供了一个统一的视角，洞察快速变化的领域，并为能够无缝掌握游戏多元宇宙中任何挑战的全能通才提供了原则性路径，从而为通用人工智能铺平了道路。

Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning

巩固-扩展算子力学：自适应学习的统一框架

Authors: Debashis Guha
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.09968
Pdf link: https://arxiv.org/pdf/2605.09968
Abstract Every adaptive learning system must alternate between two operations: consolidating what it already knows and expanding into new evidence. We propose \emph{Consolidation-Expansion Operator Mechanics} (OpMech), a framework that makes this structure precise. The central object is the \emph{order-gap} $\Ogap(\theta; e)$, the degree to which a consolidation operator~$Q$ and an expansion operator~$P_e$ fail to commute at a given knowledge state. Because the order-gap is computable from the system's own trajectory, it serves as a real-time control signal: large values indicate that the system is still sensitive to the ordering of consolidation and expansion; once the order-gap falls and stays small, further processing is unlikely to change the outcome. Three results give the signal precise meaning: the order-gap decays along convergent trajectories; a persistently large order-gap implies the system is far from its settled state; and an order-gap-based stopping rule terminates with provable guarantees in both noiseless and bounded-noise settings. The framework applies across five domains: bandits, reinforcement learning, stochastic optimization, continual learning, and recursive language models. We give conditions under which the order-gap reliably tracks convergence in three representative cases. We develop the recursive language model application in detail, showing how OpMech replaces heuristic stopping rules and fixed recursion budgets with principled, evidence-driven alternatives.
中文摘要 每个自适应学习系统都必须在两项操作之间交替：巩固已有知识和扩展到新的证据。我们提出了\emph{巩固-展开算子力学}（OpMech），这是一个使该结构变得精确的框架。中心对象是\emph{序间隙} $\Ogap（\theta; e）$，表示在给定知识状态下，巩固算符~$Q$和展开算符~$P_e$无法交换的程度。由于阶差可从系统自身轨迹计算，因此它作为实时控制信号：值大表示系统仍对巩固和膨胀的顺序敏感;一旦订单缺口缩小并保持较小，进一步处理不太可能改变结果。三个结果赋予信号精确含义：阶隙沿收敛轨迹衰减;持续较大的序间隙意味着系统远离其稳定状态;基于阶隙的停止规则在无噪声和有界噪声条件下均以可证明的保证终止。该框架适用于五个领域：盗贼、强化学习、随机优化、持续学习和递归语言模型。我们给出了序隙在三种代表性情况下可靠追踪收敛的条件。我们详细开发了递归语言模型应用，展示了OpMech如何用有原则、以证据为驱动的替代方案取代启发式停止规则和固定递归预算。

Beyond Self-Play and Scale: A Behavior Benchmark for Generalization in Autonomous Driving

超越自我游戏与规模化：自动驾驶泛化的行为基准

Authors: Aron Distelzweig, Faris Janjoš, Andreas Look, Anna Rothenhäusler, Daniel Jost, Oliver Scheel, Raghu Rajan, Daphne Cornelisse, Eugene Vinitsky, Joschka Boedecker
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.10034
Pdf link: https://arxiv.org/pdf/2605.10034
Abstract Recent Autonomous Driving (AD) works such as GigaFlow and PufferDrive have unlocked Reinforcement Learning (RL) at scale as a training strategy for driving policies. Yet such policies remain disconnected from established benchmarks, leaving the performance of large-scale RL for driving on standardized evaluations unknown. We present BehaviorBench -- a comprehensive test suite that closes this gap along three axes: Evaluation, Complexity, and Behavior Diversity. In terms of Evaluation, we provide an interface connecting PufferDrive to nuPlan, which, for the first time, enables policies trained via RL at scale to be evaluated on an established planning benchmark for autonomous driving. Complementarily, we offer an evaluation framework that allows planners to be benchmarked directly inside the PufferDrive simulation, at a fraction of the time. Regarding Complexity, we observe that today's standardized benchmarks are so simple that near-perfect scores are achievable by straight lane following with collision checking. We extract a meaningful, interaction-rich split from the Waymo Open Motion Dataset (WOMD) on which strong performance is impossible without multi-agent reasoning. Lastly, we address Behavior Diversity. Existing benchmarks commonly evaluate planners against a single rule-based traffic model, the Intelligent Driver Model (IDM). We provide a diverse suite of interactive traffic agents to stress-test policies under heterogeneous behaviors, beyond just using IDM. Overall, our benchmarking analysis uncovers the following insight: despite learning interactive behaviors in an emergent manner, policies trained via pure self-play under standard reward functions overfit to their training opponents and fail to generalize to other traffic agent behaviors. Building on this observation, we propose a hybrid planner that combines a PPO policy with a rule-based planner.
中文摘要 近期的自动驾驶（AD）项目如GigaFlow和PufferDrive，将强化学习（RL）大规模应用于驾驶政策的培训策略。然而，这些政策仍然与既定基准脱节，导致大规模强化学习在标准化评估下驾驶的表现尚不明朗。我们介绍BehaviorBench——一套综合测试套件，在评估、复杂性和行为多样性三个方面缩小了这一差距。在评估方面，我们提供了一个连接PufferDrive与nuPlan的接口，首次使通过强化学习大规模训练的政策能够基于既定的自动驾驶规划基准进行评估。此外，我们还提供一个评估框架，允许规划者在PufferDrive模拟中直接进行基准测试，时间缩短至极短。关于复杂度，我们观察到如今的标准化基准非常简单，通过直线车道跟随和碰撞检查即可获得接近完美的分数。我们从Waymo开放运动数据集（WOMD）中提取了一个有意义且交互丰富的分割，在该数据上，没有多智能体推理，无法实现强性能。最后，我们讨论行为多样性。现有基准通常以单一规则驱动的交通模型——智能驾驶员模型（IDM）来评估规划者。我们提供多样化的交互式流量代理套件，用于在异构行为下对策略进行压力测试，而不仅仅是使用IDM。总体而言，我们的基准分析揭示了以下见解：尽管以涌现方式学习互动行为，但通过标准奖励函数纯自玩训练的策略对其训练对手过于拟合，且未能推广到其他交通代理行为。基于这一观察，我们提出了一种混合规划器，结合了PPO政策和基于规则的规划器。

Adaptive Action Chunking via Multi-Chunk Q Value Estimation

通过多块Q值估计实现自适应动作分块

Authors: Yongjae Shin, Jongseong Chae, Seongmin Kim, Jongeui Park, Youngchul Sung
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.10044
Pdf link: https://arxiv.org/pdf/2605.10044
Abstract Action chunking emerged as a pivotal technique in imitation learning, enabling policies to predict cohesive action sequences rather than single actions. Recently, this approach has expanded to reinforcement learning (RL), enhancing behavioral consistency and reducing bootstrapping errors in value function estimation. However, existing methods rely on a fixed chunk length, creating a performance bottleneck as the optimal length varies across states and tasks. In this paper, we propose Adaptive Action CHunking (ACH), a novel offline-to-online RL algorithm that dynamically modulates chunk length during both training and inference. To find the optimal chunk length for a dynamically varying current state, we simultaneously estimate action-values for all candidate chunk lengths in a single forward pass, using a Transformer-based architecture. Our mechanism allows the agent to select the most effective chunk length adaptively based on the current state. Evaluated on 34 challenging tasks, ACH consistently outperforms fixed-length baselines, demonstrating superior generalization and learning efficiency in complex environments.
中文摘要 动作分块作为模仿学习中的关键技术出现，使策略能够预测连贯的动作序列，而非单一动作。近年来，这一方法已扩展到强化学习（RL），增强了行为一致性并减少价值函数估计中的自助错误。然而，现有方法依赖固定的分块长度，导致性能瓶颈，因为不同状态和任务之间最优长度不同。本文提出了自适应动作块化（ACH），一种新型离线到在线的强化学习算法，可在训练和推理过程中动态调制块长度。为了找到动态变化当前状态的最佳分块长度，我们采用基于Transformer的架构，在一次前向传递中同时估计所有候选分块长度的动作值。我们的机制允许代理根据当前状态自适应地选择最有效的区块长度。在34项具有挑战性的任务中，ACH始终优于固定长度基线，在复杂环境中展现出更优越的泛化能力和学习效率。

EFGCL: Learning Dynamic Motion through Spotting-Inspired External Force Guided Curriculum Learning

EFGCL：通过观察启发的外部力引导课程学习动态运动

Authors: Keita Yoneda, Kento Kawaharazuka, Kei Okada
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.10063
Pdf link: https://arxiv.org/pdf/2605.10063
Abstract Learning dynamic whole-body motions for legged robots through reinforcement learning (RL) remains challenging due to the high risk of failure, which makes efficient exploration difficult and often leads to unstable learning. In this paper, we propose External Force Guided Curriculum Learning (EFGCL), a guided RL approach based on the principle of physical guidance, in which external assistive forces are introduced during training. Inspired by spotting in artistic gymnastics, EFGCL enables agents to physically experience successful motion executions without relying on task-specific reward shaping or reference trajectories. Experiments on a quadrupedal robot performing Jump, Backflip, and Lateral-Flip tasks demonstrate that EFGCL accelerates learning of the Jump task by approximately a factor of two and enables the acquisition of complex whole body motions that conventional RL methods fail to learn. We further show that the learned policies can be deployed on real robot, reproducing motions consistent with those observed in simulation. These results indicate that physically guided exploration, which allows agents to experience success early in training, is an effective and general strategy for improving learning efficiency in dynamic whole-body motion tasks.
中文摘要 通过强化学习（RL）学习腿部机器人的动态全身运动仍然具有挑战性，因为失败风险高，使得高效探索变得困难，且常导致学习不稳定。本文提出了外部力量引导课程学习（EFGCL），这是一种基于物理指导原则的引导强化学习方法，在训练过程中引入外部辅助力量。EFGCL受艺术体操中的定位启发，使智能体能够亲身体验成功的动作执行，而无需依赖特定任务的奖励塑造或参考轨迹。在执行跳跃、后空翻和侧翻任务的四足机器人实验表明，EFGCL将跳跃任务的学习速度加快约两倍，并使得传统强化学习方法无法掌握的复杂全身动作成为可能。我们还进一步证明，所学策略可以部署在真实机器人上，重现与仿真中观察到的运动一致。这些结果表明，物理引导探索——让智能体在训练早期就能取得成功——是一种有效且通用的策略，能够提升动态全身运动任务中的学习效率。

Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation

在沙盒中规划，在开放世界中导航：学习基于物理的抽象体验以实现具身导航

Authors: Zhixuan Shen, Jiawei Du, Ziyu Guo, Han Luo, Lilan Peng, Joey Tianyi Zhou, Haonan Luo, Tianrui Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.10118
Pdf link: https://arxiv.org/pdf/2605.10118
Abstract Vision-Language Models (VLMs) have demonstrated exceptional general reasoning capabilities. However, their performance in embodied navigation remains hindered by a scarcity of aligned open-world vision and robot control data. Despite simulators providing a cost-effective alternative for data collection, the inherent reliance on photorealistic simulations often limits the transferability of learned policies. To this end, we propose \textit{\textbf{S}andbox-\textbf{A}bstracted \textbf{G}rounded \textbf{E}xperience} (\textbf{\textit{SAGE}}), a framework that enables agents to learn within a physics-grounded semantic abstraction rather than a photorealistic simulation, mimicking the human capacity for mental simulation where plans are rehearsed in simplified physics abstractions before execution. \textit{SAGE} system operates via three synergistic phases: (1) \textit{Genesis}: constructing diverse, physics-constrained semantic environments to bootstrap experience; (2) \textit{Evolution}: distilling experiences through Reinforcement Learning (RL), utilizing a novel asymmetric adaptive clipping mechanism to stabilize updates; (3) \textit{Navigation}: bridging the abstract policy to open-world control. We demonstrate that \textit{SAGE} significantly improves planner-assisted embodied navigation, achieving a 53.21\% LLM-Match Success Rate on A-EQA (+9.7\% over baseline), while showing encouraging transfer to physical indoor robot deployment.
中文摘要 视觉语言模型（VLM）展现出卓越的通用推理能力。然而，它们在具象导航中的表现仍受限于对齐开放世界视觉和机器人控制数据的稀缺。尽管模拟器提供了一种经济高效的数据收集替代方案，但对照片级真实模拟的固有依赖往往限制了所学策略的可迁移性。为此，我们提出了 \textit{\textbf{S}andbox-\textbf{A}bstracted \textbf{G}rounded \textbf{E}xperience} （\textbf{\textit{SAGE}}），这是一个框架，使智能体能够在基于物理的语义抽象中学习，而非写实模拟，模拟人类在心理模拟中通过简化的物理抽象进行计划的演练。\textit{SAGE} 系统通过三个协同阶段运行：（1） \textit{Genesis}：构建多样且受物理约束的语义环境以引导经验;（2） \textit{Evolution}：通过强化学习（RL）提炼体验，利用一种新颖的非对称自适应剪裁机制稳定更新;（3） \textit{导航}：将抽象政策与开放世界控制连接起来。我们证明 \textit{SAGE} 显著提升了规划辅助的具身导航，在 A-EQA 上实现了 53.21% 的 LLM 匹配成功率（比基线高出 +9.7%），同时显示出积极的室内机器人部署应用。

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

FormalRewardBench：证明奖励模型的形式定理基准

Authors: Zeynel A. Uluşan, Burak S. Akbudak, Can S. Erer, Gözde Gül Şahin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.10141
Pdf link: https://arxiv.org/pdf/2605.10141
Abstract Recent neural theorem provers use reinforcement learning with verifiable rewards (RLVR), where proof assistants provide binary correctness signals. While verifiable rewards are cheap and scalable without reward hacking issues, they suffer from sparse credit assignment: models receive no learning signal from difficult problems where partial progress goes unrewarded. This motivates learned reward models that can evaluate proof quality beyond binary verification. However, comparing reward models is challenging since it typically requires expensive RL training ablations. To address this, we introduce \textbf{FormalRewardBench}, the first benchmark for evaluating reward models in formal theorem proving with Lean 4. Our benchmark consists of 250 preference pairs where correct proofs are paired with incorrect variants generated through five expert curated error injection strategies: forced mistakes, minimal single-point variations, verbose incorrect proofs, natural language justification, and Python code injection. We evaluate frontier LLMs (e.g., Claude Opus 4.5), judge LLMs (e.g., CompassJudger-1-14B), general-purpose LLMs (e.g., Qwen2.5-72B-Instruct), and specialized theorem proving models (e.g., DeepSeek-Prover-V2-7B). Our results reveal that frontier LLMs achieve the highest performance (59.8\%) while specialized theorem provers perform the worst (24.4\%), suggesting that theorem proving ability does not transfer to proof evaluation. We provide further insights on various error injection mechanisms, highlighting the challenging nature of most injection mechanisms. We release \textbf{FormalRewardBench} publicly to encourage more research on developing reward models in formal mathematics.
中文摘要 最新的神经定理证明器采用了带可验证奖励的强化学习（RLVR），其中证明助手提供二元正确性信号。虽然可验证的奖励廉价且可扩展且没有奖励黑客问题，但它们存在稀疏的信用分配问题：模型无法接收到部分进展未获奖励的复杂问题的学习信号。这激发了能够评估超越二元验证的证明质量的学习奖励模型。然而，比较奖励模型具有挑战性，因为通常需要昂贵的强化学习训练消融。为此，我们引入了 \textbf{FormalRewardBench}，这是用精益4证明的形式定理中评估奖励模型的第一个基准。我们的基准测试包含250对偏好对，正确证明与错误变体通过五种专家精心策划的错误注入策略生成：强制错误、最小单点变异、冗长错误证明、自然语言辩解和Python代码注入。我们评估了前沿大型语言模型（如Claude Opus 4.5）、判定型大型语言模型（如CompassJudger-1-14B）、通用大型语言模型（如Qwen2.5-72B-Instruct）以及专门的定理证明模型（如DeepSeek-Prover-V2-7B）。我们的结果显示，前沿大型语言模型的表现最高（59.8%），而专门的定理证明者表现最差（24.4%），表明定理证明能力并未转化为证明评估。我们进一步介绍了各种错误注入机制，凸显了大多数注入机制的复杂性。我们公开发布 \textbf{FormalRewardBench}，以鼓励更多关于形式数学奖励模型开发的研究。

Is DRL-based MAC Ready for Underwater Acoustic Networks? Exploring Its Practicality in Real Field Experiments

基于日磁学习的MAC是否适合水下声学网络？探索其在真实实地实验中的实用性

Authors: Jiani Guo, Bingwen Huangfu, Shanshan Song, Nan Sun, Miao Pan, Guangjie Han
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.10144
Pdf link: https://arxiv.org/pdf/2605.10144
Abstract Medium Access Control (MAC) protocols rely on neighbor and environment information to design collision-free access rules for Underwater Acoustic Networks (UANs). Acquiring this information suffers from high communication overhead due to the unique underwater acoustic channel characteristics, such as long propagation delay, spatiotemporal variations in communication quality, and high attenuation. Deep Reinforcement Learning (DRL) is promising to circumvent the UANs' physical constraints and provide a low-overhead solution for underwater MAC protocols, since it can decide access rules based on real-time observation without extra information exchange. However, the unique underwater acoustic channel characteristics impose significant challenges on observation acquisition, training time, and the balance of multiple reward factors for DRL-based MAC protocols. Most existing methods remain at the theoretical level: (1) they design partial intelligent agents failing to achieve fully autonomous access; (2) they assume unreasonable simulation scenarios, weakening the effects of underwater acoustic channel characteristics on MAC protocols. To enhance the practicality of DRL-based MAC protocols, we first analyze the application challenges of DRL in UANs through real field experiments. Based on the above challenges, we propose a DRL-based MAC protocol that considers observation loss and balances multiple reward factors to achieve efficient Entire Autonomous access in the UAN (EA-MAC). To further explore the feasibility of DRL-based MAC protocols, we implement EA-MAC and other state-of-the-art protocols on underwater acoustic modems and evaluate their performance in real field experiments. Experimental results demonstrate that EA-MAC can adaptively determine the scheduling sequence for each node, enabling high-throughput and fair communication in a straightforward manner for UANs.
中文摘要 介质访问控制（MAC）协议依赖邻居和环境信息来设计水下声学网络（UAN）的无碰撞访问规则。由于水下声道的独特特性，如传播延迟较长、通信质量时空变化和高衰减，获取这些信息存在较高的通信开销。深度强化学习（DRL）有望绕过UAN的物理限制，为水下MAC协议提供低开销解决方案，因为它可以基于实时观察决定访问规则，无需额外信息交换。然而，独特的水下声学通道特性对基于日程学习的MAC协议的观测获取、训练时间及多重奖励因子的平衡带来了重大挑战。大多数现有方法仍停留在理论层面：（1）设计部分智能体，未能实现完全自主的访问;（2）它们假设了不合理的模拟场景，削弱了水下声学通道特性对MAC协议的影响。为了提升基于DRL的MAC协议的实用性，我们首先通过实地实验分析了DRL在UAN中的应用挑战。基于上述挑战，我们提出了一种基于DRL的MAC协议，考虑观测损失并平衡多重奖励因子，以实现UAN内高效的全自主访问（EA-MAC）。为了进一步探讨基于DRL的MAC协议的可行性，我们将EA-MAC及其他最先进协议应用于水下声学调制解调器，并在实地实验中评估其性能。实验结果表明，EA-MAC能够自适应地确定每个节点的调度顺序，从而实现高吞吐量和公平的UAN通信，且以直接的方式实现。

Unsupervised Process Reward Models

无监督过程奖励模型

Authors: Artyom Gadetsky, Maxim Kodryan, Siba Smarak Panigrahi, Hang Guo, Maria Brbic
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.10158
Pdf link: https://arxiv.org/pdf/2605.10158
Abstract Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.
中文摘要 过程奖励模型（PRMs）是一种强大的机制，通过提供细粒度的、步骤级的监督，引导大型语言模型推理。然而，这种有效性也带来了显著代价：PRM每一步推理都需要专家注释，成本高昂且难以扩展。在此，我们提出了一种无需人工监督的无监督PRM（uPRM）训练方法，无论是在逐步注释层面，也无需通过对最终答案的真实验证进行。我们方法的核心思想是定义一个评分函数，该函数源自LLM的下一标记概率，能够联合评估一批推理轨迹中首个错误步骤的候选位置。我们展示了uPRM在多种场景下的有效性：（i） uPRM在识别ProcessBench数据集中首次错误步骤时，比LLM作为法官的绝对准确率提升了高达15%;（ii）作为测试时间缩放的验证器，uPRM的表现与监督PRM相当，且比多数投票基线高出多达6.9%;（iii）作为强化学习中的奖励信号使用时，uPRM相比使用真实标签训练的监督PRM实现了更稳健的策略优化。总体而言，我们的结果为复杂推理任务的可扩展奖励建模开辟了一条道路。

Balancing Efficiency and Fairness in Traffic Light Control through Deep Reinforcement Learning

通过深度强化学习平衡交通信号灯控制的效率与公平性

Authors: Matteo Cederle, Giacomo Scatto, Gian Antonio Susto
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.10170
Pdf link: https://arxiv.org/pdf/2605.10170
Abstract Urban traffic congestion presents a significant challenge for modern cities, which impacts mobility and sustainability. Traditional traffic light control systems often fail to adapt to dynamic conditions, leading to inefficiencies. This paper proposes a novel deep reinforcement learning agent for traffic light control that addresses this limitation by explicitly integrating fairness considerations for both vehicular and pedestrian traffic. Unlike prior work, our approach dynamically balances these flows based on real-time demand, moving beyond systems focused solely on vehicles. Experimental results demonstrate that our agent effectively reduces congestion while ensuring equitable service for both the categories of road users. This research contributes to a practical and adaptable solution for intelligent traffic management within the framework of smart cities, paving the way for more efficient and inclusive urban mobility.
中文摘要 城市交通拥堵对现代城市构成重大挑战，影响出行性和可持续性。传统的交通信号灯控制系统常常无法适应动态环境，导致效率低下。本文提出了一种新型深度强化学习代理，用于交通信号灯控制，通过明确整合车辆和行人交通的公平性考虑来解决这一限制。与以往不同，我们的方法基于实时需求动态平衡这些流量，超越了仅聚焦车辆的系统。实验结果表明，我们的代理有效减少拥堵，同时确保两类道路使用者均享有公平服务。这项研究为智慧城市框架下的智能交通管理提供了实用且可适应的解决方案，为更高效、更包容的城市出行铺平了道路。

MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning

MTA-RL：通过多模态变压器三维赋能与强化学习实现的强健城市驾驶

Authors: Guangli Chen, Dianzhao Li, Wenjian Zhong, Bangquan Xie, Ostap Okhrin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.10177
Pdf link: https://arxiv.org/pdf/2605.10177
Abstract Robust urban autonomous driving requires reliable 3D scene understanding and stable decision-making under dense interactions. However, existing end-to-end models lack interpretability, while modular pipelines suffer from error propagation across brittle interfaces. This paper proposes MTA-RL, the first framework that bridges perception and control through Multi-modal Transformer-based 3D Affordances and Reinforcement Learning (RL). Unlike previous fusion models that directly regress actions, RGB images and LiDAR point clouds are fused using a transformer architecture to predict explicit, geometry-aware affordance representations. These structured representations serve as a compact observation space, enabling the RL policy to operate purely on predicted driving semantics, which significantly improves sample efficiency and stability. Extensive evaluations in CARLA Town01-03 across varying densities (20-60 background vehicles) show that MTA-RL consistently outperforms state-of-the-art baselines. Trained solely on Town03, our method demonstrates superior zero-shot generalization in unseen towns, achieving up to a 9.0% increase in Route Completion, an 11.0% increase in Total Distance, and an 83.7% improvement in Distance Per Violation. Furthermore, ablation studies confirm that our multi-modal fusion and reward shaping are critical, significantly outperforming image-only and unshaped variants, demonstrating the effectiveness of MTA-RL for robust urban autonomous driving.
中文摘要 稳健的城市自动驾驶需要可靠的三维场景理解和在密集交互下的稳定决策。然而，现有的端到端模型缺乏可解释性，而模块化流水线则存在脆弱接口间的错误传播问题。本文提出了MTA-RL，这是首个通过多模态变换器基础的三维赋能与强化学习（RL）桥接感知与控制的框架。与以往直接回归作用的融合模型不同，RGB图像和LiDAR点云通过变换器架构融合，预测显式的几何感知性表现。这些结构化表示作为紧凑的观察空间，使强化学习策略能够纯粹基于预测驱动语义运作，显著提升了样本效率和稳定性。在CARLA Town01-03中，针对不同密度（20-60辆背景车辆）的广泛评估显示，MTA-RL持续优于最先进的基线。我们仅基于Town03训练，在未见城镇中展示了优异的零射击泛化能力，实现了路线完成率最高提升9.0%，总距离提升11.0%，每次违规距离提升83.7%。此外，消融研究证实我们的多模融合和奖励塑造至关重要，显著优于仅图像和未成形的变体，证明MTA-RL在实现稳健城市自动驾驶方面的有效性。

TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

TRACE：通过令牌路由的自策略对齐，在关键部位提炼

Authors: Jiaxuan Wang, Xuan Ouyang, Zhiyu Chen, Yulan Hu, Zheng Pan, Xin Li, Lan-Zhe Guo
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.10194
Pdf link: https://arxiv.org/pdf/2605.10194
Abstract On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on mostly redundant positions and amplifies privileged-information leakage, causing entropy rise, shortened reasoning, and out-of-distribution degradation in long-horizon math training. We propose Token-Routed Alignment for Critical rEasoning (TRACE), which distills only on annotator-marked critical spans: forward KL on key spans of correct rollouts, optional reverse KL on localized error spans, and GRPO on all remaining tokens, with the KL channel annealed away after a short warm-up. Our analysis explains TRACE through two effects: forward KL provides non-vanishing lift to teacher-supported tokens that the student under-allocates, while span masking and decay keep cumulative privileged-gradient exposure finite. On four held-out math benchmarks plus GPQA-Diamond, TRACE improves over GRPO by 2.76 percentage points on average and preserves the Qwen3-8B base OOD score on GPQA-Diamond, where GRPO and all-token self-OPD baselines degrade. Gains persist under online self-annotation (+1.90 percentage points, about 69% of the strong-API gain), reducing the concern that TRACE merely imports external annotator capability. Across scales, the best routed action is base-dependent: on Qwen3-8B it is forward KL on key spans, while on Qwen3-1.7B it shifts to reverse KL on error spans.
中文摘要 策略自提纯（self-OPD）通过让策略在特权上下文下自我学习，使强化学习的可验证奖励（RLVR）更加密集。我们发现，当该指导覆盖整个响应时，全标记KL在大多数冗余位置上花费梯度，并放大特权信息泄漏，导致熵上升、推理缩短以及长期数学训练中的分布外退化。我们提出了关键rEasoning的令牌路由对齐（TRACE），仅在标注者标记的关键区间进行精炼：对正确展开的关键区间进行前向KL，局部错误区间可选的反向KL，剩余所有令牌为GRPO，KL通道在短暂预热后退火。我们的分析通过两种效应解释了TRACE：前向基层层为学生分配不足的教师支持代币提供非消失的提升，而跨度掩蔽和衰减则保持累积特权梯度暴露有限。在四个保留的数学基准测试加上GPQA-Diamond测试中，TRACE平均相较GRPO提升2.76个百分点，并且在GPQA-Diamond上保持了Qwen3-8B的基础OOD分数，而GRPO和全代币自OPD基线则有所下降。在线自标注下，增长依然存在（+1.90个百分点，约为强API增益的69%），减少了TRACE仅仅导入外部标注能力的担忧。在不同尺度上，最佳路由作用依赖于基数：在Qwen3-8B上，关键跨度为正向KL，而在Qwen3-1.7B上，错误跨度时向反向KL。

Relative Score Policy Optimization for Diffusion Language Models

扩散语言模型的相对评分策略优化

Authors: Zichao Yu, Shengze Xu, Bingqing Jiang, Wenyi Zhang, Difan Zou
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.10218
Pdf link: https://arxiv.org/pdf/2605.10218
Abstract Diffusion large language models (dLLMs) offer a promising route to parallel and efficient text generation, but improving their reasoning ability requires effective post-training. Reinforcement learning with verifiable rewards (RLVR) is a natural choice for this purpose, yet its application to dLLMs is hindered by the absence of tractable sequence-level log-ratios, which are central to standard policy optimization. The lack of tractable sequence-level log-ratios forces existing methods to rely on high-variance ELBO-based approximations, where high verifier rewards can amplify inaccurate score estimates and destabilize RL training. To overcome this issue, we propose \textbf{R}elative \textbf{S}core \textbf{P}olicy \textbf{O}ptimization (RSPO), a simple RLVR method that uses verifiable rewards to calibrate noisy likelihood estimates in dLLMs. The core of our algorithm relies on a key observation: a reward advantage can be interpreted not only as an update direction, but also as a target for the relative log-ratio between the current and reference policies. Accordingly, RSPO calibrates this noisy relative log-ratio estimate by comparing its reward advantage with the reward-implied target relative log-ratio, updating the policy according to the gap between the current estimate and the target rather than the raw advantage alone. Experiments on mathematical reasoning and planning benchmarks show that RSPO yields especially strong gains on planning tasks and competitive mathematical-reasoning performance.
中文摘要 扩散大型语言模型（dLLMs）提供了一条有前景的并行高效文本生成路径，但提升推理能力需要有效的后期训练。带有可验证奖励的强化学习（RLVR）是此目的的自然选择，但其在dLLM中的应用受限于缺乏可处理的序列级对数比，而这些对数比是标准策略优化的核心。缺乏可处理的序列级对数比迫使现有方法依赖高方差的ELBO近似，而高验证者奖励可能放大不准确的得分估计并破坏强化学习训练。为克服此问题，我们提出了 \textbf{R}elative \textbf{S}core \textbf{P}olicy \textbf{O}ptimization （RSPO），这是一种简单的 RLVR 方法，利用可验证的奖励来校准 dLLM 中的噪声似然估计。我们算法的核心依赖于一个关键观察：奖励优势不仅可以解释为更新方向，还可以作为当前策略与参考策略相对对数比的目标。因此，RSPO通过比较其奖励优势与奖励隐含目标相对对数比来校准该噪声相对对数比估计，并根据当前估计值与目标之间的差距而非仅仅基于原始优势来更新政策。数学推理和规划基准的实验表明，RSPO在规划任务和竞争性数学推理表现上取得了特别显著的提升。

When Does Non-Uniform Replay Matter in Reinforcement Learning?

非均匀回放在强化学习中什么时候重要？

Authors: Michal Korniak, Mikołaj Czarnecki, Yarden As, Piotr Miłoś, Pieter Abbeel, Michal Nauman
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.10236
Pdf link: https://arxiv.org/pdf/2605.10236
Abstract Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed by three factors: replay volume, the number of replayed transitions per environment step; expected recency, how recent sampled transitions are; and the entropy of the replay sampling distribution. Our main contribution is clarifying when non-uniform replay is beneficial and providing practical guidance for replay design in modern off-policy RL. Namely, we find that non-uniform replay is most beneficial when replay volume is low, and that high-entropy sampling is important even at comparable expected recency. Motivated by these findings, we adopt a simple Truncated Geometric replay that biases sampling toward recent experience while preserving high entropy and incurring negligible computational overhead. Across large-scale parallel simulation, single-task, and multi-task settings, including three modern algorithms evaluated on five RL benchmark suites, this replay sampling strategy improves sample efficiency in low-volume regimes while remaining competitive when replay volume is high.
中文摘要 现代非策略强化学习算法通常依赖简单的均匀回放抽样，且何时以及为何非均匀重放会在这一强基线上有所改善仍不明确。在不同的强化学习环境中，我们表明非均匀回放的有效性受三个因素控制：重放量，即每个环境步的重放转换次数;预期的近期性，即采样转变的近期程度;以及重放采样分布的熵。我们的主要贡献是澄清何时非均匀回放有益，并为现代非策略强化学习中的重放设计提供实用指导。也就是说，我们发现非均匀回放在回放量较低时最为有利，且即使在可比的预期新近性下，高熵采样也很重要。基于这些发现，我们采用了简单的截断几何重放方法，该方法在保持高熵且计算开销极小的情况下，将抽样偏向近期经验。在大规模并行模拟、单任务和多任务环境下，包括基于五个强化学习基准测试套件评估的三种现代算法，这种回放采样策略在低量条件下提高了样本效率，同时在回放量较大时保持竞争力。

Towards Autonomous Railway Operations: A Semi-Hierarchical Deep Reinforcement Learning Approach to the Vehicle Rescheduling Problem

迈向自主铁路运营：一种半分层式深度强化学习方法解决车辆调度问题

Authors: Alberto Castagna, Stefan Zahlner, Adrian Egli, Christian Eichenberger, Daniel Boos, Manuel Meyer, Anton Fuxjager
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.10257
Pdf link: https://arxiv.org/pdf/2605.10257
Abstract Managing disruptions in railway traffic management is a major challenge. Rising traffic density and infrastructure limits increase complexity, making the Vehicle Routing and Scheduling Problem (VRSP) difficult to solve reliably and in real time. While Operational Research (OR) methods are widely used, most dispatching still relies on human expertise due to the problem's exponential combinatorial complexity. Reinforcement Learning (RL) has gained attention for its potential in multi-agent coordination, but existing RL approaches often underperform OR methods and struggle to scale in dense rail networks. This paper addresses this gap from a machine learning perspective by introducing a semi-hierarchical RL formulation tailored to operational railway constraints. The method separates dispatching from routing through dedicated action and observation spaces, enabling policies to specialise in distinct decision scopes and addressing the imbalance between rare dispatch decisions and frequent routing updates. The approach is evaluated on the Flatland-RL simulator across five difficulty levels and 50 random seeds, with 7 to 80 trains. Results show substantially improved coordination, resource utilisation, and robustness compared with heuristic baselines and monolithic RL, nearly doubling the number of trains reaching their destinations, while keeping deadlock rates below 5% and adaptively sequencing, delaying, or cancelling trains under heavy congestion.
中文摘要 管理铁路交通管理中的中断是一项重大挑战。交通密度和基础设施限制的增加增加了复杂性，使得车辆路由与调度问题（VRSP）难以可靠且实时地解决。虽然运筹学（OR）方法被广泛使用，但由于问题的指数级组合复杂性，大多数调度仍依赖人工专业知识。强化学习（RL）因其在多智能体协调中的潜力而受到关注，但现有的强化学习方法往往表现不及OR方法，且难以在密集的铁路网络中实现规模化。本文通过引入一种半分层式强化学习表述，针对铁路运营约束，从机器学习角度解决这一空白。该方法通过专用动作和观察空间将调度与路由分离，使策略能够专注于不同的决策范围，并解决罕见调度决策与频繁路由更新之间的不平衡问题。该方法在Flatland-RL模拟器上评估，涵盖五个难度等级和50个随机种子，涵盖7至80列火车。结果显示，与启发式基线和单一强化学习相比，协调、资源利用和鲁棒性显著提升，几乎使到达目的地的列车数量翻倍，同时将死锁率控制在5%以下，并在严重拥堵时可自适应地安排、延误或取消列车。

MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading

MemReread：通过记忆引导重读增强能动的长语境推理能力

Authors: Baibei Ji, Xiaoyang Weng, Juntao Li, Zecheng Tang, Yihang Lou, Min Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.10268
Pdf link: https://arxiv.org/pdf/2605.10268
Abstract To tackle long-context reasoning tasks without the quadratic complexity of standard attention mechanisms, approaches based on agent memory have emerged, which typically maintain a dynamically updated memory when linearly processing document chunks. To mitigate the potential loss of latent evidence in this memorize-while-reading paradigm, recent works have integrated retrieval modules that allow agents to recall information previously discarded during memory overwriting. However, retrieval-based recall suffers from both evidence loss during memory formation and interference induced by invalid queries. To overcome these limitations, we propose MemReread. Built upon streaming reading, MemReread circumvents intermediate retrieval. It triggers question decomposition and rereading when the final memory is insufficient, enabling the recovery of indirect facts that were prematurely discarded. This design supports non-linear reasoning while preserving the inherent logical flow of document comprehension. To further enhance practicality, we introduce a reinforcement learning framework that enhances length extrapolation capability while dynamically determining the number of rereading passes based on task complexity, thereby flexibly controlling computational overhead. Extensive experiments demonstrate that MemReread consistently outperforms baseline frameworks on long-context reasoning tasks, while maintaining linear time complexity with respect to context length.
中文摘要 为了解决没有标准注意力机制平方复杂度的长上下文推理任务，基于代理记忆的方法出现了，通常在线性处理文档块时保持动态更新的内存。为了减少这种边读边记忆范式中潜在证据的丢失，近期工作集成了检索模块，使智能体能够回忆在记忆覆盖过程中被丢弃的信息。然而，基于检索的回忆在记忆形成过程中会丢失证据，并且由于无效查询会干扰。为了克服这些限制，我们提出了MemReread。基于流式阅读，MemReread 绕过了中间检索。当最终记忆不足时，它会触发问题分解和重读，从而恢复那些过早被遗弃的间接事实。该设计支持非线性推理，同时保持文档理解的固有逻辑流。为了进一步提升实用性，我们引入了一个强化学习框架，增强了长度外推能力，同时根据任务复杂度动态确定重读次数，从而灵活控制计算开销。大量实验表明，MemReread 在长上下文推理任务中始终优于基线框架，同时保持了上下文长度的线性时间复杂度。

Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

稳健的概率屏蔽，用于安全离线强化学习

Authors: Maris F. L. Galesloot, Thomas Rhemrev, Nils Jansen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.10293
Pdf link: https://arxiv.org/pdf/2605.10293
Abstract In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a performance guarantee: with high probability, the new policy outperforms a given baseline policy, which is assumed to be safe. Orthogonally, in the context of safe RL, a shield provides a safety guarantee by restricting the action space to those actions that are provably safe with respect to a given safety-relevant model. We integrate these paradigms by extending shielding to offline RL, relying solely on the available dataset and knowledge of safe and unsafe states. Then, we shield the policy improvement steps, guaranteeing, with high probability, a safe policy. Experimental results demonstrate that shielded SPI outperforms its unshielded counterpart, improving both average and worst-case performance, particularly in low-data regimes.
中文摘要 在离线强化学习（RL）中，我们从固定数据集中学习策略，无需环境交互。主要挑战在于对（1）最终保单的性能和（2）安全性提供保证。一种称为安全策略改进（SPI）的技术提供性能保证：新策略在高概率下优于假设安全的基线策略。在安全强化学习的背景下，盾牌通过限制作用空间到那些相对于给定安全相关模型下可证明安全的动作，从而提供安全保障。我们通过将屏蔽扩展到离线强化学习，完全依赖现有数据集和对安全与不安全状态的了解，将这些范式整合起来。然后，我们会保护保单改进步骤，保证保单安全。实验结果表明，屏蔽SPI优于无屏蔽SPI，提升了平均和最坏情况，尤其是在低数据环境中。

Verifiable Process Rewards for Agentic Reasoning

代理推理的可验证过程奖励

Authors: Huining Yuan, Zelai Xu, Huaijie Wang, Xiangmin Yi, Jiaxuan Gao, Xiao-Ping Zhang, Yu Wang, Chao Yu, Yi Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.10325
Pdf link: https://arxiv.org/pdf/2605.10325
Abstract Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existing approaches rely on sparse outcome-level feedback. This sparsity creates a credit assignment challenge in long-horizon agentic reasoning: a trajectory may fail despite containing many correct intermediate decisions, or succeed despite containing flawed ones. In this work, we study a class of densely-verifiable agentic reasoning problems, where intermediate actions can be objectively checked by symbolic or algorithmic oracles. We propose Verifiable Process Rewards (VPR), a framework that converts such oracles into dense turn-level supervision for reinforcement learning, and instantiate it in three representative settings: search-based verification for dynamic deduction, constraint-based verification for logical reasoning, and posterior-based verification for probabilistic inference. We further provide a theoretical analysis showing that dense verifier-grounded rewards can improve long-horizon credit assignment by providing more localized learning signals, with the benefit depending on the reliability of the verifier. Empirically, VPR outperforms outcome-level reward and rollout-based process reward baselines across controlled environments, and more importantly, transfers to both general and agentic reasoning benchmarks, suggesting that verifiable process supervision can foster general reasoning skills applicable beyond the training environments. Our results indicate that VPR is a promising approach for enhancing LLM agents whenever reliable intermediate verification is available, while also highlighting its dependence on oracle quality and the open challenge of extending VPR to less structured, open-ended environments.
中文摘要 可验证奖励强化学习（RLVR）提升了大型语言模型（LLMs）的推理能力，但大多数现有方法依赖于结果层级的稀疏反馈。这种稀疏性在长期代理推理中带来了信用分配的挑战：一条轨迹可能失败，尽管包含许多正确的中间决策，也可能成功，尽管存在缺陷。在本研究中，我们研究一类密集可验证的代理推理问题，其中中间动作可以通过符号或算法预言机客观检查。我们提出了可验证过程奖励（VPR）框架，将此类预言机转化为密集的回合级监督用于强化学习，并在三种代表性环境中实现：基于搜索的动态演绎验证、基于约束的逻辑推理验证以及基于概率推理的后验验证。我们还提供了理论分析，表明密集的基于验证者的奖励可以通过提供更本地化的学习信号来改善长期学分分配，其益处取决于验证者的可靠性。从实证角度看，VPR在受控环境中优于结果级奖励和基于推广的流程奖励基线，更重要的是，它能转化为通用推理基准和代理推理基准，表明可验证的过程监督能够培养适用于培训环境之外的通用推理技能。我们的结果表明，只要有可靠的中间验证，VPR是增强LLM代理的有前景方法，同时也凸显了其对oracle质量的依赖以及将VPR扩展到结构化较少、开放环境的挑战。

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

TMAS：通过多智能体协同扩展测试时间计算

Authors: George Wu, Nan Jing, Qing Yi, Chuan Hao, Ming Yang, Feng Chang, Yuan Wei, Jian Yang, Ran Tao, Bryan Dai
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.10344
Pdf link: https://arxiv.org/pdf/2605.10344
Abstract Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification-based feedback. However, existing structured test-time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what should be retained and reused, limiting their ability to balance exploration and exploitation. In this work, we propose TMAS, a framework for scaling test-time compute via multi-agent synergy. TMAS organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations. To support effective cross-trajectory collaboration, TMAS introduces hierarchical memories: the experience bank reuses low-level reliable intermediate conclusions and local feedback, while the guideline bank records previously explored high-level strategies to steer subsequent rollouts away from redundant reasoning patterns. Furthermore, we design a hybrid reward reinforcement learning scheme tailored to TMAS, which jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies. Extensive experiments on challenging reasoning benchmarks demonstrate that TMAS achieves stronger iterative scaling than existing test-time scaling baselines, while hybrid reward training further improves scaling effectiveness and stability across iterations. Code and data are available at this https URL.
中文摘要 测试时间缩放已成为提升大型语言模型推理能力的有效范式，通过在推理过程中分配额外的计算量。近期的结构化方法通过组织推理跨多个轨迹、精细轮次和基于验证的反馈，进一步推动了这一范式的发展。然而，现有的结构化测试时间尺度方法要么只能弱化并行推理轨迹，要么依赖噪杂的历史信息，却未明确决定应保留和重用哪些信息，从而限制了探索与利用的平衡能力。在本研究中，我们提出了TMAS，一种通过多智能体协同来扩展测试时间计算的框架。TMAS将推理组织为专业代理之间的协作过程，实现跨代理、轨迹和细化迭代的结构化信息流动。为支持有效的交叉轨迹协作，TMAS引入了层级记忆：经验库重用低层次可靠的中间结论和局部反馈，而指南库记录此前探索了高层策略，引导后续推广避免重复推理模式。此外，我们设计了一套针对TMAS的混合奖励强化学习方案，共同保留了基本的推理能力，提升经验利用率，并鼓励超越以往尝试的解决方案探索。对挑战性推理基准的大量实验表明，TMAS比现有测试时的扩展基线实现更强的迭代扩展，而混合奖励训练则进一步提升了跨迭代的扩展效果和稳定性。代码和数据可在此 https URL 获取。

PC3D: Zero-Shot Cooperation Across Variable Rosters via Personalized Context Distillation

PC3D：通过个性化上下文蒸馏实现可变名单间零机会合作

Authors: Ahmet Onur Akman, Rafał Kucharski
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.10377
Pdf link: https://arxiv.org/pdf/2605.10377
Abstract Cooperative multi-agent reinforcement learning often assumes a fixed execution team, yet many decentralized systems must operate with varying numbers of active agents during deployment. We study this setting under episodic roster variation: each episode is executed by a set of homogeneous agents, with the team size varying across episodes. Agents act only from local histories, without execution-time communication, privileged coordinators, or online retraining. Therefore, effective cooperation requires each agent to recover relevant context about the active team and adapt its behavior accordingly. To this end, we propose PC3D (Personalized Central Coordination Context Distillation), a method for training decentralized policies to recover and use personalized coordination context from local interaction histories. During training, a set-structured centralized teacher compresses the active team into coordination tokens and personalizes them into agent-specific contexts, which are distilled into decentralized policies. At execution, each agent predicts its own context from local history and adaptively uses it to condition decision-making. Across three cooperative MARL benchmarks, PC3D achieves higher returns than the evaluated baselines with both seen and unseen roster sizes, and ablations attribute these gains to both context distillation and adaptive context use.
中文摘要 合作式多智能体强化学习通常假设执行团队固定，但许多去中心化系统在部署时必须使用不同数量的活跃智能体。我们以分集阵容变体方式研究这一设定：每集由一组同质特工执行，团队规模在不同剧集中有所不同。代理仅根据本地历史行动，没有执行时的沟通、特权协调员或在线再培训。因此，有效的合作要求每个代理恢复有关活跃团队的相关背景，并相应调整其行为。为此，我们提出了PC3D（个性化中央协调上下文蒸馏）的方法，用于训练去中心化策略，从局部交互历史中恢复并利用个性化协调上下文。在培训过程中，有固定结构的集中教师将活跃团队压缩为协调令牌，并将其个性化到代理特定情境中，这些情境被提炼成去中心化的策略。执行时，每个代理从本地历史中预测自己的背景，并自适应地利用这些背景来条件决策。在三个合作MARL基准测试中，PC3D在可见和未观察的名单规模下均获得高于评估基线的回报，消融将这些收益归因于上下文提炼和自适应上下文的使用。

Causal Explanations from the Geometric Properties of ReLU Neural Networks

ReLU神经网络几何属性的因果解释

Authors: Hector Woods, Philippa Ryan, Rob Alexander
Subjects: Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2605.10396
Pdf link: https://arxiv.org/pdf/2605.10396
Abstract Neural networks have proved an effective means of learning control policies for autonomous systems, but these learned policies are difficult to understand due to the black-box nature of neural networks. This lack of interpretability makes safety assurance for such autonomous systems challenging. The fields of eXplainable Artificial Intelligence (XAI) and eXplainable Reinforcement Learning (XRL) aim to interpret the decision making processes of neural networks and autonomous agents, respectively. In particular, work on causal explanations aims to provide "why" and "why not" explanations for why a model made a given decision. However, most of the work on explainability to date utilises a distilled version of the original model. While this distilled policy is interpretable, it necessarily degrades in performance significantly when compared to the original model, and is not guaranteed to be an accurate reflection of the decision making processes in the original model and as such cannot be used to guarantee its safety. Recent work on understanding the geometry of ReLU neural networks shows that a ReLU network corresponds to a piecewise linear function divided into regions defined by an n-dimensional convex polytope. Through this lens, a neural network can be understood as dividing the input space into distinct regions which apply a single linear function for each output neuron. We show that this geometric representation can be used to generate causal explanations for the network's behaviour similar to previous work, but which extracts rules directly from the geometry of Neural Networks with the ReLU activation function, and is therefore an accurate reflection of the network's behaviour.
中文摘要 神经网络已被证明是学习自主系统控制策略的有效手段，但由于神经网络的黑箱特性，这些学习策略难以理解。这种可解释性缺失使得对此类自主系统的安全保障具有挑战性。可解释人工智能（XAI）和可解释强化学习（XRL）这两个领域分别旨在解释神经网络和自主智能体的决策过程。特别是，因果解释的研究旨在为模型做出某一决策提供“为什么”和“为什么不”的解释。然而，迄今为止，大多数关于可解释性的研究都采用了原始模型的精炼版本。虽然这种提炼后的政策可以解释，但与原始模型相比，其性能必然会显著下降，且不能保证准确反映原始模型的决策过程，因此无法用来保证其安全性。近期对ReLU神经网络几何结构的理解表明，ReLU网络对应于一个分段线性函数，分为由n维凸多胞体定义的区域。从这个角度来看，神经网络可以理解为将输入空间划分为不同的区域，每个输出神经元都应用一个线性函数。我们证明，这种几何表示可以用来生成网络行为的因果解释，类似于以往的工作，但通过ReLU激活函数直接从神经网络的几何中提取规则，因此准确反映了网络行为。

HiRL: Hierarchical Reinforcement Learning for Coordinated Resource Management in Heterogeneous Edge Computing

HiRL：异构边缘计算中协调资源管理的分层强化学习

Authors: Jianyong Zhu, Hao Chen, Juan Zhang, Fangda Guo, Albert Y. Zomaya, Renyu Yang
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2605.10443
Pdf link: https://arxiv.org/pdf/2605.10443
Abstract Edge computing faces unprecedented resource orchestration challenges from multi-dimensional heterogeneity across device architectures, diverse task requirements in CPU-intensive, GPU-intensive, I/O-intensive, and dynamic network conditions. The edge environments demand real-time task processing within strict energy budgets, yet conventional approaches struggle with mixed continuous-discrete optimization while meeting deadline and energy constraints. This paper presents HiRL, a hierarchical reinforcement learning framework that decomposes complex resource orchestration into coordinated power control and task allocation decisions. Our approach separates continuous power management using the Twin Delayed Deep Deterministic Policy Gradient (TD3) and discrete task placement using Double Deep Q-Network (DDQN), unified through a coordination engine with five-dimensional queue state representation. We propose a heterogeneous assessment of resource compatibility with deadline-oriented prioritization and failure-penalized adaptive sampling to enhance decision quality under resource constraints. To improve practical applicability, the framework models comprehensive system dynamics including device mobility, queue congestion patterns, infrastructure heterogeneity, and priority-sensitive scheduling demands. Experimental results show that HiRL achieves effective latency-energy trade-offs with 28% latency reduction compared to Single-DDQN and maintains nearly 100% task completion rates under all load conditions. Compared to baseline algorithms, HiRL reduces energy consumption by up to 51% under low load while achieving 24% better latency performance than static optimization approaches under high load, establishing effective resource orchestration in heterogeneous edge environments.
中文摘要 边缘计算面临前所未有的资源编排挑战，如设备架构的多维异构性、CPU密集型、GPU型、I/O型任务和动态网络条件下的多样化任务需求。边缘环境要求在严格的能源预算内进行实时任务处理，而传统方法在满足截止时间和能耗限制的同时，难以实现连续与离散的混合优化。本文介绍了HiRL，一种分层强化学习框架，将复杂的资源编排分解为协调的权力控制和任务分配决策。我们的方法通过使用双延迟深度确定性策略梯度（TD3）实现连续电源管理，并通过协调引擎与五维队列状态表示实现分离任务布置，采用双深度Q网络（DDQN）。我们提出通过以截止日期为导向的优先级排序和失败惩罚的自适应抽样，对资源兼容性进行异质评估，以提升资源约束下的决策质量。为提升实际适用性，该框架建模了包括设备移动性、队列拥堵模式、基础设施异质性和优先级敏感调度需求的全面系统动态。实验结果表明，HiRL在延迟-能量权衡上实现了有效权衡，延迟比单次DDQN降低了28%，并且在所有负载条件下几乎保持100%的任务完成率。与基线算法相比，HiRL在低负载下可降低高达51%的能耗，同时在高负载下比静态优化方法提升24%的延迟表现，在异构边缘环境中建立了有效的资源编排。

Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

Uni-Synergy：通过合作强化学习桥接理解与个性化推理的生成

Authors: Zijun Shen, Sihan Yang, Ruichuan An, Ziyu Guo, Hao Liang, Ming Lu, Renrui Zhang, Wentao Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.10445
Pdf link: https://arxiv.org/pdf/2605.10445
Abstract Unified Multimodal Models (UMMs) excel in general tasks but struggle to bridge the gap between personalized understanding and generation. Prior works largely rely on implicit token-level alignment via supervised fine-tuning, which fails to fully capture the potential synergy between comprehension and creation. In this work, we propose Sync-R1, an end-to-end reinforcement learning framework that jointly optimizes personalized understanding and generation within a single, explicit reasoning loop. Through this unified feedback process, Sync-R1 enables personalized comprehension to guide content creation, while the resulting generation quality reciprocally refines understanding within an integrated reward landscape. To efficiently orchestrate this dual-task synergy, we introduce Sync-GRPO, a reinforcement learning method utilizing an ensemble reward system. Furthermore, we propose Dynamic Group Scaling (DGS), which adaptively filters low-potential trajectories to reduce gradient variance and accelerate convergence. To better reflect real-world complexity, we introduce UnifyBench++, featuring denser textual descriptions and richer user contexts. Experimental results demonstrate that Sync-R1 achieves state-of-the-art performance, showcasing superior cross-task reasoning and robust personalization without requiring complex cold-start procedures. The code and the UnifyBench++ dataset will be released at: this https URL.
中文摘要 统一多模态模型（UMM）在通用任务中表现出色，但在弥合个性化理解与生成之间的差距方面遇到困难。以往的研究大多依赖于通过监督微调实现的隐性标记层级对齐，但这未能完全体现理解与创作之间的潜在协同效应。在本研究中，我们提出了Sync-R1，一种端到端强化学习框架，能够在单一的显式推理循环内共同优化个性化理解和生成。通过这一统一的反馈过程，Sync-R1 实现了个性化理解以指导内容创作，同时生成质量在集成奖励环境中相互完善理解。为了高效协调这种双任务协同效应，我们引入了Sync-GRPO，一种利用集合奖励系统的强化学习方法。此外，我们提出了动态群标度（DGS），该方法自适应地过滤低势轨迹，以降低梯度方差并加速收敛。为了更好地反映现实世界的复杂性，我们引入了 UnifyBench++，具备更密集的文本描述和更丰富的用户上下文。实验结果表明，Sync-R1实现了最先进的性能，展现了卓越的跨任务推理和稳健的个性化，而无需复杂的冷启动程序。代码和 UnifyBench++ 数据集将发布于：此 https URL。

Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems

必须保持安全的多智能体行为，而不仅仅是断言：基于LLM的多智能体系统中的约束漂移

Authors: Tianxiao Li, Yixing Ma, Haiquan Wen, Zhenglin Huang, Qianyu Zhou, Zeyu Fu, Guangliang Cheng
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.10481
Pdf link: https://arxiv.org/pdf/2605.10481
Abstract Modern LLM based agents are no longer passive text generators. They read repositories, call tools, browse the web, execute code, maintain memory, communicate with other agents, and act through long horizon workflows. This shift moves the unit of safety. A system may produce a compliant final answer while leaking private information through an internal message, delegating authority beyond its original scope, calling an external tool with sensitive context, or losing the evidence needed to reconstruct why an action was allowed. We argue that many emerging failures in LLM-based multi-agent systems share a common structure: safety critical constraints do not remain operative throughout the trajectory. We call this phenomenon constraint drift: the loss, distortion, weakening, or relaxation of constraints as they pass through memory, delegation, communication, tool use, audit, and optimization. The position taken here is that safe multi-agent behavior must be maintained, not merely asserted. Prompts, guardrails, tool schemas, access control, and final output checks are necessary, but they are insufficient unless constraints remain fresh, inherited, enforceable, and auditable across execution. We propose Constraint State Governance as a research paradigm for LLM-based multi-agent systems. In this paradigm, safety-critical constraints are maintained as explicit execution state, while constraint-native reinforcement learning improves utility only within maintained safety boundaries. The goal is not to freeze agentic systems under rigid rules, but to make safety operational across the trajectories through which modern agents actually act.
中文摘要 现代基于LLM的代理不再是被动文本生成器。他们读取仓库、调用工具、浏览网页、执行代码、维护内存、与其他代理沟通，并执行长远的工作流程。这种换位移动了安全单位。系统可能在通过内部消息泄露私人信息、授权超出原有范围、调用具有敏感上下文的外部工具，或丢失重建为何允许某项行动所需的证据时，可能会给出合规的最终答复。我们认为，基于LLM的多智能体系统中许多新出现的失败存在一个共同结构：安全关键约束不会在整个过程中始终生效。我们称这种现象为约束漂移：约束在经历内存、委托、通信、工具使用、审计和优化过程中的丢失、扭曲、削弱或松弛。这里的立场是，安全的多智能体行为必须被维持，而不仅仅是断言。提示符、护栏、工具模式、访问控制和最终输出检查是必要的，但除非约束在执行过程中保持新鲜、继承、可执行和可审计，否则这些都不够。我们提出约束状态治理作为基于LLM的多智能体系统研究范式。在该范式中，安全关键约束作为显式执行状态被维护，而约束本地强化学习仅在维护的安全边界内提升效用。目标不是冻结智能系统以严格规则，而是使安全在现代智能体实际行动的轨迹上实现。

Priority-Driven Control and Communication in Decentralized Multi-Agent Systems via Reinforcement Learning

通过强化学习实现去中心化多智能体系统中的优先级驱动控制与通信

Authors: Qingyun Guo, Junyi Shi, Tomasz Piotr Kucner, Dominik Baumann
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.10482
Pdf link: https://arxiv.org/pdf/2605.10482
Abstract Event-triggered control provides a mechanism for avoiding excessive use of constrained communication bandwidth in networked multi-agent systems. However, most existing methods rely on accurate system models, which may be unavailable in practice. In this work, we propose a model-free, priority-driven reinforcement learning algorithm that learns communication priorities and control policies jointly from data in decentralized multi-agent systems. By learning communication priorities, we circumvent the hybrid action space typical in event-triggered control with binary communication decisions. We evaluate our algorithm on benchmark tasks and demonstrate that it outperforms the baseline method.
中文摘要 事件触发控制为避免网络多智能体系统中过度使用受限通信带宽提供了一种机制。然而，大多数现有方法依赖于准确的系统模型，而这些模型在实际中可能无法获得。本研究提出一种无模型、优先级驱动的强化学习算法，能够在去中心化多智能体系统中从数据中共同学习通信优先级并控制策略。通过学习沟通优先级，我们绕过了事件触发控制中典型的混合行动空间，采用二元沟通决策。我们在基准测试任务中评估算法，并证明其性能优于基线方法。

DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

DeepRefine：通过强化学习实现智能体编译知识精炼

Authors: Haoyu Huang, Jiaxin Bai, Shujie Liu, Yang Wei, Hong Ting Tsang, Yisen Gao, Zhongwei Xie, Yufei Li, Yangqiu Song
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.10488
Pdf link: https://arxiv.org/pdf/2605.10488
Abstract Agent-compiled knowledge bases provide persistent external knowledge for large language model (LLM) agents in open-ended, knowledge-intensive downstream tasks. Yet their quality is systematically limited by \emph{incompleteness}, \emph{incorrectness}, and \emph{redundancy}, manifested as missing evidence or cross-document links, low-confidence or imprecise claims, and ambiguous or coreference resolution issues. Such defects compound under iterative use, degrading retrieval fidelity and downstream task performance. We present \textbf{DeepRefine}, a general LLM-based reasoning model for \emph{agent-compiled knowledge refinement} that improves the quality of any pre-constructed knowledge bases with user queries to make it more suitable for the downstream tasks. DeepRefine performs multi-turn interactions with the knowledge base and conducts abductive diagnosis over interaction history, localizes likely defects, and executes targeted refinement actions for incremental knowledge base updates. To optimize refinement policies of DeepRefine without gold references, we introduce a Gain-Beyond-Draft (GBD) reward and train the reasoning process end-to-end via reinforcement learning. Extensive experiments demonstrate consistent downstream gains over strong baselines.
中文摘要 代理编译的知识库为大型语言模型（LLM）代理在开放式、知识密集型的下游任务中提供持久的外部知识。然而，它们的质量系统性地受到\emph{不完整}、\emph{不正确}和\emph{冗余}的限制，表现为缺失的证据或跨文档链接、低信心或不精确的主张，以及歧义或共指的解决问题。这些缺陷在迭代使用下会不断积累，降低了检索精度和后续任务性能。我们呈现 \textbf{DeepRefine}，一种基于 LLM 的通用推理模型，用于 \emph{agent-compiled knowledge refinement}，它通过用户查询提升了预构建知识库的质量，使其更适合后续任务。DeepRefine与知识库进行多回合交互，对交互历史进行溯因诊断，定位可能缺陷，并执行针对性的精炼动作以实现增量式知识库更新。为了优化DeepRefine在无金参考的情况下的精炼策略，我们引入了超越选秀的收益（GBD）奖励，并通过强化学习对推理过程进行端到端训练。大量实验表明，在强基线下，下游的收益是持续的。

Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning

更高分辨率，更优的泛化：在深度强化学习中解锁视觉缩放

Authors: Raphael Trumpp, Ömer Veysel Çağatan, Barış Akgün, Marco Caccamo
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.10546
Pdf link: https://arxiv.org/pdf/2605.10546
Abstract Pixel-based deep reinforcement learning agents are typically trained on heavily downsampled visual observations, a convention inherited from early benchmarks rather than grounded in principled design. In this work, we show that observation resolution is a critical yet overlooked variable for policy learning: higher-resolution inputs can substantially improve both performance and generalization, provided the network architecture can process them effectively. We find that the widely used Impala encoder, which flattens spatial features into a vector, suffers from quadratic parameter growth as resolution increases and fails to leverage the additional visual detail. Replacing this operation with global average pooling, as in the Impoola architecture, decouples parameter count from resolution and yields consistent improvements across resolutions and network widths - at their respective best conditions, visual scaling unlocks a 28 % performance gain for Impoola over Impala. These gains are strongest in environments that require precise perception of small or distant objects, and gradient saliency analysis confirms that the underlying mechanism is a more spatially localized visual attention of the policy at higher resolutions. Our results challenge the prevailing practice of aggressive input downsampling and position resolution-independent architectures as a simple, effective path toward scalable visual deep RL. To facilitate future research on resolution scaling in deep RL, we publicly release the open-source code for the Procgen-HD benchmark: this https URL.
中文摘要 基于像素的深度强化学习代理通常基于大量下采样的视觉观察进行训练，这一惯例源自早期基准测试，而非基于原则设计。本研究表明，观测分辨率是政策学习中一个关键但被忽视的变量：只要网络架构能够有效处理高分辨率输入，可以显著提升性能和泛化能力。我们发现，广泛使用的Impala编码器将空间特征扁平化为矢量，但随着分辨率提升，参数呈二次增长，且未能充分利用额外的视觉细节。用全局平均池（如 Impoola 架构）取代此操作，参数计数与分辨率解耦，带来分辨率和网络宽度间的一致提升——在最佳条件下，视觉扩展可为 Impoola 相较 Impala 提升 28% 的性能。这些增益在需要精确感知小或远物体的环境中最为显著，梯度显著性分析证实其潜在机制是政策在更高分辨率下更具空间定位性的视觉关注。我们的结果挑战了当前激进输入降采样和位置分辨率无关架构的做法，认为这是通往可扩展视觉深度强化学习的简单有效路径。为了促进未来深度强化学习分辨率扩展的研究，我们公开发布了Procgen-HD基准测试的开源代码：这个https URL。

PhysEDA: Physics-Aware Learning Framework for Efficient EDA With Manhattan Distance Decay

PhysEDA：适用于曼哈顿距离衰减的高效物理感知学习框架

Authors: Zetao Yang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.10547
Pdf link: https://arxiv.org/pdf/2605.10547
Abstract Electronic design automation (EDA) addresses placement, routing, timing analysis, and power-integrity verification for integrated circuits. Learning methods -- attention (Transformer) and reinforcement learning (RL) -- have recently emerged on EDA tasks, yet face two common bottlenecks: vanilla attention's quadratic complexity limits scaling, and data-scarce models overfit statistical noise and amplify weak long-range correlations against the underlying physics. We observe that EDA tasks share a physical prior -- pairwise electrical and routing interactions decay exponentially along Manhattan distance -- and integrate it as a unified inductive bias into both architecture and training. We propose PhysEDA, comprising two components Physics-Structured Linear Attention (PSLA) folds the separable Manhattan decay into the linear-attention kernel as a multiplicative bias, reducing complexity from quadratic to linear; Potential-Based Reward Shaping (PBRS) constructs a physical potential from the same kernel, providing dense reward signal under sparse RL while preserving the optimal policy via the policy-invariance theorem. Across three EDA scenarios -- decoupling-capacitor placement, macro placement, and IR-drop prediction -- PhysEDA improves zero-shot cross-scale transfer by 56.8% and achieves 14x inference speedup with 98.5% memory savings on 100x100 grids; PBRS adds another 10.8% in sparse-reward DPP.
中文摘要 电子设计自动化（EDA）负责集成电路的布置、布线、时序分析和功率完整性验证。学习方法——注意力（Transformer）和强化学习（RL）——最近在EDA任务中出现，但面临两个常见瓶颈：普通注意力的二次复杂度限制了缩放性，数据稀缺模型则会过度拟合统计噪声，放大与底层物理的弱长期相关性。我们观察到EDA任务共享物理先验——成对电气和路由交互沿曼哈顿距离呈指数衰减——并将其作为统一的归纳偏置整合到架构和训练中。我们提出由两个组成部分组成的PhysEDA：物理结构线性注意力（PSLA）将可分离的曼哈顿衰变折叠为线性注意力核，作为乘法偏压，将复杂度从二次降至线性;基于势的奖励整形（PBRS）从同一核构造物理势，在稀疏强化学习下提供密集的奖励信号，同时通过策略-不变性定理保持最优策略。在三种EDA场景——解耦电容布置、宏观布置和红外降预测——PhysEDA将零次跨尺度传输提升56.8%，在100x100网格上实现14倍推断加速和98.5%内存节约;PBRS在稀疏奖励DPP中又增加了10.8%。

Controllability in preference-conditioned multi-objective reinforcement learning

偏好条件多目标强化学习中的可控性

Authors: Pau de las Heras Molins, Beyazit Yalcinkaya, Lasse Peters, David Fridovich-Keil, Georgios Bakirtzis
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.10585
Pdf link: https://arxiv.org/pdf/2605.10585
Abstract Multi-objective reinforcement learning (MORL) allows a user to express preference over outcomes in terms of the relative importance of the objectives, but standard metrics cannot capture whether changes in preference reliably change the agent's behavior in the intended way, a property termed controllability. As a result, preference-conditioned agents can score well on standard MORL metrics while being insensitive to the preference input. If the ability to control agents cannot be reliably assessed, the symbolic interface that MORL provides between user intent and agent behavior is broken. Mainstream MORL metrics alone fail to measure the controllability of preference-conditioned agents, motivating a complementary metric specifically designed to that end. We hope the results spur discussion in the community on existing evaluation protocols to consolidate advances in preference adaptation in MORL to larger and more complex problems.
中文摘要 多目标强化学习（MORL）允许用户根据目标的相对重要性表达对结果的偏好，但标准指标无法反映偏好变化是否可靠地以预期方式改变了代理行为，这一特性称为可控性。因此，偏好条件化的代理人在标准MORL指标上得分较高，但对偏好输入不敏感。如果无法可靠评估对智能体的控制能力，MORL在用户意图与智能体行为之间提供的符号接口就被破坏了。主流MORL指标单独无法衡量偏好条件作用剂的可控性，因此专门设计了一个互补指标。我们希望这些结果能激发社区对现有评估方案的讨论，以整合MORL偏好适应的进展，以应对更大更复杂的问题。

Demystifying Deep Reinforcement Learning: A Neuro-Symbolic Framework for Interpretable Open RAN Automation

揭开深度强化学习的神秘面纱：一个用于可解释开放RAN自动化的神经符号框架

Authors: Jie Lu, Peihao Yan, Pang-Ning Tan, Y. Thomas Hou, Huacheng Zeng
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.10648
Pdf link: https://arxiv.org/pdf/2605.10648
Abstract Open Radio Access Networks (O-RAN) are increasingly adopting data-driven control through Deep Reinforcement Learning (DRL) to optimize complex tasks such as network slicing and mobility management. However, the deployment of DRL in carrier-grade networks is hindered by its inherent opacity and stochastic execution, which limit operator trust, auditability, and safe deployment. Existing explainable AI (XAI) approaches primarily provide post-hoc insights and fail to produce executable, interpretable policies suitable for operational environments. In this paper, we present DeRAN, a neuro-symbolic framework that bridges the gap between DRL performance and operational transparency by distilling black-box DRL policies into human-readable symbolic representations. DeRAN introduces a concept-driven abstraction layer that transforms high-dimensional network telemetry into a compact set of semantically meaningful features, enabling interpretable policy learning. Building on the semantically grounded concepts, DeRAN synthesizes symbolic policies using deep symbolic regression (DSR) for continuous control and neurally guided differentiable logic (NUDGE) for discrete decision-making. We implement DeRAN on a live 5G O-RAN testbed and evaluate it on two representative use cases. Experimental results demonstrate that DeRAN achieves 78\% and 87\% of DRL's cumulative rewards in the two use cases, while offering interpretability and auditability by design. Source code is available at this https URL
中文摘要 开放无线接入网络（O-RAN）正日益采用深度强化学习（DRL）的数据驱动控制，以优化网络切片和移动管理等复杂任务。然而，在运营商级网络中部署日灾后后，由于其固有的不透明性和随机执行，限制了操作员的信任、可审计性和安全部署。现有的可解释人工智能（XAI）方法主要提供事后洞察，未能生成适合运营环境的可执行、可解释的策略。本文介绍了DeRAN，一种神经符号框架，通过将DRL策略提炼为人类可读的符号表示，弥合DRL性能与运营透明度之间的差距。DeRAN引入了概念驱动的抽象层，将高维网络遥测转化为一组紧凑且具有语义意义的特征，实现可解释的策略学习。基于语义基础的概念，DeRAN综合了使用深度符号回归（DSR）进行连续控制的符号策略，以及神经引导的可微逻辑（NUDGE）用于离散决策。我们在一个实时的5G O-RAN测试平台上实现DeRAN，并基于两个具有代表性的用例进行评估。实验结果显示，DeRAN在这两种用例中实现了78%和87%的DRL累计奖励，同时设计上提供了可解释性和可审计性。源代码可在此 https URL 获取

Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

Evolving-RL：代理内体验驱动自我演化能力的端到端优化

Authors: Zhiyuan Fan, Wenwei Jin, Feng Zhang, Bin Li, Yihong Dong, Yao Hu, Jiawei Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.10663
Pdf link: https://arxiv.org/pdf/2605.10663
Abstract Experience-driven self-evolving agents aim to overcome the static nature of large language models by distilling reusable experience from past interactions, thus enabling adaptation to novel tasks at deployment time. This process places substantial demands on the foundation model's capacities for abstraction, generalization, and in-context learning. However, most existing studies focus primarily on system-level design choices, such as how experience is represented and managed, neglecting the inherent capabilities of the underlying model. While some recent works have started to optimize the experience utilization stage via reinforcement learning, they still fail to treat self-evolution as a unified process to be jointly optimized. To this end, we propose Evolving-RL, an efficient algorithmic framework that jointly improves the experience extraction and utilization capabilities required for self-evolution. Specifically, we center the learning process on experience extraction and evaluation, using the two supervisory signals derived from evaluation to optimize the extractor and solver separately and thus enable their coordinated co-evolution. Experiments on ALFWorld and Mind2Web show that Evolving-RL effectively enhances LLMs' ability to extract and reuse experience, leading to strong performance gains on out-of-distribution tasks (up to 98.7% relative improvement over the GRPO baseline on ALFWorld unseen tasks and 35.8% on Mind2Web), and these gains are fully unlocked only through the coordinated co-evolution of experience extraction and utilization. Furthermore, Evolving-RL inherently functions as an experience-augmented RL algorithm. By internalizing reusable experience patterns directly into model parameters, it achieves remarkable performance gains over standard baselines on both seen and unseen tasks, even in the absence of test-time experience accumulation.
中文摘要 以经验为驱动的自我进化代理旨在通过从过去交互中提炼可复用的经验，克服大型语言模型的静态特性，从而在部署时适应新颖任务。这一过程对基础模型在抽象、泛化和上下文学习方面的能力提出了重大要求。然而，大多数现有研究主要关注系统层设计选择，如经验的表现和管理，忽视了底层模型的固有能力。虽然一些近期工作开始通过强化学习优化体验利用阶段，但它们仍未将自我演化视为一个需要共同优化的统一过程。为此，我们提出了Evolving-RL，一个高效的算法框架，共同提升自我进化所需的经验提取和利用能力。具体来说，我们将学习过程聚焦于经验提取和评估，利用评估中得出的两种监督信号分别优化提取器和求解器，从而实现它们协调的共演化。在ALFWorld和Mind2Web上的实验显示，Evolving-RL有效增强了LLM提取和复用经验的能力，导致在非分布任务中显著提升性能（ALFWorld未见任务相比GRPO基线提升高达98.7%，Mind2Web提升35.8%），而这些提升仅通过经验提取和利用的协调共进化才能完全释放。此外，Evolving-RL 本质上是一种体验增强强化学习算法。通过将可重复使用的体验模式直接内化到模型参数中，即使在没有测试时间经验积累的情况下，它在可见和未见任务上都取得了显著的性能提升。

Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework

自然策略梯度作为双平滑策略迭代：一个贝尔曼-操作员框架

Authors: Phalguni Nanda, Zaiwei Chen
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.10671
Pdf link: https://arxiv.org/pdf/2605.10671
Abstract In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-\gamma)^{-1}\log((1-\gamma)^{-1}\epsilon^{-1}))$ for computing an $\epsilon$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.
中文摘要 在本研究中，我们证明了强化学习中核心算法自然策略梯度（natural policy gradient）作为策略迭代的平滑和平均形式，具有精确表述。具体来说，我们引入了双平滑策略迭代（DSPI），这是一种贝尔曼算子框架，每个策略通过对过去$Q$函数的加权平均应用正则化贪婪步获得。DSPI包括策略迭代、双平均策略迭代、自然策略梯度以及更通用的策略对偶平均方法作为特例。仅利用单调性和平滑贝尔曼算子的收缩，我们证明了DSPI的分布无分布全局几何收敛性。因此，标准自然策略梯度和策略对偶平均在计算$\epsilon$最优策略时，无需修改MDP、添加超出更新内在镜像映射的正则化，或使用自适应、轨迹依赖步长，计算出$\ε_p_最优策略，迭代复杂度为$\mathcal{O}（（1-\gamma）^{-1}\log（1-\gamma）^{-1}）））。对于非正则化贪婪情况，对应于双重平均策略迭代，我们也证明了有限终止。同样的贝尔曼算子框架还扩展到带有线性函数近似和随机最短路径问题的折现MDP。

XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies

XQCfD：利用先前数据和策略加速快速演员-批评算法

Authors: Daniel Palenicek, Florian Vogt, Joe Watson, Ingmar Posner, Danica Kragic, Jan Peters
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.10734
Pdf link: https://arxiv.org/pdf/2605.10734
Abstract For reinforcement learning in the real world online exploration is expensive A common practice in robotic reinforcement learning is to incorporate additional data to improve sample efficiency Expert demonstration data is often crucial for solving hard exploration tasks with sparse rewards While prior data is used to augment experience and pretrain models we show that the design of existing algorithms fails to achieve the sample efficiency that is possible in this setting due to a failure to use pretrained policies effectively We propose XQCfD which extends the sample-efficient XQC actor-critic to learn from demonstrations using augmented replay buffers pretrained policies and stationary policy architectures designed to avoid rapidly unlearning the strong initial policy like prior works We show our stationary network architecture enables policy improvement out-of-distribution better than standard network architectures due to its higher entropy predictions XQCfD achieves state of the art performance across a range of complex manipulation tasks with sparse rewards from the popular Adroit Robomimic and MimicGen benchmarks -- notably with a low update-to-data ratio and no ensemble networks
中文摘要 对于现实世界中的强化学习来说，在线探索成本高昂。机器人强化学习中的一种常见做法是加入额外数据以提高样本效率。专家演示数据对于解决奖励稀疏的困难探索任务往往至关重要。虽然先前数据用于增强经验和预训练模型，但我们表明现有算法的设计未能达到该环境中可能的样本效率，原因包括未能有效使用预训练策略我们提出了XQCfD，它扩展了样本高效的XQC演员-批评者，通过增强重放缓冲区、预训练策略和平稳策略架构的演示学习，旨在避免像以往那样快速忘记强初始策略。我们展示了我们的平稳网络架构比标准网络架构更能实现非分布式策略改进，因为XQCfD实现了更高的熵预测。在多种复杂操作任务中表现出最先进的性能，但从流行的 Adroit Robomimic 和 MimicGen 基准测试中获得的奖励稀少——尤其是更新与数据的比例低且无集成网络

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

数学文本续写的似然评分：带有捷径漏洞测试的自监督基准

Authors: Daniel Ranard
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.10810
Pdf link: https://arxiv.org/pdf/2605.10810
Abstract We introduce an automatically generated benchmark for predicting hidden text in technical papers. A paper supplies visible context $X$ and a hidden continuation $Y$; the evaluated model writes an auxiliary forecast string $Z$, and a separate scorer assigns next-token probability to $Y$ both with and without conditioning on $Z$. This gives a label-free test of whether $Z$ transmits information about the continuation, compared against controls where $Z$ is recent context rather than a forecast. Our main testbed is equation-suffix prediction: the predictor sees context and the first part of a displayed equation, then forecasts the rest. The task mixes surface-level arXiv/TeX text modeling with reasoning-sensitive inference; the suffix is one of many roughly equivalent continuations, so the benchmark is read statistically rather than item-by-item. On 1363 equation continuations from 138 recent physics and mathematics papers, forecasts from GPT-5.5, Opus 4.7, and GPT-5.4 nano all improve clipped likelihood over the context control under both Qwen3-8B and Kimi K2.6 scorers, distinguishing model families and reasoning-effort settings without human labels. To emulate shortcuts where $Z$ further primes the scorer rather than making a useful forecast, we also fine-tune the scorer on context-only prompts and apply it to held-out papers as a stronger control. GPT-5.5 forecasts still beat this fine-tuned control; GPT-5.4 nano forecasts do not. Longer prose/TeX continuations show positive but noisier lift over controls, concentrated near the beginning of the target. These results support cross-model likelihood scoring as a static benchmark and as a setup for probing shortcut vulnerabilities before reinforcement learning or model-selection optimization is applied.
中文摘要 我们引入了自动生成的技术论文隐藏文本预测基准。纸张提供了可见的上下文$X$和隐藏的续写$Y$;被评估的模型会写入一个辅助预测字符串$Z$，另一个评分器会将下一个标记概率分配到$Y$，同时也没有$Z$条件。这提供了无标签的检验，判断$Z$是否传输了关于延续的信息，与$Z$是近期上下文而非预测的控制组进行了对比。我们的主要测试平台是方程后缀预测：预测变量先看到上下文和显示方程的第一部分，然后预测其余部分。该任务结合了表面层的 arXiv/TeX 文本建模与推理敏感推理;后缀是众多大致等效的延续之一，因此基准是统计学式的，而非逐项阅读。在138篇近期物理和数学论文的1363方程延续中，GPT-5.5、作品4.7和GPT-5.4 nano的预测均提升了Qwen3-8B和Kimi K2.6评分器在上下文控制上的截切似然率，区分模型族和推理努力设置，无需人工标签。为了模拟$Z$进一步激活评分者而非做出有用预测的捷径，我们还在仅上下文的提示上微调评分器，并将其应用于未完成的论文，作为更强的对照。GPT-5.5的预测仍然超过了这一精细调优;GPT-5.4纳米预测则不支持。较长的散文/TeX续写显示出正向但噪声更大的提升，集中在目标开头附近。这些结果支持跨模型似然评分作为静态基准，并作为在应用强化学习或模型选择优化前探查捷径漏洞的基础。

Policy Gradient Methods for Non-Markovian Reinforcement Learning

非马尔可夫强化学习的策略梯度方法

Authors: Avik Kar, Siddharth Chandak, Rahul Singh, Soumitra Sinhahajari, Eric Moulines, Shalabh Bhatnagar, Nicholas Bambos
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.10816
Pdf link: https://arxiv.org/pdf/2605.10816
Abstract We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.
中文摘要 我们研究非马尔可夫决策过程（NMDPs）中强化学习的策略梯度方法，其中观察和奖励依赖于整个交互历史。为应对这种依赖性，代理维护一个内部状态，递归更新以提供过去观察和动作的紧凑总结。与将智能体状态动态视为固定或通过预测目标学习的方法不同，我们提出了一种以奖励为中心的表述，联合优化智能体状态动态和控制策略，以最大化预期的累计奖励。为此，我们考虑了一类代理状态-马尔可夫（ASM）策略，包括代理状态动态和将代理状态映射到动作的控制策略。我们建立了ASM策略的新策略梯度定理，将马尔可夫设定的经典策略梯度结果扩展到情节和无限视野折现NMDP。基于该梯度表达式，我们提出了代理状态-马尔可夫策略梯度（ASMPG）算法，利用代理状态动态的递归结构实现高效优化。我们建立了有限时间且几乎确定的收敛保证，并通过实证证明，在一系列非马尔可夫任务中，ASMPG优于通过预测目标学习状态表示的基线。

Unified Noise Steering for Efficient Human-Guided VLA Adaptation

统一噪声引导，实现高效的人控VLA适配

Authors: Junjie Lu, Xinyao Qin, Yuhua Jiang, Kaixin Wang, Chuheng Zhang, Bin Liang, Jun Yang, Min Xu, Li Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.10821
Pdf link: https://arxiv.org/pdf/2605.10821
Abstract Diffusion-based vision-language-action (VLA) models have emerged as strong priors for robotic manipulation, yet adapting them to real-world distributions remains challenging. In particular, on-robot reinforcement learning (RL) is expensive and time-consuming, so effective adaptation depends on efficient policy improvement within a limited budget of real-world interactions. Noise-space RL lowers the cost by keeping the pretrained VLA fixed as a denoising generator while updating only a lightweight actor that predicts the noise. However, its performance is still limited due to inefficient autonomous exploration. Human corrective interventions can reduce this exploration burden, but they are naturally provided in action space, whereas noise-space finetuning requires supervision over noise variables. To address these challenges, we propose UniSteer, a Unified Noise Steering framework that combines human corrective guidance with noise-space RL through approximate action-to-noise inversion. Given a human corrective action, UniSteer inverts the frozen flow-matching decoder to recover a noise target, which provides supervised guidance for the same noise actor that is simultaneously optimized via reinforcement learning. Real-world experiments on diverse manipulation tasks show that UniSteer adapts more efficiently than strong noise-space RL and action-space human-in-the-loop baselines, improving the success rate from 20% to 90% in 66 minutes on average across four real-world adaptation tasks.
中文摘要 基于扩散的视觉-语言-动作（VLA）模型已成为机器人操作的有力先验，但将其适应现实世界的分布仍具挑战性。特别是，机器人强化学习（RL）成本高且耗时，因此有效的适应依赖于在有限的现实世界互动预算内高效改进策略。噪声空间强化学习通过保持预训练的VLA作为去噪发生器固定，同时只更新预测噪声的轻量级演员，从而降低成本。然而，由于自主探索效率低下，其性能仍然有限。人工纠正干预可以减轻这种探索负担，但它们自然地在动作空间中提供，而噪声空间的微调则需要对噪声变量的监督。为应对这些挑战，我们提出了UniSteer，一种结合人工纠正指导与噪声空间强化学习的统一噪声引导框架，通过近似的动作到噪声反演实现。在人工纠正措施下，UniSteer 反转冻结的流量匹配解码器以恢复噪声目标，从而为同一噪声演员提供监督指导，同时通过强化学习进行优化。在多种操作任务上的实际实验表明，UniSteer 比强噪声空间强化学习和行动空间人机循环基线更高效，四个真实适应任务的平均成功率从 66% 提升至 90%。

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

迈向可视化原生多模态深度搜索代理的策略上数据演进

Authors: Shijue Huang, Hangyu Guo, Chenxin Li, Junting Lu, Xinyu Geng, Zhaochen Su, Zhenyu Li, Shuang Chen, Hongru Wang, Yi R. Fung
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.10832
Pdf link: https://arxiv.org/pdf/2605.10832
Abstract Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round's data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.
中文摘要 多模态深度搜索需要智能体通过将搜索、工具使用和视觉推理串联起来，解决开放世界问题，同时伴随着不断演变的文本和视觉上下文。两个瓶颈限制了电流系统。首先，现有的工具使用工具将通过搜索、浏览或转换返回的图像视为暂时输出，因此中间的视觉证据不会被后续工具重新消费。其次，训练数据通常由固定的策展配方构建，无法追踪目标代理能力的演变。为应对这些挑战，我们首先引入了以图像库引用协议为核心的可视化原生代理工具工具，将每张工具返回的图像注册为可寻址的参考，并使中间的视觉证据可被后续工具重用。在该工具之上，策略上数据演化（ODE）运行一个闭环数据生成器，从被训练策略的部署轮次中不断优化。这种每轮的细化使得每一轮的数据都针对当前政策仍需学习的内容。同一框架支持多样化的监督微调数据和策略感知强化学习数据管理，涵盖目标代理的完整训练生命周期。在8项多模态深度搜索基准测试中，ODE将Qwen3-VL-8B代理的平均提升率从24.9%提升至39.0%，在标准代理-工作流程设置下超过Gemini-2.5 Pro（37.9%）。在30B时，ODE将平均得分从30.6%提升到41.5%。进一步分析验证了图像库重用的有效性，尤其是在需要反复视觉细化的复杂任务中，而推广反馈演化则比静态综合更能带来更扎实的SFT痕迹和更优的策略匹配强化学习任务。

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

BenchCAD：程序化CAD的全面、行业标准基准

Authors: Haozhe Zhang, Kaichen Liu, Miaomiao Chen, Lei Li, Shaojie Yang, Cheng Peng, Hanjie Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2605.10865
Pdf link: https://arxiv.org/pdf/2605.10865
Abstract Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings. We present BenchCAD, a unified benchmark for industrial CAD reasoning. BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families, including bevel gears, compression springs, twist drills, and other reusable engineering designs. It evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing, enabling fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Across 10+ frontier models, BenchCAD shows that current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Common failures include missing fine 3D structure, misinterpreting industrial design parameters, and replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. These results position BenchCAD as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.
中文摘要 工业计算机辅助设计（CAD）代码生成要求模型从视觉或文本输入生成可执行的参数化程序。除了识别零件的外形外，这项任务还包括理解其三维结构、推断工程参数，并选择反映零件设计和制造方式的CAD操作。尽管多模态大型语言模型（MLLM）有望用于此任务，但很少被评估这些能力在现实工业计算机辅助设计环境中是否合用。我们介绍BenchCAD，一个统一的工业CAD推理基准。BenchCAD 包含 17,900 个经过执行验证的 CadQuery 程序，涵盖 106 个工业零件系列，包括斜齿轮、压缩弹簧、旋转钻及其他可重复使用的工程设计。它通过视觉问答、代码问答、图像到代码生成和指令引导代码编辑来评估模型，实现跨感知、参数化抽象和可执行程序综合的细粒度分析。在10+前沿模型中，BenchCAD显示当前系统常常恢复粗糙的外几何，但未能生成忠实的参数化CAD程序。常见失败包括缺少精细的三维结构、误解工业设计参数，以及用更简单的草图挤出图案替代扫线、斜面和扭挤等关键操作。微调和强化学习提升分布内表现，但对未见部分家族的泛化仍然有限。这些结果使BenchCAD成为衡量和提升多模态CAD自动化工业准备度的标杆。

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

评分标准EM：基于评分标准的策略分解超越可验证奖励的元强化学习

Authors: Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen, Chun-Liang Li, Long T. Le, Rujun Han, George Lee, Hanghang Tong, Chen-Yu Lee, Tomas Pfister
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.10899
Pdf link: https://arxiv.org/pdf/2605.10899
Abstract Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.
中文摘要 培训深度研究主体，即规划、搜索、评估证据并综合长篇报告的系统，推动强化学习超越可验证奖励的框架。他们的输出缺乏真实答案，路径跨越许多工具增强决策，标准的后期培训几乎没有将过去尝试转化为可重复使用的机制。在本研究中，我们主张评分标准不仅应作为最终答案评估器，更应作为构建策略执行、评判反馈和代理记忆的共享接口。基于这一观点，我们介绍了RubricEM，一种结合分级策略分解与基于反思的元政策演进的评分标准引导强化学习框架。评分标准EM首先通过条件化规划、证据收集、回顾和综合自生成的评分标准，使研究轨迹具有阶段感知。随后，它通过阶段结构化GRPO获得认可，该系统利用分阶段的评分标准判断，为长期优化提供更密集的语义反馈。与此同时，RubricEM训练共享骨干反思元政策，将判断轨迹提炼为可重用的基于评分标准的指导，供未来尝试使用。最终的RubricEM-8B在四个长期研究基准中表现出色，优于可比的开放模型，并接近专有的深度研究系统。除了最终表现，我们还进行详尽分析，以理解评分标准EM的关键要素。

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

智能强化学习的动态技能生命周期管理

Authors: Junhao Shen, Teng Zhang, Xiaoyan Zhao, Hong Cheng
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.10923
Pdf link: https://arxiv.org/pdf/2605.10923
Abstract Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized into the policy, eventually leading to zero-skill inference. We argue this assumption is overly restrictive, since with limited parametric capacity and uneven marginal contribution across skills, the optimal active skill set is non-monotonic, task- and stage-dependent. In this work, we propose SLIM, a framework of dynamic Skill LIfecycle Management for agentic reinforcement learning (RL), which treats the active external skill set as a dynamic optimization variable jointly updated with policy learning. Specifically, SLIM estimates each active skill's marginal external contribution through leave-one-skill-out validation, then applies three lifecycle operations: retaining high-value skills, retiring skills whose contribution becomes negligible after sufficient exposure, and expanding the skill bank when persistent failures reveal missing capability coverage. Experiments show that SLIM outperforms the best baselines by an average of 7.1% points across ALFWorld and SearchQA. Results further indicate that policy learning and external skill retention are not mutually exclusive: some skills are absorbed into the policy, while others continue to provide external value, supporting SLIM as a more general paradigm for skill-based agentic RL.
中文摘要 大型语言模型代理越来越依赖外部技能来解决复杂任务，技能作为模块化单元，扩展其能力，超出参数内存本身的支持范围。现有方法假设外部技能要么作为持久指导积累，要么内化进策略，最终导致零技能推断。我们认为这一假设过于限制，因为参数能力有限且技能间贡献不均，最优的主动技能组是非单调的，且依赖任务和阶段。在本研究中，我们提出了SLIM框架，这是一种用于代理强化学习（RL）的动态技能生命周期管理框架，将主动外部技能集视为与策略学习共同更新的动态优化变量。具体来说，SLIM通过“保留一项技能”验证来估算每个主动技能的边际外部贡献，然后应用三个生命周期操作：保留高价值技能、淘汰那些在充分接触后贡献微乎其微的技能，以及在持续失败暴露能力覆盖缺失时扩展技能库。实验显示，SLIM在ALFWorld和SearchQA中平均优于最佳基线7.1个百分点。结果进一步表明，政策学习与外部技能保留并非互斥：部分技能被纳入政策，另一些技能则继续提供外部价值，支持SLIM作为基于技能的能动强化学习更通用范式。

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

超线性优势塑形的文本转图像模型训练后强化功率

Authors: Haoyuan Sun, Jing Wang, Yuxin Song, Yu Lu, Bo Fang, Yifu Luo, Jun Yin, Pengyu Zeng, Miao Zhang, Tiantian Zhang, Xueqian Wang, Shijian Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.10937
Pdf link: https://arxiv.org/pdf/2605.10937
Abstract Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.
中文摘要 近年来，基于强化学习的后训练方法，特别关注群体相对策略优化（Group Relative Policy Optimization，GRPO），已成为文本到图像（T2I）模型进一步发展的稳健范式。然而，这些方法常常存在奖励黑客行为，即模型利用奖励函数不完全的偏差，而非带来真正的性能提升。本研究指出归一化可能导致校准错误，直接去除提示级标准差项可获得最优策略上升方向，该方向线性优势，但仍限制真实信号与噪声的分离。为缓解上述问题，我们提出超线性优势塑形（SLAS），从信息几何角度重新审视功能更新。通过扩展带有优势依赖权重的费舍尔-拉奥信息指标，SLAS引入了非线性几何结构，重塑了局部政策空间。该设计放宽了高优势方向的约束，以放大信息更新，同时收紧低优势区域的约束，以抑制虚幻梯度。此外，还应用批次级规范化以稳定不同奖励尺度下的训练。广泛评估表明，SLAS在多个骨干和基准指标上持续超越DanceGRPO基线。特别是，它带来了更快的训练动态，提升了GenEval和UniGenBench++的域外性能，增强了建模扩展的鲁棒性，同时减少了奖励黑客行为，并保持了代际语义和组合的忠实度。

Keyword: diffusion policy

HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions

异构操作：针对异构对象交互的可推广操作

Authors: Zhenhao Shen, Zeming Yang, Yue Chen, Yuran Wang, Shengqiang Xu, Mingleyang Li, Hao Dong, Ruihai Wu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.10201
Pdf link: https://arxiv.org/pdf/2605.10201
Abstract Generalizable manipulation involving cross-type object interactions is a critical yet challenging capability in robotics. To reliably accomplish such tasks, robots must address two fundamental challenges: where to manipulate'' (contact point localization) andhow to manipulate'' (subsequent interaction trajectory planning). Existing foundation-model-based approaches often adopt end-to-end learning that obscures the distinction between these stages, exacerbating error accumulation in long-horizon tasks. Furthermore, they typically rely on a single uniform model, which fails to capture the diverse, category-specific features required for heterogeneous objects. To overcome these limitations, we propose HeteroGenManip, a task-conditioned, two-stage framework designed to decouple initial grasp from complex interaction execution. First, Foundation-Correspondence-Guided Grasp module leverages structural priors to align the initial contact state, thereby significantly reducing the pose uncertainty of grasping. Subsequently, Multi-Foundation-Model Diffusion Policy (MFMDP) routes objects to category-specialized foundation models, integrating fine-grained geometric information with highly-variable part features via a dual-stream cross-attention mechanism. Experimental evaluations demonstrate that HeteroGenManip achieves robust intra-category shape and pose generalization. The framework achieves an average 31\% performance improvement in simulation tasks with broad type setting, alongside a 36.7\% gain across four real-world tasks with different interaction types.
中文摘要 涉及跨类型物体交互的可推广操作是机器人学中一项关键但具有挑战性的能力。为了可靠完成此类任务，机器人必须解决两个基本挑战：“在哪里操作”（接触点定位）和“如何操作”（后续交互轨迹规划）。现有基于基础模型的方法常采用端到端学习，模糊了这两个阶段的区别，加剧了长期任务中的错误积累。此外，它们通常依赖单一的统一模型，无法捕捉异质物体所需的多样、类别特有特征。为克服这些限制，我们提出了HeteroGenManip，一种任务条件化的两阶段框架，旨在将初始掌握与复杂交互执行分离。首先，基础-对应-引导抓取模块利用结构先验对齐初始接触状态，从而显著降低抓取姿态的不确定性。随后，多基金会模型扩散策略（MFMDP）将对象路由到类别专用的基础模型，通过双流交叉关注机制将细粒度几何信息与高度可变的零件特征整合。实验评估表明，HeteroGenManip 实现了稳健的类别内形状和姿态泛化。该框架在广泛类型设置下的模拟任务中平均性能提升31%，且在四个不同交互类型的真实任务中提升36.7%。