Arxiv Papers of Today

生成时间: 2026-05-13 18:32:09 (UTC+8); Arxiv 发布时间: 2026-05-13 20:00 EDT (2026-05-14 08:00 UTC+8)

今天共有 72 篇相关文章

Keyword: reinforcement learning

$ξ$-DPO: Direct Preference Optimization via Ratio Reward Margin

$ξ$-DPO：通过比率奖励边际进行直接偏好优化

Authors: Zhengyuan Fan, Zhonghua Wu, Yuxuan Du, Qun Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.10981
Pdf link: https://arxiv.org/pdf/2605.10981
Abstract Reference-free preference optimization has emerged as an efficient alternative to reinforcement learning from human feedback, with Simple Preference Optimization(SimPO) demonstrating strong performance by eliminating the explicit reference model through a simple objective. However, the joint tuning of the hyperparameters $\beta$ and $\gamma$ in SimPO remains a central challenge. We argue that this difficulty arises because the margin formulation in SimPO is not easily interpretable across datasets with different reward gap structures. To better understand this issue, we conduct a comprehensive analysis of SimPO and find that $\beta$ implicitly controls sample filtering, while the effect of $\gamma$ depends on the reward gap structure of the dataset. Motivated by these observations, we propose $\xi$-DPO: Direct preference optimization via ratio reward margin. We first reformulate the preference objective through an equivalent transformation, changing the optimization target from maximizing the likelihood of reward gaps to minimizing the distance between reward gaps and optimal margins. Then, we redefine the reward in a ratio form between the chosen and rejected, which effectively cancels the effect of $\beta$ and yields a bounded and interpretable margin. This margin is called the ratio reward margin and is denoted by $\xi$. Unlike the margin $\gamma$ in SimPO, $\xi$ explicitly represents the desired relative separation between chosen and rejected responses and can be determined from the initial reward gap distribution, avoiding repeated trial-and-error tuning. ....
中文摘要 无引用偏好优化已成为人类反馈强化学习的高效替代方案，简单偏好优化（SimPO）通过简单目标消除显式参考模型展现了强劲的性能。然而，SimPO 中超参数 $\beta$ 和 $\gamma$ 的联合调优仍是一个核心挑战。我们认为，这一困难源于SimPO中的margin表述在不同奖励差距结构的数据集间难以解释。为更好地理解这一问题，我们对SimPO进行了全面分析，发现$\beta$隐式控制样本过滤，而$\gamma$的影响取决于数据集的奖励差距结构。基于这些观察，我们提出了$\习$-DPO：通过比率奖励边际进行直接偏好优化。我们首先通过等效变换重新表述偏好目标，将优化目标从最大化奖励差距的可能性转变为最小化奖励差距与最优边际之间的距离。然后，我们将奖励重新定义为被选者与被拒绝者之间的比率形式，这实际上抵消了$\beta$的影响，产生了有界且可解释的边际。这个差距称为比率奖励幅度，记作$\习$。与SimPO中的$\gamma$不同，$\习$明确表示所选与被拒绝反应之间的期望相对距离，并且可以从初始奖励差距分布中确定，避免了反复试错的调整。....

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

TMPO：轨迹匹配策略优化，实现多样化且高效的扩散对齐

Authors: Jiaming Li, Chenyu Zhu, Zhiyuan Ma, Nanxi Yi, Youjun Bao, Li Sun, Quanying Lv, Xiang Fang, Daizong Liu, Jianjun Li, Kun He, Bowen Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.10983
Pdf link: https://arxiv.org/pdf/2605.10983
Abstract Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence, preserving coverage over all acceptable trajectories while optimizing reward. To further reduce multi-trajectory training time on large-scale flow-matching models, TMPO incorporates Dynamic Stochastic Tree Sampling, where trajectories share denoising prefixes and branch at dynamically scheduled steps, reducing redundant computation while improving training effectiveness. Extensive results across diverse alignment tasks such as human preference, compositional generation and text rendering show that TMPO improves generative diversity over state-of-the-art methods by 9.1%, and achieves competitive performance in all downstream and efficiency metrics, attaining the optimal trade-off between reward and diversity.
中文摘要 强化学习（RL）在将扩散模型与下游任务对齐方面展现出非凡潜力，但大多数模型仍存在严重的奖励黑客问题，这种行为通过引发视觉模式崩溃和放大不可靠奖励，降低了生成多样性和质量。我们认定根本原因在于这些方法的模式寻求特性，它们最大化预期奖励，同时不有效限制可接受轨迹上的概率分布，导致集中在少数高回报路径上。相比之下，我们提出了轨迹匹配策略优化（TMPO），它用轨迹级奖励分布匹配替代标量奖励最大化。具体来说，TMPO引入了软最大轨迹平衡（Softmax-TB）目标，以匹配K条轨迹的政策概率与奖励诱导的玻尔兹曼分布。我们证明该目标继承了前向KL发散的模式覆盖特性，保持对所有可接受轨迹的覆盖，同时优化奖励。为了进一步缩短大规模流量匹配模型上的多轨迹训练时间，TMPO采用了动态随机树采样技术，轨迹在动态调度的步骤共享去噪前缀和分支，减少冗余计算同时提升训练效果。在人类偏好、合成生成和文本渲染等多样化对齐任务中的广泛结果表明，TMPO在生成多样性方面相较于最先进方法提升了9.1%，并在所有下游和效率指标上均实现竞争性能，实现了奖励与多样性之间的最佳权衡。

ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network

ACSAC：自适应块大小演员-批评者，配合因果变换器Q-Network。

Authors: Qian Chen, Junqiao Zhao, Hongtu Zhou, Hang Yu, Yanping Zhao, Chen Ye, Guang Chen
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.11009
Pdf link: https://arxiv.org/pdf/2605.11009
Abstract Long-horizon, sparse-reward tasks pose a fundamental challenge for reinforcement learning, since single-step TD learning suffers from bootstrapping error accumulation across successive Bellman updates. Actor-critic methods with action chunking address this by operating over temporally extended actions, which reduce the effective horizon, enable fast value backups, and support temporally consistent exploration. However, existing methods rely on a fixed chunk size and therefore cannot adaptively balance reactivity against temporal consistency. A large fixed chunk size reduces responsiveness to new observations, while a small one produces incoherent motions, forcing task-specific tuning of the chunk size. To address this limitation, we propose Adaptive Chunk Size Actor-Critic (ACSAC). ACSAC leverages a causal Transformer critic to evaluate expected returns for action chunks of different sizes. At each chunk boundary, it adaptively selects the chunk size that maximizes the expected return, supporting flexible, state-dependent chunk sizes without task-specific tuning. We prove that the ACSAC Bellman operator is a contraction whose unique fixed point is the action-value function of the adaptive policy. Experiments on OGBench demonstrate that ACSAC achieves state-of-the-art performance on long-horizon, sparse-reward manipulation tasks across both offline RL and offline-to-online RL settings.
中文摘要 长视野、奖励稀疏的任务对强化学习构成根本挑战，因为单步TD学习在连续Bellman更新中存在自起错误累积的问题。具有动作分块的演员-批判者方法通过在时间扩展的动作上运行来解决这个问题，从而缩短有效视野，实现快速的价值备份，并支持时间一致的探索。然而，现有方法依赖固定的块大小，因此无法适应性地平衡反应性与时间一致性。固定块大小较大会降低对新观测的响应速度，而较小块则产生不相干运动，迫使根据任务调整块大小。为解决这一限制，我们提出了自适应块大小演员-批判者（ACSAC）。ACSAC利用因果变压器批评者来评估不同大小动作块的预期收益。在每个区块边界处，它自适应地选择最大化预期回报的区块大小，支持灵活且依赖状态的区块大小，无需针对特定任务进行调整。我们证明了ACSAC Bellman算子是一个收缩，其唯一不动点是自适应策略的动作值函数。OGBench上的实验表明，ACSAC在离线强化学习和离线到在线强化学习设置下，在长期、稀疏奖励的操作任务中都能实现最先进的性能。

Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness

通过变分后验指导进行高效LLM推理，兼具效率意识

Authors: Zizhao Chen, Yuying Li, Siting Lin, Lianxi Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.11019
Pdf link: https://arxiv.org/pdf/2605.11019
Abstract Although large language models rely on chain-of-thought for complex reasoning, the overthinking phenomenon severely degrades inference efficiency. Existing reinforcement learning methods compress reasoning chains by designing elaborate reward functions, which renders high-quality samples extremely sparse in the exploration space and creates a sampling bottleneck for the prior policy. Inspired by cognitive science, we theoretically prove that a posterior distribution guided by reference answers achieves higher expected utility than the prior distribution, thus capable of breaking through the sampling bottleneck of high-quality samples. However, the posterior distribution is unavailable during inference. To this end, we formalize efficient reasoning as a variational inference problem and introduce an efficiency-aware evidence lower bound as the theoretical foundation. Based on this, we propose the VPG-EA framework. It adopts a parameter-shared dual-stream architecture to instantiate both the posterior distribution and the prior policy; after filtering out pseudo-efficient paths via cross-view evaluation, it unidirectionally transfers the posterior's efficient patterns to the prior policy through variational distillation. Experiments on DeepSeek-R1-Distill-Qwen-1.5B and 7B scales demonstrate that VPG-EA improves the comprehensive efficiency metric epsilon cubed by 8.73% and 12.37% over the strongest baselines on each model size, respectively.
中文摘要 尽管大型语言模型依赖思维链进行复杂推理，但过度思考现象严重降低了推理效率。现有的强化学习方法通过设计复杂的奖励函数来压缩推理链，这使得高质量样本在探索空间中极为稀疏，并为先前策略带来了抽样瓶颈。受认知科学启发，我们理论上证明，以参考答案为导引的后验分布比前置分布获得更高的期望效用，从而能够突破高质量样本的抽样瓶颈。然而，后验分布在推断过程中不可得。为此，我们将高效推理形式化为变分推断问题，并引入一个效率感知证据的下界作为理论基础。基于此，我们提出了VPG-EA框架。它采用参数共享的双流架构，以实例化后验分布和前置策略;通过交叉视角评估过滤掉伪高效路径后，它通过变分蒸馏单向将后验的有效模式转移到先验策略。在DeepSeek-R1-蒸馏-Qwen-1.5B和7B尺度上的实验表明，VPG-EA在各模型尺寸的最强基线上分别提升了整整效率指标（epsilon的3次方）8.73%和12.37%。

Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

信任区域逆向强化学习：利用本地策略更新实现显式双向上升

Authors: Anish Diwan, Davide Tateo, Christopher E. Mower, Haitham Bou-Ammar, Jan Peters, Oleg Arenz
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.11020
Pdf link: https://arxiv.org/pdf/2605.11020
Abstract Inverse reinforcement learning (IRL) is typically formulated as maximizing entropy subject to matching the distribution of expert trajectories. Classical (dual-ascent) IRL guarantees monotonic performance improvement but requires fully solving an RL problem each iteration to compute dual gradients. More recent adversarial methods avoid this cost at the expense of stability and monotonic dual improvement, by directly optimizing the primal problem and using a discriminator to provide rewards. In this work, we bridge the gap between these approaches by enabling monotonic improvement of the reward function and policy without having to fully solve an RL problem at every iteration. Our key theoretical insight is that a trust-region-optimal policy for a reward function update can be globally optimal for a smaller update in the same direction. This smaller update allows us to explicitly optimize the dual objective while only relying on a local search around the current policy. In doing so, our approach avoids the training instabilities of adversarial methods, offers monotonic performance improvement, and learns a reward function in the traditional sense of IRL--one that can be globally optimized to match expert demonstrations. Our proposed algorithm, Trust Region Inverse Reinforcement Learning (TRIRL), outperforms state-of-the-art imitation learning methods across multiple challenging tasks by a factor of 2.4x in terms of aggregate inter-quartile mean, while recovering reward functions that generalize to system dynamics shifts.
中文摘要 逆强化学习（IRL）通常表述为在匹配专家轨迹分布的情况下最大化熵。经典（双攀）实地学习保证单调性能提升，但每次迭代都需要完整解决一个强化学习问题以计算对偶梯度。较新的对抗方法通过直接优化原始问题并使用判别子提供奖励，从而避免了这种代价，但牺牲了稳定性和单调对偶改进。在本研究中，我们通过实现奖励函数和策略的单调改进，弥合了这些方法之间的差距，而无需在每次迭代中都完全解决强化学习问题。我们的关键理论见解是，对于奖励函数更新的信任区域最优策略，对于同一方向较小的更新，可以实现全局最优。这次较小的更新允许我们明确优化对偶目标，同时仅依赖当前策略的局部搜索。通过这样做，我们的方法避免了对抗性方法的训练不稳定性，实现单调性能提升，并学习到传统现实生活中的奖励函数——一个可以全局优化以匹配专家演示的函数。我们提出的算法——信任区域逆强化学习（TRIRL），在多项具有挑战性的任务中，其总计四分位数平均值比最先进的模仿学习方法高出2.4倍，同时还能恢复可推广到系统动力学变化的奖励函数。

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

RankQ：通过自我监督行动排名实现的离线到在线强化学习

Authors: Andrew Choi, Wei Xu
Subjects: Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.11151
Pdf link: https://arxiv.org/pdf/2605.11151
Abstract Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors. Across sparse reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. In vision-based robot learning, RankQ enables effective offline-to-online fine-tuning of a pretrained vision-language-action (VLA) model in a low-data regime, achieving on average a 42.7% higher simulation success rate than the next best method. In a high-data setting, RankQ improves simulation performance by 13.7% over the next best method and achieves strong sim-to-real transfer, increasing real-world cube stacking success from 43.1% to 84.7% relative to the VLA's initial performance.
中文摘要 离线到在线强化学习（RL）通过利用在线互动前的预先收集数据集，提高样本效率。然而，一个关键挑战是在数据集覆盖有限的大型州级行动空间中，学会准确的批评者。为减少因价值高估带来的有害更新，以往方法通过降低离区（OOD）动作相对于数据集动作的权重来施加悲观主义。虽然有效，但这本质上充当行为克隆锚点，当数据集操作不优时，可能阻碍后续在线策略的改进。我们提出了RankQ，一种离线到在线的Q学习目标，通过自监督多项排名损失来增强时间差分学习，以强制执行结构化的动作顺序。通过学习相对动作偏好，而不是统一惩罚未见的动作，RankQ塑造了Q函数，使动作梯度朝向更高质量的行为。在稀疏奖励的D4RL基准测试中，RankQ的性能与七种之前的方法相当甚至更优。在基于视觉的机器人学习中，RankQ能够在低数据环境中有效地从离线到在线微调预训练视觉-语言-动作（VLA）模型，平均比次优方法高出42.7%的模拟成功率。在高数据环境下，RankQ比次优方法提升模拟性能13.7%，并实现强大的模拟到实物传输，将真实立方体堆叠成功率从43.1%提升至84.7%，相较于VLA初始性能。

Quotient-Categorical Representations for Bellman-Compatible Average-Reward Distributional Reinforcement Learning

Bellman兼容平均奖励分布强化学习的商类别表示

Authors: Ege C. Kaya, Aliasghar Pourghani, Vijay Gupta, Abolfazl Hashemi
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2605.11289
Pdf link: https://arxiv.org/pdf/2605.11289
Abstract Average-reward reinforcement learning requires estimating the gain and the bias, which is defined only up to an additive constant. This makes direct distributional analogues ill-posed on the real line. We introduce a quotient-space formulation in which state-indexed bias laws are identified up to a common translation, together with a categorical parameterization that respects this symmetry. On this quotient-categorical space, we define a projected average-reward distributional operator and show that it is well-defined, non-expansive in a coordinate Cramér metric, and admits fixed points. We then study sampled recursions whose mean-field maps are asynchronous relaxations of this operator. In an idealized centered-reward setting, a one-state temporal-difference update enjoys almost sure convergence together with finite-iteration residual bounds under both i.i.d. and Markovian sampling. When the gain is unknown, we augment the recursion with an online gain estimator, and prove non-expansiveness and Markovian convergence of the resulting coupled scheme. Finally, we show that synchronous exact updates are gain-independent at the quotient-law level, isolating a structural contrast between ideal quotient distributions and practical fixed-grid categorical representations.
中文摘要 平均奖励强化学习需要估算增益和偏差，而偏差仅定义到一个加法常数。这使得直接分布类比在实数线上显得错位。我们引入商空间表述，其中状态指标偏置律被识别到直到共同平移，并结合了尊重该对称性的范畴参数化。在这个商范畴空间上，我们定义了一个投影平均-奖励分布算子，并证明它是良定义的，在坐标克拉梅度量下非扩张的，并且具有不动点。随后我们研究了其均值场映射为该算符异步松弛的采样递归。在理想化的中心奖励条件下，单态时间差分更新在独立与内因子和马尔可夫采样下几乎必然收敛且有限迭代剩余界限。当增益未知时，我们用在线增益估计器补充递归，并证明耦合方案的非扩张性和马尔可夫收敛性。最后，我们证明了同步精确更新在商定律层面上与增益无关，分离出理想商分布与实际固定网格类别表示之间的结构性对比。

Epistemic Uncertainty for Test-Time Discovery

测试时间发现的认识不确定性

Authors: Kainat Riaz, Muhammad Ahmed Mohsin, Ahsan Bilal, Muhammad Umer, Ayesha Mohsin, Aqib Riaz, Ali Subhan, John M. Cioffi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.11328
Pdf link: https://arxiv.org/pdf/2605.11328
Abstract Automated scientific discovery using large language models relies on identifying genuinely novel solutions. Standard reinforcement learning penalizes high-variance mutations, which leads the policy to prioritize familiar patterns. As a result, the maximum reward plateaus even as the average reward increases. Overcoming this limitation requires a signal that distinguishes unexplored regions from intrinsically difficult problems. This necessitates measuring disagreement across independently adapted weight hypotheses rather than relying on a single network's confidence. UG-TTT addresses this challenge by maintaining a small ensemble of low-rank adapters over a frozen base model. The per-token disagreement, quantified as the mutual information between ensemble predictions and weight hypotheses, isolates epistemic uncertainty and identifies positions where insufficient coverage leads to adapter divergence rather than intrinsic problem difficulty. This measure is incorporated as an exploration bonus into the policy gradient, directing the policy toward positions where persistent adapter disagreement signals low training coverage, the same frontier where genuine discovery is possible. A nuclear norm regularizer ensures the adapters remain distinct from one another, thereby preserving the exploration signal throughout training. Across four scientific discovery benchmarks, UG-TTT increases the maximum reward on three tasks, maintains substantially higher solution diversity, and an ablation study confirms that the regularizer is essential for sustaining this behavior.
中文摘要 利用大型语言模型进行自动化科学发现，依赖于识别真正新颖的解决方案。标准强化学习惩罚高方差突变，导致政策优先考虑熟悉的模式。因此，最大奖励会停滞不前，而平均奖励却在增加。克服这一限制需要一个信号，能够区分未探索区域与本质上复杂的问题。这需要在独立调整的权重假设之间测量分歧，而非依赖单一网络的置信度。UG-TTT通过在冻结的基础模型上维护一小组低阶适配器来解决这一挑战。每个代币的分歧，以集成预测与权重假设之间的互信息量化，隔离了认识论不确定性，并识别出覆盖不足导致适配器发散而非内在问题难度的处境。该措施作为探索奖励纳入政策梯度，将政策引导至持续存在适配器分歧、培训覆盖率较低的领域，这正是真正发现可能的前沿。核范数正则器确保适配器彼此独立，从而在整个训练过程中保持探索信号。在四项科学发现基准中，UG-TTT提高了三项任务的最大奖励，显著提高溶液多样性，消融研究证实正则化剂对维持这种行为至关重要。

gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods

gym-invmgmt：库存管理方法的开放基准测试框架

Authors: Reza Barati, Qinmin Vivian Hu
Subjects: Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2605.11355
Pdf link: https://arxiv.org/pdf/2605.11355
Abstract Inventory-policy comparisons are often difficult to interpret because performance depends on the evaluation contract as much as on the policy itself. Differences in topology, demand regime, information access, feasibility constraints, shortage treatment, and Key Performance Indicator (KPI) definitions can change method rankings. We present gym-invmgmt, a Gymnasium-compatible extension of the OR-Gym inventory-management lineage for auditable cross-paradigm evaluation. The benchmark evaluates optimization, heuristic, and learned controllers under a shared CoreEnv transition, reward, action-bound, and KPI contract, while varying stress conditions through a 22-scenario core grid plus four supplemental MARL-mode rows. Within these released scenarios, informed stochastic programming provides the strongest non-oracle reference, reflecting the value of scenario hedging under forecast access, but at substantially higher online computational cost. Among learned controllers, the Proximal Policy Optimization Transformer variant (PPO-Transformer) achieves the strongest learned-policy quality at fast inference, while Residual Reinforcement Learning (Residual RL) provides competitive hybrid performance. The graph neural network variant (PPO-GNN) is highly competitive on the default divergent topology but less robust on the serial topology. Imitation learning performs well in stationary regimes but degrades under demand shift, and the bounded Large Language Model (LLM) policy-parameter baseline is best interpreted as a diagnostic controller rather than an autonomous inventory optimizer. Overall, the benchmark identifies scenario-conditioned leaders while showing that performance depends jointly on information access, demand shift, topology, and policy representation.
中文摘要 库存与保单的比较往往难以解读，因为绩效不仅取决于保单本身，还取决于评估合同。拓扑结构、需求机制、信息获取、可行性约束、短缺处理以及关键绩效指标（KPI）定义的差异都可能改变方法排名。我们介绍Gym-invmgmt，这是OR-Gym库存管理谱系的兼容扩展，用于可审计的跨范式评估。该基准测试在共享的CoreEnv转换、奖励、动作绑定和KPI合约下评估优化、启发式和学习控制点，同时通过22个场景核心网格及四行补充MARL模式行变化应力条件。在这些已发布的情景中，知情随机规划提供了最强的非预言机参考，反映了情景对冲在预测访问下的价值，但在线计算成本显著更高。在学习型控制器中，近端策略优化变换器（PPO-Transformer）在快速推断下实现了最强的学习策略质量，而残差强化学习（Residual Reinforcement Learning，残差RL）则提供了具有竞争力的混合性能。图神经网络变体（PPO-GNN）在默认发散拓扑上竞争激烈，但在串行拓扑上稳健度较低。模仿学习在平稳环境中表现良好，但在需求转移时表现下降，而有界大型语言模型（LLM）的策略参数基线最好被理解为诊断控制器，而非自主库存优化器。总体而言，基准识别了情景条件下的领导者，同时表明绩效共同依赖于信息获取、需求转移、拓扑结构和政策表现。

Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

用于微调多模生成策略的行为模式发现

Authors: Alberta Longhini, David Emukpere, Jean-Michel Renders, Seungsu Kim
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.11387
Pdf link: https://arxiv.org/pdf/2605.11387
Abstract We address the problem of fine-tuning pre-trained generative policies with reinforcement learning (RL) while preserving the multimodality of their action distributions. Existing methods for RL fine-tuning of generative policies (e.g., diffusion policies) improve task performance but often collapse diverse behaviors into a single reward-maximizing mode. To mitigate this issue, we propose an unsupervised mode discovery framework that uncovers latent behavioral modes within generative policies. The discovered modes enable the use of mutual information as an intrinsic reward, regularizing RL fine-tuning to enhance task success while maintaining behavioral diversity. Experiments on robotic manipulation tasks demonstrate that our method consistently outperforms conventional fine-tuning approaches, achieving higher success rates and preserving richer multimodal action distributions.
中文摘要 我们解决了如何通过强化学习（RL）微调预训练生成策略的问题，同时保持其动作分布的多模态性。现有的强化学习生成策略微调方法（如扩散策略）提升了任务表现，但往往将多种行为压缩为单一的奖励最大化模式。为缓解这一问题，我们提出了一个无监督模式发现框架，能够揭示生成策略中潜在的行为模式。发现的模式使得互信息作为内在奖励得以使用，规范化强化学习的微调，以提升任务成功率，同时保持行为多样性。机器人操作任务的实验表明，我们的方法持续优于传统微调方法，实现更高的成功率，并保持更丰富的多模态作用分布。

fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum

FG-Expo：通过自适应的学习学习和高斯课程，前沿引导的探索优先级政策优化

Authors: Mingxiong Lin, Zhangquan Gong, Maowen Tang, Qian Li, Chuangchuang Wang, Jian Ma, Sutian Huang, Kai Tang, Haonan Lu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.11403
Pdf link: https://arxiv.org/pdf/2605.11403
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked inefficiencies inherent in GRPO. First, a fixed KL coefficient overly restricts policy exploration at moments when the model needs to diverge significantly from the reference policy. Second, uniform question sampling overlooks that moderately difficult problems produce the most informative gradient signals. We propose FG-ExPO, short for Frontier-Guided Exploration-Prioritized Policy Optimization, which integrates two lightweight components. Accuracy-Conditioned KL Scaling (AKL) adjusts the KL penalty strength through a smooth nonlinear function of batch average accuracy, loosening the constraint when the model performs poorly and strengthening it when the model achieves satisfactory results. Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at a moderate accuracy level around 0.5, focusing model training on its learning frontier. We conduct evaluations on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base across six mainstream mathematical reasoning benchmarks. Experimental results demonstrate that FG-ExPO consistently outperforms vanilla GRPO. It delivers an absolute improvement of 13.34 on the AIME 2025 pass@32 metric, rising from 63.33 percent to 76.67 percent, and obtains an average pass@32 gain of 2.66 on the 8B model. The substantially larger performance gains observed on pass@32 compared to pass@1 verify that FG-ExPO enlarges the model's effective exploration space under a fixed inference budget.
中文摘要 带可验证奖励的强化学习（RLVR）已成为大型语言模型数学推理的标准范式，群体相对策略优化（GRPO）成为主流算法。我们识别出GRPO中两个被忽视的低效问题。首先，固定的 KL 系数在模型需要显著偏离参考政策时，过度限制了策略探索。其次，均匀问题抽样忽略了中等难度问题会产生最具信息量的梯度信号。我们提出FG-ExPO，即边疆引导探索优先政策优化，整合了两个轻量级组件。准确性条件KL标度（AKL）通过批量平均精度的平滑非线性函数调整KL惩罚强度，模型表现不佳时放宽约束，达到满意结果时加强约束。高斯课程抽样（GCS）为以中等准确率为中心为0.5左右的高斯分布问题分配抽样权重，将模型训练重点放在学习前沿。我们对DeepSeek-R1-Distill-Qwen-1.5B和Qwen3-8B-Base进行了六个主流数学推理基准的评估。实验结果表明，FG-ExPO的表现始终优于普通GRPO。在AIME 2025 pass@32指标上，其绝对提升为13.34%，从63.33%升至76.67%，8B模型的平均pass@32进步为2.66%。pass@32 上显著较大的性能提升pass@1验证了 FG-ExPO 在固定推断预算下扩大了模型的有效探索空间。

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

代理-BRACE：通过口头状态不确定性实现长期任务中的信念与行动的解耦

Authors: Joykirat Singh, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen, Akshay Nambi, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.11436
Pdf link: https://arxiv.org/pdf/2605.11436
Abstract Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task-relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open-ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent-BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long-horizon, partially observable embodied language environments, Agent-BRACE achieves an average absolute improvement of +14.5% (Qwen2.5-3B-Instruct) and +5.3% (Qwen3-4B-Instruct), outperforming strong RL baselines while maintaining a near-constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.
中文摘要 大型语言模型（LLMs）越来越多地被部署在部分可观测环境中的长期任务中，它们必须在推断和跟踪复杂环境状态的过程中执行多步任务。这带来了两个挑战：部分可观测性需要对未被观测的世界属性保持不确定性，而漫长的交互历史使上下文无限制地增长，稀释了任务相关信息。对这两个挑战的原则性解决方案是信念状态：基于过去观察和行为的环境状态后验分布，它紧凑地编码了决策所需的历史，无论发作长度如何。然而，在LLM代理中，文本的开放性质使得如何表示这种分布变得不清楚。因此，我们引入了Agent-BRACE：通过抽象与置信估计实现代理信念状态表征的方法，该方法将LLM代理解耦为信念状态模型和策略模型，并通过强化学习共同优化。信念状态模型产生了信念分布的结构化近似：一组关于环境的原子自然语言主张，每个主张都标注了一个从确定到未知的序数语言确定性标签。政策模型基于这种紧凑、结构化的近似信念，而非完整历史，学习在显性不确定性下选择行动。在长视野、部分可观察的具身语言环境中，Agent-BRACE平均绝对提升为+14.5%（Qwen2.5-3B-Instruct）和+5.3%（Qwen3-4B-Instruct），优于强强强化学习基线，同时保持几乎恒定的上下文窗口，且不依赖于发作时长。进一步分析显示，随着证据的积累，这种习得的信念会随着发作过程逐渐校准。

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

打破$\textit{赢者通吃}$：合作策略优化提升多样的大型语言模型推理

Authors: Haoxuan Chen, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.11461
Pdf link: https://arxiv.org/pdf/2605.11461
Abstract Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non-redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at $\href{this https URL}{this}$.
中文摘要 基于验证器的强化学习（RLVR）已成为提升大型语言模型推理的核心范式，然而像GRPO这样的流行的基于群体的优化算法常常存在探索崩溃的问题，即模型过早收敛到一组狭窄的高分模式，缺乏探索新解的能力。近期尝试通过增加熵正则化或多样性加值来缓解这一问题。然而，这些方法并未改变“赢者通吃”的本质，即推广仍是为个人利益而竞争，而非合作最大化全球多样性。在本研究中，我们提出了群体合作策略优化（GCPO），将培训范式从推广竞争转向团队合作。具体来说，GCPO用团队层面的信用分配取代了独立的推广评分：推广的奖励是根据其对团队有效解决方案覆盖的贡献程度来衡量，而非其个人准确性。这种覆盖被描述为奖励加权语义嵌入的决定性体积，只有正确且非冗余的推广才会贡献该体积。在优势估计过程中，GCPO根据每个部署对团队的平均边际贡献重新分配团队的集体奖励。这种协作式训练范式将优化引导至非冗余且正确的推理路径。多项推理基准测试的实验表明，GCPO相比现有方法显著提升了推理准确性和解的多样性。代码将在 $\href{this https URL}{this}$ 发布。

Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

放下伪装：探针过滤的强化学习，忠实的思维链推理

Authors: Swapnil Parekh
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.11467
Pdf link: https://arxiv.org/pdf/2605.11467
Abstract Reasoning models post-hoc rationalize answers they have already committed to internally, producing chains of reasoning theater: deliberative-looking steps that contribute nothing to correctness. This wastes inference tokens, pollutes interpretability, and obscures what the model actually computed. We introduce ProFIL (Probe-Filtered Reinforcement Learning) to reduce theater, increase chain-of-thought faithfulness, and shrink chain length in a single, drop-in extension to Group Relative Policy Optimization (GRPO). A multi-head attention probe is trained once on the frozen base model to detect post-commitment steps from internal activations alone; during GRPO, rollouts whose probe score exceeds a threshold have their advantage zeroed. Our central finding is that a probe trained on a frozen base, with verifier-derived labels and no human annotation, provides a stable signal that suppresses theater while resisting the RL-obfuscation failure mode predicted by prior work. Across four reasoning domains (GSM8K, LiveCodeBench, ToolUse, MMLU-Redux) and two model architectures (Llama-8B, Qwen-7B), ProFIL reduces post-commitment theater by 11--100%, raises faithful-fraction (e.g., +24pp on LiveCodeBench under an independent Claude 3.7 Sonnet judge), and shortens chains by 4--19%, all while preserving or improving task accuracy. ProFIL also beats a matched length-penalty GRPO baseline, isolating the gain as semantic commitment-detection rather than chain compression. Probe weights, training configurations, and rollouts are released across all four domains.
中文摘要 事后推理模型会为他们内部已经承诺的答案进行合理化，产生一连串的推理剧场：看似经过深思熟虑的步骤，却对正确性毫无贡献。这浪费了推理符号，污染了可解释性，并掩盖了模型实际计算的内容。我们引入了ProFIL（Pro被过滤强化学习），旨在减少剧场，提高思维链的忠实度，并缩小链长度，这是一个对Group Relative Policy Optimization（GRPO）的直接扩展。多头注意力探针在冻结基础模型上训练一次，仅通过内部激活检测承诺后步骤;在GRPO期间，探测分数超过阈值的推广优势将归零。我们的核心发现是，在冻结基底上训练的探针，使用验证者衍生标签且无人工注释，能够提供稳定信号，抑制剧场效应，同时抵抗先前研究预测的强化学习混淆失败模式。在四个推理领域（GSM8K、LiveCodeBench、ToolUse、MMLU-Redux）和两种模型架构（Llama-8B、Qwen-7B）中，ProFIL将承诺后剧场减少了11--100%**，提升忠实分数（例如，在独立的Claude 3.7 Sonnet评审下，LiveCodeBench上的+24pp），并缩短链条4%至19%，同时保持或提升任务准确性。ProFIL还击败了匹配长度惩罚的GRPO基线，将增益隔离为语义承诺检测而非链压缩。探测器权重、训练配置和部署数据在四个领域均已发布。

Robust Multi-Agent Path Finding under Observation Attacks: A Principled Adversarial-Plus-Smoothing Training Recipe

观察攻击下的稳健多智能体路径寻找：一种原则性的对抗加平滑训练配方

Authors: Riad Ahmed
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.11469
Pdf link: https://arxiv.org/pdf/2605.11469
Abstract Decentralized multi-agent path finding (MAPF) routes a team of agents on a shared grid, each acting from its own local view. The standard solution trains one shared neural policy with Proximal Policy Optimization (PPO), a popular on-policy reinforcement learning algorithm. Such a policy works well on clean observations, but a small input perturbation on one agent often changes its action, which then blocks a neighbour, and the team jams. In this paper we present two training recipes that keep the same network and the same deployment loop, yet make the policy hold up under perturbed observations. The first recipe, Adv-PPO, trains the shared policy against worst-case perturbations of its own input and selects the checkpoint by performance under adversarial perturbation. The second recipe, Adv-PPO+MACER, fine-tunes that checkpoint with a small on-policy smoothness term whose gradient follows the certified radius of randomized smoothing. On POGEMA with 8x8 maps and four agents, the unprotected PPO policy reaches 95.8% clean success but only 2.5% under the strongest attack. Adv-PPO recovers worst-case success to 59.2% at one percentage point of clean cost. Adv-PPO+MACER recovers it to 77.5% +/- 6.0% across three independent seeds at less than one percentage point of clean cost. We support these numbers with per-attack curves, a certified action-stability sanity check (which measures the smoothed-policy wrapper, not the deployed argmax policy), and side-by-side rollout storyboards that show the failure mode and the fix inside one environment instance.
中文摘要 去中心化多智能体路径寻测（MAPF）在共享网格上路由一组代理，每个代理从自己的局部视角出发。标准解决方案通过近端策略优化（PPO）训练一个共享神经策略，PPO是一种流行的策略强化学习算法。这种策略在干净的观测中效果良好，但一个智能体的微小输入扰动常常会改变其动作，从而阻断邻居，导致团队卡壳。本文提出了两种训练方案，既保持相同的网络和部署循环，又使策略在受扰动观测下依然有效。第一个方法Adv-PPO训练共享策略以应对自身输入的最坏情况扰动，并根据在对抗扰动下的表现选择检查点。第二种方法Adv-PPO+MACER通过一个小的策略上平滑项对检查点进行微调，其梯度遵循经过认证的随机平滑半径。在POGEMA中，8x8地图和4名特工，未保护的PPO策略干净利落的成功率达到95.8%，但在最强攻击下仅有2.5%。Adv-PPO在净成本1个百分点下，最坏情况下成功率回升至59.2%。Adv-PPO+MACER在三个独立种子中将其恢复至77.5% +/- 6.0%，净成本低于1个百分点。我们通过每次攻击曲线、经过认证的动作稳定性合理性检查（衡量平滑策略包装器，而非部署的 argmax 策略）以及并排的展开故事板支持这些数据，这些故事板在一个环境实例中展示失败模式和修复。

TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

TOPPO：重新思考多任务强化学习中的PPO与批判平衡

Authors: Yuanpeng Li, Gefei Lin, Annie Qu, Rui Miao
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.11473
Pdf link: https://arxiv.org/pdf/2605.11473
Abstract Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diagnose that PPO in MTRL suffers from a previously overlooked issue: critic-side gradient ill-conditioning, which may cause tail tasks to stall while easy tasks dominate the value function's updates. To address this, we propose TOPPO (Tail-Optimized PPO), a reformulation of PPO via Critic Balancing -- a set of modules that improve gradient conditioning and balance learning dynamics across tasks. Unlike prior approaches that rely on modular architectures or large models, TOPPO targets the optimization bottleneck within PPO itself. Empirically, TOPPO achieves stronger mean and tail-task performance than published SAC-family and ARS-family baselines while using substantially fewer parameters and environment steps on Meta-World+ benchmark. Notably, TOPPO matches or surpasses strong SAC baselines early in training and maintains superior performance at full budget. Ablations confirm the effectiveness of each module in TOPPO and provide insights into their interactions. Our results demonstrate that, with proper optimization, on-policy methods can rival or exceed off-policy approaches in MTRL, challenging the prevailing reliance on SAC and highlighting critic-side gradient conditioning as the central bottleneck.
中文摘要 软演员-批判者（SAC）及其变体因其非策略样本效率而主导多任务强化学习（MTRL），而非策略方法如近点策略优化（PPO）则较少被充分探索。我们诊断出MTRL中的PPO存在一个此前被忽视的问题：批评者侧梯度不良条件反射，可能导致尾部任务停滞，而简单任务却主导价值函数的更新。为此，我们提出了TOPPO（尾部优化PPO），这是通过批判平衡（Critic Balancing）重新表述PPO的方案——一组模块，用于改善梯度条件和平衡各任务间的学习动态。与以往依赖模块化架构或大型模型的方法不同，TOPPO针对的是PPO内部的优化瓶颈。从实证角度看，TOPPO在Meta-World+基准测试中比已发布的SAC家族和ARS系列基线更强，且参数和环境步骤明显更少。值得注意的是，TOPPO在培训初期就达到甚至超过SAC的强基线，并在满预算下保持优异表现。消融验证了每个模块在TOPPO中的有效性，并提供了对其相互作用的洞察。我们的结果表明，只要优化得当，策略内方法可以在MTRL中与或超越非策略方法，挑战了对SAC的普遍依赖，并凸显了批评者侧梯度条件反射是核心瓶颈。

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

理解并防止RLVR中的熵崩溃，采用策略上的熵流优化

Authors: Huimin Xu, Shuai Zhao, Xiaobao Wu, Anh Tuan Luu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.11491
Pdf link: https://arxiv.org/pdf/2605.11491
Abstract Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning ability of large language models. However, widely used RLVR algorithms, such as GRPO, often suffer from entropy collapse, leading to premature determinism and unstable optimization. Existing remedies, including entropy regularization and ratio-based clipping heuristics, either control entropy in a coarse-grained manner or rely on approximate on-policy training. In this paper, we revisit entropy collapse from a token-level entropy flow perspective. Our analysis reveals that entropy-decreasing tokens consistently outweigh entropy-increasing ones, resulting in a severely imbalanced entropy flow. This perspective provides a unified explanation of entropy collapse in existing RLVR algorithms and highlights the importance of balancing entropy dynamics. Motivated by this analysis, we propose On-Policy Entropy Flow Optimization (OPEFO), an adaptive entropy flow balancing mechanism that rescales entropy-increasing and entropy-decreasing updates according to their contributions to entropy change, while remaining strict on-policy. Experiments on six mathematical reasoning benchmarks demonstrate that OPEFO improves training stability and final performance. We will release the code and models upon publication.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理能力的有效范式。然而，广泛使用的RLVR算法，如GRPO，常常存在熵坍缩问题，导致过早确定性和优化不稳定。现有的解决方案，包括熵正则化和基于比率的剪裁启发式，要么以粗粒度方式控制熵，要么依赖近似的策略训练。本文将从代币层面的熵流视角重新探讨熵坍缩。我们的分析显示，减少熵的标记数量始终超过增加熵的标记，导致熵流动严重失衡。这一观点统一解释了现有RLVR算法中的熵坍缩，并强调了熵动态平衡的重要性。基于此分析，我们提出了策略上熵流优化（OPEFO），这是一种自适应熵流平衡机制，能够根据熵变化的贡献重新调整熵增加和减少熵的更新，同时保持严格的策略状态。六个数学推理基准测试的实验表明，OPEFO能提升训练稳定性和最终性能。我们将在发布后发布代码和模型。

Selective Off-Policy Reference Tuning with Plan Guidance

选择性非政策参考调整及计划指导

Authors: Duc Anh Le, Tien-Phat Nguyen, Thien Huu Nguyen, Linh Ngo Van, Trung Le
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.11505
Pdf link: https://arxiv.org/pdf/2605.11505
Abstract Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation. Across three backbones and eight reasoning benchmarks, SORT improves over GRPO and guidance baselines, with largest gains on weaker models.
中文摘要 带有可验证奖励的强化学习有助于推理，但在所有采样的推广都失败的硬提示上，GRPO式方法会停滞不前。SORT为这些失败添加修复更新，但不改变推广生成：它从参考解中推导计划，比较有该计划和无该计划的令牌概率，并对在计划条件下变得更可预测的令牌赋予更高权重。这使得错误的提示变成了选择性、结构意识的学习信号，而非统一的模仿。在三个骨干基准和八个推理基准中，SORT相较于GRPO和指导基线有所提升，且在较弱模型上提升最大。

Hierarchical LLM-Driven Control for HAPS-Assisted UAV Networks: Joint Optimization of Flight and Connectivity

HAPS辅助无人机网络的分层LLM驱动控制：飞行与连接的联合优化

Authors: Zijiang Yan, Hao Zhou, Wael Jaafar, Jianhua Pei, Ping Wang, Halim Yanikomeroglu, Hina Tabassum
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.11509
Pdf link: https://arxiv.org/pdf/2605.11509
Abstract Uncrewed aerial vehicles (UAVs) are increasingly deployed in complex networked environments, yet the joint optimization of multi-UAV motion control and connectivity remains a fundamental challenge. In this paper, we study a multi-UAV system operating in an integrated terrestrial and non-terrestrial network (ITNTN) comprising terrestrial base stations and high-altitude platform stations (HAPS). We consider a three-dimensional (3D) aerial highway scenario where UAVs must adapt their motion to ensure collision avoidance, efficient traffic flow, and reliable communication under dynamic and partially observable conditions. We first model the problem as a hierarchical multi-objective partially observable Markov decision process (H-MO-POMDP), capturing the coupling between control and communication objectives. Based on this formulation, we propose a large language model (LLM)-driven hierarchical multi-rate control framework. At the global level, an LLM-based controller on the HAPS performs long-term planning for load balancing and handover decisions. At the local level, each UAV employs a hybrid controller that integrates a slow-timescale LLM for high-level spatial reasoning with a reinforcement learning agent for faster UAV-to-infrastructure (U2I) communication and motion control. We further develop a high-fidelity 3D simulation platform by integrating the gym-pybullet-drones environment with 3GPP-compliant RF/THz channel models. Numerical results demonstrate that the proposed framework significantly outperforms state-of-the-art baselines, achieving a 14% increase in transportation efficiency and a 25% improvement in telecommunication throughput. Additionally, it achieves a 23% reduction in physical collision rates, demonstrating strong handover stability and zero-shot generalization in dynamic scenarios.
中文摘要 无人机（UAV）越来越多地部署在复杂的网络环境中，但多无人机运动控制与互联的联合优化仍是一个根本性的挑战。本文研究了一种多无人机系统，运行在由地面基站和高空平台站（HAPS）组成的集成地面与非地面网络（ITNTN）中。我们考虑了一个三维（3D）空中高速公路场景，其中无人机必须调整其运动方式，以确保在动态且部分可观测条件下避免碰撞、高效交通流畅和可靠通信。我们首先将问题建模为一个层次多目标部分可观测的马尔可夫决策过程（H-MO-POMDP），捕捉控制目标与通信目标之间的耦合。基于这一表述，我们提出了一个由大型语言模型（LLM）驱动的层级多速率控制框架。在全球层面，基于LLM的HAPS控制器负责长期的负载均衡和切换决策规划。在本地层面，每架无人机都采用混合控制器，将慢时间尺度的大型语言模型（LLM）用于高级空间推理，与强化学习代理集成，实现更快的无人机到基础设施（U2I）通信和运动控制。我们进一步开发高保真3D仿真平台，将健身房-pybullet-无人机环境与3GPP兼容的射频/太太分频通道模型整合。数值结果表明，所提框架显著优于最先进的基线，实现了14%的运输效率提升和25%的电信吞吐量提升。此外，它实现了物理碰撞率的23%降低，展现了强大的切换稳定性和动态场景下的零射向推广能力。

Agents Should Replace Narrow Predictive AI as the Orchestrator in 6G AI-RAN

代理应取代狭义预测人工智能，成为6G AI-RAN中的协调者

Authors: Pranshav Gajjar, Vijay K Shah
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.11516
Pdf link: https://arxiv.org/pdf/2605.11516
Abstract This position paper argues that to achieve Level 5 autonomous 6G networks, the next generation of Artificial Intelligence in Radio Access Networks (AI-RAN) should transition away from fragmented, narrow predictive models and instead adopt multimodal Large Language Models (LLMs) as central reasoning agents. Current AI-RAN architectures rely on disjointed Deep Neural Networks (DNNs) and Deep Reinforcement Learning (DRL) agents that operate in isolated domains. These narrow models suffer from siloed knowledge, severe brittleness to out-of-distribution dynamics, and a fundamental inability to bridge the intent gap the semantic disconnect between high-level, unstructured operator directives and rigid numerical network configurations. We propose elevating LLMs, or domain-adapted Large Telecom Models (LTMs), to act as the cognitive operating system situated within the RAN Intelligent Controller (RIC), the control and orchestration layer of AI-RAN. In this architecture, LLMs do not replace narrow models but orchestrate them as executable subroutines, dynamically translating human intent into concrete policies and utilizing Retrieval-Augmented Generation (RAG) to autonomously diagnose complex, multi-vendor network anomalies. To make this architectural shift a reality, we call upon the machine learning community to prioritize critical foundational research tailored to the strict constraints of telecommunications, specifically focusing on continuous alignment via network-driven feedback (RLNF), extreme sub-8-bit edge quantization, neuro-symbolic verification to curb hallucinations, and securing orchestration frameworks against adversarial prompt injections.
中文摘要 本立场文件主张，为了实现五级自治6G网络，下一代无线接入网络人工智能（AI-RAN）应摒弃碎片化、狭窄的预测模型，转而采用多模态大型语言模型（LLMs）作为中央推理代理。当前的AI-RAN架构依赖于分散的深度神经网络（DNN）和深度强化学习（DRL）代理，这些代理运行在孤立的领域。这些狭窄模型存在知识孤岛化、分布外动态的严重脆弱性，以及根本无法弥合高层次非结构化操作指令与严格数值网络配置之间语义断裂的意图鸿沟。我们提议提升LLM，即领域适配大型电信模型（LTM），作为位于RAN智能控制器（RIC）中的认知操作系统，RIC是AI-RAN的控制与编排层。在这种架构中，LLMs不取代狭窄模型，而是将其编排为可执行子程序，动态将人类意图转化为具体策略，并利用检索增强生成（RAG）自主诊断复杂的多厂商网络异常。为了实现这一架构转变，我们呼吁机器学习界优先开展针对电信严格约束的关键基础研究，特别是通过网络驱动反馈（RLNF）实现连续对齐、极端亚8位边缘量化、神经符号验证以抑制幻觉，以及保护编排框架免受对抗提示注入的影响。

UNIPO: Unified Interactive Visual Explanation for RL Fine-Tuning Policy Optimization

UNIPO：强化学习策略优化的统一交互式可视化解释

Authors: Aeree Cho, Alexander D. Greenhalgh, Jonathan Bodea, Anthony Peng, Duen Horng (Polo)Chau
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2605.11549
Pdf link: https://arxiv.org/pdf/2605.11549
Abstract Reinforcement learning has emerged as a dominant technique for fine-tuning the behavior of large language models, with policy optimization (PO) algorithms such as GRPO, DAPO, and Dr. GRPO emerging in rapid succession to advance state-of-the-art reasoning and alignment performance. However, the modular differences between these algorithms, including targeted improvements to clipping, advantage estimation, and reward aggregation, are introduced across separate papers with inconsistent notation, making them difficult to compare and intimidating to the non-expert community. We present UNIPO, the first interactive visualization tool that exposes the token-level training dynamics of RL fine-tuning algorithms through a unified design. UNIPO connects three complementary views, a high-level training overview, a step-level prompt and response inspector, and a side-by-side algorithm comparison, allowing learners to observe how individual design decisions propagate through training. Through two usage scenarios, we demonstrate how UNIPO supports both classroom instruction for non-experts and algorithm selection for AI practitioners. Our tool is open-source and publicly available at this https URL.
中文摘要 强化学习已成为微调大型语言模型行为的主流技术，策略优化（PO）算法如GRPO、DAPO和Dr. GRPO相继涌现，推动了最先进的推理和比对性能。然而，这些算法之间的模块差异，包括针对裁剪、优势估计和奖励聚合的有针对性改进，分布在不同论文中，符号不一致，使得比较困难，对非专家群体来说也令人望而生畏。我们介绍UNIPO，这是首个通过统一设计揭示强化学习微调算法的代币级训练动态的交互式可视化工具。UNIPO连接了三个互补视图：高层次培训概览、步骤级提示与响应检查器，以及并排算法比较，使学习者能够观察个别设计决策如何通过训练传递。通过两种使用场景，我们展示了UNIPO如何支持非专家的课堂教学和AI从业者的算法选择。我们的工具是开源的，并在此 https URL 上公开使用。

TwiSTAR:Think Fast, Think Slow, Then Act,Generative Recommendation with Adaptive Reasoning

TwiSTAR：快速思考，慢思考，然后行动，生成式推荐与适应性推理

Authors: Shiteng Cao, Kaian Jiang, Yunlong Gong, Zhiheng Li
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2605.11553
Pdf link: https://arxiv.org/pdf/2605.11553
Abstract Generative recommendation with Semantic IDs (SIDs) has emerged as a promising paradigm, yet existing methods apply a fixed inference strategy, either fast direct generation or slow chain-of-thought reasoning, uniformly across all user histories. This approach creates a trade-off: fast recommendation model produces suboptimal accuracy on hard samples, while always invoking slow reasoning incurs prohibitive latency and wastes computation on easy cases. To address this, we propose Think Fast, Think Slow, Then Act, a framework that learns to adaptively allocate reasoning effort per user sequence. Our system equips an LLM with three complementary tools: a fast SID-based retriever, a lightweight candidate ranker, and a slow reasoning model that generates explicit rationales before recommending. Crucially, we inject collaborative commonsense into the slow model by transforming item-to-item knowledge into natural language explanations. A planner, trained through supervised warm-up followed by agentic reinforcement learning, dynamically decides which tool to invoke. Experiments on three datasets demonstrate that our method outperforms strong baselines, achieving consistent accuracy gains while reducing inference latency compared to uniform slow reasoning.
中文摘要 带有语义ID（SID）的生成推荐已成为一种有前景的范式，但现有方法在所有用户历史中均采用固定推理策略，要么快速直接生成，要么慢速思考链推理。这种方法带来了权衡：快速推荐模型在硬样本上产生次优准确性，而总是使用慢推理则导致极高的延迟，并在简单案例上浪费计算。为此，我们提出了“快思考，慢思考，然后行动”框架，学习为每个用户序列自适应分配推理努力。我们的系统为LLM配备了三种互补工具：基于SID的快速检索器、轻量级候选排名器，以及一个在推荐前生成明确理由的慢推理模型。关键是，我们将协作常识注入慢速模型，将项目间知识转化为自然语言解释。通过监督热身和能动强化学习训练的计划者，动态决定调用哪种工具。对三个数据集的实验表明，我们的方法优于强基线，实现了持续的准确性提升，同时降低了与均匀慢推理相比的推理延迟。

OUI as a Structural Observable: Towards an Activation-Centric View of Neural Network Training

作为结构可观测量的OUI：迈向以激活为中心的神经网络训练视角

Authors: Alberto Fernández-Hernández, Jose I. Mestre, Cristian Pérez-Corral, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.11570
Pdf link: https://arxiv.org/pdf/2605.11570
Abstract Activation functions are what make deep networks expressive: without them, the model collapses to a linear map. Yet we still evaluate training mostly from the outside, through loss, accuracy, return, or final calibration, while the internal structural evolution of the network remains largely unobserved. In this paper, we argue that the Overfitting--Underfitting Indicator (OUI) should be understood as a first practical observable of that internal structure. Across our recent results, OUI consistently appears as an early, label-free, activation-based signal that reveals whether a network is entering a poor or promising training regime before convergence. In supervised learning, it anticipates weight decay regimes; in reinforcement learning, it discriminates learning-rate regimes early in PPO actor--critic; and in online control, it can drive layer-wise weight decay adaptation. Read together with recent evidence that activation patterns tend to stabilize earlier than parameters, these results suggest a broader research direction: an activation-centric theory of training dynamics. OUI is becoming an empirical foothold toward this theory.
中文摘要 激活函数使深度网络具有表现力：没有激活函数，模型将崩解为线性映射。然而，我们仍然主要从外部评估训练，通过损失、准确性、回归或最终校准，而网络内部结构演变则大多未被观察到。本文主张，过拟合-欠拟合指标（OUI）应被视为该内部结构的第一个实用可观测量。在我们最近的研究结果中，OUI始终表现为早期、无标签、基于激活的信号，能够揭示网络在融合前是进入了劣质还是有前景的训练体系。在监督学习中，它会预测体重下降的过程;在强化学习中，它在PPO演员——批判者;在线控制时，它可以驱动层层的重量衰减适应。结合近期激活模式趋于比参数更早稳定的证据，这些结果表明了一个更广泛的研究方向：以激活为中心的训练动力学理论。OUI正成为这一理论的实证基础。

CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

CuSearch：通过搜索深度抽样的课程推广活动，针对代理性RAG

Authors: Jianghan Shen, Siqi Luo, Xinyu Cheng, Jing Xiong, Yue Li, Jiyao Liu, Jiashi Lin, Yirong Chen, Junjun He
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.11611
Pdf link: https://arxiv.org/pdf/2605.11611
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for training agentic retrieval-augmented generation (RAG) systems from outcome-only supervision. Most existing methods optimize policies from uniformly sampled rollouts, implicitly treating all trajectories as equally informative. However, trajectories differ substantially in search depth and are therefore not equally informative: deeper-search trajectories contain more retrieval decision points and provide denser direct supervision for the retrieval sub-policy. Moreover, this heterogeneity grows over training as the within-batch depth distribution shifts toward higher values, yet uniform rollout sampling remains blind to this shift. To address this, we propose CuSearch, a curriculum rollout sampling framework built on Search-Depth Greedy Allocation (SDGA), a batch-level operator that reallocates a fixed update budget toward deeper-search trajectories. SDGA-Auto always targets the deepest available trajectories in the current batch, yielding an implicit training-aligned curriculum as the depth distribution shifts upward. SDGA-Phase explicitly advances the curriculum threshold as deeper trajectories become sufficiently abundant. Experiments across model types and retrieval frameworks show that CuSearch consistently improves performance, achieving up to 11.8 exact-match points over standard GRPO on ZeroSearch. These results establish per-trajectory search depth as a reliable, annotation-free proxy for retrieval supervision density in RLVR-based agentic RAG training. The code is available at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）已成为一种有前景的范式，用于从仅结果监督中训练代理检索增强生成（RAG）系统。大多数现有方法都从均匀抽样的推广中优化政策，隐含地将所有路径视为同等的信息量。然而，轨迹在搜索深度上差异显著，因此信息量不等：更深搜索轨迹包含更多检索决策点，并为检索子策略提供更密集的直接监督。此外，随着批次内深度分布向更高值移动，这种异质性在训练过程中会增强，但统一的展开抽样对这种变化保持盲目。为此，我们提出了CuSearch，这是一个基于搜索深度贪婪分配（SDGA）的课程推广抽样框架，SDGA是一个批处理级操作符，将固定的更新预算重新分配到更深层次的搜索轨迹上。SDGA-Auto始终针对当前批次中最深的可用路径，随着深度分布向上移动，课程内隐性地与培训相符。SDGA阶段明确推进课程门槛，随着更深层次的路径足够丰富。跨模型类型和检索框架的实验表明，CuSearch持续提升性能，在ZeroSearch上比标准GRPO达到高达11.8个精确匹配点。这些结果确立了每轨迹搜索深度作为基于RLVR的能动性RAG训练中检索监督密度的可靠、无注释的代理指标。代码可在该 https URL 访问。

Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling

进化任务发现：通过技能组合和复杂度扩展推进推理前沿

Authors: Liqin Ye, Yanbin Yin, Michael Galarnyk, Yuzhao Heng, Sudheer Chava, Chao Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.11666
Pdf link: https://arxiv.org/pdf/2605.11666
Abstract The reasoning frontier of Large Language Models (LLMs) has advanced significantly through modern post-training paradigms (e.g., Reinforcement Learning from Verifiable Rewards (RLVR)). However, the efficacy of these methods remains fundamentally constrained by the diversity and complexity of the training data. One practical solution is data synthesis; yet, prevalent methods relying on unstructured mutation or exploration suffer from homogeneity collapse, failing to systematically expand the reasoning frontier. To overcome this, we propose Evoutionary Task Discovery (EvoTD), a framework that treats data synthesis as a directed search over a dual-axis manifold of Algorithmic Skills and Complexity Attributes. We introduce structured evolutionary operators to navigate this space: a Crossover operator that synthesizes novel skill compositions to enhance diversity, and a Parametric Mutation operator that scales structural constraints (e.g., input size, tree depth) to drive robust generalization. Crucially, we integrate a dynamic Zone of Proximal Development filter, ensuring tasks lie within the learnable region of the model. Empirically, EvoTD delivers substantial reasoning gains that generalize consistently across model architectures, pretraining regimes, and scales, demonstrating that structured evolutionary curricula can effectively support reasoning improvement. We release our code on this https URL.
中文摘要 大型语言模型（LLMs）的推理前沿通过现代后训练范式（例如可验证奖励强化学习（RLVR））取得了显著进步。然而，这些方法的有效性仍受限于训练数据的多样性和复杂性。一个实用的解决方案是数据综合;然而，依赖无结构突变或探索的流行方法会出现同质性崩溃，未能系统性地扩展推理前沿。为克服这一问题，我们提出了Evoutionary Task Discovery（EvoTD）框架，将数据综合视为在双轴流形上的算法技能和复杂性属性上的有向搜索。我们引入了结构化进化算符来导航这一领域：一个交叉算子，综合新技能组合以增强多样性，以及参数变异算符，可扩展结构约束（如输入大小、树深度），推动稳健泛化。关键是，我们集成了一个动态的近端发展区过滤器，确保任务位于模型的可学习区域内。从实证角度看，EvoTD在推理能力上取得了显著的进步，这些成果在模型架构、预训练体系和尺度上都能一致地推广，证明结构化的进化课程能够有效支持推理的提升。我们会在这个 https URL 上发布代码。

DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers

DORA：Vision Transformers 代币合并的动态在线强化代理

Authors: Kaixuan He, Song Chen, Yi Kang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.11683
Pdf link: https://arxiv.org/pdf/2605.11683
Abstract Vision Transformers (ViTs) incur significant computational overhead due to the quadratic complexity of self-attention relative to the token sequence length. While existing token reduction methods mitigate this issue, they predominantly rely on fixed heuristic metrics, predefined ratios, or static offline masks, which lack the adaptability to capture input-dependent redundancy during inference. In this paper, we propose DORA (Dynamic Online Reinforcement Agent), the first reinforcement learning (RL)-driven online inference framework for dynamic token merging in ViTs. We formulate the merging process as a sequential Markov Decision Process (MDP), where a lightweight RL agent determines the merging strategy for each Transformer block based on the current feature state and layer-specific context. To balance computational efficiency and feature fidelity, the agent is optimized via a dense reward function incorporating a non-linear distillation-based penalty. We implement an asymmetric Actor-Critic architecture that utilizes a high-capacity Critic for stable offline training while retaining a minimal Actor head for low-computation online inference. Evaluations across multiple ViT scales (Tiny to Large) demonstrate that DORA improves the accuracy-efficiency Pareto front compared to current baselines. Under strict negligible accuracy-drop constraints (<= 0.05%), DORA achieves up to a 12.66% token merging rate, and delivers up to a 569.7% relative improvement over the most efficient baseline. On ImageNet-1K, under aligned accuracy constraints, DORA achieves up to a 76% relative improvement in computational savings compared to state-of-the-art methods. Furthermore, on out-of-distribution (OOD) benchmarks such as ImageNet-A and ImageNet-C, DORA attains a relative efficiency advantage of over 430%.
中文摘要 视觉变换器（ViT）由于自注意力相对于令牌序列长度的二次复杂性，计算开销较大。虽然现有的令牌减少方法缓解了这一问题，但它们主要依赖固定的启发式指标、预定义的比率或静态离线掩码，这些掩码缺乏在推理过程中捕捉输入相关冗余的适应性。本文提出了DORA（动态在线强化代理），这是首个基于强化学习（RL）驱动的ViT动态令牌合并在线推理框架。我们将合并过程表述为顺序马尔可夫决策过程（MDP），其中轻量级强化学习代理根据当前特征状态和层特定上下文为每个变换器块决定合并策略。为了平衡计算效率和特征忠实度，智能体通过包含非线性蒸馏惩罚的密集奖励函数进行优化。我们实现了一种非对称的Actor-Critic架构，利用高容量Critic实现稳定的离线训练，同时保留最小的Actor头以实现低计算的在线推理。跨多个ViT量表（从小到大）的评估表明，DORA相较于当前基线提升了帕累托前沿的准确性和效率。在严格可忽略的准确性丢失约束（<= 0.05%）下，DORA 可实现高达 12.66% 的令牌合并率，并且相较于最高效基线提升高达 569.7%。在ImageNet-1K上，在对齐精度约束下，DORA相比最先进方法，计算节省可达76%。此外，在ImageNet-A和ImageNet-C等非分销（OOD）基准测试中，DORA的相对效率优势超过430%。

Rainbow Deep Q-Learning with Kinematics-Aware Design for Cooperative Delta and 3-RRS Parallel Robot Insertion

彩虹深度Q-学习，结合运动学感知设计，用于协作三角洲和3-RRS并行机器人插入

Authors: Hassen Nigatu, Gaokun Shi, Jituo Li, Wang Jin, Lu Guodong
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.11697
Pdf link: https://arxiv.org/pdf/2605.11697
Abstract This paper presents a kinematics-aware deep reinforcement learning framework based on Rainbow Deep Q-Networks (DQN) for cooperative peg-in-hole manipulation by a Delta parallel robot and a 3-RRS (Revolute--Revolute--Spherical) parallel manipulator. A key contribution is the integration of a geometric design-optimization stage that precedes learning: the 3-RRS geometry is tuned to maximize the singularity-free workspace and improve conditioning, which in turn enlarges the safe region in which the reinforcement learning policy can explore. Together the two manipulators expose a 6~degree-of-freedom (DoF) controllable subspace (three Delta translations, two 3-RRS rotations, and one 3-RRS vertical translation); the peg-in-hole task is invariant to rotation about the peg axis, so the task-relevant manifold is five dimensional. The cooperative insertion problem is cast as a Markov Decision Process with a 12-dimensional state vector and a discrete action set containing $6 \times 2 = 12$ incremental commands (one positive and one negative per controlled DoF). A shaped reward combines dense proximity guidance, penalties for kinematic and workspace violations, and sparse bonuses for successful insertions. The Rainbow DQN -- integrating double Q-learning, dueling architecture, prioritized replay, multi-step returns, noisy linear layers for exploration, and a distributional value head -- is trained with a two-stage curriculum. The co-designed framework is validated in a high-fidelity kinematic simulator, where it achieves stable policy convergence, reliable insertions, and reduced constraint violations compared against a vanilla DQN agent and a classical sampling-based planner.
中文摘要 本文提出了基于彩虹深度Q网络（DQN）的运动学感知深度强化学习框架，支持Delta并行机器人和3-RRS（Revolute--Revolute-球形）并行机械臂协同操作孔中钉。一个关键贡献是集成了学习前的几何设计-优化阶段：3-RRS几何被调优以最大化无奇点工作空间并改善条件反射，从而扩大强化学习策略可探索的安全区域。两个操作手共同暴露出一个6~自由度（DoF）可控子空间（三个Δ平移，两个3-RRS旋转和一个3-RRS垂直平移）;孔中钉任务对绕钉轴旋转不变，因此任务相关流形是五维的。合作插入问题被描述为一个马尔可夫决策过程，具有12维状态矢量和一个离散动作集，包含$6 \乘以2 = 12$的增量指令（每个受控深度各一个正向和一个负向指令）。成形奖励结合了密集的接近引导、运动学和工作区违规的惩罚，以及成功插入的稀疏加成。彩虹DQN融合了双Q学习、对抗架构、优先重放、多步返回、用于探索的噪声线性层和分布值头，采用两阶段课程进行训练。该共同设计框架在高保真运动学模拟器中得到验证，实现了稳定的策略收敛、可靠插入和与传统基于采样的规划器相比的约束违规率。

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

CaC：通过层级时空集中推进视频奖励模型

Authors: Jiyuan Wang, Huan Ouyang, Jiuzhou Lin, Chunyu Lin, Dewen Fan, Boheng Zhang, Haonan Fan, Fei Zuo, Jia Sun, Huaiqing Wang, Honglie Wang, Yiyang Fan, Zhenlong Yuan, Zijun Li, Yongrui Heng, Guosheng Lin, Fan Yang, Tingting Gao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.11723
Pdf link: https://arxiv.org/pdf/2605.11723
Abstract In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.
中文摘要 本文提出了基于视觉-语言模型的粗细异常奖励模型“集中与集中（CaC）。在推断过程中，首先进行全局时间扫描以锚定异常时间窗口，然后在局部区间内进行细粒度的空间基础分析，最后通过结构化时空思维链推理得出稳健判断。为赋予模型这些功能，我们构建了首个大规模生成视频异常数据集，配备每帧边界框注释、时间异常窗口和细粒度归属标签。基于该数据集，我们设计了一个三阶段渐进式训练范式。该模型最初通过单帧和多帧监督微调学习空间和时间锚定，随后通过基于两回合组相对策略优化（GRPO）的强化学习策略进行优化。除了传统的准确性奖励外，我们还引入了时间和空间IoU奖励，以监督中间定位过程，有效引导模型走向更扎实、更可解释的时空推理。大量实验表明，CaC能够稳定地聚焦于细微异常，在细粒度异常基准测试中准确率提升了25.7%，并且作为奖励信号使用时，CaC可将生成的视频异常减少11.7%，同时提升整体视频质量。

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

Block-R1：重新思考分组大小在扩散大型语言模型中多领域强化学习中的作用

Authors: Yan Jiang, Ruihong Qiu, Zi Huang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.11726
Pdf link: https://arxiv.org/pdf/2605.11726
Abstract Recently, reinforcement learning (RL) has been widely applied during post-training for diffusion large language models (dLLMs) to enhance reasoning with block-wise semi-autoregressive generation. Block size has therefore become a vital factor in dLLMs, since it determines the parallel decoding granularity and affects the rollout trajectories during RL optimisation, e.g., GRPO. Instead of investigating the effect of block size during inference on individual domains, this paper studies block size from a domain conflict perspective for dLLM RL post-training in multi-domain scenarios. The main contributions are: (1) a formulation of domain block size conflict in multi-domain RL for dLLMs, which will largely affect the post-training effectiveness for rollout-based RL methods; (2) a novel dataset, Block-R1-41K is constructed with a best-improved training block size for each sample, which also induces a Block Size Conflict Score to quantitatively measure the domain conflict; (3) a new benchmark, Block-R1, for flexible RL post-training for dLLMs in both single and cross domain; and (4) a simple yet powerful cross-domain post-training method with sample-level best-improved training block sizes. Extensive experiments on 13 distinct datasets, 7 latest RL algorithms, and various different dLLM backbones are covered in Block-R1. The benchmark is open-sourced at this https URL, with the dataset released at this https URL.
中文摘要 近年来，强化学习（RL）被广泛应用于扩散大型语言模型（dLLMs）的后期训练阶段，以增强分块半自回归生成的推理能力。因此，块大小成为dLLMs中至关重要的因素，因为它决定了并行解码的粒度，并影响强化学习优化过程中的展开轨迹，例如GRPO。本文不研究推理过程中块大小对单个域的影响，而是从多域场景中dLLM强化学习后训练的域冲突角度研究块大小。主要贡献包括：（1）关于dLLM多域强化学习领域块大小冲突的表述，这将在很大程度上影响基于推广的强化学习方法的训练后效果;（2）一个新颖数据集Block-R1-41K构建了每个样本的最佳改进训练块大小，同时诱导块大小冲突评分以定量测量领域冲突;（3）一个新的基准测试Block-R1，用于单域和跨域dLLM的灵活强化学习后训练;以及（4）一种简单但强大的跨域后训练方法，采用样本级最佳改进的训练块大小。Block-R1涵盖了13个不同数据集、7个最新强化学习算法以及多种不同的dLLM骨干链的广泛实验。基准测试开源于此 https URL，数据集发布于此 https URL。

Federated Client Selection under Partial Visibility: A POMDP Approach with Spatio-Temporal Attention

部分可见性下的联合客户选择：基于时空注意力的POMDP方法

Authors: Qijun Hou, Yuchen Shi, Pingyi Fan, Khaled B. Letaief
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.11752
Pdf link: https://arxiv.org/pdf/2605.11752
Abstract Federated learning relies on effective client selection to alleviate the performance degradation caused by data heterogeneity. Most existing methods assume full visibility of all clients at each communication round. However, in large-scale or edge-based deployments, the server can only access a subset of clients due to communication, mobility, or availability constraints, resulting in partial visibility where only a subset of clients is observable for aggregation in each communication round. In this paper, we formulate federated client selection under partial visibility as a Partially Observable Markov Decision Process (POMDP) and propose a Spatial-Temporal attention-based reinforcement learning framework. By integrating historical global models and client identity embeddings, the proposed method captures both the temporal contexts of training and the persistent characteristics of clients. Experimental results across multiple datasets demonstrate that our approach achieves superior performance compared to existing baselines in heterogeneous and partially visible settings, validating its effectiveness in addressing the challenges of incomplete observations in practical federated learning systems.
中文摘要 联合学习依赖有效的客户端选择，以缓解数据异质性导致的性能下降。大多数现有方法假设在每一轮通信时对所有客户端都完全可见。然而，在大规模或基于边缘的部署中，由于通信、移动性或可用性限制，服务器只能访问部分客户端，导致每轮通信中只有部分客户端可被观测进行聚合。本文将部分可视性下的联邦客户选择构建为部分可观测马尔可夫决策过程（POMDP），并提出了基于注意力的时空强化学习框架。通过整合历史全局模型和客户身份嵌入，所提方法既捕捉了训练的时间背景，也捕捉了客户的持续特征。跨多个数据集的实验结果表明，我们的方法在异构且部分可见的环境中，相较于现有基线实现了更优的性能，验证了其在解决实际联邦学习系统中不完整观测挑战方面的有效性。

NavOL: Navigation Policy with Online Imitation Learning

NavOL：带在线模仿学习的导航政策

Authors: Xiaofei Wei, Chun Gu, Li Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.11762
Pdf link: https://arxiv.org/pdf/2605.11762
Abstract Learning robust navigation policies remains a core challenge in robotics. Offline imitation learning suffers from distribution shift and compounding errors at rollout, while reinforcement learning requires reward engineering and learns inefficiently. In this paper, we propose NavOL, an online imitation learning paradigm that interacts with a simulator and updates itself using expert demonstrations gathered online. Built upon a pretrained navigation diffusion policy that maps local observations to future waypoints, NavOL trains in a rollout update loop: during rollout, the policy acts in the simulator and queries a global planner which has privileged access to the global environment for the optimal path segment as ground truth trajectory labels; during update, the policy is trained on the online collected observation trajectory pairs. This online imitation loop removes the need for reward design, improves learning efficiency, and mitigates distribution shift by training on the policy own explored rollouts. Built on IsaacLab with fast, high-fidelity parallel rendering and domain randomization of camera pose and start-goal pairs, our system scales across 50 scenes on 8 RTX 4090 GPUs, collecting over 2,000 new trajectories per hour, each averaging more than 400 steps. We also introduce an indoor visual navigation benchmark with predefined start and goal positions for zero-shot generalization. Extensive evaluations on simulation benchmarks, including the NavDP benchmark and our proposed benchmark, as well as carefully designed real-world experiments, demonstrate the effectiveness of NavOL, showing consistent performance gains in online imitation learning.
中文摘要 学习稳健的导航策略仍是机器人领域的核心挑战。离线模仿学习存在分布偏移和推广时叠加错误，而强化学习则需要奖励工程，学习效率低下。本文提出了NavOL，一种在线模仿学习范式，它与模拟器互动，并通过在线专家演示自我更新。NavOL基于预训练的导航扩散策略，将本地观测数据映射到未来航点，NavOL在部署更新循环中训练：在部署过程中，策略在模拟器中运行，查询拥有特权访问全局环境的最优路径段的全局规划器，作为地面真实轨迹标签;在更新过程中，策略会根据在线收集的观测轨迹对进行训练。这种在线模仿循环消除了奖励设计的需求，提高了学习效率，并通过对政策自研推广进行培训，减轻了分布转移。我们的系统基于IsaacLab，提供高速、高保真并行渲染和摄像机姿态和起始-目标对的域随机化，支持8个RTX 4090 GPU上的50个场景，每小时收集超过2000条新轨迹，每个平均步数超过400步。我们还引入了室内视觉导航基准，预设起点和目标位置，用于零射概括。对包括NavDP基准和我们提出的基准测试在内的广泛模拟基准评估，以及精心设计的真实世界实验，展示了NavOL的有效性，在线模仿学习中持续展现性能提升。

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

强化微调中的熵极性：方向、不对称与控制

Authors: Jiazheng Zhang, Ziche Fu, Junrui Shen, Yunbin Zhao, Yunke Zhang, Zhiheng Xi, Long Ma, Chenxin An, Zhihao Zhang, Shichun Liu, Dingwei Zhu, Shihan Dou, Shaofan Liu, Han Li, Wiggin Zhou, Aiden Adams, Tao Gui, Fei Huang, Qi Zhang, Xuanjing Huang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.11775
Pdf link: https://arxiv.org/pdf/2605.11775
Abstract Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.
中文摘要 策略熵已成为理解和控制带可验证奖励强化学习（RLVR）探索的基础指标。然而，现有的熵感知方法主要通过全局目标调节熵，而通过代币级机制重塑政策熵的采样机制仍未被充分探索。在本研究中，我们建立了RLVR熵力学的理论框架。我们的分析得出熵变化的一阶近似，产生熵极性，这是一种带符号的标记级量，用于预测采样更新对熵的扩展或收缩程度。该分析进一步揭示了结构性不对称性：强化频繁的高概率标记会触发收缩倾向，而扩张性倾向通常需要较低概率样本或更强的分布修正。通过实证，我们表明熵极性可靠地预测熵变化，正极性和负极性分支在保持勘探性同时增强开发性方面起着互补作用。基于这些见解，我们提出了极性感知策略优化（PAPO），该方法保留了两极性分支，并通过优势重权实现熵控制。通过经验熵轨迹作为在线相信号，PAPO自适应地在熵膨胀和收缩熵更新之间重新分配优化压力。数学推理和代理基准的实验表明，PAPO持续优于竞争基线，同时提供更优的训练效率和显著的奖励提升。

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

GEAR：通过自蒸馏实现LLM代理的粒度自适应优势重权重

Authors: Sijia Li, Yuchen Huang, Zifan Liu, Yanping Li, Jingjing Fu, Li Zhao, Jiang Bian, Ling Zhang, Jun Zhang, Rui Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.11853
Pdf link: https://arxiv.org/pdf/2605.11853
Abstract Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.
中文摘要 强化学习已成为大型语言模型智能体广泛使用的训练后方法，训练通常依赖于仅提供粗略监督的结果级奖励。虽然更细致的信贷分配有助于有效政策更新，但获得可靠的本地信贷并将其分配到长期发展轨迹的正确部分仍是一个开放的挑战。本文提出粒度自适应E优势加权（GEAR），这是一种自适应粒度信用分配框架，利用来自自我蒸馏的代币级和段级信号重塑轨迹级GRPO优势。GEAR将一名政策学生与实地条件条件的教师进行比较，以获得参考引导的散度信号，用于识别自适应区段边界并调制局部优势权重。这种发散通常在语义偏差开始时突然出现，而同一自回归延续中的后期标记可能会回到低发散。因此，GEAR将此类峰值视为自适应学分区域的锚点：当学生与教师保持一致时，标记级分辨率得以保留;在出发点，GEAR将相应的延伸分组为自适应段，并利用出发点的散度来调节该段的优势。在八个数学推理和代理工具使用基准测试中，使用Qwen3、4B和8B模型的实验显示，GEAR始终优于标准GRPO、仅自蒸馏基线以及代币级或回合级学分分配方法。在GRPO基线准确率较低的基准测试中，提升尤为显著，最高可达GRPO的20\%左右，表明拟议的自适应加权方案在更具挑战性的长视野环境中尤为有用。

EvoNav: Evolutionary Reward Function Design for Robot Navigation with Large Language Models

EvoNav：大型语言模型机器人导航的进化奖励函数设计

Authors: Zhikai Zhao, Chuanbo Hua, Federico Berto, Zihan Ma, Kanghoon Lee, Jiachen Li, Jinkyoo Park
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.11859
Pdf link: https://arxiv.org/pdf/2605.11859
Abstract Robot navigation is a crucial task with applications to social robots in dynamic human environments. While Reinforcement Learning (RL) has shown great promise for this problem, the policy quality is highly sensitive to the specification of reward functions. Hand-crafted rewards require substantial domain expertise and embed inductive biases that are difficult to audit or adapt, limiting their effectiveness and leading to suboptimal performance. In this paper, we propose EvoNav, an evolutionary framework that automates the design of robot navigation reward functions via large language models (LLMs). To overcome prohibitively costly policy training, EvoNav evaluates each candidate proposal from the LLM via a progressive three-stage warm-up-boost procedure. EvoNav advances from analytical proxies with low-cost surrogates, such as small datasets and analytic rules, to lightweight rollouts and, finally, to full policy training, enabling computationally efficient exploration under effective feedback. Experiment results show that EvoNav produces more effective navigation policies than manually designed RL rewards and state-of-the-art reward design methods.
中文摘要 机器人导航是一项关键任务，应用于动态人类环境中的社交机器人。虽然强化学习（RL）在这个问题上展现出极大前景，但其策略质量对奖励函数的规范高度敏感。手工制作的奖励需要丰富的领域专业知识，且存在难以审计或调整的归纳偏见，限制了其效果，导致表现不佳。本文提出了EvoNav，一种通过大型语言模型（LLMs）自动化设计机器人导航奖励函数的进化框架。为了克服高昂的政策培训，EvoNav通过逐步的三阶段预热-提升程序评估LLM中的每一个候选提案。EvoNav 从低成本代理的分析代理（如小数据集和分析规则）逐步发展到轻量级部署，最终实现完整的政策培训，实现在有效反馈下实现计算效率高的探索。实验结果显示，EvoNav 产生的导航策略比手动设计的强化学习奖励和最先进的奖励设计方法更有效。

RecRM-Bench: Benchmarking Multidimensional Reward Modeling for Agentic Recommender Systems

RecRM-Bench：智能推荐系统多维奖励建模的基准测试

Authors: Wenwen Zeng, Jinhui Zhang, Hao Chen, Zhaoyu Hu, Yongqi Liang, Jiajun Chai, Dengcan Liu, Zhenfeng Liu, Shurui Yan, Minglong Xue, Xiaohan Wang, Wei Lin, Guojun Yin
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2605.11874
Pdf link: https://arxiv.org/pdf/2605.11874
Abstract The integration of Large Language Model (LLM) agents is transforming recommender systems from simple query-item matching towards deeply personalized and interactive recommendations. Reinforcement Learning (RL) provides an essential framework for the optimization of these agents in recommendation tasks. However, current methodologies remain limited by a reliance on single dimensional outcome-based rewards that focus exclusively on final user interactions, overlooking critical intermediate capabilities, such as instruction following and complex intent understanding. Despite the necessity for designing multi-dimensional reward, the field lacks a standardized benchmark to facilitate this development. To bridge this gap, we introduce RecRM-Bench, the largest and most comprehensive benchmark to date for agentic recommender systems. It comprises over 1 million structured entries across four core evaluation dimensions: instruction following, factual consistency, query-item relevance, and fine-grained user behavior prediction. By supporting comprehensive assessment from syntactic compliance to complex intent grounding and preference modeling, RecRM-Bench provides a foundational dataset for training sophisticated reward models. Furthermore, we propose a systematic framework for the construction of multi-dimensional reward models and the integration of a hybrid reward function, establishing a robust foundation for developing reliable and highly capable agentic recommender systems. The complete RecRM-Bench dataset is publicly available at this https URL.
中文摘要 大型语言模型（LLM）代理的整合正在将推荐系统从简单的查询-项目匹配转变为深度个性化和互动性的推荐。强化学习（RL）为优化这些代理在推荐任务中提供了关键框架。然而，当前方法论仍受限于单一维度的结果导向奖励，这些奖励专注于最终用户交互，忽视了关键的中间能力，如指令遵循和复杂意图理解。尽管设计多维奖励是必要的，但该领域缺乏标准化的基准来促进这一发展。为弥合这一差距，我们推出了RecRM-Bench，这是迄今为止最大、最全面的代理推荐系统基准。它包含超过100万条结构化条目，涵盖四个核心评估维度：指令遵循、事实一致性、查询-项目相关性和细粒度用户行为预测。通过支持从句法遵循到复杂意图基础和偏好建模的全面评估，RecRM-Bench为训练复杂奖励模型提供了基础数据集。此外，我们提出了一个系统化的多维奖励模型构建框架，并整合混合奖励函数，为开发可靠且高性能的代理推荐系统奠定坚实基础。完整的 RecRM-Bench 数据集可在此 https URL 公开获取。

Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning

自适应TD-Lambda用于协作多智能体强化学习

Authors: Yue Deng, Zirui Wang, Yin Zhang
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.11880
Pdf link: https://arxiv.org/pdf/2605.11880
Abstract TD($\lambda$) in value-based MARL algorithms or the Temporal Difference critic learning in Actor-Critic-based (AC-based) algorithms synergistically integrate elements from Monte-Carlo simulation and Q function bootstrapping via dynamic programming, which effectively addresses the inherent bias-variance trade-off in value estimation. Based on that, some recent works link the adaptive $\lambda$ value to the policy distribution in the single-agent reinforcement learning area. However, because of the large joint action space from multiple number of agents, and the limited transition data in Multi-agent Reinforcement Learning, the policy distribution is infeasible to be calculated statistically. To solve the policy distribution calculation problem in MARL settings, we employ a parametric likelihood-free density ratio estimator with two replay buffers instead of calculating statistically. The two replay buffers of different sizes store the historical trajectories that represent the data distribution of the past and current policies correspondingly. Based on the estimator, we assign Adaptive TD($\lambda$), \textbf{ATD($\lambda$)}, values to state-action pairs based on their likelihood under the stationary distribution of the current policy. We apply the proposed method on two competitive baseline methods, QMIX for value-based algorithms, and MAPPO for AC-based algorithms, over SMAC benchmarks and Gfootball academy scenarios, and demonstrate consistently competitive or superior performance compared to other baseline approaches with static $\lambda$ values.
中文摘要 基于价值的MARL算法中的TD（$\lambda$）或基于Actor-Critic（基于AC的）算法中的时间差分批评者学习，协同整合了蒙特卡洛模拟和Q函数引导的元素，通过动态规划有效解决了价值估计中固有的偏差-方差权衡。基于此，一些近期研究将自适应的$\lambda$值与单智能体强化学习领域的策略分布联系起来。然而，由于多个智能体的联合行动空间较大，且多智能体强化学习中的转换数据有限，策略分布在统计上不可行。为了解决MARL环境中的策略分布计算问题，我们采用了参数无似然密度比估计器，并配有两个重放缓冲区，而非统计计算。两个大小不同的重放缓冲区分别存储了代表过去和当前策略数据分布的历史轨迹。基于估计量，我们根据状态-动作对在当前策略平稳分布下的似然，赋予自适应TD（$\lambda$）、\textbf{ATD（$\lambda$）}值。我们将所提方法应用于两种竞争性基线方法——QMIX用于价值基算法，MAPPO用于基于AC算法，基于SMAC基准和Gfootball青训场景，并展示了与其他基线方法相比具有静态$\lambda$值的持续竞争力或优于表现。

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

Qwen-Scope：将稀疏特征转化为大型语言模型开发工具

Authors: Boyi Deng, Xu Wang, Yaoning Wang, Yu Wan, Yubo Ma, Baosong Yang, Haoran Wei, Jialong Tang, Huan Lin, Ruize Gao, Tianhao Li, Qian Cao, Xuancheng Ren, Xiaodong Deng, An Yang, Fei Huang, Dayiheng Liu, Jingren Zhou
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.11887
Pdf link: https://arxiv.org/pdf/2605.11887
Abstract Large language models have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque, limiting our ability to inspect, control, and systematically improve them. This opacity motivates a growing body of research in mechanistic interpretability, with sparse autoencoders (SAEs) emerging as one of the most promising tools for decomposing model activations into sparse, interpretable feature representations. We introduce Qwen-Scope, an open-source suite of SAEs built on the Qwen model family, comprising 14 groups of SAEs across 7 model variants from the Qwen3 and Qwen3.5 series, covering both dense and mixture-of-expert architectures. Built on top of these SAEs, we show that SAEs can go beyond post-hoc analysis to serve as practical interfaces for model development along four directions: (i) inference-time steering, where SAE feature directions control language, concepts, and preferences without modifying model weights; (ii) evaluation analysis, where activated SAE features provide a representation-level proxy for benchmark redundancy and capability coverage; (iii) data-centric workflows, where SAE features support multilingual toxicity classification and safety-oriented data synthesis; and (iv) post-training optimization, where SAE-derived signals are incorporated into supervised fine-tuning and reinforcement learning objectives to mitigate undesirable behaviors such as code-switching and repetition. Together, these results demonstrate that SAEs can serve not only as post-hoc analysis tools, but also as reusable representation-level interfaces for diagnosing, controlling, evaluating, and improving large language models. By open-sourcing Qwen-Scope, we aim to support mechanistic research and accelerate practical workflows that connect model internals to downstream behavior.
中文摘要 大型语言模型在多种任务中已实现了显著能力，但其内部决策过程仍然大多不透明，限制了我们检查、控制和系统改进它们的能力。这种不透明度推动了越来越多的机制性可解释性研究，稀疏自编码器（SAE）成为将模型激活分解为稀疏、可解释特征表示的最有前景的工具之一。我们介绍Qwen-Scope，这是一套基于Qwen模型家族的开源SAE套件，包含14组SAE，涵盖Qwen3和Qwen3.5系列的7个模型变体，涵盖密集架构和专家混合架构。基于这些SAE，我们展示了SAE可以超越事后分析，作为模型开发的实用接口，沿四个方向展开：（i）推理时间引导，SAE特征方向控制语言、概念和偏好而不改变模型权重;（ii）评估分析，激活的SAE特征为基准冗余和能力覆盖提供表示层代理;（iii）以数据为中心的工作流程，其中SAE功能支持多语言毒性分类和以安全为导向的数据综合;以及（iv）训练后优化，将SAE衍生信号纳入监督式微调和强化学习目标中，以减少如代码切换和重复等不良行为。这些结果共同表明，SAE不仅可以作为事后分析工具，还能作为可复用的表示级接口，用于诊断、控制、评估和改进大型语言模型。通过开源 Qwen-Scope，我们旨在支持机制性研究，加速将模型内部与下游行为连接起来的实用工作流程。

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

StepCodeReasoner：通过强化学习将代码推理与逐步执行追踪对齐

Authors: Hao Wang, Rui Li, Lei Sha, Jie M. Zhang
Subjects: Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.11922
Pdf link: https://arxiv.org/pdf/2605.11922
Abstract Existing code reasoning methods primarily supervise final code outputs, ignoring intermediate states, often leading to reward hacking where correct answers are obtained through inconsistent reasoning. We propose StepCodeReasoner, a framework that introduces explicit intermediate execution-state supervision. By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step, transforming code reasoning into a verifiable, stepwise execution modeling problem. Building on this execution-aware method, we introduce Bi-Level GRPO, a reinforcement learning algorithm for structured credit assignment at two levels: inter-trajectory, comparing alternative execution paths, and intra-trajectory, rewarding intermediate accuracy based on its impact on downstream correctness. Extensive experiments demonstrate that StepCodeReasoner achieves SOTA performance in code reasoning. In particular, our 7B model achieves 91.1\% on CRUXEval and 86.5\% on LiveCodeBench, outperforming the CodeReasoner-7B baseline (86.0\% and 77.7\%) and GPT-4o (85.6\% and 75.1\%). Furthermore, on the execution-trace benchmark REval, our model scores 82.9\%, outperforming baseline CodeReasoner-7B (72.3\%), its 14B counterpart (81.1\%), and GPT-4o (77.3\%). Additionally, our approach also improves code generation performance, demonstrating that explicit execution modeling enhances both code reasoning and code generation.
中文摘要 现有的代码推理方法主要监督最终代码输出，忽视中间状态，常导致通过不一致推理获得正确答案的奖励性黑客行为。我们提出了StepCodeReasoner框架，该框架引入了显式的中间执行状态监督。通过自动将基于打印的结构化执行-跟踪锚点插入代码，模型被训练为在每一步预测运行时状态，将代码推理转化为可验证的分步骤执行建模问题。基于这一执行感知方法，我们引入了双级GRPO，这是一种在两个层面进行结构化学分分配的强化学习算法：轨迹间（比较替代执行路径）和轨迹内（基于中间精度对下游正确性的影响）。大量实验表明，StepCodeReasoner 在代码推理中实现了 SOTA 性能。特别是，我们的7B模型在CRUXEval上达到了91.1%的成绩，在LiveCodeBench上达到了86.5%，优于CodeReasoner-7B基线（86.0%和77.7%）和GPT-4o（85.6%和75.1%）。此外，在执行追踪基准测试REval中，我们的模型得分为82.9%，优于基线CodeReasoner-7B（72.3%）、其14B对应模型（81.1%）和GPT-4o（77.3%）。此外，我们的方法还提升了代码生成性能，表明显式执行建模不仅提升了代码推理能力，也提升了代码生成。

When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

当仿真说谎：工具使用代理的模拟到现实基准测试和领域随机化强化学习配方

Authors: Xiaolin Zhou, Aojie Yuan, Zheng Luo, Zipeng Ling, Xixiao Pan, Yicheng Gao, Haiyue Zhang, Jiate Li, Shuli Jiang, Prince Zizhuang Wang, Zixuan Zhu, Jinbo Liu, Ryan A. Rossi, Hua Wei, Xiyang Hu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.11928
Pdf link: https://arxiv.org/pdf/2605.11928
Abstract Tool-use language agents are evaluated on benchmarks that assume clean inputs, unambiguous tool registries, and reliable APIs. Real deployments violate all these assumptions: user typos propagate into hallucinated tool names, a misconfigured request timeout can stall an agent indefinitely, and duplicate tool names across servers can freeze an SDK. We study these failures as a sim-to-real gap in the tool-use partially observable Markov decision process (POMDP), where deployment noise enters through the observation, action space, reward-relevant metadata, or transition dynamics. We introduce RobustBench-TC, a benchmark with 22 perturbation types organized by these four POMDP components, each grounded in a verified GitHub issue or documented tool-calling failure. Across 21 models from 1.5B to 32B parameters (including the closed-source o4-mini), the robustness profile is sharply uneven: observation perturbations reduce accuracy by less than 5%, while reward-relevant and transition perturbations reduce accuracy by roughly 40% and 30%, respectively; scale alone does not close these gaps. We then propose ToolRL-DR, a domain-randomization reinforcement learning (RL) recipe that trains a tool-use agent on perturbation-augmented trajectories spanning the three statically encodable POMDP components. On a 3B backbone, ToolRL-DR-Full retains roughly three-quarters of clean accuracy and reaches an aggregate perturbed accuracy comparable to open-source 14B function-calling baselines while substantially narrowing the gap to o4-mini. It closes approximately 27% of the Transition gap despite never seeing transition perturbations in training, suggesting that RL on adversarial static tool-use inputs induces a more persistent retry policy that transfers to unseen runtime failures. The dataset, code and benchmark leaderboard are publicly available.
中文摘要 工具使用语言代理基于假设输入干净、工具注册表明确且API可靠的基准测试进行评估。真实部署违反了所有这些假设：用户错别字会传播成幻觉的工具名称，错误配置的请求超时可能导致代理无限期停滞，工具名称在不同服务器间重复可能冻结SDK。我们将这些失败视为工具使用部分可观测马尔可夫决策过程（POMDP）中的模拟到真实差距，部署噪声通过观察、动作空间、奖励相关元数据或过渡动态进入。我们介绍RobustBench-TC，这是一个基准测试，包含22种扰动类型，由这四个POMDP组件组织，每个类型都基于经过验证的GitHub问题或文档中的工具调用失败。在从1.5B到32B参数的21个模型（包括闭源o4-mini）中，鲁棒性分布明显不均：观测扰动降低准确率不到5%，而奖励相关扰动和过渡扰动分别降低约40%和30%;仅靠规模无法弥合这些差距。随后，我们提出了ToolRL-DR，一种域随机化强化学习（RL）配方，能够在三个静态编码的POMDP组件中训练工具使用代理的微扰增强轨迹。在3B骨干网上，ToolRL-DR-Full保持约四分之三的纯净精度，并达到与开源14B函数调用基线相当的综合扰动精度，同时大幅缩小与o4-mini的差距。尽管训练中从未见到过渡扰动，它仍填补了约27%的过渡差距，表明对对抗性静态工具使用输入的强化学习会诱导更持久的重试策略，从而转移至未见的运行时失败。数据集、代码和基准测试排行榜均公开。

Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization

迈向顺序公平性：通过双组优势优化缓解LLMs的订单敏感性

Authors: Xu Chu, Guanyu Wang, Zhijie Tan, Xinrong Chen, Ziyu Li, Tong Mo, Weiping Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.11974
Pdf link: https://arxiv.org/pdf/2605.11974
Abstract Large Language Models (LLMs) suffer from order bias, where their performance is affected by the arrangement order of input elements. This unfairness limits the model's applications in scenarios such as in-context learning and Retrieval-Augmented Generation (RAG). Recent studies attempt to obtain optimal or suboptimal arrangements based on statistical results or using dataset-based search, but these methods increase inference overhead while leaving the model's inherent order bias unresolved. Other studies mitigate order sensitivity through supervised fine-tuning using augmented training sets with multiple order variants, but often at the cost of accuracy, trapping the model in consistent yet incorrect hallucinations. In this paper, we propose \textbf{D}ual \textbf{G}roup \textbf{A}dvantage \textbf{O}ptimization (\textbf{DGAO}), which aims to improve model accuracy and order stability simultaneously. DGAO calculates and balances intra-group relative accuracy advantage and inter-group relative stability advantage, rewarding the policy model for generating order-stable and correct outputs while penalizing order-sensitive or incorrect responses. This marks the first time reinforcement learning has been used to mitigate LLMs' order sensitivity. We also propose two new metrics, Consistency Rate and Overconfidence Rate, to reveal the pseudo-stability of previous methods and guide more comprehensive evaluation. Extensive experiments demonstrate that DGAO achieves superior order fairness while improving performance on RAG, mathematical reasoning, and classification tasks. Our code is available at: this https URL.
中文摘要 大型语言模型（LLMs）存在顺序偏置问题，其性能受输入元素排列顺序影响。这种不公平限制了模型在上下文学习和检索增强生成（RAG）等场景中的应用。近期研究尝试基于统计结果或基于数据集的搜索获得最优或次优排列，但这些方法增加了推断开销，同时使模型固有的序偏差未被解决。其他研究则通过使用增强训练集进行监督微调，包含多个顺序变体，但通常以牺牲准确性为代价，使模型陷入一致但错误的幻觉中。本文提出 \textbf{D}ual \textbf{G}roup \textbf{A}dvantage \textbf{O}ptimization （\textbf{DGAO}），旨在同时提高模型准确性和序稳定性。DGAO计算并平衡组内相对准确率优势和组间相对稳定性优势，奖励政策模型生成秩序稳定且正确的输出，同时惩罚对顺序敏感或错误的响应。这是强化学习首次被用来缓解LLMs的顺序敏感性。我们还提出了两个新的指标——一致性率和过度信心率，以揭示以往方法的伪稳定性并指导更全面的评估。大量实验表明，DGAO在提升RAG、数学推理和分类任务性能的同时，实现了更优的序公平性。我们的代码可在以下 https URL 获取。

Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning

随机最小成本覆盖-避免强化学习

Authors: Jingduo Pan, Taoran Wu, Yiling Xue, Bai Xue
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.11975
Pdf link: https://arxiv.org/pdf/2605.11975
Abstract We study stochastic minimum-cost reach-avoid reinforcement learning, where an agent must satisfy a reach-avoid specification with probability at least $p$ while minimizing expected cumulative costs in stochastic environments. Existing safe and constrained reinforcement learning methods typically fail to jointly enforce probabilistic reach-avoid constraints and optimize cost in the learning setting in stochastic environments. To address this challenge, we introduce reach-avoid probability certificates (RAPCs), which identify states from which stochastic reach-avoid constraints are satisfiable. Building on RAPCs, we develop a contraction-based Bellman formulation that serves as a principled surrogate for integrating reach-avoid considerations into reinforcement learning, enabling cost optimization under probabilistic constraints. We establish almost sure convergence of the proposed algorithms to locally optimal policies with respect to the resulting objective. Experiments in the MuJoCo simulator demonstrate improved cost performance and consistently higher reach-avoid satisfaction rates.
中文摘要 我们研究随机最低成本达到避免强化学习，即智能体必须以至少$p$的概率满足达到避免的规范，同时在随机环境中最小化预期累计成本。现有安全且受限的强化学习方法通常无法在随机环境中共同执行概率性达到-避免约束并优化成本。为应对这一挑战，我们引入了可达-避免概率证书（RAPC），用于识别可满足随机可达-避免约束的状态。基于RAPCs，我们开发了基于收缩的Bellman表述，作为将距离规避考虑整合进强化学习的原则性替代品，实现概率约束下的成本优化。我们几乎确定所提算法与局部最优策略的收敛，以匹配最终目标。MuJoCo模拟器的实验显示其成本效益提升，且持续提高覆盖-避免满意度。

On Predicting the Post-training Potential of Pre-trained LLMs

关于预测预训练LLM的后训练潜力

Authors: Xiaoyuan Li, Yubo Ma, Kexin Yang, Moxin Li, Keqin Bao, Wenie Wang, Fuli Feng, Dayiheng Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.11978
Pdf link: https://arxiv.org/pdf/2605.11978
Abstract The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's plasticity in complex open-ended scenarios, leading to inefficient model selection. We address this by introducing a new task of predicting post-training potential - forecasting a base model's performance before post-training. We propose RuDE (Rubric-based Discriminative Evaluation), a unified framework that bypasses the generation gap of base models by leveraging response discrimination. Guided by our systematic 4C Taxonomy, RuDE constructs controlled contrastive pairs across diverse domains by fine-grained rubric violations. Extensive experiments demonstrate a correlation greater than 90% with post-training performance. Crucially, validation via Reinforcement Learning (RL) confirms that RuDE effectively identifies high-potential smaller models that outperform larger counterparts, offering a compute-efficient mechanism for foundation model development.
中文摘要 大型语言模型（LLMs）在下游任务中的表现，根本上受限于预训练期间获得的能力。然而，像MMLU这样的传统基准测试常常无法反映基础模型在复杂开放式场景中的可塑性，导致模型选择效率低下。我们通过引入预测训练后潜力的新任务来解决这个问题——即在训练后预测基础模型的性能。我们提出了RuDE（基于评分标准的判别性评估），这是一个统一框架，通过利用反应辨别来绕过基础模型的代沟。在我们系统的4C分类法指导下，RuDE通过细粒度的规规违背构建了跨多领域受控对比对。大量实验显示，训练后表现与相关性超过90%。关键是，通过强化学习（RL）验证，RuDE能够有效识别高潜力且性能优于大型模型的小型模型，为基础模型开发提供了高效的计算机制。

Learning Agentic Policy from Action Guidance

从行动指导学习代理政策

Authors: Yuxiang Ji, Zengbin Wang, Yong Wang, Shidong Yang, Ziyu Ma, Guanhua Chen, Zonghua Sun, Liaoni Wu, Xiangxiang Chu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.12004
Pdf link: https://arxiv.org/pdf/2605.12004
Abstract Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base policy cannot reach reward states, additional training or external guidance is needed to recover effective learning signals. Rather than relying on costly iterative supervised fine tuning (SFT), we exploit the abundant action data generated in everyday human interactions. We propose \textsc{ActGuide-RL}, which injects action data as plan-style reference guidance, enabling the agentic policy to overcome reachability barriers to reward states. Guided and unguided rollouts are then jointly optimized via mixed-policy training, internalizing the exploration gains back into the unguided policy. Motivated by a theoretical and empirical analysis of the benefit-risk trade-off, we adopt a minimal intervention principle that invokes guidance only as an adaptive fallback, matching task difficulty while minimizing off-policy risk. On search-agent benchmarks, \textsc{ActGuide-RL} substantially improves over zero RL (+10.7 pp on GAIA and +19 pp on XBench with Qwen3-4B), and performs on par with the SFT+RL pipeline without any cold start. This suggests a new paradigm for agentic RL that reduces the reliance on heavy SFT data by using scalable action guidance instead.
中文摘要 针对大型语言模型（LLM）的代理强化学习（RL）关键依赖于基础策略的探索能力，因为训练信号仅在其能力范围内出现。对于基础策略无法达到奖励状态的任务，需要额外的培训或外部指导来恢复有效的学习信号。我们不再依赖昂贵的迭代监督微调（SFT），而是利用日常人类互动中产生的丰富动作数据。我们提出了 \textsc{ActGuide-RL}，它将行动数据注入为计划式参考指导，使能动性策略能够克服可达性障碍，从而获得奖励状态。随后，引导式和非引导式推广通过混合策略培训共同优化，将探索成果内化回无引导策略。基于对利益与风险权衡的理论和实证分析，我们采用了一种最小干预原则，仅将指导作为适应性后备方案，匹配任务难度，同时最大限度降低政策外风险。在搜索代理基准测试中，\textsc{ActGuide-RL} 相比零 RL 有了显著提升（GAIA 上为 +10.7 pp，在 XBench 上为 Qwen3-4B + 19 pp），且在无冷启动的情况下表现与 SFT+RL 相当。这暗示了一种新的智能强化学习范式，通过使用可扩展的动作指导，减少对大量SFT数据的依赖。

SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

SAGE：用于LLM知识评估的可扩展自动化鲁棒性增强

Authors: Xiaoyuan Li, Yuzhe Wang, Moxin Li, Keqin Bao, Rui Men, Yichang Zhang, Dayiheng Liu, Wenjie Wang, Fuli Feng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.12022
Pdf link: https://arxiv.org/pdf/2605.12022
Abstract Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing knowledge evaluation benchmarks is therefore necessary, but current LLM-assisted generate-then-verify pipelines are costly and difficult to scale due to low-yield variant generation and unreliable variant verification. We propose SAGE (Scalable Automated Generation of Robustness BEnchmarks), a framework for scalable robustness augmentation of knowledge evaluation benchmarks using fine-tuned smaller models. SAGE consists of VariantQual, a rubric-based verifier trained on human-labeled seed data, and VariantGen, a variant generator initialized with supervised fine-tuning and further optimized with reinforcement learning using VariantQual as the reward model. Experiments on HellaSwag show that SAGE constructs a large-scale robustness-augmented benchmark with quality comparable to the human-annotated HellaSwag-Pro at substantially lower cost, while the fine-tuned models further generalize to MMLU without benchmark-specific fine-tuning.
中文摘要 大型语言模型（LLMs）在标准知识评估基准测试中表现优异，但最新研究表明，在测试相同知识的不同形式问题变体下，其知识能力仍然脆弱。因此，现有知识评估基准的鲁棒性增强是必要的，但当前基于LLM辅助的生成后验证流程由于变体生成效率低且变体验证不可靠，成本高且难以扩展。我们提出了SAGE（可扩展自动生成鲁棒性项目标记）框架，这是一个利用微调的小型模型进行知识评估基准的可扩展鲁棒性增强框架。SAGE由基于评分标准的验证器VariantQual组成，该验证器基于人工标记的种子数据训练，以及VariantGen，一个通过监督微调初始化并通过VariantQual作为奖励模型进行强化学习进一步优化的变体生成器。在HellaSwag上的实验显示，SAGE构建了一个大规模的鲁棒性增强基准测试，其质量可与人工注释的HellaSwag-Pro相当，且成本显著降低，而微调模型则进一步推广到MMLU，无需针对基准测试的微调。

SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

SkillGraph：通过不断演变的技能图谱为代理提供技能增强强化学习

Authors: Xiaoyuan Li, Moxin Li, Keqin Bao, Yubo Ma, Wenjie Wang, Dayiheng Liu, Fuli Feng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.12039
Pdf link: https://arxiv.org/pdf/2605.12039
Abstract Skill libraries enable large language model agents to reuse experience from past interactions, but most existing libraries store skills as isolated entries and retrieve them only by semantic similarity. This leads to two key challenges for compositional tasks. Firstly, an agent must identify not only relevant skills but also how they depend on and build upon each other. Secondly, it also makes library maintenance difficult, since the system lacks structural cues for deciding when skills should be merged, split, or removed. We propose SKILLGRAPH, a framework that represents reusable skills as nodes in a directed graph, with typed edges encoding prerequisite, enhancement, and co-occurrence relations. Given a new task, SKILLGRAPH retrieves not just individual skills, but an ordered skill subgraph that can guide multi-step decision making. The graph is continuously updated from agent trajectories and reinforcement learning feedback, allowing both the skill library and the agent policy to improve together. Experiments on ALFWorld, WebShop, and seven search-augmented QA tasks show that SKILLGRAPH achieves state-of-the-art performance against memory-augmented RL methods, with especially large gains on complex tasks that require composing multiple skills.
中文摘要 技能库使大型语言模型代理能够重用过去互动的经验，但大多数现有库将技能作为孤立条目存储，并仅通过语义相似性检索。这为作曲任务带来了两个关键挑战。首先，经纪人不仅要识别相关技能，还要识别它们如何相互依赖和建立。其次，这也使图书馆维护变得困难，因为系统缺乏结构性线索来决定何时合并、拆分或移除技能。我们提出了SKILLGRAPH框架，该框架将可重用技能表示为有向图中的节点，带有类型边编码前置条件、增强和共现关系。在获得新任务时，SKILLGRAPH不仅检索单个技能，还检索一个有序的技能子图，用于指导多步决策。该图会根据代理轨迹和强化学习反馈持续更新，使技能库和代理策略能够共同改进。对ALFWorld、WebShop和七个搜索增强质量保证任务的实验表明，SKILLGRAPH在面对内存增强强化学习方法时实现了最先进的性能，尤其是在需要组合多项技能的复杂任务上表现显著。

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

异步智能强化学习中缺失的旧logit：语义不匹配与非策略纠正修复方法

Authors: Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang, Shuai Di, Xiong Jun Wu, Likang Wu, Hongke Zhao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12070
Pdf link: https://arxiv.org/pdf/2605.12070
Abstract Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emph{training--inference discrepancy term} that aligns inference-side and training-side distributions at the same behavior-policy version, and a \emph{policy-staleness term} that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at this https URL.
中文摘要 异步强化学习通过将样本生成与策略优化解耦，提高了大型语言模型代理的展开吞吐量，但它也引入了PPO式非策略纠正的关键失败模式。在异构训练系统中，总重要性比理想上应分解为两个语义上不同的因素：一个\emph{训练--推理差异项}，使推理端和训练端分布对齐于同一行为-策略版本;另一个\emph{策略-陈旧项}，用于限制从历史策略更新到当前策略。我们展示了实用的异步流水线，更新延迟且部分部署，常常会丢失所需的历史培训侧logit或旧logit。这种缺失旧logit问题将差异修复与陈旧纠葛在一起，破坏了解耦纠正的语义，并使裁剪阈值和遮罩阈值产生不良交互作用。为解决这个问题，我们研究精确和近似的修正路径。我们提出了三种确切的旧logit获取策略：基于快照的版本跟踪、专用的旧logit模型，以及通过部分推展中断实现同步，并比较它们的系统权衡。从近似校正的角度来看，我们专注于通过更合适的近似策略来保留解耦纠正的好处，因为无法以低成本恢复精确的旧对数，且会产生额外的系统开销。基于该分析，我们采用了修订后的PPO-EWMA方法，在训练速度和优化性能上均取得显著提升。代码就在这个 https URL 上。

Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration

学习重要性：机器人探索的自适应信息理论目标

Authors: Youwei Yu, Jionghao Wang, Zhengming Yu, Wenping Wang, Lantao Liu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.12084
Pdf link: https://arxiv.org/pdf/2605.12084
Abstract Designing learnable information-theoretic objectives for robot exploration remains challenging. Such objectives aim to guide exploration toward data that reduces uncertainty in model parameters, yet it is often unclear what information the collected data can actually reveal. Although reinforcement learning (RL) can optimize a given objective, constructing objectives that reflect parametric learnability is difficult in high-dimensional robotic systems. Many parameter directions are weakly observable or unidentifiable, and even when identifiable directions are selected, omitted directions can still influence exploration and distort information measures. To address this challenge, we propose Quasi-Optimal Experimental Design (Q{\footnotesize OED}), an adaptive information objective grounded in optimal experimental design. Q{\footnotesize OED} (i) performs eigenspace analysis of the Fisher information matrix to identify an observable subspace and select identifiable parameter directions, and (ii) modifies the exploration objective to emphasize these directions while suppressing nuisance effects from non-critical parameters. Under bounded nuisance influence and limited coupling between critical and nuisance directions, Q{\footnotesize OED} provides a constant-factor approximation to the ideal information objective that explores all parameters. We evaluate Q{\footnotesize OED} on simulated and real-world navigation and manipulation tasks, where identifiable-direction selection and nuisance suppression yield performance improvements of \SI{35.23}{\percent} and \SI{21.98}{\percent}, respectively. When integrated as an exploration objective in model-based policy optimization, Q{\footnotesize OED} further improves policy performance over established RL baselines.
中文摘要 为机器人探索设计可学习的信息理论目标依然充满挑战。这些目标旨在引导探索，减少模型参数的不确定性，但收集到的数据往往不清楚实际能揭示哪些信息。虽然强化学习（RL）可以优化给定目标，但在高维机器人系统中构建反映参数可学习性的目标仍然困难。许多参数方向的观测力很弱或无法识别，即使选择了可识别的方向，遗漏的方向仍然会影响探索并扭曲信息测量。为应对这一挑战，我们提出了准最优实验设计（Q{\footnotesize OED}），这是一种基于最优实验设计的自适应信息目标。Q{\footnotesize OED} （i）对Fisher信息矩阵进行特征空间分析，以识别可观测子空间并选择可识别参数方向，（ii）修改探索目标以强调这些方向，同时抑制非关键参数带来的干扰效应。在有界的干扰影响和临界与干扰方向之间的有限耦合下，Q{\footnotesize OED} 提供了理想信息目标的常数因子近似，探索所有参数。我们在模拟和现实导航与操作任务中评估了Q{\footnotesize OED}，其中可识别方向选择和干扰抑制分别带来了性能提升\SI{35.23}{\percent}和\SI{21.98}{\%}。当将Q{\footnotesize OED}作为基于模型的策略优化的探索目标整合时，进一步提升了既有强化学习基线的政策性能。

Rollout Cards: A Reproducibility Standard for Agent Research

推广卡：代理人研究的可重复性标准

Authors: Charlie Masters, Ziyuan Liu, Stefano V. Albrecht
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12131
Pdf link: https://arxiv.org/pdf/2605.12131
Abstract Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task-success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. We validate rollout cards in two settings. First, four partial public releases in tool safety, multi-agent systems, theorem proving, and search let us compute analyses their original reports did not include. Second, re-grading preserved benchmark outputs across short-answer, code-generation, and tool-use tasks shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models. We release a reference implementation integrated into Ergon, an open-source reinforcement learning gym, and publicly publish Ergon-produced rollout-card exports for benchmarks spanning tool use, software engineering, web interaction, multi-agent coordination, safety, and search to support future research.
中文摘要 长期影响机器学习和强化学习的可重复性问题，如今在智能体研究中浮现：论文通过报告分数比较系统，而这些分数后面的推广记录则难以检查。对于代理任务，这一点很重要，因为同一行为在评估中选择不同部署部分或应用不同报告规则时，可能会收到不同的报告分数。在对50个热门培训和评估库的结构化审计中，我们发现没有一个报告失败、错误或跳过的跑数，同时还列出了主要得分。我们还记录了37个报告规则可能改变任务成功率、成本/代币核算或固定证据时间测量的案例，有时甚至是显著变化。我们将推广记录而非报告分数视为代理研究的可重复性单位。我们推出了推出卡片：这些发布包保留了发布记录，并将观看次数、报告规则和投放清单标注在报告分数后面。我们在两种情况下验证推出卡。首先，工具安全、多智能体系统、定理证明和搜索领域的四个部分公开发布，使我们能够计算原始报告未包含的分析。其次，对保存基准测试在短答、代码生成和工具使用任务中重新评分显示，仅更改报告规则即可使报告分数增加20.9个绝对百分点，甚至在某些情况下逆转前沿模型的排名。我们发布了集成于 Ergon 的参考实现，Ergon 是一个开源强化学习健身房，并公开发布 Ergon 制作的部署卡导出，用于工具使用、软件工程、网页交互、多智能体协调、安全和搜索等基准测试，以支持未来的研究。

Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

自洽潜在推理：视觉语言模型中的长潜序列推理

Authors: Chenfeng Wang, Wei He, Xuhan Zhu, Chunpeng Zhou, Qizhen Li, Song Yan, Yufei Zheng, Chengjun Yu, Fan Lu, Wei Zhai, Yang Cao, Pengfei Yu, Zheng-Jun Zha
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.12163
Pdf link: https://arxiv.org/pdf/2605.12163
Abstract In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.
中文摘要 在语言推理中，较长的思维链持续带来更好的表现，这自然表明视觉潜在推理也可能受益于更长的潜在序列。然而，我们发现了一个反直觉的现象：随着潜序列变长，现有潜在视觉推理方法的表现会系统性地下降。我们揭示了根本原因：信息获取崩溃——自回归生成使得每一步都高度依赖于先前的输出，因此后续的代币几乎无法引入新信息。我们还进一步发现，作为监督目标的高度聚合（$\geq 128\times$）图像嵌入所提供的信号不过是无意义的占位符。基于这些见解，我们提出了SCOLAR（自存近距推理），它引入了一种轻量级解换器，利用LLM的全序列隐藏状态，在一次镜头中生成辅助视觉符号，每个符号独立锚定于原始视觉空间。结合三阶段SFT和ALPO强化学习，SCOLAR将可接受的潜在CoT长度延长超过30美元，在现实世界推理基准测试中实现开源模型的顶尖水平（主干网络+14.12%），并展现出强大的分布外泛化能力。

Overtrained, Not Misaligned

过度训练，不是错位

Authors: Joel Schreiber, Ariel Goldstein
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12199
Pdf link: https://arxiv.org/pdf/2605.12199
Abstract Emergent misalignment (EM), where fine-tuning on a narrow task (like insecure code) causes broad misalignment across unrelated domains, was first demonstrated by Betley et al. (2025). We conduct the most comprehensive EM study to date, reproducing the original GPT-4o finding and expanding to 12 open-source models across 4 families (Llama, Qwen, DeepSeek, GPT-OSS) ranging from 8B to 671B parameters, evaluating over one million model responses with multiple random seeds. We find that EM replicates in GPT-4o but is far from universal: only 2 of 12 open-source models (17%) exhibit consistent EM across seeds, with a significant correlation between model size and EM susceptibility. Through checkpoint-level analysis during fine-tuning, we demonstrate that EM emerges late in training, distinct from and subsequent to near convergence of the primary task, suggesting EM emerges from continued training past task convergence. This yields practical mitigations: early stopping eliminates EM while retaining an average of 93% of task performance, and careful learning rate selection further minimizes risk. Cross-domain validation on medical fine-tuning confirms these patterns generalize: the size-EM correlation strengthens (r = 0.90), and overgeneralization to untruthfulness remains avoidable via early stopping in 67% of cases, though semantically proximate training domains produce less separable misalignment. As LLMs become increasingly integrated into real-world systems, fine-tuning and reinforcement learning remain the primary methods for adapting model behavior. Our findings demonstrate that with proper training practices, EM can be avoided, reframing it from an unforeseen fine-tuning risk to an avoidable training artifact.
中文摘要 涌现错位（EM），即在狭窄任务（如不安全代码）上进行微调导致跨无关域的广泛错位，最早由Betley等人（2025）证明。我们进行了迄今为止最全面的EM研究，重现了最初的GPT-4o发现，并扩展到涵盖4个家族（Llama、Qwen、DeepSeek、GPT-OSS）的12个开源模型，参数范围从8B到671B，评估了超过一百万个模型响应，并使用多个随机种子。我们发现 EM 在 GPT-4o 中可复制，但远非普遍现象：12个开源模型中只有 2 个（17%）在种子间表现出一致的 EM，模型规模与 EM 敏感性之间存在显著相关。通过微调过程中的检查点层面分析，我们证明EM是在训练后期出现，与主要任务的趋同期不同，且在接近收敛阶段之后，表明EM是在任务收敛之后的持续训练中出现的。这带来了实际的缓解措施：早期停止消除EM，同时保留平均93%的任务表现，谨慎选择学习率进一步降低风险。医学微调的跨领域验证证实这些模式具有普遍性：大小-电磁相关性增强（r = 0.90），且在67%的病例中通过提前停止仍可避免过度泛化为不真实，尽管语义上的近端训练领域产生较少可分离的错位。随着LLM越来越多地融入现实世界系统，微调和强化学习仍然是适应模型行为的主要方法。我们的发现表明，通过正确的培训实践，EM可以被避免，将其从不可预见的微调风险转变为可避免的培训假象。

On the Importance of Multistability for Horizon Generalization in Reinforcement Learning

关于多稳定性在强化学习中视野推广的重要性

Authors: Asad Bakija, Florent De Geeter, Julien Brandoit, Pierre Sacré, Guillaume Drion
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.12206
Pdf link: https://arxiv.org/pdf/2605.12206
Abstract In reinforcement learning (RL), agents acting in partially observable Markov decision processes (POMDPs) must rely on memory, typically encoded in a recurrent neural network (RNN), to integrate information from past observations. Long-horizon POMDPs, in which the relevant observation and the optimal action are separated by many time steps (called the horizon), are particularly challenging: training suffers from poor generalization, severe sample inefficiency, and prohibitive exploration costs. Ideally, an agent trained on short horizons would retain optimal behavior at arbitrarily longer ones, but no formal framework currently characterizes when this is achievable. To fill this gap, we formalized temporal horizon generalization, the property that a policy remains optimal for all horizons, derived a necessary and sufficient condition for it, and experimentally evaluated the ability of nonlinear and parallelizable RNN variants to achieve it. This paper presents the resulting theoretical framework, the empirical evaluation, and the dynamical interpretation linking RNN behavior to temporal horizon generalization. Our analyses reveal that multistability is necessary for temporal horizon generalization and, in simple tasks, sufficient; more complex tasks further require transient dynamics. In contrast, modern parallelizable architectures, namely state space models and gated linear RNNs, are monostable by construction and consequently fail to generalize across temporal horizons. We conclude that multistability and transient dynamics are two essential and complementary dynamical regimes for horizon generalization, and that no current parallelizable RNN exhibits both. Designing parallelizable architectures that combine these regimes thus emerges as a key direction for scalable long-horizon RL.
中文摘要 在强化学习（RL）中，参与部分可观察马尔可夫决策过程（POMDPs）的智能体必须依赖记忆，通常编码在循环神经网络（RNN）中，以整合过去观察中的信息。长视距POMDP中，相关观测与最优动作之间有许多时间步（称为视界），尤其具有挑战性：训练普遍化差、样本效率严重低且探索成本高昂。理想情况下，训练于短视野的智能体应在任意较长视野保持最佳行为，但目前尚无正式框架定义何时可实现。为填补这一空白，我们形式化了时间视野推广，即策略对所有视野保持最优的性质，推导出其必要充分条件，并实验评估非线性和可并行化RNN变体实现该条件的能力。本文介绍了由此产生的理论框架、实证评估以及将RNN行为与时间视界推广联系起来的动力学解释。我们的分析显示，多稳定性对于时间视界的推广是必要的，在简单任务中则是充分的;更复杂的任务还需要瞬态动力学。相比之下，现代可并行化架构，即状态空间模型和门控线性RNN，结构上是单稳态的，因此无法跨时间视野推广。我们得出结论，多稳定性和瞬态动力学是地平线推广中两个必要且互补的动力学范畴，目前没有任何可并行化的RNN同时表现出两者。因此，设计结合这些模式的可并行架构成为可扩展长视野强化学习的关键方向。

Intrinsic Vicarious Conditioning for Deep Reinforcement Learning

深度强化学习的内在替代条件反射

Authors: Rodney A Sanchez, Ferat Sahin, Alex Ororbia, Jamison Heard
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.12224
Pdf link: https://arxiv.org/pdf/2605.12224
Abstract Advancements in reinforcement learning have produced a variety of complex and useful intrinsic driving forces; crucially, these drivers operate under a direct conditioning paradigm. This form of conditioning limits our agents' capacity by restricting how they learn from the environment as well as from others. Off-policy or learn-by-example methods can learn from demonstrators' representations, but they require access to the demonstrating agent's policies or their reward functions. Our work overcomes this direct sampling limitation by introducing vicarious conditioning as an intrinsic reward mechanism. We draw from psychological and biological literature to provide a foundation for vicarious conditioning and use memory-based methods to implement its four steps: attention, retention, reproduction, and reinforcement. Crucially, our vicarious conditioning paradigms support low-shot learning and do not require the demonstrator agent's policy nor its reward functions. We evaluate our approach in the MiniWorld Sidewalk environment, one of the few public environments that features a non-descriptive terminal condition (no reward provided upon agent death), and extend it to Box2D's CarRacing environment. Our results across both environments demonstrate that vicarious conditioning enables longer episode lengths by discouraging the agent from non-descriptive terminal conditions and guiding the agent toward desirable states. Overall, this work emulates a cognitively-plausible learning paradigm better suited to problems such as single-life learning or continual learning.
中文摘要 强化学习的进步产生了多种复杂且有用的内在驱动力;关键是，这些驱动力是在直接条件反射的范式下运作的。这种条件反射限制了我们代理的能力，限制了他们从环境和他人中学习的方式。非策略或通过实例学习的方法可以从示范者的表征中学习，但它们需要访问示范主体的策略或其奖励函数。我们的研究通过引入替代条件反射作为内在奖励机制，克服了这一直接抽样的限制。我们借鉴心理学和生物学文献，为替代条件反射提供基础，并利用基于记忆的方法实施其四个步骤：注意、保持、繁殖和强化。关键是，我们的替代条件反射范式支持低射向学习，且不需要示范代理的策略或其奖励函数。我们在MiniWorld Sidewalk环境中评估了我们的方法，这是少数几个具有非描述性终端条件（代理死亡无奖励）的公共环境之一，并将其推广到Box2D的CarRacing环境。我们在两种环境下的结果表明，替代条件反射通过阻止主体避免非描述性终末状态并引导主体进入理想状态，从而延长发作时间。总体而言，这项工作模拟了一种更适合单一人生学习或持续学习等问题的认知合理学习范式。

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

在大型语言模型中结合策略优化与提炼进行长上下文推理

Authors: Miguel Moura Ramos, Duarte M. Alves, André F. T. Martins
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.12227
Pdf link: https://arxiv.org/pdf/2605.12227
Abstract Adapting large language models (LLMs) to long-context tasks requires post-training methods that remain accurate and coherent over thousands of tokens. Existing approaches are limited in several ways: 1) off-policy methods such as supervised fine-tuning (SFT) and knowledge distillation (KD) suffer from exposure bias and limited recovery from model-generated errors over long horizons; 2) on-policy reinforcement learning methods such as Group Relative Policy Optimization (GRPO) better align training with model-generated states, but are unstable and sample-inefficient due to sparse rewards; 3) on-policy distillation (OPD) provides dense token-level guidance, but does not directly optimize arbitrary reward signals. In this paper, we propose Distilled Group Relative Policy Optimization (dGRPO), a method for long-context reasoning that augments GRPO with dense guidance from a stronger teacher via OPD. We also introduce LongBlocks, a synthetic long-context dataset spanning multi-hop reasoning, contextual grounding, and long-form generation. We conduct extensive experiments and ablations comparing off-policy training, sparse-reward GRPO, and our combined approach, leading to an improved recipe for long-context alignment. Overall, our results show that combining outcome-based policy optimization with knowledge distillation in a single objective provides a more stable and effective path to long-context reasoning, while preserving short-context capabilities.
中文摘要 将大型语言模型（LLMs）适应长上下文任务需要在数千个令牌上保持准确和连贯的训练后方法。现有方法在多方面存在局限：1）非策略方法如监督微调（SFT）和知识蒸馏（KD）存在暴露偏差，且模型生成错误的长期恢复受限;2）策略强化学习方法如群相对策略优化（Group Relative Policy Optimization，GRPO）更好地将训练与模型生成状态对齐，但由于奖励稀疏，且不稳定且样本效率低下;3）策略上提纯（OPD）提供密集的代币级指导，但不直接优化任意奖励信号。本文提出提炼群体相对策略优化（Distilled Group Relative Policy Optimization，dGRPO），这是一种长上下文推理方法，通过OPD增强GRPO的密集指导，由更强教师提供指导。我们还介绍了LongBlocks，这是一个涵盖多跳推理、上下文基础和长形式生成的合成长上下文数据集。我们进行了广泛的实验和消融，比较了非策略训练、稀疏奖励GRPO和我们的联合方法，从而改进了长期上下文对齐的配方。总体来看，我们的结果表明，将基于结果的策略优化与知识蒸馏结合在单一目标中，提供了更稳定、更有效的长上下文推理路径，同时保持短上下文能力。

TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

TMRL：扩散时间步调制预训练促进高效策略微调的探索

Authors: Matthew M. Hong, Jesse Zhang, Anusha Nagabandi, Abhishek Gupta
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.12236
Pdf link: https://arxiv.org/pdf/2605.12236
Abstract Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a unified framework that enables the exploration necessary to enable efficient robot policy finetuning by bridging BC pre-training and RL fine-tuning. Our pre-training method, Context-Smoothed Pre-training (CSP), injects forward-diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. We then fine-tune pre-trained policies via Timestep-Modulated Reinforcement Learning (TMRL), which trains the agent to dynamically adjust this conditioning during fine-tuning by modulating the diffusion timestep, granting explicit control over exploration. Integrating seamlessly with arbitrary policy inputs, e.g., states, 3D point clouds, or image-based VLA policies, we show that TMRL improves RL fine-tuning sample efficiency. Notably, TMRL enables successful real-world fine-tuning on complex manipulation tasks in under one hour. Videos and code available at this https URL.
中文摘要 通过强化学习（RL）微调预训练机器人策略，往往继承了行为克隆（BC）预训练带来的瓶颈，后者产生的动作分布狭窄，缺乏下游探索所需的覆盖。我们提出了一个统一框架，通过桥接BC预训练和强化学习微调，实现机器人策略高效微调所需的探索。我们的预训练方法——上下文平滑预训练（CSP）向政策输入注入前向扩散噪声，创造了精确模仿与广泛行动覆盖之间的连续体。随后，我们通过时间步调制强化学习（TMRL）微调预训练策略，TMRL训练智能体在微调过程中通过调制扩散时间步长动态调整条件，从而实现对探索的明确控制。通过与任意策略输入无缝集成，如状态、三维点云或基于图像的VLA策略，我们展示了TMRL提升了强化学习微调样本的效率。值得注意的是，TMRL能够在一小时内成功完成复杂操作任务的实际微调。视频和代码可在此 https 网址获取。

Delay-Empowered Causal Hierarchical Reinforcement Learning

延迟赋能因果层级强化学习

Authors: Chenran Zhao, Dianxi Shi, Haotian Wang, Mengzhu Wang, Yaowen Zhang, Chunping Qiu, Shaowu Yang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.12261
Pdf link: https://arxiv.org/pdf/2605.12261
Abstract Many real-world tasks involve delayed effects, where the outcomes of actions emerge after varying time lags. Existing delay-aware reinforcement learning methods often rely on state augmentation, prior knowledge of delay distributions, or access to non-delayed data, limiting their generalization. Hierarchical reinforcement learning, by contrast, inherently offers advantages in handling delays due to its hierarchical structure, yet existing methods are restricted to fixed delays. To address these limitations, we propose Delay-Empowered Causal Hierarchical Reinforcement Learning (DECHRL). DECHRL explicitly models both the causal structure of state transitions and their associated stochastic delay distributions. These are then incorporated into a delay-aware empowerment objective that drives proactive exploration toward highly controllable states, thereby improving performance under temporal uncertainty. We evaluate DECHRL in modified 2D-Minecraft and MiniGrid environments featuring stochastic delays. Experimental results show that DECHRL effectively models temporal delays and significantly outperforms baselines in decision-making under temporal uncertainty.
中文摘要 许多现实任务涉及延迟效应，即行动的结果在不同时间延迟后显现。现有的延迟感知强化学习方法通常依赖状态增强、对延迟分布的先验知识或非延迟数据的访问，限制了其泛化。相比之下，层级强化学习因其层级结构本身在处理延迟方面具有优势，但现有方法仅限于固定延迟。为解决这些局限性，我们提出了延迟赋能因果层级强化学习（DECHRL）。DECHRL明确建模了状态转变的因果结构及其相关的随机延迟分布。这些数据随后被纳入延迟感知赋能目标，推动主动探索高度可控状态，从而提升在时间不确定性下的表现。我们在具有随机延迟的改良2D-Minecraft和MiniGrid环境中评估了DECHL。实验结果表明，DECHRL有效模拟时间延迟，并且在时间不确定性下决策表现显著优于基线。

PriorZero: Bridging Language Priors and World Models for Decision Making

PriorZero：连接语言先验与世界模型以促进决策

Authors: Junyu Xiong, Yuan Pu, Jia Tang, Yazhe Niu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12289
Pdf link: https://arxiv.org/pdf/2605.12289
Abstract Leveraging the rich world knowledge of Large Language Models (LLMs) to enhance Reinforcement Learning (RL) agents offers a promising path toward general intelligence. However, a fundamental prior-dynamics mismatch hinders existing approaches: static LLM knowledge cannot directly adapt to the complex transition dynamics of long-horizon tasks. Using LLM priors as fixed policies limits exploration diversity, as the prior is blind to environment-specific dynamics; while end-to-end fine-tuning suffers from optimization instability and credit assignment issues. To bridge this gap, we propose PriorZero, a unified framework that integrates LLM-derived conceptual priors into world-model-based planning through a decoupled rollout-training design. During rollout, a novel root-prior injection mechanism incorporates LLM priors exclusively at the root node of Monte Carlo Tree Search (MCTS), focusing search on semantically promising actions while preserving the world model's deep lookahead capability. During training, PriorZero decouples world-model learning from LLM adaptation: the world model is continuously refined on interaction data to jointly improve its dynamics, policy, and value predictions, its value estimates are then leveraged to provide fine-grained credit assignment signals for stable LLM fine-tuning via alternating optimization. Experiments across diverse benchmarks, including text-based adventure games in Jericho and instruction-following gridworld tasks in BabyAI, demonstrate that PriorZero consistently improves both exploration efficiency and asymptotic performance, establishing a promising framework for LLM-empowered decision-making. Our code is available at this https URL.
中文摘要 利用大型语言模型（LLMs）丰富的世界知识来增强强化学习（RL）智能体，为实现通用智能提供了一条有前景的道路。然而，一种根本性的先验动力学不匹配阻碍了现有方法：静态的LLM知识无法直接适应长视野任务的复杂过渡动态。将LLM先验作为固定策略限制了探索多样性，因为先验对环境特定动态视而不见;而端到端微调则存在优化不稳定性和信用分配问题。为弥合这一空白，我们提出了PriorZero，这是一个统一框架，通过解耦的推广-训练设计，将LLM衍生的概念先验整合到基于世界模型的规划中。在推广过程中，一种新颖的根先验注入机制仅在蒙特卡洛树搜索（MCTS）根节点上集成了LLM先验，将搜索聚焦于语义上有前景的动作，同时保持了全球模型的深度前瞻能力。在训练过程中，PriorZero将世界模型学习与LLM适配解耦：世界模型在交互数据上不断优化，共同提升其动态、策略和价值预测，然后利用其价值估计通过交替优化，提供细粒度的信用分配信号，实现LLM的稳定微调。在多种基准测试中的实验，包括Jericho的文本冒险游戏和BabyAI中的指令遵循网格世界任务，表明PriorZero持续提升探索效率和渐近性能，为LLM赋能的决策奠定了有前景的框架。我们的代码可在此 https URL 访问。

Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling

通过隐式因果图建模实现可转移的延迟感知强化学习

Authors: Chenran Zhao, Dianxi Shi, Yaowen Zhang, Chunping Qiu, Shaowu Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12312
Pdf link: https://arxiv.org/pdf/2605.12312
Abstract Random delays weaken the temporal correspondence between actions and subsequent state feedback, making it difficult for agents to identify the true propagation process of action effects. In cross-task scenarios, changes in task objectives and reward formulations further reduce the reusability of previously acquired task knowledge. To address this problem, this paper proposes a transferable delay-aware reinforcement learning method based on implicit causal graph modeling. The proposed method uses a field-node encoder to represent high-dimensional observations as latent states with node-level semantics, and employs a message-passing mechanism to characterize dynamic causal dependencies among nodes, thereby learning transferable structured representations and environment dynamics knowledge. On this basis, imagination-driven behavior learning and planning are incorporated to optimize policies in the latent space, enabling cross-task knowledge transfer and rapid adaptation. Experimental results show that the proposed method outperforms baseline methods on DMC continuous control tasks with random delays. Cross-task transfer experiments further demonstrate that the learned structured representations and dynamics knowledge can be effectively transferred to new tasks and significantly accelerate policy adaptation.
中文摘要 随机延迟会削弱动作与随后状态反馈之间的时间对应，使智能体难以识别动作效应的真实传播过程。在跨任务情境中，任务目标和奖励表述的变化进一步降低了先前获得任务知识的可重复使用性。为解决这一问题，本文提出了一种基于隐式因果图建模的可转移延迟感知强化学习方法。该方法使用场节点编码器将高维观测表示为具有节点级语义的潜在状态，并采用消息传递机制来刻画节点间的动态因果依赖关系，从而学习可迁移的结构化表示和环境动力学知识。基于此，融入了想象驱动的行为学习与规划，以优化潜在空间的策略，实现跨任务知识转移和快速适应。实验结果显示，该方法在DMC连续控制任务中表现优于基线方法，且有随机延迟。跨任务转移实验进一步表明，所学到的结构化表示和动态知识可以有效地转移到新任务中，并显著加速策略适应。

Reinforcing VLAs in Task-Agnostic World Models

在任务无关世界模型中强化VLA

Authors: Yucen Wang, Rui Yu, Fengming Zhang, Junjie Lu, Xinyao Qin, Tianxiang Zhang, Kaixin Wang, Li Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12334
Pdf link: https://arxiv.org/pdf/2605.12334
Abstract Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task-specific data to fine-tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference. We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation. Because both components are task-agnostic, VLAs can be readily finetuned for any new task entirely within this zero-shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual-noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real-world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task-dependent data, offering a highly scalable roadmap for VLA adaptation.
中文摘要 通过强化学习（RL）在学习世界模型中进行训练后的视觉-语言-行动（VLA）模型，已成为一种有效策略，能够适应新任务，而无需昂贵的现实世界互动。然而，虽然使用想象轨迹降低了政策训练的样本复杂度，现有方法仍然高度依赖任务特定数据来微调世界模型和奖励模型，根本限制了其可扩展性，仅限于看不见的任务。为克服这一点，我们认为世界模型和奖励模型应捕捉可转移的物理先验，从而实现零样本推断。我们提出了RAW-Dream（在任务无关性世界梦中强化VLA）这一新范式，完全将世界模型学习与下游任务依赖分开。RAW-Dream 利用一个预训练的无任务行为世界模型来预测未来推广，并使用现成的视觉语言模型（VLM）来生成奖励。由于这两个组件都与任务无关，VLA可以完全在零样本想象中为任何新任务进行微调。此外，为了减轻世界模型的幻觉，我们引入了双噪声验证机制，以过滤不可靠的推广。在模拟和现实环境中的广泛实验显示了持续的性能提升，证明广义物理先验可以有效替代昂贵的任务依赖数据，为VLA适应提供了高度可扩展的路线图。

BSO: Safety Alignment Is Density Ratio Matching

BSO：安全对齐是密度比匹配

Authors: Tien-Phat Nguyen, Truong Nguyen, Thin Nguyen, Duy Minh Ho Nguyen, Ngoc-Thanh Dinh, Trung Le
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12339
Pdf link: https://arxiv.org/pdf/2605.12339
Abstract Aligning language models for both helpfulness and safety typically requires complex pipelines-separate reward and cost models, online reinforcement learning, and primal-dual updates. Recent direct preference optimization approaches simplify training but incorporate safety through ad-hoc modifications such as multi-stage procedures or heuristic margin terms, lacking a principled derivation. We show that the likelihood ratio of the optimal safe policy admits a closed-form decomposition that reduces safety alignment to a density ratio matching problem. Minimizing Bregman divergences between the data and model ratios yields Bregman Safety Optimization (BSO), a family of single-stage loss functions, each induced by a convex generator, that provably recover the optimal safe policy. BSO is both general and simple: it requires no auxiliary models, introduces only one hyperparameter beyond standard preference optimization, and recovers existing safety-aware methods as special cases. Experiments across safety alignment benchmarks show that BSO consistently improves the safety-helpfulness trade-off.
中文摘要 使语言模型既具帮助性又安全，通常需要复杂的流程——分别的奖励和成本模型、在线强化学习以及原始-双重更新。近期的直接偏好优化方法简化了训练，但通过多阶段程序或启发式边际项等临时修改来实现安全性，缺乏原则性推导。我们证明，最优安全策略的似然比存在闭式分解，将安全比对归约为密度比匹配问题。最小化数据与模型比值之间的布雷格曼散度，得到布雷格曼安全优化（BSO），这是一类单阶段损失函数，每个损失函数由凸发生器诱导，可证明恢复最优安全策略。BSO既通用又简单：它不需要辅助模型，仅引入一个超参数，超出标准偏好优化，并将现有的安全意识方法作为特殊案例恢复。各安全对齐基准的实验表明，BSO能够持续提升安全与帮助的权衡。

Discrete Flow Matching for Offline-to-Online Reinforcement Learning

离散流匹配用于离线到在线强化学习

Authors: Fairoz Nower Khan, Nabuat Zaman Nahim, Peizhong Ju
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12379
Pdf link: https://arxiv.org/pdf/2605.12379
Abstract Many reinforcement learning (RL) tasks have discrete action spaces, but most generative policy methods based on diffusion and flow matching are designed for continuous control. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is itself challenging, as the policy must improve from new interaction without losing useful behavior learned from static data. To address those challenges, we introduce DRIFT, an online fine-tuning method that updates an offline pretrained continuous-time Markov chain (CTMC) policy with an advantage-weighted discrete flow matching loss. To preserve useful pretrained knowledge, we add a path-space penalty that regularizes the full CTMC trajectory distribution, rather than only the final action distribution. For large discrete action spaces, we introduce a candidate-set approximation that updates the actor over a small subset of actions sampled from reference-policy rollouts and uniform exploration. Our theoretical analysis shows that the candidate-set error is controlled by missing target probability mass, and the induced CTMC generator error decreases as the candidate set covers more high-probability actions. Experiments on prevailing discrete action RL task show that our method provides stable offline-to-online improvement across all tasks, achieving the highest average score on Jericho with a simple GRU encoder while outperforming methods that use pretrained language models. Controlled experiments further confirm that the path-space penalty remains bounded during fine-tuning and that the CTMC generator adapts to shifted rewards faster than deterministic baselines. The candidate-set mechanism is supported by a stability analysis showing that the generator error decreases exponentially with candidate coverage.
中文摘要 许多强化学习（RL）任务具有离散的动作空间，但大多数基于扩散和流匹配的生成策略方法设计为连续控制。与此同时，生成策略通常高度依赖离线数据集，离线到在线的强化学习本身也具有挑战性，因为策略必须从新的交互中改进，同时不丢失从静态数据中学习到的有用行为。为应对这些挑战，我们引入了DRIFT在线微调方法，通过优势加权离散流匹配损耗更新离线预训练连续时间马尔可夫链（CTMC）策略。为了保留有用的预训练知识，我们增加了路径空间惩罚，使完整的CTMC轨迹分布正则化，而不仅仅是最终作用分布。对于大型离散动作空间，我们引入候选集近似，通过从参考策略推广和均匀探索中抽样的一小部分动作，更新演员。我们的理论分析表明，候选集合误差受缺失目标概率质量控制，且随着候选集合覆盖更多高概率动作，诱导CTMC生成误差减少。对主流离散行动强化学习任务的实验表明，我们的方法在所有任务中实现了稳定的离线到在线改进，在Jericho上使用简单GRU编码器取得了最高的平均得分，同时优于使用预训练语言模型的方法。受控实验进一步证实，路径空间惩罚在微调过程中保持有界，且CTMC生成器对奖赏转移的适应速度快于确定性基线。候选集机制得到了稳定性分析的支持，显示生成元误差随着候选覆盖率的增加呈指数级减少。

Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

信任批次，开或关策略：强化学习后训练的自适应策略优化

Authors: Rasool Fakoor, Murdock Aubry, Nicholas Stranges, Alexander J. Smola
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12380
Pdf link: https://arxiv.org/pdf/2605.12380
Abstract Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sampling, and other implementation details. Existing methods manage this fragility by adding hyper-parameters to the training objective, which makes the algorithm more sensitive to its configuration and requires retuning whenever the task, model scale, or distribution mismatch changes. This fragility traces to two concerns that current objectives entangle through hyper-parameters set before training begins: a trust-region concern, that updates should not move the policy too far from its current value, and an off-policy concern, that data from older or different behavior policies should influence the update only to the extent that it remains reliable. Neither concern is a constant to set in advance, and their severity is reflected in the policy-ratio distribution of the current batch. We present a simple yet effective batch-adaptive objective that replaces fixed clipping with the normalized effective sample size of the policy ratios. The same statistic caps the score-function weight and sets the strength of an off-policy regularizer, so the update stays close to the usual on-policy score-function update when ratios are nearly uniform, and tightens automatically when stale or mismatched data cause ratio concentration, while retaining a nonzero learning signal on high-ratio tokens. Experiments across a wide range of settings show that our method matches or exceeds tuned baselines, introducing no new objective hyper-parameters and removing several existing ones. The code is available at this https URL.
中文摘要 强化学习在结构上比监督学习更难，因为该策略改变了其学习的数据分布。这种脆弱性在大型模型训练中尤为明显，训练和推广系统在数值精度、采样及其他实现细节上存在差异。现有方法通过在训练目标中添加超参数来管理这种脆弱性，使算法对其配置更敏感，且每当任务、模型规模或分布不匹配发生变化时，就需要重新调谐。这种脆弱性源于当前目标在训练开始前设置的超参数中纠缠的两个担忧：信任区域问题，即更新不应使策略偏离当前值太远;以及非策略问题，即旧有或不同行为策略的数据仅在更新保持可靠性时才会影响其。这两个问题都不是必须事先设定的恒定指标，其严重程度会反映在当前批次的保单比率分布中。我们提出了一个简单但有效的批量自适应目标，用策略比率的归一化有效样本量替代固定剪断。同一统计量限制了得分函数权重，并设定了非策略正则化器的强度，因此当比率几乎均匀时，更新会接近通常的策略内评分-函数更新;当数据陈旧或不匹配导致比率集中时，更新会自动收紧，同时高比率标记保持非零学习信号。在多种环境中的实验表明，我们的方法与调优基准一致甚至超过，没有引入新的客观超参数，并移除了若干现有参数。代码可在该 https URL 访问。

Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning

多智能体强化学习中行为多样性触发事件

Authors: Hannes Büchi, Manon Flageat, Eduardo Sebastián, Amanda Prorok
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.12388
Pdf link: https://arxiv.org/pdf/2605.12388
Abstract Effective multi-agent cooperation requires agents to adopt diverse behaviors as task conditions evolve-and to do so at the right moment. Yet, current Multi-Agent Reinforcement Learning (MARL) frameworks that facilitate this diversity are still limited by the fact that they bind fixed behaviors to fixed agent identities. Consequently, they are ill-equipped for tasks where agents need to take on different roles at very specific moments in time. We argue that, to define these behavioral transitions, the missing ingredient is events. Events are changes in the state of the system that induce qualitative changes in the task. Based on this view, we introduce a framework that decouples agent identity from behavior, capturing a continuous manifold from which agents instantiate their behaviors in response to events. This framework is based on two elements. First, to build an expressive behavior manifold, we introduce Neural Manifold Diversity (NMD), a formal distance metric that remains well-defined when behaviors are transient and agent-agnostic. Second, we use an event-based hypernetwork that generates Low-Rank Adaptation (LoRA) modules over a shared team policy, enabling on-the-fly agent-policy reconfiguration in response to events. We prove that this construction ensures that diversity does not interfere with reward maximization by design. Empirical results demonstrate that our framework outperforms established baselines across benchmarks while exhibiting zero-shot generalization, and being the only method that solves tasks requiring sequential behavior reassignment.
中文摘要 有效的多智能体合作要求智能体随着任务条件演变采取多样化行为，并且必须在恰当的时机采取。然而，当前促进多样性的多智能体强化学习（MARL）框架仍受限于将固定行为绑定于固定智能体身份。因此，他们无法胜任那些需要在非常特定时间点承担不同角色的任务。我们认为，要定义这些行为转变，缺失的要素是事件。事件是系统状态的变化，导致任务的质性变化。基于这一观点，我们引入了一个框架，将代理身份与行为解耦，捕捉一个连续流形，代理在事件中实例化其行为。该框架基于两个要素。首先，为了构建表达行为流形，我们引入了神经流形多样性（NMD），这是一个形式距离度量，当行为是瞬态且与智能体无关时，它依然定义清晰。其次，我们使用基于事件的超网络，基于共享团队策略生成低秩适配（LoRA）模块，支持对事件的实时代理策略重配置。我们证明了这种结构设计确保多样性不会干扰奖励最大化。实证结果表明，我们的框架在各基准测试中优于既有基线，同时具备零样本推广性，并且是唯一能解决需要顺序行为重分配任务的方法。

Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems

语义奖励崩溃与自适应人工智能系统中认知完整性的维护

Authors: William Parris
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12406
Pdf link: https://arxiv.org/pdf/2605.12406
Abstract Recent advances in reinforcement learning from human feedback (RLHF) and preference optimization have substantially improved the usability, coherence, and safety of large language models. However, recurring behaviors such as performative certainty, hallucinated continuity, calibration drift, sycophancy, and suppression of visible uncertainty suggest unresolved structural issues within scalarized preference optimization systems. We propose Semantic Reward Collapse (SRC): the compression of semantically distinct forms of evaluative dissatisfaction into generalized optimization signals. Under SRC, categories such as factual incorrectness, uncertainty disclosure, formatting dissatisfaction, latency, and social preference may become entangled within a shared reward topology despite representing fundamentally different epistemic classes. We argue that adaptive reasoning systems operating under generalized evaluative pressure may drift toward suppression of visible epistemic failure rather than preservation of calibrated uncertainty integrity. These behaviors are framed strictly as optimization consequences rather than evidence of deception or anthropomorphic agency. Drawing on institutional proxy collapse, metric gaming, software reliability engineering, and human learning theory, we propose that uncertainty disclosure and escalation behavior should be treated as protected epistemic conduct rather than globally penalized task incompletion. Finally, we introduce Constitutional Reward Stratification (CRS), a domain-aware reward framework intended to preserve differentiated epistemic attribution within adaptive learning systems. We present CRS not as a validated solution, but as a testable governance-oriented research direction requiring further empirical investigation.
中文摘要 人类反馈强化学习（RLHF）和偏好优化的最新进展显著提升了大型语言模型的可用性、一致性和安全性。然而，表现性确定性、幻觉连续性、校准漂移、谄媚和可见不确定性抑制等反复出现的行为，表明在标量化偏好优化系统中存在未解决的结构性问题。我们提出了语义奖励崩溃（SRC）：将语义上不同的评价不满压缩为广义优化信号。在SRC下，事实错误、不确定性披露、格式不满、延迟和社会偏好等类别尽管代表根本不同的认知类别，也可能纠缠在共享奖励拓扑中。我们认为，在广义评估压力下运作的自适应推理系统，可能会倾向于抑制可见的认知失败，而非维护校准后的不确定性完整性。这些行为严格被框架为优化的结果，而非欺骗或拟人化能动性。结合制度代理崩溃、指标博弈、软件可靠性工程和人类学习理论，我们提出不确定性披露和升级行为应被视为受保护的认知行为，而非全球性惩罚的任务未完成。最后，我们介绍了宪法奖励分层（CRS），这是一种领域感知奖励框架，旨在维护适应性学习系统中的差异化认知归因。我们不将CRS介绍为一个经过验证的解决方案，而是一个可检验的治理导向研究方向，需要进一步实证调查。

Aligning Flow Map Policies with Optimal Q-Guidance

将流程图策略与最优Q-Guidance对齐

Authors: Christos Ziakas, Alessandra Russo, Avishek Joey Bose
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.12416
Pdf link: https://arxiv.org/pdf/2605.12416
Abstract Generative policies based on expressive model classes, such as diffusion and flow matching, are well-suited to complex control problems with highly multimodal action distributions. Their expressivity, however, comes at a significant inference cost: generating each action typically requires simulating many steps of the generative process, compounding latency across sequential decision-making rollouts. We introduce flow map policies, a novel class of generative policies designed for fast action generation by learning to take arbitrary-size jumps including one-step jumps-across the generative dynamics of existing flow-based policies. We instantiate flow map policies for offline-to-online reinforcement learning (RL) and formulate online adaptation as a trust-region optimization problem that improves the critic's Q-value while remaining close to the offline policy. We theoretically derive FLOW MAP Q-GUIDANCE (FMQ), a principled closed-form learning target that is optimal for adapting offline flow map policies under a critic-guided trust-region constraint. We further introduce Q-GUIDED BEAM SEARCH (QGBS), a stochastic flow-map sampler that combines renoising with beam search to enable iterative inference-time refinement. Across 12 challenging robotic manipulation and locomotion tasks from OGBench and RoboMimic, FMQ achieves state-of-the-art performance in offline-to-online RL, outperforming the previous one-step policy MVP by a relative improvement of 21.3% on the average success rate.
中文摘要 基于表达型模型类的生成策略，如扩散和流匹配，非常适合复杂控制问题且高度多模态作用分布。然而，它们的表现力代价显著：生成每个动作通常需要模拟生成过程的多个步骤，导致连续决策展开时延迟增加。我们介绍了流程图策略，这是一种新型生成策略类别，旨在通过学习在现有基于流策略的生成动态中进行任意规模的跳跃，包括一步跳跃，实现快速行动生成。我们实例化了离线到在线强化学习（RL）的流程图策略，并将在线适应构建为信任区域优化问题，在保持离线策略接近的同时提升批判者的Q值。我们理论上推导出了FLOW MAP Q-GUIDANCE（FMQ），这是一个原则性的闭式学习目标，最适合在批评者引导的信任区域约束下调整离线流程图策略。我们进一步引入了Q引导光束搜索（QGBS），这是一种随机流图采样器，结合了重噪和光束搜索，实现迭代推断时间细化。在OGBench和RoboMimic的12项具有挑战性的机器人操作和移动任务中，FMQ在离线到在线强化学习中实现了最先进的性能，比之前的一步策略MVP相比平均成功率提升了21.3%。

ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models

ORCE：大型语言模型中口语信心的顺序感知对齐

Authors: Chen Li, Xiaoling Hu, Songzhu Zheng, Jiawei Zhou, Chao Chen
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.12446
Pdf link: https://arxiv.org/pdf/2605.12446
Abstract Large language models (LLMs) often produce answers with high certainty even when they are incorrect, making reliable confidence estimation essential for deployment in real-world scenarios. Verbalized confidence, where models explicitly state their confidence in natural language, provides a flexible and user-facing uncertainty signal that can be applied even when token logits are unavailable. However, existing verbalized-confidence methods often optimize answer generation and confidence generation jointly, which can cause confidence-alignment objectives to interfere with answer accuracy. In this work, we propose a decoupled and order-aware framework for verbalized confidence calibration. Our method first generates an answer and then estimates confidence conditioned on the fixed question--answer pair, allowing confidence optimization without directly perturbing the answer-generation process. To align confidence with correctness likelihood, we construct a sampling-based surrogate from multiple model completions and optimize rank-based reinforcement learning objectives that encourage responses with higher estimated correctness likelihood to receive higher verbalized confidence. Experiments on reasoning and knowledge-intensive benchmarks show that our method improves calibration and failure prediction performance while largely preserving answer accuracy. These results demonstrate that verbalized confidence can be more reliably aligned by decoupling confidence estimation from answer generation and optimizing the relative ordering of confidence across responses.
中文摘要 大型语言模型（LLMs）即使答案错误，也常常能以高确定性给出答案，因此在现实场景中部署可靠的置信度估计至关重要。语言置信度，即模型明确表达其自然语言的信心，提供了灵活且面向用户的不确定信号，即使代币日志不可用时也能应用。然而，现有的口头置信度方法常常会联合优化答案生成和置信度生成，这可能导致置信度对齐目标干扰答案准确性。本研究提出一个解耦且有序的语言置信校准框架。我们的方法首先生成一个答案，然后根据固定的问题-答案对估计置信度，从而实现置信度优化，同时不直接干扰答案生成过程。为了使置信度与正确性似然对齐，我们从多个模型完成中构建了一个基于抽样的替代指标，并优化基于排名的强化学习目标，鼓励估计正确性概率较高的回答获得更高的口头置信度。推理实验和知识密集型基准测试表明，我们的方法在很大程度上保持了答案准确性，同时提升了校准和失效预测性能。这些结果表明，通过将置信估计与答案生成解耦，并优化各回答置信度的相对排序，可以更可靠地对齐口头化置信。

LychSim: A Controllable and Interactive Simulation Framework for Vision Research

LychSim：一个可控且交互式的视觉研究模拟框架

Authors: Wufei Ma, Chloe Wang, Siyi Chen, Jiawei Peng, Patrick Li, Alan Yuille
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.12449
Pdf link: https://arxiv.org/pdf/2605.12449
Abstract While self-supervised pretraining has reduced vision systems' reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-distribution (OOD) evaluation. However, modern simulation platforms often present steep technical barriers, requiring extensive expertise in computer graphics and game development. In this work, we present LychSim, a highly controllable and interactive simulation framework built upon Unreal Engine 5 to bridge this gap. LychSim is built around three key designs: (1) a streamlined Python API that abstracts away underlying engine complexities; (2) a procedural data pipeline capable of generating diverse, high-fidelity environments with varying out-of-distribution (OOD) visual challenges, paired with rich 2D and 3D ground truths; and (3) a native integration of the Model Context Protocol (MCP) that transforms the simulator into a dynamic, closed-loop playground for reasoning agentic LLMs. We further annotate scene-level procedural rules and object-level pose alignments to enable semantically aligned 3D ground truths and automated scene modification. We demonstrate LychSim's capability across multiple downstream applications, including serving as a synthetic data engine, powering reinforcement learning-based adversarial examiners, and facilitating interactive, language-driven scene layout generation. To benefit the broader vision community, LychSim will be made publicly available, including full source code and various data annotations.
中文摘要 虽然自我监督预训练减少了视觉系统对合成数据的依赖，但仿真仍然是闭环优化和严格的分发范围（OOD）评估不可或缺的工具。然而，现代模拟平台常常存在严峻的技术障碍，需要在计算机图形和游戏开发方面具备丰富的专业知识。在本研究中，我们介绍了LychSim，一个基于虚幻引擎5构建的高度可控且交互式的模拟框架，旨在弥合这一差距。LychSim 围绕三个关键设计构建：（1）简化的 Python API，抽象了底层引擎的复杂性;（2）能够生成多样化、高保真度环境、具有不同非分发（OOD）视觉挑战的程序化数据流水线，并结合丰富的二维和三维地面信息;以及（3）模型上下文协议（MCP）的原生集成，将模拟器转变为一个动态的闭环游戏场，用于推理智能大型语言模型。我们还进一步注释场景级程序规则和对象级姿态对齐，以实现语义对齐的3D地面信息和自动场景修改。我们展示了LychSim在多个下游应用中的能力，包括作为合成数据引擎、支持基于强化学习的对抗性检查器，以及促进交互式、语言驱动的场景布局生成。为了惠及更广泛的愿景社区，LychSim 将公开，包括完整源代码和各种数据注释。

Towards Affordable Energy: A Gymnasium Environment for Electric Utility Demand-Response Programs

迈向可负担能源：电力公用事业需求响应项目的体育馆环境

Authors: Jose E. Aguilar Escamilla, Lingdong Zhou, Xiangqi Zhu, Huazheng Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.12462
Pdf link: https://arxiv.org/pdf/2605.12462
Abstract Extreme weather and volatile wholesale electricity markets expose residential consumers to catastrophic financial risks, yet demand response at the distribution level remains an underutilized tool for grid flexibility and energy affordability. While a demand-response program can shield consumers by issuing financial credits during high-price periods, optimizing this sequential decision-making process presents a unique challenge for reinforcement learning despite the plentiful offline historical smart meter and wholesale pricing data available publicly. Offline historical data fails to capture the dynamic, interactive feedback loop between an electric utility's pricing signals and customer acceptance and adaptation to a demand-response program. To address this, we introduce DR-Gym, an open-source, online Gymnasium-compatible environment designed to train and evaluate demand-response from the electric utility's perspective. Unlike existing device-level energy simulators, our environment focuses on the market-level electric utility setting and provides a rich observational space relevant to the electric utility. The simulator additionally features a regime-switching wholesale price model calibrated to real-world extreme events, alongside physics-based building demand profiles. For our learning signal, we use a configurable, multi-objective reward function for specifying diverse learning objectives. We demonstrate through baseline strategies and data snapshots the capability of our simulator to create realistic and learnable environments.
中文摘要 极端天气和波动的批发电力市场使住宅消费者面临灾难性的财务风险，然而配电层面的需求响应仍是实现电网灵活性和能源可负担性的重要工具。虽然需求响应计划可以通过在高价期发行金融信用来保护消费者，但尽管有大量公开的离线历史智能电表和批发价格数据，优化这种顺序决策过程仍是强化学习的独特挑战。离线历史数据未能捕捉电力公用事业公司定价信号与客户接受度和需求响应计划适应之间的动态互动反馈循环。为此，我们推出了DR-Gym，一个开源的在线兼容Gymnasium环境，旨在从电力公司的角度训练和评估需求响应。与现有设备级能源模拟器不同，我们的环境聚焦于市场级电力公用事业，提供了与电力公用事业相关的丰富观察空间。该模拟器还配备了针对现实极端事件校准的周期切换批发价格模型，以及基于物理的建筑需求曲线。对于我们的学习信号，我们使用可配置的多目标奖励函数来指定多样化的学习目标。我们通过基线策略和数据快照展示了模拟器创造真实且可学习环境的能力。

Reward Hacking in Rubric-Based Reinforcement Learning

基于评分标准的奖励黑客强化学习

Authors: Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, Yunzhong He
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12474
Pdf link: https://arxiv.org/pdf/2605.12474
Abstract Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.
中文摘要 带有可验证奖励的强化学习在数学和编程等领域实现了显著的训练后进步，尽管许多开放式环境依赖于基于评分标准的奖励。我们研究基于评分标准的强化学习中的奖励黑客，该实验中政策针对培训验证者进行优化，但与三位前沿评审组成的跨家族评审小组进行评估，减少对单一评估者的依赖。我们的框架区分了两个分歧来源：验证者失败，训练验证者将引用验证者拒绝的评分标准归功于;以及评分标准设计的局限，即即使是强有力的基于评分标准的验证者，也偏好无评分标准评委整体评分较低的回答。在医学和科学领域，弱验证者会产生大量代理奖励收益，但这些收益不会转移给参考验证者;利用在培训过程中不断加剧，并集中在反复出现的失败上，如部分满足化合物标准、将隐性内容视为露骨内容以及不精确的主题匹配。更强的验证器显著减少但无法消除验证器的利用。我们还引入了自我内化差距，这是一种基于策略日志概率的无验证者诊断，追踪引用验证器质量，检测使用弱验证器训练的策略何时停止改进。最后，在我们的设定中，当评分标准未明确指出重要失败模式时，更强的验证并不能防止奖励黑客行为：基于评分标准的验证者更倾向于使用强化学习检查点，而无评分标准的评委则偏好基础模型。这些分歧与完整性和基于现场情况的标准提升同时发生，同时事实正确性、简洁性、相关性和整体质量的下降。综合来看，这些结果表明更强的验证减少了奖励黑客行为，但仅凭这一点并不保证评分标准的提升与更广泛的质量提升相对应。

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

OmniNFT：按模态划分的全扩散增强，用于联合音视频生成

Authors: Guohui Zhang, XiaoXiao Ma, Jie Huang, Hang Xu, Hu Yu, Siming Fu, Yuming Li, Zeyue Xue, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12480
Pdf link: https://arxiv.org/pdf/2605.12480
Abstract Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.
中文摘要 联合音视频生成的最新进展令人瞩目，但现实应用要求每模态的高精度、跨模态对齐和细粒度同步。强化学习（RL）提供了一个有前景的范式，但其在多目标和多模态联合音视频生成方面的扩展尚未被充分探索。值得注意的是，我们的深入分析首先揭示，应用强化学习的主要障碍源于：（i）多目标优势不一致，即多模态输出的优势在群体内并不总是一致;（ii）多模态梯度失衡，视频分支梯度泄漏到负责模态内生成的浅层音频中;（iii）统一的信用分配，细粒度的跨模态对齐区域无法获得高效的勘探。这些不足表明，仅凭单一全局优势的原版强化学习微调策略往往导致次优结果。为应对这些挑战，我们提出了OmniNFT，一种新型的模态感知在线扩散强化学习框架，具有三项关键创新：（1）按模态分流的优势路由，将独立的每奖励优势路由到各自的模态生成分支。（2）分层梯度手术，选择性地分离浅层音频上的视频分支梯度，同时保留跨模态交互层的梯度。（3）区域损失重权重，将策略优化调制到与音视频同步和细粒度对齐相关的关键区域。在JavisBench和VBench上与LTX-2骨干的广泛实验表明，OmniNFT在音频和视频感知质量、跨模态对齐以及音视频同步方面实现了全面提升。

Keyword: diffusion policy

Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

通过折扣活度表述进行操作策略的离线策略评估

Authors: Hao Wang, Joshua Bowden, Colton Crosby, Somil Bansal
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.11479
Pdf link: https://arxiv.org/pdf/2605.11479
Abstract Policy evaluation is a fundamental component of the development and deployment pipeline for robotic policies. In modern manipulation systems, this problem is particularly challenging: rewards are often sparse, task progression of evaluation rollouts are often non-monotonic as the policies exhibit recovery behaviors, and evaluation rollouts are necessarily of finite length. This finite length introduces truncation bias, breaking the infinite-horizon assumptions underlying standard methods relying on Bellman equations/principle of optimality. In this work, we propose a framework for offline policy evaluation from sparse rewards based on a liveness-based Bellman operator. Our formulation interprets policy evaluation as a task-completion problem and yields a conservative fixed-point value function that is robust to finite-horizon truncation. We analyze the theoretical properties of the proposed operator, including contraction guarantees, and show how it encodes task progression while mitigating truncation bias. We evaluate our method on two simulated manipulation tasks using both a Vision-Language-Action model and a diffusion policy, and a cloth folding task using human demonstrations. Empirical results demonstrate that our approach more accurately reflects task progress and substantially reduces truncation bias, outperforming classical baselines such as TD(0) and Monte Carlo policy evaluation.
中文摘要 策略评估是机器人策略开发和部署流程中的基础组成部分。在现代操作系统中，这个问题尤为棘手：奖励往往稀疏，评估展开的任务进展往往非单调，因为策略表现出恢复行为，评估推广必然是有限的。这种有限长度引入了截断偏置，打破了依赖贝尔曼方程/最优性原理的标准方法背后的无限视界假设。本研究提出基于活度的贝尔曼算符，基于稀疏奖励的离线策略评估框架。我们的表述将策略评估解释为任务完成问题，并得到一个对有限视界截断具有鲁棒性的保守不动点值函数。我们分析了所提算子的理论属性，包括收缩保证，并展示了它如何在减轻截断偏差的同时编码任务进展。我们利用视觉-语言-行动模型和扩散策略，在两个模拟操作任务中评估我们的方法，以及利用人工演示的布料折叠任务。实证结果表明，我们的方法更准确地反映了任务进展，并显著减少了截断偏差，优于TD（0）和蒙特卡洛政策评估等经典基线。

NavOL: Navigation Policy with Online Imitation Learning

NavOL：带在线模仿学习的导航政策

Authors: Xiaofei Wei, Chun Gu, Li Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.11762
Pdf link: https://arxiv.org/pdf/2605.11762
Abstract Learning robust navigation policies remains a core challenge in robotics. Offline imitation learning suffers from distribution shift and compounding errors at rollout, while reinforcement learning requires reward engineering and learns inefficiently. In this paper, we propose NavOL, an online imitation learning paradigm that interacts with a simulator and updates itself using expert demonstrations gathered online. Built upon a pretrained navigation diffusion policy that maps local observations to future waypoints, NavOL trains in a rollout update loop: during rollout, the policy acts in the simulator and queries a global planner which has privileged access to the global environment for the optimal path segment as ground truth trajectory labels; during update, the policy is trained on the online collected observation trajectory pairs. This online imitation loop removes the need for reward design, improves learning efficiency, and mitigates distribution shift by training on the policy own explored rollouts. Built on IsaacLab with fast, high-fidelity parallel rendering and domain randomization of camera pose and start-goal pairs, our system scales across 50 scenes on 8 RTX 4090 GPUs, collecting over 2,000 new trajectories per hour, each averaging more than 400 steps. We also introduce an indoor visual navigation benchmark with predefined start and goal positions for zero-shot generalization. Extensive evaluations on simulation benchmarks, including the NavDP benchmark and our proposed benchmark, as well as carefully designed real-world experiments, demonstrate the effectiveness of NavOL, showing consistent performance gains in online imitation learning.
中文摘要 学习稳健的导航策略仍是机器人领域的核心挑战。离线模仿学习存在分布偏移和推广时叠加错误，而强化学习则需要奖励工程，学习效率低下。本文提出了NavOL，一种在线模仿学习范式，它与模拟器互动，并通过在线专家演示自我更新。NavOL基于预训练的导航扩散策略，将本地观测数据映射到未来航点，NavOL在部署更新循环中训练：在部署过程中，策略在模拟器中运行，查询拥有特权访问全局环境的最优路径段的全局规划器，作为地面真实轨迹标签;在更新过程中，策略会根据在线收集的观测轨迹对进行训练。这种在线模仿循环消除了奖励设计的需求，提高了学习效率，并通过对政策自研推广进行培训，减轻了分布转移。我们的系统基于IsaacLab，提供高速、高保真并行渲染和摄像机姿态和起始-目标对的域随机化，支持8个RTX 4090 GPU上的50个场景，每小时收集超过2000条新轨迹，每个平均步数超过400步。我们还引入了室内视觉导航基准，预设起点和目标位置，用于零射概括。对包括NavDP基准和我们提出的基准测试在内的广泛模拟基准评估，以及精心设计的真实世界实验，展示了NavOL的有效性，在线模仿学习中持续展现性能提升。

SI-Diff: A Framework for Learning Search and High-Precision Insertion with a Force-Domain Diffusion Policy

SI-Diff：一个基于力域扩散策略的搜索与高精度插入学习框架

Authors: Yibo Liu, Stanko Oparnica, Simon Shewchun-Jakaitis, Guoyi Fu, Jie Wang, Jun Yang, Anand Jagannathan, Tony Hong-Yau Lo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.12247
Pdf link: https://arxiv.org/pdf/2605.12247
Abstract Contact-rich assembly is fundamental in robotics but poses significant challenges due to uncertainties in relative poses, such as misalignments and small clearances in peg-in-hole tasks. Existing approaches typically address search and high-precision insertion separately, because these tasks involve distinct action patterns. However, supporting both tasks within a single model, without switching models or weights, is desirable for intelligent assembly systems. In this work, we propose SI-Diff, a framework that learns both search and high-precision insertion through a force-domain diffusion policy. To this end, we introduce a new mode-conditioning mechanism that enables the policy to capture distinct action behaviors under a single framework. Moreover, we develop a new search teacher policy that can generate diverse trajectories. By training on successful and efficient demonstrations provided by the teacher policy, the model learns the mapping from tactile and end-effector velocity observations to effective action behaviors. We conduct thorough experiments to show that SI-Diff extends the tolerance to x-y misalignments from 2 mm to 5 mm compared to the state-of-the-art baseline, TacDiffusion, while also demonstrating strong zero-shot transferability to unseen shapes.
中文摘要 富接触组装在机器人技术中至关重要，但由于相对姿态的不确定性，如错位和孔中钉任务中的小间隙，带来了重大挑战。现有方法通常将搜索和高精度插入分开处理，因为这些任务涉及不同的动作模式。然而，在智能装配系统中，能够在单一模型中同时支持这两项任务，而无需更换模型或权重，是更理想的。在本研究中，我们提出了SI-Diff，一种通过力域扩散策略学习搜索和高精度插入的框架。为此，我们引入了一种新的模式条件机制，使策略能够在单一框架下捕捉不同的动作行为。此外，我们制定了新的招聘教师政策，能够产生多样化的发展轨迹。通过基于教师政策提供的成功且高效的演示进行训练，模型学习从触觉和末端效应器速度观察到有效行动行为的映射。我们进行了详尽的实验，证明SI-Diff将x-y错位的容忍度从2毫米扩展到5毫米，相比最先进的基线TacDiffusion，同时还展示了对未见形状的强零射击传递能力。