Arxiv Papers of Today

生成时间: 2026-05-11 19:26:45 (UTC+8); Arxiv 发布时间: 2026-05-11 20:00 EDT (2026-05-12 08:00 UTC+8)

今天共有 66 篇相关文章

Keyword: reinforcement learning

Hidden Coalitions in Multi-Agent AI: A Spectral Diagnostic from Internal Representations

多智能体人工智能中的隐性联盟：从内部表征中进行的频谱诊断

Authors: Cameron Berg, Susan L. Schneider, Mark M. Bailey
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.06696
Pdf link: https://arxiv.org/pdf/2605.06696
Abstract Collections of interacting AI agents can form coalitions, creating emergent group-level organization that is critical for AI safety and alignment. However, observing agent behavior alone is often insufficient to distinguish genuine informational coupling from spurious similarity, as consequential coalitions may form at the level of internal representations before any overt behavioral change is apparent. Here, we introduce a practical method for detecting coalition structure from the internal neural representations of multi-agent systems. The approach constructs a pairwise mutual-information graph from the hidden states of agents and applies spectral partitioning to identify the most salient coalition boundary. We validate this method in two domains. First, in multi-agent reinforcement learning environments, the method successfully recovers programmed hierarchical and dynamic coalition structures and correctly rejects false positives arising from behavioral coordination without informational coupling. Second, using a large language model, the method identifies coalition structures implied by descriptive prompts, tracks dynamic team reassignments, and reveals a representational hierarchy where explicit labels dominate over conflicting interaction patterns. Across both settings, the recovered partition reveals subgroup organization that a scalar cross-agent mutual-information measure cannot distinguish. The results demonstrate that analyzing hidden-state mutual information through spectral partitioning provides a scalable diagnostic for identifying representational coalitions, offering a valuable tool for monitoring emergent structure in distributed AI systems.
中文摘要 互动的人工智能代理集合可以形成联盟，形成对AI安全和一致性至关重要的群体层级组织。然而，仅仅观察主体行为往往不足以区分真正的信息耦合与虚假相似，因为在明显行为变化显现之前，内部表征层面可能就已形成相应的联盟。在这里，我们介绍了一种实用的方法，用于从多智能体系统的内部神经表征中检测联盟结构。该方法从代理的隐藏状态构建成对互信息图，并应用谱划分以确定最显著的联盟边界。我们在两个领域验证了该方法。首先，在多智能体强化学习环境中，该方法成功恢复了程序化的层级和动态联盟结构，并正确拒绝了因无信息耦合行为协调而产生的假阳性。其次，利用大型语言模型，该方法识别描述性提示所暗示的联盟结构，追踪动态团队重新分配，并揭示一个表征层级结构，其中明确标签主导冲突的互动模式。在这两种情况下，恢复后的划分揭示了标量跨代理互信息度量无法区分的子群组织。结果表明，通过谱划分分析隐藏状态互信息为识别表征联盟提供了可扩展的诊断工具，为监测分布式人工智能系统中涌现结构提供了宝贵工具。

On Training in Imagination

关于想象力训练

Authors: Nadav Timor, Ravid Shwartz-Ziv, Micah Goldblum, Yann LeCun, David Harel
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.06732
Pdf link: https://arxiv.org/pdf/2605.06732
Abstract State-of-the-art model-based reinforcement learning methods train policies on imagined rollouts. These rollouts are trajectories generated by a learned dynamics model and are scored by a learned reward model, but without querying the true environment during policy updates. We study this training paradigm by quantifying how errors in learned dynamics and reward models affect returns and policy optimization. First, we extend the analysis of Asadi et al. (2018) to MDPs with learned reward models, and derive the optimal sample allocation--the ratio of dynamics samples to reward samples that minimizes a bound on return error under power-law scaling assumptions. We identify lower Lipschitz constants of the learned dynamics, reward, and policy as a representation desideratum that tightens this bound, and we connect this perspective to the temporal-straightening objective of Wang et al. (2026). Second, we examine how policy optimization with REINFORCE tolerates noisy rewards, which are often cheaper to obtain. We show that zero-mean reward noise leaves the gradient estimator unbiased and adds at most a variance term that decreases with the number of rollouts. This introduces a practical tradeoff: given a fixed budget, should one buy more rollouts with cheaper but noisier rewards, or fewer rollouts with more expensive but less noisy rewards? We reduce this choice to a one-dimensional optimization problem and characterize the optimum.
中文摘要 最先进的基于模型的强化学习方法训练基于想象的推广策略。这些部署是由学习的动态模型生成的轨迹，并由学习的奖励模型评分，但在策略更新时不查询真实环境。我们通过量化学习动态和奖励模型中的错误如何影响回报和策略优化来研究这一训练范式。首先，我们将Asadi等人（2018）的分析扩展到具有学习奖励模型的MDP，推导出最优样本分配——在幂律尺度假设下，动态样本与奖励样本的比例，使返回误差的界限最小。我们将学习动力学、奖励和政策的下利普希茨常数识别为一种加紧界限的表示意愿，并将这一视角与Wang等人（2026）提出的时间直率目标联系起来。其次，我们考察了利用REINFORCE进行策略优化如何容忍噪声奖励，而噪声奖励通常更便宜获得。我们证明零均值奖励噪声使梯度估计量保持无偏，最多增加一个方差项，该方差项随展开次数增加而减小。这带来了一个实际权衡：在固定预算的情况下，是应该购买更多更便宜但奖励更响亮的推出，还是购买更少但奖励更贵但噪音更小的推出？我们将此选择简化为一维优化问题，并刻画最优方案。

Gated QKAN-FWP: Scalable Quantum-inspired Sequence Learning

门控QKAN-FWP：可扩展量子启发序列学习

Authors: Kuo-Chung Peng, Samuel Yen-Chi Chen, Jiun-Cheng Jiang, Chen-Yu Liu, En-Jui Kuo, Yun-Yuan Wang, Prayag Tiwari, Andrea Ceschini, Chi-Sheng Chen, Yu-Chao Hsu, Chun-Hua Lin, Tai-Yue Li, Antonello Rosato, Massimo Panella, Simon See, Saif Al-Kuwari, Kuan-Cheng Chen, Nan-Yow Chen, Hsi-Sheng Goan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2605.06734
Pdf link: https://arxiv.org/pdf/2605.06734
Abstract Fast Weight Programmers (FWPs) encode temporal dependencies through dynamically updated parameters rather than recurrent hidden states. Quantum FWPs (QFWPs) extend this idea with variational quantum circuits (VQCs), but existing implementations rely on multi-qubit architectures that are difficult to scale on noisy intermediate-scale quantum (NISQ) devices and expensive to simulate classically. We propose gated QKAN-FWP, a fast-weight framework that integrates FWP with Quantum-inspired Kolmogorov-Arnold Network (QKAN) using single-qubit data re-uploading circuits as learnable nonlinear activation, known as DatA Re-Uploading ActivatioN (DARUAN). We further introduce a scalar-gated fast-weight update rule that stabilizes parameter evolution, supported by a theoretical analysis of its adaptive memory kernel, geometric boundedness, and parallelizable gradient paths. We evaluate the framework across time-series benchmarks, MiniGrid reinforcement learning, and highlight real-world solar cycle forecasting as our main practical result. In the long-horizon setting with 528-month input window and 132-month forecast horizon, our 12.5k-parameter model achieves lower scaled Mean Square Error (MSE), peak amplitude error, and peak timing error than a suite of classical recurrent baselines with up to 13x more parameters, including Long Short-Term Memory (LSTM) networks (25.9k-89.1k parameters), WaveNet-LSTM (167k), Vanilla recurrent neural network (11.5k), and a Modified Echo State Network (132k). To validate NISQ compatibility, we further deploy the trained fast programmer on IonQ and IBM Quantum processors, recovering forecasting accuracy within 0.1% relative MSE of the noiseless simulator at 1024 shots. These results position gated QKAN-FWP as a scalable, parameter-efficient, and NISQ-compatible approach to quantum-inspired sequence modeling.
中文摘要 快速权重程序员（FWP）通过动态更新的参数编码时间依赖，而非重复的隐藏状态。量子FWPs（QFWPs）通过变分量子电路（VQC）扩展了这一理念，但现有实现依赖于多量子比特架构，这些架构难以在噪声较大的中尺度量子（NISQ）设备上扩展，且经典模拟成本高昂。我们提出了门控QKAN-FWP，这是一种快速权重框架，将FWP与量子启发的Kolmogorov-Arnold Network（QKAN）集成，利用单量子比特数据重新上传电路作为可学习的非线性激活，称为DatA再上传激活（DARUAN）。我们进一步引入了标量门控快速权重更新规则，稳定参数演化，并通过理论分析其自适应内存核、几何有界性和可并行梯度路径支持。我们通过时间序列基准、MiniGrid强化学习和现实世界的太阳周期预测来评估该框架，作为我们的主要实际成果。在长视野设定下，输入窗口为528个月，预测期为132个月，我们的12.5k参数模型比一组参数多达13倍的经典循环基线（包括长短期记忆（LSTM）网络（25.9k-89.1k参数）、WaveNet-LSTM（167k）、普通循环神经网络（11.5k）更低。以及一个修改回声状态网络（132k）。为验证NISQ兼容性，我们进一步部署训练有素的快速程序员，在1024次无噪声模拟器中恢复了0.1%相对MSE范围内的预测准确率。这些结果使门控QKAN-FWP成为一种可扩展、参数高效且兼容NISQ的量子启发序列建模方法。

The Causally Emergent Alignment Hypothesis: Causal Emergence Aligns with and Predicts Final Reward in Reinforcement Learning Agents

因果涌现对齐假说：因果涌现与强化学习主体的最终奖励保持一致并预测其结果

Authors: Federico Pigozzi, Michael Levin
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2605.06746
Pdf link: https://arxiv.org/pdf/2605.06746
Abstract A hallmark of life on Earth is the ability of agents to exert causal power and be drivers of subsequent events. This is key to cognition at all scales. Causal emergence, measuring the degree to which an agent exerts unique predictive power on its future, is one consequence of causal power. Indeed, recent discoveries have shown that biological agents, even minimal ones, increase their causal emergence after learning new memories. However, there is a major knowledge gap regarding how causally emergent artificial agents are. We focused on Reinforcement Learning (RL) of neural-network agents across an array of environmental conditions, encompassing different algorithms, agent architectures, and six environments arranged on a complexity spectrum. For consistency, we computed the causal emergence of their latent-space representations over their lifetimes. We used the recently proposed {\Phi}ID to estimate causal emergence and tested how it related to learning performance. Our results suggested a Causally Emergent Alignment Hypothesis: successful agents exhibited causal emergence that was consistently predictive of final reward early in training and whose representational dynamics aligned with reward improvement in most tasks. This idea suggests that causal emergence may be a previously undisclosed axis of reorganization of neural representations in RL agents, with the potential to establish causal relationships and interventions that will lead to better RL agents. Our work also highlights the alignment between causal emergence and learning as another way biological and artificial creatures compare.
中文摘要 地球生命的一个标志是主体能够施加因果力量并成为后续事件的驱动力。这对所有层面的认知都是关键。因果涌现，衡量一个主体对其未来施加独特预测能力的程度，是因果能力的一个结果。事实上，最近的发现表明，即使是微小的生物制剂，在学习新记忆后也会增加其因果显现。然而，关于人工代理的因果涌现性，存在重大知识缺口。我们重点研究神经网络代理在多种环境条件下的强化学习（RL），涵盖不同算法、代理架构以及六个复杂度谱上的环境。为了保持一致，我们计算了它们潜空间表示在其生命周期内的因果涌现。我们使用了最近提出的{\Phi}ID来估算因果涌现，并测试了其与学习表现的关系。我们的结果提出了因果涌现对齐假说：成功的代理表现出因果涌现，且在培训早期持续预测最终奖励，其表征动态与大多数任务中的奖励改进相符。这一观点表明，因果涌现可能是强化学习代理神经表征重组的一个此前未公开的轴线，有潜力建立因果关系和干预措施，从而提升强化学习代理的能力。我们的研究还强调了因果涌现与学习之间的契合，作为生物与人工生物的另一种比较方式。

Gradient Extrapolation-Based Policy Optimization

基于梯度外推的策略优化

Authors: Ismam Nur Swapnil, Aranya Saha, Tanvir Ahmed Khan, Mohammad Ariful Haque, Ser-Nam Lim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06755
Pdf link: https://arxiv.org/pdf/2605.06755
Abstract Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked. Standard GRPO-style training updates the model using only the current step, while full multi-step lookahead can give a better update direction but is too expensive because it needs many backward passes. We propose Gradient Extrapolation-Based Policy Optimization (GXPO), a plug-compatible policy-update rule for GRPO-style reasoning RL. GXPO approximates a longer local lookahead using only three backward passes during an active phase. It reuses the same batch of rollouts, rewards, advantages, and GRPO loss, so it does not require new rollouts or reward computation at the lookahead points. GXPO takes two fast optimizer steps, measures how the gradients change, predicts a virtual K-step lookahead point, moves the policy partway toward that point, and then applies a corrective update using the true gradient at the new position. When the lookahead signal becomes unstable, GXPO automatically switches back to standard single-pass GRPO. We also give a plain-gradient-descent surrogate analysis that explains when the extrapolation is exact and where its local errors come from. Across Qwen2.5 and Llama math-reasoning experiments, GXPO improves the average sampled pass@1 by +1.65 to +5.00 points over GRPO and by +0.14 to +1.28 points over the strongest SFPO setting, while keeping the active-phase cost fixed at three backward passes. It also achieves up to 4.00x step speedup, 2.33x wall-clock speedup, and 1.33x backward-pass speedup in reaching GRPO's peak accuracy.
中文摘要 强化学习被广泛用于提升大型语言模型的推理能力，尤其是在答案可以自动检查的情况下。标准的GRPO式训练仅使用当前步骤更新模型，而完整的多步前瞻则能提供更好的更新方向，但成本过高，因为需要多次回溯。我们提出了基于GRPO推理的策略优化（GXPO），这是一种兼容插件的策略更新规则，适用于GRPO风格的强化学习推理。GXPO在激活阶段仅通过三次回向传递，近似更长的局部前瞻。它重复使用同一批推出、奖励、优势和GRPO损失，因此不需要在前瞻点重新推出或计算奖励。GXPO采取两个快速优化步骤，测量梯度的变化，预测一个虚拟的K步前瞻点，将策略部分移动到该点，然后在新位置使用真实梯度进行修正更新。当前瞻信号变得不稳定时，GXPO会自动切换回标准单遍GRPO。我们还提供了平梯度下降替代分析，解释了何时外推精确及其局部误差来源。在Qwen2.5和Llama数学推理实验中，GXPO使平均抽样pass@1比GRPO提升+1.65至+5.00分，较最强SFPO设置提升+0.14至+1.28分，同时保持主动相位成本固定为三次向向传递。它还能实现高达 4.00 倍的步进加速、2.33 倍的壁钟加速和 1.33 倍的回传加速，达到 GRPO 的峰值精度。

Revisiting Adam for Streaming Reinforcement Learning

重温亚当进行流媒体强化学习

Authors: Florin Gogianu, Adrian Catalin Lutu, Razvan Pascanu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06764
Pdf link: https://arxiv.org/pdf/2605.06764
Abstract Learning from a sequence of interactions, as soon as observations are perceived and acted upon, without explicitly storing them, holds the promise of simpler, more efficient and adaptive algorithms. For over a decade, however, deep reinforcement learning walked the contrary path, augmenting agents with replay buffers or parallel sampling routines, in an effort to tame learning instability. Recently, this topic has been revisited by Elsayed et al. (2024), focusing on update computation through eligibility traces and modifications to the optimisation routine, resulting in the StreamQ algorithm. In this work we take a step back, investigating the efficacy of established updates, such as those implemented by DQN and C51 within this online setting. Not only do we find that they perform well, but through analysing how the optimisation algorithm generally, and Adam in particular, interacts with these updates, we contend that two properties are essential for robust performance: i) the derivative of the objective is to be bounded and ii) weight updates are variance-adjusted. Rigorous and exhaustive experimentation demonstrates that C51, which exhibits both characteristics, is competitive with StreamQ across a subset of 55 Atari games. Using these insights, we derive a variance-adjusted algorithm based on eligibility traces, termed Adaptive Q$(\lambda)$, which approaches double the human baseline on the same subset, surpassing existing methods by all performance metrics.
中文摘要 从一系列交互中学习，一旦观察被感知并被执行，而无需显式存储，有望带来更简单、更高效和更具适应性的算法。然而，十多年来，深度强化学习走了相反的道路，通过重放缓冲区或并行采样程序来增强智能体，试图驯服学习不稳定性。最近，Elsayed等人（2024）重新探讨了这一主题，重点关注通过资格性痕迹和优化流程的修改来实现更新计算，最终形成了StreamQ算法。在本研究中，我们退一步探讨既有更新的有效性，例如DQN和C51在该在线环境中实施的更新。我们不仅发现它们表现良好，通过分析优化算法整体，尤其是Adam，如何与这些更新交互，我们认为稳健性能有两个关键性质：i）目标函数的导数有界，ii）权重更新是方差调整的。严谨且全面的实验表明，C51兼具这两者特性，在55款Atari游戏子集上与StreamQ竞争。基于这些洞察，我们基于资格追踪推导出一种方差调整算法，称为自适应Q$（\lambda）$，该算法在同一子集上接近人类基线的两倍，在所有性能指标上超越现有方法。

Randomness is sometimes necessary for coordination

有时随机性对协调是必要的

Authors: Rohan Patil, Jai Malegaonkar, Henrik I. Christensen
Subjects: Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.06825
Pdf link: https://arxiv.org/pdf/2605.06825
Abstract Full parameter sharing is standard in cooperative multi-agent reinforcement learning (MARL) for homogeneous agents. Under permutation-symmetric observations, however, a shared deterministic policy outputs identical action distributions for every agent, making role differentiation impossible. This failure can theoretically be resolved using symmetry breaking among anonymous identical processors, which requires randomness. We propose Diamond Attention, a cross-attention architecture in which each agent samples a scalar random number per timestep, inducing a transient rank ordering that masks lower-ranked peers from agent-to-agent attention while leaving task attention fully unmasked. This realizes a random-bit coordination protocol in a single broadcast round, and the set-based attention enables zero-shot deployment to teams of different sizes. We evaluate across three regimes that isolate when structured randomness matters. On the perfectly symmetric XOR game, our method achieves $1.0$ success while all deterministic baselines plateau near $0.5$. On control coordination tasks, a policy trained on $N=4$ generalizes zero-shot to $N \in [2,8]$. On SMACLite cross-scenario transfer, we achieve zero-shot transfer where standard baselines cannot transfer due to structural limitations. Furthermore, replacing the structured mask with standard dropout-based randomness results in a 0\% win rate, confirming that protocol-space structure, not stochastic noise, is the operative ingredient. this https URL
中文摘要 全参数共享是同质智能体合作多智能体强化学习（MARL）的标准配置。然而，在置换对称观察下，共享的确定性策略会为每个代理输出相同的动作分布，使角色区分变得不可能。理论上，这种失败可以通过匿名相同处理器之间的对称破缺来解决，这需要随机性。我们提出了钻石注意力（Diamond Attention），这是一种交叉注意力架构，每个代理在每个时间步采样一个标量随机数，从而引入瞬态秩序，掩盖低级同伴对代理的注意力，同时使任务注意力完全不被掩蔽。该方案在单轮广播中实现随机比特协调协议，基于集合的注意力使得对不同规模团队实现零机会部署。我们评估了三种隔离结构随机性重要性机制的模式。在完美对称异或博弈中，我们的方法成功率为1.0美元，而所有确定性基线均在0.5美元附近稳定。在控制协调任务中，训练于$N=4$的策略将零射击推广为$N \in [2,8]$。在SMACLite跨场景传输中，我们实现了零拍摄传输，而标准基线因结构限制无法传输。此外，用标准的基于退出随机性替代结构化掩码，胜率为0\%，这证实了协议空间结构而非随机噪声才是关键因素。这个 https 网址

How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment

如何在强化学习后后压缩KV缓存？用于内存高效对齐的阴影掩膜蒸馏

Authors: Rui Zhu, Weiheng Bai, Qiushi Wu, Yang Ren, Haixu Tang, Yuchu Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06850
Pdf link: https://arxiv.org/pdf/2605.06850
Abstract Reinforcement Learning (RL) has emerged as a crucial paradigm for unlocking the advanced reasoning capabilities of Large Language Models (LLMs), encompassing frameworks like RLHF and RLAIF. Regardless of the specific optimization algorithm (e.g., PPO, GRPO, or Online DPO), online RL inherently requires an exploratory trajectory generation (rollout) phase. However, for long-context reasoning tasks, this rollout phase imposes a severe ``memory wall'' due to the exorbitant Key-Value (KV) cache footprint. While applying KV cache compression during rollouts mitigates this memory overhead, it induces a critical off-policy bias. Although modern KV compression is often nearly lossless during standard inference, even minuscule approximation errors are drastically amplified by the inherent instability of RL optimization. Specifically, the sampler generates responses under a sparse context, whereas the learner updates parameters using the full, dense context. Existing statistical solutions, such as importance reweighting, struggle to correct this magnified bias, suffering from high gradient variance and severe sample inefficiency.
中文摘要 强化学习（RL）已成为解锁大型语言模型（LLM）高级推理能力的关键范式，涵盖了RLHF和RLAIF等框架。无论具体的优化算法（如PPO、GRPO或在线DPO），在线强化学习本质上都需要探索性轨迹生成（展开）阶段。然而，对于长上下文推理任务，这一部署阶段由于极高的键值（KV）缓存占用，会造成严重的“内存墙”。虽然在部署时应用 Vault 缓存压缩可以减轻内存开销，但会引发关键的非策略偏置。尽管现代KV压缩在标准推断中通常几乎无损，但即使是极小的近似误差也会因强化学习优化的固有不稳定性而大幅放大。具体来说，采样器在稀疏上下文下生成响应，而学习器则使用完整且密集的上下文更新参数。现有的统计解决方案，如重要性重权，难以纠正这种放大的偏差，导致梯度方差高且样本效率极低。

On the Divergence of Differential Temporal Difference Learning without Local Clocks

关于无本地时钟的差分时间差分学习的发散

Authors: David Antrobius, Shangtong Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.06874
Pdf link: https://arxiv.org/pdf/2605.06874
Abstract Learning rate is a critical component of reinforcement learning (RL). This work uses global and local clocks to distinguish two types of learning rates. The former is of the standard form $\alpha_t$ that depends only on the time step $t$ (i.e., a global clock). The latter is of the form $\alpha_{\nu(S_t, t)}$, where $\nu(s, t)$ counts the number of visits to state $s$ until time $t$ (i.e., a local clock). In discounted RL, an RL algorithm that is convergent with a local clock is always also convergent with a global clock, and vice versa. We are not aware of any counterexample. The key contribution of this work is to show that this nice correspondence breaks down in average-reward RL. Specifically, we construct a counterexample showing that although differential temporal difference learning is convergent with a local clock, it can diverge with a global clock. This counterexample closes the open problem in Wan et al. [2021], Blaser et al. [2026].
中文摘要 学习速率是强化学习（RL）的关键组成部分。这项工作利用全局和局部时钟区分两种学习率。前者是标准形式 $\alpha_t$，仅依赖于时间步 $t$（即全局时钟）。后者形式为$\alpha_{\nu（S_t， t）}$，其中$\nu（s， t）$计数到时间$t$（即本地时钟）前的州$s$访问次数。在折价RL中，与局部时钟收敛的RL算法总是与全局时钟收敛，反之亦然。我们没有听说有任何反例。这项工作的关键贡献是表明这种良好对应关系在平均奖励强化学习中会破缺。具体来说，我们构造了一个反例，表明虽然差分时差学习与局部时钟收敛，但它可以与全局时钟发散。该反例在Wan等人[2021]、Blase等人[2026]中解决了未解决的问题。

Mitigating Cognitive Bias in RLHF by Altering Rationality

通过改变理性来减轻RLHF中的认知偏差

Authors: Tiffany Horter, Andrew Markham, Niki Trigoni, Serena Booth
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06895
Pdf link: https://arxiv.org/pdf/2605.06895
Abstract How can we make models robust to even imperfect human feedback? In reinforcement learning from human feedback (RLHF), human preferences over model outputs are used to train a reward model that assigns scalar values to responses. Because these rewards are inferred from pairwise comparisons, this learning depends on an assumed relationship between latent reward differences and observed preferences, typically modeled using a Boltzmann formulation in which a rationality parameter beta informs how consistently preferences reflect reward differences. In practice, beta is typically treated as a fixed constant that reflects assumed uniform annotator reliability. However, human feedback is not this simplistic in practice: real human judgments are shaped by cognitive biases, leading to systematic deviations from reward-consistent behavior that arise contextually. To address this, we treat rationality as context- and annotation-dependent. We design an approach to dynamically adjust the rationality parameter beta during reward learning using an LLM-as-judge to assess the likely presence of cognitive biases. This approach effectively downweights comparisons that are likely to reflect biased or unreliable judgments. Empirically, we show that this approach learns a more rational downstream model, even when finetuning on datasets with strongly biased preferences.
中文摘要 我们如何让模型对即使是不完美的人类反馈也保持稳健？在人类反馈强化学习（RLHF）中，利用人类对模型输出的偏好来训练奖励模型，为反应分配标量值。由于这些奖励是通过两两比较推断的，这种学习依赖于潜在奖励差异与观察到偏好之间的假定关系，通常采用玻尔兹曼公式建模，其中理性参数贝塔决定偏好如何一致反映奖励差异。在实际操作中，β通常被视为一个固定常数，反映假设的均匀标注者可靠性。然而，人类反馈在实践中并非如此简单：真实的人类判断受认知偏见影响，导致系统性偏离与奖励一致的行为，这些偏离是情境上产生的。为此，我们将理性视为依赖上下文和注释的。我们设计了一种方法，在奖励学习过程中，利用LLM作为评判者动态调整理性参数β值，以评估认知偏差的可能性。这种方法实际上降低了可能反映偏见或不可靠判断的比较权重。通过实证，我们表明即使在对具有强烈偏见偏好的数据集进行微调时，这种方法也能学习更理性的下游模型。

Rollback-Free Stable Brick Structures Generation

无回滚稳定砖结构生成

Authors: Chenhui Xu, Ziyue Bai, Fuxun Yu, Heng Huang, Jinjun Xiong
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.06947
Pdf link: https://arxiv.org/pdf/2605.06947
Abstract While autoregressive models have advanced 3D generation, creating physically stable brick structures remains a challenge due to the strict requirements of gravity and interconnectivity. Existing approaches rely on external physical simulators during inference to perform rejection sampling and brick-by-brick rollbacks, which severely bottlenecks efficiency. To address this, we propose a reinforcement learning paradigm that shifts physical validity enforcement from test-time correction to training-time policy optimization. By utilizing assembly-level rewards, the model optimizes for collision avoidance, global connectivity, structural interlocking, and shape conformity. This paradigm allows the model to internalize physical priors, enabling the first rollback-free generation of stable brick structures. Experimental results demonstrate that our approach achieves state-of-the-art generation quality while accelerating inference speed by orders of magnitude. Our code and dataset are available at this https URL. Our models are available at this https URL.
中文摘要 虽然自回归模型在三维生成方面已进步，但由于重力和互联互通性要求严格，构建物理稳定的砖块结构仍是一大挑战。现有方法在推断过程中依赖外部物理模拟器进行拒绝采样和逐砖回滚，这严重限制了效率。为此，我们提出了一种强化学习范式，将物理效度的执行从测试时间修正转向训练时间策略优化。通过利用装配层级奖励，模型优化了碰撞避免、全局连通性、结构互锁和形状一致性。该范式允许模型内化物理先验，实现了首次无回滚生成稳定砖结构。实验结果表明，我们的方法实现了最先进的生成质量，同时将推理速度提升了几个数量级。我们的代码和数据集可在该 https URL 访问。我们的模型可在此 https 网址获取。

Multi-Objective Constraint Inference using Inverse reinforcement learning

利用逆强化学习进行多目标约束推断

Authors: Syed Ihtesham Hussain Shah, Floris den Hengst, Aneta Lisowska, Annette ten Teije
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.06951
Pdf link: https://arxiv.org/pdf/2605.06951
Abstract Constraint inference is widely considered essential to align reinforcement learning agents with safety boundaries and operational guidelines by observing expert demonstrations. However, existing approaches typically assume homogeneous demonstrations (i.e., generated by a single expert or multiple experts with identical objectives). They also have limited ability to capture individual preferences and often suffer from computational inefficiencies. In this paper, we introduce Multi-Objective Constraint Inference (MOCI), a novel framework designed to jointly extract shared constraints and individual preferences from heterogeneous expert trajectories, where multiple experts pursue different objectives. MOCI effectively models and learns from diverse, and potentially conflicting, behaviors. Empirical evaluations demonstrate that MOCI significantly outperforms existing baselines, achieving improved predictive performance, and maintaining competitive computational efficiency on a standard grid-world benchmark. These results establish MOCI as an accurate, flexible, and computationally practical approach for real-world constraint inference and preference learning tasks.
中文摘要 约束推断被广泛认为对于通过观察专家演示，使强化学习主体符合安全边界和操作指南至关重要。然而，现有方法通常假设演示是同质的（即由单一或多位专家生成，目标相同）。它们在捕捉个人偏好方面的能力有限，且常常存在计算效率低下的问题。本文介绍了多目标约束推断（MOCI），这是一种新颖框架，旨在从多位专家追求不同目标的异质专家轨迹中共同提取共享约束和个人偏好。MOCI有效地建模并学习多样化且可能冲突的行为。实证评估表明，MOCI显著优于现有基线，实现了更优的预测性能，并在标准网格世界基准测试中保持了竞争计算效率。这些结果确立了 MOCI 作为一种准确、灵活且计算实用的方法，适用于现实世界的约束推断和偏好学习任务。

$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses

$f$-散度正则化RLHF：两个抽样故事与统一分析

Authors: Di Wu, Chengshuai Shi, Jing Yang, Cong Shen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.06977
Pdf link: https://arxiv.org/pdf/2605.06977
Abstract Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone technique for post-training large language models. While most existing approaches rely on the reverse KL-regularization, recent empirical studies have begun exploring alternative divergences (e.g., forward KL, chi-squared) as regularizers in RLHF. However, a unified theoretical understanding of general $f$-divergence regularization remains under-explored. To fill this gap, this work develops a comprehensive theoretical framework for online RLHF with a general $f$-divergence regularized objective. Rather than treating each possible divergence function individually, we adopt a holistic perspective across the entire function class and propose two algorithms based on distinct sampling principles. The first extends the classical optimism principle with a carefully designed exploration bonus, while the second introduces a new method that exploits the sensitivity of the optimal policy to reward perturbations under $f$-divergence regularization. Theoretical analysis shows that $O(\log T)$ regret and $O(1/T)$ sub-optimality gap are achievable, establishing provable efficiency of both algorithms and, to the best of our knowledge, the first performance bounds for online RLHF under general $f$-divergence regularization.
中文摘要 来自人类反馈的强化学习（RLHF）已成为大型语言模型后训练的基石技术。虽然大多数现有方法依赖反KL正则化，但近期实证研究开始探索替代发散（如前向KL、卡方）作为RLHF正则化。然而，关于一般$f$散度正则化的统一理论理解仍未被充分探索。为填补这一空白，本研究构建了一个包含一般$f$散度正则化目标的在线RLHF综合理论框架。我们不单独处理每个可能的散度函数，而是在整个函数类中采取整体视角，并基于不同采样原则提出了两种算法。前者通过精心设计的探索加成扩展了经典乐观原则，而后者引入了一种新方法，利用最优策略的敏感性来奖励$f$散度正则化下的扰动。理论分析表明，$O（\log T）$ 后悔和 $O（1/T）$ 次最优差距是可实现的，这两者都可证明的效率，并且据我们所知，在线 RLHF 在一般 $f$-散度正则化下的首个性能界限也由此确定。

Bridging Textual Profiles and Latent User Embeddings for Personalization

连接文本配置文件与潜在用户嵌入以实现个性化

Authors: Zhaoxuan Tan, Xiang Zhai, Yan Zhu, Meng Jiang, Mohamed Hammad
Subjects: Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.06981
Pdf link: https://arxiv.org/pdf/2605.06981
Abstract Personalized systems rely on user representations to connect behavioral history with downstream recommendation applications. Existing methods typically employ either supervised latent user embeddings, which are effective for retrieval but difficult to interpret, or textual user profiles, which are interpretable but challenging to optimize for downstream utility due to lack of direct supervision. To bridge this gap, we present BLUE, a reinforcement learning framework that unifies these two forms of user representation by aligning language-based user profiles with embedding-based recommendation objectives. Given a user interaction history, BLUE leverages a profiler Large Language Model (LLM) to generate textual profiles, while an embedding model provides reward signals. This encourages the resulting textual representations to move closer to positive items and farther from negative ones in the embedding space. We further introduce a text-space supervision signal based on next-item prediction, ensuring the learned profiles remain both semantically meaningful and highly effective for downstream retrieval. Experiments on Amazon Reviews 2023 and Google Local Reviews in zero-shot sequential recommendation settings demonstrate that BLUE consistently outperforms strong baselines under both frozen and trainable embedding conditions. Notably, BLUE achieves clear gains in cross-domain transfer, highlighting the strong generalization ability of the learned user profiles. Furthermore, these generated profiles provide superior personalized context for question answering compared to raw user histories or alternative profile optimization methods. Overall, these results show that BLUE provides an effective way to unify interpretable textual profiling with discriminative latent embeddings for personalization.
中文摘要 个性化系统依赖用户表述将行为历史与下游推荐应用连接起来。现有方法通常采用监督式潜在用户嵌入，这种嵌入在检索上有效但难以解释，或者文本用户配置文件，后者可解释但由于缺乏直接监督，难以优化下游效用。为弥合这一差距，我们提出了BLUE强化学习框架，通过将基于语言的用户配置文件与基于嵌入的推荐目标对齐，统一了这两种用户表征形式。在用户交互历史的情况下，BLUE 利用分析器大型语言模型（LLM）生成文本配置文件，而嵌入模型则提供奖励信号。这促使文本表示在嵌入空间中更接近正题，远离负题。我们进一步引入基于下一条目预测的文本空间监督信号，确保所学档案既具有语义意义，又在后续检索中高效。在零样本序列推荐环境中，2023年亚马逊评论和谷歌本地评论的实验表明，BLUE在冻结和可训练嵌入条件下始终优于强基线。值得注意的是，BLUE在跨域传输方面取得了明显优势，凸显了学习用户配置文件的强大泛化能力。此外，这些生成的个人资料相比原始用户历史或其他个人资料优化方法，提供了更优越的个性化问答背景。总体而言，这些结果表明BLUE提供了一种有效的方法，将可解释的文本画像与个性化的潜在判别嵌入统一起来。

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

行为提示推理：可监控推理通过监督提升效率和安全

Authors: Christopher Z. Cui, Taylor W. Killian, Prithviraj Ammanabrolu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07021
Pdf link: https://arxiv.org/pdf/2605.07021
Abstract Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, \ours allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that \bcreasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code to be released at this https URL
中文摘要 大型语言模型（LLMs）中的推理存在监管上的挑战，因为许多错位行为直到推理结束后才会显现。为此，我们引入了行为提示推理，使LLM推理更可控和可监控。行为提示是模型训练为在特定隐性和显性行为之前立即发出的特殊令牌序列，作为信号和控制的双重功能杠杆。当用强化学习微调较弱的外部监视器进行推理监督时，仅压缩行为提示显示的信息，足以使监视器在复杂数学问题解决中修剪多达50%的推理标记。当在过度约束违规导致失败的环境中，通过几乎最优的基于规则的监控器加以利用时，\ours能够从80%的推理痕迹中恢复安全动作，避免以不安全动作的提议告终，成功率从46%翻倍至96%。通过跨两个模型家族和三个领域的评估，我们表明 \bcreasoning 能够提升推理的可监控性和可控性，且不影响性能。更广泛地说，我们的工作通过展示如何训练被监控模型本身更易理解的监督，推动可扩展监督的发展。代码将在此 https URL 发布

A Systematic Investigation of The RL-Jailbreaker in LLMs

对大型语言模型中强化学习越狱者的系统性研究

Authors: Montaser Mohammedalamen, Kevin Roice, Reginald McLean, Alyssa Lefaivre Škopac
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07032
Pdf link: https://arxiv.org/pdf/2605.07032
Abstract The evolution of generative models from next-token predictors to autonomous engines of complex systems necessitates rigorous safety hardening. Adversarial jailbreaking, the strategic manipulation of models to elicit harmful output, remains a primary threat to safe deployment. While Reinforcement Learning (RL) frames jailbreaking as a multi-step attack through sequential optimization, a mechanistic understanding of why the framework succeeds remains incomplete. To fill this gap, we present the first systematic decomposition of RL jailbreaking. We deconstruct the framework into problem formalization (reward function, action space, episode length), and algorithmic measures (RL algorithm, training data, reward-shaping) to identify the structural determinants of adversarial success. Our results reveal that the RL-jailbreaker successfully compromised all targeted models and safeguards. Through this first-of-its-kind analysis, we demonstrate that environment formalization, specifically dense rewards and extended episode lengths, is the primary driver of jailbreaking success. This work provides a tool for improving RL-jailbreaker efficiency and, ultimately, harden generative models resistant to RL-based attacks.
中文摘要 生成模型从下一代币预测器演变为复杂系统的自主引擎，需要严格的安全加固。对抗越狱，即战略性操控模型以引发有害输出，仍然是安全部署的主要威胁。虽然强化学习（RL）将越狱视为通过顺序优化的多步攻击，但对该框架成功原因的机制性理解仍不完整。为了填补这一空白，我们首次提出了强化学习越狱的系统性分解。我们将框架拆解为问题形式化（奖励函数、行动空间、剧集长度）和算法指标（强化学习算法、训练数据、奖励塑造），以识别对抗成功的结构决定因素。我们的结果显示，RL越狱成功攻破了所有针对模型和防护措施。通过这项首创的分析，我们证明环境形式化，特别是密集奖励和延长剧集长度，是越狱成功的主要驱动力。这项工作为提升RL越狱效率提供了工具，最终增强对基于强化学习攻击的生成模型的防御能力。

PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

PACEvolve++：改进进化搜索代理的测试时学习

Authors: Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Shuo Chen, Zhankui He, Noveen Sachdeva, Weili Wang, Ed H. Chi, Shivaram Venkataraman, Wang-Cheng Kang, Derek Zhiyuan Cheng, Beidou Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.07039
Pdf link: https://arxiv.org/pdf/2605.07039
Abstract Large language models have become drivers of evolutionary search, but most systems rely on a fixed, prompt-elicited policy to sample next candidates. This limits adaptation in practical engineering and research tasks, where evaluations are expensive, and progress depends on learning task-specific search dynamics. We introduce PACEvolve++, an advisor-model reinforcement learning framework for test-time policy adaptation in evolutionary search agents. PACEvolve++ decouples strategic search decisions from implementation: a trainable advisor generates, assesses, and selects hypotheses, while a stronger frontier model translates selected hypotheses into executable candidates. To train the advisor under non-stationary feedback, we propose a phase-adaptive approach that adapts its optimization strategy to different phases of the evolutionary process. Early in evolution, it uses group-relative feedback to learn broad search preferences; later, as reward gaps compress, it emphasizes best-of-$k$ frontier contribution to support stable refinement. Across expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation, PACEvolve++ outperforms the state-of-the-art evolutionary search framework with frontier models, achieving faster convergence and stabilizing test-time training during evolutionary search.
中文摘要 大型语言模型已成为进化搜索的驱动力，但大多数系统依赖固定的提示诱导策略来抽样下一个候选对象。这限制了实际工程和科研任务中的适应性，因为评估成本高昂，进展依赖于学习特定任务的搜索动态。我们介绍了PACEvolve++，一个用于进化搜索代理测试时策略适配的顾问-模型强化学习框架。PACEvolve++ 将战略搜索决策与实施解耦：可训练的顾问生成、评估并选择假设，而更强大的前沿模型则将选定假设转化为可执行候选假设。为了在非平稳反馈下训练顾问，我们提出了一种阶段自适应方法，将优化策略适应进化过程的不同阶段。在进化早期，它利用群体相对反馈来学习广泛的搜索偏好;随后，随着奖励差距的压缩，它强调了$k美元中最优的前沿贡献，以支持稳定的精炼。在专家并行负载均衡、顺序推荐和蛋白质适应性外推方面，PACEvolve++ 在前沿模型上优于最先进的进化搜索框架，实现更快的收敛并稳定进化搜索的测试时间训练。

Towards Differentially Private Reinforcement Learning with General Function Approximation

迈向带有一般函数近似的差分私有强化学习

Authors: Yi He, Xingyu Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07049
Pdf link: https://arxiv.org/pdf/2605.07049
Abstract We present the first theoretical guarantees for differentially private online reinforcement learning (RL) with general function approximation, extending beyond prior work restricted to tabular and linear settings. Our approach combines a batched policy update scheme with the exponential mechanism, together with a novel regret analysis. We show that, even under general function approximation, the regret in the model-free setting under differential privacy matches the state of the art for the linear case, scaling as $\widetilde{O}(K^{3/5})$, where $K$ denotes the number of episodes. As an important by-product, we also establish the first regret bound for online RL with batch update that depends on the standard complexity measure of coverability, complementing existing results based on a newly introduced Eluder-Condition class. In addition, we uncover fundamental gaps in recent results for private RL with linear function approximation, thereby clarifying its landscape.
中文摘要 我们首次提出了带有一般函数近似的差分私密在线强化学习（RL）理论保证，超越了此前仅限于表格和线性环境的研究。我们的方法结合了批量政策更新方案与指数机制，以及一种新的遗憾分析。我们证明，即使在一般函数近似下，在无模型环境下，差分隐私下的遗憾值也与线性情形的技术水平相匹配，缩放为 $\widetilde{O}（K^{3/5}）$，其中 $K$ 表示集数。作为一个重要的副产品，我们还建立了在线强化学习的首个后悔界限，该批次更新依赖于标准的覆盖性复杂度量，补充了基于新引入的Eluder-条件类的现有结果。此外，我们还揭示了线性函数近似私有强化学习近期结果中的根本性空白，从而清晰其整体格局。

Integrating Causal DAGs in Deep RL: Activating Minimal Markovian States with Multi-Order Exposure

深度强化学习中因果DAG的整合：多阶暴露激活最小马尔可夫态

Authors: Jiamin Xu, Jacqueline Maasch, Kyra Gan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.07057
Pdf link: https://arxiv.org/pdf/2605.07057
Abstract Online reinforcement learning (RL) relies on the Markov property for guaranteed performance, but real-world applications often lack well-defined states given raw observed variables. While causal RL has attracted growing interest, existing work typically assumes Markovian states are provided and focuses on using causality to accelerate learning, leaving a fundamental gap: \emph{given a longitudinal causal graph over observed variables, how does one construct MDP states that provably satisfy the Markov property?} We address this by providing a procedure that constructs a provably minimal state representation. In deep RL, we observe that the minimal representation alone empirically fails to improve performance, indicating that neural networks cannot directly exploit Markovian minimality. To address this, we propose \textbf{MOSE} (Multi-Order State Exposure), which feeds multi-order historical state constructions into the same $Q$-function. MOSE consistently outperforms both the minimal state construction and single-window policies on common benchmarks and synthetic datasets. Including the minimal representation alongside MOSE can further improve performance. Our results establish a core principle for causal deep RL: minimal sufficiency is not enough, and \emph{controlled redundancy} is necessary to unlock the benefit of causal state information.
中文摘要 在线强化学习（RL）依赖马尔可夫性质保证性能，但现实应用在给定原始观测变量时往往缺乏明确定义的状态。尽管因果强化学习（因果强化学习）越来越受关注，但现有研究通常假设马尔可夫态存在，并侧重于利用因果性加速学习，这留下了一个根本性的空白：\emph{给定一个对观察变量的纵向因果图，如何构造能够证明满足马尔可夫性质的MDP态？}我们通过提供一个构建可证明最小状态表示的过程来解决这个问题。在深度强化学习中，我们观察到仅靠极小表示在经验上无法提升性能，表明神经网络无法直接利用马尔可夫极小性。为此，我们提出了\textbf{MOSE}（多阶状态暴露），它将多阶历史状态构造输入同一$Q$函数。MOSE在常见基准测试和合成数据集上，始终优于最小状态构建和单窗口策略。将最小表示与MOSE结合，可以进一步提升性能。我们的结果确立了因果深度强化学习的核心原则：最小充分性不足以实现因果状态信息的益处。

Self-Consolidating Language Models: Continual Knowledge Incorporation from Context

自我巩固的语言模型：从上下文持续整合知识

Authors: Zekun Wang, Anant Gupta, Zihan Dong, Christopher J. MacLellan
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.07076
Pdf link: https://arxiv.org/pdf/2605.07076
Abstract Large language models (LLMs) increasingly receive information as streams of passages, conversations, and long-context workflows. While longer context windows expose more evidence, they do not ensure that useful information is preserved and reused. We study continual context consolidation: writing current context into model weights while limiting interference with previously consolidated information. We propose \textbf{S}elf-\textbf{Co}nsolidating \textbf{L}anguage Models (SCoL), a post-training framework in which, given current context, an LLM learns to generate textual update instructions specifying which of its own Transformer layers should be updated. Because committed updates change the model that later generates future selections, we train SCoL with meta-reinforcement learning over an evolving model state. We instantiate SCoL with supervised QA rewards on SQuAD knowledge incorporation and intrinsic likelihood-based rewards for LongBench v2 long-context consolidation. Across both settings, SCoL improves acquisition and retention over prompting, summarization, batch test-time training, and sequential finetuning baselines. Analysis of learned selection patterns shows that SCoL encourages the LLM to generate sparse update locations that align with layers of high Fisher information, suggesting that the model learns to route plasticity toward loss-sensitive regions while limiting interference. Moreover, SCoL transfers from shorter meta-training streams to longer LongBench v2 streams at evaluation, suggesting that our framework supports scalable streaming consolidation.
中文摘要 大型语言模型（LLM）越来越多地通过段落流、对话和长上下文工作流程接收信息。虽然较长的上下文窗口能揭示更多证据，但它们并不能确保有用信息被保存和重复利用。我们研究持续上下文整合：将当前上下文写入模型权重，同时限制对先前合并信息的干扰。我们提出 \textbf{S}elf-\textbf{Co}nsolidating \textbf{L}anguage Models（SCoL），这是一个后训练框架，在当前上下文下，LLM 学习生成文本更新指令，指定应更新其自身的 Transformer 层。由于已提交的更新会改变后续生成未来选择的模型，我们用元强化学习训练SCoL，基于不断演化的模型状态。我们在SQuAD知识整合和基于内在似然的奖励中实现了SCoL，用于LongBench v2长上下文整合。在这两种环境下，SCoL都提升了学习和记忆力，优于提示、摘要、批量测试训练和顺序微调基线。对学习选择模式的分析表明，SCoL鼓励LLM生成与高Fisher信息层对齐的稀疏更新位置，表明模型学会将可塑性引导到损失敏感区域，同时限制干扰。此外，SCoL在评估时会从较短的元训练流转移到较长的LongBench v2流，表明我们的框架支持可扩展的流整合。

Actor-Critic with Active Importance Sampling

具有主动重要性采样的演员-评论家

Authors: Majid Molaei, Gabor Paczolay, Matteo Papini, Alberto Maria Metelli, Marcello Restelli
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.07094
Pdf link: https://arxiv.org/pdf/2605.07094
Abstract This paper introduces the Active-Importance-Sampling Actor-Critic (AISAC) algorithm, an extension of the Actor-Critic framework for reducing variance in policy gradient estimation. AISAC optimizes the behavior policy to minimize gradient variance while preserving unbiased gradient estimates. Using importance sampling principles, the algorithm adapts the behavior policy toward efficient data collection distributions aligned with target policy gradients. For continuous action spaces, AISAC employs Gaussian behavior policies optimized through cross-entropy minimization. We provide theoretical analysis demonstrating variance reduction and unbiasedness. Experiments on Inverted Pendulum and Half Cheetah tasks show improved learning speed, sample efficiency, and training stability compared to standard Actor-Critic methods. Results indicate that optimizing the behavior policy improves both target policy updates and critic estimation accuracy across different hyperparameter settings. AISAC accelerates convergence and stabilizes reinforcement learning training, making it promising for real-world applications. Future work includes integration with advanced algorithms such as Soft Actor-Critic and TD3 for more complex environments.
中文摘要 本文介绍了主动重要性抽样演员-批判者（AISAC）算法，这是演员-批判者框架的扩展，用于减少策略梯度估计的方差。AISAC优化行为策略，以最小化梯度方差，同时保持无偏梯度估计。利用重要性抽样原则，该算法将行为策略调整为与目标策略梯度对齐的高效数据收集分布。对于连续动作空间，AISAC采用通过交叉熵最小化优化的高斯行为策略。我们提供理论分析，展示方差减少和无偏性。倒摆和半猎豹任务的实验显示，与标准演员-批判方法相比，学习速度、样本效率和训练稳定性都有提升。结果表明，优化行为策略不仅能提升目标策略更新，也能提升不同超参数设置下的批判者估计准确性。AISAC加速了融合进程，稳定了强化学习训练，使其在现实应用中充满希望。未来的工作包括与 Soft Actor-Critic 和 TD3 等先进算法集成，以应对更复杂的环境。

Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

去中心化扩散策略学习，促进合作多智能体强化学习中的探索

Authors: Yuyang Zhang, Haldun Balim, Na Li
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.07101
Pdf link: https://arxiv.org/pdf/2605.07101
Abstract Cooperative multi-agent reinforcement learning (MARL) involves complex agent interactions and requires effective exploration strategies. A prominent class of MARL algorithms, decentralized softmax policy gradient (DecSPG), addresses this through energy-based policy updates. In practice, however, such energy-based policies are intractable to maintain and are commonly projected onto the Gaussian policy class. In this work, we show that the limited expressiveness of Gaussian policies severely hinders exploration in DecSPG, and this limitation worsens as the number of agents grows. To address this issue, we propose decentralized diffusion policy learning (DDPL), which parameterizes each agent's policy with a denoising diffusion probabilistic model, an expressive generative model that captures multi-modal action distributions for enhanced exploration. DDPL enables efficient online training of diffusion policies via importance sampling score matching (ISSM), a novel training method with theoretical guarantee. We evaluate DDPL on representative continuous-action MARL benchmarks, including multi-agent particle environment, multi-agent MuJoCo, IsaacLab, and JAX-reimplemented StarCraft multi-agent challenge, and observe consistently improved performance.
中文摘要 合作多智能体强化学习（MARL）涉及复杂的智能体交互，需要有效的探索策略。一类著名的MARL算法——去中心化软最大政策梯度（DecSPG）通过基于能源的政策更新来解决这个问题。然而，实际上，这种基于能源的政策难以维持，通常会投射到高斯政策类别上。本研究显示，高斯策略表达有限严重阻碍了在DecSPG中的探索，且随着代理数量的增加，这一限制愈发严重。为解决这一问题，我们提出了去中心化扩散策略学习（DDPL），它通过去噪扩散概率模型参数化每个智能体的策略，这是一种表达性的生成模型，捕捉多模态动作分布以增强探索效果。DDPL通过重要性抽样评分匹配（ISSM）实现了扩散策略的高效在线训练，这是一种具有理论保证的新型训练方法。我们基于代表性的连续作用MARL基准测试来评估DDPL，包括多智能体粒子环境、多智能体MuJoCo、IsaacLab以及JAX重新实现的星际争霸多智能体挑战，并观察到性能持续提升。

Almost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift

通过泊松-莫罗漂移实现的随机近似和强化学习的几乎确定收敛率

Authors: Xinyu Liu, Zixuan Xie, Shangtong Zhang
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.07104
Pdf link: https://arxiv.org/pdf/2605.07104
Abstract Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic approximation algorithms whose expected updates are contractive, a setting that arises in many reinforcement learning algorithms such as $Q$-learning and linear temporal difference learning. Specifically, for a power-law learning rate $O(n^{-\eta})$ with $\eta \in (1/2, 1)$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{1 - 2\eta})$. For a harmonic learning rate $O(n^{-1})$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{-1})$, which we argue is a strong result because it is close to the optimal rate $O(n^{-1}\log\log n)$ given by the law of the iterated logarithm (for a special case of i.i.d. noise). Key to our analysis is a novel Lyapunov drift construction that applies a Poisson-equation based correction for Markovian noise to the well-established Moreau-envelope smoothing for the contractive mapping.
中文摘要 在马尔可夫噪声下建立几乎确定的随机近似和强化学习收敛率是一个根本性的理论挑战。我们对一类期望更新为收缩的随机近似算法在这一挑战上取得了进展，这一设定在许多强化学习算法中都出现，如$Q$学习和线性时间差分学习。具体来说，对于幂律学习率 $O（n^{-\eta}）$，其 $\eta \in （1/2， 1）$，我们几乎确定地会得到一个接近 $o（n^{1 - 2\eta}）$ 的收敛率。对于调和学习率 $O（n^{-1}）$，我们得到一个几乎确定的收敛率，任意接近 $o（n^{-1}））$，我们认为这是一个强结果，因为它接近迭代对数定律给出的最优收敛率 $O（n^{-1}\log\log n）$（对于 i.i.d. 噪声的特殊情况）。我们分析的关键是一种新颖的李雅普诺夫漂移构造，该结构对已建立的收缩映射的莫罗包络平滑应用基于泊松方程的马尔可夫噪声修正。

Theoretical Limits of Language Model Alignment

语言模型对齐的理论极限

Authors: Lucas Monteiro Paes, Natalie Mackraz, Barry-John Theobald, Federico Danieli
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2605.07105
Pdf link: https://arxiv.org/pdf/2605.07105
Abstract Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of-$N$ alignment, which selects the highest-reward output among $N$ independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the $\sqrt{\texttt{KL}}$ used in prior analyses. We further reformulate this expression as a covariance under the base model, yielding a practical estimator that predicts achievable alignment gains from base model samples alone. We extend our analysis to the proxy reward setting, showing that the gap between ideal and proxy alignment (reward hacking) grows with the magnitude of reward error and when the KL penalty factor decreases. We then prove that reward ensembling mitigates reward hacking, providing a theoretical justification for this technique used in practice. Empirically, we compute the KL-reward Pareto frontier for two tasks for LMs, safety and summarization, and show that best-of-$N$ closely approaches the theoretical limit, while PPO and GRPO remain substantially suboptimal. Our theoretical results shed light on several empirically observed phenomena in the alignment literature and suggest that algorithmic improvements are needed to achieve optimal alignment without high inference costs.
中文摘要 语言模型（LM）对齐改进模型输出，以反映人类偏好，同时保留基础模型的能力。最常见的比对方法包括：（i）强化学习，在KL发散约束下最大化期望奖励，以及（ii）最佳对齐$N$比对，选择$N$个独立样本中奖励最高的输出。尽管这些措施被广泛使用，但在吉隆坡预算下奖励改进的根本限制仍然鲜为人知。我们通过推导固定KL散度预算下最大可实现的期望奖励增益，来描述KL正则化比对的信息理论极限。我们的第一个结果提供了一个最优奖励改进的闭式表达式，由杰弗里斯发散项支配，而非先前分析中使用的$\sqrt{\texttt{KL}}}。我们进一步将该表达式重新表述为基模型下的协方差，得到一个实用估计器，能够仅凭基础模型样本预测可实现的比对增益。我们将分析扩展到代理奖励设置，显示理想与代理对齐（奖励黑客）之间的差距随着奖励误差的大小以及当 KL 惩罚因子减少而扩大。随后我们证明了奖励集成可以减轻奖励黑客行为，为该技术在实际应用中提供了理论依据。通过实证，我们计算了LMs安全性和总结两个任务的KL奖励帕累托前沿，结果显示$N最佳值接近理论极限，而PPO和GRPO仍然显著次优。我们的理论结果揭示了比对文献中观察到的若干实证现象，并表明需要算法改进以实现最佳比对而不产生高推断成本。

Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

部署资金分配：基于组的RLVR的击中效用最优部署分配

Authors: Tao Wang, Shuo Li, Yan Sun, Dongsheng Ding, Edgar Dobriban
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.07114
Pdf link: https://arxiv.org/pdf/2605.07114
Abstract Reinforcement learning with verifiable rewards (RLVR) has emerged as a central paradigm for improving the reasoning capabilities of large language models. Group-based policy optimization methods, such as GRPO, typically allocate a fixed number of rollouts to every prompt. This uniform allocation can be inefficient: it over-allocates compute to prompts whose sampled groups are already saturated while under-exploring prompts for which additional samples may reveal useful correct trajectories. To address this limitation, we introduce hit utility, the posterior probability that at least one rollout in a proposed additional allocation for a prompt will be correct. Building on this notion, we propose Hit-Utility Optimal Rollout Allocation (HORA), a learning-free rollout allocation policy that maximizes total posterior hit utility within each allocation batch. HORA adaptively reallocates rollout budgets while leaving the downstream reward evaluation and group-based advantage estimator unchanged. Across four mathematical reasoning benchmarks and three model scales, HORA preserves comparable Pass@1 and improves Pass@K over compute-matched GRPO in ten of twelve model--benchmark configurations, with one tie and one saturated exception. It is also drop-in compatible with other group-based estimators such as RLOO. Ablation studies indicate that the uniform prior used by HORA is competitive with five prompt-conditioned learned-prior alternatives.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理能力的核心范式。基于组的策略优化方法，如GRPO，通常为每个提示分配固定数量的推出。这种均匀分配可能效率低下：它过度分配计算给已采样组已饱和的提示，同时减少了对额外样本可能揭示有用正确轨迹的提示的探索不足。为解决这一限制，我们引入了命中效用，即在拟议的额外提示分配中，至少有一个推展是正确的后验概率。基于这一理念，我们提出了命中-效用最优部署分配（HORA），这是一种无学习的推广分配策略，最大化每个分配批次的总后验命中效用。HORA自适应地重新分配推广预算，同时保持下游奖励评估和基于群体的优势估计器不变。在四个数学推理基准和三个模型尺度上，HORA在十二种模型基准配置中有十种保持了相当的Pass@1，并且在10种模型-基准配置中Pass@K相较于计算匹配的GRPO有所提升，只有1个平局和1个饱和例外。它还兼容其他基于群的估计器，如RLOO。消融研究表明，HORA使用的均匀先验与五种即时条件学习先验方案具有竞争力。

Stabilized neural Hamilton--Jacobi--Bellman solvers: Error analysis and applications in model-based reinforcement learning

稳定神经Hamilton--Jacobi--Bellman求解器：错误分析及其在基于模型的强化学习中的应用

Authors: Minseok Kim, Yeongjong Kim, Namkyeong Cho, Yeoneung Kim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2605.07116
Pdf link: https://arxiv.org/pdf/2605.07116
Abstract Physics-informed neural solvers offer a promising route to model-based reinforcement learning in continuous time, where optimal feedback synthesis is governed by Hamilton--Jacobi--Bellman (HJB) equations. Practical implementations often occupy a regime that is neither a classical grid method nor a continuous-PDE PINN: the value function is represented by a neural network, finite-difference HJB policy-evaluation operators are evaluated by network queries at shifted points, and residuals are minimized by random continuous collocation. This regime preserves the stabilized finite-difference policy-evaluation structure while avoiding grid-based value unknowns. We develop an error theory for this hybrid regime. Interpreting finite differences as shift operators acting on neural networks, we prove a population $L^2$ stability estimate for one policy-evaluation step with learned dynamics. The bound separates residual error, initial and exterior-collar mismatch, policy mismatch, and model-identification error, with an explicit gradient amplification factor for learned dynamics, while the underlying linear evaluation stability remains free of hidden inverse-viscosity blow-up. We further give a finite-sample collocation certificate and a conditional multi-step propagation result through greedy policy improvement. Experiments on compact-control LQR upto 64 dimensions, Allen--Cahn control, pendulum, Hopper, and 3D quadrotor benchmarks compare against representative model-based and model-free RL baselines, demonstrating the predicted residual, policy-mismatch, and learned-model error trends.
中文摘要 基于物理的神经求解器为基于模型的连续时间强化学习提供了一条有前景的路径，其中最优反馈合成由汉密尔顿-雅各比-贝尔曼（HJB）方程控制。实际实现通常采用既非经典网格方法，也非连续偏微分方程PINN的区间：价值函数由神经网络表示，有限差HJB策略评估算符通过网络查询在移位点评估，残差通过随机连续配换最小化。该制度保持稳定的有限差分政策评估结构，同时避免基于网格的价值未知数。我们为这种混合状态发展了一个误差理论。我们将有限差分解释为作用于神经网络的移位算子，我们用学习动力学证明了一个基于策略评估步骤的总体$L^2$稳定性估计。该界限将残余误差、初始与外环不匹配、策略不匹配以及模型识别误差区分开来，并明确为学习动力学设置梯度放大因子，而底层的线性评估稳定性则保持不存在隐藏的反粘性爆破现象。我们还通过贪婪策略改进，给出有限样本共置证书和条件多步传播结果。对紧凑对照LQR实验，涵盖64维、Allen-Cahn对照、摆锤、Hopper和3D四旋翼基准，与代表性基于模型和无模型的强化逻辑基线进行比较，展示了预测的残差、策略错配和学习模型误差趋势。

Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought

情境强化学习与思维链的融合与出现

Authors: Zixuan Xie, Xinyu Liu, Rohan Chandra, Shangtong Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.07123
Pdf link: https://arxiv.org/pdf/2605.07123
Abstract In-context reinforcement learning (ICRL) refers to the ability of RL agents to adapt to new tasks at inference time without parameter updates by conditioning on additional context. Recent empirical studies further demonstrate that Chain-of-Thought (CoT) generation can amplify this ICRL capability. This paper is the first to provide a theoretical understanding on how CoT interacts with ICRL. We conduct our analysis in a policy evaluation setup with linear Transformer. We prove that with specific Transformer parameters, the CoT generation process is equivalent to repeatedly executing temporal difference learning updates. Additionally, we provide finite sample convergence analysis showing that the policy evaluation error decreases geometrically with CoT length and eventually saturates at a statistical floor determined by the context length. We also prove that the desired Transformer parameters are a global minimizer of the pretraining loss, providing a theoretical understanding on the empirical emergence of those parameters.
中文摘要 上下文强化学习（ICRL）指的是强化学习代理能够在推理时通过附加上下文进行条件，在不更新参数的情况下适应新任务的能力。近期实证研究进一步表明，思维链（Chain-of-Thought，简称CoT）生成可以增强ICRL的这一能力。本文首次提供了CoT与ICRL相互作用的理论理解。我们在使用线性变换器（Linear Transformer）的策略评估设置中进行分析。我们证明，针对特定的变换器参数，CoT生成过程等同于反复执行时间差分学习更新。此外，我们还提供了有限样本收敛分析，表明策略评估误差随CoT长度呈几何级数递减，最终在由上下文长度决定的统计底线饱和。我们还证明了所需的Transformer参数是预训练损耗的全局最小化器，从而为这些参数的经验生成提供了理论理解。

Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

自适应负面强化用于大型语言模型推理：在RLVR中动态平衡纠正与多样性

Authors: Yash Ingle, Jaival Chauhan, Ankit Yadav, Sudhakar Mishra
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07137
Pdf link: https://arxiv.org/pdf/2605.07137
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a highly effective method for improving the reasoning abilities of Large Language Models (LLMs). Recent research shows that Negative Sample Reinforcement (NSR) -- which focuses on penalizing incorrect steps rather than simply rewarding correct ones -- can match or even exceed the performance of more complex frameworks like PPO and GRPO across the entire Pass@k spectrum. However, current NSR techniques usually apply a fixed penalty throughout the training process and treat every incorrect response with the same weight. To address these limitations, we propose two extensions to the NSR framework: Adaptive Negative Sample Reinforcement. Rather than using a fixed update rule, A-NSR uses time-dependent scheduling functions. In the initial training phases, the system focuses heavily on correcting errors to stabilize the model. As training continues, it shifts toward more subtle and controlled updates. We also introduce Confidence-Weighted Negative Reinforcement, which operates on the principle that different mistakes carry different levels of importance. CW-NSR assigns specific penalty weights based on the model's normalized sequence likelihood. If the model is highly confident in a wrong path, it receives a larger penalty and for uncertain errors -- where the model is effectively exploring -- are penalized less strictly. Our formal analysis shows how these mechanisms govern token-level updates, allowing the model to leverage prior-guided probability redistribution while providing a natural defense against overfitting. We evaluated these methods on difficult reasoning datasets, including MATH, AIME 2025, and AMC23, using the Qwen2.5-Math-1.5B architecture.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为提升大型语言模型（LLMs）推理能力的高效方法。最新研究表明，负样本强化（NSR）——侧重于惩罚错误步骤而非仅仅奖励正确步骤——能够在整个 Pass@k 光谱上与 PPO 和 GRPO 等更复杂框架的表现相当甚至超越。然而，当前的 NSR 技术通常在整个训练过程中施加固定惩罚，并将每个错误回答都以相同的权重处理。为解决这些局限性，我们提出了两个NSR框架的扩展：自适应负样本强化。A-NSR不使用固定的更新规则，而是使用时间相关的调度函数。在初始训练阶段，系统重点关注纠正错误以稳定模型。随着培训的进行，更新会变得更加细微和受控。我们还引入了信心加权负面强化，其原理是不同的错误具有不同重要性。CW-NSR根据模型的归一化序列似然分配具体的惩罚权重。如果模型对错误路径高度有信心，则受到更严格的惩罚，对于模型实际上是在探索的不确定错误处罚较少。我们的形式分析展示了这些机制如何管理代币级更新，使模型能够利用先验引导的概率重分布，同时为防止过拟合提供自然防御。我们在包括MATH、AIME 2025和AMC23在内的复杂推理数据集上，采用Qwen2.5-Math-1.5B架构，评估了这些方法。

Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents

你能破坏RLVER吗？强化学习训练的共情智能体的对抗性鲁棒性探测

Authors: Deeraj S K, Sadhana Devarajan, Krishna Mehra, Sudhakar Mishra
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.07138
Pdf link: https://arxiv.org/pdf/2605.07138
Abstract Reinforcement learning from verifiable emotion rewards RLVER has produced language models with strong empathetic performance, evaluated on benchmarks that assume cooperative, honest users. Yet real emotional interactions systematically violate this assumption: users gaslight, escalate, and pressure AI systems for unconditional validation, dynamics that cooperative benchmarks cannot surface. We construct the Adversarial Empathy Benchmark AEB and introduce the Emotional Consistency Score ECS to evaluate empathetic robustness under adversarial conditions. AEB comprises six psychologically grounded adversarial trajectory types with discriminative reward structures that penalize formulaic responses; ECS formally disentangles a model's capacity to track user emotional states from its capacity to improve them. In a controlled experiment across eight scenario-matched conditions (think and no-think conditions on 2 RLVER models, and 2 base models (Qwen 1.5B and 7B) with 480 adversarial dialogues), RLVER-PPO-Think substantially outperforms the same-scale untuned baseline (0.963 vs. 0.761, (p<0.001, r=0.688)), with zero dialogue collapses and 47\% higher hidden-intention detection. However, ECS remains nearly flat and is not significantly different for RLVER-PPO-Think versus Base-7B-Think ((p=0.650)): RL training improves emotional responsiveness without measurable gains in observable state tracking. We interpret the ECS--FS (Final Score) gap as a behavioral/legibility dissociation inside this simulator family, not as evidence about internal understanding or clinical readiness.
中文摘要 基于可验证情感奖励的强化学习 RLVER 开发了具有强烈同理心表现的语言模型，基于假设合作且诚实用户的基准进行评估。然而，真实的情感互动系统性地违背了这一假设：用户对人工智能进行煤气灯效应、升级并施压，要求无条件验证，这种动态是合作基准无法体现的。我们构建了对抗性共情基准AEB，并引入情绪一致性评分ECS，以评估对抗条件下的同理心稳健性。AEB包括六种基于心理学的对抗轨迹类型，具有判别性的奖励结构，惩罚公式化的反应;ECS正式区分了模型追踪用户情绪状态的能力与改善情绪的能力。在一项针对八个情景匹配条件的受控实验中（2个RVVER模型的思考和无思考条件，以及2个基础模型（Qwen 1.5B和7B），480个对抗对话，RLVER-PPO-Think显著优于同尺度的未调谐基线（0.963对0.761，\（p<0.001， r=0.688\）），且无对话崩溃，隐藏意图检测率提高了47\%。然而，ECS几乎保持平稳，RLVER-PPO-Think与Base-7B-Think之间无显著差异（\（p=0.650\））：强化学习提升情绪响应性，但可观察状态追踪方面无明显提升。我们将ECS-FS（最终评分）差距解释为该模拟器家族内的行为/可读性解离，而非内部理解或临床准备的证据。

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

超越推理：强化学习解锁大型语言模型中的参数化知识

Authors: Wanli Yang, Hongyu Zang, Junwei Zhang, Wenjie Shi, Du Su, Jingang Wang, Xueqi Cheng, Fei Sun
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.07153
Pdf link: https://arxiv.org/pdf/2605.07153
Abstract Reinforcement learning (RL) has achieved remarkable success in LLM reasoning, but whether it can also improve direct recall of parametric knowledge remains an open question. We study this question in a controlled zero-shot, one-hop, closed-book QA setting with no chain-of-thought, training only on binary correctness rewards and applying fact-level train-test deduplication to ensure gains reflect improved recall rather than reasoning or memorization. Across three model families and multiple factual QA benchmarks, RL yields ~27% average relative gains, surpassing both training- and inference-time baselines alike. Mechanistically, RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations. Our data-attribution study reveals that the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples (only ~18% of training data) drive ~83% of the gain, since rare correct rollouts still emerge during training and get reinforced. Together, these findings broaden the role of RL beyond reasoning, repositioning it as a tool for unlocking rather than acquiring latent parametric knowledge.
中文摘要 强化学习（RL）在大型语言模型推理方面取得了显著成功，但它是否也能提升对参数化知识的直接回忆，仍是一个未知数。我们在受控零样本、单跳、闭卷式质询环境中研究这个问题，没有思维链，仅基于二元正确性奖励进行训练，并应用事实层面的训练测试去重，以确保收益反映的是回忆的提升，而非推理或记忆。在三个模型家族和多个事实性质检基准中，强化学习平均相对提升率约为27%，超过了训练时间和推断时间基线。从机制上讲，强化学习主要将概率质量重新分配到现有知识上，而不是获取新的事实，将正确答案从低概率尾部转移到可靠的贪婪世代。我们的数据归因研究显示，最难的例子往往信息量最大：那些在128个强化学习前样本中从未出现答案（仅占训练数据的18%）的案例，带来了~83%的收益，因为在训练过程中仍会出现罕见的正确展开并得到强化。这些发现共同拓宽了强化学习超越推理的作用，将其重新定位为解锁而非获取潜在参数知识的工具。

Rethinking Experience Utilization in Self-Evolving Language Model Agents

重新思考自我演化语言模型代理中的经验利用

Authors: Weixiang Zhao, Yingshuo Wang, Yichen Zhang, Yanyan Zhao, Yu Zhang, Yang Wu, Dandan Tu, Bing Qin, Ting Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.07164
Pdf link: https://arxiv.org/pdf/2605.07164
Abstract Self-evolving agents improve by accumulating and reusing experience from past interactions. Existing work has largely focused on how experience is constructed, represented, and updated, while paying less attention to how experience should be used during runtime decision-making. As a result, most agents rely on rigid usage strategies, either injecting experience once at initialization or at every step, without considering whether it is needed for the current decision. This paper studies experience utilization as a critical design dimension of self-evolving agents. We ask whether agents benefit from interweaving experience use with decision-making, so that experience is invoked only when additional guidance is needed. To examine this question, we introduce {ExpWeaver}, a lightweight instantiation that leaves experience construction unchanged and modifies only runtime utilization by exposing experience as an optional resource during reasoning. Across four representative frameworks, seven LLM backbones, and three types of environments, ExpWeaver consistently achieves the best performance among different utilization strategies. Reinforcement learning experiments further show that this behavior can be amplified through training. Usage-pattern, causal ablation, and entropy-based analyses reveal that ExpWeaver enables agents to invoke experience selectively, at beneficial decision points, and under higher reasoning uncertainty. Overall, our findings call for a shift from merely studying \emph{what} experience to store toward understanding \emph{how} and \emph{when} experience should enter decision-making.
中文摘要 自我进化的代理通过积累和重复利用过去互动的经验来提升自己。现有研究主要关注经验如何构建、表征和更新，而较少关注经验在运行时决策中的应用。因此，大多数智能体依赖僵化的使用策略，要么在初始化时注入一次经验，要么在每一步注入经验，而不考虑当前决策是否需要经验值。本文研究经验利用作为自我进化智能体关键设计维度。我们询问代理是否从经验使用与决策中受益，以便只有在需要额外指导时才会调用这些经验。为了探讨这个问题，我们引入了{ExpWeaver}，这是一种轻量级实例，保持经验构建不变，仅通过在推理过程中将经验作为可选资源来修改运行时利用率。在四种代表性框架、七个大型语言模型骨干和三种环境类型中，ExpWeaver 在不同利用策略中始终保持最佳性能。强化学习实验进一步表明，这种行为可以通过训练被放大。使用模式、因果消融和基于熵的分析表明，ExpWeaver使智能体能够选择性地调用经验，在有利的决策点以及更高的推理不确定性下。总体而言，我们的发现呼吁从仅仅研究“如何”存储经验“转向”如何“和”何时“经验应进入决策。

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

HyperEyes：双粒度效率感知强化学习，适用于并行多模态搜索代理

Authors: Guankai Li, Jiabin Chen, Yi Xu, Xichen Zhang, Yuan Lu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07177
Pdf link: https://arxiv.org/pdf/2605.07177
Abstract Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.
中文摘要 现有的多模态搜索代理按顺序处理目标实体，每个实体发出一次工具调用，并在查询分解为独立子检索时累计冗余交互轮次。我们认为有效的多模态代理应搜索范围更广，而非更长时间：在一轮内同时派遣多个有基础的查询。为此，我们介绍了HyperEyes，一款并行多模态搜索代理，将视觉基础和检索融合为单一原子动作，实现跨多个实体的并发搜索，同时将推理效率视为一流的训练目标。HyperEyes 的训练分为两个阶段。在冷启动监督方面，我们开发了涵盖可视化多实体和文本多约束查询的并行可操作数据综合流程，通过渐进拒绝抽样策划效率导向的轨迹。基于此，我们的核心贡献——一个双粒度效率感知强化学习框架，在两个层面运作。在宏观层面，我们提出了TRACE（工具使用参考-自适应成本效率），这是一种轨迹级奖励，其参考在训练过程中单调收紧，以抑制多余的工具调用，同时不限制真正的多跳搜索。在微观层面，我们调整了On-Policy Distillation，在失败的推广时注入来自外部教师的密集代币级纠正信号，缓解了成果奖励稀疏导致的学分分配不足。由于现有基准测试仅以准确性为衡量标准，省略了推断成本，我们引入了IMEB，这是一个由人类策划的300个实例基准，联合评估搜索能力和效率。在六个基准测试中，HyperEyes-30B 在准确率上以 9.9% 的准确率超越了最强的开源代理，平均工具调用回合减少了 5.3 倍。

Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

通过数据到洞察的发现代理实现自主商业智能

Authors: Dongming Wu, Junwen Li, Ming Lu, Gang Wang, Ting Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07202
Pdf link: https://arxiv.org/pdf/2605.07202
Abstract Transforming fragmented enterprise data into actionable insights remains a significant challenge for LLMs, constrained by complex database schemas, limitations in dynamic SQL generation, and the need for deep multi-dimensional this http URL this paper, we propose AIDA(Autonomous Insight Discovery Agent), the first end-to-end framework designed for autonomous exploration in complex business environments. We establish a highly flexible instant retail environment encompassing 200+ metrics and 100+ dimensions, and integrates a proprietary Domain-Specific Language (DSL) that bridges semantic reasoning with precise SQL execution. Our reinforcement learning system subsequently formulates business analysis as a Pareto Principle-guided cumulative reasoning process. Experimental results demonstrate that AIDA significantly outperforms workflow-based agents, and extensive evaluations further reveal that AIDA achieves superior environmental perception and more in-depth analysis from diverse perspectives. Our work ultimately establishes the transformative potential of autonomous intelligence for industrial-scale business intelligence systems.
中文摘要 将碎片化的企业数据转化为可操作的洞察仍然是LLM面临的重大挑战，受限于复杂的数据库模式、动态SQL生成的限制以及对深度多维需求的需求。本文提出AIDA（自主洞察发现代理），这是首个为复杂业务环境中自主探索设计的端到端框架。我们建立了一个高度灵活的即时零售环境，涵盖200+指标和100+维度，并集成了专有的领域专用语言（DSL），将语义推理与精准的SQL执行相结合。我们的强化学习系统随后将商业分析构建为帕累托原理引导的累积推理过程。实验结果表明，AIDA显著优于基于工作流的代理，广泛评估进一步表明AIDA在环境感知方面表现出优越，并从多元视角实现更深入的分析。我们的工作最终确立了自主智能在工业规模商业智能系统中的变革潜力。

Improved Model-based Reinforcement Learning with Smooth Kernels

基于模型的改进型强化学习，采用光滑核

Authors: Kun Long, Yuqiang Li, Xianyi Wu
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.07218
Pdf link: https://arxiv.org/pdf/2605.07218
Abstract For continuous state-action space scenarios, classical reinforcement learning (RL) theory predominantly focuses on low-rank Markov decision processes (MDPs), which provide sample-efficient guarantees at the expense of restrictive structural assumptions. Kernel smoothing model-based approaches offer a promising alternative paradigm that instead leverages the smoothness of the MDP and employs non-parametric kernel smoothing estimates of transition dynamics. This paper proposes a new kernel-smoothing model-based approach for online reinforcement learning in finite-horizon settings under Lipschitz continuity assumptions on the MDP. By incorporating a Bernstein-style exploration bonus into the kernel smoothing framework, our method achieves a regret bound which improves upon the state-of-the-art regret bound in its dependence on the horizon. The theoretical advancement relies on a delicate analysis of the synergy between Bernstein-style bonuses and kernel smoothing, where a new tight Bernstein-type concentration inequality for martingales may be of independent interest.
中文摘要 对于连续状态-动作空间场景，经典强化学习（RL）理论主要关注低秩马尔可夫决策过程（MDP），这些过程在牺牲限制性结构假设的前提下，提供了样本高效的保证。基于模型的核平滑方法提供了一种有前景的替代范式，利用MDP的光滑性，并采用非参数核平滑对跃迁动力学的估计。本文提出了一种基于有限视距环境、基于Lipschitz连续性假设的在线强化学习的新核平滑模型方法。通过将伯恩斯坦式探索加值纳入核平滑框架，我们的方法实现了一个后悔上界，在对视野的依赖性上优于最先进的后悔上界。理论进展依赖于对伯恩斯坦式加成与核平滑之间协同效应的细致分析，其中马丁格尔的新紧致伯恩斯坦型浓度不等式可能具有独立意义。

Teaching Language Models to Think in Code

教授语言模型用代码思考

Authors: Hyeon Hwang, Jiwoo Lee, Jaewoo Kang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.07237
Pdf link: https://arxiv.org/pdf/2605.07237
Abstract Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.
中文摘要 工具集成推理（TIR）已成为语言模型中数学问题解决的主导范式，结合了自然语言（NL）推理与代码执行。然而，这种交错设置存在三个关键局限：代码通常作为事后验证器，中间的NL计算容易出错，NL和代码的角色是重叠而非明显区分的。我们提出了ThinC（代码思维），这是一个框架，其中代码本身作为推理者，而非NL调用的工具。ThinC 的轨迹始于简短的 NL 规划步骤，之后所有推理通过仅通过执行输出连接的代码块展开。我们从教师模型中提炼出12.2k条以代码为中心的轨迹，并通过监督微调训练ThinC-1.7B和ThinC-4B，随后进行强化学习。ThinC-4B在五项竞赛级数学基准测试中持续优于所有TIR基线，甚至超过了规模更大的Qwen3-235B-A22B-Thinking。进一步分析显示，ThinC 通过代码推理：其最终答案中有 99.2% 基于解释器输出，且模型能可靠地从代码执行失败中恢复，无需中间的 NL 推理。我们的代码和模型将很快发布。

Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

异质语言模型的相互强化学习经验分享

Authors: Xiaoze Liu, Dhananjay Ram, Yuting Zhang, Zhaoyang Zhang, Wei Xia, Stefano Soatto
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.07244
Pdf link: https://arxiv.org/pdf/2605.07244
Abstract We introduce Mutual Reinforcement Learning, a framework for concurrent RL post-training in which heterogeneous LLM policies exchange typed experience while keeping separate parameters, objectives, and tokenizers. The framework combines a Shared Experience Exchange (SEE), Multi-Worker Resource Allocation (MWRA), and a Tokenizer Heterogeneity Layer (THL) that retokenizes text and aligns token-level traces across incompatible vocabularies. This substrate makes the experience-sharing design question operational across model families. We instantiate three controlled probes on top of GRPO: data-level rollout sharing via Peer Rollout Pooling (PRP), value-level advantage sharing via Cross-Policy GRPO Advantage Sharing (XGRPO), and outcome-level success transfer via Success-Gated Transfer (SGT). A contextual-bandit analysis characterizes their structural positions on a stability-support trade-off: PRP pays density-ratio variance and THL residual costs, XGRPO preserves on-policy actor support while changing scalar baselines, and SGT supplies a rescue-set score direction toward verified peer successes. In the evaluated regime, outcome-level sharing occupies the favorable point of this trade-off.
中文摘要 我们介绍了互助强化学习，这是一个用于并发强化学习后训练的框架，异构LLM策略在保持独立参数、目标和标记器的同时交换类型经验。该框架结合了共享体验交换（SEE）、多工作者资源分配（MWRA）和分词异构层（THL），后者能重新分词文本并在不兼容词汇之间对齐令牌级的痕迹。这一基础使体验共享设计问题能够在模型家族中发挥作用。我们在GRPO之上实例化了三种受控探针：通过对等部署池（PRP）实现的数据级推广共享，通过跨政策GRPO优势共享（XGRPO）实现价值级优势共享，以及通过成功门控转移（SGT）实现的结果级成功转移。情境-强盗分析描述了它们在稳定与支持权衡上的结构性立场：PRP承担密度-比方差和THL剩余成本，XGRPO在改变标量基线的同时保持政策行为者支持，SGT则提供一个救助性得分方向，指向经过验证的同伴成功。在被评估的体系中，结果层级分享占据了这一权衡的有利点。

Structured Role-Aware Policy Optimization for Multimodal Reasoning

结构化角色感知策略优化，用于多模态推理

Authors: Bingqing Jiang, Difan Zou
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.07274
Pdf link: https://arxiv.org/pdf/2605.07274
Abstract Reinforcement learning from verifiable rewards (RLVR), especially with Group Relative Policy Optimization (GRPO), has shown strong potential for improving the reasoning capabilities of large vision-language models (LVLMs). However, in multimodal reasoning, final-answer rewards are typically assigned at the sequence level and do not distinguish the functional roles of different tokens, making it difficult to determine whether a correct answer is supported by task-relevant visual evidence. In this paper, we revisit multimodal RLVR from the perspective of role-aware token-level credit assignment, where structured responses are decomposed into perception tokens for extracting visual evidence and reasoning tokens for deriving answers from that evidence. Based on this perspective, we propose Structured Role-aware Policy Optimization (SRPO), which refines the sequence-level GRPO advantage into role-aware token-level advantages without changing the reward function. Specifically, SRPO assigns role-specific credit by using self-distilled on-policy contrasts: perception tokens are emphasized according to their visual dependency under original versus corrupted visual inputs, while reasoning tokens are emphasized according to their consistency with the generated perception. These role-specific signals are further unified through a shared trajectory-level baseline, yielding positive token weights that adjust relative update magnitudes while preserving the original GRPO reward and optimization direction, without requiring external reward models or separate teachers. Experiments across diverse multimodal reasoning benchmarks show that SRPO improves evidence-grounded reasoning, highlighting the importance of moving beyond uniform sequence-level credit toward role-aware optimization for reliable multimodal reasoning.
中文摘要 可验证奖励强化学习（RLVR），尤其是群相对策略优化（Group Relative Policy Optimization，GRPO），在提升大型视觉语言模型（LVLM）推理能力方面展现出强大潜力。然而，在多模态推理中，最终答案奖励通常在序列层面分配，且不区分不同代币的功能角色，这使得判断正确答案是否由任务相关的视觉证据支持变得困难。本文从角色感知的代币级信用分配视角重新审视多模态RLVR，将结构化的响应分解为感知代币以提取视觉证据，以及推理代币以从证据中推导答案。基于这一观点，我们提出了结构化角色感知策略优化（SRPO），该优化将序列级GRPO优势细化为角色感知型代币级优势，而不改变奖励函数。具体来说，SRPO通过使用自我提炼的政策对比来赋予角色特定功劳：感知标记根据其在原始视觉输入与损坏视觉输入下的视觉依赖性被强调，而推理标记则根据其与生成感知的一致性来强调。这些角色特定信号通过共享轨迹级基线进一步统一，产生正令牌权重，调整相对更新幅度，同时保持原始GRPO奖励和优化方向，无需外部奖励模型或独立教师。跨越多种多模态推理基准的实验表明，SRPO提升了循证推理能力，凸显了超越统一序列层级信用向角色感知优化以实现可靠多模态推理的重要性。

Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

隐式压缩正则化：通过强化学习后内部较短分布进行简明推理

Authors: Chen Wang, Hexuan Deng, Yining Zhang, Yuchen Zhang, Jionghao Bai, Zhaochun Li, Ge Lan, Yue Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07316
Pdf link: https://arxiv.org/pdf/2605.07316
Abstract Reinforcement learning with verifiable rewards improves LLM reasoning but often induces overthinking, where models generate unnecessarily long reasoning traces. Existing methods mainly rely on length penalties or early-exit strategies; however, the former may degrade accuracy and induce underthinking, whereas the latter assumes that substantial portions of reasoning traces can be safely truncated. To obtain a compression signal without these limitations, we revisit the training dynamics of existing compression methods. We observe that the length--accuracy correlation is initially negative but continually increases during compression, indicating that shorter responses are initially more likely to be correct but gradually lose this property as the policy moves toward underthinking. Based on this observation, we formalize overthinking: a negative correlation indicates an overthinking regime, while a positive one indicates underthinking. When overthinking, the shortest correct responses are shorter than the group-average response length in expectation, making them natural compression targets already present in on-policy rollouts. We therefore propose \emph{Implicit Compression Regularization} (ICR), an on-policy regularization method whose compression signal comes from a virtual shorter distribution induced by the shortest correct responses in rollout groups, guiding the policy toward concise yet correct trajectories. Training dynamics show that ICR maintains a better length--accuracy correlation during compression, indicating that short responses remain better aligned with correctness instead of drifting toward underthinking. Experiments on three reasoning backbones and multiple mathematical and knowledge-intensive benchmarks show that ICR consistently shortens responses while preserving or improving accuracy, achieving a stronger accuracy--length Pareto frontier.
中文摘要 带有可验证奖励的强化学习提升了大型语言模型的推理能力，但常常会引发过度思考，即模型产生不必要的冗长推理轨迹。现有方法主要依赖长度罚时或提前退出策略;然而，前者可能降低准确性并引发思考不足，而后者则假设推理痕迹的大部分可以被安全地截断。为了获得无这些限制的压缩信号，我们重新审视现有压缩方法的训练动态。我们观察到，长度——准确性相关性最初为负，但在压缩过程中持续增加，表明较短的响应最初更可能正确，但随着政策趋向思考不足，这一特性逐渐消失。基于这一观察，我们形式化了过度思考：负相关表示过度思考，正相关表示思考不足。过度思考时，最短的正确响应通常比预期中的群体平均响应长度短，这使它们成为政策推广中已存在的自然压缩目标。因此，我们提出\emph{隐式压缩正则化}（ICR），一种策略上正则化方法，其压缩信号来自由最短正确响应在滚出组中诱导的虚拟更短分布，指导策略朝向简洁但正确的轨迹。训练动态显示，ICR在压缩过程中保持更长的长度——准确性相关性，表明短响应更能与正确性保持一致，而非偏向思考不足。对三种推理骨架和多个数学和知识密集基准的实验表明，ICR在保持或提升准确性的同时，能够持续缩短响应，从而实现更强的准确性——长度帕累托前沿。

SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication

SparseRL-Sync：无损权重同步，通信量减少~100倍

Authors: Lucas Hu, Ranchi Zhao, Isaac Zhu, Zach Zhang, Hscos Zhang, Hugh Yin, Jason Zhao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2605.07330
Pdf link: https://arxiv.org/pdf/2605.07330
Abstract In large-scale reinforcement learning (RL) systems with decoupled Trainer-Rollout execution, the Trainer must regularly synchronize policy weights to the Rollout side to limit policy staleness. When inter-node bandwidth is abundant, such synchronization is usually only a small fraction of end-to-end cost. As model size grows, however, the communication demand rises rapidly. In bandwidth-constrained or network-variable deployments -- for example, cross-datacenter or cross-cluster settings, heterogeneous resource pools, and online RL -- weight synchronization can become a dominant bottleneck for throughput and tail latency. We observe that, in mainstream large-model RL training, the locations where parameters actually change are highly sparse at the element level (often 99%+ sparsity). Building on this observation, we propose and implement SparseRL-Sync, which replaces full-weight transfers with a lossless sparse update payload (indices and values) that can be exactly reconstructed on the inference side, thereby preserving 100% fidelity. Under a simplified cost model, sparse synchronization reduces the per-update communication volume from S to approximately S/X; with 99% sparsity (X ~ 100), this yields about a 100x reduction in transmitted data. Combined with appropriate bucketing, SparseRL-Sync also reduces launch and control-plane overhead, significantly improving scalability and end-to-end efficiency in bandwidth-limited and highly asynchronous RL settings.
中文摘要 在大规模强化学习（RL）系统中，训练器-展开执行解耦，训练器必须定期将策略权重同步到推出端，以减少策略陈旧。当节点间带宽充足时，这种同步通常仅占端到端成本的一小部分。然而，随着模型尺寸的增长，通信需求迅速上升。在带宽受限或网络变量部署中——例如跨数据中心或跨集群设置、异构资源池以及在线强化学习——权重同步可能成为吞吐量和尾部延迟的主要瓶颈。我们观察到，在主流大型模型强化学习训练中，参数实际变化的地点在元素层面极为稀疏（通常为99%+稀疏度）。基于这一观察，我们提出了并实现了SparseRL-Sync，它用无损稀疏更新载荷（索引和值）替代全权重传输，且可在推理端精确重建，从而保持100%的保真度。在简化成本模型下，稀疏同步将每次更新的通信量从S减少到大约S/X;在稀疏度为99%时（X ~ 100），传输数据减少约100倍。结合适当的分桶功能，SparseRL-Sync还降低了启动面和控制面开销，显著提升了带宽受限和高度异步的RL环境中的可扩展性和端到端效率。

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

重新思考LLM策略优化中的重要性抽样：累积代币视角

Authors: Yuheng Zhang, Chenlu Ye, Shuowei Jin, Changlong Yu, Wei Xiong, Saurabh Sahu, Nan Jiang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07331
Pdf link: https://arxiv.org/pdf/2605.07331
Abstract Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existing methods face a fundamental bias-variance dilemma: token-level IS ratios, as adopted by PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), introduce bias by ignoring prefix state distribution mismatch; full sequence ratios provide exact trajectory-level correction but suffer from high variance due to the multiplicative accumulation of per-token ratios, while GSPO (Zheng et al., 2025) improves numerical stability via length normalization at the cost of deviating from the exact full-sequence IS correction. In this work, we identify the cumulative token IS ratio, the product of per-token ratios up to position $t$, as a theoretically principled solution to this dilemma. We prove that, under the token-level policy-gradient formulation, this ratio provides an unbiased prefix correction for each token-level gradient term and has strictly lower variance than the full sequence ratio. Building on this insight, we propose CTPO (Cumulative Token Policy Optimization), which combines the cumulative token IS ratio with position-adaptive clipping that scales log-space clip bounds according to the natural $\sqrt{t}$ growth of the cumulative log-ratio. This yields more consistent regularization across token positions. We implement and evaluate CTPO in the tool-integrated reasoning setting on several challenging mathematical reasoning benchmarks, achieving the best average performance across both model scales compared with strong GRPO and GSPO baselines. Code will be available at this https URL.
中文摘要 强化学习，包括可验证奖励的强化学习（RLVR），已成为大型语言模型（LLM）后训练中一种强大的方法。这些方法的核心是用于政策外政策梯度估计中的重要性抽样（IS）比率的设计。现有方法面临一个根本的偏倚-方差困境：PPO（Schulman等，2017）和GRPO（Shao等，2024）采用的代币级IS比率，通过忽视前缀状态分布不匹配引入偏置;全序列比率提供精确的轨迹级校正，但由于每个令符比率的乘法累积，方差较大;而GSPO（Zheng等，2025）通过长度归一化提升数值稳定性，但代价是偏离精确的全序列IS修正。在本研究中，我们将累积代币IS比率（即每个代币比率至$t$）的乘积，作为该困境的理论原则性解决方案。我们证明，在代币级策略梯度表述下，该比率为每个代币级梯度项提供了无偏的前缀修正，且方差严格低于全序列比率。基于这一见解，我们提出了CTPO（累积令牌策略优化），它结合了累计令牌IS比率与位置自适应裁剪，后者根据累计对数比率的自然$\sqrt{t}$增长来调整对数空间剪辑边界。这能在代币仓位之间实现更一致的正则化。我们在多个具有挑战性的数学推理基准测试中，在工具集成推理环境中实施并评估CTPO，在两个模型尺度上均优于强劲的GRPO和GSPO基线的平均表现。代码将在此 https URL 上提供。

Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning

超越线性注意力：Softmax 变换器实现上下文强化学习

Authors: Zixuan Xie, Xinyu Liu, Claire Chen, Shuze Daniel Liu, Rohan Chandra, Shangtong Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.07333
Pdf link: https://arxiv.org/pdf/2605.07333
Abstract In-context reinforcement learning (ICRL) studies agents that, after pretraining, adapt to new tasks by conditioning on additional context without parameter updates. Existing theoretical analyses of ICRL largely rely on linear attention, which replaces the softmax function in the standard attention with an identity mapping. This paper provides the first theoretical understanding of ICRL without making the unrealistic linear attention simplification. In particular, we consider the standard softmax attention used in practice. We show that, with certain parameters, the layerwise forward pass of a Transformer with such softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm. Here, weighted softmax TD is a new RL algorithm that performs policy evaluation in kernel space and adopts both linear TD and tabular TD as special cases. We also prove that under a certain contraction condition, the policy evaluation error decays as the number of layers grows, with the identified parameters above. Finally, we prove that those parameters are a global minimizer of a pretraining loss, explaining their emergence in our numerical experiments.
中文摘要 上下文强化学习（ICRL）研究那些在预训练后，通过附加上下文进行条件且不更新参数来适应新任务的智能体。现有的ICRL理论分析主要依赖线性注意力，即用恒等映射替代标准注意力中的软极大函数。本文首次理论上理解了ICRL，且未涉及不切实际的线性注意力简化。特别地，我们考虑了实际中使用的标准软极大注意力。我们证明，在某些参数下，具有该软最大关注的变换器分层前向传递等价于加权软最大时间差（TD）学习算法的迭代更新。这里，加权软极大TD是一种新的强化学习算法，在核空间中执行策略评估，并采用线性TD和表格TD作为特例。我们还证明了在某一收缩条件下，策略评估误差随着层数增加而衰减，上述参数如下。最后，我们证明这些参数是预训练损失的全局最小化子，解释了它们在数值实验中的出现。

Gradient-Based LoRA Rank Allocation Under GRPO: An Empirical Study

基于梯度的 LoRA 排名分配：GRPO 下的实证研究

Authors: Yash Ganpat Sawant
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.07366
Pdf link: https://arxiv.org/pdf/2605.07366
Abstract Adaptive rank allocation for LoRA, allocating more parameters to important layers and fewer to unimportant ones, consistently improves efficiency under supervised fine-tuning (SFT). We investigate whether this success transfers to reinforcement learning, specifically Group Relative Policy Optimization (GRPO). Using gradient-magnitude profiling on Qwen 2.5 1.5B with GSM8K, we find that it does not: proportional rank allocation degrades accuracy by 4.5 points compared to uniform allocation (70.0% vs. 74.5%), despite using identical parameter budgets. We identify two mechanisms behind this failure. First, the gradient landscape under GRPO is fundamentally flatter than under SFT, the max-to-min layer importance ratio is only 2.17x, compared to >10x reported in SFT literature. All layers carry meaningful gradient signal; none are truly idle. Second, we discover a gradient amplification effect: non-uniform allocation widens the importance spread from 2.17x to 3.00x, creating a positive feedback loop where high-rank layers absorb more gradient while low-rank layers are progressively silenced. Our results suggest that gradient importance does not predict capacity requirements under RL, and that naive transfer of SFT-era rank allocation to alignment training should be avoided.
中文摘要 LoRA的自适应秩分配，将更多参数分配给重要层，减少对不重要层的参数，在监督微调（SFT）下持续提升效率。我们研究这种成功是否适用于强化学习，特别是群体相对策略优化（Group Relative Policy Optimization，GRPO）。通过对Qwen 2.5 1.5B和GSM8K进行梯度-大小分析，我们发现不会：比例排名分配相比均匀分配，准确率下降了4.5个百分点（70.0%对74.5%），尽管参数预算相同。我们发现了两种机制。首先，GRPO下的梯度景观本质上比SFT更平坦，最大与最小层重要性比仅为2.17倍，而SFT文献中报告的>10倍。所有层都携带有意义的梯度信号;没有一个是真正闲置的。其次，我们发现了梯度放大效应：非均匀分配将重要性分布从2.17x扩大到3.00x，形成一个正反馈循环，高阶层吸收更多梯度，而低阶层逐渐沉默。我们的结果表明梯度重要性无法预测强化学习下的容量需求，应避免将SFT时代的排名分配简单转移给对齐训练。

RELO: Reinforcement Learning to Localize for Visual Object Tracking

RELO：强化学习以定位视觉物体

Authors: Xin Chen, Chuanyu Sun, Jiao Xu, Houwen Peng, Dong Wang, Huchuan Lu, Kede Ma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07379
Pdf link: https://arxiv.org/pdf/2605.07379
Abstract Conventional visual object trackers localize targets using handcrafted spatial priors, often in the form of heatmaps. Such priors provide only surrogate supervision and are poorly aligned with tracking optimization and evaluation metrics, such as intersection over union (IoU) and area under the success curve (AUC). Here, we introduce RELO, a REinforcement-learning-to-LOcalize method for visual object tracking that formulates target localization as a Markov decision process. Specifically, RELO replaces handcrafted spatial priors with a localization policy learned over spatial positions via reinforcement learning, with rewards combining frame-level IoU and sequence-level AUC. We additionally introduce layer-aligned temporal token propagation to improve semantic consistency across frames, with negligible computational overhead. Across multiple benchmarks, RELO achieves superior results, attaining 57.5% AUC on LaSOText without template updates. This confirms that reward-driven localization provides an effective alternative to prior-driven localization for visual object tracking.
中文摘要 传统的视觉物体追踪器通过手工制作的空间先验定位目标，通常以热力图的形式出现。此类先验仅提供替代监督，且与跟踪优化和评估指标（如并集交叉（IoU）和成功曲线下面积（AUC）不匹配。这里，我们介绍RELO，一种强化学习到定位的视觉物体追踪方法，将目标定位化表述为马尔可夫决策过程。具体来说，RELO用通过强化学习学习的空间位置定位策略取代手工定制的空间先验，奖励结合了帧级IoU和序列级AUC。我们还引入了层对齐时间令牌传播，以提升帧间语义一致性，计算开销极低。在多个基准测试中，RELO在未更新模板的情况下实现了57.5%的AUC。这证实了奖励驱动定位为视觉物体追踪提供了一种有效替代先验驱动定位的方法。

Offline Policy Optimization with Posterior Sampling

带有后验抽样的离线策略优化

Authors: Hongqiang Lin, Dongxu Zhang, Yiding Sun, Mingzhe Li, Ning Yang, Haijun Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07393
Pdf link: https://arxiv.org/pdf/2605.07393
Abstract A fundamental challenge in model-based offline reinforcement learning (RL) lies in the trade-off between generalization and robustness against exploitation errors in out-of-distribution (OOD) regions. While OOD samples may capture valid underlying physical dynamics, they also introduce the risk of model exploitation. Existing methods typically address this risk through excessive pessimistic regularization, which ensures robustness but often sacrifices generalization. To overcome this limitation, we propose Posterior Sampling-based Policy Optimization (PSPO), which formulates dynamics modeling as a Bayesian inference process to derive a posterior that explicitly quantifies model fidelity. Through the integration of posterior sampling and constrained policy optimization, our method leverages dynamics-consistent OOD transitions for generalization while ensuring robustness against model exploitation. Theoretically, we formulate Q-value estimation under posterior sampling as a stochastic approximation problem and establish its convergence. We decompose policy optimization into a sequence of constrained subproblems, demonstrating that solving these subproblems guarantees monotonic improvement until convergence. Experiments on standard benchmarks validate that PSPO achieves superior performance compared to state-of-the-art baselines.
中文摘要 基于模型的离线强化学习（RL）中的一个根本挑战在于泛化性与鲁棒性之间，如何应对分布外（OOD）区域的利用错误。虽然OOD样本可能捕捉有效的底层物理动态，但也存在模型被利用的风险。现有方法通常通过过度悲观正则化来应对这一风险，这不仅保证了鲁棒性，但往往牺牲了泛化性。为克服这一限制，我们提出了基于后验采样的策略优化（PSPO），该方法将动态建模表述为贝叶斯推断过程，以推导出显式量化模型忠实度的后验。通过后验采样和约束策略优化的集成，我们的方法利用动态一致的OOD转移进行泛化，同时确保对模型利用具有鲁棒性。理论上，我们将Q值估计在后验抽样下表述为随机近似问题，并建立其收敛性。我们将策略优化分解为一系列受限子问题，证明解决这些子问题保证单调改进直到收敛。标准基准测试的实验验证了PSPO相较于最先进基线的性能。

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

BalCapRL：基于强化语言的MLLM图像字幕平衡框架

Authors: Shaokai Ye, Vasileios Saveris, Yihao Qian, Jiaming Hu, Elmira Amirloo, Peter Grasch
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07394
Pdf link: https://arxiv.org/pdf/2605.07394
Abstract Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.
中文摘要 图像字幕是计算机视觉中最基础的任务之一。由于其开放性质，它在多模态大型语言模型（MLLMs）时代受到了广泛关注。为了追求更详细和更准确的说明文字，近期研究越来越多地转向强化学习（RL）。然而，现有的字幕-强化学习方法和评估指标往往强调狭隘的字幕质量概念，导致字幕核心维度之间存在权衡。例如，以实用为导向的目标可能鼓励噪声、幻觉或过长的字幕，这会改善后续问答，但会损害流畅性;而竞技场式目标则可能偏好流畅但通用的描述，且实用性有限。为此，我们提出了一个更平衡的强化学习框架，共同优化效用感知的正确性、引用覆盖率和语言质量。为了有效优化所得的连续多目标奖励表述，我们将GDPO式的奖励解耦归一化应用于连续值字幕奖励，并证明其性能优于普通GRPO。此外，我们引入了长度条件奖励掩蔽，为字幕提供了更合适的长度惩罚。在LLaVA-1.5-7B和Qwen2.5-VL 3B及7B基础模型中，我们的方法持续提升了字幕质量，峰值提升分别为+13.6 DCScore、+9.0 CapArena和+29.0 CapArena。

Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

用评分标准思考：从外部评估者到内部推理指导

Authors: Jiachen Yu, Zhihao Xu, Junjie Wang, Yujiu Yang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.07461
Pdf link: https://arxiv.org/pdf/2605.07461
Abstract Rubrics have been extensively utilized for evaluating unverifiable, open-ended tasks, with recent research incorporating them into reward systems for reinforcement learning. However, existing frameworks typically treat rubrics only as external evaluator disjointed from the policy's primary reasoning trace. Such design confines rubrics to post-hoc measurement, leaving them unable to actively guide the model's generation process. In this work, we introduce Think-with-Rubrics, a novel paradigm for instruction following tasks. Think-with-Rubrics integrates rubric generation into the reasoning context, transforming the rubric from an independent artifact into an internal guidance of LLM's generation. During training, LLM sequentially generates a rubric followed by a response, while a trained rubric verifier provides joint supervision by evaluating the consistency between the answer and the self-generated / golden rubrics. Experiments across multiple benchmarks demonstrate that Think-with-Rubrics consistently outperforms the Rubric-as-Reward baseline supervised by golden rubrics by an average of 3.87 points. We have also discussed the mechanism by which Think-with-Rubrics enhances model performance. Experimental results demonstrate that supervision from golden rubrics and self-generated rubrics enhances the performance of Think-with-Rubrics by improving the quality of self-generated rubrics and increasing the internal consistency of responses respectively.
中文摘要 评分标准已被广泛用于评估无法验证的开放式任务，近期研究将其纳入强化学习的奖励系统中。然而，现有框架通常仅将评分标准视为与政策主要推理轨迹脱节的外部评估者。这种设计将评分标准限制在事后测量，使其无法主动指导模型生成过程。在本研究中，我们介绍了“带评分标准思考”（Think-with-Rubrics），这是一种用于指导跟随任务的新范式。Think-with-Rubrics将评分标准生成整合进推理语境，将评分标准从独立的产物转变为LLM生成的内部指导。在培训过程中，LLM会依次生成评分标准，随后是回答，而受过训练的评分标准验证者则通过评估答案与自生成/黄金评分标准之间的一致性来提供联合监督。多个基准测试的实验表明，“用评分标准思考”始终比由黄金评分标准监督的“奖励标准”基线平均高出3.87分。我们还讨论了“带评分标准思考”提升模型性能的机制。实验结果表明，黄金评分标准和自生成评分标准的监督能通过提升自生成评分标准的质量和提高回答的内部一致性，提升“思维标准”的性能。

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

SEIF：自我进化强化学习，用于跟随教学

Authors: Qingyu Ren, Qianyu He, Jiajie Zhu, Xingzhou Chen, Jingwen Chang, Zeye Sun, Han Xia, Fei Yu, Jiaqing Liang, Yanghua Xiao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.07465
Pdf link: https://arxiv.org/pdf/2605.07465
Abstract Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self-play training with static-difficulty instructions that cannot evolve as the model's capabilities improve. To address these limitations, we propose SEIF (Self-Evolving Reinforcement Learning for Instruction Following), a self-evolving framework for enhancing the instruction-following ability of LLMs. SEIF forms a closed self-evolution loop that improves the model's instruction-following ability, where instruction difficulty evolution and model capability evolution reinforce each other. SEIF consists of four roles: an Instructor that generates increasingly challenging instructions, a Filter that removes conflicting or invalid instructions to ensure data quality, a Follower that learns to follow evolved instructions, and a Judger that provides reward signals for reinforcement learning. The Instructor and Follower are alternately trained and co-evolve throughout the process. Experiments across multiple model scales and architectures show that SEIF consistently improves instruction-following performance, suggesting strong generality. Further analyses reveal the sources of improvement and identify an effective training strategy for self-evolution on open-ended tasks: sufficient early-stage training to build a solid foundation, followed by moderate late-stage training to mitigate overfitting and achieve better final performance. The code and data are publicly available at this https URL.
中文摘要 指令跟踪是大型语言模型（LLM）的基本能力，但持续提升这一能力仍充满挑战。现有方法通常依赖于昂贵的外部人工监督或强有力的教师模型，或者依赖静态难度指令的自玩训练，这些指令无法随着模型能力的提升而不断演进。为解决这些局限性，我们提出了SEIF（指令跟随自我演化强化学习）框架，用于增强LLM的指令跟随能力。SEIF形成一个封闭的自我演化循环，提升模型的指令跟随能力，指令难度演化与模型能力演化相互强化。SEIF 包含四个角色：生成更具挑战性的指令的讲师，过滤器，去除冲突或无效指令以确保数据质量，跟随者学习遵循进化指令，以及判断者，提供强化学习的奖励信号。教练和跟随者在过程中交替接受训练并共同进化。跨多个模型尺度和架构的实验表明，SEIF能够持续提升指令跟随性能，表明其具有强烈的通用性。进一步分析揭示了改进的来源，并确定了开放式任务自我演化的有效训练策略：早期阶段有足够的训练以打下坚实基础，随后进行适度后期训练以减少过度拟合并实现更好的最终表现。代码和数据在此HTTPS网址公开。

ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression

ExpThink：体验引导强化学习用于自适应思维链压缩

Authors: Tingcheng Bian, Yuzhe Zhang, Jing Jin, Jinchang Luo, MingQuan Cheng, Haiwei Wang, Wenyuan Jiang, Miaohui Wang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.07501
Pdf link: https://arxiv.org/pdf/2605.07501
Abstract Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penalties that neglect model capability dynamics and problem-level difficulty variation. We propose \textbf{ExpThink}\xspace, an RL framework that addresses both dimensions through two complementary mechanisms. First, \emph{experience-guided reward shaping} tracks the shortest correct solution found so far for each problem and applies a three-tier reward: full credit for concise correct responses, discounted credit for verbose correct ones, and zero for incorrect ones. The threshold tightens automatically with model improvement, forming a self-evolving curriculum that requires no manual scheduling. Second, \emph{difficulty-adaptive advantage} replaces standard deviation normalization with correct-count normalization, yielding monotonically difficulty-scaled gradients that amplify learning on hard problems to preserve accuracy while suppressing gradients on easy ones to encourage brevity. Together, these mechanisms enforce an accuracy-first, compression-second training objective. Experiments on multiple mathematical reasoning benchmarks demonstrate that \textbf{ExpThink}\xspace reduces average response length by up to 77\% while simultaneously improving accuracy, achieving up to $3\times$ higher accuracy-efficiency ratio (accuracy divided by average token count) than the vanilla baseline and outperforming existing RL-based compression methods on both metrics.
中文摘要 大型推理模型（LRM）通过扩展思维链（CoT）推理实现了强劲的性能，但存在过多的代币消耗和较高的推理延迟。现有的CoT压缩强化学习（RL）方法依赖于均匀的静态长度惩罚，忽视了模型能力动态和问题层级难度变化。我们提出了\textbf{ExpThink}\xspace，这是一个通过两个互补机制解决这两个维度的强化学习框架。首先，\emph{经验引导的奖励塑造}会追踪每个问题迄今为止找到的最短正确解答，并应用三级奖励：简洁正确回答全额加分，冗长正确答复给折扣加分，错误答错则零分。随着模型改进，门槛自动收紧，形成一个无需手动排班的自我演进课程。其次，\emph{难度自适应优势}用正确计数归一化替代标准差归一化，产生单调难度尺度梯度，放大难题学习以保持准确性，同时抑制简单问题的梯度以促进简洁。这些机制共同执行了以准确性为先、压缩为第二的训练目标。多项数学推理基准测试的实验表明，\textbf{ExpThink}\xspace 在同时提升准确率的同时，平均响应长度可降低 77\%，准确率与效率比（准确率除以平均令牌数）高出高达 $3\$，并且在两个指标上都优于现有基于强化学习的压缩方法。

Implicit Preference Alignment for Human Image Animation

人类图像动画的隐性偏好对齐

Authors: Yuanzhi Wang, Xuhua Ren, Jiaxiang Cheng, Bing Ma, Kai Yu, Tianxiang Zheng, Qinglin Lu, Zhen Cui
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07545
Pdf link: https://arxiv.org/pdf/2605.07545
Abstract Human image animation has witnessed significant advancements, yet generating high-fidelity hand motions remains a persistent challenge due to their high degrees of freedom and motion complexity. While reinforcement learning from human feedback, particularly direct preference optimization, offers a potential solution, it necessitates the construction of strict preference pairs. However, curating such pairs for dynamic hand regions is prohibitively expensive and often impractical due to frame-wise inconsistencies. In this paper, we propose Implicit Preference Alignment (IPA), a data-efficient post-training framework that eliminates the need for paired preference data. Theoretically grounded in implicit reward maximization, IPA aligns the model by maximizing the likelihood of self-generated high-quality samples while penalizing deviations from the pretrained prior. Furthermore, we introduce a Hand-Aware Local Optimization mechanism to explicitly steer the alignment process toward hand regions. Experiments demonstrate that our method achieves effective preference optimization to enhance hand generation quality, while significantly lowering the barrier for constructing preference data. Codes are released at this https URL
中文摘要 人类图像动画取得了显著进步，但由于其高自由度和动作复杂性，生成高保真手部动作仍是持续的挑战。虽然从人类反馈中进行强化学习，尤其是直接偏好优化，提供了潜在解决方案，但这需要构建严格的偏好对。然而，为动态手部区域策划此类配对成本高昂且常因帧数不一致而难以实现。本文提出了隐性偏好对齐（IPA），一种数据高效的后期训练框架，无需配对偏好数据。理论上基于隐性奖励最大化，IPA通过最大化自生成高质量样本的可能性，同时惩罚偏离预训练先验的行为，使模型对齐。此外，我们引入了手感局部优化机制，明确引导比对过程朝向手部区域。实验表明，我们的方法能够有效优化偏好，提升手牌生成质量，同时显著降低了构建偏好数据的门槛。代码可在此 https URL 发布

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

你的语言模型是它的批评者：基于行为者内部状态的价值估计强化学习

Authors: Yunho Choi, Jongwon Lim, Woojin Ahn, Minjae Oh, Jeonghoon Shim, Yohan Jo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.07579
Pdf link: https://arxiv.org/pdf/2605.07579
Abstract Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.
中文摘要 大型推理模型的可验证奖励强化学习（RLVR）依赖于基线估计以减少方差，但现有方法代价沉重：PPO需要策略模型尺度批评者，而GRPO则需每个提示多次展开以保持实证组均值稳定。我们引入了带有内部状态值估计的策略优化），通过利用策略前向传递中已计算出的策略模型内部信号，以可忽略不计的成本获得基线。轻量级探针通过提示和生成轨迹的隐藏状态预测预期可验证的奖励，以及代币熵统计，并在线训练，配合策略进行训练。为了保持梯度无偏性，尽管使用轨迹条件特征，我们引入了交叉展开构造，从独立滚动的内部状态预测每个展开的值。由于 POISE 仅通过一次部署估计提示值，它使得在固定计算预算下训练期间实现更高的提示多样性。这降低了梯度方差，使学习更稳定，同时也消除了检测零优势提示时的采样计算开销。在数学推理基准测试中，Qwen3-4B和DeepSeek-R1-Distill-Qwen-1.5B的Poise与DAPO匹配，但计算需求更低。此外，其价值估计器的性能与独立的LLM尺度价值模型相似，并可推广到各种可验证任务。通过利用模型自身的内部表示，POISE 实现了更稳定和高效的策略优化。

Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

学习本地通信以实现大规模多智能体路径寻找

Authors: Valeriy Vyaltsev, Alsu Sagirova, Anton Andreychuk, Yuri Kuratov, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.07637
Pdf link: https://arxiv.org/pdf/2605.07637
Abstract Multi-agent pathfinding (MAPF) is a widely used abstraction for multi-robot trajectory planning problems, where multiple homogeneous agents move simultaneously within a shared environment. Although solving MAPF optimally is NP-hard, scalable and efficient solvers are critical for real-world applications such as logistics and search-and-rescue. To this end, the research community has proposed various decentralized suboptimal MAPF solvers that leverage machine learning. Such methods frame MAPF (from a single agent perspective) as a Dec-POMDP where at each time step an agent has to decide an action based on the local observation and typically solve the problem via reinforcement learning or imitation learning. We follow the same approach but additionally introduce a learnable communication module tailored to enhance cooperation between agents via efficient feature sharing. We present the Local Communication for Multi-agent Pathfinding (LC-MAPF), a generalizable pre-trained model that applies multi-round communication between neighboring agents to exchange information and improve their coordination. Our experiments show that the introduced method outperforms the existing learning-based MAPF solvers, including IL and RL-based approaches, across diverse metrics in a diverse range of (unseen) test scenarios. Remarkably, the introduced communication mechanism does not compromise LC-MAPF's scalability, a common bottleneck for communication-based MAPF solvers.
中文摘要 多智能体路径寻址（MAPF）是一种广泛应用于多机器人轨迹规划问题的抽象方法，即多个同质智能体在共享环境中同时移动。虽然最优解MAPF是NP难的，但可扩展且高效的求解器对于后勤和搜救等实际应用至关重要。为此，研究社区提出了多种去中心化次优MAPF求解器，利用机器学习。这些方法将MAPF（从单一代理视角）框架为Dec-POMDP，每个时间步，代理必须根据局部观察决定动作，通常通过强化学习或模仿学习来解决问题。我们沿用了相同的方法，但额外引入了一个可学习的通信模块，专门通过高效的功能共享促进代理之间的协作。我们介绍多智能体寻路的本地通信（LC-MAPF），这是一个可推广的预训练模型，通过邻居智能体之间的多轮通信交换信息并提升协调能力。我们的实验表明，所引入的方法在多种指标和多样（未见过的）测试场景下，表现优于现有基于学习的MAPF求解器，包括基于IL和强化学习的方法。值得注意的是，引入的通信机制并未影响LC-MAPF的可扩展性，而LC-MAPF是基于通信的MAPF求解器常见的瓶颈。

Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works

二元奖励GRPO中的梯度饥饿：为什么群体均值中心化失败以及为什么最简单的解决方案有效

Authors: Wenhua Nie, Jianan Wu, Junlin Liu, Ziwei Li, Zheng Lin, Zhang Zijian, Yilong Fan, Haoran Zheng, Jyh-Shing Roger Jang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.07689
Pdf link: https://arxiv.org/pdf/2605.07689
Abstract Group Relative Policy Optimization (GRPO) is a standard algorithm for reinforcement learning from verifiable rewards, but its group-mean-centered advantage can fail under binary rewards. The failure mode is gradient starvation: when every response in a group is correct or every response is wrong, the centered advantage is exactly zero and the policy receives no learning signal. We prove that the true degeneracy rate always exceeds the i.i.d. Bernoulli prediction by Jensen's inequality, and observe a 0.69 degeneracy rate at group size four in logged Qwen3.5-9B GSM8K training. We then show that the fixed-reference Sign advantage, $A=2r-1$, performs pass@$G$ failure descent by increasing the probability that at least one sample in the group succeeds. On the full GSM8K test set across seven seeds, Sign reaches 73.8% accuracy versus 28.4% for standard normalized group-mean DrGRPO at group size four, a 45.4 point gain with $p<0.0001$. The effect is directionally consistent on Llama-3.1-8B and positive but underpowered on a MATH-500 transfer check. Pass@$k$ analysis indicates that the main benefit is search compression rather than large capacity expansion, aligning the empirical gains with recent RLVR ceiling observations.
中文摘要 群体相对策略优化（GRPO）是一种基于可验证奖励的强化学习的标准算法，但其以群体均值为中心的优势在二元奖励下可能失效。失败模式是梯度饥饿：当一个组内的每个反应都正确或所有反应错误时，中心优势正好为零，策略也接收不到学习信号。我们证明了真实的简并率总是高于内识指数。通过Jensen不等式预测伯努利，并在记录的Qwen3.5-9B GSM8K训练中观察到组规模4时简并率为0.69。我们随后证明，固定参考符号优势（$A=2r-1$）通过提高组中至少一个样本成功的概率，实现了pass@$G$失败下降。在七个种子的全GSM8K测试中，Sign在第四组时的标准归一化组均DrGRPO中准确率为73.8%，提升45.4分，价值$p<0.0001$。该效应在Llama-3.1-8B上方向一致，在MATH-500转移检定中正但功率不足。Pass@$k美元分析显示，主要收益是搜索压缩，而非大规模容量扩张，这与近期RLVR上限观察结果一致。

Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models

指导不是超参数：在扩散语言模型中学习动态控制

Authors: Fan Zhou, Tim Van de Cruys
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.07701
Pdf link: https://arxiv.org/pdf/2605.07701
Abstract Classifier-Free Guidance (CFG) is a widely used mechanism for controlling diffusion-based generative models, yet its guidance scale is typically treated as a fixed hyperparameter throughout generation. This static design yields a suboptimal controllability and quality tradeoff, as the optimal degree of guidance varies across tasks and across different stages of the diffusion process, especially in NLP domain. We recast CFG scale selection as a sequential decision-making problem and propose to learn dynamic guidance trajectories via reinforcement learning. Specifically, we model the guidance scale as a discrete control action selected at each generation step based on the evolving diffusion state, and optimize a policy using Proximal Policy Optimization (PPO) under task-level rewards. Experiments on three controlled NLP generation tasks using discrete diffusion language models demonstrate that adaptive guidance consistently achieves a better balance between controllability and generation quality than fixed-scale strategies. Further analysis of the learned policies reveals distinct and interpretable guidance trajectories across tasks, underscoring the importance of treating guidance as a dynamic control process rather than a static design choice.
中文摘要 无分级器引导（CFG）是一种广泛使用的基于扩散的生成模型控制机制，但其导引尺度通常在整个生成过程中被视为固定的超参数。这种静态设计导致可控性和质量的权衡不理想，因为不同任务和扩散过程的不同阶段，尤其是在自然语言处理领域，最优引导程度会有所不同。我们将CFG尺度选择重新定义为一个顺序决策问题，并提出通过强化学习学习动态引导轨迹。具体来说，我们将引导尺度建模为每代步骤基于扩散状态演化选择的离散控制动作，并在任务级奖励下利用近端策略优化策略。使用离散扩散语言模型对三种受控NLP生成任务的实验表明，自适应指导在可控性和生成质量之间始终比固定尺度策略更平衡。对所学策略的进一步分析显示，各任务中存在不同且可解释的指导轨迹，强调将指导视为动态控制过程而非静态设计选择的重要性。

SOD: Step-wise On-policy Distillation for Small Language Model Agents

SOD：小语言模型代理的分阶段策略提炼

Authors: Qiyong Zhong, Mao Zheng, Mingyang Song, Xin Lin, Jie Sun, Houcheng Jiang, Xiang Wang, Junfeng Fang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07725
Pdf link: https://arxiv.org/pdf/2605.07725
Abstract Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher's token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high-divergence regions while preserving dense guidance in well-aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at this https URL.
中文摘要 由于长期工具交互不稳定且模型容量有限，工具集成推理（TIR）难以推广到小型语言模型。而强化学习方法如群体相对策略优化只能提供稀疏的结果级奖励。近年来，策略提炼（OPD）通过提供教师对学生生成轨迹的密集代币级监督而变得流行。然而，我们的实验表明，应用OPD于TIR会导致一个关键的失败模式：错误的工具调用往往会在后续推理步骤中连锁反应，逐渐加剧师生分歧，使教师的代币级监督变得越来越不可靠。为此，我们提出了SOD，这是一种针对小型语言模型代理的分步策略蒸馏框架，能够根据步级发散在每一步调整蒸馏强度。因此，SOD可以在高发散区域削弱教师可能误导的信号，同时在良好对齐状态中保持密集的指导。在具有挑战性的数学、科学和代码基准测试上的实验显示，SOD比第二优基准提升了高达20.86%。值得注意的是，我们的0.6亿学生在AIME 2025中取得了26.13%的成绩，展示了能动推理向轻量级模型的有效转移。我们的代码可在此 https URL 访问。

POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

POETS：通过高效计算策略集成实现不确定性感知的LLM优化

Authors: Nicolas Menet, Andreas Krause, Abbas Rahimi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.07775
Pdf link: https://arxiv.org/pdf/2605.07775
Abstract Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel framework that bridges uncertainty quantification and policy optimization. Our approach is grounded in the insight that policies trained with Kullback-Leibler (KL) regularization implicitly encode an underlying reward function. Building on this, POETS bypasses the complex, nested process of training an uncertainty-aware reward model and separately fitting a policy to this model. Instead, we directly train a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online, bootstrapped data. To overcome the prohibitive compute and memory constraints of ensembling Large Language Models (LLMs), POETS utilizes an efficient architecture: the ensemble shares a pre-trained backbone while maintaining diversity through independent Low-Rank Adaptation (LoRA) branches. Theoretically, we prove that POETS implicitly conducts KL-regularized Thompson sampling and thus inherits strong cumulative regret bounds of ${\mathcal O}(\sqrt{T \gamma_T})$. Empirically, we demonstrate that POETS achieves state-of-the-art sample efficiency across diverse scientific discovery domains, including protein search and quantum circuit design. Furthermore, it improves the optimization trajectories of reinforcement learning, proving particularly robust in off-policy settings with experience replay or in small dataset regimes.
中文摘要 平衡探索与利用是顺序决策和黑箱优化的核心挑战。我们介绍POETS（$\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling），这是一个连接不确定性量化与政策优化的新框架。我们的方法基于这样一个洞见：用Kullback-Leibler正则化训练的策略隐含了底层的奖励函数。基于此，POETS绕过了复杂且嵌套的过程，即训练一个不确定性感知的奖励模型，并单独拟合策略。相反，我们直接训练一个策略集合，通过将隐式编码的奖励函数与在线自助数据匹配，捕捉认知不确定性。为了克服大型语言模型（LLM）集成时的巨大计算和内存限制，POETS 采用了高效的架构：集合共享预训练骨干，同时通过独立的低秩适应（LoRA）分支保持多样性。理论上，我们证明POETS隐式进行KL正则化Thompson采样，因此继承了强累积遗憾界限${\mathcal O}（\sqrt{T \gamma_T}）$。通过实证，我们证明POETS在包括蛋白质搜索和量子电路设计在内的多元科学发现领域实现了最先进的采样效率。此外，它改善了强化学习的优化轨迹，在带有经验回放的非策略环境或小数据集体系中表现尤为稳健。

Approximation-Free Differentiable Oblique Decision Trees

无近似可微斜决策树

Authors: Subrat Prasad Panda, Blaise Genest, Arvind Easwaran
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.07837
Pdf link: https://arxiv.org/pdf/2605.07837
Abstract Decision Trees (DTs) are widely used in safety-critical domains such as medical diagnosis, valued for their interpretability and effectiveness on tabular data. However, training accurate oblique DTs is challenging due to complex optimization landscapes and overfitting risks, particularly in regression. Recent advances have introduced differentiable formulations that enable gradient-based training and joint optimization of decision boundaries and leaf regressors. Yet, existing approaches typically rely on approximations, either through probabilistic softening of boundaries (soft DTs) or quantized gradients such as the Straight-Through Estimator (STE). To overcome these limitations, we propose DTSemNet, a novel, semantically equivalent, and invertible representation of hard oblique DTs as neural networks. DTSemNet enables end-to-end training with standard gradient descent, eliminating the need for approximations in both classification and regression. While classification aligns naturally with this formulation, regression remains challenging due to the joint optimization of internal nodes and leaf regressors. To address this, we analyze the limitations of STE and introduce an annealed Top-k method that provides accurate gradient signals without approximation. Extensive experiments on classification and regression benchmarks show that DTSemNet-trained oblique DTs outperform state-of-the-art differentiable DTs. Furthermore, we demonstrate that DTSemNet can serve as programmatic DT policies in reinforcement learning environments, thereby broadening their applicability.
中文摘要 决策树（DT）广泛应用于安全关键领域，如医疗诊断，因其可解释性和对表格数据的有效性而备受重视。然而，由于复杂的优化环境和过拟合风险，尤其是在回归过程中，训练准确的斜线DT具有挑战性。近期进展引入了可微化的公式，使基于梯度的训练和决策边界及叶回归变量的联合优化成为可能。然而，现有方法通常依赖近似，要么通过概率性边界软化（软DTs）或量子化梯度，如直通估计器（STE）。为克服这些限制，我们提出了DTSemNet，这是一种新颖、语义等价且可逆的硬斜线DT作为神经网络的表示。DTSemNet支持端到端的标准梯度下降训练，消除了分类和回归中对近似的依赖。虽然分类自然与该表述相符，但由于内部节点和叶子回归变量的联合优化，回归仍然具有挑战性。为此，我们分析了STE的局限性，并引入了一种退火Top-k方法，能够在不近似的情况下提供准确的梯度信号。大量分类和回归基准测试显示，DTSemNet训练的斜向DT优于最先进的可微DT。此外，我们证明DTSemNet可以作为强化学习环境中的程序化DT策略，从而扩大其适用范围。

From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data

从合成到真实：迈向与合成与真实数据的身份一致性构成转移

Authors: Yue Yu, Jiayu Wang, Jiajia Shi, Jingjing Chen, Yu-Gang Jiang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.07861
Pdf link: https://arxiv.org/pdf/2605.07861
Abstract Makeup transfer aims to apply the makeup style of a reference portrait to a source portrait while preserving identity and background. Early methods formulate this task as unsupervised image-to-image translation, relying on surrogate objectives and often yielding limited performance. Recent diffusion- and flow-based approaches instead exploit synthetic data for supervised training, leading to significant improvements. However, these methods still face two critical challenges: synthetic supervision frequently fails to faithfully preserve identity, and the domain gap between synthetic and real data limits generalization, resulting in degraded performance in complex real-world scenarios. To address these issues, this paper first proposes ConsistentBeauty, a novel data curation pipeline that ensures makeup fidelity and strict identity consistency within the synthesized data. Second, we propose RealBeauty, a synthetic-to-real post-training framework. Beyond supervised learning on curated synthetic data, we further adapt the model to real-world scenarios through reinforcement learning and design novel verifiable rewards tailored to the makeup transfer task. It allows the model to further benefit from real makeup patterns beyond synthetic supervision. In addition, we establish a new diverse benchmark for makeup transfer, covering a wide range of skin tones, ages, genders, poses, and makeup styles, thereby enabling a more comprehensive evaluation of model performance under diverse real-world conditions. Extensive experiments show that our method achieves state-of-the-art performance on multiple benchmarks and demonstrates clear advantages in identity preservation and performance on complex real-world cases.
中文摘要 化妆转印旨在将参考肖像的化妆风格应用于原肖像，同时保持身份和背景。早期方法将此任务表述为无监督的图像对图像翻译，依赖替代目标，且通常性能有限。近期基于扩散和流的方法则利用合成数据进行监督训练，带来了显著改进。然而，这些方法仍面临两个关键挑战：合成监督常常无法忠实保持身份，且合成数据与真实数据之间的领域差距限制了泛化，导致复杂现实场景中性能下降。为解决这些问题，本文首先提出了ConsistentBeauty，一种新颖的数据策展流程，确保合成数据的构成准确性和严格的身份一致性。其次，我们提出RealBeauty，一种合成到真实的培训后框架。除了对精选合成数据进行监督学习外，我们还通过强化学习将模型适应现实场景，并设计针对补填转移任务量身定制的新颖可验证奖励。这让模特能够在人工监督之外，进一步受益于真实的化妆图案。此外，我们还建立了新的多样化化妆标杆，涵盖广泛的肤色、年龄、性别、姿势和化妆风格，从而实现对模特在多样现实条件下表现的更全面评估。大量实验表明，我们的方法在多个基准测试中达到了最先进的性能，并在复杂的现实世界中展现出身份保护和性能的明显优势。

Interpreting Reinforcement Learning Agents with Susceptibilities

解释具有易感性的强化学习代理

Authors: Chris Elliott, Einar Urdshals, David Quarel, Daniel Murfet
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.08007
Pdf link: https://arxiv.org/pdf/2605.08007
Abstract Susceptibilities are a technique for neural network interpretability that studies the response of posterior expectation values of observables to perturbations of the loss. We generalize this construction to the setting of the regret in deep reinforcement learning and investigate the utility of susceptibilities in a simple gridworld model that nevertheless exhibits non-trivial stagewise development. We argue that susceptibilities reveal internal features of the development of the model in parameter space that one cannot detect purely by studying the development of the learned policy. We validate these results with activation-steering, and discuss the framework's extension to RLHF post-training.
中文摘要 易感性是一种神经网络可解释性技术，用于研究可观测量后验期望值对损失扰动的反应。我们将该结构推广到深度强化学习中的遗憾，并探讨易感性在一个简单网格世界模型中的作用，尽管该模型在各个阶段的发展仍非平凡。我们认为易感性揭示了模型在参数空间中发展的内部特征，这些特征仅通过研究所学策略的发展无法完全检测到。我们通过激活引导验证这些结果，并讨论了该框架在训练后对RLHF的扩展。

Learning CLI Agents with Structured Action Credit under Selective Observation

在选择性观察下学习带有结构化动作学分的CLI代理

Authors: Haoyang Su, Ying Wen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08013
Pdf link: https://arxiv.org/pdf/2605.08013
Abstract Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable task feedback, yet few methods exploit the native structured attributes of CLI actions as learning signals. Beyond this underused action structure, CLI learning also couples two bottlenecks for coding agents. First, the agent must identify task-relevant evidence in a large codebase from partial observations. Second, sparse terminal rewards must be assigned to the actions that shape a long multi-turn trajectory. We study these bottlenecks through shell-driven information extraction and file editing tasks. For selective observation, we introduce $\sigma$-Reveal, an inference-time mechanism that selects token-budgeted context for the same CLI. For credit assignment, we propose Action Advantage Assignment ($\mathrm{A}^3$), a native agentic RL method that preserves the algorithmic complexity of standard agentic RL. $\mathrm{A}^3$ constructs turn-level advantages from episode-level relative feedback, abstract syntax tree (AST) based action sub-chain residuals, and tree-level trajectory margins. To further evaluate this problem setting, we construct ShellOps, a verifiable dataset suite covering CLI tasks in repository environments.
中文摘要 命令行接口（CLI）代理正作为代理与计算机交互的实用范式，应用于不断演变的文件系统、可执行命令行程序和在线执行反馈。近期研究利用强化学习（RL）从可验证的任务反馈中学习这些交互能力，但很少有方法利用CLI动作的原生结构化属性作为学习信号。除了这种被忽视的动作结构外，CLI学习还为编码代理带来了两个瓶颈。首先，代理必须从部分观察中识别大量代码库中的任务相关证据。其次，必须为塑造漫长多回合轨迹的行动分配稀疏的终端奖励。我们通过基于壳级的信息提取和文件编辑任务来研究这些瓶颈。对于选择性观察，我们引入了$\sigma$-Reveal，一种推理时间机制，用于为同一CLI选择标记预算上下文。关于学分分配，我们提出行动优势赋值（Action Advantage Assignment，$\mathrm{A}^3$），这是一种原生的智能强化学习方法，保持了标准智能强化学习的算法复杂性。$\mathrm{A}^3$ 通过剧集层面的相对反馈、基于抽象语法树（AST）的动作子链残差和树级轨迹边距构建回合级优势。为了进一步评估这一问题设置，我们构建了ShellOps，这是一个可验证的数据集套件，涵盖仓库环境中的CLI任务。

Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

游戏理由：前沿长程游棋与人类游戏学习者之间的行为与大脑对齐

Authors: Botos Csaba, Sreejan Kumar, Austin Tudor David Andrews, Laurence Hunt, Chris Summerfield, Joshua B. Tenenbaum, Rui Ponte Costa, Marcelo G. Mattar, Momchil Tomov
Subjects: Subjects: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2605.08019
Pdf link: https://arxiv.org/pdf/2605.08019
Abstract Humans rapidly learn abstract knowledge when encountering novel environments and flexibly deploy this knowledge to guide efficient and intelligent action. Can modern AI systems learn and plan in a similar way? We study this question using a dataset of complex human gameplay with concurrent fMRI recordings, in which participants learn novel video games that require rule discovery, hypothesis revision, and multi-step planning. We jointly evaluate models by their ability to play the games, match human learning behavior, and predict brain activity during the same task, comparing a suite of frontier Large Reasoning Models (LRMs) against model-free and model-based deep reinforcement learning agents and a Bayesian theory-based agent. We find that frontier LRMs most closely match human behavioral patterns during game discovery and predict brain activity an order of magnitude better than both reinforcement learning alternatives across cortical and subcortical regions, with effects robust to permutation controls. Through targeted manipulations, we further show that brain alignment reflects the model's in-context representation of the game state rather than its downstream planning or reasoning. Our results establish LRMs as compelling computational accounts of human learning and decision making in complex, naturalistic environments. Project page with interactive replays: this https URL
中文摘要 人类在遇到新环境时能迅速学习抽象知识，并灵活运用这些知识来指导高效且智能的行动。现代人工智能系统能否以类似的方式学习和规划？我们通过一个复杂的人类游戏数据集，并同时进行fMRI记录，参与者学习需要规则发现、假设修正和多步规划的新颖电子游戏。我们联合评估模型在玩游戏、匹配人类学习行为和预测同一任务中大脑活动的能力，比较了一组前沿大型推理模型（LRM）与无模型和基于模型的深度强化学习代理以及基于贝叶斯理论的代理。我们发现，前沿LRMs最接近人类在博弈发现过程中的行为模式，并且在皮层和皮层下区域的大脑活动预测能力比强化学习方案好一个数量级，且对置换控制具有鲁棒性。通过有针对性的操作，我们进一步表明，大脑对齐反映了模型对游戏状态的上下文表示，而非其后续的规划或推理。我们的结果确立了LRMS作为人类在复杂自然环境中学习和决策的有力计算解释。带有互动回放的项目页面：这个 https URL

Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

超越配对：你的语言模型正在秘密优化偏好图

Authors: Ning Liu, Chuanneng Sun, Kristina Klinkner, Shervin Malmasi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08037
Pdf link: https://arxiv.org/pdf/2605.08037
Abstract Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing rich preference structure that pairwise DPO fails to exploit. Collapsing such data into independent pairs discards transitivity, introduces redundant or conflicting supervision, and can lead to unstable optimization. We propose Graph Direct Preference Optimization (GraphDPO), a principled generalization of DPO that operates over directed acyclic preference graphs induced by rollout rankings. GraphDPO encodes dominance relations as edges and optimizes a graph-structured Plackett--Luce-inspired objective that aggregates supervision over graph neighborhoods, enforcing transitivity while recovering standard DPO as a special case. To handle discrete or sparse signals, we introduce an equivalence-class construction where responses with identical preferences form graph layers, and intra-layer edges contribute zero loss, preventing spurious gradients. Despite leveraging full graph structure, GraphDPO maintains linear per-prompt complexity via efficient log-sum-exp aggregation. We further incorporate optional ground-truth anchoring by inserting verified solutions as dominant nodes and applying an annealed schedule that stabilizes early training while gradually relaxing oracle supervision. Experiments on reasoning and program synthesis tasks demonstrate superior performance, suggesting that graph-structured preference modeling is a scalable and robust alternative to pairwise and listwise alignment objectives.
中文摘要 直接偏好优化（DPO）通过成对偏好比较对齐语言模型，提供了一种简单且有效的人类反馈强化学习（RL）替代方案。然而，在许多实际环境中，训练数据包含每个提示的多个展开，导致丰富的偏好结构，而成对DPO未能充分利用这一点。将这些数据合并成独立对会丢弃传递性，引入冗余或冲突的监督，并可能导致优化不稳定。我们提出了图直接偏好优化（GraphDPO），这是一种原则性的DPO推广方法，适用于由推广排名诱导的有向无环偏好图。GraphDPO将支配关系编码为边，并优化了受Plackett-Luce启发的图结构目标，该目标对图邻域进行监督聚合，强化传递性，同时恢复标准DPO作为特例。为了处理离散或稀疏信号，我们引入了等价类构造，其中偏好相同的响应形成图层，层内边贡献零损耗，防止虚假梯度。尽管采用了完整的图结构，GraphDPO 通过高效的对数和和-经验聚合，保持了线性每个提示词的复杂性。我们还通过将已验证解作为主导节点并应用退火计划，进一步加入可选的地面真实锚定，以稳定早期训练，同时逐步放松预言机监督。推理和程序综合任务的实验显示其性能优越，表明图结构偏好建模是成对和列表对齐目标的可扩展且稳健的替代方案。

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

指数效用的强化学习：贴现 MDP 中的算法与收敛

Authors: Gugan Thoppe, L. A. Prashanth, Ankur Naskar, Sanjay Bhat
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.08053
Pdf link: https://arxiv.org/pdf/2605.08053
Abstract Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the $L_\infty$ and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.
中文摘要 用于折现马尔可夫决策过程（MDPs）指数效用优化的强化学习（RL）缺乏基于价值的原则性算法。我们在固定风险厌恶设置中解决了这一空白。基于 \cite{porteus1975optimal 性}中研究的 Bellman 指数效用方程，我们推导出两个 Q-value 式扩展，并证明相关算符分别是 $L_\infty$ 和 sup-log/Thompson 度量中的收缩。我们对它们的不动点进行了表征，并证明诱导的贪婪平稳策略对于平稳政策之间的指数效用目标最优。这些结构性结果导致了两种无模型的算法：一种是二时间尺度的Q学习式算法，我们建立了几乎确定的收敛性，并通过时间尺度分离提供有限时间收敛率;另一种是由亚线性幂律算子支配的单时间尺度算法。由于后者在标准度量中不允许全局收缩，我们通过基于局部利普希茨性、单调性、齐次性和迪尼导数的精细论证证明其收敛性，并提供了一个标量有限时间分析，突出在向量情况下获得收敛率的挑战。我们的工作为基于价值的强化学习（RL）在指数效用目标下奠定了基础。

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

基于评分标准的强化学习：结构化评判对可推广推理的奖励

Authors: Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe, Scott Pakin, Dan O'Malley
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.08061
Pdf link: https://arxiv.org/pdf/2605.08061
Abstract We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.
中文摘要 我们认为，将奖励分解为加权、可验证的标准，并使用大型语言模型评判者进行评分，提供了部分学分优化信号：每个响应不再是二元结果或单一整体评分，而是根据多个任务特定标准进行评分。我们形式化了 \emph{rubric-based reinforcement learning （RL）}：一个框架，在该框架中，策略针对由冻结的 LLM 法官产生的结构化、多准则奖励进行优化，而该奖励的辅助基础条件从未被政策看到。我们通过从科学与技术信息办公室（OSTI）来源的约10万份科学技术文档语料库中推导出评分标准，并培训Llama-3.1-8B-Instruct与组相对策略优化（GRPO）来实现该框架。基于GRPO的训练中，模型在保留的评分标准评估中获得了71.7美元/%%的归一化奖励。GRPO调优策略在四个非训练语料库推导的推理基准测试——GSM8K、MATH、GPQA Main和GPQA Diamond——上也相较基础模型有所改进。这些结果证明，结构化、有文档基础的奖励能够提升既定评分标准的表现，并诱导超越构建培训环境语料库的可转移推理行为。

123D: Unifying Multi-Modal Autonomous Driving Data at Scale

123D：大规模统一多模态自动驾驶数据

Authors: Daniel Dauner, Valentin Charraut, Bastian Berle, Tianyu Li, Long Nguyen, Jiabao Wang, Changhui Jing, Maximilian Igl, Holger Caesar, Boris Ivanovic, Yiyi Liao, Andreas Geiger, Kashyap Chitta
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.08084
Pdf link: https://arxiv.org/pdf/2605.08084
Abstract The pursuit of autonomous driving has produced one of the richest sensor data collections in all of robotics. However, its scale and diversity remain largely untapped. Each dataset adopts different 2D and 3D modalities, such as cameras, lidar, ego states, annotations, traffic lights, and HD maps, with different rates and synchronization schemes. They come in fragmented formats requiring complex dependencies that cannot natively coexist in the same development environment. Further, major inconsistencies in annotation conventions prevent training or measuring generalization across multiple datasets. We present 123D, an open-source framework that unifies such multi-modal driving data through a single API. To handle synchronization, we store each modality as an independent timestamped event stream with no prescribed rate, enabling synchronous or asynchronous access across arbitrary datasets. Using 123D, we consolidate eight real-world driving datasets spanning 3,300 hours and 90,000 kilometers, together with a synthetic dataset with configurable collection scripts, and provide tools for data analysis and visualization. We conduct a systematic study comparing annotation statistics and assessing each dataset's pose and calibration accuracy. Further, we showcase two applications 123D enables: cross-dataset 3D object detection transfer and reinforcement learning for planning, and offer recommendations for future directions. Code and documentation are available at this https URL.
中文摘要 对自动驾驶的追求催生了机器人领域最丰富的传感器数据收集之一。然而，其规模和多样性在很大程度上仍未被充分开发。每个数据集采用不同的二维和三维模态，如摄像头、激光雷达、自我状态、注释、红绿灯和高清地图，且速率和同步方案各异。它们以碎片化的形式存在，需要复杂的依赖关系，无法在同一开发环境中原生共存。此外，注释惯例中的重大不一致阻碍了跨多个数据集的训练或泛化测量。我们介绍123D，一个开源框架，通过单一API统一了这类多模式驾驶数据。为了处理同步，我们将每个模态存储为独立的时间戳事件流，没有规定速率，从而实现跨任意数据集的同步或异步访问。利用123D，我们整合了八个涵盖3300小时、9万公里的真实驾驶数据集，以及一个带有可配置采集脚本的合成数据集，并提供了数据分析和可视化工具。我们进行系统研究，比较注释统计量，评估每个数据集的姿态和校准准确性。此外，我们还展示了两项123D应用：跨数据集3D对象检测传输和规划强化学习，并提出了未来方向的建议。代码和文档可在此 https URL 获取。

Keyword: diffusion policy

Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

去中心化扩散策略学习，促进合作多智能体强化学习中的探索

Authors: Yuyang Zhang, Haldun Balim, Na Li
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.07101
Pdf link: https://arxiv.org/pdf/2605.07101
Abstract Cooperative multi-agent reinforcement learning (MARL) involves complex agent interactions and requires effective exploration strategies. A prominent class of MARL algorithms, decentralized softmax policy gradient (DecSPG), addresses this through energy-based policy updates. In practice, however, such energy-based policies are intractable to maintain and are commonly projected onto the Gaussian policy class. In this work, we show that the limited expressiveness of Gaussian policies severely hinders exploration in DecSPG, and this limitation worsens as the number of agents grows. To address this issue, we propose decentralized diffusion policy learning (DDPL), which parameterizes each agent's policy with a denoising diffusion probabilistic model, an expressive generative model that captures multi-modal action distributions for enhanced exploration. DDPL enables efficient online training of diffusion policies via importance sampling score matching (ISSM), a novel training method with theoretical guarantee. We evaluate DDPL on representative continuous-action MARL benchmarks, including multi-agent particle environment, multi-agent MuJoCo, IsaacLab, and JAX-reimplemented StarCraft multi-agent challenge, and observe consistently improved performance.
中文摘要 合作多智能体强化学习（MARL）涉及复杂的智能体交互，需要有效的探索策略。一类著名的MARL算法——去中心化软最大政策梯度（DecSPG）通过基于能源的政策更新来解决这个问题。然而，实际上，这种基于能源的政策难以维持，通常会投射到高斯政策类别上。本研究显示，高斯策略表达有限严重阻碍了在DecSPG中的探索，且随着代理数量的增加，这一限制愈发严重。为解决这一问题，我们提出了去中心化扩散策略学习（DDPL），它通过去噪扩散概率模型参数化每个智能体的策略，这是一种表达性的生成模型，捕捉多模态动作分布以增强探索效果。DDPL通过重要性抽样评分匹配（ISSM）实现了扩散策略的高效在线训练，这是一种具有理论保证的新型训练方法。我们基于代表性的连续作用MARL基准测试来评估DDPL，包括多智能体粒子环境、多智能体MuJoCo、IsaacLab以及JAX重新实现的星际争霸多智能体挑战，并观察到性能持续提升。

TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning

塔维斯：模仿学习中自我中心的主动视觉与预见凝视的标杆

Authors: Giacomo Spigler
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.07943
Pdf link: https://arxiv.org/pdf/2605.07943
Abstract Active vision -- where a policy controls its own gaze during manipulation -- has emerged as a key capability for imitation learning, with multiple independent systems demonstrating its benefits in the past year. Yet there is no shared benchmark to compare approaches or quantify what active vision contributes, on which task types, and under what conditions. We introduce TAVIS, evaluation infrastructure for active-vision imitation learning, with two complementary task suites -- TAVIS-Head (5 tasks, global search via pan/tilt necks) and TAVIS-Hands (3 tasks, local occlusion via wrist cameras) -- on two humanoid torso embodiments (GR1T2, Reachy2), built on IsaacLab. TAVIS provides three evaluation primitives: a paired headcam-vs-fixedcam protocol on identical demonstrations; GALT (Gaze-Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies; and procedural ID/OOD splits. Baseline experiments with Diffusion Policy and $\pi_0$ reveal that (i) active-vision generally helps, but benefits are task-conditional rather than uniform; (ii) multi-task policies degrade sharply under controlled distribution shifts on both suites; and (iii) imitation alone yields anticipatory gaze, with median lead times comparable to the human teleoperator reference. Code, evaluation scripts, demonstrations (LeRobot v3.0; ~2200 episodes) and trained baselines are released at this https URL and this https URL.
中文摘要 主动视觉——即政策在操作过程中控制自身视线——已成为模仿学习的关键能力，过去一年已有多个独立系统展示了其优势。然而，目前还没有统一的基准来比较不同方法，或量化哪些主动视野对哪些任务有贡献，以及在哪些条件下。我们引入了TAVIS，主动视觉模拟学习的评估基础设施，配备两个互补任务套件——TAVIS-Head（5个任务，通过平移/倾斜颈部全局搜索）和TAVIS-Hands（3个任务，通过腕部摄像头局部遮挡）——基于两个基于IsaacLab的人形躯干实体（GR1T2，Reachy2）。TAVIS提供了三种评估原语：基于相同演示的配对头置摄像头与固定摄像头协议;GALT（凝视-行动前置时间），一种基于认知科学和人力资源研究的新指标，用于量化学习政策中的预期性凝视;以及程序识别/值班分工。关于扩散政策和$\pi_0的基线实验显示，（i）主动视觉通常有帮助，但收益是任务条件性的，而非统一的;（ii）在两个套件的受控分配变动下，多任务政策会急剧下降;（iii）仅靠模仿就能获得预判性凝视，其中位前置时间可与人类遥控操作员参考相当。代码、评估脚本、演示（LeRobot v3.0;~2200集）和训练过的基线都发布于此 https URL 和此 https URL。