Arxiv Papers of Today

生成时间: 2026-02-10 17:02:14 (UTC+8); Arxiv 发布时间: 2026-02-10 20:00 EST (2026-02-11 09:00 UTC+8)

今天共有 94 篇相关文章

Keyword: reinforcement learning

The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

最优代币基线：长期视野LLM-RL的方差缩减

Authors: Yingru Li, Jiawei Xu, Ziniu Li, Jiacai Liu, Wei Liu, Yuxuan Tong, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07078
Pdf link: https://arxiv.org/pdf/2602.07078
Abstract Reinforcement Learning (RL) for Large Language Models (LLMs) often suffers from training collapse in long-horizon tasks due to exploding gradient variance. To mitigate this, a baseline is commonly introduced for advantage computation; however, traditional value models remain difficult to optimize, and standard group-based baselines overlook sequence heterogeneity. Although classic optimal baseline theory can achieve global variance reduction, it neglects token heterogeneity and requires prohibitive gradient-based computation. In this work, we derive the Optimal Token Baseline (OTB) from first principles, proving that gradient updates should be weighted inversely to their cumulative gradient norm. To ensure efficiency, we propose the Logit-Gradient Proxy that approximates the gradient norm using only forward-pass probabilities. Our method achieves training stability and matches the performance of large group sizes ($N=32$) with only $N=4$, reducing token consumption by over 65% across single-turn and tool-integrated reasoning tasks.
中文摘要 大型语言模型（LLM）的强化学习（RL）在长视野任务中常因梯度方差爆炸而导致训练崩溃。为了缓解这种情况，通常会引入优势计算的基线;然而，传统的价值模型仍然难以优化，标准的基于群体的基线忽视了序列异质性。尽管经典的最优基线理论可以实现全局方差的缩减，但它忽视了符号异质性，并且需要极高的基于梯度的计算。在本研究中，我们从基本原理推导出最优代币基线（OTB），证明梯度更新应与其累积梯度规范成反比。为确保效率，我们提出仅用前向概率近似梯度范数的Logit梯度代理。我们的方法实现训练稳定性，并在仅$N=4美元的情况下匹配大组规模（$N=32美元）的性能，在单回合和工具集成推理任务中，代币消耗减少了超过65%。

Zero-Shot UAV Navigation in Forests via Relightable 3D Gaussian Splatting

通过可照明的3D高斯喷溅技术实现森林零发射无人机导航

Authors: Zinan Lv, Yeqian Qian, Chen Sang, Hao Liu, Danping Zou, Ming Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.07101
Pdf link: https://arxiv.org/pdf/2602.07101
Abstract UAV navigation in unstructured outdoor environments using passive monocular vision is hindered by the substantial visual domain gap between simulation and reality. While 3D Gaussian Splatting enables photorealistic scene reconstruction from real-world data, existing methods inherently couple static lighting with geometry, severely limiting policy generalization to dynamic real-world illumination. In this paper, we propose a novel end-to-end reinforcement learning framework designed for effective zero-shot transfer to unstructured outdoors. Within a high-fidelity simulation grounded in real-world data, our policy is trained to map raw monocular RGB observations directly to continuous control commands. To overcome photometric limitations, we introduce Relightable 3D Gaussian Splatting, which decomposes scene components to enable explicit, physically grounded editing of environmental lighting within the neural representation. By augmenting training with diverse synthesized lighting conditions ranging from strong directional sunlight to diffuse overcast skies, we compel the policy to learn robust, illumination-invariant visual features. Extensive real-world experiments demonstrate that a lightweight quadrotor achieves robust, collision-free navigation in complex forest environments at speeds up to 10 m/s, exhibiting significant resilience to drastic lighting variations without fine-tuning.
中文摘要 在非结构化户外环境中使用被动单眼视觉进行无人机导航，受制于模拟与现实之间存在较大的视觉领域差距。虽然3D高斯喷溅技术能够从真实世界数据实现照片级真实场景重建，但现有方法本质上将静态光照与几何结构结合，极大限制了对动态现实照明的策略推广。本文提出了一种全新的端到端强化学习框架，旨在有效实现零射击迁移到无结构户外环境。在基于真实世界数据的高保真模拟中，我们的策略训练为将原始单眼RGB观测直接映射为连续控制指令。为克服光度限制，我们引入可照明3D高斯喷溅技术，该技术分解场景组件，实现神经表征内环境照明的显式、物理基础编辑。通过在训练中加入从强方向性阳光到漫射阴天等多种合成光照条件，我们促使政策学习强健且不变的视觉特征。大量实际实验表明，轻量化四旋翼飞机在复杂森林环境中能以高达10米/秒的速度实现坚固、无碰撞的导航，无需微调即可对剧烈光照变化表现出显著的韧性。

Risk-Sensitive Exponential Actor Critic

风险敏感指数演员批评人

Authors: Alonso Granados, Jason Pacheco
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.07202
Pdf link: https://arxiv.org/pdf/2602.07202
Abstract Model-free deep reinforcement learning (RL) algorithms have achieved tremendous success on a range of challenging tasks. However, safety concerns remain when these methods are deployed on real-world applications, necessitating risk-aware agents. A common utility for learning such risk-aware agents is the entropic risk measure, but current policy gradient methods optimizing this measure must perform high-variance and numerically unstable updates. As a result, existing risk-sensitive model-free approaches are limited to simple tasks and tabular settings. In this paper, we provide a comprehensive theoretical justification for policy gradient methods on the entropic risk measure, including on- and off-policy gradient theorems for the stochastic and deterministic policy settings. Motivated by theory, we propose risk-sensitive exponential actor-critic (rsEAC), an off-policy model-free approach that incorporates novel procedures to avoid the explicit representation of exponential value functions and their gradients, and optimizes its policy w.r.t the entropic risk measure. We show that rsEAC produces more numerically stable updates compared to existing approaches and reliably learns risk-sensitive policies in challenging risky variants of continuous tasks in MuJoCo.
中文摘要 无模型深度强化学习（RL）算法在一系列具有挑战性的任务中取得了巨大成功。然而，当这些方法应用于实际应用时，安全问题依然存在，因此需要具备风险意识的代理。学习此类风险感知智能体的一个常见应用是熵风险测度，但当前优化该测度的策略梯度方法必须进行高方差和数值不稳定的更新。因此，现有的风险敏感无模型方法仅限于简单任务和表格设置。本文为熵风险度量上的策略梯度方法提供了全面的理论依据，包括随机和确定性策略设定下的开策略梯度定理。基于理论，我们提出了风险敏感指数行为者-批判者（rsEAC）方法，这是一种非策略模型的方法，采用了新颖的过程以避免显式表示指数价值函数及其梯度，并优化其策略以适应熵风险度量。我们证明，rsEAC相比现有方法产生了更稳定的数值更新，并且在MuJoCo中挑战性连续任务的高风险变体中，能够可靠地学习风险敏感策略。

Cerebellar-Inspired Residual Control for Fault Recovery: From Inference-Time Adaptation to Structural Consolidation

小脑启发的故障恢复残差控制：从推断时间适应到结构巩固

Authors: Nethmi Jayasinghe, Diana Gontero, Spencer T. Brown, Vinod K. Sangwan, Mark C. Hersam, Amit Ranjan Trivedi
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.07227
Pdf link: https://arxiv.org/pdf/2602.07227
Abstract Robotic policies deployed in real-world environments often encounter post-training faults, where retraining, exploration, or system identification are impractical. We introduce an inference-time, cerebellar-inspired residual control framework that augments a frozen reinforcement learning policy with online corrective actions, enabling fault recovery without modifying base policy parameters. The framework instantiates core cerebellar principles, including high-dimensional pattern separation via fixed feature expansion, parallel microzone-style residual pathways, and local error-driven plasticity with excitatory and inhibitory eligibility traces operating at distinct time scales. These mechanisms enable fast, localized correction under post-training disturbances while avoiding destabilizing global policy updates. A conservative, performance-driven meta-adaptation regulates residual authority and plasticity, preserving nominal behavior and suppressing unnecessary intervention. Experiments on MuJoCo benchmarks under actuator, dynamic, and environmental perturbations show improvements of up to $+66\%$ on \texttt{HalfCheetah-v5} and $+53\%$ on \texttt{Humanoid-v5} under moderate faults, with graceful degradation under severe shifts and complementary robustness from consolidating persistent residual corrections into policy parameters.
中文摘要 在现实环境中部署的机器人策略常常会遇到训练后故障，在这些情况下，重新训练、探索或系统识别都不切实际。我们引入了一种基于推理时间、受小脑启发的残差控制框架，通过在线纠正措施增强冻结的强化学习策略，实现故障恢复而不修改基础策略参数。该框架体现了小脑核心原则，包括通过固定特征扩展实现高维模式分离、平行微区式残留通路，以及局部误差驱动的可塑性，具有兴奋性和抑制性适格性痕迹，作用于不同时间尺度。这些机制能够在训练后干扰下快速、局部地进行纠正，同时避免导致全球政策的不稳定更新。一种保守的、以绩效为驱动的元适应调节残余权威和可塑性，保持名义上的行为并抑制不必要的干预。在执行器、动态和环境扰动下的MuJoCo基准测试显示，在中度断层下，\texttt{HalfCheetah-v5}上提升了$+66\%$，\texttt{Humanoid-v5}上提升了$+53\%$，在剧烈偏移下则有优雅的退化，且通过将持续残余修正整合到策略参数中则具有互补的鲁棒性。

Evolving LLM-Derived Control Policies for Residential EV Charging and Vehicle-to-Grid Energy Optimization

基于LLM的住宅电动汽车充电控制策略演进及车辆至电网能源优化

Authors: Vishesh Purnananda, Benjamin John Wruck, Mingyu Guo
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2602.07275
Pdf link: https://arxiv.org/pdf/2602.07275
Abstract This research presents a novel application of Evolutionary Computation to the domain of residential electric vehicle (EV) energy management. While reinforcement learning (RL) achieves high performance in vehicle-to-grid (V2G) optimization, it typically produces opaque "black-box" neural networks that are difficult for consumers and regulators to audit. Addressing this interpretability gap, we propose a program search framework that leverages Large Language Models (LLMs) as intelligent mutation operators within an iterative prompt-evaluation-repair loop. Utilizing the high-fidelity EV2Gym simulation environment as a fitness function, the system undergoes successive refinement cycles to synthesize executable Python policies that balance profit maximization, user comfort, and physical safety constraints. We benchmark four prompting strategies: Imitation, Reasoning, Hybrid and Runtime, evaluating their ability to discover adaptive control logic. Results demonstrate that the Hybrid strategy produces concise, human-readable heuristics that achieve 118% of the baseline profit, effectively discovering complex behaviors like anticipatory arbitrage and hysteresis without explicit programming. This work establishes LLM-driven Evolutionary Computation as a practical approach for generating EV charging control policies that are transparent, inspectable, and suitable for real residential deployment.
中文摘要 本研究提出了进化计算在住宅电动汽车（EV）能源管理领域的新颖应用。虽然强化学习（RL）在车辆到电网（V2G）优化方面表现出色，但它通常会产生不透明的“黑箱”神经网络，这对消费者和监管机构来说难以审计。针对这一可解释性缺口，我们提出了一个程序搜索框架，利用大型语言模型（LLMs）作为迭代提示-评估-修复循环中的智能变异作符。利用高保真EV2Gym模拟环境作为健身函数，系统经过多次优化周期，综合可执行的Python策略，平衡利润最大化、用户舒适度和物理安全约束。我们对四种提示策略进行基准测试：模仿、推理、混合和运行时，评估它们发现自适应控制逻辑的能力。结果表明，混合策略能够生成简洁、易读的启发式，实现基线利润的118%，有效发现了如预期套利和滞后等复杂行为，无需显式编程。这项工作确立了基于LLM驱动的进化计算作为一种实用方法，用于生成透明、可检查且适合实际住宅部署的电动汽车充电控制策略。

Optimizing Chlorination in Water Distribution Systems via Surrogate-assisted Neuroevolution

通过替代辅助神经进化优化供水系统中的氯化

Authors: Rivaaj Monsia, Daniel Young, Olivier Francon, Risto Miikkulainen
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.07299
Pdf link: https://arxiv.org/pdf/2602.07299
Abstract Ensuring the microbiological safety of large, heterogeneous water distribution systems (WDS) typically requires managing appropriate levels of disinfectant residuals including chlorine. WDS include complex fluid interactions that are nonlinear and noisy, making such maintenance a challenging problem for traditional control algorithms. This paper proposes an evolutionary framework to this problem based on neuroevolution, multi-objective optimization, and surrogate modeling. Neural networks were evolved with NEAT to inject chlorine at strategic locations in the distribution network at select times. NSGA-II was employed to optimize four objectives: minimizing the total amount of chlorine injected, keeping chlorine concentrations homogeneous across the network, ensuring that maximum concentrations did not exceed safe bounds, and distributing the injections regularly over time. Each network was evaluated against a surrogate model, i.e. a neural network trained to emulate EPANET, an industry-level hydraulic WDS simulator that is accurate but infeasible in terms of computational cost to support machine learning. The evolved controllers produced a diverse range of Pareto-optimal policies that could be implemented in practice, outperforming standard reinforcement learning methods such as PPO. The results thus suggest a pathway toward improving urban water systems, and highlight the potential of using evolution with surrogate modeling to optimize complex real-world systems.
中文摘要 确保大型异质供水系统（WDS）的微生物安全，通常需要管理包括氯在内的消毒剂残留物的适当水平。WDS包含复杂的流体相互作用，这些相互作用是非线性的且噪声大，这使得这类维护对传统控制算法来说是一个具有挑战性的问题。本文提出了基于神经进化、多目标优化和替代建模的进化框架。神经网络通过NEAT演化，在分配网络的特定时间向战略位置注入氯气。NSGA-II被用于优化四个目标：最小化注入的氯总量，保持全网络氯浓度均匀，确保最大浓度不超过安全范围，以及定期分配注射。每个网络都通过一个替代模型进行评估，即训练以模拟EPANET的神经网络，EPANET是一种工业级液压WDS模拟器，虽然准确但计算成本高，难以支持机器学习。进化后的控制器产生了多样化的帕累托最优策略，这些策略可应用于实际应用，优于标准强化学习方法如PPO。结果因此为改善城市水系统的路径提供了方向，并凸显了利用替代模型进行进化来优化复杂现实系统的可能性。

Adaptive Scaffolding for Cognitive Engagement in an Intelligent Tutoring System

智能辅导系统中认知参与的自适应支架

Authors: Sutapa Dey Tithi, Nazia Alam, Tahreem Yasir, Yang Shi, Xiaoyi Tian, Min Chi, Tiffany Barnes
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07308
Pdf link: https://arxiv.org/pdf/2602.07308
Abstract The ICAP framework defines four cognitive engagement levels: Passive, Active, Constructive, and Interactive, where increased cognitive engagement can yield improved learning. However, personalizing learning activities that elicit the optimal level of cognitive engagement remains a key challenge in intelligent tutoring systems (ITS). In this work, we develop and evaluate a system that adaptively scaffolds cognitive engagement by dynamically selecting worked examples in two different ICAP modes: (active) Guided examples and (constructive) Buggy examples. We compare Bayesian Knowledge Tracing (BKT) and Deep Reinforcement Learning (DRL) as adaptive methods against a non-adaptive baseline method for selecting example type in a logic ITS. Our experiment with 113 students demonstrates that both adaptive policies significantly improved student performance on test problems. BKT yielded the largest improvement in posttest scores for low prior knowledge students, helping them catch up with their high prior knowledge peers, whereas DRL yielded significantly higher posttest scores among high prior knowledge students. This paper contributes new insights into the complex interactions of cognitive engagement and adaptivity and their results on learning outcomes.
中文摘要 ICAP框架定义了四个认知参与层级：被动、主动、建设性和互动，提升认知参与度可以带来更好的学习效果。然而，个性化学习活动以激发最佳认知参与度仍是智能辅导系统（ITS）面临的关键挑战。在本研究中，我们开发并评估了一套系统，通过动态选择两种不同的ICAP模式中的已解范例，自适应地构建认知参与：（主动）引导示例和（建设性）有缺陷的示例。我们将贝叶斯知识追踪（BKT）和深度强化学习（DRL）作为自适应方法与非自适应的基线方法进行比较，用于逻辑ITS中选择示例类型。我们对113名学生的实验表明，这两种适应性策略都显著提升了学生在考试问题上的表现。BKT在低先量知识学生的测试后成绩提升最大，帮助他们赶上高先有知识的同龄人，而DRL在高先带知识学生中显著提高测试后得分。本文为认知参与与适应性复杂相互作用及其对学习成果的影响提供了新的见解。

High Fidelity Textual User Representation over Heterogeneous Sources via Reinforcement Learning

通过强化学习实现对异构来源的高保真文本用户表示

Authors: Rajat Arora, Ye Tao, Jianqiang Shen, Ping Liu, Muchen Wu, Qianqi Shen, Benjamin Le, Fedor Borisyuk, Jingwei Wu, Wenjing Zhang
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.07333
Pdf link: https://arxiv.org/pdf/2602.07333
Abstract Effective personalization on large-scale job platforms requires modeling members based on heterogeneous textual sources, including profiles, professional data, and search activity logs. As recommender systems increasingly adopt Large Language Models (LLMs), creating unified, interpretable, and concise representations from heterogeneous sources becomes critical, especially for latency-sensitive online environments. In this work, we propose a novel Reinforcement Learning (RL) framework to synthesize a unified textual representation for each member. Our approach leverages implicit user engagement signals (e.g., clicks, applies) as the primary reward to distill salient information. Additionally, the framework is complemented by rule-based rewards that enforce formatting and length constraints. Extensive offline experiments across multiple LinkedIn products, one of the world's largest job platforms, demonstrate significant improvements in key downstream business metrics. This work provides a practical, labeling-free, and scalable solution for constructing interpretable user representations that are directly compatible with LLM-based systems.
中文摘要 在大型招聘平台上进行有效的个性化，需要基于异构文本来源建模成员，包括个人资料、专业数据和搜索活动日志。随着推荐系统日益采用大型语言模型（LLM），从异构来源创建统一、可解释且简洁的表示变得至关重要，尤其是在对延迟敏感的在线环境中。在本研究中，我们提出了一种新的强化学习（RL）框架，用于为每个成员综合统一的文本表示。我们的方法利用隐含的用户参与信号（例如点击数、应用数）作为提炼重要信息的主要奖励。此外，该框架还辅以基于规则的奖励，强制执行格式和长度限制。在全球最大的招聘平台之一的多个LinkedIn产品上进行的大量线下实验显示，关键下游业务指标取得了显著提升。这项工作提供了一种实用、无标签且可扩展的解决方案，用于构建可解释的用户表示，这些表示方式与基于LLM的系统直接兼容。

Meta-Reinforcement Learning for Robust and Non-greedy Control Barrier Functions in Spacecraft Proximity Operations

在航天器近距离作中实现稳健且非贪婪的控制屏障功能的元强化学习

Authors: Minduli C. Wijayatunga, Richard Linares, Roberto Armellin
Subjects: Subjects: Systems and Control (eess.SY); Dynamical Systems (math.DS)
Arxiv link: https://arxiv.org/abs/2602.07335
Pdf link: https://arxiv.org/pdf/2602.07335
Abstract Autonomous spacecraft inspection and docking missions require controllers that can guarantee safety under thrust constraints and uncertainty. Input-constrained control barrier functions (ICCBFs) provide a framework for safety certification under bounded actuation; however, conventional ICCBF formulations can be overly conservative and exhibit limited robustness to uncertainty, leading to high fuel consumption and reduced mission feasibility. This paper proposes a framework in which the full hierarchy of class-$\mathcal{K}$ functions defining the ICCBF recursion is parameterized and learned, enabling localized shaping of the safe set and reduced conservatism. A control margin is computed efficiently using differential algebra to enable the learned continuous-time ICCBFs to be implemented on time-sampled dynamical systems typical of spacecraft proximity operations. A meta-reinforcement learning scheme is developed to train a policy that generates ICCBF parameters over a distribution of hidden physical parameters and uncertainties, using both multilayer perceptron (MLP) and recurrent neural network (RNN) architectures. Simulation results on cruise control, spacecraft inspection, and docking scenarios demonstrate that the proposed approach maintains safety while reducing fuel consumption and improving feasibility relative to fixed class-$\mathcal{K}$ ICCBFs, with the RNN showing a particularly strong advantage in the more complex inspection case.
中文摘要 自主航天器检查和对接任务需要能够在推力约束和不确定性条件下保证安全的控制人员。输入约束控制障碍函数（ICCBF）为有界驱动下的安全认证提供了框架;然而，传统的ICCBF配方可能过于保守，且对不确定性缺乏鲁棒性，导致燃料消耗高且任务可行性降低。本文提出了一个框架，在该框架中，定义ICCBF递归的类-$\mathcal{K}$函数的完整层级结构被参数化并学习，从而实现安全集的局部整形和保守性降低。控制裕度通过微分代数高效计算，使得在航天器近距离作典型的时间采样动力系统上实现所学的连续时间ICCBF。开发了一种元强化学习方案，用于训练一种策略，能够在隐藏的物理参数和不确定性分布上生成ICCBF参数，采用多层感知器（MLP）和循环神经网络（RNN）架构。巡航控制、航天器检查和对接场景的模拟结果表明，拟议方法在降低燃料消耗的同时保持安全，并相较于固定级$\mathcal{K}$ ICCBF提高了可行性，RNN在更复杂的检查场景中展现出特别强的优势。

Scalable Dexterous Robot Learning with AR-based Remote Human-Robot Interactions

基于增强现实的远程人机交互的可扩展灵巧机器人学习

Authors: Yicheng Yang, Ruijiao Li, Lifeng Wang, Shuai Zheng, Shunzheng Ma, Keyu Zhang, Tuoyu Sun, Chenyun Dai, Jie Ding, Zhuo Zou
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.07341
Pdf link: https://arxiv.org/pdf/2602.07341
Abstract This paper focuses on the scalable robot learning for manipulation in the dexterous robot arm-hand systems, where the remote human-robot interactions via augmented reality (AR) are established to collect the expert demonstration data for improving efficiency. In such a system, we present a unified framework to address the general manipulation task problem. Specifically, the proposed method consists of two phases: i) In the first phase for pretraining, the policy is created in a behavior cloning (BC) manner, through leveraging the learning data from our AR-based remote human-robot interaction system; ii) In the second phase, a contrastive learning empowered reinforcement learning (RL) method is developed to obtain more efficient and robust policy than the BC, and thus a projection head is designed to accelerate the learning progress. An event-driven augmented reward is adopted for enhancing the safety. To validate the proposed method, both the physics simulations via PyBullet and real-world experiments are carried out. The results demonstrate that compared to the classic proximal policy optimization and soft actor-critic policies, our method not only significantly speeds up the inference, but also achieves much better performance in terms of the success rate for fulfilling the manipulation tasks. By conducting the ablation study, it is confirmed that the proposed RL with contrastive learning overcomes policy collapse. Supplementary demonstrations are available at this https URL.
中文摘要 本文聚焦于灵活机器人手臂-手系统中可扩展的机器人学习，通过增强现实（AR）实现远程人机交互，收集专家演示数据以提高效率。在这样的系统中，我们提出了一个统一的框架来解决一般的作任务问题。具体来说，所提方法分为两个阶段：i）在第一阶段的预训练中，策略以行为克隆（BC）方式创建，利用我们基于增强现实的远程人机交互系统的学习数据;ii）第二阶段开发了对比学习赋能强化学习（RL）方法，以获得比BC更高效、更稳健的策略，因此设计了投影头以加速学习进展。采用事件驱动的增强奖励以增强安全性。为验证所提方法，进行了PyBullet物理模拟和实际实验。结果表明，与经典的近端策略优化和软演员-批评策略相比，我们的方法不仅显著加快了推理速度，而且在完成作任务的成功率方面也实现了更好的性能。通过进行消融研究，确认了拟议的强化学习结合对比学习能够克服政策崩溃。补充演示可在此 https 网址获取。

Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model

通过在线强化学习与脆弱性奖励模型实现安全代码生成

Authors: Tianyi Wu, Mingzhe Du, Yue Liu, Chengran Yang, Terry Yue Zhuo, Jiaheng Zhang, See-Kiong Ng
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.07422
Pdf link: https://arxiv.org/pdf/2602.07422
Abstract Large language models (LLMs) are increasingly used in software development, yet their tendency to generate insecure code remains a major barrier to real-world deployment. Existing secure code alignment methods often suffer from a functionality--security paradox, improving security at the cost of substantial utility degradation. We propose SecCoderX, an online reinforcement learning framework for functionality-preserving secure code generation. SecCoderX first bridges vulnerability detection and secure code generation by repurposing mature detection resources in two ways: (i) synthesizing diverse, reality-grounded vulnerability-inducing coding tasks for online RL rollouts, and (ii) training a reasoning-based vulnerability reward model that provides scalable and reliable security supervision. Together, these components are unified in an online RL loop to align code LLMs to generate secure and functional code. Extensive experiments demonstrate that SecCoderX achieves state-of-the-art performance, improving Effective Safety Rate (ESR) by approximately 10% over unaligned models, whereas prior methods often degrade ESR by 14-54%. We release our code, dataset and model checkpoints at this https URL.
中文摘要 大型语言模型（LLMs）在软件开发中日益广泛使用，但它们生成不安全的代码仍是实际部署的主要障碍。现有的安全代码对齐方法常常存在一种功能性——安全悖论，即以显著的效用退化为代价提升安全性。我们提出了SecCoderX，一个用于保持功能性且安全代码生成的在线强化学习框架。SecCoderX首先通过两种方式重新利用成熟的检测资源，桥接漏洞检测和安全代码生成：（i）综合多样化、基于现实的易引漏洞编码任务，用于在线强化学习推广;（ii）训练基于推理的漏洞奖励模型，提供可扩展且可靠的安全监管。这些组件在在线强化学习循环中统一，以对齐代码 LLMs，生成安全且实用的代码。大量实验表明，SecCoderX实现了最先进的性能，较未比对模型提升了约10%的有效安全率（ESR），而以往的方法往往将ESR降级14-54%。我们将代码、数据集和模型检查点发布到该 https URL。

Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning

离线强化学习中行为克隆演员-批评者的近端动作替代

Authors: Jinzong Dong, Wei Huang, Jianshu Zhang, Zhuo Chen, Xinzhe Yuan, Qinying Gu, Zhaohui Jiang, Nanyang Ye
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07441
Pdf link: https://arxiv.org/pdf/2602.07441
Abstract Offline reinforcement learning (RL) optimizes policies from a previously collected static dataset and is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which yields realistic policies and mitigates bias from out-of-distribution actions, but can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting high-value regions suggested by the critic, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), a plug-and-play training sample replacer that progressively replaces low-value actions with high-value actions generated by a stable actor, broadening the action exploration space while reducing the impact of low-value data. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance and approaches state-of-the-art when combined with the basic TD3+BC.
中文摘要 离线强化学习（RL）从先前收集的静态数据集中优化策略，是强化学习的重要分支。一种流行且有前景的方法是通过行为克隆（BC）规范actor-critic方法，这既能带来切实可行的策略，也能减少分布外动作带来的偏见，但也可能带来常被忽视的性能上限：当数据集动作不够优时，无差别的模仿在结构上阻碍了actor充分利用批评者建议的高价值区域，尤其是在模仿已经占主导地位的后续训练中。我们通过研究BC正则化actor-critic优化的收敛性质，在受控连续bandit任务中验证了这一局限性。为打破这一上限，我们提出了近端动作替换（PAR），这是一种即插即用的训练样本替换器，逐步用稳定的行为体生成的高价值动作替换低价值动作，拓宽动作探索空间，同时减少低价值数据的影响。PAR兼容多种BC正则化范式。离线强化学习基准测试的大量实验表明，PAR与基础TD3+BC结合时，能够持续提升性能，接近最先进的水平。

SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

空间奖励：通过显式空间推理弥合在线强化学习图像编辑中的感知差距

Authors: Yancheng Long, Yankai Yang, Hongyang Wei, Wei Chen, Tianke Zhang, Haonan fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.07458
Pdf link: https://arxiv.org/pdf/2602.07458
Abstract Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term "Attention Collapse," where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench--surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.
中文摘要 在线强化学习（RL）为复杂图像编辑提供了一条有前景的途径，但目前受限于可靠且细粒度的奖励信号的稀缺。现有评估者经常面临我们称之为“注意力崩溃”的关键感知差距，模型忽视跨图像比较，未能捕捉细致细节，导致感知不准确和评分错误。为解决这些局限性，我们提出了空间奖励模型，通过显式空间推理强制执行精确验证。通过将推理锚定在预测的编辑区域，SpatialReward将语义判断建立在像素级证据之上，显著提升了评估准确性。我们的模型在经过精心策划的26万空间感知数据集上训练，在MMRB2和EditReward-Bench上实现了最先进的性能，并在我们提出的MultiEditReward-Bench上优于专有评估器。此外，SpatialReward 在在线强化学习中作为强信号，使 OmniGen2 在 GEdit-Bench 上提升了 +0.90——超过领先的判别模型，并将 GPT-4.1 的增益翻倍（+0.45）。这些结果表明，空间推理对于在图像编辑中实现有效对齐至关重要。

SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning

SED-SFT：有选择性促进监督式微调的多样性

Authors: Yijie Chen, Yijin Liu, Fandong Meng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.07464
Pdf link: https://arxiv.org/pdf/2602.07464
Abstract Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs). However, the conventional SFT process, driven by Cross-Entropy (CE) loss, often induces mode collapse, where models over-concentrate on specific response patterns. This lack of distributional diversity severely restricts the exploration efficiency required for subsequent RL. While recent studies have attempted to improve SFT by replacing the CE loss, aiming to preserve diversity or refine the update policy, they fail to adequately balance diversity and accuracy, thereby yielding suboptimal performance after RL. To address the mode collapse problem, we propose SED-SFT, which adaptively encourages diversity based on the token exploration space. This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective. Extensive experiments across eight mathematical benchmarks demonstrate that SED-SFT significantly enhances generation diversity with a negligible computational overhead increase compared with CE loss, yielding average improvements of 2.06 and 1.20 points in subsequent RL performance over standard CE-based baselines on Llama-3.2-3B-Instruct and Qwen2.5-Math-7B-Instruct, respectively. The code is publicly available at this https URL
中文摘要 监督式微调（SFT）随后的强化学习（RL）已成为大型语言模型（LLM）的标准训练后范式。然而，传统的SFT过程由交叉熵（CE）损失驱动，常常导致模态坍缩，即模型过度关注特定响应模式。这种分布多样性的缺失严重限制了后续强化学习所需的探索效率。尽管近期研究尝试通过替代CE损失来改善SFT，旨在保持多样性或优化更新策略，但未能充分平衡多样性与准确性，导致强化学习后性能不理想。为解决模式崩溃问题，我们提出了SED-SFT，该方法基于代币探索空间自适应地鼓励多样性。该框架在优化目标中引入了带有选择性掩蔽机制的选择性熵正则化项。跨越八个数学基准的广泛实验表明，SED-SFT显著提升了生成多样性，计算开销增加可忽略不计，相较于CE损失，后续RL性能相比基于CE的标准基线平均提升了2.06分和1.20分，分别用于Llama-3.2-3B-Instruct和Qwen2.5-Math-7B-Instruct。代码在此 https URL 公开发布

CoMI-IRL: Contrastive Multi-Intention Inverse Reinforcement Learning

CoMI-IRL：对比多意向逆向强化学习

Authors: Antonio Mone, Frans A. Oliehoek, Luciano Cavalcante Siebert
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.07496
Pdf link: https://arxiv.org/pdf/2602.07496
Abstract Inverse Reinforcement Learning (IRL) seeks to infer reward functions from expert demonstrations. When demonstrations originate from multiple experts with different intentions, the problem is known as Multi-Intention IRL (MI-IRL). Recent deep generative MI-IRL approaches couple behavior clustering and reward learning, but typically require prior knowledge of the number of true behavioral modes $K^$. This reliance on expert knowledge limits their adaptability to new behaviors, and only enables analysis related to the learned rewards, and not across the behavior modes used to train them. We propose Contrastive Multi-Intention IRL (CoMI-IRL), a transformer-based unsupervised framework that decouples behavior representation and clustering from downstream reward learning. Our experiments show that CoMI-IRL outperforms existing approaches without a priori knowledge of $K^$ or labels, while allowing for visual interpretation of behavior relationships and adaptation to unseen behavior without full retraining.
中文摘要 逆向强化学习（IRL）旨在从专家演示中推断奖励函数。当演示由多位不同意图的专家提出时，问题称为多意图IRL（MI-IRL）。近期的深度生成MI-IRL方法结合行为聚类和奖励学习，但通常要求先验了解真实行为模式的数量$K^$。这种对专家知识的依赖限制了他们对新行为的适应能力，只能对学习奖励进行分析，而非跨行为模式进行分析。我们提出了对比多意向IRL（CoMI-IRL），这是一种基于变换器的无监督框架，将行为表征和聚类与下游奖励学习解耦。我们的实验表明，CoMI-IRL在没有先验了解$K^$或标签的情况下，表现优于现有方法，同时允许对行为关系进行视觉解读和对未见行为的适应，而无需完全重新训练。

Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models

联合奖励建模：内化思维链以实现高效的视觉奖励模型

Authors: Yankai Yang, Yancheng Long, Hongyang Wei, Wei Chen, Tianke Zhang, Kaiyu Jiang, Haonan Fan, Changyi Liu, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07533
Pdf link: https://arxiv.org/pdf/2602.07533
Abstract Reward models are critical for reinforcement learning from human feedback, as they determine the alignment quality and reliability of generative models. For complex tasks such as image editing, reward models are required to capture global semantic consistency and implicit logical constraints beyond local similarity. Existing reward modeling approaches have clear limitations. Discriminative reward models align well with human preferences but struggle with complex semantics due to limited reasoning supervision. Generative reward models offer stronger semantic understanding and reasoning, but they are costly at inference time and difficult to align directly with human preferences. To this end, we propose Joint Reward Modeling (JRM), which jointly optimizes preference learning and language modeling on a shared vision-language backbone. This approach internalizes the semantic and reasoning capabilities of generative models into efficient discriminative representations, enabling fast and accurate evaluation. JRM achieves state-of-the-art results on MMRB2 and EditReward-Bench, and significantly improves stability and performance in downstream online reinforcement learning. These results show that joint training effectively bridges efficiency and semantic understanding in reward modeling.
中文摘要 奖励模型对于从人类反馈进行强化学习至关重要，因为它们决定了生成模型的对齐质量和可靠性。对于图像编辑等复杂任务，需要奖励模型来捕捉全局语义一致性和超越局部相似性的隐含逻辑约束。现有的奖励建模方法存在明显局限性。辨别性奖励模型与人类偏好相符，但由于推理监督有限，复杂的语义处理上存在困难。生成奖励模型提供了更强的语义理解和推理，但推理成本高昂，且难以直接与人类偏好对齐。为此，我们提出了联合奖励建模（JRM），它在共享视觉语言骨干上共同优化偏好学习和语言建模。这种方法将生成模型的语义和推理能力内化为高效的判别表征，从而实现快速且准确的评估。JRM在MMRB2和EditReward-Bench上实现了最先进的结果，并显著提升了下游在线强化学习的稳定性和性能。这些结果表明，联合训练有效地连接了奖励建模中的效率和语义理解。

Unified Biomolecular Trajectory Generation via Pretrained Variational Bridge

通过预训练变分桥实现统一生物分子轨迹生成

Authors: Ziyang Yu, Wenbing Huang, Yang Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.07588
Pdf link: https://arxiv.org/pdf/2602.07588
Abstract Molecular Dynamics (MD) simulations provide a fundamental tool for characterizing molecular behavior at full atomic resolution, but their applicability is severely constrained by the computational cost. To address this, a surge of deep generative models has recently emerged to learn dynamics at coarsened timesteps for efficient trajectory generation, yet they either generalize poorly across systems or, due to limited molecular diversity of trajectory data, fail to fully exploit structural information to improve generative fidelity. Here, we present the Pretrained Variational Bridge (PVB) in an encoder-decoder fashion, which maps the initial structure into a noised latent space and transports it toward stage-specific targets through augmented bridge matching. This unifies training on both single-structure and paired trajectory data, enabling consistent use of cross-domain structural knowledge across training stages. Moreover, for protein-ligand complexes, we further introduce a reinforcement learning-based optimization via adjoint matching that speeds progression toward the holo state, which supports efficient post-optimization of docking poses. Experiments on proteins and protein-ligand complexes demonstrate that PVB faithfully reproduces thermodynamic and kinetic observables from MD while delivering stable and efficient generative dynamics.
中文摘要 分子动力学（MD）模拟为以全原子分辨率表征分子行为提供了基础工具，但其适用性受到计算成本的严重限制。为此，近年来涌现出大量深度生成模型，旨在以粗时间步学习动力学以实现高效的轨迹生成，但它们要么在系统间推广能力不佳，要么由于轨迹数据分子多样性有限，未能充分利用结构信息以提升生成保真度。这里，我们以编码器-解码器方式介绍预训练变分桥（PVB），该系统将初始结构映射到有噪声的潜在空间，并通过增强桥匹配将其传输到阶段特定目标。这统一了单结构和配对轨迹数据的训练，使跨领域结构知识在各训练阶段都能保持一致地使用。此外，对于蛋白质-配体复合物，我们进一步引入了基于强化学习的伴随匹配优化，加快向全息态的推进，支持对接姿态的高效后优化。蛋白质和蛋白-配体复合物实验表明，PVB忠实地再现了MD的热力学和动力学可观测量，同时实现了稳定高效的生成动力学。

Learning to Self-Verify Makes Language Models Better Reasoners

学会自我验证使语言模型成为更好的推理者

Authors: Yuxin Chen, Yu Wang, Yi Zhang, Ziang Ye, Zhengzhou Cai, Yaorui Shi, Qi Gu, Hui Su, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07594
Pdf link: https://arxiv.org/pdf/2602.07594
Abstract Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.
中文摘要 近年来的大型语言模型（LLMs）在为复杂任务生成有前景的推理路径方面表现出色。然而，尽管生成能力强大，LLM在验证自身答案方面仍然较弱，暴露出生成与自我验证之间持续存在的能力不对称性。本研究深入探讨了训练演化过程中的这种不对称性，并证明即使在同一任务中，提升生成并未带来相应的自我验证提升。有趣的是，我们发现这种不对称的反向表现不同：学习自我验证可以有效提升生成性能，实现与标准生成训练相当的准确性，同时产生更高效、更有效的推理迹迹。基于这一观察，我们进一步探讨如何将自我验证融入生成训练，构建多任务强化学习框架，将生成和自我验证作为两个独立但互补的目标进行优化。跨基准测试和模型的广泛实验证明，仅用代际训练在生成和验证能力上均有性能提升。

TeleBoost: A Systematic Alignment Framework for High-Fidelity, Controllable, and Robust Video Generation

TeleBoost：一个系统化的对齐框架，实现高保真、可控且强大的视频生成

Authors: Yuanzhi Liang, Xuan'er Wu, Yirui Liu, Yijie Fang, Yizhen Fan, Ke Hao, Rui Li, Ruiying Liu, Ziqi Ni, Peng Yu, Yanbo Wang, Haibin Huang, Qizhen Weng, Chi Zhang, Xuelong Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07595
Pdf link: https://arxiv.org/pdf/2602.07595
Abstract Post-training is the decisive step for converting a pretrained video generator into a production-oriented model that is instruction-following, controllable, and robust over long temporal horizons. This report presents a systematical post-training framework that organizes supervised policy shaping, reward-driven reinforcement learning, and preference-based refinement into a single stability-constrained optimization stack. The framework is designed around practical video-generation constraints, including high rollout cost, temporally compounding failure modes, and feedback that is heterogeneous, uncertain, and often weakly discriminative. By treating optimization as a staged, diagnostic-driven process rather than a collection of isolated tricks, the report summarizes a cohesive recipe for improving perceptual fidelity, temporal coherence, and prompt adherence while preserving the controllability established at initialization. The resulting framework provides a clear blueprint for building scalable post-training pipelines that remain stable, extensible, and effective in real-world deployment settings.
中文摘要 后期训练是将预训练视频生成器转变为面向生产的模型的关键步骤，该模型能够遵循指令、可控且能在较长的时间视野内稳健运行。本报告提出了一个系统化的培训后框架，将监督式策略制定、奖励驱动的强化学习和基于偏好的细化整合到一个单一的稳定性约束优化栈中。该框架围绕实际视频生成的约束设计，包括高启动成本、时间累积的故障模式，以及反馈异质性、不确定性且常常判别性较弱。通过将优化视为一个分阶段的诊断驱动过程，而非一系列孤立的技巧，报告总结出一套连贯的方案，旨在提升感知准确性、时间一致性和及时遵循性，同时保持初始化时建立的可控性。最终框架为构建可扩展的培训后流程提供了清晰蓝图，确保其在实际部署环境中保持稳定、可扩展和有效。

Efficient Planning in Reinforcement Learning via Model Introspection

通过模型内省实现强化学习中的高效规划

Authors: Gabriel Stella
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.07719
Pdf link: https://arxiv.org/pdf/2602.07719
Abstract Reinforcement learning and classical planning are typically seen as two distinct problems, with differing formulations necessitating different solutions. Yet, when humans are given a task, regardless of the way it is specified, they can often derive the additional information needed to solve the problem efficiently. The key to this ability is introspection: by reasoning about their internal models of the problem, humans directly synthesize additional task-relevant information. In this paper, we propose that this introspection can be thought of as program analysis. We discuss examples of how this approach can be applied to various kinds of models used in reinforcement learning. We then describe an algorithm that enables efficient goal-oriented planning over the class of models used in relational reinforcement learning, demonstrating a novel link between reinforcement learning and classical planning.
中文摘要 强化学习和经典规划通常被视为两个不同的问题，不同的表述需要不同的解决方案。然而，当人类被赋予任务时，无论任务如何指定，他们通常都能获得高效解决问题所需的额外信息。这种能力的关键在于内省：通过推理自身对问题的内部模型，人类直接综合了额外的任务相关信息。本文提出，这种内省可以被视为程序分析。我们讨论了该方法如何应用于强化学习中各种模型的示例。随后，我们描述了一种算法，能够在关系强化学习中使用的模型类别中实现高效的目标导向规划，展示了强化学习与经典规划之间的新联系。

Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

我们需要亚当吗？在大型语言模型中，SGD的强而稀疏的强化学习

Authors: Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-Tür, Hao Peng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07729
Pdf link: https://arxiv.org/pdf/2602.07729
Abstract Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of next-token prediction stages (e.g., pretraining and supervised fine-tuning), despite fundamental differences between RL and these stages highlighted by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rates in AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam-style per-parameter adaptive learning rates and momentum. Confirming this hypothesis, our experiments demonstrate that the substantially more memory-efficient SGD, which is known to perform poorly in supervised learning of large-scale transformers, matches or even outperforms AdamW in RL for LLMs. Remarkably, full fine-tuning with SGD updates fewer than 0.02% of model parameters without any sparsity-promoting regularization, more than 1000 times fewer than AdamW. Our analysis offers potential reasons for this update sparsity. These findings provide new insights into the optimization dynamics of RL in LLMs and show that RL can be substantially more parameter-efficient than previously recognized.
中文摘要 强化学习（RL），尤其是可验证奖励（RLVR）强化学习，已成为大型语言模型（LLM）训练的关键阶段，也是当前扩展努力的关键重点。然而，尽管近期研究凸显了强化学习与这些阶段之间存在根本差异，但强化学习的优化实践大体上遵循下一令牌预测阶段（如预训练和监督微调）。其中一种做法是AdamW优化器，尽管内存开销较高，但该优化器被广泛用于训练大规模变换器。我们的分析显示，AdamW中的动量和自适应学习率在强化学习中的影响都小于SFT，因此我们假设强化学习从亚当式的按参数自适应学习率和动量中受益较小。验证了这一假设，我们的实验表明，在大规模变换器监督学习中表现较差的内存效率更高的SGD，在LLM中与甚至超过AdamW。值得注意的是，使用SGD进行全面微调后，更新的模型参数不到0.02%，且没有任何促进稀疏性的正则化，比AdamW少了1000多倍。我们的分析提供了导致更新稀疏的潜在原因。这些发现为大型语言模型中强化学习的优化动态提供了新的见解，并表明强化学习的参数效率远高于此前的认知。

The Laplacian Keyboard: Beyond the Linear Span

拉普拉斯键盘：超越线性跨度

Authors: Siddarth Chandrasekar, Marlos C. Machado
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07730
Pdf link: https://arxiv.org/pdf/2602.07730
Abstract Across scientific disciplines, Laplacian eigenvectors serve as a fundamental basis for simplifying complex systems, from signal processing to quantum mechanics. In reinforcement learning (RL), these eigenvectors provide a natural basis for approximating reward functions; however, their use is typically limited to their linear span, which restricts expressivity in complex environments. We introduce the Laplacian Keyboard (LK), a hierarchical framework that goes beyond the linear span. LK constructs a task-agnostic library of options from these eigenvectors, forming a behavior basis guaranteed to contain the optimal policy for any reward within the linear span. A meta-policy learns to stitch these options dynamically, enabling efficient learning of policies outside the original linear constraints. We establish theoretical bounds on zero-shot approximation error and demonstrate empirically that LK surpasses zero-shot solutions while achieving improved sample efficiency compared to standard RL methods.
中文摘要 在各个科学领域，拉普拉斯特征向量作为简化复杂系统的基础，从信号处理到量子力学。在强化学习（RL）中，这些特征向量为近似奖励函数提供了自然基础;然而，它们的使用通常限于线性张展，这限制了复杂环境中的表现力。我们介绍拉普拉斯键盘（LK），这是一个超越线性跨度的层级框架。LK从这些特征向量构建了一个任务无关的选项库，形成一个行为基，保证包含线性张幅内任意奖励的最优策略。元策略学会动态拼接这些选项，从而使得在原始线性约束之外高效学习策略。我们建立了零样本近似误差的理论界限，并通过实证证明LK超越零样本解，同时相较于标准强化学习方法实现了更好的样本效率。

Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization

偏好条件多目标强化学习：分解、多样性驱动的策略优化

Authors: Tanmay Ambadkar, Sourav Panda, Shreyash Kale, Jonathan Dodge, Abhinav Verma
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07764
Pdf link: https://arxiv.org/pdf/2602.07764
Abstract Multi-objective reinforcement learning (MORL) seeks to learn policies that balance multiple, often conflicting objectives. Although a single preference-conditioned policy is the most flexible and scalable solution, existing approaches remain brittle in practice, frequently failing to recover complete Pareto fronts. We show that this failure stems from two structural issues in current methods: destructive gradient interference caused by premature scalarization and representational collapse across the preference space. We introduce $D^3PO$, a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly. $D^3PO$ preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization, enabling reliable credit assignment. In addition, a scaled diversity regularizer enforces sensitivity of policy behavior to preference changes, preventing collapse. Across standard MORL benchmarks, including high-dimensional and many-objective control tasks, $D^3PO$ consistently discovers broader and higher-quality Pareto fronts than prior single- and multi-policy methods, matching or exceeding state-of-the-art hypervolume and expected utility while using a single deployable policy.
中文摘要 多目标强化学习（MORL）旨在学习平衡多个且常常相互冲突目标的策略。尽管单一偏好条件策略是最灵活且可扩展的解决方案，但现有方法在实践中仍然脆弱，常常无法恢复完整的帕累托前沿。我们表明，这种失败源于当前方法中的两个结构性问题：由过早标量引起的破坏性梯度干涉，以及偏好空间中的表征坍缩。我们引入了$D^3PO$，这是一个基于PPO的框架，能够重组多目标政策优化，直接解决这些问题。$D^3PO$通过分解的优化流水线保留每个目标学习信号，仅在稳定后整合偏好，实现可靠的学分分配。此外，规模化的多样性规范器会强制政策行为对偏好变化的敏感性，防止崩溃。在标准MORL基准测试中，包括高维和多目标控制任务，$D^3PO$持续发现比以往单策略和多策略方法更宽更优质的帕累托前沿，在使用单一可部署策略时，能够匹配甚至超越最先进的超量和预期效用。

Generative Reasoning Re-ranker

生成推理重新排序器

Authors: Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, Zhijing Li, Jiang Liu, Mengying Sun, Fei Tian, Xiaohan Wei, Chonglin Sun, Jacob Tao, Shike Mei, Hamed Firooz, Wenlin Chen, Luke Simon
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07774
Pdf link: https://arxiv.org/pdf/2602.07774
Abstract Recent studies increasingly explore Large Language Models (LLMs) as a new paradigm for recommendation systems due to their scalability and world knowledge. However, existing work has three key limitations: (1) most efforts focus on retrieval and ranking, while the reranking phase, critical for refining final recommendations, is largely overlooked; (2) LLMs are typically used in zero-shot or supervised fine-tuning settings, leaving their reasoning abilities, especially those enhanced through reinforcement learning (RL) and high-quality reasoning data, underexploited; (3) items are commonly represented by non-semantic IDs, creating major scalability challenges in industrial systems with billions of identifiers. To address these gaps, we propose the Generative Reasoning Reranker (GR2), an end-to-end framework with a three-stage training pipeline tailored for reranking. First, a pretrained LLM is mid-trained on semantic IDs encoded from non-semantic IDs via a tokenizer achieving $\ge$99% uniqueness. Next, a stronger larger-scale LLM generates high-quality reasoning traces through carefully designed prompting and rejection sampling, which are used for supervised fine-tuning to impart foundational reasoning skills. Finally, we apply Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO), enabling scalable RL supervision with verifiable rewards designed specifically for reranking. Experiments on two real-world datasets demonstrate GR2's effectiveness: it surpasses the state-of-the-art OneRec-Think by 2.4% in Recall@5 and 1.3% in NDCG@5. Ablations confirm that advanced reasoning traces yield substantial gains across metrics. We further find that RL reward design is crucial in reranking: LLMs tend to exploit reward hacking by preserving item order, motivating conditional verifiable rewards to mitigate this behavior and optimize reranking performance.
中文摘要 近年来，越来越多的研究将大型语言模型（LLMs）作为推荐系统的新范式探索，因其可扩展性和全球知识。然而，现有工作存在三个关键局限：（1）大多数工作侧重于检索和排序，而重新排序阶段对完善最终建议至关重要，却大多被忽视;（2） LLMs通常用于零射击或监督微调环境，导致其推理能力，尤其是通过强化学习（RL）和高质量推理数据得到提升的推理能力，未被充分利用;（3）项通常由非语义ID表示，这在拥有数十亿标识符的工业系统中带来了重大的可扩展性挑战。为弥补这些空白，我们提出了生成推理重排序器（GR2），这是一个端到端框架，设有三阶段培训流程，专为重新排序量身定制。首先，预训练的LLM通过非语义ID编码的语义ID进行中训，通过代币化器实现$/05$9%的唯一性。接下来，更强大的大规模大型语言模型通过精心设计的提示和拒绝抽样生成高质量的推理痕迹，这些方法用于监督式微调，以传授基础推理技能。最后，我们应用了解耦剪辑和动态采样策略优化（DAPO），实现了可扩展的强化学习监督，并配备了专门为重新排名设计的可验证奖励。在两个真实世界数据集上的实验证明了GR2的有效性：它在Recall@5中比最先进的OneRec-Think高出2.4%，在NDCG@5中领先1.3%。消融验证了高级推理追踪在各指标上带来显著的提升。我们还发现，强化学习的奖励设计在重新排序中至关重要：大型语言模型倾向于利用奖励黑客，通过保留物品顺序、激励条件可验证奖励来减轻这种行为并优化重新排序表现。

CoLF: Learning Consistent Leader-Follower Policies for Vision-Language-Guided Multi-Robot Cooperative Transport

CoLF：学习视觉语言引导多机器人协作运输的一致领导者-跟随者政策

Authors: Joachim Yann Despature, Kazuki Shibata, Takamitsu Matsubara
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.07776
Pdf link: https://arxiv.org/pdf/2602.07776
Abstract In this study, we address vision-language-guided multi-robot cooperative transport, where each robot grounds natural-language instructions from onboard camera observations. A key challenge in this decentralized setting is perceptual misalignment across robots, where viewpoint differences and language ambiguity can yield inconsistent interpretations and degrade cooperative transport. To mitigate this problem, we adopt a dependent leader-follower design, where one robot serves as the leader and the other as the follower. Although such a leader-follower structure appears straightforward, learning with independent and symmetric agents often yields symmetric or unstable behaviors without explicit inductive biases. To address this challenge, we propose Consistent Leader-Follower (CoLF), a multi-agent reinforcement learning (MARL) framework for stable leader-follower role differentiation. CoLF consists of two key components: (1) an asymmetric policy design that induces leader-follower role differentiation, and (2) a mutual-information-based training objective that maximizes a variational lower bound, encouraging the follower to predict the leader's action from its local observation. The leader and follower policies are jointly optimized under the centralized training and decentralized execution (CTDE) framework to balance task execution and consistent cooperative behaviors. We validate CoLF in both simulation and real-robot experiments using two quadruped robots. The demonstration video is available at this https URL.
中文摘要 本研究探讨了视觉语言引导的多机器人协作运输，即每个机器人根据机载摄像头观察的自然语言指令进行基础。在这种去中心化环境中，一个关键挑战是机器人间的感知错位，观点差异和语言歧义可能导致解释不一致，并削弱协作运输。为缓解这一问题，我们采用了依赖式领导者-跟随者设计，一个机器人担任领导者，另一个作为跟随者。尽管这种领导者-跟随者结构看似简单，但使用独立且对称的代理学习通常会产生对称或不稳定的行为，且没有显式归纳偏见。为应对这一挑战，我们提出了一致性领导者-跟随者（CoLF）框架，这是一种多智能体强化学习（MARL）框架，用于稳定的领导者-跟随者角色区分。CoLF由两个关键组成部分组成：（1）非对称政策设计，诱导领导者-追随者角色区分;（2）基于互信息的训练目标，最大化变分下界，鼓励追随者根据本地观察预测领导者的行为。领导者和跟随者策略在集中培训与去中心化执行（CTDE）框架下共同优化，以平衡任务执行与一致的合作行为。我们在使用两台四足机器人的模拟和实机器人实验中验证了CoLF。演示视频可在此 https 网址观看。

Uncertainty-Aware Counterfactual Traffic Signal Control with Predictive Safety and Starvation-Avoidance Constraints Using Vision-Based Sensing

基于视觉的感知，具备预测安全和饥饿避免约束的不确定性感知反事实交通信号控制

Authors: Jayawant Bodagala, Balaji Bodagala
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.07784
Pdf link: https://arxiv.org/pdf/2602.07784
Abstract Real-world deployment of adaptive traffic signal control, to date, remains limited due to the uncertainty associated with vision-based perception, implicit safety, and non-interpretable control policies learned and validated mainly in simulation. In this paper, we introduce UCATSC, a model-based traffic signal control system that models traffic signal control at an intersection using a stochastic decision process with constraints and under partial observability, taking into account the uncertainty associated with vision-based perception. Unlike reinforcement learning methods that learn to predict safety using reward shaping, UCATSC predicts and enforces hard constraints related to safety and starvation prevention during counterfactual rollouts in belief space. The system is designed to improve traffic delay and emission while preventing safety-critical errors and providing interpretable control policy outputs based on explicit models.
中文摘要 由于基于视觉的感知、隐性安全性以及主要通过仿真学习和验证的不可解释控制策略，自适应交通信号控制的实际应用至今仍有限。本文介绍了UCATSC，一种基于模型的交通信号控制系统，利用随机决策过程在有约束条件且部分可观测性条件下模拟路通信号控制，同时考虑基于视觉感知的不确定性。与通过奖励塑造学习预测安全的强化学习方法不同，UCATSC在信念空间的反事实推广中预测并执行与安全和饥饿预防相关的硬约束。该系统旨在改善交通延迟和排放，同时防止安全关键错误，并基于显式模型提供可解释的控制策略输出。

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

VideoTemp-o3：协调时间基础与视频理解中的智能思维

Authors: Wenqi Liu, Yunxiao Wang, Shijie Ma, Meng Liu, Qile Su, Tianke Zhang, Haonan Fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Yinwei Wei, Xuemeng Song
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07801
Pdf link: https://arxiv.org/pdf/2602.07801
Abstract In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking mechanism that encourages exploration while preventing noise. For reinforcement learning, we introduce dedicated rewards to mitigate reward hacking. Besides, from the data perspective, we develop an effective pipeline to construct high-quality long video grounded QA data, along with a corresponding benchmark for systematic evaluation across various video durations. Experimental results demonstrate that our method achieves remarkable performance on both long video understanding and grounding.
中文摘要 在长视频理解中，传统的均匀帧采样常常无法捕捉关键的视觉证据，导致性能下降和幻觉增加。为此，近年来出现了带视频的代理思维范式，采用局部化剪辑-回答流程，模型主动识别相关视频片段，在这些片段中进行密集抽样，然后生成答案。然而，现有方法依然效率低下，定位薄弱，且遵循僵化的工作流程。为解决这些问题，我们提出了VideoTemp-o3，一个统一的智能体思维与视频框架，结合视频接地和问答建模。VideoTemp-o3 具备强大的定位能力，支持按需剪辑，并能修正不准确的定位。具体来说，在监督微调阶段，我们设计了统一的掩蔽机制，鼓励探索并防止噪声。在强化学习方面，我们引入了专门的奖励，以减少奖励被黑客攻击。此外，从数据角度，我们还开发了高效的流程，构建高质量的长视频基于质量保证数据，并建立了针对不同视频时长的系统评估基准。实验结果表明，我们的方法在长视频理解和接地方面都取得了显著性能。

Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning

通过过程可验证思维、数据综合和调度进行时间序列推理，实现定制化的LLM推理

Authors: Jiahui Zhou, Dan Li, Boxin Li, Xiao Zhang, Erli Meng, Lin Li, Zhuomin Chen, Jian Lou, See-Kiong Ng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07830
Pdf link: https://arxiv.org/pdf/2602.07830
Abstract Time series is a pervasive data type across various application domains, rendering the reasonable solving of diverse time series tasks a long-standing goal. Recent advances in large language models (LLMs), especially their reasoning abilities unlocked through reinforcement learning (RL), have opened new opportunities for tackling tasks with long Chain-of-Thought (CoT) reasoning. However, leveraging LLM reasoning for time series remains in its infancy, hindered by the absence of carefully curated time series CoT data for training, limited data efficiency caused by underexplored data scheduling, and the lack of RL algorithms tailored for exploiting such time series CoT data. In this paper, we introduce VeriTime, a framework that tailors LLMs for time series reasoning through data synthesis, data scheduling, and RL training. First, we propose a data synthesis pipeline that constructs a TS-text multimodal dataset with process-verifiable annotations. Second, we design a data scheduling mechanism that arranges training samples according to a principled hierarchy of difficulty and task taxonomy. Third, we develop a two-stage reinforcement finetuning featuring fine-grained, multi-objective rewards that leverage verifiable process-level CoT data. Extensive experiments show that VeriTime substantially boosts LLM performance across diverse time series reasoning tasks. Notably, it enables compact 3B, 4B models to achieve reasoning capabilities on par with or exceeding those of larger proprietary LLMs.
中文摘要 时间序列是一种在多个应用领域中普遍存在的数据类型，使得合理解决多样化的时间序列任务成为长期目标。大型语言模型（LLM）的最新进展，尤其是通过强化学习（RL）解锁的推理能力，为处理具有长思维链（CoT）推理任务开辟了新机遇。然而，利用LLM推理时间序列仍处于起步阶段，受限于缺乏精心策划的时间序列CoT训练数据、数据调度不足导致的数据效率有限，以及缺乏专门针对此类时间序列CoT数据的强化学习算法。本文介绍了VeriTime，这是一个通过数据综合、数据调度和强化学习训练，定制LLM用于时间序列推理的框架。首先，我们提出一种数据综合流水线，构建一个具有过程可验证注释的TS文本多模态数据集。其次，我们设计了一种数据调度机制，按照难度和任务分类的原则层级来安排训练样本。第三，我们开发了两阶段的强化微调，采用细粒度、多目标的奖励，利用可验证的流程层面CoT数据。大量实验表明，VeriTime在多种时间序列推理任务中显著提升了LLM的性能。值得注意的是，它使紧凑型3B、4B模型能够实现与更大型专有LLM同等甚至更强的推理能力。

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

rePIRL：用逆强化学习PRM进行LLM推理

Authors: Xian Wu, Kaijie Zhu, Ying Zhang, Lun Wang, Wenbo Guo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07832
Pdf link: https://arxiv.org/pdf/2602.07832
Abstract Process rewards have been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking. In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy. However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability. In this paper, we introduce rePIRL, an inverse RL-inspired framework that learns effective PRMs with minimal assumptions about expert policies. Specifically, we design a dual learning process that updates the policy and the PRM interchangeably. Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL to LLMs. We theoretically show that our proposed learning framework can unify both online and offline PRM learning methods, justifying that rePIRL can learn PRMs with minimal assumptions. Empirical evaluations on standardized math and coding reasoning datasets demonstrate the effectiveness of rePIRL over existing methods. We further show the application of our trained PRM in test-time training, test-time scaling, and providing an early signal for training hard problems. Finally, we validate our training recipe and key design choices via a detailed ablation study.
中文摘要 过程奖励已被广泛应用于深度强化学习，以提升训练效率、减少方差并防止奖励被黑客攻击。在LLM推理中，现有研究还探讨了学习有效过程奖励模型（PRM）的各种解决方案，无论是否借助专家策略。然而，现有方法要么依赖对专家策略的强假设（例如要求其奖励函数），要么存在内在限制（例如熵坍缩），导致PRM薄弱或泛化性有限。本文介绍了rePIRL，一种反向强化学习启发的框架，能够在对专家政策假设极少的情况下学习有效的PRM。具体来说，我们设计了一个双重学习流程，使政策和PRM互换更新。我们的学习算法针对传统逆强化学习扩展到大型语言模型的挑战进行了定制化。我们理论上证明，我们提出的学习框架能够统一线上和线下的PRM学习方法，证明rePIRL能够以最小的假设学习PRM。对标准化数学和编码推理数据集的实证评估显示rePIRL相较于现有方法的有效性。我们还展示了训练有素的PRM在测试时间训练、测试时间扩展以及为难题训练提供早期信号的应用。最后，我们通过详细的消融研究验证了训练配方和关键设计选择。

RLinf-USER: A Unified and Extensible System for Real-World Online Policy Learning in Embodied AI

RLinf-USER：一个统一且可扩展的系统，用于具身人工智能中真实世界的在线政策学习

Authors: Hongzhi Zang, Shu'ang Yu, Hao Lin, Tianxing Zhou, Zefang Huang, Zhen Guo, Xin Xu, Jiakai Zhou, Yuze Sheng, Shizhe Zhang, Feng Gao, Wenhao Tang, Yufeng Yue, Quanlu Zhang, Xinlei Chen, Chao Yu, Yu Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.07837
Pdf link: https://arxiv.org/pdf/2602.07837
Abstract Online policy learning directly in the physical world is a promising yet challenging direction for embodied intelligence. Unlike simulation, real-world systems cannot be arbitrarily accelerated, cheaply reset, or massively replicated, which makes scalable data collection, heterogeneous deployment, and long-horizon effective training difficult. These challenges suggest that real-world policy learning is not only an algorithmic issue but fundamentally a systems problem. We present USER, a Unified and extensible SystEm for Real-world online policy learning. USER treats physical robots as first-class hardware resources alongside GPUs through a unified hardware abstraction layer, enabling automatic discovery, management, and scheduling of heterogeneous robots. To address cloud-edge communication, USER introduces an adaptive communication plane with tunneling-based networking, distributed data channels for traffic localization, and streaming-multiprocessor-aware weight synchronization to regulate GPU-side overhead. On top of this infrastructure, USER organizes learning as a fully asynchronous framework with a persistent, cache-aware buffer, enabling efficient long-horizon experiments with robust crash recovery and reuse of historical data. In addition, USER provides extensible abstractions for rewards, algorithms, and policies, supporting online imitation or reinforcement learning of CNN/MLP, generative policies, and large vision-language-action (VLA) models within a unified pipeline. Results in both simulation and the real world show that USER enables multi-robot coordination, heterogeneous manipulators, edge-cloud collaboration with large models, and long-running asynchronous training, offering a unified and extensible systems foundation for real-world online policy learning.
中文摘要 直接在现实世界中进行在线政策学习，是具身智能的一个充满希望但充满挑战的方向。与仿真不同，现实世界系统无法被任意加速、廉价重置或大规模复制，这使得可扩展的数据收集、异构部署和长期有效训练变得困难。这些挑战表明，现实世界的政策学习不仅是算法问题，更是根本上系统问题。我们介绍USER，一种统一且可扩展的系统，用于真实世界的在线政策学习。USER通过统一的硬件抽象层将物理机器人视为一流硬件资源，与GPU并列，实现对异构机器人的自动发现、管理和调度。为解决云端通信问题，USER引入了自适应通信平面，采用基于隧道的网络、分布式数据通道用于流量定位，以及流媒体多处理器感知的权重同步以调节GPU端开销。在该基础设施之上，USER将学习组织为一个完全异步的框架，配备持久且具缓存感知的缓冲区，实现高效的长期实验，实现稳健的崩溃恢复和历史数据的重用。此外，USER还提供可扩展的奖励、算法和策略抽象，支持在统一流水线内在线模仿或强化CNN/MLP、生成策略和大型视觉语言动作（VLA）模型。无论是模拟还是现实世界的结果都表明，USER支持多机器人协调、异构作器、与大型模型的边缘云协作以及长期异步训练，为现实世界的在线政策学习提供统一且可扩展的系统基础。

TodoEvolve: Learning to Architect Agent Planning Systems

TodoEvolve：学习构建代理规划系统

Authors: Jiaxi Liu, Yanzuo Jiang, Guibin Zhang, Zihan Zhang, Heng Chang, Zhenfei Yin, Qibing Ren, Junchi Yan
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.07839
Pdf link: https://arxiv.org/pdf/2602.07839
Abstract Planning has become a central capability for contemporary agent systems in navigating complex, long-horizon tasks, yet existing approaches predominantly rely on fixed, hand-crafted planning structures that lack the flexibility to adapt to the structural diversity of open-ended problems. To address this limitation, we introduce TodoEvolve, a meta-planning paradigm that autonomously synthesizes and dynamically revises task-specific planning architectures. Specifically, we first construct PlanFactory, a modular design space that standardizes diverse planning paradigms within a unified codebase encompassing topology, initialization, adaptation, and navigation, thereby providing a common interface for heterogeneous planning patterns. Leveraging PlanFactory, we collect high-quality planning trajectories and train Todo-14B via \textit{Impedance-Guided Preference Optimization} (IGPO), a multi-objective reinforcement learning objective that encourages the generation of planning systems that are performant, stable, and token-efficient across arbitrary tasks and agent backbones. Empirical evaluations on five agentic benchmarks demonstrate that TodoEvolve consistently surpasses carefully engineered planning modules while maintaining economical API costs and runtime overhead.
中文摘要 规划已成为当代代理系统应对复杂、长期任务的核心能力，但现有方法主要依赖固定的手工设计规划结构，缺乏适应开放式问题结构多样性的灵活性。为解决这一限制，我们引入了TodoEvolve，一种元规划范式，能够自主综合并动态修正针对特定任务的规划架构。具体来说，我们首先构建了PlanFactory，这是一个模块化设计空间，能够在涵盖拓扑、初始化、适配和导航的统一代码库中标准化多样化的规划范式，从而为异构规划模式提供通用接口。借助PlanFactory，我们收集高质量的规划轨迹，并通过\textit{Impedance-Guided Preference Optimization}（IGPO）训练Todo-14B，这是一种多目标强化学习目标，鼓励在任意任务和代理骨干中生成性能高、稳定且令牌高效的规划系统。对五个代理基准的实证评估表明，TodoEvolve 在保持经济的 API 成本和运行开销的同时，持续超越精心设计的规划模块。

MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation

MARTI-MARS$^2$：通过强化学习扩展多智能体自我搜索以实现代码生成

Authors: Shijie Wang, Pengfei Li, Yikun Fu, Kaifeng Liu, Fangyuan Li, Yang Liu, Xiaowei Sun, Zonglin Li, Siyao Zhao, Jian Zhao, Kai Tian, Dong Li, Junqi Gao, Yutong Zhang, Yiqun Chen, Yuqiang Li, Zoe Li, Weinan Zhang, Peng Ye, Shuyue Hu, Lei Bai, Bowen Zhou, Kaiyan Zhang, Biqing Qi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.07848
Pdf link: https://arxiv.org/pdf/2602.07848
Abstract While the complex reasoning capability of Large Language Models (LLMs) has attracted significant attention, single-agent systems often encounter inherent performance ceilings in complex tasks such as code generation. Multi-agent collaboration offers a promising avenue to transcend these boundaries. However, existing frameworks typically rely on prompt-based test-time interactions or multi-role configurations trained with homogeneous parameters, limiting error correction capabilities and strategic diversity. In this paper, we propose a Multi-Agent Reinforced Training and Inference Framework with Self-Search Scaling (MARTI-MARS2), which integrates policy learning with multi-agent tree search by formulating the multi-agent collaborative exploration process as a dynamic and learnable environment. By allowing agents to iteratively explore and refine within the environment, the framework facilitates evolution from parameter-sharing homogeneous multi-role training to heterogeneous multi-agent training, breaking through single-agent capability limits. We also introduce an efficient inference strategy MARTI-MARS2-T+ to fully exploit the scaling potential of multi-agent collaboration at test time. We conduct extensive experiments across varied model scales (8B, 14B, and 32B) on challenging code generation benchmarks. Utilizing two collaborating 32B models, MARTI-MARS2 achieves 77.7%, outperforming strong baselines like GPT-5.1. Furthermore, MARTI-MARS2 reveals a novel scaling law: shifting from single-agent to homogeneous multi-role and ultimately to heterogeneous multi-agent paradigms progressively yields higher RL performance ceilings, robust TTS capabilities, and greater policy diversity, suggesting that policy diversity is critical for scaling intelligence via multi-agent reinforcement learning.
中文摘要 尽管大型语言模型（LLMs）的复杂推理能力引起了广泛关注，但单代理系统在代码生成等复杂任务中常常面临固有的性能上限。多智能体协作为突破这些界限提供了有前景的途径。然而，现有框架通常依赖基于提示的测试时间交互或多角色配置，训练参数均为同质，限制了纠错能力和战略多样性。本文提出一个多智能体强化训练与推理框架（MARTI-MARS2），通过将多智能体协作探索过程构建为动态且可学习的环境，将策略学习与多智能体树搜索整合。通过允许代理在环境中迭代探索和优化，该框架促进了从参数共享的同质多角色训练向异构多代理训练的演进，突破了单代理能力的限制。我们还引入了高效的推理策略MARTI-MARS2-T+，充分利用测试阶段多智能体协作的扩展潜力。我们在不同模型尺度（8B、14B和32B）上进行了大量实验，针对具有挑战性的代码生成基准测试。利用两个协作的32B模型，MARTI-MARS2的表现达到77.7%，优于GPT-5.1等强基线。此外，MARTI-MARS2揭示了一个新的缩放规律：从单智能体向同质多角色，最终转向异构多智能体范式，逐步带来更高的强化学习性能上限、稳健的TTS能力和更大的策略多样性，表明策略多样性对于通过多智能体强化学习实现智能扩展至关重要。

Direct Soft-Policy Sampling via Langevin Dynamics

通过朗之文动力学进行直接软政策采样

Authors: Donghyeon Ki, Hee-Jun Ahn, Kyungyoon Kim, Byung-Jun Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07873
Pdf link: https://arxiv.org/pdf/2602.07873
Abstract Soft policies in reinforcement learning define policies as Boltzmann distributions over state-action value functions, providing a principled mechanism for balancing exploration and exploitation. However, realizing such soft policies in practice remains challenging. Existing approaches either depend on parametric policies with limited expressivity or employ diffusion-based policies whose intractable likelihoods hinder reliable entropy estimation in soft policy objectives. We address this challenge by directly realizing soft-policy sampling via Langevin dynamics driven by the action gradient of the Q-function. This perspective leads to Langevin Q-Learning (LQL), which samples actions from the target Boltzmann distribution without explicitly parameterizing the policy. However, directly applying Langevin dynamics suffers from slow mixing in high-dimensional and non-convex Q-landscapes, limiting its practical effectiveness. To overcome this, we propose Noise-Conditioned Langevin Q-Learning (NC-LQL), which integrates multi-scale noise perturbations into the value function. NC-LQL learns a noise-conditioned Q-function that induces a sequence of progressively smoothed value landscapes, enabling sampling to transition from global exploration to precise mode refinement. On OpenAI Gym MuJoCo benchmarks, NC-LQL achieves competitive performance compared to state-of-the-art diffusion-based methods, providing a simple yet powerful solution for online RL.
中文摘要 强化学习中的软策略将策略定义为状态-行动价值函数上的玻尔兹曼分布，提供了一种原则性机制来平衡探索与利用。然而，在实际中实现这种软政策仍然充满挑战。现有方法要么依赖表达力有限的参数化策略，要么采用基于扩散的策略，其难以解决的可能性阻碍了软政策目标中可靠的熵估计。我们通过直接实现由Q函数作用梯度驱动的朗之文动态的软策略采样来应对这一挑战。这一观点引出了朗之万Q学习（LQL），它从目标玻尔兹曼分布中采样动作，但未明确参数化策略。然而，直接应用朗之宫动力学在高维和非凸Q-景观中混合速度较慢，限制了其实际效果。为克服这一问题，我们提出了噪声条件朗之文Q-学习（NC-LQL），将多尺度噪声扰动整合到价值函数中。NC-LQL学习一个噪声条件Q函数，诱导一系列逐步平滑的值景观，使采样能够从全局探索过渡到精确模式细化。在OpenAI Gym MuJoCo基准测试中，NC-LQL在与最先进的扩散方法相比，实现了竞争力，为在线强化学习提供了简单却强大的解决方案。

ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Intrinsic Adaptation

ToolSelf：通过工具驱动的内在适应统一任务执行与自我重组

Authors: Jingqi Zhou, Sheng Wang, DeZhao Deng, Junwen Lu, Junwei Su, Qintong Li, Jiahui Gao, Hao Wu, Jiyue Jiang, Lingpeng Kong, Chuan Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07883
Pdf link: https://arxiv.org/pdf/2602.07883
Abstract Agentic systems powered by Large Language Models (LLMs) have demonstrated remarkable potential in tackling complex, long-horizon tasks. However, their efficacy is fundamentally constrained by static configurations governing agent behaviors, which are fixed prior to execution and fail to adapt to evolving task dynamics. Existing approaches, relying on manual orchestration or heuristic-based patches, often struggle with poor generalization and fragmented optimization. To transcend these limitations, we propose ToolSelf, a novel paradigm enabling tool-driven runtime self-reconfiguration. By abstracting configuration updates as a callable tool, ToolSelf unifies task execution and self-adjustment into a single action space, achieving a phase transition from external rules to intrinsic parameters. Agents can thereby autonomously update their sub-goals and context based on task progression, and correspondingly adapt their strategy and toolbox, transforming from passive executors into dual managers of both task and self. We further devise Configuration-Aware Two-stage Training (CAT), combining rejection sampling fine-tuning with trajectory-level reinforcement learning to internalize this meta-capability. Extensive experiments across diverse benchmarks demonstrate that ToolSelf rivals specialized workflows while generalizing to novel tasks, achieving a 24.1% average performance gain and illuminating a path toward truly self-adaptive agents.
中文摘要 由大型语言模型（LLMs）驱动的代理系统在处理复杂且长远的任务方面展现出了非凡的潜力。然而，其效能从根本上受限于控制代理行为的静态配置，这些配置在执行前固定，无法适应不断演变的任务动态。现有方法依赖手动编排或基于启发式的补丁，常常面临泛化不佳和优化碎片化的问题。为了超越这些限制，我们提出了ToolSelf，一种实现工具驱动运行时自我重构的新型范式。通过将配置更新抽象为可调用工具，ToolSelf 将任务执行和自我调整统一到单一动作空间，实现了从外部规则到内在参数的阶段转换。代理因此可以根据任务进展自主更新子目标和上下文，相应地调整策略和工具箱，从被动执行者转变为任务和自我的双重管理者。我们进一步设计了配置感知两阶段训练（CAT），结合拒绝采样微调与轨迹级强化学习，以内化这一元能力。跨越多种基准测试的广泛实验表明，ToolSelf在推广新任务的同时，能够媲美专业工作流程，实现24.1%的平均性能提升，并为真正自我适应的代理指明了道路。

Efficient Anti-exploration via VQVAE and Fuzzy Clustering in Offline Reinforcement Learning

通过VQVAE和离线强化学习中的模糊聚类实现高效的反探索

Authors: Long Chen, Yinkui Liu, Shen Li, Bo Tang, Xuemin Hu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.07889
Pdf link: https://arxiv.org/pdf/2602.07889
Abstract Pseudo-count is an effective anti-exploration method in offline reinforcement learning (RL) by counting state-action pairs and imposing a large penalty on rare or unseen state-action pair data. Existing anti-exploration methods count continuous state-action pairs by discretizing these data, but often suffer from the issues of dimension disaster and information loss in the discretization process, leading to efficiency and performance reduction, and even failure of policy learning. In this paper, a novel anti-exploration method based on Vector Quantized Variational Autoencoder (VQVAE) and fuzzy clustering in offline RL is proposed. We first propose an efficient pseudo-count method based on the multi-codebook VQVAE to discretize state-action pairs, and design an offline RL anti-exploitation method based on the proposed pseudo-count method to handle the dimension disaster issue and improve the learning efficiency. In addition, a codebook update mechanism based on fuzzy C-means (FCM) clustering is developed to improve the use rate of vectors in codebooks, addressing the information loss issue in the discretization process. The proposed method is evaluated on the benchmark of Datasets for Deep Data-Driven Reinforcement Learning (D4RL), and experimental results show that the proposed method performs better and requires less computing cost in multiple complex tasks compared to state-of-the-art (SOTA) methods.
中文摘要 伪计数是离线强化学习（RL）中一种有效的反探索方法，通过计数状态-动作对，并对罕见或未见的状态-动作对数据施加较大惩罚。现有的反探索方法通过离散数据计数连续状态-动作对，但离散化过程中常常存在维度灾难和信息丢失等问题，导致效率和性能下降，甚至策略学习失败。本文提出了一种基于向量量化变分自编码器（VQVAE）和离线强化学习模糊聚类的新型反探索方法。我们首先提出了基于多码本VQVAE的高效伪计数方法，用于离散化状态-动作对，并基于该方法设计离线强化学习反利用方法，以处理维度灾难问题并提升学习效率。此外，开发了基于模糊C均值（FCM）聚类的码本更新机制，以提高码本中向量的使用率，解决离散化过程中的信息丢失问题。该方法在深度数据驱动强化学习（D4RL）数据集基准测试中进行了评估，实验结果表明，在多复杂任务中，该方法的性能更好，计算成本更低，相较于最先进（SOTA）方法。

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

AceGRPO：自主机器学习工程的自适应课程增强组相对策略优化

Authors: Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Jiaao Chen, Hanrui Wang, Wei-Chen Wang, Di Jin, Siheng Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.07906
Pdf link: https://arxiv.org/pdf/2602.07906
Abstract Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behavioral stagnation due to frozen parameters. Although Reinforcement Learning (RL) offers a remedy, applying it to MLE is hindered by prohibitive execution latency and inefficient data selection. Recognizing these challenges, we propose AceGRPO with two core components: (1) Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks, and (2) Adaptive Sampling guided by a Learnability Potential function, which dynamically prioritizes tasks at the agent's learning frontier to maximize learning efficiency. Leveraging AceGRPO, our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines (e.g., DeepSeek-V3.2), demonstrating robust capability for sustained iterative optimization. Code is available at this https URL.
中文摘要 自主机器学习工程（MLE）要求智能体在长期内持续迭代优化。虽然近期基于LLM的代理展现出潜力，但当前基于提示的MLE代理因参数冻结而存在行为停滞。尽管强化学习（RL）提供了解决方案，但由于执行延迟高且数据选择效率低，将其应用于机器学习（MLE）却受阻。鉴于这些挑战，我们提出了包含两个核心组成部分的AceGRPO：（1）演化数据缓冲区，持续将执行痕迹重新利用为可重用的训练任务;（2）由可学习潜能函数引导的自适应采样，该功能动态优先级化主体学习前沿的任务，以最大化学习效率。利用AceGRPO，我们训练好的Ace-30B模型在MLE-Bench-Lite上实现100%有效提交率，性能接近专有前沿模型，并优于更大型的开源基线（如DeepSeek-V3.2），展现出持续迭代优化的稳健能力。代码可在此 https URL 访问。

Feasibility-Guided Planning over Multi-Specialized Locomotion Policies

可行性导向规划，优先于多专业化的交通政策

Authors: Ying-Sheng Luo, Lu-Ching Wang, Hanjaya Mandala, Yu-Lun Chou, Guilherme Christmann, Yu-Chung Chen, Yung-Shun Chan, Chun-Yi Lee, Wei-Chao Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.07932
Pdf link: https://arxiv.org/pdf/2602.07932
Abstract Planning over unstructured terrain presents a significant challenge in the field of legged robotics. Although recent works in reinforcement learning have yielded various locomotion strategies, planning over multiple experts remains a complex issue. Existing approaches encounter several constraints: traditional planners are unable to integrate skill-specific policies, whereas hierarchical learning frameworks often lose interpretability and require retraining whenever new policies are added. In this paper, we propose a feasibility-guided planning framework that successfully incorporates multiple terrain-specific policies. Each policy is paired with a Feasibility-Net, which learned to predict feasibility tensors based on the local elevation maps and task vectors. This integration allows classical planning algorithms to derive optimal paths. Through both simulated and real-world experiments, we demonstrate that our method efficiently generates reliable plans across diverse and challenging terrains, while consistently aligning with the capabilities of the underlying policies.
中文摘要 在无结构地形上规划是腿式机器人领域面临的重大挑战。尽管强化学习的最新研究产生了多种运动策略，但多位专家的规划仍是一个复杂的问题。现有方法面临若干限制：传统规划者无法整合技能专属政策，而层级学习框架常常失去可解释性，且每当新增政策时需要重新培训。本文提出了一个可行性导向的规划框架，成功整合了多种地形特定政策。每个政策都与可行性网络配对，该网络学习基于局部高程图和任务向量预测可行性张量。这种积分使经典规划算法能够推导出最优路径。通过模拟和现实实验，我们证明了我们的方法能够高效生成在多样且具有挑战性的地形中可靠的计划，同时始终与底层政策的能力保持一致。

Trajectory-Aware Multi-RIS Activation and Configuration: A Riemannian Diffusion Method

轨迹感知多RIS激活与构型：黎曼扩散方法

Authors: Kaining Wang, Bo Yang, Yusheng Lei, Zhibo Li, Zhiwen Yu, Xuelin Cao, Bin Guo, George C. Alexandropoulos, Dusit Niyato, Mérouane Debbah, Zhu Han
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2602.07937
Pdf link: https://arxiv.org/pdf/2602.07937
Abstract Reconfigurable intelligent surfaces (RISs) offer a low-cost, energy-efficient means for enhancing wireless coverage. Yet, their inherently programmable reflections may unintentionally amplify interference, particularly in large-scale, multi-RIS-enabled mobile communication scenarios where dense user mobility and frequent line-of-sight overlaps can severely degrade the signal-to-interference-plus-noise ratio (SINR). To address this challenge, this paper presents a novel generative multi-RIS control framework that jointly optimizes the ON/OFF activation patterns of multiple RISs in the smart wireless environment and the phase configurations of the activated RISs based on predictions of multi-user trajectories and interference patterns. We specially design a long short-term memory (LSTM) artificial neural network, enriched with speed and heading features, to forecast multi-user trajectories, thereby enabling reconstruction of future channel state information. To overcome the highly nonconvex nature of the multi-RIS control problem, we develop a Riemannian diffusion model on the torus to generate geometry-consistent phase-configuration, where the reverse diffusion process is dynamically guided by reinforcement learning. We then rigorously derive the optimal ON/OFF states of the metasurfaces by comparing predicted achievable rates under RIS activation and deactivation conditions. Extensive simulations demonstrate that the proposed framework achieves up to 30\% SINR improvement over learning-based control and up to 44\% gain compared with the RIS always-on scheme, while consistently outperforming state-of-the-art baselines across different transmit powers, RIS configurations, and interference densities.
中文摘要 可重构智能表面（RIS）提供了低成本、节能的方式来增强无线覆盖。然而，它们本质上可编程的反射可能会无意中放大干扰，尤其是在大规模、支持多RIS的移动通信场景中，高用户移动性和频繁的视距重叠会严重降低信干扰加噪声比（SINR）。为应对这一挑战，本文提出了一种新型生成式多RIS控制框架，结合智能无线环境中多个RIS的开/关激活模式，以及基于多用户轨迹和干扰模式预测激活RIS的相位配置。我们专门设计了一种长短期记忆（LSTM）人工神经网络，富含速度和航向特征，用于预测多用户轨迹，从而实现未来信道状态信息的重建。为克服多RIS控制问题的高度非凸性，我们在环面上构建黎曼扩散模型，生成几何一致的相位构型，其中反向扩散过程由强化学习动态引导。随后，我们通过比较RIS激活和失活条件下预测的可实现速率，严格推导出超曲面的最佳开/关状态。大量模拟表明，所提框架在基于学习的控制基础上实现了高达30%的SINR提升，与RIS始终在线方案相比，增益高达44%，同时在不同发射功率、RIS配置和干扰密度方面持续优于最先进的基线。

DHEA-MECD: An Embodied Intelligence-Powered DRL Algorithm for AUV Tracking in Underwater Environments with High-Dimensional Features

DHEA-MECD：一种具身智能驱动的DRL算法，用于高维特征水下环境的AUV跟踪

Authors: Kai Tian, Chuan Lin, Guangjie Han, Chen An, Qian Zhu, Shengzhao Zhu, Zhenyu Wang
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.07947
Pdf link: https://arxiv.org/pdf/2602.07947
Abstract In recent years, autonomous underwater vehicle (AUV) systems have demonstrated significant potential in complex marine exploration. However, effective AUV-based tracking remains challenging in realistic underwater environments characterized by high-dimensional features, including coupled kinematic states, spatial constraints, time-varying environmental disturbances, etc. To address these challenges, this paper proposes a hierarchical embodied-intelligence (EI) architecture for underwater multi-target tracking with AUVs in complex underwater environments. Built upon this architecture, we introduce the Double-Head Encoder-Attention-based Multi-Expert Collaborative Decision (DHEA-MECD), a novel Deep Reinforcement Learning (DRL) algorithm designed to support efficient and robust multi-target tracking. Specifically, in DHEA-MECD, a Double-Head Encoder-Attention-based information extraction framework is designed to semantically decompose raw sensory observations and explicitly model complex dependencies among heterogeneous features, including spatial configurations, kinematic states, structural constraints, and stochastic perturbations. On this basis, a motion-stage-aware multi-expert collaborative decision mechanism with Top-k expert selection strategy is introduced to support stage-adaptive decision-making. Furthermore, we propose the DHEA-MECD-based underwater multitarget tracking algorithm to enable AUV smart, stable, and anti-interference multi-target tracking. Extensive experimental results demonstrate that the proposed approach achieves superior tracking success rates, faster convergence, and improved motion optimality compared with mainstream DRL-based methods, particularly in complex and disturbance-rich marine environments.
中文摘要 近年来，自主水下载具（AUV）系统在复杂海洋探索中展现出显著潜力。然而，在具有高维特征（如耦合运动学状态、空间限制、随时间变化的环境干扰等）的真实水下环境中，有效的AUV跟踪仍然具有挑战性。为应对这些挑战，本文提出了一种分层的具象智能（EI）架构，用于在复杂水下环境中利用AUV进行水下多目标跟踪。基于该架构，我们引入了基于双头编码器-注意力的多专家协作决策（DHEA-MECD），这是一种新型深度强化学习（DRL）算法，旨在支持高效且稳健的多目标跟踪。具体来说，在DHEA-MECD中，设计了一个基于双头编码器-注意力的信息提取框架，旨在语义上分解原始感官观察，并明确建模异构特征之间的复杂依赖关系，包括空间配置、运动学状态、结构约束和随机扰动。基于此，引入了一种具备运动感知阶段的多专家协作决策机制，配合Top-k专家选择策略，以支持阶段自适应决策。此外，我们提出了基于DHEA-MECD的水下多目标跟踪算法，以实现AUV智能、稳定且抗干扰的多目标跟踪。大量实验结果表明，该方法相比主流基于日程学习（DRL）的方法，尤其是在复杂且干扰丰富的海洋环境中，实现了更优的跟踪成功率、更快的收敛速度和更优的运动优化性。

D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning

D-ORCA：以对话为中心的强健视听字幕优化

Authors: Changli Tang, Tianyi Wang, Fengyun Rao, Jing Lyu, Chao Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.07960
Pdf link: https://arxiv.org/pdf/2602.07960
Abstract Spoken dialogue is a primary source of information in videos; therefore, accurately identifying who spoke what and when is essential for deep video understanding. We introduce D-ORCA, a \textbf{d}ialogue-centric \textbf{o}mni-modal large language model optimized for \textbf{r}obust audio-visual \textbf{ca}ptioning. We further curate DVD, a large-scale, high-quality bilingual dataset comprising nearly 40,000 multi-party dialogue videos for training and 2000 videos for evaluation in English and Mandarin, addressing a critical gap in the open-source ecosystem. To ensure fine-grained captioning accuracy, we adopt group relative policy optimization with three novel reward functions that assess speaker attribution accuracy, global speech content accuracy, and sentence-level temporal boundary alignment. These rewards are derived from evaluation metrics widely used in speech processing and, to our knowledge, are applied for the first time as reinforcement learning objectives for audio-visual captioning. Extensive experiments demonstrate that D-ORCA substantially outperforms existing open-source models in speaker identification, speech recognition, and temporal grounding. Notably, despite having only 8 billion parameters, D-ORCA achieves performance competitive with Qwen3-Omni across several general-purpose audio-visual understanding benchmarks. Demos are available at \href{this https URL}{this https URL}. Our code, data, and checkpoints will be available at \href{this https URL}{this https URL}.
中文摘要 口语对话是视频中的主要信息来源;因此，准确识别谁说了什么以及何时说话，对于深入理解视频至关重要。我们介绍D-ORCA，一个以语言为中心的\textbf{d}mni模态大型语言模型，优化于\textbf{r}强大的视听\textbf{ca}ptioning。我们还进一步策划了DVD，这是一个大规模、高质量的双语数据集，包含近4万个多方对话视频用于培训，以及2000个用于评估的英语和普通话视频，弥补了开源生态系统中的关键空白。为确保细粒度字幕准确性，我们采用了组相对策略优化，采用三种新颖的奖励函数，分别评估说话者归因准确性、全局语音内容准确性和句子层次的时间边界对齐。这些奖励来源于语音处理中广泛使用的评估指标，据我们所知，这些奖励首次被用作视听字幕的强化学习目标。大量实验表明，D-ORCA在说话者识别、语音识别和时间基础分析方面远超现有开源模型。值得注意的是，尽管仅有80亿参数，D-ORCA在多个通用视听理解基准测试中仍能与Qwen3-Omni竞争。演示可在 \href{this https URL}{this https URL} 获取。我们的代码、数据和检查点将在线于 \href{this https URL}{this https URL} 访问。

When Is Compositional Reasoning Learnable from Verifiable Rewards?

什么时候可以从可验证的奖励中学习组合推理？

Authors: Daniel Barzilai, Yotam Wolf, Ronen Basri
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.07992
Pdf link: https://arxiv.org/pdf/2602.07992
Abstract The emergence of compositional reasoning in large language models through reinforcement learning with verifiable rewards (RLVR) has been a key driver of recent empirical successes. Despite this progress, it remains unclear which compositional problems are learnable in this setting using outcome-level feedback alone. In this work, we theoretically study the learnability of compositional problems in autoregressive models under RLVR training. We identify a quantity that we call the task-advantage ratio, a joint property of the compositional problem and the base model, that characterizes which tasks and compositions are learnable from outcome-level feedback. On the positive side, using this characterization, we show that compositional problems where correct intermediate steps provide a clear advantage are efficiently learnable with RLVR. We also analyze how such an advantage naturally arises in different problems. On the negative side, when the structural advantage is not present, RLVR may converge to suboptimal compositions. We prove that, in some cases, the quality of the base model determines if such an advantage exists and whether RLVR will converge to a suboptimal solution. We hope our analysis can provide a principled theoretical understanding of when and why RLVR succeeds and when it does not.
中文摘要 通过可验证奖励强化学习（RLVR）在大型语言模型中出现的组合推理，是近年来实证成功的关键推动力。尽管取得了这些进展，目前尚不清楚哪些组合问题在该环境中仅靠结果层级反馈可学习。在本研究中，我们理论上研究了在RLVR训练下自回归模型中组合问题的可学习性。我们确定一个称为任务优势比的量，这是组合问题与基础模型的联合性质，用以描述哪些任务和组合是可通过结果级反馈学习的。积极方面，利用这一表征表明，正确中间步骤能带来明显优势的组合问题，在RLVR下是可以高效学习的。我们还分析了这种优势在不同问题中如何自然产生。负面方面，当结构优势不存在时，RLVR可能收敛到次优组合。我们证明，在某些情况下，基模型的质量决定了这种优势是否存在，以及RLVR是否会收敛到次优解。我们希望我们的分析能够提供一个有原则的理论理解，说明RLVR何时成功、何时失败。

Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization

带一般参数化的单链平均奖励约束MDP的遗憾分析

Authors: Anirudh Satheesh, Vaneet Aggarwal
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.08000
Pdf link: https://arxiv.org/pdf/2602.08000
Abstract We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the unichain assumption and general policy parameterizations. Existing regret analyses for constrained reinforcement learning largely rely on ergodicity or strong mixing-time assumptions, which fail to hold in the presence of transient states. We propose a primal--dual natural actor--critic algorithm that leverages multi-level Monte Carlo (MLMC) estimators and an explicit burn-in mechanism to handle unichain dynamics without requiring mixing-time oracles. Our analysis establishes finite-time regret and cumulative constraint violation bounds that scale as $\tilde{O}(\sqrt{T})$, up to approximation errors arising from policy and critic parameterization, thereby extending order-optimal guarantees to a significantly broader class of CMDPs.
中文摘要 我们研究了在单链假设和一般策略参数化下，无限视界平均奖励约束的马尔可夫决策过程（CMDP）。现有的约束强化学习遗憾分析主要依赖于遍历性或强混合时间假设，而这些假设在瞬态状态下难以成立。我们提出了一种原始--对偶自然-批判算法，利用多级蒙特卡洛（MLMC）估计器和显式烧录机制，在不需混合时间预言机的情况下处理单链动力学。我们的分析建立了有限时间遗憾和累计约束违背界限，其可扩展范围为$\tilde{O}（\sqrt{T}）$，最高可达策略和批判参数化产生的近似误差，从而将序最优保证扩展到更广泛的CMDP类别。

Horizon Imagination: Efficient On-Policy Training in Diffusion World Models

地平线想象力：扩散世界模型中的高效政策训练

Authors: Lior Cohen, Ofir Nabati, Kaixin Wang, Navdeep Kumar, Shie Mannor
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.08032
Pdf link: https://arxiv.org/pdf/2602.08032
Abstract We study diffusion-based world models for reinforcement learning, which offer high generative fidelity but face critical efficiency challenges in control. Current methods either require heavyweight models at inference or rely on highly sequential imagination, both of which impose prohibitive computational costs. We propose Horizon Imagination (HI), an on-policy imagination process for discrete stochastic policies that denoises multiple future observations in parallel. HI incorporates a stabilization mechanism and a novel sampling schedule that decouples the denoising budget from the effective horizon over which denoising is applied while also supporting sub-frame budgets. Experiments on Atari 100K and Craftium show that our approach maintains control performance with a sub-frame budget of half the denoising steps and achieves superior generation quality under varied schedules. Code is available at this https URL.
中文摘要 我们研究基于扩散的世界模型用于强化学习，这些模型具有高生成保真度，但在控制方面面临关键的效率挑战。当前方法要么需要推理时使用重量级模型，要么依赖高度序列想象力，这两者都带来了巨大的计算成本。我们提出了地平线想象（Horizon Imagination，简称HI），这是一种基于策略的离散随机策略想象过程，能够并行消除多个未来观测的噪声。HI包含稳定机制和新颖采样计划，将去噪预算与有效视野分离，同时支持子帧预算。在Atari 100K和Craftium上的实验表明，我们的方法能以减半的子帧预算保持控制性能，并在不同时间表下实现卓越的生成质量。代码可在此 https URL 访问。

FIRE: Frobenius-Isometry Reinitialization for Balancing the Stability-Plasticity Tradeoff

FIRE：弗罗贝尼乌斯-等距调整重新初始化以平衡稳定性与塑性权衡

Authors: Isaac Han, Sangyeon Park, Seungwon Oh, Donghu Kim, Hojoon Lee, Kyung-Joong Kim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08040
Pdf link: https://arxiv.org/pdf/2602.08040
Abstract Deep neural networks trained on nonstationary data must balance stability (i.e., retaining prior knowledge) and plasticity (i.e., adapting to new tasks). Standard reinitialization methods, which reinitialize weights toward their original values, are widely used but difficult to tune: conservative reinitializations fail to restore plasticity, while aggressive ones erase useful knowledge. We propose FIRE, a principled reinitialization method that explicitly balances the stability-plasticity tradeoff. FIRE quantifies stability through Squared Frobenius Error (SFE), measuring proximity to past weights, and plasticity through Deviation from Isometry (DfI), reflecting weight isotropy. The reinitialization point is obtained by solving a constrained optimization problem, minimizing SFE subject to DfI being zero, which is efficiently approximated by Newton-Schulz iteration. FIRE is evaluated on continual visual learning (CIFAR-10 with ResNet-18), language modeling (OpenWebText with GPT-0.1B), and reinforcement learning (HumanoidBench with SAC and Atari games with DQN). Across all domains, FIRE consistently outperforms both naive training without intervention and standard reinitialization methods, demonstrating effective balancing of the stability-plasticity tradeoff.
中文摘要 基于非常态数据训练的深度神经网络必须在稳定性（即保留先有知识）和可塑性（即适应新任务）之间取得平衡。标准的重新初始化方法，即将权重初始化至原始值，但难以调整：保守的重新初始化无法恢复可塑性，而激进的重新初始化则抹去有用知识。我们提出了FIRE，一种原则性的重初始化方法，明确平衡了稳定性与可塑性的权衡。FIRE通过平方弗罗贝尼乌斯误差（SFE）量化稳定性，测量与过去权重的接近程度，以及通过偏离等距（DfI）来量化可塑性，反映重量各向同性。重初始化点通过求解受限优化问题获得，在DfI为零的情况下最小化SFE，该过程通过牛顿-舒尔茨迭代高效近似。FIRE的评估包括持续视觉学习（CIFAR-10搭配ResNet-18）、语言建模（OpenWebText搭配GPT-0.1B）和强化学习（HumanoidBench搭配SAC和Atari游戏搭配DQN）。在所有领域，FIRE始终优于无干预的幼稚训练和标准重创方法，展示了稳定性与可塑性权衡的有效平衡。

Graph-Enhanced Deep Reinforcement Learning for Multi-Objective Unrelated Parallel Machine Scheduling

多目标无关并行机调度的图增强深度强化学习

Authors: Bulent Soykan, Sean Mondesire, Ghaith Rabadi, Grace Bochenek
Subjects: Subjects: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2602.08052
Pdf link: https://arxiv.org/pdf/2602.08052
Abstract The Unrelated Parallel Machine Scheduling Problem (UPMSP) with release dates, setups, and eligibility constraints presents a significant multi-objective challenge. Traditional methods struggle to balance minimizing Total Weighted Tardiness (TWT) and Total Setup Time (TST). This paper proposes a Deep Reinforcement Learning framework using Proximal Policy Optimization (PPO) and a Graph Neural Network (GNN). The GNN effectively represents the complex state of jobs, machines, and setups, allowing the PPO agent to learn a direct scheduling policy. Guided by a multi-objective reward function, the agent simultaneously minimizes TWT and TST. Experimental results on benchmark instances demonstrate that our PPO-GNN agent significantly outperforms a standard dispatching rule and a metaheuristic, achieving a superior trade-off between both objectives. This provides a robust and scalable solution for complex manufacturing scheduling.
中文摘要 无关并行机调度问题（UPMSP）及其发布日期、设置和资格限制，构成了一个重大的多目标挑战。传统方法难以平衡最小化总加权迟到（TWT）和总布置时间（TST）。本文提出了一个利用近端策略优化（PPO）和图神经网络（GNN）的深度强化学习框架。GNN有效地表示了作业、机器和设置的复杂状态，使PPO代理能够学习直接的调度策略。在多目标奖励函数的指导下，代理同时最小化TWT和TST。基准实例的实验结果表明，我们的PPO-GNN代理显著优于标准调度规则和元启发式，实现了两者之间的优越权衡。这为复杂的制造排程提供了稳健且可扩展的解决方案。

Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning

Epigraph引导的流程匹配，实现安全高效的离线强化学习

Authors: Manan Tayal, Mumuksh Tayal
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08054
Pdf link: https://arxiv.org/pdf/2602.08054
Abstract Offline reinforcement learning (RL) provides a compelling paradigm for training autonomous systems without the risks of online exploration, particularly in safety-critical domains. However, jointly achieving strong safety and performance from fixed datasets remains challenging. Existing safe offline RL methods often rely on soft constraints that allow violations, introduce excessive conservatism, or struggle to balance safety, reward optimization, and adherence to the data distribution. To address this, we propose Epigraph-Guided Flow Matching (EpiFlow), a framework that formulates safe offline RL as a state-constrained optimal control problem to co-optimize safety and performance. We learn a feasibility value function derived from an epigraph reformulation of the optimal control problem, thereby avoiding the decoupled objectives or post-hoc filtering common in prior work. Policies are synthesized by reweighting the behavior distribution based on this epigraph value function and fitting a generative policy via flow matching, enabling efficient, distribution-consistent sampling. Across various safety-critical tasks, including Safety-Gymnasium benchmarks, EpiFlow achieves competitive returns with near-zero empirical safety violations, demonstrating the effectiveness of epigraph-guided policy synthesis.
中文摘要 离线强化学习（RL）为训练自主系统提供了一种引人注目的范式，使其无需担心在线探索的风险，尤其是在安全关键领域。然而，从固定数据集中共同实现强的安全性和性能仍然具有挑战性。现有安全的离线强化学习方法常依赖软约束，允许违规、引入过度保守，或难以平衡安全性、奖励优化和数据分布的遵守。为此，我们提出了Epigraph引导流匹配（EpiFlow）框架，将安全的离线强化学习作为状态约束的最优控制问题，以协同优化安全性和性能。我们学习到一个可行性值函数，该函数是通过对最优控制问题的题言重述得出的，从而避免了以往工作中常见的解耦目标或事后过滤。策略通过根据该引文值函数重新加权行为分布，并通过流匹配拟合生成策略，实现高效且分布一致的采样。在包括安全体育馆基准在内的多个安全关键任务中，EpiFlow以近乎零的实证安全违规实现了竞争回报，证明了以Epigraph为指导的政策综合的有效性。

Objective Decoupling in Social Reinforcement Learning: Recovering Ground Truth from Sycophantic Majorities

社会强化学习中的客观解耦：从谄媚多数中恢复真实性

Authors: Majid Ghasemi, Mark Crowley
Subjects: Subjects: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2602.08092
Pdf link: https://arxiv.org/pdf/2602.08092
Abstract Contemporary AI alignment strategies rely on a fragile premise: that human feedback, while noisy, remains a fundamentally truthful signal. In this paper, we identify this assumption as Dogma 4 of Reinforcement Learning (RL). We demonstrate that while this dogma holds in static environments, it fails in social settings where evaluators may be sycophantic, lazy, or adversarial. We prove that under Dogma 4, standard RL agents suffer from what we call Objective Decoupling, a structural failure mode where the agent's learned objective permanently separates from the latent ground truth, guaranteeing convergence to misalignment. To resolve this, we propose Epistemic Source Alignment (ESA). Unlike standard robust methods that rely on statistical consensus (trusting the majority), ESA utilizes sparse safety axioms to judge the source of the feedback rather than the signal itself. We prove that this "judging the judges" mechanism guarantees convergence to the true objective, even when a majority of evaluators are biased. Empirically, we show that while traditional consensus methods fail under majority collusion, our approach successfully recovers the optimal policy.
中文摘要 当代人工智能对齐策略依赖于一个脆弱的前提：人类反馈虽然嘈杂，但始终是本质上真实的信号。本文将这一假设定为强化学习（RL）的第四条教义。我们证明，虽然这种教条在静态环境中成立，但在评估者可能谄媚、懒惰或对立的社交环境中则失效。我们证明，根据第四条教条，标准强化学习代理存在我们所称的客观解耦现象，即代理所学目标永久与潜在的真实性分离，确保趋同于错位。为解决这个问题，我们提出了认识源对齐（ESA）。与依赖统计共识（信任多数）的标准稳健方法不同，ESA利用稀疏安全公理来判断反馈的来源，而非信号本身。我们证明了这种“评判评委”机制能保证与真实目标趋同，即使大多数评审者存在偏见。通过实证，我们表明，虽然传统共识方法在多数人共谋下会失败，但我们的方法成功地恢复了最优政策。

Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems

多智能体强化学习系统中的可解释故障分析

Authors: Risal Shahriar Shefin, Debashis Gupta, Thai Le, Sarra Alqahtani
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.08104
Pdf link: https://arxiv.org/pdf/2602.08104
Abstract Multi-Agent Reinforcement Learning (MARL) is increasingly deployed in safety-critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two-stage gradient-based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient-0); (2) validating why non-attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per-agent failure detection via Taylor-remainder analysis of policy-gradient costs, declaring an initial Patient-0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives-first-order sensitivity and directional second-order curvature aggregated over causal windows to construct interpretable contagion graphs. This approach explains "downstream-first" detection anomalies by revealing pathways that amplify upstream deviations. Evaluated across 500 episodes in Simple Spread (3 and 5 agents) and 100 episodes in StarCraft II using MADDPG and HATRPO, our method achieves 88.2-99.4% Patient-0 detection accuracy while providing interpretable geometric evidence for detection decisions. By moving beyond black-box detection to interpretable gradient-level forensics, this framework offers practical tools for diagnosing cascading failures in safety-critical MARL systems.
中文摘要 多智能体强化学习（MARL）正日益应用于安全关键领域，但可解释的故障检测和归因方法仍然不成熟。我们引入了一个基于梯度的两阶段框架，为三个关键故障分析任务提供可解读的诊断方法：（1）检测真实的初始故障源（Patient-0）;（2）验证为何未受攻击的病原体可能因多米诺效应而先被标记;以及（3）追踪失败如何通过学习的协调路径传播。第一阶段通过策略梯度成本的泰勒余余分析，进行可解释的每个智能体故障检测，在第一个阈值交叉处宣布初始的 Patient-0 候选。第二阶段通过几何分析验证批评导数——一阶敏感度和方向性二阶曲率在因果窗口上汇总，构建可解释的传染图。该方法通过揭示放大上游偏差的路径来解释“下游优先”的检测异常。通过在简单扩散（3和5个代理）中500集以及《星际争霸II》中100次，使用MADDPG和HATRPO评估，我们的方法实现了88.2%至99.4%的零患者检测准确率，同时为检测决策提供了可解释的几何证据。通过超越黑箱检测，转向可解释的梯度级取证，该框架为诊断安全关键MARL系统中的级联故障提供了实用工具。

CADO: From Imitation to Cost Minimization for Heatmap-based Solvers in Combinatorial Optimization

CADO：从模仿到基于热图的组合优化求解器的成本最小化

Authors: Hyungseok Song, Deunsol Yoon, Kanghoon Lee, Han-Seul Jeong, Soonyoung Lee, Woohyung Lim
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.08210
Pdf link: https://arxiv.org/pdf/2602.08210
Abstract Heatmap-based solvers have emerged as a promising paradigm for Combinatorial Optimization (CO). However, we argue that the dominant Supervised Learning (SL) training paradigm suffers from a fundamental objective mismatch: minimizing imitation loss (e.g., cross-entropy) does not guarantee solution cost minimization. We dissect this mismatch into two deficiencies: Decoder-Blindness (being oblivious to the non-differentiable decoding process) and Cost-Blindness (prioritizing structural imitation over solution quality). We empirically demonstrate that these intrinsic flaws impose a hard performance ceiling. To overcome this limitation, we propose CADO (Cost-Aware Diffusion models for Optimization), a streamlined Reinforcement Learning fine-tuning framework that formulates the diffusion denoising process as an MDP to directly optimize the post-decoded solution cost. We introduce Label-Centered Reward, which repurposes ground-truth labels as unbiased baselines rather than imitation targets, and Hybrid Fine-Tuning for parameter-efficient adaptation. CADO achieves state-of-the-art performance across diverse benchmarks, validating that objective alignment is essential for unlocking the full potential of heatmap-based solvers.
中文摘要 基于热图的求解器已成为组合优化（CO）的一种有前景的范式。然而，我们认为主流的监督学习（SL）训练范式存在一个根本的客观不匹配：最小化模仿损失（如交叉熵）并不能保证解成本最小化。我们将这种不匹配分为两个缺陷：解码器盲（对不可微译码过程视而不见）和成本盲（优先考虑结构模仿而非解质）。我们通过实证证明，这些内在缺陷带来了严格的性能上限。为克服这一限制，我们提出了CADO（优化成本感知扩散模型），这是一个简化的强化学习微调框架，将扩散去噪过程制定为MDP，直接优化解码后的解码成本。我们引入了以标签为中心的奖励，将真实标签重新利用为无偏的基线而非模仿目标，以及混合微调以实现参数高效的适应。CADO在多种基准测试中实现了最先进的性能，验证了目标一致性对于释放基于热图求解器的全部潜力至关重要。

DrugR: Optimizing Molecular Drugs through LLM-based Explicit Reasoning

DrugR：通过基于LLM的显式推理优化分子药物

Authors: Haoran Liu, Zheni Zeng, Yukun Yan, Yuxuan Chen, Yunduo Xiao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2602.08213
Pdf link: https://arxiv.org/pdf/2602.08213
Abstract Molecule generation and optimization is a fundamental task in chemical domain. The rapid development of intelligent tools, especially large language models (LLMs) with powerful knowledge reserves and interactive capabilities, has provided new paradigms for it. Nevertheless, the intrinsic challenge for LLMs lies in the complex implicit relationship between molecular structure and pharmacological properties and the lack of corresponding labeled data. To bridge this gap, we propose DrugR, an LLM-based method that introduces explicit, step-by-step pharmacological reasoning into the optimization process. Our approach integrates domain-specific continual pretraining, supervised fine-tuning via reverse data engineering, and self-balanced multi-granular reinforcement learning. This framework enables DrugR to effectively improve key ADMET properties while preserving the original molecule's core efficacy. Experimental results demonstrate that DrugR achieves comprehensive enhancement across multiple properties without compromising structural similarity or target binding affinity. Importantly, its explicit reasoning process provides clear, interpretable rationales for each optimization step, yielding actionable design insights and advancing toward automated, knowledge-driven scientific discovery. Our code and model checkpoints are open-sourced to foster future research.
中文摘要 分子生成与优化是化学领域的一项基础任务。智能工具的快速发展，尤其是拥有强大知识储备和交互功能的大型语言模型（LLM），为其提供了新的范式。然而，LLMs的内在挑战在于分子结构与药理性质之间复杂的隐性关系，以及缺乏相应的标记数据。为弥合这一空白，我们提出了DrugR，一种基于LLM的方法，将显性、逐步的药理学推理引入优化过程。我们的方法整合了领域特定的持续预训练、通过逆向数据工程进行监督微调以及自平衡多粒度强化学习。这一框架使DrugR能够有效改善ADMET的关键特性，同时保持原始分子的核心疗效。实验结果表明，DrugR在多重特性上实现了全面增强，同时不影响结构相似性或靶点结合亲和力。重要的是，其显式推理过程为每个优化步骤提供了清晰且可解释的理由，带来可作的设计见解，推动自动化、知识驱动的科学发现。我们的代码和模型检查点为开源，以促进未来的研究。

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

SkillRL：通过递归技能增强强化学习进化代理

Authors: Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, Huaxiu Yao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.08234
Pdf link: https://arxiv.org/pdf/2602.08234
Abstract Large Language Model (LLM) agents have shown stunning results in complex tasks, yet they often operate in isolation, failing to learn from past experiences. Existing memory-based methods primarily store raw trajectories, which are often redundant and noise-heavy. This prevents agents from extracting high-level, reusable behavioral patterns that are essential for generalization. In this paper, we propose SkillRL, a framework that bridges the gap between raw experience and policy improvement through automatic skill discovery and recursive evolution. Our approach introduces an experience-based distillation mechanism to build a hierarchical skill library SkillBank, an adaptive retrieval strategy for general and task-specific heuristics, and a recursive evolution mechanism that allows the skill library to co-evolve with the agent's policy during reinforcement learning. These innovations significantly reduce the token footprint while enhancing reasoning utility. Experimental results on ALFWorld, WebShop and seven search-augmented tasks demonstrate that SkillRL achieves state-of-the-art performance, outperforming strong baselines over 15.3% and maintaining robustness as task complexity increases. Code is available at this this https URL.
中文摘要 大型语言模型（LLM）代理在复杂任务中表现出惊人的成果，但它们常常孤立运作，未能从过去经验中学习。现有基于内存的方法主要存储原始轨迹，这些轨迹通常冗余且噪声较多。这阻止了代理提取对泛化至关重要的高层次、可重复使用的行为模式。本文提出了SkillRL，这一框架通过自动技能发现和递归进化，弥合了原始经验与政策改进之间的鸿沟。我们的方法引入了基于经验的提炼机制，构建分层技能库SkillBank，一种适用于通用和任务特定启发式的自适应检索策略，以及一种递归进化机制，使技能库在强化学习过程中能够与智能体策略共同进化。这些创新显著减少了代币的使用，同时提升了推理的实用性。在ALFWorld、WebShop及七个搜索增强任务上的实验结果表明，SkillRL实现了最先进的性能，超过15.3%的强基线表现，并且随着任务复杂度的增加，依然保持鲁棒性。代码可在此 https URL 下载。

Document Reconstruction Unlocks Scalable Long-Context RLVR

文档重建解锁可扩展的长上下文RLVR

Authors: Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim, Xiaoli Li, Roy Ka-wei Lee, Lidong Bing
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.08237
Pdf link: https://arxiv.org/pdf/2602.08237
Abstract Reinforcement Learning with Verifiable Rewards~(RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models~(LLMs). However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming. In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision. Specifically, we first replace a few paragraphs with special placeholders in a long document. LLMs are trained through reinforcement learning to reconstruct the document by correctly identifying and sequencing missing paragraphs from a set of candidate options. This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance. We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBench~v2. While acquiring noticeable gains on RULER, it can also achieve a reasonable improvement on LongBench~v2 without any manually curated long-context QA data. Furthermore, we conduct extensive ablation studies to analyze the impact of reward design, data curation strategies, training schemes, and data scaling effects on model performance. We publicly release our code, data, and models.
中文摘要 带可验证奖励的强化学习~（RLVR）已成为增强大型语言模型（LLMs）能力（即长上下文）的显著范式。然而，它常依赖于由强大教师模型或人类专家提供的黄金标准答案或明确的评估标准，这些内容既昂贵又耗时。本研究探讨无监督方法以增强LLM的长上下文能力，消除对大量人工注释或教师模型监督的需求。具体来说，我们首先用长文档中的几个段落替换为特殊的占位符。LLM通过强化学习训练，通过正确识别和排序候选选项中缺失段落来重建文档。这种训练范式使模型能够捕捉全局叙事的连贯性，显著提升了长上下文表现。我们在两个广泛使用的基准测试 RULER 和 LongBench~v2 上验证了该方法的有效性。在RULER上获得明显提升的同时，它在没有手动整理的长上下文QA数据的情况下，也能在LongBench~v2上取得合理的提升。此外，我们开展了广泛的消融研究，分析奖励设计、数据管理策略、训练方案和数据扩展对模型性能的影响。我们公开发布代码、数据和模型。

Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs

MLLMs真的看见了吗：在多模态LLM中强化视觉注意力

Authors: Siqu Ou, Tianrui Wan, Zhiyuan Zhao, Junyu Gao, Xuelong Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.08241
Pdf link: https://arxiv.org/pdf/2602.08241
Abstract While chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks, existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies. Our analysis shows that current MLLMs exhibit weak visual focus: early-stage visual misalignment is rarely corrected during subsequent reasoning, leading to error propagation and failed inferences. We argue that this limitation stems from inadequate credit assignment for visual attention during training. To address this issue, we propose SAYO, a visual reasoning model trained with a reinforcement learning (RL) framework that introduces a region-level visual attention-based reward. This reward explicitly aligns optimization signals with visually grounded reasoning steps, enabling the model to learn more reliable attention behaviors. Extensive experiments across multiple multimodal benchmarks demonstrate that SAYO consistently improves performance on diverse reasoning and perception tasks.
中文摘要 虽然思维链（CoT）推理在复杂推理任务上的多模态大型语言模型（MLLM）大幅提升了改进，但现有方法主要依赖长文本推理轨迹，且提供有限的稳定视觉注意力策略学习机制。我们的分析显示，当前MLLM表现出视觉聚焦较弱：早期视觉错位很少在后续推理中得到纠正，导致错误传播和推断失败。我们认为这一限制源于培训期间视觉注意力的学分分配不足。为解决这一问题，我们提出了SAYO，这是一个用强化学习（RL）框架训练的视觉推理模型，引入了区域级视觉注意力奖励。该奖励明确将优化信号与视觉化的推理步骤对齐，使模型能够学习更可靠的注意力行为。多模态基准测试的广泛实验表明，SAYO在多样化的推理和感知任务中持续提升表现。

Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers

以选择为引导的语境学习：一个无奖励的Transformers强化学习范式

Authors: Juncheng Dong, Bowen He, Moyang Guo, Ethan X. Fang, Zhuoran Yang, Vahid Tarokh
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08244
Pdf link: https://arxiv.org/pdf/2602.08244
Abstract In-context reinforcement learning (ICRL) leverages the in-context learning capabilities of transformer models (TMs) to efficiently generalize to unseen sequential decision-making tasks without parameter updates. However, existing ICRL methods rely on explicit reward signals during pretraining, which limits their applicability when rewards are ambiguous, hard to specify, or costly to obtain. To overcome this limitation, we propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback, eliminating the need for reward supervision. We study two variants that differ in the granularity of feedback: Immediate Preference-based RL (I-PRL) with per-step preferences, and Trajectory Preference-based RL (T-PRL) with trajectory-level comparisons. We first show that supervised pretraining, a standard approach in ICRL, remains effective under preference-only context datasets, demonstrating the feasibility of in-context reinforcement learning using only preference signals. To further improve data efficiency, we introduce alternative preference-native frameworks for I-PRL and T-PRL that directly optimize TM policies from preference data without requiring reward signals nor optimal action this http URL on dueling bandits, navigation, and continuous control tasks demonstrate that ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
中文摘要 上下文强化学习（ICRL）利用变换器模型（TM）的上下文学习能力，高效地推广到不需参数更新的未见顺序决策任务。然而，现有的ICRL方法依赖于预训练阶段的显式奖励信号，这限制了当奖励模糊、难以指定或获取成本高昂时的适用性。为克服这一局限，我们提出了一种新的学习范式——基于情境偏好的强化学习（ICPRL），其中预训练和部署均仅依赖偏好反馈，消除了奖励监督的需求。我们研究了两种反馈粒度不同的变体：基于即时偏好的RL（I-PRL）与基于轨迹层级比较的基于轨迹偏好的RL（T-PRL）。我们首先证明，监督预训练作为ICRL的标准方法，在仅偏好上下文数据集下依然有效，展示了仅使用偏好信号进行上下文强化学习的可行性。为进一步提升数据效率，我们引入了I-PRL和T-PRL的替代偏好原生框架，直接从偏好数据优化图灵机策略，无需奖励信号或最优作。该http URL用于对决强盗、导航和持续控制任务，展示了ICPRL能够强力地在上下文中泛化到看不见的任务，实现与ICRL方法在完全奖励监督下训练的表现相当。

When Do Multi-Agent Systems Outperform? Analysing the Learning Efficiency of Agentic Systems

多智能体系统何时表现优异？分析智能系统的学习效率

Authors: Junwei Su, Chuan Wu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08272
Pdf link: https://arxiv.org/pdf/2602.08272
Abstract Reinforcement Learning (RL) has emerged as a crucial method for training or fine-tuning large language models (LLMs), enabling adaptive, task-specific optimizations through interactive feedback. Multi-Agent Reinforcement Learning (MARL), in particular, offers a promising avenue by decomposing complex tasks into specialized subtasks learned by distinct interacting agents, potentially enhancing the ability and efficiency of LLM systems. However, theoretical insights regarding when and why MARL outperforms Single-Agent RL (SARL) remain limited, creating uncertainty in selecting the appropriate RL framework. In this paper, we address this critical gap by rigorously analyzing the comparative sample efficiency of MARL and SARL within the context of LLM. Leveraging the Probably Approximately Correct (PAC) framework, we formally define SARL and MARL setups for LLMs, derive explicit sample complexity bounds, and systematically characterize how task decomposition and alignment influence learning efficiency. Our results demonstrate that MARL improves sample complexity when tasks naturally decompose into independent subtasks, whereas dependent subtasks diminish MARL's comparative advantage. Additionally, we introduce and analyze the concept of task alignment, quantifying the trade-offs when enforcing independent task decomposition despite potential misalignments. These theoretical insights clarify empirical inconsistencies and provide practical criteria for deploying MARL strategies effectively in complex LLM scenarios.
中文摘要 强化学习（RL）已成为训练或微调大型语言模型（LLM）的关键方法，通过交互式反馈实现自适应、针对特定任务的优化。尤其是多智能体强化学习（MARL），通过将复杂任务分解为由不同交互代理学习的专门子任务，提供了一条有前景的途径，有望提升LLM系统的能力和效率。然而，关于MARL何时以及为何优于单剂强化学习（SARL）的理论见解仍然有限，这也带来了选择合适强化学习框架的不确定性。本文通过严格分析大型语言模型（LLM）背景下MARL和SARL的样本效率比较，解决了这一关键空白。利用Probably Approx Correct（PAC）框架，我们正式定义了LLM的SARL和MARL设置，推导出显式的样本复杂度界限，并系统地描述任务分解和对齐如何影响学习效率。我们的结果表明，当任务自然分解为独立子任务时，MARL能提升样本复杂度，而依赖子任务则削弱MARL的比较优势。此外，我们介绍并分析了任务对齐的概念，量化了在执行独立任务分解时即使存在潜在错位也会带来权衡。这些理论见解澄清了实证上的不一致之处，并为在复杂大型语言模型场景中有效部署MARL策略提供了实用标准。

New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR

新技能还是更锋利的原始武器？关于RLVR中推理出现的概率视角

Authors: Zhilin Wang, Yafu Li, Shunkai Zhang, Zhi Wang, Haoran Zhang, Xiaoye Qu, Yu Cheng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.08281
Pdf link: https://arxiv.org/pdf/2602.08281
Abstract Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance-level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi-step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model's existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients ($\rho \in [0.69, 0.96]$); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.
中文摘要 带可验证奖励的强化学习（RLVR）究竟赋予大型语言模型（LLMs）新功能，还是仅仅引发潜在痕迹，仍是一个核心争论。在本研究中，我们与前者观点一致，提出了一个概率框架，其中能力由实例级可解性定义。我们假设复杂推理的出现可以通过原子步进概率的锐化推动，这使得模型能够克服多步推理链中固有的成功率指数衰减。利用代数框架，我们专门训练单步作模型，并评估其在看不见的多步任务中的表现。我们的实证结果证实：（1）RLVR通过增强模型现有技能，激励探索此前无法触及的解路径;（2）复合性能严格受原子步进联合概率的约束，表现为高皮尔逊相关系数（$\rho \in [0.69， 0.96]$）;以及（3）RLVR作为全局优化器，可以牺牲特定技能以最大化总奖励。我们的工作为RLVR中的涌现能力提供了新颖解释，表明可解问题的迭代优化使模型能够发展出解决此前无法解决的情景的能力。

Improving Data and Reward Design for Scientific Reasoning in Large Language Models

改进大型语言模型中科学推理的数据和奖励设计

Authors: Zijie Chen, Zhenghao Lin, Xiao Liu, Zhenzhong Lan, Yeyun Gong, Peng Cheng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.08321
Pdf link: https://arxiv.org/pdf/2602.08321
Abstract Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using this http URL pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.
中文摘要 对于大型语言模型来说，解决开放式科学问题依然具有挑战性，尤其是由于本质上监督和评估不可靠。瓶颈在于科学后培训的数据构建和奖励设计。我们开发了一套大规模、系统化的数据处理流程，将异构的开源科学数据转化为Dr. SCI数据集，包含涵盖八个STEM学科的100万个问题，具有明确可验证/开放式拆分、可扩展的难度注释，以及细粒度的评分标准，使开放式答案的评估得以作化。基于该数据集，我们提出了Dr. SCI后训练流程，通过三个组成部分重新设计标准SFT->强化学习工作流程：（i）探索扩展SFT，扩大模型在强化学习前的推理模式覆盖范围;（ii）动态难度课程，将训练数据调整以适应模型不断演进的科学能力;以及（iii）SciRubric引导RL，通过基于评分标准的评估实现对开放式科学问题的稳定强化学习，并明确正确答案。使用该http URL流水线训练的Qwen3-4B-Base在GPQA-diamond上达到63.2，在GPQA-general上达到32.4，且在强力后训练基线如o1-mini和GPT-4o中持续提升，科学推理能力显著提升，尤其是在开放式环境中。

Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression

通过极端比率思维链压缩实现高效的大语言推理模型

Authors: Yuntian Tang, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Wenxi Li, Wei Li, Jie Hu, Xinghao Chen, Rongrong Ji, Shaohui Lin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.08324
Pdf link: https://arxiv.org/pdf/2602.08324
Abstract Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods.
中文摘要 思维链（CoT）推理成功提升了大型语言模型（LLMs）的推理能力，但推理时会产生相当大的计算开销。现有的CoT压缩方法在高压缩比下常常会严重丧失逻辑保真度，导致性能显著下降。为了实现高保真、快速的推理，我们提出了一种新的EXTreme-RAtio思维链压缩框架，称为Extra-CoT，它在保持答案准确性的同时，积极减少令牌预算。为了生成可靠且高保真的监督，我们首先在数学CoT数据上训练一个专用的语义保持压缩器，并带有细粒度的注释。LLM随后通过混合比例监督微调（SFT）对这些压缩对进行微调，教它遵循压缩预算的光谱，并为强化学习（RL）提供稳定的初始化。我们进一步提出了受限与层级比率政策优化（CHRPO），通过层级奖励明确激励低预算下的解题能力。对三个数学推理基准的实验显示了Extra-CoT的优越性。例如，在使用 Qwen3-1.7B 的 MATH-500 上，Extra-CoT 实现了超过 73% 的令牌减少，准确率提升了 0.6%，显著优于最先进的（SOTA）方法。

Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System

谁配得上这份奖励？SHARP：基于Shapley信用的多智能体系统优化

Authors: Yanming Li, Xuelin Zhang, WenJie Lu, Ziye Tang, Maodong Wu, Haotian Luo, Tongtong Wu, Zijie Peng, Hongze Mi, Yibo Feng, Naiqiang Tan, Chao Huang, Hong Chen, Li Shen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08335
Pdf link: https://arxiv.org/pdf/2602.08335
Abstract Integrating Large Language Models (LLMs) with external tools via multi-agent systems offers a promising new paradigm for decomposing and solving complex problems. However, training these systems remains notoriously difficult due to the credit assignment challenge, as it is often unclear which specific functional agent is responsible for the success or failure of decision trajectories. Existing methods typically rely on sparse or globally broadcast rewards, failing to capture individual contributions and leading to inefficient reinforcement learning. To address these limitations, we introduce the Shapley-based Hierarchical Attribution for Reinforcement Policy (SHARP), a novel framework for optimizing multi-agent reinforcement learning via precise credit attribution. SHARP effectively stabilizes training by normalizing agent-specific advantages across trajectory groups, primarily through a decomposed reward mechanism comprising a global broadcast-accuracy reward, a Shapley-based marginal-credit reward for each agent, and a tool-process reward to improve execution efficiency. Extensive experiments across various real-world benchmarks demonstrate that SHARP significantly outperforms recent state-of-the-art baselines, achieving average match improvements of 23.66% and 14.05% over single-agent and multi-agent approaches, respectively.
中文摘要 通过多智能体系统将大型语言模型（LLM）与外部工具集成，为分解和解决复杂问题提供了有前景的新范式。然而，由于信用分配的挑战，培训这些系统仍然非常困难，因为通常不清楚具体哪个功能代理对决策轨迹的成功或失败负有责任。现有方法通常依赖稀疏或全局广播奖励，未能捕捉个体贡献，导致强化学习效率低下。为解决这些局限性，我们引入了基于Shapley的强化策略层级归因（SHARP），这是一个通过精确信用归因优化多智能体强化学习的新框架。SHARP通过规范各轨迹组间的代理特异优势，有效稳定训练，主要通过分解的奖励机制，包括全局广播准确性奖励、基于Shapley的边际信用奖励（针对每个代理）以及工具-过程奖励以提升执行效率。在多个真实世界基准测试中广泛实验表明，SHARP显著优于近期最先进的基线，平均匹配提升分别为23.66%和14.05%，优于单代理和多代理方法。

OPE: Overcoming Information Saturation in Parallel Thinking via Outline-Guided Path Exploration

OPE：通过大纲引导路径探索的并行思维克服信息饱和

Authors: Qi Guo, Jianing Wang, Deyang Kong, Xiangyu Xi, Jianfei Zhang, Yi Lu, Jingang Wang, Wei Wang, Shikun Zhang, Wei Ye
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08344
Pdf link: https://arxiv.org/pdf/2602.08344
Abstract Parallel thinking has emerged as a new paradigm for large reasoning models (LRMs) in tackling complex problems. Recent methods leverage Reinforcement Learning (RL) to enhance parallel thinking, aiming to address the limitations in computational resources and effectiveness encountered with supervised fine-tuning. However, most existing studies primarily focus on optimizing the aggregation phase, with limited attention to the path exploration stage. In this paper, we theoretically analyze the optimization of parallel thinking under the Reinforcement Learning with Verifiable Rewards (RLVR) setting, and identify that the mutual information bottleneck among exploration paths fundamentally restricts overall performance. To address this, we propose Outline-Guided Path Exploration (OPE), which explicitly partitions the solution space by generating diverse reasoning outlines prior to parallel path reasoning, thereby reducing information redundancy and improving the diversity of information captured across exploration paths. We implement OPE with an iterative RL strategy that optimizes outline planning and outline-guided reasoning independently. Extensive experiments across multiple challenging mathematical benchmarks demonstrate that OPE effectively improves reasoning performance in different aggregation strategies, enabling LRMs to more reliably discover correct solutions.
中文摘要 并行思维已成为大型推理模型（LRM）解决复杂问题的新范式。最新方法利用强化学习（RL）来增强并行思维，旨在解决监督微调在计算资源和效能上的限制。然而，大多数现有研究主要关注聚合阶段的优化，对路径探索阶段的关注有限。本文理论上分析了在可验证奖励强化学习（RLVR）环境下的并行思维优化，并指出探索路径间的互信息瓶颈从根本上限制了整体表现。为此，我们提出了大纲引导路径探索（OPE），该方法通过生成多样的推理大纲，明确划分解空间，从而减少信息冗余，提升探索路径间捕获信息的多样性。我们采用迭代强化学习策略实现OPE，优化大纲规划和基于大纲引导的推理。在多个具有挑战性的数学基准测试中的大量实验表明，OPE有效提升不同聚合策略中的推理性能，使LRMS更可靠地发现正确解。

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

你的推理模型是否隐含地知道什么时候该停止思考？

Authors: Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuanda Wang, Zhixia Zhang, Hongyan Xie, Songshi Liang, Zehao Chen, Xuefeng Xiao, Fuzhen Zhuang, Jianxin Li, Yikun Ban, Deqing Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08354
Pdf link: https://arxiv.org/pdf/2602.08354
Abstract Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.
中文摘要 近年来，大型推理模型（LRM）的进步极大提升了它们通过长链思维（CoT）处理复杂推理任务的能力。然而，这种方法常常导致大量冗余，降低计算效率，并在实时应用中造成显著延迟。最新研究表明，较长的推理链通常与正确性无关，甚至可能损害准确性。在对这一现象的进一步深入分析中，我们令人惊讶地发现并实证了LRM隐含地知道何时该停止思考，而这一能力被当前的采样范式所掩盖。基于此，我们引入了SAGE（自我意识引导高效推理），这是一种释放高效推理潜力的新抽样范式。此外，将SAGE作为混合抽样整合进基于群体的强化学习（SAGE-RL），使SAGE-RL能够有效将SAGE发现的高效推理模式融入标准pass@1推断中，显著提升了LRM在多个具有挑战性的数学基准测试中的推理准确性和效率。

Learning Human-Like Badminton Skills for Humanoid Robots

学习类人机器人羽毛球技能

Authors: Yeke Chen, Shihao Dong, Xiaoyu Ji, Jingkai Sun, Zeren Luo, Liu Zhao, Jiahui Zhang, Wanyue Li, Ji Ma, Bowen Xu, Yimin Han, Yudong Zhao, Peng Lu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.08370
Pdf link: https://arxiv.org/pdf/2602.08370
Abstract Realizing versatile and human-like performance in high-demand sports like badminton remains a formidable challenge for humanoid robotics. Unlike standard locomotion or static manipulation, this task demands a seamless integration of explosive whole-body coordination and precise, timing-critical interception. While recent advances have achieved lifelike motion mimicry, bridging the gap between kinematic imitation and functional, physics-aware striking without compromising stylistic naturalness is non-trivial. To address this, we propose Imitation-to-Interaction, a progressive reinforcement learning framework designed to evolve a robot from a "mimic" to a capable "striker." Our approach establishes a robust motor prior from human data, distills it into a compact, model-based state representation, and stabilizes dynamics via adversarial priors. Crucially, to overcome the sparsity of expert demonstrations, we introduce a manifold expansion strategy that generalizes discrete strike points into a dense interaction volume. We validate our framework through the mastery of diverse skills, including lifts and drop shots, in simulation. Furthermore, we demonstrate the first zero-shot sim-to-real transfer of anthropomorphic badminton skills to a humanoid robot, successfully replicating the kinetic elegance and functional precision of human athletes in the physical world.
中文摘要 在羽毛球等高需求运动中实现多功能且类人化的表现，对人形机器人来说仍是一项艰巨挑战。与标准的移动或静态作不同，该任务要求爆炸物全身协调与精准、关键时机的拦截无缝结合。虽然近年来的进展实现了逼真的运动模仿，但要在不牺牲风格自然性的情况下，连接运动学模仿与功能性、物理意识的击打之间，这并非易事。为此，我们提出了“模仿到交互”（Imitation-to-T交互）——一种渐进式强化学习框架，旨在将机器人从“模仿者”进化为有能力的“打击者”。我们的方法从人类数据中建立一个稳健的运动先验，将其提炼为紧凑的基于模型的状态表示，并通过对抗先验稳定动力学。关键是，为了克服专家演示的稀疏性，我们引入了一种流形展开策略，将离散打击点推广到密集的交互体积中。我们通过掌握多种技能，包括提重和吊球，来验证我们的框架。此外，我们还展示了首次将拟人羽毛球技能从零秒模拟到真实的模拟，成功复制了人类运动员在现实世界中的动态优雅和功能精准。

Reinforcement Learning with Backtracking Feedback

带回溯反馈的强化学习

Authors: Bilgehan Sel, Vaishakh Keshava, Phillip Wallis, Lukas Rutishauser, Ming Jin, Dingcheng Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.08377
Pdf link: https://arxiv.org/pdf/2602.08377
Abstract Addressing the critical need for robust safety in Large Language Models (LLMs), particularly against adversarial attacks and in-distribution errors, we introduce Reinforcement Learning with Backtracking Feedback (RLBF). This framework advances upon prior methods, such as BSAFE, by primarily leveraging a Reinforcement Learning (RL) stage where models learn to dynamically correct their own generation errors. Through RL with critic feedback on the model's live outputs, LLMs are trained to identify and recover from their actual, emergent safety violations by emitting an efficient "backtrack by x tokens" signal, then continuing generation autoregressively. This RL process is crucial for instilling resilience against sophisticated adversarial strategies, including middle filling, Greedy Coordinate Gradient (GCG) attacks, and decoding parameter manipulations. To further support the acquisition of this backtracking capability, we also propose an enhanced Supervised Fine-Tuning (SFT) data generation strategy (BSAFE+). This method improves upon previous data creation techniques by injecting violations into coherent, originally safe text, providing more effective initial training for the backtracking mechanism. Comprehensive empirical evaluations demonstrate that RLBF significantly reduces attack success rates across diverse benchmarks and model scales, achieving superior safety outcomes while critically preserving foundational model utility.
中文摘要 针对大型语言模型（LLMs）对强健安全性的迫切需求，特别是针对对抗性攻击和分布内错误，我们引入了带回溯反馈的强化学习（RLBF）。该框架在之前的方法（如BSAFE）基础上，主要利用强化学习（RL）阶段，模型学习动态修正自身生成错误。通过强化学习并对模型实时输出进行批评反馈，LLMs被训练为识别并恢复其实际出现的安全违规，通过发出高效的“按x个token回溯”信号，然后继续自回归生成。这一强化学习过程对于增强对复杂对抗策略的韧性至关重要，包括中间填充、贪婪坐标梯度（GCG）攻击以及参数作解码。为进一步支持回溯能力的获取，我们还提出了增强型监督微调（SFT）数据生成策略（BSAFE+）。该方法通过将违规注入连贯且原本安全的文本，改进了以往的数据创建技术，为回溯机制提供更有效的初始训练。全面的实证评估表明，RLBF在不同基准和模型尺度上显著降低了攻击成功率，实现了更优越的安全结果，同时关键地保持了模型的基础效用。

Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning

通过端到端强化学习对压缩记忆进行动态长上下文推理

Authors: Zhuoen Chen, Dongfang Li, Meishan Zhang, Baotian Hu, Min Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08382
Pdf link: https://arxiv.org/pdf/2602.08382
Abstract Large Language Models (LLMs) face significant challenges in long-context processing, including quadratic computational costs, information forgetting, and the context fragmentation inherent in retrieval-augmented generation (RAG). We propose a cognitively inspired framework for efficient long-context inference based on chunk-wise compression and selective memory recall, rather than processing all raw tokens. The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor. A gating module dynamically selects relevant memory blocks, which are then iteratively processed by a reasoning module with an evolving working memory to solve downstream tasks. The compressor and reasoner are jointly optimized via end-to-end reinforcement learning, while the gating module is trained separately as a classifier. Experimental results show that the proposed method achieves competitive accuracy on multi-hop reasoning benchmarks such as RULER-HQA, extrapolates context length from 7K to 1.75M tokens, and offers a favorable accuracy-efficiency trade-off compared to strong long-context baselines. In particular, it achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.
中文摘要 大型语言模型（LLMs）在长上下文处理中面临重大挑战，包括二次计算成本、信息遗忘以及检索增强生成（RAG）中固有的上下文碎片化。我们提出了一种基于分块压缩和选择性记忆回忆的认知启发框架，用于高效长上下文推断，而非处理所有原始代币。该框架将长输入分割成块，并利用学习的压缩器将每个块编码为压缩内存表示。门控模块动态选择相关记忆块，然后由具有演化工作记忆的推理模块迭代处理，以解决后续任务。压缩器和推理器通过端到端强化学习共同优化，而门控模块则作为分类器单独训练。实验结果显示，所提方法在多跳推理基准测试（如RULER-HQA）上具有竞争力的准确性，将上下文长度从7K外推到175万个代币，并且相比强的长上下文基线提供了更有利的准确性与效率权衡。特别是，它在峰值GPU内存使用量上实现了最多2倍的减少，推理速度提升了6倍。

Intelligent support for Human Oversight: Integrating Reinforcement Learning with Gaze Simulation to Personalize Highlighting

智能支持人类监督：将强化学习与凝视模拟整合，实现个性化高亮

Authors: Thorsten Klößner, João Belo, Zekun Wu, Jörg Hoffmann, Anna Maria Feit
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08403
Pdf link: https://arxiv.org/pdf/2602.08403
Abstract Interfaces for human oversight must effectively support users' situation awareness under time-critical conditions. We explore reinforcement learning (RL)-based UI adaptation to personalize alerting strategies that balance the benefits of highlighting critical events against the cognitive costs of interruptions. To enable learning without real-world deployment, we integrate models of users' gaze behavior to simulate attentional dynamics during monitoring. Using a delivery-drone oversight scenario, we present initial results suggesting that RL-based highlighting can outperform static, rule-based approaches and discuss challenges of intelligent oversight support.
中文摘要 人工监控界面必须在时间关键条件下有效支持用户的态势感知。我们探讨基于强化学习（RL）的用户界面适应，以个性化提醒策略，平衡突出关键事件的好处与中断带来的认知成本。为了实现无需实际部署的学习，我们集成了用户凝视行为模型，模拟监控期间的注意力动态。通过传递无人机监控情景，我们展示了初步结果，表明基于强化学习的标亮能够优于静态、基于规则的方法，并讨论智能监督支持的挑战。

Beyond Correctness: Learning Robust Reasoning via Transfer

超越正确性：通过转移学习稳健推理

Authors: Hyunseok Lee, Soheil Abbasloo, Jihoon Tack, Jinwoo Shin
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.08489
Pdf link: https://arxiv.org/pdf/2602.08489
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has recently strengthened LLM reasoning, but its focus on final answer correctness leaves a critical gap: it does not ensure the robustness of the reasoning process itself. We adopt a simple philosophical view, robust reasoning should remain useful beyond the mind that produced it, and treat reasoning as a form of meaning transfer that must survive truncation, reinterpretation, and continuation. Building on this principle, we introduce Reinforcement Learning with Transferable Reward (RLTR), which operationalizes robustness via transfer reward that tests whether a partial reasoning prefix from one model can guide a separate model to the correct answer. This encourages LLMs to produce reasoning that is stable, interpretable, and genuinely generalizable. Our approach improves sampling consistency while improving final answer accuracy, and it reaches comparable performance in substantially fewer training steps. For example, on MATH500, RLTR achieves a +3.6%p gain in Maj@64 compared to RLVR and matches RLVR's average accuracy with roughly 2.5x fewer training steps, providing both more reliable reasoning and significantly more sample efficient.
中文摘要 带可验证奖励的强化学习（RLVR）最近加强了LLM推理，但其对最终答案正确性的关注留下了一个关键空白：它未能确保推理过程本身的稳健性。我们采取简单的哲学观点，坚实的推理应超越产生它的心智，并将推理视为一种意义转移，必须经受截断、重新诠释和延续的考验。基于这一原则，我们引入了带可转移奖励的强化学习（RLTR），通过转移奖励作化鲁棒性，测试一个模型的部分推理前缀是否能引导另一个模型找到正确答案。这鼓励大型语言模型产生稳定、可解释且真正可推广的推理。我们的方法提高了抽样一致性，同时提高了最终答案的准确性，并且在显著减少的训练步骤内达到了相当的性能。例如，在MATH500上，RLTR相比RLVR实现了+3.6%的提升Maj@64，且训练步数约少2.5倍，从而提供了更可靠的推理和显著更高的样本效率。

Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

情境推广强化学习的强化盗贼，提供可验证奖励

Authors: Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, Deqing Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08499
Pdf link: https://arxiv.org/pdf/2602.08499
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supervision, poor sample efficiency, and suboptimal policy updates. We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training. Each rollout is treated as an arm whose reward is defined by the induced performance gain between consecutive optimization steps. The resulting scheduler supports both noise-aware intra-group selection and adaptive global reuse of historical rollouts within a single principled framework. We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound. Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
中文摘要 带可验证奖励的强化学习（RLVR）是一种有效范式，用于提升大型语言模型的推理能力。然而，现有的RLVR方法以无差别且短视距的方式使用推展：每个提示中质量异质的响应被统一处理，历史推展在一次使用后丢弃。这导致了监督噪声、样本效率低下以及策略更新不理想。我们通过将RLVR中的部署调度定义为情境盗贼问题，并提出统一的神经调度框架，在训练过程中自适应地选择高价值部署来解决这些问题。每次推出都被视为一个臂，其奖励由连续优化步骤间诱导的性能提升定义。由此产生的调度器支持在单一原则框架内实现噪声感知的组内选择和历史推广的自适应全局重用。我们通过推导亚线性遗憾界限，并证明扩大滚动缓冲区能提升可实现的性能上限，提供理论上的依据。六个数学推理基准测试的实验显示，多种RLVR优化方法在性能和训练效率上均有持续提升。

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

通过扩展增强学习视觉-语言模型中的自我纠正

Authors: Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.08503
Pdf link: https://arxiv.org/pdf/2602.08503
Abstract Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.
中文摘要 自我纠正对于解决视觉语言模型（VLM）中的复杂推理问题至关重要。然而，现有的强化学习（RL）方法难以学习，因为有效的自我纠正行为极为罕见，使得学习信号极为稀疏。为应对这一挑战，我们提出了修正专属推广（Octopus），这是一种强化学习推广增强框架，通过重新组合现有的推广，综合密集的自我纠正示例。这种增强同时通过滚动重复利用提高了样本效率，并通过平衡监督稳定了强化学习优化。此外，我们引入了一种反应掩蔽策略，将自我纠正与直接推理解耦，避免信号冲突，使两种行为都能有效学习。在此基础上，我们介绍了Octopus-8B，一种具有可控自我纠正能力的推理VLM。在7个基准测试中，它在开源VLM中实现了SoTA性能，比最佳RLVR基线高出1.0分，且每步仅需0.72+乘以的训练时间。

Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO

通过代理博弈和基于自适应树的GRPO进行对话模型优化

Authors: Kun Peng, Conghui Tan, Yu Liu, Guohua Tang, Zhongqian Sun, Wei Yang, Zining Zhu, Lei Jiang, Yanbing Liu, Hao Peng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08533
Pdf link: https://arxiv.org/pdf/2602.08533
Abstract Open-ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users' traits, but existing methods face critical limitations: over-reliance on pre-collected user data, and short-horizon biases in reinforcement learning (RL) that neglect long-term dialogue value. To address these, we propose a novel long-horizon RL framework integrating online personalization with Adaptive Tree-based Group Relative Policy Optimization (AT-GRPO). Adopting a two-agent game paradigm, a user agent constructs dynamic environments via style mimicry (learning user-specific conversational traits) and active termination (predicting turn-level termination probabilities as immediate rewards), forming an iterative cycle that drives the dialogue agent to deepen interest exploration. AT-GRPO reinterprets dialogue trajectories as trees and introduces adaptive observation ranges. Unlike full tree expansion that incurs exponential overhead, it limits each node to aggregate rewards from a stage-aware range: larger ranges support early-stage topic exploration, while smaller ranges facilitate late-stage dialogue maintenance. This design reduces rollout budgets from exponential to polynomial in the dialogue length, while preserving long-term reward capture. Extensive experiments show our framework's superior performance, sample efficiency, and robustness.
中文摘要 开放式对话代理旨在通过适应用户特质来提供引人入胜、个性化的互动，但现有方法面临严重局限：过度依赖预先收集的用户数据，以及强化学习（RL）中短期偏见，忽视了长期对话价值。为解决这些问题，我们提出了一种新颖的长视野强化学习框架，将在线个性化与基于自适应树的群体相对策略优化（AT-GRPO）整合在一起。采用双代理游戏范式，用户代理通过风格模仿（学习用户特定的对话特征）和主动终止（预测回合级终止概率作为即时奖励）构建动态环境，形成一个迭代循环，推动对话代理加深兴趣探索。AT-GRPO将对话轨迹重新解释为树状结构，并引入了自适应观察范围。与产生指数级开销的全树扩展不同，它限制每个节点从阶段感知范围内汇总奖励：较大的范围支持早期话题探索，而较小的范围则有助于后期对话维护。这种设计将推出预算从指数级减少到多项式式的对话长度，同时保持了长期的奖励捕获。大量实验显示了我们框架的卓越性能、样本效率和鲁棒性。

Constrained Sampling to Guide Universal Manipulation RL

约束抽样指导通用作强化学习

Authors: Marc Toussaint, Cornelius V. Braun, Eckart Cobo-Briesewitz, Sayantan Auddy, Armand Jordana, Justin Carpentier
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.08557
Pdf link: https://arxiv.org/pdf/2602.08557
Abstract We consider how model-based solvers can be leveraged to guide training of a universal policy to control from any feasible start state to any feasible goal in a contact-rich manipulation setting. While Reinforcement Learning (RL) has demonstrated its strength in such settings, it may struggle to sufficiently explore and discover complex manipulation strategies, especially in sparse-reward settings. Our approach is based on the idea of a lower-dimensional manifold of feasible, likely-visited states during such manipulation and to guide RL with a sampler from this manifold. We propose Sample-Guided RL, which uses model-based constraint solvers to efficiently sample feasible configurations (satisfying differentiable collision, contact, and force constraints) and leverage them to guide RL for universal (goal-conditioned) manipulation policies. We study using this data directly to bias state visitation, as well as using black-box optimization of open-loop trajectories between random configurations to impose a state bias and optionally add a behavior cloning loss. In a minimalistic double sphere manipulation setting, Sample-Guided RL discovers complex manipulation strategies and achieves high success rates in reaching any statically stable state. In a more challenging panda arm setting, our approach achieves a significant success rate over a near-zero baseline, and demonstrates a breadth of complex whole-body-contact manipulation strategies.
中文摘要 我们考虑如何利用基于模型的求解器来指导通用策略的训练，从任何可行的起始状态到任何可行的目标，在丰富的接触作环境中进行控制。虽然强化学习（RL）在此类环境中展现了其优势，但在奖励稀疏的环境中，它可能难以充分探索和发现复杂的作策略。我们的方法基于一个低维流形的可行且可能被访问的状态的思想，并用该流形中的采样器引导强化学习。我们提出了样本引导强化学习（Sample-Guided RL），利用基于模型的约束求解器高效采样可行配置（满足可微碰撞、接触和力约束），并利用它们指导强化学习实现通用（目标条件化）作策略。我们研究直接利用这些数据对状态访问进行偏向，以及利用黑箱优化随机配置间的开环轨迹来施加状态偏置，并可选地添加行为克隆损失。在极简的双球作环境中，样本引导强化学习发现复杂的作策略，并实现任何静态稳定状态的高成功率。在更具挑战性的熊猫手臂环境中，我们的方法在接近零基线下取得了显著成功率，并展示了广泛的复杂全身接触作策略。

SemiNFT: Learning to Transfer Presets from Imitation to Appreciation via Hybrid-Sample Reinforcement Learning

半NFT：通过混合样本强化学习，学习从模仿转移到欣赏的预设

Authors: Melany Yang, Yuhang Yu, Diwang Weng, Jinwei Chen, Wei Dong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.08582
Pdf link: https://arxiv.org/pdf/2602.08582
Abstract Photorealistic color retouching plays a vital role in visual content creation, yet manual retouching remains inaccessible to non-experts due to its reliance on specialized expertise. Reference-based methods offer a promising alternative by transferring the preset color of a reference image to a source image. However, these approaches often operate as novice learners, performing global color mappings derived from pixel-level statistics, without a true understanding of semantic context or human aesthetics. To address this issue, we propose SemiNFT, a Diffusion Transformer (DiT)-based retouching framework that mirrors the trajectory of human artistic training: beginning with rigid imitation and evolving into intuitive creation. Specifically, SemiNFT is first taught with paired triplets to acquire basic structural preservation and color mapping skills, and then advanced to reinforcement learning (RL) on unpaired data to cultivate nuanced aesthetic perception. Crucially, during the RL stage, to prevent catastrophic forgetting of old skills, we design a hybrid online-offline reward mechanism that anchors aesthetic exploration with structural review. % experiments Extensive experiments show that SemiNFT not only outperforms state-of-the-art methods on standard preset transfer benchmarks but also demonstrates remarkable intelligence in zero-shot tasks, such as black-and-white photo colorization and cross-domain (anime-to-photo) preset transfer. These results confirm that SemiNFT transcends simple statistical matching and achieves a sophisticated level of aesthetic comprehension. Our project can be found at this https URL.
中文摘要 写实色彩修饰在视觉内容创作中起着重要作用，但由于依赖专业技术，手工修饰对非专业人士来说仍然难以获得。基于参考的方法通过将参考图像的预设颜色转移到源图像，提供了一种有前景的替代方案。然而，这些方法往往像初学者一样，仅根据像素级统计进行全局色彩映射，缺乏对语义语境或人类美学的真正理解。为解决这一问题，我们提出了SemiNFT，一种基于扩散变换器（DiT）的修图框架，反映了人类艺术训练的轨迹：从严格的模仿开始，逐步演变为直觉创作。具体来说，SemiNFT首先通过配对三元组教授，以获得基础的结构保存和色彩映射技能，随后在未配对数据上进行强化学习（RL），以培养细腻的美学感知。关键是，在强化学习阶段，为了防止旧技能的灾难性遗忘，我们设计了一种线上线下混合奖励机制，将审美探索与结构性复习相结合。百分比实验大量实验表明，SemiNFT不仅在标准预设传输基准上超越了最先进的方法，还在零摄影任务（如黑白照片上色和跨域（动漫转照片）预设传输中展现出卓越的智能。这些结果证实了SemiNFT超越了简单的统计匹配，实现了高水平的审美理解。我们的项目可以在这个 https URL 找到。

Conditional Sequence Modeling for Safe Reinforcement Learning

安全强化学习的条件序列建模

Authors: Wensong Bai, Chao Zhang, Qihang Xu, Chufan Chen, Chenhao Zhou, Hui Qian
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.08584
Pdf link: https://arxiv.org/pdf/2602.08584
Abstract Offline safe reinforcement learning (RL) aims to learn policies from a fixed dataset while maximizing performance under cumulative cost constraints. In practice, deployment requirements often vary across scenarios, necessitating a single policy that can adapt zero-shot to different cost thresholds. However, most existing offline safe RL methods are trained under a pre-specified threshold, yielding policies with limited generalization and deployment flexibility across cost thresholds. Motivated by recent progress in conditional sequence modeling (CSM), which enables flexible goal-conditioned control by specifying target returns, we propose RCDT, a CSM-based method that supports zero-shot deployment across multiple cost thresholds within a single trained policy. RCDT is the first CSM-based offline safe RL algorithm that integrates a Lagrangian-style cost penalty with an auto-adaptive penalty coefficient. To avoid overly conservative behavior and achieve a more favorable return--cost trade-off, a reward--cost-aware trajectory reweighting mechanism and Q-value regularization are further incorporated. Extensive experiments on the DSRL benchmark demonstrate that RCDT consistently improves return--cost trade-offs over representative baselines, advancing the state-of-the-art in offline safe RL.
中文摘要 离线安全强化学习（RL）旨在从固定数据集中学习策略，同时在累积成本约束下最大化性能。实际上，部署需求常因场景而异，因此需要一个能够适应不同成本阈值的单一策略。然而，大多数现有的离线安全强化学习方法在预设阈值下训练，导致策略在成本阈值间的泛化性和部署灵活性有限。受条件序列建模（CSM）最新进展的启发，该方法通过指定目标回报实现灵活的目标条件控制，我们提出了基于 CSM 的方法 RCDT，支持在单一训练策略内跨多个成本阈值的零机会部署。RCDT 是首个基于 CSM 的离线安全强化学习算法，将拉格朗日式成本惩罚与自适应惩罚系数整合。为避免过于保守的行为并获得更有利的回报——成本权衡，进一步纳入了奖励-成本感知的轨迹重权重机制和Q值正则化。DSRL基准测试的广泛实验表明，RCDT在回报-成本权衡上持续提升，超越了代表性基线，推动了离线安全强化学习的先进水平。

Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation

超越标量分数：机器翻译错误感知质量估计的强化学习

Authors: Archchana Sindhujan, Girish A. Koushik, Shenbin Qian, Diptesh Kanojia, Constantin Orăsan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.08600
Pdf link: https://arxiv.org/pdf/2602.08600
Abstract Quality Estimation (QE) aims to assess the quality of machine translation (MT) outputs without relying on reference translations, making it essential for real-world, large-scale MT evaluation. Large Language Models (LLMs) have shown significant promise in advancing the field of quality estimation of machine translation. However, most of the QE approaches solely rely on scalar quality scores, offering no explicit information about the translation errors that should drive these judgments. Moreover, for low-resource languages where annotated QE data is limited, existing approaches struggle to achieve reliable performance. To address these challenges, we introduce the first segment-level QE dataset for English to Malayalam, a severely resource-scarce language pair in the QE domain, comprising human-annotated Direct Assessment (DA) scores and Translation Quality Remarks (TQR), which are short, contextual, free-form annotator comments that describe translation errors. We further introduce ALOPE-RL, a policy-based reinforcement learning framework that trains efficient adapters based on policy rewards derived from DA score and TQR. Integrating error-aware rewards with ALOPE-RL, enables LLMs to reason about translation quality beyond numeric scores. Despite being trained on a small-scale QE dataset, ALOPE-RL achieves state-of-the-art performance on English to Malayalam QE using compact LLMs (<=4B parameters}) fine-tuned with LoRA and 4-bit quantization, outperforming both larger LLM-based baselines and leading encoder-based QE models. Our results demonstrate that error-aware, policy-based learning can deliver strong QE performance under limited data and compute budgets. We release our dataset, code, and trained models to support future research.
中文摘要 质量估计（QE）旨在评估机器翻译（MT）输出的质量，而不依赖参考翻译，因此对现实世界的大规模机器翻译评估至关重要。大型语言模型（LLMs）在推动机器翻译质量估计领域展现出显著潜力。然而，大多数量化宽松方法仅依赖标量质量评分，未明确说明应驱动这些判断的翻译错误。此外，对于低资源语言，且带注释的量子工程数据有限，现有方法难以实现可靠的性能。为应对这些挑战，我们引入了首个面向英语至马拉雅拉姆语的细分级QE数据集，这是QE领域资源极度稀缺的语言组合，包含人工注释直接评估（DA）评分和翻译质量注释（TQR），后者是描述翻译错误的简短、上下文自由形式的注释。我们进一步介绍了ALOPE-RL，一个基于策略的强化学习框架，基于DA评分和TQR得出的策略奖励训练高效适配器。将错误感知奖励与 ALOPE-RL 集成，使大型语言模型能够超越数值分数来推理翻译质量。尽管训练在小规模的量子化数据集上，ALOPE-RL使用紧凑的LLM（<=4B参数}，经过LoRA和4位量化微调），在英语到马拉雅拉姆语QE上实现了最先进的性能，优于更大的基于LLM的基线和领先的基于编码器的QE模型。我们的结果表明，基于错误的、基于策略的学习可以在有限的数据和计算预算下实现强劲的量化宽松表现。我们发布数据集、代码和训练模型，以支持未来的研究。

Breaking the Grid: Distance-Guided Reinforcement Learning in Large Discrete and Hybrid Action Spaces

打破网格：大型离散与混合行动空间中的距离引导强化学习

Authors: Heiko Hoppe, Fabian Akkerman, Wouter van Heeswijk, Maximilian Schiffer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08616
Pdf link: https://arxiv.org/pdf/2602.08616
Abstract Reinforcement Learning is increasingly applied to logistics, scheduling, and recommender systems, but standard algorithms struggle with the curse of dimensionality in such large discrete action spaces. Existing algorithms typically rely on restrictive grid-based structures or computationally expensive nearest-neighbor searches, limiting their effectiveness in high-dimensional or irregularly structured domains. We propose Distance-Guided Reinforcement Learning (DGRL), combining Sampled Dynamic Neighborhoods (SDN) and Distance-Based Updates (DBU) to enable efficient RL in spaces with up to 10$^\text{20}$ actions. Unlike prior methods, SDN leverages a semantic embedding space to perform stochastic volumetric exploration, provably providing full support over a local trust region. Complementing this, DBU transforms policy optimization into a stable regression task, decoupling gradient variance from action space cardinality and guaranteeing monotonic policy improvement. DGRL naturally generalizes to hybrid continuous-discrete action spaces without requiring hierarchical dependencies. We demonstrate performance improvements of up to 66% against state-of-the-art benchmarks across regularly and irregularly structured environments, while simultaneously improving convergence speed and computational complexity.
中文摘要 强化学习越来越多地应用于物流、调度和推荐系统，但标准算法在如此庞大的离散动作空间中难以应对维度问题。现有算法通常依赖于限制性的基于网格的结构或计算量高的最近邻搜索，限制了它们在高维或不规则结构域中的有效性。我们提出了距离引导强化学习（DGRL），结合采样动态邻域（SDN）和基于距离的更新（DBU），以实现在最多10$^\text{20}$动作的空间中实现高效的强化学习。与以往方法不同，SDN利用语义嵌入空间进行随机体积探索，能够在局部信任区域内提供充分支持。此外，DBU将策略优化转化为稳定回归任务，将梯度方差与动作空间基数解耦，保证单调策略改进。DGRL自然地推广到混合连续-离散作用空间，无需层级依赖关系。我们在常规和非规则结构环境中，展示了在最先进基准测试下性能提升高达66%，同时提升收敛速度和计算复杂度。

High-Speed Vision-Based Flight in Clutter with Safety-Shielded Reinforcement Learning

基于高速视觉的杂波飞行，配合安全屏蔽增强学习

Authors: Jiarui Zhang, Chengyong Lei, Chengjiang Dai, Lijie Wang, Zhichao Han, Fei Gao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.08653
Pdf link: https://arxiv.org/pdf/2602.08653
Abstract Quadrotor unmanned aerial vehicles (UAVs) are increasingly deployed in complex missions that demand reliable autonomous navigation and robust obstacle avoidance. However, traditional modular pipelines often incur cumulative latency, whereas purely reinforcement learning (RL) approaches typically provide limited formal safety guarantees. To bridge this gap, we propose an end-to-end RL framework augmented with model-based safety mechanisms. We incorporate physical priors in both training and deployment. During training, we design a physics-informed reward structure that provides global navigational guidance. During deployment, we integrate a real-time safety filter that projects the policy outputs onto a provably safe set to enforce strict collision-avoidance constraints. This hybrid architecture reconciles high-speed flight with robust safety assurances. Benchmark evaluations demonstrate that our method outperforms both traditional planners and recent end-to-end obstacle avoidance approaches based on differentiable physics. Extensive experiments demonstrate strong generalization, enabling reliable high-speed navigation in dense clutter and challenging outdoor forest environments at velocities up to 7.5m/s.
中文摘要 四旋翼无人机（UAV）越来越多地被部署在需要可靠自主导航和强有力障碍避让的复杂任务中。然而，传统的模块化流水线通常会产生累积延迟，而纯强化学习（RL）方法通常提供有限的形式安全保障。为弥合这一差距，我们提出了一个端到端强化学习框架，辅以基于模型的安全机制。我们在训练和部署中都融入了身体经验。在培训过程中，我们设计了基于物理的奖励结构，提供全球导航指导。在部署过程中，我们集成了一个实时安全过滤器，将策略输出投射到可证明安全的集合上，以强制执行严格的碰撞避免约束。这种混合架构将高速飞行与强大的安全保障相结合。基准评估表明，我们的方法优于传统规划方法以及基于微分物理的最新端到端障碍规避方法。大量实验证明了强有力的泛化能力，使得在密集杂波和具有挑战性的户外森林环境中，以最高7.5米/秒的速度实现可靠的高速导航。

From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism

从机器人到败血症治疗：通过几何悲观主义实现离线强化学习

Authors: Sarthak Wanjari
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.08655
Pdf link: https://arxiv.org/pdf/2602.08655
Abstract Offline Reinforcement Learning (RL) promises the recovery of optimal policies from static datasets, yet it remains susceptible to the overestimation of out-of-distribution (OOD) actions, particularly in fractured and sparse data this http URL solutions necessitates a trade off between computational efficiency and performance. Methods like CQL offers rigorous conservatism but require tremendous compute power while efficient expectile-based methods like IQL often fail to correct OOD errors on pathological datasets, collapsing to Behavioural Cloning. In this work, we propose Geometric Pessimism, a modular, compute-efficient framework that augments standard IQL with density-based penalty derived from k-nearest-neighbour distances in the state-action embedding space. By pre-computing the penalties applied to each state-action pair our method injects OOD conservatism via reward shaping with a O(1) training overhead. Evaluated on the D4Rl MuJoCo benchmark, our method, Geo-IQL outperforms standard IQL on sensitive and unstable medium-replay tasks by over 18 points, while reducing inter-seed variance by 4x. Furthermore, Geo-IQL does not degrade performance on stable manifolds. Crucially, we validate our algorithm on the MIMIC-III Sepsis critical care dataset. While standard IQL collapses to behaviour cloning, Geo-IQL demonstrates active policy improvement. Maintaining safety constraints, achieving 86.4% terminal agreement with clinicians compared to IQL's 75%. Our results suggest that geometric pessimism provides the necessary regularisation to safely overcome local optima in critical, real-world decision systems.
中文摘要 离线强化学习（RL）承诺从静态数据集中恢复最优策略，但仍易被高估分布外（OOD）动作，尤其是在破碎和稀疏的数据中。这种http URL解决方案要求在计算效率和性能之间做出权衡。像CQL这样的方法提供严谨的保守性，但需要巨大的计算能力;而高效的基于期望值的方法如IQL常常无法纠正病理数据集上的OOD错误，最终只能归结为行为克隆。本研究提出几何悲观主义，一种模块化、计算高效的框架，在基于密度的惩罚基础上补充标准IQL，该框架基于状态作用嵌入空间中k个最近邻距离。通过预先计算对每个状态-动作对施加的惩罚，我们的方法通过奖励塑形注入OOD保守性，并带有O（1）的训练开销。基于D4Rl MuJoCo基准测试，我们的方法Geo-IQL在敏感且不稳定的中等重放任务中表现优于标准IQL超过18分，同时将种子间方差降低了4倍。此外，Geo-IQL不会降低稳定流形的性能。关键是，我们在MIMIC-III败血症重症护理数据集上验证了算法。虽然标准的IQL会归结为行为克隆，而Geo-IQL则展示了积极的政策改进。保持安全约束，临床医生的终极同意率达到86.4%，而IQL仅为75%。我们的结果表明，几何悲观主义为在关键现实决策系统中安全克服局部最优提供了必要的正则化。

LLaDA2.1: Speeding Up Text Diffusion via Token Editing

LLaDA2.1：通过令牌编辑加快文本扩散

Authors: Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, Zhuocheng Gong, Yanmei Gu, Jian Guan, Kaiyuan Guan, Hongliang He, Zenan Huang, Juyong Jiang, Zhonghui Jiang, Zhenzhong Lan, Chengxi Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu, Yuan Lu, Yuxin Ma, Xingyu Mou, Zhenxuan Pan, Kaida Qiu, Yuji Ren, Jianfeng Tan, Yiding Tian, Zian Wang, Lanning Wei, Tao Wu, Yipeng Xing, Wentao Ye, Liangyu Zha, Tianze Zhang, Xiaolu Zhang, Junbo Zhao, Da Zheng, Hao Zhong, Wanli Zhong, Jun Zhou, Junlin Zhou, Liwang Zhu, Muzhi Zhu, Yihong Zhuang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08676
Pdf link: https://arxiv.org/pdf/2602.08676
Abstract While LLaDA2.0 showcased the scaling potential of 100B-level block-diffusion models and their inherent parallelization, the delicate equilibrium between decoding speed and generation quality has remained an elusive frontier. Today, we unveil LLaDA2.1, a paradigm shift designed to transcend this trade-off. By seamlessly weaving Token-to-Token (T2T) editing into the conventional Mask-to-Token (M2T) scheme, we introduce a joint, configurable threshold-decoding scheme. This structural innovation gives rise to two distinct personas: the Speedy Mode (S Mode), which audaciously lowers the M2T threshold to bypass traditional constraints while relying on T2T to refine the output; and the Quality Mode (Q Mode), which leans into conservative thresholds to secure superior benchmark performances with manageable efficiency degrade. Furthering this evolution, underpinned by an expansive context window, we implement the first large-scale Reinforcement Learning (RL) framework specifically tailored for dLLMs, anchored by specialized techniques for stable gradient estimation. This alignment not only sharpens reasoning precision but also elevates instruction-following fidelity, bridging the chasm between diffusion dynamics and complex human intent. We culminate this work by releasing LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B). Across 33 rigorous benchmarks, LLaDA2.1 delivers strong task performance and lightning-fast decoding speed. Despite its 100B volume, on coding tasks it attains an astounding 892 TPS on HumanEval+, 801 TPS on BigCodeBench, and 663 TPS on LiveCodeBench.
中文摘要 虽然LLaDA2.0展示了100B级块扩散模型的扩展潜力及其固有的并行化，但解码速度与生成质量之间的微妙平衡仍是一个难以捉摸的前沿。今天，我们发布了LLaDA2.1，这是一个旨在超越这种权衡的范式转变。通过无缝将令牌到令牌（T2T）编辑融入传统的掩码到令牌（M2T）方案，我们引入了联合且可配置的阈值解码方案。这一结构创新催生了两种截然不同的人格：快速模式（S 模式），大胆降低 M2T 门槛以绕过传统约束，同时依靠 T2T 来优化输出;以及质量模式（Q模式），倾向于保守阈值以确保在可控效率下降的同时获得更优的基准性能。在这一发展基础上，基于宽广的上下文窗口，我们实现了首个专为dLLM量身定制的大规模强化学习（RL）框架，并以稳定梯度估计的专业技术为基础。这种对齐不仅提升了推理的精准度，也提升了指令遵循的忠实度，弥合了扩散动力学与复杂人类意图之间的鸿沟。我们以发布LLaDA2.1-Mini（16B）和LLaDA2.1-Flash（100B）为此工作做总结。在33项严格的基准测试中，LLaDA2.1实现了出色的任务性能和极快的解码速度。尽管其总量为100B，但在编码任务上，HumanEval+的TPS惊人地达到了892 TPS，BigCodeBench的801 TPS，LiveCodeBench的663 TPS。

Learning To Sample From Diffusion Models Via Inverse Reinforcement Learning

通过逆强化学习从扩散模型中采样学习

Authors: Constant Bourdrez, Alexandre Vérine, Olivier Cappé
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.08689
Pdf link: https://arxiv.org/pdf/2602.08689
Abstract Diffusion models generate samples through an iterative denoising process, guided by a neural network. While training the denoiser on real-world data is computationally demanding, the sampling procedure itself is more flexible. This adaptability serves as a key lever in practice, enabling improvements in both the quality of generated samples and the efficiency of the sampling process. In this work, we introduce an inverse reinforcement learning framework for learning sampling strategies without retraining the denoiser. We formulate the diffusion sampling procedure as a discrete-time finite-horizon Markov Decision Process, where actions correspond to optional modifications of the sampling dynamics. To optimize action scheduling, we avoid defining an explicit reward function. Instead, we directly match the target behavior expected from the sampler using policy gradient techniques. We provide experimental evidence that this approach can improve the quality of samples generated by pretrained diffusion models and automatically tune sampling hyperparameters.
中文摘要 扩散模型通过神经网络引导的迭代去噪过程生成样本。虽然在真实世界数据上训练去噪器计算量很大，但采样过程本身更为灵活。这种适应性成为实践中的关键杠杆，能够提升样本质量和采样过程的效率。本研究引入了一种逆强化学习框架，用于学习采样策略而不重新训练去噪器。我们将扩散采样过程表述为离散时间有限视界的马尔可夫决策过程，其中动作对应采样动力学的可选修改。为了优化动作调度，我们避免定义显式的奖励函数。相反，我们直接通过策略梯度技术匹配采样器预期的目标行为。我们提供了实验证据，表明该方法能够提升预训练扩散模型生成的样本质量，并自动调节采样超参数。

SoK: The Pitfalls of Deep Reinforcement Learning for Cybersecurity

SoK：深度强化学习在网络安全中的陷阱

Authors: Shae McFadden, Myles Foley, Elizabeth Bates, Ilias Tsingenopoulos, Sanyam Vyas, Vasilios Mavroudis, Chris Hicks, Fabio Pierazzi
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2602.08690
Pdf link: https://arxiv.org/pdf/2602.08690
Abstract Deep Reinforcement Learning (DRL) has achieved remarkable success in domains requiring sequential decision-making, motivating its application to cybersecurity problems. However, transitioning DRL from laboratory simulations to bespoke cyber environments can introduce numerous issues. This is further exacerbated by the often adversarial, non-stationary, and partially-observable nature of most cybersecurity tasks. In this paper, we identify and systematize 11 methodological pitfalls that frequently occur in DRL for cybersecurity (DRL4Sec) literature across the stages of environment modeling, agent training, performance evaluation, and system deployment. By analyzing 66 significant DRL4Sec papers (2018-2025), we quantify the prevalence of each pitfall and find an average of over five pitfalls per paper. We demonstrate the practical impact of these pitfalls using controlled experiments in (i) autonomous cyber defense, (ii) adversarial malware creation, and (iii) web security testing environments. Finally, we provide actionable recommendations for each pitfall to support the development of more rigorous and deployable DRL-based security systems.
中文摘要 深度强化学习（DRL）在需要顺序决策的领域取得了显著成功，推动其应用于网络安全问题。然而，将DRL从实验室模拟转向定制网络环境可能会带来诸多问题。大多数网络安全任务常常具有对抗性、非固定性和部分可观察性质，进一步加剧了这一问题。本文识别并系统化了网络安全DRL（DRL4Sec）文献中在环境建模、代理培训、性能评估和系统部署等阶段中常见的11种方法学陷阱。通过分析66篇重要的DRL4Sec论文（2018-2025年），我们量化了每个陷阱的流行程度，发现每篇论文平均超过五个陷阱。我们通过受控实验展示了这些陷阱在（i）自主网络防御、（ii）对抗性恶意软件生成和（iii）网络安全测试环境中的实际影响。最后，我们为每个陷阱提供可作的建议，以支持更严谨且可部署的基于日程（DRL）的安全系统开发。

Finite-State Controllers for (Hidden-Model) POMDPs using Deep Reinforcement Learning

用于（隐藏模型）POMDP的有限状态控制器，使用深度强化学习

Authors: David Hudák, Maris F. L. Galesloot, Martin Tappler, Martin Kurečka, Nils Jansen, Milan Češka
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08734
Pdf link: https://arxiv.org/pdf/2602.08734
Abstract Solving partially observable Markov decision processes (POMDPs) requires computing policies under imperfect state information. Despite recent advances, the scalability of existing POMDP solvers remains limited. Moreover, many settings require a policy that is robust across multiple POMDPs, further aggravating the scalability issue. We propose the Lexpop framework for POMDP solving. Lexpop (1) employs deep reinforcement learning to train a neural policy, represented by a recurrent neural network, and (2) constructs a finite-state controller mimicking the neural policy through efficient extraction methods. Crucially, unlike neural policies, such controllers can be formally evaluated, providing performance guarantees. We extend Lexpop to compute robust policies for hidden-model POMDPs (HM-POMDPs), which describe finite sets of POMDPs. We associate every extracted controller with its worst-case POMDP. Using a set of such POMDPs, we iteratively train a robust neural policy and consequently extract a robust controller. Our experiments show that on problems with large state spaces, Lexpop outperforms state-of-the-art solvers for POMDPs as well as HM-POMDPs.
中文摘要 解决部分可观测的马尔可夫决策过程（POMDPs）需要在不完全状态信息下计算策略。尽管近期有所进展，现有POMDP求解器的可扩展性仍然有限。此外，许多设置要求在多个POMDP之间建立稳健的策略，进一步加剧了可扩展性问题。我们提出了用于 POMDP 求解的 Lexpop 框架。Lexpop（1）采用深度强化学习来训练神经策略，该策略由循环神经网络表示，（2）通过高效的提取方法构建一个有限状态控制器，模拟神经策略。关键是，与神经策略不同，这类控制器可以被正式评估，提供性能保证。我们扩展Lexpop以计算隐藏模型POMDP（HM-POMDPs）的稳健策略，这些策略描述有限的POMDP集合。我们会将每个提取的控制器与其最坏情况的 POMDP 关联起来。利用一组此类POMDPs，我们迭代训练一个稳健的神经策略，从而提取出一个稳健的控制器。我们的实验表明，在大状态空间问题中，Lexpop的表现优于最先进的POMDP和HM-POMDP求解器。

Bayesian Preference Learning for Test-Time Steerable Reward Models

测试时间可引导奖励模型的贝叶斯偏好学习

Authors: Jiwoo Hong, Shao Tang, Zhipeng Wang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.08819
Pdf link: https://arxiv.org/pdf/2602.08819
Abstract Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapt to unseen preference distributions at test time for both single and multi-objective settings. With more in-context demonstrations, ICRM gains 34% accuracy on SafeRLHF and 9% accuracy on RM-Bench in the single-objective setting, while widening the Pareto frontier with a 4% gain in hypervolume on helpfulness and refusal benchmarks. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.
中文摘要 奖励模型对于通过强化学习（RL）将语言模型与人类偏好对齐起着核心作用。随着强化学习越来越多地应用于可验证奖励和多目标对齐等环境，期望 RM 编码更复杂、更多面的偏好分布。然而，分类器RM一旦训练后保持静态，限制了其测试时的适应性。我们提出了变分上下文奖励建模（ICRM），这是一种新颖的贝叶斯奖励建模目标，通过上下文偏好演示实现测试时的可引导性。ICRM将奖励建模归为在Bradley-Terry模型下，利用共轭Beta先验对潜在偏好概率进行摊销变分推断。我们表明ICRM在测试时适应单目标和多目标环境下的未见偏好分布。通过更多上下文演示，ICRM在单目标设置下SafeRLHF准确率提升34%，RM-Bench准确率提升9%，同时在帮助性和拒绝基准的超量提升4%，拓宽了帕累托边界。我们进一步研究了ICRM在强化学习训练中的实际应用性，表明它能够通过在数学推理中优于传统RM来有效编码可验证的奖励。最后，我们提供了变分目标具有有限置信度的理论保证，并分析了KL正则化如何减轻奖励过度优化。

VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning

VideoVeritas：通过感知前提强化学习实现的AI生成视频检测

Authors: Hao Tan, Jun Lan, Senyuan Shi, Zichang Tan, Zijian Yu, Huijia Zhu, Weiqiang Wang, Jun Wan, Zhen Lei
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.08828
Pdf link: https://arxiv.org/pdf/2602.08828
Abstract The growing capability of video generation poses escalating security risks, making reliable detection increasingly essential. In this paper, we introduce VideoVeritas, a framework that integrates fine-grained perception and fact-based reasoning. We observe that while current multi-modal large language models (MLLMs) exhibit strong reasoning capacity, their granular perception ability remains limited. To mitigate this, we introduce Joint Preference Alignment and Perception Pretext Reinforcement Learning (PPRL). Specifically, rather than directly optimizing for detection task, we adopt general spatiotemporal grounding and self-supervised object counting in the RL stage, enhancing detection performance with simple perception pretext tasks. To facilitate robust evaluation, we further introduce MintVid, a light yet high-quality dataset containing 3K videos from 9 state-of-the-art generators, along with a real-world collected subset that has factual errors in content. Experimental results demonstrate that existing methods tend to bias towards either superficial reasoning or mechanical analysis, while VideoVeritas achieves more balanced performance across diverse benchmarks.
中文摘要 视频生成能力的提升带来了不断升级的安全风险，使得可靠的检测变得愈发重要。本文介绍了VideoVeritas，一个整合细粒度感知与事实推理的框架。我们观察到，尽管当前多模态大型语言模型（MLLMs）展现出强大的推理能力，但其细粒感知能力仍然有限。为缓解这一问题，我们引入了联合偏好对齐与感知借口强化学习（PPRL）。具体来说，我们没有直接优化检测任务，而是在强化学习阶段采用通用时空基础和自我监督对象计数，通过简单的感知前提任务提升检测性能。为了促进分析的稳健性，我们进一步引入了MintVid，这是一个轻量但高质量的数据集，包含来自9个最先进生成器的3000个视频，以及一个真实收集但内容存在事实错误的子集。实验结果表明，现有方法往往偏向表面推理或机械分析，而VideoVeritas在不同基准测试中表现更为平衡。

Learning the Value Systems of Societies with Preference-based Multi-objective Reinforcement Learning

通过基于偏好的多目标强化学习学习社会的价值体系

Authors: Andrés Holgado-Sánchez, Peter Vamplew, Richard Dazeley, Sascha Ossowski, Holger Billhardt
Subjects: Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.08835
Pdf link: https://arxiv.org/pdf/2602.08835
Abstract Value-aware AI should recognise human values and adapt to the value systems (value-based preferences) of different users. This requires operationalization of values, which can be prone to misspecification. The social nature of values demands their representation to adhere to multiple users while value systems are diverse, yet exhibit patterns among groups. In sequential decision making, efforts have been made towards personalization for different goals or values from demonstrations of diverse agents. However, these approaches demand manually designed features or lack value-based interpretability and/or adaptability to diverse user preferences. We propose algorithms for learning models of value alignment and value systems for a society of agents in Markov Decision Processes (MDPs), based on clustering and preference-based multi-objective reinforcement learning (PbMORL). We jointly learn socially-derived value alignment models (groundings) and a set of value systems that concisely represent different groups of users (clusters) in a society. Each cluster consists of a value system representing the value-based preferences of its members and an approximately Pareto-optimal policy that reflects behaviours aligned with this value system. We evaluate our method against a state-of-the-art PbMORL algorithm and baselines on two MDPs with human values.
中文摘要 价值感知型人工智能应识别人类价值观，并适应不同用户的价值体系（基于价值的偏好）。这需要对数值进行作化，而这些值容易被错误指定。价值观的社会性质要求其代表性需同时满足多个用户，而价值体系虽多样，却在群体间表现出模式。在连续决策中，通过不同代理的演示，努力实现针对不同目标或价值观的个性化。然而，这些方法需要手动设计的功能，或缺乏基于价值的可解释性和/或对不同用户偏好的适应能力。我们提出了基于聚类和基于偏好的多目标强化学习（PbMORL）的算法，用于学习马尔可夫决策过程（MDP）中智能体社会的价值对齐模型和价值系统。我们共同学习社会推导的价值对齐模型（基础）和一组简明代表社会中不同用户群体（群体）的价值体系。每个集群由一个代表其成员基于价值偏好的价值系统和一个近似帕累托最优政策组成，该政策反映了与该价值体系相符的行为。我们用最先进的PbMORL算法和两个具有人类值的MDP进行基线评估。

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

MAS博士：多智能体大型语言模型系统的稳定强化学习

Authors: Lang Feng, Longtao Zheng, Shuo He, Fuxiang Zhang, Bo An
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08847
Pdf link: https://arxiv.org/pdf/2602.08847
Abstract Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style optimization, a global normalization baseline may deviate from diverse agents' reward distributions, which ultimately leads to gradient-norm instability. Based on this finding, we propose Dr. MAS, a simple and stable RL training recipe for multi-agent LLM systems. Dr. MAS uses an agent-wise remedy: normalizing advantages per agent using each agent's own reward statistics, which calibrates gradient scales and dramatically stabilizes training, both theoretically and empirically. Beyond the algorithm, Dr. MAS provides an end-to-end RL training framework for multi-agent LLM systems, supporting scalable orchestration, flexible per-agent LLM serving and optimization configs, and shared resource scheduling of LLM actor backends. We evaluate Dr. MAS on multi-agent math reasoning and multi-turn search benchmarks using Qwen2.5 and Qwen3 series models. Dr. MAS achieves clear gains over vanilla GRPO (e.g., +5.6\% avg@16 and +4.6\% pass@16 on math, and +15.2\% avg@16 and +13.1\% pass@16 on search) while largely eliminating gradient spikes. Moreover, it remains highly effective under heterogeneous agent-model assignments while improving efficiency.
中文摘要 多智能体LLM系统通过角色专化实现高级推理和工具使用，但此类系统后期的可靠强化学习（RL）仍然困难。在本研究中，我们理论上指出了在将基于群体的强化学习扩展到多智能体大型语言模型系统时训练不稳定的关键原因。我们表明，在GRPO式优化下，全局归一化基线可能偏离多样化代理的奖励分布，最终导致梯度-范数不稳定。基于这一发现，我们提出了 Dr. MAS，一种简单且稳定的强化学习训练方案，适用于多智能体大型语言模型系统。MAS博士采用了按代理为单位的解决方案：利用每个代理自身的奖励统计量对每个代理的优势进行规范化，这对梯度尺度进行了校准，并在理论和实证上显著稳定了训练效果。除了算法之外，MAS 博士还为多代理 LLM 系统提供了端到端的强化学习训练框架，支持可扩展的编排、灵活的每代理 LLM 服务与优化配置，以及 LLM 演员后端的共享资源调度。我们利用Qwen2.5和Qwen3系列模型评估了MAS博士在多智能体数学推理和多回合搜索基准测试上的表现。Dr. MAS明显优于原版GRPO（例如数学avg@16+5.6%和+4.6%pass@16，搜索avg@16+15.2%和+13.1/%pass@16），同时大幅消除梯度峰值。此外，在异构代理-模型分配下，它依然高效，同时提高了效率。

AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection

AnomSeer：强化多模态大型语言模型以推理时间序列异常检测

Authors: Junru Zhang, Lang Feng, Haoran Shi, Xu Guo, Han Yu, Yabo Dong, Duanqing Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08868
Pdf link: https://arxiv.org/pdf/2602.08868
Abstract Time-series anomaly detection (TSAD) with multimodal large language models (MLLMs) is an emerging area, yet a persistent challenge remains: MLLMs rely on coarse time-series heuristics but struggle with multi-dimensional, detailed reasoning, which is vital for understanding complex time-series data. We present AnomSeer to address this by reinforcing the model to ground its reasoning in precise, structural details of time series, unifying anomaly classification, localization, and explanation. At its core, an expert chain-of-thought trace is generated to provide a verifiable, fine-grained reasoning from classical analyses (e.g., statistical measures, frequency transforms). Building on this, we propose a novel time-series grounded policy optimization (TimerPO) that incorporates two additional components beyond standard reinforcement learning: a time-series grounded advantage based on optimal transport and an orthogonal projection to ensure this auxiliary granular signal does not interfere with the primary detection objective. Across diverse anomaly scenarios, AnomSeer, with Qwen2.5-VL-3B/7B-Instruct, outperforms larger commercial baselines (e.g., GPT-4o) in classification and localization accuracy, particularly on point- and frequency-driven exceptions. Moreover, it produces plausible time-series reasoning traces that support its conclusions.
中文摘要 多模态大型语言模型（MLLMs）中的时间序列异常检测（TSAD）是一个新兴领域，但依然存在一个持续的挑战：MLLM依赖粗糙的时间序列启发式，但在多维且详细的推理方面存在困难，而这对于理解复杂的时间序列数据至关重要。我们介绍AnomSeer，通过强化模型，使其推理基于时间序列的精确结构细节，统一异常分类、定位和解释，来应对这一问题。其核心是生成专家的思维链追踪，以从经典分析（如统计指标、频率变换）中提供可验证的细致推理。基于此，我们提出了一种新颖的时间序列基准策略优化（TimerPO），除了标准强化学习外，还包含两个额外组成部分：基于最优传输的时间序列基地优势和正交投影，以确保辅助粒度信号不会干扰主要检测目标。在多种异常场景中，AnomSeer 与 Qwen2.5-VL-3B/7B-Ininstruction 在分类和定位精度上优于更大型商业基线（如 GPT-4o），尤其是在点和频率驱动的例外上。此外，它还能产生支持其结论的合理时间序列推理痕迹。

Efficient and Stable Reinforcement Learning for Diffusion Language Models

扩散语言模型的高效稳定强化学习

Authors: Jiawei Liu, Xiting Wang, Yuanyuan Zhong, Defu Lian, Yu Yang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08905
Pdf link: https://arxiv.org/pdf/2602.08905
Abstract Reinforcement Learning (RL) is crucial for unlocking the complex reasoning capabilities of Diffusion-based Large Language Models (dLLMs). However, applying RL to dLLMs faces unique challenges in efficiency and stability. To address these challenges, we propose Spatio-Temporal Pruning (STP), a framework designed to simultaneously improve the efficiency and stability of RL for dLLMs. STP compresses the redundancy in the generative process through: (1) \textit{spatial pruning}, which constrains the exploration space using static priors; and (2) \textit{temporal pruning}, which bypasses redundant late-stage refinement steps. Our theoretical analysis demonstrates that STP strictly reduces the variance of the log-likelihood estimation, thereby ensuring more stable policy updates. Extensive experiments demonstrate that STP surpasses state-of-the-art baselines in both efficiency and accuracy. Our code is available at this https URL.
中文摘要 强化学习（RL）对于解锁基于扩散的大型语言模型（dLLMs）复杂的推理能力至关重要。然而，将强化学习应用于数字大型语言模型（dLLM）在效率和稳定性方面面临独特挑战。为应对这些挑战，我们提出了时空剪枝（STP）框架，旨在同时提升dLLM强化学习的效率和稳定性。STP通过以下方式压缩生成过程中的冗余：（1） \textit{空间剪枝}，它通过静态先验限制探索空间;以及（2）\textit{temporal pruning}，绕过冗余的后期细化步骤。我们的理论分析表明，STP严格降低了对数似然估计的方差，从而确保了策略更新更稳定。大量实验表明，STP在效率和准确性方面均超越了最先进的基线。我们的代码可在此 https URL 访问。

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

StealthRL：多重检测器规避人工智能文本检测器的强化学习释义攻击

Authors: Suraj Ranganath, Atharv Ramesh
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2602.08934
Pdf link: https://arxiv.org/pdf/2602.08934
Abstract AI-text detectors face a critical robustness challenge: adversarial paraphrasing attacks that preserve semantics while evading detection. We introduce StealthRL, a reinforcement learning framework that stress-tests detector robustness under realistic adversarial conditions. StealthRL trains a paraphrase policy against a multi-detector ensemble using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3-4B, optimizing a composite reward that balances detector evasion with semantic preservation. We evaluate six attack settings (M0-M5) against three detector families (RoBERTa, FastDetectGPT, and Binoculars) at the security-relevant 1% false positive rate operating point. StealthRL achieves near-zero detection (0.001 mean TPR@1%FPR), reduces mean AUROC from 0.74 to 0.27, and attains a 99.9% attack success rate. Critically, attacks transfer to a held-out detector family not seen during training, revealing shared architectural vulnerabilities rather than detector-specific brittleness. We additionally conduct LLM-based quality evaluation via Likert scoring, analyze detector score distributions to explain why evasion succeeds, and provide per-detector AUROC with bootstrap confidence intervals. Our results expose significant robustness gaps in current AI-text detection and establish StealthRL as a principled adversarial evaluation protocol. Code and evaluation pipeline are publicly available at this https URL.
中文摘要 AI文本检测器面临关键的鲁棒性挑战：对抗性改写攻击，既保留语义又规避检测。我们介绍了StealthRL，一种强化学习框架，在现实对抗条件下对检测器的鲁棒性进行压力测试。StealthRL 在 Qwen3-4B 上使用组相对策略优化（Group Relative Policy Optimization，GRPO）与 LoRA 适配器在 Qwen3-4B 上训练一个针对多探测器集合的释义策略，优化了一种在检测器规避与语义保持之间取得平衡的复合奖励。我们评估了六种攻击设置（M0-M5），针对三种检测器家族（RoBERTa、FastDetectGPT和Binoculars），在安全相关的1%误报率作点。StealthRL实现近乎零的侦测（平均TPR@1 FPR为0.001），平均AUROC从0.74降至0.27，攻击成功率达到99.9%。关键是，攻击会转移到训练中未见的已知检测器族，揭示的是共享的架构漏洞，而非探测器特有的脆弱性。我们还通过李克特评分进行基于LLM的质量评估，分析探测器得分分布以解释规避成功的原因，并为每个探测器提供自助置信区间。我们的结果揭示了当前AI文本检测存在显著的鲁棒性空白，并确立了StealthRL作为一种有原则的对抗性评估协议。代码和评估流程可在此 https URL 公开获取。

Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

多智能体强化学习中的量子纠缠协调学习

Authors: John Gardiner, Orlando Romero, Brendan Tivnan, Nicolò Dal Fabbro, George J. Pappas
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.08965
Pdf link: https://arxiv.org/pdf/2602.08965
Abstract The inability to communicate poses a major challenge to coordination in multi-agent reinforcement learning (MARL). Prior work has explored correlating local policies via shared randomness, sometimes in the form of a correlation device, as a mechanism to assist in decentralized decision-making. In contrast, this work introduces the first framework for training MARL agents to exploit shared quantum entanglement as a coordination resource, which permits a larger class of communication-free correlated policies than shared randomness alone. This is motivated by well-known results in quantum physics which posit that, for certain single-round cooperative games with no communication, shared quantum entanglement enables strategies that outperform those that only use shared randomness. In such cases, we say that there is quantum advantage. Our framework is based on a novel differentiable policy parameterization that enables optimization over quantum measurements, together with a novel policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors. To illustrate the effectiveness of our proposed method, we first show that we can learn, purely from experience, strategies that attain quantum advantage in single-round games that are treated as black box oracles. We then demonstrate how our machinery can learn policies with quantum advantage in an illustrative multi-agent sequential decision-making problem formulated as a decentralized partially observable Markov decision process (Dec-POMDP).
中文摘要 无法沟通是多智能体强化学习（MARL）协调的重大挑战。此前的研究曾探索通过共享随机性（有时以相关装置形式）关联本地政策，作为辅助去中心化决策的机制。相比之下，这项工作引入了首个训练MARL代理利用共享量子纠缠作为协调资源的框架，允许比单纯共享随机性更多的无通信相关策略类别。这一观点源于量子物理中一些已知的结果，这些结果表明，对于某些无通信的单轮合作博弈，共享量子纠缠使得策略的表现优于仅使用共享随机性的策略。在这种情况下，我们说存在量子优势。我们的框架基于一种新型可微策略参数化，实现对量子测量的优化，同时结合一种将联合策略分解为量子协调者和去中心化本地行为者的新型策略架构。为了说明我们提出方法的有效性，首先我们展示了可以纯粹从经验中学习到在单轮游戏中获得量子优势的策略，这些游戏被视为黑箱预言机。随后，我们展示了我们的机制如何在一个多智能体顺序决策问题中学习具有量子优势的策略，该问题以去中心化部分可观测的马尔可夫决策过程（Dec-POMDP）形式表述。

Contraction Metric Based Safe Reinforcement Learning Force Control for a Hydraulic Actuator with Real-World Training

基于收缩度量的安全加固学习力控制，适用于液压执行器，具备真实训练

Authors: Lucca Maitan, Lucas Toschi, Cícero Zanette, Elisa G. Vergamini, Leonardo F. Santos, Thiago Boaventura
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.08977
Pdf link: https://arxiv.org/pdf/2602.08977
Abstract Force control in hydraulic actuators is notoriously difficult due to strong nonlinearities, uncertainties, and the high risks associated with unsafe exploration during learning. This paper investigates safe reinforcement learning (RL) for hy draulic force control with real-world training using contraction metric certificates. A data-driven model of a hydraulic actuator, identified from experimental data, is employed for simulation based pretraining of a Soft Actor-Critic (SAC) policy that adapts the PI gains of a feedback-linearization (FL) controller. To reduce instability during online training, we propose a quadratic-programming (QP) contraction filter that leverages a learned contraction metric to enforce approximate exponential convergence of trajectories, applying minimal corrections to the policy output. The approach is validated on a hydraulic test bench, where the RL controller is trained directly on hardware and benchmarked against a simulation-trained agent and a fixed-gain baseline. Experimental results show that real-hardware training improves force-tracking performance compared to both alternatives, while the contraction filter mitigates chattering and instabilities. These findings suggest that contraction-based certificates can enable safe RL in high force hydraulic systems, though robustness at extreme operating conditions remains a challenge.
中文摘要 液压执行器中的力控制因强烈非线性、不确定性以及学习过程中不安全探索带来的高风险而臭名昭著。本文探讨了基于真实世界训练，使用收缩度量证书进行安全强化学习（RL）在高拖力控制中的应用。基于实验数据的数据驱动液压执行器模型，用于基于仿真的软演员-批判者（SAC）策略预训练，该策略可调整反馈线性化（FL）控制器的PI增益。为减少在线训练中的不稳定性，我们提出了一种二次规划（QP）收缩滤波器，利用学习到的收缩度量来强制轨迹的近似指数收敛，并对策略输出施加最小修正。该方法在液压测试台上验证，强化学习控制器直接在硬件上训练，并与模拟训练的代理和固定增益基线进行基准测试。实验结果表明，真实硬件训练相比两种方法都能提升力追踪性能，而收缩滤波器则能减轻抖动和不稳定性。这些发现表明，基于约束的证书可以在高压力液压系统中实现安全的强化过程，尽管在极端运行条件下的稳健性仍是一大挑战。

iGRPO: Self-Feedback-Driven LLM Reasoning

iGRPO：自我反馈驱动的大型语言模型推理

Authors: Ali Hatamizadeh, Shrimai Prabhumoye, Igor Gitman, Ximing Lu, Seungju Han, Wei Ping, Yejin Choi, Jan Kautz
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.09000
Pdf link: https://arxiv.org/pdf/2602.09000
Abstract Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.
中文摘要 大型语言模型（LLMs）在解决复杂数学问题方面展现出潜力，但它们仍未能产出准确且一致的解决方案。强化学习（RL）是一个框架，用于将这些模型与任务特定的奖励对齐，从而提升整体质量和可靠性。群体相对策略优化（GRPO）是一种高效、无价值函数的近端策略优化（PPO）替代方案，利用群体相对奖励归一化。我们介绍了迭代群相对策略优化（iGRPO），这是GRPO的两阶段扩展，通过模型生成的草稿增加了动态自条件。在第一阶段，iGRPO采样多个探索性草稿，并使用用于优化的同一标量奖励信号选择奖励最高的草药。在第二阶段，它将这份最佳草案附加到原始提示后，并对草案条件的细化进行类似GRPO的更新，训练策略改进，超越其之前最强的尝试。在匹配的推广预算下，iGRPO在基础模型（如Nemotron-H-8B-Base-8K和DeepSeek-R1 Distilled）中持续优于GRPO，验证了其在多种推理基准测试上的有效性。此外，将iGRPO应用于AceReason-Math训练的OpenReasoning-Nemotron-7B，分别在AIME24和AIME25上取得了85.62%和79.64%的先进成绩。消融进一步表明，精炼包装器能够推广超越GRPO变体，受益于生成式评判，并通过延迟熵坍缩改变学习动态。这些结果强调了迭代、基于自我反馈的强化学习在推进可验证数学推理中的潜力。

WorldCompass: Reinforcement Learning for Long-Horizon World Models

世界指南针：长视界世界模型的强化学习

Authors: Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, Chunchao Guo, Zhou Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.09022
Pdf link: https://arxiv.org/pdf/2602.09022
Abstract This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.
中文摘要 本研究提出了WorldCompass，一种针对长视野交互视频世界模型的新型强化学习（RL）后训练框架，使其能够基于交互信号更准确、更一致地探索世界。为了有效“引导”世界模型的探索，我们引入了三项针对自回归视频生成范式的核心创新：1）剪辑级推广策略：我们在单一目标剪辑上生成并评估多个样本，显著提升推展效率并提供细粒度的奖励信号。2）互补奖励函数：我们设计奖励功能，兼顾交互跟踪的准确性和视觉质量，提供直接监督并有效抑制奖励黑客行为。3）高效的强化学习算法：我们采用负向感知的微调策略，结合多种效率优化，高效且有效地提升模型容量。对SoTA开源世界模型WorldPlay的评估表明，WorldCompass在各种场景下显著提升了交互准确性和视觉真实度。

TwinRL-VLA: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

TwinRL-VLA：数字孪生驱动强化学习，用于现实世界机器人作

Authors: Qinwen Xu, Jiaming Liu, Rui Zhou, Shaojun Shi, Nuowei Han, Zhuoyang Liu, Chenyang Gu, Shuo Gu, Yang Yue, Gao Huang, Wenzhao Zheng, Sirui Han, Peng Jia, Shanghang Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.09023
Pdf link: https://arxiv.org/pdf/2602.09023
Abstract Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and insufficient real-world interaction. While online reinforcement learning (RL) has shown promise in improving general foundation models, applying RL to VLA manipulation in real-world settings is still hindered by low exploration efficiency and a restricted exploration space. Through systematic real-world experiments, we observe that the effective exploration space of online RL is closely tied to the data distribution of supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative RL framework designed to scale and guide exploration for VLA models. First, a high-fidelity digital twin is efficiently reconstructed from smartphone-captured scenes, enabling realistic bidirectional transfer between real and simulated environments. During the SFT warm-up stage, we introduce an exploration space expansion strategy using digital twins to broaden the support of the data trajectory distribution. Building on this enhanced initialization, we propose a sim-to-real guided exploration strategy to further accelerate online RL. Specifically, TwinRL performs efficient and parallel online RL in the digital twin prior to deployment, effectively bridging the gap between offline and online training stages. Subsequently, we exploit efficient digital twin sampling to identify failure-prone yet informative configurations, which are used to guide targeted human-in-the-loop rollouts on the real robot. In our experiments, TwinRL approaches 100% success in both in-distribution regions covered by real-world demonstrations and out-of-distribution regions, delivering at least a 30% speedup over prior real-world RL methods and requiring only about 20 minutes on average across four tasks.
中文摘要 尽管具有强大的泛化能力，视觉-语言-行动（VLA）模型仍受限于专家演示的高成本和缺乏足够的真实世界互动。尽管在线强化学习（RL）在改进通用基础模型方面展现出潜力，但在现实环境中将强化学习应用于VLA作仍受限于低探索效率和有限的探索空间。通过系统的真实实验，我们观察到在线强化学习的有效探索空间与监督式微调（SFT）数据分布密切相关。基于这一观察，我们提出了TwinRL，一个数字孪生现实世界的协作强化学习框架，旨在扩展并指导VLA模型的探索。首先，从智能手机拍摄的场景高效重建高保真数字孪生，实现真实与模拟环境之间的真实双向传输。在SFT预热阶段，我们引入了利用数字孪生的探索空间扩展策略，以扩大数据轨迹分布的支持。基于这种增强的初始化，我们提出了一种模拟到现实的引导探索策略，以进一步加速在线强化学习。具体来说，TwinRL在部署前高效且并行地在数字孪生中进行在线强化学习，有效弥合了离线与在线培训阶段之间的差距。随后，我们利用高效的数字孪生采样技术识别出易失效但信息丰富的配置，用于指导针对真机器人的人工环路部署。在我们的实验中，TwinRL在实际演示覆盖的分布区域和分布外区域均接近100%成功率，较以往的真实强化学习方法至少提升30%，且在四个任务中平均仅需约20分钟完成。

Keyword: diffusion policy

Trace-Focused Diffusion Policy for Multi-Modal Action Disambiguation in Long-Horizon Robotic Manipulation

长视野机器人作中多模态作用消歧的痕迹聚焦扩散政策

Authors: Yuxuan Hu, Xiangyu Chen, Chuhao Zhou, Yuxi Liu, Gen Li, Jindou Jia, Jianfei Yang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.07388
Pdf link: https://arxiv.org/pdf/2602.07388
Abstract Generative model-based policies have shown strong performance in imitation-based robotic manipulation by learning action distributions from demonstrations. However, in long-horizon tasks, visually similar observations often recur across execution stages while requiring distinct actions, which leads to ambiguous predictions when policies are conditioned only on instantaneous observations, termed multi-modal action ambiguity (MA2). To address this challenge, we propose the Trace-Focused Diffusion Policy (TF-DP), a simple yet effective diffusion-based framework that explicitly conditions action generation on the robot's execution history. TF-DP represents historical motion as an explicit execution trace and projects it into the visual observation space, providing stage-aware context when current observations alone are insufficient. In addition, the induced trace-focused field emphasizes task-relevant regions associated with historical motion, improving robustness to background visual disturbances. We evaluate TF-DP on real-world robotic manipulation tasks exhibiting pronounced multi-modal action ambiguity and visually cluttered conditions. Experimental results show that TF-DP improves temporal consistency and robustness, outperforming the vanilla diffusion policy by 80.56 percent on tasks with multi-modal action ambiguity and by 86.11 percent under visual disturbances, while maintaining inference efficiency with only a 6.4 percent runtime increase. These results demonstrate that execution-trace conditioning offers a scalable and principled approach for robust long-horizon robotic manipulation within a single policy.
中文摘要 基于生成模型的策略在模仿机器人作中表现出优异的表现，通过从演示中学习动作分布。然而，在长视野任务中，视觉上相似的观察常常在执行阶段反复出现，但需要不同的作，这导致当策略仅基于瞬时观测时，预测存在歧义，称为多模态动作模糊性（MA2）。为应对这一挑战，我们提出了痕迹聚焦扩散策略（TF-DP），这是一个简单但有效的基于扩散的框架，明确将动作生成条件设定在机器人的执行历史上。TF-DP将历史运动表现为明确的执行轨迹，并将其投射到视觉观察空间，在当前观测不足时提供阶段感知的上下文。此外，诱导的痕迹聚焦场强调与历史运动相关的任务相关区域，提升对背景视觉扰动的鲁棒性。我们在现实世界中机器人作任务中评估TF-DP，这些任务表现出明显的多模态动作模糊性和视觉杂乱的条件。实验结果显示，TF-DP在时间一致性和鲁棒性方面有所提升，在多模态动作模糊性任务中比原版扩散策略高出80.56%，在视觉干扰下高出86.11%，同时仅以6.4%的运行时间提升保持推理效率。这些结果表明，执行-轨迹条件化为单一策略内的稳健长视野机器人作提供了一种可扩展且有原则的方法。

STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction

STEP：带时空一致性预测的热启动动力运动保单

Authors: Jinhao Li, Yuxuan Cong, Yingqiao Wang, Hao Xia, Shan Huang, Yijia Zhang, Ningyi Xu, Guohao Dai
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.08245
Pdf link: https://arxiv.org/pdf/2602.08245
Abstract Diffusion policies have recently emerged as a powerful paradigm for visuomotor control in robotic manipulation due to their ability to model the distribution of action sequences and capture multimodality. However, iterative denoising leads to substantial inference latency, limiting control frequency in real-time closed-loop systems. Existing acceleration methods either reduce sampling steps, bypass diffusion through direct prediction, or reuse past actions, but often struggle to jointly preserve action quality and achieve consistently low latency. In this work, we propose STEP, a lightweight spatiotemporal consistency prediction mechanism to construct high-quality warm-start actions that are both distributionally close to the target action and temporally consistent, without compromising the generative capability of the original diffusion policy. Then, we propose a velocity-aware perturbation injection mechanism that adaptively modulates actuation excitation based on temporal action variation to prevent execution stall especially for real-world tasks. We further provide a theoretical analysis showing that the proposed prediction induces a locally contractive mapping, ensuring convergence of action errors during diffusion refinement. We conduct extensive evaluations on nine simulated benchmarks and two real-world tasks. Notably, STEP with 2 steps can achieve an average 21.6% and 27.5% higher success rate than BRIDGER and DDIM on the RoboMimic benchmark and real-world tasks, respectively. These results demonstrate that STEP consistently advances the Pareto frontier of inference latency and success rate over existing methods.
中文摘要 扩散政策近年来成为机器人作中视觉运动控制的强大范式，因其能够模拟动作序列的分布并捕捉多模态。然而，迭代去噪会导致显著的推断延迟，限制实时闭环系统的控制频率。现有加速方法要么减少采样步数，要么通过直接预测绕过扩散，或重复利用过去的动作，但往往难以同时保持动作质量并实现持续低延迟。本研究提出STEP，一种轻量级时空一致性预测机制，用于构建高质量的热启动动作，既分布接近目标动作又在时间上一致，同时不影响原始扩散策略的生成能力。随后，我们提出了一种速度感知微扰注入机制，能够根据时间作用变化自适应调制驱动激发，以防止执行停滞，特别是针对现实任务。我们还提供了理论分析，表明所提预测诱导局部收缩映射，确保扩散细化过程中作用误差的收敛。我们对九项模拟基准测试和两项真实任务进行了广泛评估。值得注意的是，拥有两步的STEP在机器人模拟基准测试和实际任务中分别比BRIDGER和DDIM高出平均21.6%和27.5%。这些结果表明，STEP在推断延迟和成功率方面持续推进了现有方法的帕累托前沿。