Arxiv Papers of Today

生成时间: 2026-02-06 16:46:32 (UTC+8); Arxiv 发布时间: 2026-02-06 20:00 EST (2026-02-07 09:00 UTC+8)

今天共有 52 篇相关文章

Keyword: reinforcement learning

Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog

逐步压缩大型语言模型，像煮青蛙一样推理

Authors: Yiran Zhao, Shengyang Zhou, Zijian Wu, Tongyan Hu, Yuhui Xu, Rengan Dou, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Michael Qizhe Shieh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04919
Pdf link: https://arxiv.org/pdf/2602.04919
Abstract Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, but their substantial size often demands significant computational resources. To reduce resource consumption and accelerate inference, it is essential to eliminate redundant parameters without compromising performance. However, conventional pruning methods that directly remove such parameters often lead to a dramatic drop in model performance in reasoning tasks, and require extensive post-training to recover the lost capabilities. In this work, we propose a gradual compacting method that divides the compression process into multiple fine-grained iterations, applying a Prune-Tune Loop (PTL) at each stage to incrementally reduce model size while restoring performance with finetuning. This iterative approach-reminiscent of the "boiling frog" effect-enables the model to be progressively compressed without abrupt performance loss. Experimental results show that PTL can compress LLMs to nearly half their original size with only lightweight post-training, while maintaining performance comparable to the original model on reasoning tasks. Moreover, PTL is flexible and can be applied to various pruning strategies, such as neuron pruning and layer pruning, as well as different post-training methods, including continual pre-training and reinforcement learning. Additionally, experimental results confirm the effectiveness of PTL on a variety of tasks beyond mathematical reasoning, such as code generation, demonstrating its broad applicability.
中文摘要 大型语言模型（LLM）已展现出令人印象深刻的推理能力，但其庞大的体积往往需要大量计算资源。为了减少资源消耗并加快推断，必须在不牺牲性能的情况下消除冗余参数。然而，直接去除这些参数的传统剪枝方法常常导致推理任务中的模型性能大幅下降，需要大量后期训练以恢复失去的能力。在本研究中，我们提出了一种渐进式压缩方法，将压缩过程划分为多个细粒度迭代，在每个阶段应用修剪-调优环（PTL），以逐步缩小模型规模，同时通过微调恢复性能。这种迭代方法类似于“沸青蛙效应”，使模型能够逐步压缩而不会突然失去性能。实验结果表明，PTL仅需轻量化后训练即可将LLM压缩到原始大小的近一半，同时在推理任务中保持与原始模型相当的性能。此外，PTL具有灵活性，可以应用于多种剪枝策略，如神经元修剪和层次修剪，以及不同的训练后方法，包括持续的预训练和强化学习。此外，实验结果也证实了PTL在数学推理之外的多种任务（如代码生成）上的有效性，展示了其广泛适用性。

Euphonium: Steering Video Flow Matching via Process Reward Gradient Guided Stochastic Dynamics

上低音号：通过过程奖励梯度引导随机动力学进行引导视频流匹配

Authors: Ruizhe Zhong, Jiesong Lian, Xiaoyue Mi, Zixiang Zhou, Yuan Zhou, Qinglin Lu, Junchi Yan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.04928
Pdf link: https://arxiv.org/pdf/2602.04928
Abstract While online Reinforcement Learning has emerged as a crucial technique for aligning flow matching models with human preferences, current approaches are hindered by inefficient exploration during training rollouts. Relying on undirected stochasticity and sparse outcome rewards, these methods struggle to discover high-reward samples, resulting in data-inefficient and slow optimization. To address these limitations, we propose Euphonium, a novel framework that steers generation via process reward gradient guided dynamics. Our key insight is to formulate the sampling process as a theoretically principled Stochastic Differential Equation that explicitly incorporates the gradient of a Process Reward Model into the flow drift. This design enables dense, step-by-step steering toward high-reward regions, advancing beyond the unguided exploration in prior works, and theoretically encompasses existing sampling methods (e.g., Flow-GRPO, DanceGRPO) as special cases. We further derive a distillation objective that internalizes the guidance signal into the flow network, eliminating inference-time dependency on the reward model. We instantiate this framework with a Dual-Reward Group Relative Policy Optimization algorithm, combining latent process rewards for efficient credit assignment with pixel-level outcome rewards for final visual fidelity. Experiments on text-to-video generation show that Euphonium achieves better alignment compared to existing methods while accelerating training convergence by 1.66x.
中文摘要 尽管在线强化学习已成为将流程匹配模型与人类偏好对齐的关键技术，但当前方法因培训推广过程中探索效率低下而受阻。这些方法依赖无向随机性和稀疏的结果奖励，难以发现高奖励样本，导致数据效率低下且优化缓慢。为解决这些局限性，我们提出了Euphonium，一种通过过程奖励梯度引导动态引导生成的新框架。我们的核心见解是将采样过程表述为一个理论原理的随机微分方程，明确将过程奖励模型的梯度纳入流漂移中。该设计允许密集的逐步引导，朝向高回报区域，超越以往工作的无引导探索，理论上涵盖现有采样方法（如Flow-GRPO、DanceGRPO）作为特殊情况。我们进一步推导出一个蒸馏目标，将引导信号内化到流网络中，消除对奖励模型的推断时间依赖。我们用双重奖励组相对策略优化算法实现该框架，结合潜在过程奖励以高效分配信用，与像素级结果奖励以实现最终视觉真实度。文本转视频生成的实验表明，Euphonium 比现有方法实现更好的对齐，同时将训练收敛速度加速了 1.66 倍。

Privileged Information Distillation for Language Models

语言模型的特权信息蒸馏

Authors: Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, Massimo Caccia
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.04942
Pdf link: https://arxiv.org/pdf/2602.04942
Abstract Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a powerful tool for reinforcement learning in hard, long-horizon settings. However, transferring capabilities learned with PI to policies that must act without it at inference time remains a fundamental challenge. We study this problem in the context of distilling frontier models for multi-turn agentic environments, where closed-source systems typically hide their internal reasoning and expose only action trajectories. This breaks standard distillation pipelines, since successful behavior is observable but the reasoning process is not. For this, we introduce {\pi}-Distill, a joint teacher-student objective that trains a PI-conditioned teacher and an unconditioned student simultaneously using the same model. Additionally, we also introduce On-Policy Self-Distillation (OPSD), an alternative approach that trains using Reinforcement Learning (RL) with a reverse KL-penalty between the student and the PI-conditioned teacher. We show that both of these algorithms effectively distill frontier agents using action-only PI. Specifically we find that {\pi}-Distill and in some cases OPSD, outperform industry standard practices (Supervised finetuning followed by RL) that assume access to full Chain-of-Thought supervision across multiple agentic benchmarks, models, and forms of PI. We complement our results with extensive analysis that characterizes the factors enabling effective learning with PI, focusing primarily on {\pi}-Distill and characterizing when OPSD is competitive.
中文摘要 训练时间特权信息（PI）能够使语言模型在原本可能失败的任务中取得成功，使其成为在艰难、长期视野环境中强化学习的强大工具。然而，将通过PI学习的能力转移到必须在推理时无需PI作用的策略仍然是根本性挑战。我们在多回合代理环境中提炼前沿模型的背景下研究该问题，封闭源系统通常隐藏其内部推理，只暴露行动轨迹。这破坏了标准的蒸馏流程，因为成功行为是可观察的，但推理过程却不可见。为此，我们引入了{\pi}-Distill，这是一个师生联合目标，同时使用同一模型培训一名PI条件教师和一名非条件学生。此外，我们还引入了策略自提炼（OPSD），这是一种采用强化学习（RL）训练的替代方法，学生与PI条件教师之间存在反向的KL惩罚。我们展示了这两种算法都有效利用仅动作PI提炼了前沿代理。具体来说，我们发现{\pi}-Distill以及在某些情况下的OPSD表现优于行业标准实践（监督式微调后是强化学习），后者假设在多个代理基准、模型和PI形式中获得完整的Chain-of-Thought监督。我们通过广泛的分析补充结果，重点分析促成PI有效学习的因素，主要聚焦于{\pi}-蒸馏，并评估OPSD的竞争力。

Laplacian Representations for Decision-Time Planning

决策时间规划中的拉普拉斯表示

Authors: Dikshant Shehmar, Matthew Schlegel, Matthew E. Taylor, Marlos C. Machado
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.05031
Pdf link: https://arxiv.org/pdf/2602.05031
Abstract Planning with a learned model remains a key challenge in model-based reinforcement learning (RL). In decision-time planning, state representations are critical as they must support local cost computation while preserving long-horizon structure. In this paper, we show that the Laplacian representation provides an effective latent space for planning by capturing state-space distances at multiple time scales. This representation preserves meaningful distances and naturally decomposes long-horizon problems into subgoals, also mitigating the compounding errors that arise over long prediction horizons. Building on these properties, we introduce ALPS, a hierarchical planning algorithm, and demonstrate that it outperforms commonly used baselines on a selection of offline goal-conditioned RL tasks from OGBench, a benchmark previously dominated by model-free methods.
中文摘要 基于模型强化学习（RL）的规划仍然是关键挑战。在决策时规划中，状态表示至关重要，因为它们必须支持局部成本计算，同时保持长视角结构。本文展示了拉普拉斯表示通过捕捉多时间尺度的状态-空间距离，为规划提供了有效的潜在空间。这种表示法保留了有意义的距离，并自然地将长视界问题分解为子目标，同时也减轻了在长预测视野上产生的复合误差。基于这些特性，我们介绍了ALPS，一种分层规划算法，并证明其在OGBench中一系列离线目标条件强化学习任务中表现优于常用基线，该基准此前被无模型方法主导。

ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation

ReFORM：通过噪声控实现支持离线强化学习的反射流

Authors: Songyuan Zhang, Oswin So, H. M. Sabbir Ahmad, Eric Yang Yu, Matthew Cleaveland, Mitchell Black, Chuchu Fan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.05051
Pdf link: https://arxiv.org/pdf/2602.05051
Abstract Offline reinforcement learning (RL) aims to learn the optimal policy from a fixed dataset generated by behavior policies without additional environment interactions. One common challenge that arises in this setting is the out-of-distribution (OOD) error, which occurs when the policy leaves the training distribution. Prior methods penalize a statistical distance term to keep the policy close to the behavior policy, but this constrains policy improvement and may not completely prevent OOD actions. Another challenge is that the optimal policy distribution can be multimodal and difficult to represent. Recent works apply diffusion or flow policies to address this problem, but it is unclear how to avoid OOD errors while retaining policy expressiveness. We propose ReFORM, an offline RL method based on flow policies that enforces the less restrictive support constraint by construction. ReFORM learns a behavior cloning (BC) flow policy with a bounded source distribution to capture the support of the action distribution, then optimizes a reflected flow that generates bounded noise for the BC flow while keeping the support, to maximize the performance. Across 40 challenging tasks from the OGBench benchmark with datasets of varying quality and using a constant set of hyperparameters for all tasks, ReFORM dominates all baselines with hand-tuned hyperparameters on the performance profile curves.
中文摘要 离线强化学习（RL）旨在从由行为策略生成的固定数据集中学习最优策略，而无需额外的环境交互。在这种情况下，一个常见的挑战是“out-ofdistribution”（OOD）错误，即策略离开训练分布时发生。以往的方法惩罚统计距离项以保持策略接近行为策略，但这限制了策略改进，可能无法完全阻止值班人员的行为。另一个挑战是，最优政策分布可能多模态且难以表述。近期研究应用了扩散或流策略来解决这个问题，但目前尚不清楚如何在保持策略表达性的同时避免OOD错误。我们提出了ReFORM，这是一种基于流策略的离线强化学习方法，通过构建来强制执行更宽松的支持约束。ReFORM 学习带有有界源分布的行为克隆（BC）流策略以捕捉动作分布的支持，然后优化反射流，在保持支持的同时产生有界噪声，以最大化性能。在来自OGBench基准的40项具有挑战性的任务中，使用质量不一的数据集，且所有任务均使用恒定的超参数集，ReFORM在性能曲线上手工调整超参数，主导了所有基线。

Optimizing Mission Planning for Multi-Debris Rendezvous Using Reinforcement Learning with Refueling and Adaptive Collision Avoidance

利用加油和自适应碰撞避免的强化学习优化多碎片会合任务规划

Authors: Agni Bandyopadhyay, Gunther Waxenegger-Wilfing
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Space Physics (physics.space-ph)
Arxiv link: https://arxiv.org/abs/2602.05075
Pdf link: https://arxiv.org/pdf/2602.05075
Abstract As the orbital environment around Earth becomes increasingly crowded with debris, active debris removal (ADR) missions face significant challenges in ensuring safe operations while minimizing the risk of in-orbit collisions. This study presents a reinforcement learning (RL) based framework to enhance adaptive collision avoidance in ADR missions, specifically for multi-debris removal using small satellites. Small satellites are increasingly adopted due to their flexibility, cost effectiveness, and maneuverability, making them well suited for dynamic missions such as ADR. Building on existing work in multi-debris rendezvous, the framework integrates refueling strategies, efficient mission planning, and adaptive collision avoidance to optimize spacecraft rendezvous operations. The proposed approach employs a masked Proximal Policy Optimization (PPO) algorithm, enabling the RL agent to dynamically adjust maneuvers in response to real-time orbital conditions. Key considerations include fuel efficiency, avoidance of active collision zones, and optimization of dynamic orbital parameters. The RL agent learns to determine efficient sequences for rendezvousing with multiple debris targets, optimizing fuel usage and mission time while incorporating necessary refueling stops. Simulated ADR scenarios derived from the Iridium 33 debris dataset are used for evaluation, covering diverse orbital configurations and debris distributions to demonstrate robustness and adaptability. Results show that the proposed RL framework reduces collision risk while improving mission efficiency compared to traditional heuristic approaches. This work provides a scalable solution for planning complex multi-debris ADR missions and is applicable to other multi-target rendezvous problems in autonomous space mission planning.
中文摘要 随着地球轨道环境日益充斥碎片，主动碎片清除（ADR）任务面临重大挑战，以确保安全运行同时降低轨道碰撞风险。本研究提出了基于强化学习（RL）的框架，旨在增强ADR任务中的自适应碰撞避免能力，特别是针对利用小型卫星进行多碎片清除。由于其灵活性、成本效益高且机动性强，小型卫星越来越被广泛采用，非常适合动态任务如ADR。该框架基于现有的多碎片会合工作，整合了加注燃料策略、高效任务规划和自适应碰撞避免，以优化航天器会合作。该方法采用了掩蔽的近端策略优化（PPO）算法，使强化学习代理能够根据实时轨道条件动态调整机动。关键考虑因素包括燃油效率、避免主动碰撞区以及动态轨道参数的优化。RL代理学习如何确定与多个碎片目标会合的高效顺序，优化燃料使用和任务时间，同时纳入必要的加油停靠。用于评估的基于Iridium 33碎片数据集的模拟ADR场景，涵盖多样轨道构型和碎片分布，以展示其稳健性和适应性。结果显示，所提出的强化学习框架相比传统启发式方法降低碰撞风险，同时提升任务效率。该工作为复杂的多碎片ADR任务规划提供了可扩展的解决方案，并适用于自主航天任务规划中的其他多目标交会问题。

Reinforcement Learning Enhancement Using Vector Semantic Representation and Symbolic Reasoning for Human-Centered Autonomous Emergency Braking

利用矢量语义表示和符号推理实现以人为中心的自主紧急制动的强化学习增强

Authors: Vinal Asodia, Iman Sharifi, Saber Fallah
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.05079
Pdf link: https://arxiv.org/pdf/2602.05079
Abstract The problem with existing camera-based Deep Reinforcement Learning approaches is twofold: they rarely integrate high-level scene context into the feature representation, and they rely on rigid, fixed reward functions. To address these challenges, this paper proposes a novel pipeline that produces a neuro-symbolic feature representation that encompasses semantic, spatial, and shape information, as well as spatially boosted features of dynamic entities in the scene, with an emphasis on safety-critical road users. It also proposes a Soft First-Order Logic (SFOL) reward function that balances human values via a symbolic reasoning module. Here, semantic and spatial predicates are extracted from segmentation maps and applied to linguistic rules to obtain reward weights. Quantitative experiments in the CARLA simulation environment show that the proposed neuro-symbolic representation and SFOL reward function improved policy robustness and safety-related performance metrics compared to baseline representations and reward formulations across varying traffic densities and occlusion levels. The findings demonstrate that integrating holistic representations and soft reasoning into Reinforcement Learning can support more context-aware and value-aligned decision-making for autonomous driving.
中文摘要 现有基于摄像头的深度强化学习方法存在两个问题：它们很少将高层场景上下文整合到特征表示中，且依赖于僵化、固定的奖励函数。为应对这些挑战，本文提出了一种新颖的流程，能够生成一种涵盖语义、空间和形状信息的神经符号特征表示，以及场景中动态实体的空间增强特征，重点关注安全关键的道路使用者。它还提出了一种软一阶逻辑（SFOL）奖励函数，通过符号推理模块平衡人类价值观。在这里，从切段映射中提取语义和空间谓词，并应用于语言规则以获得奖励权重。CARLA模拟环境中的定量实验表明，所提出的神经符号表示和SFOL奖励函数相比基线表示和奖励表述，在不同交通密度和阻塞水平下，提升了策略的鲁棒性和安全相关性能指标。研究结果表明，将整体表征和软推理融入强化学习，可以支持更具情境意识和价值契合的自动驾驶决策。

Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning

警惕不可信模拟器——强化学习中的无奖励后门攻击

Authors: Ethan Rathbun, Wo Wei Lin, Alina Oprea, Christopher Amato
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.05089
Pdf link: https://arxiv.org/pdf/2602.05089
Abstract Simulated environments are a key piece in the success of Reinforcement Learning (RL), allowing practitioners and researchers to train decision making agents without running expensive experiments on real hardware. Simulators remain a security blind spot, however, enabling adversarial developers to alter the dynamics of their released simulators for malicious purposes. Therefore, in this work we highlight a novel threat, demonstrating how simulator dynamics can be exploited to stealthily implant action-level backdoors into RL agents. The backdoor then allows an adversary to reliably activate targeted actions in an agent upon observing a predefined trigger'', leading to potentially dangerous consequences. Traditional backdoor attacks are limited in their strong threat models, assuming the adversary has near full control over an agent's training pipeline, enabling them to both alter and observe agent's rewards. As these assumptions are infeasible to implement within a simulator, we propose a new attackDaze'' which is able to reliably and stealthily implant backdoors into RL agents trained for real world tasks without altering or even observing their rewards. We provide formal proof of Daze's effectiveness in guaranteeing attack success across general RL tasks along with extensive empirical evaluations on both discrete and continuous action space domains. We additionally provide the first example of RL backdoor attacks transferring to real, robotic hardware. These developments motivate further research into securing all components of the RL training pipeline to prevent malicious attacks.
中文摘要 模拟环境是强化学习（RL）成功的关键环节，使从业者和研究人员能够在无需在真实硬件上进行昂贵实验的情况下训练决策代理。然而，模拟器依然是安全盲区，使得对抗性开发者能够为恶意目的改变其已发布模拟器的动态。因此，本文强调了一个新威胁，展示了如何利用模拟器动态，悄无声息地在强化学习代理中植入动作级后门。后门允许对手在观察到预设“触发”后可靠地激活智能体的定向动作，从而引发潜在的危险后果。传统的后门攻击在强威胁模型上有限，假设对手几乎完全控制了智能体的训练流程，从而能够改变并观察智能体的奖励。由于这些假设在模拟器中难以实现，我们提出了一种新攻击“Daze”，能够可靠且隐蔽地在训练现实任务的强化智能体中植入后门，而无需改变或观察其奖励。我们正式证明了Daze在一般强化学习任务中保证攻击成功的有效性，并对离散和连续动作空间域进行了大量实证评估。我们还提供了首个强化学习后门攻击转移到真实机器人硬件的例子。这些进展推动了进一步研究，保护强化学习培训流程的所有组成部分，以防止恶意攻击。

EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization

EBPO：经验贝叶斯缩减以稳定群体相对政策优化

Authors: Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, Lizhu Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.05165
Pdf link: https://arxiv.org/pdf/2602.05165
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing the reasoning capabilities of Large Language Models (LLMs). However, dominant approaches like Group Relative Policy Optimization (GRPO) face critical stability challenges: they suffer from high estimator variance under computational constraints (small group sizes) and vanishing gradient signals in saturated failure regimes where all responses yield identical zero rewards. To address this, we propose Empirical Bayes Policy Optimization (EBPO), a novel framework that regularizes local group-based baselines by borrowing strength from the policy's accumulated global statistics. Instead of estimating baselines in isolation, EBPO employs a shrinkage estimator that dynamically balances local group statistics with a global prior updated via Welford's online algorithm. Theoretically, we demonstrate that EBPO guarantees strictly lower Mean Squared Error (MSE), bounded entropy decay, and non-vanishing penalty signals in failure scenarios compared to GRPO. Empirically, EBPO consistently outperforms GRPO and other established baselines across diverse benchmarks, including AIME and OlympiadBench. Notably, EBPO exhibits superior training stability, achieving high-performance gains even with small group sizes, and benefits significantly from difficulty-stratified curriculum learning.
中文摘要 带可验证奖励的强化学习（RLVR）已被证明能有效提升大型语言模型（LLMs）的推理能力。然而，像群相对策略优化（GRPO）这样的主流方法面临关键的稳定性挑战：它们在计算约束下（组规模小）下估计量方差较高，以及在饱和失败状态下梯度信号消失，所有响应均为零奖励。为此，我们提出了经验贝叶斯政策优化（EBPO），这是一种新颖框架，通过借用政策累积的全球统计数据来规范基于局部群体的基线。EBPO不单独估计基线，而是采用缩减估计器，动态平衡局部群体统计数据与通过Welford在线算法更新的全局先验数据。理论上，我们证明EBPO在失效场景下保证了均方误差（MSE）、有界熵衰减和惩罚信号的严格降低，相较GRPO更低。从实证来看，EBPO在包括AIME和OlympiadBench在内的多项基准指标中持续优于GRPO及其他既定基线。值得注意的是，EBPO展现出卓越的训练稳定性，即使在小团体规模下也能实现高绩效提升，并且显著受益于难度分层课程学习。

Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

基于LLM的数据可解释性，用于基于LLM的多智能体强化学习

Authors: John Yan, Michael Yu, Yuqi Sun, Alexander Duffy, Tyler Marques, Matthew Lyle Olson
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.05183
Pdf link: https://arxiv.org/pdf/2602.05183
Abstract Large language models (LLMs) are increasingly trained in complex Reinforcement Learning, multi-agent environments, making it difficult to understand how behavior changes over training. Sparse Autoencoders (SAEs) have recently shown to be useful for data-centric interpretability. In this work, we analyze large-scale reinforcement learning training runs from the sophisticated environment of Full-Press Diplomacy by applying pretrained SAEs, alongside LLM-summarizer methods. We introduce Meta-Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics. We discover fine-grained behaviors including role-playing patterns, degenerate outputs, language switching, alongside high-level strategic behaviors and environment-specific bugs. Through automated evaluation, we validate that 90% of discovered SAE Meta-Features are significant, and find a surprising reward hacking behavior. However, through two user studies, we find that even subjectively interesting and seemingly helpful SAE features may be worse than useless to humans, along with most LLM generated hypotheses. However, a subset of SAE-derived hypotheses are predictively useful for downstream tasks. We further provide validation by augmenting an untrained agent's system prompt, improving the score by +14.2%. Overall, we show that SAEs and LLM-summarizer provide complementary views into agent behavior, and together our framework forms a practical starting point for future data-centric interpretability work on ensuring trustworthy LLM behavior throughout training.
中文摘要 大型语言模型（LLM）越来越多地在复杂的强化学习多代理环境中训练，这使得理解行为如何随着训练而变化变得困难。稀疏自编码器（SAE）最近被证明对数据中心的解释性具有实用性。本研究通过应用预训练的SAE和LLM总结方法，分析了从Full-Press Diplomacy复杂环境中进行的大规模强化学习训练运行。我们介绍了Meta-Autointerp，这是一种将SAE特征归类为训练动力学可解释假设的方法。我们发现了包括角色扮演模式、退化输出、语言切换，以及高级战略行为和环境特定漏洞在内的细粒度行为。通过自动评估，我们验证了90%发现的SAE元特征具有显著性，并发现了令人惊讶的奖励黑客行为。然而，通过两项用户研究，我们发现即使是主观上有趣且看似有用的SAE特性，对人类来说可能比无用还要糟糕，而且大多数大型语言模型生成的假设也是如此。然而，SAE衍生的部分假设对后续任务具有预测性价值。我们还通过增强未受过训练的智能体的系统提示，进一步提升了验证，使得分提升了+14.2%。总体而言，我们表明SAE和LLM总结器提供了对代理行为的互补视角，我们的框架共同构成了未来以数据为中心的可解释性工作，确保LLM在整个训练过程中行为可信度的实用起点。

MobileManiBench: Simplifying Model Verification for Mobile Manipulation

MobileManiBench：简化移动作的模型验证

Authors: Wenbo Wang, Fangyun Wei, QiXiu Li, Xi Chen, Yaobo Liang, Chang Xu, Jiaolong Yang, Baining Guo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.05233
Pdf link: https://arxiv.org/pdf/2602.05233
Abstract Vision-language-action models have advanced robotic manipulation but remain constrained by reliance on the large, teleoperation-collected datasets dominated by the static, tabletop scenes. We propose a simulation-first framework to verify VLA architectures before real-world deployment and introduce MobileManiBench, a large-scale benchmark for mobile-based robotic manipulation. Built on NVIDIA Isaac Sim and powered by reinforcement learning, our pipeline autonomously generates diverse manipulation trajectories with rich annotations (language instructions, multi-view RGB-depth-segmentation images, synchronized object/robot states and actions). MobileManiBench features 2 mobile platforms (parallel-gripper and dexterous-hand robots), 2 synchronized cameras (head and right wrist), 630 objects in 20 categories, 5 skills (open, close, pull, push, pick) with over 100 tasks performed in 100 realistic scenes, yielding 300K trajectories. This design enables controlled, scalable studies of robot embodiments, sensing modalities, and policy architectures, accelerating research on data efficiency and generalization. We benchmark representative VLA models and report insights into perception, reasoning, and control in complex simulated environments.
中文摘要 视觉-语言-动作模型具备先进的机器人作能力，但仍受限于大型远程作收集的数据集，这些数据集以静态桌面场景为主。我们提出了一个以仿真为先的框架，用于验证VLA架构，并在实际部署前推出MobileManiBench，这是一个针对基于移动的机器人作的大规模基准测试。我们的流水线基于NVIDIA Isaac Sim，并由强化学习驱动，自主生成多样的作轨迹，并附有丰富的注释（语言指令、多视角RGB深度分割图像、同步对象/机器人状态和动作）。MobileManiBench 拥有 2 个移动平台（平行握力机器人和灵巧手型机器人）、2 个同步摄像头（头部和右手腕）、20 个类别的 630 个物体、5 个技能（开、闭、拉、推、拣选），在 100 个真实场景中执行超过 100 项任务，累计 30 万条轨迹。该设计支持对机器人具象、感知模态和政策架构的受控、可扩展研究，加速数据效率和泛化的研究。我们对代表性的VLA模型进行基准测试，并报告关于复杂模拟环境中感知、推理和控制的见解。

RFM-Pose:Reinforcement-Guided Flow Matching for Fast Category-Level 6D Pose Estimation

RFM-Pose：用于快速类别级6D姿态估计的强化引导流匹配

Authors: Diya He, Qingchen Liu, Cong Zhang, Jiahu Qin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.05257
Pdf link: https://arxiv.org/pdf/2602.05257
Abstract Object pose estimation is a fundamental problem in computer vision and plays a critical role in virtual reality and embodied intelligence, where agents must understand and interact with objects in 3D space. Recently, score based generative models have to some extent solved the rotational symmetry ambiguity problem in category level pose estimation, but their efficiency remains limited by the high sampling cost of score-based diffusion. In this work, we propose a new framework, RFM-Pose, that accelerates category-level 6D object pose generation while actively evaluating sampled hypotheses. To improve sampling efficiency, we adopt a flow-matching generative model and generate pose candidates along an optimal transport path from a simple prior to the pose distribution. To further refine these candidates, we cast the flow-matching sampling process as a Markov decision process and apply proximal policy optimization to fine-tune the sampling policy. In particular, we interpret the flow field as a learnable policy and map an estimator to a value network, enabling joint optimization of pose generation and hypothesis scoring within a reinforcement learning framework. Experiments on the REAL275 benchmark demonstrate that RFM-Pose achieves favorable performance while significantly reducing computational cost. Moreover, similar to prior work, our approach can be readily adapted to object pose tracking and attains competitive results in this setting.
中文摘要 物体姿态估计是计算机视觉中的一个基本问题，在虚拟现实和具身智能中起着关键作用，在这些领域，代理必须理解并与三维空间中的物体互动。近年来，基于分数的生成模型在一定程度上解决了类别级姿态估计中的旋转对称性歧义问题，但其效率仍受限于基于分数扩散的高采样成本。在本研究中，我们提出了一个新框架RFM-Pose，在积极评估采样假设的同时，加速类别级的6D对象姿态生成。为提高采样效率，我们采用流匹配生成模型，并从简单前置态分布沿最优传输路径生成姿态候选。为了进一步优化这些候选方案，我们将流匹配抽样过程定义为马尔可夫决策过程，并应用近端策略优化以微调抽样策略。特别是，我们将流域解释为可学习的策略，并将估计量映射到价值网络，从而实现在强化学习框架内对态生成和假设评分的联合优化。REAL275基准测试的实验表明，RFM-Pose在显著降低计算成本的同时，实现了良好的性能。此外，与以往类似，我们的方法可轻易应用于物体姿态追踪，并在该环境中取得竞争性成果。

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

长度无偏序列策略优化：揭示和控制RLVR中反应长度变化

Authors: Fanfan Liu, Youyang Yin, Peng Shi, Siqi Yang, Zhixiong Zeng, Haibo Qiu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.05261
Pdf link: https://arxiv.org/pdf/2602.05261
Abstract Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.
中文摘要 近年来，带有可验证奖励的强化学习（RLVR）在大型语言模型（LLM）和视觉语言模型（VLMs）上的应用，在提升复杂任务的推理能力方面取得了显著成功。在RLVR训练中，反应长度的增加通常被视为推动推理能力增长的关键因素。然而，在训练过程中，不同RLVR算法的响应长度变化模式有显著差异。为提供这些变异的基本解释，本文深入分析了主流RLVR算法的组成部分。我们对影响反应长度的因素进行了理论分析，并通过大量实验验证了我们的理论。基于这些理论发现，我们提出了长度无偏序列策略优化（LUSPO）算法。具体来说，我们纠正了组序列策略优化（GSPO）固有的长度偏差，使其损失函数对响应长度保持无偏，从而解决响应长度崩溃的问题。我们在数学推理基准和多模态推理场景中进行了大量实验，LUSPO始终表现出色。实证结果表明，与现有方法如GRPO和GSPO相比，LUSPO代表了一种新颖、最先进的优化策略。

Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities

回归基础：通过生成概率重新探讨强化学习中的LLM推理

Authors: Pengyi Li, Elizaveta Goncharova, Andrey Kuznetsov, Ivan Oseledets
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.05281
Pdf link: https://arxiv.org/pdf/2602.05281
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to severe mode collapse and limited output diversity. We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths, thereby suppressing valid alternative reasoning chains. To address this, we propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses. By incorporating Prompt Perplexity and Answer Confidence into the advantage estimation, our method dynamically reshapes the reward signal to attenuate the gradient updates of over-confident reasoning paths, while redistributing probability mass toward under-explored correct solutions. Empirical results demonstrate that our approach significantly enhances generative diversity and response entropy while maintaining competitive accuracy, effectively achieving a superior trade-off between exploration and exploitation in reasoning tasks. Empirical results on Qwen2.5 and DeepSeek models across mathematical and coding benchmarks show that ProGRPO significantly mitigates entropy collapse. Specifically, on Qwen2.5-7B, our method outperforms GRPO by 5.7% in Pass@1 and, notably, by 13.9% in Pass@32, highlighting its superior capability in generating diverse correct reasoning paths.
中文摘要 带可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLMs）推理能力的不可或缺范式。然而，标准策略优化方法，如群相对策略优化（GRPO），常常趋向低熵策略，导致严重的模式崩溃和输出多样性受限。我们从抽样概率动力学的角度分析该问题，发现标准目标不成比例地强化了最高似然路径，从而抑制了有效的替代推理链。为此，我们提出了一种新的优势重加权机制（ARM），旨在平衡所有正确回答的置信水平。通过将提示困惑度和答案信心纳入优势估计，我们的方法动态重塑奖励信号，以减弱过度自信推理路径的梯度更新，同时将概率质量重新分配到未被充分探索的正确解中。实证结果表明，我们的方法显著提升了生成多样性和响应熵，同时保持了竞争精度，在推理任务中实现了探索与利用之间的优越权衡。在数学和编码基准测试中，Qwen2.5和DeepSeek模型的实证结果表明，ProGRPO显著减轻了熵坍缩。具体来说，在Qwen2.5-7B上，我们的方法在Pass@1中比GRPO高出5.7%，在Pass@32中高出13.9%，凸显了其在生成多样正确推理路径方面的优越能力。

Formal Synthesis of Certifiably Robust Neural Lyapunov-Barrier Certificates

认证稳健神经里雅普诺夫障碍证书的形式综合

Authors: Chengxiao Wang, Haoze Wu, Gagandeep Singh
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.05311
Pdf link: https://arxiv.org/pdf/2602.05311
Abstract Neural Lyapunov and barrier certificates have recently been used as powerful tools for verifying the safety and stability properties of deep reinforcement learning (RL) controllers. However, existing methods offer guarantees only under fixed ideal unperturbed dynamics, limiting their reliability in real-world applications where dynamics may deviate due to uncertainties. In this work, we study the problem of synthesizing \emph{robust neural Lyapunov barrier certificates} that maintain their guarantees under perturbations in system dynamics. We formally define a robust Lyapunov barrier function and specify sufficient conditions based on Lipschitz continuity that ensure robustness against bounded perturbations. We propose practical training objectives that enforce these conditions via adversarial training, Lipschitz neighborhood bound, and global Lipschitz regularization. We validate our approach in two practically relevant environments, Inverted Pendulum and 2D Docking. The former is a widely studied benchmark, while the latter is a safety-critical task in autonomous systems. We show that our methods significantly improve both certified robustness bounds (up to $4.6$ times) and empirical success rates under strong perturbations (up to $2.4$ times) compared to the baseline. Our results demonstrate effectiveness of training robust neural certificates for safe RL under perturbations in dynamics.
中文摘要 神经里雅普诺夫和屏障证书最近被用作验证深度强化学习（RL）控制器安全性和稳定性特性的强大工具。然而，现有方法仅在固定的理想无扰动力学下提供保证，限制了其在实际应用中因不确定性导致动力学偏差的可靠性。本研究研究如何综合\emph{稳健神经里雅普诺夫障碍证书}，以维持其在系统动力学扰动下保证。我们正式定义了一个稳健的李雅普诺夫势垒函数，并基于利普希茨连续性指定了足够的条件，以确保对有界扰动的鲁棒性。我们提出了通过对抗训练、利普希茨邻域界限和全局利普希茨正则化来强制执行这些条件的实用训练目标。我们在两个实际相关的环境中验证了我们的方法：倒摆和二维对接。前者是一个广泛研究的基准，而后者则是自主系统中至关重要的安全任务。我们证明，我们的方法在强扰动下显著提升了认证鲁棒性界限（最高可达4.6美元倍）和实证成功率（最高可达2.4美元倍），相较于基线。我们的结果证明，在动态扰动下，训练稳健神经证书以实现安全强化学习的有效性。

GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL

GAS：增强生成模型辅助离线安全强化学习的奖励与成本平衡

Authors: Zifan Liu, Xinran Li, Shibo Chen, Jun Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.05323
Pdf link: https://arxiv.org/pdf/2602.05323
Abstract Offline Safe Reinforcement Learning (OSRL) aims to learn a policy to achieve high performance in sequential decision-making while satisfying constraints, using only pre-collected datasets. Recent works, inspired by the strong capabilities of Generative Models (GMs), reformulate decision-making in OSRL as a conditional generative process, where GMs generate desirable actions conditioned on predefined reward and cost values. However, GM-assisted methods face two major challenges in OSRL: (1) lacking the ability to "stitch" optimal transitions from suboptimal trajectories within the dataset, and (2) struggling to balance reward targets with cost targets, particularly when they are conflict. To address these issues, we propose Goal-Assisted Stitching (GAS), a novel algorithm designed to enhance stitching capabilities while effectively balancing reward maximization and constraint satisfaction. To enhance the stitching ability, GAS first augments and relabels the dataset at the transition level, enabling the construction of high-quality trajectories from suboptimal ones. GAS also introduces novel goal functions, which estimate the optimal achievable reward and cost goals from the dataset. These goal functions, trained using expectile regression on the relabeled and augmented dataset, allow GAS to accommodate a broader range of reward-cost return pairs and achieve a better tradeoff between reward maximization and constraint satisfaction compared to human-specified values. The estimated goals then guide policy training, ensuring robust performance under constrained settings. Furthermore, to improve training stability and efficiency, we reshape the dataset to achieve a more uniform reward-cost return distribution. Empirical results validate the effectiveness of GAS, demonstrating superior performance in balancing reward maximization and constraint satisfaction compared to existing methods.
中文摘要 离线安全强化学习（OSRL）旨在学习一套策略，在满足约束条件的同时实现序列决策的高效能，仅使用预先收集的数据集。近期工作受生成模型（GM）强大能力的启发，将OSRL中的决策重新表述为一种条件生成过程，GM根据预定义的奖励和成本价值生成理想行为。然而，GM辅助方法在OSRL中面临两大挑战：（1）缺乏从数据集中次优轨迹“拼接”最优转变的能力，（2）在奖励目标与成本目标之间难以平衡，尤其是在冲突时。为解决这些问题，我们提出了目标辅助拼接（GAS）算法，这是一种新颖算法，旨在增强缝合能力，同时有效平衡奖励最大化与约束满足。为了增强拼接能力，GAS首先在过渡层对数据集进行了补充和重新标记，从而能够从次优轨迹构建出高质量轨迹。GAS还引入了新的目标函数，能够从数据集中估算最优可实现的奖励和成本目标。这些目标函数通过在重新标记和增强后的数据集上进行期望回归训练，使GAS能够容纳更广泛的奖品-成本回报对，并在奖励最大化与约束满足度之间实现比人工指定值更好的权衡。估计目标随后指导政策培训，确保在受限条件下的稳健表现。此外，为了提升训练稳定性和效率，我们重新设计数据集，实现更均匀的回报-成本分布。实证结果验证了GAS的有效性，显示其在平衡奖励最大化与约束满足方面优于现有方法。

Imagine a City: CityGenAgent for Procedural 3D City Generation

想象一座城市：用于程序化3D城市生成的CityGenAgent

Authors: Zishan Liu, Zecong Tang, RuoCheng Wu, Xinzhe Zheng, Jingyu Hu, Ka-Hei Hui, Haoran Xie, Bo Dai, Zhengzhe Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.05362
Pdf link: https://arxiv.org/pdf/2602.05362
Abstract The automated generation of interactive 3D cities is a critical challenge with broad applications in autonomous driving, virtual reality, and embodied intelligence. While recent advances in generative models and procedural techniques have improved the realism of city generation, existing methods often struggle with high-fidelity asset creation, controllability, and manipulation. In this work, we introduce CityGenAgent, a natural language-driven framework for hierarchical procedural generation of high-quality 3D cities. Our approach decomposes city generation into two interpretable components, Block Program and Building Program. To ensure structural correctness and semantic alignment, we adopt a two-stage learning strategy: (1) Supervised Fine-Tuning (SFT). We train BlockGen and BuildingGen to generate valid programs that adhere to schema constraints, including non-self-intersecting polygons and complete fields; (2) Reinforcement Learning (RL). We design Spatial Alignment Reward to enhance spatial reasoning ability and Visual Consistency Reward to bridge the gap between textual descriptions and the visual modality. Benefiting from the programs and the models' generalization, CityGenAgent supports natural language editing and manipulation. Comprehensive evaluations demonstrate superior semantic alignment, visual quality, and controllability compared to existing methods, establishing a robust foundation for scalable 3D city generation.
中文摘要 交互式3D城市的自动化生成是一个关键挑战，在自动驾驶、虚拟现实和具身智能等领域有广泛应用。尽管生成模型和程序化技术的最新进展提升了城市生成的真实性，但现有方法在高保真资产创建、可控性和作方面常常存在困难。在本研究中，我们介绍了CityGenAgent，一个基于自然语言的结构式高质量三维城市程序生成框架。我们的方法将城市生成分解为两个可解释的组成部分：街区规划和建设计划。为确保结构正确性和语义对齐，我们采用两阶段学习策略：（1）监督微调（SFT）。我们训练BlockGen和BuildingGen生成符合模式约束的有效程序，包括非自交多边形和完整字段;（2）强化学习（RL）。我们设计空间对齐奖励以增强空间推理能力，并设计视觉一致性奖励以弥合文本描述与视觉模态之间的差距。借助这些程序和模型的泛化，CityGenAgent 支持自然语言编辑和作。综合评估显示，与现有方法相比，语义对齐、视觉质量和可控性更优，为可扩展的3D城市生成奠定了坚实基础。

Rich-Media Re-Ranker: A User Satisfaction-Driven LLM Re-ranking Framework for Rich-Media Search

Rich-Media Re-Ranker：一个以用户满意度为驱动的 Rich-Media 搜索大型语言模型重新排序框架

Authors: Zihao Guo, Ligang Zhou, Zeyang Tang, Feicheng Li, Ying Nie, Zhiming Peng, Qingyun Sun, Jianxin Li
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2602.05408
Pdf link: https://arxiv.org/pdf/2602.05408
Abstract Re-ranking plays a crucial role in modern information search systems by refining the ranking of initial search results to better satisfy user information needs. However, existing methods show two notable limitations in improving user search satisfaction: inadequate modeling of multifaceted user intents and neglect of rich side information such as visual perception signals. To address these challenges, we propose the Rich-Media Re-Ranker framework, which aims to enhance user search satisfaction through multi-dimensional and fine-grained modeling. Our approach begins with a Query Planner that analyzes the sequence of query refinements within a session to capture genuine search intents, decomposing the query into clear and complementary sub-queries to enable broader coverage of users' potential intents. Subsequently, moving beyond primary text content, we integrate richer side information of candidate results, including signals modeling visual content generated by the VLM-based evaluator. These comprehensive signals are then processed alongside carefully designed re-ranking principle that considers multiple facets, including content relevance and quality, information gain, information novelty, and the visual presentation of cover images. Then, the LLM-based re-ranker performs the holistic evaluation based on these principles and integrated signals. To enhance the scenario adaptability of the VLM-based evaluator and the LLM-based re-ranker, we further enhance their capabilities through multi-task reinforcement learning. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines. Notably, the proposed framework has been deployed in a large-scale industrial search system, yielding substantial improvements in online user engagement rates and satisfaction metrics.
中文摘要 重新排序在现代信息搜索系统中发挥着关键作用，它通过优化初始搜索结果的排名，更好地满足用户信息需求。然而，现有方法在提升用户搜索满意度方面存在两个显著局限：多元用户意图建模不足，以及忽视了丰富的副信息，如视觉感知信号。为应对这些挑战，我们提出了Rich-Media Re-Ranker框架，旨在通过多维度和细粒度建模提升用户搜索满意度。我们的方法从查询规划器开始，分析会话中查询细化的顺序，捕捉真实的搜索意图，将查询分解为清晰且互补的子查询，从而更广泛地覆盖用户潜在意图。随后，超越主要文本内容，我们整合了候选结果更丰富的侧面信息，包括基于VLM的评估器生成的视觉内容的信号建模。这些综合信号随后与精心设计的重新排序原则一同处理，该原则考虑了内容相关性和质量、信息获取、信息新颖性以及封面图片的视觉呈现等多个方面。然后，基于LLM的重新排序工具基于这些原则和整合信号进行整体评估。为了增强基于VLM的评估器和基于LLM的重排序器的场景适应性，我们通过多任务强化学习进一步增强了它们的能力。大量实验表明，我们的方法显著优于最先进的基线。值得注意的是，该框架已部署在大型工业搜索系统中，显著提升了在线用户参与率和满意度指标。

DistillER: Knowledge Distillation in Entity Resolution with Large Language Models

DistillER：利用大型语言模型进行实体解析中的知识蒸馏

Authors: Alexandros Zeakis, George Papadakis, Dimitrios Skoutas, Manolis Koubarakis
Subjects: Subjects: Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2602.05452
Pdf link: https://arxiv.org/pdf/2602.05452
Abstract Recent advances in Entity Resolution (ER) have leveraged Large Language Models (LLMs), achieving strong performance but at the cost of substantial computational resources or high financial overhead. Existing LLM-based ER approaches operate either in unsupervised settings and rely on very large and costly models, or in supervised settings and require ground-truth annotations, leaving a critical gap between time efficiency and effectiveness. To make LLM-powered ER more practical, we investigate Knowledge Distillation (KD) as a means to transfer knowledge from large, effective models (Teachers) to smaller, more efficient models (Students) without requiring gold labels. We introduce DistillER, the first framework that systematically bridges this gap across three dimensions: (i) Data Selection, where we study strategies for identifying informative subsets of data; (ii) Knowledge Elicitation, where we compare single- and multi-teacher settings across LLMs and smaller language models (SLMs); and (iii) Distillation Algorithms, where we evaluate supervised fine-tuning and reinforcement learning approaches. Our experiments reveal that supervised fine-tuning of Students on noisy labels generated by LLM Teachers consistently outperforms alternative KD strategies, while also enabling high-quality explanation generation. Finally, we benchmark DistillER against established supervised and unsupervised ER methods based on LLMs and SLMs, demonstrating significant improvements in both effectiveness and efficiency.
中文摘要 实体解析（ER）的最新进展利用了大型语言模型（LLMs），实现了强大的性能，但代价是大量计算资源或高昂的财务开销。现有基于LLM的ER方法要么运行在无监督环境中，依赖非常庞大且成本高昂的模型，要么在监督环境中运行，需要实地注释，导致时间效率与效果之间存在关键差距。为了使基于LLM驱动的ER更实用，我们研究知识蒸馏（KD）作为一种将知识从大型有效模型（教师）转移到更小、更高效的模型（学生）而无需金牌的手段。我们介绍了DistillER，这是首个系统性地跨三个维度弥合这一差距的框架：（i）数据选择，我们研究识别信息性数据子集的策略;（ii）知识引导，比较单教师和多教师在大型语言模型（LLM）与较小语言模型（SLMs）中的环境;以及（iii）蒸馏算法，我们评估监督式微调和强化学习方法。我们的实验表明，对学生在噪声标签上进行监督微调，由LLM教师生成，持续优于其他知识驱动策略，同时实现高质量解释生成。最后，我们将DistillER与基于大型语言模型（LLM）和SLMs的既有监督和无监督ER方法进行基准测试，显示出在效果和效率上的显著提升。

When Are RL Hyperparameters Benign? A Study in Offline Goal-Conditioned RL

什么时候强化超参数是良性的？离线目标条件强化学习研究

Authors: Jan Malte Töpperwien, Aditya Mohan, Marius Lindauer
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.05459
Pdf link: https://arxiv.org/pdf/2602.05459
Abstract Hyperparameter sensitivity in Deep Reinforcement Learning (RL) is often accepted as unavoidable. However, it remains unclear whether it is intrinsic to the RL problem or exacerbated by specific training mechanisms. We investigate this question in offline goal-conditioned RL, where data distributions are fixed, and non-stationarity can be explicitly controlled via scheduled shifts in data quality. Additionally, we study varying data qualities under both stationary and non-stationary regimes, and cover two representative algorithms: HIQL (bootstrapped TD-learning) and QRL (quasimetric representation learning). Overall, we observe substantially greater robustness to changes in hyperparameter configurations than commonly reported for online RL, even under controlled non-stationarity. Once modest expert data is present ($\approx$ 20\%), QRL maintains broad, stable near-optimal regions, while HIQL exhibits sharp optima that drift significantly across training phases. To explain this divergence, we introduce an inter-goal gradient alignment diagnostic. We find that bootstrapped objectives exhibit stronger destructive gradient interference, which coincides directly with hyperparameter sensitivity. These results suggest that high sensitivity to changes in hyperparameter configurations during training is not inevitable in RL, but is amplified by the dynamics of bootstrapping, offering a pathway toward more robust algorithmic objective design.
中文摘要 深度强化学习（RL）中的超参数敏感性通常被视为不可避免。然而，目前尚不清楚这是强化学习问题的内在原因，还是由特定训练机制加剧。我们在离线目标条件强化学习中探讨了这个问题，其中数据分布是固定的，非平稳性可以通过数据质量的计划调整来显式控制。此外，我们还研究了在平稳和非平稳状态下不同数据质量，并涵盖了两种代表性算法：HIQL（自助TD学习）和QRL（准度量表示学习）。总体而言，即使在受控非平稳性条件下，我们观察到对在线强化学习（LL）中超参数配置变化的鲁棒性显著高于常见报告。一旦有适度的专家数据（$\约20%）存在，QRL就能维持宽广、稳定的近最优区域，而HIQL则表现出显著的最优值，且在训练阶段间有显著漂移。为解释这种分歧，我们引入了目标间梯度对齐诊断。我们发现自举物镜表现出更强的破坏性梯度干涉，这与超参数敏感性直接一致。这些结果表明，在强化学习中，对训练过程中超参数配置变化的高敏感性并非不可避免，但自举动态会放大这种敏感性，为更稳健的算法目标设计提供了一条路径。

ALIVE: Awakening LLM Reasoning via Adversarial Learning and Instructive Verbal Evaluation

活着：通过对抗性学习和教学性言语评估觉醒LLM推理

Authors: Yiwen Duan, Jing Ye, Xinpei Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.05472
Pdf link: https://arxiv.org/pdf/2602.05472
Abstract The quest for expert-level reasoning in Large Language Models (LLMs) has been hampered by a persistent \textit{reward bottleneck}: traditional reinforcement learning (RL) relies on scalar rewards that are \textbf{costly} to scale, \textbf{brittle} across domains, and \textbf{blind} to the underlying logic of a solution. This reliance on external, impoverished signals prevents models from developing a deep, self-contained understanding of reasoning principles. We introduce \textbf{ALIVE} (\emph{Adversarial Learning with Instructive Verbal Evaluation}), a hands-free alignment framework that moves beyond scalar reward optimization toward intrinsic reasoning acquisition. Grounded in the principle of \emph{Cognitive Synergy}, ALIVE unifies problem posing, solving, and judging within a single policy model to internalize the logic of correctness. By coupling adversarial learning with instructive verbal feedback, ALIVE enables models to internalize evaluative criteria directly from raw corpora, effectively transforming external critiques into an endogenous reasoning faculty. Empirical evaluations across mathematical reasoning, code generation, and general logical inference benchmarks demonstrate that ALIVE consistently mitigates reward signal limitations. With identical data and compute, it achieves accuracy gains, markedly improved cross-domain generalization, and higher self-correction rates. These results indicate that the reasoning trinity fosters a self-sustaining trajectory of capability growth, positioning ALIVE as a scalable foundation for general-purpose reasoning alignment without human-in-the-loop supervision.
中文摘要 在大型语言模型（LLMs）中追求专家级推理的努力，一直受到一个持续存在的\textit{reward bottleneck}阻碍：传统的强化学习（RL）依赖标量奖励，这些奖励在扩展时成本高昂，跨域时将\textbf{脆弱}，且对解决方案的底层逻辑视而不见。这种对外部、贫乏信号的依赖阻碍了模型对推理原则的深刻、自成体系的理解。我们引入了 \textbf{ALIVE}（\emph{Adversarial Learning with Inguitive Verbal Evaluation}），这是一个免手动对齐框架，超越标量奖励优化，更深入内在推理习得。基于\emph{认知协同}原则，ALIVE将问题提出、解决和判断统一在单一政策模型中，内化正确的逻辑。通过将对抗性学习与具启发性的言语反馈结合，ALIVE 使模型能够直接从原始语料库中内化评价标准，有效地将外部批评转化为内生的推理能力。数学推理、代码生成和一般逻辑推理基准的实证评估表明，ALIVE 始终有效缓解了奖励信号的限制。在相同的数据和计算条件下，它实现了准确率的提升、显著提升的跨域泛化能力以及更高的自我修正率。这些结果表明，推理三位一体促进了能力的自我维持发展轨迹，使ALIVE成为无需人工监督的通用推理对齐的可扩展基础。

A Unified Framework for Rethinking Policy Divergence Measures in GRPO

重新思考GRPO政策分歧措施的统一框架

Authors: Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yanning Dai, Shilong Deng, Sarra Habchi, Qi Zhu, Matthias Gallé, Chao Huang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.05494
Pdf link: https://arxiv.org/pdf/2602.05494
Abstract Reinforcement Learning with Verified Reward (RLVR) has emerged as a critical paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). Most existing RLVR methods, such as GRPO and its variants, ensure stable updates by constraining policy divergence through clipping likelihood ratios. This paper introduces a unified clipping framework that characterizes existing methods via a general notion of policy divergence, encompassing both likelihood ratios and Kullback-Leibler (KL) divergences and extending to alternative measures. The framework provides a principled foundation for systematically analyzing how different policy divergence measures affect exploration and performance. We further identify the KL3 estimator, a variance-reduced Monte Carlo estimator of the KL divergence, as a key policy divergence constraint. We theoretically demonstrate that the KL3-based constraint is mathematically equivalent to an asymmetric ratio-based clipping that reallocates probability mass toward high-confidence actions, promoting stronger exploration while retaining the simplicity of GRPO-style methods. Empirical results on mathematical reasoning benchmarks demonstrate that incorporating the KL3 estimator into GRPO improves both training stability and final performance, highlighting the importance of principled policy divergence constraints in policy optimization.
中文摘要 带验证奖励的强化学习（RLVR）已成为推动大型语言模型（LLM）推理能力提升的关键范式。大多数现有RLVR方法，如GRPO及其变体，通过裁剪似然比来约束政策偏差，确保更新的稳定。本文提出了一个统一的剪辑框架，通过政策发散的一般概念来刻画现有方法，涵盖似然比和Kullback-Leibler（KL）发散，并扩展到替代度量。该框架为系统分析不同政策分歧措施如何影响勘探和绩效提供了原则性基础。我们还进一步确定KL3估计量，这是一种KL背度的方差约简蒙特卡洛估计器，作为关键的策略背度约束。我们理论上证明，基于KL3的约束在数学上等价于基于非对称比率的裁剪，后者将概率质量重新分配到高置信度动作，促进更强的探索，同时保持GRPO式方法的简洁性。数学推理基准的实证结果表明，将KL3估计器纳入GRPO不仅提升了训练稳定性，也提升了最终性能，凸显了原则性政策分歧约束在策略优化中的重要性。

Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation

揭示隐性优势对称性：为什么GRPO在探索和难度适应方面遇到困难

Authors: Zhiqi Yu, Zhangquan Chen, Mengting Liu, Heye Zhang, Liangqiong Qu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.05548
Pdf link: https://arxiv.org/pdf/2602.05548
Abstract Reinforcement Learning with Verifiable Rewards (RLVR), particularly GRPO, has become the standard for eliciting LLM reasoning. However, its efficiency in exploration and difficulty adaptation remains an open challenge. In this work, we argue that these bottlenecks stem from an implicit advantage symmetry inherent in Group Relative Advantage Estimation (GRAE). This symmetry induces two critical limitations: (i) at the group level, strict symmetry in weights between correct and incorrect trajectories leaves unsampled action logits unchanged, thereby hindering exploration of novel correct solution. (ii) at the sample level, the algorithm implicitly prioritizes medium-difficulty samples, remaining agnostic to the non-stationary demands of difficulty focus. Through controlled experiments, we reveal that this symmetric property is sub-optimal, yielding two pivotal insights: (i) asymmetrically suppressing the advantages of correct trajectories encourages essential exploration. (ii) learning efficiency is maximized by a curriculum-like transition-prioritizing simpler samples initially before gradually shifting to complex ones. Motivated by these findings, we propose Asymmetric GRAE (A-GRAE), which dynamically modulates exploration incentives and sample-difficulty focus. Experiments across seven benchmarks demonstrate that A-GRAE consistently improves GRPO and its variants across both LLMs and MLLMs.
中文摘要 带可验证奖励的强化学习（RLVR），特别是GRPO，已成为引发LLM推理的标准。然而，其探索效率和难度适应仍是一个开放的挑战。在本研究中，我们认为这些瓶颈源于群体相对优势估计（GRAE）中隐含的优势对称性。这种对称性带来了两个关键局限：（i）在群层面，正确与错误轨迹间权重的严格对称使未采样作用对数保持不变，从而阻碍了新颖正确解的探索。（ii）在样本层面，算法隐含优先考虑中等难度样本，对难度聚焦的非平稳需求保持中立。通过受控实验，我们发现这种对称性质是次优的，并得出两个关键见解：（i）非对称地抑制正确轨迹的优势，有助于进行必要的探索。（ii）通过类似课程的过渡式教学最大化，优先学习较简单的样本，随后逐步转向复杂的样本。基于这些发现，我们提出了非对称GRAE（A-GRAE），该方法动态调节探索激励和样本难度聚焦。七个基准测试的实验表明，A-GRAE在LLM和MLLM中持续提升GRPO及其变体。

TOLEBI: Learning Fault-Tolerant Bipedal Locomotion via Online Status Estimation and Fallibility Rewards

TOLEBI：通过在线状态估计和错误奖励学习容错双足行走

Authors: Hokyun Lee, Woo-Jeong Baek, Junhyeok Cha, Jaeheung Park
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.05596
Pdf link: https://arxiv.org/pdf/2602.05596
Abstract With the growing employment of learning algorithms in robotic applications, research on reinforcement learning for bipedal locomotion has become a central topic for humanoid robotics. While recently published contributions achieve high success rates in locomotion tasks, scarce attention has been devoted to the development of methods that enable to handle hardware faults that may occur during the locomotion process. However, in real-world settings, environmental disturbances or sudden occurrences of hardware faults might yield severe consequences. To address these issues, this paper presents TOLEBI (A faulT-tOlerant Learning framEwork for Bipedal locomotIon) that handles faults on the robot during operation. Specifically, joint locking, power loss and external disturbances are injected in simulation to learn fault-tolerant locomotion strategies. In addition to transferring the learned policy to the real robot via sim-to-real transfer, an online joint status module incorporated. This module enables to classify joint conditions by referring to the actual observations at runtime under real-world conditions. The validation experiments conducted both in real-world and simulation with the humanoid robot TOCABI highlight the applicability of the proposed approach. To our knowledge, this manuscript provides the first learning-based fault-tolerant framework for bipedal locomotion, thereby fostering the development of efficient learning methods in this field.
中文摘要 随着学习算法在机器人应用中的日益普及，关于双足行走强化学习的研究已成为类人机器人学的核心课题。尽管近期发表的贡献在移动任务中取得了较高的成功率，但开发能够处理移动过程中可能发生的硬件故障的方法却鲜有关注。然而，在现实环境中，环境干扰或硬件故障的突发发生可能带来严重后果。为解决这些问题，本文提出了TOLEBI（适用于双足机车的通用学习框架工作），该模型在机器人运行过程中处理故障。具体来说，在仿真中注入了关节锁定、功率损失和外部扰动，以学习容错运动策略。除了通过模拟到现实传输将所学策略转移到真实机器人外，还集成了一个在线联合状态模块。该模块通过参考运行时在实际条件下的观测数据，能够对联合状况进行分类。在现实世界和模拟中，使用人形机器人TOCABI进行的验证实验凸显了该方法的适用性。据我们所知，本手稿提供了首个基于学习的容错双足行走框架，从而促进了该领域高效学习方法的发展。

HiCrowd: Hierarchical Crowd Flow Alignment for Dense Human Environments

HiCrowd：密集人类环境下的层级人群流动对齐

Authors: Yufei Zhu, Shih-Min Yang, Martin Magnusson, Allan Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.05608
Pdf link: https://arxiv.org/pdf/2602.05608
Abstract Navigating through dense human crowds remains a significant challenge for mobile robots. A key issue is the freezing robot problem, where the robot struggles to find safe motions and becomes stuck within the crowd. To address this, we propose HiCrowd, a hierarchical framework that integrates reinforcement learning (RL) with model predictive control (MPC). HiCrowd leverages surrounding pedestrian motion as guidance, enabling the robot to align with compatible crowd flows. A high-level RL policy generates a follow point to align the robot with a suitable pedestrian group, while a low-level MPC safely tracks this guidance with short horizon planning. The method combines long-term crowd aware decision making with safe short-term execution. We evaluate HiCrowd against reactive and learning-based baselines in offline setting (replaying recorded human trajectories) and online setting (human trajectories are updated to react to the robot in simulation). Experiments on a real-world dataset and a synthetic crowd dataset show that our method outperforms in navigation efficiency and safety, while reducing freezing behaviors. Our results suggest that leveraging human motion as guidance, rather than treating humans solely as dynamic obstacles, provides a powerful principle for safe and efficient robot navigation in crowds.
中文摘要 在密集的人群中导航仍然是移动机器人面临的重大挑战。一个关键问题是机器人冻结问题，机器人难以找到安全的动作，结果被困在人群中。为此，我们提出了HiCrowd，一个将强化学习（RL）与模型预测控制（MPC）集成的分层框架。HiCrowd利用周围行人的运动作为引导，使机器人能够与兼容的人群流动对齐。高层次强化学习策略生成跟踪点，使机器人与合适的行人群体对齐，而低层次MPC则通过短视野规划安全地跟踪该引导。该方法结合了长期的群体意识决策与安全的短期执行。我们在离线环境（重播录制的人类轨迹）和在线环境（模拟中更新人体轨迹以响应机器人）中，结合反应性和基于学习的基线来评估HiCrowd。在真实世界数据集和合成人群数据集上的实验显示，我们的方法在导航效率和安全性方面表现优于我们，同时减少了冻结行为。我们的结果表明，利用人体运动作为引导，而不仅仅是将人类视为动态障碍，为在人群中安全高效地导航机器人提供了有力的原则。

Mode-Dependent Rectification for Stable PPO Training

模式依赖整流以实现PPO稳定训练

Authors: Mohamad Mohamad, Francesco Ponzio, Xavier Descombes
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.05619
Pdf link: https://arxiv.org/pdf/2602.05619
Abstract Mode-dependent architectural components (layers that behave differently during training and evaluation, such as Batch Normalization or dropout) are commonly used in visual reinforcement learning but can destabilize on-policy optimization. We show that in Proximal Policy Optimization (PPO), discrepancies between training and evaluation behavior induced by Batch Normalization lead to policy mismatch, distributional drift, and reward collapse. We propose Mode-Dependent Rectification (MDR), a lightweight dual-phase training procedure that stabilizes PPO under mode-dependent layers without architectural changes. Experiments across procedurally generated games and real-world patch-localization tasks demonstrate that MDR consistently improves stability and performance, and extends naturally to other mode-dependent layers.
中文摘要 依赖模式的架构组件（在训练和评估过程中行为不同的层，如批归一化或退出）常用于视觉强化学习，但可能会破坏策略优化。我们表明，在近端策略优化（PPO）中，批次归一化引发的训练与评估行为差异导致策略不匹配、分布漂移和奖励崩溃。我们提出了模式依赖整流（MDR），一种轻量级双相训练程序，可在无架构变更的情况下稳定相模相关层下的PPO。在程序生成游戏和现实补丁本地化任务中的实验表明，MDR持续提升稳定性和性能，并自然扩展到其他依赖模式的层级。

Rewards as Labels: Revisiting RLVR from a Classification Perspective

奖励作为标签：从分类视角重新审视RLVR

Authors: Zepeng Zhai, Meilin Chen, Jiaxuan Zhao, Junlang Qian, Lei Shen, Yuan Lu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.05630
Pdf link: https://arxiv.org/pdf/2602.05630
Abstract Reinforcement Learning with Verifiable Rewards has recently advanced the capabilities of Large Language Models in complex reasoning tasks by providing explicit rule-based supervision. Among RLVR methods, GRPO and its variants have achieved strong empirical performance. Despite their success, we identify that they suffer from Gradient Misassignment in Positives and Gradient Domination in Negatives, which lead to inefficient and suboptimal policy updates. To address these issues, we propose Rewards as Labels (REAL), a novel framework that revisits verifiable rewards as categorical labels rather than scalar weights, thereby reformulating policy optimization as a classification problem. Building on this, we further introduce anchor logits to enhance policy learning. Our analysis reveals that REAL induces a monotonic and bounded gradient weighting, enabling balanced gradient allocation across rollouts and effectively mitigating the identified mismatches. Extensive experiments on mathematical reasoning benchmarks show that REAL improves training stability and consistently outperforms GRPO and strong variants such as DAPO. On the 1.5B model, REAL improves average Pass@1 over DAPO by 6.7%. These gains further scale to 7B model, REAL continues to outperform DAPO and GSPO by 6.2% and 1.7%, respectively. Notably, even with a vanilla binary cross-entropy, REAL remains stable and exceeds DAPO by 4.5% on average.
中文摘要 带可验证奖励的强化学习最近通过提供显式基于规则的监督，提升了大型语言模型在复杂推理任务中的能力。在RLVR方法中，GRPO及其变体取得了强劲的实证表现。尽管它们取得了成功，我们发现它们存在正值梯度错误分配和负值梯度支配的问题，导致政策更新效率低下且不理想。为解决这些问题，我们提出了“奖励即标签”（REAL），这是一个新颖框架，将可验证奖励重新审视为类别标签而非标量权重，从而将策略优化重新表述为分类问题。在此基础上，我们进一步引入锚点日志以增强政策学习。我们的分析显示，REAL能够诱导单调且有界的梯度加权，实现各部署间的均衡梯度分配，有效缓解已识别的不匹配。大量数学推理基准测试显示，REAL提升训练稳定性，并持续优于GRPO及强变体如DAPO。在15亿模型中，REAL平均Pass@1比DAPO提升了6.7%。这些涨幅进一步扩展到7B模型，REAL继续以6.2%和1.7%的优势优于DAPO和GSPO。值得注意的是，即使存在普通二元交叉熵，REAL依然保持稳定，平均比DAPO高出4.5%。

UAV Trajectory Optimization via Improved Noisy Deep Q-Network

通过改进的噪声深Q网络实现无人机轨迹优化

Authors: Zhang Hengyu, Maryam Cheraghy, Liu Wei, Armin Farhadi, Meysam Soltanpour, Zhong Zhuoqing
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.05644
Pdf link: https://arxiv.org/pdf/2602.05644
Abstract This paper proposes an Improved Noisy Deep Q-Network (Noisy DQN) to enhance the exploration and stability of Unmanned Aerial Vehicle (UAV) when applying deep reinforcement learning in simulated environments. This method enhances the exploration ability by combining the residual NoisyLinear layer with an adaptive noise scheduling mechanism, while improving training stability through smooth loss and soft target network updates. Experiments show that the proposed model achieves faster convergence and up to $+40$ higher rewards compared to standard DQN and quickly reach to the minimum number of steps required for the task 28 in the 15 * 15 grid navigation environment set up. The results show that our comprehensive improvements to the network structure of NoisyNet, exploration control, and training stability contribute to enhancing the efficiency and reliability of deep Q-learning.
中文摘要 本文提出了一种改进型噪声深度Q网络（噪声DQN），以增强无人机（UAV）在模拟环境中深度强化学习应用中的探索和稳定性。该方法通过将残差噪声线性层与自适应噪声调度机制结合，增强了探索能力，同时通过平滑损耗和软目标网络更新提升训练稳定性。实验显示，所提模型比标准DQN实现更快的收敛速度和高达$+40$的奖励，并且在15×15网格导航环境中，能迅速达到任务28所需的最低步数。结果显示，我们对 NoisyNet 网络结构、探索控制和训练稳定性的全面改进，有助于提升深度 Q 学习的效率和可靠性。

Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification

锚定策略优化：通过支持限制的纠正缓解探索崩溃

Authors: Tianyi Wang, Long Li, Hongcan Guo, Yibiao Chen, Yixia Li, Yong Wang, Yun Chen, Guanhua Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.05717
Pdf link: https://arxiv.org/pdf/2602.05717
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model's full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model's high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.
中文摘要 带可验证奖励的强化学习（RLVR）越来越被视为一种树修剪机制。然而，我们发现了一种系统性病理，称为递归空间收缩（RSC），这是一种由正锐化和负压缩动态共同驱动的不可逆塌陷，有效替代方案的抽样概率为零。虽然Kullback-Leibler（KL）正则化旨在缓解这一问题，但它施加了严格的形状匹配约束，迫使策略模拟参考模型的全密度，从而与正确性所需的锐化造成梯度冲突。我们提出了锚定策略优化（APO），将范式从全局形状匹配转向支持覆盖。通过基于参考模型的高置信度支持定义安全歧管，APO允许在纠错过程中选择性地调用修复力以防止塌陷，从而实现高效锐化。我们理论上推断，APO作为梯度对齐机制，最大化支持覆盖，实现弹性恢复，重新膨胀有效分支。数学基准的实证评估表明，APO打破了准确性与多样性的权衡，显著提升了Pass@1，同时恢复了标准政策梯度方法通常丢失的Pass@K多样性。

Mitigating Hallucination in Financial Retrieval-Augmented Generation via Fine-Grained Knowledge Verification

通过细粒度知识验证缓解金融检索增强生成中的幻觉

Authors: Taoye Yin, Haoyuan Hu, Yaxin Fan, Xinhao Chen, Xinya Wu, Kai Deng, Kezun Zhang, Feng Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.05723
Pdf link: https://arxiv.org/pdf/2602.05723
Abstract In financial Retrieval-Augmented Generation (RAG) systems, models frequently rely on retrieved documents to generate accurate responses due to the time-sensitive nature of the financial domain. While retrieved documents help address knowledge gaps, model-generated responses still suffer from hallucinations that contradict the retrieved information. To mitigate this inconsistency, we propose a Reinforcement Learning framework enhanced with Fine-grained Knowledge Verification (RLFKV). Our method decomposes financial responses into atomic knowledge units and assesses the correctness of each unit to compute the fine-grained faithful reward. This reward offers more precise optimization signals, thereby improving alignment with the retrieved documents. Additionally, to prevent reward hacking (e.g., overly concise replies), we incorporate an informativeness reward that encourages the policy model to retain at least as many knowledge units as the base model. Experiments conducted on the public Financial Data Description (FDD) task and our newly proposed FDD-ANT dataset demonstrate consistent improvements, confirming the effectiveness of our approach.
中文摘要 在金融检索增强生成（RAG）系统中，由于金融领域的时间敏感性，模型常依赖检索到的文档来生成准确的响应。虽然检索到的文档有助于弥补知识空白，但模型生成的响应仍然存在与检索信息相矛盾的幻觉。为缓解这种不一致，我们提出了一个通过细粒度知识验证（RLFKV）增强的强化学习框架。我们的方法将财务响应分解为原子知识单元，并评估每个单元的正确性，以计算细粒度的忠实奖励。这种奖励提供了更精确的优化信号，从而提升与检索文档的对齐度。此外，为防止奖励黑客行为（例如回复过于简洁），我们加入了信息量奖励，鼓励策略模型保留至少与基础模型相当的知识单元。在公共金融数据描述（FDD）任务和我们新提出的FDD-ANT数据集上进行的实验显示出持续的改进，证实了我们方法的有效性。

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

学习注入：通过强化学习实现自动提示注入

Authors: Xin Chen, Jie Zhang, Florian Tramer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.05746
Pdf link: https://arxiv.org/pdf/2602.05746
Abstract Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.
中文摘要 提示注入是LLM代理中最关键的漏洞之一;然而，从优化角度来看，有效的自动化攻击仍然大多未被充分探索。现有方法高度依赖人工红队成员和手工制作的提示，限制了其可扩展性和适应性。我们提出了AutoInject，一种强化学习框架，能够生成通用的、可转移的对抗后缀，同时共同优化攻击成功率和良性任务的效用保留。我们的黑箱方法支持基于查询的优化和对未见模型和任务的转移攻击。仅使用1.5B参数的对抗后缀生成器，我们成功攻破了包括GPT 5 Nano、Claude Sonnet 3.5和Gemini 2.5 Flash在内的前沿系统，在AgentDojo基准测试上建立了更强的自动提示注入研究基线。

LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards

LongR：通过强化学习释放长上下文推理，并结合密集效用奖励

Authors: Bowen Ping, Zijun Chen, Yiyao Yu, Tingfeng Hui, Junchi Yan, Baobao Chang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.05758
Pdf link: https://arxiv.org/pdf/2602.05758
Abstract Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios--such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model's robustness against distractors.
中文摘要 强化学习已成为大型语言模型推理的关键驱动力。这一能力在长上下文场景中同样至关重要——比如长时间对话理解和结构化数据分析，挑战不仅仅是消耗代币，还体现在执行严格推理。虽然现有努力主要集中在数据综合或架构变革，但最新研究指出，仅依赖稀疏、仅结果的奖励收益有限，因为粗糙信号往往不足以有效指导复杂的长上下文推理。为此，我们提出了LongR，一个统一框架，通过集成动态的“思考与阅读”机制，将推理与文档咨询交错结合，并基于相对信息获得的上下文密度奖励来量化相关文档的效用，从而提升长上下文性能。从实证来看，LongR在LongBench v2上实现了9%的提升，并且在RULER和InfiniteBench上持续提升，展现出在广泛上下文导航上的强劲高效。此外，LongR持续提升了多种强化学习算法（如DAPO、GSPO）的性能。最后，我们进行了深入分析，探讨链条长度推理对效率及模型对干扰因素的鲁棒性的影响。

RL-VLA$^3$: Reinforcement Learning VLA Accelerating via Full Asynchronism

RL-VLA$^3$：强化学习VLA通过完全异步加速

Authors: Zhong Guan, Haoran Sun, Yongjian Guo, Shuai Di, Xiaodong Bai, Jing Long, Tianyun Zhao, Mingxi Luo, Chen Zhou, Yucheng Guo, Qiming Yang, Wanting Xu, Wen Huang, Yunxuan Ma, Hongke Zhao, Likang Wu, Xiaotie Deng, Xi Xiao, Sheng Wen, Yicheng Gong, Junwu Xiong
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.05765
Pdf link: https://arxiv.org/pdf/2602.05765
Abstract In recent years, Vision-Language-Action (VLA) models have emerged as a crucial pathway towards general embodied intelligence, yet their training efficiency has become a key bottleneck. Although existing reinforcement learning (RL)-based training frameworks like RLinf can enhance model generalization, they still rely on synchronous execution, leading to severe resource underutilization and throughput limitations during environment interaction, policy generation (rollout), and model update phases (actor). To overcome this challenge, this paper, for the first time, proposes and implements a fully-asynchronous policy training framework encompassing the entire pipeline from environment interaction, rollout generation, to actor policy updates. Systematically drawing inspiration from asynchronous optimization ideas in large model RL, our framework designs a multi-level decoupled architecture. This includes asynchronous parallelization of environment interaction and trajectory collection, streaming execution for policy generation, and decoupled scheduling for training updates. We validated the effectiveness of our method across diverse VLA models and environments. On the LIBERO benchmark, the framework achieves throughput improvements of up to 59.25\% compared to existing synchronous strategies. When deeply optimizing separation strategies, throughput can be increased by as much as 126.67\%. We verified the effectiveness of each asynchronous component via ablation studies. Scaling law validation across 8 to 256 GPUs demonstrates our method's excellent scalability under most conditions.
中文摘要 近年来，视觉-语言-行动（VLA）模型已成为通向通用具象智能的重要途径，但其训练效率已成为关键瓶颈。尽管现有基于强化学习（RL）的训练框架如RLinf可以增强模型泛化，但它们仍依赖同步执行，导致资源严重利用不足和环境交互、策略生成（展开）和模型更新阶段（actor）吞吐量受限。为克服这一挑战，本文首次提出并实施了涵盖从环境交互、推广生成到参与者策略更新的全流程的全异步政策培训框架。我们的框架系统地汲取大型模型强化学习中的异步优化理念，设计了一个多层级解耦架构。这包括环境交互和轨迹收集的异步并行化、策略生成的流式执行，以及训练更新的解耦调度。我们验证了该方法在不同VLA模型和环境中的有效性。在LIBERO基准测试中，该框架相比现有同步策略实现了最高59.25%的吞吐量提升。在深度优化分离策略时，吞吐量可提升多达126.67%。我们通过消融研究验证了每个异步成分的有效性。在8到256块GPU上进行定律验证，展示了我们方法在大多数条件下的卓越可扩展性。

Cross-Domain Offline Policy Adaptation via Selective Transition Correction

通过选择性转换纠正实现跨域离线策略适配

Authors: Mengbei Yan, Jiafei Lyu, Shengjie Sun, Zhongjian Qiao, Jingwen Yang, Zichuan Lin, Deheng Ye, Xiu Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.05776
Pdf link: https://arxiv.org/pdf/2602.05776
Abstract It remains a critical challenge to adapt policies across domains with mismatched dynamics in reinforcement learning (RL). In this paper, we study cross-domain offline RL, where an offline dataset from another similar source domain can be accessed to enhance policy learning upon a target domain dataset. Directly merging the two datasets may lead to suboptimal performance due to potential dynamics mismatches. Existing approaches typically mitigate this issue through source domain transition filtering or reward modification, which, however, may lead to insufficient exploitation of the valuable source domain data. Instead, we propose to modify the source domain data into the target domain data. To that end, we leverage an inverse policy model and a reward model to correct the actions and rewards of source transitions, explicitly achieving alignment with the target dynamics. Since limited data may result in inaccurate model training, we further employ a forward dynamics model to retain corrected samples that better match the target dynamics than the original transitions. Consequently, we propose the Selective Transition Correction (STC) algorithm, which enables reliable usage of source domain data for policy adaptation. Experiments on various environments with dynamics shifts demonstrate that STC achieves superior performance against existing baselines.
中文摘要 在强化学习（RL）中，跨领域动态不匹配的策略调整仍是一个关键挑战。本文研究跨域离线强化学习，即可访问来自其他相似源域的离线数据集，以增强目标域数据集的策略学习。直接合并这两个数据集可能导致性能不优，因为潜在的动态不匹配。现有方法通常通过源域转换过滤或奖励修改来缓解这一问题，但这可能导致宝贵源域数据的充分利用。相反，我们建议将源域数据修改为目标域数据。为此，我们利用逆策略模型和奖励模型纠正源转移的行为和奖励，明确实现与目标动态的对齐。由于数据有限可能导致模型训练不准确，我们进一步采用前向动力学模型，保留更符合目标动力学的校正样本，而非原始转变。因此，我们提出了选择性转换修正（STC）算法，能够可靠地利用源域数据进行策略调整。在各种具有动态变化的环境中的实验表明，STC相较于现有基线在性能上更为优越。

Distributional Reinforcement Learning with Diffusion Bridge Critics

分布式强化学习与扩散桥批评者

Authors: Shutong Ding, Yimiao Zhou, Ke Hu, Mokai Pan, Shan Zhong, Yanwei Fu, Jingya Wang, Ye Shi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.05783
Pdf link: https://arxiv.org/pdf/2602.05783
Abstract Recent advances in diffusion-based reinforcement learning (RL) methods have demonstrated promising results in a wide range of continuous control tasks. However, existing works in this field focus on the application of diffusion policies while leaving the diffusion critics unexplored. In fact, since policy optimization fundamentally relies on the critic, accurate value estimation is far more important than policy expressiveness. Furthermore, given the stochasticity of most reinforcement learning tasks, it has been confirmed that the critic is more appropriately depicted with a distributional model. Motivated by these points, we propose a novel distributional RL method with Diffusion Bridge Critics (DBC). DBC directly models the inverse cumulative distribution function (CDF) of the Q value. This allows us to accurately capture the value distribution and prevents it from collapsing into a trivial Gaussian distribution owing to the strong distribution-matching capability of the diffusion bridge. Moreover, we further derive an analytic integral formula to address discretization errors in DBC, which is essential in value estimation. To our knowledge, DBC is the first work to employ the diffusion bridge model as the critic. Notably, DBC is also a plug-and-play component and can be integrated into most existing RL frameworks. Experimental results on MuJoCo robot control benchmarks demonstrate the superiority of DBC compared with previous distributional critic models.
中文摘要 基于扩散的强化学习（RL）方法的最新进展在广泛的连续控制任务中展现出有希望的成果。然而，该领域的现有研究主要关注扩散政策的应用，而未深入探讨扩散批评者。事实上，由于政策优化根本依赖于批评者，准确的价值估计远比政策表达性更为重要。此外，鉴于大多数强化学习任务的随机性，已确认批评者更适合用分布模型来描绘。基于这些观点，我们提出了一种带有扩散桥评（DBC）的新型分布强化学习方法。DBC直接建模Q值的逆累积分布函数（CDF）。这使得我们能够准确捕捉值分布，并防止由于扩散桥强大的分布匹配能力而崩溃成平凡的高斯分布。此外，我们还推导了一个解析积分公式，用于解决DBC中的离散化误差，这在价值估计中至关重要。据我们所知，DBC是首个采用扩散桥模型作为批判的研究。值得注意的是，DBC也是一个即插即用组件，可以集成到大多数现有的强化学习框架中。MuJoCo机器人控制基准测试的实验结果显示，DBC相较于以往的分布批判模型具有优越性。

TKG-Thinker: Towards Dynamic Reasoning over Temporal Knowledge Graphs via Agentic Reinforcement Learning

TKG-Thinker：通过代理强化学习，迈向动态推理，超越时间知识图谱

Authors: Zihao Jiang, Miao Peng, Zhenyan Shan, Wenjie Xu, Ben Liu, Gong Chen, Ziqi Gao, Min Peng
Subjects: Subjects: Artificial Intelligence (cs.AI); Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2602.05818
Pdf link: https://arxiv.org/pdf/2602.05818
Abstract Temporal knowledge graph question answering (TKGQA) aims to answer time-sensitive questions by leveraging temporal knowledge bases. While Large Language Models (LLMs) demonstrate significant potential in TKGQA, current prompting strategies constrain their efficacy in two primary ways. First, they are prone to reasoning hallucinations under complex temporal constraints. Second, static prompting limits model autonomy and generalization, as it lack optimization through dynamic interaction with temporal knowledge graphs (TKGs) environments. To address these limitations, we propose \textbf{TKG-Thinker}, a novel agent equipped with autonomous planning and adaptive retrieval capabilities for reasoning over TKGs. Specifically, TKG-Thinker performs in-depth temporal reasoning through dynamic multi-turn interactions with TKGs via a dual-training strategy. We first apply Supervised Fine-Tuning (SFT) with chain-of thought data to instill core planning capabilities, followed by a Reinforcement Learning (RL) stage that leverages multi-dimensional rewards to refine reasoning policies under intricate temporal constraints. Experimental results on benchmark datasets with three open-source LLMs show that TKG-Thinker achieves state-of-the-art performance and exhibits strong generalization across complex TKGQA settings.
中文摘要 时间知识图解答（TKGQA）旨在通过利用时间知识库来回答时间敏感的问题。虽然大型语言模型（LLMs）在TKGQA中展现出显著潜力，但当前的提示策略在两个主要方面限制了其效能。首先，他们在复杂的时间限制下容易推理幻觉。其次，静态提示限制了模型的自主性和泛化性，因为它缺乏通过与时间知识图谱（TKGs）环境动态交互进行优化。为解决这些局限性，我们提出了 \textbf{TKG-Thinker}，一款具备自主规划和自适应检索能力的新型代理，用于对 TKG 进行推理。具体来说，TKG-Thinker 通过双重训练策略，通过动态多回合交互与 TKG 进行深入的时间推理。我们首先应用监督微调（SFT）与链式思维数据，灌输核心规划能力，随后进入强化学习（RL）阶段，利用多维奖励在复杂的时间约束下优化推理策略。在基准数据集上使用三个开源大型语言模型的实验结果显示，TKG-Thinker 实现了最先进的性能，并在复杂的 TKGQA 环境中展现出强烈的泛化能力。

Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning

Weaver：视频交错推理的端到端智能系统培训

Authors: Yudi Shi, Shangzhe Di, Qirui Chen, Qinian Wang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.05829
Pdf link: https://arxiv.org/pdf/2602.05829
Abstract Video reasoning constitutes a comprehensive assessment of a model's capabilities, as it demands robust perceptual and interpretive skills, thereby serving as a means to explore the boundaries of model performance. While recent research has leveraged text-centric Chain-of-Thought reasoning to augment these capabilities, such approaches frequently suffer from representational mismatch and restricted by limited perceptual acuity. To address these limitations, we propose Weaver, a novel, end-to-end trainable multimodal reasoning agentic system. Weaver empowers its policy model to dynamically invoke diverse tools throughout the reasoning process, enabling progressive acquisition of crucial visual cues and construction of authentic multimodal reasoning trajectories. Furthermore, we integrate a reinforcement learning algorithm to allow the system to freely explore strategies for employing and combining these tools with trajectory-free data. Extensive experiments demonstrate that our system, Weaver, enhances performance on several complex video reasoning benchmarks, particularly those involving long videos.
中文摘要 视频推理是对模型能力的全面评估，因为它要求强大的感知和解释能力，从而探索模型性能的边界。虽然近期研究利用以文本为中心的思维链推理来增强这些能力，但此类方法常常存在表征不匹配且受限于感知敏锐度。为解决这些局限性，我们提出了Weaver，一种新颖的端到端可训练多模态推理智能体系统。Weaver赋能其政策模型，在推理过程中动态调用多样化工具，实现关键视觉线索的渐进获取和真实多模态推理轨迹的构建。此外，我们还集成了强化学习算法，使系统能够自由探索利用和结合这些工具与无轨迹数据的策略。大量实验表明，我们的系统Weaver在多个复杂的视频推理基准测试中提升了性能，尤其是涉及长视频的测试。

UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents

UI-Mem：用于移动图形界面代理在线强化学习的自我演化体验记忆

Authors: Han Xiao, Guozhi Wang, Hao Wang, Shilong Liu, Yuxiang Chai, Yue Pan, Yufeng Zhou, Xiaoxin Chen, Yafei Wen, Hongsheng Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.05832
Pdf link: https://arxiv.org/pdf/2602.05832
Abstract Online Reinforcement Learning (RL) offers a promising paradigm for enhancing GUI agents through direct environment interaction. However, its effectiveness is severely hindered by inefficient credit assignment in long-horizon tasks and repetitive errors across tasks due to the lack of experience transfer. To address these challenges, we propose UI-Mem, a novel framework that enhances GUI online RL with a Hierarchical Experience Memory. Unlike traditional replay buffers, our memory accumulates structured knowledge, including high-level workflows, subtask skills, and failure patterns. These experiences are stored as parameterized templates that enable cross-task and cross-application transfer. To effectively integrate memory guidance into online RL, we introduce Stratified Group Sampling, which injects varying levels of guidance across trajectories within each rollout group to maintain outcome diversity, driving the unguided policy toward internalizing guided behaviors. Furthermore, a Self-Evolving Loop continuously abstracts novel strategies and errors to keep the memory aligned with the agent's evolving policy. Experiments on online GUI benchmarks demonstrate that UI-Mem significantly outperforms traditional RL baselines and static reuse strategies, with strong generalization to unseen applications. Project page: this https URL
中文摘要 在线强化学习（RL）为通过直接环境交互增强图形界面代理提供了有前景的范式。然而，长期任务中学分分配效率低下以及因缺乏经验转移导致任务间重复错误严重影响了其有效性。为应对这些挑战，我们提出了UI-Mem，一个新颖框架，通过层级体验记忆增强了在线GUI强化学习。与传统的回放缓冲不同，我们的记忆积累了结构化的知识，包括高层次工作流程、子任务技能和失败模式。这些体验以参数化模板形式存储，支持跨任务和跨应用的转移。为了有效将记忆指导融入在线强化学习，我们引入了分层群体抽样，该方法在每个推广组的轨迹中注入不同层次的指导，以保持结果多样性，推动无指导政策内化引导行为。此外，自我演化循环会持续抽象新策略和错误，以保持内存与代理演化策略的对齐。在线图形界面基准测试的实验表明，UI-Mem 远远优于传统强化学习基线和静态重用策略，并且对未见应用具有强烈的泛化性。项目页面：此 https URL

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

柯内尔博士：特里顿内核世代的强化学习

Authors: Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, Junxian He
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.05885
Pdf link: https://arxiv.org/pdf/2602.05885
Abstract High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, this http URL-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for this http URL-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in this https URL.
中文摘要 高质量内核对于可扩展的人工智能系统至关重要，使大型语言模型能够生成此类代码将推动人工智能的发展。然而，训练LLM进行这项任务需要足够的数据和稳健的环境，且该过程常常容易受到黑客攻击和懒惰优化的攻击。在这些情况下，模型可能会修改训练奖励，优先考虑简单的正确性而非有意义的加速。本文系统地研究了核生成中的强化学习（RL）。我们首先设计了KernelGYM，一个强大的分布式GPU环境，支持奖励黑客检验、多回合交互数据收集以及长期强化学习训练。基于KernelGYM，我们研究了有效的多回合强化学习方法，并识别了GRPO中自我包含性导致的有偏政策梯度问题。为此，我们提出了回合级强化-保留一人-出局（TRLOO）方法，以提供多回合强化学习的无偏优势估计。为缓解懒惰优化，我们加入了不匹配校正以提升训练稳定性，并引入基于画像的奖励（PR）和基于画像的拒绝抽样（PRS）来克服这一问题。训练模型，即 http URL-14B，其性能可与 Kernelbench 中的 Claude-4.5-Sonnet 媲美。最后，我们研究了 http URL-14B 的顺序测试时间缩放。在 KernelBench Level-2 子集中，31.6% 的生成内核速度至少比 Torch 参考提升了 1.2 倍，超过了 Claude-4.5-Sonnet（26.7%）和 GPT-5（28.6%）。在所有回合中选择最佳候选人时，这一1.2倍加速率进一步提升至47.8%。所有资源，包括环境、训练代码、模型和数据集，都包含在这个 https URL。

DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training

DFPO：通过分布流向鲁棒且可推广的后训练LLM进行价值建模

Authors: Dingwei Zhu, Zhiheng Xi, Shihan Dou, Jiahan Li, Chenhao Huang, Junjie Ye, Sixian Li, Mingxu Chai, Yuhui Wang, Yajie Yang, Ming Zhang, Jiazheng Zhang, Shichun Liu, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.05890
Pdf link: https://arxiv.org/pdf/2602.05890
Abstract Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.
中文摘要 由于监督噪声和域外（OOD）泛化能力差，尤其是在大型语言模型（LLM）后训练中，训练强化学习（RL）系统在现实环境中依然具有挑战性。最新的分布强化学习方法通过建模具有多个分位点的值来提升鲁棒性，但它们仍然独立学习每个分位数作为标量。这导致粗粒度值表示缺乏对状态信息的细粒度条件，在复杂且面向外的条件下难以实现。我们提出了DFPO（分布价值流策略优化，带条件风险和一致性控制），这是一个稳健的分布强化学习框架，将价值建模为跨时间步的连续流。通过学习价值流域而非孤立的分位数预测来扩展价值建模，DFPO捕捉了更丰富的状态信息，从而实现更准确的优势估计。为了在噪声反馈下稳定训练，DFPO进一步沿价值流轨迹整合条件性风险控制和一致性约束。对话、数学推理和科学任务的实验显示，DFPO在噪声监督下优于PPO、FlowRL及其他稳健基线，实现了训练稳定性和泛化性的提升。

Residual Reinforcement Learning for Waste-Container Lifting Using Large-Scale Cranes with Underactuated Tools

使用大型起重机和欠驱动工具进行废物集装箱搬运的残余加固学习

Authors: Qi Li, Karsten Berns
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.05895
Pdf link: https://arxiv.org/pdf/2602.05895
Abstract This paper studies the container lifting phase of a waste-container recycling task in urban environments, performed by a hydraulic loader crane equipped with an underactuated discharge unit, and proposes a residual reinforcement learning (RRL) approach that combines a nominal Cartesian controller with a learned residual policy. All experiments are conducted in simulation, where the task is characterized by tight geometric tolerances between the discharge-unit hooks and the container rings relative to the overall crane scale, making precise trajectory tracking and swing suppression essential. The nominal controller uses admittance control for trajectory tracking and pendulum-aware swing damping, followed by damped least-squares inverse kinematics with a nullspace posture term to generate joint velocity commands. A PPO-trained residual policy in Isaac Lab compensates for unmodeled dynamics and parameter variations, improving precision and robustness without requiring end-to-end learning from scratch. We further employ randomized episode initialization and domain randomization over payload properties, actuator gains, and passive joint parameters to enhance generalization. Simulation results demonstrate improved tracking accuracy, reduced oscillations, and higher lifting success rates compared to the nominal controller alone.
中文摘要 本文研究了城市环境中废弃物容器回收任务的集装箱提升阶段，该任务由配备欠驱动卸载单元的液压装载起重机执行，并提出了一种残余强化学习（RRL）方法，结合了名义上的笛卡尔控制器与已学习的残差策略。所有实验均在模拟中进行，任务特点是卸货单元钩与容器环相对于整体起重机尺度之间的几何公差非常严格，因此精确的轨迹跟踪和摆动抑制至关重要。标称控制器使用导纳控制进行轨迹跟踪和摆动感知阻尼，随后采用阻尼最小二乘反运动学和零空间姿态项来生成联合速度指令。Isaac实验室中采用PPO训练的残差策略，可以补偿未建模的动力学和参数变化，提升精度和鲁棒性，无需从零学习。我们进一步利用随机化事件初始化和域随机化，应用于有效载荷特性、执行器增益和被动关节参数，以增强泛化。模拟结果显示，跟踪精度提升，振荡减少，提升成功率，均优于标称控制器。

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

停止奖励幻觉步骤：忠实感知的步骤级强化学习针对小推理模型

Authors: Shuo Nie, Hexuan Deng, Chao Wang, Ruiyu Fang, Xuebo Liu, Shuangyong Song, Yu Li, Min Zhang, Xuelong Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.05897
Pdf link: https://arxiv.org/pdf/2602.05897
Abstract As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at this https URL.
中文摘要 随着大型语言模型变得更小更高效，小推理模型（SRM）对于在资源受限环境中实现思维链（CoT）推理至关重要。然而，他们在中级推理阶段更容易出现忠实幻觉。基于在线强化学习的现有缓解方法依赖基于结果的奖励或粗粒度的CoT评估，当最终答案正确时，这些方法可能无意中强化不忠实的推理。为解决这些局限性，我们提出了忠诚感知阶级强化学习（FaithRL），通过过程奖励模型中的显式忠实奖励引入阶级监督，并结合隐式截断重采样策略，从忠实前缀生成对比信号。多个SRM和开卷本QA基准测试的实验表明，FaithRL在CoT和最终答案中持续减少幻觉，从而实现更忠实可靠的推理。代码可在此 https URL 访问。

Quantum Reinforcement Learning with Transformers for the Capacitated Vehicle Routing Problem

利用变压器实现的量子强化学习，用于电容车辆路由问题

Authors: Eva Andrés
Subjects: Subjects: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2602.05920
Pdf link: https://arxiv.org/pdf/2602.05920
Abstract This paper addresses the Capacitated Vehicle Routing Problem (CVRP) by comparing classical and quantum Reinforcement Learning (RL) approaches. An Advantage Actor-Critic (A2C) agent is implemented in classical, full quantum, and hybrid variants, integrating transformer architectures to capture the relationships between vehicles, clients, and the depot through self- and cross-attention mechanisms. The experiments focus on multi-vehicle scenarios with capacity constraints, considering 20 clients and 4 vehicles, and are conducted over ten independent runs. Performance is assessed using routing distance, route compactness, and route overlap. The results show that all three approaches are capable of learning effective routing policies. However, quantum-enhanced models outperform the classical baseline and produce more robust route organization, with the hybrid architecture achieving the best overall performance across distance, compactness, and route overlap. In addition to quantitative improvements, qualitative visualizations reveal that quantum-based models generate more structured and coherent routing solutions. These findings highlight the potential of hybrid quantum-classical reinforcement learning models for addressing complex combinatorial optimization problems such as the CVRP.
中文摘要 本文通过比较经典与量子强化学习（RL）方法，解决了电容车辆路由问题（CVRP）。优势演员-批评者（A2C）代理以经典、全量子和混合变体实现，集成变压器架构，通过自关注和交叉关注机制捕捉车辆、客户端和仓库之间的关系。实验聚焦于多车场景，考虑20个客户和4个车辆，分十次独立运行。性能通过路由距离、路由紧凑性和路由重叠来评估。结果表明，这三种方法都能学习有效的路由策略。然而，量子增强模型优于经典基线，实现更稳健的路由组织，混合架构在距离、紧凑性和路线重叠方面实现了最佳的整体性能。除了定量改进外，定性可视化还显示基于量子的模型能生成更有结构性和连贯的路由解。这些发现凸显了混合量子-经典强化学习模型在解决复杂组合优化问题（如CVRP）中的潜力。

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

在策略镜像下降中对日志划分函数的近似，诱导了LLM后训练的隐式正则化

Authors: Zhenghao Xu, Qin Lu, Changlong Yu, Tuo Zhao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.05933
Pdf link: https://arxiv.org/pdf/2602.05933
Abstract Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL--$\chi^2$ regularizer. This additional $\chi^2$ regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at this https URL.
中文摘要 策略镜像下降（PMD）通过迭代解决KL正则化策略改进子问题，为强化学习（RL）提供了一个有原则的框架。虽然这种方法已被用于训练高级大型语言模型，如Kimi K1.5/K2，但理想的封闭形式PMD更新需要可靠的划分函数估计，这在有限的大型语言模型动作空间中是一大挑战。我们研究一种实用算法，称为PMD均值，它将对数划分项与抽样策略下的均值奖励近似，并在对数策略空间中进行回归。具体来说，我们刻画了PMD均值的总体解，并证明它隐式优化了自适应混合KL-$\chi^2$正则化子问题的镜像下降子问题。这种额外的$\chi^2$正则化限制了较大的概率变化，在期望奖励较低时产生更保守的更新，并增强了对有限样本估计误差的鲁棒性。数学推理任务的实验表明，PMD均值在稳定性和时间效率上表现更优。这些发现深化了我们对PMD均值的理解，并揭示了实现强化学习算法在LLM中实现原则性改进的路径。代码可在此 https 网址获取。

$f$-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

$f$-GRPO及其后：基于发散的强化学习算法用于通用LLM对齐

Authors: Rajdeep Haldar, Lantao Mei, Guang Lin, Yue Xing, Qifan Song
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.05946
Pdf link: https://arxiv.org/pdf/2602.05946
Abstract Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose $f$-Group Relative Policy Optimization ($f$-GRPO), a class of on-policy reinforcement learning, and $f$-Hybrid Alignment Loss ($f$-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of $f$-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.
中文摘要 最新研究表明，偏好对齐（PA）目标作为对齐（被选）与未对齐（拒绝）响应分布之间的发散估计器。在本研究中，我们将基于发散的视角扩展到一般的对齐环境，如带可验证奖励的强化学习（RLVR），其中仅有环境奖励可用。在这一统一框架下，我们提出了基于$f$-发散变分表示的$f$-群相对策略优化（$f$-GRPO）这一类策略内强化学习，以及$f$-混合对齐丢失（$f$-HAL），一种混合开/关策略目标，用于基于$发散的变分表示进行一般LLM对齐。我们提供理论保证，这些目标类别在对齐后会提升平均奖励。通过实证，我们在RVR（数学推理）和PA（安全对齐）任务中验证了该框架，显示其性能优于现有方法，具有更优的灵活性。

Learning to Share: Selective Memory for Efficient Parallel Agentic Systems

学会分享：高效并行智能系统的选择性记忆

Authors: Joseph Fioresi, Parth Parag Kulkarni, Ashmal Vayani, Song Wang, Mubarak Shah
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.05965
Pdf link: https://arxiv.org/pdf/2602.05965
Abstract Agentic systems solve complex tasks by coordinating multiple agents that iteratively reason, invoke tools, and exchange intermediate results. To improve robustness and solution quality, recent approaches deploy multiple agent teams running in parallel to explore diverse reasoning trajectories. However, parallel execution comes at a significant computational cost: when different teams independently reason about similar sub-problems or execute analogous steps, they repeatedly perform substantial overlapping computation. To address these limitations, in this paper, we propose Learning to Share (LTS), a learned shared-memory mechanism for parallel agentic frameworks that enables selective cross-team information reuse while controlling context growth. LTS introduces a global memory bank accessible to all teams and a lightweight controller that decides whether intermediate agent steps should be added to memory or not. The controller is trained using stepwise reinforcement learning with usage-aware credit assignment, allowing it to identify information that is globally useful across parallel executions. Experiments on the AssistantBench and GAIA benchmarks show that LTS significantly reduces overall runtime while matching or improving task performance compared to memory-free parallel baselines, demonstrating that learned memory admission is an effective strategy for improving the efficiency of parallel agentic systems. Project page: this https URL
中文摘要 智能系统通过协调多个代理来解决复杂任务，这些代理通过迭代推理、调用工具并交换中间结果。为了提高鲁棒性和解决方案质量，最新方法部署多个代理团队并行运行，探索不同的推理轨迹。然而，并行执行会带来显著的计算成本：当不同团队独立推理相似的子问题或执行类似步骤时，他们会反复执行大量重叠的计算。为解决这些局限性，本文提出了学习共享（LTS），这是一种针对并行代理框架的学习共享记忆机制，能够在控制上下文增长的同时实现跨团队信息的选择性重用。LTS 引入了全局内存库，所有团队均可访问，并配备了轻量级控制器，决定是否将中间代理步骤添加到内存中。控制器通过逐步强化学习和使用意识的信用分配进行训练，使其能够识别在并行执行中全局有用的信息。AssistantBench和GAIA基准测试的实验表明，LTS显著缩短了整体运行时间，同时与无内存并行基线匹配或提升任务表现，证明学习记忆录入是提升并行智能系统效率的有效策略。项目页面：此 https URL

VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation

VisRefiner：从视觉差异中学习截图到代码生成

Authors: Jie Deng, Kaichun Yao, Libo Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.05998
Pdf link: https://arxiv.org/pdf/2602.05998
Abstract Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.
中文摘要 截图转代码生成旨在将用户界面截图转化为可执行的前端代码，忠实还原目标布局和风格。现有的多模态大型语言模型直接从截图进行映射，但训练时未观察其生成代码的视觉结果。相比之下，人类开发者通过迭代渲染实现，与设计进行比较，学习视觉差异如何与代码变化相关。受这一过程启发，我们提出了VisRefiner，这是一个训练框架，使模型能够从渲染的预测与参考设计之间的视觉差异中学习。我们构建了差异对齐监督，将视觉差异与相应的代码编辑联系起来，使模型能够理解实现变更导致的外观差异。在此基础上，我们引入了自我精炼的强化学习阶段，模型通过观察渲染输出和目标设计，识别它们的视觉差异，并相应更新代码，从而改进其生成代码。实验显示，VisRefiner 显著提升了单步生成的质量和布局精度，同时赋予模型强大的自我精炼能力。这些结果展示了从视觉差异中学习在推进截图到代码生成方面的有效性。

On Computation and Reinforcement Learning

关于计算与强化学习

Authors: Raj Ghugare, Michał Bortkiewicz, Alicja Ziarko, Benjamin Eysenbach
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.05999
Pdf link: https://arxiv.org/pdf/2602.05999
Abstract How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.
中文摘要 强化学习（RL）策略可用的计算量如何影响其学习？使用固定参数数量的策略还能从额外的计算中受益吗？标准的强化学习框架没有提供正式回答这些问题的语言。从经验角度看，深度强化学习策略通常被参数化为带有静态架构的神经网络，混淆了计算量和参数数量。本文形式化了计算有限策略，并证明使用更多计算的策略能够解决问题，并推广到超出低计算策略范围的更长任务。基于之前在算法学习和无模型规划方面的工作，我们提出了一种可以使用可变计算量的最小架构。我们的实验补充了我们的理论。在一组涵盖在线和离线的31个不同任务中，我们证明该架构通过使用更多计算实现了更强的性能，而在更长视野的测试任务中，则比标准前馈网络或深度残差网络使用最多5倍参数的更强推广能力提升了$（2）$。

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

学习运行时代理内存的查询感知预算层路由

Authors: Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06025
Pdf link: https://arxiv.org/pdf/2602.06025
Abstract Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.
中文摘要 内存在大型语言模型（LLM）代理中越来越核心，这些代理在单一上下文窗口之外运行，但大多数现有系统依赖离线、与查询无关的内存构建，这种结构效率可能低，且可能丢弃查询关键信息。虽然运行时内存利用率是一个自然的替代方案，但先前的工作通常会产生较大的开销，并且对性能与成本权衡的显式控制有限。在本研究中，我们提出了 \textbf{BudgetMem}，一个用于显式、基于查询控制的性能成本的运行时代理内存框架。BudgetMem 将内存处理结构化为一组内存模块，每个模块提供三个预算层级（即 \textsc{Low}/\textsc{Mid}/\textsc{High}）。轻量级路由器通过模块执行预算级路由，以平衡任务性能和内存构建成本，实现为通过强化学习训练的紧凑神经策略。以BudgetMem为统一测试平台，我们研究实现预算层级的三种互补策略：实施（方法复杂度）、推理（推理行为）和容量（模块模型大小）。在LoCoMo、LongMemEval和HotpotQA中，当性能优先考虑（即高预算设置）时，BudgetMem 超越了强有力基线，并在更紧缩的预算下提供更优的准确率成本前沿。此外，我们的分析还梳理了不同分级策略的优缺点，明确了在不同预算体制下各轴在何时能带来最有利的权衡。

Can vision language models learn intuitive physics from interaction?

视觉语言模型能从互动中学习直觉物理吗？

Authors: Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06033
Pdf link: https://arxiv.org/pdf/2602.06033
Abstract Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.
中文摘要 预训练的视觉语言模型对物理世界没有良好的直觉。最新研究表明，监督式微调可以提升模型在简单物理任务中的表现。然而，微调模型似乎无法学习能够推广到新情境的稳健物理规则。基于认知科学的研究，我们假设模型需要与环境互动，才能正确学习其物理动态。我们通过强化学习训练通过与环境互动学习的模型。虽然通过交互学习可以提升模型在任务中的表现，但它无法生成具有可推广物理直觉的模型。我们发现，即使任务共享视觉统计和物理原理，且模型是否通过交互训练，训练于某一任务的模型也无法可靠地推广到相关任务。

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

V-Retrver：基于证据的代理推理用于普遍多模态检索

Authors: Dongyang Chen, Chaoyang Wang, Dezhao SU, Xi Xiao, Zeyu Zhang, Jing Xiong, Qing Li, Yuzhang Shang, Shichao Ka
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.06034
Pdf link: https://arxiv.org/pdf/2602.06034
Abstract Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual this http URL train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.
中文摘要 多模态大型语言模型（MLLM）最近被应用于通用多模态检索，其中思维链（Chain-of-Thought，简称CoT）推理改善候选人的重新排序。然而，现有方法仍主要依赖语言驱动，依赖静态视觉编码，缺乏主动验证细粒度视觉证据的能力，这常导致视觉模糊时出现推测性推理。我们提出了V-Retrver，一种以证据为驱动的检索框架，将多模态检索重新表述为一种基于视觉检查的能动推理过程。V-Retrver使MLLM能够通过外部视觉工具选择性地获取推理中的视觉证据，执行一种多模态交错推理过程，在假设生成和目标视觉间交替进行。作为证据收集代理，我们采用结合监督推理激活、基于拒绝的精炼和强化学习的课程型学习策略，并以证据为目标为基础。跨多个多模态检索基准的实验显示，检索准确率（平均提升23.0%）、感知驱动推理可靠性和泛化性均有持续提升。

InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

InterPrior：基于物理的人与物交互的生成控制尺度化

Authors: Sirui Xu, Samuel Schulter, Morteza Ziyadi, Xialin He, Xiaohan Fei, Yu-Xiong Wang, Liangyan Gui
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.06035
Pdf link: https://arxiv.org/pdf/2602.06035
Abstract Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.
中文摘要 人类很少在显性全身运动层面规划与物体的全身互动。高层次的意图，如可获得性，定义了目标，而协调的平衡、接触和控则可以自然地从潜在的身体和运动先验中显现出来。扩大这些先验是使类人生物能够在不同情境中构建和推广机车控技能的关键，同时保持身体协调的身体协调。为此，我们介绍了InterPrior，这是一个可扩展的框架，通过大规模模仿预训练和强化学习的后训练，学习统一的生成控制器。InterPrior 首先将一位全参考模拟专家提炼为一种多功能、目标条件化的变分策略，能够从多模态观察和高层次意图中重建运动。虽然提炼策略重建了训练行为，但由于大规模人与物交互的配置空间庞大，其推广能力不可靠。为此，我们应用物理扰动的数据增强，然后进行强化学习微调，以提升对未见目标和初始化的能力。这些步骤共同将重建的潜在技能整合为有效的流形，产生超越训练数据的运动先验，例如可以包含如与看不见物体交互等新行为。我们还进一步展示了其在用户交互控制方面的有效性以及其在实际机器人部署中的潜力。

Keyword: diffusion policy

There is no result