Arxiv Papers of Today

生成时间: 2026-02-03 16:45:37 (UTC+8); Arxiv 发布时间: 2026-02-03 20:00 EST (2026-02-04 09:00 UTC+8)

今天共有 134 篇相关文章

Keyword: reinforcement learning

AutoBool: An Reinforcement-Learning trained LLM for Effective Automated Boolean Query Generation for Systematic Reviews

AutoBool：一个强化学习训练的大型语言模型，用于系统性综述的有效自动布尔查询生成

Authors: Shuai Wang, Harrisen Scells, Bevan Koopman, Guido Zuccon
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2602.00005
Pdf link: https://arxiv.org/pdf/2602.00005
Abstract We present AutoBool, a reinforcement learning (RL) framework that trains large language models (LLMs) to generate effective Boolean queries for medical systematic reviews. Boolean queries are the primary mechanism for literature retrieval in this domain and must achieve high recall while maintaining reasonable precision - a challenging balance that existing prompt-based LLM approaches often struggle to achieve. A major limitation in this space is the lack of high-quality ground-truth Boolean queries for each topic, which makes supervised fine-tuning impractical. AutoBool addresses this challenge by using RL to directly optimize query generation with retrieval measures, without requiring target queries. To support this effort, we create and release the largest dataset of its kind: 65588 topics in total for training and evaluating the task of automatic Boolean query formulation. Experiments on our new dataset and two established datasets (CLEF TAR and Seed Collection) show that AutoBool significantly outperforms zero shot/few shot prompting and matches or exceeds the effectiveness of much larger GPT-based models (e.g., GPT-4o, O3) using smaller backbones. It also approaches effectiveness of expert-authored queries while retrieving 10 to 16 times fewer documents. Ablation studies reveal the critical roles of model backbone, size, decoding temperature, and prompt design. Code and data are available at this https URL.
中文摘要 我们介绍AutoBool，一个强化学习（RL）框架，用于训练大型语言模型（LLMs），生成有效的布尔查询，用于医学系统综述。布尔查询是该领域文献检索的主要机制，必须在保持合理精度的同时实现高召回率——这是现有基于提示的大型语言模型方法常常难以实现的挑战平衡。该领域的一个主要局限是每个主题缺乏高质量的地面真实布尔查询，这使得监督式微调变得不切实际。AutoBool 通过利用强化学习直接优化检索指标的查询生成，而无需目标查询，从而解决了这一挑战。为支持这一工作，我们创建并发布了同类中最大的数据集：总共65588个主题，用于自动布尔查询表述的训练和评估任务。在我们的新数据集和两个已建立数据集（CLEF TAR和Seed Collection）上的实验显示，AutoBool的表现显著优于零射/少射提示，且在使用较小骨干的更大型GPT模型（如GPT-4o、O3）中表现优于甚至超过。它还能在检索文档数量减少10到16倍的同时，提高专家作者查询的有效性。消融研究揭示了模型骨架、大小、解码温度和提示设计的关键作用。代码和数据可在此 https URL 获取。

Representation Learning Enhanced Deep Reinforcement Learning for Optimal Operation of Hydrogen-based Multi-Energy Systems

表征学习增强深度强化学习，实现氢基多能系统的最佳运行

Authors: Zhenyu Pu, Yu Yang, Lun Yang, Qing-Shan Jia, Xiaohong Guan, Costas J. Spanos
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.00027
Pdf link: https://arxiv.org/pdf/2602.00027
Abstract Hydrogen-based multi-energy systems (HMES) have emerged as a promising low-carbon and energy-efficient solution, as it can enable the coordinated operation of electricity, heating and cooling supply and demand to enhance operational flexibility, improve overall energy efficiency, and increase the share of renewable integration. However, the optimal operation of HMES remains challenging due to the nonlinear and multi-physics coupled dynamics of hydrogen energy storage systems (HESS) (consisting of electrolyters, fuel cells and hydrogen tanks) as well as the presence of multiple uncertainties from supply and demand. To address these challenges, this paper develops a comprehensive operational model for HMES that fully captures the nonlinear dynamics and multi-physics process of HESS. Moreover, we propose an enhanced deep reinforcement learning (DRL) framework by integrating the emerging representation learning techniques, enabling substantially accelerated and improved policy optimization for spatially and temporally coupled complex networked systems, which is not provided by conventional DRL. Experimental studies based on real-world datasets show that the comprehensive model is crucial to ensure the safe and reliable of HESS. In addition, the proposed SR-DRL approaches demonstrate superior convergence rate and performance over conventional DRL counterparts in terms of reducing the operation cost of HMES and handling the system operating constraints. Finally, we provide some insights into the role of representation learning in DRL, speculating that it can reorganize the original state space into a well-structured and cluster-aware geometric representation, thereby smoothing and facilitating the learning process of DRL.
中文摘要 氢基多能系统（HMES）已成为一种有前景的低碳节能解决方案，能够实现电力、供暖和制冷的协调运行，提升运营灵活性，提升整体能效，并增加可再生能源整合的份额。然而，由于氢能储存系统（HESS）（由电解槽、燃料电池和氢气罐组成）具有非线性和多物理耦合动力学，以及供需多重不确定性，HMES的最佳运行仍具挑战性。为应对这些挑战，本文开发了一个全面的HMES作模型，全面展现了HESS的非线性动力学和多物理过程。此外，我们提出了一个增强型深度强化学习（DRL）框架，整合新兴的表征学习技术，实现了对空间和时间耦合复杂网络系统的策略优化，这是传统DRL所不具备的。基于真实世界数据集的实验研究表明，综合模型对于确保HESS的安全可靠至关重要。此外，提出的SR-DRL方法在降低HMES运行成本和处理系统运行限制方面，展现出优于传统DRL方法的收敛率和性能。最后，我们对表示学习在DRL中的作用提供了一些见解，推测它可以将原始状态空间重组为结构良好且具聚类感知的几何表示，从而平滑并促进DRL的学习过程。

Asynchronous MultiAgent Reinforcement Learning for 5G Routing under Side Constraints

侧约束下的5G路由异步多智能体强化学习

Authors: Sebastian Racedo, Brigitte Jaumard, Oscar Delgado, Meysam Masoudi
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.00035
Pdf link: https://arxiv.org/pdf/2602.00035
Abstract Networks in the current 5G and beyond systems increasingly carry heterogeneous traffic with diverse quality-of-service constraints, making real-time routing decisions both complex and time-critical. A common approach, such as a heuristic with human intervention or training a single centralized RL policy or synchronizing updates across multiple learners, struggles with scalability and straggler effects. We address this by proposing an asynchronous multi-agent reinforcement learning (AMARL) framework in which independent PPO agents, one per service, plan routes in parallel and commit resource deltas to a shared global resource environment. This coordination by state preserves feasibility across services and enables specialization for service-specific objectives. We evaluate the method on an O-RAN like network simulation using nearly real-time traffic data from the city of Montreal. We compared against a single-agent PPO baseline. AMARL achieves a similar Grade of Service (acceptance rate) (GoS) and end-to-end latency, with reduced training wall-clock time and improved robustness to demand shifts. These results suggest that asynchronous, service-specialized agents provide a scalable and practical approach to distributed routing, with applicability extending beyond the O-RAN domain.
中文摘要 当前5G及更高系统的网络日益承载异构流量，且服务质量限制多样，使得实时路由决策既复杂又时间紧迫。一种常见的方法，如人类干预的启发式方法，或训练单一集中式强化学习策略，或跨多个学习者同步更新，都存在扩展性和落后效应。我们通过提出异步多智能体强化学习（AMARL）框架来解决这个问题，该框架中独立的PPO代理每个服务一个，并行规划路由并将资源差块提交到共享的全局资源环境。州级的这种协调保持了各军种的可行性，并促进了针对特定服务目标的专业化。我们在类似O-RAN的网络仿真中，利用蒙特利尔市的近实时交通数据评估该方法。我们与单剂PPO基线进行了比较。AMARL实现了类似的服务等级（接受率）（GoS）和端到端延迟，减少了培训墙时钟时间，并提高了对需求班次的鲁棒性。这些结果表明，异步、服务专用代理为分布式路由提供了一种可扩展且实用的方法，其适用范围超越了O-RAN领域。

Distributional Reinforcement Learning for Condition-Based Maintenance of Multi-Pump Equipment

分布式强化学习用于基于条件的多泵设备维护

Authors: Takato Yasuno
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.00051
Pdf link: https://arxiv.org/pdf/2602.00051
Abstract Condition-Based Maintenance (CBM) signifies a paradigm shift from reactive to proactive equipment management strategies in modern industrial systems. Conventional time-based maintenance schedules frequently engender superfluous expenditures and unanticipated equipment failures. In contrast, CBM utilizes real-time equipment condition data to enhance maintenance timing and optimize resource allocation. The present paper proposes a novel distributional reinforcement learning approach for multi-equipment CBM using Quantile Regression Deep Q-Networks (QR-DQN) with aging factor integration. The methodology employed in this study encompasses the concurrent administration of multiple pump units through three strategic scenarios. The implementation of safety-first, balanced, and cost-efficient approaches is imperative. Comprehensive experimental validation over 3,000 training episodes demonstrates significant performance improvements across all strategies. The Safety-First strategy demonstrates superior cost efficiency, with a return on investment (ROI) of 3.91, yielding 152\% better performance than alternatives while requiring only 31\% higher investment. The system exhibits 95.66\% operational stability and immediate applicability to industrial environments.
中文摘要 基于状态的维护（CBM）标志着现代工业系统中从被动设备管理策略向主动管理策略的范式转变。传统的基于时间的维护计划常常产生不必要的开支和意外的设备故障。相比之下，CBM利用实时设备状态数据来提升维护时机并优化资源分配。本文提出了一种新颖的分布强化学习方法，应用于多设备CBM，采用分位数回归深度Q网络（QR-DQN）并结合老化因子积分。本研究采用的方法论涵盖了通过三种战略情景同时管理多台泵机组的方法。实施安全第一、平衡且成本效益高的方法至关重要。对3000次训练的全面实验验证显示，所有策略的性能均显著提升。安全优先战略展现出卓越的成本效益，投资回报率（ROI）为3.91，性能比其他方案提升152%，但只需多出31%的投资。该系统表现出95.66%的运行稳定性和对工业环境的即时适用性。

Joint Continual Learning of Local Language Models and Cloud Offloading Decisions with Budget Constraints

本地语言模型与云分销决策的联合持续学习，预算约束

Authors: Evan Chen, Wenzhi Fang, Shiqiang Wang, Christopher Brinton
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00166
Pdf link: https://arxiv.org/pdf/2602.00166
Abstract Locally deployed Small Language Models (SLMs) must continually support diverse tasks under strict memory and computation constraints, making selective reliance on cloud Large Language Models (LLMs) unavoidable. Regulating cloud assistance during continual learning is challenging, as naive reward-based reinforcement learning often yields unstable offloading behavior and exacerbates catastrophic forgetting as task distributions shift. We propose DA-GRPO, a dual-advantage extension of Group Relative Policy Optimization that incorporates cloud-usage constraints directly into advantage computation, avoiding fixed reward shaping and external routing models. This design enables the local model to jointly learn task competence and collaboration behavior, allowing cloud requests to emerge naturally during post-training while respecting a prescribed assistance budget. Experiments on mathematical reasoning and code generation benchmarks show that DA-GRPO improves post-switch accuracy, substantially reduces forgetting, and maintains stable cloud usage compared to prior collaborative and routing-based approaches.
中文摘要 本地部署的小语言模型（SLM）必须在严格的内存和计算限制下持续支持多样化任务，因此选择性依赖云大型语言模型（LLM）是不可避免的。在持续学习期间调节云辅助具有挑战性，因为基于奖励的简单强化学习常常导致卸载行为不稳定，并在任务分布变化时加剧灾难性遗忘。我们提出了DA-GRPO，这是一种基于群相对策略优化的双重优势扩展，直接将云使用约束纳入优势计算，避免固定奖励塑形和外部路由模型。该设计使本地模型能够共同学习任务能力和协作行为，使云请求在培训后自然出现，同时遵守规定的协助预算。数学推理和代码生成基准测试的实验表明，DA-GRPO相比以往的协作和路由方法，提高了切换后的准确性，显著减少遗忘，并保持了云的稳定使用。

Learning Robust Reasoning through Guided Adversarial Self-Play

通过引导式对抗性自我扮演学习扎实的推理

Authors: Shuozhe Li, Vaishnav Tadiparthi, Kwonjoon Lee, Nakul Agarwal, Hossein Nourkhiz Mahjoub, Ehsan Moradi Pari, Lizhang Chen, Amy Zhang, Liu Leqi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00173
Pdf link: https://arxiv.org/pdf/2602.00173
Abstract Reinforcement learning from verifiable rewards (RLVR) produces strong reasoning models, yet they can fail catastrophically when the conditioning context is fallible (e.g., corrupted chain-of-thought, misleading partial solutions, or mild input perturbations), since standard RLVR optimizes final-answer correctness only under clean conditioning. We introduce GASP (Guided Adversarial Self-Play), a robustification method that explicitly trains detect-and-repair capabilities using only outcome verification. Without human labels or external teachers, GASP forms an adversarial self-play game within a single model: a polluter learns to induce failure via locally coherent corruptions, while an agent learns to diagnose and recover under the same corrupted conditioning. To address the scarcity of successful recoveries early in training, we propose in-distribution repair guidance, an imitation term on self-generated repairs that increases recovery probability while preserving previously acquired capabilities. Across four open-weight models (1.5B--8B), GASP transforms strong-but-brittle reasoners into robust ones that withstand misleading and perturbed context while often improving clean accuracy. Further analysis shows that adversarial corruptions induce an effective curriculum, and in-distribution guidance enables rapid recovery learning with minimal representational drift.
中文摘要 可验证奖励强化学习（RLVR）产生了强有力的推理模型，但当条件条件存在错误（如思维链损坏、误导性的部分解法或轻微输入扰动）时，它们可能灾难性地失败，因为标准RLVR仅在干净条件条件下优化最终答案正确性。我们引入了GASP（引导对抗自我对弈），这是一种稳健化方法，仅通过结果验证明确训练检测与修复能力。没有人类标签或外部教师，GASP在单一模型内形成了一场对抗性的自我游戏：污染者学会通过局部相干的腐败来诱导失败，而主体则学会在同样的腐败条件下诊断和恢复。为解决培训早期成功回收的稀缺问题，我们提出了“分发维修指导”，这是自发修复的模仿术语，能提高回收概率，同时保留先前获得的能力。在四个开放权重模型（1.5B-8B）中，GASP 将强而脆弱的推理器转变为能承受误导性和扰动上下文的稳健推理器，同时常常提升干净准确性。进一步分析显示，对抗性腐败能诱导有效的课程，而分布式指导则使快速恢复学习能够实现最小的表征漂移。

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

CamReasoner：通过结构化空间推理强化对摄像机运动的理解

Authors: Hang Wu, Yujun Cai, Zehao Li, Haonan Ge, Bowen Sun, Junsong Yuan, Yiwei Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00181
Pdf link: https://arxiv.org/pdf/2602.00181
Abstract Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present CamReasoner, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to decode spatio-temporal cues such as trajectories and view frustums within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. Notably, we are the first to employ RL for logical alignment in this domain, ensuring motion inferences are grounded in physical geometry rather than contextual guesswork. By applying Reinforcement Learning to the Observation-Think-Answer (O-T-A) reasoning paradigm, CamReasoner effectively suppresses hallucinations and achieves state-of-the-art performance across multiple benchmarks.
中文摘要 理解摄像机动态是视频空间智能的基础支柱。然而，现有的多模态模型主要将此任务视为黑箱分类，常常通过依赖表面的视觉图案而非几何线索，混淆了物理上不同的运动。我们提出了CamReasoner，这是一个将摄像机运动理解重新表述为结构化推理过程的框架，旨在弥合感知与电影逻辑之间的鸿沟。我们的方法以观察-思考-答案（O-T-A）范式为核心，该范式促使模型解码时空线索，如轨迹，并在显式推理块内观察视锥体。为实现这一能力，我们构建了一个大型推理轨迹套件，包含18kSFT推理链和38k强化学习反馈样本。值得注意的是，我们是首个在该领域采用强化学习进行逻辑对齐的机构，确保运动推断基于物理几何而非上下文猜测。通过将强化学习应用于观察-思考-答案（O-T-A）推理范式，CamReasoner有效抑制幻觉，并在多个基准测试中实现了最先进的性能。

From Gameplay Traces to Game Mechanics: Causal Induction with Large Language Models

从游戏机制到游戏机制：大型语言模型的因果归纳

Authors: Mohit Jiwatode, Alexander Dockhorn, Bodo Rosenhahn
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.00190
Pdf link: https://arxiv.org/pdf/2602.00190
Abstract Deep learning agents can achieve high performance in complex game domains without often understanding the underlying causal game mechanics. To address this, we investigate Causal Induction: the ability to infer governing laws from observational data, by tasking Large Language Models (LLMs) with reverse-engineering Video Game Description Language (VGDL) rules from gameplay traces. To reduce redundancy, we select nine representative games from the General Video Game AI (GVGAI) framework using semantic embeddings and clustering. We compare two approaches to VGDL generation: direct code generation from observations, and a two-stage method that first infers a structural causal model (SCM) and then translates it into VGDL. Both approaches are evaluated across multiple prompting strategies and controlled context regimes, varying the amount and form of information provided to the model, from just raw gameplay observations to partial VGDL specifications. Results show that the SCM-based approach more often produces VGDL descriptions closer to the ground truth than direct generation, achieving preference win rates of up to 81\% in blind evaluations and yielding fewer logically inconsistent rules. These learned SCMs can be used for downstream use cases such as causal reinforcement learning, interpretable agents, and procedurally generating novel but logically consistent games.
中文摘要 深度学习代理可以在复杂的游戏领域实现高性能，而无需通常理解背后的因果博弈机制。为此，我们研究了因果归纳：通过让大型语言模型（LLMs）从游戏轨迹逆向工程视频游戏描述语言（VGDL）规则，从观察数据中推断出支配律的能力。为了减少冗余，我们通过语义嵌入和聚类，从通用视频游戏人工智能（GVGAI）框架中挑选了九款代表性游戏。我们比较了两种VGDL生成方法：直接从观察生成代码，以及先推断结构因果模型（SCM），然后将其转化为VGDL的两阶段方法。这两种方法都通过多种提示策略和受控上下文模式进行评估，提供给模型的信息量和形式有所不同，从纯粹的游戏过程观察到部分VGDL规格。结果显示，基于SCM的方法更常产生更接近基层真实的VGDL描述，在盲测中优先胜率高达81%，逻辑不一致规则更少。这些学习到的SCM可用于后续场景，如因果强化学习、可解释代理以及程序生成新颖但逻辑一致的博弈。

Sample Complexity Analysis for Constrained Bilevel Reinforcement Learning

受限双级强化学习的示例复杂性分析

Authors: Naman Saxena, Vaneet Aggarwal
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00282
Pdf link: https://arxiv.org/pdf/2602.00282
Abstract Several important problem settings within the literature of reinforcement learning (RL), such as meta-learning, hierarchical learning, and RL from human feedback (RL-HF), can be modelled as bilevel RL problems. A lot has been achieved in these domains empirically; however, the theoretical analysis of bilevel RL algorithms hasn't received a lot of attention. In this work, we analyse the sample complexity of a constrained bilevel RL algorithm, building on the progress in the unconstrained setting. We obtain an iteration complexity of $O(\epsilon^{-2})$ and sample complexity of $\tilde{O}(\epsilon^{-4})$ for our proposed algorithm, Constrained Bilevel Subgradient Optimization (CBSO). We use a penalty-based objective function to avoid the issue of primal-dual gap and hyper-gradient in the context of a constrained bilevel problem setting. The penalty-based formulation to handle constraints requires analysis of non-smooth optimization. We are the first ones to analyse the generally parameterized policy gradient-based RL algorithm with a non-smooth objective function using the Moreau envelope.
中文摘要 强化学习（RL）文献中的若干重要问题设定，如元学习、层级学习和来自人类反馈的强化学习（RL-HF），都可以作为双层级强化学习问题来建模。这些领域在实证上取得了许多成就;然而，对双层强化学习算法的理论分析并未受到太多关注。本研究分析了受限双层强化学习算法的样本复杂度，基于无约束环境的进展。我们提出的算法——受限双层子梯度优化（CBSO），得到迭代复杂度为$O（\epsilon^{-2}）$，样本复杂度为$\tilde{O}（\epsilon^{-4}）$。我们采用基于惩罚的目标函数，以避免在受限双层问题环境中出现原始-对偶间隙和超梯度问题。基于惩罚的约束处理表述需要分析非光滑优化。我们是首批使用莫罗包络分析带有非光滑目标函数的一般参数化策略梯度强化学习算法的学者。

AdaFuse: Adaptive Multimodal Fusion for Lung Cancer Risk Prediction via Reinforcement Learning

AdaFuse：通过强化学习预测肺癌风险的自适应多模态融合

Authors: Chongyu Qu, Zhengyi Lu, Yuxiang Lai, Thomas Z. Li, Junchao Zhu, Junlin Guo, Juming Xiong, Yanfan Zhu, Yuechen Yang, Allen J. Luna, Kim L. Sandler, Bennett A. Landman, Yuankai Huo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00347
Pdf link: https://arxiv.org/pdf/2602.00347
Abstract Multimodal fusion has emerged as a promising paradigm for disease diagnosis and prognosis, integrating complementary information from heterogeneous data sources such as medical images, clinical records, and radiology reports. However, existing fusion methods process all available modalities through the network, either treating them equally or learning to assign different contribution weights, leaving a fundamental question unaddressed: for a given patient, should certain modalities be used at all? We present AdaFuse, an adaptive multimodal fusion framework that leverages reinforcement learning (RL) to learn patient-specific modality selection and fusion strategies for lung cancer risk prediction. AdaFuse formulates multimodal fusion as a sequential decision process, where the policy network iteratively decides whether to incorporate an additional modality or proceed to prediction based on the information already acquired. This sequential formulation enables the model to condition each selection on previously observed modalities and terminate early when sufficient information is available, rather than committing to a fixed subset upfront. We evaluate AdaFuse on the National Lung Screening Trial (NLST) dataset. Experimental results demonstrate that AdaFuse achieves the highest AUC (0.762) compared to the best single-modality baseline (0.732), the best fixed fusion strategy (0.759), and adaptive baselines including DynMM (0.754) and MoE (0.742), while using fewer FLOPs than all triple-modality methods. Our work demonstrates the potential of reinforcement learning for personalized multimodal fusion in medical imaging, representing a shift from uniform fusion strategies toward adaptive diagnostic pipelines that learn when to consult additional modalities and when existing information suffices for accurate prediction.
中文摘要 多模态融合已成为疾病诊断和预后的有前景范式，整合了来自医学影像、临床记录和放射报告等异构数据源的互补信息。然而，现有的融合方法通过网络处理所有可用方式，要么平等对待，要么学习分配不同的贡献权重，留下一个根本性问题未被解决：对于特定患者，是否应该使用某些方式？我们介绍AdaFuse，一种自适应多模态融合框架，利用强化学习（RL）学习患者特定的模态选择和融合策略，以预测肺癌风险。AdaFuse将多模态融合表述为一个顺序决策过程，策略网络通过迭代决定是加入额外模态还是基于已获得的信息进行预测。这种顺序表述使模型能够基于先前观察到的模态来对每个选择进行条件，并在信息充足时提前终止，而不是一开始就承诺固定子集。我们在国家肺部筛查试验（NLST）数据集上评估AdaFuse。实验结果显示，AdaFuse在使用FLOP数量少于所有三模态方法的单模态基线（0.732）、最佳固定融合策略（0.759）和自适应基线（0.754）和MoE（0.742）均为最高。我们的研究展示了强化学习在医学影像个性化多模态融合中的潜力，代表着从统一融合策略向自适应诊断流程的转变，能够学习何时需要参考其他模态，何时现有信息足以实现准确预测。

MASC: Metal-Aware Sampling and Correction via Reinforcement Learning for Accelerated MRI

MASC：金属感知采样与通过强化学习修正加速MRI

Authors: Zhengyi Lu, Ming Lu, Chongyu Qu, Junchao Zhu, Junlin Guo, Marilyn Lionts, Yanfan Zhu, Yuechen Yang, Tianyuan Yao, Jayasai Rajagopal, Bennett Allan Landman, Xiao Wang, Xinqiang Yan, Yuankai Huo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.00348
Pdf link: https://arxiv.org/pdf/2602.00348
Abstract Metal implants in MRI cause severe artifacts that degrade image quality and hinder clinical diagnosis. Traditional approaches address metal artifact reduction (MAR) and accelerated MRI acquisition as separate problems. We propose MASC, a unified reinforcement learning framework that jointly optimizes metal-aware k-space sampling and artifact correction for accelerated MRI. To enable supervised training, we construct a paired MRI dataset using physics-based simulation, generating k-space data and reconstructions for phantoms with and without metal implants. This paired dataset provides simulated 3D MRI scans with and without metal implants, where each metal-corrupted sample has an exactly matched clean reference, enabling direct supervision for both artifact reduction and acquisition policy learning. We formulate active MRI acquisition as a sequential decision-making problem, where an artifact-aware Proximal Policy Optimization (PPO) agent learns to select k-space phase-encoding lines under a limited acquisition budget. The agent operates on undersampled reconstructions processed through a U-Net-based MAR network, learning patterns that maximize reconstruction quality. We further propose an end-to-end training scheme where the acquisition policy learns to select k-space lines that best support artifact removal while the MAR network simultaneously adapts to the resulting undersampling patterns. Experiments demonstrate that MASC's learned policies outperform conventional sampling strategies, and end-to-end training improves performance compared to using a frozen pre-trained MAR network, validating the benefit of joint optimization. Cross-dataset experiments on FastMRI with physics-based artifact simulation further confirm generalization to realistic clinical MRI data. The code and models of MASC have been made publicly available: this https URL
中文摘要 MRI中的金属植入物会导致严重的伪影，降低图像质量并妨碍临床诊断。传统方法将金属伪影减少（MAR）和加速MRI采集视为两个独立的问题。我们提出了MASC，一种统一的强化学习框架，共同优化金属感知k空间采样和伪影校正，以实现加速MRI。为实现监督训练，我们利用基于物理的模拟构建配对MRI数据集，生成k空间数据并重建带金属植入物和无金属植入物的幻影。该配对数据集提供有无金属植入物的模拟3DMRI扫描，每个金属污染样本均有完全匹配的干净参考，支持人工制品减少和采集策略学习的直接监督。我们将主动MRI采集表述为一个顺序决策问题，其中一个具伪造感的近端策略优化（PPO）代理在有限的获取预算下学习选择k空间相位编码线。该代理通过基于U-Net的MAR网络处理的欠采样重建，学习最大化重建质量的模式。我们还提出了一种端到端训练方案，其中采集策略学习选择最支持伪影去除的k-空间线，同时MAR网络同时适应由此产生的欠采样模式。实验表明，MASC的学习策略优于传统采样策略，端到端训练相较于使用冻结预训练MAR网络能提升性能，验证了联合优化的优势。FastMRI与基于物理的伪影模拟的跨数据集实验进一步确认了其向真实临床MRI数据的推广。MASC 的代码和模型已公开：这个 https URL

ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

ReLAPSe：强化学习训练的对抗提示搜索，在未学扩散模型中消除概念

Authors: Ignacy Kolton, Kacper Marzol, Paweł Batorski, Marcin Mazur, Paul Swoboda, Przemysław Spurek
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.00350
Pdf link: https://arxiv.org/pdf/2602.00350
Abstract Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model's latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model's noise prediction loss as a model-intrinsic and verifiable feedback signal. This closed-loop design directly aligns textual prompt manipulation with latent visual residuals, enabling the agent to learn transferable restoration strategies rather than optimizing isolated prompts. By pioneering the shift from per-instance optimization to global policy learning, ReLAPSe achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing a scalable tool for rigorous red-teaming of unlearned diffusion models. Some experimental evaluations involve sensitive visual concepts, such as nudity. Code is available at this https URL
中文摘要 机器学习去学习是去除文本到图像扩散模型中未经授权概念的关键防御机制，但最新证据表明，潜在的视觉信息在去学习后往往仍然存在。现有的对抗方法利用这种泄漏存在根本性局限性：基于优化的方法由于每实例的迭代搜索计算成本较高。同时，基于推理和启发式的技术缺乏来自目标模型潜在视觉表征的直接反馈。为应对这些挑战，我们引入了ReLAPSe，一个基于策略的对抗框架，将概念恢复重新表述为强化学习问题。ReLAPSe利用可验证奖励强化学习（RLVR）训练智能体，利用扩散模型的噪声预测损失作为模型固有且可验证的反馈信号。这种闭环设计直接将文本提示作与潜在的视觉残差对齐，使智能体能够学习可迁移的恢复策略，而非优化孤立提示。通过率先从每实例优化向全局策略学习的转变，ReLAPSe 实现了在多种最先进去学习方法中高效、近实时地恢复细粒度身份和样式，为未学习扩散模型的严格红队化提供了可扩展的工具。一些实验评估涉及敏感的视觉概念，如裸体。代码可在此 https URL 获取

KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning

KEPO：基于推理的知识增强偏好优化，用于强化学习

Authors: Fan Yang, Rui Meng, Trudi Di Qi, Ali Ezzati, Yuxin Wen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00400
Pdf link: https://arxiv.org/pdf/2602.00400
Abstract Reinforcement learning (RL) has emerged as a promising paradigm for inducing explicit reasoning behaviors in large language and vision-language models. However, reasoning-oriented RL post-training remains fundamentally challenging due to sparse trajectory-level rewards, leading to ambiguous credit assignment and severe exploration failures that can trap the policy in a ``learning cliff.'' Recent on-policy distillation methods introduce dense teacher supervision to stabilize optimization, but apply it uniformly across all generated trajectories. We argue that such uniform distillation is ill-suited for reasoning-intensive tasks, as low-quality on-policy trajectories often originate from early logical errors, and distillation under flawed contexts injects noisy and misaligned gradients. To address these challenges, we propose Knowledge-Enhanced Preference Optimization (KEPO), a unified post-training framework that integrates: (i) a quality-gated on-policy distillation objective that selectively applies dense teacher guidance only to high-quality trajectories, and (ii) a knowledge-enhanced exploration strategy that leverages hints learned from a teacher model to rejectively sample reward-positive on-policy trajectories for RL, thereby mitigating exploration collapse. Evaluated on a challenging medical visual question answering benchmark under single-source generalization, KEPO demonstrates improved training stability, more coherent reasoning behaviors, and superior out-of-distribution performance over reinforcement learning and on-policy distillation baselines.
中文摘要 强化学习（RL）已成为一种有前景的范式，用于在大型语言和视觉语言模型中诱导显性推理行为。然而，基于推理的强化学习培训后仍然存在根本挑战，因为路径级奖励稀疏，导致学分分配模糊和严重的探索失败，可能使策略陷入“学习悬崖”。近期的政策内提炼方法引入了密集的教师监督以稳定优化，但该监督在所有生成轨迹中均一应用。我们认为，这种均匀提纯不适合推理密集型任务，因为低质量的政策轨迹往往源于早期逻辑错误，且在有缺陷的上下文下提纯会注入噪声和错位梯度。为应对这些挑战，我们提出了知识增强偏好优化（KEPO），这是一个统一的培训后框架，整合了：（i）一个质量门槛的政策提炼目标，选择性地将密集教师指导应用于高质量的路径;（ii）一种知识增强探索策略，利用教师模型中学到的提示，拒绝抽样奖励正向的政策轨迹用于强化学习，从而减轻探索崩溃。在单一来源泛化下，基于具有挑战性的医学视觉问答基准进行评估，KEPO展现出了训练稳定性的提升、推理行为更连贯的表现，以及优于强化学习和策略提炼基线的分布外表现。

ZEST: Zero-shot Embodied Skill Transfer for Athletic Robot Control

ZEST：零射击具体技能转移，用于运动机器人控制

Authors: Jean Pierre Sleiman, He Li, Alphonsus Adu-Bredu, Robin Deits, Arun Kumar, Kevin Bergamin, Mohak Bhardwaj, Scott Biddlestone, Nicola Burger, Matthew A. Estrada, Francesco Iacobelli, Twan Koolen, Alexander Lambert, Erica Lin, M. Eva Mungai, Zach Nobles, Shane Rozen-Levy, Yuyao Shi, Jiashun Wang, Jakob Welner, Fangzhou Yu, Mike Zhang, Alfred Rizzi, Jessica Hodgins, Sylvain Bertrand, Yeuhi Abe, Scott Kuindersma, Farbod Farshidian
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00401
Pdf link: https://arxiv.org/pdf/2602.00401
Abstract Achieving robust, human-like whole-body control on humanoid robots for agile, contact-rich behaviors remains a central challenge, demanding heavy per-skill engineering and a brittle process of tuning controllers. We introduce ZEST (Zero-shot Embodied Skill Transfer), a streamlined motion-imitation framework that trains policies via reinforcement learning from diverse sources -- high-fidelity motion capture, noisy monocular video, and non-physics-constrained animation -- and deploys them to hardware zero-shot. ZEST generalizes across behaviors and platforms while avoiding contact labels, reference or observation windows, state estimators, and extensive reward shaping. Its training pipeline combines adaptive sampling, which focuses training on difficult motion segments, and an automatic curriculum using a model-based assistive wrench, together enabling dynamic, long-horizon maneuvers. We further provide a procedure for selecting joint-level gains from approximate analytical armature values for closed-chain actuators, along with a refined model of actuators. Trained entirely in simulation with moderate domain randomization, ZEST demonstrates remarkable generality. On Boston Dynamics' Atlas humanoid, ZEST learns dynamic, multi-contact skills (e.g., army crawl, breakdancing) from motion capture. It transfers expressive dance and scene-interaction skills, such as box-climbing, directly from videos to Atlas and the Unitree G1. Furthermore, it extends across morphologies to the Spot quadruped, enabling acrobatics, such as a continuous backflip, through animation. Together, these results demonstrate robust zero-shot deployment across heterogeneous data sources and embodiments, establishing ZEST as a scalable interface between biological movements and their robotic counterparts.
中文摘要 实现人形机器人实现强大、类人般的全身控制，以实现敏捷且接触丰富的行为，仍是核心挑战，需要大量按技能进行工程化和控制器调试的脆弱过程。我们介绍ZEST（零射具身技能转移），这是一个简化的动作模拟框架，通过强化学习从多种来源训练策略——高保真动作捕捉、噪声单眼视频和非物理限制动画——并将其部署到硬件零射击中。ZEST 在避免接触标签、参考或观察窗口、状态估计器以及广泛的奖励塑造的情况下，推广到行为和平台。其训练流程结合了适应采样（专注于困难运动段的训练）和基于模型的辅助扳手的自动课程，共同实现动态且长视距的机动。我们还提供了从近似解析电枢值中选择闭链执行器接头级增益的程序，并提供了一个执行器的精细模型。ZEST完全在中等域随机化的模拟中训练，展现出了显著的通用性。在波士顿动力的Atlas人形模型上，ZEST通过动作捕捉学习动态多接触技能（如军队爬行、霹雳舞）。它将表现力舞蹈和场景互动技能，如爬箱子，直接从视频转移到Atlas和Unitree G1。此外，它跨越了形态，甚至包括斑点四足动物，通过动画实现了如连续后空翻等杂技动作。这些结果共同展示了在异构数据源和实体中的零发射部署，确立了ZEST作为生物运动与其机器人对应物之间可扩展接口的地位。

DROGO: Default Representation Objective via Graph Optimization in Reinforcement Learning

DROGO：通过强化学习中的图优化实现默认表示目标

Authors: Hon Tik Tse, Marlos C. Machado
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.00403
Pdf link: https://arxiv.org/pdf/2602.00403
Abstract In computational reinforcement learning, the default representation (DR) and its principal eigenvector have been shown to be effective for a wide variety of applications, including reward shaping, count-based exploration, option discovery, and transfer. However, in prior investigations, the eigenvectors of the DR were computed by first approximating the DR matrix, and then performing an eigendecomposition. This procedure is computationally expensive and does not scale to high-dimensional spaces. In this paper, we derive an objective for directly approximating the principal eigenvector of the DR with a neural network. We empirically demonstrate the effectiveness of the objective in a number of environments, and apply the learned eigenvectors for reward shaping.
中文摘要 在计算强化学习中，默认表示（DR）及其主特征向量已被证明在多种应用中有效，包括奖励塑形、基于计数的探索、选项发现和转移。然而，在以往的研究中，DR的特征向量是先近似DR矩阵，然后进行特征分解计算的。该过程计算量大，且无法扩展到高维空间。本文推导了一个目标，用于用神经网络直接近似DR的主特征向量。我们通过实证方式证明了该目标在多种环境中的有效性，并将所学的特征向量应用于奖励塑造。

Variational Approach for Job Shop Scheduling

工作车间调度的变分方法

Authors: Seung Heon Oh, Jiwon Baek, Ki Young Cho, Hee Chang Yoon, Jong Hun Woo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00408
Pdf link: https://arxiv.org/pdf/2602.00408
Abstract This paper proposes a novel Variational Graph-to-Scheduler (VG2S) framework for solving the Job Shop Scheduling Problem (JSSP), a critical task in manufacturing that directly impacts operational efficiency and resource utilization. Conventional Deep Reinforcement Learning (DRL) approaches often face challenges such as non-stationarity during training and limited generalization to unseen problem instances because they optimize representation learning and policy execution simultaneously. To address these issues, we introduce variational inference to the JSSP domain for the first time and derive a probabilistic objective based on the Evidence of Lower Bound (ELBO) with maximum entropy reinforcement learning. By mathematically decoupling representation learning from policy optimization, the VG2S framework enables the agent to learn robust structural representations of scheduling instances through a variational graph encoder. This approach significantly enhances training stability and robustness against hyperparameter variations. Extensive experiments demonstrate that the proposed method exhibits superior zero-shot generalization compared with state-of-the-art DRL baselines and traditional dispatching rules, particularly on large-scale and challenging benchmark instances such as DMU and SWV.
中文摘要 本文提出了一种新的变分图到调度器（VG2S）框架，用于解决作业车间调度问题（JSSP），这是制造业中直接影响运营效率和资源利用的关键任务。传统的深度强化学习（DRL）方法常面临诸如训练过程中的非平稳性以及对未见问题实例的有限泛化等挑战，因为它们同时优化了表征学习和策略执行。为解决这些问题，我们首次将变分推断引入JSSP领域，并基于最大熵强化学习的下界证据（ELBO）推导出概率目标。通过数学上将表示学习与策略优化解耦，VG2S框架使智能体能够通过变分图编码器学习调度实例的稳健结构表示。该方法显著提升了训练的稳定性和对超参数变化的鲁棒性。大量实验表明，该方法在大规模且具有挑战性的基准测试实例如DMU和SWV中，表现优于最先进的日程日程学习（DRL）基线和传统调度规则。

Open Materials Generation with Inference-Time Reinforcement Learning

带推理时间强化学习的开放材料生成

Authors: Philipp Hoellmer, Stefano Martiniani
Subjects: Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
Arxiv link: https://arxiv.org/abs/2602.00424
Pdf link: https://arxiv.org/pdf/2602.00424
Abstract Continuous-time generative models for crystalline materials enable inverse materials design by learning to predict stable crystal structures, but incorporating explicit target properties into the generative process remains challenging. Policy-gradient reinforcement learning (RL) provides a principled mechanism for aligning generative models with downstream objectives but typically requires access to the score, which has prevented its application to flow-based models that learn only velocity fields. We introduce Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL), a policy-gradient RL framework that operates directly on the learned velocity fields and eliminates the need for the explicit computation of the score. OMatG-IRL leverages stochastic perturbations of the underlying generation dynamics preserving the baseline performance of the pretrained generative model while enabling exploration and policy-gradient estimation at inference time. Using OMatG-IRL, we present the first application of RL to crystal structure prediction (CSP). Our method enables effective reinforcement of an energy-based objective while preserving diversity through composition conditioning, and it achieves performance competitive with score-based RL approaches. Finally, we show that OMatG-IRL can learn time-dependent velocity-annealing schedules, enabling accurate CSP with order-of-magnitude improvements in sampling efficiency and, correspondingly, reduction in generation time.
中文摘要 晶体材料的连续时间生成模型通过学习预测稳定晶体结构实现逆材料设计，但将明确的靶向性质纳入生成过程仍具挑战性。策略梯度强化学习（RL）提供了一种原则性机制，用于将生成模型与下游目标对齐，但通常需要访问评分，这阻碍了其应用于仅学习速度场的基于流的模型。我们引入了带有推理时间强化学习的开放材料生成（OMatG-IRL），这是一个策略梯度强化学习框架，直接作用于学习到的速度场，消除了对分数显式计算的需求。OMatG-IRL利用底层生成动力学的随机扰动，保持预训练生成模型的基线性能，同时支持在推断时进行探索和策略梯度估计。利用OMatG-IRL，我们首次展示了强化学习在晶体结构预测（CSP）中的应用。我们的方法能够有效强化基于能量的目标，同时通过组成条件保持多样性，并实现与基于评分的强化学习方法的表现。最后，我们证明OMatG-IRL能够学习时间相关的速度退火计划，实现精确的CSP，同时提升采样效率数量级，并相应缩短生成时间。

LLMs as High-Dimensional Nonlinear Autoregressive Models with Attention: Training, Alignment and Inference

作为高维非线性自回归模型的LLMs：训练、对齐与推断

Authors: Vikram Krishnamurthy
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2602.00426
Pdf link: https://arxiv.org/pdf/2602.00426
Abstract Large language models (LLMs) based on transformer architectures are typically described through collections of architectural components and training procedures, obscuring their underlying computational structure. This review article provides a concise mathematical reference for researchers seeking an explicit, equation-level description of LLM training, alignment, and generation. We formulate LLMs as high-dimensional nonlinear autoregressive models with attention-based dependencies. The framework encompasses pretraining via next-token prediction, alignment methods such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), rejection sampling fine-tuning (RSFT), and reinforcement learning from verifiable rewards (RLVR), as well as autoregressive generation during inference. Self-attention emerges naturally as a repeated bilinear--softmax--linear composition, yielding highly expressive sequence models. This formulation enables principled analysis of alignment-induced behaviors (including sycophancy), inference-time phenomena (such as hallucination, in-context learning, chain-of-thought prompting, and retrieval-augmented generation), and extensions like continual learning, while serving as a concise reference for interpretation and further theoretical development.
中文摘要 基于变换器架构的大型语言模型（LLM）通常通过一组架构组件和训练过程来描述，掩盖了其底层的计算结构。本综述文章为寻求明确、方程级描述LLM训练、比对和生成的研究者提供了简明的数学参考。我们将LLM表述为具有基于注意力的依赖的高维非线性自回归模型。该框架涵盖了通过下一个令牌预测进行预训练、基于人类反馈的强化学习（RLHF）、直接偏好优化（DPO）、拒绝抽样微调（RSFT）和可验证奖励的强化学习（RLVR）等比对方法，以及推理过程中的自回归生成。自我关注自然地以重复的双线性-软极大-线性合成形式出现，产生高度表现力的序列模型。该表述使得对对齐诱导行为（包括谄媚）、推理时间现象（如幻觉、上下文学习、思维链提示和检索增强生成）以及持续学习等扩展进行原则性分析成为可能，同时作为解释和理论进一步发展的简明参考。

FedMOA: Federated GRPO for Personalized Reasoning LLMs under Heterogeneous Rewards

FedMOA：针对异构奖励下的个性化推理LLM的联合GRPO

Authors: Ziyao Wang, Daeun Jung, Yexiao He, Guoheng Sun, Zheyu Shen, Myungjin Lee, Ang Li
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2602.00453
Pdf link: https://arxiv.org/pdf/2602.00453
Abstract Group Relative Policy Optimization (GRPO) has recently emerged as an effective approach for improving the reasoning capabilities of large language models through online multi-objective reinforcement learning. While personalization on private data is increasingly vital, traditional Reinforcement Learning (RL) alignment is often memory-prohibitive for on-device federated learning due to the overhead of maintaining a separate critic network. GRPO's critic-free architecture enables feasible on-device training, yet transitioning to a federated setting introduces systemic challenges: heterogeneous reward definitions, imbalanced multi-objective optimization, and high training costs. We propose FedMOA, a federated GRPO framework for multi-objective alignment under heterogeneous rewards. FedMOA stabilizes local training through an online adaptive weighting mechanism via hypergradient descent, which prioritizes primary reasoning as auxiliary objectives saturate. On the server side, it utilizes a task- and accuracy-aware aggregation strategy to prioritize high-quality updates. Experiments on mathematical reasoning and code generation benchmarks demonstrate that FedMOA consistently outperforms federated averaging, achieving accuracy gains of up to 2.2% while improving global performance, personalization, and multi-objective balance.
中文摘要 群相对策略优化（GRPO）最近成为通过在线多目标强化学习提升大型语言模型推理能力的有效方法。虽然对私人数据的个性化日益重要，但传统的强化学习（RL）对齐在设备上的联合学习中往往会占用内存，因为维护独立的批评网络会带来负担。GRPO无批评的架构使得可行的设备内训练成为可能，但转向联邦环境则带来了系统性挑战：奖励定义异质、多目标优化不平衡以及高昂的训练成本。我们提出了FedMOA，这是一个用于异质奖励下多目标对齐的联邦GRPO框架。FedMOA通过在线自适应加权机制通过超梯度下降稳定局部训练，优先考虑辅助目标饱和时的主要推理。在服务器端，它采用任务感知和准确性感知的聚合策略，优先推送高质量更新。数学推理和代码生成基准测试的实验表明，FedMOA始终优于联邦平均，准确率提升高达2.2%，同时提升全球性能、个性化和多目标平衡。

Search Inspired Exploration in Reinforcement Learning

强化学习中的搜索启发探索

Authors: Georgios Sotirchos, Zlatan Ajanović, Jens Kober
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.00460
Pdf link: https://arxiv.org/pdf/2602.00460
Abstract Exploration in environments with sparse rewards remains a fundamental challenge in reinforcement learning (RL). Existing approaches such as curriculum learning and Go-Explore often rely on hand-crafted heuristics, while curiosity-driven methods risk converging to suboptimal policies. We propose Search-Inspired Exploration in Reinforcement Learning (SIERL), a novel method that actively guides exploration by setting sub-goals based on the agent's learning progress. At the beginning of each episode, SIERL chooses a sub-goal from the \textit{frontier} (the boundary of the agent's known state space), before the agent continues exploring toward the main task objective. The key contribution of our method is the sub-goal selection mechanism, which provides state-action pairs that are neither overly familiar nor completely novel. Thus, it assures that the frontier is expanded systematically and that the agent is capable of reaching any state within it. Inspired by search, sub-goals are prioritized from the frontier based on estimates of cost-to-come and cost-to-go, effectively steering exploration towards the most informative regions. In experiments on challenging sparse-reward environments, SIERL outperforms dominant baselines in both achieving the main task goal and generalizing to reach arbitrary states in the environment.
中文摘要 在奖励稀疏的环境中进行探索仍然是强化学习（RL）中的根本挑战。现有的方法如课程学习和Go-Explore常依赖手工设计的启发式方法，而出于好奇心驱动的方法则有可能趋同于次优策略。我们提出了强化学习中的搜索启发探索（SIERL），这是一种新颖的方法，通过根据智能体的学习进度设定子目标，主动引导探索。每集开始时，SIERL会从\textit{frontier}（代理已知状态空间的边界）中选择一个子目标，然后代理继续探索主要任务目标。我们方法的关键贡献是子目标选择机制，该机制提供了既不熟悉也不完全新颖的状态-动作对。因此，它确保了边界被系统性地扩展，并且主体能够到达边界内的任何状态。受搜索启发，子目标从前沿基于成本和出行成本的估算优先排序，有效引导探索向最具信息量的区域。在挑战性稀疏奖励环境的实验中，SIERL在实现主要任务目标和推广到环境中任意状态方面均优于主流基线。

AREAL-DTA: Dynamic Tree Attention for Efficient Reinforcement Learning of Large Language Models

AREAL-DTA：动态树注意力用于高效强化大型语言模型的学习

Authors: Jiarui Zhang, Yuchen Yang, Ran Yan, Zhiyu Mei, Liyuan Zhang, Daifeng Li, Wei Fu, Jiaxuan Gao, Shusheng Xu, Yi Wu, Binhang Yuan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.00482
Pdf link: https://arxiv.org/pdf/2602.00482
Abstract Reinforcement learning (RL) based post-training for large language models (LLMs) is computationally expensive, as it generates many rollout sequences that could frequently share long token prefixes. Existing RL frameworks usually process these sequences independently, repeatedly recomputing identical prefixes during forward and backward passes during policy model training, leading to substantial inefficiencies in computation and memory usage. Although prefix sharing naturally induces a tree structure over rollouts, prior tree-attention-based solutions rely on fully materialized attention masks and scale poorly in RL settings. In this paper, we introduce AREAL-DTA to efficiently exploit prefix sharing in RL training. AREAL-DTA employs a depth-first-search (DFS)-based execution strategy that dynamically traverses the rollout prefix tree during both forward and backward computation, materializing only a single root-to-leaf path at a time. To further improve scalability, AREAL-DTA incorporates a load-balanced distributed batching mechanism that dynamically constructs and processes prefix trees across multiple GPUs. Across the popular RL post-training workload, AREAL-DTA achieves up to $8.31\times$ in $\tau^2$-bench higher training throughput.
中文摘要 基于强化学习（RL）的大型语言模型（LLM）后期训练计算成本高，因为它会产生许多常共享长令牌前缀的展开序列。现有的强化学习框架通常独立处理这些序列，在策略模型训练过程中，前向和后向传递时反复计算相同的前缀，导致计算和内存使用效率大幅降低。虽然前缀共享自然会在展开中形成树状结构，但以往基于树注意力的解决方案依赖于完全具象化的注意力掩码，且在强化学习环境中扩展性较差。本文介绍了AREAL-DTA以高效利用强化学习训练中的前缀共享。AREAL-DTA采用基于深度优先搜索（DFS）的执行策略，在前向和后向计算过程中动态遍历展开前缀树，一次只实现一条根到叶路径。为进一步提升可扩展性，AREAL-DTA采用负载均衡分布式批处理机制，动态构建和处理多个GPU的前缀树。在流行的强化学习后工作量中，AREAL-DTA实现了高达8.31美元/倍数的$\tau^2$-bench型训练吞吐量。

Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs

Minerva：针对网络威胁情报大型语言模型的可验证奖励强化学习

Authors: Md Tanvirul Alam, Aritran Piplai, Ionut Cardei, Nidhi Rastogi, Peter J Worth Jr
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.00513
Pdf link: https://arxiv.org/pdf/2602.00513
Abstract Cyber threat intelligence (CTI) analysts routinely convert noisy, unstructured security artifacts into standardized, automation-ready representations. Although large language models (LLMs) show promise for this task, existing approaches remain brittle when producing structured CTI outputs and have largely relied on supervised fine-tuning (SFT). In contrast, CTI standards and community-maintained resources define canonical identifiers and schemas that enable deterministic verification of model outputs. We leverage this structure to study reinforcement learning with verifiable rewards (RLVR) for CTI tasks. We introduce \textit{Minerva}, a unified dataset and training pipeline spanning multiple CTI subtasks, each paired with task-specific verifiers that score structured outputs and identifier predictions. To address reward sparsity during rollout, we propose a lightweight self-training mechanism that generates additional verified trajectories and distills them back into the model. Experiments across LLM backbones show consistent improvements in accuracy and robustness over SFT across multiple benchmarks.
中文摘要 网络威胁情报（CTI）分析师通常会将噪声杂乱、非结构化的安全产件转换为标准化、自动化可行的表示。尽管大型语言模型（LLMs）在这项任务中展现出潜力，但现有方法在生成结构化CTI输出时仍然脆弱，且主要依赖监督微调（SFT）。相比之下，CTI标准和社区维护的资源定义了规范标识符和模式，使模型输出能够确定性验证。我们利用这一结构研究CTI任务中的可验证奖励强化学习（RLVR）。我们引入了 \textit{Minerva}，这是一个统一的数据集和训练流水线，涵盖多个 CTI 子任务，每个子任务都与任务特定的验证器配对，这些验证器能为结构化输出和标识符预测进行评分。为了解决推广过程中的奖励稀疏问题，我们提出了一种轻量级自训机制，生成额外的经过验证的轨迹并将其提炼回模型。跨大型语言模型骨干的实验显示，在多个基准测试中，准确性和鲁棒性均优于SFT持续提升。

How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use

大型语言模型离职业扑克玩家有多远？重新审视结合智能工具的博弈论推理

Authors: Minhua Lin, Enyan Dai, Hui Liu, Xianfeng Tang, Yuliang Yan, Zhenwei Dai, Jingying Zeng, Zhiwei Zhang, Fali Wang, Hongcheng Gao, Chen Luo, Xiang Zhang, Qi He, Suhang Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00528
Pdf link: https://arxiv.org/pdf/2602.00528
Abstract As Large Language Models (LLMs) are increasingly applied in high-stakes domains, their ability to reason strategically under uncertainty becomes critical. Poker provides a rigorous testbed, requiring not only strong actions but also principled, game-theoretic reasoning. In this paper, we conduct a systematic study of LLMs in multiple realistic poker tasks, evaluating both gameplay outcomes and reasoning traces. Our analysis reveals LLMs fail to compete against traditional algorithms and identifies three recurring flaws: reliance on heuristics, factual misunderstandings, and a "knowing-doing" gap where actions diverge from reasoning. An initial attempt with behavior cloning and step-level reinforcement learning improves reasoning style but remains insufficient for accurate game-theoretic play. Motivated by these limitations, we propose ToolPoker, a tool-integrated reasoning framework that combines external solvers for GTO-consistent actions with more precise professional-style explanations. Experiments demonstrate that ToolPoker achieves state-of-the-art gameplay while producing reasoning traces that closely reflect game-theoretic principles.
中文摘要 随着大型语言模型（LLMs）在高风险领域日益应用，其在不确定性下进行战略推理的能力变得至关重要。扑克提供了一个严谨的测试平台，不仅需要强有力的行动，还需要有原则的博弈论推理。本文系统研究了多种真实扑克任务中的大型语言模型，评估了游戏结果和推理痕迹。我们的分析显示，大型语言模型无法与传统算法竞争，并指出三个反复出现的缺陷：依赖启发式方法、事实误解，以及“知而行”的差距，即行为与推理产生分歧。初步尝试行为克隆和步级强化学习能提升推理风格，但仍不足以实现准确的博弈论。基于这些局限性，我们提出了ToolPoker，一个集成工具的推理框架，结合了外部求解器与更精准的专业解释。实验表明，ToolPoker在产生与博弈论原理密切相关的推理痕迹的同时，实现了最先进的玩法。

Reinforcement Learning-assisted Constraint Relaxation for Constrained Expensive Optimization

强化学习辅助约束松弛以实现受限且昂贵的优化

Authors: Qianhao Zhu, Sijie Ma, Zeyuan Ma, Hongshu Guo, Yue-Jiao Gong
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.00532
Pdf link: https://arxiv.org/pdf/2602.00532
Abstract Constraint handling plays a key role in solving realistic complex optimization problems. Though intensively discussed in the last few decades, existing constraint handling techniques predominantly rely on human experts' designs, which more or less fall short in utility towards general cases. Motivated by recent progress in Meta-Black-Box Optimization where automated algorithm design can be learned to boost optimization performance, in this paper, we propose learning effective, adaptive and generalizable constraint handling policy through reinforcement learning. Specifically, a tailored Markov Decision Process is first formulated, where given optimization dynamics features, a deep Q-network-based policy controls the constraint relaxation level along the underlying optimization process. Such adaptive constraint handling provides flexible tradeoff between objective-oriented exploitation and feasible-region-oriented exploration, and hence leads to promising optimization performance. We train our approach on CEC 2017 Constrained Optimization benchmark with limited evaluation budget condition (expensive cases) and compare the trained constraint handling policy to strong baselines such as recent winners in CEC/GECCO competitions. Extensive experimental results show that our approach performs competitively or even surpasses the compared baselines under either Leave-one-out cross-validation or ordinary train-test split validation. Further analysis and ablation studies reveal key insights in our designs.
中文摘要 约束处理在解决现实复杂的优化问题中起着关键作用。尽管在过去几十年里被广泛讨论，现有的约束处理技术主要依赖人类专家的设计，而这些设计在通用情况下的实用性或多或少。本文受元黑匣子优化（Meta-Black-Box Optimization）最新进展的启发，在该领域可以通过学习自动化算法设计来提升优化性能，本文提出通过强化学习学习有效、自适应且可推广的约束处理策略。具体来说，首先制定一个定制化的马尔可夫决策过程，在给定优化动力学特征的情况下，基于深度Q网络的策略控制底层优化过程中的约束松弛水平。这种自适应约束处理在目标导向的利用与可行区域的探索之间提供了灵活的权衡，从而带来有前景的优化性能。我们将方法基于CEC 2017受限优化基准测试，且评估预算条件有限（昂贵案例），并将训练好的约束处理策略与近期CEC/GECCO竞赛获胜者之强基线进行比较。大量实验结果表明，我们的方法在“留一”交叉验证或普通列车-测试分段验证下，表现具有竞争力甚至超过了所比较的基线。进一步的分析和消融研究揭示了我们设计中的关键见解。

Surrogate Ensemble in Expensive Multi-Objective Optimization via Deep Q-Learning

通过深度Q学习实现昂贵多目标优化中的替代集合

Authors: Yuxin Wu, Hongshu Guo, Ting Huang, Yue-Jiao Gong, Zeyuan Ma
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.00540
Pdf link: https://arxiv.org/pdf/2602.00540
Abstract Surrogate-assisted Evolutionary Algorithms~(SAEAs) have shown promising robustness in solving expensive optimization problems. A key aspect that impacts SAEAs' effectiveness is surrogate model selection, which in existing works is predominantly decided by human developer. Such human-made design choice introduces strong bias into SAEAs and may hurt their expected performance on out-of-scope tasks. In this paper, we propose a reinforcement learning-assisted ensemble framework, termed as SEEMOO, which is capable of scheduling different surrogate models within a single optimization process, hence boosting the overall optimization performance in a cooperative paradigm. Specifically, we focus on expensive multi-objective optimization problems, where multiple objective functions shape a compositional landscape and hence challenge surrogate selection. SEEMOO comprises following core designs: 1) A pre-collected model pool that maintains different surrogate models; 2) An attention-based state-extractor supports universal optimization state representation of problems with varied objective numbers; 3) a deep Q-network serves as dynamic surrogate selector: Given the optimization state, it selects desired surrogate model for current-step evaluation. SEEMOO is trained to maximize the overall optimization performance under a training problem distribution. Extensive benchmark results demonstrate SEEMOO's surrogate ensemble paradigm boosts the optimization performance of single-surrogate baselines. Further ablation studies underscore the importance of SEEMOO's design components.
中文摘要 替代辅助进化算法~（SAEA）在解决昂贵优化问题方面展现出有前景的鲁棒性。影响SAEA有效性的关键方面是替代模型选择，在现有工作中，这主要由人类开发者决定。这种人为设计的选择会给SAEA带来强烈偏见，可能影响其在超出范围任务中的预期性能。本文提出了一种名为SEEMOO的强化学习辅助集成框架，能够在单一优化过程中调度不同的替代模型，从而提升协作范式中的整体优化性能。具体来说，我们关注昂贵的多目标优化问题，其中多目标函数塑造复合景观，从而挑战替代选择。SEEMOO包含以下核心设计：1）一个预先收集的模型池，维护不同的替代模型;2）基于注意力的状态提取器支持具有不同目标数问题的通用优化状态表示;3）深度Q网络作为动态代理选择器：在优化状态下，选择所需的代理模型进行当前步评估。SEEMOO 经过训练，旨在最大化训练问题分布下的整体优化性能。大量基准测试结果表明，SEEMOO的代理集合范式提升了单代理基线的优化性能。进一步的消融研究强调了SEEMOO设计组件的重要性。

APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation

APEX：一种基于内存的解耦探索器，用于异步航天目标导航

Authors: Daoxuan Zhang, Ping Chen, Xiaobo Xia, Xiu Su, Ruichen Zhen, Jianqiang Xiao, Shuo Yang
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.00551
Pdf link: https://arxiv.org/pdf/2602.00551
Abstract Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description. However, existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decision-making, and inefficient exploration and information gathering. To address these challenges, we introduce \textbf{APEX} (Aerial Parallel Explorer), a novel hierarchical agent designed for efficient exploration and target acquisition in complex aerial settings. APEX is built upon a modular, three-part architecture: 1) Dynamic Spatio-Semantic Mapping Memory, which leverages the zero-shot capability of a Vision-Language Model (VLM) to dynamically construct high-resolution 3D Attraction, Exploration, and Obstacle maps, serving as an interpretable memory mechanism. 2) Action Decision Module, trained with reinforcement learning, which translates this rich spatial understanding into a fine-grained and robust control policy. 3) Target Grounding Module, which employs an open-vocabulary detector to achieve definitive and generalizable target identification. All these components are integrated into a hierarchical, asynchronous, and parallel framework, effectively bypassing the VLM's inference latency and boosting the agent's proactivity in exploration. Extensive experiments show that APEX outperforms the previous state of the art by +4.2\% SR and +2.8\% SPL on challenging UAV-ON benchmarks, demonstrating its superior efficiency and the effectiveness of its hierarchical asynchronous design. Our source code is provided in \href{this https URL}{GitHub}
中文摘要 空中目标导航是具身人工智能中的一个挑战性领域，要求无人机（UAV）智能体仅凭视觉感知和语言描述自主探索、推理并识别特定目标。然而，现有方法在空中环境中记忆复杂空间表征、可靠且可解释的行动决策以及探索和信息收集效率低下方面存在困难。为应对这些挑战，我们引入了\textbf{APEX}（空中平行探测器），这是一种新型分层智能体，旨在高效探索和目标获取，适用于复杂空中环境。APEX 基于模块化的三部分架构构建：1）动态时空语义映射记忆，利用视觉语言模型（VLM）的零点能力，动态构建高分辨率的三维吸引、探索和障碍地图，作为可解释的记忆机制。2）动作决策模块，通过强化学习训练，将丰富的空间理解转化为细粒度且稳健的控制策略。3）目标接地模块，采用开放词汇探测器实现明确且可推广的目标识别。所有这些组件都集成在分层、异步和并行的框架中，有效绕过了VLM的推理延迟，提升了代理在探索中的主动性。大量实验表明，APEX在挑战性的无人机ON基准测试中，SR和SPL提升+4.2%和SPL+2.8%，展示了其卓越的效率和分层异步设计的有效性。我们的源代码已提供于 \href{this https URL}{GitHub}

NetWorld: Communication-Based Diffusion World Model for Multi-Agent Reinforcement Learning in Wireless Networks

NetWorld：无线网络中多智能体强化学习的基于通信扩散世界模型

Authors: Kechen Meng, Rongpeng Li, Yansha Deng, Zhifeng Zhao, Honggang Zhang
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2602.00558
Pdf link: https://arxiv.org/pdf/2602.00558
Abstract As wireless communication networks grow in scale and complexity, diverse resource allocation tasks become increasingly critical. Multi-Agent Reinforcement Learning (MARL) provides a promising solution for distributed control, yet it often requires costly real-world interactions and lacks generalization across diverse tasks. Meanwhile, recent advances in Diffusion Models (DMs) have demonstrated strong capabilities in modeling complex dynamics and supporting high-fidelity simulation. Motivated by these challenges and opportunities, we propose a Communication-based Diffusion World Model (NetWorld) to enable few-shot generalization across heterogeneous MARL tasks in wireless networks. To improve applicability to large-scale distributed networks, NetWorld adopts the Distributed Training with Decentralized Execution (DTDE) paradigm and is organized into a two-stage framework: (i) pre-training a classifier-guided conditional diffusion world model on multi-task offline datasets, and (ii) performing trajectory planning entirely within this world model to avoid additional online interaction. Cross-task heterogeneity is handled via shared latent processing for observations, two-hot discretization for task-specific actions and rewards, and an inverse dynamics model for action recovery. We further introduce a lightweight Mean Field (MF) communication mechanism to reduce non-stationarity and promote coordinated behaviors with low overhead. Experiments on three representative tasks demonstrate improved performance and sample efficiency over MARL baselines, indicating strong scalability and practical potential for wireless network optimization.
中文摘要 随着无线通信网络规模和复杂度的增长，多样化的资源分配任务变得越来越关键。多智能体强化学习（MARL）为分布式控制提供了有前景的解决方案，但通常需要代价高昂的现实世界交互，且缺乏跨多样任务的泛化能力。与此同时，扩散模型（DM）的最新进展展示了复杂动力学建模和支持高保真模拟的强大能力。受这些挑战和机遇激励，我们提出了基于通信的扩散世界模型（NetWorld），以实现无线网络中异构MARL任务的少数推广。为了提高对大规模分布式网络的适用性，NetWorld采用了分布式训练与去中心化执行（DTDE）范式，并组织为两阶段框架：（i）在多任务离线数据集上预训练分类器引导的条件扩散世界模型，（ii）完全在该世界模型内进行轨迹规划，以避免额外的在线交互。跨任务异质性通过共享潜在处理（用于观察）、双热离散处理任务特定动作和奖励，以及反动力学模型（动作恢复）来处理。我们还引入了轻量级均值场（MF）通信机制，以减少非平稳性，促进协调行为且开销低。在三个代表性任务上的实验显示，MARL基线相比性能和样本效率有所提升，表明其具有强大的可扩展性和无线网络优化的实用潜力。

Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models

学习在视频多模态大型语言模型中解码组合幻觉

Authors: Wenbin Xing, Quanxing Zha, Lizheng Zu, Mengran Li, Ming Li, Junchi Yan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00559
Pdf link: https://arxiv.org/pdf/2602.00559
Abstract Current research on video hallucination mitigation primarily focuses on isolated error types, leaving compositional hallucinations, arising from incorrect reasoning over multiple interacting spatial and temporal factors largely underexplored. We introduce OmniVCHall, a benchmark designed to systematically evaluate both isolated and compositional hallucinations in video multimodal large language models (VLLMs). OmniVCHall spans diverse video domains, introduces a novel camera-based hallucination type, and defines a fine-grained taxonomy, together with adversarial answer options (e.g., "All are correct" and "None of the above") to prevent shortcut reasoning. The evaluations of 39 representative VLLMs reveal that even advanced models (e.g., Qwen3-VL and GPT-5) exhibit substantial performance degradation. We propose TriCD, a contrastive decoding framework with a triple-pathway calibration mechanism. An adaptive perturbation controller dynamically selects distracting operations to construct negative video variants, while a saliency-guided enhancement module adaptively reinforces grounded token-wise visual evidences. These components are optimized via reinforcement learning to encourage precise decision-making under compositional hallucination settings. Experimental results show that TriCD consistently improves performance across two representative backbones, achieving an average accuracy improvement of over 10%. The data and code can be find at this https URL.
中文摘要 目前关于视频幻觉缓解的研究主要关注孤立的错误类型，导致因对多个相互作用的空间和时间因素进行错误推理而产生的合成幻觉大多未被充分探讨。我们介绍了OmniVCHall，这是一个旨在系统评估视频多模态大型语言模型（VLLM）中孤立幻觉和组合幻觉的基准测试。OmniVCHall涵盖了多种视频领域，引入了一种新型基于摄像头的幻觉类型，并定义了细致的分类法，并配备了对抗性答案选项（例如“全正确”和“以上皆非”）以防止偷懒推理。对39个代表性VLLM的评估显示，即使是先进模型（如Qwen3-VL和GPT-5）也表现出显著的性能下降。我们提出了TriCD，一种具有三径校准机制的对比解码框架。自适应扰动控制器动态选择分散注意力作以构建负视频变体，而显著性引导增强模块则自适应地强化基于标记的视觉证据。这些组成部分通过强化学习得到优化，以鼓励在组合幻觉环境中做出精确决策。实验结果显示，TriCD在两种代表性骨干链上持续提升性能，平均准确率提升超过10%。数据和代码可在该 https URL 中找到。

Learning Modal-Mixed Chain-of-Thought Reasoning with Latent Embeddings

学习带有潜在嵌入的模态混合思维链推理

Authors: Yifei Shao, Kun Zhou, Ziming Xu, Mohammad Atif Quamar, Shibo Hao, Zhen Wang, Zhiting Hu, Biwei Huang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00574
Pdf link: https://arxiv.org/pdf/2602.00574
Abstract We study how to extend chain-of-thought (CoT) beyond language to better handle multimodal reasoning. While CoT helps LLMs and VLMs articulate intermediate steps, its text-only form often fails on vision-intensive problems where key intermediate states are inherently visual. We introduce modal-mixed CoT, which interleaves textual tokens with compact visual sketches represented as latent embeddings. To bridge the modality gap without eroding the original knowledge and capability of the VLM, we use the VLM itself as an encoder and train the language backbone to reconstruct its own intermediate vision embeddings, to guarantee the semantic alignment of the visual latent space. We further attach a diffusion-based latent decoder, invoked by a special control token and conditioned on hidden states from the VLM. In this way, the diffusion head carries fine-grained perceptual details while the VLM specifies high-level intent, which cleanly disentangles roles and reduces the optimization pressure of the VLM. Training proceeds in two stages: supervised fine-tuning on traces that interleave text and latents with a joint next-token and latent-reconstruction objective, followed by reinforcement learning that teaches when to switch modalities and how to compose long reasoning chains. Extensive experiments across 11 diverse multimodal reasoning tasks, demonstrate that our method yields better performance than language-only and other CoT methods. Our code will be publicly released.
中文摘要 我们研究如何将思维链（CoT）扩展到语言之外，以更好地处理多模态推理。虽然CoT帮助大型语言模型和大型语言模型（VLM）表达中间步骤，但其纯文本形式在视觉密集型问题上常常失败，因为关键中间状态本质上是视觉化的。我们引入了模态混合的CoT，将文本标记与以潜在嵌入表示的紧凑视觉草图交错交错。为了弥合模态差距而不削弱VLM的原始知识和能力，我们使用VLM本身作为编码器，并训练语言骨干重建自身的中间视觉嵌入，以确保视觉潜在空间的语义对齐。我们还附加了一个基于扩散的潜在解码器，由特殊的控制令牌调用，并基于VLM的隐藏状态进行条件。通过这种方式，扩散头承载细粒度的感知细节，而VLM指定高层次意图，从而干净利落地解开角色并降低VLM的优化压力。培训分为两个阶段：对交错文本和潜在元素的痕迹进行监督微调，结合联合下一个标记和潜在重建目标;随后是强化学习，教人何时切换模态以及如何构建长推理链。通过涵盖11种多模态推理任务的广泛实验，我们的方法优于单语言及其他CoT方法的性能。我们的代码将公开发布。

Agentic Reward Modeling: Verifying GUI Agent via Online Proactive Interaction

代理奖励建模：通过在线主动交互验证图形界面代理

Authors: Chaoqun Cui, Jing Huang, Shijing Wang, Liming Zheng, Qingchao Kong, Zhixiong Zeng
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.00575
Pdf link: https://arxiv.org/pdf/2602.00575
Abstract Reinforcement learning with verifiable rewards (RLVR) is pivotal for the continuous evolution of GUI agents, yet existing evaluation paradigms face significant limitations. Rule-based methods suffer from poor scalability and cannot handle open-ended tasks, while LLM-as-a-Judge approaches rely on passive visual observation, often failing to capture latent system states due to partial state observability. To address these challenges, we advocate for a paradigm shift from passive evaluation to Agentic Interactive Verification. We introduce VAGEN, a framework that employs a verifier agent equipped with interaction tools to autonomously plan verification strategies and proactively probe the environment for evidence of task completion. Leveraging the insight that GUI tasks are typically "easy to verify but hard to solve", VAGEN overcomes the bottlenecks of visual limitations. Experimental results on OSWorld-Verified and AndroidWorld benchmarks demonstrate that VAGEN significantly improves evaluation accuracy compared to LLM-as-a-Judge baselines and further enhances performance through test-time scaling strategies.
中文摘要 带可验证奖励的强化学习（RLVR）对于图形界面代理的持续演进至关重要，但现有的评估范式仍面临重大局限。基于规则的方法存在扩展性较差且无法处理开放式任务，而作为法官的LLM方法依赖被动视觉观察，常因部分状态可观测性而无法捕捉潜在系统状态。为应对这些挑战，我们倡导从被动评估转向智能交互验证（agentic interactive verification）。我们介绍VAGEN框架，该框架采用配备交互工具的验证代理，自主规划验证策略并主动探测环境中任务完成的证据。利用GUI任务通常“易于验证但难以解决”的洞察，VAGEN克服了视觉限制的瓶颈。OSWorld-Verified和AndroidWorld基准测试的实验结果显示，VAGEN相比LLM作为评判基线显著提升了评估准确性，并通过测试时间缩放策略进一步提升了性能。

Safe Langevin Soft Actor Critic

安全朗热文软性演员评论家

Authors: Mahesh Keswani, Samyak Jain, Raunak P. Bhattacharyya
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.00587
Pdf link: https://arxiv.org/pdf/2602.00587
Abstract Balancing reward and safety in constrained reinforcement learning remains challenging due to poor generalization from sharp value minima and inadequate handling of heavy-tailed risk distribution. We introduce Safe Langevin Soft Actor-Critic (SL-SAC), a principled algorithm that addresses both issues through parameter-space exploration and distributional risk control. Our approach combines three key mechanisms: (1) Adaptive Stochastic Gradient Langevin Dynamics (aSGLD) for reward critics, promoting ensemble diversity and escape from poor optima; (2) distributional cost estimation via Implicit Quantile Networks (IQN) with Conditional Value-at-Risk (CVaR) optimization for tail-risk mitigation; and (3) a reactive Lagrangian relaxation scheme that adapts constraint enforcement based on the empirical CVaR of episodic costs. We provide theoretical guarantees on CVaR estimation error and demonstrate that CVaR-based Lagrange updates yield stronger constraint violation signals than expected-cost updates. On Safety-Gymnasium benchmarks, SL-SAC achieves the lowest cost in 7 out of 10 tasks while maintaining competitive returns, with cost reductions of 19-63% in velocity tasks compared to state-of-the-art baselines.
中文摘要 由于从锐利价值最小值泛化不够，以及对重尾风险分布处理不充分，在受限强化学习中平衡奖励与安全仍然具有挑战性。我们介绍了安全朗之文软演员-批判者（SL-SAC），这是一种原则性算法，通过参数空间探索和分布风险控制解决这两个问题。我们的方法结合了三个关键机制：（1）为奖励批评者提供自适应随机梯度朗之文动力学（aSGLD），促进集合多样性并摆脱糟糕的最优状态;（2）通过隐式分位数网络（IQN）进行分布成本估算，并结合条件风险价值（CVaR）优化以缓解尾部风险;以及（3）一种反应式拉格朗日松弛方案，基于情节成本的经验CVaR调整约束执行。我们对CVaR估计误差提供理论保证，并证明基于CVaR的拉格朗日更新比预期成本更新更强的约束违背信号。在Safety-Gymnasium基准测试中，SL-SAC在10项任务中有7项成本最低，同时保持竞争力，速度任务的成本比最先进基线下降19%-63%。

Model-Based Data-Efficient and Robust Reinforcement Learning

基于模型的数据高效且稳健的强化学习

Authors: Ludvig Svedlund, Constantin Cronrath, Jonas Fredriksson, Bengt Lennartson
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.00630
Pdf link: https://arxiv.org/pdf/2602.00630
Abstract A data-efficient learning-based control design method is proposed in this paper. It is based on learning a system dynamics model that is then leveraged in a two-level procedure. On the higher level, a simple but powerful optimization procedure is performed such that, for example, energy consumption in a vehicle can be reduced when hard state and action constraints are also introduced. Load disturbances and model errors are compensated for by a feedback controller on the lower level. In that regard, we briefly examine the robustness of both model-free and model-based learning approaches, and it is shown that the model-free approach greatly suffers from the inclusion of unmodeled dynamics. In evaluating the proposed method, it is assumed that a path is given, while the velocity and acceleration can be modified such that energy is saved, while still keeping speed limits and completion time. Compared with two well-known actor-critic reinforcement learning strategies, the suggested learning-based approach saves more energy and reduces the number of evaluated time steps by a factor of 100 or more.
中文摘要 本文提出了一种数据高效的基于学习的控制设计方法。它基于学习系统动力学模型，然后在两层次的过程中加以利用。在更高层面，会执行一种简单但强大的优化过程，例如当引入硬状态和动作约束时，车辆的能耗可以被降低。负载干扰和模型误差由下层的反馈控制器进行补偿。在这方面，我们简要考察了无模型和基于模型的学习方法的鲁棒性，结果显示无模型方法在包含未建模动力学方面存在很大问题。在评估所提方法时，假设给定了路径，同时可以修改速度和加速度，以节省能量，同时保持速度限制和完成时间。与两种知名的演员-批评者强化学习策略相比，建议的基于学习的方法节省了更多能源，并将评估时间步数减少了100倍或更多。

Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation

迈向基于LLM的推荐中样本高效且稳定的强化学习

Authors: Hongxun Ding, Keqin Bao, Jizhi Zhang, Yi Fang, Wenxin Xu, Fuli Feng, Xiangnan He
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2602.00632
Pdf link: https://arxiv.org/pdf/2602.00632
Abstract While Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs), its adoption for enhancing recommendation quality is growing rapidly. In this work, we critically examine this trend and argue that Long CoT is inherently ill-suited for the sequential recommendation domain. We attribute this misalignment to two primary factors: excessive inference latency and the lack of explicit cognitive reasoning patterns in user behavioral data. Driven by these observations, we propose pivoting away from the CoT structure to directly leverage its underlying mechanism: Reinforcement Learning (RL), to explore the item space. However, applying RL directly faces significant obstacles, notably low sample efficiency-where most actions fail to provide learning signals-and training instability. To overcome these limitations, we propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation. RISER is designed to transform non-learnable trajectories into effective pairwise preference data for optimization. Furthermore, it incorporates specific strategies to ensure stability, including the prevention of redundant rollouts and the constraint of token-level update magnitudes. Extensive experiments on three real-world datasets show that RISER significantly outperforms competitive baselines, establishing a robust paradigm for RL-enhanced LLM recommendation. Our code will be available at this https URL.
中文摘要 虽然长链思维（Long Chain-of-Thought，简称Long CoT）推理在大型语言模型（LLMs）中展现出潜力，但其在提升推荐质量方面的应用正在迅速增长。在本研究中，我们批判性地审视这一趋势，并论证Long CoT本质上不适合序列推荐领域。我们将这种错位归因于两个主要因素：推理延迟过高和用户行为数据中缺乏明确的认知推理模式。基于这些观察，我们建议从CoT结构转向其底层机制：强化学习（Reinforcement Learning，RL），探索项目空间。然而，直接应用强化学习面临重大障碍，尤其是样本效率低——大多数作无法提供学习信号——以及训练不稳定性。为克服这些限制，我们提出了RISER，一种创新的强化项目太空探索推荐框架。RISER旨在将不可学习的轨迹转换为有效的成对偏好数据以优化。此外，它还包含了确保稳定性的具体策略，包括防止冗余上线和限制代币级更新幅度。对三个真实世界数据集的广泛实验表明，RISER的表现显著优于竞争基线，建立了强化学习增强LLM推荐的稳健范式。我们的代码将在此 https URL 上发布。

Equilibrium of Feasible Zone and Uncertain Model in Safe Exploration

安全勘探中可行区与不确定模型的平衡

Authors: Yujie Yang, Zhilong Zheng, Shengbo Eben Li
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.00636
Pdf link: https://arxiv.org/pdf/2602.00636
Abstract Ensuring the safety of environmental exploration is a critical problem in reinforcement learning (RL). While limiting exploration to a feasible zone has become widely accepted as a way to ensure safety, key questions remain unresolved: what is the maximum feasible zone achievable through exploration, and how can it be identified? This paper, for the first time, answers these questions by revealing that the goal of safe exploration is to find the equilibrium between the feasible zone and the environment model. This conclusion is based on the understanding that these two components are interdependent: a larger feasible zone leads to a more accurate environment model, and a more accurate model, in turn, enables exploring a larger zone. We propose the first equilibrium-oriented safe exploration framework called safe equilibrium exploration (SEE), which alternates between finding the maximum feasible zone and the least uncertain model. Using a graph formulation of the uncertain model, we prove that the uncertain model obtained by SEE is monotonically refined, the feasible zones monotonically expand, and both converge to the equilibrium of safe exploration. Experiments on classic control tasks show that our algorithm successfully expands the feasible zones with zero constraint violation, and achieves the equilibrium of safe exploration within a few iterations.
中文摘要 确保环境探索的安全是强化学习（RL）中的关键问题。虽然将勘探限制在可行区域内已被广泛接受为确保安全的一种方式，但关键问题仍未解决：通过勘探可达的最大可行区到底是多少？如何识别？本文首次通过揭示安全勘探的目标是找到可行区模型与环境模型之间的平衡，来回答这些问题。这一结论基于两者相互依存的理解：更大的可行区域带来更准确的环境模型，而更准确的模型则使探索更大的区域成为可能。我们提出了第一个以均衡为导向的安全探索框架——安全均衡探索（SEE），它在寻找最大可行区和最小不确定性模型之间交替进行。通过不确定模型的图表述，我们证明了SEE得到的不确定模型是单调细化的，可行区域单调扩展，且两者都收敛于安全勘探的平衡。经典对照任务的实验表明，我们的算法成功地以零约束违背的状态扩展可行区域，并在几次迭代内实现安全探索的平衡。

LegalOne: A Family of Foundation Models for Reliable Legal Reasoning

LegalOne：一系列可靠法律推理的基础模型

Authors: Haitao Li, Yifan Chen, Shuo Miao, Qian Dong, Jia Chen, Yiran Hu, Junjie Chen, Minghao Qin, Qingyao Ai, Yiqun Liu, Cheng Luo, Quan Zhou, Ya Zhang, Jikun Hu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.00642
Pdf link: https://arxiv.org/pdf/2602.00642
Abstract While Large Language Models (LLMs) have demonstrated impressive general capabilities, their direct application in the legal domain is often hindered by a lack of precise domain knowledge and complexity of performing rigorous multi-step judicial reasoning. To address this gap, we present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain. LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning. First, during mid-training phase, we propose Plasticity-Adjusted Sampling (PAS) to address the challenge of domain adaptation. This perplexity-based scheduler strikes a balance between the acquisition of new knowledge and the retention of original capabilities, effectively establishing a robust legal foundation. Second, during supervised fine-tuning, we employ Legal Agentic CoT Distillation (LEAD) to distill explicit reasoning from raw legal texts. Unlike naive distillation, LEAD utilizes an agentic workflow to convert complex judicial processes into structured reasoning trajectories, thereby enforcing factual grounding and logical rigor. Finally, we implement a Curriculum Reinforcement Learning (RL) strategy. Through a progressive reinforcement process spanning memorization, understanding, and reasoning, LegalOne evolves from simple pattern matching to autonomous and reliable legal reasoning. Experimental results demonstrate that LegalOne achieves state-of-the-art performance across a wide range of legal tasks, surpassing general-purpose LLMs with vastly larger parameter counts through enhanced knowledge density and efficiency. We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI, paving the way for deploying trustworthy and interpretable foundation models in high-stakes judicial applications.
中文摘要 尽管大型语言模型（LLMs）展现了令人印象深刻的通用能力，但其在法律领域的直接应用常常受限于缺乏精确领域知识和执行严格多步司法推理的复杂性。为弥补这一空白，我们推出了LegalOne，一系列专为中国法律领域量身定制的基础模型。LegalOne 通过一个全面的三阶段流程开发，旨在掌握法律推理。首先，在中期训练阶段，我们提出可塑性调整抽样（PAS）方法，以应对领域适应的挑战。这种基于困惑度的调度器在获取新知识与保留原有能力之间取得了平衡，有效地建立了坚实的法律基础。其次，在监督微调过程中，我们采用法律代理CoT蒸馏（LEAD）从原始法律文本中提炼出明确的推理。与朴素提炼不同，LEAD利用代理性工作流程将复杂的司法过程转化为结构化的推理轨迹，从而强化事实基础和逻辑严谨性。最后，我们实施课程强化学习（RL）策略。通过涵盖记忆、理解和推理的渐进式强化过程，LegalOne 从简单的模式匹配演变为自主且可靠的法律推理。实验结果表明，LegalOne 在广泛的法律任务中实现了最先进的性能，通过提升知识密度和效率，超越了参数数远超的通用大型语言模型。我们公开发布LegalOne权重和LegalKit评估框架，推动法律人工智能领域的发展，为在高风险司法应用中部署可信且可解释的基础模型铺平道路。

Toward Reliable Sim-to-Real Predictability for MoE-based Robust Quadrupedal Locomotion

迈向基于MoE的稳健四足行走的可靠模拟到真实可预测性

Authors: Tianyang Wu, Hanwei Guo, Yuhang Wang, Junshu Yang, Xinyang Sui, Jiayi Xie, Xingyu Chen, Zeyang Liu, Xuguang Lan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.00678
Pdf link: https://arxiv.org/pdf/2602.00678
Abstract Reinforcement learning has shown strong promise for quadrupedal agile locomotion, even with proprioception-only sensing. In practice, however, sim-to-real gap and reward overfitting in complex terrains can produce policies that fail to transfer, while physical validation remains risky and inefficient. To address these challenges, we introduce a unified framework encompassing a Mixture-of-Experts (MoE) locomotion policy for robust multi-terrain representation with RoboGauge, a predictive assessment suite that quantifies sim-to-real transferability. The MoE policy employs a gated set of specialist experts to decompose latent terrain and command modeling, achieving superior deployment robustness and generalization via proprioception alone. RoboGauge further provides multi-dimensional proprioception-based metrics via sim-to-sim tests over terrains, difficulty levels, and domain randomizations, enabling reliable MoE policy selection without extensive physical trials. Experiments on a Unitree Go2 demonstrate robust locomotion on unseen challenging terrains, including snow, sand, stairs, slopes, and 30 cm obstacles. In dedicated high-speed tests, the robot reaches 4 m/s and exhibits an emergent narrow-width gait associated with improved stability at high velocity.
中文摘要 强化学习在四足敏捷行走方面展现出强烈潜力，即使仅靠本体感觉感知。然而，在实际作中，复杂地形中的模拟与现实差距和奖励过拟合可能导致策略无法转移，而物理验证依然存在风险和低效。为应对这些挑战，我们引入了一个统一框架，采用RoboGauge预测评估套件，实现多地形的稳健表现，实现多地形的稳健表现。MoE政策采用一组封闭的专家团队，分解潜在地形和指挥建模，仅通过本体感觉实现卓越的部署鲁棒性和泛化。RoboGauge还通过模拟对模拟测试，基于地形、难度等级和域随机化，提供基于本体感觉的多维指标，实现可靠的MoE策略选择，无需大量物理试验。Unitree Go2的实验展示了在未知的复杂地形上，包括雪地、沙地、楼梯、坡地和30厘米高障碍物上的强劲运动能力。在专用高速测试中，机器人速度可达4米/秒，并表现出与高速稳定性提升相关的窄宽度步态。

SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning

SA-VLA：视觉-语言-行动强化学习中的空间感知流匹配

Authors: Xu Pan, Zhenglin Wan, Xingrui Yu, Xianwei Zheng, Youkai Ke, Ming Sun, Rui Wang, Ziwei Wang, Ivor Tsang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00743
Pdf link: https://arxiv.org/pdf/2602.00743
Abstract Vision-Language-Action (VLA) models exhibit strong generalization in robotic manipulation, yet reinforcement learning (RL) fine-tuning often degrades robustness under spatial distribution shifts. For flow-matching VLA policies, this degradation is closely associated with the erosion of spatial inductive bias during RL adaptation, as sparse rewards and spatially agnostic exploration increasingly favor short-horizon visual cues. To address this issue, we propose \textbf{SA-VLA}, a spatially-aware RL adaptation framework that preserves spatial grounding during policy optimization by aligning representation learning, reward design, and exploration with task geometry. SA-VLA fuses implicit spatial representations with visual tokens, provides dense rewards that reflect geometric progress, and employs \textbf{SCAN}, a spatially-conditioned annealed exploration strategy tailored to flow-matching dynamics. Across challenging multi-object and cluttered manipulation benchmarks, SA-VLA enables stable RL fine-tuning and improves zero-shot spatial generalization, yielding more robust and transferable behaviors. Code and project page are available at this https URL.
中文摘要 视觉-语言-行动（VLA）模型在机器人作中表现出强烈的泛化性，但强化学习（RL）微调常常在空间分布变化下降低鲁棒性。对于流量匹配VLA策略，这种退化与强化学习适应过程中空间归纳偏差的侵蚀密切相关，因为稀疏奖励和空间无关探索越来越偏向短视距视觉线索。为解决这一问题，我们提出了 \textbf{SA-VLA}，一种空间感知的强化学习适应框架，通过将表征学习、奖励设计和探索与任务几何对齐，在策略优化过程中保持空间基础。SA-VLA将隐式空间表示与视觉标记融合，提供反映几何进展的高密度奖励，并采用\textbf{SCAN}，这是一种针对流动匹配动力学的空间条件退火探索策略。在多对象和杂乱作基准测试中，SA-VLA实现了稳定的强化学习微调，并提升了零次空间泛化能力，从而实现了更稳健且可迁移的行为。代码和项目页面可在此 https URL 访问。

ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation

ACE步骤1.5：推动开源音乐创作的边界

Authors: Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo
Subjects: Subjects: Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2602.00744
Pdf link: https://arxiv.org/pdf/2602.00744
Abstract We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast -- under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style. At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints -- scaling from short loops to 10-minute compositions -- while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model's internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities -- such as cover generation, repainting, and vocal-to-BGM conversion -- while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. The code, the model weights and the demo are available at: this https URL
中文摘要 我们介绍ACE-Step v1.5，一种高效的开源音乐基础模型，将商业级音乐生成带入消费级硬件。在常用的评估指标上，ACE-Step v1.5 在质量上超越大多数商业音乐型号，同时保持极高速度——A100 上每首完整歌曲不到 2 秒，RTX 3090 上不到 10 秒。该模型本地运行，显存不足4GB，支持轻量化个性化：用户只需几首歌曲即可训练LoRA，捕捉自己的风格。其核心是一种新型混合架构，语言模型（LM）作为一个全能规划器：它将简单的用户查询转化为全面的歌曲蓝图——从短循环扩展到10分钟的作品——同时通过Chain-of-Thought综合元数据、歌词和字幕，引导扩散转换器（DiT）。独特的是，这种对齐通过内在强化学习实现，仅依赖模型的内部机制，从而消除外部奖励模型或人类偏好固有的偏见。除了标准合成，ACE-Step v1.5 还将精准的风格控制与多功能编辑功能（如翻唱生成、重新绘制和人声转背景音乐转换）统一起来，同时严格遵守 50+ 语言的提示词。这为强大的工具铺平了道路，能够无缝融入音乐艺术家、制作人和内容创作者的创意工作流程。代码、模型权重和演示可在以下网站获取：此 https URL

Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

自适应能力分解以解锁大型推理模型有效强化学习

Authors: Zhipeng Chen, Xiaobo Qin, Wayne Xin Zhao, Youbin Wu, Ji-Rong Wen
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00759
Pdf link: https://arxiv.org/pdf/2602.00759
Abstract Reinforcement learning with verifiable rewards (RLVR) has shown great potential to enhance the reasoning ability of large language models (LLMs). However, due to the limited amount of information provided during the RLVR process, the model can only engage in largely blind exploration, which often results in failure on challenging problems. To provide additional information for the RLVR process without relying on a teacher model, we propose A$^2$D, an Adaptive Ability Decomposing method for enhancing the effectiveness of RLVR. Specifically, we first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions. Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance. To better understand A$^2$D, we first compare its performance with competitive baselines, showing its effectiveness. Next, we observe that our method functions as a plug-and-play module that can be applied to different RLVR algorithms. Furthermore, we conduct an analysis of the decomposer, revealing how the RLVR process affects its performance and behavior, and which type of guidance is better suited for enhancing the reasoner's exploration and exploitation abilities.
中文摘要 带有可验证奖励的强化学习（RLVR）已被展示出极大潜力，能够提升大型语言模型（LLMs）的推理能力。然而，由于RLVR过程中提供的信息有限，模型只能进行大部分盲探，这常常导致在复杂问题上失败。为了在不依赖教师模型的情况下为RLVR过程提供更多信息，我们提出了A$^2$D，一种增强RLVR效果的适应性能力分解方法。具体来说，我们首先通过不进行蒸馏的RLVR训练分解器，使其能够将复杂问题分解为一组更简单的子问题。接下来，我们用这个分解器为训练数据集中的每个问题注释子问题，然后在RLVR下训练推理器，并以子问题为指导。为了更好地理解A$^2$D，我们首先将其性能与竞争对手基线进行比较，展示其有效性。接下来，我们观察到我们的方法作为一个即插即用模块，可以应用于不同的RLVR算法。此外，我们对分解器进行了分析，揭示了RLVR过程如何影响其性能和行为，以及哪种类型的引导更适合提升推理者的探索和利用能力。

Communications-Incentivized Collaborative Reasoning in NetGPT through Agentic Reinforcement Learning

通过智能强化学习实现NetGPT中的沟通激励协作推理

Authors: Xiaoxue Yu, Rongpeng Li, Zhifeng Zhao, Honggang Zhang
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.00766
Pdf link: https://arxiv.org/pdf/2602.00766
Abstract The evolution of next-Generation (xG) wireless networks marks a paradigm shift from connectivity-centric architectures to Artificial Intelligence (AI)-native designs that tightly integrate data, computing, and communication. Yet existing AI deployments in communication systems remain largely siloed, offering isolated optimizations without intrinsic adaptability, dynamic task delegation, or multi-agent collaboration. In this work, we propose a unified agentic NetGPT framework for AI-native xG networks, wherein a NetGPT core can either perform autonomous reasoning or delegate sub-tasks to domain-specialized agents via agentic communication. The framework establishes clear modular responsibilities and interoperable workflows, enabling scalable, distributed intelligence across the network. To support continual refinement of collaborative reasoning strategies, the framework is further enhanced through Agentic reinforcement learning under partially observable conditions and stochastic external states. The training pipeline incorporates masked loss against external agent uncertainty, entropy-guided exploration, and multi-objective rewards that jointly capture task quality, coordination efficiency, and resource constraints. Through this process, NetGPT learns when and how to collaborate, effectively balancing internal reasoning with agent invocation. Overall, this work provides a foundational architecture and training methodology for self-evolving, AI-native xG networks capable of autonomous sensing, reasoning, and action in complex communication environments.
中文摘要 下一代（xG）无线网络的发展标志着从以连接为中心的架构向人工智能（AI）原生设计的范式转变，紧密集成数据、计算和通信。然而，现有的人工智能在通信系统中的部署仍然大多孤立，提供孤立的优化，缺乏内在的适应性、动态任务委托或多代理协作。在本研究中，我们提出了一个统一的智能体NetGPT框架，用于AI原生xG网络，其中NetGPT核心可以自主推理，或通过代理通信将子任务委托给领域专用代理。该框架确立了明确的模块化职责和互作的工作流程，实现了网络中可扩展的分布式智能。为支持协作推理策略的持续完善，框架通过在部分可观察条件下和随机外部状态下的智能强化学习进一步增强。培训流程包含对外部主体不确定性的掩蔽损失、熵引导探索和多目标奖励，共同捕捉任务质量、协调效率和资源约束。通过这一过程，NetGPT学会了何时以及如何协作，有效地平衡了内部推理与代理调用。总体而言，这项工作为能够自主感知、推理和复杂通信环境中的自我进化、AI原生xG网络提供了基础架构和训练方法。

Fast Non-Episodic Finite-Horizon RL with K-Step Lookahead Thresholding

快速非情节有限地平线强化学习，采用K步前瞻阈值

Authors: Jiamin Xu, Kyra Gan
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.00781
Pdf link: https://arxiv.org/pdf/2602.00781
Abstract Online reinforcement learning in non-episodic, finite-horizon MDPs remains underexplored and is challenged by the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods, which often rely on discounted contraction, do not naturally account for this fixed-horizon structure. We introduce a modified Q-function: rather than targeting the full-horizon, we learn a K-step lookahead Q-function that truncates planning to the next K steps. To further improve sample efficiency, we introduce a thresholding mechanism: actions are selected only when their estimated K-step lookahead value exceeds a time-varying threshold. We provide an efficient tabular learning algorithm for this novel objective, proving it achieves fast finite-sample convergence: it achieves minimax optimal constant regret for $K=1$ and $\mathcal{O}(\max((K-1),C_{K-1})\sqrt{SAT\log(T)})$ regret for any $K \geq 2$. We numerically evaluate the performance of our algorithm under the objective of maximizing reward. Our implementation adaptively increases K over time, balancing lookahead depth against estimation variance. Empirical results demonstrate superior cumulative rewards over state-of-the-art tabular RL methods across synthetic MDPs and RL environments: JumpRiverswim, FrozenLake and AnyTrading.
中文摘要 非片段式有限视野MDP中的在线强化学习仍未被充分探索，且面临估计固定终端时间回报的需求挑战。现有的无限视界方法通常依赖折现收缩，并未自然考虑这种固定视界结构。我们引入一个修改后的Q函数：我们不再针对整个视野，而是学习一个K步前瞻Q函数，将规划截断到接下来的K步。为进一步提高样本效率，我们引入阈值机制：仅当动作的估计K步前瞻值超过时间变化阈值时才被选择。我们为这一新目标提供了一种高效的表式学习算法，证明其实现了快速有限样本收敛：对于 $K=1$ 和 $\mathcal{O}（\max（（K-1），C_{K-1}）\sqrt{SAT\log（T）}）$ 对$K任意 \geq 2$ 的遗憾，都能实现极小极大最优常数遗憾。我们通过数值方式评估算法的性能，目标是最大化奖励。我们的实现会随着时间自适应地增加K，平衡前瞻深度与估计方差。实证结果显示，在合成MDP和RL环境中，JumpRiverswim、FrozenLake和AnyTrading等最先进的表格强化学习方法（如JumpRiverswim、FrozenLake和AnyTrading）在累计回报上更优。

World Models as an Intermediary between Agents and the Real World

作为代理与现实世界之间的中介的世界模型

Authors: Sherry Yang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00785
Pdf link: https://arxiv.org/pdf/2602.00785
Abstract Large language model (LLM) agents trained using reinforcement learning has achieved superhuman performance in low-cost environments like games, mathematics, and coding. However, these successes have not translated to complex domains where the cost of interaction is high, such as the physical cost of running robots, the time cost of ML engineering, and the resource cost of scientific experiments. The true bottleneck for achieving the next level of agent performance for these complex and high-cost domains lies in the expense of executing actions to acquire reward signals. To address this gap, this paper argues that we should use world models as an intermediary between agents and the real world. We discuss how world models, viewed as models of dynamics, rewards, and task distributions, can overcome fundamental barriers of high-cost actions such as extreme off-policy learning and sample inefficiency in long-horizon tasks. Moreover, we demonstrate how world models can provide critical and rich learning signals to agents across a broad set of domains, including machine learning engineering, computer use, robotics, and AI for science. Lastly, we identify the challenges of building these world models and propose actionable items along dataset curation, architecture design, scaling, and evaluation of world models.
中文摘要 通过强化学习训练的大型语言模型（LLM）代理在游戏、数学和编程等低成本环境中实现了超人性能。然而，这些成功并未转化到交互成本较高的复杂领域，比如运行机器人的物理成本、机器学习工程的时间成本以及科学实验的资源消耗。实现这些复杂且高成本领域的代理性能下一水平的真正瓶颈在于执行获取奖励信号动作的成本。为了弥补这一空白，本文主张我们应利用世界模型作为代理与现实世界之间的中介。我们讨论了世界模型，作为动态、奖励和任务分布的模型，如何克服高成本行动的根本障碍，如极端的非政策学习和长期任务中的样本低效。此外，我们还展示了世界模型如何为机器学习工程、计算机应用、机器人学和科学人工智能等广泛领域的智能体提供关键且丰富的学习信号。最后，我们识别了构建这些世界模型的挑战，并提出了关于数据集策划、架构设计、规模化和世界模型评估等可作项目。

DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning

DVLA-RL：双层视觉-语言对齐辅导学习门槛，用于少数样本学习

Authors: Wenhao Li, Xianjing Meng, Qiangchang Wang, Zhongyi Han, Zhibin Wu, Yilong Yin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.00795
Pdf link: https://arxiv.org/pdf/2602.00795
Abstract Few-shot learning (FSL) aims to generalize to novel categories with only a few samples. Recent approaches incorporate large language models (LLMs) to enrich visual representations with semantic embeddings derived from class names. However, they overlook progressive and adaptive alignment between vision and language from low-level to high-level semantics, resulting in limited semantic gains. To address these challenges, we propose Dual-level Vision-Language Alignment with Reinforcement Learning gating (DVLA-RL), which consists of Dual-level Semantic Construction (DSC) and RL-gated Attention (RLA). Specifically, DSC conditions LLMs on both class names and support samples to generate discriminative attributes, progressively selects the most relevant ones, and then synthesizes them into coherent class descriptions. This process provides complementary low-level attributes and high-level descriptions, enabling both fine-grained grounding and holistic class understanding. To dynamically integrate dual-level semantics along with the visual network layers, RLA formulates cross-modal fusion as a sequential decision process. A lightweight policy trained with episodic REINFORCE adaptively adjusts the contributions of self-attention and cross-attention to integrate textual and visual tokens. As a result, shallow layers refine local attributes and deep layers emphasize global semantics, enabling more precise cross-modal alignment. This achieves class-specific discrimination and generalized representations with merely a few support samples. DVLA-RL achieves new state-of-the-art performance across nine benchmarks in three diverse FSL scenarios.
中文摘要 少数样本学习（FSL）旨在仅用少量样本推广到新类别。最新方法包括大型语言模型（LLM），通过类名的语义嵌入丰富视觉表现。然而，它们忽视了视觉与语言从低层到高级语义的渐进和适应性对齐，导致语义上的提升有限。为应对这些挑战，我们提出了双层视觉-语言对齐强化学习门控（DVLA-RL），包括双层语义建构（DSC）和强化门控注意力（RLA）。具体来说，DSC对LLM的类名和支持样本进行条件，以生成判别属性，逐步筛选最相关的属性，然后将它们综合成连贯的类描述。这一过程提供了互补的低层属性和高层描述，使得细粒度的扎根和整体的类类理解成为可能。为了动态整合双层语义与视觉网络层，RLA将跨模态融合表述为一种顺序决策过程。一个用情节式REINFORCE训练的轻量级策略，会自适应地调整自我关注和交叉注意力的贡献，以整合文本和视觉标记。因此，浅层能够精炼局部属性，而深层则强调全局语义，从而实现更精确的跨模态比对。这实现了类别特定的区分和泛化表示，仅用少数支持样本。DVLA-RL在三种不同的FSL场景下，实现了九个基准测试的最高水平表现。

Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement

通过动态一次性策略细化，资源高效强化大型语言模型推理

Authors: Yunjian Zhang, Sudong Wang, Yang Li, Peiran Xu, Conghao Zhou, Xiaoyue Ma, Jianing Li, Yao Zhu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00815
Pdf link: https://arxiv.org/pdf/2602.00815
Abstract Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks, with reinforcement learning under verifiable rewards (RLVR) emerging as a principled framework for aligning model behavior with reasoning chains. Despite its promise, RLVR remains prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training. In this work, we revisit the fundamental question of data and compute efficiency in RLVR. We first establish a theoretical lower bound on the sample complexity required to unlock reasoning capabilities, and empirically validate that strong performance can be achieved with a surprisingly small number of training instances. To tackle the computational burden, we propose Dynamic One-Shot Policy Refinement (DoPR), an uncertainty-aware RL strategy that dynamically selects a single informative training sample per batch for policy updates, guided by reward volatility and exploration-driven acquisition. DoPR reduces rollout overhead by nearly an order of magnitude while preserving competitive reasoning accuracy, offering a scalable and resource-efficient solution for LLM post-training. This approach offers a practical path toward more efficient and accessible RL-based training for reasoning-intensive LLM applications.
中文摘要 大型语言模型（LLMs）在复杂的推理任务中表现出显著表现，可验证奖励强化学习（RLVR）成为一种原则性框架，用于将模型行为与推理链对齐。尽管承诺如此，RLVR仍然资源消耗极高，需要大量奖励信号，并在培训期间产生大量推广成本。在本研究中，我们重新探讨了RLVR中数据和计算效率的根本问题。我们首先建立了理论上对解锁推理能力所需样本复杂度的下界，并通过实证验证了即使训练实例数量极少也能实现强性能。为应对计算负担，我们提出了动态一次性策略细化（DoPR），这是一种不确定性的强化学习策略，每批动态选择一个信息性训练样本进行策略更新，并由奖励波动性和探索驱动的获取指导。DoPR在保持竞争推理准确性的同时，将推广开销降低近一个数量级，为LLM后培训提供了可扩展且资源高效的解决方案。这种方法为推理密集型语言应用提供了更高效、更易获得的基于强化学习的训练的实用路径。

Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis

Omni-RRM：通过自动评分标准基础偏好综合推进全奖励建模

Authors: Zicheng Kong, Dehua Ma, Zhenbo Xu, Alven Yang, Yiwei Ru, Haoran Wang, Zixuan Zhou, Fuqing Bie, Liuyu Xiang, Huijia Wu, Jian Zhao, Zhaofeng He
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.00846
Pdf link: https://arxiv.org/pdf/2602.00846
Abstract Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2\% on ShareGPT-V) and audio (66.8\% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7\% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at this https URL.
中文摘要 多模态大型语言模型（MLLMs）展现出了卓越的能力，但其性能常常受限于现有比对技术的粗糙性。一个关键瓶颈仍是缺乏有效的奖励模型（RM）：现有的奖励模型主要以视觉为中心，返回不透明的标量分数，且依赖昂贵的人工注释。我们介绍了 \textbf{Omni-RRM}，这是首个开源的基于评分标准的奖励模型，能够生成结构化、多维度的偏好判断，并在 \textbf{文本、图像、视频和音频}中提供按维度进行充分理由。我们方法的核心是 \textbf{Omni-Preference}，这是一个通过全自动流程构建的大规模数据集：我们通过不同能力的对比模型综合候选反应对，并利用强有力的教师模型来\emph{协调和过滤}偏好，同时为每对提供一个基于标准的/emph{评分标准基础理由}。这消除了对人工标注训练偏好的需求。Omni-RRM 训练分为两个阶段：监督微调以学习与规律基础的输出，随后是强化学习（GRPO），以增强对困难、低对比度对的辨别能力。综合评估显示，Omni-RRM在视频（ShareGPT-V）和音频（Audio-HH-RLHF）基准测试中达到最先进的准确率（80.2%），在图像任务中显著优于现有开源RM，整体准确率比基础模型提升17.7%。Omni-RRM 还通过 $N 美元最佳选择和转入纯文本偏好基准提升下游性能。我们的数据、代码和模型可在该 https URL 访问。

Learning Abstractions for Hierarchical Planning in Program-Synthesis Agents

学习程序综合代理中层级规划的抽象

Authors: Zergham Ahmed, Kazuki Irie, Joshua B. Tenenbaum, Christopher J. Bates, Samuel J. Gershman
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00929
Pdf link: https://arxiv.org/pdf/2602.00929
Abstract Humans learn abstractions and use them to plan efficiently to quickly generalize across tasks -- an ability that remains challenging for state-of-the-art large language model (LLM) agents and deep reinforcement learning (RL) systems. Inspired by the cognitive science of how people form abstractions and intuitive theories of their world knowledge, Theory-Based RL (TBRL) systems, such as TheoryCoder, exhibit strong generalization through effective use of abstractions. However, they heavily rely on human-provided abstractions and sidestep the abstraction-learning problem. We introduce TheoryCoder-2, a new TBRL agent that leverages LLMs' in-context learning ability to actively learn reusable abstractions rather than relying on hand-specified ones, by synthesizing abstractions from experience and integrating them into a hierarchical planning process. We conduct experiments on diverse environments, including BabyAI, Minihack and VGDL games like Sokoban. We find that TheoryCoder-2 is significantly more sample-efficient than baseline LLM agents augmented with classical planning domain construction, reasoning-based planning, and prior program-synthesis agents such as WorldCoder. TheoryCoder-2 is able to solve complex tasks that the baselines fail, while only requiring minimal human prompts, unlike prior TBRL systems.
中文摘要 人类学习抽象并高效规划，从而快速泛化任务——这一能力对最先进的大型语言模型（LLM）代理和深度强化学习（RL）系统来说仍具挑战性。受认知科学对人们如何构建抽象和直观世界知识理论的启发，基于理论的强化学习（TBRL）系统，如TheoryCoder，通过有效使用抽象展现出强烈的泛化能力。然而，它们高度依赖人类提供的抽象，规避了抽象学习的问题。我们介绍了TheoryCoder-2，一款新的TBRL代理，利用LLM的上下文学习能力，通过综合经验抽象并将其整合进层级规划过程，主动学习可重用抽象，而非依赖手工指定的抽象。我们在多种环境中进行实验，包括BabyAI、Minihack和像Sokoban这样的VGDL游戏。我们发现，TheoryCoder-2 比基础大型语言模型代理（辅以经典规划领域构建、基于推理的规划以及之前的程序综合代理如 WorldCoder）在样本效率上显著提升。TheoryCoder-2能够解决基线无法完成的复杂任务，同时只需极少的人力提示，这与之前的TBRL系统不同。

Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning

揭示认知罗盘：心智理论引导的多模态情绪推理

Authors: Meng Luo, Bobo Li, Shanqing Xu, Shize Zhang, Qiuchan Chen, Menglu Han, Wenhao Chen, Yanxiang Huang, Hao Fei, Mong-Li Lee, Wynne Hsu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.00971
Pdf link: https://arxiv.org/pdf/2602.00971
Abstract Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs. Our dataset and code are available at: this https URL.
中文摘要 尽管多模态大型语言模型（MLLM）进展迅速，但它们在深度情感理解方面的能力仍然有限。我们认为，真正的情感智能需要对心智理论（ToM）进行明确建模，这是情绪产生的认知基础。为此，我们引入了HitEmotion，一个基于ToM的层级基准测试，用于诊断认知深度递增层次中的能力断点。其次，我们提出一种以ToM为导引的推理链，追踪心理状态并校准跨模态证据，以实现忠实的情绪推理。我们进一步介绍了TMPO，这是一种强化学习方法，利用中间心理状态作为过程层级监督，指导和强化模型推理。大量实验表明，HitEmotion在最先进的模型中，尤其是在认知要求较高的任务中，暴露出深刻的情感推理缺陷。在评估中，ToM引导推理链和TMPO提高了终端任务的准确性，并产生更忠实、更连贯的推理。总之，我们的工作为研究界提供了一套实用工具包，用于评估和提升MLLM基于认知的情感理解能力。我们的数据集和代码可在：https URL 获取。

DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning

DISPO：提升大型语言模型数学推理强化学习的训练效率和稳定性

Authors: Batuhan K. Karaman, Aditya Rawal, Suhaila Shakiah, Mohammad Ghavamzadeh, Mingyi Hong, Arijit Biswas, Ruida Zhou
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.00983
Pdf link: https://arxiv.org/pdf/2602.00983
Abstract Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes. Through targeted ablations, we uncover how each regime impacts training: for correct responses, weights >1 increase the average token entropy (i.e., exploration) while weights <1 decrease it (i.e., distillation) -- both beneficial but causing gradual performance degradation when excessive. For incorrect responses, overly restrictive clipping triggers sudden performance collapse through repetitive outputs (when weights >1) or vanishing response lengths (when weights <1). By separately tuning these four clipping parameters, DISPO maintains the exploration-distillation balance while preventing catastrophic failures, achieving 61.04% on AIME'24 (vs. 55.42% CISPO and 50.21% DAPO) with similar gains across various benchmarks and models.
中文摘要 具有可验证奖励的强化学习已成为提升大型语言模型推理能力的有前景范式，尤其是在数学领域。当前该领域的方法存在明显权衡：PPO风格方法（如GRPO/DAPO）提供训练稳定性，但由于信任区域对策略更新的约束，学习轨迹较慢;而REINFORCE风格方法（如CISPO）则提升学习效率，但由于在允许信任区域外非零梯度的情况下，会削减重要性抽样权重，存在性能不稳定性。为解决这些局限性，我们引入了DISPO，一种简单但有效的REINFORCE风格算法，能够解耦正确和错误响应的重要性抽样权重的上下裁剪，产生四种可控的策略更新模式。通过有针对性消融，我们揭示了每种状态如何影响训练：对于正确反应，权重>1会增加平均令牌熵（即探索），而权重<1会降低（即蒸馏）——两者都有益，但过度时会导致性能逐渐下降。对于错误的响应，过于限制的削波会因重复输出（权重>1）或响应长度消失（权重时）而导致性能突然崩溃<1）通过分别调整这四个削波参数，DISPO在防止灾难性失败的同时，保持了勘探与蒸馏的平衡，在AIME'24上实现了61.04%的成功率（相比之下CISPO的55.42%和DAPO的50.21%），在多个基准和模型中也有类似的提升。

Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

推理与工具使用在智能强化学习中相互竞争：从量化干扰到解开纠缠的调谐

Authors: Yu Li, Mingyang Yi, Xiuyu Li, Ju Fan, Fuxin Jiang, Binbin Chen, Peng Li, Jie Song, Tieying Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.00994
Pdf link: https://arxiv.org/pdf/2602.00994
Abstract Agentic Reinforcement Learning (ARL) focuses on training large language models (LLMs) to interleave reasoning with external tool execution to solve complex tasks. Most existing ARL methods train a single shared model parameters to support both reasoning and tool use behaviors, implicitly assuming that joint training leads to improved overall agent performance. Despite its widespread adoption, this assumption has rarely been examined empirically. In this paper, we systematically investigate this assumption by introducing a Linear Effect Attribution System(LEAS), which provides quantitative evidence of interference between reasoning and tool-use behaviors. Through an in-depth analysis, we show that these two capabilities often induce misaligned gradient directions, leading to training interference that undermines the effectiveness of joint optimization and challenges the prevailing ARL paradigm. To address this issue, we propose Disentangled Action Reasoning Tuning(DART), a simple and efficient framework that explicitly decouples parameter updates for reasoning and tool-use via separate low-rank adaptation modules. Experimental results show that DART consistently outperforms baseline methods with averaged 6.35 percent improvements and achieves performance comparable to multi-agent systems that explicitly separate tool-use and reasoning using a single model.
中文摘要 代理强化学习（ARL）专注于训练大型语言模型（LLM），将推理与外部工具执行交织以解决复杂任务。大多数现有的ARL方法训练单一共享模型参数，以支持推理和工具使用行为，隐含假设联合训练能提升整体代理性能。尽管这一假设被广泛采用，但很少被实证检验。本文通过引入线性效应归因系统（LEAS）系统性地研究这一假设，该系统为推理与工具使用行为之间的干扰提供了定量证据。通过深入分析，我们表明这两种能力常常导致梯度方向错位，导致训练干扰，削弱联合优化的有效性，挑战现有的ARL范式。为解决这一问题，我们提出了解缠动作推理调优（DART）框架，这是一个简单高效的框架，通过独立的低秩适应模块，明确解耦参数更新以实现推理和工具使用。实验结果显示，DART始终优于基线方法，平均提升6.35%，并实现与多智能体系统在单一模型中明确分离工具使用和推理的表现相当。

Reliable Use of Lemmas via Eligibility Reasoning and Section$-$Aware Reinforcement Learning

通过资格推理和Section$-$Aware强化学习的可靠引理使用

Authors: Zhikun Xu, Xiaodong Yu, Ben Zhou, Jiang Liu, Jialian Wu, Ze Wang, Ximeng Sun, Hao Chen, Zicheng Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.00998
Pdf link: https://arxiv.org/pdf/2602.00998
Abstract Recent large language models (LLMs) perform strongly on mathematical benchmarks yet often misapply lemmas, importing conclusions without validating assumptions. We formalize lemma$-$judging as a structured prediction task: given a statement and a candidate lemma, the model must output a precondition check and a conclusion$-$utility check, from which a usefulness decision is derived. We present RULES, which encodes this specification via a two$-$section output and trains with reinforcement learning plus section$-$aware loss masking to assign penalty to the section responsible for errors. Training and evaluation draw on diverse natural language and formal proof corpora; robustness is assessed with a held$-$out perturbation suite; and end$-$to$-$end evaluation spans competition$-$style, perturbation$-$aligned, and theorem$-$based problems across various LLMs. Results show consistent in$-$domain gains over both a vanilla model and a single$-$label RL baseline, larger improvements on applicability$-$breaking perturbations, and parity or modest gains on end$-$to$-$end tasks; ablations indicate that the two$-$section outputs and section$-$aware reinforcement are both necessary for robustness.
中文摘要 近年来的大型语言模型（LLMs）在数学基准测试上表现优异，但常常误用引理，导入结论时未验证假设。我们将引理$-$judging形式化为结构化预测任务：给定一个陈述和候选引理，模型必须输出一个前提检查和一个结论$-$utility检查，从中推导出有用性决策。我们展示了RULES，通过2$-$section的输出编码该规范，训练时会进行强化学习和section$-$aware损失掩蔽，将惩罚分配给负责错误的section。培训和评估依赖于多样的自然语言和形式证明语料库;鲁棒性通过 Held $-$out 微扰套件进行评估;而 end$-$to$-$end 的评估涵盖了各种大型语言模型的竞争 $-$style、微扰$-$aligned和定理$-$based问题。结果显示，在 in$-$domain 内的提升，无论是在原模型还是单一 $-$label 的强化学习基线上，应用 $-$breaking 扰动的提升更大，在 $-$to$-$end 任务结束时获得奇偶校验或适度提升;消融表明，两个$-$section输出和section$-$aware的增强都是稳健性的必要条件。

ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning

ESSAM：一种新型竞争性进化策略强化学习方法，用于记忆高效LLM微调

Authors: Zhishen Sun, Sizhe Dang, Guang Dai, Haishan Ye
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01003
Pdf link: https://arxiv.org/pdf/2602.01003
Abstract Reinforcement learning (RL) has become a key training step for improving mathematical reasoning in large language models (LLMs), but it often has high GPU memory usage, which makes it hard to use in settings with limited resources. To reduce these issues, we propose Evolution Strategies with Sharpness-Aware Maximization (ESSAM), a full parameter fine-tuning framework that tightly combines the zero-order search in parameter space from Evolution Strategies (ES) with the Sharpness-Aware Maximization (SAM) to improve generalization. We conduct fine-tuning experiments on the mainstream mathematica reasoning task GSM8K. The results show that ESSAM achieves an average accuracy of 78.27\% across all models and its overall performance is comparable to RL methods. It surpasses classic RL algorithm PPO with an accuracy of 77.72\% and is comparable to GRPO with an accuracy of 78.34\%, and even surpassing them on some models. In terms of GPU memory usage, ESSAM reduces the average GPU memory usage by $18\times$ compared to PPO and by $10\times$ compared to GRPO, achieving an extremely low GPU memory usage.
中文摘要 强化学习（RL）已成为提升大型语言模型（LLM）数学推理能力的关键训练步骤，但它通常占用较高的GPU内存，这使得在资源有限的环境中难以使用。为减少这些问题，我们提出了带有锐利度感知最大化的进化策略（ESSAM），这是一个全参数微调框架，紧密结合了进化策略（ES）中的零阶参数空间搜索与锐利感知最大化（SAM），以提升泛化性。我们在主流 Mathematica 推理任务 GSM8K 上进行微调实验。结果显示，ESSAM在所有模型中平均准确率为78.27%，整体性能与强化学习方法相当。它以77.72%的准确率超过经典强化学习算法PPO，准确率为78.34%，甚至在某些模型上超过了GRPO。在GPU内存使用方面，ESSAM相比PPO平均减少了18美元，比GRPO减少了10美元，实现了极低的GPU内存使用。

Discovering Process-Outcome Credit in Multi-Step LLM Reasoning

在多步大型语言模型推理中发现过程-结果学分

Authors: Xiangwei Wang, Wei Wang, Ken Chen, Nanduni Nimalsiri, Saman Halgamuge
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.01034
Pdf link: https://arxiv.org/pdf/2602.01034
Abstract Reinforcement Learning (RL) serves as a potent paradigm for enhancing reasoning capabilities in Large Language Models (LLMs), yet standard outcome-based approaches often suffer from reward sparsity and inefficient credit assignment. In this paper, we propose a novel framework designed to provide continuous reward signals, which introduces a Step-wise Marginal Information Gain (MIG) mechanism that quantifies the intrinsic value of reasoning steps against a Monotonic Historical Watermark, effectively filtering out training noise. To ensure disentangled credit distribution, we implement a Decoupled Masking Strategy, applying process-oriented rewards specifically to the chain-of-thought (CoT) and outcome-oriented rewards to the full completion. Additionally, we incorporate a Dual-Gated SFT objective to stabilize training with high-quality structural and factual signals. Extensive experiments across textual and multi-modal benchmarks (e.g., MATH, Super-CLEVR) demonstrate that our approach consistently outperforms baselines such as GRPO in both sample efficiency and final accuracy. Furthermore, our model exhibits superior out-of-distribution robustness, demonstrating promising zero-shot transfer capabilities to unseen and challenging reasoning tasks.
中文摘要 强化学习（RL）作为增强大型语言模型（LLMs）推理能力的有力范式，但标准基于结果的方法常常存在奖励稀疏和学分分配效率低下的问题。本文提出了一种新颖框架，旨在提供连续奖励信号，引入了一步边际信息增益（MIG）机制，该机制量化推理步骤相对于单调历史水印的内在价值，有效过滤了训练噪声。为确保信用分配的解耦，我们实施了解耦掩蔽策略，针对思维链（CoT）应用面向过程的奖励，对整体完成者施加以结果为导向的奖励。此外，我们还采用了双门SFT目标，以高质量的结构性和事实信号稳定训练。跨文本和多模态基准测试（如MATH、Super-CLEVR）的广泛实验表明，我们的方法在样本效率和最终准确率上始终优于GRPO等基线。此外，我们的模型展现出卓越的非分布鲁棒性，展示了对看不见且具有挑战性推理任务的有前景的零射点转移能力。

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

好的SFT优化SFT，更优秀的SFT为强化学习做准备

Authors: Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.01058
Pdf link: https://arxiv.org/pdf/2602.01058
Abstract Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.
中文摘要 推理型LLM的后期训练是一个整体过程，通常包括离线SFT阶段，随后是在线强化学习（RL）阶段。然而，SFT通常在孤立情况下进行优化，以最大化SFT单独的性能。我们表明，在相同的强化学习训练后，来自更强SFT检查点的模型可能显著低于从较弱检查点初始化的模型。我们将此归因于当前SFT-RL流水线中典型的不匹配：生成离线SFT数据的分布可能与在线强化学习时优化的策略有显著差异，而在线强化学习是从自身的部署中学习的。我们提出了PEAR（策略评估启发的离线学习损失重加权算法），这是一种SFT阶段的方法，用于纠正这一不匹配，更好地为强化学习做准备。PEAR使用重要性抽样来重新加权SFT损失，在代币、区块和序列层面有三种变体。它可以用于增强标准SFT目标，且在收集离线数据概率后，几乎不会产生额外的训练负担。我们在Qwen 2.5和3以及DeepSeek提炼模型上，对可验证推理博弈和数学推理任务进行了受控实验。PEAR在RL后表现持续提升，4次传球获得4次，AIME2025时最高达14.6%。我们的结果表明，PEAR通过设计和评估SFT时考虑下游强化学习而非孤立，是迈向更整体LLM后训练的有效步骤。

SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning

SetPO：用于多样性保持LLM推理的集合级策略优化

Authors: Chenyi Li, Yuan Zhang, Bo Wang, Guoqing Ma, Wei Tang, Haoyang Huang, Nan Duan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01062
Pdf link: https://arxiv.org/pdf/2602.01062
Abstract Reinforcement learning with verifiable rewards has shown notable effectiveness in enhancing large language models (LLMs) reasoning performance, especially in mathematics tasks. However, such improvements often come with reduced outcome diversity, where the model concentrates probability mass on a narrow set of solutions. Motivated by diminishing-returns principles, we introduce a set level diversity objective defined over sampled trajectories using kernelized similarity. Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization. We further investigate the contribution of a single trajectory to language model diversity within a distribution perturbation framework. This analysis theoretically confirms a monotonicity property, proving that rarer trajectories yield consistently higher marginal contributions to the global diversity. Extensive experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.
中文摘要 带有可验证奖励的强化学习在提升大型语言模型（LLMs）推理能力方面表现出显著效果，尤其是在数学任务中。然而，这种改进通常伴随着结果多样性的降低，模型将概率质量集中在一小部分解上。基于收益递减原则，我们引入了基于核化相似度的采样轨迹定义的集合级多样性目标。我们的方法为每个抽样轨迹推导出一个“留一”边际贡献，并将该目标作为政策优化的插件优势形成项整合。我们进一步探讨单一轨迹对分布扰动框架下语言模型多样性的贡献。该分析理论上证实了单调性质，证明更稀有的轨迹对全球多样性的边际贡献持续更高。在多个模型尺度上的大量实验证明了我们提出的算法的有效性，在Pass@1和Pass@K多个基准测试中持续优于强基线。

Probing RLVR training instability through the lens of objective-level hacking

通过客观层级黑客的视角探究RLVR训练不稳定性

Authors: Yiming Dong, Kun Fu, Haoyu Li, Xinyuan Zhu, Yurou Liu, Lijing Shao, Jieping Ye, Zheng Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01103
Pdf link: https://arxiv.org/pdf/2602.01103
Abstract Prolonged reinforcement learning with verifiable rewards (RLVR) has been shown to drive continuous improvements in the reasoning capabilities of large language models, but the training is often prone to instabilities, especially in Mixture-of-Experts (MoE) architectures. Training instability severely undermines model capability improvement, yet its underlying causes and mechanisms remain poorly understood. In this work, we introduce a principled framework for understanding RLVR instability through the lens of objective-level hacking. Unlike reward hacking, which arises from exploitable verifiers, objective-level hacking emerges from token-level credit misalignment and is manifested as system-level spurious signals in the optimization objective. Grounded in our framework, together with extensive experiments on a 30B MoE model, we trace the origin and formalize the mechanism behind a key pathological training dynamic in MoE models: the abnormal growth of the training-inference discrepancy, a phenomenon widely associated with instability but previously lacking a mechanistic explanation. These findings provide a concrete and causal account of the training dynamics underlying instabilities in MoE models, offering guidance for the design of stable RLVR algorithms.
中文摘要 带有可验证奖励的长期强化学习（RLVR）已被证明能持续提升大型语言模型的推理能力，但训练往往容易出现不稳定性，尤其是在专家混合（MoE）架构中。训练不稳定性严重削弱模型能力提升，但其根本原因和机制仍不充分。在本研究中，我们引入了一个通过客观层次黑客视角理解RLVR不稳定性的原则框架。与由可利用验证者产生的奖励黑客不同，目标级黑客源于代币级信用错位，表现为优化目标中的系统级虚假信号。基于我们的框架，结合对30B模型的广泛实验，我们追溯并形式化了MoE模型中一个关键病理训练动态的起源机制：训练-推断差异的异常增长，这一现象广泛与不稳定性相关，但此前缺乏机制解释。这些发现为MoE模型中不稳定性的训练动力学提供了具体且有因果的解释，为设计稳定的RLVR算法提供了指导。

Lyapunov Stability-Aware Stackelberg Game for Low-Altitude Economy: A Control-Oriented Pruning-Based DRL Approach

低空经济的Lyapunov稳定性感知Stackelberg博弈：一种基于控制的剪枝驱动日程方法

Authors: Yue Zhong, Jiawen Kang, Yongju Tong, Hong-Ning Dai, Dong In Kim, Abbas Jamalipour, Shengli Xie
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01131
Pdf link: https://arxiv.org/pdf/2602.01131
Abstract With the rapid expansion of the low-altitude economy, Unmanned Aerial Vehicles (UAVs) serve as pivotal aerial base stations supporting diverse services from users, ranging from latency-sensitive critical missions to bandwidth-intensive data streaming. However, the efficacy of such heterogeneous networks is often compromised by the conflict between limited onboard resources and stringent stability requirements. Moving beyond traditional throughput-centric designs, we propose a Sensing-Communication-Computing-Control closed-loop framework that explicitly models the impact of communication latency on physical control stability. To guarantee mission reliability, we leverage the Lyapunov stability theory to derive an intrinsic mapping between the state evolution of the control system and communication constraints, transforming abstract stability requirements into quantifiable resource boundaries. Then, we formulate the resource allocation problem as a Stackelberg game, where UAVs (as leaders) dynamically price resources to balance load and ensure stability, while users (as followers) optimize requests based on service urgency. Furthermore, addressing the prohibitive computational overhead of standard Deep Reinforcement Learning (DRL) on energy-constrained edge platforms, we propose a novel and lightweight pruning-based Proximal Policy Optimization (PPO) algorithm. By integrating a dynamic structured pruning mechanism, the proposed algorithm significantly compresses the neural network scale during training, enabling the UAV to rapidly approximate the game equilibrium with minimal inference latency. Simulation results demonstrate that the proposed scheme effectively secures control loop stability while maximizing system utility in dynamic low-altitude environments.
中文摘要 随着低空经济的快速扩展，无人机（UAV）成为支持用户多样化服务的关键空中基站，涵盖从延迟敏感的关键任务到带宽密集型数据流。然而，这种异构网络的效能常常因机载资源有限和严格的稳定性要求之间的冲突而受到影响。超越传统的吞吐量中心设计，我们提出了一个传感-通信-计算-控制闭环框架，明确模拟通信延迟对物理控制稳定性的影响。为保证任务可靠性，我们利用李雅普诺夫稳定性理论推导出控制系统状态演化与通信约束之间的内在映射，将抽象的稳定性要求转化为可量化的资源边界。然后，我们将资源分配问题表述为斯塔克尔伯格游戏，无人机（作为领导者）动态定价资源以平衡负载并确保稳定性，而用户（作为跟随者）则根据服务紧急度优化请求。此外，针对标准深度强化学习（DRL）在能耗限制边缘平台上的高计算开销，我们提出了一种新颖且轻量级的基于剪枝的近端策略优化（PPO）算法。通过集成动态结构化剪枝机制，所提算法在训练过程中显著压缩了神经网络尺度，使无人机能够以极低的推理延迟快速近似博弈平衡。模拟结果表明，该方案在动态低空环境中有效保障控制环路稳定性，同时最大化系统效用。

Parallel Training in Spiking Neural Networks

尖峰神经网络的并行训练

Authors: Yanbin Huang, Man Yao, Yuqi Pan, Changze Lv, Siyuan Xu, Xiaoqing Zheng, Bo Xu, Guoqi Li
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2602.01133
Pdf link: https://arxiv.org/pdf/2602.01133
Abstract The bio-inspired integrate-fire-reset mechanism of spiking neurons constitutes the foundation for efficient processing in Spiking Neural Networks (SNNs). Recent progress in large models demands that spiking neurons support highly parallel computation to scale efficiently on modern GPUs. This work proposes a novel functional perspective that provides general guidance for designing parallel spiking neurons. We argue that the reset mechanism, which induces complex temporal dependencies and hinders parallel training, should be removed. However, any such modification should satisfy two principles: 1) preserving the functions of reset as a core biological mechanism; and 2) enabling parallel training without sacrificing the serial inference ability of spiking neurons, which underpins their efficiency at test time. To this end, we identify the functions of the reset and analyze how to reconcile parallel training with serial inference, upon which we propose a dynamic decay spiking neuron. We conduct comprehensive testing of our method in terms of: 1) Training efficiency and extrapolation capability. On 16k-length sequences, we achieve a 25.6x training speedup over the pioneering parallel spiking neuron, and our models trained on 2k-length can stably perform inference on sequences as long as 30k. 2) Generality. We demonstrate the consistent effectiveness of the proposed method across five task categories (image classification, neuromorphic event processing, time-series forecasting, language modeling, and reinforcement learning), three network architectures (spiking CNN/Transformer/SSMs), and two spike activation modes (spike/integer activation). 3) Energy consumption. The spiking firing of our neuron is lower than that of vanilla and existing parallel spiking neurons.
中文摘要 仿生的尖峰神经元整合-放火-重置机制构成了尖峰神经网络（SNN）高效处理的基础。大型模型的最新进展要求尖峰神经元支持高度并行计算，以在现代GPU上高效扩展。本研究提出了一种新颖的功能视角，为设计平行尖峰神经元提供通用指导。我们认为，应当去除引入复杂时间依赖并阻碍并行训练的重置机制。然而，任何此类改造都应满足两个原则：1）保持复位功能作为核心生物机制;2）实现并行训练而不牺牲尖峰神经元的串行推断能力，这支撑了它们在测试时的效率。为此，我们识别了复位的功能，并分析如何调和并行训练与串行推断，基于此我们提出了一个动态衰变尖峰神经元。我们对方法进行全面测试，包括：1）训练效率和外推能力。在16k长度的序列上，我们比开创性的平行尖峰神经元实现了25.6倍的训练加速，而在2k长度下训练的模型可以稳定地对长达30k的序列进行推断。2）通用性。我们展示了该方法在五个任务类别（图像分类、神经形态事件处理、时间序列预测、语言建模和强化学习）、三种网络架构（尖峰CNN/Transformer/SSM）以及两种尖峰激活模式（尖峰/整数激活）上的一致有效性。3）能源消耗。我们神经元的尖峰放电频率低于普通和现有的平行尖峰神经元。

Self-Generative Adversarial Fine-Tuning for Large Language Models

大型语言模型的自生成对抗微调

Authors: Shiguang Wu, Yaqing Wang, Quanming Yao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01137
Pdf link: https://arxiv.org/pdf/2602.01137
Abstract Fine-tuning large language models (LLMs) for alignment typically relies on supervised fine-tuning or reinforcement learning from human feedback, both limited by the cost and scarcity of high-quality annotations. Recent self-play and synthetic data approaches reduce this dependence but often rely on heuristic assumptions or ungrounded self-evaluation, which can cause bias accumulation and performance drift. In this paper, we propose Self-Generative Adversarial LLM (SGALM), a unified fine-tuning framework that formulates alignment as a generative adversarial game within a single LLM. SGALM jointly evolves generation and discrimination capabilities without external reward models. Theoretical and empirical results demonstrate that SGALM achieves state-of-the-art performance, serves as an effective alignment algorithm and a robust synthetic data engine.
中文摘要 对大型语言模型（LLMs）进行微调以实现对齐，通常依赖于监督式微调或基于人类反馈的强化学习，这两者受限于高质量注释的成本和稀缺性。近期的自玩和合成数据方法减少了这种依赖，但通常依赖启发式假设或缺乏依据的自我评估，这可能导致偏见积累和性能漂移。本文提出自生成对抗性大型语言模型（SGALM），这是一种统一的微调框架，将对齐构建为单一大型语言模型中的生成对抗博弈。SGALM 共同发展生成和判别能力，无需外部奖励模型。理论和实证结果表明，SGALM实现了最先进的性能，作为有效的比对算法和稳健的合成数据引擎。

PolicyFlow: Policy Optimization with Continuous Normalizing Flow in Reinforcement Learning

策略流程：强化学习中持续规范化流程的策略优化

Authors: Shunpeng Yang, Ben Liu, Hua Chen
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.01156
Pdf link: https://arxiv.org/pdf/2602.01156
Abstract Among on-policy reinforcement learning algorithms, Proximal Policy Optimization (PPO) demonstrates is widely favored for its simplicity, numerical stability, and strong empirical performance. Standard PPO relies on surrogate objectives defined via importance ratios, which require evaluating policy likelihood that is typically straightforward when the policy is modeled as a Gaussian distribution. However, extending PPO to more expressive, high-capacity policy models such as continuous normalizing flows (CNFs), also known as flow-matching models, is challenging because likelihood evaluation along the full flow trajectory is computationally expensive and often numerically unstable. To resolve this issue, we propose PolicyFlow, a novel on-policy CNF-based reinforcement learning algorithm that integrates expressive CNF policies with PPO-style objectives without requiring likelihood evaluation along the full flow path. PolicyFlow approximates importance ratios using velocity field variations along a simple interpolation path, reducing computational overhead without compromising training stability. To further prevent mode collapse and further encourage diverse behaviors, we propose the Brownian Regularizer, an implicit policy entropy regularizer inspired by Brownian motion, which is conceptually elegant and computationally lightweight. Experiments on diverse tasks across various environments including MultiGoal, PointMaze, IsaacLab and MuJoCo Playground show that PolicyFlow achieves competitive or superior performance compared to PPO using Gaussian policies and flow-based baselines including FPO and DPPO. Notably, results on MultiGoal highlight PolicyFlow's ability to capture richer multimodal action distributions.
中文摘要 在策略强化算法中，近点策略优化（PPO）因其简洁、数值稳定性和强实证性能而广受青睐。标准PPO依赖于通过重要性比率定义的替代目标，这需要评估策略似然，而当策略以高斯分布建模时，这通常很简单。然而，将PPO扩展到更具表现力、高容量的策略模型，如连续归一化流（CNF），也称为流量匹配模型，存在挑战，因为沿完整流轨迹进行似然评估计算成本高且常常数值不稳定。为解决这一问题，我们提出了PolicyFlow，一种基于策略的CNF强化学习算法，将表达性CNF策略与PPO式目标整合，无需在全流程路径上进行似然评估。PolicyFlow 通过沿简单插值路径对速度场变化来近似重要性比，降低计算开销而不影响训练稳定性。为了进一步防止模态崩溃并鼓励多样化行为，我们提出了布朗正则化器，这是一种受布朗运动启发的隐式策略熵正则化器，概念优雅且计算量轻。在包括 MultiGoal、PointMaze、IsaacLab 和 MuJoCo Playground 等多种环境中的各种任务实验显示，PolicyFlow 在使用高斯策略和基于流量的基线（如 FPO 和 DPPO）时，能够实现与 PPO 竞争或更优的性能。值得注意的是，MultiGoal 的结果凸显了 PolicyFlow 捕捉更丰富多模态动作分布的能力。

Med3D-R1: Incentivizing Clinical Reasoning in 3D Medical Vision-Language Models for Abnormality Diagnosis

Med3D-R1：激励三维医学视觉语言模型中的临床推理以诊断异常

Authors: Haoran Lai, Zihang Jiang, Kun Zhang, Qingsong Yao, Rongsheng Wang, Zhiyang He, Xiaodong Tao, Wei Wei, Shaohua Kevin Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.01200
Pdf link: https://arxiv.org/pdf/2602.01200
Abstract Developing 3D vision-language models with robust clinical reasoning remains a challenge due to the inherent complexity of volumetric medical imaging, the tendency of models to overfit superficial report patterns, and the lack of interpretability-aware reward designs. In this paper, we propose Med3D-R1, a reinforcement learning framework with a two-stage training process: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). During SFT stage, we introduce a residual alignment mechanism to bridge the gap between high-dimensional 3D features and textual embeddings, and an abnormality re-weighting strategy to emphasize clinically informative tokens and reduce structural bias in reports. In RL stage, we redesign the consistency reward to explicitly promote coherent, step-by-step diagnostic reasoning. We evaluate our method on medical multiple-choice visual question answering using two 3D diagnostic benchmarks, CT-RATE and RAD-ChestCT, where our model attains state-of-the-art accuracies of 41.92\% on CT-RATE and 44.99\% on RAD-ChestCT. These results indicate improved abnormality diagnosis and clinical reasoning and outperform prior methods on both benchmarks. Overall, our approach holds promise for enhancing real-world diagnostic workflows by enabling more reliable and transparent 3D medical vision-language systems.
中文摘要 由于体积医学影像本身的复杂性、模型容易过拟合表面报告模式以及缺乏可解释性的奖励设计，开发具有坚实临床推理的三维视觉语言模型仍是一项挑战。本文提出了Med3D-R1，一种强化学习框架，采用两阶段训练过程：监督式微调（SFT）和强化学习（RL）。在SFT阶段，我们引入了残余比对机制，以弥合高维三维特征与文本嵌入之间的差距，并采用异常重权策略，强调临床信息量的标记，减少报告中的结构偏倚。在强化学习阶段，我们重新设计一致性奖励，明确促进连贯的、逐步的诊断推理。我们利用两个3D诊断基准测试CT-RATE和RAD-ChestCT评估医学多项选择视觉问题解答方法，模型在CT-RATE上达到了41.92%和RAD-ChestCT中44.99%的先进准确率。这些结果显示异常诊断和临床推理能力有所提升，并且在这两个基准测试上都优于以往方法。总体而言，我们的方法有望通过实现更可靠、更透明的3D医疗视觉语言系统，提升现实诊断工作流程。

ASTER: Agentic Scaling with Tool-integrated Extended Reasoning

ASTER：带有工具集成扩展推理的智能尺度

Authors: Xuqin Zhang, Quan He, Zhenrui Zheng, Zongzhang Zhang, Xu He, Dong Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.01204
Pdf link: https://arxiv.org/pdf/2602.01204
Abstract Reinforcement learning (RL) has emerged as a dominant paradigm for eliciting long-horizon reasoning in Large Language Models (LLMs). However, scaling Tool-Integrated Reasoning (TIR) via RL remains challenging due to interaction collapse: a pathological state where models fail to sustain multi-turn tool usage, instead degenerating into heavy internal reasoning with only trivial, post-hoc code verification. We systematically study three questions: (i) how cold-start SFT induces an agentic, tool-using behavioral prior, (ii) how the interaction density of cold-start trajectories shapes exploration and downstream RL outcomes, and (iii) how the RL interaction budget affects learning dynamics and generalization under varying inference-time budgets. We then introduce ASTER (Agentic Scaling with Tool-integrated Extended Reasoning), a framework that circumvents this collapse through a targeted cold-start strategy prioritizing interaction-dense trajectories. We find that a small expert cold-start set of just 4K interaction-dense trajectories yields the strongest downstream performance, establishing a robust prior that enables superior exploration during extended RL training. Extensive evaluations demonstrate that ASTER-4B achieves state-of-the-art results on competitive mathematical benchmarks, reaching 90.0% on AIME 2025, surpassing leading frontier open-source models, including DeepSeek-V3.2-Exp.
中文摘要 强化学习（RL）已成为大型语言模型（LLM）中引发长视野推理的主流范式。然而，通过强化学习扩展工具集成推理（TIR）仍面临挑战，原因是交互崩溃：这是一种病态状态，模型无法持续多回合工具使用，反而退化为繁重的内部推理，仅有简单、事后验证代码。我们系统地研究了三个问题：（i）冷启动SFT如何诱导代理性、工具使用行为先验，（ii）冷启动轨迹的交互密度如何影响探索和后续强化学习结果，以及（iii）强化学习相互作用预算如何影响不同推理时间预算下的学习动态和泛化。随后，我们引入了ASTER（工具集成扩展推理的能动尺度），该框架通过优先考虑交互密集轨迹的有针对性冷启动策略规避了这种崩溃。我们发现，仅4K交互密集轨迹的小型专家冷启动组能带来最强的下游表现，建立稳健先验，从而在延长强化学习训练中实现更优越的探索。广泛评估表明，ASTER-4B在竞争性数学基准测试中取得了最先进的成绩，在AIME 2025中达到90.0%，超过了包括DeepSeek-V3.2-Exp在内的领先前沿开源模型。

Sample Efficient Active Algorithms for Offline Reinforcement Learning

线下强化学习的高效主动算法示例

Authors: Soumyadeep Roy, Shashwat Kushwaha, Ambedkar Dukkipati
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01260
Pdf link: https://arxiv.org/pdf/2602.01260
Abstract Offline reinforcement learning (RL) enables policy learning from static data but often suffers from poor coverage of the state-action space and distributional shift problems. This problem can be addressed by allowing limited online interactions to selectively refine uncertain regions of the learned value function, which is referred to as Active Reinforcement Learning (ActiveRL). While there has been good empirical success, no theoretical analysis is available in the literature. We fill this gap by developing a rigorous sample-complexity analysis of ActiveRL through the lens of Gaussian Process (GP) uncertainty modeling. In this respect, we propose an algorithm and using GP concentration inequalities and information-gain bounds, we derive high-probability guarantees showing that an $\epsilon$-optimal policy can be learned with ${\mathcal{O}}(1/\epsilon^2)$ active transitions, improving upon the $\Omega(1/\epsilon^2(1-\gamma)^4)$ rate of purely offline methods. Our results reveal that ActiveRL achieves near-optimal information efficiency, that is, guided uncertainty reduction leads to accelerated value-function convergence with minimal online data. Our analysis builds on GP concentration inequalities and information-gain bounds, bridging Bayesian nonparametric regression and reinforcement learning theories. We conduct several experiments to validate the algorithm and theoretical findings.
中文摘要 离线强化学习（RL）使得从静态数据中进行策略学习成为可能，但通常存在对状态-行动空间和分布转移问题的覆盖不足。这个问题可以通过允许有限的在线交互选择性地细化学习价值函数中不确定的区域来解决，这被称为主动强化学习（ActiveRL）。虽然已有良好的实证成功，但文献中尚无理论分析。我们通过高斯过程（GP）不确定性建模的视角，开发了对ActiveRL的严谨样本复杂度分析来填补这一空白。在这方面，我们提出算法，利用GP集中不等式和信息增益界限，推导出高概率保证，表明$\epsilon$最优策略可以通过${\mathcal{O}}（1/\epsilon^2）$主动转移学习，改进了纯离线方法的$\Omega（1/\epsilon^2（1-\gamma）^4）$速率。我们的结果表明，ActiveRL实现了近乎最优的信息效率，即引导不确定性降低，从而在极少的在线数据下加速价值函数收敛。我们的分析基于GP集中不等式和信息增益界限，桥接了贝叶斯非参数回归和强化学习理论。我们进行了多项实验以验证算法和理论发现。

Reinforcement Learning for Active Perception in Autonomous Navigation

自主导航中主动感知的强化学习

Authors: Grzegorz Malczyk, Mihir Kulkarni, Kostas Alexis
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.01266
Pdf link: https://arxiv.org/pdf/2602.01266
Abstract This paper addresses the challenge of active perception within autonomous navigation in complex, unknown environments. Revisiting the foundational principles of active perception, we introduce an end-to-end reinforcement learning framework in which a robot must not only reach a goal while avoiding obstacles, but also actively control its onboard camera to enhance situational awareness. The policy receives observations comprising the robot state, the current depth frame, and a particularly local geometry representation built from a short history of depth readings. To couple collision-free motion planning with information-driven active camera control, we augment the navigation reward with a voxel-based information metric. This enables an aerial robot to learn a robust policy that balances goal-directed motion with exploratory sensing. Extensive evaluation demonstrates that our strategy achieves safer flight compared to using fixed, non-actuated camera baselines while also inducing intrinsic exploratory behaviors.
中文摘要 本文探讨了在复杂未知环境中自主导航中主动感知的挑战。回顾主动感知的基础原则，我们引入了端到端强化学习框架，机器人不仅要在避开障碍物的同时达成目标，还要主动控制机载摄像头以增强态势感知能力。该政策接收的观测数据包括机器人状态、当前深度框架，以及基于短期深度读数历史构建的局部几何表示。为了将无碰撞运动规划与信息驱动的主动摄像机控制结合起来，我们用基于体素的信息指标来增强导航奖励。这使得空中机器人能够学习一套稳健的策略，平衡目标导向运动与探索性感知。广泛评估表明，我们的策略相比使用固定的非驱动相机基线，实现了更安全的飞行，同时还能诱发内在的探索行为。

Mixture-of-World Models: Scaling Multi-Task Reinforcement Learning with Modular Latent Dynamics

世界混合模型：利用模块化潜在动力学扩展多任务强化学习

Authors: Boxuan Zhang, Weipu Zhang, Zhaohan Feng, Wei Xiao, Jian Sun, Jie Chen, Gang Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01270
Pdf link: https://arxiv.org/pdf/2602.01270
Abstract A fundamental challenge in multi-task reinforcement learning (MTRL) is achieving sample efficiency in visual domains where tasks exhibit substantial heterogeneity in both observations and dynamics. Model-based reinforcement learning offers a promising path to improved sample efficiency through world models, but standard monolithic architectures struggle to capture diverse task dynamics, resulting in poor reconstruction and prediction accuracy. We introduce Mixture-of-World Models (MoW), a scalable architecture that combines modular variational autoencoders for task-adaptive visual compression, a hybrid Transformer-based dynamics model with task-conditioned experts and a shared backbone, and a gradient-based task clustering strategy for efficient parameter allocation. On the Atari 100k benchmark, a single MoW agent trained once on 26 Atari games achieves a mean human-normalized score of 110.4%, competitive with the score of 114.2% achieved by STORM, an ensemble of 26 task-specific models, while using 50% fewer parameters. On Meta-World, MoW achieves a 74.5% average success rate within 300 thousand environment steps, establishing a new state of the art. These results demonstrate that MoW provides a scalable and parameter-efficient foundation for generalist world models.
中文摘要 多任务强化学习（MTRL）中的一个根本挑战是在视觉领域实现样本效率，因为任务在观察和动态上表现出显著异质性。基于模型的强化学习为通过世界模型提升样本效率提供了有前景的途径，但标准的单体架构难以捕捉多样化的任务动态，导致重建和预测准确性较差。我们介绍了世界混合模型（MoW），这是一种可扩展架构，结合了模块化变分自编码器实现任务自适应视觉压缩、基于Transformer的混合动力学模型与任务条件专家和共享骨干，以及基于梯度的任务聚类策略以实现参数高效分配。在Atari 100k基准测试中，单个MoW代理在26款Atari游戏中训练一次后，平均人类归一化得分为110.4%，与STORM的114.2%相当，STORM是26个任务特定模型的集合，且参数数量减少了50%。在元世界中，MoW在30万个环境步内实现了74.5%的平均成功率，确立了新的技术水平。这些结果表明，MoW为通用世界模型提供了可扩展且参数高效的基础。

From Intents to Actions: Agentic AI in Autonomous Networks

从意图到行动：自主网络中的智能人工智能

Authors: Burak Demirel, Pablo Soldati, Yu Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01271
Pdf link: https://arxiv.org/pdf/2602.01271
Abstract Telecommunication networks are increasingly expected to operate autonomously while supporting heterogeneous services with diverse and often conflicting intents -- that is, performance objectives, constraints, and requirements specific to each service. However, transforming high-level intents -- such as ultra-low latency, high throughput, or energy efficiency -- into concrete control actions (i.e., low-level actuator commands) remains beyond the capability of existing heuristic approaches. This work introduces an Agentic AI system for intent-driven autonomous networks, structured around three specialized agents. A supervisory interpreter agent, powered by language models, performs both lexical parsing of intents into executable optimization templates and cognitive refinement based on feedback, constraint feasibility, and evolving network conditions. An optimizer agent converts these templates into tractable optimization problems, analyzes trade-offs, and derives preferences across objectives. Lastly, a preference-driven controller agent, based on multi-objective reinforcement learning, leverages these preferences to operate near the Pareto frontier of network performance that best satisfies the original intent. Collectively, these agents enable networks to autonomously interpret, reason over, adapt to, and act upon diverse intents and network conditions in a scalable manner.
中文摘要 电信网络越来越多地被期望在支持具有多样且常常相互冲突意图的异构服务的同时实现自主运行——即每个服务的特定性能目标、约束和需求。然而，将高层次意图——如超低延迟、高通量或能效——转化为具体的控制动作（即低级执行器命令）仍超出现有启发式方法的能力范围。这项工作引入了一种面向意图驱动自主网络的代理人工智能系统，结构围绕三个专门代理。由语言模型驱动的监督解释代理既能将意图进行词汇解析成可执行的优化模板，也基于反馈、约束可行性和不断变化的网络条件进行认知细化。优化代理将这些模板转换为可处理的优化问题，分析权衡，并推导跨目标的偏好。最后，基于多目标强化学习的偏好驱动控制器代理，利用这些偏好，在网络性能的帕累托边界附近运行，以最能满足原始意图。这些代理共同使网络能够自主地解释、推理、适应并以可扩展的方式处理不同的意图和网络条件。

AOASS: Adaptive Obstacle-Aware Square Spiral Framework for Single-mobile Anchor-Based WSN Localization

AOASS：单移动锚点WSN定位的自适应障碍感知方形螺旋框架

Authors: Abdelhady Naguib
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2602.01290
Pdf link: https://arxiv.org/pdf/2602.01290
Abstract Accurate and energy efficient localization remains a key challenge in Wireless Sensor Networks (WSNs), particularly when obstacles affect signal propagation. This study introduces AOASS (Adaptive Obstacle Aware Square Spiral), a new single mobile anchor framework that combines an optimized square spiral movement pattern with adaptive obstacle detection. The mobile anchor can sense and bypass obstacles while maintaining high localization accuracy and full network coverage, ensuring that each node receives at least three noncollinear beacon signals for reliable position estimation. Localization accuracy is further improved using the OLSTM DV Hop model, which integrates a Long Short Term Memory (LSTM) network with the traditional DV Hop algorithm to estimate hop distances better and reduce multi hop errors. The anchor trajectory is managed by a TD3 LSTM reinforcement learning agent, supported by a Kalman based prediction layer and a fuzzy logic ORCA safety module for smooth and collision free navigation. Simulation experiments across different obstacle densities show that AOASS consistently achieves higher localization accuracy, better energy efficiency, and more optimized trajectories than existing approaches. These results demonstrate the framework scalability and potential for real world WSN applications, offering an intelligent and adaptable solution for data driven IoT systems.
中文摘要 准确且节能的定位仍然是无线传感器网络（WSN）中的关键挑战，尤其是在障碍物影响信号传播时。本研究引入了AOASS（自适应障碍感知方形螺旋），这是一种新的单移动锚点框架，结合了优化的方形螺旋移动模式与自适应障碍检测。移动锚点能够在保持高定位精度和全网络覆盖的同时，探测并绕过障碍物，确保每个节点至少接收到三个非共线信标信号，以实现可靠的位置估计。通过OLSTM的DV跳跃模型进一步提升定位精度，该模型将长短时记忆（LSTM）网络与传统的DV跳算法整合，更好地估算跳跃距离并减少多跳误差。锚点轨迹由TD3 LSTM强化学习代理管理，支持基于卡尔曼的预测层和模糊逻辑ORCA安全模块，实现平滑且无碰撞的导航。不同障碍密度的模拟实验表明，AOASS始终比现有方法实现更高的定位精度、更佳的能量效率和更优化的轨迹。这些结果展示了框架的可扩展性和在现实世界WSN应用中的潜力，为数据驱动的物联网系统提供了智能且适应性的解决方案。

What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

视觉工具使用强化学习到底学到了什么？裁剪与缩放工具诱导与内在效应的解开

Authors: Yan Ma, Weiyu Zhang, Tianle Li, Linge Du, Xuyang Shen, Pengfei Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.01334
Pdf link: https://arxiv.org/pdf/2602.01334
Abstract Vision tool-use reinforcement learning (RL) can equip vision-language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic this http URL introduce MED (Measure-Explain-Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in tool-based correction of intrinsic failures. Overall, current vision tool-use RL learns to coexist safely with tools rather than master them.
中文摘要 视觉工具使用强化学习（RL）可以为视觉语言模型配备裁剪缩放等视觉作符，并实现显著的性能提升，但目前尚不清楚这些提升是由工具使用改进驱动，还是内在进化。介绍MED（测量-解释-诊断）——一个从粗到细的框架，将内在能力变化与工具诱发的效应分开，将工具引起的性能差异分解为增益和损害项，并探究驱动其进化的机制。在两个具备不同工具先验和六个基准测试的VLM检查点级分析中，我们发现改进主要体现在内在学习，而工具使用强化学习主要减少工具引起的伤害（例如，呼叫错误减少和工具模式干扰较弱），且在基于工具的内在故障纠正方面取得有限进展。总体而言，当前视觉工具的强化学习学会了与工具安全共存，而非精通它们。

Adaptive Quantum-Safe Cryptography for 6G Vehicular Networks via Context-Aware Optimization

通过上下文感知优化实现6G车载网络的自适应量子安全密码学

Authors: Poushali Sengupta, Mayank Raikwar, Sabita Maharjan, Frank Eliassen, Yan Zhang
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2602.01342
Pdf link: https://arxiv.org/pdf/2602.01342
Abstract Powerful quantum computers in the future may be able to break the security used for communication between vehicles and other devices (Vehicle-to-Everything, or V2X). New security methods called post-quantum cryptography can help protect these systems, but they often require more computing power and can slow down communication, posing a challenge for fast 6G vehicle networks. In this paper, we propose an adaptive post-quantum cryptography (PQC) framework that predicts short-term mobility and channel variations and dynamically selects suitable lattice-, code-, or hash-based PQC configurations using a predictive multi-objective evolutionary algorithm (APMOEA) to meet vehicular latency and security this http URL, frequent cryptographic reconfiguration in dynamic vehicular environments introduces new attack surfaces during algorithm transitions. A secure monotonic-upgrade protocol prevents downgrade, replay, and desynchronization attacks during transitions. Theoretical results show decision stability under bounded prediction error, latency boundedness under mobility drift, and correctness under small forecast noise. These results demonstrate a practical path toward quantum-safe cryptography in future 6G vehicular networks. Through extensive experiments based on realistic mobility (LuST), weather (ERA5), and NR-V2X channel traces, we show that the proposed framework reduces end-to-end latency by up to 27\%, lowers communication overhead by up to 65\%, and effectively stabilizes cryptographic switching behavior using reinforcement learning. Moreover, under the evaluated adversarial scenarios, the monotonic-upgrade protocol successfully prevents downgrade, replay, and desynchronization attacks.
中文摘要 未来的强大量子计算机可能突破用于车辆与设备之间通信的安全性（车辆对全设备，V2X）。被称为后量子密码学的新安全方法可以帮助保护这些系统，但通常需要更多的计算能力，并且可能减慢通信速度，这对高速6G车辆网络构成挑战。本文提出了一种自适应后量子密码学（PQC）框架，能够预测短期移动性和信道变化，并利用预测多目标进化算法（APMOEA）动态选择合适的基于格子、代码或哈希的PQC配置，以满足车辆的延迟和安全。该http URL在动态车辆环境中频繁的密码学重构在算法转换过程中引入了新的攻击面。安全的单调升级协议防止切换过程中的降级、重放和不同步攻击。理论结果显示，在有界预测误差下决策稳定性、移动漂移下的延迟有界性以及在小预报噪声下的正确性。这些结果展示了未来6G车载网络实现量子安全密码学的切实可行路径。通过基于真实移动性（LuST）、天气（ERA5）和NR-V2X信道追踪的广泛实验，我们表明所提出的框架可将端到端延迟降低多达27%，通信开销降低多达65%，并通过强化学习有效稳定密码交换行为。此外，在评估的对抗场景下，单调升级协议成功防止降级、重放和不同步攻击。

CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering

CRAFT：通过强化学习实现多跳问答的校准推理与答案忠实追踪

Authors: Yu Liu, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Weizhuo Chen, Cheng Hu, Pin Xu, Yuling Yang, Kun Peng, Diandian Guo, Qiang Sun, Yanbing Liu, Jin B. Hong, Zhiyuan Ma
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01348
Pdf link: https://arxiv.org/pdf/2602.01348
Abstract Retrieval-augmented generation (RAG) is widely used to ground Large Language Models (LLMs) for multi-hop question answering. Recent work mainly focused on improving answer accuracy via fine-tuning and structured or reinforcement-based optimization. However, reliable reasoning in response generation faces three challenges: 1) Reasoning Collapse. Reasoning in multi-hop QA is inherently complex due to multi-hop composition and is further destabilized by noisy retrieval. 2) Reasoning-answer inconsistency. Due to the intrinsic uncertainty of LLM generation and exposure to evidence--distractor mixtures, models may produce correct answers that are not faithfully supported by their intermediate reasoning or evidence. 3) Loss of format control. Traditional chain-of-thought generation often deviates from required structured output formats, leading to incomplete or malformed structured content. To address these challenges, we propose CRAFT (Calibrated Reasoning with Answer-Faithful Traces), a Group Relative Policy Optimization (GRPO) based reinforcement learning framework that trains models to perform faithful reasoning during response generation. CRAFT employs dual reward mechanisms to optimize multi-hop reasoning: deterministic rewards ensure structural correctness while judge-based rewards verify semantic faithfulness. This optimization framework supports controllable trace variants that enable systematic analysis of how structure and scale affect reasoning performance and faithfulness. Experiments on three multi-hop QA benchmarks show that CRAFT improves both answer accuracy and reasoning faithfulness across model scales, with the CRAFT 7B model achieving competitive performance with closed-source LLMs across multiple reasoning trace settings.
中文摘要 检索增强生成（RAG）被广泛用于为大型语言模型（LLMs）进行多跳问答的基础。近期工作主要集中在通过微调和结构化或基于强化的优化来提高答案准确性。然而，可靠的回应生成推理面临三大挑战：1）推理崩溃。多跳质检中的推理本质复杂，因为多跳组成，且噪声反复使推理更加不稳定。2）推理与答案不一致。由于LLM生成的固有不确定性和证据的暴露——干扰因素混合物，模型可能得出的正确答案并未得到其中间推理或证据的忠实支持。3）失去格式控制权。传统的思维链生成常常偏离所需的结构化输出格式，导致结构化内容不完整或结构化不完整。为应对这些挑战，我们提出了CRAFT（带有真实响应痕迹的校准推理），这是一个基于群体相对策略优化（GRPO）的强化学习框架，训练模型在响应生成过程中执行忠实推理。CRAFT采用双重奖励机制优化多跳推理：确定性奖励确保结构正确性，而基于判决的奖励验证语义忠实性。该优化框架支持可控迹变体，使结构和规模如何影响推理性能和忠实度的系统分析成为可能。在三个多跳QA基准测试上的实验显示，CRAFT在不同模型尺度上都提升了答案的准确性和推理忠实度，CRAFT 7B模型在多种推理追踪设置下与闭源LLM的竞争性能均有竞争力。

PromptRL: Prompt Matters in RL for Flow-Based Image Generation

PromptRL：基于流的图像生成中的提示重要性

Authors: Fu-Yun Wang, Han Zhang, Michael Gharbi, Hongsheng Li, Taesung Park
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01382
Pdf link: https://arxiv.org/pdf/2602.01382
Abstract Flow matching models (FMs) have revolutionized text-to-image (T2I) generation, with reinforcement learning (RL) serving as a critical post-training strategy for alignment with reward objectives. In this research, we show that current RL pipelines for FMs suffer from two underappreciated yet important limitations: sample inefficiency due to insufficient generation diversity, and pronounced prompt overfitting, where models memorize specific training formulations and exhibit dramatic performance collapse when evaluated on semantically equivalent but stylistically varied prompts. We present PromptRL (Prompt Matters in RL for Flow-Based Image Generation), a framework that incorporates language models (LMs) as trainable prompt refinement agents directly within the flow-based RL optimization loop. This design yields two complementary benefits: rapid development of sophisticated prompt rewriting capabilities and, critically, a synergistic training regime that reshapes the optimization dynamics. PromptRL achieves state-of-the-art performance across multiple benchmarks, obtaining scores of 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore. Furthermore, we validate the effectiveness of our RL approach on large-scale image editing models, improving the EditReward of FLUX.1-Kontext from 1.19 to 1.43 with only 0.06 million rollouts, surpassing Gemini 2.5 Flash Image (also known as Nano Banana), which scores 1.37, and achieving comparable performance with ReasonNet (1.44), which relied on fine-grained data annotations along with a complex multi-stage training. Our extensive experiments empirically demonstrate that PromptRL consistently achieves higher performance ceilings while requiring over 2$\times$ fewer rollouts compared to naive flow-only RL. Our code is available at this https URL.
中文摘要 流程匹配模型（FM）彻底改变了文本到图像（T2I）生成，强化学习（RL）成为与奖励目标对齐的关键训练后策略。本研究显示，当前FM的强化学习流水线存在两个被低估但重要的局限性：由于生成多样性不足导致的样本效率低下，以及明显的提示过拟合，即模型记忆特定训练表述，在语义等效但风格多样的提示下表现会显著崩溃。我们介绍了PromptRL（Prompt Matters in RL for Flow-Based Image Generation），这是一个框架，将语言模型（LM）作为可训练的提示优化代理直接纳入基于流的强化学习优化循环。该设计带来了两个互补的好处：快速开发复杂的提示重写能力，关键是实现协同训练体系，重塑优化动态。PromptRL 在多个基准测试中达到了最先进的性能，GenEval 得分为 0.97，OCR 准确度为 0.98，PickScore 得分为 24.05。此外，我们验证了强化学习方法在大规模图像编辑模型上的有效性，仅有0.06百万次部署，将FLUX.1-Kontext的EditReward从1.19提升至1.43，超过得分1.37的Gemini 2.5 Flash Image（又称Nano Banana），并与依赖细粒度数据注释和复杂多阶段训练的ReasonNet（1.44）性能相当。我们大量实验实证表明，PromptRL在与纯流程强化学习相比，持续实现更高的性能上限，同时所需的部署次数减少了超过2美元/倍数。我们的代码可在此 https URL 访问。

The Enhanced Physics-Informed Kolmogorov-Arnold Networks: Applications of Newton's Laws in Financial Deep Reinforcement Learning (RL) Algorithms

增强型物理知情的柯尔莫哥洛夫-阿诺德网络：牛顿定律在金融深度强化学习（RL）算法中的应用

Authors: Trang Thoi, Hung Tran, Tram Thoi, Huaiyang Zhong
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01388
Pdf link: https://arxiv.org/pdf/2602.01388
Abstract Deep Reinforcement Learning (DRL), a subset of machine learning focused on sequential decision-making, has emerged as a powerful approach for tackling financial trading problems. In finance, DRL is commonly used either to generate discrete trade signals or to determine continuous portfolio allocations. In this work, we propose a novel reinforcement learning framework for portfolio optimization that incorporates Physics-Informed Kolmogorov-Arnold Networks (PIKANs) into several DRL algorithms. The approach replaces conventional multilayer perceptrons with Kolmogorov-Arnold Networks (KANs) in both actor and critic components-utilizing learnable B-spline univariate functions to achieve parameter-efficient and more interpretable function approximation. During actor updates, we introduce a physics-informed regularization loss that promotes second-order temporal consistency between observed return dynamics and the action-induced portfolio adjustments. The proposed framework is evaluated across three equity markets-China, Vietnam, and the United States, covering both emerging and developed economies. Across all three markets, PIKAN-based agents consistently deliver higher cumulative and annualized returns, superior Sharpe and Calmar ratios, and more favorable drawdown characteristics compared to both standard DRL baselines and classical online portfolio-selection methods. This yields more stable training, higher Sharpe ratios, and superior performance compared to traditional DRL counterparts. The approach is particularly valuable in highly dynamic and noisy financial markets, where conventional DRL often suffers from instability and poor generalization.
中文摘要 深度强化学习（DRL）是机器学习的一个子集，专注于顺序决策，已成为解决金融交易问题的有力方法。在金融领域，DRL通常用于生成离散的交易信号或确定连续的投资组合配置。本研究提出了一种新型强化学习框架，用于组合优化，将物理知情的柯尔莫哥洛夫-阿诺德网络（PIKANs）整合进多个 DRL 算法中。该方法在演员和批判者组件中用Kolmogorov-Arnold网络（KAN）取代了传统的多层感知器——利用可学习的B样条单变量函数，实现参数高效且更易解释的函数近似。在演员更新过程中，我们引入了一种基于物理的正则化损失，促进观察到的回报动态与动作诱导的投资组合调整之间的二阶时间一致性。该框架涵盖中国、越南和美国三大股市，涵盖新兴和发达经济体。在这三个市场中，基于PIKAN的代理人始终提供更高的累计和年化回报，优于Sharpe和Calmar比率，以及比标准DRL基线和传统在线投资组合选择方法更有利的回撤特性。这带来了更稳定的训练、更高的夏普比以及优于传统日行学习工具的性能。该方法在高度动态且嘈杂的金融市场中尤为有价值，因为传统DRL常常存在不稳定性和推广性差的问题。

TQL: Scaling Q-Functions with Transformers by Preventing Attention Collapse

TQL：通过防止注意力崩溃来扩展变换器的Q函数

Authors: Perry Dong, Kuo-Han Hung, Alexander Swerdlow, Dorsa Sadigh, Chelsea Finn
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01439
Pdf link: https://arxiv.org/pdf/2602.01439
Abstract Despite scale driving substantial recent advancements in machine learning, reinforcement learning (RL) methods still primarily use small value functions. Naively scaling value functions -- including with a transformer architecture, which is known to be highly scalable -- often results in learning instability and worse performance. In this work, we ask what prevents transformers from scaling effectively for value functions? Through empirical analysis, we identify the critical failure mode in this scaling: attention scores collapse as capacity increases. Our key insight is that we can effectively prevent this collapse and stabilize training by controlling the entropy of the attention scores, thereby enabling the use of larger models. To this end, we propose Transformer Q-Learning (TQL), a method that unlocks the scaling potential of transformers in learning value functions in RL. Our approach yields up to a 43% improvement in performance when scaling from the smallest to the largest network sizes, while prior methods suffer from performance degradation.
中文摘要 尽管大规模驱动了机器学习的重大进展，强化学习（RL）方法仍然主要使用小值函数。天真地扩展价值函数——包括已知高度可扩展的变换器架构——常常导致学习不稳定和性能下降。在本研究中，我们探讨是什么阻止变换器有效扩展价值函数？通过实证分析，我们识别出该尺度中的关键失效模式：注意力评分随着容量增加而崩溃。我们的关键见解是，通过控制注意力评分的熵，我们可以有效防止这种崩溃并稳定训练，从而使使用更大的模型成为可能。为此，我们提出了变换器Q-学习（TQL），这是一种释放变换器在强化学习中价值函数扩展潜力的方法。我们的方法在从最小到最大网络规模扩展时，性能提升可达43%，而之前的方法则存在性能下降的问题。

Provable Cooperative Multi-Agent Exploration for Reward-Free MDPs

可证明的无奖励多智能体合作探索

Authors: Idan Barnea, Orin Levy, Yishay Mansour
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01453
Pdf link: https://arxiv.org/pdf/2602.01453
Abstract We study cooperative multi-agent reinforcement learning in the setting of reward-free exploration, where multiple agents jointly explore an unknown MDP in order to learn its dynamics (without observing rewards). We focus on a tabular finite-horizon MDP and adopt a phased learning framework. In each learning phase, multiple agents independently interact with the environment. More specifically, in each learning phase, each agent is assigned a policy, executes it, and observes the resulting trajectory. Our primary goal is to characterize the tradeoff between the number of learning phases and the number of agents, especially when the number of learning phases is small. Our results identify a sharp transition governed by the horizon $H$. When the number of learning phases equals $H$, we present a computationally efficient algorithm that uses only $\tilde{O}(S^6 H^6 A / \epsilon^2)$ agents to obtain an $\epsilon$ approximation of the dynamics (i.e., yields an $\epsilon$-optimal policy for any reward function). We complement our algorithm with a lower bound showing that any algorithm restricted to $\rho < H$ phases requires at least $A^{H/\rho}$ agents to achieve constant accuracy. Thus, we show that it is essential to have an order of $H$ learning phases if we limit the number of agents to be polynomial.
中文摘要 我们在无奖励探索环境中研究合作多智能体强化学习，即多个智能体共同探索未知MDP以了解其动态（不观察奖励）。我们专注于表格式的有限视野MDP，并采用分阶段学习框架。在每个学习阶段，多个智能体独立与环境交互。更具体地说，在每个学习阶段，每个代理都会被分配一个策略，执行该策略，并观察由此产生的轨迹。我们的主要目标是刻画学习阶段数量与代理数量之间的权衡，尤其是在学习阶段数量较少时。我们的结果显示，地平线$H$的显著转变。当学习阶段数等于 $H$ 时，我们提出一个计算效率高的算法，仅使用 $\tilde{O}（S^6 H^6 A / \epsilon^2）$ 个代理，以获得 $\epsilon$ 的动态近似（即对任意奖励函数产生 $\epsilon$ 最优策略）。我们用一个下界补充算法，表明任何限制在 $\rho < H$ 相位的算法都需要至少 $A^{H/\rho}$ 个代理才能实现恒定准确率。因此，我们证明如果限制代理数量为多项式，学习阶段的阶数必须达到$H$。

ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure

ConPress：从多问情境压力中学习高效推理

Authors: Jie Deng, Shining Liang, Jun Li, Hongzhi Li, Yutao Xie
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.01472
Pdf link: https://arxiv.org/pdf/2602.01472
Abstract Large reasoning models (LRMs) typically solve reasoning-intensive tasks by generating long chain-of-thought (CoT) traces, leading to substantial inference overhead. We identify a reproducible inference-time phenomenon, termed Self-Compression: when multiple independent and answerable questions are presented within a single prompt, the model spontaneously produces shorter reasoning traces for each question. This phenomenon arises from multi-question contextual pressure during generation and consistently manifests across models and benchmarks. Building on this observation, we propose ConPress (Learning from Contextual Pressure), a lightweight self-supervised fine-tuning approach. ConPress constructs multi-question prompts to induce self-compression, samples the resulting model outputs, and parses and filters per-question traces to obtain concise yet correct reasoning trajectories. These trajectories are directly used for supervised fine-tuning, internalizing compressed reasoning behavior in single-question settings without external teachers, manual pruning, or reinforcement learning. With only 8k fine-tuning examples, ConPress reduces reasoning token usage by 59% on MATH500 and 33% on AIME25, while maintaining competitive accuracy.
中文摘要 大型推理模型（LRM）通常通过生成长的思维链（CoT）痕迹来解决推理密集型任务，导致推理开销较大。我们识别出一种可重复的推理时间现象，称为自我压缩：当单个提示中出现多个独立且可回答的问题时，模型会自发地为每个问题生成更短的推理痕迹。这一现象源于生成过程中多问题的情境压力，并在模型和基准中持续体现。基于这一观察，我们提出了ConPress（从情境压力中学习），这是一种轻量级的自我监督微调方法。ConPress 构建多题提示以诱导自我压缩，采样模型输出，并解析和过滤每个问题的痕迹，以获得简洁而准确的推理轨迹。这些轨迹直接用于监督微调，在无外部教师、手动剪枝或强化学习的单题环境中内化压缩推理行为。仅有8千个微调示例，ConPress在MATH500上将推理代币使用减少了59%，在AIME25上减少了33%，同时保持了竞争性准确性。

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

交替强化学习用于不可验证的LLM后训练中基于评分标准的奖励建模

Authors: Ran Xu, Tianci Liu, Zihan Dong, Tony You, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, Haoyu Wang
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01511
Pdf link: https://arxiv.org/pdf/2602.01511
Abstract Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.
中文摘要 标准奖励模型通常预测的标量分数未能捕捉到不可验证领域（如创意写作或开放式指令跟随）中反应质量的多面性。为解决这一限制，我们提出了Rubric-ARM框架，该框架通过强化偏好反馈共同优化评分标准生成器和评审。与依赖静态评分标准或分离训练流程的现有方法不同，我们的方法将评分标准生成视为一种潜伏动作，以最大化判断准确性。我们引入了交替优化策略，以减轻同时更新的非平稳性，并通过理论分析展示了该计划如何降低训练过程中的梯度方差。大量实验表明，Rubric-ARM在多个基准测试中实现了基线的顶尖性能，并显著提升了离线和在线强化学习环境中的下游策略对齐性。

A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning

大型语言模型推理中可验证奖励的相对预算强化学习理论

Authors: Akifumi Wachi, Hirota Kinoshita, Shokichi Takakura, Rei Higuchi, Taiji Suzuki
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.01523
Pdf link: https://arxiv.org/pdf/2602.01523
Abstract Reinforcement learning (RL) is a dominant paradigm for improving the reasoning abilities of large language models, yet its effectiveness varies across tasks and compute budgets. We propose a \emph{relative-budget} theory explaining this variation through a single quantity called relative budget $\xi := H/\mathbb{E}[T]$, where $H$ is the generation horizon (token budget) and $T$ denotes the number of tokens until the first correct solution under a base policy. We show that $\xi$ determines sample efficiency by controlling reward variance and the likelihood of informative trajectories. Our analysis reveals three regimes: in the \emph{deficient} regime ($\xi \to 0$), informative trajectories are rare and the sample complexity explodes; in the \emph{balanced} regime ($\xi=\Theta(1)$), informative trajectories occur with non-negligible probability and RL is maximally sample-efficient; and in the \emph{ample} regime ($\xi \to \infty$), learning remains stable but marginal gains per iteration diminish. We further provide finite-sample guarantees for online RL that characterize learning progress across these regimes. Specifically, in a case study under idealized distributional assumptions, we show that the relative budget grows linearly over iterations. Our empirical results confirm these predictions in realistic settings, identifying a budget $\xi \in [1.5, 2.0]$ that maximizes learning efficiency and coincides with peak reasoning performance.
中文摘要 强化学习（RL）是提升大型语言模型推理能力的主流范式，但其有效性因任务和计算预算而异。我们提出了一个\emph{相对预算}理论，通过一个称为相对预算$\习的单一量来解释这种变化，其中$H$是生成视野（代币预算），$T$表示在基础策略下第一个正确解之前的代币数量。我们证明 $\习$ 通过控制奖励方差和信息轨迹的可能性来决定样本效率。我们的分析揭示了三种模式：在\emph{deficient}区（$\习 \到 0$）中，信息轨迹罕见，样本复杂度激增;在\emph{平衡}区间（$\习=\Theta（1）$）中，信息轨迹以不可忽略的概率出现，且RL具有最大样本效率;在\emph{ample}（$\习至\infty$）体系中，学习保持稳定，但每次迭代的边际收益减小。我们还为在线强化学习提供了有限样本保证，以描述这些体系下的学习进展。具体来说，在理想化分布假设下的案例研究中，我们表明相对预算在迭代中线性增长。我们的实证结果在现实环境中证实了这些预测，确定了一个预算$\习\in [1.5， 2.0]$，最大化学习效率，并与推理峰值表现相符。

Making Bias Non-Predictive: Training Robust LLM Judges via Reinforcement Learning

使偏见非预测性：通过强化学习培养稳健的LLM评判者

Authors: Qian Wang, Xuandong Zhao, Zirui Zhang, Zhanzhi Lou, Nuo Chen, Dawn Song, Bingsheng He
Subjects: Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01528
Pdf link: https://arxiv.org/pdf/2602.01528
Abstract Large language models (LLMs) increasingly serve as automated judges, yet they remain susceptible to cognitive biases -- often altering their reasoning when faced with spurious prompt-level cues such as consensus claims or authority appeals. Existing mitigations via prompting or supervised fine-tuning fail to generalize, as they modify surface behavior without changing the optimization objective that makes bias cues predictive. To address this gap, we propose Epistemic Independence Training (EIT), a reinforcement learning framework grounded in a key principle: to learn independence, bias cues must be made non-predictive of reward. EIT operationalizes this through a balanced conflict strategy where bias signals are equally likely to support correct and incorrect answers, combined with a reward design that penalizes bias-following without rewarding bias agreement. Experiments on Qwen3-4B demonstrate that EIT improves both accuracy and robustness under adversarial biases, while preserving performance when bias aligns with truth. Notably, models trained only on bandwagon bias generalize to unseen bias types such as authority and distraction, indicating that EIT induces transferable epistemic independence rather than bias-specific heuristics. Code and data are available at this https URL.
中文摘要 大型语言模型（LLMs）越来越多地充当自动裁判，但它们仍然容易受到认知偏差的影响——当面对诸如共识主张或权威申诉等虚假提示时，常常会改变推理。现有的提示式或监督式微调缓解方法无法实现普遍化，因为它们改变了表面行为，却不改变使偏差线索具有预测性的优化目标。为弥补这一空白，我们提出了认知独立训练（EIT），这是一种基于关键原则的强化学习框架：要学习独立性，必须使偏见线索非奖励预测性。EIT通过平衡的冲突策略实现这一点，即偏见信号支持正确和错误答案的可能性相等，并结合一种奖励设计，惩罚偏见跟随者而不奖励偏见同意。Qwen3-4B的实验表明，EIT在对抗性偏见下不仅能提高准确性，也能增强鲁棒性，同时当偏倚与真理一致时，性能依然保持。值得注意的是，仅在随行偏见训练的模型推广到权威和分心等隐形偏见类型，表明EIT诱导的是可转移的认知独立性，而非偏见特定启发式。代码和数据可在此 https URL 获取。

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

MAGIC：一款共同进化的攻防对抗游戏，增强强大LLM安全

Authors: Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.01539
Pdf link: https://arxiv.org/pdf/2602.01539
Abstract Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbf{reliance on static, pre-collected data distributions}. In this paper, we introduce \textbf{MAGIC}, a novel multi-turn multi-agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a \textbf{co-evolution}, where the attacker's ever-changing strategies continuously uncover long-tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves \textbf{novel, previously unseen combinatorial strategies} through iterative RL training, underscoring our method's substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework's effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at this https URL.
中文摘要 确保强健的安全对齐对大型语言模型（LLM）至关重要，但现有防御往往落后于不断演变的对抗性攻击，原因是其依赖静态、预先收集的数据分布。本文介绍了 \textbf{MAGIC}，一种新型多回合多智能体强化学习框架，将大型语言模型的安全对齐表述为对抗非对称博弈。具体来说，攻击者代理学会迭代地将原始查询重写为欺骗性提示，而防御者代理则同时优化其策略以识别和拒绝此类输入。这一动态过程触发了\textbf{共进化}，攻击者的不断变化的策略不断揭示长尾漏洞，迫使防御者泛化到看不见的攻击模式。值得注意的是，我们观察到，攻击者具备初始推理能力，通过迭代强化学习进化出前所未有的组合策略，凸显了我们方法的巨大潜力。理论上，我们提供了更稳健的博弈均衡的见解，并推导出安全保证。大量实验验证了我们框架的有效性，展示了优越的防御成功率，同时不影响模型的实用性。我们的代码可在此 https URL 访问。

Toward Cognitive Supersensing in Multimodal Large Language Model

迈向多模态大型语言模型中的认知超感知

Authors: Boyi Li, Yifan Shen, Yuanzhe Liu, Yifan Xu, Jiateng Liu, Xinzhuo Li, Zhengyuan Li, Jingyuan Zhu, Yunhan Zhong, Fangzhou Lan, Jianguo Cao, James M. Rehg, Heng Ji, Ismini Lourentzou, Xu Cao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01541
Pdf link: https://arxiv.org/pdf/2602.01541
Abstract Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuospatial sketchpad and visual imagery. To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning chains. We further introduce a reinforcement learning stage that optimizes text reasoning paths based on this grounded visual latent. To evaluate the cognitive capabilities of MLLMs, we present CogSense-Bench, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions. Extensive experiments demonstrate that MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench and exhibit superior generalization on out-of-domain mathematics and science VQA benchmarks, suggesting that internal visual imagery is potentially key to bridging the gap between perceptual recognition and cognitive understanding. We will open-source the CogSense-Bench and our model weights.
中文摘要 多模态大型语言模型（MLLM）在开放词汇感知任务中取得了显著成功，但其解决复杂认知问题的能力仍然有限，尤其是在视觉细节抽象且需要视觉记忆时。当前的方法主要在文本空间中扩展思维链（Chain-of-Thought，CoT）推理，即使单靠语言不足以实现清晰且结构化的推理，且大多忽视了类似于人类可见空间素描板和视觉图像的视觉推理机制。为弥补这一不足，我们引入了认知超感知，这是一种新型训练范式，通过整合潜在视觉影像预测（LVIP）头，共同学习视觉认知潜在嵌入序列并将其与答案对齐，从而形成基于视觉的内部推理链，赋予MLLM类人类人视觉意象能力。我们进一步引入强化学习阶段，基于这一扎根的视觉潜能优化文本推理路径。为了评估MLLM的认知能力，我们介绍了CogSense-Bench，这是一个全面的视觉问答（VQA）基准测试，评估了五个认知维度。大量实验表明，接受认知超感知训练的MLLM在CogSense-Bench上显著优于最先进的基线，并在跨领域数学和科学VQA基准测试中展现出更优的泛化能力，表明内部视觉意象可能是弥合感知识别与认知理解差距的关键。我们将开源CogSense-Bench和我们的模型权重。

AdNanny: One Reasoning LLM for All Offline Ads Recommendation Tasks

AdNanny：适用于所有离线广告推荐任务的一种逻辑大型语言模型

Authors: Nan Hu, Han Li, Jimeng Sun, Lu Wang, Fangkai Yang, Bo Qiao, Pu Zhao, David Dai, Mengyu Liu, Yuefeng Zhan, Jianjin Zhang, Weihao Han, Allen Sun, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Denvy Deng, Feng Sun, Qi Zhang
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2602.01563
Pdf link: https://arxiv.org/pdf/2602.01563
Abstract Large Language Models (LLMs) have shown strong capabilities in Natural Language Understanding and Generation, but deploying them directly in online advertising systems is often impractical due to strict millisecond-level latency constraints. This has motivated the use of LLMs offline to improve retrieval, ranking, and recommendation models. Existing solutions typically fine-tune separate LLMs for individual tasks such as query-ad relevance labeling, keyword-based query generation, and user profiling. This results in redundant models, high maintenance cost, and limited performance gains despite substantial overlap in domain knowledge and reasoning patterns. We introduce AdNanny, a unified reasoning-centric LLM that serves as a shared backbone for offline advertising tasks. AdNanny is obtained by fine-tuning a public 671B-parameter DeepSeek-R1 checkpoint using a scalable training system that supports hybrid dense-MoE parallelism. We construct reasoning-augmented corpora that pair structured supervision with step-by-step natural language explanations. A multi-task supervised fine-tuning stage with adaptive reweighting enables AdNanny to handle diverse labeling and generation tasks in a consistent reasoning format. This is followed by reinforcement learning using downstream advertising metrics to align model behavior with online retrieval and ranking objectives. AdNanny is deployed in production within Bing Ads, where it significantly reduces manual labeling effort and improves accuracy across multiple offline tasks. By consolidating many task-specific models into a single reasoning-centric foundation model, AdNanny provides a scalable and cost-effective solution for large-scale advertising systems.
中文摘要 大型语言模型（LLMs）在自然语言理解和生成方面表现出强大的能力，但由于严格的毫秒级延迟限制，直接部署在在线广告系统中往往不切实际。这促使人们离线使用大型语言模型来改进检索、排名和推荐模型。现有解决方案通常会针对单个任务进行微调，如查询广告相关性标签、基于关键词的查询生成和用户画像。这导致模型冗余、维护成本高昂，尽管领域知识和推理模式有大量重叠，性能提升有限。我们介绍AdNanny，一个统一的以推理为中心的大型语言模型，作为离线广告任务的共享骨干。AdNanny 通过使用支持混合密集-MoE 并行的可扩展训练系统微调一个公共 671B 参数的 DeepSeek-R1 检查点获得。我们构建了推理增强语料库，将结构化督导与逐步自然语言解释相结合。多任务监督微调阶段配合自适应重权，使 AdNanny 能够以一致的推理格式处理多样化的标签和生成任务。随后是利用下游广告指标进行强化学习，使模型行为与在线检索和排名目标保持一致。AdNanny已部署在必应广告的生产环境中，显著减少了人工标签的工作量，并提高了多个离线任务的准确性。通过将多个任务特定模型整合到一个以推理为中心的基础模型中，AdNanny 为大型广告系统提供了一个可扩展且经济高效的解决方案。

Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages

了解你的步数：通过步进感知优势，更快更好地对齐流量匹配模型

Authors: Zhixiong Yue, Zixuan Ni, Feiyang Ye, Jinshan Zhang, Sheng Shen, Zhenpeng Mi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.01591
Pdf link: https://arxiv.org/pdf/2602.01591
Abstract Recent advances in flow matching models, particularly with reinforcement learning (RL), have significantly enhanced human preference alignment in few step text to image generators. However, existing RL based approaches for flow matching models typically rely on numerous denoising steps, while suffering from sparse and imprecise reward signals that often lead to suboptimal alignment. To address these limitations, we propose Temperature Annealed Few step Sampling with Group Relative Policy Optimization (TAFS GRPO), a novel framework for training flow matching text to image models into efficient few step generators well aligned with human preferences. Our method iteratively injects adaptive temporal noise onto the results of one step samples. By repeatedly annealing the model's sampled outputs, it introduces stochasticity into the sampling process while preserving the semantic integrity of each generated image. Moreover, its step aware advantage integration mechanism combines the GRPO to avoid the need for the differentiable of reward function and provide dense and step specific rewards for stable policy optimization. Extensive experiments demonstrate that TAFS GRPO achieves strong performance in few step text to image generation and significantly improves the alignment of generated images with human preferences. The code and models of this work will be available to facilitate further research.
中文摘要 近年来，尤其是强化学习（RL）的流匹配模型取得了显著提升，显著提升了文本到图像生成器中人类偏好的对齐能力。然而，现有基于强化学习的流匹配模型方法通常依赖大量去噪步骤，同时存在稀疏且不精确的奖励信号，常导致对齐不理想。为解决这些局限性，我们提出了带群相对策略优化的温度退火少数步采样（TAFS GRPO），这是一种新颖的框架，用于将文本与图像模型匹配流训练为高效的几步生成器，且能很好地符合人类偏好。我们的方法迭代地将自适应时间噪声注入到一步采样的结果上。通过反复退火模型采样输出，它在采样过程中引入了随机性，同时保持了每张生成图像的语义完整性。此外，其步知优势整合机制结合了GRPO，避免了对奖励函数微分的需求，并为稳定的策略优化提供了密集且步数专属的奖励。大量实验表明，TAFS GRPO在文本到图像生成中实现了强劲的性能，并显著提升了生成图像与人类偏好的对齐。该工作的代码和模型将公开，以便进一步研究。

The Multiple Ticket Hypothesis: Random Sparse Subnetworks Suffice for RLVR

多票假说：随机稀疏子网足以解释RLVR

Authors: Israel Adewuyi, Solomon Okibe, Vladmir Ivanov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01599
Pdf link: https://arxiv.org/pdf/2602.01599
Abstract The Lottery Ticket Hypothesis demonstrated that sparse subnetworks can match full-model performance, suggesting parameter redundancy. Meanwhile, in Reinforcement Learning with Verifiable Rewards (RLVR), recent work has shown that updates concentrate on a sparse subset of parameters, which further lends evidence to this underlying redundancy. We study the simplest possible way to exploit this redundancy: training only a randomly selected subset of parameters at extreme sparsities. Empirically, we find that training just 1\% of parameters matches or exceeds full-parameter RLVR finetuning across 3 models and 2 task domains. Moreover, different random masks show minimal overlap ($\leq 0.005$ Jaccard similarity) and yet all succeed, suggesting pretrained models contain many viable sparse subnetworks rather than one privileged set. We term this the Multiple Ticket Hypothesis. We explain this phenomenon through the implicit per-step KL constraint in RLVR, which restricts updates to a low-dimensional subspace, enabling arbitrary sparse masks to succeed.
中文摘要 彩票假说证明稀疏子网络可以匹配全模型性能，暗示参数冗余。与此同时，在可验证奖励强化学习（RLVR）中，最新研究显示更新集中在稀疏的参数子集上，这进一步证明了这种潜在的冗余性。我们研究利用这种冗余的最简单方法：只在极端稀疏处随机选择的参数子集进行训练。实证显示，仅1/%的参数训练就能在3个模型和2个任务领域中匹配或超过全参数RLVR微调。此外，不同的随机掩码重叠极少（$\leq 0.005$ Jaccard 相似度），但都成功，表明预训练模型包含许多可行的稀疏子网，而非单一特权集。我们称之为多票假说。我们通过RLVR中隐含的每步KL约束来解释这一现象，该约束限制了更新到低维子空间，使任意稀疏掩码能够成功。

Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards

在线强化学习的自适应推广分配，附带可验证奖励

Authors: Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She, Viet Anh Nguyen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.01601
Pdf link: https://arxiv.org/pdf/2602.01601
Abstract Sampling efficiency is a key bottleneck in reinforcement learning with verifiable rewards. Existing group-based policy optimization methods, such as GRPO, allocate a fixed number of rollouts for all training prompts. This uniform allocation implicitly treats all prompts as equally informative, and could lead to inefficient computational budget usage and impede training progress. We introduce \Ours, a Variance-Informed Predictive allocation strategy that allocates a given rollout budget to the prompts in the incumbent batch to minimize the expected gradient variance of the policy update. At each iteration, \Ours~uses a lightweight Gaussian process model to predict per-prompt success probabilities based on recent rollouts. These probability predictions are translated into variance estimates, which are then fed into a convex optimization problem to determine the optimal rollout allocations under a hard compute budget constraint. Empirical results show that \Ours~consistently improves sampling efficiency and achieves higher performance than uniform or heuristic allocation strategies in multiple benchmarks. Our code will be available at this https URL.
中文摘要 抽样效率是强化学习中的关键瓶颈，且有可验证的奖励。现有的基于群体的策略优化方法，如GRPO，为所有训练提示分配固定数量的推广。这种统一分配隐含地将所有提示视为同等的信息量，可能导致计算预算使用效率低落，阻碍训练进展。我们引入了 \Ours，一种方差知情预测分配策略，将既定的部署预算分配给现有批次中的提示，以最小化政策更新带来的预期梯度方差。在每次迭代中，\Ours~使用轻量级高斯过程模型，基于近期的推广预测每个提示的成功概率。这些概率预测被转化为方差估计，然后输入凸优化问题，以确定在硬计算预算约束下的最佳推广分配。实证结果表明，\Ours~在多个基准测试中持续提升抽样效率，并比统一或启发式分配策略实现更高的性能。我们的代码将在此 https URL 上发布。

Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching

通过一步流匹配提升最大熵强化学习

Authors: Zeqiao Li, Yijing Wang, Haoyu Wang, Zheng Li, Zhiqiang Zuo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01606
Pdf link: https://arxiv.org/pdf/2602.01606
Abstract Diffusion policies are expressive yet incur high inference latency. Flow Matching (FM) enables one-step generation, but integrating it into Maximum Entropy Reinforcement Learning (MaxEnt RL) is challenging: the optimal policy is an intractable energy-based distribution, and the efficient log-likelihood estimation required to balance exploration and exploitation suffers from severe discretization bias. We propose \textbf{F}low-based \textbf{L}og-likelihood-\textbf{A}ware \textbf{M}aximum \textbf{E}ntropy RL (\textbf{FLAME}), a principled framework that addresses these challenges. First, we derive a Q-Reweighted FM objective that bypasses partition function estimation via importance reweighting. Second, we design a decoupled entropy estimator that rigorously corrects bias, which enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Third, we integrate the MeanFlow formulation to achieve expressive and efficient one-step control. Empirical results on MuJoCo show that FLAME outperforms Gaussian baselines and matches multi-step diffusion policies with significantly lower inference cost. Code is available at this https URL.
中文摘要 扩散策略具有表达性，但会产生较高的推理延迟。流匹配（FM）支持一步生成，但将其集成到最大熵强化学习（MaxEnt RL）中具有挑战：最优策略是难以处理的基于能量的分布，而平衡探索与利用所需的高效对数似然估计存在严重的离散化偏差。我们提出了 \textbf{F}低基 \textbf{L}og-likelihood-\textbf{A}ware \textbf{M}aximum \textbf{E}ntropy RL （\textbf{FLAME}），这是一个有原则的框架，解决了这些挑战。首先，我们推导出一个Q-重加权FM目标，通过重要性重加权绕过配分函数估计。其次，我们设计了一个解耦熵估计器，严格纠正偏置，从而实现高效探索，使策略更接近最优MaxEnt策略。第三，我们整合了MeanFlow的表述，实现表达性强且高效的一步控制。MuJoCo上的实证结果表明，FLAME优于高斯基线，并以显著较低的推断成本匹配多步扩散策略。代码可在此 https URL 访问。

SUSD: Structured Unsupervised Skill Discovery through State Factorization

SUSD：通过状态分解进行结构化无监督技能发现

Authors: Seyed Mohammad Hadi Hosseini, Mahdieh Soleymani Baghshah
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01619
Pdf link: https://arxiv.org/pdf/2602.01619
Abstract Unsupervised Skill Discovery (USD) aims to autonomously learn a diverse set of skills without relying on extrinsic rewards. One of the most common USD approaches is to maximize the Mutual Information (MI) between skill latent variables and states. However, MI-based methods tend to favor simple, static skills due to their invariance properties, limiting the discovery of dynamic, task-relevant behaviors. Distance-Maximizing Skill Discovery (DSD) promotes more dynamic skills by leveraging state-space distances, yet still fall short in encouraging comprehensive skill sets that engage all controllable factors or entities in the environment. In this work, we introduce SUSD, a novel framework that harnesses the compositional structure of environments by factorizing the state space into independent components (e.g., objects or controllable entities). SUSD allocates distinct skill variables to different factors, enabling more fine-grained control on the skill discovery process. A dynamic model also tracks learning across factors, adaptively steering the agent's focus toward underexplored factors. This structured approach not only promotes the discovery of richer and more diverse skills, but also yields a factorized skill representation that enables fine-grained and disentangled control over individual entities which facilitates efficient training of compositional downstream tasks via Hierarchical Reinforcement Learning (HRL). Our experimental results across three environments, with factors ranging from 1 to 10, demonstrate that our method can discover diverse and complex skills without supervision, significantly outperforming existing unsupervised skill discovery methods in factorized and complex environments. Code is publicly available at: this https URL.
中文摘要 无监督技能发现（USD）旨在自主学习多样化的技能，而不依赖外部奖励。最常见的USD方法之一是最大化技能潜在变量和状态之间的互信息（MI）。然而，基于MI的方法往往偏好简单静态技能，因其不变性，限制了动态且与任务相关的行为的发现。距离最大化技能发现（DSD）通过利用状态空间距离促进更具动态性的技能，但仍未能鼓励全面技能组，以涉及环境中所有可控因素或实体。在本研究中，我们介绍了SUSD，这是一种新颖的框架，通过将状态空间分解为独立的组件（例如对象或可控实体）来利用环境的组成结构。SUSD将不同的技能变量分配给不同的因素，从而实现对技能发现过程更细致的控制。动态模型还能跨因素追踪学习，自适应地引导智能体关注未被充分探索的因素。这种结构化方法不仅促进了更丰富多样技能的发现，还产生了分解技能表示，使得对单个实体实现细粒度且解开的控制，从而通过层级强化学习（HRL）高效训练组合下游任务。我们在三种环境中的实验结果，因子范围为1到10，表明我们的方法能够在无监督的情况下发现多样且复杂的技能，显著优于现有的非监督技能发现方法在因数分解和复杂环境中。代码公开可访问：此 https URL。

PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

PISCES：通过最优运输对齐奖励实现无注释的文本转视频后培训

Authors: Minh-Quan Le, Gaurav Mittal, Cheng Zhao, David Gu, Dimitris Samaras, Mei Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.01624
Pdf link: https://arxiv.org/pdf/2602.01624
Abstract Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision. We present $\texttt{PISCES}$, an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, $\texttt{PISCES}$ uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a Distributional OT-aligned Quality Reward that captures overall visual quality and temporal coherence; and (ii) a Discrete Token-level OT-aligned Semantic Reward that enforces semantic, spatio-temporal correspondence between text and video tokens. To our knowledge, $\texttt{PISCES}$ is the first to improve annotation-free reward supervision in generative post-training through the lens of OT. Experiments on both short- and long-video generation show that $\texttt{PISCES}$ outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies further validating its effectiveness. We show that the Dual OT-aligned Rewards module is compatible with multiple optimization paradigms, including direct backpropagation and reinforcement learning fine-tuning.
中文摘要 文本到视频（T2V）生成旨在合成具有高视觉质量和时间一致性且语义上与输入文本对齐的视频。基于奖励的后期培训已成为提升生成视频质量和语义对齐的有前景方向。然而，近期方法要么依赖大规模人类偏好标注，要么依赖预训练视觉语言模型的错位嵌入，导致可扩展性有限或监督不优。我们介绍$\texttt{PISCES}$，一种无注释的训练后算法，通过一个新颖的双最优运输（OT）对齐奖励模块解决了这些限制。为了使奖励信号与人类判断保持一致，$\texttt{PISCES}$ 利用 OT 在分布和离散代币层面桥接文本和视频嵌入，使奖励监督实现两个目标：（i）与 OT 对齐的分布质量奖励，捕捉整体视觉质量和时间连贯性;以及（ii）离散代币级OT对齐语义奖励，强制文本与视频代币之间的语义、时空对应。据我们所知，$\texttt{PISCES}$是首个通过OT视角改进生成式后期培训中无注释奖励监督的方案。短视频和长视频生成的实验显示，$\texttt{PISCES}$在VBench上无论在质量评分还是语义评分上都优于基于注释和无注释的方法，人类偏好研究进一步验证了其有效性。我们证明了Dual OT-aligned Rewards模块兼容多种优化范式，包括直接反向传播和强化学习微调。

Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

贡献感知令牌压缩，通过强化学习实现高效的视频理解

Authors: Yinchao Ma, Qiang Zhou, Zhibin Wang, Xianing Chen, Hanqing Yang, Jun Song, Bo Zheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01649
Pdf link: https://arxiv.org/pdf/2602.01649
Abstract Video large language models have demonstrated remarkable capabilities in video understanding tasks. However, the redundancy of video tokens introduces significant computational overhead during inference, limiting their practical deployment. Many compression algorithms are proposed to prioritize retaining features with the highest attention scores to minimize perturbations in attention computations. However, the correlation between attention scores and their actual contribution to correct answers remains ambiguous. To address the above limitation, we propose a novel \textbf{C}ontribution-\textbf{a}ware token \textbf{Co}mpression algorithm for \textbf{VID}eo understanding (\textbf{CaCoVID}) that explicitly optimizes the token selection policy based on the contribution of tokens to correct predictions. First, we introduce a reinforcement learning-based framework that optimizes a policy network to select video token combinations with the greatest contribution to correct predictions. This paradigm shifts the focus from passive token preservation to active discovery of optimal compressed token combinations. Secondly, we propose a combinatorial policy optimization algorithm with online combination space sampling, which dramatically reduces the exploration space for video token combinations and accelerates the convergence speed of policy optimization. Extensive experiments on diverse video understanding benchmarks demonstrate the effectiveness of CaCoVID. Codes will be released.
中文摘要 视频大型语言模型在视频理解任务中展现出了卓越的能力。然而，视频令牌的冗余在推理过程中带来了显著的计算开销，限制了其实际应用。许多压缩算法被提出优先保留注意力得分最高的特征，以最大限度减少注意力计算中的干扰。然而，注意力分数与其对正确答案的实际贡献之间的相关性仍然模糊不清。为解决上述限制，我们提出了一种新颖的\textbf{C}ontribution-\textbf{a}ware token \textbf{Co}mpression算法，用于\textbf{VID}eo理解（\textbf{CaCoVID}），该算法基于代币对正确预测的贡献，明确优化代币选择策略。首先，我们引入了一个基于强化学习的框架，优化策略网络，以选择对正确预测贡献最大的视频令牌组合。这一范式将重点从被动的代币保存转向主动发现最优的压缩代币组合。其次，我们提出了一种结合组合空间采样的组合策略优化算法，显著减少了视频代币组合的探索空间，加快了策略优化的收敛速度。对各种视频理解基准的广泛实验证明了CaCovid的有效性。代码将会公布。

FlowSteer: Interactive Agentic Workflow Orchestration via End-to-End Reinforcement Learning

FlowSteer：通过端到端强化学习实现交互式代理工作流编排

Authors: Mingda Zhang, Haoran Luo, Tiesunlong Shen, Qika Lin, Xiaoying Tang, Rui Mao, Erik Cambria
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01664
Pdf link: https://arxiv.org/pdf/2602.01664
Abstract In recent years, a variety of powerful agentic workflows have been applied to solve a wide range of human problems. However, existing workflow orchestration still faces key challenges, including high manual cost, reliance on specific operators/large language models (LLMs), and sparse reward signals. To address these challenges, we propose FlowSteer, an end-to-end reinforcement learning framework that takes a lightweight policy model as the agent and an executable canvas environment, automating workflow orchestration through multi-turn interaction. In this process, the policy model analyzes execution states and selects editing actions, while the canvas executes operators and returns feedback for iterative refinement. Moreover, FlowSteer provides a plug-and-play framework that supports diverse operator libraries and interchangeable LLM backends. To effectively train this interaction paradigm, we propose Canvas Workflow Relative Policy Optimization (CWRPO), which introduces diversity-constrained rewards with conditional release to stabilize learning and suppress shortcut behaviors. Experimental results on twelve datasets show that FlowSteer significantly outperforms baselines across various tasks.
中文摘要 近年来，多种强大的代理性工作流已被应用于解决各种人类问题。然而，现有的工作流程编排仍面临关键挑战，包括高人工成本、对特定操作员/大型语言模型（LLMs）的依赖以及奖励信号稀疏。为应对这些挑战，我们提出了FlowSteer，一个端到端强化学习框架，以轻量级策略模型为代理，并以可执行画布环境为主，通过多回合交互实现工作流程编排。在此过程中，策略模型分析执行状态并选择编辑动作，而画布执行作符并返回反馈以进行迭代优化。此外，FlowSteer 还提供了一个即插即用的框架，支持多样化的作符库和可互换的大型语言模型后端。为了有效训练这种交互范式，我们提出了Canvas工作流相对策略优化（CWRPO），该方法引入多样性约束的奖励和条件释放，以稳定学习并抑制捷径行为。对十二个数据集的实验结果显示，FlowSteer 在各项任务中显著优于基线。

TABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning

TABX：一款高通量沙盒战斗模拟器，用于多智能体强化学习

Authors: Hayeong Lee, JunHyeok Oh, Byung-Jun Lee
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01665
Pdf link: https://arxiv.org/pdf/2602.01665
Abstract The design of environments plays a critical role in shaping the development and evaluation of cooperative multi-agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high-throughput sandbox designed for reconfigurable multi-agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade-offs across a diverse spectrum of task complexities. Leveraging JAX for hardware-accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: this https URL.
中文摘要 环境设计在塑造协作式多智能体强化学习（MARL）算法的开发和评估中起着关键作用。虽然现有基准突出了关键挑战，但它们往往缺乏设计定制评估场景所需的模块化。我们介绍JAX中的全加速战斗模拟器（TABX），这是一个为可重构多代理任务设计的高通量沙盒。TABX对环境参数提供细致控制，允许系统性研究涌现代理行为和算法权衡，涵盖多样任务复杂度。利用 JAX 实现 GPU 上的硬件加速执行，TABX 实现大规模并行化，显著降低计算开销。通过提供快速、可扩展且易于定制的框架，TABX促进了对复杂结构化领域中MARL代理的研究，并为未来研究奠定了可扩展的基础。我们的代码可在以下 https URL 获取。

Scaling Search-Augmented LLM Reasoning via Adaptive Information Control

通过自适应信息控制扩展搜索增强LLM推理

Authors: Siheng Xiong, Oguzhan Gungordu, Blair Johnson, James C. Kerce, Faramarz Fekri
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.01672
Pdf link: https://arxiv.org/pdf/2602.01672
Abstract Search-augmented reasoning agents interleave multi-step reasoning with external information retrieval, but uncontrolled retrieval often leads to redundant evidence, context saturation, and unstable learning. Existing approaches rely on outcome-based reinforcement learning (RL), which provides limited guidance for regulating information acquisition. We propose DeepControl, a framework for adaptive information control based on a formal notion of information utility, which measures the marginal value of retrieved evidence under a given reasoning state. Building on this utility, we introduce retrieval continuation and granularity control mechanisms that selectively regulate when to continue and stop retrieval, and how much information to expand. An annealed control strategy enables the agent to internalize effective information acquisition behaviors during training. Extensive experiments across seven benchmarks demonstrate that our method consistently outperforms strong baselines. In particular, our approach achieves average performance improvements of 9.4% and 8.6% on Qwen2.5-7B and Qwen2.5-3B, respectively, over strong outcome-based RL baselines, and consistently outperforms both retrieval-free and retrieval-based reasoning methods without explicit information control. These results highlight the importance of adaptive information control for scaling search-augmented reasoning agents to complex, real-world information environments.
中文摘要 搜索增强推理代理将多步推理与外部信息检索交错进行，但不受控检索常导致证据重复、上下文饱和和学习不稳定。现有方法依赖基于结果的强化学习（RL），该方法为信息获取的调控提供了有限的指导。我们提出了DeepControl，这是一种基于信息效用形式概念的自适应信息控制框架，衡量在给定推理状态下检索证据的边际价值。基于该工具，我们引入检索继续和粒度控制机制，选择性地调节何时继续检索、何时停止检索，以及扩展信息量。退火控制策略使智能体能够在训练过程中内化有效的信息获取行为。跨越七个基准测试的广泛实验表明，我们的方法始终优于强基线。特别是，我们的方法在Qwen2.5-7B和Qwen2.5-3B上分别取得了9.4%和8.6%的平均性能提升，优于强而有力的基于结果的强化学习基线，并且在无显式信息控制的情况下，持续优于无检索和基于检索的推理方法。这些结果凸显了自适应信息控制在将搜索增强推理代理扩展到复杂现实信息环境中的重要性。

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

TRIP-Bench：现实场景中长视野交互代理的基准

Authors: Yuanzhe Shen, Zisu Huang, Zhengyuan Wang, Muzhao Tian, Zhengkang Guo, Chenyang Zhang, Shuaiyu Zhou, Zengjie Hu, Dailin Li, Jingwen Xu, Kaimin Wang, Wenhao Liu, Tianlong Li, Fengpeng Yue, Feng Hong, Cao Liu, Ke Zeng
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01675
Pdf link: https://arxiv.org/pdf/2602.01675
Abstract As LLM-based agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we introduce \textbf{TRIP-Bench}, a long-horizon benchmark grounded in realistic travel-planning scenarios. TRIP-Bench leverages real-world data, offers 18 curated tools and 40+ travel requirements, and supports automated evaluation. It includes splits of varying difficulty; the hard split emphasizes long and ambiguous interactions, style shifts, feasibility changes, and iterative version revision. Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context. Experiments show that even advanced models achieve at most 50\% success on the easy split, with performance dropping below 10\% on hard subsets. We further propose \textbf{GTPO}, an online multi-turn reinforcement learning method with specialized reward normalization and reward differencing. Applied to Qwen2.5-32B-Instruct, GTPO improves constraint satisfaction and interaction robustness, outperforming Gemini-3-Pro in our evaluation. We expect TRIP-Bench to advance practical long-horizon interactive agents, and GTPO to provide an effective online RL recipe for robust long-horizon training.
中文摘要 随着基于LLM的代理在日益复杂的现实环境中部署，现有基准测试未能充分体现诸如执行全局约束、协调多工具推理以及适应长期多回合交互中不断演变的用户行为等关键挑战。为弥合这一差距，我们引入了 \textbf{TRIP-Bench}，这是一个基于现实旅行规划情景的长期基准。TRIP-Bench 利用真实世界数据，提供 18 种精选工具和 40+ 旅行需求，并支持自动评估。它包含不同难度的分段;硬分裂强调冗长且模糊的互动、风格转变、可行性变化以及迭代版本修订。对话可跨越最多15个用户回合，可能涉及150+工具调用，上下文代码可能超过20万代币。实验显示，即使是高级模型在简单分段中最多也只有50%的成功率，而在困难子集上表现则低于10%。我们还提出了 \textbf{GTPO}，一种在线多回合强化学习方法，具有专门的奖励规范化和奖励差分。应用于Qwen2.5-32B-Instruct时，GTPO提升了约束满足度和交互鲁棒性，在我们的评估中优于Gemini-3-Pro。我们预计TRIP-Bench将推动实用的远程交互代理，GTPO则将提供有效的在线强化学习方案，用于强有力的长期视野训练。

Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment

语义感知的 Wasserstein 策略正则化用于大型语言模型对齐

Authors: Byeonghu Na, Hyungho Na, Yeongmin Kim, Suhyeon Jo, HeeSun Bae, Mina Kang, Il-Chul Moon
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01685
Pdf link: https://arxiv.org/pdf/2602.01685
Abstract Large language models (LLMs) are commonly aligned with human preferences using reinforcement learning from human feedback (RLHF). In this method, LLM policies are generally optimized through reward maximization with Kullback-Leibler (KL) divergence regularization of the reference policy. However, KL and its $f$-divergence variants only compare token probabilities at identical indices, failing to capture semantic similarity. We propose Wasserstein Policy Regularization (WPR), a semantic-aware regularization for the RLHF framework based on the entropy-regularized Wasserstein distance, which incorporates the geometry of the token space. The dual formulation of the distance expresses the regularization as penalty terms applied to the reward via optimal dual variables, which yield a tractable objective compatible with standard RL algorithms. Empirically, our method outperforms KL- and $f$-divergence-based baselines, demonstrating the benefits of semantic-aware policy distances for alignment. Our code is available at this https URL.
中文摘要 大型语言模型（LLMs）通常通过从人类反馈强化学习（RLHF）与人类偏好对齐。在这种方法中，LLM策略通常通过对参考策略进行Kullback-Leibler（KL）发散正则化来优化奖励最大化。然而，KL及其$f$发散变体仅比较相同指标下的代币概率，未能捕捉语义相似性。我们提出了Wasserstein策略正则化（WPR），这是一种基于熵正则化Wasserstein距离的RLHF框架语义感知正则化，并包含了符号空间的几何结构。距离的对偶表述将正则化表达为通过最优对偶变量对奖励施加的惩罚项，从而得到一个与标准强化学习算法兼容的可处理目标。从实证角度看，我们的方法优于基于KL和$f$的背离基线，展示了语义感知政策距离对一致性的益处。我们的代码可在此 https URL 访问。

Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models

训练后恢复探索：大型推理模型的潜在探索解码

Authors: Wenhui Tan, Fiorenzo Parascandolo, Enver Sangineto, Jianzhong Ju, Zhenbo Luo, Qian Cao, Rita Cucchiara, Ruihua Song, Jian Luan
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01698
Pdf link: https://arxiv.org/pdf/2602.01698
Abstract Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: this https URL.
中文摘要 大型推理模型（LRM）最近通过强化学习（RL）后训练实现了强大的数学和代码推理性能。然而，我们表明现代训练后的推理会引发一次意外的探索崩溃：基于温度的采样不再提升pass@$n$的准确性。经验上，后训练LRM的末层后部熵显著降低，而中间层的熵则相对较高。基于这种熵不对称性，我们提出了潜在探索解码（LED）这一深度条件解码策略。LED通过累积和聚合中间后部，并选择熵最大的深度构型作为探索候选。无需额外训练或参数，LED在多个推理基准和模型中持续提升pass@1和pass@16准确率0.61个百分点和1.03个百分点。项目页面：这个 https URL。

Mitigating loss of control in advanced AI systems through instrumental goal trajectories

通过工具性目标轨迹减轻先进人工智能系统中失控的问题

Authors: Willem Fourie
Subjects: Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2602.01699
Pdf link: https://arxiv.org/pdf/2602.01699
Abstract Researchers at artificial intelligence labs and universities are concerned that highly capable artificial intelligence (AI) systems may erode human control by pursuing instrumental goals. Existing mitigations remain largely technical and system-centric: tracking capability in advanced systems, shaping behaviour through methods such as reinforcement learning from human feedback, and designing systems to be corrigible and interruptible. Here we develop instrumental goal trajectories to expand these options beyond the model. Gaining capability typically depends on access to additional technical resources, such as compute, storage, data and adjacent services, which in turn requires access to monetary resources. In organisations, these resources can be obtained through three organisational pathways. We label these pathways the procurement, governance and finance instrumental goal trajectories (IGTs). Each IGT produces a trail of organisational artefacts that can be monitored and used as intervention points when a systems capabilities or behaviour exceed acceptable thresholds. In this way, IGTs offer concrete avenues for defining capability levels and for broadening how corrigibility and interruptibility are implemented, shifting attention from model properties alone to the organisational systems that enable them.
中文摘要 人工智能实验室和大学的研究人员担心，高能力的人工智能（AI）系统可能通过追求工具性目标而削弱人类的控制。现有的缓解措施大多仍以技术和系统为中心：在先进系统中跟踪能力，通过从人类反馈中强化学习等方法塑造行为，以及设计系统使其可修正和可中断。在这里，我们制定工具性目标轨迹，以将这些选项扩展到模型之外。能力的提升通常依赖于获得额外的技术资源，如计算、存储、数据及相关服务，而这又需要资金资源的使用。在组织中，这些资源可以通过三种组织途径获得。我们将这些路径称为采购、治理和融资工具性目标轨迹（IGTs）。每个 IGT 都会产生一条组织工件的痕迹，可以被监控，并在系统能力或行为超过可接受阈值时作为干预点使用。通过这种方式，IGT为定义能力水平和拓宽可修正性和可中断性实现方式提供了具体途径，将关注点从仅仅关注模型属性转向支持它们的组织系统。

Beyond Mode Elicitation: Diversity-Preserving Reinforcement Learning via Latent Diffusion Reasoner

超越模式诱发：通过潜在扩散推理器实现多样性保持强化学习

Authors: Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Yi-An Ma, Lianhui Qin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01705
Pdf link: https://arxiv.org/pdf/2602.01705
Abstract Recent reinforcement learning (RL) methods improve LLM reasoning by optimizing discrete Chain-of-Thought (CoT) generation; however, exploration in token space often suffers from diversity collapse as policy entropy decreases due to mode elicitation behavior in discrete RL. To mitigate this issue, we propose Latent Diffusion Reasoning with Reinforcement Learning (LaDi-RL), a framework that conducts exploration directly in a continuous latent space, where latent variables encode semantic-level reasoning trajectories. By modeling exploration via guided diffusion, multi-step denoising distributes stochasticity and preserves multiple coexisting solution modes without mutual suppression. Furthermore, by decoupling latent-space exploration from text-space generation, we show that latent diffusion-based optimization is more effective than text-space policy optimization alone, while a complementary text policy provides additional gains when combined with latent exploration. Experiments on code generation and mathematical reasoning benchmarks demonstrate consistent improvements in both pass@1 and pass@k over discrete RL baselines, with absolute pass@1 gains of +9.4% on code generation and +5.7% on mathematical reasoning, highlighting diffusion-based latent RL as a principled alternative to discrete token-level RL for reasoning.
中文摘要 最新的强化学习（RL）方法通过优化离散思维链（CoT）生成来提升LLM推理能力;然而，随着离散强化学习中模式诱发行为导致策略熵减小，代币空间中的探索常常会遭遇多样性崩溃。为缓解这一问题，我们提出了带有强化学习的潜在扩散推理（LaDi-RL）框架，该框架直接在连续潜在空间中进行探索，潜在变量编码语义层推理轨迹。通过通过引导扩散来建模探索，多步去噪能够分布随机性，并保持多种共存的解模态，而无需相互抑制。此外，通过将潜空间探索与文本空间生成解耦，我们表明基于潜在扩散的优化比单纯文本空间策略优化更有效，而补充文本策略与潜在探索结合时则带来额外收益。代码生成和数学推理基准测试的实验显示，在pass@1和pass@k方面相较离散强化学习基线持续提升，代码生成的绝对pass@1提升为+9.4%，数学推理提升+5.7%，凸显基于扩散的潜在强化学习作为离散代币级强化学习的原则性替代方案。

Uncertainty-Aware Non-Prehensile Manipulation with Mobile Manipulators under Object-Induced Occlusion

在物体诱导遮蔽下，使用移动作器的不确定性感知非抓握作

Authors: Jiwoo Hwang, Taegeun Yang, Jeil Jeong, Minsung Yoon, Sung-Eui Yoon
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.01731
Pdf link: https://arxiv.org/pdf/2602.01731
Abstract Non-prehensile manipulation using onboard sensing presents a fundamental challenge: the manipulated object occludes the sensor's field of view, creating occluded regions that can lead to collisions. We propose CURA-PPO, a reinforcement learning framework that addresses this challenge by explicitly modeling uncertainty under partial observability. By predicting collision possibility as a distribution, we extract both risk and uncertainty to guide the robot's actions. The uncertainty term encourages active perception, enabling simultaneous manipulation and information gathering to resolve occlusions. When combined with confidence maps that capture observation reliability, our approach enables safe navigation despite severe sensor occlusion. Extensive experiments across varying object sizes and obstacle configurations demonstrate that CURA-PPO achieves up to 3X higher success rates than the baselines, with learned behaviors that handle occlusions. Our method provides a practical solution for autonomous manipulation in cluttered environments using only onboard sensing.
中文摘要 利用车载感测进行非抓握式作面临一个根本挑战：控物体会遮挡传感器的视野，造成被遮挡的区域，可能导致碰撞。我们提出了CURA-PPO，一种强化学习框架，通过明确建模部分可观察性下的不确定性来解决这一挑战。通过以分布方式预测碰撞可能性，我们提取了风险和不确定性，以指导机器人的行为。不确定性项鼓励主动感知，使得同时作和信息收集以解决遮挡。结合捕捉观测可靠性的置信度图，我们的方法使得即使存在严重的传感器遮挡，也能实现安全导航。在不同物体大小和障碍物配置下的大量实验表明，CURA-PPO的成功率是基线的3倍，且其学习行为能够处理阻塞问题。我们的方法为仅利用机载感测在杂乱环境中实现自主作提供了实用解决方案。

Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking

对抗性奖励审计用于主动检测和缓解奖励黑客行为

Authors: Mohammad Beigi, Ming Jin, Junshan Zhang, Qifan Wang, Lifu Huang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01750
Pdf link: https://arxiv.org/pdf/2602.01750
Abstract Reinforcement Learning from Human Feedback (RLHF) remains vulnerable to reward hacking, where models exploit spurious correlations in learned reward models to achieve high scores while violating human intent. Existing mitigations rely on static defenses that cannot adapt to novel exploitation strategies. We propose Adversarial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game. ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations; second, Auditor-Guided RLHF (AG-RLHF) gates reward signals to penalize detected hacking, transforming reward hacking from an unobservable failure into a measurable, controllable signal. Experiments across three hacking scenarios demonstrate that ARA achieves the best alignment-utility tradeoff among all baselines: reducing sycophancy to near-SFT levels while improving helpfulness, decreasing verbosity while achieving the highest ROUGE-L, and suppressing code gaming while improving Pass@1. Beyond single-domain evaluation, we show that reward hacking, detection, and mitigation all generalize across domains -- a Hacker trained on code gaming exhibits increased sycophancy despite no reward for this behavior, and an Auditor trained on one domain effectively suppresses exploitation in others, enabling efficient multi-domain defense with a single model.
中文摘要 人类反馈强化学习（RLHF）仍易受奖励黑客攻击的脆弱性，即模型利用学习奖励模型中的虚假相关性以获得高分，同时违背人类意图。现有的缓解措施依赖于无法适应新型利用策略的静态防御。我们提出了对抗性奖励审计（ARA）框架，将奖励黑客重新定义为一种动态且竞争性的游戏。ARA分为两个阶段：首先，黑客策略发现奖励模型的漏洞，审计员则学习从潜在表征中检测利用;其次，审计员引导的RLHF（AG-RLHF）通过奖励信号来惩罚被检测到的黑客行为，将奖励黑客从不可观察的故障转变为可测量、可控的信号。三种黑客场景的实验表明，ARA在所有基线中实现了最佳的对齐-效用权衡：将谄媚降低到接近SFT水平同时提升帮助性，减少冗长但获得最高ROUGE-L，抑制代码游戏同时提升Pass@1。除了单域评估外，我们还证明了奖励黑客、检测和缓解在多个领域间具有普遍性——一个受过代码游戏训练的黑客表现出更高的谄媚行为，尽管没有奖励;而一个受过单一领域训练的审计员有效抑制了其他领域的利用，从而实现了单一模型高效的多域防御。

Position: Beyond Model-Centric Prediction -- Agentic Time Series Forecasting

立场：超越以模型为中心的预测——代理时间序列预测

Authors: Mingyue Cheng, Xiaoyu Tao, Qi Liu, Ze Guo, Enhong Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01776
Pdf link: https://arxiv.org/pdf/2602.01776
Abstract Time series forecasting has traditionally been formulated as a model-centric, static, and single-pass prediction problem that maps historical observations to future values. While this paradigm has driven substantial progress, it proves insufficient in adaptive and multi-turn settings where forecasting requires informative feature extraction, reasoning-driven inference, iterative refinement, and continual adaptation over time. In this paper, we argue for agentic time series forecasting (ATSF), which reframes forecasting as an agentic process composed of perception, planning, action, reflection, and memory. Rather than focusing solely on predictive models, ATSF emphasizes organizing forecasting as an agentic workflow that can interact with tools, incorporate feedback from outcomes, and evolve through experience accumulation. We outline three representative implementation paradigms -- workflow-based design, agentic reinforcement learning, and a hybrid agentic workflow paradigm -- and discuss the opportunities and challenges that arise when shifting from model-centric prediction to agentic forecasting. Together, this position aims to establish agentic forecasting as a foundation for future research at the intersection of time series forecasting.
中文摘要 时间序列预测传统上被提出为以模型为中心的静态预测问题，将历史观测数据映射到未来值。虽然这一范式推动了实质性进展，但在需要信息性特征提取、推理驱动推断、迭代精炼和持续适应的自适应和多回合预测环境中，其效果仍然不足。本文主张代理时间序列预测（ATSF），它将预测重新定义为由感知、规划、行动、反思和记忆组成的代理过程。ATSF强调将预测组织为一种能够与工具交互、整合结果反馈并通过经验积累演变的智能流程来组织预测，而不仅仅是预测模型。我们概述了三种代表性的实施范式——基于工作流的设计、代理强化学习和混合代理工作流程范式——并讨论了从以模型为中心的预测转向代理预测时所面临的机遇与挑战。该职位旨在将代理预测作为未来时间序列预测交叉研究的基础。

RFS: Reinforcement learning with Residual flow steering for dexterous manipulation

RFS：基于残留流引导的强化学习，实现灵巧作

Authors: Entong Su, Tyler Westenbroek, Anusha Nagabandi, Abhishek Gupta
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.01789
Pdf link: https://arxiv.org/pdf/2602.01789
Abstract Imitation learning has emerged as an effective approach for bootstrapping sequential decision-making in robotics, achieving strong performance even in high-dimensional dexterous manipulation tasks. Recent behavior cloning methods further leverage expressive generative models, such as diffusion models and flow matching, to represent multimodal action distributions. However, policies pretrained in this manner often exhibit limited generalization and require additional fine-tuning to achieve robust performance at deployment time. Such adaptation must preserve the global exploration benefits of pretraining while enabling rapid correction of local execution this http URL propose \emph{Residual Flow Steering} (RFS), a data-efficient reinforcement learning framework for adapting pretrained generative policies. RFS steers a pretrained flow-matching policy by jointly optimizing a residual action and a latent noise distribution, enabling complementary forms of exploration: local refinement through residual corrections and global exploration through latent-space modulation. This design allows efficient adaptation while retaining the expressive structure of the pretrained this http URL demonstrate the effectiveness of RFS on dexterous manipulation tasks, showing efficient fine-tuning both in simulation and in real-world settings when adapting pretrained base this http URL website:this https URL.
中文摘要 模仿学习已成为机器人学中序列决策自启动的有效方法，即使在高维灵巧作任务中也能取得优异表现。近期的行为克隆方法进一步利用表达式生成模型，如扩散模型和流匹配，来表示多模态动作分布。然而，以这种方式预训练的策略通常具有有限的泛化性，需要进一步微调以在部署时实现稳健的性能。这种适应必须保留预训练的全局探索优势，同时实现本地执行的快速纠正。该http URL提出了\emph{Residual Flow Steering}（RFS），一种数据高效的强化学习框架，用于适应预训练生成策略。RFS通过联合优化残差作用和潜噪声分布，引导预训练的流量匹配策略，实现互补的探索形式：通过残差修正实现局部细化和潜空间调制实现全局探索。这种设计允许高效适应，同时保持预训练 http URL 的表达结构，展示了 RFS 在灵巧作任务中的有效性，在模拟和现实环境中都展现了高效的微调，基于预训练 http URL 网站：this https URL。

Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning

Grad2Reward：从稀疏判断到高密度奖励，提升开放式大型语言模型推理能力

Authors: Zheng Zhang, Ao Lu, Yuanhao Zeng, Ziwei Shan, Jinjin Guo, Lufei Li, Yexin Li, Kan Ren
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01791
Pdf link: https://arxiv.org/pdf/2602.01791
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant breakthroughs in complex LLM reasoning within verifiable domains, such as mathematics and programming. Recent efforts have sought to extend this paradigm to open-ended tasks by employing LLMs-as-a-Judge to provide sequence-level rewards for policy optimization. However, these rewards are inherently sparse, failing to provide the fine-grained supervision necessary for generating complex, long-form trajectories. Furthermore, current work treats the Judge as a black-box oracle, discarding the rich intermediate feedback signals encoded in it. To address these limitations, we introduce Grad2Reward, a novel framework that extracts dense process rewards directly from the Judge's model inference process via a single backward pass. By leveraging gradient-based attribution, Grad2Reward enables precise token-level credit assignment, substantially enhancing training efficiency and reasoning quality. Additionally, Grad2Reward introduces a self-judging mechanism, allowing the policy to improve through its own evaluative signals without training specialized reward models or reliance on superior external Judges. The experiments demonstrate that policies optimized with Grad2Reward achieve outstanding performance across diverse open-ended tasks, affirming its effectiveness and broad generalizability.
中文摘要 带可验证奖励的强化学习（RLVR）催化了数学和编程等可验证领域复杂大型语言模型推理的重大突破。近期努力试图将这一范式扩展到开放式任务，利用LLMs作为评判者，为策略优化提供序列级奖励。然而，这些奖励本质上稀疏，无法提供生成复杂长形态轨迹所需的细粒度监督。此外，当前研究将法官视为黑匣子预言机，丢弃其编码的丰富中间反馈信号。为解决这些局限性，我们引入了Grad2Reward，这是一个新颖框架，通过单次后向传递直接从评判模型推理过程提取密集的过程奖励。通过利用基于梯度的归因，Grad2Reward实现了精确的代币级学分分配，显著提升了培训效率和推理质量。此外，Grad2Reward引入了自我判断机制，使政策能够通过自身的评估信号改进，而无需训练专门的奖励模型或依赖更优秀的外部评审。实验表明，采用Grad2Reward优化的策略在多样化的开放式任务中表现出色，证实了其有效性和广泛的泛化性。

Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

超越精度：训练-推断不匹配是一个优化问题，简单的逻辑推理调度可以解决

Authors: Yaxiang Zhang, Yingru Li, Jiacai Liu, Jiawei Xu, Ziniu Li, Qian Liu, Haoyuan Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01826
Pdf link: https://arxiv.org/pdf/2602.01826
Abstract Reinforcement Learning (RL) for training Large Language Models is notoriously unstable. While recent studies attribute this to "training inference mismatch stemming" from inconsistent hybrid engines, standard remedies, such as Importance Sampling, might fail during extended training runs. In this work, we analyze this instability through the lens of optimization, demonstrating that gradient noise and training-inference mismatch escalate in tandem as training progresses. Meanwhile, we find that the mismatch can be effectively suppressed by shrinking the update size. Taken together, we deduce that the mismatch is not merely a static numerical discrepancy, but a dynamic failure coupled with the model's optimization. Based on this insight, we propose a simple yet effective solution: a specialized Learning Rate (LR) scheduler. Instead of pre-defined decay schedule in traditional LR scheduler, our method dynamically triggers LR decay based on response length, which we identify as a reliable early-warning signal for impending instability. Empirical evidence suggests that by reducing the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.
中文摘要 用于训练大型语言模型的强化学习（RL）以不稳定著称。虽然最新研究将此归因于混合引擎不一致导致的“训练推断不匹配”，但标准补救措施如重要性采样可能在长时间训练运行中失效。本研究通过优化视角分析这种不稳定性，证明随着训练进展，梯度噪声和训练-推断不匹配会同步加剧。与此同时，我们发现通过缩小更新大小可以有效抑制这种不匹配。综合来看，我们推断出这种不匹配不仅仅是静态的数值差异，而是与模型优化相结合的动态失效。基于这一见解，我们提出了一个简单但有效的解决方案：专门的学习率（LR）调度器。与传统LR调度器中预定义的衰减计划不同，我们的方法基于响应长度动态触发LR衰减，我们将响应长度识别为即将到来不稳定的可靠早期预警信号。实证证据表明，随着梯度噪声增加，降低学习率，我们可以持续稳定强化学习训练，并将训练与推断不匹配保持在安全水平。

Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning

利用变压器强化学习设计A/B测试时间序列实验

Authors: Xiangkun Wu, Qianglin Wen, Yingying Zhang, Hongtu Zhu, Ting Li, Chengchun Shi
Subjects: Subjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.01853
Pdf link: https://arxiv.org/pdf/2602.01853
Abstract A/B testing has become a gold standard for modern technological companies to conduct policy evaluation. Yet, its application to time series experiments, where policies are sequentially assigned over time, remains challenging. Existing designs suffer from two limitations: (i) they do not fully leverage the entire history for treatment allocation; (ii) they rely on strong assumptions to approximate the objective function (e.g., the mean squared error of the estimated treatment effect) for optimizing the design. We first establish an impossibility theorem showing that failure to condition on the full history leads to suboptimal designs, due to the dynamic dependencies in time series experiments. To address both limitations simultaneously, we next propose a transformer reinforcement learning (RL) approach which leverages transformers to condition allocation on the entire history and employs RL to directly optimize the MSE without relying on restrictive assumptions. Empirical evaluations on synthetic data, a publicly available dispatch simulator, and a real-world ridesharing dataset demonstrate that our proposal consistently outperforms existing designs.
中文摘要 A/B测试已成为现代科技公司进行政策评估的黄金标准。然而，其在时间序列实验中的应用仍然具有挑战性，即策略随时间顺序分配。现有设计存在两个局限性：（i）它们未能充分利用整个病史进行治疗分配;（ii）它们依赖强假设来近似目标函数（例如估计处理效应的均方误差）以优化设计。我们首先建立一个不可能定理，表明由于时间序列实验中的动态依赖性，未能充分条件于完整历史，会导致设计次优。为了同时解决这两个局限性，我们接下来提出一种变换器强化学习（RL）方法，利用变换器对整个历史进行条件分配，并利用强化学习直接优化MSE，而无需依赖限制性假设。对合成数据、公开调度模拟器和真实网约车数据集的实证评估表明，我们的提案持续优于现有设计。

PretrainRL: Alleviating Factuality Hallucination of Large Language Models at the Beginning

PretrainRL：缓解大型语言模型初期的事实性幻觉

Authors: Langming Liu, Kangtao Lv, Haibin Chen, Weidong Zhang, Yejing Wang, Shilei Liu, Xin Tong, Yujin Yuan, Yongwei Wang, Wenbo Su, Bo Zheng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.01875
Pdf link: https://arxiv.org/pdf/2602.01875
Abstract Large language models (LLMs), despite their powerful capabilities, suffer from factual hallucinations where they generate verifiable falsehoods. We identify a root of this issue: the imbalanced data distribution in the pretraining corpus, which leads to a state of "low-probability truth" and "high-probability falsehood". Recent approaches, such as teaching models to say "I don't know" or post-hoc knowledge editing, either evade the problem or face catastrophic forgetting. To address this issue from its root, we propose \textbf{PretrainRL}, a novel framework that integrates reinforcement learning into the pretraining phase to consolidate factual knowledge. The core principle of PretrainRL is "\textbf{debiasing then learning}." It actively reshapes the model's probability distribution by down-weighting high-probability falsehoods, thereby making "room" for low-probability truths to be learned effectively. To enable this, we design an efficient negative sampling strategy to discover these high-probability falsehoods and introduce novel metrics to evaluate the model's probabilistic state concerning factual knowledge. Extensive experiments on three public benchmarks demonstrate that PretrainRL significantly alleviates factual hallucinations and outperforms state-of-the-art methods.
中文摘要 尽管大型语言模型（LLMs）功能强大，但它们仍存在事实幻觉，产生可验证的虚假信息。我们发现了这一问题的根源：预训练语料库中数据分布不平衡，导致了“低概率真实”和“高概率虚假”的状态。最近的方法，比如教模型说“我不知道”或事后知识编辑，要么回避了这个问题，要么面临灾难性的遗忘。为了从根本上解决这个问题，我们提出了 \textbf{PretrainRL}，这是一个将强化学习整合进预训练阶段以巩固事实知识的新框架。PretrainRL的核心原则是“\textbf{去偏见然后学习}。它通过降低高概率错误的权重，积极重塑模型的概率分布，从而为低概率真理的有效学习创造了“空间”。为此，我们设计了一种高效的负抽样策略，以发现这些高概率的错误，并引入新的指标来评估模型在事实知识方面的概率状态。在三个公开基准测试上的广泛实验表明，PretrainRL显著缓解了事实幻觉，并优于最先进的方法。

VLM-Guided Experience Replay

VLM引导体验回放

Authors: Elad Sharony, Tom Jurgenson, Orr Krupnik, Dotan Di Castro, Shie Mannor
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01915
Pdf link: https://arxiv.org/pdf/2602.01915
Abstract Recent advances in Large Language Models (LLMs) and Vision-Language Models (VLMs) have enabled powerful semantic and multimodal reasoning capabilities, creating new opportunities to enhance sample efficiency, high-level planning, and interpretability in reinforcement learning (RL). While prior work has integrated LLMs and VLMs into various components of RL, the replay buffer, a core component for storing and reusing experiences, remains unexplored. We propose addressing this gap by leveraging VLMs to guide the prioritization of experiences in the replay buffer. Our key idea is to use a frozen, pre-trained VLM (requiring no fine-tuning) as an automated evaluator to identify and prioritize promising sub-trajectories from the agent's experiences. Across scenarios, including game-playing and robotics, spanning both discrete and continuous domains, agents trained with our proposed prioritization method achieve 11-52% higher average success rates and improve sample efficiency by 19-45% compared to previous approaches. this https URL
中文摘要 大型语言模型（LLMs）和视觉语言模型（VLMs）的最新进展，使强大的语义和多模态推理能力得以实现，为增强强化学习（RL）中的样本效率、高层次规划和可解释性创造了新机遇。虽然此前的工作已将LLM和VLM集成到强化学习的各个组件中，但重放缓冲区——存储和重用体验的核心组件——仍未被探索。我们建议利用VLM来引导重放缓冲区中体验的优先级排序，以弥补这一空白。我们的核心想法是使用一个冻结的预训练VLM（无需微调）作为自动评估器，识别并优先考虑代理经验中有潜力的子轨迹。在包括游戏和机器人技术在内的场景中，跨越离散和连续域，采用我们提出的优先级排序方法训练的代理相比以往方法，平均成功率提高了11-52%，样本效率提升19-45%。这个 https 网址

Zero-Shot Off-Policy Learning

零单点非策略学习

Authors: Arip Asadulaev, Maksim Bobrin, Salem Lahlou, Dmitry Dylov, Fakhri Karray, Martin Takac
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.01962
Pdf link: https://arxiv.org/pdf/2602.01962
Abstract Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents significant challenges, primarily due to the inherent distributional shift and value function overestimation bias. These issues become even more noticeable in zero-shot reinforcement learning, where an agent trained on reward-free data must adapt to new tasks at test time without additional training. In this work, we address the off-policy problem in a zero-shot setting by discovering a theoretical connection of successor measures to stationary density ratios. Using this insight, our algorithm can infer optimal importance sampling ratios, effectively performing a stationary distribution correction with an optimal policy for any task on the fly. We benchmark our method in motion tracking tasks on SMPL Humanoid, continuous control on ExoRL, and for the long-horizon OGBench tasks. Our technique seamlessly integrates into forward-backward representation frameworks and enables fast-adaptation to new tasks in a training-free regime. More broadly, this work bridges off-policy learning and zero-shot adaptation, offering benefits to both research areas.
中文摘要 非策略学习方法试图直接从固定的先前交互数据集中推导最优策略。这一目标面临重大挑战，主要由于固有的分布偏移和价值函数高估偏差。这些问题在零样本强化学习中更加明显，其中一个在无奖励数据上训练的代理必须在测试时适应新任务，无需额外训练。本研究通过发现后继测度与固定密度比的理论联系，解决零样本环境中的非策略问题。利用这一洞察，我们的算法能够推断出最优的重要性抽样比值，从而实时执行带有最优策略的平稳分布修正。我们在SMPL Humanoid上的运动跟踪任务、ExoRL上的连续控制以及长视野OGBench任务中对该方法进行了基准测试。我们的技术无缝整合进正向-后退表示框架，实现在无培训条件下快速适应新任务。更广泛地说，这项工作连接了政策外的学习与零样本适应，为这两个研究领域带来了益处。

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

小型可推广提示预测模型可以引导大型推理模型的高效强化学习后训练

Authors: Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.01970
Pdf link: https://arxiv.org/pdf/2602.01970
Abstract Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS's substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.
中文摘要 强化学习提升了大型语言模型的推理能力，但由于推广密集型优化，通常涉及较高的计算成本。在线提示选择通过优先考虑信息性提示来提升培训效率，提供了一种可行的解决方案。然而，当前的方法要么依赖昂贵且精确的评估，要么构建缺乏跨提示泛化的提示特定预测模型。本研究引入了广义预测提示选择（GPS），该方法利用基于共享优化历史训练的轻量级生成模型，对提示难度进行贝叶斯推断。中级难度优先级和历史锚定多样性被纳入批次获取原则中，以选择信息丰富的提示批次。小型预测模型在测试时也会推广，以实现高效的计算分配。跨不同推理基准的实验显示，GPS在训练效率、最终表现和测试时间效率方面相较于更优的基线方法有显著提升。

Bandwidth-Efficient Multi-Agent Communication through Information Bottleneck and Vector Quantization

通过信息瓶颈和矢量量化实现带宽高效的多智能体通信

Authors: Ahmad Farooq, Kamran Iqbal
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.02035
Pdf link: https://arxiv.org/pdf/2602.02035
Abstract Multi-agent reinforcement learning systems deployed in real-world robotics applications face severe communication constraints that significantly impact coordination effectiveness. We present a framework that combines information bottleneck theory with vector quantization to enable selective, bandwidth-efficient communication in multi-agent environments. Our approach learns to compress and discretize communication messages while preserving task-critical information through principled information-theoretic optimization. We introduce a gated communication mechanism that dynamically determines when communication is necessary based on environmental context and agent states. Experimental evaluation on challenging coordination tasks demonstrates that our method achieves 181.8% performance improvement over no-communication baselines while reducing bandwidth usage by 41.4%. Comprehensive Pareto frontier analysis shows dominance across the entire success-bandwidth spectrum with area-under-curve of 0.198 vs 0.142 for next-best methods. Our approach significantly outperforms existing communication strategies and establishes a theoretically grounded framework for deploying multi-agent systems in bandwidth-constrained environments such as robotic swarms, autonomous vehicle fleets, and distributed sensor networks.
中文摘要 多智能体强化学习系统在现实机器人应用中部署，面临严重的通信限制，显著影响协调效果。我们提出了一个结合信息瓶颈理论与矢量量化的框架，以实现多智能体环境中选择性且带宽高效的通信。我们的方法通过原则性的信息理论优化，学习压缩和离散化通信信息，同时保留任务关键信息。我们引入了一种门控式通信机制，能够根据环境背景和代理状态动态判断何时需要通信。对具有挑战性的协调任务的实验评估表明，我们的方法比无通信基线提升了181.8%的性能，同时带宽使用减少了41.4%。全面的帕累托前沿分析显示，在整个成功带宽谱系中占优，曲线下面积为0.198，次优方法为0.142。我们的方法远超现有通信策略，建立了基于理论的框架，用于在带宽受限的环境中部署多代理系统，如机器人群体、自主车队和分布式传感器网络。

FORLER: Federated Offline Reinforcement Learning with Q-Ensemble and Actor Rectification

FORLER：结合Q-Ensemble和演员纠正的联合离线强化学习

Authors: Nan Qiao, Sheng Yue
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.02055
Pdf link: https://arxiv.org/pdf/2602.02055
Abstract In Internet-of-Things systems, federated learning has advanced online reinforcement learning (RL) by enabling parallel policy training without sharing raw data. However, interacting with real environments online can be risky and costly, motivating offline federated RL (FRL), where local devices learn from fixed datasets. Despite its promise, offline FRL may break down under low-quality, heterogeneous data. Offline RL tends to get stuck in local optima, and in FRL, one device's suboptimal policy can degrade the aggregated model, i.e., policy pollution. We present FORLER, combining Q-ensemble aggregation on the server with actor rectification on devices. The server robustly merges device Q-functions to curb policy pollution and shift heavy computation off resource-constrained hardware without compromising privacy. Locally, actor rectification enriches policy gradients via a zeroth-order search for high-Q actions plus a bespoke regularizer that nudges the policy toward them. A $\delta$-periodic strategy further reduces local computation. We theoretically provide safe policy improvement performance guarantees. Extensive experiments show FORLER consistently outperforms strong baselines under varying data quality and heterogeneity.
中文摘要 在物联网系统中，联邦学习通过实现并行策略训练而无需共享原始数据，推动了在线强化学习（RL）的发展。然而，在线与真实环境交互可能存在风险和成本高昂，促使离线联邦强化学习（FRL），即本地设备从固定数据集中学习。尽管有潜力，离线FRL在低质量、异构数据下可能会出现故障。离线强化学习往往停留在局部最优状态，而在FRL中，一个设备的次优策略可能会削弱聚合模型，即策略污染。我们介绍FORLER，将服务器上的Q-集合聚合与设备上的演员纠正结合在一起。服务器稳健地合并设备Q函数，以减少政策污染，将大量计算从资源受限的硬件转移，同时不损害隐私。在局部，演员纠正通过零阶高Q动作搜索以及一个定制的正则化器推动策略趋向于高Q值动作，丰富了策略梯度。$\delta$周期性策略进一步减少了局部计算。理论上，我们提供安全的政策改进绩效保证。大量实验表明，在不同数据质量和异质性下，FORLER始终优于强基线。

Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning

多任务强化学习的概率性能保证

Authors: Yannik Schnitzer, Mathias Jackermeier, Alessandro Abate, David Parker
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.02098
Pdf link: https://arxiv.org/pdf/2602.02098
Abstract Multi-task reinforcement learning trains generalist policies that can execute multiple tasks. While recent years have seen significant progress, existing approaches rarely provide formal performance guarantees, which are indispensable when deploying policies in safety-critical settings. We present an approach for computing high-confidence guarantees on the performance of a multi-task policy on tasks not seen during training. Concretely, we introduce a new generalisation bound that composes (i) per-task lower confidence bounds from finitely many rollouts with (ii) task-level generalisation from finitely many sampled tasks, yielding a high-confidence guarantee for new tasks drawn from the same arbitrary and unknown distribution. Across state-of-the-art multi-task RL methods, we show that the guarantees are theoretically sound and informative at realistic sample sizes.
中文摘要 多任务强化学习训练能够执行多项任务的通用策略。尽管近年来取得了显著进展，现有方法很少提供正式的性能保证，而在安全关键环境中部署政策时，性能保证至关重要。我们提出了一种计算多任务策略在训练中未见任务中执行高置信度保证的方法。具体来说，我们引入了一个新的泛化界限，该界限将（i）有限次展开的每个任务的置信下限与（ii）有限个抽样任务的任务层级推广结合起来，从而为来自同一任意未知分布的新任务提供高置信度保证。通过最先进的多任务强化学习方法，我们证明了在现实样本量下，这些保证在理论上是合理且具有信息量的。

Think Dense, Not Long: Dynamic Decoupled Conditional Advantage for Efficient Reasoning

思考密集，而非长：动态解耦条件优势以实现高效推理

Authors: Keqin Peng, Yuanxin Ouyang, Xuebo Liu, Zhiliang Tian, Ruijian Han, Yancheng Yuan, Liang Ding
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.02099
Pdf link: https://arxiv.org/pdf/2602.02099
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) can elicit strong multi-step reasoning, yet it often encourages overly verbose traces. Moreover, naive length penalties in group-relative optimization can severely hurt accuracy. We attribute this failure to two structural issues: (i) Dilution of Length Baseline, where incorrect responses (with zero length reward) depress the group baseline and over-penalize correct solutions; and (ii) Difficulty-Penalty Mismatch, where a static penalty cannot adapt to problem difficulty, suppressing necessary reasoning on hard instances while leaving redundancy on easy ones. We propose Dynamic Decoupled Conditional Advantage (DDCA) to decouple efficiency optimization from correctness. DDCA computes length advantages conditionally within the correct-response cluster to eliminate baseline dilution, and dynamically scales the penalty strength using the group pass rate as a proxy for difficulty. Experiments on GSM8K, MATH500, AMC23, and AIME25 show that DDCA consistently improves the efficiency--accuracy trade-off relative to adaptive baselines, reducing generated tokens by approximately 60% on simpler tasks (e.g., GSM8K) versus over 20% on harder benchmarks (e.g., AIME25), thereby maintaining or improving accuracy. Code is available at this https URL.
中文摘要 可验证奖励强化学习（RLVR）可以引发强烈的多步推理，但往往鼓励过于冗长的痕迹。此外，群相对优化中的天真长度惩罚会严重损害准确性。我们将此失败归因于两个结构性问题：（一）长度基线稀释，错误的回答（长度奖励为零）使团队基线下降，错误地惩罚正确解;以及（ii）难度-惩罚不匹配，静态惩罚无法适应问题难度，在困难情况下抑制必要的推理，而在简单情况下则存在冗余。我们提出动态解耦条件优势（DDCA），以将效率优化与正确性脱钩。DDCA在正确响应集群内有条件地计算长度优势，以消除基线稀释，并利用分组通过率作为难度的代理，动态调整惩罚强度。对GSM8K、MATH500、AMC23和AIME25的实验显示，DDCA相较于自适应基线持续提升效率和准确性权衡，在较简单任务（如GSM8K）上约减少60%的生成代币，而在较难的基准测试（如AIME25）中减少超过20%，从而保持或提升准确性。代码可在此 https URL 访问。

DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations

DCoPilot：生成式AI赋能的政策适应，适用于动态数据中心运营

Authors: Minghao Li, Ruihang Wang, Rui Tan, Yonggang Wen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.02137
Pdf link: https://arxiv.org/pdf/2602.02137
Abstract Modern data centers (DCs) hosting artificial intelligence (AI)-dedicated devices operate at high power densities with rapidly varying workloads, making minute-level adaptation essential for safe and energy-efficient operation. However, manually designing piecewise deep reinforcement learning (DRL) agents cannot keep pace with frequent dynamics shifts and service-level agreement (SLA) changes of an evolving DC. This specification-to-policy lag causes a lack of timely, effective control policies, which may lead to service outages. To bridge the gap, we present DCoPilot, a hybrid framework for generative control policies in dynamic DC operation. DCoPilot synergizes two distinct generative paradigms, i.e., a large language model (LLM) that performs symbolic generation of structured reward forms, and a hypernetwork that conducts parametric generation of policy weights. DCoPilot operates through three coordinated phases: (i) simulation scale-up, which stress-tests reward candidates across diverse simulation-ready (SimReady) scenes; (ii) meta policy distillation, where a hypernetwork is trained to output policy weights conditioned on SLA and scene embeddings; and (iii) online adaptation, enabling zero-shot policy generation in response to updated specifications. Evaluated across five control task families spanning diverse DC components, DCoPilot achieves near-zero constraint violations and outperforms all baselines across specification variations. Ablation studies validate the effectiveness of LLM-based unified reward generation in enabling stable hypernetwork convergence.
中文摘要 现代数据中心（DC）搭载人工智能（AI）专用设备，运行功率密度高且工作负载变化迅速，因此对安全和节能运行至关重要。然而，手动设计分段深度强化学习（DRL）代理无法跟上频繁的动态变化和不断变化的 DC 服务水平协议（SLA）变化。这种规范到策略的延迟导致缺乏及时有效的控制策略，可能导致服务中断。为弥合这一差距，我们提出了DCoPilot，一个动态DC运行中生成控制政策的混合框架。DCoPilot协同了两种不同的生成范式，即大型语言模型（LLM），用于符号生成结构化奖励形式，以及进行参数化策略权重生成的超网络。DCoPilot 通过三个协调阶段运作：（i）模拟扩展，通过压力测试奖励多样化模拟准备（SimReady）场景中的候选人;（ii）元策略提炼，即超网络训练以输出基于SLA和场景嵌入的策略权重;以及（iii）在线适应，实现零发射策略生成以响应更新规范。通过跨五个控制任务族、跨越不同DC组件的评估，DCoPilot几乎实现了零约束违规，并且在各规范变体中表现优于所有基线。消融研究验证了基于LLM的统一奖励生成在实现稳定超网络收敛方面的有效性。

Learning Generative Selection for Best-of-N

学习N中最佳生成选择

Authors: Shubham Toshniwal, Aleksander Ficek, Siddhartha Jain, Wei Du, Vahid Noroozi, Sadegh Mahdavi, Somshubra Majumdar, Igor Gitman
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.02143
Pdf link: https://arxiv.org/pdf/2602.02143
Abstract Scaling test-time compute via parallel sampling can substantially improve LLM reasoning, but is often limited by Best-of-N selection quality. Generative selection methods, such as GenSelect, address this bottleneck, yet strong selection performance remains largely limited to large models. We show that small reasoning models can acquire strong GenSelect capabilities through targeted reinforcement learning. To this end, we synthesize selection tasks from large-scale math and code instruction datasets by filtering to instances with both correct and incorrect candidate solutions, and train 1.7B-parameter models with DAPO to reward correct selections. Across math (AIME24, AIME25, HMMT25) and code (LiveCodeBench) reasoning benchmarks, our models consistently outperform prompting and majority-voting baselines, often approaching or exceeding much larger models. Moreover, these gains generalize to selecting outputs from stronger models despite training only on outputs from weaker models. Overall, our results establish reinforcement learning as a scalable way to unlock strong generative selection in small models, enabling efficient test-time scaling.
中文摘要 通过并行采样对测试时间计算进行缩放可以显著提升LLM推理能力，但通常受限于N次选择质量。生成选择方法，如GenSelect，解决了这一瓶颈，但强大的选择性能仍主要限于大型模型。我们展示了小型推理模型通过有针对性的强化学习，能够获得强大的GenSelect能力。为此，我们通过筛选包含正确和错误候选解的实例，综合大规模数学和代码指令数据集中的选择任务，并用DAPO训练1.7B参数模型以奖励正确选择。在数学（AIME24、AIME25、HMMT25）和代码（LiveCodeBench）推理基准测试中，我们的模型持续优于提示和多数投票基线，常常接近甚至超过更大规模的模型。此外，这些提升可以推广到选择更强模型的输出，尽管只训练于较弱模型的输出。总体而言，我们的结果确立了强化学习作为一种可扩展的方式，能够在小模型中释放强生成选择，从而实现高效的测试时间缩放。

ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning

ECHO：熵-置信度混合优化用于测试时间强化学习

Authors: Chu Zhao, Enneng Yang, Yuting Liu, Jianzhe Zhao, Guibing Guo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.02150
Pdf link: https://arxiv.org/pdf/2602.02150
Abstract Test-time reinforcement learning generates multiple candidate answers via repeated rollouts and performs online updates using pseudo-labels constructed by majority voting. To reduce overhead and improve exploration, prior work introduces tree structured rollouts, which share reasoning prefixes and branch at key nodes to improve sampling efficiency. However, this paradigm still faces two challenges: (1) high entropy branching can trigger rollout collapse, where the branching budget concentrates on a few trajectories with consecutive high-entropy segments, rapidly reducing the number of effective branches; (2) early pseudo-labels are noisy and biased, which can induce self-reinforcing overfitting, causing the policy to sharpen prematurely and suppress exploration. To address these issues, we propose Entropy Confidence Hybrid Group Relative Policy Optimization (ECHO). During rollout, ECHO jointly leverages local entropy and group level confidence to adaptively control branch width, and further introduces online confidence-based pruning to terminate persistently low confidence branches, avoiding high entropy traps and mitigating collapse. During policy updates, ECHO employs confidence adaptive clipping and an entropy confidence hybrid advantage shaping approach to enhance training robustness and mitigate early stage bias. Experiments demonstrate that ECHO achieves consistent gains on multiple mathematical and visual reasoning benchmarks, and generalizes more effectively under a limited rollout budget.
中文摘要 测试时强化学习通过反复展开生成多个候选答案，并通过多数投票构建的伪标签进行在线更新。为减少开销并改善探索，先前工作引入了树状结构化的展开，这些推理在关键节点共享推理前缀和分支，以提高采样效率。然而，该范式仍面临两个挑战：（1）高熵分支可能引发滚动崩溃，即分支预算集中在连续高熵段的少数轨迹上，迅速减少有效分支数量;（2）早期伪标签噪声大且有偏，可能导致自我强化的过拟合，导致策略过早明确，抑制探索。为解决这些问题，我们提出了熵置信混合组相对策略优化（ECHO）。在推广过程中，ECHO联合利用局部熵和组级置信度自适应控制分支宽度，并进一步引入基于置信度的在线剪枝，以终止持续低置信度分支，避免高熵陷阱并减轻崩溃。在策略更新期间，ECHO采用置信度自适应剪裁和熵置信混合优势塑形方法，以增强训练的鲁棒性并减轻早期偏倚。实验表明，ECHO在多个数学和视觉推理基准上实现了持续的提升，并且在有限的推广预算下推广效果更佳。

D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use

D-CORE：在大型推理模型中激励任务分解以适应复杂工具的使用。

Authors: Bowen Xu, Shaoyu Wu, Hao Jiang, Kai Liu, Xin Chen, Lulu Hu, Bin Yang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.02160
Pdf link: https://arxiv.org/pdf/2602.02160
Abstract Effective tool use and reasoning are essential capabilities for large reasoning models~(LRMs) to address complex real-world problems. Through empirical analysis, we identify that current LRMs lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning. To address this, we propose a two-stage training framework D-CORE~(\underline{\textbf{D}}ecomposing tasks and \underline{\textbf{Co}}mposing \underline{\textbf{Re}}asoning processes) that first incentivize the LRMs' task decomposition reasoning capability via self-distillation, followed by diversity-aware reinforcement learning~(RL) to restore LRMs' reflective reasoning capability. D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Experiments on BFCLv3 demonstrate superiority of our method: D-CORE-8B reaches 77.7\% accuracy, surpassing the best-performing 8B model by 5.7\%. Meanwhile, D-CORE-14B establishes a new state-of-the-art at 79.3\%, outperforming 70B models despite being 5$\times$ smaller. The source code is available at this https URL.
中文摘要 有效的工具使用和推理是大型推理模型~（LRM）解决复杂现实问题的关键能力。通过实证分析，我们发现当前LRMs在复杂工具使用场景中缺乏子任务分解能力，导致了懒惰推理。为此，我们提出了一个两阶段训练框架D-CORE~（\underline{\textbf{D}}ecomposing tasks和\underline{\textbf{Co}}mposing \underline{\textbf{Re}}asoning processes），首先通过自我蒸馏激励LRMs的任务分解推理能力，随后进行多样性感知强化学习~（RL），以恢复LRMs的反思推理能力。D-CORE在多样化基准和模型尺度上实现了强有力的工具使用改进。BFCLv3的实验显示我们方法更优：D-CORE-8B准确率达到77.7%，比表现最好的8B模型高出5.7%。与此同时，D-CORE-14B以79.3%的效率创下了新的尖端性能，尽管比70B型号小5美元/倍数，表现优于70B型号。源代码可在该 https URL 访问。

ECHO-2: A Large Scale Distributed Rollout Framework for Cost-efficient Reinforcement Learning

ECHO-2：一个大规模分布式推广框架，实现成本效益高的强化学习

Authors: Jie Xiao, Meng Chen, Qingnan Ren, Song Jingwei, Jiaqi Huang, Yangshen Deng, Chris Tong, Wanyi Chen, Suli Wang, Ziqian Bi, Shuo Lu, Yiqun Duan, Lynn Ai, Eric Yang, Bill Shi
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2602.02192
Pdf link: https://arxiv.org/pdf/2602.02192
Abstract Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.
中文摘要 强化学习（RL）是大型语言模型（LLM）训练后的关键阶段，涉及推广生成、奖励评估和集中学习之间的反复互动。分发推广执行提供了利用更具成本效益推断资源的机会，但也带来了广域协调和政策传播的挑战。我们介绍了ECHO-2，一种分布式强化学习框架，用于远程推断工作者及不可忽视的传播延迟后培训。ECHO-2 结合了集中学习与分布式推广，并将有界策略陈旧视为用户控制参数，使推广生成、传播和培训能够重叠。我们引入了基于重叠的容量模型，将训练时间、传播延迟和推广吞吐量联系起来，形成一个实用的配置规则，以维持学习者利用率。为缓解传播瓶颈和降低成本，ECHO-2采用了对等人辅助的流水线广播和成本意识激活异构工作者。在真实广域带宽下对4B和8B模型进行GRPO后训练的实验显示，ECHO-2显著提高了成本效益，同时保持了与强基线相当的强化学习奖励。

Online Fine-Tuning of Pretrained Controllers for Autonomous Driving via Real-Time Recurrent RL

通过实时循环强化学习，在线微调预训练控制器的自动驾驶

Authors: Julian Lemmel, Felix Resch, Mónika Farsang, Ramin Hasani, Daniela Rus, Radu Grosu
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.02236
Pdf link: https://arxiv.org/pdf/2602.02236
Abstract Deploying pretrained policies in real-world applications presents substantial challenges that fundamentally limit the practical applicability of learning-based control systems. When autonomous systems encounter environmental changes in system dynamics, sensor drift, or task objectives, fixed policies rapidly degrade in performance. We show that employing Real-Time Recurrent Reinforcement Learning (RTRRL), a biologically plausible algorithm for online adaptation, can effectively fine-tune a pretrained policy to improve autonomous agents' performance on driving tasks. We further show that RTRRL synergizes with a recent biologically inspired recurrent network model, the Liquid-Resistance Liquid-Capacitance RNN. We demonstrate the effectiveness of this closed-loop approach in a simulated CarRacing environment and in a real-world line-following task with a RoboRacer car equipped with an event camera.
中文摘要 在实际应用中部署预训练策略带来了重大挑战，从根本上限制了基于学习的控制系统的实际应用性。当自主系统遇到环境变化、系统动态、传感器漂移或任务目标时，固定策略的性能会迅速下降。我们展示了采用实时循环强化学习（RTRRL）这一生物学上合理的在线适应算法，可以有效微调预训练策略，以提升自主智能体在驱动任务中的表现。我们还进一步表明，RTRRL与一种近期受生物启发的循环网络模型——液体电阻液电容RNN协同作用。我们在模拟赛车环境以及配备事件摄像机的RoboRacer赛车的真实线路跟踪任务中展示了这种闭环方法的有效性。

Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models

保持好奇心的学习：通过自适应自蒸馏对大型推理模型进行保持熵的监督微调

Authors: Hao Wang, Hao Gu, Hongming Piao, Kaixiong Gong, Yuxiao Ye, Xiangyu Yue, Sirui Han, Yike Guo, Dapeng Wu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.02244
Pdf link: https://arxiv.org/pdf/2602.02244
Abstract The standard post-training recipe for large reasoning models, supervised fine-tuning followed by reinforcement learning (SFT-then-RL), may limit the benefits of the RL stage: while SFT imitates expert demonstrations, it often causes overconfidence and reduces generation diversity, leaving RL with a narrowed solution space to explore. Adding entropy regularization during SFT is not a cure-all; it tends to flatten token distributions toward uniformity, increasing entropy without improving meaningful exploration capability. In this paper, we propose CurioSFT, an entropy-preserving SFT method designed to enhance exploration capabilities through intrinsic curiosity. It consists of (a) Self-Exploratory Distillation, which distills the model toward a self-generated, temperature-scaled teacher to encourage exploration within its capability; and (b) Entropy-Guided Temperature Selection, which adaptively adjusts distillation strength to mitigate knowledge forgetting by amplifying exploration at reasoning tokens while stabilizing factual tokens. Extensive experiments on mathematical reasoning tasks demonstrate that, in SFT stage, CurioSFT outperforms the vanilla SFT by 2.5 points on in-distribution tasks and 2.9 points on out-of-distribution tasks. We also verify that exploration capabilities preserved during SFT successfully translate into concrete gains in RL stage, yielding an average improvement of 5.0 points.
中文摘要 大型推理模型的标准训练后模式，先是监督微调，随后是强化学习（SFT-然后是RL），这可能限制了RL阶段的优势：虽然SFT模仿专家演示，但常导致过度自信并降低生成多样性，使RL的解空间被狭窄地探索。在SFT中加入熵正则化并非万能;它倾向于使代币分布趋于均匀，增加熵，但并未提升有意义的探索能力。本文提出了CurioSFT，一种保持熵的SFT方法，旨在通过内在好奇心增强探索能力。它包括：（a）自我探索蒸馏，将模型提炼为自发、温度尺度的教师，鼓励在能力范围内进行探索;以及（b）熵引导温度选择，该技术通过增强推理标记的探索，同时稳定事实标记，自适应地调整蒸馏强度以减少知识遗忘。大量数学推理任务实验表明，在SFT阶段，CurioSFT在分布内任务中比普通SFT高出2.5分，在分布外任务中高出2.9分。我们还验证了SFT期间保留的勘探能力在强化阶段成功转化为具体提升，平均提升5.0分。

Segment to Focus: Guiding Latent Action Models in the Presence of Distractors

聚焦细分：在干扰因素存在下引导潜在动作模型

Authors: Hamza Adnan, Matthew T. Jackson, Alexey Zakharov
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.02259
Pdf link: https://arxiv.org/pdf/2602.02259
Abstract Latent Action Models (LAMs) learn to extract action-relevant representations solely from raw observations, enabling reinforcement learning from unlabelled videos and significantly scaling available training data. However, LAMs face a critical challenge in disentangling action-relevant features from action-correlated noise (e.g., background motion). Failing to filter these distractors causes LAMs to capture spurious correlations and build sub-optimal latent action spaces. In this paper, we introduce MaskLAM -- a lightweight modification to LAM training to mitigate this issue by incorporating visual agent segmentation. MaskLAM utilises segmentation masks from pretrained foundation models to weight the LAM reconstruction loss, thereby prioritising salient information over background elements while requiring no architectural modifications. We demonstrate the effectiveness of our method on continuous-control MuJoCo tasks, modified with action-correlated background noise. Our approach yields up to a 4x increase in accrued rewards compared to standard baselines and a 3x improvement in the latent action quality, as evidenced by linear probe evaluation.
中文摘要 潜在动作模型（LAM）仅从原始观察中学习提取与动作相关的表示，从而实现从未标记视频中进行强化学习，并显著扩展可用训练数据。然而，LAMs在将与动作相关的特征（如背景运动）中分离出来方面面临着关键挑战。未能过滤这些干扰因素会导致LAM捕捉虚假相关性，构建次优的潜在作用空间。本文介绍了MaskLAM——一种轻量级的LAM训练修改，通过引入视觉代理分割来缓解这一问题。MaskLAM利用预训练基础模型的分段掩码来加权LAM重建损失，从而优先考虑显著信息而非背景元素，且无需进行架构修改。我们展示了该方法在连续控制MuJoCo任务上的有效性，并加以动作相关背景噪声修改。我们的方法可使累计奖励比标准基线增加多达4倍，潜能动作质量提升3倍，这一点通过线性探针评估得到证明。

Learning Markov Decision Processes under Fully Bandit Feedback

在完全强盗反馈下学习马尔可夫决策过程

Authors: Zhengjia Zhuo, Anupam Gupta, Viswanath Nagarajan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.02260
Pdf link: https://arxiv.org/pdf/2602.02260
Abstract A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Markov Decision Process (MDP), along with the per-step rewards. Strong theoretical results are known in this setting, achieving nearly-tight $\Theta(\sqrt{T})$-regret bounds. However, such detailed feedback can be unrealistic, and recent research has investigated more restricted settings such as trajectory feedback, where the agent observes all the visited state-action pairs, but only a single \emph{aggregate} reward. In this paper, we consider a far more restrictive fully bandit'' feedback model for episodic MDPs, where the agent does not even observe the visited state-action pairs -- it only learns the aggregate reward. We provide the first efficient bandit learning algorithm for episodic MDPs with $\widetilde{O}(\sqrt{T})$ regret. Our regret has an exponential dependence on the horizon length $\H$, which we show is necessary. We also obtain improved nearly-tight regret bounds forordered'' MDPs; these can be used to model classical stochastic optimization problems such as $k$-item prophet inequality and sequential posted pricing. Finally, we evaluate the empirical performance of our algorithm for the setting of $k$-item prophet inequalities; despite the highly restricted feedback, our algorithm's performance is comparable to that of a state-of-art learning algorithm (UCB-VI) with detailed state-action feedback.
中文摘要 强化学习中的一个标准假设是，智能体观察相关马尔可夫决策过程（MDP）中每对访问的状态-动作对，以及每步的奖励。在此设定下，已知有强有力的理论结果，实现了近乎紧密的 $\Theta（\sqrt{T}）$-后悔界限。然而，如此详细的反馈可能不现实，近期研究探讨了更受限的环境，如轨迹反馈，即代理观察所有访问的状态-动作对，但仅观察单一\emph{aggregate}奖励。本文探讨了一种更为严格的“完全强盗”反馈模型，适用于情节式MDP，其中代理甚至不观察访问的状态-动作对——只学习总体奖励。我们提供了首个高效的盗贼学习算法，适用于带有$\widetilde{O}（\sqrt{T}）$后悔的分集MDP。我们的遗憾与视界长度 $\H$ 呈指数级关系，我们证明了视界长度是必然的。我们还获得了“有序”MDP的近乎紧密的改进遗憾界限;这些可以用来建模经典的随机优化问题，如$k$项目的先知不等式和顺序发布定价。最后，我们评估了算法在设置$k$项先知不等式时的实证性能;尽管反馈受限，我们的算法性能可媲美拥有详细状态动作反馈的先进学习算法（UCB-VI）。

Kimi K2.5: Visual Agentic Intelligence

Kimi K2.5：视觉智能

Authors: Kimi Team: Tongtong Bai, Yifan Bai, Yiping Bao, S.H. Cai, Yuan Cao, Y. Charles, H.S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen, Dazhi Cheng, Minghan Chu, Jialei Cui, Jiaqi Deng, Muxi Diao, Hao Ding, Mengfan Dong, Mengnan Dong, Yuxin Dong, Yuhao Dong, Angang Du, Chenzhuang Du, Dikang Du, Lingxiao Du, Yulun Du, Yu Fan, Shengjun Fang, Qiulin Feng, Yichen Feng, Garimugai Fu, Kelin Fu, Hongcheng Gao, Tong Gao, Yuyao Ge, Shangyi Geng, Chengyang Gong, Xiaochen Gong, Zhuoma Gongque, Qizheng Gu, Xinran Gu, Yicheng Gu, Longyu Guan, Yuanying Guo, Xiaoru Hao, Weiran He, Wenyang He, Yunjia He, Chao Hong, Hao Hu, Jiaxi Hu, Yangyang Hu, Zhenxing Hu, Ke Huang, Ruiyuan Huang, Weixiao Huang, Zhiqi Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yu Jing, Guokun Lai, Aidi Li, C. Li, Cheng Li, Fang Li, Guanghe Li, Guanyu Li, Haitao Li, Haoyang Li, Jia Li, Jingwei Li, Junxiong Li, Lincan Li, Mo Li, Weihong Li, Wentao Li, Xinhang Li, Xinhao Li, Yang Li, Yanhao Li, Yiwei Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.02276
Pdf link: https://arxiv.org/pdf/2602.02276
Abstract We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.
中文摘要 我们介绍Kimi K2.5，一个开源的多模态智能体模型，旨在推动通用智能的发展。K2.5强调文本与视觉的联合优化，使两种模态相互增强。这包括一系列技术，如联合文本视觉预训练、零视觉SFT和联合文本视觉强化学习。基于这一多模态基础，K2.5引入了Agent Swarm，一种自导向的并行代理编排框架，能够动态分解复杂任务为异构子问题并并并发执行。大量评估表明，Kimi K2.5在编码、视觉、推理和智能任务等多个领域都取得了最先进的成果。Agent Swarm还能将延迟降低高达4.5美元/倍数，相较单代理基线。我们发布了经过后训练的Kimi K2.5模型检查点，以促进未来的研究和智能智能的实际应用。

Choice-Model-Assisted Q-learning for Delayed-Feedback Revenue Management

选择模型辅助Q学习用于延迟反馈收入管理

Authors: Owen Shen, Patrick Jaillet
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.02283
Pdf link: https://arxiv.org/pdf/2602.02283
Abstract We study reinforcement learning for revenue management with delayed feedback, where a substantial fraction of value is determined by customer cancellations and modifications observed days after booking. We propose \emph{choice-model-assisted RL}: a calibrated discrete choice model is used as a fixed partial world model to impute the delayed component of the learning target at decision time. In the fixed-model deployment regime, we prove that tabular Q-learning with model-imputed targets converges to an $O(\varepsilon/(1-\gamma))$ neighborhood of the optimal Q-function, where $\varepsilon$ summarizes partial-model error, with an additional $O(t^{-1/2})$ sampling term. Experiments in a simulator calibrated from 61{,}619 hotel bookings (1{,}088 independent runs) show: (i) no statistically detectable difference from a maturity-buffer DQN baseline in stationary settings; (ii) positive effects under in-family parameter shifts, with significant gains in 5 of 10 shift scenarios after Holm--Bonferroni correction (up to 12.4\%); and (iii) consistent degradation under structural misspecification, where the choice model assumptions are violated (1.4--2.6\% lower revenue). These results characterize when partial behavioral models improve robustness under shift and when they introduce harmful bias.
中文摘要 我们研究了延迟反馈的收益管理强化学习，其中价值的很大一部分由预订后几天观察到的客户取消和修改决定。我们提出 \emph{choice-model-assisted RL}：使用校准的离散选择模型作为固定的部分世界模型，在决策时推断学习目标的延迟部分。在固定模型部署中，我们证明了带有模型推断目标的表格Q学习收敛到最优Q函数的$O（\varepsilon/（1-\gamma））$邻域，其中$\varepsilon$总结部分模型误差，并附加一个额外的$O（t^{-1/2}）$采样项。在模拟器中校准了61{，}619次酒店预订（1{，}088次独立运行）的实验显示：（i）在静止环境中，成熟缓冲DQN基线无统计学上可检测的差异;（ii）家族内参数变动下的积极效应，Holm--Bonferroni修正后10个变位情景中有5个显著提升（最高可达12.4%）;以及（iii）结构性错误规定下的持续劣化，即选择模型假设被违反（收入下降1.4-2.6%，收入下降1.4-2.6%。这些结果描述了部分行为模型何时能提升转变中的鲁棒性，以及何时引入有害偏见。

Advancing General-Purpose Reasoning Models with Modular Gradient Surgery

通过模块化梯度手术推进通用推理模型

Authors: Min Cai, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Xi Ye, Daiting Shi
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.02301
Pdf link: https://arxiv.org/pdf/2602.02301
Abstract Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open-ended reasoning. However, training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross-domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce Modular Gradient Surgery (MGS), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6\%) and 4.5 (11.1\%) points, respectively, over standard multi-task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi-domain RL and presents an effective solution for training general-purpose LRMs.
中文摘要 强化学习（RL）在大型推理模型（LRM）的最新进展中发挥了核心作用，带来了可验证和开放式推理的显著进展。然而，由于域异质性显著，在不同领域训练单一通用LRM仍然具有挑战性。通过对两种广泛使用的策略——顺序强化学习和混合强化学习的系统研究，我们发现两者在行为和梯度层面都会产生显著的跨域干扰，导致整体收益有限。为应对这些挑战，我们引入了Modular G梯度S急需（MGS），它在变压器内部模块级解决梯度冲突。应用于Llama和Qwen模型时，MGS在三个代表性领域（数学、通用聊天和跟随指令）中，分别比标准多任务强化学习平均提升4.3分（16.6%）和4.5分（11.1%）。进一步分析表明，MGS在长期训练下依然有效。总体而言，我们的研究澄清了多域强化学习中干扰的来源，并为通用长程学习提供了有效解决方案。

Position: Explaining Behavioral Shifts in Large Language Models Requires a Comparative Approach

立场：解释大型语言模型中的行为转变需要比较方法

Authors: Martino Ciaperoni, Marzio Di Vece, Luca Pappalardo, Fosca Giannotti, Francesco Giannini
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.02304
Pdf link: https://arxiv.org/pdf/2602.02304
Abstract Large-scale foundation models exhibit behavioral shifts: intervention-induced behavioral changes that appear after scaling, fine-tuning, reinforcement learning or in-context learning. While investigating these phenomena have recently received attention, explaining their appearance is still overlooked. Classic explainable AI (XAI) methods can surface failures at a single checkpoint of a model, but they are structurally ill-suited to justify what changed internally across different checkpoints and which explanatory claims are warranted about that change. We take the position that behavioral shifts should be explained comparatively: the core target should be the intervention-induced shift between a reference model and an intervened model, rather than any single model in isolation. To this aim we formulate a Comparative XAI ($\Delta$-XAI) framework with a set of desiderata to be taken into account when designing proper explaining methods. To highlight how $\Delta$-XAI methods work, we introduce a set of possible pipelines, relate them to the desiderata, and provide a concrete $\Delta$-XAI experiment.
中文摘要 大规模基础模型表现出行为转变：即在缩放、微调、强化学习或情境内学习后出现的干预诱导行为变化。虽然最近对这些现象的研究受到关注，但解释其外观仍然被忽视。经典的可解释人工智能（XAI）方法可以在模型的单一检查点发现失败，但它们在结构上并不适合为不同检查点内部的变化辩护，也无法解释哪些变化的解释性要求。我们认为行为转变应当进行比较解释：核心目标应是参考模型与干预模型之间由干预引起的转变，而非孤立的单一模型。为此，我们制定了一个比较XAI（$\Delta$-XAI）框架，并提出设计合适解释方法时需考虑的一系列需求。为了突出$\Delta$-XAI方法的工作原理，我们介绍了一组可能的管道，将它们与目标联系起来，并提供了一个具体的$\Delta$-XAI实验。

SWE-Universe: Scale Real-World Verifiable Environments to Millions

SWE宇宙：将现实世界中可验证的环境规模扩展到数百万

Authors: Mouxiang Chen, Lei Zhang, Yunlong Feng, Xuwu Wang, Wenting Zhao, Ruisheng Cao, Jiaxi Yang, Jiawei Chen, Mingze Li, Zeyao Ma, Hao Ge, Zongmeng Zhang, Zeyu Cui, Dayiheng Liu, Jingren Zhou, Jianling Sun, Junyang Lin, Binyuan Hui
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.02361
Pdf link: https://arxiv.org/pdf/2602.02361
Abstract We propose SWE-Universe, a scalable and efficient framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs). To overcome the prevalent challenges of automatic building, such as low production yield, weak verifiers, and prohibitive cost, our framework utilizes a building agent powered by an efficient custom-trained model. This agent employs iterative self-verification and in-loop hacking detection to ensure the reliable generation of high-fidelity, verifiable tasks. Using this method, we scale the number of real-world multilingual SWE environments to a million scale (807,693). We demonstrate the profound value of our environments through large-scale agentic mid-training and reinforcement learning. Finally, we applied this technique to Qwen3-Max-Thinking and achieved a score of 75.3% on SWE-Bench Verified. Our work provides both a critical resource and a robust methodology to advance the next generation of coding agents.
中文摘要 我们提出了SWE-Universe，这是一个可扩展且高效的框架，用于通过GitHub拉取请求（PR）自动构建现实世界软件工程（SWE）可验证环境。为了克服自动化建筑常见的挑战，如低产量、较弱的验证器和高昂的成本，我们的框架采用了由高效定制训练模型驱动的建筑代理。该代理采用迭代自我验证和环内黑客检测，确保可靠生成高保真、可验证的任务。通过这种方法，我们将现实世界中多语言软件工程环境的数量扩展到百万倍（807,693）。我们通过大规模的代理性中期培训和强化学习，展示了环境的深远价值。最后，我们将该技术应用于Qwen3-Max-Thinking，并在SWE-Bench Verified中获得了75.3%的得分。我们的工作既提供了关键资源，也提供了坚实的方法论，推动下一代编码代理的发展。

Proof-RM: A Scalable and Generalizable Reward Model for Math Proof

证明-RM：一种可扩展且可推广的数学证明奖励模型

Authors: Haotong Yang, Zitong Wang, Shijia Kang, Siqi Yang, Wenkai Yu, Xu Niu, Yike Sun, Yi Hu, Zhouchen Lin, Muhan Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.02377
Pdf link: https://arxiv.org/pdf/2602.02377
Abstract While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with Verifiable Rewards (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a scalable data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality "question-proof-check" triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating additional process reward and token weight balance to stabilize the RL process. Our experiments validate the model's scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.
中文摘要 虽然大型语言模型（LLMs）通过带有可验证奖励的强化学习（RLVR）展现了强大的数学推理能力，但许多高级数学问题基于证明，无法通过简单答案匹配来确定证明的真实性。为了实现自动验证，需要一个能够可靠评估完整证明流程的奖励模型（RM）。在本研究中，我们设计了一个可扩展的数据构建流水线，利用最小的人力，利用大型语言模型生成大量高质量的“问题防范-检查”三元组数据。通过系统性地变化问题源、生成方法和模型配置，我们创建了涵盖多种难度等级、语言风格和错误类型的多样问题证明对，随后通过层级人工审核筛选以匹配标签。利用这些数据，我们训练了一个证明检查RM，加入额外的进程奖励和代币权重平衡，以稳定强化学习过程。我们的实验从多个角度验证模型的可扩展性和强性能，包括奖励准确性、泛化能力和测试时指导，为强化大型语言模型数学能力提供了重要的实用配方和工具。

Unified Personalized Reward Model for Vision Generation

统一个性化奖励模型用于视觉生成

Authors: Yibin Wang, Yuhang Zang, Feng Han, Jiazi Bu, Yujie Zhou, Cheng Jin, Jiaqi Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.02380
Pdf link: https://arxiv.org/pdf/2602.02380
Abstract Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.
中文摘要 多模态奖励模型（RM）的最新进展极大推动了视觉生成的发展。现有框架通常采用Bradley-Terry风格的偏好建模，或利用生成VLM作为评判，随后通过强化学习优化视觉生成模型。然而，现有的RM存在固有局限性：它们通常遵循一刀切的范式，假设偏好分布单一或依赖固定的评估标准。因此，它们对内容特定的视觉线索不敏感，导致系统性地与主观且依赖上下文的人类偏好不一致。为此，我们以人类评估为灵感，提出了UnifiedReward-Flex，一种统一的个性化视觉生成奖励模型，将奖励建模与灵活且情境适应的推理相结合。具体来说，给定提示和生成的视觉内容，首先解释语义意图并基于视觉证据，然后通过在预定义和自生成的高层次维度下实例化细粒度标准，动态构建层级评估。我们的训练流程遵循两阶段流程：（1）首先将先进闭源VLM的结构化、高质量推理痕迹提取到自启SFT，赋予模型灵活且具备上下文适应的推理行为;（2）随后对精心策划的偏好对进行直接偏好优化（DPO），进一步强化推理的忠实度和判别对齐。为验证其有效性，我们将UnifiedReward-Flex整合进GRPO图像和视频合成框架，广泛结果证明其优势。

SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

SLIME：稳定似然隐性边际强制执行以优化偏好

Authors: Maksim Afanasyev, Illarion Iov
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.02383
Pdf link: https://arxiv.org/pdf/2602.02383
Abstract Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models (LLMs). Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they often suffer from a critical objective mismatch: optimizing the relative margin between chosen and rejected responses does not guarantee the preservation of the chosen response's absolute likelihood. This can lead to unlearning'', where the model degrades the probability of high-quality outputs to satisfy margin constraints, andformatting collapse'' caused by the over-penalization of rejected sequences. In this work, we introduce SLIME (Stabilized Likelihood Implicit Margin Enforcement), a reference-free alignment objective designed to decouple preference learning from generation quality. SLIME incorporates a three-pronged objective: (1) an anchoring term to maximize the likelihood of preferred responses; (2) a stabilizing penalty that prevents the probabilities of rejected tokens from collapsing to zero; and (3) a dual-margin mechanism that combines hard and soft constraints for precise boundary shaping. Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.
中文摘要 直接偏好优化方法已成为一种计算高效的替代方案，替代人类反馈强化学习（RLHF），用于对齐大型语言模型（LLMs）。最新方法通过推导隐式奖励函数简化了比对过程，但它们常常存在一个关键的客观不匹配问题：优化所选与被拒绝反应之间的相对间距，并不能保证所选反应的绝对概率得以保持。这可能导致“逆学习”，即模型降低高质量输出满足边距约束的概率，以及因对拒绝序列过度惩罚而导致的“格式化崩溃”。在本研究中，我们引入了SLIME（稳定似然隐性边际强制），这是一种无引用比对目标，旨在将偏好学习与生成质量解耦。SLIME 包含一个三重目标：（1）一个锚定词，以最大化偏好反应的可能性;（2）一种稳定惩罚，防止被拒绝令牌坍缩到零的概率;以及（3）结合硬约束和软约束以实现精确边界塑形的双边际机制。我们的结果表明，SLIME在保持更高世代稳定性的同时，性能优于最先进的基线。

David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

大卫对歌利亚：通过强化学习实现可验证的代理间越狱

Authors: Samuel Nellessen, Tal Kachman
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.02395
Pdf link: https://arxiv.org/pdf/2602.02395
Abstract The evolution of large language models into autonomous agents introduces adversarial failures that exploit legitimate tool privileges, transforming safety evaluation in tool-augmented environments from a subjective NLP task into an objective control problem. We formalize this threat model as Tag-Along Attacks: a scenario where a tool-less adversary "tags along" on the trusted privileges of a safety-aligned Operator to induce prohibited tool use through conversation alone. To validate this threat, we present Slingshot, a 'cold-start' reinforcement learning framework that autonomously discovers emergent attack vectors, revealing a critical insight: in our setting, learned attacks tend to converge to short, instruction-like syntactic patterns rather than multi-turn persuasion. On held-out extreme-difficulty tasks, Slingshot achieves a 67.0% success rate against a Qwen2.5-32B-Instruct-AWQ Operator (vs. 1.7% baseline), reducing the expected attempts to first success (on solved tasks) from 52.3 to 1.3. Crucially, Slingshot transfers zero-shot to several model families, including closed-source models like Gemini 2.5 Flash (56.0% attack success rate) and defensive-fine-tuned open-source models like Meta-SecAlign-8B (39.2% attack success rate). Our work establishes Tag-Along Attacks as a first-class, verifiable threat model and shows that effective agentic attacks can be elicited from off-the-shelf open-weight models through environment interaction alone.
中文摘要 大型语言模型向自主智能体的发展引入了利用合法工具权限的对抗性失败，将工具增强环境中的安全评估从主观的自然语言处理任务转变为客观控制问题。我们将这种威胁模型正式化为“跟随攻击”：一种无工具的对手“跟随”安全相关操作员的可信权限，仅通过对话诱导禁用工具的场景。为验证这一威胁，我们提出了Slingshot，一种“冷启动”强化学习框架，能够自主发现涌现的攻击向量，揭示了一个关键洞见：在我们的设定中，学习到的攻击往往趋向简短、类似指令的语法模式，而非多回合说服。在等待的极端难度任务中，Slingshot 对 Qwen2.5-32B-Instruct-AWQ 操作员的成功率为 67.0%（基线为 1.7%），首次成功率的预期从52.3降至1.3。关键是，Slingshot 将零射击技术转移到多个模型家族中，包括闭源模型如 Gemini 2.5 Flash（攻击成功率 56.0%）和防御微调开源模型如 Meta-SecAlign-8B（攻击成功率 39.2%）。我们的工作确立了跟随攻击作为一流、可验证的威胁模型，并展示了仅通过环境交互，就能从现成的开放权重模型中引发有效的代理攻击。

World-Gymnast: Training Robots with Reinforcement Learning in a World Model

世界体运动员：在世界模型中用强化学习训练机器人

Authors: Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu, Sherry Yang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.02454
Pdf link: https://arxiv.org/pdf/2602.02454
Abstract Robot learning from interacting with the physical world is fundamentally bottlenecked by the cost of physical interaction. The two alternatives, supervised finetuning (SFT) from expert demonstrations and reinforcement learning (RL) in a software-based simulator, are limited by the amount of expert data available and the sim-to-real gap for manipulation. With the recent emergence of world models learned from real-world video-action data, we ask the question of whether training a policy in a world model can be more effective than supervised learning or software simulation in achieving better real-robot performance. We propose World-Gymnast, which performs RL finetuning of a vision-language-action (VLA) policy by rolling out the policy in an action-conditioned video world model and rewarding the rollouts with a vision-language model (VLM). On the Bridge robot setup, World-Gymnast outperforms SFT by as much as 18x and outperforms software simulator by as much as 2x. More importantly, World-Gymnast demonstrates intriguing capabilities of RL with a world model, including training on diverse language instructions and novel scenes from the world model, test-time training in a novel scene, and online iterative world model and policy improvement. Our results suggest learning a world model and training robot policies in the cloud could be the key to bridging the gap between robots that work in demonstrations and robots that can work in anyone's household.
中文摘要 机器人通过与物理世界互动学习，根本上被物理互动的成本所限制。两种选择是基于专家演示的监督微调（SFT）和基于软件的模拟器中的强化学习（RL），但受限于专家数据量和模拟与现实作的差距。随着从现实世界视频动作数据学习的世界模型的出现，我们开始思考，在世界模型中训练政策是否比监督学习或软件仿真在实现更优实机器人性能方面更有效。我们提出了World-Gymnast，通过在动作条件视频世界模型中推出策略，并以视觉语言模型（VLM）奖励推广，对视觉-语言-行动（VLA）策略进行强化微调。在桥牌机器人配置下，世界体运动员的表现是SFT的18倍，软件模拟器则高达2倍。更重要的是，世界体展示了强化学习与世界模型的有趣能力，包括基于多样语言指令和新颖场景的训练、新场景中的测试训练，以及在线迭代世界模型和策略改进。我们的结果表明，学习世界模型并在云中训练机器人策略，可能是弥合演示机器人与家庭机器人之间差距的关键。

Conflict-Aware Client Selection for Multi-Server Federated Learning

多服务器联合学习中的冲突感知客户端选择

Authors: Mingwei Hong, Zheng Lin, Zehang Lin, Lin Li, Miao Yang, Xia Du, Zihan Fang, Zhaolu Kang, Dianxin Luan, Shunzhi Zhu
Subjects: Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2602.02458
Pdf link: https://arxiv.org/pdf/2602.02458
Abstract Federated learning (FL) has emerged as a promising distributed machine learning (ML) that enables collaborative model training across clients without exposing raw data, thereby preserving user privacy and reducing communication costs. Despite these benefits, traditional single-server FL suffers from high communication latency due to the aggregation of models from a large number of clients. While multi-server FL distributes workloads across edge servers, overlapping client coverage and uncoordinated selection often lead to resource contention, causing bandwidth conflicts and training failures. To address these limitations, we propose a decentralized reinforcement learning with conflict risk prediction, named RL CRP, to optimize client selection in multi-server FL systems. Specifically, each server estimates the likelihood of client selection conflicts using a categorical hidden Markov model based on its sparse historical client selection sequence. Then, a fairness-aware reward mechanism is incorporated to promote long-term client participation for minimizing training latency and resource contention. Extensive experiments demonstrate that the proposed RL-CRP framework effectively reduces inter-server conflicts and significantly improves training efficiency in terms of convergence speed and communication cost.
中文摘要 联邦学习（FL）作为一种有前景的分布式机器学习（ML）技术兴起，能够在客户之间协作训练模型而不暴露原始数据，从而保护用户隐私并降低通信成本。尽管有这些优势，传统的单服务器 FL 由于大量客户端模型聚合而存在较高的通信延迟。虽然多服务器 FL 将工作负载分配到多个边缘服务器，但客户端覆盖重叠和选择不协调常常导致资源争用，导致带宽冲突和训练失败。为解决这些局限性，我们提出了一种去中心化强化学习和冲突风险预测，名为RL CRP，以优化多服务器FL系统中的客户端选择。具体来说，每个服务器基于其稀疏的历史客户端选择序列，使用类别隐马尔可夫模型来估算客户端选择冲突的可能性。随后，引入公平意识的奖励机制，促进长期客户参与，以最大限度减少训练延迟和资源争夺。大量实验表明，所提出的RL-CRP框架有效减少了服务器间冲突，并在收敛速度和通信成本方面显著提升了训练效率。

TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments

TIC-VLA：一种动态环境中机器人导航的控制思维视觉-语言-行动模型

Authors: Zhiyu Huang, Yun Zhang, Johnson Liu, Rui Song, Chen Tang, Jiaqi Ma
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.02459
Pdf link: https://arxiv.org/pdf/2602.02459
Abstract Robots in dynamic, human-centric environments must follow language instructions while maintaining real-time reactive control. Vision-language-action (VLA) models offer a promising framework, but they assume temporally aligned reasoning and control, despite semantic inference being inherently delayed relative to real-time action. We introduce Think-in-Control (TIC)-VLA, a latency-aware framework that explicitly models delayed semantic reasoning during action generation. TIC-VLA defines a delayed semantic-control interface that conditions action generation on delayed vision-language semantic states and explicit latency metadata, in addition to current observations, enabling policies to compensate for asynchronous reasoning. We further propose a latency-consistent training pipeline that injects reasoning inference delays during imitation learning and online reinforcement learning, aligning training with asynchronous deployment. To support realistic evaluation, we present DynaNav, a physics-accurate, photo-realistic simulation suite for language-guided navigation in dynamic environments. Extensive experiments in simulation and on a real robot show that TIC-VLA consistently outperforms prior VLA models while maintaining robust real-time control under multi-second reasoning latency. Project website: this https URL
中文摘要 在动态、以人为中心的环境中，机器人必须遵循语言指令，同时保持实时的反应式控制。视觉-语言-行动（VLA）模型提供了一个有前景的框架，但它们假设推理和控制时间对齐，尽管语义推断相较于实时动作本质上会延迟。我们介绍了Think-in-Control（TIC）-VLA，一种具有延迟感知的框架，明确建模动作生成过程中的延迟语义推理。TIC-VLA定义了一种延迟语义控制接口，除了当前观察外，还会根据延迟的视觉语言语义状态和显式延迟元数据来决定动作生成，从而使策略能够补偿异步推理。我们还提出了一种延迟一致的训练流水线，在模仿学习和在线强化学习中注入推理推理延迟，使训练与异步部署保持一致。为了支持真实评估，我们推出了Dynav，一套物理精确、逼真的动态环境中语言引导导航模拟套件。大量模拟和实机实验表明，TIC-VLA在多秒推理延迟下保持稳健的实时控制，始终优于以往的VLA模型。项目网站：此 https URL

Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

为分而治之的大型语言模型训练提升了测试时间的可扩展性

Authors: Xiao Liang, Zhong-Zhi Li, Zhenghao Lin, Eric Hancheng Jiang, Hengyuan Zhang, Yelong Shen, Kai-Wei Chang, Ying Nian Wu, Yeyun Gong, Weizhu Chen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.02477
Pdf link: https://arxiv.org/pdf/2602.02477
Abstract Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model's capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs' reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.
中文摘要 大型语言模型（LLMs）通过逐步思维链（CoT）推理展现了强大的推理能力。然而，在模型能力的极限下，CoT常常被证明不足，其严格的顺序性质限制了测试时间的可扩展性。一种潜在的替代方案是分而治之（DAC）推理，它将复杂问题分解为子问题，以便更有效地探索解决方案。尽管分析前景看好，但我们的分析揭示了通用后训练与DAC式推断之间的根本错位，限制了模型充分发挥这一潜力的能力。为了弥合这一差距，充分释放LLM在最具挑战性的任务中的推理能力，我们提出了一个端到端强化学习（RL）框架，以增强其DAC式推理能力。在每一步，策略将问题分解为一组子问题，顺序解决它们，并针对基于子问题解的原始问题进行处理，同时将分解和解都集成到强化学习训练中。在类似培训下，我们的DAC风格框架赋予模型更高的性能上限和更强的测试时间可扩展性，在竞赛基准测试中Pass@1中比CoT高出8.6%，Pass@32中高出6.3%。

RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

RLAnything：完全动态强化学习系统中的锻造环境、政策与奖励模型

Authors: Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.02488
Pdf link: https://arxiv.org/pdf/2602.02488
Abstract We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: this https URL
中文摘要 我们提出了RLAnything，一个强化学习框架，通过闭环优化动态构建环境、策略和奖励模型，放大学习信号，强化任何大型语言模型或代理场景下的整体强化学习系统。具体来说，策略通过逐步反馈和结果信号的综合反馈进行训练，而奖励模型则通过一致性反馈共同优化，进一步提升策略训练质量。此外，我们基于理论驱动的自动环境适应通过利用双方的批评反馈，提升了双方模型的训练效果，从而实现了经验学习。从实证角度看，每增加一个组件都会持续提升整体系统，RLAnything在多个代表性的大型语言模型和代理任务中取得了显著提升，分别在OSWorld上提升了Qwen3-VL-8B-Thinking9.1%，在AlfWorld和LiveBench上分别提升了Qwen2.5-7B-Ininstruction 18.7%和11.9%。我们还发现，优化的奖励模型信号优于依赖人类标签的结果。代码：这个 https URL

Keyword: diffusion policy

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

CLAMP：三维多视角动作条件机器人作预训练的对比学习

Authors: I-Chun Arthur Liu, Krzysztof Choromanski, Sandy Huang, Connor Schenck
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.00937
Pdf link: https://arxiv.org/pdf/2602.00937
Abstract Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks.
中文摘要 在行为克隆策略中利用预训练的二维图像表示取得了巨大成功，并成为机器人作的标准方法。然而，此类表示无法捕捉对精确作至关重要的物体和场景的三维空间信息。在本研究中，我们介绍了用于三维多视角动作条件机器人作预训练（CLAMP）的对比学习，这是一种利用点云和机器人动作的新型三维预训练框架。通过从RGB-D图像和相机外部元件计算的合并点云，我们重新渲染多视角四通道图像观测，包含深度和三维坐标，包括动态手腕视角，以提供更清晰的目标物体视图，用于高精度作任务。预训练编码器通过对比学习对大尺度模拟机器人轨迹，学习将物体的三维几何和位置信息与机器人动作模式关联起来。在编码器预训练期间，我们预训练扩散策略，以初始化策略权重进行微调，这对于提升微调样本效率和性能至关重要。经过预训练后，我们利用所学的图像和动作表示，对有限数量的任务演示进行微调策略。我们证明了这种预训练和微调设计显著提升了在未看见任务上的学习效率和策略执行。此外，我们证明CLAMP在六个模拟任务和五个真实世界任务中表现优于最先进的基线。

Keyword: reinforcement learning

AutoBool: An Reinforcement-Learning trained LLM for Effective Automated Boolean Query Generation for Systematic Reviews

AutoBool：一个强化学习训练的大型语言模型，用于系统性综述的有效自动布尔查询生成

Representation Learning Enhanced Deep Reinforcement Learning for Optimal Operation of Hydrogen-based Multi-Energy Systems

表征学习增强深度强化学习，实现氢基多能系统的最佳运行

Asynchronous MultiAgent Reinforcement Learning for 5G Routing under Side Constraints

侧约束下的5G路由异步多智能体强化学习

Distributional Reinforcement Learning for Condition-Based Maintenance of Multi-Pump Equipment

分布式强化学习用于基于条件的多泵设备维护

Joint Continual Learning of Local Language Models and Cloud Offloading Decisions with Budget Constraints

本地语言模型与云分销决策的联合持续学习，预算约束

Learning Robust Reasoning through Guided Adversarial Self-Play

通过引导式对抗性自我扮演学习扎实的推理

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

CamReasoner：通过结构化空间推理强化对摄像机运动的理解

From Gameplay Traces to Game Mechanics: Causal Induction with Large Language Models

从游戏机制到游戏机制：大型语言模型的因果归纳

Sample Complexity Analysis for Constrained Bilevel Reinforcement Learning

受限双级强化学习的示例复杂性分析

AdaFuse: Adaptive Multimodal Fusion for Lung Cancer Risk Prediction via Reinforcement Learning

AdaFuse：通过强化学习预测肺癌风险的自适应多模态融合

MASC: Metal-Aware Sampling and Correction via Reinforcement Learning for Accelerated MRI

MASC：金属感知采样与通过强化学习修正加速MRI

ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

ReLAPSe：强化学习训练的对抗提示搜索，在未学扩散模型中消除概念

KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning

KEPO：基于推理的知识增强偏好优化，用于强化学习

ZEST: Zero-shot Embodied Skill Transfer for Athletic Robot Control

ZEST：零射击具体技能转移，用于运动机器人控制

DROGO: Default Representation Objective via Graph Optimization in Reinforcement Learning

DROGO：通过强化学习中的图优化实现默认表示目标

Variational Approach for Job Shop Scheduling

工作车间调度的变分方法

Open Materials Generation with Inference-Time Reinforcement Learning

带推理时间强化学习的开放材料生成

LLMs as High-Dimensional Nonlinear Autoregressive Models with Attention: Training, Alignment and Inference

作为高维非线性自回归模型的LLMs：训练、对齐与推断

FedMOA: Federated GRPO for Personalized Reasoning LLMs under Heterogeneous Rewards

FedMOA：针对异构奖励下的个性化推理LLM的联合GRPO

Search Inspired Exploration in Reinforcement Learning

强化学习中的搜索启发探索

AREAL-DTA: Dynamic Tree Attention for Efficient Reinforcement Learning of Large Language Models

AREAL-DTA：动态树注意力用于高效强化大型语言模型的学习

Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs

Minerva：针对网络威胁情报大型语言模型的可验证奖励强化学习

How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use

大型语言模型离职业扑克玩家有多远？重新审视结合智能工具的博弈论推理

Reinforcement Learning-assisted Constraint Relaxation for Constrained Expensive Optimization

强化学习辅助约束松弛以实现受限且昂贵的优化

Surrogate Ensemble in Expensive Multi-Objective Optimization via Deep Q-Learning

通过深度Q学习实现昂贵多目标优化中的替代集合

APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation

APEX：一种基于内存的解耦探索器，用于异步航天目标导航

NetWorld: Communication-Based Diffusion World Model for Multi-Agent Reinforcement Learning in Wireless Networks

NetWorld：无线网络中多智能体强化学习的基于通信扩散世界模型

Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models

学习在视频多模态大型语言模型中解码组合幻觉

Learning Modal-Mixed Chain-of-Thought Reasoning with Latent Embeddings

学习带有潜在嵌入的模态混合思维链推理

Agentic Reward Modeling: Verifying GUI Agent via Online Proactive Interaction

代理奖励建模：通过在线主动交互验证图形界面代理

Safe Langevin Soft Actor Critic

安全朗热文软性演员评论家

Model-Based Data-Efficient and Robust Reinforcement Learning

基于模型的数据高效且稳健的强化学习

Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation

迈向基于LLM的推荐中样本高效且稳定的强化学习

Equilibrium of Feasible Zone and Uncertain Model in Safe Exploration

安全勘探中可行区与不确定模型的平衡

LegalOne: A Family of Foundation Models for Reliable Legal Reasoning

LegalOne：一系列可靠法律推理的基础模型

Toward Reliable Sim-to-Real Predictability for MoE-based Robust Quadrupedal Locomotion

迈向基于MoE的稳健四足行走的可靠模拟到真实可预测性

SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning

SA-VLA：视觉-语言-行动强化学习中的空间感知流匹配

ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation

ACE步骤1.5：推动开源音乐创作的边界

Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

自适应能力分解以解锁大型推理模型 有效强化学习

Communications-Incentivized Collaborative Reasoning in NetGPT through Agentic Reinforcement Learning

自适应能力分解以解锁大型推理模型有效强化学习