Arxiv Papers of Today

生成时间: 2026-02-11 16:53:50 (UTC+8); Arxiv 发布时间: 2026-02-11 20:00 EST (2026-02-12 09:00 UTC+8)

今天共有 46 篇相关文章

Keyword: reinforcement learning

UI-Venus-1.5 Technical Report

UI-Venus-1.5 技术报告

Authors: Veuns-Team: Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, Xingran Zhou, Weizhi Chen, Sunhao Dai, Jingya Dou, Yichen Gong, Yuan Guo, Zhenlin Guo, Feng Li, Qian Li, Jinzhen Lin, Yuqi Zhou, Linchao Zhu, Liang Chen, Zhenyu Guo, Changhua Meng, Weiqiang Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.09082
Pdf link: https://arxiv.org/pdf/2602.09082
Abstract GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains this http URL this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world this http URL proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B) to meet various downstream application this http URL to our previous version, UI-Venus-1.5 introduces three key technical advances: (1) a comprehensive Mid-Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full-trajectory rollouts, aligning training objectives with long-horizon, dynamic navigation in large-scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain-specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI-Venus-1.5 establishes new state-of-the-art performance on benchmarks such as ScreenSpot-Pro (69.6%), VenusBench-GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real-world scenarios. Code: this https URL Model: this https URL
中文摘要 GUI 代理已成为数字环境中自动化交互的强大范式，但仍然要实现广泛的通用性和持续强劲的任务性能。本报告，我们介绍 UI-Venus-1.5，一款统一的端到端 GUI 代理，专为强大的现实世界设计。http URL 所提模型家族包括两个密集变体（2B 和 8B）和一个专家混合变体（30B-A3B），以满足各种下游应用，http URL 指向我们之前的版本， UI-Venus-1.5 引入了三项关键技术进展：（1）采用涵盖30+数据集、100亿个令牌的全面中期训练阶段，建立基础的图形用户界面语义;（2）在线强化学习，实现全程部署，将培训目标与长期动态导航对齐，适用于大规模环境;以及（3）通过模型合并构建的统一图形界面代理，将领域特定模型（接地、网页和移动）综合为一个连贯的检查点。广泛评估显示，UI-Venus-1.5在ScreenSpot-Pro（69.6%）、VenusBench-GD（75.0%）和AndroidWorld（77.6%）等基准测试上建立了新的最先进性能，远超以往强劲基准。此外，UI-Venus-1.5在多种中国移动应用中展示了强大的导航能力，能够在真实场景下有效执行用户指令。代码：这个 https URL 模型：这个 https URL

An Actor-Critic-Identifier Control Design for Increasing Energy Efficiency of Automated Electric Vehicles

一种用于提高自动电动车能效的演员-批评者-标识符控制设计

Authors: Hamed Faghihian, Arman Sargolzaei
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.09140
Pdf link: https://arxiv.org/pdf/2602.09140
Abstract Electric vehicles (EVs) are increasingly deployed, yet range limitations remain a key barrier. Improving energy efficiency via advanced control is therefore essential, and emerging vehicle automation offers a promising avenue. However, many existing strategies rely on indirect surrogates because linking power consumption to control inputs is difficult. We propose a neural-network (NN) identifier that learns this mapping online and couples it with an actor-critic reinforcement learning (RL) framework to generate optimal control commands. The resulting actor-critic-identifier architecture removes dependence on explicit models relating total power, recovered energy, and inputs, while maintaining accurate speed tracking and maximizing efficiency. Update laws are derived using Lyapunov stability analysis, and performance is validated in simulation. Compared to a traditional controller, the method increases total energy recovery by 12.84%, indicating strong potential for improving EV energy efficiency.
中文摘要 电动汽车（EV）的普及日益增加，但续航限制仍是一个关键障碍。因此，通过先进控制提升能源效率至关重要，而新兴的车辆自动化提供了一条有前景的途径。然而，许多现有策略依赖间接替代，因为将功耗与控制输入联系起来较为困难。我们提出了一种神经网络（NN）标识符，它在线学习这种映射，并结合演员-批判者强化学习（RL）框架，生成最优控制命令。由此产生的actor-critic-identifier架构消除了对显式模型的依赖，这些模型将总功耗、回收能量和输入联系起来，同时保持准确的速度追踪并最大化效率。更新定律通过李雅普诺夫稳定性分析推导，性能在仿真中得到验证。与传统控制器相比，该方法总能量回收率提升了12.84%，显示出提升电动汽车能源效率的强大潜力。

Boltzmann Reinforcement Learning for Noise resilience in Analog Ising Machines

模拟伊辛机噪声韧性的玻尔兹曼强化学习

Authors: Aditya Choudhary, Saaketh Desai, Prasad Iyer
Subjects: Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
Arxiv link: https://arxiv.org/abs/2602.09162
Pdf link: https://arxiv.org/pdf/2602.09162
Abstract Analog Ising machines (AIMs) have emerged as a promising paradigm for combinatorial optimization, utilizing physical dynamics to solve Ising problems with high energy efficiency. However, the performance of traditional optimization and sampling algorithms on these platforms is often limited by inherent measurement noise. We introduce BRAIN (Boltzmann Reinforcement for Analog Ising Networks), a distribution learning framework that utilizes variational reinforcement learning to approximate the Boltzmann distribution. By shifting from state-by-state sampling to aggregating information across multiple noisy measurements, BRAIN is resilient to Gaussian noise characteristic of AIMs. We evaluate BRAIN across diverse combinatorial topologies, including the Curie-Weiss and 2D nearest-neighbor Ising systems. We find that under realistic 3\% Gaussian measurement noise, BRAIN maintains 98\% ground state fidelity, whereas Markov Chain Monte Carlo (MCMC) methods degrade to 51\% fidelity. Furthermore, BRAIN reaches the MCMC-equivalent solution up to 192x faster under these conditions. BRAIN exhibits $\mathcal{O}(N^{1.55})$ scaling up to 65,536 spins and maintains robustness against severe measurement uncertainty up to 40\%. Beyond ground state optimization, BRAIN accurately captures thermodynamic phase transitions and metastable states, providing a scalable and noise-resilient method for utilizing analog computing architectures in complex optimizations.
中文摘要 模拟伊辛机（AIM）已成为组合优化的有前景范式，利用物理动力学以高能效解决伊辛问题。然而，传统优化和采样算法在这些平台上的性能常常受限于固有的测量噪声。我们介绍了BRAIN（模拟伊辛网络的玻尔兹曼强化），这是一个利用变分强化学习来近似玻尔兹曼分布的分布学习框架。通过从逐状态采样转向多重噪声测量信息聚合，BRAIN对AIM特有的高斯噪声具有韧性。我们评估BRAIN在多种组合拓扑结构中的应用，包括居里-魏斯系统和二维最近邻伊辛系统。我们发现，在现实的3/%高斯测量噪声下，BRAIN保持98%的基态保真度，而马尔可夫链蒙特卡洛（MCMC）方法则退化至51/%。此外，在这些条件下，BRAIN达到MCMC等效解的速度可达192倍。BRAIN表现出$\mathcal{O}（N^{1.55}））$的可扩展至65,536次自旋，并能在高达40%的严重测量不确定性下保持鲁棒性。除了基态优化外，BRAIN还准确捕捉热力学相变和亚稳态，提供了一种可扩展且抗噪声的方法，用于复杂优化中利用模拟计算架构。

$n$-Musketeers: Reinforcement Learning Shapes Collaboration Among Language Models

$n$-火枪手：强化学习塑造语言模型间的协作

Authors: Ryozo Masukawa, Sanggeon Yun, Hyunwoo Oh, SuhgHeon Jeong, Raheeb Hassa, Hanning Chen, Wenjun Huang, Mahdi Imani, Pietro Mercati, Nathaniel D. Bastian, Mohsen Imani
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.09173
Pdf link: https://arxiv.org/pdf/2602.09173
Abstract Recent progress in reinforcement learning with verifiable rewards (RLVR) shows that small, specialized language models (SLMs) can exhibit structured reasoning without relying on large monolithic LLMs. We introduce soft hidden-state collaboration, where multiple heterogeneous frozen SLM experts are integrated through their internal representations via a trainable attention interface. Experiments on Reasoning Gym and GSM8K show that this latent integration is competitive with strong single-model RLVR baselines. Ablations further reveal a dual mechanism of expert utilization: for simpler arithmetic domains, performance gains can largely be explained by static expert preferences, whereas more challenging settings induce increasingly concentrated and structured expert attention over training, indicating emergent specialization in how the router connects to relevant experts. Overall, hidden-state collaboration provides a compact mechanism for leveraging frozen experts, while offering an observational window into expert utilization patterns and their evolution under RLVR.
中文摘要 近期在可验证奖励强化学习（RLVR）中的进展表明，小型专门化语言模型（SLMs）无需依赖大型单体大型语言模型即可展现结构化推理。我们引入了软隐藏状态协作，通过可训练的注意力接口，将多个异构冻结的SLM专家通过内部表征整合。在Reasoning Gym和GSM8K上的实验表明，这种潜在整合能够与强的单模型RLVR基线竞争。消融进一步揭示了专家利用的双重机制：对于简单的算术领域，性能提升主要由静态专家偏好解释，而更具挑战性的环境则促使专家对培训的关注更加集中和结构化，表明路由器在连接相关专家时出现了新兴的专业化。总体而言，隐藏状态协作为利用冻结专家提供了紧凑的机制，同时也为观察专家利用模式及其在RLVR下的演变提供了窗口。

EExApp: GNN-Based Reinforcement Learning for Radio Unit Energy Optimization in 5G O-RAN

EExApp：基于GNN的增强学习用于5G O-RAN无线单元能量优化

Authors: Jie Lu, Peihao Yan, Huacheng Zeng
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.09206
Pdf link: https://arxiv.org/pdf/2602.09206
Abstract With over 3.5 million 5G base stations deployed globally, their collective energy consumption (projected to exceed 131 TWh annually) raises significant concerns over both operational costs and environmental impacts. In this paper, we present EExAPP, a deep reinforcement learning (DRL)-based xApp for 5G Open Radio Access Network (O-RAN) that jointly optimizes radio unit (RU) sleep scheduling and distributed unit (DU) resource slicing. EExAPP uses a dual-actor-dual-critic Proximal Policy Optimization (PPO) architecture, with dedicated actor-critic pairs targeting energy efficiency and quality-of-service (QoS) compliance. A transformer-based encoder enables scalable handling of variable user equipment (UE) populations by encoding all-UE observations into fixed-dimensional representations. To coordinate the two optimization objectives, a bipartite Graph Attention Network (GAT) is used to modulate actor updates based on both critic outputs, enabling adaptive tradeoffs between power savings and QoS. We have implemented EExAPP and deployed it on a real-world 5G O-RAN testbed with live traffic, commercial RU and smartphones. Extensive over-the-air experiments and ablation studies confirm that EExAPP significantly outperforms existing methods in reducing the energy consumption of RU while maintaining QoS.
中文摘要 全球部署了超过350万个5G基站，其总能耗（预计年均超过131太瓦时）引发了运营成本和环境影响的重大担忧。本文介绍了EExAPP，这是一种基于深度强化学习（DRL）的5G开放无线接入网（O-RAN）xApp，能够联合优化无线单元（RU）睡眠调度和分布式单元（DU）资源切片。EExAPP采用双演员双批判者近端策略优化（PPO）架构，专门的演员-批判者对以能效和服务质量（QoS）合规为目标。基于变压器的编码器通过将全UE观测数据编码为固定维表示，实现可扩展的用户设备（UE）群体处理。为协调两个优化目标，采用了二分图注意力网络（GAT）根据两个批判者输出调制演员更新，实现功耗节约与服务质量之间的自适应权衡。我们已经实现了EExAPP，并将其部署在一个真实的5G O-RAN测试平台上，支持实时流量、商业RU（商业RU）和智能手机。大量空中实验和消融研究证实，EExAPP在降低RU能耗同时保持服务质量方面，显著优于现有方法。

CausalGDP: Causality-Guided Diffusion Policies for Reinforcement Learning

因果GDP：基于因果律的扩散策略用于强化学习

Authors: Xiaofeng Xiao, Xiao Hu, Yang Ye, Xubo Yue
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.09207
Pdf link: https://arxiv.org/pdf/2602.09207
Abstract Reinforcement learning (RL) has achieved remarkable success in a wide range of sequential decision-making problems. Recent diffusion-based policies further improve RL by modeling complex, high-dimensional action distributions. However, existing diffusion policies primarily rely on statistical associations and fail to explicitly account for causal relationships among states, actions, and rewards, limiting their ability to identify which action components truly cause high returns. In this paper, we propose Causality-guided Diffusion Policy (CausalGDP), a unified framework that integrates causal reasoning into diffusion-based RL. CausalGDP first learns a base diffusion policy and an initial causal dynamical model from offline data, capturing causal dependencies among states, actions, and rewards. During real-time interaction, the causal information is continuously updated and incorporated as a guidance signal to steer the diffusion process toward actions that causally influence future states and rewards. By explicitly considering causality beyond association, CausalGDP focuses policy optimization on action components that genuinely drive performance improvements. Experimental results demonstrate that CausalGDP consistently achieves competitive or superior performance over state-of-the-art diffusion-based and offline RL methods, especially in complex, high-dimensional control tasks.
中文摘要 强化学习（RL）在多种顺序决策问题中取得了显著成功。近期基于扩散的策略通过建模复杂、高维的动作分布，进一步提升了强化学习。然而，现有的扩散策略主要依赖统计关联，未能明确考虑状态、行动和奖励之间的因果关系，限制了它们识别哪些行动组成部分真正带来高回报的能力。本文提出了因果引导扩散政策（CausalGDP），这是一个将因果推理整合进基于扩散的强化学习的统一框架。因果GDP首先从离线数据中学习基础扩散策略和初始因果动力模型，捕捉状态、行为和奖励之间的因果依赖关系。在实时互动过程中，因果信息会不断更新并作为指导信号，引导扩散过程朝向对未来状态和奖励产生因果影响的行动。通过明确考虑超越关联的因果关系，CausalGDP将政策优化重点放在真正推动绩效改进的行动组成部分。实验结果表明，CausalGDP在复杂高维控制任务中，始终在与最先进的基于扩散和离线的强化学习方法竞争或更优于其上。

Risk-sensitive reinforcement learning using expectiles, shortfall risk and optimized certainty equivalent risk

利用预期值、短缺风险和优化确定性等效风险的风险敏感强化学习

Authors: Sumedh Gupte, Shrey Rakeshkumar Patel, Soumen Pachal, Prashanth L. A., Sanjay P. Bhat
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.09300
Pdf link: https://arxiv.org/pdf/2602.09300
Abstract We propose risk-sensitive reinforcement learning algorithms catering to three families of risk measures, namely expectiles, utility-based shortfall risk and optimized certainty equivalent risk. For each risk measure, in the context of a finite horizon Markov decision process, we first derive a policy gradient theorem. Second, we propose estimators of the risk-sensitive policy gradient for each of the aforementioned risk measures, and establish $\mathcal{O}\left(1/m\right)$ mean-squared error bounds for our estimators, where $m$ is the number of trajectories. Further, under standard assumptions for policy gradient-type algorithms, we establish smoothness of the risk-sensitive objective, in turn leading to stationary convergence rate bounds for the overall risk-sensitive policy gradient algorithm that we propose. Finally, we conduct numerical experiments to validate the theoretical findings on popular RL benchmarks.
中文摘要 我们提出了针对三类风险衡量的风险敏感强化学习算法，即预期值、基于效用的短缺风险和优化确定性等效风险。对于每个风险度量，在有限视野马尔可夫决策过程的背景下，我们首先推导一个政策梯度定理。其次，我们提出上述风险度量的风险敏感政策梯度估计，并为估计量建立均方误差界限 $\mathcal{O}\left（1/m\right）$，其中 $m$ 是轨迹数。此外，在策略梯度类算法的标准假设下，我们建立了风险敏感目标的平滑性，进而为我们所提出的整体风险敏感策略梯度算法设定了平稳收敛率界限。最后，我们进行数值实验以验证流行强化学习基准测试的理论发现。

Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation

基于强化学习的LLM推理中的奖励建模：设计、挑战与评估

Authors: Pei-Chi Pan, Yingbin Liang, Sen Lin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.09305
Pdf link: https://arxiv.org/pdf/2602.09305
Abstract Large Language Models (LLMs) demonstrate transformative potential, yet their reasoning remains inconsistent and unreliable. Reinforcement learning (RL)-based fine-tuning is a key mechanism for improvement, but its effectiveness is fundamentally governed by reward design. Despite its importance, the relationship between reward modeling and core LLM challenges--such as evaluation bias, hallucination, distribution shift, and efficient learning--remains poorly understood. This work argues that reward modeling is not merely an implementation detail but a central architect of reasoning alignment, shaping what models learn, how they generalize, and whether their outputs can be trusted. We introduce Reasoning-Aligned Reinforcement Learning (RARL), a unifying framework that systematizes diverse reward paradigms for multi-step reasoning. Within this framework, we present a taxonomy of reward mechanisms, analyze reward hacking as a pervasive failure mode, and examine how reward signals unify challenges ranging from inference-time scaling to hallucination mitigation. We further critically evaluate existing benchmarks, highlighting vulnerabilities such as data contamination and reward misalignment, and outline directions for more robust evaluation. By integrating fragmented research threads and clarifying the interplay between reward design and fundamental reasoning capabilities, this work provides a foundational roadmap for building reasoning models that are robust, verifiable, and trustworthy.
中文摘要 大型语言模型（LLMs）展现了变革潜力，但其推理仍然不一致且不可靠。基于强化学习（RL）的微调是改进的关键机制，但其有效性根本上取决于奖励设计。尽管奖励建模与核心LLM挑战——如评估偏差、幻觉、分布转移和高效学习——之间的关系仍难以理解。本研究论证了奖励建模不仅仅是实现细节，而是推理对齐的核心架构师，塑造模型学习内容、推广方式以及其输出是否可信。我们介绍了推理对齐强化学习（RARL），这是一个统一框架，系统化了多步推理的多样化奖励范式。在此框架下，我们提出了奖励机制的分类法，分析奖励黑客作为普遍失败模式的现象，并探讨奖励信号如何统一从推理时间尺度到幻觉缓解等挑战。我们进一步批判性地评估现有基准，指出数据污染和奖励错位等漏洞，并提出了更为严密评估的方向。通过整合分散的研究线索并澄清奖励设计与基本推理能力之间的相互作用，本工作为构建稳健、可验证和可信赖的推理模型提供了基础路线图。

CAPER: Constrained and Procedural Reasoning for Robotic Scientific Experiments

CAPER：机器人科学实验的受限与程序性推理

Authors: Jinghan Yang, Jingyi Hou, Xinbo Yu, Wei He, Yifan Wu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.09367
Pdf link: https://arxiv.org/pdf/2602.09367
Abstract Robotic assistance in scientific laboratories requires procedurally correct long-horizon manipulation, reliable execution under limited supervision, and robustness in low-demonstration regimes. Such conditions greatly challenge end-to-end vision-language-action (VLA) models, whose assumptions of recoverable errors and data-driven policy learning often break down in protocol-sensitive experiments. We propose CAPER, a framework for Constrained And ProcEdural Reasoning for robotic scientific experiments, which explicitly restricts where learning and reasoning occur in the planning and control pipeline. Rather than strengthening end-to-end policies, CAPER enforces a responsibility-separated structure: task-level reasoning generates procedurally valid action sequences under explicit constraints, mid-level multimodal grounding realizes subtasks without delegating spatial decision-making to large language models, and low-level control adapts to physical uncertainty via reinforcement learning with minimal demonstrations. By encoding procedural commitments through interpretable intermediate representations, CAPER prevents execution-time violations of experimental logic, improving controllability, robustness, and data efficiency. Experiments on a scientific workflow benchmark and a public long-horizon manipulation dataset demonstrate consistent improvements in success rate and procedural correctness, particularly in low-data and long-horizon settings.
中文摘要 科学实验室中的机器人辅助需要程序正确的长视距作、在有限监督下的可靠执行以及在低示范条件下的稳健性。这些条件极大地挑战了端到端视觉-语言-动作（VLA）模型，其对可恢复错误和数据驱动策略学习的假设常在协议敏感实验中失效。我们提出了CAPER，这是一个用于机器人科学实验的受限与过程性推理框架，明确限制学习和推理在规划和控制流程中的发生位置。CAPER并未强化端到端策略，而是强制执行责任分离结构：任务级推理在显式约束下生成程序有效动作序列，中层多模态基础实现子任务而不将空间决策委托给大型语言模型，底层控制通过强化学习适应物理不确定性，演示极少。通过通过可解释的中间表示编码过程承诺，CAPER防止了实验逻辑的执行时间违规，提升了可控性、鲁棒性和数据效率。科学工作流程基准和公开长期作数据集的实验显示，成功率和程序正确性在低数据和长期视野环境中持续提升。

Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning

从流中榨取更多：流式强化学习的在线学习表示

Authors: Nilaksh, Antoine Clavaud, Mathieu Reymond, François Rivest, Sarath Chandar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.09396
Pdf link: https://arxiv.org/pdf/2602.09396
Abstract In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update. While this minimizes resource usage for on-device applications, it makes agents notoriously sample-inefficient, since value-based losses alone struggle to extract meaningful representations from transient data. We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame. However, due to the highly correlated samples induced by the streaming regime, naively applying this auxiliary loss results in training instabilities. Thus, we introduce orthogonal gradient updates relative to the momentum target and resolve gradient conflicts arising from streaming-specific optimizers. Validated across the Atari, MinAtar, and Octax suites, our approach systematically outperforms existing streaming baselines. Latent-space analysis, including t-SNE visualizations and effective-rank measurements, confirms that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer, while remaining efficient enough to train on just a few CPU cores.
中文摘要 在流式强化学习（RL）中，迁移会在一次更新后立即被观察并丢弃。虽然这最大限度地减少了设备端应用的资源使用，但使代理在采样效率上臭名昭著，因为仅基于价值的损耗难以从瞬态数据中提取有意义的表示。我们提议将自我预测表示（SPR）扩展到流媒体流水线，以最大化每个观测帧的效用。然而，由于流态模式诱导的样本高度相关，简单应用这种辅助损耗会导致训练不稳定。因此，我们引入相对于动量目标的正交梯度更新，并解决了由流媒体特定优化器引起的梯度冲突。经过Atari、MinAtar和Octax套件的验证，我们的方法系统性地优于现有的流媒体基线。潜空间分析，包括t-SNE可视化和有效秩测量，证实我们的方法学习了更丰富的表示，弥合了缺乏重放缓冲区带来的性能差距，同时保持足够高效，仅能在少数CPU核心上训练。

SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL

SceneReVis：一种基于自反思的视觉基础框架，通过多回合强化学习实现室内3D场景合成

Authors: Yang Zhao, Shizhao Sun, Meisheng Zhang, Yingdong Shi, Xubo Yang, Jiang Bian
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.09432
Pdf link: https://arxiv.org/pdf/2602.09432
Abstract Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act'' loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.
中文摘要 当前的单次三维场景合成方法常因缺乏深思熟虑的推理而存在空间幻觉，如碰撞。为弥合这一差距，我们引入了SceneReVis，一种基于愿景的自我反思框架，采用迭代“诊断与行动”循环，通过多模态反馈明确拦截和解决空间冲突。为支持这一分步范式，我们构建了SceneChain-12k，这是一个通过新颖逆向工程流水线推导出因果构建轨迹的大规模数据集。我们还提出了一个两阶段训练方案，从监督式微调过渡到代理强化学习，将模型发展为主动空间规划器。大量实验表明，SceneReVis在高保真生成和目标导向优化方面实现了最先进的性能，并且对长尾域具有鲁棒的推广能力。

P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

P1-VL：物理奥林匹克竞赛中视觉感知与科学推理的桥梁

Authors: Yun Luo, Futing Wang, Qianjia Cheng, Fangchen Yu, Haodi Lei, Jianhao Yan, Chenxi Li, Jiacheng Chen, Yufeng Zhao, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Wenxuan Zeng, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.09443
Pdf link: https://arxiv.org/pdf/2602.09443
Abstract The transition from symbolic manipulation to science-grade reasoning represents a pivotal frontier for Large Language Models (LLMs), with physics serving as the critical test anchor for binding abstract logic to physical reality. Physics demands that a model maintain physical consistency with the laws governing the universe, a task that fundamentally requires multimodal perception to ground abstract logic in reality. At the Olympiad level, diagrams are often constitutive rather than illustrative, containing essential constraints, such as boundary conditions and spatial symmetries, that are absent from the text. To bridge this visual-logical gap, we introduce P1-VL, a family of open-source vision-language models engineered for advanced scientific reasoning. Our method harmonizes Curriculum Reinforcement Learning, which employs progressive difficulty expansion to stabilize post-training, with Agentic Augmentation, enabling iterative self-verification at inference. Evaluated on HiPhO, a rigorous benchmark of 13 exams from 2024-2025, our flagship P1-VL-235B-A22B becomes the first open-source Vision-Language Model (VLM) to secure 12 gold medals and achieves the state-of-the-art performance in the open-source models. Our agent-augmented system achieves the No.2 overall rank globally, trailing only Gemini-3-Pro. Beyond physics, P1-VL demonstrates remarkable scientific reasoning capacity and generalizability, establishing significant leads over base models in STEM benchmarks. By open-sourcing P1-VL, we provide a foundational step toward general-purpose physical intelligence to better align visual perceptions with abstract physical laws for machine scientific discovery.
中文摘要 从符号作向科学级推理的转变，代表了大型语言模型（LLMs）的关键前沿，物理学成为将抽象逻辑与物理现实结合的关键测试锚点。物理学要求模型保持与宇宙规律的物理一致性，这一任务从根本上需要多模态感知，以将抽象逻辑扎根于现实。在奥林匹克层面，图表往往是构成性的而非说明性的，包含文本中缺失的边界条件和空间对称性等基本约束。为了弥合这一视觉与逻辑的鸿沟，我们引入了P1-VL，这是一系列为高级科学推理设计的开源视觉语言模型。我们的方法将课程强化学习（通过渐进式难度扩展以稳定训练后）与能动增强（Agentic Augmentation）相结合，实现推理时的迭代自我验证。在HiPhO测试中进行评估，这是一项2024-2025年间13门考试的严格基准测试，我们的旗舰P1-VL-235B-A22B成为首个获得12枚金牌的开源视觉语言模型（VLM），并在开源模型中达到最先进的性能。我们的智能体增强系统在全球排名第二，仅次于Gemini-3-Pro。除了物理学，P1-VL还展现了卓越的科学推理能力和普遍化能力，在STEM基准测试中显著领先基础模型。通过开源P1-VL，我们为通用物理智能迈出了基础一步，更好地将视觉感知与抽象物理定律对齐，促进机器科学发现。

SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning

SpotAgent：通过智能推理在大型视觉语言模型中扎根视觉地理定位

Authors: Furong Jia, Ling Dai, Wenjin Deng, Fan Zhang, Chen Hu, Daxin Jiang, Yu Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.09463
Pdf link: https://arxiv.org/pdf/2602.09463
Abstract Large Vision-Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo-localization, yet they often struggle in real-world scenarios where visual cues are sparse, long-tailed, and highly ambiguous. Previous approaches, bound by internal knowledge, often fail to provide verifiable results, yielding confident but ungrounded predictions when faced with confounded evidence. To address these challenges, we propose SpotAgent, a framework that formalizes geo-localization into an agentic reasoning process that leverages expert-level reasoning to synergize visual interpretation with tool-assisted verification. SpotAgent actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram. We introduce a 3-stage post-training pipeline starting with a Supervised Fine-Tuning (SFT) stage for basic alignment, followed by an Agentic Cold Start phase utilizing high-quality trajectories synthesized via a Multi-Agent framework, aiming to instill tool-calling expertise. Subsequently, the model's reasoning capabilities are refined through Reinforcement Learning. We propose a Spatially-Aware Dynamic Filtering strategy to enhance the efficiency of the RL stage by prioritizing learnable samples based on spatial difficulty. Extensive experiments on standard benchmarks demonstrate that SpotAgent achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.
中文摘要 大型视觉语言模型（LVLMs）在地理定位方面展现出强大的推理能力，但在现实场景中视觉线索稀疏、长尾且高度模糊时常常表现不佳。以往受内部知识约束的方法常常无法提供可验证的结果，面对混淆证据时，预测往往自信但缺乏依据。为应对这些挑战，我们提出了SpotAgent框架，将地理定位形式化为智能体推理过程，利用专家级推理，将视觉解释与工具辅助验证协同。SpotAgent通过ReAct图主动利用外部工具（如网页搜索、地图）来探索和验证视觉线索。我们引入了三阶段的培训后流程，首先是监督式微调（SFT）阶段进行基本对齐，随后是采用多智能体框架合成的高质量轨迹的智能冷启动阶段，旨在培养工具调用技能。随后，模型的推理能力通过强化学习得到完善。我们提出了一种空间感知动态过滤策略，通过根据空间难度优先考虑可学习样本，提升强化学习阶段的效率。对标准基准测试的广泛实验表明，SpotAgent 实现了最先进的性能，有效减少幻觉，同时提供精确且可验证的地理定位。

Online Learning in MDPs with Partially Adversarial Transitions and Losses

部分对抗性过渡和损失的MDP在线学习

Authors: Ofir Schlisselberg, Tal Lancewicki, Yishay Mansour
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.09474
Pdf link: https://arxiv.org/pdf/2602.09474
Abstract We study reinforcement learning in MDPs whose transition function is stochastic at most steps but may behave adversarially at a fixed subset of $\Lambda$ steps per episode. This model captures environments that are stable except at a few vulnerable points. We introduce \emph{conditioned occupancy measures}, which remain stable across episodes even with adversarial transitions, and use them to design two algorithms. The first handles arbitrary adversarial steps and achieves regret $\tilde{O}(H S^{\Lambda}\sqrt{K S A^{\Lambda+1}})$, where $K$ is the number of episodes, $S$ is the number of state, $A$ is the number of actions and $H$ is the episode's horizon. The second, assuming the adversarial steps are consecutive, improves the dependence on $S$ to $\tilde{O}(H\sqrt{K S^{3} A^{\Lambda+1}})$. We further give a $K^{2/3}$-regret reduction that removes the need to know which steps are the $\Lambda$ adversarial steps. We also characterize the regret of adversarial MDPs in the \emph{fully adversarial} setting ($\Lambda=H-1$) both for full-information and bandit feedback, and provide almost matching upper and lower bounds (slightly strengthen existing lower bounds, and clarify how different feedback structures affect the hardness of learning).
中文摘要 我们研究MDP中的强化学习，其过渡函数在最多步为随机，但在每集的固定子集$\Lambda$时表现为对抗性。该模型捕捉的环境是稳定的，只有少数脆弱点存在。我们引入了\emph{条件占用度}，即使发生对抗性转变，这些指标在不同剧集之间依然保持稳定，并用它们设计了两种算法。第一个处理任意对抗步骤，实现后悔 $\tilde{O}（H S^{\Lambda}\sqrt{K S A^{\Lambda+1}}）$，其中$K$为集数，$S$为状态数，$A$为动作数，$H$为集数。第二种，假设对抗步骤是连续的，则改善了对$S$到 $\tilde{O}（H\sqrt{K S^{3} A^{\Lambda+1}}）$ 的依赖。我们进一步给出一个$K^{2/3}$-后悔约简，无需知道哪些步骤是$\Lambda$对抗步骤。我们还对对抗性MDP在\emph{完全对抗}设置（$\Lambda=H-1$）中对全信息和强盗反馈的遗憾进行了表征，并提供了几乎匹配的上下界（稍微强化现有下界，澄清不同反馈结构如何影响学习难度）。

Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models

连接效率与透明度：多模大推理模型中的可解释CoT压缩

Authors: Yizhi Wang, Linan Yue, Min-Ling Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.09485
Pdf link: https://arxiv.org/pdf/2602.09485
Abstract Long chains of thought (Long CoTs) are widely employed in multimodal reasoning models to tackle complex tasks by capturing detailed visual information. However, these Long CoTs are often excessively lengthy and contain redundant reasoning steps, which can hinder inference efficiency. Compressing these long CoTs is a natural solution, yet existing approaches face two major challenges: (1) they may compromise the integrity of visual-textual reasoning by removing essential alignment cues, and (2) the compression process lacks explainability, making it difficult to discern which information is critical. To address these problems, we propose XMCC, an eXplainable Multimodal CoT Compressor that formulates compression as a sequential decision-making process optimized via reinforcement learning. XMCC can effectively shorten reasoning trajectories while preserving key reasoning steps and answer correctness, and simultaneously generates natural-language explanations for its compression decisions. Extensive experiments on representative multimodal reasoning benchmarks demonstrate that XMCC not only reduces reasoning length but also provides explainable explanations, validating its effectiveness.
中文摘要 长思维链（长思维链）被广泛应用于多模态推理模型中，通过捕捉详细的视觉信息来处理复杂任务。然而，这些长CoT通常过长且包含冗余的推理步骤，可能影响推理效率。压缩这些较长的CoT是自然的解决方案，但现有方法面临两个主要挑战：（1）它们可能通过移除关键对齐线索而损害视觉文本推理的完整性;（2）压缩过程缺乏解释性，难以判断哪些信息是关键的。为解决这些问题，我们提出了XMCC，一种可解释的多模态CoT压缩器，将压缩表述为通过强化学习优化的顺序决策过程。XMCC能够有效缩短推理轨迹，同时保留关键推理步骤和答案正确性，同时生成自然语言解释以支持其压缩决策。对代表性多模态推理基准的广泛实验表明，XMCC不仅缩短推理长度，还提供了可解释的解释，验证了其有效性。

Where-to-Unmask: Ground-Truth-Guided Unmasking Order Learning for Masked Diffusion Language Models

揭露的地点：基于真实基础的掩蔽式解密顺序学习，适用于掩蔽扩散语言模型

Authors: Hikaru Asano, Tadashi Kozuno, Kuniaki Saito, Yukino Baba
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.09501
Pdf link: https://arxiv.org/pdf/2602.09501
Abstract Masked Diffusion Language Models (MDLMs) generate text by iteratively filling masked tokens, requiring two coupled decisions at each step: which positions to unmask (where-to-unmask) and which tokens to place (what-to-unmask). While standard MDLM training directly optimizes token prediction (what-to-unmask), inference-time unmasking orders (where-to-unmask) are typically determined by heuristic confidence measures or trained through reinforcement learning with costly on-policy rollouts. To address this, we introduce Gt-Margin, a position-wise score derived from ground-truth tokens, defined as the probability margin between the correct token and its strongest alternative. Gt-Margin yields an oracle unmasking order that prioritizes easier positions first under each partially masked state. We demonstrate that leveraging this oracle unmasking order significantly enhances final generation quality, particularly on logical reasoning benchmarks. Building on this insight, we train a supervised unmasking planner via learning-to-rank to imitate the oracle ordering from masked contexts. The resulting planner integrates into standard MDLM sampling to select where-to-unmask, improving reasoning accuracy without modifying the token prediction model.
中文摘要 掩蔽扩散语言模型（MDLMs）通过迭代填充掩码标记来生成文本，每步需做出两个耦合决策：哪些位置需要解除掩蔽（哪里解除掩蔽）和放置哪些标记（解除隐藏什么）。虽然标准MDLM训练直接优化令牌预测（解除掩蔽内容），但推理时间的揭密命令（解除掩蔽位置）通常通过启发式置信度量决定，或通过高成本的策略推广强化学习训练。为此，我们引入了Gt-margin，这是一种基于真实性代币的按位置得分，定义为正确代币与其最强替代方案之间的概率margin。Gt-margin 产生一种预言解除掩蔽顺序，优先优先处理每个部分掩蔽状态下较易的盘口。我们证明，利用该预言机揭露顺序显著提升了最终生成质量，尤其是在逻辑推理基准测试上。基于这一见解，我们通过学习排序训练了受监督的解除掩蔽规划器，以模拟从掩蔽上下文中进行预言机的排序。最终的规划器整合进标准MDLM抽样，选择隐藏位置，提高推理准确性，而无需修改代币预测模型。

Training deep physical neural networks with local physical information bottleneck

训练深度物理神经网络，利用局部物理信息瓶颈

Authors: Hao Wang, Ziao Wang, Xiangpeng Liang, Han Zhao, Jianqi Hu, Junjie Jiang, Xing Fu, Jianshi Tang, Huaqiang Wu, Sylvain Gigan, Qiang Liu
Subjects: Subjects: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
Arxiv link: https://arxiv.org/abs/2602.09569
Pdf link: https://arxiv.org/pdf/2602.09569
Abstract Deep learning has revolutionized modern society but faces growing energy and latency constraints. Deep physical neural networks (PNNs) are interconnected computing systems that directly exploit analog dynamics for energy-efficient, ultrafast AI execution. Realizing this potential, however, requires universal training methods tailored to physical intricacies. Here, we present the Physical Information Bottleneck (PIB), a general and efficient framework that integrates information theory and local learning, enabling deep PNNs to learn under arbitrary physical dynamics. By allocating matrix-based information bottlenecks to each unit, we demonstrate supervised, unsupervised, and reinforcement learning across electronic memristive chips and optical computing platforms. PIB also adapts to severe hardware faults and allows for parallel training via geographically distributed resources. Bypassing auxiliary digital models and contrastive measurements, PIB recasts PNN training as an intrinsic, scalable information-theoretic process compatible with diverse physical substrates.
中文摘要 深度学习已经彻底改变了现代社会，但也面临日益增长的能源和延迟限制。深度物理神经网络（PNN）是互联计算系统，直接利用模拟动力学实现节能、超高速的人工智能执行。然而，要实现这一潜力，需要针对身体复杂度量身定制的通用训练方法。本文介绍物理信息瓶颈（PIB），这是一个通用且高效的框架，整合了信息理论与局部学习，使深度PNN能够在任意物理动力学下学习。通过为每个单元分配基于矩阵的信息瓶颈，我们展示了电子记忆芯片和光学计算平台上的监督、无监督和强化学习。PIB还能适应严重的硬件故障，并允许通过地理分布资源进行并行训练。PIB绕过辅助数字模型和对比测量，将PNN训练重新定位为一种内在且可扩展的信息理论过程，兼容多样化的物理基底。

Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning

基于LLM的高效多智能体强化学习的推广-培训联合设计

Authors: Zhida Jiang, Zhaolong Xing, Jiawei Lu, Yipei Niu, Qingyuan Sang, Liangxu Zhang, Wenquan Dai, Junhua Shu, Jiaxing Wang, Qiangyu Pei, Qiong Chen, Xinyu Liu, Fangming Liu, Ai Han, Zhen Chen, Ke Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.09578
Pdf link: https://arxiv.org/pdf/2602.09578
Abstract Despite algorithm-level innovations for multi-agent reinforcement learning (MARL), the underlying networked infrastructure for large-scale MARL training remains underexplored. Existing training frameworks primarily optimize for single-agent scenarios and fail to address the unique system-level challenges of MARL, including rollout-training synchronization barriers, rollout load imbalance, and training resource underutilization. To bridge this gap, we propose FlexMARL, the first end-to-end training framework that holistically optimizes rollout, training, and their orchestration for large-scale LLM-based MARL. Specifically, FlexMARL introduces the joint orchestrator to manage data flow under the rollout-training disaggregated architecture. Building upon the experience store, a novel micro-batch driven asynchronous pipeline eliminates the synchronization barriers while providing strong consistency guarantees. Rollout engine adopts a parallel sampling scheme combined with hierarchical load balancing, which adapts to skewed inter/intra-agent request patterns. Training engine achieves on-demand hardware binding through agent-centric resource allocation. The training states of different agents are swapped via unified and location-agnostic communication. Empirical results on a large-scale production cluster demonstrate that FlexMARL achieves up to 7.3x speedup and improves hardware utilization by up to 5.6x compared to existing frameworks.
中文摘要 尽管多智能体强化学习（MARL）在算法层面已有创新，大规模MARL训练的网络基础设施仍然未被充分探索。现有的培训框架主要针对单代理场景进行优化，未能解决MARL独特的系统层面挑战，包括部署训练同步障碍、部署负载不平衡以及训练资源利用不足。为弥合这一差距，我们提出了FlexMARL，这是首个全面优化大规模基于LLM的MARL部署、培训及其编排的端到端培训框架。具体来说，FlexMARL引入了联合编排器，用于在部署训练的拆分架构下管理数据流。基于经验存储，一种新型的微批次驱动异步流水线消除了同步障碍，同时提供强有力的一致性保证。Rollout引擎采用并行采样方案结合分层负载均衡，适应代理间/代理内请求模式的偏差。训练引擎通过以代理为中心的资源分配实现按需硬件绑定。不同代理的训练状态通过统一且位置无关的通信进行交换。大型生产集群的实证结果表明，FlexMARL相比现有框架实现了高达7.3倍的速度，并提高了硬件利用率5.6倍。

On the Optimal Reasoning Length for RL-Trained Language Models

关于强化学习训练语言模型的最佳推理长度

Authors: Daisuke Nohara, Taishi Nakamura, Rio Yokota
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.09591
Pdf link: https://arxiv.org/pdf/2602.09591
Abstract Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.
中文摘要 强化学习显著提升了大型语言模型中的推理能力，但同时也倾向于延长思维链，并在训练和推理过程中增加计算成本。尽管已有长度控制方法被提出，但平衡效率与性能的最佳输出长度仍不明确。本研究比较了两种模型的多种长度控制方法，分别是Qwen3-1.7B Base和DeepSeek-R1-Distill-Qwen-1.5B。我们的结果表明，长度惩罚可能阻碍推理习得，而适当调整的长度控制则能提高具有强烈先验推理的模型效率。通过将先前工作扩展到强化学习训练的策略，我们识别出两种失败模式：1）长期输出增加离散，2）短期输出导致思考不足。

Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning

从不可逆转的困境中学习：错误局部策略优化以实现工具集成的大型语言模型推理

Authors: Qiao Liang, Yuke Zhu, Chao Ge, Lei Yang, Ying Shen, Bo Zheng, Sheng Guo
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.09598
Pdf link: https://arxiv.org/pdf/2602.09598
Abstract Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine-grained credit assignment. We propose Error-Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary-search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error-localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency. Our code will be publicly released soon.
中文摘要 工具集成推理（TIR）使LLM代理能够通过规划、工具使用和迭代修订来解决任务，但该环境中仅结果强化学习存在稀疏、延迟奖励和薄弱的阶级学分分配。在长视野TIR轨迹中，早期且不可逆转的错误可能决定成败，因此关键在于确定第一个不可挽回的步骤，并加以利用它进行细致的信用分配。我们提出了错误局部化策略优化（ELPO），该方法通过二叉搜索滚动树在固定的部署预算下对第一个不可恢复步骤进行局部化，通过层级优势归因将生成的树转换为稳定学习信号，并应用错误局部自适应裁剪以加强对关键步骤及其后缀的修正更新。在数学、科学质量保证和代码执行等TIR基准测试中，ELPO在可比抽样预算下持续优于强有力的Agentic RL基线，并在Pass@K和Major@K扩展、推广排名质量和工具调用效率方面有额外提升。我们的代码将很快公开发布。

Directed Information: Estimation, Optimization and Applications in Communications and Causality

定向信息：估计、优化及通信与因果关系中的应用

Authors: Dor Tsur, Oron Sabag, Navin Kashyap, Haim Permuter, Gerhard Kramer
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2602.09711
Pdf link: https://arxiv.org/pdf/2602.09711
Abstract Directed information (DI) is an information measure that attempts to capture directionality in the flow of information from one random process to another. It is closely related to other causal influence measures, such as transfer entropy, Granger causality, and Pearl's causal framework. This monograph provides an overview of DI and its main application in information theory, namely, characterizing the capacity of channels with feedback and memory. We begin by reviewing the definitions of DI, its basic properties, and its relation to Shannon's mutual information. Next, we provide a survey of DI estimation techniques, ranging from classic plug-in estimators to modern neural-network-based estimators. Considering the application of channel capacity estimation, we describe how such estimators numerically optimize DI rate over a class of joint distributions on input and output processes. A significant part of the monograph is devoted to techniques to compute the feedback capacity of finite-state channels (FSCs). The feedback capacity of a strongly connected FSC involves the maximization of the DI rate from the channel input process to the output process. This maximization is performed over the class of causal conditioned probability input distributions. When the FSC is also unifilar, i.e., the next state is given by a time-invariant function of the current state and the new input-output symbol pair, the feedback capacity is the optimal average reward of an appropriately formulated Markov decision process (MDP). This MDP formulation has been exploited to develop several methods to compute exactly, or at least estimate closely, the feedback capacity of a unifilar FSC. This monograph describes these methods, starting from the value iteration algorithm, to Q-graph methods, and reinforcement learning algorithms that can handle large input and output alphabets.
中文摘要 定向信息（DI）是一种信息度量，试图捕捉信息从一个随机过程到另一个过程的方向性。它与其他因果影响度量密切相关，如转移熵、格兰杰因果性和珍珠因果框架。本专著概述了DI及其在信息理论中的主要应用，即描述具有反馈和记忆的通道容量。我们首先回顾DI的定义、其基本性质以及与香农互信息的关系。接下来，我们概述了DI估计技术，涵盖从经典插件估计器到现代基于神经网络的估计器。考虑通道容量估计的应用，我们描述了此类估计器如何数值优化输入和输出过程上一类联合分布的DI速率。专著中很大一部分内容涉及计算有限状态信道（FSC）反馈容量的技术。强连接FSC的反馈容量涉及从信道输入过程到输出过程的DI速率最大化。这种最大化是在因果条件概率输入分布类别上进行的。当FSC也是统一的，即下一状态由当前状态的时间不变函数与新的输入输出符号对给出时，反馈容量即为适当表述的马尔可夫决策过程（MDP）的最佳平均奖励。该MDP表述已被用于开发多种方法，以精确计算或至少精确估计单线FSC的反馈能力。本专著描述了这些方法，从值迭代算法开始，到Q图方法，以及能够处理大量输入和输出字母表的强化学习算法。

ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm

ExO-PPO：一种扩展的非策略近端策略优化算法

Authors: Hanyong Wang, Menglong Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.09726
Pdf link: https://arxiv.org/pdf/2602.09726
Abstract Deep reinforcement learning has been able to solve various tasks successfully, however, due to the construction of policy gradient and training dynamics, tuning deep reinforcement learning models remains challenging. As one of the most successful deep reinforcement-learning algorithm, the Proximal Policy Optimization algorithm (PPO) clips the policy gradient within a conservative on-policy updates, which ensures reliable and stable policy improvement. However, this training pattern may sacrifice sample efficiency. On the other hand, off-policy methods make more adequate use of data through sample reuse, though at the cost of increased the estimation variance and bias. To leverage the advantages of both, in this paper, we propose a new PPO variant based on the stability guarantee from conservative on-policy iteration with a more efficient off-policy data utilization. Specifically, we first derive an extended off-policy improvement from an expectation form of generalized policy improvement lower bound. Then, we extend the clipping mechanism with segmented exponential functions for a suitable surrogate objective function. Third, the trajectories generated by the past $M$ policies are organized in the replay buffer for off-policy training. We refer to this method as Extended Off-policy Proximal Policy Optimization (ExO-PPO). Compared with PPO and some other state-of-the-art variants, we demonstrate an improved performance of ExO-PPO with balanced sample efficiency and stability on varied tasks in the empirical experiments.
中文摘要 深度强化学习已成功解决多种任务，但由于策略梯度和训练动态的构建，深度强化学习模型的调优仍具挑战性。作为最成功的深度强化学习算法之一，近端策略优化算法（PPO）在保守的策略更新中剪裁了策略梯度，确保了策略的可靠和稳定改进。然而，这种训练模式可能会牺牲样本效率。另一方面，非策略方法通过样本重用更有效地利用数据，但代价是估计方差和偏差增加。为了利用这两者的优势，本文提出了一种基于保守策略迭代稳定性保证、且更高效的非策略数据利用的新PPO变体。具体来说，我们首先从一种期望形式的广义政策改进下界推导出扩展的非政策改进。然后，我们将截波机制扩展为分段指数函数，以获得合适的替代目标函数。第三，过去$M$策略生成的轨迹被组织在重放缓冲区中，用于非策略训练。我们称这种方法为扩展非策略近端策略优化（ExO-PPO）。与PPO及其他一些先进变体相比，我们在实证实验中展示了ExO-PPO在不同任务中表现更优，样本效率和稳定性均衡。

DiffuReason: Bridging Latent Reasoning and Generative Refinement for Sequential Recommendation

DiffuReason：连接潜在推理与生成精炼，实现顺序推荐

Authors: Jie Jiang, Yang Wu, Qian Li, Yuling Xiong, Yihang Su, Junbang Huo, Longfei Lu, Jun Zhang, Huan Yu
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2602.09744
Pdf link: https://arxiv.org/pdf/2602.09744
Abstract Latent reasoning has emerged as a promising paradigm for sequential recommendation, enabling models to capture complex user intent through multi-step deliberation. Yet existing approaches often rely on deterministic latent chains that accumulate noise and overlook the uncertainty inherent in user intent, and they are typically trained in staged pipelines that hinder joint optimization and exploration. To address these challenges, we propose DiffuReason, a unified "Think-then-Diffuse" framework for sequential recommendation. It integrates multi-step Thinking Tokens for latent reasoning, diffusion-based refinement for denoising intermediate representations, and end-to-end Group Relative Policy Optimization (GRPO) alignment to optimize for ranking performance. In the Think stage, the model generates Thinking Tokens that reason over user history to form an initial intent hypothesis. In the Diffuse stage, rather than treating this hypothesis as the final output, we refine it through a diffusion process that models user intent as a probabilistic distribution, providing iterative denoising against reasoning noise. Finally, GRPO-based reinforcement learning enables the reasoning and refinement modules to co-evolve throughout training, without the constraints of staged optimization. Extensive experiments on four benchmarks demonstrate that DiffuReason consistently improves diverse backbone architectures. Online A/B tests on a large-scale industrial platform further validate its practical effectiveness.
中文摘要 潜在推理已成为顺序推荐的有前景范式，使模型能够通过多步审议捕捉复杂的用户意图。然而，现有方法往往依赖于确定性潜在链，这些链积累了噪声，忽视了用户意图中固有的不确定性，且通常在分阶段的流水线中训练，阻碍了联合优化和探索。为应对这些挑战，我们提出了DiffuReason，一个统一的“思考后扩散”连续推荐框架。它集成了多步思维代币用于潜在推理，基于扩散的细化用于去噪中间表示，以及端到端的群体相对策略优化（GRPO）对齐以优化排名性能。在思考阶段，模型生成思考代币，基于用户历史推理形成初始意图假设。在扩散阶段，我们不将该假设视为最终输出，而是通过扩散过程对用户意图进行优化，将其建模为概率分布，提供针对推理噪声的迭代去噪。最后，基于GRPO的强化学习使推理和精炼模块能够在整个训练过程中协同演化，不受分阶段优化的限制。在四个基准测试上的广泛实验表明，DiffuReason持续改进了多样化的骨干架构。在大型工业平台上进行的在线A/B测试进一步验证了其实际有效性。

Grounding LTL Tasks in Sub-Symbolic RL Environments for Zero-Shot Generalization

在子符号强化环境中为零样子泛化奠定LTL任务基础

Authors: Matteo Pannacci, Andrea Fanti, Elena Umili, Roberto Capobianco
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.09761
Pdf link: https://arxiv.org/pdf/2602.09761
Abstract In this work we address the problem of training a Reinforcement Learning agent to follow multiple temporally-extended instructions expressed in Linear Temporal Logic in sub-symbolic environments. Previous multi-task work has mostly relied on knowledge of the mapping between raw observations and symbols appearing in the formulae. We drop this unrealistic assumption by jointly training a multi-task policy and a symbol grounder with the same experience. The symbol grounder is trained only from raw observations and sparse rewards via Neural Reward Machines in a semi-supervised fashion. Experiments on vision-based environments show that our method achieves performance comparable to using the true symbol grounding and significantly outperforms state-of-the-art methods for sub-symbolic environments.
中文摘要 本研究探讨了如何训练强化学习代理在亚符号环境中遵循线性时间逻辑中表达的多条时间扩展指令的问题。以往的多任务工作主要依赖于对原始观测值与公式中符号之间的映射关系。我们通过联合训练一个多任务策略和一个拥有相同经验的符号地面人来摒弃这一不切实际的假设。符号地面人仅通过神经奖励机进行原始观察和稀疏奖励的半监督训练。基于视觉的实验表明，我们的方法在使用真实符号基础化方面的性能相当，并且在亚符号环境中的先进方法中表现显著优于。

Diverse Skill Discovery for Quadruped Robots via Unsupervised Learning

通过无监督学习，四足机器人的多样化技能发现

Authors: Ruopeng Cui, Yifei Bi, Haojie Luo, Wei Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.09767
Pdf link: https://arxiv.org/pdf/2602.09767
Abstract Reinforcement learning necessitates meticulous reward shaping by specialists to elicit target behaviors, while imitation learning relies on costly task-specific data. In contrast, unsupervised skill discovery can potentially reduce these burdens by learning a diverse repertoire of useful skills driven by intrinsic motivation. However, existing methods exhibit two key limitations: they typically rely on a single policy to master a versatile repertoire of behaviors without modeling the shared structure or distinctions among them, which results in low learning efficiency; moreover, they are susceptible to reward hacking, where the reward signal increases and converges rapidly while the learned skills display insufficient actual diversity. In this work, we introduce an Orthogonal Mixture-of-Experts (OMoE) architecture that prevents diverse behaviors from collapsing into overlapping representations, enabling a single policy to master a wide spectrum of locomotion skills. In addition, we design a multi-discriminator framework in which different discriminators operate on distinct observation spaces, effectively mitigating reward hacking. We evaluated our method on the 12-DOF Unitree A1 quadruped robot, demonstrating a diverse set of locomotion skills. Our experiments demonstrate that the proposed framework boosts training efficiency and yields an 18.3\% expansion in state-space coverage compared to the baseline.
中文摘要 强化学习需要专家细致地塑造奖励以诱导目标行为，而模仿学习则依赖于昂贵的任务特定数据。相比之下，无监督的技能发现可以通过学习由内在动机驱动的多样化有用技能，从而减轻这些负担。然而，现有方法存在两个关键局限：它们通常依赖单一策略来掌握多样化的行为库，却没有建模共享的结构或区分，导致学习效率较低;此外，他们容易受到奖励黑客攻击的影响，即奖励信号快速增加和收敛，而所学技能的实际多样性不足。在本研究中，我们引入了一种正交专家混合（OMoE）架构，防止多样行为崩溃成重叠的表征，使单一策略能够掌握广泛的移动技能。此外，我们设计了一个多判别器框架，使不同的判别器在不同的观察空间上工作，有效减少了奖励黑客行为。我们在12自由度的Unitree A1四足机器人上进行了测试，展示了多样化的运动技能。我们的实验表明，所提框架提高了训练效率，并使状态空间覆盖率比基线增长了18.3%。

Flexible Entropy Control in RLVR with Gradient-Preserving Perspective

RLVR中的灵活熵控制，采用保持梯度视角

Authors: Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.09782
Pdf link: https://arxiv.org/pdf/2602.09782
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.
中文摘要 带可验证奖励的强化学习（RLVR）已成为提升大型语言模型（LLM）推理能力的关键方法。然而，持续训练常导致策略熵崩溃，表现为熵迅速衰减，导致过早过度自信、输出多样性降低以及梯度范数消失，阻碍学习。保持梯度裁剪是影响这些动态的主要因素，但现有的缓解策略大多是静态的，缺乏将剪裁机制与精确熵控制连接起来的框架。本文提出了从梯度保持剪裁角度重塑强化学习中的熵控制。我们首先理论和实证地验证了特定重要性抽样比区域对熵增长和减少的贡献。基于这些发现，我们引入了一种利用动态截断阈值精确管理熵的新型调控机制。此外，我们设计并评估动态熵控制策略，包括增加后减少、减少-增加-减少和振荡衰减。实验结果表明，这些策略有效减轻熵坍缩，并在多个基准测试中实现更优表现。

A Controlled Study of Double DQN and Dueling DQN Under Cross-Environment Transfer

跨环境转移下双DQN与对抗DQN的受控研究

Authors: Azka Nasir, Fatima Dossa, Muhammad Ahmed Atif, Mohammad Ahmed Atif
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.09810
Pdf link: https://arxiv.org/pdf/2602.09810
Abstract Transfer learning in deep reinforcement learning is often motivated by improved stability and reduced training cost, but it can also fail under substantial domain shift. This paper presents a controlled empirical study examining how architectural differences between Double Deep Q-Networks (DDQN) and Dueling DQN influence transfer behavior across environments. Using CartPole as a source task and LunarLander as a structurally distinct target task, we evaluate a fixed layer-wise representation transfer protocol under identical hyperparameters and training conditions, with baseline agents trained from scratch used to contextualize transfer effects. Empirical results show that DDQN consistently avoids negative transfer under the examined setup and maintains learning dynamics comparable to baseline performance in the target environment. In contrast, Dueling DQN consistently exhibits negative transfer under identical conditions, characterized by degraded rewards and unstable optimization behavior. Statistical analysis across multiple random seeds confirms a significant performance gap under transfer. These findings suggest that architectural inductive bias is strongly associated with robustness to cross-environment transfer in value-based deep reinforcement learning under the examined transfer protocol.
中文摘要 深度强化学习中的迁移学习通常以提升稳定性和降低训练成本为动力，但在显著的领域转移下也可能失败。本文通过对照实证研究探讨了双深度Q网络（DDQN）与对抗DQN之间的架构差异如何影响跨环境的传输行为。以CartPole为源任务，LunarLander为结构独立目标任务，我们评估了在相同超参数和训练条件下的固定层级表征传输协议，基线智能体从零训练用于赋予转移效应情境。实证结果表明，DDQN在所测试的设置下始终避免负转移，并保持与目标环境中基线性能相当的学习动态。相比之下，Dueling DQN在相同条件下持续表现出负转移，表现为奖励退化和优化行为不稳定。多重随机种子的统计分析证实了转移过程中存在显著的性能差距。这些发现表明，在所研究的转移协议下，基于价值的深度强化学习中，架构归纳偏倚与跨环境迁移的鲁棒性密切相关。

Code2World: A GUI World Model via Renderable Code Generation

Code2World：通过可渲染代码生成的图形界面世界模型

Authors: Yuhao Zheng, Li'an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, Kevin Qinghong Lin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2602.09856
Pdf link: https://arxiv.org/pdf/2602.09856
Abstract Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at this https URL.
中文摘要 自主的图形界面代理通过感知接口和执行作来与环境交互。作为一个虚拟沙盒，GUI 世界模型通过实现行动条件预测，赋予代理具备类人预见的能力。然而，现有的基于文本和像素的方法在实现高视觉真实度和细粒度结构控制方面仍面临困难。为此，我们提出了Code2World，一种视觉语言编码器，通过可渲染的代码生成模拟下一个视觉状态。具体来说，为了解决数据稀缺问题，我们通过将图形界面轨迹转换为高精度HTML并通过可视化反馈机制优化合成代码，构建了AndroidCode，最终生成了超过8万对高质量的屏幕-动作对语料库。为了将现有VLM适应为代码预测，我们首先对SFT进行冷启动，随后进行格式布局，然后进一步应用渲染感知强化学习，通过强制执行视觉语义忠实性和动作一致性，将渲染结果作为奖励信号。大量实验表明，Code2World-8B在下一个UI预测中表现最佳，可与竞争对手GPT-5和Gemini-3-Pro-Image媲美。值得注意的是，Code2World以灵活方式显著提升了下游导航成功率，使Gemini-2.5-Flash在AndroidWorld导航上提升了+9.5%。代码可在该 https URL 访问。

QP-OneModel: A Unified Generative LLM for Multi-Task Query Understanding in Xiaohongshu Search

QP-OneModel：一个统一生成式大型语言模型，用于小红书搜索中的多任务查询理解

Authors: Jianzhao Huang, Xiaorui Huang, Fei Zhao, Yunpeng Liu, Hui Zhang, Fangcheng Shi, Congfeng Li, Zechen Sun, Yi Wu, Yao Hu, Yunhan Bai, Shaosheng Cao
Subjects: Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.09901
Pdf link: https://arxiv.org/pdf/2602.09901
Abstract Query Processing (QP) bridges user intent and content supply in large-scale Social Network Service (SNS) search engines. Traditional QP systems rely on pipelines of isolated discriminative models (e.g., BERT), suffering from limited semantic understanding and high maintenance overhead. While Large Language Models (LLMs) offer a potential solution, existing approaches often optimize sub-tasks in isolation, neglecting intrinsic semantic synergy and necessitating independent iterations. Moreover, standard generative methods often lack grounding in SNS scenarios, failing to bridge the gap between open-domain corpora and informal SNS linguistic patterns, while struggling to adhere to rigorous business definitions. We present QP-OneModel, a Unified Generative LLM for Multi-Task Query Understanding in the SNS domain. We reformulate heterogeneous sub-tasks into a unified sequence generation paradigm, adopting a progressive three-stage alignment strategy culminating in multi-reward Reinforcement Learning. Furthermore, QP-OneModel generates intent descriptions as a novel high-fidelity semantic signal, effectively augmenting downstream tasks such as query rewriting and ranking. Offline evaluations show QP-OneModel achieves a 7.35% overall gain over discriminative baselines, with significant F1 boosts in NER (+9.01%) and Term Weighting (+9.31%). It also exhibits superior generalization, surpassing a 32B model by 7.60% accuracy on unseen tasks. Fully deployed at Xiaohongshu, online A/B tests confirm its industrial value, optimizing retrieval relevance (DCG) by 0.21% and lifting user retention by 0.044%.
中文摘要 查询处理（QP）连接了大规模社交网络服务（SNS）搜索引擎中用户意图与内容供应。传统的QP系统依赖于孤立判别模型（如BERT）的流水线，因此语义理解有限且维护开销较高。虽然大型语言模型（LLMs）提供了潜在解决方案，但现有方法常常单独优化子任务，忽视了内在语义协同，需要独立迭代。此外，标准生成方法往往缺乏在SNS场景中的基础，未能弥合开放域语料库与非正式SNS语言模式之间的鸿沟，同时难以严格遵守商业定义。我们介绍QP-OneModel，一款用于SNS领域多任务查询理解的统一生成式大型语言模型。我们将异构子任务重新表述为统一的序列生成范式，采用渐进式三阶段比对策略，最终实现多奖励强化学习。此外，QP-OneModel 生成意图描述，作为一种新颖的高保真度语义信号，有效增强了查询重写和排名等后续任务。线下评估显示，QP-OneModel整体提升了7.35%，NER+9.01%和期限加权（+9.31%）均显著提升F1。它还表现出更优越的泛化能力，在未见任务中比32B模型的准确率高出7.60%。在小红树全面部署的在线A/B测试验证了其工业价值，优化检索相关性（DCG）提升0.21%，用户留存率提升0.044%。

ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning

注意引导过程监督：高效推理的注意力引导过程监督

Authors: Shuaiyi Nie, Siyu Ding, Wenyuan Zhang, Linhao Yu, Tianmeng Yang, Yao Chen, Tingwen Liu, Weichong Yin, Yu Sun, Hua Wu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.09953
Pdf link: https://arxiv.org/pdf/2602.09953
Abstract Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model's intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.
中文摘要 采用强化学习和可验证奖励（RLVR）训练的大型推理模型在复杂推理任务中表现出色，但常常过度思考，产生冗余推理而无性能提升。现有的轨迹级长度惩罚往往无法有效缩短推理长度并降低准确性，因为它们统一处理所有推理步骤，缺乏区分冗余与必要性的细粒信号。与此同时，过程监督方法通常资源密集且信用分配不准确。为解决这些问题，我们提出了ATTNPO，一种低开销的过程监督强化学习框架，利用模型内在的注意力信号进行步骤级学分分配。我们首先确定一组特别关注的注意力，他们自然地专注于关键步骤，同时抑制重复步骤。通过利用这些头的注意力分数，我们采用两个子策略，通过减少重复步骤来减少过度思考，同时通过减少关键步骤的惩罚来保持准确性。实验结果显示，ATTNPO显著缩短了推理长度，同时显著提升了9个基准测试的性能。

SCOPE: A Training-Free Online 3D Deployment for UAV-BSs with Theoretical Analysis and Comparative Study

SCOPE：无人机BS的免培训在线3D部署，结合理论分析与比较研究

Authors: Chuan-Chi Lai
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2602.09971
Pdf link: https://arxiv.org/pdf/2602.09971
Abstract Unmanned Aerial Vehicle (UAV)-mounted Base Stations (UAV-BSs) offer a flexible solution for serving ground users in temporary hotspot scenarios. However, efficiently deploying UAV-BSs to satisfy heterogeneous user distributions remains a challenging optimization problem. While recent data-driven approaches, particularly Deep Reinforcement Learning (DRL), have shown promise in dynamic environments, they often suffer from prohibitive training overhead, poor generalization to topology changes, and high computational complexity. To address these limitations, this paper proposes Satisfaction-driven Coverage Optimization via Perimeter Extraction (SCOPE), a training-free and online 3D deployment framework. Unlike heuristic baselines that rely on fixed-altitude assumptions, SCOPE integrates a perimeter extraction mechanism with the Smallest Enclosing Circle (SEC) algorithm to dynamically optimize 3D UAV positions. Theoretically, we provide a rigorous convergence proof of the proposed algorithm and derive its polynomial time complexity of $O(N^2 \log N)$. Experimentally, we conduct a comprehensive comparative study against state-of-the-art DRL baselines (e.g., PPO). Simulation results demonstrate that SCOPE achieves comparable user satisfaction to DRL methods but significantly lower computational latency (milliseconds vs. hours of training) and superior energy efficiency, making it an ideal solution for real-time, on-demand emergency deployment.
中文摘要 无人机（UAV）基站（UAV-BS）为临时热点场景中的地面用户提供了灵活的解决方案。然而，高效部署无人机BS以满足异构用户分布仍是一个具有挑战性的优化难题。虽然近期的数据驱动方法，尤其是深度强化学习（DRL）在动态环境中展现出潜力，但它们常常面临过高的训练开销、对拓扑变化的推广性差以及计算复杂度高的问题。为解决这些局限性，本文提出了通过边界提取实现满意度驱动覆盖优化（SCOPE），这是一种无需培训且在线的三维部署框架。与依赖固定高度假设的启发式基线不同，SCOPE将周边提取机制与最小封闭圆（SEC）算法集成，动态优化三维无人机位置。理论上，我们给出了该算法的严格收敛证明，并推导出其多项式时间复杂度为$O（N^2 \log N）$。我们进行了一项综合比较研究，针对最先进的日程日程（DRL）基线（如PPO）。仿真结果表明，SCOPE实现了与日间学习（DRL）方法相当的用户满意度，但计算延迟显著降低（毫秒对比训练小时），且能效优异，是实时按需紧急部署的理想解决方案。

ORCHID: Fairness-Aware Orchestration in Mission-Critical Air-Ground Integrated Networks

ORCHID：关键任务空地综合网络中的公平意识编排

Authors: Chuan-Chi Lai, Chi Jai Choy
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2602.09994
Pdf link: https://arxiv.org/pdf/2602.09994
Abstract In the era of 6G Air-Ground Integrated Networks (AGINs), Unmanned Aerial Vehicles (UAVs) are pivotal for providing on-demand wireless coverage in mission-critical environments, such as post-disaster rescue operations. However, traditional Deep Reinforcement Learning (DRL) approaches for multi-UAV orchestration often face critical challenges: instability due to the non-stationarity of multi-agent environments and the difficulty of balancing energy efficiency with service equity. To address these issues, this paper proposes ORCHID (Orchestration of Resilient Coverage via Hybrid Intelligent Deployment), a novel stability-enhanced two-stage learning framework. First, ORCHID leverages a GBS-aware topology partitioning strategy to mitigate the exploration cold-start problem. Second, we introduce a Reset-and-Finetune (R\&F) mechanism within the MAPPO architecture that stabilizes the learning process via synchronized learning rate decay and optimizer state resetting. This mechanism effectively suppresses gradient variance to prevent policy degradation, thereby ensuring algorithmic resilience in dynamic environments. Furthermore, we uncover a counter-intuitive efficiency-fairness synergy: contrary to the conventional trade-off, our results demonstrate that the proposed Max-Min Fairness (MMF) design not only guarantees service for cell-edge users but also achieves superior energy efficiency compared to Proportional Fairness (PF), which tends to converge to suboptimal greedy equilibria. Extensive experiments confirm that ORCHID occupies a superior Pareto-dominant position compared to state-of-the-art baselines, ensuring robust convergence and resilient connectivity in mission-critical scenarios.
中文摘要 在6G空地综合网络（AGIN）时代，无人机（UAV）在关键任务环境中（如灾后救援行动）中提供按需无线覆盖至关重要。然而，传统的深度强化学习（DRL）多无人机编排方法常面临关键挑战：多智能体环境非固定性带来的不稳定性，以及在能源效率与服务公平之间取得平衡的困难。为解决这些问题，本文提出了ORCHID（通过混合智能部署编排弹性覆盖），这是一种新型稳定性增强的两阶段学习框架。首先，ORCHID利用GBS感知的拓扑分区策略来缓解探索冷启动问题。其次，我们在MAPPO架构中引入了重置与微调（R\&F）机制，通过同步学习率衰减和优化器状态重置来稳定学习过程。该机制有效抑制梯度方差，防止策略退化，从而确保算法在动态环境中的韧性。此外，我们发现了一个反直觉的效率-公平协同效应：与传统权衡相反，我们的结果表明，所提出的最大最小公平（MMF）设计不仅保证了对单元边缘用户的服务，还相较于倾向于趋向次优贪婪均衡的比例公平（PF）实现了更优的能效。大量实验证实，兰花在帕累托优势上优于最先进基线，确保了任务关键场景下的稳健收敛和韧性连接。

ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference

ESTAR：早期停止令牌感知推理以实现高效推理

Authors: Junda Wang, Zhichao Yang, Dongxu Zhang, Sanjit Singh Batra, Robert E. Tillman
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10004
Pdf link: https://arxiv.org/pdf/2602.10004
Abstract Large reasoning models (LRMs) achieve state-of-the-art performance by generating long chains-of-thought, but often waste computation on redundant reasoning after the correct answer has already been reached. We introduce Early-Stopping for Token-Aware Reasoning (ESTAR), which detects and reduces such reasoning redundancy to improve efficiency without sacrificing accuracy. Our method combines (i) a trajectory-based classifier that identifies when reasoning can be safely stopped, (ii) supervised fine-tuning to teach LRMs to propose self-generated signals, and (iii) -aware reinforcement learning that truncates rollouts at self-generated stop points with compute-aware rewards. Experiments on four reasoning datasets show that ESTAR reduces reasoning length by about 3.7x (from 4,799 to 1,290) while preserving accuracy (74.9% vs. 74.2%), with strong cross-domain generalization. These results highlight early stopping as a simple yet powerful mechanism for improving reasoning efficiency in LRMs.
中文摘要 大型推理模型（LRM）通过生成长链思考实现了最先进的性能，但在正确答案已经确定后，往往会浪费计算时间进行冗余推理。我们引入了早期停止机制以实现令牌感知推理（ESTAR），通过检测并减少此类推理冗余，以提升效率而不牺牲准确性。我们的方法结合了（i）基于轨迹的分类器，识别何时可以安全地停止推理，（ii）监督微调以教LRM提出自我生成信号，以及（iii）在自生成停止点截断滚动的有意识强化学习，并获得计算感知的奖励。在四个推理数据集上的实验显示，ESTAR将推理长度缩短约3.7倍（从4,799降至1,290），同时保持准确率（74.9%对74.2%），且具有强的跨域推广性。这些结果凸显了早期停止作为提升LRMS推理效率的简单而强大的机制。

Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning

先回答，后推理：通过模式平衡强化学习对齐搜索相关性

Authors: Shijie Zhang, Xiang Guo, Rujun Guo, Shaoyu Liu, Xiaozhao Wang, Guanjun Jiang, Kevin Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10006
Pdf link: https://arxiv.org/pdf/2602.10006
Abstract Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry. To satisfy the millisecond-level response requirements of online systems while retaining the interpretable reasoning traces of Large Language Models (LLMs), we propose a novel \textbf{Answer-First, Reason Later (AFRL)} paradigm. This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation. Inspired by the success of reasoning models, we adopt a "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" pipeline to achieve AFRL. However, directly applying existing RL training often leads to \textbf{mode collapse} in the search relevance task, where the model forgets complex long-tail rules in pursuit of high rewards. From an information theory perspective: RL inherently minimizes the \textbf{Reverse KL divergence}, which tends to seek probability peaks (mode-seeking) and is prone to "reward hacking." On the other hand, SFT minimizes the \textbf{Forward KL divergence}, forcing the model to cover the data distribution (mode-covering) and effectively anchoring expert rules. Based on this insight, we propose a \textbf{Mode-Balanced Optimization} strategy, incorporating an SFT auxiliary loss into Stepwise-GRPO training to balance these two properties. Furthermore, we construct an automated instruction evolution system and a multi-stage curriculum to ensure expert-level data quality. Extensive experiments demonstrate that our 32B teacher model achieves state-of-the-art performance. Moreover, the AFRL architecture enables efficient knowledge distillation, successfully transferring expert-level logic to a 0.6B model, thereby reconciling reasoning depth with deployment latency.
中文摘要 构建一个兼具低延迟和高性能的搜索相关性模型，是搜索行业长期面临的挑战。为了满足在线系统的毫秒级响应要求，同时保留大型语言模型（LLM）可解释的推理痕迹，我们提出了一种新的 \textbf{先答，后推理（AFRL）}范式。该范式要求模型在第一个词符中输出确定的相关性分数，随后进行结构化的逻辑解释。受到推理模型成功的启发，我们采用了“监督式微调（SFT）+ 强化学习（RL）”流程以实现AFRL。然而，直接应用现有强化学习训练常常导致搜索相关性任务中的 \textbf{mode 崩溃}，模型在追求高回报时遗忘复杂的长尾规则。从信息论角度看：强化学习本质上最小化了\textbf{逆向KL发散}，后者倾向于寻求概率峰值（模式寻求），且容易出现“奖励黑客”现象。另一方面，SFT最小化了\textbf{前向KL散度}，迫使模型覆盖数据分布（模式覆盖），并有效锚定专家规则。基于这一见解，我们提出了一种\textbf{模式平衡优化}策略，将SFT辅助损耗纳入逐步GRPO训练中，以平衡这两种特性。此外，我们构建了自动化教学演进系统和多阶段课程，以确保数据质量达到专家水平。大量实验表明，我们的32B教师模型实现了最先进的教学表现。此外，AFRL架构实现了高效的知识蒸馏，成功将专家级逻辑转移到0.6亿亿模型中，从而协调推理深度与部署延迟。

A Collaborative Safety Shield for Safe and Efficient CAV Lane Changes in Congested On-Ramp Merging

一个协作安全盾牌，用于在拥堵的匝道并入中安全高效地变换CAV车道

Authors: Bharathkumar Hegde, Melanie Bouroche
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.10007
Pdf link: https://arxiv.org/pdf/2602.10007
Abstract Lane changing in dense traffic is a significant challenge for Connected and Autonomous Vehicles (CAVs). Existing lane change controllers primarily either ensure safety or collaboratively improve traffic efficiency, but do not consider these conflicting objectives together. To address this, we propose the Multi-Agent Safety Shield (MASS), designed using Control Barrier Functions (CBFs) to enable safe and collaborative lane changes. The MASS enables collaboration by capturing multi-agent interactions among CAVs through interaction topologies constructed as a graph using a simple algorithm. Further, a state-of-the-art Multi-Agent Reinforcement Learning (MARL) lane change controller is extended by integrating MASS to ensure safety and defining a customised reward function to prioritise efficiency improvements. As a result, we propose a lane change controller, known as MARL-MASS, and evaluate it in a congested on-ramp merging simulation. The results demonstrate that MASS enables collaborative lane changes with safety guarantees by strictly respecting the safety constraints. Moreover, the proposed custom reward function improves the stability of MARL policies trained with a safety shield. Overall, by encouraging the exploration of a collaborative lane change policy while respecting safety constraints, MARL-MASS effectively balances the trade-off between ensuring safety and improving traffic efficiency in congested traffic. The code for MARL-MASS is available with an open-source licence at this https URL
中文摘要 在密集交通中变道是互联和自动驾驶车辆（CAV）面临的重大挑战。现有的变道控制器主要要么确保安全，要么协同提升交通效率，但并未将这些相互冲突的目标纳入考虑。为此，我们提出了多智能体安全盾（MASS），该系统采用控制障碍功能（CBF）设计，以实现安全且协作的变道。MASS通过使用简单算法构建的交互拓扑图，捕捉CAV之间的多智能体交互，实现协作。此外，通过集成MASS和定义定制奖励函数，优化了先进的多智能体强化学习（MARL）车道变换控制器，确保安全，并定义了定制奖励函数以优先提升效率。因此，我们提出了一种称为MARL-MASS的变道控制器，并在拥堵的匝道并线模拟中进行评估。结果表明，MASS通过严格遵守安全约束，实现了协同变道并实现安全保障。此外，所提自定义奖励函数提升了用安全盾训练的MARL策略的稳定性。总体而言，通过鼓励在尊重安全约束的协作变道政策的探索中，MARL-MASS在拥堵交通中有效平衡了保障安全与提升交通效率之间的权衡。MARL-MASS 的代码以开源许可证形式在此 https URL 下提供

ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning

ADORA：基于强化学习的动态优势估计训练推理模型

Authors: Qingnan Ren, Shiting Huang, Zhen Fang, Zehui Chen, Lin Chen, Lijun Li, Feng Zhao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10019
Pdf link: https://arxiv.org/pdf/2602.10019
Abstract Reinforcement learning has become a cornerstone technique for developing reasoning models in complex tasks, ranging from mathematical problem-solving to imaginary reasoning. The optimization of these models typically relies on policy gradient methods, whose efficacy hinges on the accurate estimation of an advantage function. However, prevailing methods typically employ static advantage estimation, a practice that leads to inefficient credit assignment by neglecting the dynamic utility of training samples over time. This limitation results in suboptimal policy updates, which in turn manifest as slower convergence rates and increased learning instability, as models fail to adapt to evolving sample utilities effectively. To address this problem, we introduce \textbf{ADORA} (\textbf{A}dvantage \textbf{D}ynamics via \textbf{O}nline \textbf{R}ollout \textbf{A}daptation), a novel framework for policy optimization. ADORA dynamically adjusts the advantage function's weighting by adaptively categorizing training data into temporarily advantageous and disadvantageous samples, based on their evolving utility during online model rollouts. This tailored data differentiation strategy allows ADORA to be seamlessly integrated into existing policy optimization algorithms without significant architectural modifications, enabling the policy to prioritize learning from more informative experiences and thereby achieve more efficient policy updates. Extensive evaluations across diverse model families and varying data scales demonstrate that ADORA is a robust and efficient framework. It significantly enhances long reasoning in both geometric and mathematical tasks, consistently achieving notable performance gains without requiring sensitive hyperparameter tuning.
中文摘要 强化学习已成为在复杂任务中发展推理模型的基石技术，涵盖从数学问题解决到想象推理等领域。这些模型的优化通常依赖于策略梯度方法，其有效性取决于优势函数的准确估计。然而，主流方法通常采用静态优势估计，这种做法导致学分分配效率低下，忽视了训练样本随时间变化的动态效用。这一限制导致策略更新不优，进而表现为收敛速度变慢和学习不稳定性增加，因为模型无法有效适应不断演变的样本效用。为解决这个问题，我们引入了 \textbf{ADORA}（\textbf{A}dvantage \textbf{D}ynamics via \textbf{O}nline \textbf{R}ollout \textbf{A}daptation），这是一种用于策略优化的新框架。ADORA通过根据训练数据在在线模型推广过程中的效用演变，动态调整优势函数的权重，将其暂时有利和不利样本分类。这种量身定制的数据差异化策略使ADORA能够无缝集成到现有策略优化算法中，无需重大架构修改，使策略能够优先学习更具信息量的体验，从而实现更高效的策略更新。跨越不同模型家族和不同数据尺度的广泛评估表明，ADORA是一个稳健高效的框架。它显著提升了几何和数学任务中的长推理能力，持续实现显著的性能提升，无需敏感的超参数调优。

Resilient Topology-Aware Coordination for Dynamic 3D UAV Networks under Node Failure

节点故障下动态三维无人机网络的弹性拓扑感知协调

Authors: Chuan-Chi Lai
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.10029
Pdf link: https://arxiv.org/pdf/2602.10029
Abstract In 3D Aerial-Ground Integrated Networks (AGINs), ensuring continuous service coverage under unexpected hardware failures is critical for mission-critical applications. While Multi-Agent Reinforcement Learning (MARL) has shown promise in autonomous coordination, its resilience under sudden node failures remains a challenge due to dynamic topology deformation. This paper proposes a Topology-Aware Graph MAPPO (TAG-MAPPO) framework designed to enhance system survivability through autonomous 3D spatial reconfiguration. Our framework incorporates graph-based feature aggregation with a residual ego-state fusion mechanism to capture intricate inter-agent dependencies. This architecture enables the surviving swarm to rapidly adapt its topology compared to conventional Multi-Layer Perceptron (MLP) based approaches. Extensive simulations across heterogeneous environments, ranging from interference-limited Crowded Urban to sparse Rural areas, validate the proposed approach. The results demonstrate that TAG-MAPPO consistently outperforms baselines in both stability and efficiency; specifically, it reduces redundant handoffs by up to 50 percent while maintaining a lead in energy efficiency. Most notably, the framework exhibits exceptional self-healing capabilities following a catastrophic node failure. TAG-MAPPO restores over 90 percent of the pre-failure service coverage within 15 time steps, exhibiting a significantly faster V-shaped recovery trajectory than MLP baselines. Furthermore, in dense urban scenarios, the framework achieves a post-failure Jain's Fairness Index that even surpasses its original four-UAV configuration by effectively resolving service overlaps. These findings suggest that topology-aware coordination is essential for the realization of resilient 6G aerial networks and provides a robust foundation for adaptive deployments in volatile environments.
中文摘要 在三维天线-地面集成网络（AGIN）中，确保在意外硬件故障下持续服务覆盖对于关键任务应用至关重要。尽管多智能体强化学习（MARL）在自主协调方面展现出潜力，但由于动态拓扑变形，其在突发节点故障下的韧性仍是一大挑战。本文提出了一种拓扑感知图MAPPO（TAG-MAPPO）框架，旨在通过自主三维空间重构提升系统生存能力。我们的框架结合了基于图的特征聚合和残余自我状态融合机制，以捕捉复杂的代理间依赖关系。该架构使幸存群体能够快速适应其拓扑结构，相较于传统的多层感知器（MLP）方法。跨异质环境的广泛模拟，涵盖干扰受限的拥挤城市到稀疏农村地区，验证了该方法。结果表明，TAG-MAPPO在稳定性和效率方面始终优于基线;具体来说，它在能效方面保持领先优势，同时将冗余切换减少多达50%。最显著的是，该框架在节点灾难性故障后展现出卓越的自我修复能力。TAG-MAPPO在15个时间步长内恢复了超过90%的故障前服务覆盖，恢复轨迹明显快于MLP基线。此外，在密集城市场景中，该框架通过有效解决服务重叠，实现了失败后Jain公平指数，甚至超过了最初的四无人机配置。这些发现表明，拓扑感知协调对于实现韧性6G天线网络至关重要，并为在易变环境中的自适应部署奠定坚实基础。

Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection

Fake-HR1：重新思考合成图像检测中视觉语言模型的推理

Authors: Changjiang Jiang, Xinkuan Sha, Fengchang Yu, Jingjing Liu, Jian Liu, Mingqi Fang, Chenfeng Zhang, Wei Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10042
Pdf link: https://arxiv.org/pdf/2602.10042
Abstract Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model's ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.
中文摘要 最新研究表明，将思维链（CoT）推理融入检测过程可以增强模型检测合成图像的能力。然而，过长的推理会带来大量资源开销，包括令牌消耗和延迟，而在处理明显生成的伪造币时，这些开销尤其多余。为解决这个问题，我们提出了Fake-HR1，一种大规模混合推理模型，据我们所知，它是首个能够根据生成检测任务特性自适应判断推理是否必要的模型。为此，我们设计了一个两阶段训练框架：首先进行混合微调（HFT）进行冷启动初始化，随后通过混合推理分组策略优化（HGRPO）进行在线强化学习，隐式学习何时选择合适的推理模式。实验结果显示，Fake-HR1能够在不同类型的查询中自适应地进行推理，在推理能力和生成检测性能上均超越现有LLMs，同时显著提升了响应效率。

Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

乐观世界模型：基于模型的深度强化学习中的高效探索

Authors: Akshay Mete, Shahid Aamir Sheikh, Tzu-Hsiang Lin, Dileep Kalathil, P. R. Kumar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.10044
Pdf link: https://arxiv.org/pdf/2602.10044
Abstract Efficient exploration remains a central challenge in reinforcement learning (RL), particularly in sparse-reward environments. We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration that brings classical reward-biased maximum likelihood estimation (RBMLE) from adaptive control into deep RL. In contrast to upper confidence bound (UCB)-style exploration methods, OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes. This fully gradient-based loss requires neither uncertainty estimates nor constrained optimization. Our approach is plug-and-play with existing world model frameworks, preserving scalability while requiring only minimal modifications to standard training procedures. We instantiate OWMs within two state-of-the-art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM, which demonstrate significant improvements in sample efficiency and cumulative return compared to their baseline counterparts.
中文摘要 高效的探索仍然是强化学习（RL）中的核心挑战，尤其是在奖励稀疏的环境中。我们引入乐观世界模型（OWMs），这是一个原则性且可扩展的乐观探索框架，将经典的奖励偏倚最大似然估计（RBMLE）从自适应控制带入深度强化学习。与上置信界（UCB）式探索方法不同，OWMs通过增强乐观动态，直接将乐观性融入模型学习，从而偏向想象中向更高回报结果的转变。这种完全基于梯度的损耗既不需要不确定性估计，也不需要约束优化。我们的方法是即插即用，结合现有的世界模型框架，保持可扩展性，同时只需对标准训练程序进行最小修改。我们将OWM实例化为两种最先进的世界模型架构，最终推出了Optimistic DreamerV3和Optimistic STORM，这些模型在样本效率和累计回报方面相较基线版本显著提升。

Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

通过细粒度群策略优化实现长链思维压缩

Authors: Xinchen Han, Hossam Afifi, Michel Marot, Xilu Wang, Lu Yin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10048
Pdf link: https://arxiv.org/pdf/2602.10048
Abstract Large Language Models (LLMs) often generate unnecessarily verbose Chain-of-Thought (CoT) reasoning that increases computational costs and latency without proportional performance gains. In this paper, we propose \textbf{F}ine-grained \textbf{G}roup policy \textbf{O}ptimization (\textbf{FGO}), a Reinforcement Learning (RL) algorithm that refines group responses by subdividing them and assigning appropriate weights based on length and entropy, thereby enabling effective CoT compression. Meanwhile, as an enhanced variant of Group Relative Policy Optimization (GRPO), FGO successfully addresses two major limitations of the GRPO: inefficient data utilization and entropy collapse. We evaluate FGO on multiple reasoning LLMs and benchmarks, including MATH500, AIME24, AMC23, and Minerva. Experimental results show that FGO achieves efficient CoT compression without degrading performance, and simultaneously resolves the key limitations of GRPO.
中文摘要 大型语言模型（LLM）常常产生不必要的冗长思维链（CoT）推理，增加计算成本和延迟，但性能却无法成比例提升。本文提出 \textbf{F}粒度 \textbf{G}roup 策略 \textbf{O}ptimization （\textbf{FGO}），这是一种强化学习（RL）算法，通过细分群体响应并根据长度和熵分配适当权重，从而实现有效的 CoT 压缩。与此同时，作为群相对策略优化（GRPO）的增强变体，FGO成功解决了GRPO的两个主要局限：低效的数据利用和熵坍缩。我们基于多个推理大型语言模型和基准测试评估FGO，包括MATH500、AIME24、AMC23和Minerva。实验结果表明，FGO在不降低性能的情况下实现了高效的CoT压缩，同时解决了GRPO的关键局限。

Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

作为奖励的特点：通过可解释性实现开放式任务的可扩展监督

Authors: Aaditya Vikram Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Thomas McGrath, Ekdeep Singh Lubana
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10067
Pdf link: https://arxiv.org/pdf/2602.10067
Abstract Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on Gemma-3-12B-IT results in a policy that is 58% less likely to hallucinate compared to the original model, while preserving performance on standard benchmarks. Taken together, by grounding supervision in the language of features, this paper introduces a novel paradigm in the use of interpretability for learning open-ended tasks.
中文摘要 经过大规模数据集训练的语言模型已被证明能够学习编码抽象概念（如事实性或意图）的特征。这些功能传统上用于测试时间的监控或转向。我们提出了一种替代优势：作为开放式任务的可扩展监督功能。我们认为幻觉减少是一种理想但开放式的行为，设计了一个名为RLFR（特征奖励强化学习）的强化学习（RL）流程，该流水线将特征作为奖励函数。基于一种新颖的探究框架，识别候选人的幻觉主张，我们的管道教导一个模型，在不确定其真实性时介入并纠正其完成性。此外，该流水线支持可扩展的测试时间计算，再次由我们的奖励功能引导。在Gemma-3-12B-IT上实现的端到端流程，使得与原始模型相比，出现幻觉的可能性降低了58%，同时在标准基准测试上保持了性能。综上所述，通过将监督扎根于特征语言，本文引入了一种新范式，用于学习开放式任务的可解释性。

Anagent For Enhancing Scientific Table & Figure Analysis

增强科学表格与图表分析的分析工具

Authors: Xuehang Guo, Zhiyong Lu, Tom Hope, Qingyun Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10081
Pdf link: https://arxiv.org/pdf/2602.10081
Abstract In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table \& figure analysis. To quantify these challenges, we introduce AnaBench, a large-scale benchmark featuring $63,178$ instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi-agent framework for enhanced scientific table \& figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 170 subdomains demonstrates that Anagent achieves substantial improvements, up to $\uparrow 13.43\%$ in training-free settings and $\uparrow 42.12\%$ with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table \& figure analysis. Our project page: this https URL.
中文摘要 在科学研究中，分析需要准确解读复杂的多模态知识，整合来自不同来源的证据，并基于特定领域的知识做出推断。然而，当前的人工智能（AI）系统在持续展示这些能力方面存在困难。科学表格和图表的复杂性和变异性，加上结构异构和长上下文需求，构成了科学表格与图表分析的根本障碍。为了量化这些挑战，我们引入了AnaBench，这是一个大型基准测试，涵盖来自九个科学领域的价值63,178美元实例，系统地按七个复杂度维度分类。为应对这些挑战，我们提出了Anagent，这是一个多智能体框架，通过四个专业智能体进行科学表格和图形分析的增强：Planner将任务分解为可作的子任务，Expert通过有针对性工具执行获取任务特定信息，Solver综合信息生成连贯分析，Critic通过五维质量评估进行迭代优化。我们进一步开发模块化培训策略，利用监督式微调和专业强化学习，优化个体能力，同时保持有效协作。涵盖170个子领域的全面评估表明，Anagent在无训练环境中实现了显著提升，在无训练环境中可达$\uparrow 13.43\%$，微调后可达$42.12\%$，同时表明任务导向推理和上下文感知问题解决对于高质量科学表格和图表分析至关重要。我们的项目页面：这个 https URL。

CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

CODE-SHARP：作为层级奖励计划，技能的持续开放式发现与进化

Authors: Richard Bornemann, Pierluigi Vito Amadori, Antoine Cully
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10085
Pdf link: https://arxiv.org/pdf/2602.10085
Abstract Developing agents capable of open-endedly discovering and learning novel skills is a grand challenge in Artificial Intelligence. While reinforcement learning offers a powerful framework for training agents to master complex skills, it typically relies on hand-designed reward functions. This is infeasible for open-ended skill discovery, where the set of meaningful skills is not known a priori. While recent methods have shown promising results towards automating reward function design, they remain limited to refining rewards for pre-defined tasks. To address this limitation, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a novel framework leveraging Foundation Models (FM) to open-endedly expand and refine a hierarchical skill archive, structured as a directed graph of executable reward functions in code. We show that a goal-conditioned agent trained exclusively on the rewards generated by the discovered SHARP skills learns to solve increasingly long-horizon goals in the Craftax environment. When composed by a high-level FM-based planner, the discovered skills enable a single goal-conditioned agent to solve complex, long-horizon tasks, outperforming both pretrained agents and task-specific expert policies by over $134$% on average. We will open-source our code and provide additional videos $\href{this https URL}{here}$.
中文摘要 开发能够无限期发现和学习新技能的智能体，是人工智能领域的一项重大挑战。虽然强化学习为训练代理掌握复杂技能提供了强大的框架，但它通常依赖于手工设计的奖励函数。对于开放式技能发现来说，这不可行，因为有意义的技能集合尚未被先验知道。尽管近期方法在自动化奖励函数设计方面取得了有前景的成果，但它们仍限于针对预定义任务优化奖励。为解决这一限制，我们引入了作为层级奖励程序的持续开放式技能发现与演进（CODE-SHARP），这是一个利用基础模型（FM）以开放式扩展和完善层级技能档案的新框架，该档案库结构化为代码中的可执行奖励函数有向图。我们展示了，一个目标条件化代理专门训练于发现的SHARP技能所产生的奖励，能够在Craftax环境中学习解决越来越长视野的目标。当由基于FM的高级规划器构建时，发现的技能使单一目标条件代理能够解决复杂且长期的任务，平均比预训练代理和任务专属专家政策高出超过134美元。我们将开源代码并提供更多视频 $\href{this https URL}{here}$。

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

代理世界模型：用于代理强化学习的无限合成环境

Authors: Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10090
Pdf link: https://arxiv.org/pdf/2602.10090
Abstract Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at this https URL.
中文摘要 大型语言模型（LLM）的最新进展使自主智能体能够执行需要多回合与工具和环境交互的复杂任务。然而，由于缺乏多样且可靠的环境，这种智能体培训的规模化受到限制。本文提出了代理世界模型（Agent World Model，AWM），这是一个完全合成的环境生成流水线。利用该流水线，我们扩展到覆盖日常场景的1000个环境，代理可以与丰富的工具集（平均每个环境35个工具）交互，并获得高质量的观测。值得注意的是，这些环境由代码驱动并由数据库支持，提供比大型语言模型模拟环境更可靠、更一致的状态转换。此外，它们比从现实环境中收集轨迹更高效地与代理互动。为证明该资源的有效性，我们对多回合工具使用代理进行了大规模强化学习。得益于完全可执行的环境和可访问的数据库状态，我们还能设计出可靠的奖励函数。三个基准测试的实验表明，仅在合成环境中训练，而非基准特有环境，能产生强烈的分布外泛化。代码可在该 https URL 访问。

Keyword: diffusion policy

CausalGDP: Causality-Guided Diffusion Policies for Reinforcement Learning

因果GDP：基于因果律的扩散策略用于强化学习

Authors: Xiaofeng Xiao, Xiao Hu, Yang Ye, Xubo Yue
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.09207
Pdf link: https://arxiv.org/pdf/2602.09207
Abstract Reinforcement learning (RL) has achieved remarkable success in a wide range of sequential decision-making problems. Recent diffusion-based policies further improve RL by modeling complex, high-dimensional action distributions. However, existing diffusion policies primarily rely on statistical associations and fail to explicitly account for causal relationships among states, actions, and rewards, limiting their ability to identify which action components truly cause high returns. In this paper, we propose Causality-guided Diffusion Policy (CausalGDP), a unified framework that integrates causal reasoning into diffusion-based RL. CausalGDP first learns a base diffusion policy and an initial causal dynamical model from offline data, capturing causal dependencies among states, actions, and rewards. During real-time interaction, the causal information is continuously updated and incorporated as a guidance signal to steer the diffusion process toward actions that causally influence future states and rewards. By explicitly considering causality beyond association, CausalGDP focuses policy optimization on action components that genuinely drive performance improvements. Experimental results demonstrate that CausalGDP consistently achieves competitive or superior performance over state-of-the-art diffusion-based and offline RL methods, especially in complex, high-dimensional control tasks.
中文摘要 强化学习（RL）在多种顺序决策问题中取得了显著成功。近期基于扩散的策略通过建模复杂、高维的动作分布，进一步提升了强化学习。然而，现有的扩散策略主要依赖统计关联，未能明确考虑状态、行动和奖励之间的因果关系，限制了它们识别哪些行动组成部分真正带来高回报的能力。本文提出了因果引导扩散政策（CausalGDP），这是一个将因果推理整合进基于扩散的强化学习的统一框架。因果GDP首先从离线数据中学习基础扩散策略和初始因果动力模型，捕捉状态、行为和奖励之间的因果依赖关系。在实时互动过程中，因果信息会不断更新并作为指导信号，引导扩散过程朝向对未来状态和奖励产生因果影响的行动。通过明确考虑超越关联的因果关系，CausalGDP将政策优化重点放在真正推动绩效改进的行动组成部分。实验结果表明，CausalGDP在复杂高维控制任务中，始终在与最先进的基于扩散和离线的强化学习方法竞争或更优于其上。

Preference Aligned Visuomotor Diffusion Policies for Deformable Object Manipulation

可变形物体作的偏好对齐维度驱动扩散策略

Authors: Marco Moletta, Michael C. Welle, Danica Kragic
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.09583
Pdf link: https://arxiv.org/pdf/2602.09583
Abstract Humans naturally develop preferences for how manipulation tasks should be performed, which are often subtle, personal, and difficult to articulate. Although it is important for robots to account for these preferences to increase personalization and user satisfaction, they remain largely underexplored in robotic manipulation, particularly in the context of deformable objects like garments and fabrics. In this work, we study how to adapt pretrained visuomotor diffusion policies to reflect preferred behaviors using limited demonstrations. We introduce RKO, a novel preference-alignment method that combines the benefits of two recent frameworks: RPO and KTO. We evaluate RKO against common preference learning frameworks, including these two, as well as a baseline vanilla diffusion policy, on real-world cloth-folding tasks spanning multiple garments and preference settings. We show that preference-aligned policies (particularly RKO) achieve superior performance and sample efficiency compared to standard diffusion policy fine-tuning. These results highlight the importance and feasibility of structured preference learning for scaling personalized robot behavior in complex deformable object manipulation tasks.
中文摘要 人类自然会对控任务的执行方式产生偏好，这些偏好往往微妙、个人化且难以表达。尽管机器人考虑这些偏好以提升个性化和用户满意度很重要，但在机器人作中，尤其是在可变形物体如服装和织物的背景下，这些偏好仍然鲜有充分探索。本研究如何通过有限的演示调整预训练的视觉运动扩散策略，以反映偏好行为。我们介绍RKO，一种新颖的偏好-对齐方法，结合了两个最新框架——RPO和KTO的优势。我们将RKO结合包括这两种常见偏好学习框架及基础原版扩散政策，在跨越多种服装和偏好设置的真实布料折叠任务中进行评估。我们证明，偏好对齐策略（尤其是RKO）相较于标准扩散策略微调，在性能和样本效率上更优。这些结果凸显了结构化偏好学习在复杂可变形对象作任务中个性化机器人行为缩放的重要性和可行性。