Arxiv Papers of Today

生成时间: 2026-02-17 16:53:24 (UTC+8); Arxiv 发布时间: 2026-02-17 20:00 EST (2026-02-18 09:00 UTC+8)

今天共有 77 篇相关文章

Keyword: reinforcement learning

Reinforcement Learning-Enabled Dynamic Code Assignment for Ultra-Dense IoT Networks: A NOMA-Based Approach to Massive Device Connectivity

基于NOMA的大规模设备连接方法强化学习驱动动态代码赋值：基于NOMA的大规模设备连接方法

Authors: Sumita Majhi, Kishan Thakkar, Pinaki Mitra
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2602.13205
Pdf link: https://arxiv.org/pdf/2602.13205
Abstract Ultra-dense IoT networks require an effective non-orthogonal multiple access (NOMA) scheme, yet they experience intense interference because of fixed code assignment. We suggest a reinforcement learning (RL) model of dynamic Gold code assignment in IoT-NOMA networks. Our Markov Decision Process which is IoT aware is a joint optimization of throughput, energy efficiency, and fairness. Two RL algorithms are created, including Natural Policy Gradient (NPG) to learn stable discrete actions and Deep Deterministic Policy Gradient (DDPG) with continuous code embedding. Under smart city conditions, NPG can attain throughput of 11.6% and energy efficiency of 15.8 likewise superior to its performance with a static allocation. Nonetheless, the performance is worse in organized industrial settings, and the reliability is minimal (0-2%), which points to the fact that dynamic code assignment is not a sufficient measure of ultra-reliable IoT and needs to be supplemented by power control or retransmission schemes. The work offers a basis to the RL-based resource allocation in massive IoT network.
中文摘要 超高密度物联网网络需要有效的非正交多重访问（NOMA）方案，但由于固定编码分配，它们会受到强烈干扰。我们建议在物联网-NOMA网络中采用强化学习（RL）模型来进行动态金码分配。我们的马尔可夫决策流程是物联网感知的联合优化，涵盖吞吐量、能源效率和公平性。创建了两种强化学习算法，包括用于学习稳定离散动作的自然策略梯度（NPG）和带有连续代码嵌入的深度确定性策略梯度（DDPG）。在智慧城市条件下，NPG可实现11.6%的吞吐量和15.8%的能效，同时优于静态分配时的性能。然而，在有组织的工业环境中性能较差，可靠性也极低（0-2%），这表明动态码分配不足以衡量超可靠的物联网，需要通过功率控制或重传方案来补充。该工作为大规模物联网网络中基于强化学习的资源分配提供了基础。

A Safety-Constrained Reinforcement Learning Framework for Reliable Wireless Autonomy

一个安全受限的强化学习框架，实现可靠的无线自主

Authors: Abdikarim Mohamed Ibrahim, Rosdiadee Nordin
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.13207
Pdf link: https://arxiv.org/pdf/2602.13207
Abstract Artificial intelligence (AI) and reinforcement learning (RL) have shown significant promise in wireless systems, enabling dynamic spectrum allocation, traffic management, and large-scale Internet of Things (IoT) coordination. However, their deployment in mission-critical applications introduces the risk of unsafe emergent behaviors, such as UAV collisions, denial-of-service events, or instability in vehicular networks. Existing safety mechanisms are predominantly reactive, relying on anomaly detection or fallback controllers that intervene only after unsafe actions occur, which cannot guarantee reliability in ultra-reliable low-latency communication (URLLC) settings. In this work, we propose a proactive safety-constrained RL framework that integrates proof-carrying control (PCC) with empowerment-budgeted (EB) enforcement. Each agent action is verified through lightweight mathematical certificates to ensure compliance with interference constraints, while empowerment budgets regulate the frequency of safety overrides to balance safety and autonomy. We implement this framework on a wireless uplink scheduling task using Proximal Policy Optimization (PPO). Simulation results demonstrate that the proposed PCC+EB controller eliminates unsafe transmissions while preserving system throughput and predictable autonomy. Compared with unconstrained and reactive baselines, our method achieves provable safety guarantees with minimal performance degradation. These results highlight the potential of proactive safety constrained RL to enable trustworthy wireless autonomy in future 6G networks.
中文摘要 人工智能（AI）和强化学习（RL）在无线系统中展现出显著潜力，支持动态频谱分配、流量管理和大规模物联网（IoT）协调。然而，在关键任务应用中的部署带来了不安全突发行为的风险，如无人机碰撞、拒绝服务事件或车载网络不稳定。现有的安全机制主要是被动的，依赖异常检测或后备控制器，后者仅在发生不安全作后介入，无法保证在超可靠低延迟通信（URLLC）环境中的可靠性。在本研究中，我们提出了一个主动且受安全限制的强化学习框架，将证明携带控制（PCC）与赋权预算（EB）执法整合。每个智能体的行为都通过轻量级数学证书进行验证，以确保符合干扰约束，同时赋权预算调节安全覆盖的频率，以平衡安全与自主性。我们将该框架应用于无线上行调度任务，使用近端策略优化（PPO）。仿真结果表明，所提议的PCC+EB控制器消除了不安全的传输，同时保持了系统吞吐量和可预测的自主性。与无约束和反应基线相比，我们的方法实现了可验证的安全保证，且性能下降极小。这些结果凸显了前瞻性安全受限强化学习在未来6G网络中实现可信无线自治的潜力。

Large Language Model (LLM)-enabled Reinforcement Learning for Wireless Network Optimization

大型语言模型（LLM）支持的强化学习用于无线网络优化

Authors: Jie Zheng, Ruichen Zhang, Dusit Niyato, Haijun Zhang, Jiacheng Wang, Hongyang Du, Jiawen Kang, Zehui Xiong
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.13210
Pdf link: https://arxiv.org/pdf/2602.13210
Abstract Enhancing future wireless networks presents a significant challenge for networking systems due to diverse user demands and the emergence of 6G technology. While reinforcement learning (RL) is a powerful framework, it often encounters difficulties with high-dimensional state spaces and complex environments, leading to substantial computational demands, distributed intelligence, and potentially inconsistent outcomes. Large language models (LLMs), with their extensive pretrained knowledge and advanced reasoning capabilities, offer promising tools to enhance RL in optimizing 6G wireless networks. We explore RL models augmented by LLMs, emphasizing their roles and the potential benefits of their synergy in wireless network optimization. We then examine LLM-enabled RL across various protocol layers: physical, data link, network, transport, and application layers. Additionally, we propose an LLM-assisted state representation and semantic extraction to enhance the multi-agent reinforcement learning (MARL) framework. This approach is applied to service migration and request routing, as well as topology graph generation in unmanned aerial vehicle (UAV)-satellite networks. Through case studies, we demonstrate that our framework effectively performs optimization of wireless network. Finally, we outline prospective research directions for LLM-enabled RL in wireless network optimization.
中文摘要 由于用户需求多样和6G技术的兴起，增强未来无线网络对网络系统构成了重大挑战。虽然强化学习（RL）是一个强大的框架，但它常常在高维状态空间和复杂环境中遇到困难，导致计算需求巨大、分布式智能以及可能出现不一致的结果。大型语言模型（LLMs）凭借其丰富的预训练知识和高级推理能力，为优化6G无线网络提供了有前景的工具。我们探讨了由大型语言模型（LLM）辅助的强化学习模型，强调它们在无线网络优化中的作用及其协同效应的潜在益处。随后，我们考察了支持LLM的强化学习在物理层、数据链路层、网络层、传输层和应用层的多个协议层。此外，我们提出了一种LLM辅助的状态表示和语义提取，以增强多智能体强化学习（MARL）框架。该方法应用于服务迁移和请求路由，以及无人机（UAV）卫星网络中的拓扑图生成。通过案例研究，我们证明了我们的框架能够有效优化无线网络。最后，我们概述了基于LLM的强化学习在无线网络优化中的前瞻性研究方向。

An Overlay Multicast Routing Method Based on Network Situational Aware-ness and Hierarchical Multi-Agent Reinforcement Learning

一种基于网络态势感知和分层多智能体强化学习的叠加多播路由方法

Authors: Miao Ye, Yanye Chen, Yong Wang, Cheng Zhu, Qiuxiang Jiang, Gai Huang, Feng Ding
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.13211
Pdf link: https://arxiv.org/pdf/2602.13211
Abstract Compared with IP multicast, Overlay Multicast (OM) offers better compatibility and flexible deployment in heterogeneous, cross-domain networks. However, traditional OM struggles to adapt to dynamic traffic due to unawareness of physical resource states, and existing reinforcement learning methods fail to decouple OM's tightly coupled multi-objective nature, leading to high complexity, slow convergence, and instability. To address this, we propose MA-DHRL-OM, a multi-agent deep hierarchical reinforcement learning approach. Using SDN's global view, it builds a traffic-aware model for OM path planning. The method decomposes OM tree construction into two stages via hierarchical agents, reducing action space and improving convergence stability. Multi-agent collaboration balances multi-objective optimization while enhancing scalability and adaptability. Experiments show MA-DHRL-OM outperforms existing methods in delay, bandwidth utilization, and packet loss, with more stable convergence and flexible routing.
中文摘要 与IP多播相比，叠加多播（OM）在异构跨域网络中提供了更好的兼容性和灵活部署。然而，传统OM由于对物理资源状态的无知，难以适应动态流量，现有强化学习方法未能解耦OM紧耦的多目标特性，导致复杂度高、收敛缓慢和不稳定。为此，我们提出了MA-DHRL-OM，一种多智能体深度层级强化学习方法。利用 SDN 的全局视图，它构建了一个交通感知模型，用于 OM 路径规划。该方法通过分层代理将OM树构建分解为两个阶段，减少动作空间并提升收敛稳定性。多智能体协作在提升可扩展性和适应性的同时，平衡了多目标优化。实验显示，MA-DHRL-OM在延迟、带宽利用率和丢包方面优于现有方法，收敛更稳定，路由更灵活。

Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning

缩放逻辑的尺度化：逻辑推理的能动元综合

Authors: Bowen Liu, Zhi Wu, Runquan Xie, Zhanhui Kang, Jia Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2602.13218
Pdf link: https://arxiv.org/pdf/2602.13218
Abstract Scaling verifiable training signals remains a key bottleneck for Reinforcement Learning from Verifiable Rewards (RLVR). Logical reasoning is a natural substrate: constraints are formal and answers are programmatically checkable. However, prior synthesis pipelines either depend on expert-written code or operate within fixed templates/skeletons, which limits growth largely to instance-level perturbations. We propose SSLogic, an agentic meta-synthesis framework that scales at the task-family level by iteratively synthesizing and repairing executable Generator--Validator program pairs in a closed Generate--Validate--Repair loop, enabling continuous family evolution with controllable difficulty. To ensure reliability, we introduce a Multi-Gate Validation Protocol that combines multi-strategy consistency checks with Adversarial Blind Review, where independent agents must solve instances by writing and executing code to filter ambiguous or ill-posed tasks. Starting from 400 seed families, two evolution rounds expand to 953 families and 21,389 verifiable instances (from 5,718). Training on SSLogic-evolved data yields consistent gains over the seed baseline at matched training steps, improving SynLogic by +5.2, BBEH by +1.4, AIME25 by +3.0, and Brumo25 by +3.7.
中文摘要 可验证训练信号的规模化仍然是可验证奖励强化学习（RLVR）的关键瓶颈。逻辑推理是一种自然的基础：约束是形式化的，答案则可通过程序验证。然而，以往的综合流程要么依赖专家编写的代码，要么运行在固定的模板/骨架内，这在很大程度上限制了增长在实例层面的扰动。我们提出了SSLogic，一种能动元综合框架，通过迭代合成和修复可执行的生成器-验证器程序对，在封闭的生成-验证-修复循环中实现任务族层面的扩展，实现可控难度的连续家族演化。为确保可靠性，我们引入了多门验证协议，结合多策略一致性检查与对抗盲评，独立代理需编写并执行代码以过滤歧义或不当任务来解决实例。从400个种子家族开始，经过两轮演化，扩展到953个家族和21,389个可验证实例（从5,718个减少）。在SSLogic进化数据上训练时，相较种子基线在匹配训练步长上获得持续提升，SynLogic提升+5.2，BBEH提升+1.4，AIME25提升+3.0，Brumo25提升+3.7。

Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains

Lang2Act：通过自发语言工具链实现细粒度视觉推理

Authors: Yuqi Xiong, Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Zulong Chen, Yukun Yan, Shuo Wang, Yu Gu, Ge Yu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.13235
Pdf link: https://arxiv.org/pdf/2602.13235
Abstract Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design can lead to unnecessary loss of visual information, particularly when image-based operations such as cropping are applied. In this paper, we propose Lang2Act, which enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains. Rather than invoking fixed external engines, Lang2Act collects self-emergent actions as linguistic tools and leverages them to enhance the visual perception capabilities of VLMs. To support this mechanism, we design a two-stage Reinforcement Learning (RL)-based training framework. Specifically, the first stage optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox, and the second stage further optimizes VLMs to exploit these linguistic tools for downstream reasoning effectively. Experimental results demonstrate the effectiveness of Lang2Act in substantially enhancing the visual perception capabilities of VLMs, achieving performance improvements of over 4%. All code and data are available at this https URL.
中文摘要 视觉检索增强生成（VRAG）通过整合外部视觉文档来应对特定查询，增强视觉语言模型（VLM）。现有的VRAG框架通常依赖僵化的预定义外部工具来扩展VLM的感知能力，通常通过明确将视觉感知与后续推理过程分离。然而，这种解耦设计可能导致不必要的视觉信息丢失，尤其是在应用基于图像的作如裁剪时。本文提出了Lang2Act，通过自发语言工具链实现细粒度的视觉感知和推理。Lang2Act 不依赖固定的外部引擎，而是收集自发动作作为语言工具，并利用它们提升 VLM 的视觉感知能力。为支持这一机制，我们设计了一个基于强化学习（RL）的两阶段培训框架。具体来说，第一阶段优化VLMs自我探索高质量动作，构建可重复使用的语言工具箱，第二阶段进一步优化VLMs，有效利用这些语言工具进行下游推理。实验结果表明，Lang2Act在显著提升VLM的视觉感知能力方面有效，性能提升超过4%。所有代码和数据均在此 https 网址上获取。

Securing SIM-Assisted Wireless Networks via Quantum Reinforcement Learning

通过量子强化学习保护SIM辅助无线网络

Authors: Le-Hung Hoang, Quang-Trung Luu, Dinh Thai Hoang, Diep N. Nguyen, Van-Dinh Nguyen
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.13238
Pdf link: https://arxiv.org/pdf/2602.13238
Abstract Stacked intelligent metasurfaces (SIMs) have recently emerged as a powerful wave-domain technology that enables multi-stage manipulation of electromagnetic signals through multilayer programmable architectures. While SIMs offer unprecedented degrees of freedom for enhancing physical-layer security, their extremely large number of meta-atoms leads to a high-dimensional and strongly coupled optimization space, making conventional design approaches inefficient and difficult to scale. Moreover, existing deep reinforcement learning (DRL) techniques suffer from slow convergence and performance degradation in dynamic wireless environments with imperfect knowledge of passive eavesdroppers. To overcome these challenges, we propose a hybrid quantum proximal policy optimization (Q-PPO) framework for SIM-assisted secure communications, which jointly optimizes transmit power allocation and SIM phase shifts to maximize the average secrecy rate under power and quality-of-service constraints. Specifically, a parameterized quantum circuit is embedded into the actor network, forming a hybrid classical-quantum policy architecture that enhances policy representation capability and exploration efficiency in high-dimensional continuous action spaces. Extensive simulations demonstrate that the proposed Q-PPO scheme consistently outperforms DRL baselines, achieving approximately 15% higher secrecy rates and 30% faster convergence under imperfect eavesdropper channel state information. These results establish Q-PPO as a powerful optimization paradigm for SIM-enabled secure wireless networks.
中文摘要 堆叠智能元曲面（SIM）近年来作为一种强大的波域技术出现，能够通过多层可编程架构对电磁信号进行多级作。虽然SIM为提升物理层安全性提供了前所未有的自由度，但其极大量的元原子数量导致优化空间高度且强耦合，使传统设计方法效率低下且难以扩展。此外，现有的深度强化学习（DRL）技术在动态无线环境中收敛缓慢且性能下降，且对被动窃听器了解不完全。为克服这些挑战，我们提出了一种混合量子近端策略优化（Q-PPO）框架，用于SIM辅助安全通信，该框架联合优化发射功率分配和SIM相位偏移，以最大化在功率和服务质量约束下的平均保密率。具体来说，参数化的量子电路嵌入演员网络中，形成了经典-量子策略混合架构，提升了高维连续动作空间中的策略表示能力和探索效率。大量模拟表明，所提Q-PPO方案始终优于DRL基线，在窃听信道状态信息不完美时，保密率提升约15%，收敛速度加快30%。这些结果确立了Q-PPO作为支持SIM的安全无线网络的强大优化范式。

General learned delegation by clones

克隆人普遍的学术委派

Authors: Darren Li, Meiqi Chen, Chenze Shao, Fandong Meng, Jie Zhou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.13262
Pdf link: https://arxiv.org/pdf/2602.13262
Abstract Frontier language models improve with additional test-time computation, but serial reasoning or uncoordinated parallel sampling can be compute-inefficient under fixed inference budgets. We propose SELFCEST, which equips a base model with the ability to spawn same-weight clones in separate parallel contexts by agentic reinforcement learning. Training is end-to-end under a global task reward with shared-parameter rollouts, yielding a learned controller that allocates both generation and context budget across branches. Across challenging math reasoning benchmarks and long-context multi-hop QA, SELFCEST improves the accuracy-cost Pareto frontier relative to monolithic baselines at matched inference budget, and exhibits out-of-distribution generalization in both domains.
中文摘要 前沿语言模型通过增加测试时间计算会有所提升，但在固定推理预算下，串行推理或非协调并行采样可能计算效率低下。我们提出了SELFCEST，它赋予基础模型通过智能强化学习在不同并行环境中生成同权重克隆的能力。训练在全局任务奖励下端到端进行，共享参数推广，产生一个学习到的控制器，在各分支之间分配生成和上下文预算。在具有挑战性的数学推理基准和长上下文多跳质量保证中，SELFCEST 在匹配推断预算下提升了相对于单一基线的准确率成本前沿，并在这两个领域都表现出分布外的泛化。

Cooperative Edge Caching with Large Language Model in Wireless Networks

无线网络中的协作边缘缓存与大型语言模型

Authors: Ning Yang, Wentao Wang, Lingtao Ouyang, Haijun Zhang
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2602.13307
Pdf link: https://arxiv.org/pdf/2602.13307
Abstract Cooperative edge caching in overlapping zones creates intricate coupling among Base Station (BS) decisions, making content replacement highly sensitive to topology and temporal reuse. While heuristics are often myopic and Deep Reinforcement Learning lacks robustness under dynamics, this paper proposes a Large Language Model (LLM)-based multi-BS orchestrator. The LLM acts as the sole autonomous engine, interacting with the environment via a validated text-to-action interface. Each time slot, the system renders environmental states -- including cache inventories and frequency statistics -- into prompts, parsing LLM-generated decisions against strict feasibility constraints. We align the model through a two-stage paradigm: Supervised Fine-Tuning on oracle trajectories for syntax and initialization, followed by Group Relative Policy Optimization. The latter employs an ``opportunity-aware'' reward that prioritizes multi-step cooperative gains relative to a No-Operation baseline. Evaluated on identical request traces, the orchestrator approaches exhaustive-search performance (0.610 vs.\ 0.617 in a 5-BS scenario), outperforms classical baselines (e.g., +4.1\% over least-frequently used), and demonstrates robust zero-shot transfer across varying cache capacities, library sizes, and user densities.
中文摘要 在重叠区域中的协作边缘缓存在基站（BS）决策之间产生了复杂的耦合，使内容替换对拓扑和时间重用极为敏感。虽然启发式往往目光短浅，深度强化学习在动态下缺乏稳健性，但本文提出了基于大型语言模型（LLM）的多重强化编排器。LLM作为唯一的自主引擎，通过经过验证的文本转动作界面与环境交互。每个时隙，系统将环境状态——包括缓存库存和频率统计——渲染为提示，解析LLM生成的决策，结合严格的可行性约束进行解析。我们通过两阶段范式对齐模型：对预言机轨迹进行监督微调，用于语法和初始化，随后是组相对策略优化。后者采用“机会意识”奖励，优先考虑多步合作收益，相较于无行动基线。在相同的请求跟踪上评估时，编排器接近穷尽搜索性能（5-BS场景中为0.610对0.617），优于经典基线（例如，对比最少使用率+4.1\%），并展示了跨不同缓存容量、库大小和用户密度的稳健零时段传输能力。

Adaptive Value Decomposition: Coordinating a Varying Number of Agents in Urban Systems

自适应价值分解：协调城市系统中不同数量的代理

Authors: Yexin Li, Jinjin Guo, Haoyu Zhang, Yuhan Zhao, Yiwen Sun, Zihao Jiao
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.13309
Pdf link: https://arxiv.org/pdf/2602.13309
Abstract Multi-agent reinforcement learning (MARL) provides a promising paradigm for coordinating multi-agent systems (MAS). However, most existing methods rely on restrictive assumptions, such as a fixed number of agents and fully synchronous action execution. These assumptions are often violated in urban systems, where the number of active agents varies over time, and actions may have heterogeneous durations, resulting in a semi-MARL setting. Moreover, while sharing policy parameters among agents is commonly adopted to improve learning efficiency, it can lead to highly homogeneous actions when a subset of agents make decisions concurrently under similar observations, potentially degrading coordination quality. To address these challenges, we propose Adaptive Value Decomposition (AVD), a cooperative MARL framework that adapts to a dynamically changing agent population. AVD further incorporates a lightweight mechanism to mitigate action homogenization induced by shared policies, thereby encouraging behavioral diversity and maintaining effective cooperation among agents. In addition, we design a training-execution strategy tailored to the semi-MARL setting that accommodates asynchronous decision-making when some agents act at different times. Experiments on real-world bike-sharing redistribution tasks in two major cities, London and Washington, D.C., demonstrate that AVD outperforms state-of-the-art baselines, confirming its effectiveness and generalizability.
中文摘要 多智能体强化学习（MARL）为协调多智能体系统（MAS）提供了一种有前景的范式。然而，大多数现有方法依赖于限制性假设，如固定数量的代理和完全同步动作执行。这些假设在城市系统中常被打破，因为活跃主体数量随时间变化，且行动持续时间可能异质，导致半MARL环境。此外，虽然在智能体之间共享策略参数通常被采用以提高学习效率，但当部分智能体在类似观察下同时做出决策时，可能导致行为高度同质化，从而降低协调质量。为应对这些挑战，我们提出了适应性价值分解（AVD），这是一种配合动态变化的代理群体的合作MARL框架。AVD还包含了轻量级机制，以减轻共享政策引起的行动同质化，从而促进行为多样性并维持代理间的有效合作。此外，我们设计了一种针对半MARL环境的训练-执行策略，能够适应不同智能体在不同时间行动时的异步决策。在伦敦和华盛顿特区这两个主要城市的真实共享自行车再分配任务实验显示，AVD的表现优于最先进的基线，证实了其有效性和普遍性。

FireRed-Image-Edit-1.0 Techinical Report

FireRed-Image-Edit-1.0 技术报告

Authors: Super Intelligence Team: Changhao Qiao, Chao Hui, Chen Li, Cunzheng Wang, Dejia Song, Jiale Zhang, Jing Li, Qiang Xiang, Runqi Wang, Shuang Sun, Wei Zhu, Xu Tang, Yao Hu, Yibo Chen, Yuhao Huang, Yuxuan Duan, Zhiyi Chen, Ziyuan Guo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2602.13344
Pdf link: https://arxiv.org/pdf/2602.13344
Abstract We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. We release code, models, and the benchmark suite to support future research.
中文摘要 我们介绍FireRed-Image-Edit，一款基于指令的扩散变换器，通过系统优化数据整理、训练方法和评估设计，实现了最先进的性能。我们构建了一个1.6B样本的训练语料库，包含来自不同来源的9亿文本转图像和7亿对图像编辑。经过严格的清洗、分层、自动标记和两阶段过滤，我们保留了超过1亿份高质量样本，在生成和编辑之间保持平衡，确保语义覆盖和指令一致性。我们的多阶段培训流程通过预训练、监督微调和强化学习逐步提升编辑能力。为提高数据效率，我们引入了多条件感知桶采样器，支持可变分辨率批处理和随机指令对齐，并支持动态提示重新索引。为了稳定优化并增强可控性，我们提出了针对DPO的非对称梯度优化，DiffusionNFT配备布局感知的OCR奖励用于文本编辑，以及可微分的一致性丢失以保护身份。我们还建立了REDEdit-Bench，这是一个涵盖15个编辑类别的综合基准，包括新引入的美化和低层次增强任务。在REDEdit-Bench和公开基准测试（ImgEdit和GEdit）上的广泛实验显示，它在开源和专有系统中均具竞争力或优于性能。我们发布代码、模型和基准测试套件，以支持未来的研究。

Robust Mean-Field Games with Risk Aversion and Bounded Rationality

具有风险厌恶和有界理性的稳健平均场博弈

Authors: Bhavini Jeloka, Yue Guan, Panagiotis Tsiotras
Subjects: Subjects: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2602.13353
Pdf link: https://arxiv.org/pdf/2602.13353
Abstract Recent advances in mean-field game literature enable the reduction of large-scale multi-agent problems to tractable interactions between a representative agent and a population distribution. However, existing approaches typically assume a fixed initial population distribution and fully rational agents, limiting robustness under distributional uncertainty and cognitive constraints. We address these limitations by introducing risk aversion with respect to the initial population distribution and by incorporating bounded rationality to model deviations from fully rational decision-making agents. The combination of these two elements yields a new and more general equilibrium concept, which we term the mean-field risk-averse quantal response equilibrium (MF-RQE). We establish existence results and prove convergence of fixed-point iteration and fictitious play to MF-RQE. Building on these insights, we develop a scalable reinforcement learning algorithm for scenarios with large state-action spaces. Numerical experiments demonstrate that MF-RQE policies achieve improved robustness relative to classical mean-field approaches that optimize expected cumulative rewards under a fixed initial distribution and are restricted to entropy-based regularizers.
中文摘要 近期平均场博弈文献的进展使得大规模多智能体问题能够简化为代表性代理与总体分布之间的可处理相互作用。然而，现有方法通常假设初始种群分布固定且主体完全理性，限制了在分布不确定性和认知约束下的鲁棒性。我们通过引入初始总体分布的风险规避，并引入有界理性来建模偏离完全理性决策主体的行为，来解决这些局限性。这两种元素的结合产生了一个新的、更通用的均衡概念，我们称之为平均场风险厌恶量化反应均衡（MF-RQE）。我们建立存在性结果，并证明了不动点迭代与虚构游玩的收敛性至MF-RQE。基于这些见解，我们开发了适用于大状态-动作空间场景的可扩展强化学习算法。数值实验表明，MF-RQE策略相较于在固定初始分布下优化预期累计奖励的经典平均场方法，且仅限于基于熵的正则化方法，其鲁棒性有所提升。

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

南贝哥4.1-3B：一个推理、对齐并行动的小型通用模型

Authors: Chen Yang, Guangyue Peng, Jiaying Zhu, Ran Le, Ruixiang Feng, Tao Zhang, Xiyun Xu, Yang Song, Yiming Jia, Yuntao Wen, Yunzhi Xu, Zekai Wang, Zhenwei An, Zhicong Sun, Zongchao Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.13367
Pdf link: https://arxiv.org/pdf/2602.13367
Abstract We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B. Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.
中文摘要 我们提出了南贝格4.1-3B，一种统一的通用语言模型，仅用3B参数即可同时实现强烈的代理行为、代码生成和一般推理。据我们所知，它是首个在单一模型中实现如此多功能性的开源小语言模型（SLM）。为了改善推理和偏好对齐，我们结合了点数和配对奖励建模，确保高质量、符合人类的回答。在代码生成方面，我们设计了具有复杂性感知的奖励，优化了正确性和效率。在深度搜索中，我们进行复杂数据综合，并在培训期间引入轮流级监督。这使得长视距工具相互作用变得稳定，使南贝阁4.1-3B能够可靠执行多达600次工具调用，以解决复杂问题。大量实验结果显示，南贝哥4.1-3B显著优于同等规模的前代型号，如南贝哥4-3B-2511和Qwen3-4B，甚至比规模更大的型号如Qwen3-30B-A3B表现更优。我们的结果表明，小型模型可以同时实现广泛的能力和强烈的专业化，重新定义了3B参数模型的潜力。

On-Policy Supervised Fine-Tuning for Efficient Reasoning

政策监督微调以实现高效推理

Authors: Anhao Zhao, Ziyang Chen, Junlong Tong, Yingqi Fan, Fanghua Ye, Shuhao Li, Yunpu Ma, Wenjie Li, Xiaoyu Shen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.13407
Pdf link: https://arxiv.org/pdf/2602.13407
Abstract Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning, achieving strong performance at high computational cost. Recent methods add multi-reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade-offs. We revisit this objective and challenge the necessity of such complexity. Through principled analysis, we identify fundamental misalignments in this paradigm: KL regularization loses its intended role when correctness and length are directly verifiable, and group-wise normalization becomes ambiguous under multiple reward signals. By removing these two items and simplifying the reward to a truncation-based length penalty, we show that the optimization problem reduces to supervised fine-tuning on self-generated data filtered for both correctness and conciseness. We term this simplified training strategy on-policy SFT. Despite its simplicity, on-policy SFT consistently defines the accuracy-efficiency Pareto frontier. It reduces CoT length by up to 80 while maintaining original accuracy, surpassing more complex RL-based methods across five benchmarks. Furthermore, it significantly enhances training efficiency, reducing GPU memory usage by 50% and accelerating convergence by 70%. Our code is available at this https URL.
中文摘要 大型推理模型（LRM）通常通过强化学习（RL）训练，以探索长链思维推理，在高计算成本下实现强劲表现。近期方法增加了多奖励目标，以共同优化正确性和简洁性，但这些复杂的扩展常常破坏训练稳定性，产生不理想的权衡。我们重新审视这一目标，并挑战如此复杂化的必要性。通过原则性分析，我们识别出该范式中的根本错位：当正确性和长度可直接验证时，基层正则化失去其预期作用，且在多重奖励信号下，组间归一化变得模糊。通过去除这两个项目并将奖励简化为基于截断的长度惩罚，我们证明优化问题简化为对自生成数据进行监督微调，并对正确性和简洁性进行过滤。我们将这种简化培训策略称为政策SFT的简化培训策略。尽管结构简单，按策略的SFT始终定义了准确率与效率的帕累托边界。它在保持原始准确性的情况下，将CoT长度缩短了最多80，超过了基于更复杂的强化学习方法（基于强化学习）的方法，涵盖五个基准测试。此外，它显著提升了训练效率，将GPU内存使用减少了50%，并加速了70%的收敛进程。我们的代码可在此 https URL 访问。

OpAgent: Operator Agent for Web Navigation

OpAgent：用于网页导航的作代理

Authors: Yuyu Guo, Wenjie Yang, Siyuan Yang, Ziyang Liu, Cheng Chen, Yuan Wei, Yun Hu, Yang Huang, Guoliang Hao, Dongsheng Yuan, Jianming Wang, Xin Chen, Hang Yu, Lei Lei, Peng Di
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.13559
Pdf link: https://arxiv.org/pdf/2602.13559
Abstract To fulfill user instructions, autonomous web agents must contend with the inherent complexity and volatile nature of real-world websites. Conventional paradigms predominantly rely on Supervised Fine-Tuning (SFT) or Offline Reinforcement Learning (RL) using static datasets. However, these methods suffer from severe distributional shifts, as offline trajectories fail to capture the stochastic state transitions and real-time feedback of unconstrained wide web environments. In this paper, we propose a robust Online Reinforcement Learning WebAgent, designed to optimize its policy through direct, iterative interactions with unconstrained wide websites. Our approach comprises three core innovations: 1) Hierarchical Multi-Task Fine-tuning: We curate a comprehensive mixture of datasets categorized by functional primitives -- Planning, Acting, and Grounding -- establishing a Vision-Language Model (VLM) with strong instruction-following capabilities for Web GUI tasks. 2) Online Agentic RL in the Wild: We develop an online interaction environment and fine-tune the VLM using a specialized RL pipeline. We introduce a Hybrid Reward Mechanism that combines a ground-truth-agnostic WebJudge for holistic outcome assessment with a Rule-based Decision Tree (RDT) for progress reward. This system effectively mitigates the credit assignment challenge in long-horizon navigation. Notably, our RL-enhanced model achieves a 38.1\% success rate (pass@5) on WebArena, outperforming all existing monolithic baselines. 3) Operator Agent: We introduce a modular agentic framework, namely \textbf{OpAgent}, orchestrating a Planner, Grounder, Reflector, and Summarizer. This synergy enables robust error recovery and self-correction, elevating the agent's performance to a new State-of-the-Art (SOTA) success rate of \textbf{71.6\%}.
中文摘要 为了满足用户指令，自主网络代理必须应对现实世界网站固有的复杂性和易变性。传统范式主要依赖监督式微调（SFT）或离线强化学习（RL），使用静态数据集。然而，这些方法存在严重的分布转变，离线轨迹无法捕捉无约束宽网环境的随机状态转变和实时反馈。本文提出一个强大的在线强化学习WebAgent，旨在通过与无限制的宽网站进行直接、迭代的交互来优化其策略。我们的方法包括三项核心创新：1）分层多任务微调：我们策划了一系列按功能原语分类的数据集——规划、行动和基础——建立具有强大指令跟随能力的视觉语言模型（VLM），用于Web图形界面任务。2）野生在线智能强化学习：我们开发在线交互环境，并通过专门的强化学习流水线微调VLM。我们引入了一种混合奖励机制，结合了基于规则的决策树（RDT）用于整体结果评估，实现整体结果评估。该系统有效缓解了长期视野导航中的信用分配挑战。值得注意的是，我们的强化增强模型在WebArena上实现了38.1%的成功率（pass@5），超过了所有现有的单一基准。3）作代理：我们引入了一个模块化代理框架，即\textbf{OpAgent}，负责协调规划器、地面人、反射器和摘要器。这种协同效应实现了稳健的错误恢复和自我纠正，使代理的性能提升到新的最先进（SOTA）成功率 \textbf{71.6\%}。

Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization

通过Agentic-Q估计和逐步策略优化构建自主GUI导航

Authors: Yibo Wang, Guangda Huzhang, Yuwei Hu, Yu Xia, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Lijun Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2602.13653
Pdf link: https://arxiv.org/pdf/2602.13653
Abstract Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.
中文摘要 多模态大型语言模型（MLLM）的最新进展极大推动了图形用户界面（GUI）自主代理的发展。然而，在现实应用中，GUI代理常面临非固定环境，导致数据管理和策略优化的计算成本较高。本报告介绍了一个以MLLM为中心的新框架，涵盖两个组成部分：agentic-Q估计和分步策略优化。前者旨在优化一个Q模型，能够生成逐步值，以评估给定动作对任务完成的贡献。后者从状态-动作轨迹中逐步取样作为输入，并通过我们的agentic-Q模型进行强化学习优化策略。需要注意的是，（i）所有状态行动轨迹均由政策本身生成，因此数据收集成本是可控的;（ii）政策更新与环境解耦，确保稳定高效的优化。实证评估显示，我们的框架赋予Ovis2.5-9B强大的图形界面交互能力，在图形界面导航和基准测试中表现出色，甚至超越了规模更大的竞争者。

AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning

AuTAgent：工具增强音频推理的强化学习框架

Authors: Siqian Tong, Xuan Li, Yiwei Wang, Baolong Bi, Yujun Cai, Shenghua Liu, Yuchen He, Chengpeng Hao
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.13685
Pdf link: https://arxiv.org/pdf/2602.13685
Abstract Large Audio Language Models (LALMs) excel at perception but struggle with complex reasoning requiring precise acoustic measurements. While external tools can extract fine-grained features like exact tempo or pitch, effective integration remains challenging: naively using all tools causes information overload, while prompt-based selection fails to assess context-dependent utility. To address this, we propose AuTAgent (Audio Tool Agent), a reinforcement learning framework that learns when and which tools to invoke. By employing a sparse-feedback training strategy with a novel Differential Reward mechanism, the agent learns to filter out irrelevant tools and invokes external assistance only when it yields a net performance gain over the base model. Experimental results confirm that AuTAgent complements the representation bottleneck of LALMs by providing verifiable acoustic evidence. It improves accuracy by 4.20% / 6.20% and 9.80% / 8.00% for open-source and closed-source backbones on the MMAU Test-mini and the MMAR benchmarks, respectively. In addition, further experiments demonstrate exceptional transferability. We highlight the complementary role of external tools in augmenting audio model reasoning.
中文摘要 大型音频语言模型（LALMs）在感知方面表现出色，但在复杂的推理上需要精确的声学测量时表现不佳。虽然外部工具可以提取精确的速度或音高等细致特征，但有效整合依然充满挑战：天真地使用所有工具会导致信息过载，而基于提示的选择则无法评估上下文依赖的实用性。为此，我们提出了AuTAgent（音频工具代理），这是一个强化学习框架，可以学习何时以及调用哪些工具。通过采用稀疏反馈训练策略和新颖的差分奖励机制，代理学会过滤无关工具，只有在外部辅助能带来比基础模型净性能提升时才调用。实验结果证实，AuTAgent 通过提供可验证的声学证据，补充了 LALM 的表征瓶颈。它分别在MMAU Test-mini和MMAR基准测试中，开源和闭源骨干网的准确率提升了4.20% / 6.20%和9.80% / 8.00%。此外，后续实验证明了其卓越的可转移性。我们强调外部工具在增强音频模型推理中的作用。

Skeleton2Stage: Reward-Guided Fine-Tuning for Physically Plausible Dance Generation

Skeleton2Stage：以奖励为导向的微调，实现身体上合理的舞蹈生成

Authors: Jidong Jia, Youjian Zhang, Huan Fu, Dacheng Tao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.13778
Pdf link: https://arxiv.org/pdf/2602.13778
Abstract Despite advances in dance generation, most methods are trained in the skeletal domain and ignore mesh-level physical constraints. As a result, motions that look plausible as joint trajectories often exhibit body self-penetration and Foot-Ground Contact (FGC) anomalies when visualized with a human body mesh, reducing the aesthetic appeal of generated dances and limiting their real-world applications. We address this skeleton-to-mesh gap by deriving physics-based rewards from the body mesh and applying Reinforcement Learning Fine-Tuning (RLFT) to steer the diffusion model toward physically plausible motion synthesis under mesh visualization. Our reward design combines (i) an imitation reward that measures a motion's general plausibility by its imitability in a physical simulator (penalizing penetration and foot skating), and (ii) a Foot-Ground Deviation (FGD) reward with test-time FGD guidance to better capture the dynamic foot-ground interaction in dance. However, we find that the physics-based rewards tend to push the model to generate freezing motions for fewer physical anomalies and better imitability. To mitigate it, we propose an anti-freezing reward to preserve motion dynamics while maintaining physical plausibility. Experiments on multiple dance datasets consistently demonstrate that our method can significantly improve the physical plausibility of generated motions, yielding more realistic and aesthetically pleasing dances. The project page is available at: this https URL
中文摘要 尽管舞蹈生成技术有所进步，大多数方法仍集中在骨骼领域，忽视网格级物理约束。因此，看似关节轨迹的动作在用人体网格可视化时常常表现出身体自我穿透和足地接触（FGC）异常，降低了生成舞蹈的美学吸引力，限制了其现实应用。我们通过从身体网格中推导基于物理的奖励，并应用强化学习微调（RLFT）来解决骨架到网格之间的差距，从而引导扩散模型朝向物理上合理的运动综合。我们的奖励设计结合了（i）通过动作在物理模拟器中的模仿性来衡量动作的整体可行性（惩罚穿透和脚滑）和（ii）带有测试时FGD指导的足地偏移（FGD）奖励，以更好地捕捉舞蹈中脚地与地面的动态互动。然而，我们发现基于物理的奖励往往促使模型生成冻结动作，以减少物理异常并提升模仿性。为缓解这一问题，我们提出一种抗冻奖励，以保持运动动力学同时保持物理可信度。在多个舞蹈数据集上的实验一致表明，我们的方法能显著提升生成动作的物理可信度，从而产生更真实、更美观的舞蹈。项目页面可访问：此 https URL

Cast-R1: Learning Tool-Augmented Sequential Decision Policies for Time Series Forecasting

Cast-R1：学习工具增强的顺序决策策略以实现时间序列预测

Authors: Xiaoyu Tao, Mingyue Cheng, Chuang Jiang, Tian Gao, Huanjian Zhang, Yaguo Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.13802
Pdf link: https://arxiv.org/pdf/2602.13802
Abstract Time series forecasting has long been dominated by model-centric approaches that formulate prediction as a single-pass mapping from historical observations to future values. Despite recent progress, such formulations often struggle in complex and evolving settings, largely because most forecasting models lack the ability to autonomously acquire informative evidence, reason about potential future changes, or revise predictions through iterative decision processes. In this work, we propose Cast-R1, a learned time series forecasting framework that reformulates forecasting as a sequential decision-making problem. Cast-R1 introduces a memory-based state management mechanism that maintains decision-relevant information across interaction steps, enabling the accumulation of contextual evidence to support long-horizon reasoning. Building on this formulation, forecasting is carried out through a tool-augmented agentic workflow, in which the agent autonomously interacts with a modular toolkit to extract statistical features, invoke lightweight forecasting models for decision support, perform reasoning-based prediction, and iteratively refine forecasts through self-reflection. To train Cast-R1, we adopt a two-stage learning strategy that combines supervised fine-tuning with multi-turn reinforcement learning, together with a curriculum learning scheme that progressively increases task difficulty to improve policy learning. Extensive experiments on multiple real-world time series datasets demonstrate the effectiveness of Cast-R1. We hope this work provides a practical step towards further exploration of agentic paradigms for time series modeling. Our code is available at this https URL.
中文摘要 时间序列预测长期以来一直由以模型为中心的方法主导，这些方法将预测表述为从历史观测到未来值的单次映射。尽管近期取得了进展，这类表述在复杂且不断演变的环境中常常遇到困难，主要原因是大多数预测模型缺乏自主获取信息证据、推理潜在未来变化或通过迭代决策过程修正预测的能力。在本研究中，我们提出了Cast-R1，一种学习式时间序列预测框架，将预测重新表述为顺序决策问题。Cast-R1引入了基于内存的状态管理机制，能够在交互步骤中维护决策相关信息，从而积累支持长期推理的上下文证据。基于这一表述，预测通过工具增强的代理工作流进行，代理自主地与模块化工具包交互，提取统计特征，调用轻量级预测模型支持决策，基于推理进行预测，并通过自我反思迭代优化预测。为训练Cast-R1，我们采用两阶段学习策略，结合监督微调与多回合强化学习，并采用逐步提升任务难度以改善政策学习的课程学习方案。在多个真实世界时间序列数据集上的广泛实验证明了Cast-R1的有效性。我们希望这项工作为进一步探索智能体范式在时间序列建模中提供切实的一步。我们的代码可在此 https URL 访问。

AnomaMind: Agentic Time Series Anomaly Detection with Tool-Augmented Reasoning

AnomaMind：基于工具增强推理的代理时间序列异常检测

Authors: Xiaoyu Tao, Yuchong Wu, Mingyue Cheng, Ze Guo, Tian Gao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.13807
Pdf link: https://arxiv.org/pdf/2602.13807
Abstract Time series anomaly detection is critical in many real-world applications, where effective solutions must localize anomalous regions and support reliable decision-making under complex settings. However, most existing methods frame anomaly detection as a purely discriminative prediction task with fixed feature inputs, rather than an evidence-driven diagnostic process. As a result, they often struggle when anomalies exhibit strong context dependence or diverse patterns. We argue that these limitations stem from the lack of adaptive feature preparation, reasoning-aware detection, and iterative refinement during inference. To address these challenges, we propose AnomaMind, an agentic time series anomaly detection framework that reformulates anomaly detection as a sequential decision-making process. AnomaMind operates through a structured workflow that progressively localizes anomalous intervals in a coarse-to-fine manner, augments detection through multi-turn tool interactions for adaptive feature preparation, and refines anomaly decisions via self-reflection. The workflow is supported by a set of reusable tool engines, enabling context-aware diagnostic analysis. A key design of AnomaMind is an explicitly designed hybrid inference mechanism for tool-augmented anomaly detection. In this mechanism, general-purpose models are responsible for autonomous tool interaction and self-reflective refinement, while core anomaly detection decisions are learned through reinforcement learning under verifiable workflow-level feedback, enabling task-specific optimization within a flexible reasoning framework. Extensive experiments across diverse settings demonstrate that AnomaMind consistently improves anomaly detection performance. The code is available at this https URL.
中文摘要 时间序列异常检测在许多实际应用中至关重要，有效的解决方案必须定位异常区域，并支持在复杂环境中的可靠决策。然而，大多数现有方法将异常检测视为具有固定特征输入的纯判别性预测任务，而非基于证据的诊断过程。因此，当异常表现出强烈的情境依赖性或多样模式时，他们常常感到困难。我们认为这些局限性源于缺乏自适应特征准备、推理感知检测以及推理过程中的迭代精炼。为应对这些挑战，我们提出了AnomaMind，一种能动时间序列异常检测框架，将异常检测重新表述为顺序决策过程。AnomaMind通过结构化工作流程，逐步以粗到细的方式定位异常区间，通过多回合工具交互增强检测以实现自适应特征准备，并通过自我反思优化异常决策。该工作流程由一组可重用工具引擎支持，支持上下文感知的诊断分析。AnomaMind的一个关键设计是明确设计的混合推断机制，用于工具增强异常检测。在这种机制中，通用模型负责自主工具交互和自我反思的细化，而核心异常检测决策则通过可验证的工作流程级反馈进行强化学习，从而在灵活的推理框架内实现任务特定优化。在不同环境中进行的大量实验表明，AnomaMind持续提升异常检测性能。代码可在该 https URL 访问。

Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation

带有瞬时速度约束的单步动作生成平均流策略

Authors: Guojian Zhan, Letian Tao, Pengcheng Wang, Yixiao Wang, Yiheng Li, Yuxin Chen, Masayoshi Tomizuka, Shengbo Eben Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.13810
Pdf link: https://arxiv.org/pdf/2602.13810
Abstract Learning expressive and efficient policy functions is a promising direction in reinforcement learning (RL). While flow-based policies have recently proven effective in modeling complex action distributions with a fast deterministic sampling process, they still face a trade-off between expressiveness and computational burden, which is typically controlled by the number of flow steps. In this work, we propose mean velocity policy (MVP), a new generative policy function that models the mean velocity field to achieve the fastest one-step action generation. To ensure its high expressiveness, an instantaneous velocity constraint (IVC) is introduced on the mean velocity field during training. We theoretically prove that this design explicitly serves as a crucial boundary condition, thereby improving learning accuracy and enhancing policy expressiveness. Empirically, our MVP achieves state-of-the-art success rates across several challenging robotic manipulation tasks from Robomimic and OGBench. It also delivers substantial improvements in training and inference speed over existing flow-based policy baselines.
中文摘要 学习表达性和高效的策略功能是强化学习（RL）中一个有前景的方向。尽管基于流量的策略最近已被证明能通过快速确定性采样过程建模复杂动作分布，但它们仍面临表达性和计算负担之间的权衡，而计算负担通常由流程步骤数控制。在本研究中，我们提出了平均速度策略（MVP），这是一种新的生成策略函数，用于建模平均速度场以实现最快的一步动作生成。为了确保其高表现力，训练期间对平均速度场引入瞬时速度约束（IVC）。我们理论上证明该设计明确作为关键边界条件，从而提高学习准确性并增强政策表达性。从实证数据来看，我们的MVP在Robomimic和OGBench等多项具有挑战性的机器人作任务中取得了最先进的成功率。它还在培训和推断速度上相较于现有基于流量的政策基线实现了显著提升。

Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings

嵌入强化学习：推理驱动多模态嵌入的强化学习

Authors: Haonan Jiang, Yuji Wang, Yongjie Zhu, Xin Lu, Wenyu Qin, Meng Wang, Pengfei Wan, Yansong Tang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.13823
Pdf link: https://arxiv.org/pdf/2602.13823
Abstract Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations compared to discriminative methods. However, the generated reasoning CoTs of existing generative embedding methods are limited to the textual analysis of queries and are irrelevant to the retrieval of the targets. To address these limitations, we propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT (T-CoT). Our key contributions are threefold: (1) We design an EG-RL framework where the Embedder provides explicit supervision to the Reasoner, ensuring the generated CoT traces are aligned with embedding tasks. (2) We introduce T-CoT, which extracts critical multimodal cues to focus on retrieval-relevant elements and provides multimodal inputs for the Embedder. (3) With limited computational resources, our framework outperforms the pioneering embedding model on both MMEB-V2 and UVRB benchmarks. The integration of multimodal evidence in structured reasoning, paired with retrieval-oriented alignment, effectively strengthens cross-modal semantic consistency and boosts the fine-grained matching capability of the model as well as the generalization across complex scenarios. Our work demonstrates that targeted reasoning optimization can significantly improve multimodal embedding quality, providing a practical and efficient solution for reasoning-driven UME development.
中文摘要 利用多模态大型语言模型（MLLM）已成为推动通用多模态嵌入（UME）解决多样化跨模态任务的关键。最新研究表明，结合生成思维链（CoT）推理相比判别方法，能显著提升任务特异性的表征。然而，现有生成嵌入方法生成的推理CoT仅限于查询的文本分析，与目标检索无关。为解决这些局限性，我们提出了一个基于推理的UME框架，整合嵌入者引导强化学习（EG-RL），以优化推理器以产生证据可追溯性CoT（T-CoT）。我们的主要贡献有三方面：（1）我们设计了一个EG-RL框架，嵌入者为推理者提供明确监督，确保生成的CoT痕迹与嵌入任务对齐。（2）我们引入了T-CoT，它提取关键多模态线索，聚焦于检索相关元素，并为嵌入器提供多模态输入。（3）在有限的计算资源下，我们的框架在MMEB-V2和UVRB基准测试中均优于开创性的嵌入模型。多模态证据在结构化推理中的整合，结合以反演为导向的对齐，有效增强了跨模态语义一致性，提升模型的细粒度匹配能力及复杂场景间的泛化能力。我们的研究表明，有针对性推理优化能显著提升多模态嵌入质量，为推理驱动的UME开发提供实用高效的解决方案。

Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind

超越言语：通过心智理论评估并弥合用户-代理互动中的认识论分歧

Authors: Minyuan Ruan, Ziyue Wang, Kaiming Liu, Yunghwei Lai, Peng Li, Yang Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.13832
Pdf link: https://arxiv.org/pdf/2602.13832
Abstract Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks to assist human users. However, they still struggle to comprehend and respond to the true user needs when intentions and instructions are imprecisely conveyed, leading to a divergence between subjective user believes and true environment states. Resolving this epistemic divergence requires Theory of Mind (ToM), yet existing ToM evaluations for LLMs primarily focus on isolated belief inference, overlooking its functional utility in real-world interaction. To this end, we formalize ToM for LLMs as a mechanism for epistemic divergence detection and resolution, and propose a benchmark, \benchname, to assess how models reconcile user beliefs and profiles in practice. Results across 11 leading models reveal a significant limitation to identify underlying cognitive gaps that impede task success. To bridge this gap, we further curate a trajectory-based ToM dataset linking belief tracking with task-related state inference. The model trained on this data via reinforcement learning shows consistent improvement in reasoning about user mental states, leading to enhanced downstream performance. Our work highlights the practical value of ToM as an essential interaction-level mechanism rather than as a standalone reasoning skill.
中文摘要 大型语言模型（LLMs）发展迅速，广泛应用于通用和专业任务中，以协助人类用户。然而，当意图和指令传达不够准确时，他们仍然难以理解和回应真实用户需求，导致主观用户信念与真实环境状态之间存在分歧。解决这种认识分歧需要心智理论（ToM），但现有的 ToM 评估主要关注孤立信念推断，忽视了其在现实世界交互中的功能性效用。为此，我们将大型语言模型的ToM形式化为一种识别分歧检测和解决机制，并提出了一个基准测试——\benchname，用于评估模型在实际中如何调和用户信念与画像。11个领先模型的结果揭示了识别阻碍任务成功的潜在认知差距存在重大局限。为弥合这一差距，我们进一步策划了一个基于轨迹的ToM数据集，将信念追踪与任务相关状态推断连接起来。基于这些数据进行强化学习训练的模型显示，用户心理状态推理能力持续提升，从而提升了下游性能。我们的研究强调了ToM作为一种关键交互层机制的实际价值，而非单一的推理技能。

Enabling Option Learning in Sparse Rewards with Hindsight Experience Replay

在稀疏奖励中启用选项学习，结合事后诸葛体验回放

Authors: Gabriel Romio, Mateus Begnini Melchiades, Bruno Castro da Silva, Gabriel de Oliveira Ramos
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.13865
Pdf link: https://arxiv.org/pdf/2602.13865
Abstract Hierarchical Reinforcement Learning (HRL) frameworks like Option-Critic (OC) and Multi-updates Option Critic (MOC) have introduced significant advancements in learning reusable options. However, these methods underperform in multi-goal environments with sparse rewards, where actions must be linked to temporally distant outcomes. To address this limitation, we first propose MOC-HER, which integrates the Hindsight Experience Replay (HER) mechanism into the MOC framework. By relabeling goals from achieved outcomes, MOC-HER can solve sparse reward environments that are intractable for the original MOC. However, this approach is insufficient for object manipulation tasks, where the reward depends on the object reaching the goal rather than on the agent's direct interaction. This makes it extremely difficult for HRL agents to discover how to interact with these objects. To overcome this issue, we introduce Dual Objectives Hindsight Experience Replay (2HER), a novel extension that creates two sets of virtual goals. In addition to relabeling goals based on the object's final state (standard HER), 2HER also generates goals from the agent's effector positions, rewarding the agent for both interacting with the object and completing the task. Experimental results in robotic manipulation environments show that MOC-2HER achieves success rates of up to 90%, compared to less than 11% for both MOC and MOC-HER. These results highlight the effectiveness of our dual objective relabeling strategy in sparse reward, multi-goal tasks.
中文摘要 层级强化学习（HRL）框架，如Option-Critic（OC）和多重更新Option Critic（MOC），在学习可重用选项方面带来了重大进展。然而，这些方法在多目标且奖励稀疏的环境中表现不佳，因为行动必须与时间上的遥远结果挂钩。为解决这一限制，我们首先提出了MOC-HER，它将后见之明回放（HER）机制整合进MOC框架。通过根据已取得成果重新标记目标，MOC-HER能够解决原MOC难以解决的稀疏奖励环境。然而，这种方法对于对象作任务来说是不充分的，因为奖励取决于对象达到目标，而不是直接与智能体的互动。这使得HRL特工极难发现如何与这些物体互动。为解决这一问题，我们引入了双重目标回顾体验（2HER），这是一种新颖的扩展，创建了两组虚拟目标。除了根据对象的最终状态（标准 HER）重新标记目标外，2HER 还会根据智能体的执行器位置生成目标，奖励智能体与对象交互和完成任务的行为。机器人作环境中的实验结果显示，MOC-2HER的成功率高达90%，而MOC和MOC-HER均低于11%。这些结果凸显了我们双目标重新标记策略在稀疏奖励、多目标任务中的有效性。

Probabilistic Reachability Analysis of Multi-scale Voltage Dynamics Using Reinforcement Learning

利用增强学习对多尺度电压动力学进行概率可达性分析

Authors: Naoki Hashima, Hikaru Hoshino, Luis David Pabón Ospina, Eiko Furutani
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.13896
Pdf link: https://arxiv.org/pdf/2602.13896
Abstract Voltage stability in modern power systems involves coupled dynamics across multiple time scales. Conventional methods based on time-scale separation or static stability margins may overlook instabilities caused by the coupling of slow and fast transients. Uncertainty in operating conditions further complicates stability assessment, and high computational cost of Monte Carlo simulations limit its applicability to multi-scale dynamics. This paper presents a deep reinforcement learning-based framework for probabilistic reachability analysis of multi-scale voltage dynamics. By formulating each instability mechanism as a distinct absorbing state and introducing a multi-critic architecture for mechanism-specific learning, the proposed method enables consistent learning of risk probabilities associated with multiple instability types within a unified framework. The approach is demonstrated on a four-bus system with load tap changers and over-excitation limiters, illustrating effectiveness of the proposed learning-based reachability analysis in identifying and quantifying the mechanisms leading to voltage collapse.
中文摘要 现代电力系统中的电压稳定性涉及跨多个时间尺度的耦合动力学。基于时间尺度分离或静态稳定裕度的传统方法可能忽视由慢速瞬态耦合引起的不稳定性。作条件的不确定性进一步复杂化了稳定性评估，而蒙特卡洛模拟的高计算成本限制了其在多尺度动力学中的适用性。本文提出了基于深度强化学习的多尺度电压动力学概率可达性分析框架。通过将每个不稳定性机制表述为独立的吸收态，并引入多批判者架构以实现机制特定学习，本方法使得在统一框架内能够一致学习多种不稳定性类型的风险概率。该方法在带有负载抽头开关和过激限制器的四母线系统上进行了演示，展示了基于学习的可达性分析在识别和量化导致电压坍缩机制方面的有效性。

From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design

从像素到策略：在内容感知布局设计中强化语言模型中的空间推理

Authors: Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2602.13912
Pdf link: https://arxiv.org/pdf/2602.13912
Abstract We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design. LaySPA addresses two key challenges: LLMs' limited spatial reasoning and the lack of opacity in design decision making. Instead of operating at the pixel level, we reformulate layout design as a policy learning problem over a structured textual spatial environment that explicitly encodes canvas geometry, element attributes, and inter-element relationships. LaySPA produces dual-level outputs comprising interpretable reasoning traces and structured layout specifications, enabling transparent and controllable design decision making. Layout design policy is optimized via a multi-objective spatial critique that decomposes layout quality into geometric validity, relational coherence, and aesthetic consistency, and is trained using relative group optimization to stabilize learning in open-ended design spaces. Experiments demonstrate that LaySPA improves structural validity and visual quality, outperforming larger proprietary LLMs and achieving performance comparable to specialized SOTA layout generators while requiring fewer annotated samples and reduced latency.
中文摘要 我们介绍LaySPA，一个强化学习框架，为大型语言模型（LLM）提供显式且可解释的空间推理，以实现内容感知的图形布局设计。LaySPA解决了两个关键挑战：大型语言模型有限的空间推理能力以及设计决策中缺乏不透明性。我们不再在像素层面作，而是将布局设计重新表述为一个策略学习问题，基于结构化的文本空间环境，明确编码画布几何体、元素属性和元素间关系。LaySPA 生成包含可解释推理轨迹和结构化布局规范的双级输出，实现透明且可控的设计决策。布局设计策略通过多目标空间批判进行优化，将布局质量分解为几何效度、关系一致性和美学一致性，并通过相对群优化训练以稳定开放式设计空间中的学习。实验表明，LaySPA提升了结构效度和视觉质量，优于更大型的专有大型语言模型，并实现与专用SOTA布局生成器相当的性能，同时减少注释样本和降低延迟。

Why Code, Why Now: Learnability, Computability, and the Real Limits of Machine Learning

为什么是代码，为什么是现在：可学习性、可计算性以及机器学习的真正局限性

Authors: Zhimin Zhao
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.13934
Pdf link: https://arxiv.org/pdf/2602.13934
Abstract Code generation has progressed more reliably than reinforcement learning, largely because code has an information structure that makes it learnable. Code provides dense, local, verifiable feedback at every token, whereas most reinforcement learning problems do not. This difference in feedback quality is not binary but graded. We propose a five-level hierarchy of learnability based on information structure and argue that the ceiling on ML progress depends less on model size than on whether a task is learnable at all. The hierarchy rests on a formal distinction among three properties of computational problems (expressibility, computability, and learnability). We establish their pairwise relationships, including where implications hold and where they fail, and present a unified template that makes the structural differences explicit. The analysis suggests why supervised learning on code scales predictably while reinforcement learning does not, and why the common assumption that scaling alone will solve remaining ML challenges warrants scrutiny.
中文摘要 代码生成比强化学习更为可靠，主要因为代码具有信息结构使其易于学习。代码在每个代币上提供密集、局部且可验证的反馈，而大多数强化学习问题则不然。这种反馈质量的差异不是非黑即白，而是分级的。我们提出了基于信息结构的五级可学习性层级，并认为机器学习进展的上限更多取决于任务是否可学习，而非模型大小。该层级基于计算问题的三种属性（可表达性、可计算性和可学习性）的形式区分。我们建立了它们的成对关系，包括哪些方面蕴含成立，哪些地方失效，并提出了一个统一的模板，使结构性差异变得清晰。分析揭示了为何监督式学习在代码上可预测地扩展，而强化学习则不然，以及为何普遍认为仅靠扩展就能解决剩余机器学习难题的观点值得深入审视。

You Can Learn Tokenization End-to-End with Reinforcement Learning

你可以通过强化学习端到端学习代币化

Authors: Sam Dauncey, Roger Wattenhofer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.13940
Pdf link: https://arxiv.org/pdf/2602.13940
Abstract Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the $100$ million parameter scale.
中文摘要 令牌化是一种硬编码的压缩步骤，尽管整体架构趋向端到端，但它仍处于大型语言模型（LLM）的训练流程中。以往的研究显示，在大规模上通过启发式方法将这一压缩步骤纳入LLM架构绘制标记边界，并尝试通过直通估计学习标记边界，将绘制离散标记边界的问题视为连续。我们证明这些标记边界可以通过评分函数估计来学习，后者由于直接优化绘制离散标记边界以最小化损失，理论上保证更为严格。我们观察到，强化学习中的技术，如时间折算，对于降低该评分函数的方差性到可行性是必要的。我们证明，所得方法在1亿美元参数尺度上，无论是定性还是定量上都优于先前提出的直通式估算。

Experiential Reinforcement Learning

体验式强化学习

Authors: Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, Jieyu Zhao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.13949
Pdf link: https://arxiv.org/pdf/2602.13949
Abstract Reinforcement learning has become the central approach for language models (LMs) to learn from environmental reward or feedback. In practice, the environmental feedback is usually sparse and delayed. Learning from such signals is challenging, as LMs must implicitly infer how observed failures should translate into behavioral changes for future iterations. We introduce Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experience-reflection-consolidation loop into the reinforcement learning process. Given a task, the model generates an initial attempt, receives environmental feedback, and produces a reflection that guides a refined second attempt, whose success is reinforced and internalized into the base policy. This process converts feedback into structured behavioral revision, improving exploration and stabilizing optimization while preserving gains at deployment without additional inference cost. Across sparse-reward control environments and agentic reasoning benchmarks, ERL consistently improves learning efficiency and final performance over strong reinforcement learning baselines, achieving gains of up to +81% in complex multi-step environments and up to +11% in tool-using reasoning tasks. These results suggest that integrating explicit self-reflection into policy training provides a practical mechanism for transforming feedback into durable behavioral improvement.
中文摘要 强化学习已成为语言模型（LM）从环境奖励或反馈中学习的核心方法。实际上，环境反馈通常是稀疏且延迟的。从这些信号中学习具有挑战性，因为LM必须隐性推断观察到的失败如何转化为未来迭代的行为变化。我们引入了体验式强化学习（ERL），这是一种在强化学习过程中嵌入显式经验-反思-巩固循环的培训范式。给定任务时，模型生成一次初始尝试，接收环境反馈，并产生反映，指导精炼后的第二次尝试，其成功被强化并内化为基础策略。该过程将反馈转化为结构化的行为修正，提升探索和稳定优化，同时保留部署时的收益，避免额外推断成本。在稀疏奖励控制环境和代理推理基准测试中，ERL在强化学习基线上持续提升学习效率和最终表现，在复杂多步环境中实现高达+81%的提升，工具使用推理任务中提升至+11%。这些结果表明，将明确的自我反思融入政策培训，提供了将反馈转化为持久行为改进的实用机制。

QuRL: Efficient Reinforcement Learning with Quantized Rollout

QuRL：带量化推广的高效强化学习

Authors: Yuhang Li, Reena Elangovan, Xin Dong, Priyadarshini Panda, Brucek Khailany
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.13953
Pdf link: https://arxiv.org/pdf/2602.13953
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a trending paradigm for training reasoning large language models (LLMs). However, due to the autoregressive decoding nature of LLMs, the rollout process becomes the efficiency bottleneck of RL training, consisting of up to 70\% of the total training time. In this work, we propose Quantized Reinforcement Learning (QuRL) that uses a quantized actor for accelerating the rollout. We address two challenges in QuRL. First, we propose Adaptive Clipping Range (ACR) that dynamically adjusts the clipping ratio based on the policy ratio between the full-precision actor and the quantized actor, which is essential for mitigating long-term training collapse. Second, we identify the weight update problem, where weight changes between RL steps are extremely small, making it difficult for the quantization operation to capture them effectively. We mitigate this problem through the invariant scaling technique that reduces quantization noise and increases weight update. We evaluate our method with INT8 and FP8 quantization experiments on DeepScaleR and DAPO, and achieve 20% to 80% faster rollout during training.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为训练推理大型语言模型（LLMs）的一种趋势范式。然而，由于大型语言模型的自回归解码特性，推广过程成为强化学习训练的效率瓶颈，占总训练时间的70%左右。在本研究中，我们提出了量化强化学习（QuRL），利用量化演员加速推广。我们在QuRL中解决了两个挑战。首先，我们提出了自适应裁剪范围（ACR），根据全精度演员与量化演员之间的策略比动态调整裁剪比，这对于减轻长期训练崩溃至关重要。其次，我们识别了权重更新问题，即强化步骤间权重变化极小，使量化作难以有效捕捉这些权重。我们通过不变缩放技术来缓解这个问题，该技术减少了量化噪声并增加了权重更新。我们用DeepScaleR和DAPO上的INT8和FP8量化实验评估了我们的方法，训练期间的推广速度提升了20%到80%。

WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

WoVR：作为强化学习后VLA策略可靠模拟器的世界模型

Authors: Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, Dongbin Zhao
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.13977
Pdf link: https://arxiv.org/pdf/2602.13977
Abstract Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably suffer from hallucination and long-horizon error accumulation. Such errors do not merely degrade visual fidelity; they corrupt the optimization signal, encouraging policies to exploit model inaccuracies rather than genuine task progress. We propose WoVR, a reliable world-model-based reinforcement learning framework for post-training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action-conditioned video world model, reshapes imagined interaction to reduce effective error depth via Keyframe-Initialized Rollouts, and maintains policy-simulator alignment through World Model-Policy co-evolution. Extensive experiments on LIBERO benchmarks and real-world robotic manipulation demonstrate that WoVR enables stable long-horizon imagined rollouts and effective policy optimization, improving average LIBERO success from 39.95% to 69.2% (+29.3 points) and real-robot success from 61.7% to 91.7% (+30.0 points). These results show that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly controlled.
中文摘要 强化学习（RL）有望为视觉-语言-行动（VLA）模型解锁超越模仿学习的能力，但其对大规模现实世界交互的需求，阻碍了直接部署在实体机器人上的应用。近期研究尝试将学习过的世界模型作为策略优化的模拟器，但闭环想象的推广不可避免地存在幻觉和长期误差累积。这些错误不仅会降低视觉真实性;它们破坏了优化信号，鼓励政策利用模型不准确性而非真实的任务进展。我们提出了WoVR，一个可靠的基于世界模型的强化学习框架，用于训练后的VLA策略。WoVR没有假设一个忠实的世界模型，而是明确调节了强化学习如何与不完美的想象动态互动。它通过可控的动作条件视频世界模型提升了滚动稳定性，通过关键帧初始化展开重塑想象交互以降低有效误差深度，并通过世界模型-策略共进化保持策略-模拟器的对齐。对LIBERO基准测试和现实机器人作的广泛实验表明，WoVR能够实现稳定的长期想象推广和有效的政策优化，使LIBERO的平均成功率从39.95%提升至69.2%（+29.3分），真实机器人成功率从61.7%提升至91.7%（+30.0分）。这些结果表明，当幻觉被明确控制时，学习到的世界模型可以作为强化学习的实用模拟器。

BRAIN: Bayesian Reasoning via Active Inference for Agentic and Embodied Intelligence in Mobile Networks

BRAIN：通过主动推理实现移动网络中智能与具身智能的贝叶斯推理

Authors: Osman Tugay Basaran, Martin Maier, Falko Dressler
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2602.14033
Pdf link: https://arxiv.org/pdf/2602.14033
Abstract Future sixth-generation (6G) mobile networks will demand artificial intelligence (AI) agents that are not only autonomous and efficient, but also capable of real-time adaptation in dynamic environments and transparent in their decisionmaking. However, prevailing agentic AI approaches in networking, exhibit significant shortcomings in this regard. Conventional deep reinforcement learning (DRL)-based agents lack explainability and often suffer from brittle adaptation, including catastrophic forgetting of past knowledge under non-stationary conditions. In this paper, we propose an alternative solution for these challenges: Bayesian reasoning via Active Inference (BRAIN) agent. BRAIN harnesses a deep generative model of the network environment and minimizes variational free energy to unify perception and action in a single closed-loop paradigm. We implement BRAIN as O-RAN eXtended application (xApp) on GPU-accelerated testbed and demonstrate its advantages over standard DRL baselines. In our experiments, BRAIN exhibits (i) robust causal reasoning for dynamic radio resource allocation, maintaining slice-specific quality of service (QoS) targets (throughput, latency, reliability) under varying traffic loads, (ii) superior adaptability with up to 28.3% higher robustness to sudden traffic shifts versus benchmarks (achieved without any retraining), and (iii) real-time interpretability of its decisions through human-interpretable belief state diagnostics.
中文摘要 未来的第六代（6G）移动网络将需要不仅自主高效，还能在动态环境中实时适应并实现决策透明的人工智能（AI）代理。然而，网络领域中主流的代理式人工智能方法在这方面存在显著不足。基于深度强化学习（DRL）的传统智能体缺乏可解释性，且常存在脆性适应问题，包括在非平稳条件下灾难性地遗忘过去知识。本文提出了一种替代解决方案：通过主动推理（BRAIN）代理进行贝叶斯推理。BRAIN利用网络环境的深度生成模型，最小化变分自由能，将感知与行为统一在单一闭环范式中。我们将BRAIN作为O-RAN扩展应用（xApp）在GPU加速测试平台上实现，并展示了其相较于标准DRL基线的优势。在我们的实验中，BRAIN展现了（i）动态无线资源分配的强健因果推理，在不同流量负载下保持片特定服务质量（QoS）目标（吞吐量、延迟、可靠性），（ii）卓越的适应性，对突发流量变化的鲁棒性高出高出28.3%（无需重新训练），以及（iii）通过人类可解释的信念状态诊断实现其决策的实时可解释性。

CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning

CoCoEdit：通过区域正则化强化学习实现内容一致的图像编辑

Authors: Yuhui Wu, Chenxi Xie, Ruibin Li, Liyi Chen, Qiaosi Yi, Lei Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.14068
Pdf link: https://arxiv.org/pdf/2602.14068
Abstract Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) via region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only competitive editing scores with state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings.
中文摘要 随着大规模生成模型的发展，图像编辑取得了令人印象深刻的成果。然而，现有模型主要关注对预期对象和区域的编辑效果，常常导致非预期区域出现不必要的更改。我们通过区域正则化强化学习，提出了内容一致性编辑（CoCoEdit）的后期培训框架。我们首先用精炼的指令和掩码补充现有编辑数据集，从中精选出4万个多样且高质量的样本作为训练集。随后引入像素级相似度奖励，补充基于MLLM的奖励，使模型能够在编辑过程中同时确保编辑质量和内容一致性。为克服奖励的空间无关性，我们提出了基于区域的正则化器，旨在保留高奖励样本的未编辑区域，同时鼓励低奖励样本的编辑效应。为评估，我们为GEdit-Bench和ImgEdit-Bench注释编辑掩码，引入像素级相似度指标以衡量内容一致性和编辑质量。将CoCoEdit应用于Qwen-Image-Edit和FLUX-Kontext，我们不仅凭借最先进的模型实现了竞争性编辑评分，还显著提升了内容一致性，这些指标通过PSNR/SSIM指标和人类主观评分衡量。

Policy Gradient with Adaptive Entropy Annealing for Continual Fine-Tuning

策略梯度与自适应熵退火以实现持续微调

Authors: Yaqian Zhang, Bernhard Pfahringer, Eibe Frank, Albert Bifet
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.14078
Pdf link: https://arxiv.org/pdf/2602.14078
Abstract Despite their success, large pretrained vision models remain vulnerable to catastrophic forgetting when adapted to new tasks in class-incremental settings. Parameter-efficient fine-tuning (PEFT) alleviates this by restricting trainable parameters, yet most approaches still rely on cross-entropy (CE) loss, a surrogate for the 0-1 loss, to learn from new data. We revisit this choice and revive the true objective (0-1 loss) through a reinforcement learning perspective. By formulating classification as a one-step Markov Decision Process, we derive an Expected Policy Gradient (EPG) method that directly minimizes misclassification error with a low-variance gradient estimation. Our analysis shows that CE can be interpreted as EPG with an additional sample-weighting mechanism: CE encourages exploration by emphasizing low-confidence samples, while EPG prioritizes high-confidence ones. Building on this insight, we propose adaptive entropy annealing (aEPG), a training strategy that transitions from exploratory (CE-like) to exploitative (EPG-like) learning. aEPG-based methods outperform CE-based methods across diverse benchmarks and with various PEFT modules. More broadly, we evaluate various entropy regularization methods and demonstrate that lower entropy of the output prediction distribution enhances adaptation in pretrained vision models.
中文摘要 尽管取得了成功，大型预训练视觉模型在适应新任务、在类增量环境中仍易发生灾难性遗忘。参数高效微调（PEFT）通过限制可训练参数来缓解这一问题，但大多数方法仍然依赖交叉熵损失（CE）损失，即0-1损失的替代品，来从新数据中学习。我们重新审视这个选择，并通过强化学习的视角恢复真正的目标（0-1失利）。通过将分类表述为一步马尔可夫决策过程，我们推导出了一种期望政策梯度（EPG）方法，该方法通过低方差梯度估计直接最小化错误分类误差。我们的分析表明，CE可以被解释为带有额外样本加权机制的EPG：CE通过强调低置信度样本鼓励探索，而EPG则优先考虑高置信度样本。基于这一见解，我们提出了自适应熵退火（aEPG），这是一种从探索性（类似CE）向利用性（类EPG）学习过渡的训练策略。基于EPG的方法在多种基准测试和各种PEFT模块中优于基于CE的方法。更广泛地，我们评估了各种熵正则化方法，证明输出预测分布的熵降低能增强预训练视觉模型中的适应性。

ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization

伪造VCR：通过多层次多层次营销（MLLM）中高效的取证工具进行视觉中心推理，实现图像伪造检测和定位

Authors: Youqi Wang, Shen Chen, Haowei Wang, Rongxuan Peng, Taiping Yao, Shunquan Tan, Changsheng Chen, Bin Li, Shouhong Ding
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.14098
Pdf link: https://arxiv.org/pdf/2602.14098
Abstract Existing Multimodal Large Language Models (MLLMs) for image forgery detection and localization predominantly operate under a text-centric Chain-of-Thought (CoT) paradigm. However, forcing these models to textually characterize imperceptible low-level tampering traces inevitably leads to hallucinations, as linguistic modalities are insufficient to capture such fine-grained pixel-level inconsistencies. To overcome this, we propose ForgeryVCR, a framework that incorporates a forensic toolbox to materialize imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning. To enable efficient tool utilization, we introduce a Strategic Tool Learning post-training paradigm, encompassing gain-driven trajectory construction for Supervised Fine-Tuning (SFT) and subsequent Reinforcement Learning (RL) optimization guided by a tool utility reward. This paradigm empowers the MLLM to act as a proactive decision-maker, learning to spontaneously invoke multi-view reasoning paths including local zoom-in for fine-grained inspection and the analysis of invisible inconsistencies in compression history, noise residuals, and frequency domains. Extensive experiments reveal that ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks, demonstrating superior generalization and robustness with minimal tool redundancy. The project page is available at this https URL.
中文摘要 现有用于图像伪造检测和定位的多模态大型语言模型（MLLM）主要采用以文本为中心的思维链（Chain-of-Thought，CoT）范式。然而，强迫这些模型在文本上描述低层次不可察觉的篡改痕迹，必然导致幻觉，因为语言学模态不足以捕捉如此细粒度的像素层次不一致。为克服这一问题，我们提出了ForgeryVCR，一种结合法医工具箱的框架，通过视觉中心推理将不可察觉的痕迹转化为显性视觉中间体。为了实现工具的高效利用，我们引入了战略工具学习后训练范式，涵盖增益驱动轨迹构建，用于监督微调（SFT）及后续的强化学习（RL）优化，并以工具效用奖励为指导。这一范式使MLLM能够作为主动决策者，学会自发调用多视角推理路径，包括局部放大以进行细粒度检查，以及分析压缩历史、噪声残差和频域中的隐形不一致。大量实验表明，ForgeryVCR在检测和定位任务中均具备最先进的（SOTA）性能，展现出卓越的泛化性和鲁棒性，且工具冗余极少。项目页面可在此 https 网址访问。

LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

LaViDa-R1：推进统一多模扩散语言模型的推理

Authors: Shufan Li, Yuchen Zhu, Jiuxiang Gu, Kangning Liu, Zhe Lin, Yongxin Chen, Molei Tao, Aditya Grover, Jason Kuen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.14147
Pdf link: https://arxiv.org/pdf/2602.14147
Abstract Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.
中文摘要 扩散语言模型（dLLMs）最近作为自回归LLM的有前景替代方案出现。最新作品进一步将其扩展到多模态理解和生成任务。在本研究中，我们提出了LaViDa-R1，一种多模、通用推理的数字大型语言模型（dLLM）。与通过任务特定强化学习构建推理型数字大型语言模型（dLLM）的现有作品不同，LaViDa-R1 以统一的方式整合了多样的多模态理解和生成任务。特别是，LaViDa-R1采用了新颖统一的培训后框架，无缝集成了监督微调（SFT）和多任务强化学习（RL）。它采用了多种新颖的训练技术，包括答案强制、树状搜索和互补似然估计，以提升效果和可扩展性。大量实验展示了LaViDa-R1在多种多模态任务中的强劲表现，包括视觉数学推理、推理密集型基础和图像编辑。

Process-Supervised Multi-Agent Reinforcement Learning for Reliable Clinical Reasoning

过程监督多代理强化学习，实现可靠的临床推理

Authors: Chaeeun Lee, T. Michael Yates, Pasquale Minervini, T. Ian Simpson
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.14160
Pdf link: https://arxiv.org/pdf/2602.14160
Abstract Clinical decision-making requires nuanced reasoning over heterogeneous evidence and traceable justifications. While recent LLM multi-agent systems (MAS) show promise, they largely optimise for outcome accuracy while overlooking process-grounded reasoning aligned with clinical standards. One critical real-world case of this is gene-disease validity curation, where experts must determine whether a gene is causally implicated in a disease by synthesising diverse biomedical evidence. We introduce an agent-as-tool reinforcement learning framework for this task with two objectives: (i) process-level supervision to ensure reasoning follows valid clinical pathways, and (ii) efficient coordination via a hierarchical multi-agent system. Our evaluation on the ClinGen dataset shows that with outcome-only rewards, MAS with a GRPO-trained Qwen3-4B supervisor agent substantially improves final outcome accuracy from 0.195 with a base model supervisor to 0.732, but results in poor process alignment (0.392 F1). Conversely, with process + outcome rewards, MAS with GRPO-trained supervisor achieves higher outcome accuracy (0.750) while significantly improving process fidelity to 0.520 F1. Our code is available at this https URL.
中文摘要 临床决策需要在异质证据和可追溯的理由中进行细致的推理。虽然最新的LLM多智能体系统（MAS）展现出潜力，但它们大多优化了结果的准确性，忽视了与临床标准相符的过程基础推理。一个关键的现实案例是基因疾病有效性策划，专家必须通过综合多种生物医学证据来确定某个基因是否与疾病有因果关系。我们为该任务引入了一个代理作为工具的强化学习框架，目标有两个：（i）过程级监督，确保推理遵循有效的临床路径;（ii）通过分层多代理系统实现高效协调。我们在ClinGen数据集上的评估显示，仅有结果奖励时，使用GRPO训练的Qwen3-4B监督代理的MAS能显著提升最终结果准确率，从基础模型监督者的0.195提升至0.732，但过程对齐度较差（0.392 F1）。相反，采用过程+结果奖励时，配备GRPO培训导师的MAS能实现更高的结果准确率（0.750），同时显著提升过程忠实度至0.520 F1。我们的代码可在此 https URL 访问。

Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling

通过枢轴驱动重采样进行深度密集探索，用于LLM强化学习

Authors: Yiran Guo, Zhongjian Qiao, Yingqi Xie, Jie Liu, Dan Ye, Ruiqing Zhang, Shuang Qiu, Lijie Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.14169
Pdf link: https://arxiv.org/pdf/2602.14169
Abstract Effective exploration is a key challenge in reinforcement learning for large language models: discovering high-quality trajectories within a limited sampling budget from the vast natural language sequence space. Existing methods face notable limitations: GRPO samples exclusively from the root, saturating high-probability trajectories while leaving deep, error-prone states under-explored. Tree-based methods blindly disperse budgets across trivial or unrecoverable states, causing sampling dilution that fails to uncover rare correct suffixes and destabilizes local baselines. To address this, we propose Deep Dense Exploration (DDE), a strategy that focuses exploration on $\textit{pivots}$-deep, recoverable states within unsuccessful trajectories. We instantiate DDE with DEEP-GRPO, which introduces three key innovations: (1) a lightweight data-driven utility function that automatically balances recoverability and depth bias to identify pivot states; (2) local dense resampling at each pivot to increase the probability of discovering correct subsequent trajectories; and (3) a dual-stream optimization objective that decouples global policy learning from local corrective updates. Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines.
中文摘要 有效的探索是大型语言模型强化学习中的一个关键挑战：在有限的采样预算内，从庞大的自然语言序列空间中发现高质量的轨迹。现有方法存在显著局限：GRPO仅从根源采样，饱和了高概率轨迹，而深度且易出错的状态却未被充分探索。基于树的方法盲目地将预算分散到琐碎或无法恢复的状态，导致抽样稀释，无法发现罕见的正确后缀，并破坏了局部基线。为此，我们提出了深度密集探索（DDE）策略，专注于探索在失败轨迹中、深度为$\textit{pivots}的可恢复状态。我们用 DEEP-GRPO 实现 DDE，该方案引入了三项关键创新：（1）一个轻量级的数据驱动实用函数，自动平衡可恢复性和深度偏差以识别枢轴状态;（2）在每个枢轴处进行局部密集重采样，以提高发现正确后续轨迹的概率;以及（3）实现双流优化目标，将全球政策学习与局部纠正更新解耦。数学推理基准测试的实验表明，我们的方法持续优于GRPO、基于树的方法及其他强有力的基线方法。

UniRef-Image-Edit: Towards Scalable and Consistent Multi-Reference Image Editing

UniRef-image-edit：迈向可扩展且一致的多参考图像编辑

Authors: Hongyang Wei, Bin Wen, Yancheng Long, Yankai Yang, Yuhang Hu, Tianke Zhang, Wei Chen, Haonan Fan, Kaiyu Jiang, Jiankang Chen, Changyi Liu, Kaiyu Tang, Haojie Ding, Xiao Yang, Jia Sun, Huaiqing Wang, Zhenyu Yang, Xinyu Wei, Xianglong He, Yangguang Li, Fan Yang, Tingting Gao, Lei Zhang, Guorui Zhou, Han Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.14186
Pdf link: https://arxiv.org/pdf/2602.14186
Abstract We present UniRef-Image-Edit, a high-performance multi-modal generation system that unifies single-image editing and multi-image composition within a single framework. Existing diffusion-based editing methods often struggle to maintain consistency across multiple conditions due to limited interaction between reference inputs. To address this, we introduce Sequence-Extended Latent Fusion (SELF), a unified input representation that dynamically serializes multiple reference images into a coherent latent sequence. During a dedicated training stage, all reference images are jointly constrained to fit within a fixed-length sequence under a global pixel-budget constraint. Building upon SELF, we propose a two-stage training framework comprising supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we jointly train on single-image editing and multi-image composition tasks to establish a robust generative prior. We adopt a progressive sequence length training strategy, in which all input images are initially resized to a total pixel budget of $1024^2$, and are then gradually increased to $1536^2$ and $2048^2$ to improve visual fidelity and cross-reference consistency. This gradual relaxation of compression enables the model to incrementally capture finer visual details while maintaining stable alignment across references. For the RL stage, we introduce Multi-Source GRPO (MSGRPO), to our knowledge the first reinforcement learning framework tailored for multi-reference image generation. MSGRPO optimizes the model to reconcile conflicting visual constraints, significantly enhancing compositional consistency. We will open-source the code, models, training data, and reward data for community research purposes.
中文摘要 我们介绍UniRef-Image-Edit，一种高性能的多模态生成系统，将单幅图像编辑和多图像合成统一在一个框架内。现有基于扩散的编辑方法常因参考输入间的交互有限，难以在多种条件下保持一致性。为此，我们引入了序列扩展潜在融合（SELF），一种统一的输入表示，能够动态序列化多个参考图像，形成一个连贯的潜在序列。在专门训练阶段，所有参考图像在全局像素预算约束下被联合约束，以适应固定长度序列。基于SELF，我们提出了一个包含监督微调（SFT）和强化学习（RL）的两阶段培训框架。在SFT阶段，我们联合训练单图像编辑和多图像合成任务，建立稳健的生成先验。我们采用渐进式序列长度训练策略，所有输入图像最初调整为总像素预算为1024^2美元，随后逐步增加至1536^2美元和2048^2美元，以提升视觉真实度和交叉参考一致性。这种逐渐放松压缩使模型能够在保持参考点间稳定对齐的同时，逐步捕捉更细微的视觉细节。在强化学习阶段，我们引入了多源GRPO（MSGRPO），这是我们已知的首个专为多引用图像生成量身定制的强化学习框架。MSGRPO优化模型以调和相互冲突的视觉约束，显著提升了构图一致性。我们将开源代码、模型、训练数据和奖励数据，用于社区研究。

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

GeoEyes：按需视觉聚焦，基于证据理解超高分辨率遥感图像

Authors: Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, Jun Song, Jing Zhang, Bo Du
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.14201
Pdf link: https://arxiv.org/pdf/2602.14201
Abstract The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.
中文摘要 “用图像思考”范式使多模态大型语言模型（MLLMs）能够通过放大工具主动探索视觉场景。这对于超高分辨率（UHR）遥感VQA至关重要，因为任务相关线索稀少且微小。然而，我们观察到现有支持Zoom的MLLM存在一种一致的失败模式：工具使用同质化，即工具调用崩溃为任务无关模式，限制了有效的证据获取。为此，我们提出了GeoEyes，一个分阶段训练框架，包括（1）冷启动SFT数据集UHR链-Zoom（UHR-CoZ），涵盖多种扩展模式，以及（2）一种agentic reforcement学习方法AdaZoom-GRPO，明确奖励缩放互动中证据的获取和答案提升。最终模型能够在适当的停止行为下学习按需变焦，并在UHR遥感基准测试上实现了显著提升，在XLRS-Bench上准确率达到54.23%。

The Interspeech 2026 Audio Reasoning Challenge: Evaluating Reasoning Process Quality for Audio Reasoning Models and Agents

Interspeech 2026 音频推理挑战：评估音频推理模型和代理的推理过程质量

Authors: Ziyang Ma, Ruiyang Xu, Yinghao Ma, Chao-Han Huck Yang, Bohan Li, Jaeyeon Kim, Jin Xu, Jinyu Li, Carlos Busso, Kai Yu, Eng Siong Chng, Xie Chen
Subjects: Subjects: Sound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2602.14224
Pdf link: https://arxiv.org/pdf/2602.14224
Abstract Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this "black-box" limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) quality in the audio domain. The challenge introduced MMAR-Rubrics, a novel instance-level protocol assessing the factuality and logic of reasoning chains. Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions. Results show agent systems currently lead in reasoning quality, utilizing iterative tool orchestration and cross-modal analysis. Besides, single models are rapidly advancing via reinforcement learning and sophisticated data pipeline. We details the challenge design, methodology, and a comprehensive analysis of state-of-the-art systems, providing new insights for explainable audio intelligence.
中文摘要 最新的大型音频语言模型（LALMs）在理解方面表现出色，但往往缺乏清晰的推理。为了解决这一“黑箱”限制，我们在2026年Interspeech举办了音频推理挑战赛，这是首个专门评估音频领域Chain-of-Thought（CoT）质量的共享任务。该挑战引入了MMAR评分标准，这是一种新型实例级协议，用于评估推理链的事实性和逻辑性。比赛设有单一车型和经纪人赛道，吸引了来自18个国家和地区的156支车队参赛。结果显示，代理系统目前在推理质量方面领先，采用迭代工具编排和跨模态分析。此外，单一模型正通过强化学习和复杂的数据管道快速发展。我们详细介绍了挑战设计、方法论以及对最先进系统的全面分析，为可解释音频智能提供了新的见解。

Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding

视觉之前的文本：分阶段的知识注入对超高分辨率遥感理解中的代理RLVR至关重要

Authors: Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yuhao Zhou, Di Wang, Yifan Zhang, Haoyu Wang, Haiyan Zhao, Hongda Sun, Long Lan, Jun Song, Yulin Wang, Jing Zhang, Wenlong Zhang, Bo Du
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.14225
Pdf link: https://arxiv.org/pdf/2602.14225
Abstract Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition: the model necessitates localizing tiny task-relevant regions in massive pixel spaces. While Agentic Reinforcement Learning with Verifiable Rewards (RLVR) using zoom-in tools offers a path forward, we find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors. In this paper, we investigate the interplay between post-training paradigms: comparing Cold-start Supervised Fine-Tuning (SFT), RLVR, and Agentic RLVR on the UHR RS this http URL controlled studies yield a counter-intuitive finding: high-quality Earth-science text-only QA is a primary driver of UHR visual reasoning gains. Despite lacking images, domain-specific text injects the concepts, mechanistic explanations, and decision rules necessary to guide visual evidence this http URL on this, we propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures;and (2) "pre-warming" on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL. This approach achieves a 60.40% Pass@1 on XLRS-Bench, significantly outperforming larger general purpose models (e.g., GPT-5.2, Gemini 3.0 Pro, Intern-S1) and establishing a new state-of-the-art.
中文摘要 超高分辨率（UHR）遥感（RS）的多模态推理通常被视觉证据获取所限制：模型需要在巨大的像素空间中定位与任务相关的微小区域。虽然使用放大工具的可验证奖励代理强化学习（RLVR）为前进提供了方向，但我们发现，标准强化学习在没有结构化领域先验的情况下难以驾驭这些庞大的视觉空间。本文探讨了训练后范式之间的相互作用：比较UHR RS上的冷启动监督微调（SFT）、RLVR和代理RLVR。这些http URL受控研究得出一个反直觉的结论：高质量的地球科学纯文本质检是UHR视觉推理提升的主要驱动力。尽管缺乏图片，领域特定文本注入了指导视觉证据所需的概念、机制解释和决策规则。http URL 关于此，我们提出了一个分阶段的知识注入配方：（1）冷启动，使用可扩展、经过知识图谱验证的地球科学文本质检，以灌输推理结构;以及（2）在SFT期间对同一硬UHR图像文本样本进行“预热”，以稳定和放大后续基于工具的强化学习。该方法在XLRS-Bench上实现了60.40%的开发Pass@1，显著优于大型通用模型（如GPT-5.2、Gemini 3.0 Pro、Intern-S1），并奠定了全新的技术水平。

REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents

REDSearcher：一个可扩展且成本效益高的长期搜索代理框架

Authors: Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang, Yue Yang, Guohai Xu, Chenxiao Zhao, Cheng Xiang, Shengchao Hu, Dongdong Kuang, Ming Liu, Bing Qin, Xing Yu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.14234
Pdf link: https://arxiv.org/pdf/2602.14234
Abstract Large language models are transitioning from generalpurpose knowledge engines to realworld problem solvers, yet optimizing them for deep search tasks remains challenging. The central bottleneck lies in the extreme sparsity of highquality search trajectories and reward signals, arising from the difficulty of scalable longhorizon task construction and the high cost of interactionheavy rollouts involving external tool calls. To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization. Specifically, REDSearcher introduces the following improvements: (1) We frame task synthesis as a dualconstrained optimization, where task difficulty is precisely governed by graph topology and evidence dispersion, allowing scalable generation of complex, highquality tasks. (2) We introduce toolaugmented queries to encourage proactive tool use rather than passive recall.(3) During midtraining, we strengthen core atomic capabilities knowledge, planning, and function calling substantially reducing the cost of collecting highquality trajectories for downstream training. (4) We build a local simulated environment that enables rapid, lowcost algorithmic iteration for reinforcement learning experiments. Across both textonly and multimodal searchagent benchmarks, our approach achieves stateoftheart performance. To facilitate future research on longhorizon search agents, we will release 10K highquality complex text search trajectories, 5K multimodal trajectories and 1K text RL query set, and together with code and model checkpoints.
中文摘要 大型语言模型正从通用知识引擎转向现实问题解决器，但针对深度搜索任务进行优化仍然充满挑战。核心瓶颈在于高质量搜索轨迹和奖励信号极度稀缺，这源于可扩展的长视野任务构建困难以及涉及外部工具调用的交互重度部署高昂成本。为应对这些挑战，我们提出了REDSearcher，一个统一框架，用于编码复杂任务综合、中训练和后训练，实现可扩展的搜索代理优化。具体来说，REDSearcher 引入了以下改进：（1）我们将任务综合框架为一种双约束优化，任务难度由图拓扑和证据离散精确控制，允许可扩展生成复杂且高质量的任务。（2）我们引入工具增强查询，鼓励主动使用工具而非被动回忆。（3）在训练中，我们加强核心原子能力的知识、规划和功能调用，大幅降低了为后续训练收集高质量轨迹的成本。（4）我们构建了一个本地模拟环境，使强化学习实验能够快速、低成本地进行算法迭代。无论是单文本还是多模态 Searchagent 基准测试，我们的方法都能实现 stateoftheart 性能。为促进未来对长视界搜索代理的研究，我们将发布1万条高质量复杂文本搜索轨迹、5千条多模态轨迹和1千条文本强化学习查询集，并结合代码和模型检查点。

GRAIL: Goal Recognition Alignment through Imitation Learning

圣杯：通过模仿学习实现目标识别对齐

Authors: Osher Elhadad, Felipe Meneguzzi, Reuth Mirsky
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.14252
Pdf link: https://arxiv.org/pdf/2602.14252
Abstract Understanding an agent's goals from its behavior is fundamental to aligning AI systems with human intentions. Existing goal recognition methods typically rely on an optimal goal-oriented policy representation, which may differ from the actor's true behavior and hinder the accurate recognition of their goal. To address this gap, this paper introduces Goal Recognition Alignment through Imitation Learning (GRAIL), which leverages imitation learning and inverse reinforcement learning to learn one goal-directed policy for each candidate goal directly from (potentially suboptimal) demonstration trajectories. By scoring an observed partial trajectory with each learned goal-directed policy in a single forward pass, GRAIL retains the one-shot inference capability of classical goal recognition while leveraging learned policies that can capture suboptimal and systematically biased behavior. Across the evaluated domains, GRAIL increases the F1-score by more than 0.5 under systematically biased optimal behavior, achieves gains of approximately 0.1-0.3 under suboptimal behavior, and yields improvements of up to 0.4 under noisy optimal trajectories, while remaining competitive in fully optimal settings. This work contributes toward scalable and robust models for interpreting agent goals in uncertain environments.
中文摘要 从智能体的行为中理解其目标，是将人工智能系统与人类意图对齐的基础。现有的目标识别方法通常依赖于最优的目标导向政策表征，这可能与行为者的真实行为不同，阻碍对其目标的准确识别。为弥补这一空白，本文引入了通过模仿学习实现目标识别对齐（GRAIL），该方法利用模仿学习和逆强化学习，直接从（可能不理想的）示范轨迹中为每个候选目标学习一个目标导向策略。通过在一次前传中对每个学习到的目标导向策略评分观察到的部分轨迹，GRAIL保留了经典目标识别的一次性推断能力，同时利用能够捕捉次优且系统性偏见行为的策略。在被评估的领域中，GRAIL在系统偏向最优行为下使F1得分提升超过0.5，在次优行为下获得约0.1-0.3的提升，在噪声最优轨迹下提升最多0.4分，同时在完全最优环境下保持竞争力。这项工作有助于构建可扩展且稳健的模型，用于在不确定环境中解读代理目标。

KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

KernelBlaster：通过内存增强上下文强化学习实现持续跨任务CUDA优化

Authors: Kris Shengjun Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Edward Lin, Siva Kumar Sastry Hari, Christos Kozyrakis
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.14293
Pdf link: https://arxiv.org/pdf/2602.14293
Abstract Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware-specific optimization space. Traditional compilers are constrained by fixed heuristics, whereas finetuning Large Language Models (LLMs) can be expensive. However, agentic workflows for CUDA code optimization have limited ability to aggregate knowledge from prior exploration, leading to biased sampling and suboptimal solutions. We propose KernelBlaster, a Memory-Augmented In-context Reinforcement Learning (MAIC-RL) framework designed to improve CUDA optimization search capabilities of LLM-based GPU coding agents. KernelBlaster enables agents to learn from experience and make systematically informed decisions on future tasks by accumulating knowledge into a retrievable Persistent CUDA Knowledge Base. We propose a novel profile-guided, textual-gradient-based agentic flow for CUDA generation and optimization to achieve high performance across generations of GPU architectures. KernelBlaster guides LLM agents to systematically explore high-potential optimization strategies beyond naive rewrites. Compared to the PyTorch baseline, our method achieves geometric mean speedups of 1.43x, 2.50x, and 1.50x on KernelBench Levels 1, 2, and 3, respectively. We release KernelBlaster as an open-source agentic framework, accompanied by a test harness, verification components, and a reproducible evaluation pipeline.
中文摘要 跨多代GPU架构优化CUDA代码具有挑战性，因为实现最高性能需要广泛探索日益复杂且硬件特定的优化领域。传统编译器受限于固定启发式，而微调大型语言模型（LLM）则可能成本高昂。然而，CUDA代码优化的代理式工作流在汇聚先前探索知识方面有限，导致抽样偏颇和解决方案次优。我们提出了KernelBlaster，这是一个内存增强上下文强化学习（MAIC-RL）框架，旨在提升基于LLM的GPU编码代理的CUDA优化搜索能力。KernelBlaster 使代理能够从经验中学习，并通过将知识积累到可检索的持久 CUDA 知识库中，系统性地做出未来任务的决策。我们提出了一种新型的配置文件引导、基于文本梯度的代理流，用于CUDA生成和优化，以实现跨世代GPU架构的高性能。KernelBlaster引导LLM代理系统性探索超越简单重写的高潜优化策略。与PyTorch基线相比，我们的方法在KernelBench第1、2和3级上分别实现了1.43倍、2.50倍和1.50倍的几何平均加速。我们将KernelBlaster作为一个开源的代理框架发布，配有测试工具、验证组件和可重复的评估流程。

Conformal Signal Temporal Logic for Robust Reinforcement Learning Control: A Case Study

稳健强化学习控制的共形信号时间逻辑：案例研究

Authors: Hani Beirami, M M Manjurul Islam
Subjects: Subjects: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2602.14322
Pdf link: https://arxiv.org/pdf/2602.14322
Abstract We investigate how formal temporal logic specifications can enhance the safety and robustness of reinforcement learning (RL) control in aerospace applications. Using the open source AeroBench F-16 simulation benchmark, we train a Proximal Policy Optimization (PPO) agent to regulate engine throttle and track commanded airspeed. The control objective is encoded as a Signal Temporal Logic (STL) requirement to maintain airspeed within a prescribed band during the final seconds of each maneuver. To enforce this specification at run time, we introduce a conformal STL shield that filters the RL agent's actions using online conformal prediction. We compare three settings: (i) PPO baseline, (ii) PPO with a classical rule-based STL shield, and (iii) PPO with the proposed conformal shield, under both nominal conditions and a severe stress scenario involving aerodynamic model mismatch, actuator rate limits, measurement noise, and mid-episode setpoint jumps. Experiments show that the conformal shield preserves STL satisfaction while maintaining near baseline performance and providing stronger robustness guarantees than the classical shield. These results demonstrate that combining formal specification monitoring with data driven RL control can substantially improve the reliability of autonomous flight control in challenging environments.
中文摘要 我们研究形式时序逻辑规范如何提升航空航天应用中强化学习（RL）控制的安全性和稳健性。利用开源的AeroBench F-16模拟基准测试，我们训练了一个近端策略优化（PPO）代理，用于调节发动机油门并跟踪指令空速。控制目标编码为信号时间逻辑（STL）要求，确保在每次机动的最后几秒钟内保持在规定的空速带内。为了在运行时强制执行该规范，我们引入了一个共形STL屏蔽，利用在线共形预测过滤RL代理的动作。我们比较三种设置：（i）PPO基线，（ii）基于经典规则的STL屏蔽的PPO，以及（iii）拟议的共形屏蔽，在名义条件下以及涉及空气动力模型不匹配、执行器速率限制、测量噪声和中段设定点跳跃等严重应力场景下。实验表明，贴合护盾在保持STL满足的同时，保持接近基线的性能，并提供比经典护盾更强的坚固性保证。这些结果表明，将正式规范监控与数据驱动的强化学习控制结合起来，可以显著提升自主飞行控制在复杂环境中的可靠性。

Train Less, Learn More: Adaptive Efficient Rollout Optimization for Group-Based Reinforcement Learning

少训练，了解更多：基于群体强化学习的自适应高效推广优化

Authors: Zhi Zhang, Zhen Han, Costas Mavromatis, Qi Zhu, Yunyi Zhang, Sheng Guan, Dingmin Wang, Xiong Zhou, Shuai Wang, Soji Adeshina, Vassilis Ioannidis, Huzefa Rangwala
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.14338
Pdf link: https://arxiv.org/pdf/2602.14338
Abstract Reinforcement learning (RL) plays a central role in large language model (LLM) post-training. Among existing approaches, Group Relative Policy Optimization (GRPO) is widely used, especially for RL with verifiable rewards (RLVR) fine-tuning. In GRPO, each query prompts the LLM to generate a group of rollouts with a fixed group size $N$. When all rollouts in a group share the same outcome, either all correct or all incorrect, the group-normalized advantages become zero, yielding no gradient signal and wasting fine-tuning compute. We introduce Adaptive Efficient Rollout Optimization (AERO), an enhancement of GRPO. AERO uses an adaptive rollout strategy, applies selective rejection to strategically prune rollouts, and maintains a Bayesian posterior to prevent zero-advantage dead zones. Across three model configurations (Qwen2.5-Math-1.5B, Qwen2.5-7B, and Qwen2.5-7B-Instruct), AERO improves compute efficiency without sacrificing performance. Under the same total rollout budget, AERO reduces total training compute by about 48% while shortening wall-clock time per step by about 45% on average. Despite the substantial reduction in compute, AERO matches or improves Pass@8 and Avg@8 over GRPO, demonstrating a practical, scalable, and compute-efficient strategy for RL-based LLM alignment.
中文摘要 强化学习（RL）在大型语言模型（LLM）训练后中起着核心作用。在现有方法中，群体相对策略优化（Group Relative Policy Optimization，GRPO）被广泛使用，尤其是在可验证奖励（RLVR）微调的强化学习（RL）。在GRPO中，每次查询都会提示LLM生成一组固定组规模$N$的推广。当一组内的所有推广结果相同（全部正确或全部错误）时，组归一化优势为零，不会产生梯度信号，浪费微调计算。我们引入自适应高效推广优化（AERO），这是GRPO的改进。AERO采用自适应滚动策略，对战略性剪辑施加选择性拒绝，并保持贝叶斯后验以防止零优势死区。在三种模型配置（Qwen2.5-Math-1.5B、Qwen2.5-7B和Qwen2.5-7B-Instruct）中，AERO在不牺牲性能的前提下提升了计算效率。在相同的总推广预算下，AERO将总训练计算量减少约48%，同时平均每步的墙上计时时间缩短约45%。尽管计算量大幅减少，AERO在匹配或提升Pass@8和Avg@8方面相较于GRPO，展示了基于强化学习的LLM对齐的实用、可扩展且计算高效的策略。

Data-Driven Network LQG Mean Field Games with Heterogeneous Populations via Integral Reinforcement Learning

数据驱动网络LQG通过整合强化学习实现异构群体的均等场博弈

Authors: Jean Zhu, Shuang Gao
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.14339
Pdf link: https://arxiv.org/pdf/2602.14339
Abstract This paper establishes a data-driven solution for infinite horizon linear quadratic Gaussian Mean Field Games with network-coupled heterogeneous agent populations where the dynamics of the agents are unknown. The solution technique relies on Integral Reinforcement Learning and Kleinman's iteration for solving algebraic Riccati equations (ARE). The resulting algorithm uses trajectory data to generate network-coupled MFG strategies for agents and does not require parameters of agents' dynamics. Under technical conditions on the persistency of excitation and on the existence of unique stabilizing solution to the corresponding AREs, the learned network-coupled MFG strategies are shown to converge to their true values.
中文摘要 本文建立了一种基于数据的数据的解，适用于网络耦合异构代理群体中代理的无限视界线性二次高斯均场博弈，代理的动态未知。该解法依赖于积分强化学习和克莱因曼迭代来求解代数里卡蒂方程（ARE）。最终算法利用轨迹数据生成网络耦合的智能体 MFG 策略，无需智能体动态参数。在激发持久性及对应AREs存在唯一稳定解的技术条件下，所学到的网络耦合MFG策略被证明能收敛于其真实值。

Zero-Shot Instruction Following in RL via Structured LTL Representations

通过结构化LTL表示实现的强化学习中零帧指令后续

Authors: Mathias Jackermeier, Mattia Giuri, Jacques Cloete, Alessandro Abate
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.14344
Pdf link: https://arxiv.org/pdf/2602.14344
Abstract We study instruction following in multi-task reinforcement learning, where an agent must zero-shot execute novel tasks not seen during training. In this setting, linear temporal logic (LTL) has recently been adopted as a powerful framework for specifying structured, temporally extended tasks. While existing approaches successfully train generalist policies, they often struggle to effectively capture the rich logical and temporal structure inherent in LTL specifications. In this work, we address these concerns with a novel approach to learn structured task representations that facilitate training and generalisation. Our method conditions the policy on sequences of Boolean formulae constructed from a finite automaton of the task. We propose a hierarchical neural architecture to encode the logical structure of these formulae, and introduce an attention mechanism that enables the policy to reason about future subgoals. Experiments in a variety of complex environments demonstrate the strong generalisation capabilities and superior performance of our approach.
中文摘要 我们研究多任务强化学习中的指令后续，该过程中代理必须零射执行训练中未见的新任务。在这种背景下，线性时间逻辑（LTL）最近被采纳为规范结构化、时间扩展任务的强大框架。虽然现有方法成功训练了通用策略，但它们常常难以有效捕捉LTL规范中固有的丰富逻辑和时间结构。本研究通过一种新颖的方法来学习结构化任务表征，促进训练和泛化，解决了这些问题。我们的方法对由任务的有限自动机构造的布尔公式序列的策略进行了条件。我们提出了一种分层神经架构，用于编码这些公式的逻辑结构，并引入一种注意力机制，使策略能够推理未来的子目标。在各种复杂环境中的实验展示了我们方法强大的泛化能力和卓越的性能。

WIMLE: Uncertainty-Aware World Models with IMLE for Sample-Efficient Continuous Control

WIMLE：不确定性感知世界模型，采用IMLE实现样本高效连续控制

Authors: Mehran Aghabozorgi, Alireza Moazeni, Yanshu Zhang, Ke Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.14351
Pdf link: https://arxiv.org/pdf/2602.14351
Abstract Model-based reinforcement learning promises strong sample efficiency but often underperforms in practice due to compounding model error, unimodal world models that average over multi-modal dynamics, and overconfident predictions that bias learning. We introduce WIMLE, a model-based method that extends Implicit Maximum Likelihood Estimation (IMLE) to the model-based RL framework to learn stochastic, multi-modal world models without iterative sampling and to estimate predictive uncertainty via ensembles and latent sampling. During training, WIMLE weights each synthetic transition by its predicted confidence, preserving useful model rollouts while attenuating bias from uncertain predictions and enabling stable learning. Across $40$ continuous-control tasks spanning DeepMind Control, MyoSuite, and HumanoidBench, WIMLE achieves superior sample efficiency and competitive or better asymptotic performance than strong model-free and model-based baselines. Notably, on the challenging Humanoid-run task, WIMLE improves sample efficiency by over $50$\% relative to the strongest competitor, and on HumanoidBench it solves $8$ of $14$ tasks (versus $4$ for BRO and $5$ for SimbaV2). These results highlight the value of IMLE-based multi-modality and uncertainty-aware weighting for stable model-based RL.
中文摘要 基于模型的强化学习承诺良好的样本效率，但由于模型误差累积、单模态世界模型在多模态动态上平均，以及过度自信的预测导致学习偏差，实际表现常常不佳。我们介绍了WIMLE，这是一种基于模型的方法，将隐式最大似然估计（IMLE）扩展到基于模型的强化学习框架，学习随机多模态世界模型，无需迭代抽样，并通过集成和潜抽样估计预测不确定性。在训练过程中，WIMLE 根据预测置信度对每个合成转变加权，保持有用的模型展开，同时减轻不确定预测带来的偏差，实现稳定学习。在涵盖DeepMind Control、MyoSuite和HumanoidBench的40美元连续控制任务中，WIMLE实现了比强无模型和基于模型的基线更优的样本效率和竞争力甚至更优的渐近性能。值得注意的是，在具有挑战性的类人生物运行任务中，WIMLE相较于最强的竞争者提高了样本效率超过50美元/百分比;在HumanoidBench上，它解决了8美元，而14美元任务中，WIMLE解决了8美元（而BRO为4美元，SimbaV2为5美元）。这些结果凸显了基于IMLE的多模态和不确定性意识加权对稳定模型强化学习的价值。

AdaptManip: Learning Adaptive Whole-Body Object Lifting and Delivery with Online Recurrent State Estimation

AdaptManip：通过在线重复状态估计学习自适应全身物体的提起与传递

Authors: Morgan Byrd, Donghoon Baek, Kartik Garg, Hyunyoung Jung, Daesol Cho, Maks Sorokin, Robert Wright, Sehoon Ha
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.14363
Pdf link: https://arxiv.org/pdf/2602.14363
Abstract This paper presents Adaptive Whole-body Loco-Manipulation, AdaptManip, a fully autonomous framework for humanoid robots to perform integrated navigation, object lifting, and delivery. Unlike prior imitation learning-based approaches that rely on human demonstrations and are often brittle to disturbances, AdaptManip aims to train a robust loco-manipulation policy via reinforcement learning without human demonstrations or teleoperation data. The proposed framework consists of three coupled components: (1) a recurrent object state estimator that tracks the manipulated object in real time under limited field-of-view and occlusions; (2) a whole-body base policy for robust locomotion with residual manipulation control for stable object lifting and delivery; and (3) a LiDAR-based robot global position estimator that provides drift-robust localization. All components are trained in simulation using reinforcement learning and deployed on real hardware in a zero-shot manner. Experimental results show that AdaptManip significantly outperforms baseline methods, including imitation learning-based approaches, in adaptability and overall success rate, while accurate object state estimation improves manipulation performance even under occlusion. We further demonstrate fully autonomous real-world navigation, object lifting, and delivery on a humanoid robot.
中文摘要 本文介绍了自适应全身机车作（AdaptManip），这是一个全自主的框架，用于实现人形机器人的集成导航、物体搬运和投递。与以往依赖人类演示且常易受干扰的模仿学习方法不同，AdaptManip旨在通过强化学习训练强化作策略，而无需人工演示或远程作数据。该框架由三个耦合组成部分组成：（1）一个在有限视场和遮挡条件下实时追踪作对象的循环对象状态估计器;（2）一套全体基础策略，用于稳健的移动，并保留残差作控制以实现物体的稳定升降和投放;以及（3）基于激光雷达的机器人全球定位器，提供漂移稳健定位。所有组件均通过强化学习进行模拟训练，并以零次方式部署在真实硬件上。实验结果显示，AdaptManip在适应性和整体成功率方面显著优于包括模仿学习方法在内的基线方法，而准确的对象状态估计即使在遮挡下也能提升作性能。我们还进一步展示了人形机器人上的全自主现实导航、物体搬运和投递能力。

A Q-Learning Approach for Dynamic Resource Management in Three-Tier Vehicular Fog Computing

三层车载雾计算中动态资源管理的Q-学习方法

Authors: Bahar Mojtabaei Ranani, Mahmood Ahmadi, Sajad Ahmadian
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2602.14390
Pdf link: https://arxiv.org/pdf/2602.14390
Abstract In this paper, a method for predicting the resources required for an intelligent vehicle client using a three-layer vehicular computing architecture is proposed. This method leverages Q-Learning to optimize resource allocation and enhance overall system performance. This approach employs reinforcement learning capabilities to provide a dynamic and adaptive strategy for resource management in a fog computing environment. The key findings of this study indicate that Q-learning can effectively predict the appropriate allocation of resources by learning from past experiences and making informed decisions. Through continuous training and updating of the Q-learning agent, the system can adapt to changing conditions and make resource allocation decisions based on real-time information. The experimental results demonstrate the effectiveness of the proposed method in optimizing resource allocation. The Q-learning agent predicts the optimal values for memory, bandwidth, and processor. These predictions not only minimize resource consumption but also meet the performance requirements of the fog system. Implementations show that this method improves the average task processing time in compared to other methods evaluated in this study
中文摘要 本文提出了一种利用三层车载计算架构预测智能车辆客户端所需资源的方法。该方法利用Q-Learning优化资源分配，提升整体系统性能。该方法利用强化学习能力，在雾化计算环境中提供动态且自适应的资源管理策略。本研究的关键发现表明，Q学习可以通过从过往经验中学习并做出明智决策，有效预测资源的适当分配。通过持续训练和更新Q学习代理，系统能够适应变化的条件，并基于实时信息做出资源分配决策。实验结果证明了该方法在优化资源分配方面的有效性。Q-学习代理预测内存、带宽和处理器的最佳值。这些预测不仅最大限度地减少了资源消耗，还满足雾系统的性能要求。实现显示，该方法相比本研究评估的其他方法，在中提升了平均任务处理时间

LACONIC: Length-Aware Constrained Reinforcement Learning for LLM

拉科尼克：LLM中的长度感知受限强化学习

Authors: Chang Liu, Yiran Zhao, Lawrence Liu, Yaoqi Ye, Csaba Szepesvári, Lin F. Yang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.14468
Pdf link: https://arxiv.org/pdf/2602.14468
Abstract Reinforcement learning (RL) has enhanced the capabilities of large language models (LLMs) through reward-driven training. Nevertheless, this process can introduce excessively long responses, inflating inference latency and computational overhead. Prior length-control approaches typically rely on fixed heuristic reward shaping, which can misalign with the task objective and require brittle tuning. In this work, we propose LACONIC, a reinforcement learning method that enforces a target token budget during training. Specifically, we update policy models using an augmented objective that combines the task reward with a length-based cost. To balance brevity and task performance, the cost scale is adaptively adjusted throughout training. This yields robust length control while preserving task reward. We provide a theoretical guarantee that support the method. Across mathematical reasoning models and datasets, LACONIC preserves or improves pass@1 while reducing output length by over 50%. It maintains out-of-domain performance on general knowledge and multilingual benchmarks with 44% fewer tokens. Moreover, LACONIC integrates into standard RL-tuning with no inference changes and minimal deployment overhead.
中文摘要 强化学习（RL）通过奖励驱动训练增强了大型语言模型（LLMs）的能力。然而，这一过程可能会引入过长的响应，增加推理延迟和计算开销。以往的长度控制方法通常依赖固定的启发式奖励形态，这可能与任务目标不匹配，需要对脆弱性进行调整。在本研究中，我们提出了LACONIC方法，一种在训练过程中强制执行目标令牌预算的强化学习方法。具体来说，我们通过结合任务奖励和基于长度的成本的增强目标来更新策略模型。为了平衡简洁性和任务表现，成本尺度会在培训过程中进行自适应调整。这样可以在保持任务奖励的同时实现稳健的长度控制。我们提供理论保证，支持该方法。在数学推理模型和数据集中，LACONIC 在保持或改进pass@1同时将输出长度缩短超过50%。它在通用知识和多语言基准测试中保持域外表现，代币数量减少了44%。此外，LACONIC 集成到标准强化学习调优中，无需更改推理，部署开销也极低。

Socially-Weighted Alignment: A Game-Theoretic Framework for Multi-Agent LLM Systems

社会加权对齐：多智能体大型语言模型系统的博弈论框架

Authors: Furkan Mumcu, Yasin Yilmaz
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.14471
Pdf link: https://arxiv.org/pdf/2602.14471
Abstract Deploying large language model (LLM) agents in shared environments introduces a fundamental tension between individual alignment and collective stability: locally rational decisions can impose negative externalities that degrade system-level performance. We propose Socially-Weighted Alignment (SWA), a game-theoretic framework that modifies inference-time decision making by interpolating between an agent's private objective and an estimate of group welfare via a social weight $\lambda\in[0,1]$. In a shared-resource congestion game with $n$ agents and congestion severity $\beta$, we show that SWA induces a critical threshold $\lambda^*=(n-\beta)/(n-1)$ above which agents no longer have marginal incentive to increase demand under overload, yielding a phase transition from persistent congestion to stable operation near capacity. We further provide an inference-time algorithmic instantiation of SWA that does not require parameter updates or multi-agent reinforcement learning, and use a multi-agent simulation to empirically validate the predicted threshold behavior.
中文摘要 在共享环境中部署大型语言模型（LLM）代理，带来了个体对齐与集体稳定性之间的根本张力：局部理性决策可能带来负面外部性，降低系统级性能。我们提出了社会加权对齐（SWA），这是一种博弈论框架，通过在社会权重$\lambda\in[0,1]$中插值代理的私人目标与群体福利估计之间，修改推理时间决策。在一个拥有$n$代理、拥塞严重度为$\beta$的共享资源拥塞博弈中，我们证明SWA会诱导一个临界阈值$\lambda^*=（n-\beta）/（n-1）$，超过阈值后代理在过载时不再有边际动力增加需求，从而实现从持续拥堵到接近容量稳定运行的阶段转变。我们还提供了无需参数更新或多智能体强化学习的推理时间算法实例化SWA，并利用多智能体模拟实证验证预测阈值行为。

Learning Transferability: A Two-Stage Reinforcement Learning Approach for Enhancing Quadruped Robots' Performance in U-Shaped Stair Climbing

学习可迁移性：提升四足机器人U形楼梯爬行性能的两阶段强化学习方法

Authors: Baixiao Huang, Baiyu Huang, Yu Hou
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.14473
Pdf link: https://arxiv.org/pdf/2602.14473
Abstract Quadruped robots are employed in various scenarios in building construction. However, autonomous stair climbing across different indoor staircases remains a major challenge for robot dogs to complete building construction tasks. In this project, we employed a two-stage end-to-end deep reinforcement learning (RL) approach to optimize a robot's performance on U-shaped stairs. The training robot-dog modality, Unitree Go2, was first trained to climb stairs on Isaac Lab's pyramid-stair terrain, and then to climb a U-shaped indoor staircase using the learned policies. This project explores end-to-end RL methods that enable robot dogs to autonomously climb stairs. The results showed (1) the successful goal reached for robot dogs climbing U-shaped stairs with a stall penalty, and (2) the transferability from the policy trained on U-shaped stairs to deployment on straight, L-shaped, and spiral stair terrains, and transferability from other stair models to deployment on U-shaped terrain.
中文摘要 四足机器人在建筑施工的各种场景中被广泛应用。然而，自动爬梯跨越不同室内楼梯仍然是机器人狗完成建筑施工任务的重大挑战。在本项目中，我们采用了两阶段的端到端深度强化学习（RL）方法，以优化机器人在U形楼梯上的表现。训练机器人-狗狗模式Unitree Go2，首先被训练为在Isaac实验室的金字塔楼梯地形上爬楼梯，然后利用所学策略攀爬U形室内楼梯。本项目探索了端到端强化学习方法，使机器人狗能够自主爬楼梯。结果显示：（1）机器人狗在U形楼梯上爬行并有拖延惩罚，成功实现目标;（2）从U形楼梯训练策略可迁移到直线、L形和螺旋楼梯地形，以及从其他楼梯模型迁移到U形地形的可转移性。

TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning

TikArt：光圈引导观察，通过强化学习实现细粒度视觉推理

Authors: Hao Ding, Zhichuan Yang, Weijie Ge, Ziqin Gao, Chaoyi Lu, Lei Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.14482
Pdf link: https://arxiv.org/pdf/2602.14482
Abstract We address fine-grained visual reasoning in multimodal large language models (MLLMs), where key evidence may reside in tiny objects, cluttered regions, or subtle markings that are lost under a single global image encoding. We introduce TikArt (Thinking Aperture), an aperture-guided agent that casts multi-step vision-language reasoning as a decision process over regions of interest. TikArt follows a Think-Aperture-Observe loop, alternating between language generation and two aperture actions: Zoom extracts rectangular crops, while Segment invokes SAM2 to obtain mask-based crops for irregular targets. After every action, the model must produce an explicit observation, turning local visual cues into persistent linguistic memory. Built on Qwen3-VL-8B, TikArt optimizes its reasoning policy with AGRPO, a GRPO-style reinforcement learning algorithm with a two-stage curriculum: it warms up segmentation actions and then jointly optimizes visual math, fine-grained VQA, and segmentation, using rewards that couple task success with purposeful aperture use. Experiments on V*, HR-Bench-4K/8K, MME-RealWorld-Lite, MMStar, RefCOCO, and ReasonSeg show consistent gains over the backbone and yield interpretable aperture trajectories for high-resolution reasoning.
中文摘要 我们在多模态大型语言模型（MLLMs）中处理细粒度的视觉推理，关键证据可能存在于微小物体、杂乱区域或细微标记中，这些在单一全局图像编码下会丢失。我们介绍TikArt（思维光圈），这是一种光圈引导代理，将多步视觉语言推理作为对感兴趣区域的决策过程。TikArt遵循Think-Aperture-Observe循环，在语言生成和两种孔径动作之间交替进行：Zoom提取矩形作物，而Segment调用SAM2以获取基于遮罩的不规则目标作物。每一次作后，模型必须产生显性观察，将局部视觉线索转化为持久的语言记忆。基于Qwen3-VL-8B，TikArt通过AGRPO优化推理策略，AGRPO是一种类似GRPO的强化学习算法，采用两阶段课程：先预热分割动作，然后联合优化视觉数学、细粒度VQA和分割，使用将任务成功与有目的光圈使用的奖励相结合。在V*、HR-Bench-4K/8K、MME-RealWorld-Lite、MMStar、RefCOCO和ReasonSeg上的实验显示，相较于骨干网实现了持续的提升，并提供了可解释的孔径轨迹，用于高分辨率推理。

Formally Verifying and Explaining Sepsis Treatment Policies with COOL-MC

正式核实并解释COOL-MC的败血症治疗政策

Authors: Dennis Gross
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.14505
Pdf link: https://arxiv.org/pdf/2602.14505
Abstract Safe and interpretable sequential decision-making is critical in healthcare, yet reinforcement learning (RL) policies for sepsis treatment optimization remain opaque and difficult to verify. Standard probabilistic model checkers operate on the full state space, which becomes infeasible for larger MDPs, and cannot explain why a learned policy makes particular decisions. COOL-MC wraps the model checker Storm but adds three key capabilities: it constructs only the reachable state space induced by a trained policy, yielding a smaller discrete-time Markov chain amenable to verification even when full-MDP analysis is intractable; it automatically labels states with clinically meaningful atomic propositions; and it integrates explainability methods with probabilistic computation tree logic (PCTL) queries to reveal which features drive decisions across treatment trajectories. We demonstrate COOL-MC's capabilities on the ICU-Sepsis MDP, a benchmark derived from approximately 17,000 sepsis patient records, which serves as a case study for applying COOL-MC to the formal analysis of sepsis treatment policies. Our analysis establishes hard bounds via full MDP verification, trains a safe RL policy that achieves optimal survival probability, and analyzes its behavior via PCTL verification and explainability on the induced DTMC. This reveals, for instance, that our trained policy relies predominantly on prior dosing history rather than the patient's evolving condition, a weakness that is invisible to standard evaluation but is exposed by COOL-MC's integration of formal verification and explainability. Our results illustrate how COOL-MC could serve as a tool for clinicians to investigate and debug sepsis treatment policies before deployment.
中文摘要 安全且可解释的顺序决策在医疗领域至关重要，然而用于败血症治疗优化的强化学习（RL）政策仍然不透明且难以验证。标准的概率模型检查器在完整的状态空间上工作，这对于更大的MDP来说变得不可行，也无法解释为何一个学习过的策略会做出特定决策。COOL-MC包裹了模型检查器Storm，但增加了三个关键功能：它仅构造由训练策略诱导的可达状态空间，从而生成一个更小的离散时间马尔可夫链，即使全MDP分析难以处理也能进行验证;它自动给具有临床意义的原子命题标记状态;它将可解释性方法与概率计算树逻辑（PCTL）查询整合，揭示哪些特征驱动了跨治疗轨迹的决策。我们在ICU-败血症MDP测试中展示了COOL-MC的能力，该基准基于约17,000条败血症患者记录，作为应用COOL-MC于败血症治疗政策正式分析的案例研究。我们的分析通过完整的MDP验证建立硬界限，训练一个实现最佳生存概率的安全强化学习策略，并通过PCTL验证和对诱导DTMC的解释性进行分析其行为。例如，这表明我们训练有素的政策主要依赖于以往的用药历史，而非患者不断变化的状况，这一弱点在标准评估中难以察觉，但通过COOL-MC整合的形式验证和可解释性得以暴露。我们的结果展示了COOL-MC如何作为临床医生在部署前调查和调试败血症治疗策略的工具。

TWISTED-RL: Hierarchical Skilled Agents for Knot-Tying without Human Demonstrations

TWISTED-RL：层级熟练特工，无需人类演示即可结结

Authors: Guy Freund, Tom Jurgenson, Matan Sudry, Erez Karpas
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.14526
Pdf link: https://arxiv.org/pdf/2602.14526
Abstract Robotic knot-tying represents a fundamental challenge in robotics due to the complex interactions between deformable objects and strict topological constraints. We present TWISTED-RL, a framework that improves upon the previous state-of-the-art in demonstration-free knot-tying (TWISTED), which smartly decomposed a single knot-tying problem into manageable subproblems, each addressed by a specialized agent. Our approach replaces TWISTED's single-step inverse model that was learned via supervised learning with a multi-step Reinforcement Learning policy conditioned on abstract topological actions rather than goal states. This change allows more delicate topological state transitions while avoiding costly and ineffective data collection protocols, thus enabling better generalization across diverse knot configurations. Experimental results demonstrate that TWISTED-RL manages to solve previously unattainable knots of higher complexity, including commonly used knots such as the Figure-8 and the Overhand. Furthermore, the increase in success rates and drop in planning time establishes TWISTED-RL as the new state-of-the-art in robotic knot-tying without human demonstrations.
中文摘要 机器人结结是机器人学中的一个根本挑战，因为可变形物体之间的复杂相互作用和严格的拓扑约束。我们介绍TWISTED-RL，这是一个改进了之前最先进的无演示结结（TWISTED）框架，后者巧妙地将单个结结问题分解为可管理的子问题，每个子问题由专门的代理处理。我们的方法用以抽象拓扑动作而非目标状态为条件的多步强化学习策略，取代了通过监督学习学习的TWISTED单步逆模型。这一变化允许更精细的拓扑状态转变，同时避免昂贵且无效的数据收集协议，从而实现在不同结配置间更为通用的应用。实验结果表明，TWISTED-RL能够解决此前无法达到的高复杂度结，包括常用的数字8结和Overhand结。此外，成功率的提升和规划时间的缩短，使TWISTED-RL成为无需人工演示即可机器人结结系的新技术。

MoRL: Reinforced Reasoning for Unified Motion Understanding and Generation

MoRL：统一运动理解与生成的强化推理

Authors: Hongpeng Wang, Zeyu Zhang, Wenhao Li, Hao Tang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.14534
Pdf link: https://arxiv.org/pdf/2602.14534
Abstract Human motion understanding and generation are crucial for vision and robotics but remain limited in reasoning capability and test-time planning. We propose MoRL, a unified multimodal motion model trained with supervised fine-tuning and reinforcement learning with verifiable rewards. Our task-specific reward design combines semantic alignment and reasoning coherence for understanding with physical plausibility and text-motion consistency for generation, improving both logical reasoning and perceptual realism. To further enhance inference, we introduce Chain-of-Motion (CoM), a test-time reasoning method that enables step-by-step planning and reflection. We also construct two large-scale CoT datasets, MoUnd-CoT-140K and MoGen-CoT-140K, to align motion sequences with reasoning traces and action descriptions. Experiments on HumanML3D and KIT-ML show that MoRL achieves significant gains over state-of-the-art baselines. Code: this https URL. Website: this https URL.
中文摘要 人类运动的理解和生成对于视觉和机器人技术至关重要，但在推理能力和测试时间规划方面仍有限。我们提出了MoRL，一种统一的多模态运动模型，通过监督微调和强化学习训练，并提供可验证的奖励。我们的任务特定奖励设计结合了语义对齐和推理连贯性以实现理解，同时实现物理合理性和文本-运动一致性，提升逻辑推理和感知真实性。为了进一步增强推断，我们引入了运动链（Chain-of-Motion，简称CoM），这是一种测试时间推理方法，支持逐步规划和反思。我们还构建了两个大型CoT数据集MoUnd-CoT-140K和MoGen-CoT-140K，以对应运动序列与推理轨迹和动作描述。HumanML3D和KIT-ML的实验表明，MoRL相较于最先进的基线实现了显著的进步。代码：这个 https URL。网站：这个 https URL。

Fluid-Agent Reinforcement Learning

流体代理强化学习

Authors: Shishir Sharma, Doina Precup, Theodore J. Perkins
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.14559
Pdf link: https://arxiv.org/pdf/2602.14559
Abstract The primary focus of multi-agent reinforcement learning (MARL) has been to study interactions among a fixed number of agents embedded in an environment. However, in the real world, the number of agents is neither fixed nor known a priori. Moreover, an agent can decide to create other agents (for example, a cell may divide, or a company may spin off a division). In this paper, we propose a framework that allows agents to create other agents; we call this a fluid-agent environment. We present game-theoretic solution concepts for fluid-agent games and empirically evaluate the performance of several MARL algorithms within this framework. Our experiments include fluid variants of established benchmarks such as Predator-Prey and Level-Based Foraging, where agents can dynamically spawn, as well as a new environment we introduce that highlights how fluidity can unlock novel solution strategies beyond those observed in fixed-population settings. We demonstrate that this framework yields agent teams that adjust their size dynamically to match environmental demands.
中文摘要 多智能体强化学习（MARL）的主要关注点是研究嵌入环境中固定数量的智能体之间的相互作用。然而，在现实世界中，代理人的数量既不固定，也不是事先已知的。此外，代理人可以决定创建其他代理人（例如，一个单元可能分裂，或者公司可能从某个部门分拆出来）。本文提出一个框架，允许智能体创建其他智能体;我们称之为流体代理环境。我们提出了流体代理博弈的博弈论解概念，并通过实证评估了多个MARL算法在该框架内的性能。我们的实验包括对既定基准的流动变体，如捕食者-猎物和基于等级的采集，这些基准点中代理可以动态生成，以及我们引入的新环境，强调流动性如何解锁超越固定种群环境的新解决方案策略。我们证明，该框架能够生成能够动态调整规模以适应环境需求的代理团队。

Simulation-based Learning of Electrical Cabinet Assembly Using Robot Skills

基于模拟的机器学习：利用机器人技能学习电柜组装

Authors: Arik Laemmle, Balázs András Bálint, Philipp Tenbrock, Frank Naegele, David Traunecker, József Váncza, Marco F. Huber
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.14561
Pdf link: https://arxiv.org/pdf/2602.14561
Abstract This paper presents a simulation-driven approach for automating the force-controlled assembly of electrical terminals on DIN-rails, a task traditionally hindered by high programming effort and product variability. The proposed method integrates deep reinforcement learning (DRL) with parameterizable robot skills in a physics-based simulation environment. To realistically model the snap-fit assembly process, we develop and evaluate two types of joining models: analytical models based on beam theory and rigid-body models implemented in the MuJoCo physics engine. These models enable accurate simulation of interaction forces, essential for training DRL agents. The robot skills are structured using the pitasc framework, allowing modular, reusable control strategies. Training is conducted in simulation using Soft Actor-Critic (SAC) and Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithms. Domain randomization is applied to improve robustness. The trained policies are transferred to a physical UR10e robot system without additional tuning. Experimental results demonstrate high success rates (up to 100%) in both simulation and real-world settings, even under significant positional and rotational deviations. The system generalizes well to new terminal types and positions, significantly reducing manual programming effort. This work highlights the potential of combining simulation-based learning with modular robot skills for flexible, scalable automation in small-batch manufacturing. Future work will explore hybrid learning methods, automated environment parameterization, and further refinement of joining models for design integration.
中文摘要 本文提出了一种基于仿真的自动化方式，用于在DIN轨道上由力控制组装电气端子，这一任务传统上受限于高编程工作量和产品多样性。该方法将深度强化学习（DRL）与可参数化机器人技能在基于物理的仿真环境中集成。为了真实建模卡扣装配过程，我们开发并评估了两种类型的连接模型：基于梁理论的解析模型和在MuJoCo物理引擎中实现的刚体模型。这些模型能够精确模拟相互作用力，这对于训练DRL代理至关重要。机器人技能采用pitasc框架构建，允许模块化、可重复使用的控制策略。训练通过软演员-批判者（SAC）和双延迟深度确定性策略梯度（TD3）算法进行模拟。应用域随机化以提升鲁棒性。训练好的保单会转移到物理的UR10e机器人系统，无需额外调优。实验结果显示，在模拟和现实环境中，即使存在显著的位置和旋转偏差，成功率高达100%。该系统能够很好地推广到新的终端类型和位置，显著减少了手动编程工作量。这项工作强调了将仿真学习与模块化机器人技能结合，在小批量制造中实现灵活且可扩展自动化的潜力。未来工作将探索混合学习方法、自动化环境参数化，以及进一步完善设计集成的连接模型。

DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving

DriveFine：精炼增强面罩扩散VLA，实现精准稳健的驾驶

Authors: Chenxu Dang, Sining Ang, Yongkang Li, Haochen Tian, Jie Wang, Guang Li, Hangjun Ye, Jie Ma, Long Chen, Yan Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.14577
Pdf link: https://arxiv.org/pdf/2602.14577
Abstract Vision-Language-Action (VLA) models for autonomous driving increasingly adopt generative planners trained with imitation learning followed by reinforcement learning. Diffusion-based planners suffer from modality alignment difficulties, low training efficiency, and limited generalization. Token-based planners are plagued by cumulative causal errors and irreversible decoding. In summary, the two dominant paradigms exhibit complementary strengths and weaknesses. In this paper, we propose DriveFine, a masked diffusion VLA model that combines flexible decoding with self-correction capabilities. In particular, we design a novel plug-and-play block-MoE, which seamlessly injects a refinement expert on top of the generation expert. By enabling explicit expert selection during inference and gradient blocking during training, the two experts are fully decoupled, preserving the foundational capabilities and generic patterns of the pretrained weights, which highlights the flexibility and extensibility of the block-MoE design. Furthermore, we design a hybrid reinforcement learning strategy that encourages effective exploration of refinement expert while maintaining training stability. Extensive experiments on NAVSIM v1, v2, and Navhard benchmarks demonstrate that DriveFine exhibits strong efficacy and robustness. The code will be released at this https URL.
中文摘要 视觉-语言-行动（VLA）自动驾驶模型越来越多地采用通过模仿学习和强化学习训练的生成式规划器。基于扩散的规划者存在模态对齐困难、训练效率低和泛化有限的问题。基于代币的规划器存在累积因果错误和不可逆解码的问题。总之，这两种主导范式展现出互补的优势和劣势。本文提出了DriveFine，一种结合了灵活解码和自纠正能力的掩蔽扩散VLA模型。特别是，我们设计了一种新颖的即插即用模块化技术，无缝地将精细化专家融入生成专家之上。通过在推理时实现显式专家选择和训练时梯度阻断，两位专家实现了完全解耦，保留了预训练权重的基础能力和通用模式，凸显了块-MoE设计的灵活性和可扩展性。此外，我们设计了一种混合强化学习策略，鼓励高效探索精炼专家，同时保持训练稳定性。在NAVSIM v1、v2和Navhard基准测试上的大量实验表明，DriveFine展现出强大的效能和鲁棒性。代码将在此 https URL 发布。

RNM-TD3: N:M Semi-structured Sparse Reinforcement Learning From Scratch

RNM-TD3：N：M 半结构化稀疏强化从零开始学习

Authors: Isam Vrce, Andreas Kassler, Gökçe Aydos
Subjects: Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2602.14578
Pdf link: https://arxiv.org/pdf/2602.14578
Abstract Sparsity is a well-studied technique for compressing deep neural networks (DNNs) without compromising performance. In deep reinforcement learning (DRL), neural networks with up to 5% of their original weights can still be trained with minimal performance loss compared to their dense counterparts. However, most existing methods rely on unstructured fine-grained sparsity, which limits hardware acceleration opportunities due to irregular computation patterns. Structured coarse-grained sparsity enables hardware acceleration, yet typically degrades performance and increases pruning complexity. In this work, we present, to the best of our knowledge, the first study on N:M structured sparsity in RL, which balances compression, performance, and hardware efficiency. Our framework enforces row-wise N:M sparsity throughout training for all networks in off-policy RL (TD3), maintaining compatibility with accelerators that support N:M sparse matrix operations. Experiments on continuous-control benchmarks show that RNM-TD3, our N:M sparse agent, outperforms its dense counterpart at 50%-75% sparsity (e.g., 2:4 and 1:4), achieving up to a 14% increase in performance at 2:4 sparsity on the Ant environment. RNM-TD3 remains competitive even at 87.5% sparsity (1:8), while enabling potential training speedups.
中文摘要 稀疏性是一种经过深入研究的深度神经网络（DNN）压缩技术，同时不影响性能。在深度强化学习（DRL）中，拥有多达5%原始权重的神经网络，仍可以最小的性能损失进行训练，相较于密集的对应网络。然而，大多数现有方法依赖于无结构的细粒度稀疏性，这限制了由于计算模式不规则而导致硬件加速的机会。结构化的粗粒度稀疏性支持硬件加速，但通常会降低性能并增加修剪复杂度。在本研究中，我们据所知首次探讨了强化学习中N：M结构稀疏性，该研究平衡了压缩、性能和硬件效率。我们的框架在非策略RL（TD3）中，在整个训练过程中强制执行按行的N：M稀疏性，同时保持与支持N：M稀疏矩阵作的加速器的兼容性。连续对照基准测试的实验显示，我们的N：M稀疏因子RNM-TD3在50%-75%稀疏率（如2：4和1：4）下优于其致密对应者，在蚂蚁环境中在2：4稀疏度下性能提升高达14%。RNM-TD3即使在87.5%的稀疏度（1：8）下依然具有竞争力，同时有望实现训练加速。

Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow

通过哈密顿流进行解耦连续时间强化学习

Authors: Minh Nguyen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Statistics Theory (math.ST)
Arxiv link: https://arxiv.org/abs/2602.14587
Pdf link: https://arxiv.org/pdf/2602.14587
Abstract Many real-world control problems, ranging from finance to robotics, evolve in continuous time with non-uniform, event-driven decisions. Standard discrete-time reinforcement learning (RL), based on fixed-step Bellman updates, struggles in this setting: as time gaps shrink, the $Q$-function collapses to the value function $V$, eliminating action ranking. Existing continuous-time methods reintroduce action information via an advantage-rate function $q$. However, they enforce optimality through complicated martingale losses or orthogonality constraints, which are sensitive to the choice of test processes. These approaches entangle $V$ and $q$ into a large, complex optimization problem that is difficult to train reliably. To address these limitations, we propose a novel decoupled continuous-time actor-critic algorithm with alternating updates: $q$ is learned from diffusion generators on $V$, and $V$ is updated via a Hamiltonian-based value flow that remains informative under infinitesimal time steps, where standard max/softmax backups fail. Theoretically, we prove rigorous convergence via new probabilistic arguments, sidestepping the challenge that generator-based Hamiltonians lack Bellman-style contraction under the sup-norm. Empirically, our method outperforms prior continuous-time and leading discrete-time baselines across challenging continuous-control benchmarks and a real-world trading task, achieving 21% profit over a single quarter$-$nearly doubling the second-best method.
中文摘要 许多现实世界的控制问题，从金融到机器人，都是随着非均匀事件驱动决策连续演变而演变的。基于固定步长Bellman更新的标准离散时间强化学习（RL）在此环境中表现不佳：随着时间间隔缩小，$Q$函数崩溃为价值函数$V$，从而消除动作排序。现有的连续时间方法通过优势率函数$q$重新引入动作信息。然而，它们通过复杂的马丁格尔损失或正交约束来强制最优性，这些约束对测试过程的选择非常敏感。这些方法将$V$和$q$纠缠成一个庞大复杂的优化问题，且难以可靠训练。为解决这些局限性，我们提出了一种新颖的解耦连续时间演员-批判算法，更新交替：$q$ 从$V$上的扩散发生器中学习，$V$ 通过基于哈密顿量的价值流更新，该过程在无穷小时间步长下仍具信息量，而标准的最大/软最大备份则失败。理论上，我们通过新的概率论证证明了严格收敛性，规避了基于生成元的哈密顿量在上范数下缺乏贝尔曼式收缩的挑战。从实证角度看，我们的方法在具有挑战性的连续控制基准和现实交易任务中，优于以往连续时间和领先离散时间基线，单季度利润达到21%$nearly，是次优方法的两倍。

GREAT-EER: Graph Edge Attention Network for Emergency Evacuation Responses

GREAT-EER：Graph Edge 紧急疏散响应关注网络

Authors: Attila Lischka, Balázs Kulcsár
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.14676
Pdf link: https://arxiv.org/pdf/2602.14676
Abstract Emergency situations that require the evacuation of urban areas can arise from man-made causes (e.g., terrorist attacks or industrial accidents) or natural disasters, the latter becoming more frequent due to climate change. As a result, effective and fast methods to develop evacuation plans are of great importance. In this work, we identify and propose the Bus Evacuation Orienteering Problem (BEOP), an NP-hard combinatorial optimization problem with the goal of evacuating as many people from an affected area by bus in a short, predefined amount of time. The purpose of bus-based evacuation is to reduce congestion and disorder that arises in purely car-focused evacuation scenarios. To solve the BEOP, we propose a deep reinforcement learning-based method utilizing graph learning, which, once trained, achieves fast inference speed and is able to create evacuation routes in fractions of seconds. We can bound the gap of our evacuation plans using an MILP formulation. To validate our method, we create evacuation scenarios for San Francisco using real-world road networks and travel times. We show that we achieve near-optimal solution quality and are further able to investigate how many evacuation vehicles are necessary to achieve certain bus-based evacuation quotas given a predefined evacuation time while keeping run time adequate.
中文摘要 需要疏散城市地区的紧急情况可能源于人为原因（如恐怖袭击或工业事故）或自然灾害，后者因气候变化而更频繁。因此，制定有效且快速的疏散计划方法至关重要。在本研究中，我们识别并提出了公交疏散定向问题（BEOP），这是一个NP难组合优化问题，目标是在短时间内通过公交车从受灾区域疏散尽可能多的人。公交车疏散的目的是减少纯粹以汽车为中心的疏散场景中产生的拥堵和混乱。为解决BEOP，我们提出了一种基于深度强化学习的方法，利用图学习，训练后可实现快速推理速度，并能在几分之一秒内创建撤离路线。我们可以用MILP表述来界定撤离计划的间隔。为了验证我们的方法，我们利用现实世界的道路网络和出行时间为旧金山创建疏散场景。我们证明了近乎最优解质量，并进一步研究在预定疏散时间条件下，满足基于公交车的疏散配额所需的撤离车辆数量。

Evolutionary System Prompt Learning can Facilitate Reinforcement Learning for LLMs

进化系统提示学习可以促进LLM的强化学习

Authors: Lunjun Zhang, Ryan Chen, Bradly C. Stadie
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.14697
Pdf link: https://arxiv.org/pdf/2602.14697
Abstract Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI. Large language models (LLMs) today primarily self-improve via two mechanisms: self-reflection for context updates, and reinforcement learning (RL) for weight updates. In this work, we propose Evolutionary System Prompt Learning (E-SPL), a method for jointly improving model contexts and model weights. In each RL iteration, E-SPL selects multiple system prompts and runs rollouts with each in parallel. It applies RL updates to model weights conditioned on each system prompt, and evolutionary updates to the system prompt population via LLM-driven mutation and crossover. Each system prompt has a TrueSkill rating for evolutionary selection, updated from relative performance within each RL iteration batch. E-SPL encourages a natural division between declarative knowledge encoded in prompts and procedural knowledge encoded in weights, resulting in improved performance across reasoning and agentic tasks. For instance, in an easy-to-hard (AIME $\rightarrow$ BeyondAIME) generalization setting, E-SPL improves RL success rate from 38.8% $\rightarrow$ 45.1% while also outperforming reflective prompt evolution (40.0%). Overall, our results show that coupling reinforcement learning with system prompt evolution yields consistent gains in sample efficiency and generalization. Code: this https URL
中文摘要 构建能够自主从经验中自我改进的智能系统，是人工智能长期以来的目标。如今，大型语言模型（LLMs）主要通过两种机制自我改进：用于上下文更新的自我反思和权重更新的强化学习（RL）。在本研究中，我们提出了进化系统提示学习（E-SPL），这是一种联合改进模型上下文和模型权重的方法。在每个 RL 迭代中，E-SPL 会选择多个系统提示符并并行运行每个提示符的展开。它对基于每个系统提示符的模型权重进行强化学习更新，并通过大型语言模型驱动的突变和交叉对系统提示群体进行进化更新。每个系统提示符都有一个TrueSkill评分，用于进化选择，根据每次强化学习迭代批次中的相对性能进行更新。E-SPL鼓励将提示中编码的陈述性知识与以权重编码的过程性知识之间自然划分，从而提升推理和代理任务的表现。例如，在从简单到困难（AIME $\rightarrow$ 超越AIME）的推广设置中，E-SPL将强化学习成功率从$\rightarrow$ 45.1%提升至38.8%，同时也优于反思提示演化（40.0%）。总体来看，我们的结果表明，强化学习与系统提示演化结合，能够在样本效率和泛化上持续提升。代码：这个 https URL

ManeuverNet: A Soft Actor-Critic Framework for Precise Maneuvering of Double-Ackermann-Steering Robots with Optimized Reward Functions

ManeuverNet：一个软演员-批评框架，用于优化奖励函数的双阿克曼转向机器人的精确作

Authors: Kohio Deflesselle, Mélodie Daniel, Aly Magassouba, Miguel Aranda, Olivier Ly
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.14726
Pdf link: https://arxiv.org/pdf/2602.14726
Abstract Autonomous control of double-Ackermann-steering robots is essential in agricultural applications, where robots must execute precise and complex maneuvers within a limited space. Classical methods, such as the Timed Elastic Band (TEB) planner, can address this problem, but they rely on parameter tuning, making them highly sensitive to changes in robot configuration or environment and impractical to deploy without constant recalibration. At the same time, end-to-end deep reinforcement learning (DRL) methods often fail due to unsuitable reward functions for non-holonomic constraints, resulting in sub-optimal policies and poor generalization. To address these challenges, this paper presents ManeuverNet, a DRL framework tailored for double-Ackermann systems, combining Soft Actor-Critic with CrossQ. Furthermore, ManeuverNet introduces four specifically designed reward functions to support maneuver learning. Unlike prior work, ManeuverNet does not depend on expert data or handcrafted guidance. We extensively evaluate ManeuverNet against both state-of-the-art DRL baselines and the TEB planner. Experimental results demonstrate that our framework substantially improves maneuverability and success rates, achieving more than a 40% gain over DRL baselines. Moreover, ManeuverNet effectively mitigates the strong parameter sensitivity observed in the TEB planner. In real-world trials, ManeuverNet achieved up to a 90% increase in maneuvering trajectory efficiency, highlighting its robustness and practical applicability.
中文摘要 双阿克曼转向机器人的自主控制在农业应用中至关重要，因为机器人必须在有限空间内执行精确复杂的作。经典方法如定时弹性带（TEB）规划器可以解决这个问题，但它们依赖参数调优，因此对机器人配置或环境的变化极为敏感，且不需不断重新校准就难以部署。与此同时，端到端深度强化学习（DRL）方法常因奖励函数不适合非全全体约束而失败，导致策略不优且泛化能力差。为应对这些挑战，本文提出了ManeuverNet，这是一个专为双阿克曼系统设计的DRL框架，结合了Soft Actor-Critic和CrossQ。此外，ManeuverNet引入了四个专门设计的奖励函数以支持机动学习。与以往工作不同，ManeuverNet不依赖专家数据或手工指导。我们广泛评估了ManeuverNet的结合，结合最先进的日程日程灯基线和TEB规划器。实验结果表明，我们的框架显著提升了机动性和成功率，比DRL基线提升超过40%。此外，ManeuverNet有效缓解了TEB规划器中观察到的强烈参数敏感性。在实际试验中，机动网实现了机动轨迹效率高达90%，彰显其鲁棒性和实用性。

Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment

无交互逆向强化学习：一个以数据为中心的持久对齐框架

Authors: Elias Malomgré, Pieter Simoens
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.14844
Pdf link: https://arxiv.org/pdf/2602.14844
Abstract AI alignment is growing in importance, yet current approaches suffer from a critical structural flaw that entangles the safety objectives with the agent's policy. Methods such as Reinforcement Learning from Human Feedback and Direct Preference Optimization create opaque, single-use alignment artifacts, which we term Alignment Waste. We propose Interactionless Inverse Reinforcement Learning to decouple alignment artifact learning from policy optimization, producing an inspectable, editable, and model-agnostic reward model. Additionally, we introduce the Alignment Flywheel, a human-in-the-loop lifecycle that iteratively hardens the reward model through automated audits and refinement. This architecture transforms safety from a disposable expense into a durable, verifiable engineering asset.
中文摘要 AI对齐的重要性日益增长，但现有方法存在一个关键的结构性缺陷，将安全目标与代理政策纠缠在一起。诸如人类反馈强化学习和直接偏好优化等方法会产生不透明的一次性对齐伪影，我们称之为对齐浪费。我们提出了无交互反强化学习，以将对齐工件学习与策略优化解耦，生成一个可检查、可编辑且模型无关的奖励模型。此外，我们还引入了“对齐飞轮”，这是一种人机生命周期，通过自动审计和优化迭代加固奖励模型。该架构将安全从一次性支出转变为耐用且可验证的工程资产。

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

金发姑娘RL：调整任务难度以逃避推理奖励稀少

Authors: Ilia Mahrooghi, Aryo Lotfi, Emmanuel Abbe
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.14868
Pdf link: https://arxiv.org/pdf/2602.14868
Abstract Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.
中文摘要 强化学习已成为释放大型语言模型推理能力的强大范式。然而，依赖稀疏的奖励使得这一过程极低采样效率，因为模型必须在极少的反馈下穿越庞大的搜索空间。虽然经典课程学习旨在通过根据复杂度排序数据来缓解这一问题，但具体模型的正确排序往往不够明确。为此，我们提出了“金发姑娘”（Goldilocks），一种由教师驱动的新型数据抽样策略，旨在预测每个问题对学生模型的难度。教师模型选择适合学生模型难度的题目，即既不简单也不难（金发姑娘原理），同时用GRPO训练学生。通过利用学生在实习样本上的表现，教师不断适应学生不断发展的能力。在OpenMathReasoning数据集上，Goldilocks数据抽样提升了在相同计算预算下用标准GRPO训练模型的性能。

On the Learning Dynamics of RLVR at the Edge of Competence

关于RLVR在能力边缘的学习动态

Authors: Yu Huang, Zixin Wen, Yuejie Chi, Yuting Wei, Aarti Singh, Yingbin Liang, Yuxin Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.14872
Pdf link: https://arxiv.org/pdf/2602.14872
Abstract Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RL for transformers on compositional reasoning tasks. Our theory characterizes how the effectiveness of RLVR is governed by the smoothness of the difficulty spectrum. When data contains abrupt discontinuities in difficulty, learning undergoes grokking-type phase transitions, producing prolonged plateaus before progress recurs. In contrast, a smooth difficulty spectrum leads to a relay effect: persistent gradient signals on easier problems elevate the model's capabilities to the point where harder ones become tractable, resulting in steady and continuous improvement. Our theory explains how RLVR can improve performance at the edge of competence, and suggests that appropriately designed data mixtures can yield scalable gains. As a technical contribution, our analysis develops and adapts tools from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.
中文摘要 带有可验证奖励的强化学习（RLVR）一直是大型推理模型近期突破的主要推动力。然而，仅基于最终结果的奖励如何帮助克服长期推理障碍仍是个谜。为理解这一点，我们发展了一套关于变换器在组合推理任务中强化学习训练动力学的理论。我们的理论描述了RLVR的有效性如何受难度谱平滑度的决定。当数据中出现难度的突兀断裂时，学习经历类似格罗金（grokking）的相变，产生较长的平台期，随后进展才会重现。相比之下，平滑难度谱会产生继电效应：对较易问题持续的梯度信号提升模型能力，使较难的问题变得可解，从而实现稳定且持续的改进。我们的理论解释了RLVR如何在能力边缘提升性能，并建议适当设计的数据组合可以带来可扩展的提升。作为技术贡献，我们的分析开发并调整了有限群傅里叶分析中的工具到我们的设定中。我们通过合成实验实证验证了预测机制。

BFS-PO: Best-First Search for Large Reasoning Models

BFS-PO：大推理模型的最佳优先搜索

Authors: Fiorenzo Parascandolo, Wenhui Tan, Enver Sangineto, Ruihua Song, Rita Cucchiara
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.14917
Pdf link: https://arxiv.org/pdf/2602.14917
Abstract Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers.
中文摘要 大型推理模型（LRM），如OpenAI o1和DeepSeek-R1，在使用长推理链的推理任务中表现出优异的性能。然而，这也导致计算成本显著增加，输出内容冗长，这种现象被称为过度思考。过度思考的倾向常因强化学习（RL）算法如GRPO/DAPO而加剧。本文提出了BFS-PO算法，这是一种通过最佳优先搜索探索策略缓解该问题的强化学习算法。具体来说，BFS-PO通过基于最大熵节点的回溯机制寻找最短的正确答案。通过在训练中逐步缩短响应，BFS-PO学会了生成简洁的推理链。利用不同的基准测试和基础LRM，我们证明BFS-PO可以同时提高LRM的准确性并缩短其答案。

MAC-AMP: A Closed-Loop Multi-Agent Collaboration System for Multi-Objective Antimicrobial Peptide Design

MAC-AMP：一个用于多目标抗菌肽设计的闭环多代理协作系统

Authors: Gen Zhou, Sugitha Janarthanan, Lianghong Chen, Pingzhao Hu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.14926
Pdf link: https://arxiv.org/pdf/2602.14926
Abstract To address the global health threat of antimicrobial resistance, antimicrobial peptides (AMP) are being explored for their potent and promising ability to fight resistant pathogens. While artificial intelligence (AI) is being employed to advance AMP discovery and design, most AMP design models struggle to balance key goals like activity, toxicity, and novelty, using rigid or unclear scoring methods that make results hard to interpret and optimize. As the capabilities of Large Language Models (LLM) advance and evolve swiftly, we turn to AI multi-agent collaboration based on such models (multi-agent LLMs), which show rapidly rising potential in complex scientific design scenarios. Based on this, we introduce MAC-AMP, a closed-loop multi-agent collaboration (MAC) system for multi-objective AMP design. The system implements a fully autonomous simulated peer review-adaptive reinforcement learning framework that requires only a task description and example dataset to design novel AMPs. The novelty of our work lies in introducing a closed-loop multi-agent system for AMP design, with cross-domain transferability, that supports multi-objective optimization while remaining explainable rather than a 'black box'. Experiments show that MAC-AMP outperforms other AMP generative models by effectively optimizing AMP generation for multiple key molecular properties, demonstrating exceptional results in antibacterial activity, AMP likeliness, toxicity compliance, and structural reliability.
中文摘要 为了应对抗菌素耐药性的全球健康威胁，抗菌肽（AMP）正在被探索，其对抗耐药病原体具有强大且有前景的能力。虽然人工智能（AI）被用于推动抗菌肽的发现和设计，但大多数AMP设计模型在平衡活性、毒性和新颖性等关键目标时遇到困难，采用僵化或不明确的评分方法，使结果难以解读和优化。随着大型语言模型（LLM）能力的快速进步和演进，我们转向基于此类模型的AI多智能体协作（多智能体LLM），这些模型在复杂科学设计场景中展现出快速增长的潜力。基于此，我们引入了MAC-AMP，一种闭环多代理协作（MAC）系统，用于多目标AMP设计。该系统实现了一个完全自主的模拟同行评审自适应强化学习框架，只需任务描述和示例数据集即可设计新颖的 AMP。我们工作的创新之处在于引入了一套闭环多智能体系统，用于AMP设计，具备跨域可转移性，支持多目标优化，同时保持可解释性，而非“黑箱”。实验显示，MAC-AMP通过有效优化多种关键分子性质的抗菌肽生成，表现优于其他AMP生成模型，在抗菌活性、抗菌素可能性、毒性依从性和结构可靠性方面表现出色。

Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation

通过推理和提炼学习用户兴趣，实现跨领域新闻推荐

Authors: Mengdan Zhu, Yufan Zhao, Tao Di, Yulan Yan, Liang Zhao
Subjects: Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2602.15005
Pdf link: https://arxiv.org/pdf/2602.15005
Abstract News recommendation plays a critical role in online news platforms by helping users discover relevant content. Cross-domain news recommendation further requires inferring user's underlying information needs from heterogeneous signals that often extend beyond direct news consumption. A key challenge lies in moving beyond surface-level behaviors to capture deeper, reusable user interests while maintaining scalability in large-scale production systems. In this paper, we present a reinforcement learning framework that trains large language models to generate high-quality lists of interest-driven news search queries from cross-domain user signals. We formulate query-list generation as a policy optimization problem and employ GRPO with multiple reward signals. We systematically study two compute dimensions: inference-time sampling and model capacity, and empirically observe consistent improvements with increased compute that exhibit scaling-like behavior. Finally, we perform on-policy distillation to transfer the learned policy from a large, compute-intensive teacher to a compact student model suitable for scalable deployment. Extensive offline experiments, ablation studies and large-scale online A/B tests in a production news recommendation system demonstrate consistent gains in both interest modeling quality and downstream recommendation performance.
中文摘要 新闻推荐在在线新闻平台中发挥着关键作用，帮助用户发现相关内容。跨域新闻推荐还需要从异构信号中推断用户的底层信息需求，这些信号往往超出直接新闻消费范围。一个关键挑战在于超越表面行为，捕捉更深层、可重用的用户兴趣，同时保持大规模生产系统的可扩展性。本文提出了一个强化学习框架，训练大型语言模型，从跨域用户信号生成高质量的兴趣驱动新闻搜索查询列表。我们将查询列表生成作为策略优化问题，并采用带有多重奖励信号的GRPO。我们系统地研究了两个计算维度：推断时间采样和模型容量，并实证观察到计算量增加后，表现出类似缩放行为的持续改进。最后，我们进行策略内提炼，将学到的策略从大型、计算密集型教师转移到适合可扩展部署的紧凑学生模型。大量离线实验、消融研究以及在生产新闻推荐系统中的大规模在线A/B测试显示，兴趣建模质量和下游推荐表现均持续提升。

Cold-Start Personalization via Training-Free Priors from Structured World Models

通过结构化世界模型的无训练先验进行冷启动个性化

Authors: Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du, Yulia Tsvetkov, Maryam Fazel, Lin Xiao, Asli Celikyilmaz
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.15012
Pdf link: https://arxiv.org/pdf/2602.15012
Abstract Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi-turn settings its terminal reward fails to exploit the factored, per-criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold-start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users' stated preferences versus 68.5% for RL, with 3-5x fewer interactions. When two users give different answers to the same question, Pep changes its follow-up 39-62% of the time versus 0-28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold-start elicitation is the capability to exploit the factored structure of preference data.
中文摘要 冷启动个性化需要在没有用户特定历史数据的情况下，通过交互推断用户偏好。核心挑战是路由问题：每个任务包含数十个偏好维度，但每个用户只关心少数几个，哪些重要取决于是谁在提问。问题预算有限，没有结构地提问会错过重要的维度。强化学习是自然的表述，但在多回合环境中，其最终奖励未能利用偏好数据的因式分解和标准结构，实践中学习策略会崩溃为静态问题序列，忽视用户回答。我们提出将冷启动诱导分解为离线结构学习和在线贝叶斯推断。Pep（带先验的偏好诱导）通过完整画像离线学习一个结构化的世界模型，进行偏好相关性的世界模型，然后在线进行无训练贝叶斯推断，以选择有用的问题并预测完整的偏好特征，包括未曾被提及的维度。该框架在下游求解器间模块化，只需简单的信念模型。在医学、数学、社会和常识推理中，Pep在生成的回答与用户陈述偏好之间实现了80.8%的对齐度，而强化学习仅为68.5%，互动次数减少了3-5倍。当两个用户对同一问题给出不同答案时，Pep的后续问题变化率为39%-62%，而强化学习为0-28%。它以 ~10K 参数实现，而 RL 为 8B，表明冷启动诱导的瓶颈在于利用偏好数据的分解结构。

Keyword: diffusion policy

HybridFlow: A Two-Step Generative Policy for Robotic Manipulation

HybridFlow：机器人作的两步生成策略

Authors: Zhenchen Dong, Jinna Fu, Jiaming Wu, Shengyuan Yu, Fulin Chen, Yide Liu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.13718
Pdf link: https://arxiv.org/pdf/2602.13718
Abstract Limited by inference latency, existing robot manipulation policies lack sufficient real-time interaction capability with the environment. Although faster generation methods such as flow matching are gradually replacing diffusion methods, researchers are pursuing even faster generation suitable for interactive robot control. MeanFlow, as a one-step variant of flow matching, has shown strong potential in image generation, but its precision in action generation does not meet the stringent requirements of robotic manipulation. We therefore propose \textbf{HybridFlow}, a \textbf{3-stage method} with \textbf{2-NFE}: Global Jump in MeanFlow mode, ReNoise for distribution alignment, and Local Refine in ReFlow mode. This method balances inference speed and generation quality by leveraging the rapid advantage of MeanFlow one-step generation while ensuring action precision with minimal generation steps. Through real-world experiments, HybridFlow outperforms the 16-step Diffusion Policy by \textbf{15--25\%} in success rate while reducing inference time from 152ms to 19ms (\textbf{8$\times$ speedup}, \textbf{$\sim$52Hz}); it also achieves 70.0\% success on unseen-color OOD grasping and 66.3\% on deformable object folding. We envision HybridFlow as a practical low-latency method to enhance real-world interaction capabilities of robotic manipulation policies.
中文摘要 由于推理延迟的限制，现有机器人作策略缺乏足够的实时环境交互能力。尽管像流量匹配这样的更快生成方法正逐渐取代扩散方法，研究人员仍在追求更快速的生成，适合交互式机器人控制。作为流匹配的一步变体，MeanFlow在图像生成方面展现出强大潜力，但其动作生成的精度仍不足以满足机器人作的严格要求。因此，我们提出了 \textbf{HybridFlow}，一种 \textbf{3-阶段方法}，采用 \textbf{2-NFE}：MeanFlow 模式下的全局跳跃，分布对齐的 ReNoise 和 Reflow 模式下的局部细化。该方法通过利用MeanFlow一步生成的快速优势，平衡推理速度和生成质量，同时以最小的生成步骤确保动作的精确性。通过真实实验，HybridFlow在成功率上优于16步扩散策略，提升了15-25\%}，同时将推理时间从152毫秒缩短到19毫秒（\textbf{8$\times$ speedup}， \textbf{$\sim$52Hz}）;它在未见颜色的外勤抓取成功率为70.0%，在可变形物体折叠方面达到66.3%。我们设想HybridFlow作为一种实用的低延迟方法，以增强机器人作策略的实际交互能力。

Semantic-Contact Fields for Category-Level Generalizable Tactile Tool Manipulation

用于类别级可推广触觉工具作的语义接触场

Authors: Kevin Yuchen Ma, Heng Zhang, Weisi Lin, Mike Zheng Shou, Yan Wu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.13833
Pdf link: https://arxiv.org/pdf/2602.13833
Abstract Generalizing tool manipulation requires both semantic planning and precise physical control. Modern generalist robot policies, such as Vision-Language-Action (VLA) models, often lack the high-fidelity physical grounding required for contact-rich tool manipulation. Conversely, existing contact-aware policies that leverage tactile or haptic sensing are typically instance-specific and fail to generalize across diverse tool geometries. Bridging this gap requires learning unified contact representations from diverse data, yet a fundamental barrier remains: diverse real-world tactile data are prohibitive at scale, while direct zero-shot sim-to-real transfer is challenging due to the complex dynamics of nonlinear deformation of soft sensors. To address this, we propose Semantic-Contact Fields (SCFields), a unified 3D representation fusing visual semantics with dense contact estimates. We enable this via a two-stage Sim-to-Real Contact Learning Pipeline: first, we pre-train on a large simulation data set to learn general contact physics; second, we fine-tune on a small set of real data, pseudo-labeled via geometric heuristics and force optimization, to align sensor characteristics. This allows physical generalization to unseen tools. We leverage SCFields as the dense observation input for a diffusion policy to enable robust execution of contact-rich tool manipulation tasks. Experiments on scraping, crayon drawing, and peeling demonstrate robust category-level generalization, significantly outperforming vision-only and raw-tactile baselines.
中文摘要 推广工具作既需要语义规划，也需要精确的物理控制。现代通用机器人策略，如视觉-语言-行动（VLA）模型，通常缺乏实现接触丰富工具作所需的高精度物理基础。相反，现有利用触觉或触摸感知的接触感知策略通常具有实例特定性，无法在不同工具几何结构中推广。弥合这一差距需要从不同数据中学习统一的接触表示，但一个根本障碍依然存在：多样的真实世界触觉数据在大规模上难以实现，而由于软传感器非线性变形的复杂动力学，直接零样品模拟到实物传输也具有挑战性。为此，我们提出了语义接触场（SCFields），这是一种融合视觉语义与密集接触估计的统一三维表示。我们通过两阶段的模拟到真实接触学习流程实现这一点：首先，我们在大型仿真数据集上预训练以学习一般接触物理;其次，我们对一小部分真实数据进行微调，这些数据通过几何启发式和力优化进行伪标注，以对齐传感器特性。这使得物理泛化到看不见的工具成为可能。我们利用SCFields作为扩散策略的密集观测输入，实现接触丰富工具作任务的稳健执行。刮擦、蜡笔绘画和剥皮的实验显示出稳健的类别级泛化能力，显著优于仅视觉和原始触觉基线。

Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation

学习部分感知密集三维特征场以实现可通用的关节对象作

Authors: Yue Chen, Muqing Jiang, Kaifeng Zheng, Jiaqi Liang, Chenrui Tie, Haoran Lu, Ruihai Wu, Hao Dong
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.14193
Pdf link: https://arxiv.org/pdf/2602.14193
Abstract Articulated object manipulation is essential for various real-world robotic tasks, yet generalizing across diverse objects remains a major challenge. A key to generalization lies in understanding functional parts (e.g., door handles and knobs), which indicate where and how to manipulate across diverse object categories and shapes. Previous works attempted to achieve generalization by introducing foundation features, while these features are mostly 2D-based and do not specifically consider functional parts. When lifting these 2D features to geometry-profound 3D space, challenges arise, such as long runtimes, multi-view inconsistencies, and low spatial resolution with insufficient geometric information. To address these issues, we propose Part-Aware 3D Feature Field (PA3FF), a novel dense 3D feature with part awareness for generalizable articulated object manipulation. PA3FF is trained by 3D part proposals from a large-scale labeled dataset, via a contrastive learning formulation. Given point clouds as input, PA3FF predicts a continuous 3D feature field in a feedforward manner, where the distance between point features reflects the proximity of functional parts: points with similar features are more likely to belong to the same part. Building on this feature, we introduce the Part-Aware Diffusion Policy (PADP), an imitation learning framework aimed at enhancing sample efficiency and generalization for robotic manipulation. We evaluate PADP on several simulated and real-world tasks, demonstrating that PA3FF consistently outperforms a range of 2D and 3D representations in manipulation scenarios, including CLIP, DINOv2, and Grounded-SAM. Beyond imitation learning, PA3FF enables diverse downstream methods, including correspondence learning and segmentation tasks, making it a versatile foundation for robotic manipulation. Project page: this https URL
中文摘要 关节式物体作对于各种现实机器人任务至关重要，但跨越不同物体的泛化仍是一大挑战。推广的关键在于理解功能部件（例如门把手和门把手），这些部件指示了在不同物体类别和形状中需要作的位置和方式。以往的研究试图通过引入基础特征来实现泛化，而这些特征大多基于二维，并未专门考虑功能部件。当将这些二维特征提升到几何深邃的三维空间时，会出现诸如长时间运行、多视角不一致以及空间分辨率低且几何信息不足等挑战。为解决这些问题，我们提出了部分感知3D特征场（PA3FF），这是一种新颖的密集3D特征，具备部分感知功能，用于可泛化的关节对象作。PA3FF通过对比学习形式，从一个大规模标记数据集中获得三维零件提案进行训练。给定点云作为输入，PA3FF以前馈方式预测一个连续的三维特征场，其中点特征之间的距离反映了功能部件的接近程度：具有相似特征的点更可能属于同一部件。基于这一特性，我们介绍了部分感知扩散策略（PADP），这是一个模仿学习框架，旨在提升样本效率和机器人作的泛化性。我们在多个模拟和现实任务中评估了PADP，证明PA3FF在作场景中持续优于多种二维和三维表示方式，包括CLIP、DINOv2和Grounded-SAM。除了模仿学习，PA3FF还支持多种下游方法，包括对应学习和分割任务，使其成为机器人作的多功能基础。项目页面：此 https URL