Arxiv Papers of Today

生成时间: 2026-04-13 17:58:06 (UTC+8); Arxiv 发布时间: 2026-04-13 20:00 EDT (2026-04-14 08:00 UTC+8)

今天共有 34 篇相关文章

Keyword: reinforcement learning

Distributionally Robust Token Optimization in RLHF

RLHF 中的分布式鲁棒令牌优化

Authors: Yeping Jin, Jiaming Hu, Ioannis Ch. Paschalidis
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08577
Pdf link: https://arxiv.org/pdf/2604.08577
Abstract Large Language Models (LLMs) tend to respond correctly to prompts that align to the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO bounds worst case token-wise rewards by constructing an f-divergence ambiguity set over a loss minibatch, leading to a theoretical robustness. Empirically, DRTO enhances consistency under distribution shifts in mathematical reasoning benchmarks, achieving 9.17\% improvement on GSM8K and 2.49% improvement on MathQA.
中文摘要 大型语言模型（LLM）往往能正确响应与其训练和微调数据相符的提示。然而，措辞、格式或语言的微小变化，尤其是在多步推理问题中，可能会引发意外的巨大失败。为解决这一问题，我们提出了一种分布式强棒代币优化（DRTO）方法，结合了基于代币级的人类反馈强化学习（RLHF）和分布式鲁棒优化（DRO）。DRTO通过构造一个f-散度模糊集，在损失小批次上限制了最坏情况下的代币奖励，从而实现理论上的鲁棒性。从经验来看，DRTO在数学推理基准测试中提升了分布变化下的一致性，GSM8K提升了9.17%，MathQA提升了2.49%。

StructRL: Recovering Dynamic Programming Structure from Learning Dynamics in Distributional Reinforcement Learning

StructRL：从分布强化学习中的动态规划结构恢复

Authors: Ivo Nowak
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08620
Pdf link: https://arxiv.org/pdf/2604.08620
Abstract Reinforcement learning is typically treated as a uniform, data-driven optimization process, where updates are guided by rewards and temporal-difference errors without explicitly exploiting global structure. In contrast, dynamic programming methods rely on structured information propagation, enabling efficient and stable learning. In this paper, we provide evidence that such structure can be recovered from the learning dynamics of distributional reinforcement learning. By analyzing the temporal evolution of return distributions, we identify signals that capture when and where learning occurs in the state space. In particular, we introduce a temporal learning indicator t*(s) that reflects when a state undergoes its strongest learning update during training. Empirically, this signal induces an ordering over states that is consistent with a dynamic programming-style propagation of information. Building on this observation, we propose StructRL, a framework that exploits these signals to guide sampling in alignment with the emerging propagation structure. Our preliminary results suggest that distributional learning dynamics provide a mechanism to recover and exploit dynamic programming-like structure without requiring an explicit model. This offers a new perspective on reinforcement learning, where learning can be interpreted as a structured propagation process rather than a purely uniform optimization procedure.
中文摘要 强化学习通常被视为一种统一的数据驱动优化过程，更新由奖励和时间差误引导，而非明确利用全局结构。相比之下，动态规划方法依赖结构化信息传播，实现高效且稳定的学习。本文提供了证据，表明这种结构可以从分布式强化学习的学习动态中恢复。通过分析返回分布的时间演化，我们识别出捕捉学习在状态空间中何时何地发生的信号。特别地，我们引入了一个时间学习指标t*（s），反映州在培训期间经历最强的学习更新。从经验上看，该信号诱导了状态上的排序，这与动态编程风格的信息传播一致。基于这一观察，我们提出了StructRL，一个利用这些信号指导采样，以符合新兴传播结构的框架。我们的初步结果表明，分布式学习动态提供了一种机制，可以在无需显式模型的情况下恢复和利用类似动态规划的结构。这为强化学习提供了新的视角，学习可以被解释为一个结构化的传播过程，而非纯粹的统一优化过程。

RAMP: Hybrid DRL for Online Learning of Numeric Action Models

RAMP：用于数值作用模型在线学习的混合日程学习

Authors: Yarin Benyamin, Argaman Mordoch, Shahaf S. Shperberg, Roni Stern
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08685
Pdf link: https://arxiv.org/pdf/2604.08685
Abstract Automated planning algorithms require an action model specifying the preconditions and effects of each action, but obtaining such a model is often hard. Learning action models from observations is feasible, but existing algorithms for numeric domains are offline, requiring expert traces as input. We propose the Reinforcement learning, Action Model learning, and Planning (RAMP) strategy for learning numeric planning action models online via interactions with the environment. RAMP simultaneously trains a Deep Reinforcement Learning (DRL) policy, learns a numeric action model from past interactions, and uses that model to plan future actions when possible. These components form a positive feedback loop: the RL policy gathers data to refine the action model, while the planner generates plans to continue training the RL policy. To facilitate this integration of RL and numeric planning, we developed Numeric PDDLGym, an automated framework for converting numeric planning problems to Gym environments. Experimental results on standard IPC numeric domains show that RAMP significantly outperforms PPO, a well-known DRL algorithm, in terms of solvability and plan quality.
中文摘要 自动化规划算法需要一个动作模型，指定每个动作的前提条件和效果，但获得这样的模型通常很困难。从观察中学习动作模型是可行的，但现有的数值域算法处于离线状态，需要专家的迹迹作为输入。我们提出了强化学习、行动模型学习与规划（RAMP）策略，通过与环境的交互在线学习数值规划行动模型。RAMP同时训练深度强化学习（DRL）策略，从过去的交互中学习数值动作模型，并在可能的情况下利用该模型规划未来行动。这些组成部分形成了一个正反馈循环：强化学习策略收集数据以优化行动模型，而规划者则生成计划以继续训练强化学习策略。为了促进强化学习与数值规划的整合，我们开发了数值PDDLGym，这是一个将数值规划问题转换为健身房环境的自动化框架。在标准IPC数值域上的实验结果表明，RAMP在可解性和计划质量方面显著优于著名的DRL算法PPO。

Wireless Communication Enhanced Value Decomposition for Multi-Agent Reinforcement Learning

无线通信增强值分解用于多智能体强化学习

Authors: Diyi Hu, Bhaskar Krishnamachari
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.08728
Pdf link: https://arxiv.org/pdf/2604.08728
Abstract Cooperation in multi-agent reinforcement learning (MARL) benefits from inter-agent communication, yet most approaches assume idealized channels and existing value decomposition methods ignore who successfully shared information with whom. We propose CLOVER, a cooperative MARL framework whose centralized value mixer is conditioned on the communication graph realized under a realistic wireless channel. This graph introduces a relational inductive bias into value decomposition, constraining how individual utilities are mixed based on the realized communication structure. The mixer is a GNN with node-specific weights generated by a Permutation-Equivariant Hypernetwork: multi-hop propagation along communication edges reshapes credit assignment so that different topologies induce different mixing. We prove this mixer is permutation invariant, monotonic (preserving the IGM condition), and strictly more expressive than QMIX-style mixers. To handle realistic channels, we formulate an augmented MDP isolating stochastic channel effects from the agent computation graph, and employ a stochastic receptive field encoder for variable-size message sets, enabling end-to-end differentiable training. On Predator-Prey and Lumberjacks benchmarks under p-CSMA wireless channels, CLOVER consistently improves convergence speed and final performance over VDN, QMIX, TarMAC+VDN, and TarMAC+QMIX. Behavioral analysis confirms agents learn adaptive signaling and listening strategies, and ablations isolate the communication-graph inductive bias as the key source of improvement.
中文摘要 多智能体强化学习（MARL）中的合作受益于智能体间通信，但大多数方法假设理想化的通道，现有的价值分解方法忽略了谁成功与谁共享了信息。我们提出了CLOVER，这是一个协作的MARL框架，其中心化价值混合器基于在现实无线信道下实现的通信图。该图引入了价值分解中的关系归纳偏见，限制了基于实现的通信结构如何混合各个效用。混合器是一个由置换等变超网络生成节点特定权重的GNN：沿通信边缘的多跳传播会重塑信用分配，使不同拓扑引发不同的混合。我们证明该混频器是置换不变的、单调的（保持IGM条件），并且比QMIX风格的混频器更具表现力。为处理真实信道，我们构建了一个增强MDP，将随机信道效应从代理计算图中分离出来，并采用随机感受场编码器处理可变大小消息集，实现端到端可微训练。在Predator-Prey和Lumberjacks基于p-CSMA无线信道的基准测试中，CLOVER在VDN、QMIX、TarMAC+VDN和TarMAC+QMIX上持续提升收敛速度和最终性能。行为分析证实代理学习自适应信号和倾听策略，消融法将沟通图的归纳偏见隔离为改善的关键来源。

Artifacts as Memory Beyond the Agent Boundary

工件作为代理边界之外的记忆

Authors: John D. Martin, Fraser Mince, Esra'a Saleh, Amy Pajak
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08756
Pdf link: https://arxiv.org/pdf/2604.08756
Abstract The situated view of cognition holds that intelligent behavior depends not only on internal memory, but on an agent's active use of environmental resources. Here, we begin formalizing this intuition within Reinforcement Learning (RL). We introduce a mathematical framing for how the environment can functionally serve as an agent's memory, and prove that certain observations, which we call artifacts, can reduce the information needed to represent history. We corroborate our theory with experiments showing that when agents observe spatial paths, the amount of memory required to learn a performant policy is reduced. Interestingly, this effect arises unintentionally, and implicitly through the agent's sensory stream. We discuss the implications of our findings, and show they satisfy qualitative properties previously used to ground accounts of external memory. Moving forward, we anticipate further work on this subject could reveal principled ways to exploit the environment as a substitute for explicit internal memory.
中文摘要 情境认知观点认为，智能行为不仅依赖于内部记忆，还依赖于智能体对环境资源的主动使用。在这里，我们开始在强化学习（RL）中形式化这一直觉。我们引入了环境如何作为代理记忆的功能性框架，并证明某些我们称之为人工物的观察可以减少代表历史所需的信息。我们用实验证实了我们的理论，表明当代理观察空间路径时，学习高效策略所需的记忆量会减少。有趣的是，这种效果是无意中通过主体的感官流隐含产生的。我们讨论了这些发现的含义，并证明它们满足了此前用来支撑外部记忆的定性属性。展望未来，我们预计该主题的进一步研究可能揭示利用环境作为显式内部记忆替代品的原则性方法。

Alleviating Community Fear in Disasters via Multi-Agent Actor-Critic Reinforcement Learning

通过多智能体演员-批评者强化学习，缓解灾难中的社区恐惧

Authors: Yashodhan D. Hakke, Almuatazbellah M. Boker, Lamine Mili, Michael von Spakovsky, Hoda Eldardiry
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.08802
Pdf link: https://arxiv.org/pdf/2604.08802
Abstract During disasters, cascading failures across power grids, communication networks, and social behavior amplify community fear and undermine cooperation. Existing cyber-physical-social (CPS) models simulate these coupled dynamics but lack mechanisms for active intervention. We extend the CPS resilience model of Valinejad and Mili (2023) with control channels for three agencies, communication, power, and emergency management, and formulate the resulting system as a three-player non-zero-sum differential game solved via online actor-critic reinforcement learning. Simulations based on Hurricane Harvey data show 70% mean fear reduction with improved infrastructure recovery; cross-validation in the case of Hurricane Irma (without refitting) achieves 50% fear reduction, confirming generalizability.
中文摘要 灾难期间，电网、通信网络和社会行为的连锁故障加剧了社区恐惧，削弱了合作。现有的网络-物理-社会（CPS）模型模拟了这些耦合动态，但缺乏主动干预的机制。我们将Valinejad和Mili（2023）的CPS韧性模型扩展，加入了三个机构的控制通道：通信、电力和应急管理，并将最终系统构建为一个通过在线行为者-批评者强化学习解决的三人非零和差分博弈。基于哈维飓风数据的模拟显示，平均恐惧减少率为70%，基础设施恢复有所改善;在飓风伊尔玛的情况下，交叉验证（未重新调整）可实现50%的恐惧减少，确认了普遍性。

Building Better Environments for Autonomous Cyber Defence

构建更优的自主网络防御环境

Authors: Chris Hicks, Elizabeth Bates, Shae McFadden, Isaac Symes Thompson, Myles Foley, Ed Chapman, Nickolas Espinosa Dice, Ankita Samaddar, Joshua Sylvester, Himanshu Neema, Nicholas Butts, Nate Foster, Ahmad Ridley, Zoe M, Paul Jones
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08805
Pdf link: https://arxiv.org/pdf/2604.08805
Abstract In November 2025, the authors ran a workshop on the topic of what makes a good reinforcement learning (RL) environment for autonomous cyber defence (ACD). This paper details the knowledge shared by participants both during the workshop and shortly afterwards by contributing herein. The workshop participants come from academia, industry, and government, and have extensive hands-on experience designing and working with RL and cyber environments. While there is now a sizeable body of literature describing work in RL for ACD, there is nevertheless a great deal of tradecraft, domain knowledge, and common hazards which are not detailed comprehensively in a single resource. With a specific focus on building better environments to train and evaluate autonomous RL agents in network defence scenarios, including government and critical infrastructure networks, the contributions of this work are twofold: (1) a framework for decomposing the interface between RL cyber environments and real systems, and (2) guidelines on current best practice for RL-based ACD environment development and agent evaluation, based on the key findings from our workshop.
中文摘要 2025年11月，作者举办了一场关于如何打造良好强化学习（RL）环境的研讨会，适用于自主网络防御（ACD）。本文详细介绍了参与者在研讨会期间及研讨会后通过贡献分享的知识。研讨会参与者来自学术界、工业界和政府部门，拥有丰富的强化学习和网络环境设计和操作经验。虽然现在有大量文献描述强化学习中ACD的工作，但仍有大量技术、领域知识和常见危害，这些内容并未在单一资源中全面详尽描述。本工作特别聚焦于构建更好的环境，以训练和评估网络防御场景中的自主强化学习代理，包括政府和关键基础设施网络，其贡献有两个方面：（1）构建强化学习网络环境与真实系统接口的框架，以及（2）关于基于强化学习的强化学习环境开发和代理评估的最佳实践指南，基于我们研讨会的主要发现。

Simulation of Adaptive Running with Flexible Sports Prosthesis using Reinforcement Learning of Hybrid-link System

利用混合链路系统的强化学习，模拟灵活运动假肢的自适应跑步

Authors: Yuta Shimane, Ko Yamamoto
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.08882
Pdf link: https://arxiv.org/pdf/2604.08882
Abstract This study proposes a reinforcement learning-based adaptive running motion simulation for a unilateral transtibial amputee with the flexibility of a leaf-spring-type sports prosthesis using hybrid-link system. The design and selection of sports prostheses often rely on trial and error. A comprehensive whole-body dynamics analysis that considers the interaction between human motion and prosthetic deformation could provide valuable insights for user-specific design and selection. The hybrid-link system facilitates whole-body dynamics analysis by incorporating the Piece-wise Constant Strain model to represent the flexible deformation of the prosthesis. Based on this system, the simulation methodology generates whole-body dynamic motions of a unilateral transtibial amputee through a reinforcement learning-based approach, which combines imitation learning from motion capture data with accurate prosthetic dynamics computation. We simulated running motions under different virtual prosthetic stiffness conditions and analyzed the metabolic cost of transport obtained from the simulations, suggesting that variations in stiffness influence running performance. Our findings demonstrate the potential of this approach for simulation and analysis under virtual conditions that differ from real conditions.
中文摘要 本研究提出了一种基于强化学习的自适应跑步运动模拟，适用于单侧经胫骨截肢者，采用混合链条系统，具备叶片弹簧式运动假肢的灵活性。运动义肢的设计和选择通常依赖于反复试验。全面的全身动力学分析，考虑人体运动与义肢变形的相互作用，将为用户的设计和选择提供宝贵见解。混合连杆系统通过采用分段常应变模型来表示假肢的柔性变形，便于整体动力学分析。基于该系统，模拟方法通过基于强化学习的方法生成单侧胫骨截肢者的全身动态运动，结合了从动作捕捉数据中的模拟学习与精确的假肢动力学计算。我们模拟了在不同虚拟义肢刚度条件下的跑步动作，并分析了从模拟中获得的代谢运输成本，表明刚度的变化会影响跑步表现。我们的发现展示了该方法在虚拟条件下与现实条件不同的模拟和分析潜力。

HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation

HTNav：具有分层结构的城市空中视觉与语言导航混合导航框架

Authors: Chengjie Fan, Cong Pan, Zijian Liu, Ningzhong Liu, Jie Qin
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08883
Pdf link: https://arxiv.org/pdf/2604.08883
Abstract Inspired by the general Vision-and-Language Navigation (VLN) task, aerial VLN has attracted widespread attention, owing to its significant practical value in applications such as logistics delivery and urban inspection. However, existing methods face several challenges in complex urban environments, including insufficient generalization to unseen scenes, suboptimal performance in long-range path planning, and inadequate understanding of spatial continuity. To address these challenges, we propose HTNav, a new collaborative navigation framework that integrates Imitation Learning (IL) and Reinforcement Learning (RL) within a hybrid IL-RL framework. This framework adopts a staged training mechanism to ensure the stability of the basic navigation strategy while enhancing its environmental exploration capability. By integrating a tiered decision-making mechanism, it achieves collaborative interaction between macro-level path planning and fine-grained action control. Furthermore, a map representation learning module is introduced to deepen its understanding of spatial continuity in open domains. On the CityNav benchmark, our method achieves state-of-the-art performance across all scene levels and task difficulties. Experimental results demonstrate that this framework significantly improves navigation precision and robustness in complex urban environments.
中文摘要 受通用视觉与语言导航（VLN）任务启发，空中VLN因其在物流配送和城市检查等应用中的显著实用价值而受到广泛关注。然而，现有方法在复杂的城市环境中面临诸多挑战，包括对未见场景的推广不足、长期路径规划表现不佳以及对空间连续性的理解不足。为应对这些挑战，我们提出了HTNav，一种新的协作导航框架，将模仿学习（IL）和强化学习（RL）集成在混合IL-RL框架内。该框架采用分阶段训练机制，确保基本导航策略的稳定性，同时增强其环境勘探能力。通过整合分层决策机制，实现宏观路径规划与细粒度动作控制之间的协作互动。此外，还引入了地图表示学习模块，以加深对开放域空间连续性的理解。在CityNav基准测试中，我们的方法在所有场景层级和任务难度下都实现了最先进的性能。实验结果表明，该框架在复杂城市环境中显著提升了导航精度和稳健性。

StaRPO: Stability-Augmented Reinforcement Policy Optimization

StaRPO：稳定性增强强化策略优化

Authors: Jinghan Zhang, Fengran Mo, Tharindu Cyril Weerasooriya, Ruimin Dai, Xiaoyan Han, Yanjie Fu, Dakuo Wang, Kunpeng Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.08905
Pdf link: https://arxiv.org/pdf/2604.08905
Abstract Reinforcement learning (RL) is effective in enhancing the accuracy of large language models in complex reasoning tasks. Existing RL policy optimization frameworks rely on final-answer correctness as feedback signals and rarely capture the internal logical structure of the reasoning process. Consequently, the models would generate fluent and semantically relevant responses but logically inconsistent, structurally erratic, or redundant. To this end, we propose StaRPO, a stability-augmented reinforcement learning framework that explicitly incorporates reasoning stability into the optimization objective. Our StaRPO decomposes stability into two computable lightweight metrics: the Autocorrelation Function (ACF) to evaluate local step-to-step coherence, and Path Efficiency (PE) to evaluate global goal-directedness of the reasoning trajectory. These stability rewards are combined with task rewards to provide complementary and process-aware feedback. We validate the effectiveness of using ACF and PE rewards by showing their correlation with logic errors on two backbone models. Experiments on four reasoning benchmarks show that StaRPO consistently outperforms compared baselines and can enhance both final-answer accuracy and logical stability.
中文摘要 强化学习（RL）在提升大型语言模型在复杂推理任务中的准确性方面非常有效。现有的强化学习策略优化框架依赖最终答案的正确性作为反馈信号，很少捕捉推理过程的内部逻辑结构。因此，模型会产生流畅且语义相关的回答，但逻辑上不一致、结构不稳定或冗余。为此，我们提出了StaRPO，一种稳定性增强的强化学习框架，明确将推理稳定性纳入优化目标。我们的StaRPO将稳定性分解为两个可计算的轻量级指标：自相关函数（ACF）用于评估局部的步对步一致性，以及路径效率（PE）用于评估推理轨迹的全局目标导向性。这些稳定性奖励与任务奖励结合，提供互补且过程意识的反馈。我们通过展示ACF和PE奖励与两个骨干模型逻辑错误的相关性，验证了其有效性。四个推理基准测试的实验表明，StaRPO持续优于比较基线，能够提升最终答案的准确性和逻辑稳定性。

Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning

桥接SFT与强化学习：动态策略优化以实现稳健推理

Authors: Taojie Zhu, Dongyang Xu, Ding Zou, Sen Zhao, Qiaobo Hao, Zhiguo Yang, Yonghong He
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.08926
Pdf link: https://arxiv.org/pdf/2604.08926
Abstract Post-training paradigms for Large Language Models (LLMs), primarily Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), face a fundamental dilemma: SFT provides stability (low variance) but suffers from high fitting bias, while RL enables exploration (low bias) but grapples with high gradient variance. Existing unified optimization strategies often employ naive loss weighting, overlooking the statistical conflict between these distinct gradient signals. In this paper, we provide a rigorous theoretical analysis of this bias-variance trade-off and propose \textbf{DYPO} (Dynamic Policy Optimization), a unified framework designed to structurally mitigate this conflict. DYPO integrates three core components: (1) a \textit{Group Alignment Loss (GAL)} that leverages intrinsic group dynamics to significantly reduce RL gradient variance; (2) a \textit{Multi-Teacher Distillation} mechanism that corrects SFT fitting bias via diverse reasoning paths; and (3) a \textit{Dynamic Exploitation-Exploration Gating} mechanism that adaptively arbitrates between stable SFT and exploratory RL based on reward feedback. Theoretical analysis confirms that DYPO linearly reduces fitting bias and minimizes overall variance. Extensive experiments demonstrate that DYPO significantly outperforms traditional sequential pipelines, achieving an average improvement of 4.8\% on complex reasoning benchmarks and 13.3\% on out-of-distribution tasks. Our code is publicly available at this https URL.
中文摘要 大型语言模型（LLM）的训练后范式，主要是监督式微调（SFT）和强化学习（RL），面临一个根本性难题：SFT提供稳定性（低方差），但存在高拟合偏差，而RL支持探索（低偏差），但面临高梯度方差。现有统一优化策略通常采用朴素的损失加权，忽视了这些不同梯度信号之间的统计冲突。本文对这种偏差-方差权衡进行了严谨的理论分析，并提出了\textbf{DYPO}（动态策略优化）这一统一框架，旨在结构性地缓解这一冲突。DYPO集成了三个核心组件：（1）\textit{群比对丢失（GAL）}，利用内在的群动态显著降低强化学习梯度方差;（2）\textit（多教师提炼）机制，通过多样推理路径纠正SFT拟合偏差;以及（3）\textit{动态利用-探索门控}机制，基于奖励反馈，自适应地仲裁稳定SFT和探索性强化学习。理论分析证实，DYPO线性地减少拟合偏差并最小化整体方差。大量实验表明，DYPO在复杂推理基准测试中平均提升4.8%，在非分发任务中提升13.3%。我们的代码在此 https URL 公开。

WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

WOMBET：基于世界模型的经验转移，实现稳健且样本高效的强化学习

Authors: Mintae Kim, Koushil Sreenath
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.08958
Pdf link: https://arxiv.org/pdf/2604.08958
Abstract Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose \textit{World Model-based Experience Transfer} (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.
中文摘要 机器人中的强化学习（RL）通常受限于数据收集的成本和风险，促使经验从源任务转移到目标任务。离线到在线的强化学习利用了先前的数据，但通常假设给定的固定数据集，不涉及如何生成可靠的数据进行传输。我们提出了\textit{世界模型基础经验转移}（WOMBET）框架，该框架共同生成并利用先前数据。WOMBET在源任务中学习世界模型，并通过不确定性惩罚规划生成离线数据，随后过滤高回报和低认知不确定性的轨迹。随后，它通过离线与在线数据之间的自适应采样对目标任务进行在线微调，实现从先验驱动初始化向任务特定适应的稳定过渡。我们证明了不确定性惩罚目标为真实回报提供了下界，并推导出有限样本误差分解，捕捉分布不匹配和近似误差。通过实证，WOMBET在连续控制基准测试的强基线条件下提升了样本效率和最终性能，展示了联合优化数据生成和传输的优势。

Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning

高效的层级隐式流Q-learning用于离线目标条件强化学习

Authors: Zhiqiang Dong, Teng Pang, Rongjian Xu, Guoqiang Wu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.08960
Pdf link: https://arxiv.org/pdf/2604.08960
Abstract Offline goal-conditioned reinforcement learning (GCRL) is a practical reinforcement learning paradigm that aims to learn goal-conditioned policies from reward-free offline data. Despite recent advances in hierarchical architectures such as HIQL, long-horizon control in offline GCRL remains challenging due to the limited expressiveness of Gaussian policies and the inability of high-level policies to generate effective subgoals. To address these limitations, we propose the goal-conditioned mean flow policy, which introduces an average velocity field into hierarchical policy modeling for offline GCRL. Specifically, the mean flow policy captures complex target distributions for both high-level and low-level policies through a learned average velocity field, enabling efficient action generation via one-step sampling. Furthermore, considering the insufficiency of goal representation, we introduce a LeJEPA loss that repels goal representation embeddings during training, thereby encouraging more discriminative representations and improving generalization. Experimental results show that our method achieves strong performance across both state-based and pixel-based tasks in the OGBench benchmark.
中文摘要 离线目标条件强化学习（GCRL）是一种实用的强化学习范式，旨在从无奖励的离线数据中学习目标条件策略。尽管近年来如HIQL等分层架构有所进步，但由于高斯策略表达有限且高级策略无法生成有效子目标，离线GCRL中的长视野控制仍然具有挑战性。为解决这些局限性，我们提出了目标条件平均流策略，该策略为离线GCRL的分层策略建模引入了平均速度场。具体来说，平均流策略通过学习到的平均速度场捕捉高层和低层策略的复杂目标分布，从而通过一步抽样实现高效的动作生成。此外，考虑到目标表示的不足，我们引入了LeJEPA损失，使其在训练过程中排斥目标表示嵌入，从而鼓励更多判别性表示并提升泛化性。实验结果显示，我们的方法在OGBench基准测试中，无论是基于状态还是基于像素的任务都取得了强劲表现。

Multi-agent Reinforcement Learning for Low-Carbon P2P Energy Trading among Self-Interested Microgrids

多智能体强化学习用于自利微电网间的低碳点对点能源交易

Authors: Junhao Ren, Honglin Gao, Lan Zhao, Qiyu Kang, Gaoxi Xiao, Yajuan Sun
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.08973
Pdf link: https://arxiv.org/pdf/2604.08973
Abstract Uncertainties in renewable generation and demand dynamics challenge day-ahead scheduling. To enhance renewable penetration and maintain intra-day balance, we develop a multi-agent reinforcement learning framework for self-interested microgrids participating in peer-to-peer (P2P) electricity trading. Each microgrid independently bids both price and quantity while optimizing its own profit via storage arbitrage under time-varying main-grid prices. A market-clearing mechanism coordinating trades and promoting incentive compatibility is proposed. Simulation results show that the learned bidding policy improves renewable utilization and reduces reliance on high-carbon electricity, while increasing community-level economic welfare, delivering a win-win situation in emission reduction and local prosperity.
中文摘要 可再生能源发电和需求动态的不确定性挑战了提前的排班。为了提升可再生能源的渗透率并保持日中平衡，我们为参与点对点（P2P）电力交易的自利微电网开发了一个多代理强化学习框架。每个微电网独立竞价价格和数量，同时通过储存套利优化自身利润，且价格与主电网价格不变。提出了一种市场清算机制，协调交易并促进激励兼容性。模拟结果显示，学习式招标政策提高了可再生能源的利用率，减少了对高碳电力的依赖，同时提升了社区层面的经济福祉，实现减排和地方繁荣的双赢局面。

PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

PerMix-RLVR：在可验证奖励对齐下保持人物表达力

Authors: Jihwan Oh, Soowon Oh, Murad Aghazada, Minchan Jeong, Sungnyun Kim, Se-Young Yun
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08986
Pdf link: https://arxiv.org/pdf/2604.08986
Abstract Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal persona is time-consuming, and its impact on output quality remains poorly understood. Prior work has mainly addressed this issue at the prompt level via inference-time strategies, incurring additional computation. In this work, we avoid inference-time prompt search by tackling persona sensitivity during training, aiming to train models that adapt their behavior to diverse personas while preserving task performance. In particular, we find that reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts, but also reveals an inherent trade-off of outcome-based optimization: while RLVR improves robustness on tasks with verifiable goals, it can also degrade persona expressivity when needed, e.g., in-character role-playing. To address this limitation, we propose PerMix-RLVR, a persona-mixed RLVR strategy that mitigates the persona robustness-fidelity trade-off, preserving strong robustness to harmful persona variation while enabling faithful persona adoption when required. Concretely, PerMix-RLVR improves persona stability score (PSS) over RLVR by +21.2% on MATH500, while also enhancing persona fidelity by +11.4% on PersonaGym.
中文摘要 人物提示已被广泛采用，用于引导大型语言模型（LLMs）的行为，并通过分配特定字符来提升其指令性能。然而，确定最佳人格非常耗时，且其对产出质量的影响仍不充分。此前的研究主要通过推理时间策略在提示层面解决这个问题，这带来了额外的计算。在本研究中，我们通过在训练中处理人物形象敏感性，避免了推断时间提示搜索，旨在训练能够适应多样角色的模型，同时保持任务表现。特别是，我们发现带有可验证奖励的强化学习（RLVR）系统性地降低了对角色提示的敏感度，但也揭示了基于结果优化的内在权衡：虽然RLVR提升了可验证目标任务的鲁棒性，但在需要时也可能削弱角色表达力，例如角色内角色扮演。为解决这一限制，我们提出了PerMix-RLVR策略，这是一种角色混合RLVR策略，该策略减轻了角色的稳健性与忠实度的权衡，保持对有害角色多样性的强韧性，同时在需要时实现忠实角色的采用。具体来说，PerMix-RLVR在MATH500上相比RLVR提升了人格稳定性评分（PSS）+21.2%，同时在PersonaGym上也提升了+11.4%。

ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning

ActFER：通过主动工具增强视觉推理实现的能动面部表情识别

Authors: Shifeng Liu, Zhengye Zhang, Sirui Zhao, Xinglong Mao, Zhehan Kan, Zhixiang Wei, Shiwei Wu, Chaoyou Fu, Tong Xu, Enhong Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.08990
Pdf link: https://arxiv.org/pdf/2604.08990
Abstract Recent advances in Multimodal Large Language Models (MLLMs) have created new opportunities for facial expression recognition (FER), moving it beyond pure label prediction toward reasoning-based affect understanding. However, existing MLLM-based FER methods still follow a passive paradigm: they rely on externally prepared facial inputs and perform single-pass reasoning over fixed visual evidence, without the capability for active facial perception. To address this limitation, we propose ActFER, an agentic framework that reformulates FER as active visual evidence acquisition followed by multimodal reasoning. Specifically, ActFER dynamically invokes tools for face detection and alignment, selectively zooms into informative local regions, and reasons over facial Action Units (AUs) and emotions through a visual Chain-of-Thought. To realize such behavior, we further develop Utility-Calibrated GRPO (UC-GRPO), a reinforcement learning algorithm tailored to agentic FER. UC-GRPO uses AU-grounded multi-level verifiable rewards to densify supervision, query-conditional contrastive utility estimation to enable sample-aware dynamic credit assignment for local inspection, and emotion-aware EMA calibration to reduce noisy utility estimates while capturing emotion-wise inspection tendencies. This algorithm enables ActFER to learn both when local inspection is beneficial and how to reason over the acquired evidence. Comprehensive experiments show that ActFER trained with UC-GRPO consistently outperforms passive MLLM-based FER baselines and substantially improves AU prediction accuracy.
中文摘要 多模态大型语言模型（MLLM）的最新进展为面部表情识别（FER）创造了新机遇，使其从单纯的标签预测转向基于推理的情感理解。然而，现有基于MLLM的FER方法仍遵循被动范式：它们依赖外部准备的面部输入，并在固定视觉证据上进行单次推理，而不具备主动面部感知的能力。为解决这一局限性，我们提出了ActFER，一种能动框架，将FER重新表述为主动视觉证据获取，随后进行多模态推理。具体来说，ActFER动态调用面部检测和对齐工具，选择性地聚焦于信息丰富的局部区域，并通过视觉思维链对面部行动单元（AU）和情感进行推理。为实现此类行为，我们进一步开发了效用校准GRPO（UC-GRPO），这是一种针对代理FER量身定制的强化学习算法。UC-GRPO采用基于AU的多级可验证奖励来丰富监督，利用查询条件对比效用估计实现样本感知的动态信用分配以进行局部检查，并采用情绪感知EMA校准以减少噪声效用估计，同时捕捉情绪化的检查倾向。该算法使ActFER能够学习何时本地检查有益，以及如何对已采集的证据进行推理。综合实验表明，使用UC-GRPO训练的ActFER持续优于基于MLLM的被动FER基线，并显著提升AU预测准确性。

Hypergraph Neural Networks Accelerate MUS Enumeration

超图神经网络加速多单元枚举

Authors: Hiroya Ijima, Koichiro Yawata
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2604.09001
Pdf link: https://arxiv.org/pdf/2604.09001
Abstract Enumerating Minimal Unsatisfiable Subsets (MUSes) is a fundamental task in constraint satisfaction problems (CSPs). Its major challenge is the exponential growth of the search space, which becomes particularly severe when satisfiability checks are expensive. Recent machine learning approaches reduce this cost for Boolean satisfiability problems but rely on explicit variable-constraint relationships, limiting their application domains. This paper proposes a domain-agnostic method to accelerate MUS enumeration using Hypergraph Neural Networks (HGNNs). The proposed method incrementally builds a hypergraph with constraints as vertices and MUSes enumerated until the current step as hyperedges, and employs an HGNN-based agent trained via reinforcement learning to minimize the number of satisfiability checks required to obtain an MUS. Experimental results demonstrate the effectiveness of our approach in accelerating MUS enumeration, showing that our method can enumerate more MUSes within the same satisfiability check budget compared to conventional methods.
中文摘要 枚举最小不可满足子集（MUSes）是约束满足问题（CSP）中的一项基本任务。其主要挑战是搜索空间的指数级增长，当满足性检查成本高时，这一增长尤为严重。最新的机器学习方法降低了布尔可满足性问题的成本，但依赖显式变量-约束关系，限制了其应用领域。本文提出了一种域无关的方法，利用超图神经网络（HGNNs）加速多单元枚举。所提方法逐步构建一个超图，约束为顶点，MUSes以超边枚举至当前步骤，并采用基于HGNN的代理，通过强化学习训练，以最小化获得MUS所需的满足性检查次数。实验结果证明了我们方法在加速MUS枚举方面的有效性，表明我们的方法在相同的满足性检查预算内可以枚举更多MUS，相较于传统方法。

Plasticity-Enhanced Multi-Agent Mixture of Experts for Dynamic Objective Adaptation in UAVs-Assisted Emergency Communication Networks

可塑性增强的多智能体专家组合，用于无人机辅助紧急通信网络中的动态目标适应

Authors: Wen Qiu, Zhiqiang He, Wei Zhao, Hiroshi Masui
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2604.09028
Pdf link: https://arxiv.org/pdf/2604.09028
Abstract Unmanned aerial vehicles serving as aerial base stations can rapidly restore connectivity after disasters, yet abrupt changes in user mobility and traffic demands shift the quality of service trade-offs and induce strong non-stationarity. Deep reinforcement learning policies suffer from plasticity loss under such shifts, as representation collapse and neuron dormancy impair adaptation. We propose plasticity enhanced multi-agent mixture of experts (PE-MAMoE), a centralized training with decentralized execution framework built on multi-agent proximal policy optimization. PE-MAMoE equips each UAV with a sparsely gated mixture of experts actor whose router selects a single specialist per step. A non-parametric Phase Controller injects brief, expert-only stochastic perturbations after phase switches, resets the action log-standard-deviation, anneals entropy and learning rate, and schedules the router temperature, all to re-plasticize the policy without destabilizing safe behaviors. We derive a dynamic regret bound showing the tracking error scales with both environment variation and cumulative noise energy. In a phase-driven simulator with mobile users and 3GPP-style channels, PE-MAMoE improves normalized interquartile mean return by 26.3\% over the best baseline, increases served-user capacity by 12.8\%, and reduces collisions by approximately 75\%. Diagnostics confirm persistently higher expert feature rank and periodic dormant-neuron recovery at regime switches.
中文摘要 作为空中基站的无人机可以在灾难发生后迅速恢复连接，但用户移动性和交通需求的突变会改变服务质量的权衡，并导致强烈的非固定状态。深度强化学习策略在此类转变下会失去可塑性，因为表征崩溃和神经元休眠会削弱适应能力。我们提出可塑性增强多智能体专家混合（PE-MAMoE），这是一种基于多智能体近端策略优化的中心化训练和去中心化执行框架。PE-MAMoE为每架无人机配备了稀疏的专家组合，其路由器每一步选择一名专家。非参数相位控制器在相位切换后注入短暂的专家专属随机扰动，重置动作对数标准差，退火熵和学习率，调度路由器温度，所有这些都是为了重新塑化策略，同时不破坏安全行为。我们推导出一个动态后悔界限，显示跟踪误差随环境变化和累计噪声能量的变化。在拥有移动用户和3GPP风格信道的相位驱动模拟器中，PE-MAMoE比最佳基线提升了26.3%的归一化四分位平均回报，提升了12.8%的服务用户容量，并减少了约75%的碰撞。诊断结果显示，在状态切换时，专家级和休眠神经元的周期性恢复持续提升。

Advantage-Guided Diffusion for Model-Based Reinforcement Learning

优势引导扩散用于基于模型的强化学习

Authors: Daniele Foffano, Arvid Eriksson, David Broman, Karl H. Johansson, Alexandre Proutiere
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.09035
Pdf link: https://arxiv.org/pdf/2604.09035
Abstract Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent's advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.
中文摘要 基于模型的强化学习（MBRL）采用自回归世界模型存在复合错误，而扩散世界模型通过联合生成轨迹段来缓解这一问题。然而，现有的扩散指南要么仅是政策，丢弃价值信息，要么基于奖励，当扩散期短暂时变得目光短浅。我们引入了优势引导扩散（AGD-MBRL），利用代理的优势估计引导反向扩散过程，使采样集中于预期在生成窗口期外带来更高长期回报的轨迹上。我们制定了两份指南：（i）Sigmoid优势指导（SAG）和（ii）指数优势指导（EAG）。我们证明，通过SAG或EAG引导的扩散模型，使我们在标准假设下对国家-行动优势-政策改进的权重递增轨迹进行重权抽样。此外，我们表明AGD-MBRL生成的轨迹遵循改进策略（即价值更高），相较于无导向扩散模型。AGD 通过引导状态组件，同时保持动作生成策略条件，无缝集成 PolyGRAD 风格架构，且不需更改扩散训练目标。在MuJoCo控制任务（HalfCheetah、Hopper、Walker2D和Reacher）上，AGD-MBRL相比PolyGRAD（在线Diffuser式奖励指南）和无模型基线（PPO/TRPO）提高了样本效率和最终回报，在某些情况下提升了2倍的优势。这些结果表明，优势感知导引是扩散模型MBRL中短视距近视的简单有效疗法。

Learning Vision-Language-Action World Models for Autonomous Driving

学习视觉-语言-行动世界模型用于自动驾驶

Authors: Guoqing Wang, Pin Tang, Xiangxuan Ren, Guodongfang Zhao, Bailan Feng, Chao Ma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.09059
Pdf link: https://arxiv.org/pdf/2604.09059
Abstract Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the imagined future they generate. In this work, we present VLA-World, a simple yet effective VLA world model that unifies predictive imagination with reflective reasoning to improve driving foresight. VLA-World first uses an action-derived feasible trajectory to guide the generation of the next-frame image, capturing rich spatial and temporal cues that describe how the surrounding environment evolves. The model then reasons over this self-generated future imagined frame to refine the predicted trajectory, achieving higher performance and better interpretability. To support this pipeline, we curate nuScenes-GR-20K, a generative reasoning dataset derived from nuScenes, and employ a three-stage training strategy that includes pretraining, supervised fine-tuning, and reinforcement learning. Extensive experiments demonstrate that VLA-World consistently surpasses state-of-the-art VLA and world-model baselines on both planning and future-generation benchmarks. Project page: this https URL
中文摘要 视觉-语言-行动（VLA）模型最近在端到端自动驾驶方面取得了显著进展，将感知、推理和控制整合到统一的多模态框架中。然而，它们通常缺乏对时间动态和全球世界一致性的明确建模，这限制了其前瞻性和安全性。相比之下，世界模型可以模拟合理的未来场景，但通常难以推理或评估它们所生成的想象未来。在本研究中，我们提出了VLA-World，一种简单而有效的VLA世界模型，将预测想象力与反思推理相结合，以提升前瞻性驱动力。VLA-World首次使用动作衍生的可行轨迹来引导下一帧图像的生成，捕捉描述周围环境演变的丰富空间和时间线索。模型随后对这个自我生成的未来想象框架进行推理，以优化预测轨迹，实现更高的性能和更好的解释性。为支持这一流程，我们策划了nuScenes-GR-20K，这是一个源自nuScenes的生成推理数据集，并采用三阶段训练策略，包括预训练、监督微调和强化学习。大量实验表明，VLA-World在规划和未来世代基准测试上始终超越最先进的VLA和世界模型基线。项目页面：此 https URL

TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

TensorHub：适用于LLM RL训练的可扩展且弹性的重量转移

Authors: Chenhao Ye, Huaizheng Zhang, Mingcong Han, Baoquan Zhong, Xiang Li, Qixiang Chen, Xinyi Zhang, Weidong Zhang, Kaihua Jiang, Wang Zhang, He Sun, Wencong Xiao, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.09107
Pdf link: https://arxiv.org/pdf/2604.09107
Abstract Modern LLM reinforcement learning (RL) workloads require a highly efficient weight transfer system to scale training across heterogeneous computational resources. However, existing weight transfer approaches either fail to provide flexibility for dynamically scaling clusters or incur fundamental data movement overhead, resulting in poor performance. We introduce Reference-Oriented Storage (ROS), a new storage abstraction for RL weight transfer that exploits the highly replicated model weights in place. ROS presents the illusion that certain versions of the model weights are stored and can be fetched on demand. Underneath, ROS does not physically store any copies of the weights; instead, it tracks the workers that hold these weights on GPUs for inference. Upon request, ROS directly uses them to serve reads. We build TensorHub, a production-quality system that extends the ROS idea with topology-optimized transfer, strong consistency, and fault tolerance. Evaluation shows that TensorHub fully saturates RDMA bandwidth and adapts to three distinct rollout workloads with minimal engineering effort. Specifically, TensorHub reduces total GPU stall time by up to 6.7x for standalone rollouts, accelerates weight update for elastic rollout by 4.8x, and cuts cross-datacenter rollout stall time by 19x. TensorHub has been deployed in production to support cutting-edge RL training.
中文摘要 现代LLM强化学习（RL）工作负载需要高效的权重转移系统，以跨异构计算资源进行训练的扩展。然而，现有权重转移方法要么无法提供动态扩展集群的灵活性，要么会产生基本的数据移动开销，导致性能较差。我们引入了参考导向存储（ROS），这是一种用于强化学习权重转移的新型存储抽象，利用了高度复制的模型权重。ROS给人一种错觉，认为某些版本的模型权重是被存储的，并且可以按需取用。在下面，ROS不物理存储权重的副本;相反，它追踪持有这些权重的 GPU 工作者以进行推断。ROS在请求时，直接使用它们来提供阅读。我们构建了TensorHub，这是一个生产级系统，通过拓扑优化的传输、强一致性和容错性，扩展了ROS理念。评估显示，TensorHub 完全饱和了 RDMA 带宽，并以最小的工程操作适应三种不同的部署工作负载。具体来说，TensorHub 将独立部署的总 GPU 停滞时间减少高达 6.7 倍，弹性推展的权重更新加速 4.8 倍，跨数据中心部署停滞时间缩短 19 倍。TensorHub已部署为生产环境，支持前沿的强化学习培训。

Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling

截断纠正流策略用于一步抽样强化学习

Authors: Xubin Zhou, Yipeng Yang, Zhan Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.09159
Pdf link: https://arxiv.org/pdf/2604.09159
Abstract Maximum entropy reinforcement learning (MaxEnt RL) has become a standard framework for sequential decision making, yet its standard Gaussian policy parameterization is inherently unimodal, limiting its ability to model complex multimodal action distributions. This limitation has motivated increasing interest in generative policies based on diffusion and flow matching as more expressive alternatives. However, incorporating such policies into MaxEnt RL is challenging for two main reasons: the likelihood and entropy of continuous-time generative policies are generally intractable, and multi-step sampling introduces both long-horizon backpropagation instability and substantial inference latency. To address these challenges, we propose Truncated Rectified Flow Policy (TRFP), a framework built on a hybrid deterministic-stochastic architecture. This design makes entropy-regularized optimization tractable while supporting stable training and effective one-step sampling through gradient truncation and flow straightening. Empirical results on a toy multigoal environment and 10 MuJoCo benchmarks show that TRFP captures multimodal behavior effectively, outperforms strong baselines on most benchmarks under standard sampling, and remains highly competitive under one-step sampling.
中文摘要 最大熵强化学习（MaxEnt RL）已成为顺序决策的标准框架，但其标准的高斯策略参数化本质上是单模态的，限制了其对复杂多模态作用分布的建模能力。这一局限促使基于扩散和流匹配的生成策略成为更具表达性的替代方案，受到越来越多的关注。然而，将此类策略纳入 MaxEnt RL 面临的挑战主要有二：连续时间生成策略的概率和熵通常难以处理，且多步采样引入了长视野反向传播的不稳定性和显著的推理延迟。为应对这些挑战，我们提出了截断整流策略（TRFP），这是一个基于混合确定性-随机架构构建的框架。该设计使熵正则化优化变得易于处理，同时支持稳定训练和通过梯度截断和流直线实现有效的单步采样。在玩具多目标环境和10个MuJoCo基准测试上的实证结果表明，TRFP能有效捕捉多模态行为，在标准抽样下大多数基准测试中表现优于强基准，并在一步抽样下保持高度竞争力。

On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach

关于DAG拓扑在能量感知云调度中的作用：基于GNN的深度强化学习方法

Authors: Anas Hattay, Fred Ngole Mboula, Eric Gascard, Zakaria Yahoun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.09202
Pdf link: https://arxiv.org/pdf/2604.09202
Abstract Cloud providers must assign heterogeneous compute resources to workflow DAGs while balancing competing objectives such as completion time, cost, and energy consumption. In this work, we study a single-workflow, queue-free scheduling setting and consider a graph neural network (GNN)-based deep reinforcement learning scheduler designed to minimize workflow completion time and energy usage. We identify specific out-of-distribution (OOD) conditions under which GNN-based deep reinforcement learning schedulers fail and provide a principled explanation of why these failures occur. Through controlled OOD evaluations, we demonstrate that performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. Our analysis exposes fundamental limitations of current GNN-based schedulers and highlights the need for more robust representations to ensure reliable scheduling performance under distribution shifts.
中文摘要 云服务提供商必须为工作流DAG分配异构计算资源，同时平衡完成时间、成本和能耗等竞争目标。本研究中，我们研究了单一工作流程、无队列的调度设置，并考虑了基于图神经网络（GNN）的深度强化学习调度器，旨在最大限度地减少工作流程完成时间和能源消耗。我们识别了基于GNN的深度强化学习调度器在哪些特定非分布（OOD）条件下失效，并对这些失败的原因提供了原则性解释。通过受控的值班评估，我们证明性能下降源于培训与部署环境之间的结构性不匹配，这破坏了消息传递并削弱了政策的泛化。我们的分析揭示了当前基于GNN调度器的根本局限性，并强调需要更稳健的表示方式，以确保在分配转移下可靠的调度性能。

Online Intention Prediction via Control-Informed Learning

通过控制知情学习进行在线意图预测

Authors: Tianyu Zhou, Zihao Liang, Zehui Lu, Shaoshuai Mou
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.09303
Pdf link: https://arxiv.org/pdf/2604.09303
Abstract This paper presents an online intention prediction framework for estimating the goal state of autonomous systems in real time, even when intention is time-varying, and system dynamics or objectives include unknown parameters. The problem is formulated as an inverse optimal control / inverse reinforcement learning task, with the intention treated as a parameter in the objective. A shifting horizon strategy discounts outdated information, while online control-informed learning enables efficient gradient computation and online parameter updates. Simulations under varying noise levels and hardware experiments on a quadrotor drone demonstrate that the proposed approach achieves accurate, adaptive intention prediction in complex environments.
中文摘要 本文提出了一种在线意图预测框架，用于实时估计自主系统的目标状态，即使意图随时间变化，且系统动力学或目标包含未知参数。该问题被表述为一个逆最优控制/逆强化学习任务，意图被视为目标中的参数。视野转移策略减少过时信息，而在线控制知情学习则实现了高效的梯度计算和在线参数更新。在不同噪声水平下的模拟和四旋翼无人机的硬件实验表明，所提方法能够在复杂环境中实现准确且自适应的意图预测。

Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym

注意空间推理与行动之间的差距！空间体态的逐步评估

Authors: Lars Benedikt Kaesberg, Tianyu Yang, Niklas Bauer, Terry Ruas, Jan Philip Wahle, Bela Gipp
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.09338
Pdf link: https://arxiv.org/pdf/2604.09338
Abstract Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking. We evaluate eight models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes. The best model, GPT-OSS 120B, achieves a solve rate of 16.0%, 82 points below the human baseline (98.0%). Step-by-step format helps weaker models (up to +5.4%) by removing formatting errors, but hurts stronger models (up to 5.6%) by constraining global planning. Backtracking improves episode completion, but increases solve rate only for weaker models; stronger models rarely backtrack and do not benefit from it. Our experiments have three key findings: (1) models fail to scale reasoning effort with difficulty, (2) vision models receiving images of the spatial environment reduce solve rate by 73%, and (3) extended chain-of-thought reasoning retains a 3-5x accuracy advantage over standard inference even in the step-by-step setting. Spatial-Gym enables diagnosis of model limitations and provides a framework for improving spatial reasoning through reinforcement learning.
中文摘要 空间推理是导航和机器人技术的核心，但衡量模型在这些任务上的能力仍然困难。现有基准测试采用一次性测试模型，要求在单一响应中生成完整解，这与人类在互动环境中逐步工作不同。我们介绍了 Spatial-Gym，这是一个 Gymnasium 环境，通过测试二维网格谜题中的路径寻找，作为一个顺序决策任务，并可选择回溯，从而隔离空间约束推理。我们在三个场景（单次、逐步、带回溯的逐步）中，对8个模型进行了对比，对500集的人类、随机和A*基线进行了评估。最佳模型GPT-OSS 120B的破解率为16.0%，比人类基线（98.0%）低82分。逐步格式化通过消除格式错误帮助较弱的模型（最高可达+5.4%），但通过限制整体规划，对较弱的模型（最高可达5.6%）造成不利。回溯能提升剧集完成率，但仅提高较弱模型的解决率;更强的模型很少会回溯，也不会从中受益。我们的实验有三个关键发现：（1）模型难以扩展推理努力，（2）视觉模型接收空间环境图像时，解决率降低73%，（3）即使在逐步推理中，扩展思维链推理仍保有3-5倍的准确性优势。空间体操使得模型局限性得以诊断，并通过强化学习为提升空间推理提供了框架。

Visually-Guided Policy Optimization for Multimodal Reasoning

多模态推理的可视化引导策略优化

Authors: Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, Xiangxiang Chu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.09349
Pdf link: https://arxiv.org/pdf/2604.09349
Abstract Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks.
中文摘要 带有可验证奖励的强化学习（RLVR）显著提升了视觉语言模型（VLM）的推理能力。然而，VLM固有的文本主导性常导致视觉忠实度不足，表现为对视觉符号的注意力激活稀少。更重要的是，我们的实证分析显示，时间视觉遗忘在推理步骤中加剧了这一缺陷。为弥合这一差距，我们提出了视觉引导政策优化（VGPO）这一新框架，用于在政策优化过程中强化视觉聚焦。具体来说，VGPO最初引入了视觉注意力补偿机制，利用视觉相似性定位和放大视觉线索，同时在后续步骤逐步提升视觉期望，以抵消视觉遗忘。基于该机制，我们实施了双粒度优势重权策略：轨迹内层重点突出表现出较高视觉激活度的代币，而轨迹间层级优先考虑显示出优越视觉累积的轨迹。大量实验表明，VGPO在数学多模态推理和视觉依赖任务中实现了更好的视觉激活和更优异的表现。

Musculoskeletal Motion Imitation for Learning Personalized Exoskeleton Control Policy in Impaired Gait

肌肉骨骼运动模拟以学习受损步态中个性化外骨骼控制政策

Authors: Itak Choi, Ilseung Park, Eni Halilaj, Inseung Kang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.09431
Pdf link: https://arxiv.org/pdf/2604.09431
Abstract Designing generalizable control policies for lower-limb exoskeletons remains fundamentally constrained by exhaustive data collection or iterative optimization procedures, which limit accessibility to clinical populations. To address this challenge, we introduce a device-agnostic framework that combines physiologically plausible musculoskeletal simulation with reinforcement learning to enable scalable personalized exoskeleton assistance for both able-bodied and clinical populations. Our control policies not only generate physiologically plausible locomotion dynamics but also capture clinically observed compensatory strategies under targeted muscular deficits, providing a unified computational model of both healthy and pathological gait. Without task-specific tuning, the resulting exoskeleton control policies produce assistive torque profiles at the hip and ankle that align with state-of-the-art profiles validated in human experiments, while consistently reducing metabolic cost across walking speeds. For simulated impaired-gait models, the learned control policies yield asymmetric, deficit-specific exoskeleton assistance that improves both energetic efficiency and bilateral kinematic symmetry without explicit prescription of the target gait pattern. These results demonstrate that physiologically plausible musculoskeletal simulation via reinforcement learning can serve as a scalable foundation for personalized exoskeleton control across both able-bodied and clinical populations, eliminating the need for extensive physical trials.
中文摘要 为下肢外骨骼设计可推广的控制策略仍受限于详尽的数据收集或迭代优化程序，这些限制了临床人群的可及性。为应对这一挑战，我们引入了一个无关设备框架，结合生理上合理的肌肉骨骼模拟与强化学习，实现可扩展的个性化外骨骼辅助，适用于健全人群和临床人群。我们的控制策略不仅生成生理上合理的运动动态，还捕捉临床观察到的针对性肌肉缺陷下的代偿策略，提供了一个健康步态与病理步态的统一计算模型。在没有特定任务调校的情况下，产生的外骨骼控制政策在髋部和踝部产生与人类实验验证的最先进曲线相符的辅助扭矩曲线，同时在行走速度中持续降低代谢成本。对于模拟的受损步态模型，所学控制策略能够提供非对称、针对缺陷特异的外骨骼辅助，既能提升能量效率，也能提升双侧运动学对称性，而无需明确规定目标步态模式。这些结果表明，通过强化学习进行生理学上合理的肌肉骨骼模拟，可以作为适合健全人和临床人群个性化外骨骼控制的可扩展基础，无需大量物理试验。

SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

SafeAdapt：深度强化学习中的可验证安全政策更新

Authors: Maksim Anisimov (Imperial College London), Francesco Belardinelli (Imperial College London), Matthew Wicker (Imperial College London)
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.09452
Pdf link: https://arxiv.org/pdf/2604.09452
Abstract Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety-critical tasks. Often, deployment environments exhibit non-stationary dynamics or are subject to changing performance goals, requiring updates to the learned policy. This leads to a fundamental challenge: how to update an RL policy while preserving its safety properties on previously encountered tasks? The majority of current approaches either do not provide formal guarantees or verify policy safety only a posteriori. We propose a novel a priori approach to safe policy updates in continual RL by introducing the Rashomon set: a region in policy parameter space certified to meet safety constraints within the demonstration data distribution. We then show that one can provide formal, provable guarantees for arbitrary RL algorithms used to update a policy by projecting their updates onto the Rashomon set. Empirically, we validate this approach across grid-world navigation environments (Frozen Lake and Poisoned Apple) where we guarantee an a priori provably deterministic safety on the source task during downstream adaptation. In contrast, we observe that regularisation-based baselines experience catastrophic forgetting of safety constraints while our approach enables strong adaptation with provable guarantees that safety is preserved.
中文摘要 安全保障是在安全关键任务中部署强化学习（RL）代理的前提条件。部署环境通常表现出非平稳动态，或性能目标发生变化，因此需要对已学策略进行更新。这引出了一个根本性的挑战：如何在保持之前遇到任务安全属性的同时更新强化学习策略？目前大多数方法要么不提供正式保证，要么只对政策安全性进行事后验证。我们提出了一种新颖的先验方法，通过引入罗生门集合：在策略参数空间中经过验证，能够满足演示数据分布中安全约束的区域。随后我们证明，可以通过将任意强化学习算法的更新投影到罗生门集合上，为用于更新策略的算法提供形式化且可证明的保证。通过实证，我们在网格世界导航环境（如Frozen Lake和Poisoned Apple）验证了该方法，确保源任务在下游适应过程中先验可证明的确定性安全性。相比之下，我们观察到基于正则化的基线会严重忘记安全约束，而我们的方法则能强有力适应并保证安全得以保持。

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

从推理到智能：大型语言模型强化学习中的学分分配

Authors: Chenchen Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.09459
Pdf link: https://arxiv.org/pdf/2604.09459
Abstract Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards -- yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500--30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K--1M tokens), making episode-level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches -- hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations -- that have no direct precedent in reasoning RL.
中文摘要 大型语言模型（LLM）的强化学习（RL）越来越依赖稀疏的结果级奖励——但确定在一个长路径中哪些行为导致了结果仍然困难。这种信用分配（CA）问题体现在两种模式：推理强化学习（LOGL），即必须在单一思维链生成（500-30K+代币）中分配信用于代币和步骤;以及智能强化学习，其中多回合环境交互引入了随机转移、部分可观测性和100+回合（10万至100万个代币）的视野，使得剧集级的信用信息日益不足。我们调查了2024年至2026年初发布的47种CA方法（41个核心，6个相邻赋能方法），并按分配细度（标记、段、阶梯、转向、多代理）和方法（蒙特卡洛、时间差分、基于模型、博弈论、信息论）进行二维分类。除了调查本身，我们还提供了三种可重复使用的资源：（1）结构化、机器可读的纸质清单，包含分类标签、基线家族和证据水平;（2）为未来CA论文制定报告清单，并与已审文献进行验证，以识别系统性方法学上的空白;以及（3）包含任务族、元数据需求和受控分岔任务的基准协议规范，并配有方法选择决策树。我们的综合研究表明，从推理向代理强化学习的转变复杂化并重塑了学分分配格局：推理CA围绕过程奖励模型和无批评群体比较逐渐成熟，而代理CA则推动了真正的新方法——事后诸葛亮的反事实分析、特权的非对称批评者以及回合级的MDP重新表述——这些在强化学习推理中没有直接先例。

Physics-Informed Reinforcement Learning of Spatial Density Velocity Potentials for Map-Free Racing

物理知情强化学习空间密度速度势能，用于无地图竞速

Authors: Shathushan Sivashangaran, Apoorva Khairnar, Sepideh Gohari, Vihaan Dutta, Azim Eskandarian
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.09499
Pdf link: https://arxiv.org/pdf/2604.09499
Abstract Autonomous racing without prebuilt maps is a grand challenge for embedded robotics that requires kinodynamic planning from instantaneous sensor data at the acceleration and tire friction limits. Out-Of-Distribution (OOD) generalization to various racetrack configurations utilizes Machine Learning (ML) to encode the mathematical relation between sensor data and vehicle actuation for end-to-end control, with implicit localization. These comprise Behavioral Cloning (BC) that is capped to human reaction times and Deep Reinforcement Learning (DRL) which requires large-scale collisions for comprehensive training that can be infeasible without simulation but is arduous to transfer to reality, thus exhibiting greater performance than BC in simulation, but actuation instability on hardware. This paper presents a DRL method that parameterizes nonlinear vehicle dynamics from the spectral distribution of depth measurements with a non-geometric, physics-informed reward, to infer vehicle time-optimal and overtaking racing controls with an Artificial Neural Network (ANN) that utilizes less than 1% of the computation of BC and model-based DRL. Slaloming from simulation to reality transfer and variance-induced conservatism are eliminated with the combination of a physics engine exploit-aware reward and the replacement of an explicit collision penalty with an implicit truncation of the value horizon. The policy outperforms human demonstrations by 12% in OOD tracks on proportionally scaled hardware, by maximizing the friction circle with tire dynamics that resemble an empirical Pacejka tire model. System identification illuminates a functional bifurcation where the first layer compresses spatial observations to extract digitized track features with higher resolution in corner apexes, and the second encodes nonlinear dynamics.
中文摘要 没有预设地图的自主赛车是嵌入式机器人的巨大挑战，需要从加速和轮胎摩擦极限的瞬时传感器数据中进行运动动力学规划。分布外（Out-Of-Distribution，OOD）推广到各种赛道配置，利用机器学习编码传感器数据与车辆驱动之间的数学关系，实现端到端控制，并隐含定位。这些包括行为克隆（BC），其反应时间限制在人类范围内;以及深度强化学习（DRL），需要大规模碰撞以实现全面训练，这种训练在没有仿真的情况下难以实现，但难以转化为现实，因此在仿真中表现优于BC，但在硬件上存在驱动不稳定性。本文提出了一种DRL方法，通过深度测量的光谱分布参数化非线性车辆动力学，并获得非几何、物理知情的奖励，从而通过人工神经网络（ANN）推断车辆时间最优和超车控制，该网络利用BC计算量和基于模型的DRL计算量不到1%。通过物理引擎的利用感知奖励和用隐式截断值视野取代显式碰撞惩罚，消除了从模拟到现实转移的缓慢变化和方差引起的保守性。该政策在按比例比例的硬件上进行户外轨道时，通过最大化摩擦圆，使轮胎动力学类似经验Pacejka轮胎模型，从而比人工演示高出12%。系统识别揭示了一个功能分岔，第一层压缩空间观测以提取角点更高分辨率的数字化轨迹特征，第二层编码非线性动力学。

VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

VISOR：通过迭代搜索和超视距推理实现的智能视觉检索增强生成

Authors: Yucheng Shen, Jiulong Wu, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.09508
Pdf link: https://arxiv.org/pdf/2604.09508
Abstract Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning), a unified single-agent framework. VISOR features a structured Evidence Space for progressive cross-page reasoning, coupled with a Visual Action Evaluation and Correction mechanism to manage visual actions. Additionally, we introduce a Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift. They anchor the evidence space while discarding earlier raw interactions, preventing context from being overwhelmed by visual tokens. We train VISOR using a Group Relative Policy Optimization-based Reinforcement Learning (GRPO-based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction. Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state-of-the-art performance with superior efficiency for long-horizon visual reasoning tasks.
中文摘要 视觉检索增强生成（VRAG）赋能视觉语言模型检索并推理视觉丰富的文档。为了应对需要多步推理的复杂查询，智能VRAG系统将推理与迭代检索交错。然而，现有的智能VRAG面临两个关键瓶颈。（1）视觉证据稀疏：关键证据分散在各页中，但被单独处理，阻碍跨页推理;此外，细粒度图像内证据通常需要精确的视觉动作，而这些误用会降低检索质量;（2）长视野中的搜索漂移：检索页面中视觉标记的累积稀释了上下文，导致认知过载，导致代理偏离其搜索目标。为应对这些挑战，我们提出了VISOR（通过迭代搜索和视野推理实现可视化检索增强生成），这是一个统一的单代理框架。VISOR采用结构化证据空间，支持渐进式跨页推理，并结合可视化行动评估与纠正机制管理视觉动作。此外，我们还引入了带有滑动窗口和意图注入的动态轨迹，以减轻搜索漂移。它们锚定证据空间，同时摒弃早期的原始互动，防止视觉符号淹没上下文。我们使用基于组相对策略优化的强化学习（基于GRPO的强化学习）流水线训练VISOR，并配备状态掩蔽和信用分配，专门用于动态上下文重建。在ViDoSeek、SlideVQA和MMLongBench上的广泛实验表明，VISOR在远景视觉推理任务中以卓越的效率实现了最先进的性能。

RIRF: Reasoning Image Restoration Framework

RIRF：推理图像修复框架

Authors: Wending Yan, Rongkai Zhang, Kaihua Tang, Yu Cheng, Qiankun Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.09511
Pdf link: https://arxiv.org/pdf/2604.09511
Abstract Universal image restoration (UIR) aims to recover clean images from diverse and unknown degradations using a unified model. Existing UIR methods primarily focus on pixel reconstruction and often lack explicit diagnostic reasoning over degradation composition, severity, and scene semantics prior to restoration. We propose Reason and Restore (R\&R), a novel framework that integrates structured Chain-of-Thought (CoT) reasoning into the image restoration pipeline. R\&R introduces an explicit reasoner, implemented by fine-tuning Qwen3-VL, to diagnose degradation types, quantify degradation severity, infer key degradation-related factors, and describe relevant scene and object semantics. The resulting structured reasoning provides interpretable and fine-grained diagnostic priors for the restorer. To further improve restoration quality, the quantified degradation severity produced by the reasoner is leveraged as reinforcement learning (RL) signals to guide and strengthen the restorer. Unlike existing multimodal LLM-based agentic systems that decouple reasoning from low-level vision tasks, R\&R tightly couples semantic diagnostic reasoning with pixel-level restoration in a unified framework. Extensive experiments across diverse UIR benchmarks demonstrate that R\&R achieves state-of-the-art performance while offering unique interpretability into the restoration process.
中文摘要 通用图像恢复（UIR）旨在通过统一模型从各种未知的劣化中恢复干净的图像。现有的UIR方法主要关注像素重建，且通常缺乏对退化组成、严重程度和场景语义的明确诊断推理。我们提出了Reason and Restore（R\&R），这是一个将结构化思维链（CoT）推理整合进图像修复流程的新框架。R\&R 引入了通过微调 Qwen3-VL 实现的显式推理器，用于诊断退化类型、量化退化严重程度、推断关键退化相关因素，并描述相关的场景和对象语义。由此产生的结构化推理为修复师提供了可解释且细致的诊断先验。为了进一步提升恢复质量，推理器产生的量化退化严重度被用作强化学习（RL）信号，指导和强化修复器。与现有多模态基于LLM的代理系统（将推理与低层视觉任务解耦）不同，R&R将语义诊断推理与像素级恢复紧密结合，形成统一框架。跨多个UIR基准的广泛实验表明，R&R实现了最先进的性能，同时为修复过程提供了独特的解释性。

Event-Driven Temporal Graph Networks for Asynchronous Multi-Agent Cyber Defense in NetForge_RL

NetForge_RL中异步多代理网络防御的事件驱动时图网络

Authors: Igor Jankowski
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.09523
Pdf link: https://arxiv.org/pdf/2604.09523
Abstract The transition of Multi-Agent Reinforcement Learning (MARL) policies from simulated cyber wargames to operational Security Operations Centers (SOCs) is fundamentally bottlenecked by the Sim2Real gap. Legacy simulators abstract away network protocol physics, rely on synchronous ticks, and provide clean state vectors rather than authentic, noisy telemetry. To resolve these limitations, we introduce NetForge_RL: a high-fidelity cyber operations simulator that reformulates network defense as an asynchronous, continuous-time Partially Observable Semi-Markov Decision Process (POSMDP). NetForge enforces Zero-Trust Network Access (ZTNA) constraints and requires defenders to process NLP-encoded SIEM telemetry. Crucially, NetForge bridges the Sim2Real gap natively via a dual-mode engine, allowing high-throughput MARL training in a mock hypervisor and zero-shot evaluation against live exploits in a Docker hypervisor. To navigate this continuous-time POSMDP, we propose Continuous-Time Graph MARL (CT-GMARL), utilizing fixed-step Neural Ordinary Differential Equations (ODEs) to process irregularly sampled alerts. We evaluate our framework against discrete baselines (R-MAPPO, QMIX). Empirical results demonstrate that CT-GMARL achieves a converged median Blue reward of 57,135 - a 2.0x improvement over R-MAPPO and 2.1x over QMIX. Critically, CT-GMARL restores 12x more compromised services than the strongest baseline by avoiding the "scorched earth" failure mode of trivially minimizing risk by destroying network utility. On zero-shot transfer to the live Docker environment, CT-GMARL policies achieve a median reward of 98,026, validating the Sim2Real bridge.
中文摘要 多智能体强化学习（MARL）策略从模拟网络战争游戏向运营安全作战中心（SOC）的转变，根本上被Sim2Real的差距所阻碍。传统模拟器抽象了网络协议物理，依赖同步刻，并提供干净的状态矢量，而非真实且噪声较大的遥测。为解决这些局限性，我们引入了NetForge_RL：一款高保真网络操作模拟器，将网络防御重新构想为异步、连续时间的部分可观测半马尔可夫决策过程（POSMDP）。NetForge 执行零信任网络访问（ZTNA）约束，并要求防御者处理 NLP 编码的 SIEM 遥测数据。关键是，NetForge 通过双模引擎原生弥合了 Sim2Real 的差距，允许在模拟虚拟机管理程序中进行高通量 MARL 训练，并在 Docker 虚拟机监控程序中对实时漏洞进行零样本评估。为了应对这种连续时间POSMDP，我们提出了连续时间图MARL（CT-GMARL），利用固定步神经常微分方程（ODE）处理不规则采样的警报。我们以离散基线（R-MAPPO、QMIX）为基础进行评估。实证结果显示，CT-GMARL的收敛中位蓝色奖励为57,135——比R-MAPPO提升2.0倍，比QMIX提升2.1倍。关键是，CT-GMARL通过避免“焦土”失效模式——通过破坏网络效用来最小化风险——恢复了比最强基线多12倍的被攻破服务。在零机会传输到实时Docker环境时，CT-GMARL策略的中位数奖励为98,026，验证了Sim2Real桥。

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

VL校准：大型视觉语言模型的解耦置信校准推理

Authors: Wenyi Xiao, Xinchi Xu, Leilei Gan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.09529
Pdf link: https://arxiv.org/pdf/2604.09529
Abstract Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.
中文摘要 大型视觉语言模型（LVLM）具备强大的多模态推理能力，但经常表现出高确定度的幻觉和错误反应，这阻碍了其在高风险领域的应用。现有的口头置信度校准方法，主要为纯文本LLM开发，通常通过二元答案水平的正确性优化单一整体置信度评分。这种设计与LVLMs不匹配：错误预测可能源于感知失误或在正确感知下推理错误，单一置信度混淆了这些来源，而视觉不确定性往往由语言先验主导。为解决这些问题，我们提出了VL-Calibration，一种明确将信心与视觉和推理信心解耦的强化学习框架。为了在没有真实感知标签的情况下监督视觉置信，我们引入了一种内在视觉确定性估计，该估计结合了（i）通过图像扰动下的KL发散测量的视觉基础性和（ii）通过令牌熵测量的内部确定性。我们进一步提出代币层面优势重权重，以聚焦基于视觉确定性的代币优化，抑制无根据的幻觉，同时保持有效感知。对十三个基准测试的实验表明，VL校准有效提升校准效果，同时提升视觉推理准确性，并且能够推广到模型尺度和架构的分布外基准。

Keyword: diffusion policy

There is no result