Arxiv Papers of Today

生成时间: 2026-03-30 17:14:58 (UTC+8); Arxiv 发布时间: 2026-03-30 20:00 EDT (2026-03-31 08:00 UTC+8)

今天共有 19 篇相关文章

Keyword: reinforcement learning

Empowering Epidemic Response: The Role of Reinforcement Learning in Infectious Disease Control

赋能流行病应对：强化学习在传染病控制中的作用

Authors: Mutong Liu, Yang Liu, Jiming Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2603.25771
Pdf link: https://arxiv.org/pdf/2603.25771
Abstract Reinforcement learning (RL), owing to its adaptability to various dynamic systems in many real-world scenarios and the capability of maximizing long-term outcomes under different constraints, has been used in infectious disease control to optimize the intervention strategies for controlling infectious disease spread and responding to outbreaks in recent years. The potential of RL for assisting public health sectors in preventing and controlling infectious diseases is gradually emerging and being explored by rapidly increasing publications relevant to COVID-19 and other infectious diseases. However, few surveys exclusively discuss this topic, that is, the development and application of RL approaches for optimizing strategies of non-pharmaceutical and pharmaceutical interventions of public health. Therefore, this paper aims to provide a concise review and discussion of the latest literature on how RL approaches have been used to assist in controlling the spread and outbreaks of infectious diseases, covering several critical topics addressing public health demands: resource allocation, balancing between lives and livelihoods, mixed policy of multiple interventions, and inter-regional coordinated control. Finally, we conclude the paper with a discussion of several potential directions for future research.
中文摘要 强化学习（RL）因其适应多种现实场景中的动态系统，以及在不同约束条件下最大化长期结果的能力，近年来被用于传染病控制，以优化控制传染病传播和应对疫情的干预策略。RL在协助公共卫生部门预防和控制传染病方面的潜力正逐渐显现，并被与COVID-19及其他传染病相关的快速增多出版物所探索。然而，很少有调查专门讨论这一话题，即强化学习方法在优化公共卫生非药物和药物干预策略方面的开发与应用。因此，本文旨在简明回顾和讨论最新文献，探讨强化学习方法如何帮助控制传染病传播和爆发，涵盖多个关键公共卫生主题：资源分配、生命与生计的平衡、多重干预混合政策以及跨区域协调控制。最后，我们以讨论未来研究的几个潜在方向作为论文的结尾。

Chasing Autonomy: Dynamic Retargeting and Control Guided RL for Performant and Controllable Humanoid Running

追求自主性：动态重定向与控制引导的强化学习，实现高效且可控的人形运行

Authors: Zachary Olkin, William D. Compton, Ryan M. Bena, Aaron D. Ames
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.25902
Pdf link: https://arxiv.org/pdf/2603.25902
Abstract Humanoid robots have the promise of locomoting like humans, including fast and dynamic running. Recently, reinforcement learning (RL) controllers that can mimic human motions have become popular as they can generate very dynamic behaviors, but they are often restricted to single motion play-back which hinders their deployment in long duration and autonomous locomotion. In this paper, we present a pipeline to dynamically retarget human motions through an optimization routine with hard constraints to generate improved periodic reference libraries from a single human demonstration. We then study the effect of both the reference motion and the reward structure on the reference and commanded velocity tracking, concluding that a goal-conditioned and control-guided reward which tracks dynamically optimized human data results in the best performance. We deploy the policy on hardware, demonstrating its speed and endurance by achieving running speeds of up to 3.3 m/s on a Unitree G1 robot and traversing hundreds of meters in real-world environments. Additionally, to demonstrate the controllability of the locomotion, we use the controller in a full perception and planning autonomy stack for obstacle avoidance while running outdoors.
中文摘要 类人机器人有望像人类一样驾驶，包括快速且动态的运行。近年来，能够模拟人类动作的强化学习（RL）控制器因能生成非常动态的行为而变得流行，但通常仅限于单次动作回放，这限制了其在长时间和自主运动中的部署。本文提出了一种流水线，通过带有硬约束的优化程序动态重新定位人类运动，从而从单一人体演示中生成更优的周期参考库。随后，我们研究了参考运动和奖励结构对参考速度和指令速度追踪的影响，得出结论：以目标条件和控制为导向、跟踪动态优化的人类数据的奖励能带来最佳表现。我们在硬件上部署该政策，通过在Unitree G1机器人上达到最高3.3米/秒的速度，并在真实环境中穿越数百米，展示了其速度和耐久性。此外，为了展示移动的可控性，我们在户外跑步时将控制器用于完整的感知和规划自动驾驶堆栈，以实现障碍物规避。

Reinforcing Structured Chain-of-Thought for Video Understanding

强化结构化思维链以促进视频理解

Authors: Peiyao Wang, Haotian Xu, Noranart Vesdapunt, Rui Hou, Jingyi Zhang, Haibin Ling, Oleksandr Obiednikov, Ning Zhou, Kah Kuen Fu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.25942
Pdf link: https://arxiv.org/pdf/2603.25942
Abstract Multi-modal Large Language Models (MLLMs) show promise in video understanding. However, their reasoning often suffers from thinking drift and weak temporal comprehension, even when enhanced by Reinforcement Learning (RL) techniques like Group Relative Policy Optimization (GRPO). Moreover, existing RL methods usually depend on Supervised Fine-Tuning (SFT), which requires costly Chain-of-Thought (CoT) annotation and multi-stage training, and enforces fixed reasoning paths, limiting MLLMs' ability to generalize and potentially inducing bias. To overcome these limitations, we introduce Summary-Driven Reinforcement Learning (SDRL), a novel single-stage RL framework that obviates the need for SFT by utilizing a Structured CoT format: Summarize -> Think -> Answer. SDRL introduces two self-supervised mechanisms integrated into the GRPO objective: 1) Consistency of Vision Knowledge (CVK) enforces factual grounding by reducing KL divergence among generated summaries; and 2) Dynamic Variety of Reasoning (DVR) promotes exploration by dynamically modulating thinking diversity based on group accuracy. This novel integration effectively balances alignment and exploration, supervising both the final answer and the reasoning process. Our method achieves state-of-the-art performance on seven public VideoQA datasets.
中文摘要 多模态大型语言模型（MLLM）在视频理解方面展现出潜力。然而，即使通过强化学习（RL）技术如群体相对策略优化（GRPO）加以增强，他们的推理常常存在思维漂移和时间理解力薄弱的问题。此外，现有的强化学习方法通常依赖监督式微调（SFT），这需要昂贵的思维链注释（CoT）和多阶段训练，并强制执行固定的推理路径，限制了MLLM的泛化能力，并可能诱发偏见。为克服这些局限，我们引入了摘要驱动强化学习（SDRL），一种新型单阶段强化学习框架，通过结构化CoT格式（总结->思考->答案）取代了SFT的需求。SDRL引入了两种自我监督机制，整合进GRPO目标：1）视觉知识一致性（CVK）通过减少生成摘要间的认知偏差来强制事实基础;2）推理的动态多样性（DVR）通过动态调节思维多样性，基于群体准确性促进探索。这种新颖的整合有效地平衡了对齐与探索，同时监督最终答案和推理过程。我们的方法在七个公开的视频质量保证数据集上实现了最先进的性能。

Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control

以人为本自动驾驶车辆控制的神经认知奖励建模

Authors: Zhuoli Zhuang, Yu-Cheng Chang, Yu-Kai Wang, Thomas Do, Chin-Teng Lin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.25968
Pdf link: https://arxiv.org/pdf/2603.25968
Abstract Recent advancements in computer vision have accelerated the development of autonomous driving. Despite these advancements, training machines to drive in a way that aligns with human expectations remains a significant challenge. Human factors are still essential, as humans possess a sophisticated cognitive system capable of rapidly interpreting scene information and making accurate decisions. Aligning machine with human intent has been explored with Reinforcement Learning with Human Feedback (RLHF). Conventional RLHF methods rely on collecting human preference data by manually ranking generated outputs, which is time-consuming and indirect. In this work, we propose an electroencephalography (EEG)-guided decision-making framework to incorporate human cognitive insights without behaviour response interruption into reinforcement learning (RL) for autonomous driving. We collected EEG signals from 20 participants in a realistic driving simulator and analyzed event-related potentials (ERP) in response to sudden environmental changes. Our proposed framework employs a neural network to predict the strength of ERP based on the cognitive information from visual scene information. Moreover, we explore the integration of such cognitive information into the reward signal of the RL algorithm. Experimental results show that our framework can improve the collision avoidance ability of the RL algorithm, highlighting the potential of neuro-cognitive feedback in enhancing autonomous driving systems. Our project page is: this https URL.
中文摘要 计算机视觉领域的最新进展加速了自动驾驶的发展。尽管取得了这些进步，训练机器以符合人类期望的方式驾驶仍然是一项重大挑战。人因依然至关重要，因为人类拥有复杂的认知系统，能够快速解读场景信息并做出准确决策。通过“人类反馈强化学习”（RLHF）来探索将机器与人类意图对齐。传统的RLHF方法依赖于通过手动对生成输出进行排序来收集人类偏好数据，这既耗时又间接。本研究提出一种以脑电图（EEG）为导引的决策框架，将人类认知洞察融入无行为反应中断的强化学习（RL）中，用于自动驾驶。我们在真实驾驶模拟器中收集了20名参与者的脑电信号，并分析了对突发环境变化的事件相关电位（ERP）。我们提出的框架利用神经网络，基于视觉场景信息的认知信息预测ERP的强度。此外，我们还探讨了将这些认知信息整合进强化学习算法的奖励信号中。实验结果表明，我们的框架能够提升强化学习算法的碰撞避免能力，凸显了神经认知反馈在增强自动驾驶系统中的潜力。我们的项目页面是：这个 https URL。

AutoB2G: A Large Language Model-Driven Agentic Framework For Automated Building-Grid Co-Simulation

AutoB2G：一种大型语言模型驱动的智能体框架，用于自动化建筑-网格共仿

Authors: Borui Zhang, Nariman Mahdavi, Subbu Sethuvenkatraman, Shuang Ao, Flora Salim
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.26005
Pdf link: https://arxiv.org/pdf/2603.26005
Abstract The growing availability of building operational data motivates the use of reinforcement learning (RL), which can learn control policies directly from data and cope with the complexity and uncertainty of large-scale building clusters. However, most existing simulation environments prioritize building-side performance metrics and lack systematic evaluation of grid-level impacts, while their experimental workflows still rely heavily on manual configuration and substantial programming expertise. Therefore, this paper proposes AutoB2G, an automated building-grid co-simulation framework that completes the entire simulation workflow solely based on natural-language task descriptions. The framework extends CityLearn V2 to support Building-to-Grid (B2G) interaction and adopts the large language model (LLM)-based SOCIA (Simulation Orchestration for Computational Intelligence with Agents) framework to automatically generate, execute, and iteratively refine the simulator. As LLMs lack prior knowledge of the implementation context of simulation functions, a codebase covering simulation configurations and functional modules is constructed and organized as a directed acyclic graph (DAG) to explicitly represent module dependencies and execution order, guiding the LLM to retrieve a complete executable path. Experimental results demonstrate that AutoB2G can effectively enable automated simulator implementations, coordinating B2G interactions to improve grid-side performance metrics.
中文摘要 建筑运营数据日益普及，促使采用强化学习（RL），它可以直接从数据中学习控制策略，并应对大规模建筑集群的复杂性和不确定性。然而，大多数现有仿真环境优先考虑建筑端性能指标，缺乏对网格层面影响的系统评估，其实验工作流程仍高度依赖手动配置和丰富的编程专业知识。因此，本文提出了AutoB2G，一种自动化建筑网格共仿真框架，能够完全基于自然语言任务描述完成整个仿真工作流程。该框架扩展了 CityLearn V2，支持建屋到网格（B2G）交互，并采用基于大型语言模型（LLM）的 SOCIA（带代理计算智能仿真编排）框架，自动生成、执行并迭代优化模拟器。由于LLM缺乏对仿真函数实现上下文的先验了解，因此构建并组织了一个涵盖仿真配置和功能模块的代码库，以有向无环图（DAG）形式显式表示模块依赖关系和执行顺序，指导LLM检索完整的可执行路径。实验结果表明，AutoB2G能够有效实现自动化模拟器实现，协调B2G交互，从而提升网格端性能指标。

Designing Fatigue-Aware VR Interfaces via Biomechanical Models

通过生物力学模型设计疲劳感知虚拟现实界面

Authors: Harshitha Voleti, Charalambos Poullis
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.26031
Pdf link: https://arxiv.org/pdf/2603.26031
Abstract Prolonged mid-air interaction in virtual reality (VR) causes arm fatigue and discomfort, negatively affecting user experience. Incorporating ergonomic considerations into VR user interface (UI) design typically requires extensive human-in-the-loop evaluation. Although biomechanical models have been used to simulate human behavior in HCI tasks, their application as surrogate users for ergonomic VR UI design remains underexplored. We propose a hierarchical reinforcement learning framework that leverages biomechanical user models to evaluate and optimize VR interfaces for mid-air interaction. A motion agent is trained to perform button-press tasks in VR under sequential conditions, using realistic movement strategies and estimating muscle-level effort via a validated three-compartment control with recovery (3CC-r) fatigue model. The simulated fatigue output serves as feedback for a UI agent that optimizes UI element layout via reinforcement learning (RL) to minimize fatigue. We compare the RL-optimized layout against a manually-designed centered baseline and a Bayesian optimized baseline. Results show that fatigue trends from the biomechanical model align with human user data. Moreover, the RL-optimized layout using simulated fatigue feedback produced significantly lower perceived fatigue in a follow-up human study. We further demonstrate the framework's extensibility via a simulated case study on longer sequential tasks with non-uniform interaction frequencies. To our knowledge, this is the first work using simulated biomechanical muscle fatigue as a direct optimization signal for VR UI layout design. Our findings highlight the potential of biomechanical user models as effective surrogate tools for ergonomic VR interface design, enabling efficient early-stage iteration with less reliance on extensive human participation.
中文摘要 在虚拟现实（VR）中长时间的空中互动会导致手臂疲劳和不适，从而对用户体验产生负面影响。将人体工学考虑纳入VR用户界面（UI）设计通常需要大量的人工参与评估。尽管生物力学模型已被用于人机交互任务中的人类行为模拟，但其作为符合人机工程学VR界面设计的替代用户的应用仍然未被充分探索。我们提出了一个层级强化学习框架，利用生物力学用户模型评估和优化空中交互的虚拟现实界面。运动代理被训练为在VR中连续条件下执行按键任务，采用真实的运动策略，并通过经过验证的三区室恢复控制（3CC-r）疲劳模型估算肌肉层面的努力。模拟的疲劳输出作为反馈，供UI代理通过强化学习（RL）优化UI元素布局以最小化疲劳。我们将强化学习优化的布局与手动设计的居中基线和贝叶斯优化基线进行比较。结果显示，生物力学模型中的疲劳趋势与人类用户数据一致。此外，采用模拟疲劳反馈优化的强化学习布局在后续人体研究中显著降低了疲劳感知。我们通过对较长连续任务且交互频率不均匀的模拟案例研究，进一步展示了该框架的可扩展性。据我们所知，这是首次将模拟生物力学肌肉疲劳作为VR UI布局设计的直接优化信号。我们的发现凸显了生物力学用户模型作为符合人体工学VR界面设计的有效替代工具的潜力，使早期迭代高效，减少对大量人类参与的依赖。

Hierarchical Control Framework Integrating LLMs with RL for Decarbonized HVAC Operation

层级控制框架：将LLM与强化学习整合以实现脱碳暖通空调运行

Authors: Dianyu Zhong, Tian Xing, Kailai Sun, Xu Yang, Heye Huang, Irfan Qaisar, Tinggang Jia, Shaobo Wang, Qianchuan Zhao
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.26050
Pdf link: https://arxiv.org/pdf/2603.26050
Abstract Heating, ventilation, and air conditioning (HVAC) systems account for a substantial share of building energy consumption. Environmental uncertainty and dynamic occupancy behavior bring challenges in decarbonized HVAC control. Reinforcement learning (RL) can optimize long-horizon comfort-energy trade-offs but suffers from exponential action-space growth and inefficient exploration in multi-zone buildings. Large language models (LLMs) can encode semantic context and operational knowledge, yet when used alone they lack reliable closed-loop numerical optimization and may result in less reliable comfort-energy trade-offs. To address these limitations, we propose a hierarchical control framework in which a fine-tuned LLM, trained on historical building operation data, generates state-dependent feasible action masks that prune the combinatorial joint action space into operationally plausible subsets. A masked value-based RL agent then performs constrained optimization within this reduced space, improving exploration efficiency and training stability. Evaluated in a high-fidelity simulator calibrated with real-world sensor and occupancy data from a 7-zone office building, the proposed method achieves a mean PPD of 7.30%, corresponding to reductions of 39.1% relative to DQN, the best vanilla RL baseline in comfort, and 53.1% relative to the best vanilla LLM baseline, while reducing daily HVAC energy use to 140.90~kWh, lower than all vanilla RL baselines. The results suggest that LLM-guided action masking is a promising pathway toward efficient multi-zone HVAC control.
中文摘要 供暖、通风和空调（HVAC）系统占建筑能耗的重要部分。环境不确定性和动态占用行为为脱碳暖通空调控制带来了挑战。强化学习（RL）可以优化长期舒适度与能量的权衡，但在多区建筑中存在指数级的行动空间增长和低效探索。大型语言模型（LLMs）可以编码语义上下文和操作知识，但单独使用时缺乏可靠的闭环数值优化，可能导致舒适度与能量的权衡不够可靠。为解决这些限制，我们提出了一个分层控制框架，其中经过精细调优的大型语言模型，基于历史建筑操作数据训练，生成状态依赖的可行动作掩码，将组合联合行动空间修剪成操作上合理的子集。掩蔽值基础强化学习代理在缩小空间内执行约束优化，提升探索效率和训练稳定性。在高保真模拟器中评估，基于7区办公楼的真实传感器和占用数据校准，该方法平均PPD为7.30%，相较于舒适度最佳原版强化学习基线减少39.1%，相较最佳原版大型语言模型基线减少53.1%，同时将日暖通空调能耗降低至140.90~千瓦时。比所有原版强化游戏的基线都要低。结果表明，LLM引导的动作掩蔽是实现高效多区暖通空调控制的有前景路径。

Dynamic Tokenization via Reinforcement Patching: End-to-end Training and Zero-shot Transfer

通过强化补丁实现动态令牌化：端到端训练与零射击传输

Authors: Yulun Wu, Sravan Kumar Ankireddy, Samuel Sharpe, Nikita Seleznev, Dehao Yuan, Hyeji Kim, Nam H. Nguyen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.26097
Pdf link: https://arxiv.org/pdf/2603.26097
Abstract Efficiently aggregating spatial or temporal horizons to acquire compact representations has become a unifying principle in modern deep learning models, yet learning data-adaptive representations for long-horizon sequence data, especially continuous sequences like time series, remains an open challenge. While fixed-size patching has improved scalability and performance, discovering variable-sized, data-driven patches end-to-end often forces models to rely on soft discretization, specific backbones, or heuristic rules. In this work, we propose Reinforcement Patching (ReinPatch), the first framework to jointly optimize a sequence patching policy and its downstream sequence backbone model using reinforcement learning. By formulating patch boundary placement as a discrete decision process optimized via Group Relative Policy Gradient (GRPG), ReinPatch bypasses the need for continuous relaxations and performs dynamic patching policy optimization in a natural manner. Moreover, our method allows strict enforcement of a desired compression rate, freeing the downstream backbone to scale efficiently, and naturally supports multi-level hierarchical modeling. We evaluate ReinPatch on time-series forecasting datasets, where it demonstrates compelling performance compared to state-of-the-art data-driven patching strategies. Furthermore, our detached design allows the patching module to be extracted as a standalone foundation patcher, providing the community with visual and empirical insights into the segmentation behaviors preferred by a purely performance-driven neural patching strategy.
中文摘要 高效聚合空间或时间视野以获得紧凑表示已成为现代深度学习模型的统一原则，但学习长视野序列数据，尤其是连续序列如时间序列，仍是一个开放的挑战。虽然固定规模的补丁提高了可扩展性和性能，但发现可变规模、数据驱动的端到端补丁，往往迫使模型依赖软离散化、特定骨干或启发式规则。在本研究中，我们提出了强化补丁（ReinPatch），这是首个利用强化学习联合优化序列补丁策略及其下游序列骨干模型的框架。通过将补丁边界布置表述为通过群相对策略梯度（GRPG）优化的离散决策过程，ReinPatch绕过了连续松弛的需求，以自然的方式执行动态补丁策略优化。此外，我们的方法允许严格执行期望的压缩率，释放下游骨干以高效扩展，并自然支持多层次建模。我们在时间序列预测数据集上评估ReinPatch，其性能优于最先进的数据驱动补丁策略。此外，我们的独立设计允许将修补模块作为独立的基础补丁器提取，为社区提供关于纯性能驱动神经修补策略所偏好的分割行为的可视化和实证洞见。

Rethinking Recommendation Paradigms: From Pipelines to Agentic Recommender Systems

重新思考推荐范式：从管道到代理推荐系统

Authors: Jinxin Hu, Hao Deng, Lingyu Mu, Hao Zhang, Shizhun Wang, Yu Zhang, Xiaoyi Zeng
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2603.26100
Pdf link: https://arxiv.org/pdf/2603.26100
Abstract Large-scale industrial recommenders typically use a fixed multi-stage pipeline (recall, ranking, re-ranking) and have progressed from collaborative filtering to deep and large pre-trained models. However, both multi-stage and so-called One Model designs remain essentially static: models are black boxes, and system improvement relies on manual hypotheses and engineering, which is hard to scale under heterogeneous data and multi-objective business constraints. We propose an Agentic Recommender System (AgenticRS) that reorganizes key modules as agents. Modules are promoted to agents only when they form a functionally closed loop, can be independently evaluated, and possess an evolvable decision space. For model agents, we outline two self-evolution mechanisms: reinforcement learning style optimization in well-defined action spaces, and large language model based generation and selection of new architectures and training schemes in open-ended design spaces. We further distinguish individual evolution of single agents from compositional evolution over how multiple agents are selected and connected, and use a layered inner and outer reward design to couple local optimization with global objectives. This provides a concise blueprint for turning static pipelines into self-evolving agentic recommender systems.
中文摘要 大规模工业推荐器通常使用固定的多阶段流程（回忆、排序、再排序），并已从协作过滤发展到深度且大型的预训练模型。然而，多阶段设计和所谓的单一模型设计本质上仍是静态的：模型是黑箱，系统改进依赖于手动假设和工程设计，而在异构数据和多目标业务约束下难以实现规模化。我们提出了一种代理推荐系统（AgenticRS），将关键模块重新组织为代理。模块只有在形成功能闭合环、可独立评估且具有可进化决策空间时才被提升为代理。对于模型代理，我们概述了两种自我进化机制：在明确定义的动作空间中进行强化学习风格优化，以及在开放设计空间中基于大型语言模型生成和选择新架构和训练方案。我们进一步区分单个代理的个体演变与多个代理的组合演变，并采用分层的内外奖励设计将局部优化与全局目标相结合。这为将静态管道转变为自我演进的代理推荐系统提供了简明的蓝图。

Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR

超越关注点：多模态RLVR的轨迹引导强化学习

Authors: Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Mingzhu Chen, Jiancan Wu, Kuien Liu, Xiang Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.26126
Pdf link: https://arxiv.org/pdf/2603.26126
Abstract Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for multimodal large language models (MLLMs) have mainly focused on improving final answer correctness and strengthening visual grounding. However, a critical bottleneck remains: although models can attend to relevant visual regions, they often fail to effectively incorporate visual evidence into subsequent reasoning, leading to reasoning chains that are weakly grounded in visual facts. To address this issue, we propose Trajectory-Guided Reinforcement Learning (TGRL), which guides the policy model to integrate visual evidence into fine-grained reasoning processes using expert reasoning trajectories from stronger models. We further introduce token-level reweighting and trajectory filtering to ensure stable and effective policy optimization. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that TGRL consistently improves reasoning performance and effectively bridges the gap between visual perception and logical reasoning.
中文摘要 近年来，多模态大型语言模型（MLLM）中可验证奖励强化学习（RLVR）的最新进展主要集中在提升最终答案的正确性和加强视觉基础。然而，一个关键瓶颈依然存在：尽管模型能够关注相关的视觉区域，但它们往往无法有效将视觉证据纳入后续推理，导致推理链基于视觉事实的基础薄弱。为解决这一问题，我们提出了轨迹引导强化学习（TGRL），该方法指导政策模型将视觉证据整合进细粒度的推理过程，利用更强模型的专家推理轨迹。我们还进一步引入了令牌级重权重和轨迹过滤，以确保策略优化的稳定和有效。多模态推理基准测试的广泛实验表明，TGRL持续提升推理性能，并有效弥合视觉感知与逻辑推理之间的差距。

Knowledge Distillation for Efficient Transformer-Based Reinforcement Learning in Hardware-Constrained Energy Management Systems

硬件受限能源管理系统中基于变压器的高效强化学习知识蒸馏

Authors: Pascal Henrich, Jonas Sievers, Maximilian Beichter, Thomas Blank, Ralf Mikut, Veit Hagenmeyer
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.26249
Pdf link: https://arxiv.org/pdf/2603.26249
Abstract Transformer-based reinforcement learning has emerged as a strong candidate for sequential control in residential energy management. In particular, the Decision Transformer can learn effective battery dispatch policies from historical data, thereby increasing photovoltaic self-consumption and reducing electricity costs. However, transformer models are typically too computationally demanding for deployment on resource-constrained residential controllers, where memory and latency constraints are critical. This paper investigates knowledge distillation to transfer the decision-making behaviour of high-capacity Decision Transformer policies to compact models that are more suitable for embedded deployment. Using the Ausgrid dataset, we train teacher models in an offline sequence-based Decision Transformer framework on heterogeneous multi-building data. We then distil smaller student models by matching the teachers' actions, thereby preserving control quality while reducing model size. Across a broad set of teacher-student configurations, distillation largely preserves control performance and even yields small improvements of up to 1%, while reducing the parameter count by up to 96%, the inference memory by up to 90%, and the inference time by up to 63%. Beyond these compression effects, comparable cost improvements are also observed when distilling into a student model of identical architectural capacity. Overall, our results show that knowledge distillation makes Decision Transformer control more applicable for residential energy management on resource-limited hardware.
中文摘要 基于变压器的强化学习已成为住宅能源管理中顺序控制的有力候选。特别是，决策变压器可以从历史数据中学习有效的电池调度策略，从而提高光伏自用电量并降低电费。然而，变压器模型通常计算量过高，不适合在资源受限的住宅控制器上部署，因为内存和延迟限制至关重要。本文探讨了知识蒸馏，将高容量决策变换器策略的决策行为转移到更适合嵌入式部署的紧凑模型中。利用Ausgrid数据集，我们在离线序列的决策转换器框架中，针对异构多建筑数据训练教师模型。随后，我们通过匹配教师的行为来提炼较小的学生模型，从而保持控制质量并缩小模型规模。在广泛的师生配置中，蒸馏在很大程度上保留了控制性能，甚至带来高达1%的小幅提升，同时将参数数量减少多达96%，推理记忆减少多达90%，推理时间减少多达63%。除了这些压缩效应外，当将模型提炼成具有相同架构能力的学生模型时，也观察到类似的成本提升。总体来看，我们的结果表明，知识蒸馏使决策变压器控制更适用于资源有限硬件上的住宅能源管理。

Topology-Aware Graph Reinforcement Learning for Energy Storage Systems Optimal Dispatch in Distribution Networks

用于储能系统配电网络中最优调度的拓扑感知图强化学习

Authors: Shuyi Gao, Stavros Orfanoudakis, Shengren Hou, Peter Palensky, Pedro P. Vergara
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.26264
Pdf link: https://arxiv.org/pdf/2603.26264
Abstract Optimal dispatch of energy storage systems (ESSs) in distribution networks involves jointly improving operating economy and voltage security under time-varying conditions and possible topology changes. To support fast online decision making, we develop a topology-aware Reinforcement Learning architecture based on Twin Delayed Deep Deterministic Policy Gradient (TD3), which integrates graph neural networks (GNNs) as graph feature encoders for ESS dispatch. We conduct a systematic investigation of three GNN variants: graph convolutional networks (GCNs), topology adaptive graph convolutional networks (TAGConv), and graph attention networks (GATs) on the 34-bus and 69-bus systems, and evaluate robustness under multiple topology reconfiguration cases as well as cross-system transfer between networks with different system sizes. Results show that GNN-based controllers consistently reduce the number and magnitude of voltage violations, with clearer benefits on the 69-bus system and under reconfiguration; on the 69-bus system, TD3-GCN and TD3-TAGConv also achieve lower saved cost relative to the NLP benchmark than the NN baseline. We also highlight that transfer gains are case-dependent, and zero-shot transfer between fundamentally different systems results in notable performance degradation and increased voltage magnitude violations. This work is available at: this https URL and this https URL.
中文摘要 配电网络中储能系统（ESS）的最佳调度涉及在时间变化和可能的拓扑变化条件下共同提升运行经济性和电压安全性。为支持快速在线决策，我们基于双延迟深度确定性策略梯度（TD3）开发了一种拓扑感知强化学习架构，该架构集成了图神经网络（GNN）作为ESS调度的图特征编码器。我们系统性地研究了三种GNN变体：图卷积网络（GCNs）、拓扑自适应图卷积网络（TAGConv）和图注意力网络（GATs），这些在34-总线和69-总线系统上，并评估了在多种拓扑重构情形下的鲁棒性以及不同系统规模网络间的跨系统转移。结果显示，基于GNN的控制器持续减少电压违规的数量和幅度，在69总线系统和重新配置下有更明显的优势;在69总线系统中，TD3-GCN和TD3-TAGConv相较于NLP基准的节省成本也低于NN基线。我们还强调传输增益因具体情况而异，根本不同系统之间的零射程传输会导致显著的性能下降和电压幅度违规增加。该作品可在以下网站获取：https URL 和 https URL。

Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning

动态令牌压缩，通过强化学习实现高效的视频理解

Authors: Shida Wang, YongXiang Hua, Zhou Tao, Haoyu Cao, Linli Xu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.26365
Pdf link: https://arxiv.org/pdf/2603.26365
Abstract Multimodal Large Language Models have demonstrated remarkable capabilities in video understanding, yet face prohibitive computational costs and performance degradation from ''context rot'' due to massive visual token redundancy. Existing compression strategies typically rely on heuristics or fixed transformations that are often decoupled from the downstream task objectives, limiting their adaptability and effectiveness. To address this, we propose SCORE (Surprise-augmented token COmpression via REinforcement learning), a unified framework that learns an adaptive token compression policy. SCORE introduces a lightweight policy network conditioned on a surprise-augmented state representation that incorporates inter-frame residuals to explicitly capture temporal dynamics and motion saliency. We optimize this policy using a group-wise reinforcement learning scheme with a split-advantage estimator, stabilized by a two-stage curriculum transferring from static pseudo-videos to real dynamic videos. Extensive experiments on diverse video understanding benchmarks demonstrate that SCORE significantly outperforms state-of-the-art baselines. Notably, SCORE achieves a 16x prefill speedup while preserving 99.5% of original performance at a 10% retention ratio, offering a scalable solution for efficient long-form video understanding.
中文摘要 多模态大型语言模型在视频理解方面展现了卓越的能力，但由于大量视觉符号冗余，面临着高昂的计算成本和“上下文腐烂”导致的性能下降。现有的压缩策略通常依赖启发式或固定变换，这些变换往往与下游任务目标解耦，限制了其适应性和有效性。为此，我们提出了SCORE（通过强化学习实现的惊喜增强令牌压缩），这是一个统一框架，用于学习自适应令牌压缩策略。SCORE引入了一个基于惊奇增强状态表示的轻量级策略网络，该网络包含帧间残差，以显式捕捉时间动态和运动显著性。我们通过分组强化学习方案和分优势估计器优化该策略，并通过两阶段课程从静态伪视频转移到真实动态视频进行稳定。对多种视频理解基准的广泛实验表明，SCORE的表现显著优于最先进的基线数据。值得注意的是，SCORE实现了16倍的预填充加速，同时保持了99.5%的原始性能，且保留率为10%，为高效长视频理解提供了可扩展的解决方案。

120 Minutes and a Laptop: Minimalist Image-goal Navigation via Unsupervised Exploration and Offline RL

120分钟和一台笔记本电脑：通过无监督探索和离线强化学习实现极简图像目标导航

Authors: Xiaoming Liu, Borong Zhang, Qingbiao Li, Steven Morad
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.26441
Pdf link: https://arxiv.org/pdf/2603.26441
Abstract The prevailing paradigm for image-goal visual navigation often assumes access to large-scale datasets, substantial pretraining, and significant computational resources. In this work, we challenge this assumption. We show that we can collect a dataset, train an in-domain policy, and deploy it to the real world (1) in less than 120 minutes, (2) on a consumer laptop, (3) without any human intervention. Our method, MINav, formulates image-goal navigation as an offline goal-conditioned reinforcement learning problem, combining unsupervised data collection with hindsight goal relabeling and offline policy learning. Experiments in simulation and the real world show that MINav improves exploration efficiency, outperforms zero-shot navigation baselines in target environments, and scales favorably with dataset size. These results suggest that effective real-world robotic learning can be achieved with high computational efficiency, lowering the barrier to rapid policy prototyping and deployment.
中文摘要 图像目标视觉导航的主流范式通常假设访问大规模数据集、大量预训练和大量计算资源。在本研究中，我们挑战了这一假设。我们证明，我们可以收集数据集，训练域内策略，并将其部署到现实世界，（1）在不到120分钟内完成，（2）在消费级笔记本电脑上完成，（3）无需人工干预。我们的方法MINav将图像目标导航表述为离线目标条件强化学习问题，结合了无监督数据收集、事后目标重新标记和离线策略学习。模拟和现实中的实验表明，MINav 提高了勘探效率，在目标环境中优于零发射导航基线，并且随着数据集规模的扩展性更为显著。这些结果表明，现实世界中有效的机器人学习可以通过高计算效率实现，降低了快速策略原型设计和部署的门槛。

Automatic feature identification in least-squares policy iteration using the Koopman operator framework

使用库普曼算子框架进行最小二乘策略迭代中的自动特征识别

Authors: Christian Mugisho Zagabe, Sebastian Petiz
Subjects: Subjects: Machine Learning (cs.LG); Dynamical Systems (math.DS)
Arxiv link: https://arxiv.org/abs/2603.26464
Pdf link: https://arxiv.org/pdf/2603.26464
Abstract In this paper, we present a Koopman autoencoder-based least-squares policy iteration (KAE-LSPI) algorithm in reinforcement learning (RL). The KAE-LSPI algorithm is based on reformulating the so-called least-squares fixed-point approximation method in terms of extended dynamic mode decomposition (EDMD), thereby enabling automatic feature learning via the Koopman autoencoder (KAE) framework. The approach is motivated by the lack of a systematic choice of features or kernels in linear RL techniques. We compare the KAE-LSPI algorithm with two previous works, the classical least-squares policy iteration (LSPI) and the kernel-based least-squares policy iteration (KLSPI), using stochastic chain walk and inverted pendulum control problems as examples. Unlike previous works, no features or kernels need to be fixed a priori in our approach. Empirical results show the number of features learned by the KAE technique remains reasonable compared to those fixed in the classical LSPI algorithm. The convergence to an optimal or a near-optimal policy is also comparable to the other two methods.
中文摘要 本文提出了基于Koopman自编码器的最小二乘策略迭代（KAE-LSPI）算法，用于强化学习（RL）。KAE-LSPI 算法基于通过扩展动态模式分解（EDMD）重新表述所谓的最小二乘不动点近似方法，从而实现了通过 Koopman 自编码器（KAE）框架实现的自动特征学习。该方法的动机源于线性强化学习技术中缺乏系统性选择特征或核。我们将KAE-LSPI算法与两项此前的研究——经典最小二乘策略迭代（LSPI）和基于核的最小二乘策略迭代（KLSPI）进行比较，并以随机链游和倒摆控制问题为例。与以往的工作不同，我们的方法无需事先修复任何功能或内核。实证结果显示，KAE技术学习到的特征数量相较于经典LSPI算法中固定的数量仍然合理。趋同于最优或近似最优策略的收敛性也与另外两种方法相当。

CR-Eyes: A Computational Rational Model of Visual Sampling Behavior in Atari Games

CR-Eyes：Atari游戏中视觉采样行为的计算理性模型

Authors: Martin Lorenz, Niko Konzack, Alexander Lingler, Philipp Wintersberger, Patrick Ebel
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2603.26527
Pdf link: https://arxiv.org/pdf/2603.26527
Abstract Designing mobile and interactive technologies requires understanding how users sample dynamic environments to acquire information and make decisions under time pressure. However, existing computational user models either rely on hand-crafted task representations or are limited to static or non-interactive visual inputs, restricting their applicability to realistic, pixel-based environments. We present CR-Eyes, a computationally rational model that simulates visual sampling and gameplay behavior in Atari games. Trained via reinforcement learning, CR-Eyes operates under perceptual and cognitive constraints and jointly learns where to look and how to act in a time-sensitive setting. By explicitly closing the perception-action loop, the model treats eye movements as goal-directed actions rather than as isolated saliency predictions. Our evaluation shows strong alignment with human data in task performance and aggregate saliency patterns, while also revealing systematic differences in scanpaths. CR-Eyes is a step toward scalable, theory-grounded user models that support design and evaluation of interactive systems.
中文摘要 设计移动和交互技术需要理解用户如何采样动态环境以获取信息并在时间压力下做出决策。然而，现有的计算用户模型要么依赖手工制作的任务表示，要么仅限于静态或非交互式的视觉输入，限制了其适用于真实的像素环境。我们介绍CR-Eyes，一种计算理性模型，模拟Atari游戏中的视觉采样和游戏行为。CR-Eyes 通过强化学习训练，在感知和认知约束下工作，共同学习在时间敏感的环境中观察和行动。通过明确闭合感知-行动循环，模型将眼球运动视为有目标导向的行动，而非孤立的显著性预测。我们的评估显示，任务表现和整体显著性模式与人类数据高度一致，同时也揭示了扫描路径上的系统性差异。CR-Eyes 是迈向可扩展、基于理论的用户模型的一步，支持交互系统的设计与评估。

Think over Trajectories: Leveraging Video Generation to Reconstruct GPS Trajectories from Cellular Signaling

思考轨迹：利用视频生成重建蜂窝信号的GPS轨迹

Authors: Ruixing Zhang, Hanzhang Jiang, Leilei Sun, Liangzhe Han, Jibin Wang, Weifeng Lv
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.26610
Pdf link: https://arxiv.org/pdf/2603.26610
Abstract Mobile devices continuously interact with cellular base stations, generating massive volumes of signaling records that provide broad coverage for understanding human mobility. However, such records offer only coarse location cues (e.g., serving-cell identifiers) and therefore limit their direct use in applications that require high-precision GPS trajectories. This paper studies the Sig2GPS problem: reconstructing GPS trajectories from cellular signaling. Inspired by domain experts often lay the signaling trace on the map and sketch the corresponding GPS route, unlike conventional solutions that rely on complex multi-stage engineering pipelines or regress coordinates, Sig2GPS is reframed as an image-to-video generation task that directly operates in the map-visual domain: signaling traces are rendered on a map, and a video generation model is trained to draw a continuous GPS path. To support this paradigm, a paired signaling-to-trajectory video dataset is constructed to fine-tune an open-source video model, and a trajectory-aware reinforcement learning-based optimization method is introduced to improve generation fidelity via rewards. Experiments on large-scale real-world datasets show substantial improvements over strong engineered and learning-based baselines, while additional results on next GPS prediction indicate scalability and cross-city transferability. Overall, these results suggest that map-visual video generation provides a practical interface for trajectory data mining by enabling direct generation and refinement of continuous paths under map constraints.
中文摘要 移动设备持续与蜂窝基站交互，生成海量的信号记录，广泛支持理解人类出行。然而，这类记录只能提供粗略的位置线索（例如服务单元标识符），因此限制了它们在需要高精度GPS轨迹的应用中的直接应用。本文研究了Sig2GPS问题：从蜂巢信号中重建GPS轨迹。受领域专家启发，通常将信号轨迹铺设在地图上并勾勒出相应的GPS路线，这与依赖复杂多阶段工程流程或回归坐标的传统解决方案不同，Sig2GPS被重新定位为直接在地图可视化领域运行的图像到视频生成任务：信号轨迹在地图上渲染，视频生成模型被训练以绘制连续的GPS路径。为支持这一范式，构建了一个配对信号到轨迹视频数据集，以微调开源视频模型，并引入了基于轨迹感知强化学习的优化方法，通过奖励提升生成忠实度。在大规模真实世界数据集上的实验显示，相较于强有力的工程和基于学习的基线，提升了显著性，而关于下一次GPS预测的额外结果则显示出可扩展性和跨城市迁移性。总体而言，这些结果表明地图可视化视频生成为轨迹数据挖掘提供了实用的接口，能够在地图约束下直接生成和细化连续路径。

VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

VLA-OPD：通过策略内蒸馏桥接离线SFT与在线强化学习，实现视觉-语言-动作模型

Authors: Zhide Zhong, Haodong Yan, Junfeng Li, Junjie He, Tianran Zhang, Haoang Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.26666
Pdf link: https://arxiv.org/pdf/2603.26666
Abstract Although pre-trained Vision-Language-Action (VLA) models exhibit impressive generalization in robotic manipulation, post-training remains crucial to ensure reliable performance during deployment. However, standard offline Supervised Fine-Tuning (SFT) suffers from distribution shifts and catastrophic forgetting of pre-trained capabilities, while online Reinforcement Learning (RL) struggles with sparse rewards and poor sample efficiency. In this paper, we propose On-Policy VLA Distillation (VLA-OPD), a framework bridging the efficiency of SFT with the robustness of RL. Instead of relying on sparse environmental rewards, VLA-OPD leverages an expert teacher to provide dense, token-level supervision on the student's self-generated trajectories. This enables active error correction on policy-induced states while preserving pre-trained general capabilities through gentle alignment. Crucially, we formulate VLA-OPD via a Reverse-KL objective. Unlike standard Forward-KL that induces mode-covering entropy explosion, or Hard-CE that causes premature entropy collapse, our bounded mode-seeking objective ensures stable policy learning by filtering out the teacher's epistemic uncertainty while maintaining action diversity. Experiments on LIBERO and RoboTwin2.0 benchmarks demonstrate that VLA-OPD significantly improves sample efficiency over RL and robustness over SFT, while effectively mitigating catastrophic forgetting during post-training.
中文摘要 尽管预训练的视觉-语言-行动（VLA）模型在机器人操作方面表现出显著的泛化能力，但后期训练对于确保部署期间的可靠性依然至关重要。然而，标准的离线监督微调（SFT）存在分布转移和预训练能力的严重遗忘，而在线强化学习（RL）则面临奖励稀疏和样本效率低下的问题。本文提出了策略上的VLA蒸馏（VLA-OPD），这是一个将SFT的效率与RL的鲁棒性相结合的框架。VLA-OPD不依赖稀疏的环境奖励，而是依靠专业教师对学生自我生成轨迹进行密集、代币级别的监督。这使得在策略诱导状态上实现主动纠错，同时通过温和对齐保持预训练的通用能力。关键是，我们通过逆KL目标来制定VLA-OPD。与诱导模式覆盖熵爆炸的标准前向-KL或导致过早熵坍缩的硬持续工程不同，我们的有界模式寻求目标通过过滤教师的认知不确定性，确保政策学习的稳定，同时保持行动多样性。LIBERO和RoboTwin2.0基准测试的实验表明，VLA-OPD显著提升了强化学习（RL）的样本效率和相对于SFT的鲁棒性，同时有效减少了训练后灾难性遗忘。

Keyword: diffusion policy

DiffusionAnything: End-to-End In-context Diffusion Learning for Unified Navigation and Pre-Grasp Motion

DiffusionAnything：端到端上下文扩散学习，实现统一导航和预抓运动

Authors: Iana Zhura, Yara Mahmoud, Jeffrin Sam, Hung Khang Nguyen, Didar Seyidov, Miguel Altamirano Cabrera, Dzmitry Tsetserukou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.26322
Pdf link: https://arxiv.org/pdf/2603.26322
Abstract Efficiently predicting motion plans directly from vision remains a fundamental challenge in robotics, where planning typically requires explicit goal specification and task-specific design. Recent vision-language-action (VLA) models infer actions directly from visual input but demand massive computational resources, extensive training data, and fail zero-shot in novel scenes. We present a unified image-space diffusion policy handling both meter-scale navigation and centimeter-scale manipulation via multi-scale feature modulation, with only 5 minutes of self-supervised data per task. Three key innovations drive the framework: (1) Multi-scale FiLM conditioning on task mode, depth scale, and spatial attention enables task-appropriate behavior in a single model; (2) trajectory-aligned depth prediction focuses metric 3D reasoning along generated waypoints; (3) self-supervised attention from AnyTraverse enables goal-directed inference without vision-language models and depth sensors. Operating purely from RGB input (2.0 GB memory, 10 Hz), the model achieves robust zero-shot generalization to novel scenes while remaining suitable for onboard deployment.
中文摘要 直接从视觉高效预测运动计划仍是机器人领域的根本挑战，规划通常需要明确的目标指定和任务特定设计。最新的视觉-语言-动作（VLA）模型直接从视觉输入推断动作，但需要大量计算资源、大量训练数据，并且在新场景中失败零镜头。我们提出了统一的图像空间扩散策略，既能处理米尺级导航，也能通过多尺度特征调制实现厘米级操作，每个任务仅需5分钟的自监督数据。推动该框架的三大关键创新：（1）多尺度FiLM对任务模式、深度尺度和空间注意力的条件化，使单一模型能够实现任务适宜行为;（2）轨迹对齐深度预测聚焦于生成的航点进行度量三维推理;（3）AnyTraverse的自我监督注意力使得无需视觉语言模型和深度传感器即可实现目标导向的推断。该模型纯以RGB输入（2.0 GB内存，10赫兹）运行，能够实现对新场景的鲁棒零拍摄推广，同时仍适合机载部署。