Arxiv Papers of Today

生成时间: 2026-03-31 17:05:37 (UTC+8); Arxiv 发布时间: 2026-03-31 20:00 EDT (2026-04-01 08:00 UTC+8)

今天共有 68 篇相关文章

Keyword: reinforcement learning

Learning Energy-Efficient Air--Ground Actuation for Hybrid Robots on Stair-Like Terrain

学习在阶梯状地形上的混合机器人的节能空气-地面驱动

Authors: Jiaxing Li, Wen Tian, Xinhang Xu, Junbin Yuan, Sebastian Scherer, Muqing Cao
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.26687
Pdf link: https://arxiv.org/pdf/2603.26687
Abstract Hybrid aerial--ground robots offer both traversability and endurance, but stair-like discontinuities create a trade-off: wheels alone often stall at edges, while flight is energy-hungry for small height gains. We propose an energy-aware reinforcement learning framework that trains a single continuous policy to coordinate propellers, wheels, and tilt servos without predefined aerial and ground modes. We train policies from proprioception and a local height scan in Isaac Lab with parallel environments, using hardware-calibrated thrust/power models so the reward penalizes true electrical energy. The learned policy discovers thrust-assisted driving that blends aerial thrust and ground traction. In simulation it achieves about 4 times lower energy than propeller-only control. We transfer the policy to a DoubleBee prototype on an 8cm gap-climbing task; it achieves 38% lower average power than a rule-based decoupled controller. These results show that efficient hybrid actuation can emerge from learning and deploy on hardware.
中文摘要 混合空中-地面机器人兼具可移动性和耐久性，但阶梯状的断层造成权衡：单轮常在边缘失速，飞行则耗能以小幅高度提升。我们提出了一个能量感知强化学习框架，训练单一连续策略，协调螺旋桨、车轮和倾斜伺服机，无需预设空中和地面模式。我们在Isaac实验室的并行环境中，利用硬件校准的推力/功率模型，通过本体感觉和局部高度扫描训练策略，从而惩罚真实电能。该政策发现了推力辅助驾驶，融合了空中推力与地面牵引力。在模拟中，其能量约为单螺旋桨控制的4倍。我们将保单转移到一台DoubleBee原型机上，执行8厘米的间隙攀爬任务;其平均功耗比基于规则的解耦控制器低38%。这些结果表明，高效的混合驱动可以通过学习和硬件部署而产生。

Physicochemical-Neural Fusion for Semi-Closed-Circuit Respiratory Autonomy in Extreme Environments

物理化学-神经融合技术在极端环境中实现半闭路呼吸自主能力

Authors: Phillip Kingston, Nicholas Johnston
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.26697
Pdf link: https://arxiv.org/pdf/2603.26697
Abstract This paper introduces Galactic Bioware's Life Support System, a semi-closed-circuit breathing apparatus designed for integration into a positive-pressure firefighting suit and governed by an AI control system. The breathing loop incorporates a soda lime CO2 scrubber, a silica gel dehumidifier, and pure O2 replenishment with finite consumables. One-way exhaust valves maintain positive pressure while creating a semi-closed system in which outward venting gradually depletes the gas inventory. Part I develops the physicochemical foundations from first principles, including state-consistent thermochemistry, stoichiometric capacity limits, adsorption isotherms, and oxygen-management constraints arising from both fire safety and toxicity. Part II introduces an AI control architecture that fuses three sensor tiers, external environmental sensing, internal suit atmosphere sensing (with triple-redundant O2 cells and median voting), and firefighter biometrics. The controller combines receding-horizon model-predictive control (MPC) with a learned metabolic model and a reinforcement learning (RL) policy advisor, with all candidate actuator commands passing through a final control-barrier-function safety filter before reaching the hardware. This architecture is intended to optimize performance under unknown mission duration and exertion profiles. In this paper we introduce an 18-state, 3-control nonlinear state-space formulation using only sensors viable in structural firefighting, with triple-redundant O2 sensing and median voting. Finally, we introduce an MPC framework with a dynamic resource scarcity multiplier, an RL policy advisor for warm-starting, and a final control-barrier-function safety filter through which all actuator commands must pass, demonstrating 18-34% endurance improvement in simulation over PID baselines while maintaining tighter physiological and fire-safety margins.
中文摘要 本文介绍了银河生物软件的生命支持系统，这是一种半闭路呼吸装置，设计用于集成于正压消防服中，并由人工智能控制系统控制。呼吸环路包含苏打石灰二氧化碳净化器、硅胶除湿机以及有限消耗品的纯氧补充。单向排气阀保持正压，同时形成半闭合系统，向外排气逐渐消耗气体库存。第一部分从基本原理建立物理化学基础，包括状态一致的热化学、化学计量容量极限、吸附等温线以及因消防安全和毒性引起的氧气管理限制。第二部分引入了一套融合三层传感器的AI控制架构，包括外部环境感测、防护服内部大气感测（含三重冗余氧气单元和中位数投票）以及消防员生物识别。该控制器结合了退缩视界模型预测控制（MPC）、学习后的代谢模型和强化学习（RL）策略顾问，所有候选执行器命令在到达硬件前都会经过最终的控制-屏障-功能安全滤波器。该架构旨在优化在未知任务持续时间和消耗曲线下的性能。本文介绍了一种18态、3控制非线性状态空间的表述，仅使用适用于结构消防的传感器，配备三重冗余氧传感器和中位数投票。最后，我们引入了MPC框架，包含动态资源稀缺乘数、用于热启动的强化学习政策顾问，以及所有执行器指令必须通过的最终控制-屏障-功能安全过滤器，在PID基线模拟中展现出18-34%的耐久提升，同时保持更严格的生理和防火安全余裕。

SutureAgent: Learning Surgical Trajectories via Goal-conditioned Offline RL in Pixel Space

缝合剂：通过目标条件离线强化学习在像素空间中学习手术轨迹

Authors: Huanrong Liu, Chunlin Tian, Tongyu Jia, Tailai Zhou, Qin Liu, Yu Gao, Yutong Ban, Yun Gu, Guy Rosman, Xin Ma, Qingbiao Li
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.26720
Pdf link: https://arxiv.org/pdf/2603.26720
Abstract Predicting surgical needle trajectories from endoscopic video is critical for robot-assisted suturing, enabling anticipatory planning, real-time guidance, and safer motion execution. Existing methods that directly learn motion distributions from visual observations tend to overlook the sequential dependency among adjacent motion steps. Moreover, sparse waypoint annotations often fail to provide sufficient supervision, further increasing the difficulty of supervised or imitation learning methods. To address these challenges, we formulate image-based needle trajectory prediction as a sequential decision-making problem, in which the needle tip is treated as an agent that moves step by step in pixel space. This formulation naturally captures the continuity of needle motion and enables the explicit modeling of physically plausible pixel-wise state transitions over time. From this perspective, we propose SutureAgent, a goal-conditioned offline reinforcement learning framework that leverages sparse annotations to dense reward signals via cubic spline interpolation, encouraging the policy to exploit limited expert guidance while exploring plausible future motion paths. SutureAgent encodes variable-length clips using an observation encoder to capture both local spatial cues and long-range temporal dynamics, and autoregressively predicts future waypoints through actions composed of discrete directions and continuous magnitudes. To enable stable offline policy optimization from expert demonstrations, we adopt Conservative Q-Learning with Behavioral Cloning regularization. Experiments on a new kidney wound suturing dataset containing 1,158 trajectories from 50 patients show that SutureAgent reduces Average Displacement Error by 58.6% compared with the strongest baseline, demonstrating the effectiveness of modeling needle trajectory prediction as pixel-level sequential action learning.
中文摘要 通过内窥镜视频预测手术针头轨迹对于机器人辅助缝合至关重要，能够实现预判规划、实时指导和更安全的动作执行。现有直接通过视觉观察学习运动分布的方法往往忽视了相邻运动步骤之间的顺序依赖性。此外，稀疏的路径点注释往往无法提供足够的监督，进一步增加了监督或模仿学习方法的难度。为应对这些挑战，我们将基于图像的针头轨迹预测构建为顺序决策问题，其中针尖被视为在像素空间中逐步移动的代理。该表述自然捕捉了针运动的连续性，并使得对物理上合理的像素状态转变随时间显式建模成为可能。基于此，我们提出了SutureAgent，一种目标条件化离线强化学习框架，利用稀疏注释通过三次样条插值来密集奖励信号，鼓励策略利用有限的专家指导，同时探索可能的未来运动路径。SutureAgent 利用观察编码器编码可变长度的剪辑，捕捉局部空间线索和长距离时间动态，并通过由离散方向和连续星等组成的动作自回归预测未来航点。为了通过专家演示实现稳定的离线策略优化，我们采用了带有行为克隆正则化的保守Q-学习。在包含50名患者1158条轨迹的新肾伤伤口缝合数据集上的实验显示，与最强基线相比，缝合剂平均位移误差降低了58.6%，证明了针头轨迹预测作为像素级顺序动作学习建模的有效性。

Evolutionary Warm-Starts for Reinforcement Learning in Industrial Continuous Control

工业连续控制强化学习中的进化热启动

Authors: Tom Maus, Stephan Frank, Tobias Glasmachers
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.26750
Pdf link: https://arxiv.org/pdf/2603.26750
Abstract Reinforcement learning (RL) is still rarely applied in industrial control, partly due to the difficulty of training reliable agents for real-world conditions. This work investigates how evolution strategies can support RL in such settings by introducing a continuous-control adaptation of an industrial sorting benchmark. The CMA-ES algorithm is used to generate high-quality demonstrations that warm-start RL agents. Results show that CMA-ES-guided initialization significantly improves stability and performance. Furthermore, the demonstration trajectories generated with the CMA-ES provide a strong oracle reference performance level, which is of interest in its own right. The study delivers a focused proof of concept for hybrid evolutionary-RL approaches and a basis for future, more complex industrial applications.
中文摘要 强化学习（RL）在工业控制中仍然很少被应用，部分原因是难以训练可靠的代理适应现实世界条件。本研究探讨了进化策略如何通过引入工业分选基准的连续控制适应，支持此类环境中的强化学习。CMA-ES算法用于生成高质量的演示，证明强化学习智能体的热启动。结果显示，CMA-ES引导初始化显著提升了稳定性和性能。此外，CMA-ES生成的演示轨迹提供了强大的oracle参考性能水平，这本身就值得关注。该研究为混合进化-强化学习方法提供了重点的概念验证，并为未来更复杂的工业应用奠定了基础。

Bitboard version of Tetris AI

俄罗斯方块AI的Bitboard版本

Authors: Xingguo Chen, Pingshou Xiong, Zhenyu Luo, Mengfei Hu, Xinwen Li, Yongzhou Lü, Guang Yang, Chao Li, Shangdong Yang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.26765
Pdf link: https://arxiv.org/pdf/2603.26765
Abstract The efficiency of game engines and policy optimization algorithms is crucial for training reinforcement learning (RL) agents in complex sequential decision-making tasks, such as Tetris. Existing Tetris implementations suffer from low simulation speeds, suboptimal state evaluation, and inefficient training paradigms, limiting their utility for large-scale RL research. To address these limitations, this paper proposes a high-performance Tetris AI framework based on bitboard optimization and improved RL algorithms. First, we redesign the Tetris game board and tetrominoes using bitboard representations, leveraging bitwise operations to accelerate core processes (e.g., collision detection, line clearing, and Dellacherie-Thiery Features extraction) and achieve a 53-fold speedup compared to OpenAI Gym-Tetris. Second, we introduce an afterstate-evaluating actor network that simplifies state value estimation by leveraging Tetris afterstate property, outperforming traditional action-value networks with fewer parameters. Third, we propose a buffer-optimized Proximal Policy Optimization (PPO) algorithm that balances sampling and update efficiency, achieving an average score of 3,829 on 10x10 grids within 3 minutes. Additionally, we develop a Python-Java interface compliant with the OpenAI Gym standard, enabling seamless integration with modern RL frameworks. Experimental results demonstrate that our framework enhances Tetris's utility as an RL benchmark by bridging low-level bitboard optimizations with high-level AI strategies, providing a sample-efficient and computationally lightweight solution for scalable sequential decision-making research.
中文摘要 游戏引擎和策略优化算法的效率对于训练强化学习（RL）代理进行复杂顺序决策任务（如俄罗斯方块）至关重要。现有的俄罗斯方块实现存在低仿真速度、状态评估不优和训练模式低效的问题，限制了其在大规模强化学习研究中的应用。为解决这些局限性，本文提出了基于Bitboard优化和改进强化学习算法的高性能俄罗斯方块AI框架。首先，我们用位板表示重新设计俄罗斯方块游戏板和四方块，利用比特操作加速核心流程（如碰撞检测、行清和Dellacherie-Thiery特征提取），相比OpenAI健身房俄罗斯方块实现了53倍的加速。其次，我们引入了一个后状态评估演员网络，利用俄罗斯方块的后状态特性简化状态值估计，在参数更少的情况下优于传统动作值网络。第三，我们提出一种缓冲区优化的近端策略优化（PPO）算法，平衡采样与更新效率，在10x10网格中3分钟内平均得分为3829。此外，我们还开发了符合OpenAI Gym标准的Python-Java接口，实现与现代强化学习框架的无缝集成。实验结果表明，我们的框架通过桥接低层位板优化与高级人工智能策略，提升了俄罗斯方块作为强化学习基准的实用性，为可扩展的顺序决策研究提供了样本高效且计算量轻的解决方案。

LogicDiff: Logic-Guided Denoising Improves Reasoning in Masked Diffusion Language Models

LogicDiff：逻辑引导去噪提升掩盖扩散语言模型中的推理能力

Authors: Shaik Aman
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.26771
Pdf link: https://arxiv.org/pdf/2603.26771
Abstract Masked diffusion language models (MDLMs) generate text by iteratively unmasking tokens from a fully masked sequence, offering parallel generation and bidirectional context. However, their standard confidence-based unmasking strategy systematically defers high-entropy logical connective tokens, the critical branching points in reasoning chains, leading to severely degraded reasoning performance. We introduce LogicDiff, an inference-time method that replaces confidence-based unmasking with logic-role-guided unmasking. A lightweight classification head (4.2M parameters, 0.05% of the base model) predicts the logical role of each masked position (premise, connective, derived step, conclusion, or filler) from the base model's hidden states with 98.4% accuracy. A dependency-ordered scheduler then unmasks tokens in logical dependency order: premises first, then connectives, then derived steps, then conclusions. Without modifying a single parameter of the base model and without any reinforcement learning or task-specific training, LogicDiff improves LLaDA-8B-Instruct accuracy from 22.0% to 60.7% on GSM8K (+38.7 percentage points) and from 23.6% to 29.2% on MATH-500 (+5.6 pp), with less than 6% speed overhead. Our results demonstrate that a substantial portion of the reasoning deficit in MDLMs is attributable to suboptimal token unmasking order, not to limitations of the model's learned representations.
中文摘要 掩盖扩散语言模型（MDLMs）通过迭代从完全掩蔽的序列中解密令牌来生成文本，提供并行生成和双向上下文。然而，他们标准的基于置信度的揭露策略系统性地推迟了高熵逻辑连接标记，这些标记是推理链中的关键分支点，导致推理性能严重下降。我们介绍了LogicDiff，一种推理时间方法，用逻辑角色引导的卸罩替代基于置信度的卸罩。轻量级分类中心（420万参数，占基础模型的0.05%）可从基础模型的隐藏状态预测每个掩蔽位置（前提、连接、导生步骤、结论或填充）的逻辑作用，准确率达98.4%。依赖顺序调度器随后按逻辑依赖顺序揭露令牌：先是前提，然后是连接词，然后是衍生步骤，最后是结论。LogicDiff 无需修改基础模型的任何参数，也未进行任何强化学习或任务特定训练，即可将 GSM8K 上的 LLaDA-8B-Ininstruction 准确率从 22.0% 提升至 60.7%（+38.7 百分点），在 MATH-500 上从23.6%提升至29.2%（+5.6 pp），速度开销低于6%。我们的结果表明，MDLMs中推理缺陷的很大一部分原因在于token揭膜顺序不优，而非模型学习表征的局限。

Learning to Select Visual In-Context Demonstrations

学习选择视觉上下文演示

Authors: Eugene Lee, Yu-Chi Lin, Jiajie Diao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.26775
Pdf link: https://arxiv.org/pdf/2603.26775
Abstract Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task's full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.
中文摘要 多模态大型语言模型（MLLM）通过上下文学习（ICL）适应视觉任务，这在很大程度上依赖演示质量。主要的示范选择策略是无监督k-最近邻（kNN）搜索。虽然简单，但这种相似性优先的方法对于复杂的事实回归任务来说并不理想;它选择了无法完全捕捉任务输出范围的冗余示例。我们将选择重新框架为顺序决策问题，并引入了学习至选择演示（LSD），训练强化学习代理构建最优演示集。通过一个以查询为中心的变换器解码器进行对立DQN，我们的代理学习出最大化MLLM下游性能的策略。通过评估五个视觉回归基准，我们发现了一个关键的二分法：虽然kNN在主观偏好任务中仍为最佳，但LSD在客观事实回归任务中显著优于基线。通过平衡视觉相关性与多样性，LSD更好地界定了回归边界，揭示了何时学习选择对视觉ICL严格必要。

PiCSRL: Physics-Informed Contextual Spectral Reinforcement Learning

PiCSRL：物理知情的情境谱强化学习

Authors: Mitra Nasr Azadani, Syed Usama Imtiaz, Nasrin Alamdari
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.26816
Pdf link: https://arxiv.org/pdf/2603.26816
Abstract High-dimensional low-sample-size (HDLSS) datasets constrain reliable environmental model development, where labeled data remain sparse. Reinforcement learning (RL)-based adaptive sensing methods can learn optimal sampling policies, yet their application is severely limited in HDLSS contexts. In this work, we present PiCSRL (Physics-Informed Contextual Spectral Reinforcement Learning), where embeddings are designed using domain knowledge and parsed directly into the RL state representation for improved adaptive sensing. We developed an uncertainty-aware belief model that encodes physics-informed features to improve prediction. As a representative example, we evaluated our approach for cyanobacterial gene concentration adaptive sampling task using NASA PACE hyperspectral imagery over Lake Erie. PiCSRL achieves optimal station selection (RMSE = 0.153, 98.4% bloom detection rate, outperforming random (0.296) and UCB (0.178) RMSE baselines, respectively. Our ablation experiments demonstrate that physics-informed features improve test generalization (0.52 R^2, +0.11 over raw bands) in semi-supervised learning. In addition, our scalability test shows that PiCSRL scales effectively to large networks (50 stations, >2M combinations) with significant improvements over baselines (p = 0.002). We posit PiCSRL as a sample-efficient adaptive sensing method across Earth observation domains for improved observation-to-target mapping.
中文摘要 高维低样本量（HDLSS）数据集限制了可靠的环境模型开发，其中标记数据仍然稀疏。基于强化学习（RL）的自适应感知方法可以学习最优采样策略，但其在HDLSS环境中的应用非常有限。在本研究中，我们介绍了PiCSRL（物理知情上下文强化学习），其中嵌入基于领域知识设计，并直接解析到强化学习状态表示中，以提升自适应感知能力。我们开发了一个不确定性感知的信念模型，编码基于物理的特征以提升预测效果。作为一个代表性例子，我们评估了利用NASA PACE高光谱影像在伊利湖上空进行蓝细菌基因浓度自适应采样任务的方法。PiCSRL实现了最佳站点选择（RMSE = 0.153,98.4%的布隆检测率），分别优于随机（0.296）和UCB（0.178）RMSE基线。我们的消融实验表明，基于物理的特征能提升半监督学习中的测试泛化率（0.52 R^2，+0.11，原始波段）。此外，我们的可扩展性测试显示，PiCSRL能够有效扩展到大型网络（50个站点，>200万组合），并且比基线有显著提升（p = 0.002）。我们认为PiCSRL是一种跨地球观测域的高效自适应感测方法，以提升观测到目标的映射。

Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry

稳定推理，不稳定反应：通过稳定性不对称来缓解LLM欺骗

Authors: Guoxi Zhang, Jiawei Chen, Tianzhuo Yang, Lang Qin, Juntao Dai, Yaodong Yang, Jingwei Yi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.26846
Pdf link: https://arxiv.org/pdf/2603.26846
Abstract As Large Language Models (LLMs) expand in capability and application scope, their trustworthiness becomes critical. A vital risk is intrinsic deception, wherein models strategically mislead users to achieve their own objectives. Existing alignment approaches based on chain-of-thought (CoT) monitoring supervise explicit reasoning traces. However, under optimization pressure, models are incentivized to conceal deceptive reasoning, rendering semantic supervision fundamentally unreliable. Grounded in cognitive psychology, we hypothesize that a deceptive LLM maintains a stable internal belief in its CoT while its external response remains fragile under perturbation. We term this phenomenon stability asymmetry and quantify it by measuring the contrast between internal CoT stability and external response stability under perturbation. Building on this structural signature, we propose the Stability Asymmetry Regularization (SAR), a novel alignment objective that penalizes this distributional asymmetry during reinforcement learning. Unlike CoT monitoring, SAR targets the statistical structure of model outputs, rendering it robust to semantic concealment. Extensive experiments confirm that stability asymmetry reliably identifies deceptive behavior, and that SAR effectively suppresses intrinsic deception without degrading general model capability.
中文摘要 随着大型语言模型（LLM）能力和应用范围的扩展，其可信度变得至关重要。一个关键风险是内在欺骗，即模型通过策略性误导用户以实现自身目标。基于思维链（CoT）监测的现有对齐方法监督显性推理痕迹。然而，在优化压力下，模型被激励隐藏欺骗性推理，使语义监督根本不可靠。基于认知心理学，我们假设欺骗性的大型语言模型（LLM）在内部对其CoT保持稳定的信念，而其外部反应在扰动下依然脆弱。我们将此现象称为稳定性不对称，并通过测量内部CoT稳定性与外部响应稳定性在扰动下的对比来量化。基于该结构特征，我们提出了稳定性不对称正则化（SAR），这是一种新颖的比对目标，在强化学习过程中惩罚这种分布不对称。与CoT监测不同，SAR针对模型输出的统计结构，使其对语义隐蔽具有鲁棒性。大量实验证实稳定性不对称能可靠识别欺骗行为，且SAR有效抑制内在欺骗，同时不降低模型的整体能力。

Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching

无监督行为压缩：通过状态-占有匹配学习低维策略流形

Authors: Andrea Fraschini, Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.27044
Pdf link: https://arxiv.org/pdf/2603.27044
Abstract Deep Reinforcement Learning (DRL) is widely recognized as sample-inefficient, a limitation attributable in part to the high dimensionality and substantial functional redundancy inherent to the policy parameter space. A recent framework, which we refer to as Action-based Policy Compression (APC), mitigates this issue by compressing the parameter space $\Theta$ into a low-dimensional latent manifold $\mathcal Z$ using a learned generative mapping $g:\mathcal Z \to \Theta$. However, its performance is severely constrained by relying on immediate action-matching as a reconstruction loss, a myopic proxy for behavioral similarity that suffers from compounding errors across sequential decisions. To overcome this bottleneck, we introduce Occupancy-based Policy Compression (OPC), which enhances APC by shifting behavior representation from immediate action-matching to long-horizon state-space coverage. Specifically, we propose two principal improvements: (1) we curate the dataset generation with an information-theoretic uniqueness metric that delivers a diverse population of policies; and (2) we propose a fully differentiable compression objective that directly minimizes the divergence between the true and reconstructed mixture occupancy distributions. These modifications force the generative model to organize the latent space around true functional similarity, promoting a latent representation that generalizes over a broad spectrum of behaviors while retaining most of the original parameter space's expressivity. Finally, we empirically validate the advantages of our contributions across multiple continuous control benchmarks.
中文摘要 深度强化学习（DRL）被广泛认为样本效率低下，这一限制部分归因于策略参数空间固有的高维度和大量功能冗余。我们称之为基于动作的策略压缩（APC）的一种最新框架，通过将参数空间 $\Theta$ 压缩为低维潜在流形 $\mathcal Z$，利用学习的生成映射 $g：\mathcal Z \到 \Theta$，来缓解这一问题。然而，其性能受到极大限制，因为依赖即时动作匹配作为重建损失，这是一种短视的行为相似性代理指标，且在连续决策中存在叠加错误。为克服这一瓶颈，我们引入了基于占用的策略压缩（OPC），通过将行为表征从即时动作匹配转向长视野状态空间覆盖，增强了APC。具体来说，我们提出了两个主要改进：（1）我们通过信息论唯一性指标来策划数据集生成，从而呈现多样化的政策总体;（2）我们提出一个完全可微的压缩目标，直接最小化真实与重建后的混合气占用分布之间的差异。这些修改迫使生成模型围绕真正的功能相似性组织潜在空间，促进一种在保留原始参数空间大部分表达性的同时，推广到广泛行为光谱的潜在表征。最后，我们通过实证验证了我们在多个连续控制基准中的贡献优势。

Dynamic resource matching in manufacturing using deep reinforcement learning

制造业中的动态资源匹配，利用深度强化学习

Authors: Saunak Kumar Panda, Yisha Xiang, Ruiqi Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2603.27066
Pdf link: https://arxiv.org/pdf/2603.27066
Abstract Matching plays an important role in the logical allocation of resources across a wide range of industries. The benefits of matching have been increasingly recognized in manufacturing industries. In particular, capacity sharing has received much attention recently. In this paper, we consider the problem of dynamically matching demand-capacity types of manufacturing resources. We formulate the multi-period, many-to-many manufacturing resource-matching problem as a sequential decision process. The formulated manufacturing resource-matching problem involves large state and action spaces, and it is not practical to accurately model the joint distribution of various types of demands. To address the curse of dimensionality and the difficulty of explicitly modeling the transition dynamics, we use a model-free deep reinforcement learning approach to find optimal matching policies. Moreover, to tackle the issue of infeasible actions and slow convergence due to initial biased estimates caused by the maximum operator in Q-learning, we introduce two penalties to the traditional Q-learning algorithm: a domain knowledge-based penalty based on a prior policy and an infeasibility penalty that conforms to the demand-supply constraints. We establish theoretical results on the convergence of our domain knowledge-informed Q-learning providing performance guarantee for small-size problems. For large-size problems, we further inject our modified approach into the deep deterministic policy gradient (DDPG) algorithm, which we refer to as domain knowledge-informed DDPG (DKDDPG). In our computational study, including small- and large-scale experiments, DKDDPG consistently outperformed traditional DDPG and other RL algorithms, yielding higher rewards and demonstrating greater efficiency in time and episodes.
中文摘要 匹配在各行各业资源的合理分配中起着重要作用。匹配的好处在制造业中越来越被认可。尤其是容量共享问题，最近备受关注。本文探讨了制造资源需求-产能类型动态匹配的问题。我们将多阶段、多对多制造资源匹配问题提出为一个顺序决策过程。所制定的制造资源匹配问题涉及很大的状态和行动空间，且准确建模各种需求类型联合分布并不切实际。为了解决维度问题和显式建模转变动态的难度，我们采用无模型深度强化学习方法寻找最优匹配策略。此外，为了解决由于最大算符导致初始偏见导致不可行动作和收敛缓慢的问题，我们引入了传统Q学习算法的两个惩罚：基于先验策略的领域知识惩罚和符合供需约束的不可行性惩罚。我们建立了关于领域知识驱动Q学习收敛性的理论结果，为小规模问题提供性能保证。对于大规模问题，我们进一步将修改后的方法注入深度确定性策略梯度（DDPG）算法，称为领域知识导向DDPG（DKDDPG）。在我们的计算研究中，包括小规模和大规模实验，DKDDPG持续优于传统DDPG及其他强化学习算法，获得更高的奖励，并在时间和发作上展现出更高的效率。

Semantic Interaction Information mediates compositional generalization in latent space

语义交互信息在潜空间中介导组合推广

Authors: John Schwarcz
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.27134
Pdf link: https://arxiv.org/pdf/2603.27134
Abstract Are there still barriers to generalization once all relevant variables are known? We address this question via a framework that casts compositional generalization as a variational inference problem over latent variables with parametric interactions. To explore this, we develop the Cognitive Gridworld, a stationary Partially Observable Markov Decision Process (POMDP) where observations are generated jointly by multiple latent variables, yet feedback is provided for only a single goal variable. This setting allows us to define Semantic Interaction Information (SII): a metric measuring the contribution of latent variable interactions to task performance. Using SII, we analyze Recurrent Neural Networks (RNNs) provided with these interactions, finding that SII explains the accuracy gap between Echo State and Fully Trained networks. Our analysis also uncovers a theoretically predicted failure mode where confidence decouples from accuracy, suggesting that utilizing interactions between relevant variables is a non-trivial capability. We then address a harder regime where the interactions must be learned by an embedding model. Learning how latent variables interact requires accurate inference, yet accurate inference depends on knowing those interactions. The Cognitive Gridworld reveals this circular dependence as a core challenge for continual meta-learning. We approach this dilemma via Representation Classification Chains (RCCs), a JEPA-style architecture that disentangles these processes: variable inference and variable embeddings are learned by separate modules through Reinforcement Learning and self-supervised learning, respectively. Lastly, we demonstrate that RCCs facilitate compositional generalization to novel combinations of relevant variables. Together, these results establish a grounded setting for evaluating goal-directed generalist agents.
中文摘要 当所有相关变量都已知后，推广仍然存在障碍吗？我们通过一个框架来解决这个问题，将组合推广视为对具有参数交互作用的潜在变量的变分推断问题。为此，我们开发了认知网格世界，这是一种静态的部分可观测马尔可夫决策过程（POMDP），其中观测值由多个潜在变量联合生成，但反馈仅针对单一目标变量。这一设定使我们能够定义语义交互信息（SII）：衡量潜在变量交互对任务表现贡献的指标。利用SII，我们分析了与这些交互提供的循环神经网络（RNN），发现SII解释了回声状态与完全训练网络之间的准确性差距。我们的分析还揭示了一种理论预测的失效模式，即信心与准确性脱钩，表明利用相关变量之间的相互作用是一项非凡的能力。接着我们处理一个更难的领域，必须通过嵌入模型学习交互作用。学习潜在变量如何相互作用需要准确的推断，而准确的推断依赖于了解这些相互作用。认知网格世界揭示了这种循环依赖作为持续元学习的核心挑战。我们通过表示分类链（RCC）来解决这一困境，这是一种JEPA风格的架构，将这些过程分开：变量推理和变量嵌入分别通过强化学习和自监督学习由独立模块学习。最后，我们证明RCC有助于组合推广到相关变量的新组合。这些结果共同为评估目标导向的通才代理建立了一个扎实的背景。

Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision

基于推理的异常检测与定位，配合图像级监督

Authors: Yizhou Jin, Yuezhu Feng, Jinjin Zhang, Peng Wang, Qingjie Liu, Yunhong Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.27179
Pdf link: https://arxiv.org/pdf/2603.27179
Abstract Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection. However, most approaches remain confined to image-level anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations. In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels. Specifically, we propose Reasoning-Driven Anomaly Localization (ReAL), which extracts anomaly-related tokens from the autoregressive reasoning process and aggregates their attention responses to produce pixel-level anomaly maps. We further introduce a Consistency-Guided Reasoning Optimization (CGRO) module that leverages reinforcement learning to align reasoning tokens with visual attentions, resulting in more coherent reasoning and accurate anomaly localization. Extensive experiments on four public benchmarks demonstrate that our method significantly improves anomaly detection, localization, and interpretability. Remarkably, despite relying solely on image-level supervision, our approach achieves performance competitive with MLLM-based methods trained under dense pixel-level supervision. Code is available at this https URL.
中文摘要 多模态大型语言模型（MLLMs）近年来展现出了卓越的推理能力和感知能力，用于异常检测。然而，大多数方法仍局限于图像级异常检测和文本推理，而像素级定位仍依赖外部视觉模块和密集注释。本研究激活MLLM的内在推理潜力，使其能够仅通过图像级监督实现异常检测、像素级定位和可解释的推理，无需任何辅助成分或像素标签。具体来说，我们提出了推理驱动异常定位（ReAL），该方法从自回归推理过程中提取异常相关标记，并汇总其注意力响应，生成像素级异常映射。我们还进一步介绍了一个一致性引导推理优化（CGRO）模块，利用强化学习将推理标记与视觉注意力对齐，从而实现更连贯的推理和准确的异常定位。在四个公开基准测试上的广泛实验表明，我们的方法显著提升了异常检测、定位和可解释性。令人惊讶的是，尽管仅依赖图像级监督，我们的方法在高密度像素级监督下训练的MLLM方法中的性能仍可媲美。代码可在此 https URL 访问。

Incentivizing Temporal-Awareness in Egocentric Video Understanding Models

在自我中心的视频理解模型中激励时间意识

Authors: Zhiyang Xu, Tian Qin, Bowen Jin, Zhengfeng Lai, Meng Cao, Lifu Huang, Peng Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.27184
Pdf link: https://arxiv.org/pdf/2603.27184
Abstract Multimodal large language models (MLLMs) have recently shown strong performance in visual understanding, yet they often lack temporal awareness, particularly in egocentric settings where reasoning depends on the correct ordering and evolution of events. This deficiency stems in part from training objectives that fail to explicitly reward temporal reasoning and instead rely on frame-level spatial shortcuts. To address this limitation, we propose Temporal Global Policy Optimization (TGPO), a reinforcement learning with verifiable rewards (RLVR) algorithm designed to incentivize temporal awareness in MLLMs. TGPO contrasts model outputs generated from temporally ordered versus shuffled video frames to derive calibrated, globally normalized reward signals that explicitly favor temporally coherent reasoning. Integrated with GRPO and GSPO, TGPO supports cold-start RL training and effectively suppresses spatial shortcut behaviors learned by existing MLLMs. Experiments across five egocentric video benchmarks demonstrate that TGPO consistently improves temporal grounding and causal coherence, outperforming prior RL-based video reasoning approaches. Our results suggest that TGPO offers a simple and scalable pathway toward temporally robust MLLMs for egocentric video understanding.
中文摘要 多模态大型语言模型（MLLMs）近年来在视觉理解方面表现出强劲表现，但它们通常缺乏时间意识，尤其是在以自我为中心的环境中，推理依赖于事件的正确顺序和演变。这一缺陷部分源于训练目标未能明确奖励时间推理，而是依赖帧级空间捷径。为解决这一限制，我们提出了时间全局策略优化（TGPO），这是一种带有可验证奖励的强化学习（RLVR）算法，旨在激励MLLM中的时间意识。TGPO对比了由时间顺序视频帧生成的模型输出，以导出校准的全局规范化奖励信号，明确支持时间连贯推理。TGPO与GRPO和GSPO集成，支持冷启动强化学习训练，有效抑制现有MLLM学习到的空间捷径行为。五个以自我为中心的视频基准测试实验表明，TGPO持续提升时间根基和因果一致性，优于以往基于强化学习的视频推理方法。我们的结果表明，TGPO为自我中心的视频理解提供了一条简单且可扩展的路径，实现时间上强健的MLLMs。

Autonomous overtaking trajectory optimization using reinforcement learning and opponent pose estimation

利用强化学习和对抗姿态估计实现自主超车轨迹优化

Authors: Matej Rene Cihlar, Luka Šiktar, Branimir Ćaran, Marko Švaco
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.27207
Pdf link: https://arxiv.org/pdf/2603.27207
Abstract Vehicle overtaking is one of the most complex driving maneuvers for autonomous vehicles. To achieve optimal autonomous overtaking, driving systems rely on multiple sensors that enable safe trajectory optimization and overtaking efficiency. This paper presents a reinforcement learning mechanism for multi-agent autonomous racing environments, enabling overtaking trajectory optimization, based on LiDAR and depth image data. The developed reinforcement learning agent uses pre-generated raceline data and sensor inputs to compute the steering angle and linear velocity for optimal overtaking. The system uses LiDAR with a 2D detection algorithm and a depth camera with YOLO-based object detection to identify the vehicle to be overtaken and its pose. The LiDAR and the depth camera detection data are fused using a UKF for improved opponent pose estimation and trajectory optimization for overtaking in racing scenarios. The results show that the proposed algorithm successfully performs overtaking maneuvers in both simulation and real-world experiments, with pose estimation RMSE of (0.0816, 0.0531) m in (x, y).
中文摘要 车辆超车是自动驾驶车辆中最复杂的驾驶操作之一。为了实现最佳的自动超车，驾驶系统依赖多个传感器，实现安全轨迹优化和超车效率。本文提出了一种基于LiDAR和深度图像数据的多智能体自主竞速环境强化学习机制，实现超车轨迹优化。开发的强化学习代理利用预生成的线路数据和传感器输入计算转向角度和线速度，实现最佳超车。该系统利用激光雷达（LiDAR）配合二维检测算法和基于YOLO的物体检测深度相机，识别被超越车辆及其姿态。激光雷达和深度相机检测数据通过UKF融合，提升对手姿态估计和轨迹优化，以实现赛车场景中的超车。结果显示，所提出的算法在模拟和现实实验中都能成功完成超车动作，姿态的RMSE估计为（0.0816， 0.0531） m，在（x， y）中。

Rethinking Easy-to-Hard: Limits of Curriculum Learning in Post-Training for Deductive Reasoning

重新思考从简单到困难：演绎推理后培训课程学习的局限性

Authors: Maximilian Mordig, Andreas Opedal, Weiyang Liu, Bernhard Schölkopf
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.27226
Pdf link: https://arxiv.org/pdf/2603.27226
Abstract Curriculum learning (CL), motivated by the intuition that learning in increasing order of difficulty should ease generalization, is commonly adopted both in pre-training and post-training of large language models (LLMs). The intuition of CL is particularly compelling for compositional reasoning, where complex problems are built from elementary inference rules; however, the actual impact of CL on such tasks remains largely underexplored. We present a systematic empirical study of CL for post-training of LLMs, using synthetic arithmetic and logical benchmarks where difficulty is characterized by reasoning complexity rather than surface-level proxies. Surprisingly, across multiple model families and curriculum schedules, we find no robust advantage in difficulty-based sequencing over standard random sampling in either accuracy or response length. These findings persist across both supervised fine-tuning (SFT) and reinforcement learning (RL) methods. Our study suggests that, in the context of deductive reasoning, the specific ordering of training examples plays a negligible role in achieving compositional generalization, challenging the practical utility of curriculum-based post-training.
中文摘要 课程学习（CL）源于直觉，即按难度递增的顺序学习有助于泛化，通常被广泛应用于大型语言模型（LLM）的训练前后。CL的直觉对于组合推理尤为有说服力，因为复杂问题是基于初等推理规则构建的;然而，CL对此类任务的实际影响仍然鲜有深入探讨。我们提出了一项系统性实证研究，针对大型语言模型后期训练的逻辑分析，利用合成算术和逻辑基准，难度主要由推理复杂度而非表面代理指标来决定。令人惊讶的是，在多个模型家庭和课程安排中，我们发现基于难度的测序在准确率和响应长度上均无明显优势。这些发现在监督式微调（SFT）和强化学习（RL）方法中均存在。我们的研究表明，在演绎推理的背景下，训练示例的具体排序在实现组合泛化中作用微乎其微，挑战了基于课程的培训后培训的实用性。

Where-to-Learn: Analytical Policy Gradient Directed Exploration for On-Policy Robotic Reinforcement Learning

学习渠道：分析政策梯度导向探索，用于政策内机器人强化学习

Authors: Leixin Chang, Xinchen Yao, Ben Liu, Liangjing Yang, Hua Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.27317
Pdf link: https://arxiv.org/pdf/2603.27317
Abstract On-policy reinforcement learning (RL) algorithms have demonstrated great potential in robotic control, where effective exploration is crucial for efficient and high-quality policy learning. However, how to encourage the agent to explore the better trajectories efficiently remains a challenge. Most existing methods incentivize exploration by maximizing the policy entropy or encouraging novel state visiting regardless of the potential state value. We propose a new form of directed exploration that uses analytical policy gradients from a differentiable dynamics model to inject task-aware, physics-guided guidance, thereby steering the agent towards high-reward regions for accelerated and more effective policy learning.
中文摘要 策略强化学习（RL）算法在机器人控制领域展现出巨大潜力，而有效的探索对于高效且高质量的策略学习至关重要。然而，如何有效地鼓励代理探索更好的发展路径仍是一个挑战。大多数现有方法通过最大化政策熵或鼓励新颖的国家访问来激励探索，无论潜在的状态价值如何。我们提出了一种新型的定向探索方式，利用可微动力学模型中的分析策略梯度注入任务感知、物理导向的指导，从而引导智能体朝向高回报区域，从而加速且更有效的策略学习。

D-SPEAR: Dual-Stream Prioritized Experience Adaptive Replay for Stable Reinforcement Learninging Robotic Manipulation

D-SPEAR：双流优先体验自适应回放，用于稳定强化学习机器人操作

Authors: Yu Zhang, Karl Mason
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.27346
Pdf link: https://arxiv.org/pdf/2603.27346
Abstract Robotic manipulation remains challenging for reinforcement learning due to contact-rich dynamics, long horizons, and training instability. Although off-policy actor-critic algorithms such as SAC and TD3 perform well in simulation, they often suffer from policy oscillations and performance collapse in realistic settings, partly due to experience replay strategies that ignore the differing data requirements of the actor and the critic. We propose D-SPEAR: Dual-Stream Prioritized Experience Adaptive Replay, a replay framework that decouples actor and critic sampling while maintaining a shared replay buffer. The critic leverages prioritized replay for efficient value learning, whereas the actor is updated using low-error transitions to stabilize policy optimization. An adaptive anchor mechanism balances uniform and prioritized sampling based on the coefficient of variation of TD errors, and a Huber-based critic objective further improves robustness under heterogeneous reward scales. We evaluate D-SPEAR on challenging robotic manipulation tasks from the robosuite benchmark, including Block-Lifting and Door-Opening. Results demonstrate that D-SPEAR consistently outperforms strong off-policy baselines, including SAC, TD3, and DDPG, in both final performance and training stability, with ablation studies confirming the complementary roles of the actorside and critic-side replay streams.
中文摘要 由于接触丰富的动态、长视野和训练不稳定性，机器人操作在强化学习方面依然充满挑战。尽管像SAC和TD3这样的非策略演员-批评算法在模拟中表现良好，但在现实环境中常常出现策略振荡和性能崩溃，部分原因是经验重放策略忽视了演员和批评者不同的数据需求。我们提出了D-SPEAR：双流优先体验自适应回放框架，该框架在保持共享回放缓冲区的同时，将演员和评论家采样解耦。批评者利用优先重放实现高效的价值学习，而参与者则通过低误差转换来稳定策略优化。自适应锚点机制基于TD误差的变异系数平衡均匀且优先抽样，基于Huber的批判目标进一步提升异质奖励尺度下的鲁棒性。我们评估了D-SPEAR在机器人套件基准测试中具有挑战性的机器人操作任务，包括方块搬运和门开。结果显示，D-SPEAR在最终表现和训练稳定性方面持续优于强的非策略基线，包括SAC、TD3和DDPG，消融研究证实了演员侧和批评方回放流的互补作用。

DRASTIC: A Dynamic Resource Allocation Framework over 6G Network Slicing in Task-aware Closed-Loop Tactile Internet Applications

DRASTIC：任务感知闭环触觉互联网应用中基于6G网络切片的动态资源分配框架

Authors: Narges Golmohammadi, Madan Mohan Rayguru, Sabur Baidya
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.27364
Pdf link: https://arxiv.org/pdf/2603.27364
Abstract This work proposes a novel learning driven bandwidth optimization framework called DRASTIC (Dynamic Resource Allocation for Slicing in Task aware Closed loop tactile Internet applications). The proposed framework dynamically allocates resources among network slices supporting both enhanced Mobile Broadband (eMBB) and high reliable low latency communication (HRLLC) users. The algorithm ensures queue stability and meets delay targets with high probability under a Markov-modulated Poisson traffic, exploiting a Lyapunov guided advantage actor critic reinforcement learning technique. The proposed network model includes an open-loop eMBB queue whose arrival and departure are mainly driven by throughput demand, as well as a closed loop HRLLC queue that captures feedback and task execution effects. A task execution dependent dexterity index adjusts the effective arrival rate, creating a feedback aware interaction between the network and the task. A probabilistic delay constraint is incorporated into the objective via Lagrangian relaxation, yielding a min_max optimization framework that enforces latency guarantees while maximizing throughput for both types of users. Simulation results demonstrate that the proposed framework meets diverse Quality of Service (QoS) requirements, maintains queue stability under dynamic wireless and robotic task variation conditions, and outperforms other approaches.
中文摘要 本研究提出了一种新颖的学习驱动带宽优化框架，称为DRASTIC（任务感知闭环触觉互联网应用中的动态资源分配切片）。该框架动态分配资源于支持增强型移动宽带（eMBB）和高可靠低延迟通信（HRLLC）用户的网络片。该算法利用李雅普诺夫引导演员批评强化学习技术，在马尔可夫调制泊松流量下保证队列稳定性并高概率满足延迟目标。所提网络模型包括一个开环eMBB队列，其到达和离开主要由吞吐量需求驱动，以及一个闭环HRLLC队列，捕捉反馈和任务执行效果。依赖任务执行的敏捷度指数调整有效到达率，在网络与任务之间建立反馈感知的交互。通过拉格朗日松弛，目标中加入了概率延迟约束，形成了一个min_max优化框架，既能保证延迟，又能最大化两类用户的吞吐量。模拟结果表明，所提框架满足多样化的服务质量（QoS）要求，在动态无线和机器人任务变化条件下保持队列稳定性，并且优于其他方法。

Bridging Visual Representation and Reinforcement Learning from Verifiable Rewards in Large Vision-Language Models

大型视觉语言模型中，通过可验证奖励桥接视觉表征与强化学习

Authors: Yuhang Han, Yuyang Wu, Zhengbo Jiao, Yiyu Wang, Xuyang Liu, Shaobo Wang, Hanlin Xu, Xuming Hu, Linfeng Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.27375
Pdf link: https://arxiv.org/pdf/2603.27375
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) has substantially enhanced the reasoning capabilities of large language models in abstract reasoning tasks. However, its application to Large Vision-Language Models (LVLMs) remains constrained by a structural representational bottleneck. Existing approaches generally lack explicit modeling and effective utilization of visual information, preventing visual representations from being tightly coupled with the reinforcement learning optimization process and thereby limiting further improvements in multimodal reasoning performance. To address this limitation, we propose KAWHI (Key-Region Aligned Weighted Harmonic Incentive), a plug-and-play reward reweighting mechanism that explicitly incorporates structured visual information into uniform reward policy optimization methods (e.g., GRPO and GSPO). The method adaptively localizes semantically salient regions through hierarchical geometric aggregation, identifies vision-critical attention heads via structured attribution, and performs paragraph-level credit reallocation to align spatial visual evidence with semantically decisive reasoning steps. Extensive empirical evaluations on diverse reasoning benchmarks substantiate KAWHI as a general-purpose enhancement module, consistently improving the performance of various uniform reward optimization methods. Project page: KAWHI (this https URL)
中文摘要 可验证奖励强化学习（RLVR）大幅提升了大型语言模型在抽象推理任务中的推理能力。然而，其在大型视觉语言模型（LVLM）中的应用仍受结构性表示瓶颈的限制。现有方法通常缺乏显式建模和视觉信息的有效利用，阻碍了视觉表现与强化学习优化过程紧密结合，从而限制了多模态推理性能的进一步提升。为解决这一限制，我们提出了KAWHI（关键区域对齐加权和谐激励），这是一种即插即用的奖励加权机制，明确将结构化可视化信息纳入统一的奖励政策优化方法（如GRPO和GSPO）。该方法通过层级几何聚合自适应地定位语义显著区域，通过结构化归因识别视觉批判性注意力头，并执行段落级的署名重新分配，使空间视觉证据与语义决定性推理步骤对齐。对多种推理基准的大量实证评估证明KAWHI作为通用增强模块，持续提升了多种统一奖励优化方法的性能。项目页面：KAWHI（此 https URL）

Diagnosing Non-Markovian Observations in Reinforcement Learning via Prediction-Based Violation Scoring

通过基于预测的违规评分诊断强化学习中的非马尔可夫观察

Authors: Naveen Mysore
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.27389
Pdf link: https://arxiv.org/pdf/2603.27389
Abstract Reinforcement learning algorithms assume that observations satisfy the Markov property, yet real-world sensors frequently violate this assumption through correlated noise, latency, or partial observability. Standard performance metrics conflate Markov breakdowns with other sources of suboptimality, leaving practitioners without diagnostic tools for such violations. This paper introduces a prediction-based scoring method that quantifies non-Markovian structure in observation trajectories. A random forest first removes nonlinear Markov-compliant dynamics; ridge regression then tests whether historical observations reduce prediction error on the residuals beyond what the current observation provides. The resulting score is bounded in [0, 1] and requires no causal graph construction. Evaluation spans six environments (CartPole, Pendulum, Acrobot, HalfCheetah, Hopper, Walker2d), three algorithms (PPO, A2C, SAC), controlled AR(1) noise at six intensity levels, and 10 seeds per condition. In post-hoc detection, 7 of 16 environment-algorithm pairs, primarily high-dimensional locomotion tasks, show significant positive monotonicity between noise intensity and the violation score (Spearman rho up to 0.78, confirmed under repeated-measures analysis); under training-time noise, 13 of 16 pairs exhibit statistically significant reward degradation. An inversion phenomenon is documented in low-dimensional environments where the random forest absorbs the noise signal, causing the score to decrease as true violations grow, a failure mode analyzed in detail. A practical utility experiment demonstrates that the proposed score correctly identifies partial observability and guides architecture selection, fully recovering performance lost to non-Markovian observations. Source code to reproduce all results is provided at this https URL.
中文摘要 强化学习算法假设观测满足马尔可夫性，但现实中的传感器常常通过相关的噪声、延迟或部分可观测性来违反这一假设。标准性能指标将马尔可夫分解与其他次优因素混为一谈，导致从业者缺乏此类违规的诊断工具。本文介绍了一种基于预测的评分方法，用于量化观测轨迹中的非马尔可夫结构。随机森林首先去除非线性马尔可夫兼容的动力学;脊回归随后检验历史观测是否将残差的预测误差降低到当前观测之外。所得分数有界于[0， 1]，无需因果图构造。评估涵盖六个环境（CartPole、Pendulum、Acrobot、HalfCheetah、Hopper、Walker2d）、三种算法（PPO、A2C、SAC）、六个强度级别的受控AR（1）噪声，以及每种条件10个种子。在事后检测中，16对环境-算法对（主要是高维运动任务）中有7对在噪声强度与违规评分之间表现出显著的正单调性（Spearman rho最高可达0.78，经重复测量分析确认）;在训练时间噪声下，16对中有13对表现出统计学显著的奖励退化。在低维环境中，随机森林吸收噪声信号，导致分数下降，真实违规次数增加，这种失败模式被详细分析。一个实用的实用实验证明，所提出的分数正确识别了部分可观测性，并指导了架构选择，完全恢复了非马尔可夫观测所失去的性能。所有结果的复现源代码在此 https 网址。

Rainbow-DemoRL: Combining Improvements in Demonstration-Augmented Reinforcement Learning

彩虹-演示RL：结合演示增强强化学习的改进

Authors: Dwait Bhatt, Shih-Chieh Chou, Nikolay Atanasov
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.27400
Pdf link: https://arxiv.org/pdf/2603.27400
Abstract Several approaches have been proposed to improve the sample efficiency of online reinforcement learning (RL) by leveraging demonstrations collected offline. The offline data can be used directly as transitions to optimize RL objectives, or offline policy and value functions can first be learned from the data and then used for online finetuning or to provide reference actions. While each of these strategies has shown compelling results, it is unclear which method has the most impact on sample efficiency, whether these approaches can be combined, and if there are cumulative benefits. We classify existing demonstration-augmented RL approaches into three categories and perform an extensive empirical study of their strengths, weaknesses, and combinations to isolate the contribution of each strategy and determine effective hybrid combinations for sample-efficient online RL. Our analysis reveals that directly reusing offline data and initializing with behavior cloning consistently outperform more complex offline RL pretraining methods for improving online sample efficiency.
中文摘要 已有多种方法被提出，通过利用离线收集的演示来提高在线强化学习（RL）的样本效率。离线数据可以直接作为优化强化学习目标的过渡，或者先从数据中学习离线策略和值函数，然后用于在线微调或提供参考操作。尽管这些策略都取得了令人信服的结果，但目前尚不清楚哪种方法对样本效率的影响最大，这些方法是否可以结合使用，以及它们是否会带来累积效益。我们将现有的演示增强强化学习方法分为三类，并对其优势、劣势和组合进行了广泛的实证研究，以分离每种策略的贡献，并确定有效的混合组合以实现样本高效的在线强化学习。我们的分析显示，直接重用离线数据并初始化行为克隆，在提升在线样本效率方面，持续优于更复杂的离线强化学习预训练方法。

Agent-Driven Autonomous Reinforcement Learning Research: Iterative Policy Improvement for Quadruped Locomotion

代理驱动自主强化学习研究：四足行走的迭代策略改进

Authors: Nimesh Khandelwal, Shakti S. Gupta
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.27416
Pdf link: https://arxiv.org/pdf/2603.27416
Abstract This paper documents a case study in agent-driven autonomous reinforcement learning research for quadruped locomotion. The setting was not a fully self-starting research system. A human provided high-level directives through an agentic coding environment, while an agent carried out most of the execution loop: reading code, diagnosing failures, editing reward and terrain configurations, launching and monitoring jobs, analyzing intermediate metrics, and proposing the next wave of experiments. Across more than 70 experiments organized into fourteen waves on a DHAV1 12-DoF quadruped in Isaac Lab, the agent progressed from early rough-terrain runs with mean reward around 7 to a best logged Wave 12 run, exp063, with velocity error 0.263 and 97\% timeout over 2000 iterations, independently reproduced five times across different GPUs. The archive also records several concrete autonomous research decisions: isolating PhysX deadlocks to terrain sets containing boxes and stair-like primitives, porting four reward terms from openly available reference implementations \cite{deeprobotics, rlsar}, correcting Isaac Sim import and bootstrapping issues, reducing environment count for diagnosis, terminating hung runs, and pivoting effort away from HIM after repeated terrain=0.0 outcomes. Relative to the AutoResearch paradigm \cite{autoresearch}, this case study operates in a more failure-prone robotics RL setting with multi-GPU experiment management and simulator-specific engineering constraints. The contribution is empirical and documentary: it shows that an agent can materially execute the iterative RL research loop in this domain with limited human intervention, while also making clear where human direction still shaped the agenda.
中文摘要 本文记录了一个关于四足行走的智能体驱动自主强化学习研究案例研究。这个设定并不是一个完全自启动的研究系统。人类通过代理编码环境提供高层指令，而代理则承担大部分执行循环：阅读代码、诊断故障、编辑奖励和地形配置、启动和监控作业、分析中间指标，并提出下一波实验方案。在Isaac实验室的DHAV1 12深度四足实验中，进行了70多次实验，分为十四波，该代理从早期崎岖地形运行（平均奖励约7波）发展到第12波最佳记录运行exp063，速度误差为0.263，超时率为97%，在2000次迭代中独立复制了五次不同GPU。档案还记录了若干具体的自主研究决策：将PhysX死锁隔离到包含箱子和阶梯状图元的地形集，从公开参考实现\cite{deeprobotics， rlsar}移植四个奖励词，纠正Isaac Sim导入和自备问题，减少诊断环境计数，终止挂机运行，以及在重复地形=0.0结果后将工作从HIM转移。相较于AutoResearch范式\cite{autoresearch}，本案例研究运行在一个更易失败的机器人强化学习环境中，采用多GPU实验管理和模拟器特定的工程约束。该贡献具有实证性和文献性：它表明代理可以在有限的人类干预下实质性执行该领域的反复强化学习研究循环，同时明确了人类方向在哪些方面仍影响着议程。

FlowRL: A Taxonomy and Modular Framework for Reinforcement Learning with Diffusion Policies

FlowRL：带有扩散策略的强化学习分类法与模块化框架

Authors: Chenxiao Gao, Edward Chen, Tianyi Chen, Bo Dai
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.27450
Pdf link: https://arxiv.org/pdf/2603.27450
Abstract Thanks to their remarkable flexibility, diffusion models and flow models have emerged as promising candidates for policy representation. However, efficient reinforcement learning (RL) upon these policies remains a challenge due to the lack of explicit log-probabilities for vanilla policy gradient estimators. While numerous attempts have been proposed to address this, the field lacks a unified perspective to reconcile these seemingly disparate methods, thus hampering ongoing development. In this paper, we bridge this gap by introducing a comprehensive taxonomy for RL algorithms with diffusion/flow policies. To support reproducibility and agile prototyping, we introduce a modular, JAX-based open-source codebase that leverages JIT-compilation for high-throughput training. Finally, we provide systematic and standardized benchmarks across Gym-Locomotion, DeepMind Control Suite, and IsaacLab, offering a rigorous side-by-side comparison of diffusion-based methods and guidance for practitioners to choose proper algorithms based on the application. Our work establishes a clear foundation for understanding and algorithm design, a high-efficiency toolkit for future research in the field, and an algorithmic guideline for practitioners in generative models and robotics. Our code is available at this https URL.
中文摘要 凭借其卓越的灵活性，扩散模型和流动模型已成为政策代表的有力候选者。然而，由于缺乏对普通策略梯度估计器的显式对数概率，对这些策略进行高效的强化学习（RL）仍是一个挑战。尽管已有许多尝试来解决这个问题，但该领域缺乏统一视角来调和这些看似无关的方法，从而阻碍了持续的发展。本文通过引入具有扩散/流策略的强化学习算法的全面分类法，弥补了这一空白。为了支持可重复性和敏捷原型开发，我们引入了一个模块化、基于JAX的开源代码库，利用JIT编译实现高通量训练。最后，我们提供了涵盖Gym-Locomotion、DeepMind Control Suite和IsaacLab的系统化标准基准测试，严谨地并排对比基于扩散的方法，并为从业者提供基于应用选择合适算法的指导。我们的工作为理解和算法设计奠定了清晰基础，为未来该领域研究提供了高效工具包，并为生成模型和机器人从业者提供了算法指南。我们的代码可在此 https URL 访问。

Driving Condition-Aware Multi-Agent Integrated Power and Thermal Management for Hybrid Electric Vehicles

混合动力电动汽车的驾驶状态感知多智能体集成动力与热管理

Authors: Hanghang Cui, Arash Khalatbarisoltani, Jie Han, Wenxue Liu, Muhammad Saeed, Xiaosong Hu
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.27471
Pdf link: https://arxiv.org/pdf/2603.27471
Abstract Effective co-optimization of energy management strategy (EMS) and thermal management (TM) is crucial for optimizing fuel efficiency in hybrid electric vehicles (HEVs). Driving conditions significantly influence the performance of both EMS and TM in HEVs. This study presents a novel driving condition-aware integrated thermal and energy management (ITEM) framework. In this context, after analyzing and segmenting driving data into micro-trips, two primary features (average speed and maximum acceleration) are measured. Using the K-means approach, the micro-trips are clustered into three main groups. Finally, a deep neural network is employed to develop a real-time driving recognition model. An ITEM is then developed based on multi-agent deep reinforcement learning (DRL), leveraging the proposed real-time driving recognition model. The primary objectives are to improve the fuel economy and reduce TM power consumption while maintaining a pleasant cabin temperature for passengers. Our simulation results illustrate the effectiveness of the suggested framework and the positive impact of recognizing driving conditions on ITEM, improving fuel economy by 16.14% and reducing TM power consumption by 8.22% compared to the benchmark strategy.
中文摘要 有效的能源管理策略（EMS）和热管理（TM）协同优化对于优化混合动力电动汽车（HEVs）的燃油效率至关重要。驾驶条件显著影响HEV中EMS和TM的性能。本研究提出了一种新型驾驶状态感知综合热能与能源管理（ITEM）框架。在此背景下，分析并细分驾驶数据为微行程后，测量了两个主要特征（平均速度和最大加速度）。采用K-均值方法，微旅行被分为三大类。最后，采用深度神经网络开发实时驾驶识别模型。然后基于多智能体深度强化学习（DRL）开发ITEM，利用所提出的实时驾驶识别模型。主要目标是提高燃油经济性，减少TM动力消耗，同时保持乘客舒适的车内温度。我们的模拟结果展示了建议框架的有效性以及识别驾驶状况对ITEM的积极影响，使燃油经济性提升16.14%，并使TM耗电量降低了8.22%，相较基准策略。

Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

学习聚焦与精确裁剪：一个针对多层次学习者信息缺口和基础损失的强化学习框架

Authors: Xuanpu Zhao, Zhentao Tan, Dianmo Sheng, Tianxiang Chen, Yao Liu, Yue Wu, Tao Gong, Qi Chu, Nenghai Yu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.27494
Pdf link: https://arxiv.org/pdf/2603.27494
Abstract To enhance the perception and reasoning capabilities of multimodal large language models in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping tool to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning and reinforcement learning, have made significant progress, our empirical analysis reveals a key limitation. We demonstrate the model's strong reliance on global input and its weak dependence on the details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the ``Information Gap" mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model's attention to cropped regions, enabling it to achieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs. Code is available at: this https URL.
中文摘要 为了增强复杂视觉场景中多模态大语言模型的感知和推理能力，近期研究引入了基于主体的工作流。在这些工作中，MLLM自主利用图像裁剪工具分析感兴趣区域以进行问答。尽管现有的训练策略，如采用监督式微调和强化学习，取得了显著进展，但我们的实证分析揭示了一个关键局限。我们展示了模型对全球输入的强烈依赖，以及对裁剪区域内细节的较弱依赖。为解决这一问题，我们提出了一种新的两阶段强化学习框架，无需轨迹监督。第一阶段，我们通过调整全局图像的粒度来引入“信息差距”机制。该机制通过聚焦裁剪的关键区域来训练模型回答问题，这些区域提供的信息获得由裁剪驱动。第二阶段通过引入接地损耗，使用少量包围框注释，进一步提升裁剪精度。实验表明，我们的方法显著提升了模型对裁剪区域的关注，使其在高分辨率视觉问答基准测试中实现了最先进的性能。我们的方法为感知和推理MLM中的细粒度细节提供了更高效的方法。代码可在：此 https URL 获取。

Match or Replay: Self Imitating Proximal Policy Optimization

匹配或回放：自我模仿的近端策略优化

Authors: Gaurav Chaudhary, Laxmidhar Behera, Washim Uddin Mondal
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.27515
Pdf link: https://arxiv.org/pdf/2603.27515
Abstract Reinforcement Learning (RL) agents often struggle with inefficient exploration, particularly in environments with sparse rewards. Traditional exploration strategies can lead to slow learning and suboptimal performance because agents fail to systematically build on previously successful experiences, thereby reducing sample efficiency. To tackle this issue, we propose a self-imitating on-policy algorithm that enhances exploration and sample efficiency by leveraging past high-reward state-action pairs to guide policy updates. Our method incorporates self-imitation by using optimal transport distance in dense reward environments to prioritize state visitation distributions that match the most rewarding trajectory. In sparse-reward environments, we uniformly replay successful self-encountered trajectories to facilitate structured exploration. Experimental results across diverse environments demonstrate substantial improvements in learning efficiency, including MuJoCo for dense rewards and the partially observable 3D Animal-AI Olympics and multi-goal PointMaze for sparse rewards. Our approach achieves faster convergence and significantly higher success rates compared to state-of-the-art self-imitating RL baselines. These findings underscore the potential of self-imitation as a robust strategy for enhancing exploration in RL, with applicability to more complex tasks.
中文摘要 强化学习（RL）智能体常常在探索效率低下中遇到困难，尤其是在奖励稀疏的环境中。传统的探索策略可能导致学习缓慢和表现不佳，因为代理未能系统地建立在之前成功的经验基础上，从而降低样本效率。为解决这一问题，我们提出了一种自我模仿的策略上算法，通过利用过去高回报的状态-动作对来指导策略更新，提升探索和样本效率。我们的方法通过在高奖励环境中使用最优运输距离，结合自我模仿，优先选择与最有利奖励轨迹相匹配的状态访问分布。在奖励稀疏的环境中，我们一致地重演成功的自我遭遇轨迹，以促进结构化探索。在不同环境中的实验结果显示学习效率显著提升，包括MuJoCo用于高密度奖励，以及部分可观察的3D Animal-AI Olympics和多目标PointMaze用于稀疏奖励。我们的方法相比最先进的自我模拟强化学习基线，实现了更快的收敛速度和显著更高的成功率。这些发现强调了自我模仿作为增强强化学习探索的有力策略的潜力，并适用于更复杂的任务。

Secure Reinforcement Learning: On Model-Free Detection of Man in the Middle Attacks

安全强化学习：关于无模型检测中间人攻击

Authors: Rishi Rani, Massimo Franceschetti
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.27592
Pdf link: https://arxiv.org/pdf/2603.27592
Abstract We consider the problem of learning-based man-in-the-middle (MITM) attacks in cyber-physical systems (CPS), and extend our previously proposed Bellman Deviation Detection (BDD) framework for model-free reinforcement learning (RL). We refine the standard MDP attack model by allowing the reward function to depend on both the current and subsequent states, thereby capturing reward variations induced by errors in the adversary's transition estimate. We also derive an optimal system-identification strategy for the adversary that minimizes detectable value deviations. Further, we prove that the agent's asymptotic learning time required to secure the system scales linearly with the adversary's learning time, and that this matches the optimal lower bound. Hence, the proposed detection scheme is order-optimal in detection efficiency. Finally, we extend the framework to asynchronous and intermittent attack scenarios, where reliable detection is preserved.
中文摘要 我们考察了基于学习的中间人攻击（MITM）在网络物理系统（CPS）中的问题，并扩展了我们之前提出的贝尔曼偏差检测（BDD）框架，用于无模型强化学习（RL）。我们通过允许奖励函数依赖当前和后续状态，进一步优化了标准MDP攻击模型，从而捕捉了由对手转移估计误差引起的奖励变化。我们还为对手推导出了一种最优的系统识别策略，以最小化可检测的值偏差。此外，我们证明了智能体为保护系统所需的渐近学习时间与对手的学习时间呈线性增长，且这符合最优下界。因此，所提出的检测方案在检测效率上具有阶数最优。最后，我们将框架扩展到异步和间歇性攻击场景，保持了可靠的检测能力。

DSevolve: Enabling Real-Time Adaptive Scheduling on Dynamic Shop Floor with LLM-Evolved Heuristic Portfolios

DSevolve：通过LLM演化的启发式投资组合，在动态车间实现实时自适应调度

Authors: Jin Huang, Jie Yang, XinLei Zhou, Qihao Liu, Liang Gao, Xinyu Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.27628
Pdf link: https://arxiv.org/pdf/2603.27628
Abstract In dynamic manufacturing environments, disruptions such as machine breakdowns and new order arrivals continuously shift the optimal dispatching strategy, making adaptive rule selection essential. Existing LLM-powered Automatic Heuristic Design (AHD) frameworks evolve toward a single elite rule that cannot meet this adaptability demand. To address this, we present DSevolve, an industrial scheduling framework that evolves a quality-diverse portfolio of dispatching rules offline and adaptively deploys them online with second-level response time. Multi-persona seeding and topology-aware evolutionary operators produce a behaviorally diverse rule archive indexed by a MAP-Elites feature space. Upon each disruption event, a probe-based fingerprinting mechanism characterizes the current shop floor state, retrieves high-quality candidate rules from an offline knowledge base, and selects the best one via rapid look-ahead simulation. Evaluated on 500 dynamic flexible job shop instances derived from real industrial data, DSevolve outperforms state-of-the-art AHD frameworks, classical dispatching rules, genetic programming, and deep reinforcement learning, offering a practical and deployable solution for intelligent shop floor scheduling.
中文摘要 在动态制造环境中，机器故障和新订单到货等干扰不断改变最佳调度策略，使得自适应规则选择变得不可或缺。现有的由LLM驱动的自动启发式设计（AHD）框架正朝着一个无法满足适应性需求的单一精英规则发展。为此，我们介绍了DSevolve，一个工业调度框架，它将质量多样化的调度规则组合离线演化，并以二级响应时间自适应地在线部署。多角色做种和拓扑感知的进化算子生成一个行为多样化的规则档案，并由MAP-Elites特征空间索引。每次中断事件后，基于探针的指纹识别机制会描述当前车间状态，从离线知识库检索高质量候选规则，并通过快速前瞻模拟选择最佳规则。DSevolve基于500个基于真实工业数据的动态灵活工序实例进行评估，其性能优于最先进的AHD框架、经典调度规则、遗传编程和深度强化学习，提供了一种实用且可部署的智能车间调度解决方案。

RTLSeek: Boosting the LLM-Based RTL Generation with Multi-Stage Diversity-Oriented Reinforcement Learning

RTLSeek：通过多阶段多样性导向强化学习提升基于LLM的RTL生成

Authors: Xinyu Zhang, Zhiteng Chao, Yonghao Wang, Bin Sun, Tianyun Ma, Tianmeng Yang, Jianan Mu, Jing Justin Ye, Huawei Li
Subjects: Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.27630
Pdf link: https://arxiv.org/pdf/2603.27630
Abstract Register Transfer Level (RTL) design translates high-level specifications into hardware using HDLs such as Verilog. Although LLM-based RTL generation is promising, the scarcity of functionally verifiable high-quality data limits both accuracy and diversity. Existing post-training typically produces a single HDL implementation per specification, lacking awareness of RTL variations needed for different design goals. We propose RTLSeek, a post-training paradigm that applies rule-based Diversity-Oriented Reinforcement Learning to improve RTL correctness and diversity. Our Diversity-Centric Multi-Objective Reward Scheduling integrates expert knowledge with EDA feedback, and a three-stage framework maximizes the utility of limited data. Experiments on the RTLLM benchmark show that RTLSeek surpasses prior methods, with ablation results confirming that encouraging broader design-space exploration improves RTL quality and achieves the principle of "the more generated, the better results." Implementation framework, including the dataset, source code, and model weights, is shown at this https URL.
中文摘要 寄存器传输级别（RTL）设计将高级规范转换为硬件，使用如Verilog等HDL技术。尽管基于LLM的RTL生成前景看好，但功能性可验证的高质量数据稀缺限制了准确性和多样性。现有的后期培训通常每个规范只实现一个HDL实现，缺乏对不同设计目标所需RTL变体的认知。我们提出了RTLSeek，一种基于规则的多样性导向强化学习的培训后范式，以提升RTL的正确性和多样性。我们的以多样性为中心的多目标奖励调度将专家知识与EDA反馈相结合，三阶段框架最大化有限数据的效用。RTLLM基准测试的实验显示，RTLSeek超越了以往的方法，消融结果证实鼓励更广泛的设计空间探索能提升RTL质量，并实现“生成越多，结果越好”的原则。实现框架，包括数据集、源代码和模型权重，见此 https URL。

Optimizing Coverage and Difficulty in Reinforcement Learning for Quiz Composition

优化测验写作的强化学习覆盖和难度

Authors: Ricardo Pedro Querido Andrade Silva, Nassim Bouarour, Dina Fettache, Sarab Boussouar, Noha Ibrahim, Sihem Amer-Yahia
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.27695
Pdf link: https://arxiv.org/pdf/2603.27695
Abstract Quiz design is a tedious process that teachers undertake to evaluate the acquisition of knowledge by students. Our goal in this paper is to automate quiz composition from a set of multiple choice questions (MCQs). We formalize a generic sequential decision-making problem with the goal of training an agent to compose a quiz that meets the desired topic coverage and difficulty levels. We investigate DQN, SARSA and A2C/A3C, three reinforcement learning solutions to solve our problem. We run extensive experiments on synthetic and real datasets that study the ability of RL to land on the best quiz. Our results reveal subtle differences in agent behavior and in transfer learning with different data distributions and teacher goals. This was supported by our user study, paving the way for automating various teachers' pedagogical goals.
中文摘要 测验设计是一个繁琐的过程，教师用以评估学生掌握知识的情况。本文的目标是通过一组选择题（MCQ）自动化测验作文。我们形式化了一个通用的顺序决策问题，目标是训练代理人编写符合期望主题、覆盖范围和难度水平的测验。我们研究了DQN、SARSA和A2C/A3C这三种强化学习解决方案来解决我们的问题。我们在合成和真实数据集上进行了大量实验，研究强化学习是否能获得最佳测验。我们的结果揭示了不同数据分布和教师目标下，代理行为和迁移学习存在细微差异。我们的用户研究支持了这一点，为自动化各类教师的教学目标铺平了道路。

KAT-Coder-V2 Technical Report

KAT-Coder-V2技术报告

Authors: Fengxiang Li, Han Zhang, Haoyang Huang, Jinghui Wang, Jinhua Hao, Kun Yuan, Mengtong Li, Minglei Zhang, Pengcheng Xu, Wenhao Zhuang, Yizhen Shao, Zongxian Feng, Can Tang, Chao Wang, Chengxiao Tong, Fan Yang, Gang Xiong, Haixuan Gao, Han Gao, Hao Wang, Haochen Liu, Hongliang Sun, Jiabao Li, Jingwen Chang, Jun Du, Junyi Peng, Leizhen Cui, Meimei Jing, Mingqi Wu, Shangpeng Yan, Shaotong Qi, Suzhe Xu, Wenxuan Zhao, Xianda Sun, Xuan Xie, Yanbo Wang, Yao Xia, Yinghan Cui, Yingpeng Chen, Yong Wang, Yuze Shi, Zhiwei Shen, Ziyu Wang, Ming Sun, Lin Ye, Bin Chen
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.27703
Pdf link: https://arxiv.org/pdf/2603.27703
Abstract We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a "Specialize-then-Unify" paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy distillation. We develop KwaiEnv, a modular infrastructure sustaining tens of thousands of concurrent sandbox instances, and scale RL training along task complexity, intent alignment, and scaffold generalization. We further propose MCLA for stabilizing MoE RL training and Tree Training for eliminating redundant computation over tree-structured trajectories with up to 6.2x speedup. KAT-Coder-V2 achieves 79.6% on SWE-bench Verified (vs. Claude Opus 4.6 at 80.8%), 88.7 on PinchBench (surpassing GLM-5 and MiniMax M2.7), ranks first across all three frontend aesthetics scenarios, and maintains strong generalist scores on Terminal-Bench Hard (46.8) and tau^2-Bench (93.9). Our model is publicly available at this https URL.
中文摘要 我们介绍由快手KwaiKAT团队开发的代理编码模型KAT-Coder-V2。KAT-Coder-V2采用“先专精后统一”范式，将智能编码分解为五个专家领域——软件软件、Web编码、终端、Web搜索和通用——每个领域都经过独立监督的微调和强化学习，然后通过策略内蒸馏整合为单一模型。我们开发了KwaiEnv，一个模块化基础设施，支持数万个并发沙盒实例，并根据任务复杂度、意图对齐和支架泛化来扩展强化学习训练。我们还提出MCLA用于稳定MoE强化学习，树训练用于消除树结构轨迹上的冗余计算，加速最高可达6.2倍。KAT-Coder-V2在SWE-bench Verified上获得79.6%的评分（相比Claude Opus 4.6的80.8%），PinchBench上的88.7%（超过GLM-5和MiniMax M2.7），在三种前端美观场景中均排名第一，并在终端-工作台困难（46.8）和tau^2-Bench（93.9）中保持强劲的通才评分。我们的模型在此 https URL 公开。

TIR-Agent: Training an Explorative and Efficient Agent for Image Restoration

TIR代理：训练一款探索性且高效的图像修复代理

Authors: Yisheng Zhang, Guoli Jia, Haote Hu, Shanxu Zhao, Kaikai Zhao, Long Sun, Xinwei Long, Kai Tian, Che Jiang, Zhaoxiang Liu, Kai Wang, Shiguo Lian, Kaiyan Zhang, Bowen Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.27742
Pdf link: https://arxiv.org/pdf/2603.27742
Abstract Vision-language agents that orchestrate specialized tools for image restoration (IR) have emerged as a promising method, yet most existing frameworks operate in a training-free manner. They rely on heuristic task scheduling and exhaustive tool traversal, resulting in sub-optimal restoration paths and prohibitive computational cost. We argue that the core bottleneck lies in the absence of a learned policy to make decision, as a vision-language model cannot efficiently handle degradation-aware task ordering and tool composition. To this end, we propose TIR-Agent, a trainable image restoration agent that performs a direct tool-calling policy through a two-stage training pipeline of supervised fine-tuning (SFT) followed by reinforcement learning (RL). Two key designs underpin effective RL training: (i) a random perturbation strategy applied to the SFT data, which broadens the policy's exploration over task schedules and tool compositions, and (ii) a multi-dimensional adaptive reward mechanism that dynamically re-weights heterogeneous image quality metrics to mitigate reward hacking. To support high-throughput, asynchronous GPU-based tool invocation during training, we further develop a globally shared model-call pool. Experiments on both in-domain and out-of-domain degradations show that TIR-Agent outperforms 12 baselines, including 6 all-in-one models, 3 training-free agents, and 3 proprietary models, and achieves over 2.5$\times$ inference speedup by eliminating redundant tool executions.
中文摘要 用于协调图像修复（IR）专用工具的视觉语言代理已成为一种有前景的方法，但大多数现有框架仍以无训练的方式运行。它们依赖启发式任务调度和穷尽工具遍历，导致恢复路径不理想且计算成本高昂。我们认为核心瓶颈在于缺乏学习过的决策策略，因为视觉语言模型无法高效处理基于退化的任务排序和工具组合。为此，我们提出了TIR代理，这是一种可训练的图像恢复代理，通过监督微调（SFT）和强化学习（RL）两阶段的训练流程，执行直接调用工具策略。有效强化学习训练的两项关键设计是：（i）对SFT数据应用的随机扰动策略，扩大了对任务计划和工具组合的策略探索范围;（ii）多维自适应奖励机制，动态重新加权异构图像质量指标以减轻奖励黑客行为。为了支持训练期间的高通量、异步基于GPU的工具调用，我们进一步开发了一个全球共享的模型调用池。对域内外退化的实验显示，TIR代理优于12个基线，包括6个全合一模型、3个无训练代理和3个专有模型，并通过消除冗余工具执行实现了超过2.5美元\时间的推理加速。

SkyNet: Belief-Aware Planning for Partially-Observable Stochastic Games

天网：部分可观测随机博弈的信念感知规划

Authors: Adam Haile
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.27751
Pdf link: https://arxiv.org/pdf/2603.27751
Abstract In 2019, Google DeepMind released MuZero, a model-based reinforcement learning method that achieves strong results in perfect-information games by combining learned dynamics models with Monte Carlo Tree Search (MCTS). However, comparatively little work has extended MuZero to partially observable, stochastic, multi-player environments, where agents must act under uncertainty about hidden state. Such settings arise not only in card games but in domains such as autonomous negotiation, financial trading, and multi-agent robotics. In the absence of explicit belief modeling, MuZero's latent encoding has no dedicated mechanism for representing uncertainty over unobserved variables. To address this, we introduce SkyNet (Belief-Aware MuZero), which adds ego-conditioned auxiliary heads for winner prediction and rank estimation to the standard MuZero architecture. These objectives encourage the latent state to retain information predictive of outcomes under partial observability, without requiring explicit belief-state tracking or changes to the search algorithm. We evaluate SkyNet on Skyjo, a partially observable, non-zero-sum, stochastic card game, using a decision-granularity environment, transformer-based encoding, and a curriculum of heuristic opponents with self-play. In 1000-game head-to-head evaluations at matched checkpoints, SkyNet achieves a 75.3% peak win rate against the baseline (+194 Elo, $p < 10^{-50}$). SkyNet also outperforms the baseline against heuristic opponents (0.720 vs.\ 0.466 win rate). Critically, the belief-aware model initially underperforms the baseline but decisively surpasses it once training throughput is sufficient, suggesting that belief-aware auxiliary supervision improves learned representations under partial observability, but only given adequate data flow.
中文摘要 2019年，Google DeepMind发布了MuZero，这是一种基于模型的强化学习方法，通过将学习的动力学模型与蒙特卡洛树搜索（MCTS）结合，在完美信息博弈中取得显著效果。然而，相对较少的工作将 MuZero 扩展到部分可观察的随机多人环境中，代理必须在隐藏状态的不确定性下行动。这种设定不仅出现在纸牌游戏中，也出现在自主谈判、金融交易和多智能体机器人等领域。在缺乏显式信念建模的情况下，MuZero的潜在编码没有专门的机制来表示未观察变量的不确定性。为此，我们引入了SkyNet（信念感知MuZero），它在标准MuZero架构中增加了带有自我条件的辅助头，用于预测获胜者和排名估计。这些目标鼓励潜伏状态保留部分可观测性下预测结果的信息，而无需明确的信念状态追踪或搜索算法的修改。我们在Skyjo上评估SkyNet，这是一款部分可观测、非零和随机的纸牌游戏，采用决策粒度环境、基于变换器的编码以及一套具有自我对战的启发式对手课程。在匹配检查点进行的1000场对战评估中，天网对基线的峰值胜率达到75.3%（+194 Elo，$p < 10^{-50}$）。天网在对抗启发式对手时也优于基础（0.720胜率对0.466胜率）。关键是，信念感知模型起初表现不及基线，但一旦训练吞吐量足够，便明显超越基线，表明信念感知辅助监督在部分可观测性下能改善学习到的表征，但前提是数据流充足。

Wan-R1: Verifiable-Reinforcement Learning for Video Reasoning

Wan-R1：视频推理的可验证强化学习

Authors: Ming Liu, Yunbei Zhang, Shilong Liu, Liwen Wang, Wensheng Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.27866
Pdf link: https://arxiv.org/pdf/2603.27866
Abstract Video generation models produce visually coherent content but struggle with tasks requiring spatial reasoning and multi-step planning. Reinforcement learning (RL) offers a path to improve generalization, but its effectiveness in video reasoning hinges on reward design -- a challenge that has received little systematic study. We investigate this problem by adapting Group Relative Policy Optimization (GRPO) to flow-based video models and training them on maze-solving and robotic navigation tasks. We first show that multimodal reward models fail catastrophically in this setting. To address this, we design verifiable reward functions grounded in objective task metrics. For structured game environments, we introduce a multi-component trajectory reward. For robotic navigation, we propose an embedding-level verifiable reward. Our experiments show that RL fine-tuning with verifiable rewards improves generalization. For example, on complex 3D mazes, our model improves exact match accuracy by 29.1\% over the SFT baseline, and on trap-avoidance tasks by 51.4\%. Our systematic reward analysis reveals that verifiable rewards are critical for stable training, while multimodal reward models could lead to degenerate solutions. These findings establish verifiable reward design as a key enabler for robust video reasoning. Code will be publicly available.
中文摘要 视频生成模型能产生视觉上连贯的内容，但在需要空间推理和多步规划的任务上表现不佳。强化学习（RL）为提升泛化提供了一条路径，但其在视频推理中的有效性依赖于奖励设计——这一挑战几乎没有系统性研究。我们通过将群体相对策略优化（GRPO）应用于基于流量的视频模型，并训练其进行迷宫解决和机器人导航任务来研究这一问题。我们首先证明多模态奖励模型在此环境中会灾难性地失败。为此，我们设计了基于客观任务指标的可验证奖励函数。对于结构化游戏环境，我们引入了多元轨迹奖励。对于机器人导航，我们提出了一种嵌入层级的可验证奖励。我们的实验表明，带有可验证奖励的强化学习微调能提升泛化能力。例如，在复杂的三维迷宫中，我们的模型比SFT基线提升了29.1%的精确匹配准确率，在陷阱避免任务中提升了51.4%。我们的系统奖励分析显示，可验证的奖励对于稳定训练至关重要，而多模态奖励模型可能导致退化解决方案。这些发现确立了可验证的奖励设计作为强有力视频推理的关键推动力。代码将公开。

Energy Efficient Orchestration in Multiple-Access Vehicular Aerial-Terrestrial 6G Networks

多接入车辆空中-陆地6G网络中的节能编排

Authors: Mohammad Farhoudi, Hamidreza Mazandarani, Masoud Shokrnezhad, Tarik Taleb, Ignacio Lacalle
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.27870
Pdf link: https://arxiv.org/pdf/2603.27870
Abstract The proliferation of users, devices, and novel vehicular applications - propelled by advancements in autonomous systems and connected technologies - is precipitating an unprecedented surge in novel services. These emerging services require substantial bandwidth allocation, adherence to stringent Quality of Service (QoS) parameters, and energy-efficient implementations, particularly within highly dynamic vehicular environments. The complexity of these requirements necessitates a fundamental paradigm shift in service orchestration methodologies to facilitate seamless and robust service delivery. This paper addresses this challenge by presenting a novel framework for service orchestration in Unmanned Aerial Vehicles (UAV)-assisted 6G aerial-terrestrial networks. The proposed framework synergistically integrates UAV trajectory planning, Multiple-Access Control (MAC), and service placement to facilitate energy-efficient service coverage while maintaining ultra-low latency communication for vehicular user service requests. We first present a non-linear programming model that formulates the optimization problem. Next, to address the problem, we employ a Hierarchical Deep Reinforcement Learning (HDRL) algorithm that dynamically predicts service requests, user mobility, and channel conditions, addressing the challenges of interference, resource scarcity, and mobility in heterogeneous networks. Simulation results demonstrate that the proposed framework outperforms state-of-the-art solutions in request acceptance, energy efficiency, and latency minimization, showcasing its potential to support the high demands of next-generation vehicular networks.
中文摘要 用户、设备和新型车辆应用的激增——由自主系统和互联技术的进步推动——正引发前所未有的新型服务激增。这些新兴服务需要大量带宽分配，遵守严格的服务质量（QoS）参数，并实现节能，尤其是在高度动态的车辆环境中。这些需求的复杂性要求在服务编排方法上进行根本性的范式转变，以实现无缝且稳健的服务交付。本文通过提出无人机（UAV）辅助6G空中-陆地网络服务编排的新框架，解决了这一挑战。该框架协同整合了无人机轨迹规划、多重接入控制（MAC）和服务布置，以实现节能服务覆盖，同时保持车辆用户服务请求的超低延迟通信。我们首先提出一个非线性规划模型，用于表述优化问题。接下来，为了解决这个问题，我们采用了分层深度强化学习（HDRL）算法，能够动态预测服务请求、用户移动性和信道状况，解决异构网络中的干扰、资源稀缺和移动性等挑战。模拟结果显示，该框架在请求接受度、能效和延迟最小化方面优于最先进方案，展示了其支持下一代车载网络高需求的潜力。

Near-Optimal Primal-Dual Algorithm for Learning Linear Mixture CMDPs with Adversarial Rewards

用于学习具有对抗性奖励的线性混合CMDP的近优原始对偶算法

Authors: Kihyun Yu, Seoungbin Bae, Dabeen Lee
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2603.27884
Pdf link: https://arxiv.org/pdf/2603.27884
Abstract We study safe reinforcement learning in finite-horizon linear mixture constrained Markov decision processes (CMDPs) with adversarial rewards under full-information feedback and an unknown transition kernel. We propose a primal-dual policy optimization algorithm that achieves regret and constraint violation bounds of $\widetilde{O}(\sqrt{d^2 H^3 K})$ under mild conditions, where $d$ is the feature dimension, $H$ is the horizon, and $K$ is the number of episodes. To the best of our knowledge, this is the first provably efficient algorithm for linear mixture CMDPs with adversarial rewards. In particular, our regret bound is near-optimal, matching the known minimax lower bound up to logarithmic factors. The key idea is to introduce a regularized dual update that enables a drift-based analysis. This step is essential, as strong duality-based analysis cannot be directly applied when reward functions change across episodes. In addition, we extend weighted ridge regression-based parameter estimation to the constrained setting, allowing us to construct tighter confidence intervals that are crucial for deriving the near-optimal regret bound.
中文摘要 我们研究了在有限视野线性混合约束马尔可夫决策过程（CMDPs）中，在全信息反馈和未知过渡核下进行对抗性奖励的安全强化学习。我们提出了一种原始对偶策略优化算法，在温和条件下实现后悔和约束违背界限为$\widetilde{O}（\sqrt{d^2 H^3 K}）$，其中$d$为特征维度，$H$为视界，$K$为集数。据我们所知，这是首个可证明高效的线性混合CMDP具有对抗性奖励的算法。特别地，我们的遗憾界限是近似最优的，匹配已知的极小极大下界，直到对数因子。关键思想是引入正则化的对偶更新，从而实现基于漂移的分析。这一步至关重要，因为当奖励函数在不同剧集变化时，强对偶分析无法直接应用。此外，我们将基于加权脊回归的参数估计扩展到受限环境，使我们能够构建更紧密的置信区间，这对于推导近似最优的遗憾界限至关重要。

Flip Stunts on Bicycle Robots using Iterative Motion Imitation

利用迭代运动模仿在自行车机器人上做翻转特技

Authors: Jeonghwan Kim, Shamel Fahmi, Seungeun Rho, Sehoon Ha, Gabriel Nelson
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.27944
Pdf link: https://arxiv.org/pdf/2603.27944
Abstract This work demonstrates a front-flip on bicycle robots via reinforcement learning, particularly by imitating reference motions that are infeasible and imperfect. To address this, we propose Iterative Motion Imitation(IMI), a method that iteratively imitates trajectories generated by prior policy rollouts. Starting from an initial reference that is kinematically or dynamically infeasible, IMI helps train policies that lead to feasible and agile behaviors. We demonstrate our method on Ultra-Mobility Vehicle (UMV), a bicycle robot that is designed to enable agile behaviors. From a self-colliding table-to-ground flip reference generated by a model-based controller, we are able to train policies that enable ground-to-ground and ground-to-table front-flips. We show that compared to a single-shot motion imitation, IMI results in policies with higher success rates and can transfer robustly to the real world. To our knowledge, this is the first unassisted acrobatic flip behavior on such a platform.
中文摘要 这项工作通过强化学习展示了自行车机器人的前空翻，特别是通过模拟不可行且不完美的参考运动。为此，我们提出了迭代运动模仿（IMI）方法，该方法可迭代模拟先前政策推出产生的轨迹。从运动学或动态上不可行的初始参考开始，IMI帮助培训政策，从而实现可行且敏捷的行为。我们在超能出行车（UMV）上演示了我们的方法，这是一种设计用于实现敏捷行为的自行车机器人。通过基于模型的控制器生成的自碰撞桌面与地面翻转参考，我们能够训练出支持地对地和地面到桌面前翻的策略。我们证明，与单次动作模拟相比，IMI的政策成功率更高，并且能够稳健地转移到现实世界。据我们所知，这是首次在此类平台上进行无辅助的杂技翻转动作。

Principal Prototype Analysis on Manifold for Interpretable Reinforcement Learning

可解释强化学习流形的主要原型分析

Authors: Bodla Krishna Vamshi, Haizhao Yang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.27971
Pdf link: https://arxiv.org/pdf/2603.27971
Abstract Recent years have witnessed the widespread adoption of reinforcement learning (RL), from solving real-time games to fine-tuning large language models using human preference data significantly improving alignment with user expectations. However, as model complexity grows exponentially, the interpretability of these systems becomes increasingly challenging. While numerous explainability methods have been developed for computer vision and natural language processing to elucidate both local and global reasoning patterns, their application to RL remains limited. Direct extensions of these methods often struggle to maintain the delicate balance between interpretability and performance within RL settings. Prototype-Wrapper Networks (PW-Nets) have recently shown promise in bridging this gap by enhancing explainability in RL domains without sacrificing the efficiency of the original black-box models. However, these methods typically require manually defined reference prototypes, which often necessitate expert domain knowledge. In this work, we propose a method that removes this dependency by automatically selecting optimal prototypes from the available data. Preliminary experiments on standard Gym environments demonstrate that our approach matches the performance of existing PW-Nets, while remaining competitive with the original black-box models.
中文摘要 近年来，强化学习（RL）的广泛应用，从解决实时游戏到利用人类偏好数据微调大型语言模型，显著提升了与用户期望的契合度。然而，随着模型复杂度呈指数级增长，这些系统的可解释性变得越来越具有挑战性。尽管已有许多可解释方法用于计算机视觉和自然语言处理，以阐明局部和全局推理模式，但它们在强化学习中的应用仍然有限。这些方法的直接扩展往往难以在强化学习环境中维持可解释性和性能之间的微妙平衡。原型包装网络（PW-Nets）最近展现出在通过增强学习领域可解释性、同时不牺牲原始黑匣子模型效率的前提下，弥合这一差距的潜力。然而，这些方法通常需要手动定义的参考原型，这往往需要专业知识。在本研究中，我们提出了一种方法，通过自动从可用数据中选择最优原型来消除这种依赖关系。在标准健身房环境中的初步实验表明，我们的方法能够与现有PW-Nets的性能匹敌，同时仍能与原始黑箱模型竞争。

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

SARL：通过奖励推理拓扑实现的无标签强化学习

Authors: Yifan Wang, Bolian Li, David Cho, Ruqi Zhang, Fanping Sui, Ananth Grama
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.27977
Pdf link: https://arxiv.org/pdf/2603.27977
Abstract Reinforcement learning has become central to improving large reasoning models, but its success still relies heavily on verifiable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimization towards final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning) and extend traditional RLVR to open ended settings. We introduce structure aware reinforcement learning (SARL), a label free framework that constructs a per response Reasoning Map from intermediate thinking steps and rewards its small world topology, inspired by complex networks and the functional organization of the human brain. SARL encourages reasoning trajectories that are both locally coherent and globally efficient, shifting supervision from destination to path. Our experiments on Qwen3-4B show SARL surpasses ground truth based RL and prior label free RL baselines, achieving the best average gain of 9.1% under PPO and 11.6% under GRPO on math tasks and 34.6% under PPO and 30.4% under GRPO on open ended tasks. Beyond good performance, SARL also exhibits lower KL divergence, higher policy entropy, indicating a more stable and exploratory training and generalized reasoning ability.
中文摘要 强化学习已成为改进大型推理模型的核心，但其成功仍高度依赖可验证的奖励或标记监督。这限制了其适用于开放式领域，因为正确性模糊且无法验证。此外，推理轨迹基本不受约束，向最终答案优化可能更有利于早期利用而非泛化。在本研究中，我们探讨是否可以通过教授模型如何思考（推理的结构）而非“产生什么”（推理的结果）来提升一般推理能力，并将传统RLVR扩展到开放式环境。我们介绍了结构感知强化学习（SARL），这是一个无标签框架，从中间思考步骤构建每响应推理图，并奖励其受复杂网络和人脑功能组织启发的小世界拓扑。SARL鼓励既本地连贯又全球高效的推理轨迹，将监督从目的地转向路径。我们在Qwen3-4B上的实验显示，SARL优于基于基层的强化学习和之前无标签的强化学习基线，在数学任务中PPO下平均增益最佳，分别为9.1%，GRPO下11.6%，开放式任务中为34.6%，GRPO下30.4%。除了良好的表现外，SARL还表现出较低的基层逻辑分歧度和更高的策略熵，表明其训练更稳定、更具探索性，推理能力更为广义。

Reducing Oracle Feedback with Vision-Language Embeddings for Preference-Based RL

通过视觉语言嵌入减少基于偏好的强化学习的Oracle反馈

Authors: Udita Ghosh, Dripta S. Raychaudhuri, Jiachen Li, Konstantinos Karydis, Amit Roy-Chowdhury
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.28053
Pdf link: https://arxiv.org/pdf/2603.28053
Abstract Preference-based reinforcement learning can learn effective reward functions from comparisons, but its scalability is constrained by the high cost of oracle feedback. Lightweight vision-language embedding (VLE) models provide a cheaper alternative, but their noisy outputs limit their effectiveness as standalone reward generators. To address this challenge, we propose ROVED, a hybrid framework that combines VLE-based supervision with targeted oracle feedback. Our method uses the VLE to generate segment-level preferences and defers to an oracle only for samples with high uncertainty, identified through a filtering mechanism. In addition, we introduce a parameter-efficient fine-tuning method that adapts the VLE with the obtained oracle feedback in order to improve the model over time in a synergistic fashion. This ensures the retention of the scalability of embeddings and the accuracy of oracles, while avoiding their inefficiencies. Across multiple robotic manipulation tasks, ROVED matches or surpasses prior preference-based methods while reducing oracle queries by up to 80%. Remarkably, the adapted VLE generalizes across tasks, yielding cumulative annotation savings of up to 90%, highlighting the practicality of combining scalable embeddings with precise oracle supervision for preference-based RL.
中文摘要 基于偏好的强化学习可以通过比较学习有效的奖励函数，但其可扩展性受限于oracle反馈的高昂成本。轻量化视觉语言嵌入（VLE）模型提供了更便宜的替代方案，但其噪声较大的输出限制了其作为独立奖励生成器的有效性。为应对这一挑战，我们提出了ROVED这一混合框架，结合了基于VLE的监督与针对性的预言机反馈。我们的方法使用VLE生成片段级偏好，仅对高不确定性样本通过过滤机制识别，采用oracle。此外，我们引入了一种参数高效的微调方法，将VLE与获得的预言机反馈进行调整，以协同方式逐步改进模型。这确保了嵌入的可扩展性和预言机的准确性，同时避免了其低效。在多种机器人操作任务中，ROVED能够匹配甚至超过以往基于偏好的方法，同时将oracle查询减少多达80%。令人惊讶的是，适配的VLE可跨任务推广，累计节省多达90%的注释，凸显了将可扩展嵌入与精确预言机监督结合，用于基于偏好的强化学习的实用性。

Koopman-based surrogate modeling for reinforcement-learning-control of Rayleigh-Benard convection

基于库普曼的替代建模用于雷利-贝纳德对流的强化-学习-控制

Authors: Tim Plotzki, Sebastian Peitz
Subjects: Subjects: Machine Learning (cs.LG); Dynamical Systems (math.DS)
Arxiv link: https://arxiv.org/abs/2603.28074
Pdf link: https://arxiv.org/pdf/2603.28074
Abstract Training reinforcement learning (RL) agents to control fluid dynamics systems is computationally expensive due to the high cost of direct numerical simulations (DNS) of the governing equations. Surrogate models offer a promising alternative by approximating the dynamics at a fraction of the computational cost, but their feasibility as training environments for RL is limited by distribution shifts, as policies induce state distributions not covered by the surrogate training data. In this work, we investigate the use of Linear Recurrent Autoencoder Networks (LRANs) for accelerating RL-based control of 2D Rayleigh-Bénard convection. We evaluate two training strategies: a surrogate trained on precomputed data generated with random actions, and a policy-aware surrogate trained iteratively using data collected from an evolving policy. Our results show that while surrogate-only training leads to reduced control performance, combining surrogates with DNS in a pretraining scheme recovers state-of-the-art performance while reducing training time by more than 40%. We demonstrate that policy-aware training mitigates the effects of distribution shift, enabling more accurate predictions in policy-relevant regions of the state space.
中文摘要 训练强化学习（RL）代理控制流体动力学系统计算成本高，因为直接数值模拟（DNS）控制方程的成本高昂。替代模型通过以极低的计算成本近似动力学提供了一种有前景的替代方案，但它们作为强化学习训练环境的可行性受限于分布偏移，因为策略诱导了替代训练数据未覆盖的状态分布。本研究研究利用线性循环自编码网络（LRAN）加速基于强化学习控制二维雷利-贝纳对流的应用。我们评估了两种训练策略：一种是基于随机动作生成的预计算数据训练的代理，另一种是基于策略演变数据进行迭代训练的策略意识型代理。我们的结果显示，虽然仅用代理训练会导致控制性能下降，但将代理与DNS结合在预训练方案中，可以恢复最先进的性能，同时将训练时间缩短超过40%。我们证明，政策意识训练能够减轻分布转移的影响，使得在与政策相关的国家空间区域实现更准确的预测。

Heddle: A Distributed Orchestration System for Agentic RL Rollout

Heddle：一种用于代理强化学习（Agentic RL）推广的分布式编排系统

Authors: Zili Zhang, Yinmin Zhong, Chengxu Yang, Chao Jin, Bingyang Wu, Xinming Wei, Yuliang Liu, Xin Jin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.28101
Pdf link: https://arxiv.org/pdf/2603.28101
Abstract Agentic Reinforcement Learning (RL) enables LLMs to solve complex tasks by alternating between a data-collection rollout phase and a policy training phase. During rollout, the agent generates trajectories, i.e., multi-step interactions between LLMs and external tools. Yet, frequent tool calls induce long-tailed trajectory generation that bottlenecks rollouts. This stems from step-centric designs that ignore trajectory context, triggering three system problems for long-tail trajectory generation: queueing delays, interference overhead, and inflated per-token time. We propose Heddle, a trajectory-centric system to optimize the when, where, and how of agentic rollout execution. Heddle integrates three core mechanisms: trajectory-level scheduling using runtime prediction and progressive priority to minimize cumulative queueing; trajectory-aware placement via presorted dynamic programming and opportunistic migration during idle tool call intervals to minimize interference; and trajectory-adaptive resource manager that dynamically tunes model parallelism to accelerate the per-token time of long-tail trajectories while maintaining high throughput for short trajectories. Evaluations across diverse agentic RL workloads demonstrate that Heddle effectively neutralizes the long-tail bottleneck, achieving up to 2.5$\times$ higher end-to-end rollout throughput compared to state-of-the-art baselines.
中文摘要 代理强化学习（RL）使LLM能够通过在数据收集推广阶段和策略训练阶段之间交替解决复杂任务。在部署过程中，代理生成轨迹，即大型语言模型与外部工具之间的多步交互。然而，频繁的工具调用会引发长尾轨迹生成，从而阻碍推广。这源于忽略轨迹上下文的步进中心设计，触发了长尾轨迹生成的三个系统问题：排队延迟、干扰开销和每个令牌时间膨胀。我们提出了Heddle系统，一种以轨迹为中心的系统，用于优化代理式推广的何时、地点和方式。Heddle集成了三种核心机制：轨迹级调度，利用运行时预测和渐进优先级以最小化积累队列;通过预排序动态规划实现轨迹感知布局，并在空闲工具调用间隔中机会性迁移以最小化干扰;以及轨迹自适应资源管理器，动态调优模型并行性，加速长尾轨迹的每个代币时间，同时保持短轨迹的高吞吐量。跨不同代理型强化学习工作负载的评估表明，Heddle有效中和了长尾瓶颈，相比最先进的基线，端到端的部署吞吐量高出多达2.5美元/时间点。

$AutoDrive\text{-}P^3$: Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning

$AutoDrive\text{-}P^3$：通过强化微调实现的感知-预测-规划统一思维链

Authors: Yuqi Ye, Zijian Zhang, Junhong Lin, Shangkun Sun, Changhao Peng, Wei Gao
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.28116
Pdf link: https://arxiv.org/pdf/2603.28116
Abstract Vision-language models (VLMs) are increasingly being adopted for end-to-end autonomous driving systems due to their exceptional performance in handling long-tail scenarios. However, current VLM-based approaches suffer from two major limitations: 1) Some VLMs directly output planning results without chain-of-thought (CoT) reasoning, bypassing crucial perception and prediction stages which creates a significant domain gap and compromises decision-making capability; 2) Other VLMs can generate outputs for perception, prediction, and planning tasks but employ a fragmented decision-making approach where these modules operate separately, leading to a significant lack of synergy that undermines true planning performance. To address these limitations, we propose ${AutoDrive\text{-}P^3}$, a novel framework that seamlessly integrates $\textbf{P}$erception, $\textbf{P}$rediction, and $\textbf{P}$lanning through structured reasoning. We introduce the ${P^3\text{-}CoT}$ dataset to facilitate coherent reasoning and propose ${P^3\text{-}GRPO}$, a hierarchical reinforcement learning algorithm that provides progressive supervision across all three tasks. Specifically, ${AutoDrive\text{-}P^3}$ progressively generates CoT reasoning and answers for perception, prediction, and planning, where perception provides essential information for subsequent prediction and planning, while both perception and prediction collectively contribute to the final planning decisions, enabling safer and more interpretable autonomous driving. Additionally, to balance inference efficiency with performance, we introduce dual thinking modes: detailed thinking and fast thinking. Extensive experiments on both open-loop (nuScenes) and closed-loop (NAVSIMv1/v2) benchmarks demonstrate that our approach achieves state-of-the-art performance in planning tasks. Code is available at this https URL.
中文摘要 视觉语言模型（VLM）因其在处理长尾场景中的卓越性能，正日益被端到端自动驾驶系统采用。然而，当前基于VLM的方法存在两个主要局限：1）一些VLM直接输出规划结果，无需思考链（CoT）推理，绕过关键的感知和预测阶段，造成显著的领域空白并影响决策能力;2）其他VLM可以生成感知、预测和规划任务的输出，但采用分散的决策方法，这些模块各自独立运行，导致协同效应严重不足，削弱了真正的规划绩效。为解决这些限制，我们提出了 ${AutoDrive\text{-}P^3}$，一个新颖框架，通过结构化推理无缝整合了 $\textbf{P}$erception、$\textbf{P}$rediction 和 $\textbf{P}$lanning。我们引入了${P^3\text{-}CoT}$数据集以促进连贯推理，并提出了${P^3\text{-}GRPO}$，一种在三项任务中提供渐进式监督的分层强化学习算法。具体来说，${AutoDrive\text{-}P^3}$ 逐步生成 CoT 推理和对感知、预测和规划的答案，感知为后续预测和规划提供关键信息，而感知和预测共同参与最终规划决策，实现更安全、更易理解的自动驾驶。此外，为了平衡推理效率与性能，我们引入了双重思维模式：细致思考和快速思考。在开环（nuScenes）和闭环（NAVSIMv1/v2）基准测试上的大量实验表明，我们的方法在规划任务中实现了最先进的性能。代码可在此 https URL 访问。

MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding

MedLoc-R1：基于GRPO的医学视觉基础的绩效意识课程奖励安排

Authors: Guangjing Yang, Ziyuan Qin, Chaoran Zhang, Chenlin Du, Jinlin Wang, Wanran Sun, Zhenyu Zhang, Bing Ji, Qicheng Lao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.28120
Pdf link: https://arxiv.org/pdf/2603.28120
Abstract Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code \& checkpoints are available at \hyperlink{}{this https URL}.
中文摘要 医学视觉基础是细致多模态推理和可理解临床决策支持的重要基础。尽管强化学习（RL）在基础任务上的近期取得进展，现有方法如群体相对策略优化~（GRPO）在直接应用于医学图像时仍面临严重的奖励稀疏问题，主要原因是定位感兴趣区域本身就很困难，而基于IoU的固定奖励方案在强化学习中又僵化且次优，这一问题更加严重。这导致策略梯度消失，优化停滞，尤其是在早期培训阶段。为应对这一挑战，我们提出了MedLoc-R1，一种性能感知型奖励调度框架，根据模型准备度逐步收紧奖励标准。MedLoc-R1引入了滑动窗口性能跟踪器和多条件更新规则，能够自动将奖励计划从密集且易于获得的信号调整到更严格、细粒度的定位要求，同时保持GRPO的有利特性，而无需引入辅助网络或额外的梯度路径。对三个医学视觉接地基准的实验表明，MedLoc-R1在定位准确性和训练稳定性方面均优于基于GRPO的基线。我们的框架为高风险医疗应用中的强化学习基础提供了通用、轻量且高效的解决方案。代码 \& 检查点可在 \hyperlink{}{this https URL} 获取。

A Deep Reinforcement Learning Framework for Closed-loop Guidance of Fish Schools via Virtual Agents

通过虚拟代理闭环引导鱼群的深度强化学习框架

Authors: Takato Shibayama, Hiroaki Kawashima
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
Arxiv link: https://arxiv.org/abs/2603.28200
Pdf link: https://arxiv.org/pdf/2603.28200
Abstract Guiding collective motion in biological groups is a fundamental challenge in understanding social interaction rules and developing automated systems for animal management. In this study, we propose a deep reinforcement learning (RL) framework for the closed-loop guidance of fish schools using virtual agents. These agents are controlled by policies trained via Proximal Policy Optimization (PPO) in simulation and deployed in physical experiments with rummy-nose tetras (Petitella bleheri), enabling real-time interaction between artificial agents and live individuals. To cope with the stochastic behavior of live individuals, we design a composite reward function to balance directional guidance with social cohesion. Our systematic evaluation of visual parameters shows that a white background and larger stimulus sizes maximize guidance efficacy in physical trials. Furthermore, evaluation across group sizes revealed that while the system demonstrates effective guidance for groups of five individuals, this capability markedly degrades as group size increases to eight. This study highlights the potential of deep RL for automated guidance of biological collectives and identifies challenges in maintaining artificial influence in larger groups.
中文摘要 在生物群体中引导集体运动是理解社会互动规则和开发动物管理自动化系统中的根本挑战。本研究提出一个深度强化学习（RL）框架，用于利用虚拟代理进行鱼群闭环引导。这些代理通过模拟中的近端策略优化（PPO）训练策略控制，并部署于拉米鼻灯鱼（Petitella bleheri）的物理实验中，实现人工代理与真人个体之间的实时交互。为了应对活体个体的随机行为，我们设计了一个复合奖励函数，以平衡方向引导与社会凝聚力。我们对视觉参数的系统评估表明，白色背景和较大刺激尺寸最大化物理试验中的指导效果。此外，跨组规模的评估显示，虽然系统对五人小组有效指导，但随着小组规模增加到八人，这一能力明显下降。本研究强调了深度强化学习在生物集体自动化引导中的潜力，并指出在大型群体中维持人工影响的难题。

ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models

ERPO：用于大型推理模型的代币级熵调控策略优化

Authors: Song Yu, Li Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.28204
Pdf link: https://arxiv.org/pdf/2603.28204
Abstract Reinforcement learning from verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models. However, standard Group Relative Policy Optimization (GRPO) typically assigns a uniform, sequence-level advantage to all tokens, thereby overlooking the intrinsic information heterogeneity along reasoning chains. We show that this coarse-grained credit assignment leads to premature entropy collapse and encourages the model to generate redundant, low-quality reasoning paths. Through systematic empirical analysis, we identify Critical Decision Pivots (CDPs): transient high-entropy states where the policy's trajectory is most sensitive to perturbations. These pivots represent the "forks in the road" where effective multi-path exploration is most crucial yet often suppressed by uniform advantage signals. Building on these insights, we propose Entropy-Regulated Policy Optimization (ERPO), which transitions the optimization focus from coarse sequences to fine-grained token dynamics. ERPO introduces three synergistic components: (i) Entropy-aware Gating, which adaptively amplifies exploration at CDPs to facilitate diverse path discovery; (ii) Bucket-based Implicit Normalization, which mitigates difficulty bias by aligning token progress windows; and (iii) Result-anchored Advantage Synthesis, which re-weights token-level signals via outcome-driven anchors. Extensive experiments on competitive mathematical benchmarks (e.g., MATH, AIME) demonstrate that ERPO significantly outperforms GRPO. Notably, ERPO not only boosts reasoning accuracy but also yields significantly more concise and robust derivation paths, establishing a new efficiency-accuracy frontier for large reasoning models.
中文摘要 可验证奖励强化学习（RLVR）显著提升了大型语言模型的推理能力。然而，标准的组相对策略优化（GRPO）通常为所有令牌分配统一的序列层优势，从而忽略了推理链上内在的信息异质性。我们表明，这种粗粒度的信用赋值会导致熵提前崩溃，并促使模型生成冗余且低质量的推理路径。通过系统实证分析，我们识别出关键决策枢轴（CDP）：即政策轨迹对扰动最敏感的高熵瞬态态。这些枢轴代表了“路口”，在这些节点上，有效的多路径探索最为关键，但常被统一优势信号所抑制。基于这些见解，我们提出了熵调控策略优化（ERPO），将优化重点从粗序列转向细粒度代币动态。ERPO引入了三个协同组成部分：（i）熵感知门控，通过自适应放大CDP的探索，促进多样化路径的发现;（ii）基于桶的隐式归一化，通过对齐令牌进度窗口来减轻难度偏差;以及（iii）结果锚定优势综合，通过结果驱动锚点重新加权代币级信号。在竞争性数学基准测试（如MATH、AIME）上的大量实验表明，ERPO的表现显著优于GRPO。值得注意的是，ERPO不仅提升了推理准确性，还提供了更简洁、更稳健的推导路径，为大型推理模型树立了新的效率与准确性边界。

Cost-Matching Model Predictive Control for Efficient Reinforcement Learning in Humanoid Locomotion

成本匹配模型预测控制，用于人形运动中高效强化学习

Authors: Wenqi Cai, Kyriakos G. Vamvoudakis, Sébastien Gros, Anthony Tzes
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.28243
Pdf link: https://arxiv.org/pdf/2603.28243
Abstract In this paper, we propose a cost-matching approach for optimal humanoid locomotion within a Model Predictive Control (MPC)-based Reinforcement Learning (RL) framework. A parameterized MPC formulation with centroidal dynamics is trained to approximate the action-value function obtained from high-fidelity closed-loop data. Specifically, the MPC cost-to-go is evaluated along recorded state-action trajectories, and the parameters are updated to minimize the discrepancy between MPC-predicted values and measured returns. This formulation enables efficient gradient-based learning while avoiding the computational burden of repeatedly solving the MPC problem during training. The proposed method is validated in simulation using a commercial humanoid platform. Results demonstrate improved locomotion performance and robustness to model mismatch and external disturbances compared with manually tuned baselines.
中文摘要 本文提出了一种基于模型预测控制（MPC）的强化学习（RL）框架内最佳人形移动的成本匹配方法。带有质心动力学的参数化MPC表述被训练为近似从高保真闭环数据中获得的作用值函数。具体来说，MPC的到货成本会根据记录的状态动作轨迹进行评估，并更新参数以最小化MPC预测值与实际测量收益之间的差异。这种表述实现了基于梯度的高效学习，同时避免了在训练过程中反复解决MPC问题的计算负担。该方法已在使用商业类人平台进行模拟验证。结果显示，与人工调校基线相比，运动性能和建模不匹配和外部扰动的鲁棒性有所提升。

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

来自人类反馈的强化盗版离线多智能体强化学习

Authors: Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban, Adish Singla, Goran Radanović
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.28281
Pdf link: https://arxiv.org/pdf/2603.28281
Abstract We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong-contamination model: given a dataset $D$ of trajectory-preference tuples (each preference being an $n$-dimensional binary label vector representing each of the $n$ agents' preferences), an $\epsilon$-fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a uniform coverage assumption - where every policy of interest is sufficiently represented in the clean (prior to corruption) data - we introduce a robust estimator that guarantees an $O(\epsilon^{1 - o(1)})$ bound on the Nash equilibrium gap. Next, we move to the more challenging unilateral coverage setting, in which only a Nash equilibrium and its single-player deviations are covered. In this case, our proposed algorithm achieves an $O(\sqrt{\epsilon})$ bound on the Nash gap. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to coarse correlated equilibria (CCE). Under the same unilateral coverage regime, we derive a quasi-polynomial-time algorithm whose CCE gap scales as $O(\sqrt{\epsilon})$. To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.
中文摘要 我们在强污染模型下考虑离线多智能体人反馈强化学习（MARLHF）中对数据损坏的鲁棒性：给定一个$D$的轨迹偏好元组数据集（每个偏好是代表每个$n$智能体偏好的$n维二进制标签向量），样本中可任意破坏$\epsilon$比例。我们用线性马尔可夫博弈的框架来建模这个问题。首先，在统一覆盖假设下——即每个感兴趣的保单在干净（腐败前）数据中都充分代表——我们引入一个稳健估计量，保证纳什均衡缺口上有一个$O（\epsilon^{1 - o（1）}）$的上界。接下来，我们进入更具挑战性的单边覆盖设置，其中只覆盖纳什均衡及其单人偏差。在这种情况下，我们提出的算法在纳什缺口上实现了$O（\sqrt{\epsilon}）$的界限。然而，这两种方法都存在难以解决的计算问题。为此，我们将解概念放宽为粗相关均衡（CCE）。在相同的单边覆盖体系下，我们推导出一个准多项式时间算法，其CCE间隙的缩放为$O（\sqrt{\epsilon}）$。据我们所知，这是离线MARLHF中首次系统性处理对抗性数据损坏的案例。

Competitor-aware Race Management for Electric Endurance Racing

电动耐力赛的竞赛管理

Authors: Wytze de Vries, Erik van den Eshof, Jorn van Kampen, Mauro Salazar
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.28286
Pdf link: https://arxiv.org/pdf/2603.28286
Abstract Electric endurance racing is characterized by severe energy constraints and strong aerodynamic interactions. Determining race-winning policies therefore becomes a fundamentally multi-agent, game-theoretic problem. These policies must jointly govern low-level driver inputs as well as high-level strategic decisions, including energy management and charging. This paper proposes a bi-level framework for competitor-aware race management that combines game-theoretic optimal control with reinforcement learning. At the lower level, a multi-agent game-theoretic optimal control problem is solved to capture aerodynamic effects and asymmetric collision-avoidance constraints inspired by motorsport rules. Using this single-lap problem as the environment, reinforcement learning agents are trained to allocate battery energy and schedule pit stops over an entire race. The framework is demonstrated in a two-agent, 45-lap simulated race. The results show that effective exploitation of aerodynamic interactions is decisive for race outcome, with strategies that prioritize finishing position differing fundamentally from single-agent, minimum-time approaches.
中文摘要 电动耐力赛的特点是能量受限极大，空气动力学相互作用强烈。因此，确定赢得种族的政策成为一个根本上多主体、博弈论的问题。这些政策必须共同管理低层次驾驶员的投入以及包括能源管理和充电在内的高层战略决策。本文提出了一个双层次的竞争意识管理框架，结合了博弈论的最优控制与强化学习。在更低层面，解决了一个多智能体博弈论的最优控制问题，以捕捉受赛车规则启发的空气动力学效应和非对称碰撞避免约束。以单圈问题为环境，强化学习代理被训练为分配电池能量并在整场比赛中安排进站时间。该框架在一场两代理、45圈的模拟比赛中得到了展示。结果表明，有效利用空气动力学相互作用对比赛结果至关重要，优先获得名次的策略与单代理、最短时间的策略有根本不同。

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

Kernel-Smith：进化内核优化的统一配方

Authors: He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang, Sheng Yuan, Qinxiu Cheng, Xinchen Xie, Yicheng Chen, Yining Li, Jiaxing Xie, Huanan Dong, Yaguang Wu, Xiangjun Huang, Jian Yang, Hui Wang, Bowen Zhou, Bowen Li, Qipeng Guo, Kai Chen
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.28342
Pdf link: https://arxiv.org/pdf/2603.28342
Abstract We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.
中文摘要 我们介绍Kernel-Smith，一个高性能GPU内核和操作符生成框架，结合了稳定的评估驱动进化代理和面向进化的后训练方案。在代理端，Kernel-Smith 维护一批可执行候选程序，并通过一个性能优异且多样化的程序档案，以及结构化的执行反馈，对编译、正确性和加速进行迭代改进。为了使搜索更可靠，我们为 NVIDIA GPU 上的 Triton 和 MetaX GPU 上的 Maca 构建了后端专用的评估服务。在训练端，我们通过保留保持正确性和高增益的修订，将长视野演化轨迹转化为以步骤为中心的监督和强化学习信号，使模型在进化循环中成为强有力的局部改进者，而非一次性生成器。在统一的进化协议下，Kernel-Smith-235B-RL 在搭载 Nvidia Triton 后端的 KernelBench 上实现了最先进的整体性能，实现了最佳的平均加速率，并超越了包括 Gemini-3.0-pro 和 Claude-4.6-opus 在内的前沿专有模型。我们在MetaX MACA后端进一步验证了该框架，我们的Kernel-Smith-MACA-30B超越了DeepSeek-V3.2-think和Qwen3-235B-2507-think等大规模对应产品，凸显了跨异构平台无缝适配的潜力。除了基准测试结果，同一工作流程还为生产系统（包括SGLang和LMDeploy）产生上游贡献，证明了LLM驱动的内核优化可以从受控评估转化为实际部署。

Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

重新思考基于视觉自回归模型的文本引导图像编辑中的结构保存

Authors: Tao Xia, Jiawei Liu, Yukun Zhang, Ting Liu, Wei Wang, Lei Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.28367
Pdf link: https://arxiv.org/pdf/2603.28367
Abstract Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.
中文摘要 视觉自回归（VAR）模型近年来作为一类有前景的生成模型出现，支持了多种下游视觉任务，如文本引导图像编辑。通过将编辑范式从基于扩散方法的噪声操作转向令牌级操作，基于VAR的方法实现了更好的背景保存和显著加快的推理速度。然而，现有基于VAR的编辑方法仍面临两个关键挑战：准确定位可编辑标记和保持编辑结果结构一致性。在本研究中，我们提出了一种基于VAR模型中中间特征分布分析的新型文本引导图像编辑框架。首先，我们引入了一种从粗到细的令牌本地化策略，可以精炼可编辑区域，平衡编辑忠实度和背景保护。其次，我们分析VAR模型的中间表示，识别结构相关特征，设计出一种简单但有效的特征注入机制，以增强编辑图像与源图像之间的结构一致性。第三，我们开发了一种基于强化学习的自适应特征注入方案，能够自动学习尺度和层级的注入比率，共同优化编辑的真实性和结构保持。大量实验表明，我们的方法在本地和全局编辑场景下，相较于最先进方法，在结构一致性和编辑质量上都优越。

Critic-Free Deep Reinforcement Learning for Maritime Coverage Path Planning on Irregular Hexagonal Grids

无批判深度强化学习，用于不规则六边形网格的海洋覆盖路径规划

Authors: Carlos S. Sepúlveda, Gonzalo A. Ruz
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.28385
Pdf link: https://arxiv.org/pdf/2603.28385
Abstract Maritime surveillance missions, such as search and rescue and environmental monitoring, rely on the efficient allocation of sensing assets over vast and geometrically complex areas. Traditional Coverage Path Planning (CPP) approaches depend on decomposition techniques that struggle with irregular coastlines, islands, and exclusion zones, or require computationally expensive re-planning for every instance. We propose a Deep Reinforcement Learning (DRL) framework to solve CPP on hexagonal grid representations of irregular maritime areas. Unlike conventional methods, we formulate the problem as a neural combinatorial optimization task where a Transformer-based pointer policy autoregressively constructs coverage tours. To overcome the instability of value estimation in long-horizon routing problems, we implement a critic-free Group-Relative Policy Optimization (GRPO) scheme. This method estimates advantages through within-instance comparisons of sampled trajectories rather than relying on a value function. Experiments on 1,000 unseen synthetic maritime environments demonstrate that a trained policy achieves a 99.0% Hamiltonian success rate, more than double the best heuristic (46.0%), while producing paths 7% shorter and with 24% fewer heading changes than the closest baseline. All three inference modes (greedy, stochastic sampling, and sampling with 2-opt refinement) operate under 50~ms per instance on a laptop GPU, confirming feasibility for real-time on-board deployment.
中文摘要 海上监视任务，如搜救和环境监测，依赖于在广阔且几何复杂区域高效分配感测资源。传统的覆盖路径规划（CPP）方法依赖于分解技术，这些技术在不规则的海岸线、岛屿和排除区中遇到困难，或者每次实例都需要计算量大的重新规划。我们提出了一个深度强化学习（DRL）框架，用于解决不规则海域六边形网格表示上的CPP。与传统方法不同，我们将问题表述为一个神经组合优化任务，其中基于Transformer的指针策略自回归构建覆盖巡回。为克服长视野路由问题中价值估计的不稳定性，我们实施了无批评的群相对策略优化（GRPO）方案。该方法通过实例内比较采样轨迹来估算优势，而非依赖价值函数。在1000个未见的合成海洋环境中的实验表明，训练有素的政策能实现99.0%的哈密顿成功率，超过最佳启发式（46.0%）的两倍多，同时产生比最近基线缩短7%的路径，减少24%的航向变化。这三种推理模式（贪婪、随机采样和带2-opt精细化的采样）在笔记本GPU上每实例运行速度均低于50~ms，确认了实时板载部署的可行性。

Evolutionary Discovery of Reinforcement Learning Algorithms via Large Language Models

通过大型语言模型进化发现强化学习算法

Authors: Alkis Sygkounas, Amy Loutfi, Andreas Persson
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.28416
Pdf link: https://arxiv.org/pdf/2603.28416
Abstract Reinforcement learning algorithms are defined by their learning update rules, which are typically hand-designed and fixed. We present an evolutionary framework for discovering reinforcement learning algorithms by searching directly over executable update rules that implement complete training procedures. The approach builds on REvolve, an evolutionary system that uses large language models as generative variation operators, and extends it from reward-function discovery to algorithm discovery. To promote the emergence of nonstandard learning rules, the search excludes canonical mechanisms such as actor--critic structures, temporal-difference losses, and value bootstrapping. Because reinforcement learning algorithms are highly sensitive to internal scalar parameters, we introduce a post-evolution refinement stage in which a large language model proposes feasible hyperparameter ranges for each evolved update rule. Evaluated end-to-end by full training runs on multiple Gymnasium benchmarks, the discovered algorithms achieve competitive performance relative to established baselines, including SAC, PPO, DQN, and A2C.
中文摘要 强化学习算法由其学习更新规则定义，这些规则通常是手工设计和固定的。我们提出了一个进化框架，通过直接搜索实现完整训练过程的可执行更新规则，发现强化学习算法。该方法基于REvolve，一种使用大型语言模型作为生成变异算子的进化系统，并将其从奖励函数发现扩展到算法发现。为了促进非标准学习规则的出现，搜索排除了诸如actor-critic结构、时间差分损失和价值自助法等典型机制。由于强化学习算法对内部标量参数高度敏感，我们引入了一个进化后细化阶段，其中大型语言模型为每个进化后的更新规则提出可行的超参数范围。通过在多个Gymnasium基准测试上进行全程训练，发现的算法相较于已建立的基线（包括SAC、PPO、DQN和A2C）实现了竞争性能。

$R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation

$R_{dm}$：重新概念化分配匹配作为扩散蒸馏奖励

Authors: Linqian Fan, Peiqin Sun, Tiancheng Wen, Shun Lu, Chengru Song
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.28460
Pdf link: https://arxiv.org/pdf/2603.28460
Abstract Diffusion models achieve state-of-the-art generative performance but are fundamentally bottlenecked by their slow iterative sampling process. While diffusion distillation techniques enable high-fidelity few-step generation, traditional objectives often restrict the student's performance by anchoring it solely to the teacher. Recent approaches have attempted to break this ceiling by integrating Reinforcement Learning (RL), typically through a simple summation of distillation and RL objectives. In this work, we propose a novel paradigm by reconceptualizing distribution matching as a reward, denoted as $R_{dm}$. This unified perspective bridges the algorithmic gap between Diffusion Matching Distillation (DMD) and RL, providing several key benefits. (1) Enhanced optimization stability: we introduce Group Normalized Distribution Matching (GNDM), which adapts standard RL group normalization to stabilize $R_{dm}$ estimation. By leveraging group-mean statistics, GNDM establishes a more robust and effective optimization direction. (2) Seamless reward integration: our reward-centric formulation inherently supports adaptive weighting mechanisms, allowing flexible combination of DMD with external reward models. (3) Improved sampling efficiency: by aligning with RL principles, the framework readily incorporates importance sampling (IS), leading to a significant boost in sampling efficiency. Extensive experiments demonstrate that GNDM outperforms vanilla DMD, reducing the FID by 1.87. Furthermore, our multi-reward variant, GNDMR, surpasses existing baselines by achieving a strong balance between aesthetic quality and fidelity, reaching a peak HPS of 30.37 and a low FID-SD of 12.21. Overall, $R_{dm}$ provides a flexible, stable, and efficient framework for real-time high-fidelity synthesis. Code will be released upon publication.
中文摘要 扩散模型实现了最先进的生成性能，但其缓慢的迭代采样过程根本上受到瓶颈。虽然扩散蒸馏技术实现了高保真度的少步生成，但传统目标往往限制学生的表现，仅依赖于教师。近年来，一些方法试图通过整合强化学习（RL）来打破这一限制，通常通过对蒸馏和RL目标的简单求和来实现。在本研究中，我们通过重新概念化将分布匹配视为奖励（记为$R_{dm}$）来提出一种新范式。这一统一视角弥合了扩散匹配蒸馏（DMD）与强化学习（RL）之间的算法差距，带来了多项关键优势。（1）优化稳定性增强：我们引入了群归一化分布匹配（GNDM），该技术对标准RL群归一化进行调整，以稳定$R_{dm}$估计。通过利用群均统计，GNDM确立了一个更稳健、更有效的优化方向。（2）无缝的奖励集成：我们以奖励为中心的表述本质支持自适应加权机制，允许DMD与外部奖励模型灵活组合。（3）提升抽样效率：通过与强化学习原则对齐，该框架能够轻松整合重要性抽样（IS），显著提升了抽样效率。大量实验表明，GNDM优于原版DMD，将FID降低了1.87。此外，我们的多奖励变种GNDMR超越了现有基准，在美学质量与保真度之间取得了良好平衡，峰值HPS为30.37，FID-SD为12.21。总体而言，$R_{dm}$ 提供了一个灵活、稳定且高效的实时高保真合成框架。代码将在发布后发布。

CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

CiQi-代理：多模态代理中视觉、工具与美学的对齐，以促进中国瓷器文化推理

Authors: Wenhan Wang, Zhixiang Zhou, Zhongtian Ma, Yanzhu Chen, Ziyu Lin, Hao Sheng, Pengfei Liu, Honglin Ma, Wenqi Shao, Qiaosheng Zhang, Yu Qiao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.28474
Pdf link: https://arxiv.org/pdf/2603.28474
Abstract The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent -- a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question--answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2\% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at this https URL.
中文摘要 对中国古董瓷器的鉴赏需要丰富的历史知识、材料理解和审美敏感度，非专业人士难以深入鉴赏。为了民主化文化遗产理解并助力鉴赏专家，我们推出了CiQi-Agent——一款领域专属的瓷器鉴赏代理，用于智能分析中国古董瓷器。CiQi-Agent支持多图像瓷器输入，支持视觉工具调用和多模态检索增强生成，针对六个属性进行细致的鉴赏分析：朝代、统治时期、窑址、釉色、装饰图案和器皿形状。除了属性分类外，它还捕捉细微的视觉细节，检索相关领域知识，并整合视觉和文本证据，生成连贯且可解释的鉴赏家描述。为实现这一能力，我们构建了一个大型专家注释数据集CiQi-VQA，包含29,596件瓷器样本、51,553张图片和557,940对视觉问答对，并进一步建立了与上述六项属性对齐的综合基准CiQi-Bench。CiQi-Agent 通过监督微调、强化学习和工具增强推理框架进行训练，该框架整合了两类工具：视觉工具和多模态检索工具。实验结果显示，CiQi-Agent（7B）在CiQi-Bench上的六个属性上均优于所有竞争的开源和闭源模型，平均准确率比GPT-5高出12.2/%。模型和数据集已发布，并可在此 https URL 公开获取。

Tac2Real: Reliable and GPU Visuotactile Simulation for Online Reinforcement Learning and Zero-Shot Real-World Deployment

Tac2Real：可靠且GPU的视触模拟，用于在线强化学习和零射点真实部署

Authors: Ningyu Yan, Shuai Wang, Xing Shen, Hui Wang, Hanqing Wang, Yang Xiang, Jiangmiao Pang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.28475
Pdf link: https://arxiv.org/pdf/2603.28475
Abstract Visuotactile sensors are indispensable for contact-rich robotic manipulation tasks. However, policy learning with tactile feedback in simulation, especially for online reinforcement learning (RL), remains a critical challenge, as it demands a delicate balance between physics fidelity and computational efficiency. To address this challenge, we present Tac2Real, a lightweight visuotactile simulation framework designed to enable efficient online RL training. Tac2Real integrates the Preconditioned Nonlinear Conjugate Gradient Incremental Potential Contact (PNCG-IPC) method with a multi-node, multi-GPU high-throughput parallel simulation architecture, which can generate marker displacement fields at interactive rates. Meanwhile, we propose a systematic approach, TacAlign, to narrow both structured and stochastic sources of domain gap, ensuring a reliable zero-shot sim-to-real transfer. We further evaluate Tac2Real on the contact-rich peg insertion task. The zero-shot transfer results achieve a high success rate in the real-world scenario, verifying the effectiveness and robustness of our framework. The project page is: this https URL
中文摘要 Visuotactile 传感器对于接触丰富的机器人操作任务至关重要。然而，在模拟中结合触觉反馈的策略学习，尤其是在线强化学习（RL），仍是一个关键挑战，因为它要求在物理精度与计算效率之间取得微妙平衡。为应对这一挑战，我们推出了Tac2Real，一个轻量级视觉触觉模拟框架，旨在实现高效的在线强化学习训练。Tac2Real 将预处理非线性共轭梯度增量势接触（PNCG-IPC）方法与多节点、多 GPU 高通量并行仿真架构集成，能够以交互速率生成标记位移场。同时，我们提出了一种系统化方法TacAlign，旨在缩小结构化和随机的领域间隙来源，确保零射程的模拟到真实传输的可靠。我们还进一步评估了Tac2Real在接触丰富插销任务上的应用。零发射转移结果在现实场景中成功率很高，验证了我们框架的有效性和稳健性。项目页面是：这个 https URL

Intelligent Radio Resource Slicing for 6G In-Body Subnetworks

6G 内置子网的智能无线资源切片

Authors: Samira Abdelrahman, Hossam Farag
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.28529
Pdf link: https://arxiv.org/pdf/2603.28529
Abstract 6G In-body Subnetworks (IBSs) represent a key enabler for supporting standalone eXtended Reality (XR) applications. IBSs are expected to operate as an underlay to existing cellular networks, giving rise to coexistence challenges when sharing radio resources with other cellular users, such as enhanced Mobile Broadband (eMBB) users. Such resource allocation problem is highly dynamic and inherently non-convex due to heterogeneous service demands and fluctuating channel conditions. In this paper, we propose an intelligent radio resource slicing strategy based on the Soft Actor-Critic (SAC) deep reinforcement learning algorithm. The proposed SAC-based slicing method addresses the coexistence challenge between IBSs and eMBB users by optimizing a refined reward function that explicitly incorporates XR cross-modal delay alignment to ensure immersive experience while preserving eMBB service guarantees. Extensive system-level simulations are performed under realistic network conditions and the results demonstrate that the proposed method can enhance user experience by 12-85% under different network densities compared to baseline methods while maintaining the target data rate for eMBB users.
中文摘要 6G 内部子网（IBS）是支持独立扩展现实（XR）应用的关键推动力。IBS预计将作为现有蜂窝网络的基础，导致与其他蜂窝用户共享无线资源（如增强型移动宽带（eMBB）用户时存在共存挑战。由于服务需求异构和信道条件波动，这种资源分配问题高度动态且本质上不凸。本文提出了基于软演员-批判者（SAC）深度强化学习算法的智能无线资源切片策略。提出的基于SAC的切片方法通过优化精细的奖励函数，明确包含XR跨模态延迟对齐，解决IBS与eMBB用户共存的挑战，以确保沉浸式体验同时保持eMBB服务的保障。在真实的网络条件下进行了大量系统级仿真，结果表明，在不同网络密度下，该方法在保持eMBB用户目标数据速率的同时，可提升用户体验12%至85%。

GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum

GraphWalker：通过合成轨迹课程解答智能知识图题

Authors: Shuwen Xu, Yao Xu, Jiaxiang Liu, Chenhao Yuan, Wenshuo Peng, Jun Zhao, Kang Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.28533
Pdf link: https://arxiv.org/pdf/2603.28533
Abstract Agentic knowledge graph question answering (KGQA) requires an agent to iteratively interact with knowledge graphs (KGs), posing challenges in both training data scarcity and reasoning generalization. Specifically, existing approaches often restrict agent exploration: prompting-based methods lack autonomous navigation training, while current training pipelines usually confine reasoning to predefined trajectories. To this end, this paper proposes \textit{GraphWalker}, a novel agentic KGQA framework that addresses these challenges through \textit{Automated Trajectory Synthesis} and \textit{Stage-wise Fine-tuning}. GraphWalker adopts a two-stage SFT training paradigm: First, the agent is trained on structurally diverse trajectories synthesized from constrained random-walk paths, establishing a broad exploration prior over the KG; Second, the agent is further fine-tuned on a small set of expert trajectories to develop reflection and error recovery capabilities. Extensive experiments demonstrate that our stage-wise SFT paradigm unlocks a higher performance ceiling for a lightweight reinforcement learning (RL) stage, enabling GraphWalker to achieve state-of-the-art performance on CWQ and WebQSP. Additional results on GrailQA and our constructed GraphWalkerBench confirm that GraphWalker enhances generalization to out-of-distribution reasoning paths. The code is publicly available at this https URL
中文摘要 智能体知识图题答（KGQA）要求智能体与知识图谱（KGs）迭代交互，这在训练数据稀缺性和推理泛化方面都面临挑战。具体来说，现有方法常常限制主体探索：基于提示的方法缺乏自主导航训练，而当前的训练流程通常将推理限制在预定轨迹内。为此，本文提出了 \textit{GraphWalker}，一种新颖的代理 KGQA 框架，通过 \textit{自动轨迹合成}和 \textit{阶段微调}来解决这些挑战。GraphWalker采用两阶段SFT训练范式：首先，代理在结构上多样化的轨迹上进行训练，这些轨迹由受限随机游走路径合成，建立对KG的广泛探索先验;其次，代理在一组专家轨迹上进一步微调，以发展反射和错误恢复能力。大量实验表明，我们分阶段的SFT范式为轻量级强化学习（RL）阶段释放了更高的性能上限，使GraphWalker能够在CWQ和WebQSP上实现最先进的性能。关于GrailQA和我们构建的GraphWalkerBench的其他结果证实，GraphWalker增强了对分布外推理路径的推广。代码在此 https URL 公开发布

Learning Partial Action Replacement in Offline MARL

离线MARL中的部分动作替换学习

Authors: Yue Jin, Giovanni Montana
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.28573
Pdf link: https://arxiv.org/pdf/2603.28573
Abstract Offline multi-agent reinforcement learning (MARL) faces a critical challenge: the joint action space grows exponentially with the number of agents, making dataset coverage exponentially sparse and out-of-distribution (OOD) joint actions unavoidable. Partial Action Replacement (PAR) mitigates this by anchoring a subset of agents to dataset actions, but existing approach relies on enumerating multiple subset configurations at high computational cost and cannot adapt to varying states. We introduce PLCQL, a framework that formulates PAR subset selection as a contextual bandit problem and learns a state-dependent PAR policy using Proximal Policy Optimisation with an uncertainty-weighted reward. This adaptive policy dynamically determines how many agents to replace at each update step, balancing policy improvement against conservative value estimation. We prove a value-error bound showing that the estimation error scales linearly with the expected number of deviating agents. Compared with the previous PAR-based method SPaCQL, PLCQL reduces the number of per-iteration Q-function evaluations from n to 1, significantly improving computational efficiency. Empirically, PLCQL achieves the highest normalised scores on 66% of tasks across MPE, MaMuJoCo, and SMAC benchmarks, outperforming SPaCQL on 84% of tasks while substantially reducing computational cost.
中文摘要 离线多智能体强化学习（MARL）面临一个关键挑战：联合行动空间随着智能体数量呈指数增长，导致数据集覆盖率呈指数级稀疏，且不可避免地出现分发外（OOD）联合行动。部分动作替换（PAR）通过将部分代理锚定到数据集动作来缓解这一问题，但现有方法依赖于高计算成本枚举多个子集配置，且无法适应不同的状态。我们介绍了PLCQL，这是一个将PAR子集选择作为上下文盗垒问题来构建的框架，并通过近端策略优化学习状态依赖的PAR策略，并获得不确定性加权奖励。这种自适应策略动态决定每个更新步骤需要替换多少代理，平衡策略改进与保守价值估计。我们证明了一个值-误差界限，表明估计误差随期望偏差代理人数量线性增长。与之前基于PAR的方法SPaCQL相比，PLCQL将每次迭代Q函数的评估次数从n减少到1，显著提升了计算效率。从实证来看，PLCQL在MPE、MaMuJoCo和SMAC基准测试中66%的任务中取得了最高的归一化分数，在84%的任务中优于SPaCQL，同时显著降低了计算成本。

Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

与你共见：多模态推理中的感知-推理共进化

Authors: Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, Jing Shao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.28618
Pdf link: https://arxiv.org/pdf/2603.28618
Abstract Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.
中文摘要 带有可验证奖励的强化学习（RLVR）显著提升了多模态大型语言模型（MLLM）的推理能力。然而，现有的RLVR方法通常依赖于结果驱动的优化，通过基于最终答案的共享奖励来更新感知和推理。这种共享奖励模糊了学分分配，常常改善推理模式，但未能可靠地提升上游视觉证据提取的准确性。为解决这一感知瓶颈，我们引入了PRCO（感知-推理共进化），这是一个具有共享策略的双角色RLVR框架。PRCO由两个合作角色组成：一位观察员，负责生成针对问题的证据说明，另一位解答者则根据该说明词预测最终答案。关键是，PRCO采用角色特定奖励信号：求解者通过可验证的结果奖励对最终答案进行优化，而观察者则从求解器的下游成功中获得效用奖励。在八个具有挑战性的多模态推理基准测试中进行了大量实验，表明PRCO在不同模型尺度上的平均准确率相较基础模型持续提升超过7分，优于此前开源的强化学习调优基线。

DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

DreamLite：一款轻量级的设备内统一图像生成与编辑模型

Authors: Kailai Feng, Yuxiang Wei, Bo Chen, Yang Pan, Hu Ye, Songwei Liu, Chenqian Yan, Yuan Gao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.28713
Pdf link: https://arxiv.org/pdf/2603.28713
Abstract Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.
中文摘要 扩散模型在文本到图像（T2I）生成和文本引导图像编辑方面取得了显著进展。然而，这些模型通常包含数十亿参数，导致高延迟和部署挑战增加。虽然设备内扩散模型提高了效率，但它们主要专注于T2I生成，缺乏图像编辑支持。本文提出了DreamLite，一种紧凑统一的设备内扩散模型（0.39B），支持单一网络内的T2I生成和文本引导图像编辑。DreamLite 基于修剪的移动 U-Net 骨干，并通过上下文中的空间连接在潜在空间中统一条件反射。它将图像水平连接作为输入，生成任务使用（目标|空白）配置，编辑任务使用（目标|源）配置。为稳定该紧凑模型的训练，我们引入了任务渐进式联合预训练策略，依次针对T2I、编辑和联合任务。经过高质量的SFT和强化学习，DreamLite在图像生成方面达到了GenEval（0.72）和图像编辑的ImgEdit（4.11），表现优于现有的设备内模型，并与多个服务器端模型保持竞争力。通过采用阶梯蒸馏技术，我们将去噪处理进一步简化为仅4步，使我们的DreamLite能够在小米14智能手机上不到1秒内生成或编辑1024 x 1024的图像。据我们所知，DreamLite是首个支持图像生成和图像编辑的统一设备扩散模型。

Dynamic Dual-Granularity Skill Bank for Agentic RL

智能强化学习动态双粒度技能库

Authors: Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, Dongbin Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.28716
Pdf link: https://arxiv.org/pdf/2603.28716
Abstract Agentic reinforcement learning (RL) can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance and often lack principled mechanisms for maintaining an evolving skill memory. We propose D2Skill, a dynamic dual-granularity skill bank for agentic RL that organizes reusable experience into task skills for high-level guidance and step skills for fine-grained decision support and error correction. D2Skill jointly trains the policy and skill bank through paired baseline and skill-injected rollouts under the same policy, using their performance gap to derive hindsight utility signals for both skill updating and policy optimization. Built entirely from training-time experience, the skill bank is continuously expanded through reflection and maintained with utility-aware retrieval and pruning. Experiments on ALFWorld and WebShop with Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 show that D2Skill consistently improves success rates over skill-free baselines by 10-20 points. Further ablations and analyses show that both dual-granularity skill modeling and dynamic skill maintenance are critical to these gains, while the learned skills exhibit higher utility, transfer across evaluation settings, and introduce only modest training overhead.
中文摘要 代理强化学习（RL）可以从可重复使用的经验中获得显著益处，但现有基于技能的方法主要提取轨迹级指导，且常缺乏维持技能记忆演进的原则机制。我们提出了D2Skill，一个动态的双粒度技能库，用于能动强化学习，将可复用的经验组织为任务技能，用于高层次指导，并作为步骤技能进行细致决策支持和错误纠正。D2Skill通过同一政策下的配对基线和技能注入推广，联合培训政策和技能库，利用其性能差距推导技能更新和策略优化的事后效用信号。技能库完全基于培训时间的经验构建，通过反思不断扩展，并通过效用意识的检索和修剪来维护。在ALFWorld和WebShop上使用Qwen2.5-7B-Instruct和Qwen3-4B-Instruct-2507的实验显示，D2Skill在无技能基线上持续提升成功率10-20分。进一步的消融和分析表明，双粒度技能建模和动态技能维护对这些提升至关重要，而学到的技能则具有更高的效用，能跨评估环境迁移，且仅带来适度的培训开销。

Stepwise Credit Assignment for GRPO on Flow-Matching Models

流量匹配模型GRPO的分级分配

Authors: Yash Savani, Branislav Kveton, Yuchen Liu, Yilin Wang, Jing Shi, Subhojyoti Mukherjee, Nikos Vlassis, Krishna Kumar Singh
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.28718
Pdf link: https://arxiv.org/pdf/2603.28718
Abstract Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.
中文摘要 Flow-GRPO成功地将强化学习应用于流模型，但在所有步骤中均采用统一的学分分配。这忽略了扩散生成的时间结构：早期步骤决定了成分和内容（低频结构），而后期步骤则决定细节和质感（高频细节）。此外，仅基于最终图像给予统一的致谢，可能会无意中奖励不理想的中间步骤，尤其是在扩散轨迹后期纠正误差时。我们提出了逐步-流量-GRPO，根据每一步的奖励改进来分配积分。通过利用Tweedie公式获得中间奖励估算并引入基于收益的优势，我们的方法实现了更优的样本效率和更快的收敛。我们还引入了受DDIM启发的SDE，在保持政策梯度的随机性同时提升奖励质量。

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

SOLE-R1：视频语言推理作为机器人强化学习的唯一奖励

Authors: Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza
Subjects: Subjects: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.28730
Pdf link: https://arxiv.org/pdf/2603.28730
Abstract Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking.
中文摘要 视觉语言模型（VLMs）在多种任务中展现出令人印象深刻的能力，推动利用这些模型来监督机器人学习。然而，当作为强化学习（RL）中的评估工具时，当今最强的模型常常在部分可观测性和分布转移下失效，使策略能够利用感知错误而非解决问题。为解决这一限制，我们引入了SOLE-R1（自我观察学习者），这是一种专门设计为在线强化学习唯一奖励信号的视频语言推理模型。在仅有原始视频观察和自然语言目标的情况下，SOLE-R1 执行每时步的时空思维链（CoT）推理，并生成可直接作为奖励的任务进展密集估计。为了训练SOLE-R1，我们开发了大规模视频轨迹和推理综合流程，生成时间接地的CoT痕迹，并配合持续进展监督。这些数据结合基础空间和多帧时间推理，并用一个混合框架训练模型，将监督微调与可验证奖励的强化学习相结合。在四种不同的模拟环境和真实机器人环境中，SOLE-R1通过随机初始化实现零射击在线强化学习：机器人学习之前未见过的操作任务，无需地面真实奖励、成功指示、演示或任务特定调优。SOLE-R1在24项未完成任务时表现显著优于强大的视觉语言奖励器，包括GPT-5和Gemini-3-Pro，同时在奖励黑客方面表现出显著更高的鲁棒性。

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Gen-Searcher：强化能动搜索以生成图像

Authors: Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, Xiangyu Yue
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.28767
Pdf link: https://arxiv.org/pdf/2603.28767
Abstract Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.
中文摘要 近期图像生成模型展现出高保真度和照片级真实图像的强大能力。然而，它们本质上受限于冻结的内部知识，因此在知识密集型或需要最新信息的现实场景中常常失败。本文介绍了Gen-Searcher，作为首次尝试训练搜索增强图像生成代理，该智能体通过多跳推理和搜索收集基础生成所需的文本知识和参考图像。为此，我们构建了定制化的数据流水线，并策划了两个高质量数据集Gen-Searcher-SFT-10k和Gen-Searcher-RL-6k，包含多样化的搜索密集提示和对应的真实合成图像。我们还进一步介绍了KnowGen，这是一个全面的基准测试，明确要求基于搜索的外部知识来生成图像，并从多维度评估模型。基于这些资源，我们先用SFT训练Gen-Searcher，随后进行带有双重奖励反馈的代理强化学习，结合文本和图像奖励，为GRPO训练提供更稳定且信息丰富的学习信号。实验显示，Gen-Searcher带来了显著提升，KnowGen上的Qwen-Image提升了约16个百分点，WISE上的提升了15个百分点。我们希望这项工作能成为图像生成中搜索代理的开放基础，并对数据、模型和代码完全开源。

Keyword: diffusion policy

UMI-Underwater: Learning Underwater Manipulation without Underwater Teleoperation

UMI-水下：学习水下操控而不使用水下远程操作

Authors: Hao Li, Long Yin Chung, Jack Goler, Ryan Zhang, Xiaochi Xie, Huy Ha, Shuran Song, Mark Cutkosky
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.27012
Pdf link: https://arxiv.org/pdf/2603.27012
Abstract Underwater robotic grasping is difficult due to degraded, highly variable imagery and the expense of collecting diverse underwater demonstrations. We introduce a system that (i) autonomously collects successful underwater grasp demonstrations via a self-supervised data collection pipeline and (ii) transfers grasp knowledge from on-land human demonstrations through a depth-based affordance representation that bridges the on-land-to-underwater domain gap and is robust to lighting and color shift. An affordance model trained on on-land handheld demonstrations is deployed underwater zero-shot via geometric alignment, and an affordance-conditioned diffusion policy is then trained on underwater demonstrations to generate control actions. In pool experiments, our approach improves grasping performance and robustness to background shifts, and enables generalization to objects seen only in on-land data, outperforming RGB-only baselines. Code, videos, and additional results are available at this https URL.
中文摘要 由于图像质量下降且高度变异，以及收集多样化水下演示的成本高昂，水下机器人抓取非常困难。我们引入了一套系统，（i）通过自监督的数据收集流程自主收集成功的水下抓握演示，（ii）通过基于深度的可给性表示，将陆地人类演示的抓取知识传递出来，弥合陆地与水下领域的差距，并对光照和色变具有鲁棒性。在陆地手持演示上训练的赋能模型通过几何对齐进行水下零射击部署，随后在水下演示上训练一个由产能条件的扩散策略以生成控制动作。在池子实验中，我们的方法提升了抓取性能和对背景偏移的鲁棒性，并实现了对仅在陆地数据中才见到的对象的泛化，优于仅有RGB基线的应用。代码、视频及更多结果均可在此 https 网址获取。

Tele-Catch: Adaptive Teleoperation for Dexterous Dynamic 3D Object Catching

远程捕捉：灵活动态三维物体捕捉的自适应远程操作

Authors: Weiguang Zhao, Junting Dong, Rui Zhang, Kailin Li, Qin Zhao, Kaizhu Huang
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.28427
Pdf link: https://arxiv.org/pdf/2603.28427
Abstract Teleoperation is a key paradigm for transferring human dexterity to robots, yet most prior work targets objects that are initially static, such as grasping or manipulation. Dynamic object catch, where objects move before contact, remains underexplored. Pure teleoperation in this task often fails due to timing, pose, and force errors, highlighting the need for shared autonomy that combines human input with autonomous policies. To this end, we present Tele-Catch, a systematic framework for dexterous hand teleoperation in dynamic object catching. At its core, we design DAIM, a dynamics-aware adaptive integration mechanism that realizes shared autonomy by fusing glove-based teleoperation signals into the diffusion policy denoising process. It adaptively modulates control based on the interaction object state. To improve policy robustness, we introduce DP-U3R, which integrates unsupervised geometric representations from point cloud observations into diffusion policy learning, enabling geometry-aware decision making. Extensive experiments demonstrate that Tele-Catch significantly improves accuracy and robustness in dynamic catching tasks, while also exhibiting consistent gains across distinct dexterous hand embodiments and previously unseen object categories.
中文摘要 远程操作是将人类灵巧度传授给机器人的关键范式，但以往大多数工作主要针对最初静态的物体，如抓握或操作。动态物体捕捉，即物体在接触前移动，仍然缺乏充分探索。纯远程操作常因时序、姿势和力错误而失败，凸显了结合人类输入与自主策略的共享自主性需求。为此，我们介绍了Tele-Catch，一个系统化的动态物体捕捉中灵巧手部遥控操作的框架。核心设计DAIM，一种动态感知的自适应集成机制，通过将基于手套的远程操作信号融合进扩散策略去噪过程，实现共享自治。它根据交互对象状态自适应调制控制。为提升策略鲁棒性，我们引入了DP-U3R，将点云观测中的无监督几何表示整合进扩散策略学习，实现几何感知决策。大量实验表明，Tele-Catch在动态捕捉任务中显著提升了准确性和稳健性，同时在不同灵巧的手部表现和此前未见的物体类别中也展现出持续的提升。