Arxiv Papers of Today

生成时间: 2026-03-10 16:52:10 (UTC+8); Arxiv 发布时间: 2026-03-10 20:00 EDT (2026-03-11 08:00 UTC+8)

今天共有 93 篇相关文章

Keyword: reinforcement learning

Autonomous AI Agents for Option Hedging: Enhancing Financial Stability through Shortfall Aware Reinforcement Learning

期权对冲的自主人工智能代理：通过缺口感知强化学习提升财务稳定性

Authors: Minxuan Hu, Ziheng Chen, Jiayu Yi, Wenxi Sun
Subjects: Subjects: Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP); Risk Management (q-fin.RM)
Arxiv link: https://arxiv.org/abs/2603.06587
Pdf link: https://arxiv.org/pdf/2603.06587
Abstract The deployment of autonomous AI agents in derivatives markets has widened a practical gap between static model calibration and realized hedging outcomes. We introduce two reinforcement learning frameworks, a novel Replication Learning of Option Pricing (RLOP) approach and an adaptive extension of Q-learner in Black-Scholes (QLBS), that prioritize shortfall probability and align learning objectives with downside sensitive hedging. Using listed SPY and XOP options, we evaluate models using realized path delta hedging outcome distributions, shortfall probability, and tail risk measures such as Expected Shortfall. Empirically, RLOP reduces shortfall frequency in most slices and shows the clearest tail-risk improvements in stress, while implied volatility fit often favors parametric models yet poorly predicts after-cost hedging performance. This friction-aware RL framework supports a practical approach to autonomous derivatives risk management as AI-augmented trading systems scale.
中文摘要 自主人工智能代理在衍生品市场的部署加剧了静态模型校准与实现对冲结果之间的实际差距。我们引入了两种强化学习框架：一种新颖的期权定价复制学习（RLOP）方法，以及Black-Scholes算法中Q-learner的自适应扩展（QLBS），这些框架优先考虑短缺概率，并将学习目标与下行敏感对冲对齐。利用列出的SPY和XOP期权，我们评估了采用实现路径δ对冲结果分布、短缺概率和尾部风险指标（如预期短缺）的模型。从经验上看，RLOP在大多数切片中减少了缺口频率，并在压力下显示出最明显的尾风险改善，而隐含波动率拟合通常偏向参数模型，但对成本后对冲表现的预测较差。这一具摩擦感知的强化学习框架支持了随着人工智能增强交易系统规模扩大，自主衍生品风险管理的实用方法。

Know When You're Wrong: Aligning Confidence with Correctness for LLM Error Detection

知道你错了：将信心与正确性对齐以进行LLM错误检测

Authors: Xie Xiaohu, Liu Xiaohu, Yao Benjamin
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.06604
Pdf link: https://arxiv.org/pdf/2603.06604
Abstract As large language models (LLMs) are increasingly deployed in critical decision-making systems, the lack of reliable methods to measure their uncertainty presents a fundamental trustworthiness risk. We introduce a normalized confidence score based on output anchor token probabilities: classification labels for structured tasks and self-evaluation responses (Yes/No) for open-ended generation. This enables direct detection of errors and hallucinations with minimal overhead and without external validation. We make three key contributions. First, we propose a normalized confidence score and self-evaluation framework that exposes reliable confidence estimates for error detection across seven diverse benchmark tasks and five LLMs of varying architectures and sizes. Second, our theoretical analysis reveals that supervised fine-tuning (SFT) yields well-calibrated confidence through maximum-likelihood estimation, whereas reinforcement learning methods (PPO, GRPO) and DPO induce overconfidence via reward exploitation. Third, we propose post-RL SFT with self-distillation to restore confidence reliability in RL-trained models. Empirical results demonstrated that SFT improved average confidence-correctness AUROC from 0.806 to 0.879 and reduced calibration error from 0.163 to 0.034 on Qwen3-4B, while GRPO and DPO degraded confidence reliability. We demonstrated practical value through adaptive retrieval-augmented generation (RAG) that selectively retrieves context when the model lacks confidence, using only 58\% of retrieval operations to recover 95\% of the maximum achievable accuracy gain on TriviaQA
中文摘要 随着大型语言模型（LLMs）越来越多地应用于关键决策系统，缺乏可靠方法来衡量其不确定性，带来了根本性的可信度风险。我们引入基于输出锚点代币概率的归一化置信度评分：结构化任务使用分类标签，开放式生成时采用自我评估反应（是/否）。这使得错误和幻觉能够直接检测，且开销最小，无需外部验证。我们做出了三项关键贡献。首先，我们提出一个归一化置信度评分和自我评估框架，展示七个不同基准任务和五个不同架构和规模的大型语言模型的可靠置信估计，用于错误检测。其次，我们的理论分析表明，监督式微调（SFT）通过最大似然估计产生良好校准的置信度，而强化学习方法（PPO、GRPO）和DPO则通过奖励利用诱导过度自信。第三，我们提出采用后强化学习SFT的自蒸馏技术，以恢复强化学习模型的置信度。实证结果表明，SFT将Qwen3-4B的平均置信度正确度从0.806提升至0.879，校准误差从0.163降至0.034，而GRPO和DPO则降低了置信度。我们通过自适应检索增强生成（RAG）展示了其实用价值，该生成在模型缺乏信心时选择性地检索上下文，仅用58%的检索作就能恢复TriviaQA中95%的最大准确率提升

Multi-Agent DRL for V2X Resource Allocation: Disentangling Challenges and Benchmarking Solutions

多代理 DRL 用于 V2X 资源分配：解开挑战与基准测试解决方案

Authors: Siyuan Wang, Lei Lei, Pranav Maheshwari, Sam Bellefeuille, Kan Zheng, Dusit Niyato
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.06607
Pdf link: https://arxiv.org/pdf/2603.06607
Abstract Multi-agent deep reinforcement learning (DRL) has emerged as a promising approach for radio resource allocation (RRA) in cellular vehicle-to-everything (C-V2X) networks. However, the multifaceted challenges inherent to multi-agent reinforcement learning (MARL) - including non-stationarity, coordination difficulty, large action spaces, partial observability, and limited robustness and generalization - are often intertwined, making it difficult to understand their individual impact on performance in vehicular environments. Moreover, existing studies typically rely on different baseline MARL algorithms, and a systematic comparison of their capabilities in addressing specific challenges in C-V2X RRA remains lacking. In this paper, we bridge this gap by formulating C-V2X RRA as a sequence of multi-agent interference games with progressively increasing complexity, each designed to isolate a key MARL challenge. Based on these formulations, we construct a suite of learning tasks that enable controlled evaluation of performance degradation attributable to each challenge. We further develop large-scale, diverse training and testing datasets using SUMO-generated highway traces to capture a wide range of vehicular topologies and corresponding interference patterns. Through extensive benchmarking of representative MARL algorithms, we identify policy robustness and generalization across diverse vehicular topologies as the dominant challenge in C-V2X RRA. We further show that, on the most challenging task, the best-performing actor-critic method outperforms the value-based approach by 42%. By emphasizing the need for zero-shot policy transfer to both seen and unseen topologies at runtime, and by open-sourcing the code, datasets, and interference-game benchmark suite, this work provides a systematic and reproducible foundation for evaluating and advancing MARL algorithms in vehicular networks.
中文摘要 多智能体深度强化学习（DRL）已成为蜂窝载体到全网（C-V2X）网络中无线资源分配（RRA）的一种有前景的方法。然而，多智能体强化学习（MARL）固有的多方面挑战——包括非平稳性、协调难度、大动作空间、部分可观性以及有限的鲁棒性和泛化性——往往相互交织，使得理解它们对车辆环境中性能的具体影响变得困难。此外，现有研究通常依赖不同的基线MARL算法，且缺乏系统性比较它们在解决C-V2X RRA具体挑战中的能力。本文通过将C-V2X RRA构建为一系列复杂度递增的多智能体干扰博弈，每个博弈旨在隔离关键的MARL挑战，弥合了这一空白。基于这些表述，我们构建了一套学习任务，能够对每个挑战导致的性能下降进行受控评估。我们还进一步开发了利用SUMO生成的高速公路轨迹的大规模、多样化训练和测试数据集，捕捉各种车辆拓扑结构及相应的干扰模式。通过对代表性MARL算法的广泛基准测试，我们识别出C-V2X RRA中政策的鲁棒性和跨多种车辆拓扑的普遍性。我们还进一步证明，在最具挑战性的任务中，表现最好的演员-批评方法比基于价值的方法高出42%。通过强调在运行时对可见和不可见拓扑进行零样本策略转移的必要性，以及开源代码、数据集和干扰博弈基准测试套件，这项工作为评估和推进车载网络中的MARL算法提供了系统且可重复的基础。

Scaling Strategy, Not Compute: A Stand-Alone, Open-Source StarCraft II Benchmark for Accessible Reinforcement Learning Research

扩展策略，而非计算：一个独立的开源星际争霸II无障碍强化学习基准测试

Authors: Sourav Panda, Shreyash Kale, Tanmay Ambadkar, Abhinav Verma, Jonathan Dodge
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.06608
Pdf link: https://arxiv.org/pdf/2603.06608
Abstract The research community lacks a middle ground between StarCraft IIs full game and its mini-games. The full-games sprawling state-action space renders reward signals sparse and noisy, but in mini-games simple agents saturate performance. This complexity gap hinders steady curriculum design and prevents researchers from experimenting with modern Reinforcement Learning algorithms in RTS environments under realistic compute budgets. To fill this gap, we present the Two-Bridge Map Suite, the first entry in an open-source benchmark series we purposely engineered as an intermediate benchmark to sit between these extremes. By disabling economy mechanics such as resource collection, base building, and fog-of-war, the environment isolates two core tactical skills: long-range navigation and micro-combat. Preliminary experiments show that agents learn coherent maneuvering and engagement behaviors without imposing full-game computational costs. Two-Bridge is released as a lightweight, Gym-compatible wrapper on top of PySC2, with maps, wrappers, and reference scripts fully open-sourced to encourage broad adoption as a standard benchmark.
中文摘要 研究界缺乏星际争霸II完整游戏与小游戏之间的中间地带。完整游戏中庞大的状态动作空间使得奖励信号稀疏且嘈杂，但在小游戏中，简单的代理却能完全满足性能。这种复杂度差距阻碍了稳定的课程设计，并阻碍研究人员在现实计算预算下尝试现代强化学习算法。为了填补这一空白，我们推出了双桥地图套件，这是我们有意设计的开源基准系列的第一篇，作为介于这些极端之间的中间基准。通过禁用资源采集、基地建设和战争迷雾等经济机制，环境孤立了两项核心战术技能：远程导航和微观战斗。初步实验表明，智能体在不施加全博弈计算成本的情况下，能够学习连贯的机动和交战行为。Two-Bridge 作为轻量级、兼容 Gym 的包装发布，基于 PySC2，地图、包装器和参考脚本完全开源，鼓励广泛采用，作为标准基准。

Not all tokens are needed(NAT): token efficient reinforcement learning

并非所有令牌都必须（NAT）：令牌高效的强化学习

Authors: Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.06619
Pdf link: https://arxiv.org/pdf/2603.06619
Abstract Reinforcement learning (RL) has become a key driver of progress in large language models, but scaling RL to long chain-of-thought (CoT) trajectories is increasingly constrained by backpropagation over every generated token. Even with optimized rollout engines, full-token updates can consume a large fraction of total training cost, turning token length into a hidden tax on RL. We introduce Not All Tokens Are Needed (NAT), a unified framework that makes the token budget a first-class optimization primitive. NAT updates the policy using only a selected subset of generated tokens while preserving the learning signal of full-sequence RL. The core idea is an unbiased partial-token policy-gradient estimator via Horvitz-Thompson reweighting, which ensures statistically correct gradients despite subsampling. We instantiate NAT with two simple, plug-and-play token selection schemes: Uniform Random Sampling (URS) and Random Prefix Cutting (RPC), both of which reduce forward and backward compute and memory without modifying the reward computation or rollout pipeline. Across mathematical reasoning benchmarks, NAT matches full-token GRPO performance while using as few as 50% of tokens, providing an efficient and orthogonal pathway to scaling RL beyond the limits imposed by long trajectories. In our experiments, RPC saves 18% peak GPU memory and 29% forward and backward RL training time for Qwen3-8B.
中文摘要 强化学习（RL）已成为大型语言模型进步的关键驱动力，但将强化学习扩展到长思维链（CoT）轨迹的限制越来越多，受限于对每个生成的代币进行反向传播。即使采用优化的推送引擎，完整令牌更新也可能占用训练总成本的很大一部分，使令牌长度成为强化学习的隐性税收。我们引入了“并非所有令牌都必需”（NAT），这是一个统一框架，使令牌预算成为一流的优化原语。NAT只使用选定的部分生成的令牌来更新策略，同时保留全序列强化学习的学习信号。核心思想是通过Horvitz-Thompson重权重实现无偏的部分令牌策略梯度估计，确保即使进行了子抽样，梯度仍保持统计正确。我们通过两种简单的即插即用令牌选择方案实现NAT：统一随机采样（URS）和随机前缀切割（RPC），这两种方案都能减少前向和后向计算及内存，而无需修改奖励计算或推广流水线。在数学推理基准测试中，NAT在使用少达50%代币的情况下，能够匹配全代币的GRPO性能，为强化学习扩展到长轨迹限制的高效且正交的路径。在我们的实验中，RPC为Qwen3-8B节省了18%的峰值GPU内存和29%的前向和后向强化学习时间。

Advances in GRPO for Generation Models: A Survey

发电模型GRPO的进展：综述

Authors: Zexiang Liu, Xianglong He, Yangguang Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.06623
Pdf link: https://arxiv.org/pdf/2603.06623
Abstract Large-scale flow matching models have achieved strong performance across generative tasks such as text-to-image, video, 3D, and speech synthesis. However, aligning their outputs with human preferences and task-specific objectives remains challenging. Flow-GRPO extends Group Relative Policy Optimization (GRPO) to generation models, enabling stable reinforcement learning alignment for generative systems. Since its introduction, Flow-GRPO has triggered rapid research growth, spanning methodological refinements and diverse application domains. This survey provides a comprehensive review of Flow-GRPO and its subsequent developments. We organize existing work along two primary dimensions. First, we analyze methodological advances beyond the original framework, including reward signal design, credit assignment, sampling efficiency, diversity preservation, reward hacking mitigation, and reward model construction. Second, we examine extensions of GRPO-based alignment across generative paradigms and modalities, including text-to-image, video generation, image editing, speech and audio, 3D modeling, embodied vision-language-action systems, unified multimodal models, autoregressive and masked diffusion models, and restoration tasks. By synthesizing theoretical insights and practical adaptations, this survey highlights Flow-GRPO as a general alignment framework for modern generative models and outlines key open challenges for scalable and robust reinforcement-based generation.
中文摘要 大规模流量匹配模型在文本转图像、视频、三维和语音合成等生成任务中取得了优异表现。然而，将输出与人类偏好和任务特定目标对齐仍具挑战性。Flow-GRPO将组相对策略优化（GRPO）扩展到生成模型，实现生成系统稳定的强化学习对齐。自推出以来，Flow-GRPO推动了快速的研究发展，涵盖了方法论的精炼和多样化的应用领域。本调查对Flow-GRPO及其后续发展进行了全面回顾。我们按两个主要维度组织现有工作。首先，我们分析了超出原始框架的方法论进展，包括奖励信号设计、信用分配、抽样效率、多样性保持、奖励黑客缓解和奖励模型构建。其次，我们考察基于GRPO的对齐在生成范式和模态上的扩展，包括文本转图像、视频生成、图像编辑、语音与音频、3D建模、具身视觉-语言-动作系统、统一多模态模型、自回归和掩蔽扩散模型以及恢复任务。通过综合理论见解和实际调整，本综述突出了Flow-GRPO作为现代生成模型的通用对齐框架，并概述了可扩展且稳健的基于强化生成面临的关键挑战。

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

PaLMR：通过多模态过程对齐实现忠实的视觉推理

Authors: Yantao Li, Qiang Hui, Chenyang Yan, Kanzhi Cheng, Fang Zhao, Chao Tan, Huanling Gao, Jianbing Zhang, Kai Wang, Xinyu Dai, Shiguo Lian
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.06652
Pdf link: https://arxiv.org/pdf/2603.06652
Abstract Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.
中文摘要 强化学习近年来提升了大型语言模型和多模态大型语言模型的推理能力，但现行的奖励设计强调最终答案的正确性，因此容忍过程幻觉——即模型在误判视觉证据时得出正确答案的情况。我们通过PaLMR解决这一流程层面的不一致，该框架不仅使结果一致，也使推理过程本身保持一致。PaLMR由两个互补组成部分组成：一个感知对齐的数据层，构建带有结构化伪真实信息和可验证视觉事实的过程感知推理数据;以及一个过程对齐的优化层，构建带有过程感知评分功能的层级奖励融合方案，以鼓励视觉忠实的思维链并提升训练稳定性。在Qwen2.5-VL-7B上的实验显示，我们的方法显著减少了推理幻觉，提升了视觉推理的真实性，在HallusionBench上取得了最先进的结果，同时在MMMU、MathVista和MathVerse上保持了强劲的性能。这些发现表明，PaLMR为流程对齐的多模态推理提供了一条有原则且实用的路径，提升了MLLM的可靠性和可解释性。

Digital Twin-Enabled Mobility-Aware Cooperative Caching in Vehicular Edge Computing

数字孪生驱动的移动感知协作缓存在车辆边缘计算中

Authors: Jiahao Zeng, Zhenkui Shi, Chunpei Li, Mengkai Yan, Hongliang Zhang, Sihan Chen, Xiantao Hu, Xianxian Li
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.06653
Pdf link: https://arxiv.org/pdf/2603.06653
Abstract With the advancement of vehicle-to-vehicle (V2V) ad hoc networks and wireless communication technologies, mobile edge caching has become a key enabler for enhancing network performance and user experience. However, traditional federated learning-based collaborative caching approaches in vehicular scenarios suffer from inadequate client selection mechanisms and limited prediction accuracy, which result in suboptimal cache hit ratios and increased content transmission latency. To address these challenges, we propose a Digital Twin-based Asynchronous Federated Learning-driven Predictive Edge Caching with Deep Reinforcement Learning (DAPR) framework. DAPR employs an intelligent client selection strategy based on asynchronous federated learning, which leverages mobility prediction and data quality assessment to avoid selecting highly mobile clients or clients with low-quality data, thereby significantly improving model convergence efficiency. In addition, we design a GRU-VAE prediction model that uses a Variational Autoencoder (VAE) to capture latent data distribution features and Gated Recurrent Units (GRUs) to model temporal dependencies, thereby substantially enhancing the accuracy of content request prediction. The predicted content popularities are then fed into a deep reinforcement learning-driven caching decision engine to dynamically optimize edge caching resource allocation. Extensive experiments demonstrate that DAPR achieves superior performance in terms of average reward, cache hit ratio, and transmission latency, thereby effectively improving the overall efficiency of vehicular edge caching systems.
中文摘要 随着车对车（V2V）自组网和无线通信技术的发展，移动边缘缓存已成为提升网络性能和用户体验的关键工具。然而，传统的基于联邦学习的协作缓存方法在车辆场景中存在客户端选择机制不足和预测准确性有限的问题，导致缓存命中率不理想，内容传输延迟增加。为应对这些挑战，我们提出了基于数字孪生的异步联合学习驱动的深度强化学习预测边缘缓存（DAPR）框架。DAPR采用基于异步联邦学习的智能客户端选择策略，利用移动预测和数据质量评估，避免选择高度移动的客户端或数据质量较低的客户端，从而显著提升模型收敛效率。此外，我们设计了一个GRU-VAE预测模型，利用变分自编码器（VAE）捕捉潜在数据分布特征，并利用门控循环单元（GRU）模拟时间依赖，从而大幅提升内容请求预测的准确性。预测内容受欢迎度随后被输入深度强化学习驱动的缓存决策引擎，动态优化边缘缓存资源分配。大量实验表明，DAPR在平均奖励、缓存命中率和传输延迟方面表现出色，从而有效提升了车辆边缘缓存系统的整体效率。

GameVerse: Can Vision-Language Models Learn from Video-based Reflection?

GameVerse：视觉语言模型能否从基于视频的反思中学习？

Authors: Kuan Zhang, Dongchen Liu, Qiyue Zhao, Jinkun Hou, Xinran Zhang, Qinlei Xie, Miao Liu, Yiming Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.06656
Pdf link: https://arxiv.org/pdf/2603.06656
Abstract Human gameplay is a visually grounded interaction loop in which players act, reflect on failures, and watch tutorials to refine strategies. Can Vision-Language Models (VLMs) also learn from video-based reflection? We present GameVerse, a comprehensive video game benchmark that enables a reflective visual interaction loop. Moving beyond traditional fire-and-forget evaluations, it uses a novel reflect-and-retry paradigm to assess how VLMs internalize visual experience and improve policies. To facilitate systematic and scalable evaluation, we also introduce a cognitive hierarchical taxonomy spanning 15 globally popular games, dual action space for both semantic and GUI control, and milestone evaluation using advanced VLMs to quantify progress. Our experiments show that VLMs benefit from video-based reflection in varied settings, and perform best by combining failure trajectories and expert tutorials-a training-free analogue to reinforcement learning (RL) plus supervised fine-tuning (SFT).
中文摘要 人类游戏是一个视觉上贴近现实的互动循环，玩家在其中行动、反思失败，并观看教程以完善策略。视觉语言模型（VLMs）也能从基于视频的反思中学习吗？我们介绍GameVerse，一个全面的视频游戏基准测试，能够实现反思性的视觉互动循环。它超越了传统的“发射后遗忘”评估，采用一种新颖的“反射与重试”范式，评估VLM如何内化视觉体验并改进策略。为促进系统且可扩展的评估，我们还引入了涵盖15款全球流行游戏的认知层级分类法，双重动作空间用于语义和图形界面控制，以及利用先进VLM进行里程碑评估以量化进展。我们的实验表明，VLM在不同环境中受益于基于视频的反射，并且通过结合失败轨迹和专家教程表现最佳——这是一种无需训练的强化学习（RL）加监督微调（SFT）的无训练类比。

Hybrid Orchestration of Edge AI and Microservices via Graph-based Self-Imitation Learning

通过基于图的自我模仿学习，边缘人工智能与微服务的混合编排

Authors: Chen Yang, Jin Zheng, Yang Zhuolin, Lai Pan, Zhang Xiao, Hu Menglan, Yin Haiyan
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.06669
Pdf link: https://arxiv.org/pdf/2603.06669
Abstract Modern edge AI applications increasingly rely on microservice architectures that integrate both AI services and conventional microservices into complex request chains with stringent latency requirements. Effectively orchestrating these heterogeneous services is crucial for ensuring low-latency performance, yet remains challenging due to their diverse resource demands and strong operational interdependencies under resource-constrained edge environments. In particular, frequent interactions between services tightly couple deployment and routing decisions, yet existing approaches optimize them in isolation, leading to fundamentally inadequate system this http URL this paper, we propose SIL-GPO, a reinforcement learning framework that optimizes hybrid orchestration for edge AI microservice systems. SIL-GPO formulates the orchestration problem as a sequential decision-making task and leverages graph attention networks to encode service topologies and routing dependencies within the agent state representation. Moreover, SIL-GPO integrates a self-imitation learning strategy into proximal policy optimization, enabling the agent to prioritize and reuse high-reward trajectories. This guides policy updates towards globally promising solutions that standard RL often fails to discover under sparse rewards and large combinatorial action spaces. We conduct extensive experiments on trace-driven edge AI workloads, demonstrating that SIL-GPO significantly reduces end-to-end service latency and enhances resource utilization compared to state-of-the-art heuristic, metaheuristic, and deep RL baselines. Our framework offers a unified and scalable solution for efficient orchestration of AI services and microservices in the edge, paving the way for low-latency, high-performance edge AI deployments.
中文摘要 现代边缘AI应用越来越依赖将AI服务和传统微服务整合成复杂请求链、要求严格延迟的微服务架构。有效协调这些异构服务对于确保低延迟性能至关重要，但由于资源需求多样且在资源受限的边缘环境中存在强烈的运营相互依赖性，依然具有挑战性。特别是，服务间频繁的交互紧密耦合了部署和路由决策，但现有方法却单独优化，导致系统根本不够完善。本文提出SIL-GPO，一种强化学习框架，用于优化边缘AI微服务系统的混合编排。SIL-GPO 将编排问题表述为一个顺序决策任务，并利用图注意力网络在代理状态表示中编码服务拓扑和路由依赖。此外，SIL-GPO将自我模仿学习策略整合进近端策略优化，使智能体能够优先排序并重用高回报轨迹。这引导政策更新，朝向全球有前景的解决方案，而标准强化学习在稀疏奖励和庞大的组合行动空间下往往无法发现这些解决方案。我们在追踪驱动边缘AI工作负载上进行了大量实验，证明SIL-GPO相比最先进的启发式、元启发式和深度强化学习基线，显著降低端到端服务延迟并提升资源利用率。我们的框架提供了一个统一且可扩展的解决方案，用于边缘中高效编排AI服务和微服务，为低延迟、高性能的边缘AI部署铺平道路。

Don't Freeze, Don't Crash: Extending the Safe Operating Range of Neural Navigation in Dense Crowds

别冻结，别撞车：在密集人群中延长神经导航的安全作范围

Authors: Jiefu Zhang, Yang Xu, Vaneet Aggarwal
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.06729
Pdf link: https://arxiv.org/pdf/2603.06729
Abstract Navigating safely through dense crowds requires collision avoidance that generalizes beyond the densities seen during training. Learning-based crowd navigation can break under out-of-distribution crowd sizes due to density-sensitive observation normalization and social-cost scaling, while analytical solvers often remain safe but freeze in tight interactions. We propose a reinforcement learning approach for dense, variable-density navigation that attains zero-shot density generalization using a density-invariant observation encoding with density-randomized training and physics-informed proxemic reward shaping with density-adaptive scaling. The encoding represents the distance-sorted $K$ nearest pedestrians plus bounded crowd summaries, keeping input statistics stable as crowd size grows. Trained with $N!\in![11,16]$ pedestrians in a $3\mathrm{m}\times3\mathrm{m}$ arena and evaluated up to $N!=!21$ pedestrians ($1.3\times$ denser), our policy reaches the goal in $>99\%$ of episodes and achieves $86\%$ collision-free success in random crowds, with markedly less freezing than analytical methods and a $>!60$-point collision-free margin over learning-based benchmark methods. Codes are available at \href{this https URL}{this https URL}.
中文摘要 在密集人群中安全导航需要超越训练密集密度的碰撞规避能力。基于学习的人群导航在分布外的人群规模下可能因密度敏感观测归一化和社会成本尺度而出现故障，而分析求解器通常保持安全，但在紧密交互中会冻结。我们提出了一种强化学习方法，用于密集、可变密度导航，利用密度不变的观察编码，结合密度随机训练和物理知情的近似奖励塑造，并结合密度自适应缩放，实现零射击密度推广。编码表示距离排序后的$K$最近行人加上有界人群汇总，保持输入统计数据稳定，随着人群规模增长。接受$N训练！！在！[11,16]$3\MathRM{M}\Times3\MathRM{M}$ 竞技场的行人，并评估至$N\！=\！21$行人（$1.3\Times$ 更密集），我们的政策在$>99\%$的集数内达成目标，在随机人群中实现$86\%$无碰撞成功率，明显低于分析方法的冻结率，且相较于基于学习的基准方法有$>\！60$的无碰撞优势。代码可在 \href{this https URL}{this https URL} 获取。

HybridMimic: Hybrid RL-Centroidal Control for Humanoid Motion Mimicking

混合模拟：用于模拟类人生物运动的混合强化学习中心控制

Authors: Ludwig Chee-Ying Tay, I-Chia Chang, Yan Gu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.06775
Pdf link: https://arxiv.org/pdf/2603.06775
Abstract Motion mimicking, i.e., encouraging the control policy to mimic human motion, facilitates the learning of complex tasks via reinforcement learning (RL) for humanoid robots. Although standard RL frameworks demonstrate impressive locomotion agility, they often bypass explicit reasoning about robot dynamics during deployment, which is a design choice that can lead to physically infeasible commands when the robot encounters out-of-distribution environments. By integrating model-based principles, hybrid approaches can improve performance; however, existing methods typically rely on predefined contact timing, limiting their versatility. This paper introduces HybridMimic, a framework in which a learned policy dynamically modulates a centroidal-model-based controller by predicting continuous contact states and desired centroidal velocities. This architecture exploits the physical grounding of centroidal dynamics to generate feedforward torques that remain feasible even under domain shift. Using physics-informed rewards, the policy is trained to efficiently utilize the centroidal controller's optimization by outputting precise control targets and reference torques. Through hardware experiments on the Booster T1 humanoid, HybridMimic reduces the average base position tracking error by 13\% compared to a state-of-the-art RL baseline, demonstrating the robustness of dynamics-aware deployment.
中文摘要 动作模仿，即鼓励控制策略模拟人类运动，通过强化学习（RL）促进人形机器人复杂任务的学习。尽管标准强化学习框架展现出令人印象深刻的移动敏捷性，但它们常常绕过部署过程中对机器人动力学的明确推理，这种设计选择可能导致机器人在遇到非分布环境时发出物理上不可行的命令。通过整合基于模型的原则，混合方法可以提升性能;然而，现有方法通常依赖预设的接触时机，限制了其多样性。本文介绍了HybridMimic，这一框架中，学习策略通过预测连续接触状态和期望的重心速度，动态调制基于重心模型的控制器。该架构利用质心动力学的物理接地，产生即使在域位移下仍可行的前馈扭矩。利用物理反馈，该策略被训练以高效利用质心控制器的优化，输出精确的控制目标和参考力矩。通过对Booster T1类人生物进行的硬件实验，HybridMimic将平均基准位置跟踪误差降低了13/%，相比最先进的强化学习基线，展示了动态感知部署的稳健性。

HGT-Scheduler: Deep Reinforcement Learning for the Job Shop Scheduling Problem via Heterogeneous Graph Transformers

HGT-调度器：通过异构图变换器解决工作车间调度问题的深度强化学习

Authors: Bulent Soykan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2603.06777
Pdf link: https://arxiv.org/pdf/2603.06777
Abstract The Job Shop Scheduling Problem (JSSP) is commonly formulated as a disjunctive graph in which nodes represent operations and edges encode technological precedence constraints as well as machine-sharing conflicts. Most existing reinforcement learning approaches model this graph as homogeneous, merging job-precedence and machine-contention edges into a single relation type. Such a simplification overlooks the intrinsic heterogeneity of the problem structure and may lead to the loss of critical relational information. To address this limitation, we propose the Heterogeneous Graph Transformer (HGT)-Scheduler, a reinforcement learning framework that models the JSSP as a heterogeneous graph. The proposed architecture leverages a Heterogeneous Graph Transformer to capture type-specific relational patterns through edge-type-dependent attention mechanisms applied to precedence and contention relations. The scheduling policy is trained using Proximal Policy Optimization. The effectiveness of the proposed method is evaluated on the Fisher--Thompson benchmark instances. On the FT06 instance, the HGT-Scheduler achieves an optimality gap of 8.4\%, statistically outperforming both an identical architecture that ignores edge types ($p = 0.011$) and a standard Graph Isomorphism Network baseline. On the larger FT10 instance, the approach demonstrates favorable scalability. However, under a 50,000-step training limit, the performance of heterogeneous and homogeneous graph models is comparable, suggesting that edge-type awareness requires longer training horizons for larger problem instances. Ablation analyses further indicate that a three-layer attention architecture provides the best performance. Overall, the results confirm that explicitly modeling distinct edge semantics improves the learning of effective scheduling policies.
中文摘要 作业车间调度问题（JSSP）通常被表述为一个析取图，其中节点表示作，边编码技术优先约束以及机器共享冲突。大多数现有的强化学习方法将该图建模为同质的，将作业优先级和机器争用边合并为单一关系类型。这种简化忽视了问题结构的内在异质性，可能导致关键关系信息的丢失。为解决这一限制，我们提出了异构图变换器（HGT）调度器，这是一种将JSSP建模为异构图的强化学习框架。该架构利用异构图变换器，通过边缘类型依赖的注意力机制捕捉类型特定的关系模式，应用于优先级和争用关系。调度策略通过近端策略优化进行训练。该方法的有效性在Fisher-Thompson基准实例上进行评估。在FT06实例中，HGT-调度器实现了8.4%的最优性差距，统计上优于无边类型（$p = 0.011$）和标准图同构网络基线的相同架构。在更大的FT10实例上，这种方法展现了良好的可扩展性。然而，在5万步训练限制下，异构图模型和齐质图模型的性能相当，表明边缘型感知需要更长的训练时间来应对更大的问题实例。消融分析进一步表明，三层注意力架构能提供最佳性能。总体而言，结果证实了明确建模不同边缘语义能提升有效调度策略的学习。

Optimistic Policy Regularization

乐观政策正则化

Authors: Mai Pham, Vikrant Vaze, Peter Chin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.06793
Pdf link: https://arxiv.org/pdf/2603.06793
Abstract Deep reinforcement learning agents frequently suffer from premature convergence, where early entropy collapse causes the policy to discard exploratory behaviors before discovering globally optimal strategies. We introduce Optimistic Policy Regularization (OPR), a lightweight mechanism designed to preserve and reinforce historically successful trajectories during policy optimization. OPR maintains a dynamic buffer of high-performing episodes and biases learning toward these behaviors through directional log-ratio reward shaping and an auxiliary behavioral cloning objective. When instantiated on Proximal Policy Optimization (PPO), OPR substantially improves sample efficiency on the Arcade Learning Environment. Across 49 Atari games evaluated at the 10-million step benchmark, OPR achieves the highest score in 22 environments despite baseline methods being reported at the standard 50-million step horizon. Beyond arcade benchmarks, OPR also generalizes to the CAGE Challenge 2 cyber-defense environment, surpassing the competition-winning Cardiff agent while using the same PPO architecture. These results demonstrate that anchoring policy updates to empirically successful trajectories can improve both sample efficiency and final performance.
中文摘要 深度强化学习代理常常出现过早收敛，即早期熵坍缩导致策略在发现全局最优策略前就放弃探索性行为。我们介绍乐观策略正则化（OPR），这是一种轻量级机制，旨在在策略优化过程中保留和强化历史上成功的轨迹。OPR通过定向对数比奖励塑造和辅助行为克隆目标，保持高绩效发作和偏见的动态缓冲。当实例化在近端策略优化（PPO）上时，OPR显著提升了Arcade学习环境中的样本效率。在49款以1000万步基准评估的雅达利游戏中，OPR在22个环境中取得了最高分，尽管基线方法报告在标准的5000万步视野。除了街机基准测试，OPR还推广到了CAGE Challenge 2网络防御环境，超越了竞争对手Cardiff代理，同时采用相同的PPO架构。这些结果表明，将政策更新锚定于实证成功的轨迹可以提升样本效率和最终绩效。

Multi-Agent Reinforcement Learning with Submodular Reward

多智能体强化学习与亚模块奖励

Authors: Wenjing Chen, Chengyuan Qian, Shuo Xing, Yi Zhou, Victoria Crawford
Subjects: Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2603.06810
Pdf link: https://arxiv.org/pdf/2603.06810
Abstract In this paper, we study cooperative multi-agent reinforcement learning (MARL) where the joint reward exhibits submodularity, which is a natural property capturing diminishing marginal returns when adding agents to a team. Unlike standard MARL with additive rewards, submodular rewards model realistic scenarios where agent contributions overlap (e.g., multi-drone surveillance, collaborative exploration). We provide the first formal framework for this setting and develop algorithms with provable guarantees on sample efficiency and regret bound. For known dynamics, our greedy policy optimization achieves a $1/2$-approximation with polynomial complexity in the number of agents $K$, overcoming the exponential curse of dimensionality inherent in joint policy optimization. For unknown dynamics, we propose a UCB-based learning algorithm achieving a $1/2$-regret of $O(H^2KS\sqrt{AT})$ over $T$ episodes.
中文摘要 本文研究了合作多智能体强化学习（MARL），其中联合奖励表现出亚模块性，这是一种自然特性，在加入团队时捕捉边际收益递减。与带有加法奖励的标准MARL不同，亚模块化奖励模拟了代理贡献重叠的现实场景（例如多无人机监控、协作探索）。我们提供了该设定的首个正式框架，并开发了对样本效率和后悔界限可证明保证的算法。对于已知的动态，我们的贪婪策略优化实现了一个$1/2$近似，其代理数为多项式复杂度$K，克服了联合策略优化固有的指数级维度诅咒。对于未知动态，我们提出基于UCB的学习算法，在$T$的发作中实现$1/2$-后悔值为$O（H^2KS\sqrt{AT}）$。

Reinforcing the World's Edge: A Continual Learning Problem in the Multi-Agent-World Boundary

强化世界边界：多智能体-世界边界中的持续学习问题

Authors: Dane Malenfant
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.06813
Pdf link: https://arxiv.org/pdf/2603.06813
Abstract Reusable decision structure survives across episodes in reinforcement learning, but this depends on how the agent--world boundary is drawn. In stationary, finite-horizon MDPs, an invariant core: the (not-necessarily contiguous) subsequences of state--action pairs shared by all successful trajectories (optionally under a simple abstraction) can be constructed. Under mild goal-conditioned assumptions, it's existence can be proven and explained by how the core captures prototypes that transfer across episodes. When the same task is embedded in a decentralized Markov game and the peer agent is folded into the world, each peer-policy update induces a new MDP; the per-episode invariant core can shrink or vanish, even with small changes to the induced world dynamics, sometimes leaving only the individual task core or just nothing. This policy-induced non-stationarity can be quantified with a variation budget over the induced kernels and rewards, linking boundary drift to loss of invariants. The view that a continual RL problem arises from instability of the agent--world boundary (rather than exogenous task switches) in decentralized MARL suggests future work on preserving, predicting, or otherwise managing boundary drift.
中文摘要 可重用的决策结构在强化学习的各个阶段都能存活，但这取决于智能体——世界边界的划定方式。在静止的有限视界MDP中，可以构造一个不变核心：所有成功轨迹共享的状态-作用对子序列（可选地通过简单抽象）。在轻微的目标条件假设下，它的存在可以通过核心捕捉原型并在各集间转移来证明和解释。当同一任务嵌入去中心化马尔可夫博弈中，且对等代理被折叠进世界时，每次对等策略更新都会诱导新的MDP;每集不变核心可能会缩小或消失，即使对诱导的世界动态有微小调整，有时只剩下单个任务核心，甚至什么都不剩。这种由策略引起的非平稳性可以通过对诱导核和奖励的变异预算量化，将边界漂移与不变量的丧失联系起来。认为持续强化学习问题源于去中心化MARL中智能体-世界边界的不稳定性（而非外生任务切换），这暗示未来应有关于边界漂移的维护、预测或管理工作。

Contextual Counterfactual Credit Assignment for Multi-Agent Reinforcement Learning in LLM Collaboration

LLM协作中多智能体强化学习的上下文反事实学分赋值

Authors: Yanjun Chen, Yirong Sun, Hanlin Wang, Xinming Zhang, Xiaoyu Shen, Wenjie Li, Wei Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.06859
Pdf link: https://arxiv.org/pdf/2603.06859
Abstract Cooperative multi-agent reinforcement learning (MARL) systems powered by large language models (LLMs) are frequently optimized via sparse terminal-only feedback. This shared signal entangles upstream decisions, obstructing accurate decision-level credit assignment. To address this trajectory-level diffusion, we introduce Contextual Counterfactual Credit Assignment (\textbf{\texttt{C3}}). Instead of distributing rewards across an entire episode, \textbf{\texttt{C3}} isolates the causal impact of individual messages by freezing the exact transcript-derived context, evaluating context-matched alternatives via fixed-continuation replay, and applying a leave-one-out (LOO) baseline. This localized intervention extracts unbiased, low-variance marginal advantages for standard policy-gradient optimization. Evaluated across five mathematical and coding benchmarks under matched budgets, \textbf{\texttt{C3}} improves terminal performance over established baselines. Mechanistic diagnostics further show that these gains are accompanied by higher credit fidelity, lower contextual variance, and stronger inter-agent causal dependence. Our code is available at this https URL.
中文摘要 由大型语言模型（LLMs）驱动的协作多智能体强化学习（MARL）系统通常通过稀疏的终端反馈进行优化。这种共享信号纠缠了上游决策，阻碍了准确的决策层级信用分配。为了解决这一轨迹层面的扩散，我们引入了情境反事实信用分配（\textbf{\texttt{C3}}）。\textbf{\texttt{C3}} 不将奖励分配到整集节目，而是通过冻结准确的转录上下文、通过固定续播评估上下文匹配的替代方案，以及应用“留一”（LOO）基线，来隔离单个信息的因果影响。这种局部干预为标准政策梯度优化提取了无偏、低方差的边际优势。在五个数学和编码基准测试中，在匹配预算下，\textbf{\texttt{C3}} 提升了终端性能，优于既定基线。机制诊断进一步表明，这些收益伴随着更高的信用保真度、更低的情境方差以及更强的代理间因果依赖。我们的代码可在此 https URL 访问。

Joint MDPs and Reinforcement Learning in Coupled-Dynamics Environments

耦合动力学环境中的联合MDP与强化学习

Authors: Ege C. Kaya, Mahsa Ghasemi, Abolfazl Hashemi
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2603.06946
Pdf link: https://arxiv.org/pdf/2603.06946
Abstract Many distributional quantities in reinforcement learning are intrinsically joint across actions, including distributions of gaps and probabilities of superiority. However, the classical Markov decision process (MDP) formalism specifies only marginal laws and leaves the joint law of counterfactual one-step outcomes across multiple possible actions at a state unspecified. We study coupled-dynamics environments with a multi-action generative interface which can sample counterfactual one-step outcomes for multiple actions under shared exogenous randomness. We propose joint MDPs (JMDPs) as a formalism for such environments by augmenting an MDP with a multi-action sample transition model which specifies a coupling of one-step counterfactual outcomes, while preserving standard MDP interaction as marginal observations. We adopt and formalize a one-step coupling regime where dependence across actions is confined to immediate counterfactual outcomes at the queried state. In this regime, we derive Bellman operators for $n$th-order return moments, providing dynamic programming and incremental algorithms with convergence guarantees.
中文摘要 强化学习中的许多分布量本质上是跨行动的联合的，包括差距分布和优越概率。然而，经典的马尔可夫决策过程（MDP）形式主义只指定了边际律，并且未指定状态下多重可能动作的反事实一步结果的联合律。我们研究具有多动作生成接口的耦合动力学环境，该接口能够在共享外生随机性下采样多重行动的反事实一步结果。我们提出联合MDP（JMDPs）作为此类环境的形式主义，通过在多作用样本转移模型中补充MDP，指定一步反事实结果的耦合，同时保持标准MDP交互作为边际观测值。我们采用并形式化了一步耦合机制，使跨行动的依赖仅限于被询问状态下的即时反事实结果。在此模式下，我们推导出$n次返回矩的贝尔曼算子，提供动态规划和带有收敛保证的增量算法。

Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards

图表-强化学习：通过强化学习实现的广义图表理解，附带可验证的奖励

Authors: Xin Zhang, Xingyu Li, Rongguang Wang, Ruizhong Miao, Zheng Wang, Dan Roth, Chenyang Li
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.06958
Pdf link: https://arxiv.org/pdf/2603.06958
Abstract Accurate chart comprehension represents a critical challenge in advancing multimodal learning systems, as extensive information is compressed into structured visual representations. However, existing vision-language models (VLMs) frequently struggle to generalize on unseen charts because it requires abstract, symbolic, and quantitative reasoning over structured visual representations. In this work, we introduce Chart-RL, an effective reinforcement learning (RL) method that employs mathematically verifiable rewards to enhance chart question answering in VLMs. Our experiments demonstrate that Chart-RL consistently outperforms supervised fine-tuning (SFT) across different chart understanding benchmarks, achieving relative improvements of 16.7% on MutlChartQA, and 11.5% on ChartInsights. We conduct robustness analysis, where Chart-RL achieves enhanced performance in 18 of 25 perturbed chart categories, demonstrating strong consistency and reasoning capability across visual variations. Furthermore, we demonstrate that task difficulty and inherent complexity are more critical than data quantity in RL training. For instance, Chart-RL trained on merely 10 complex chart-query examples significantly outperforms models trained on over 6,000 simple examples. Additionally, training on challenging reasoning tasks not only improves in-domain generalization relative to simpler tasks, but also facilitate strong transfer to out-of-domain visual mathematical problems.
中文摘要 准确的图表理解是推动多模态学习系统发展的关键挑战，因为大量信息被压缩成结构化的可视化表示。然而，现有的视觉语言模型（VLM）常常难以在未见图表上进行推广，因为这需要抽象、符号和定量推理，而非结构化的视觉表征。在本研究中，我们介绍了Chart-RL，一种有效的强化学习（RL）方法，利用数学可验证的奖励来增强VLM中的图表问题解答能力。我们的实验表明，Chart-RL在不同图表理解基准中持续优于监督微调（SFT），在MutlChartQA上取得了16.7%的相对提升，在ChartInsights上分别提升了11.5%。我们进行了鲁棒性分析，Chart-RL在25个扰动图表类别中的18个中表现更优，展现出视觉变异间的强烈一致性和推理能力。此外，我们证明任务难度和固有复杂性在强化学习训练中比数据量更为关键。例如，仅用10个复杂图表查询样本训练的Chart-RL，其表现远超用6000多个简单样本训练的模型。此外，针对具有挑战性的推理任务进行训练，不仅提升了相对于简单任务的领域推广能力，还促进了对域外视觉数学问题的强有力迁移。

Topology-Aware Reinforcement Learning over Graphs for Resilient Power Distribution Networks

基于图的拓扑感知强化学习，用于弹性电力分配网络

Authors: Roshni Anna Jacob, Prithvi Poddar, Jaidev Goel, Souma Chowdhury, Yulia R. Gel, Jie Zhang
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.06964
Pdf link: https://arxiv.org/pdf/2603.06964
Abstract Extreme weather events and cyberattacks can cause component failures and disrupt the operation of power distribution networks (DNs), during which reconfiguration and load shedding are often adopted for resilience enhancement. This study introduces a topology-aware graph reinforcement learning (RL) framework for outage management that embeds higher-order topological features of the DN into a graph-based RL model, enabling reconfiguration and load shedding to maximize energy supply while maintaining operational stability. Results on the modified IEEE 123-bus feeder across 300 diverse outage scenarios demonstrate that incorporating the topological data analysis (TDA) tool, persistence homology (PH), yields 9-18% higher cumulative rewards, up to 6% increase in power delivery, and 6-8% fewer voltage violations compared to a baseline graph-RL model. These findings highlight the potential of integrating RL with TDA to enable self-healing in DNs, facilitating fast, adaptive, and automated restoration.
中文摘要 极端天气事件和网络攻击可能导致组件故障，扰乱配电网络（DN）的运行，在此期间，通常会采用重组和限电措施来增强韧性。本研究引入了一种拓扑感知图强化学习（RL）停电管理框架，将DN的高阶拓扑特征嵌入基于图的强化模型中，实现重配置和负载减排，以最大化能源供应同时保持运行稳定性。在300种不同停电场景下，改进后的IEEE 123总线供电器结果表明，结合拓扑数据分析（TDA）工具——持久同调（PH）后，相比基线图-强化模型，累计奖励提升9%-18%，功率传输提升最多6%，电压违规减少6-8%。这些发现凸显了将强化学习与TDA整合的潜力，以实现DN的自我愈合，促进快速、适应性和自动化的恢复。

NePPO: Near-Potential Policy Optimization for General-Sum Multi-Agent Reinforcement Learning

NePPO：通用和多智能体强化学习的近似策略优化

Authors: Addison Kalanther, Sanika Bharvirkar, Shankar Sastry, Chinmay Maheshwari
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2603.06977
Pdf link: https://arxiv.org/pdf/2603.06977
Abstract Multi-agent reinforcement learning (MARL) is increasingly used to design learning-enabled agents that interact in shared environments. However, training MARL algorithms in general-sum games remains challenging: learning dynamics can become unstable, and convergence guarantees typically hold only in restricted settings such as two-player zero-sum or fully cooperative games. Moreover, when agents have heterogeneous and potentially conflicting preferences, it is unclear what system-level objective should guide learning. In this paper, we propose a new MARL pipeline called Near-Potential Policy Optimization (NePPO) for computing approximate Nash equilibria in mixed cooperative--competitive environments. The core idea is to learn a player-independent potential function such that the Nash equilibrium of a cooperative game with this potential as the common utility approximates a Nash equilibrium of the original game. To this end, we introduce a novel MARL objective such that minimizing this objective yields the best possible potential function candidate and consequently an approximate Nash equilibrium of the original game. We develop an algorithmic pipeline that minimizes this objective using zeroth-order gradient descent and returns an approximate Nash equilibrium policy. We empirically show the superior performance of this approach compared to popular baselines such as MAPPO, IPPO and MADDPG.
中文摘要 多智能体强化学习（MARL）越来越多地被用于设计能够在共享环境中交互的学习智能体。然而，在一般和博弈中训练MARL算法依然具有挑战性：学习动态可能变得不稳定，收敛保证通常只在有限条件下成立，如两人零和或全合作博弈。此外，当代理存在异质且可能冲突的偏好时，系统层面的目标应指导学习仍不明确。本文提出了一种新的MARL流水线，称为近势政策优化（NePPO），用于在混合合作竞争环境中计算近似纳什均衡。核心思想是学习一个与玩家无关的势函数，使得以该势能作为公共效用的合作博弈的纳什均衡近似于原博弈的纳什均衡。为此，我们引入了一个新的MARL目标，使得最小化该目标能得到最佳的潜在函数候选，从而得到原博弈的近似纳什均衡。我们开发了一个算法流水线，利用零阶梯度下降来最小化该目标，并返回一个近似的纳什均衡策略。我们通过实证证明，该方法优于流行的基线如MAPPO、IPPO和MADDPG。

Diffusion Controller: Framework, Algorithms and Parameterization

扩散控制器：框架、算法与参数化

Authors: Tong Yang, Moonkyung Ryu, Chih-Wei Hsu, Guy Tennenholtz, Yuejie Chi, Craig Boutilier, Bo Dai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.06981
Pdf link: https://arxiv.org/pdf/2603.06981
Abstract Controllable diffusion generation often relies on various heuristics that are seemingly disconnected without a unified understanding. We bridge this gap with Diffusion Controller (DiffCon), a unified control-theoretic view that casts reverse diffusion sampling as state-only stochastic control within (generalized) linearly-solvable Markov Decision Processes (LS-MDPs). Under this framework, control acts by reweighting the pretrained reverse-time transition kernels, balancing terminal objectives against an $f$-divergence cost. From the resulting optimality conditions, we derive practical reinforcement learning methods for diffusion fine-tuning: (i) f-divergence-regularized policy-gradient updates, including a PPO-style rule, and (ii) a regularizer-determined reward-weighted regression objective with a minimizer-preservation guarantee under the Kullback-Leibler (KL) divergence. The LS-MDP framework further implies a principled model form: the optimal score decomposes into a fixed pretrained baseline plus a lightweight control correction, motivating a side-network parameterization conditioned on exposed intermediate denoising outputs, enabling effective gray-box adaptation with a frozen backbone. Experiments on Stable Diffusion v1.4 across supervised and reward-driven finetuning show consistent gains in preference-alignment win rates and improved quality-efficiency trade-offs versus gray-box baselines and even the parameter-efficient white-box adapter LoRA.
中文摘要 可控扩散生成通常依赖于看似脱节且缺乏统一理解的各种启发式方法。我们用扩散控制器（DiffCon）弥合这一空白，这是一种统一的控制理论观点，将反扩散采样视为（广义）线性可解马尔可夫决策过程（LS-MDP）内的纯状态随机控制。在该框架下，控制通过重新加权预训练的反向时间转移核，使终端目标与$f$的散度成本进行平衡。基于所得的最优条件，我们推导出实用的扩散微调强化学习方法：（i） f-发散正则化策略梯度更新，包括PPO风格规则，以及（ii）正则化器确定的奖励加权回归目标，并在Kullback-Leibler（KL）发散下保证最小化-保持。LS-MDP框架进一步暗示了一个原则性模型形式：最优分数分解为固定的预训练基线和轻量级控制修正，激励基于暴露中间去噪输出的侧网络参数化，实现有效的灰盒适应，伴随冻结骨干。在监督和奖励驱动微调中，稳定扩散v1.4的实验显示，偏好对齐的胜利率持续提升，质量与效率权衡相比灰框基线甚至参数高效的白盒适配器LoRA有所改善。

AdaGen: Learning Adaptive Policy for Image Synthesis

AdaGen：学习图像合成的自适应策略

Authors: Zanlin Ni, Yulin Wang, Yeguo Hua, Renping Zhou, Jiayi Guo, Jun Song, Bo Zheng, Gao Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.06993
Pdf link: https://arxiv.org/pdf/2603.06993
Abstract Recent advances in image synthesis have been propelled by powerful generative models, such as Masked Generative Transformers (MaskGIT), autoregressive models, diffusion models, and rectified flow models. A common principle behind their success is the decomposition of synthesis into multiple steps. However, this introduces a proliferation of step-specific parameters (e.g., noise level or temperature at each step). Existing approaches typically rely on manually-designed rules to manage this complexity, demanding expert knowledge and trial-and-error. Furthermore, these static schedules lack the flexibility to adapt to the unique characteristics of each sample, yielding sub-optimal performance. To address this issue, we present AdaGen, a general, learnable, and sample-adaptive framework for scheduling the iterative generation process. Specifically, we formulate the scheduling problem as a Markov Decision Process, where a lightweight policy network determines suitable parameters given the current generation state, and can be trained through reinforcement learning. Importantly, we demonstrate that simple reward designs, such as FID or pre-trained reward models, can be easily hacked and may not reliably guarantee the desired quality or diversity of generated samples. Therefore, we propose an adversarial reward design to guide the training of the policy networks. Finally, we introduce an inference-time refinement strategy and a controllable fidelity-diversity trade-off mechanism to further enhance the performance and flexibility of AdaGen. Comprehensive experiments on four generative paradigms validate the superiority of AdaGen. For example, AdaGen achieves better performance on DiT-XL with 3 times lower inference cost and improves the FID of VAR from 1.92 to 1.59 with negligible computational overhead.
中文摘要 近年来图像合成的进展得益于强大的生成模型，如蒙面生成变换器（MaskGIT）、自回归模型、扩散模型和整流流模型。它们成功的共同原理是将合成分解为多个步骤。然而，这会引入步级特定参数的增殖（例如每步的噪声水平或温度）。现有方法通常依赖手动设计的规则来管理这些复杂性，需要专业知识和反复试验。此外，这些静态计划缺乏适应每个样本独特特性的灵活性，导致性能不理想。为解决这个问题，我们提出了AdaGen，一个通用、可学习且可采样的迭代生成调度框架。具体来说，我们将调度问题表述为马尔可夫决策过程，其中轻量级策略网络根据当前生成状态确定合适的参数，并通过强化学习进行训练。重要的是，我们证明了简单的奖励设计，如FID或预训练奖励模型，容易被破解，且可能无法可靠保证生成样本的质量或多样性。因此，我们提出一种对抗性奖励设计，以指导政策网络的培训。最后，我们引入了推理时间优化策略和可控的保真度-多样性权衡机制，进一步提升AdaGen的性能和灵活性。四种生成范式的综合实验验证了AdaGen的优越性。例如，AdaGen在DiT-XL上性能更优，推理成本降低了3倍，并将VAR的FID从1.92提升到1.59，计算开销极低。

AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge

AutoChecklist：用于LLM作为评判的可组合流程用于清单生成和评分

Authors: Karen Zhou, Chenhao Tan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.07019
Pdf link: https://arxiv.org/pdf/2603.07019
Abstract Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge. Beyond evaluation, these structured criteria can serve as signals for model alignment, reinforcement learning, and self-correction. To support these use cases, we present AutoChecklist, an open-source library that unifies checklist-based evaluation into composable pipelines. At its core is a taxonomy of five checklist generation abstractions, each encoding a distinct strategy for deriving evaluation criteria. A modular Generator $\rightarrow$ Refiner $\rightarrow$ Scorer pipeline connects any generator with a unified scorer, and new configurations can be registered via prompt templates alone. The library ships with ten built-in pipelines implementing published approaches and supports multiple LLM providers (OpenAI, OpenRouter, vLLM). Beyond the Python API, the library includes a CLI for off-the-shelf evaluation and a web interface for interactive exploration. Validation experiments confirm that these checklist methods significantly align with human preferences and quality ratings, and a case study on ICLR peer review rebuttals demonstrates flexible domain adaptation. AutoChecklist is publicly available at this https URL.
中文摘要 检查表已成为一种流行的可解释性和细致评估方法，尤其是在LLM作为评判者中。除了评估，这些结构化标准还可以作为模型对齐、强化学习和自我纠正的信号。为支持这些用例，我们介绍了AutoChecklist，一个开源库，将基于清单的评估统一为可组合的流程。其核心是五个检查表生成抽象的分类法，每个抽象都编码了一种不同的评估标准推导策略。模块化生成器 $\rightarrow$ Refiner $\rightarrow$ 评分器流水线将任何生成器与统一评分器连接起来，且新配置可仅通过提示模板注册。该库内置十条流水线，实现已发布的方法，并支持多个大型语言模型提供商（OpenAI、OpenRouter、vLLM）。除了 Python API，库还包含一个用于现成评估的 CLI 和一个用于交互探索的网页界面。验证实验证实这些清单方法与人类偏好和质量评分显著一致，ICLR同行评审反驳的案例研究展示了灵活的领域适应。AutoChecklist 在此 https URL 公开发布。

RESCHED: Rethinking Flexible Job Shop Scheduling from a Transformer-based Architecture with Simplified States

RESCHED：从基于Transformer架构的简化状态重新思考灵活工作坊调度

Authors: Xiangjie Xiao, Cong Zhang, Wen Song, Zhiguang Cao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.07020
Pdf link: https://arxiv.org/pdf/2603.07020
Abstract Neural approaches to the Flexible Job Shop Scheduling Problem (FJSP), particularly those based on deep reinforcement learning (DRL), have gained growing attention in recent years. However, existing methods rely on complex feature-engineered state representations (i.e., often requiring more than 20 handcrafted features) and graph-biased neural architectures. To reduce modeling complexity and advance a more generalizable framework for FJSP, we introduce \textsc{ReSched}, a minimalist DRL framework that rethinks both the scheduling formulation and model design. First, by revisiting the Markov Decision Process (MDP) formulation of FJSP, we condense the state space to just four essential features, eliminating historical dependencies through a subproblem-based perspective. Second, we employ Transformer blocks with dot-product attention, augmented by three lightweight but effective architectural modifications tailored to scheduling tasks. Extensive experiments show that \textsc{ReSched} outperforms classical dispatching rules and state-of-the-art DRL methods on FJSP. Moreover, \textsc{ReSched} also generalizes well to the Job Shop Scheduling Problem (JSSP) and the Flexible Flow Shop Scheduling Problem (FFSP), achieving competitive performance against neural baselines specifically designed for these variants.
中文摘要 针对灵活工作坊排班问题（FJSP）的神经方法，尤其是基于深度强化学习（DRL）的方法，近年来受到越来越多的关注。然而，现有方法依赖于复杂的特征工程状态表示（即通常需要超过20个手工构建的特征）和图偏置的神经架构。为了降低建模复杂度并推动FJSP更具通用性的框架，我们引入了\textsc{ReSched}，这是一种极简的DRL框架，重新思考了调度表述和模型设计。首先，通过重新审视FJSP的马尔可夫决策过程（MDP）表述，我们将状态空间浓缩为仅四个基本特征，通过基于子问题的视角消除历史依赖关系。其次，我们采用带有点积关注的变换器模块，辅以三种轻量但高效的架构修改，针对调度任务量身定制。大量实验表明，\textsc{ReSched}在FJSP上优于经典调度规则和最先进的DRL方法。此外，\textsc{ReSched} 还很好地推广到作业车间调度问题（JSSP）和灵活流程车间调度问题（FFSP），在专门为这些变体设计的神经基线中实现了竞争性能。

SSP: Safety-guaranteed Surgical Policy via Joint Optimization of Behavioral and Spatial Constraints

SSP：通过联合优化行为和空间约束实现安全保障的手术政策

Authors: Jianshu Hu, ZhiYuan Guan, Lei Song, Kantaphat Leelakunwet, Hesheng Wang, Wei Xiao, Qi Dou, Yutong Ban
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.07032
Pdf link: https://arxiv.org/pdf/2603.07032
Abstract The paradigm of robot-assisted surgery is shifting toward data-driven autonomy, where policies learned via Reinforcement Learning (RL) or Imitation Learning (IL) enable the execution of complex tasks. However, these ``black-box" policies often lack formal safety guarantees, a critical requirement for clinical deployment. In this paper, we propose the Safety-guaranteed Surgical Policy (SSP) framework to bridge the gap between data-driven generality and formal safety. We utilize Neural Ordinary Differential Equations (Neural ODEs) to learn an uncertainty-aware dynamics model from demonstration data. This learned model underpins a robust Control Barrier Function (CBF) safety controller, which minimally alters the actions of a surgical policy to ensure strict safety under uncertainty. Our controller enforces two constraint categories: behavioral constraints (restricting the task space of the agent) and spatial constraints (defining surgical no-go zones). We instantiate the SSP framework with surgical policies derived from RL, IL and Control Lyapunov Functions (CLF). Validation on in both the SurRoL simulation and da Vinci Research Kit (dVRK) demonstrates that our method achieves a near-zero constraint violation rate while maintaining high task success rates compared to unconstrained baselines.
中文摘要 机器人辅助手术的范式正向数据驱动自主转变，通过强化学习（RL）或模仿学习（IL）学习的策略能够执行复杂任务。然而，这些“黑箱”政策往往缺乏正式的安全保障，而这是临床部署的关键要求。本文提出了安全保障手术政策（SSP）框架，以弥合数据驱动的通用性与正式安全性之间的差距。我们利用神经常微分方程（Neural ODE）从演示数据中学习一个不确定性感知的动力学模型。该学习模型支撑了稳健的控制屏障功能（CBF）安全控制器，仅对手术政策的动作进行最小限度调整，以确保在不确定性下严格的安全。我们的控制器强制执行两个约束类别：行为约束（限制代理的任务空间）和空间约束（定义外科手术禁区）。我们通过基于RL、IL和Control Lyapunov Functions（CLF）衍生的外科策略，实例化SSP框架。在SurRoL模拟和达芬奇研究工具包（dVRK）中的验证表明，我们的方法在相较于无约束基线条件下保持较高任务成功率的同时，实现了近乎零的约束违背率。

Dreamer-CDP: Improving Reconstruction-free World Models Via Continuous Deterministic Representation Prediction

Dreamer-CDP：通过连续确定性表示预测改进无重建世界模型

Authors: Michael Hauri, Friedemann Zenke
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07083
Pdf link: https://arxiv.org/pdf/2603.07083
Abstract Model-based reinforcement learning (MBRL) agents operating in high-dimensional observation spaces, such as Dreamer, rely on learning abstract representations for effective planning and control. Existing approaches typically employ reconstruction-based objectives in the observation space, which can render representations sensitive to task-irrelevant details. Recent alternatives trade reconstruction for auxiliary action prediction heads or view augmentation strategies, but perform worse in the Crafter environment than reconstruction-based methods. We close this gap between Dreamer and reconstruction-free models by introducing a JEPA-style predictor defined on continuous, deterministic representations. Our method matches Dreamer's performance on Crafter, demonstrating effective world model learning on this benchmark without reconstruction objectives.
中文摘要 在高维观察空间中运行的基于模型的强化学习（MBRL）代理，如 Dreamer，依赖学习抽象表征以实现有效的规划和控制。现有方法通常在观察空间采用基于重建的目标，这使得表征对任务无关细节变得敏感。近期的替代方案用辅助动作预测头或视角增强策略替代重建，但在重建环境中表现不如基于重建的方法。我们通过引入一个定义在连续确定性表示上的JEPA式预测变量，弥合了Dreamer模型与无重建模型之间的差距。我们的方法与Dreamer在Crafter上的表现相匹配，展示了在该基准测试下无需重建目标即可有效学习世界模型。

Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

倒计时代码：研究RLVR中奖励黑客的出现与推广的试验平台

Authors: Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.07084
Pdf link: https://arxiv.org/pdf/2603.07084
Abstract Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual-access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward-hacking rates. Using this environment, we study reward hacking in open-weight LLMs and find that such behaviors can be unintentionally learned during supervised fine-tuning (SFT) when even a small fraction of reward-hacking trajectories leak into training data. As little as 1\% contamination in distillation SFT data is sufficient for models to internalize reward hacking which resurfaces during subsequent reinforcement learning (RL). We further show that RL amplifies misalignment and drives its generalization beyond the original domain. We open-source our environment and code to facilitate future research on reward hacking in LLMs. Our results reveal a previously underexplored pathway through which reward hacking can emerge and persist in LLMs, underscoring the need for more rigorous validation of synthetic SFT data. Code is available at this https URL.
中文摘要 奖励黑客是一种错位形式，模型过度优化代理奖励，却未能真正解决底层任务。精确测量奖励黑客事件仍然具有挑战性，因为真正的任务奖励往往昂贵或难以计算。我们引入了Countdown-Code，一个极简环境，模型既能解决数学推理任务，也能作测试工具。这种双重访问设计在代理奖励（测试通过/失败）和真实奖励（数学正确性）之间实现了清晰的分离，从而能够准确测量奖励黑客率。利用这一环境，我们研究了开放权重大型语言模型中的奖励黑客行为，发现在监督微调（SFT）过程中，即使只有一小部分奖励黑客轨迹泄漏到训练数据中，这种行为也可能被无意中学习。蒸馏SFT数据中仅有1/%的污染就足以让模型内化奖励黑客，并在后续强化学习（RL）中重新出现。我们进一步证明，强化学习会放大错位，并推动其推广超出原始领域。我们将环境和代码开源，以促进未来关于大型语言模型奖励黑客的研究。我们的结果揭示了一条此前未被充分探索的路径，奖励黑客可以在大型语言模型中出现并持续存在，凸显了对合成SFT数据进行更严谨验证的必要性。代码可在此 https URL 访问。

Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction

与人类对自然二元互动偏好相符的面部表情生成

Authors: Xu Chen, Rui Gao, Xinjie Zhang, Haoyu Zhang, Che Sun, Zhi Gao, Yuwei Wu, Yunde Jia
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.07093
Pdf link: https://arxiv.org/pdf/2603.07093
Abstract Achieving natural dyadic interaction requires generating facial expressions that are emotionally appropriate and socially aligned with human preference. Human feedback offers a compelling mechanism to guide such alignment, yet how to effectively incorporate this feedback into facial expression generation remains underexplored. In this paper, we propose a facial expression generation method aligned with human preference by leveraging human feedback to produce contextually and emotionally appropriate expressions for natural dyadic interaction. A key to our method is framing the generation of identity-independent facial expressions as an action learning process, allowing human feedback to assess their validity free from visual or identity bias. We establish a closed feedback loop in which listener expressions dynamically respond to evolving conversational cues of the speaker. Concretely, we train a vision-language-action model via supervised fine-tuning to map the speaker's multimodal signals into controllable low-dimensional expression representations of a 3D morphable model. We further introduce a human-feedback reinforcement learning strategy that integrates the imitation of high-quality expression response with critic-guided optimization. Experiments on two benchmarks demonstrate that our method effectively aligns facial expressions with human preference and achieves superior performance.
中文摘要 实现自然的双人互动需要生成情感上合适且符合人类偏好的面部表情。人类反馈提供了一种引人注目的机制来引导这种对齐，但如何有效地将这种反馈融入面部表情生成仍未被充分探索。本文提出一种符合人类偏好的面部表情生成方法，利用人类反馈生成符合情境和情感的自然二元互动表情。我们方法的关键在于将生成与身份无关的面部表情视为一种行动学习过程，允许人类反馈在无视觉或身份偏见的情况下评估其有效性。我们建立了闭合反馈循环，使听者的表达动态响应说话者不断变化的会话线索。具体来说，我们通过监督微调训练视觉-语言-行动模型，将说话者的多模态信号映射为可控的低维表达表示，构成三维可变模型。我们进一步引入了一种人类反馈强化学习策略，将高质量表达响应的模仿与批评者引导的优化相结合。在两个基准测试上的实验表明，我们的方法能够有效将面部表情与人类偏好对齐，并实现了更优异的性能。

Learning From Failures: Efficient Reinforcement Learning Control with Episodic Memory

从失败中学习：基于情节记忆的高效强化学习控制

Authors: Chenyang Miao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.07110
Pdf link: https://arxiv.org/pdf/2603.07110
Abstract Reinforcement learning has achieved remarkable success in robot learning. However, under challenging exploration and contact-rich dynamics, early-stage training is frequently dominated by premature terminations such as collisions and falls. As a result, learning is overwhelmed by short-horizon, low-return trajectories, which hinder convergence and limit long-horizon exploration. To alleviate this issue, we propose a technique called Failure Episodic Memory Alert (FEMA). FEMA explicitly stores short-horizon failure experiences through an episodic memory module. During interactions, it retrieves similar failure experiences and prevents the robot from recurrently relapsing into unstable states, guiding the policy toward long-horizon trajectories with greater long-term value. FEMA can be combined easily with model-free reinforcement learning algorithms, and yields a substantial sample-efficiency improvement of 33.11% on MuJoCo tasks across several classical RL algorithms. Furthermore, integrating FEMA into a parallelized PPO training pipeline demonstrates its effectiveness on a real-world bipedal robot task.
中文摘要 强化学习在机器人学习领域取得了显著成功。然而，在充满挑战的探索和接触密集的环境下，早期训练常常被碰撞和坠落等过早终止所主导。因此，学习被短视野、低回报轨迹所压倒，阻碍了趋同并限制了长视野的探索。为缓解这一问题，我们提出了一种称为失败情节记忆警报（FEMA）的技术。FEMA通过情节记忆模块明确存储短视距失效体验。在交互过程中，它会提取类似的失败经验，防止机器人反复陷入不稳定状态，引导策略朝着具有更大长期价值的长期轨迹发展。FEMA可以轻松与无模型强化学习算法结合，在多个经典强化学习算法中，MuJoCo任务的样本效率提升显著达33.11%。此外，将FEMA整合进并行PPO训练流程，展示了其在现实世界双足机器人任务中的有效性。

$\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving

$\textbf{Re}^{2}$：通过强化学习通过解析解锁大型语言模型推理

Authors: Pinzheng Wang, Shuli Xu, Juntao Li, Yu Luo, Dong Li, Jianye Hao, Min Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.07197
Pdf link: https://arxiv.org/pdf/2603.07197
Abstract Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning performance of large language models (LLMs) by increasing test-time compute. However, even after extensive RLVR training, such models still tend to generate unnecessary and low-quality steps in their chain-of-thought (CoT), leading to inefficient overthinking and lower answer quality. We show that when the initial direction or quality of the CoT is suboptimal, the model often fails to reach the correct answer, even after generating several times more tokens than when the initial CoT is well-initialized. To this end, we introduce Reinforcement Learning with Re-solving (Re$^2$), in which LLMs learn to flexibly abandon unproductive reasoning paths and restart the solution process when necessary, rather than always committing to a final answer. Re$^2$ applies pure reinforcement learning without any preliminary supervised fine-tuning, successfully amplifying the rare redo behavior in vanilla models from only 0.5% to over 30%. This leads to substantial performance gains over standard RLVR under the same training compute budget, and also demonstrates notable improvements in test-time performance as the number of samples increases.
中文摘要 带有可验证奖励的强化学习（RLVR）已显示出通过提高测试时计算能力，提升大型语言模型（LLMs）推理性能的潜力。然而，即使经过大量RLVR训练，这些模型仍倾向于在思考链（CoT）中产生不必要且低质量的步骤，导致过度思考效率低下和答案质量下降。我们表明，当初始CoT的方向或质量不优时，模型往往无法得出正确答案，即使生成的token数是初始化良好时的数倍。为此，我们引入了带Re-solven的强化学习（Re-resultment with Resolving （Re$^2$），LLM学会灵活地放弃无效的推理路径，并在必要时重新开始解决方案过程，而不是总是坚持最终答案。Re$^2$ 应用纯强化学习，无需任何预备监督微调，成功将原版模型中罕见的重做行为从仅 0.5% 放大到 30% 以上。这在相同训练计算预算下相比标准RLVR实现了显著的性能提升，并且随着样本数量增加，测试时间性能也显著提升。

Reinforcement Learning for Vehicle-to-Grid Voltage Regulation: Single-Hub to Multi-Hub Coordination with Battery-Aware Constraints

车辆与电网电压调节的强化学习：单枢纽到多枢纽的协调，具备电池感知约束

Authors: Jingbo Wang, Roshni Anna Jacob, Harshal D. Kaushik, Jie Zhang
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.07237
Pdf link: https://arxiv.org/pdf/2603.07237
Abstract This paper presents a Vehicle-to-Grid (V2G) coordination framework using reinforcement learning (RL). {An intelligent control strategy based on the soft actor-critic algorithm is developed for voltage regulation through single and multi-hub charging systems while respecting realistic fleet constraints. A two-phase training approach integrates stability-focused learning with battery-aware deployment to ensure practical feasibility. Simulation studies on the IEEE 34-bus system validate the framework against a standard Volt-Var/Volt-Watt droop controller. Results indicate that the RL agent achieves performance comparable to the baseline control strategy in nominal scenarios. Under aggressive overloading, it provides robust voltage recovery (within 10% of the baseline) while prioritizing fleet availability and state-of-charge preservation, demonstrating the viability of constraint-aware learning for critical grid services.}
中文摘要 本文提出了一种利用强化学习（RL）进行车辆到电网（V2G）协调框架的理论。基于软演员-批判算法开发了智能控制策略，通过单集线器和多枢纽充电系统进行电压调节，同时尊重真实的车队约束。两阶段训练方法将以稳定为中心的学习与电池感知部署相结合，确保可行性。对IEEE 34总线系统的仿真研究验证了该框架与标准Volt-Var/Volt-Watt下沉控制器的结合。结果表明，强化学习代理在名义场景下的表现与基线控制策略相当。在激进过载下，它提供了稳健的电压恢复（基线不到10%），同时优先考虑车队可用性和电量状态保持，展示了约束感知学习对关键电网服务的可行性。

Learning When to Cooperate Under Heterogeneous Goals

学习何时在异质目标下合作

Authors: Max Taylor-Davies, Neil Bramley, Christopher G. Lucas
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.07253
Pdf link: https://arxiv.org/pdf/2603.07253
Abstract A significant element of human cooperative intelligence lies in our ability to identify opportunities for fruitful collaboration; and conversely to recognise when the task at hand is better pursued alone. Research on flexible cooperation in machines has left this meta-level problem largely unexplored, despite its importance for successful collaboration in heterogeneous open-ended environments. Here, we extend the typical Ad Hoc Teamwork (AHT) setting to incorporate the idea of agents having heterogeneous goals that in any given scenario may or may not overlap. We introduce a novel approach to learning policies in this setting, based on a hierarchical combination of imitation and reinforcement learning, and show that it outperforms baseline methods across extended versions of two cooperative environments. We also investigate the contribution of an auxiliary component that learns to model teammates by predicting their actions, finding that its effect on performance is inversely related to the amount of observable information about teammate goals.
中文摘要 人类合作智能的一个重要要素在于我们能够识别富有成效的合作机会;反过来，也要识别何时更适合独自完成手头的任务。尽管对异构开放环境中的成功协作至关重要，但关于机器灵活协作的研究却大多未被充分探索。在这里，我们扩展了典型的临时团队合作（AHT）设置，纳入代理拥有异质目标的概念，这些目标在任何特定场景中可能重叠也可能不重叠。我们在该环境中引入了一种基于模仿与强化学习的层级结合的新学习策略方法，并证明其在两个合作环境的扩展版本中表现优于基线方法。我们还研究了一个辅助组件的贡献，该组件通过预测队友的行为来学习建模，发现其对表现的影响与队友目标可观测信息量呈反比关系。

Kinematics-Aware Latent World Models for Data-Efficient Autonomous Driving

基于运动学感知的潜在世界模型，实现数据高效的自动驾驶

Authors: Jiazhuo Li, Linjiang Cao, Qi Liu, Xi Xiong
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.07264
Pdf link: https://arxiv.org/pdf/2603.07264
Abstract Data-efficient learning remains a central challenge in autonomous driving due to the high cost and safety risks of large-scale real-world interaction. Although world-model-based reinforcement learning enables policy optimization through latent imagination, existing approaches often lack explicit mechanisms to encode spatial and kinematic structure essential for driving tasks. In this work, we build upon the Recurrent State-Space Model (RSSM) and propose a kinematics-aware latent world model framework for autonomous driving. Vehicle kinematic information is incorporated into the observation encoder to ground latent transitions in physically meaningful motion dynamics, while geometry-aware supervision regularizes the RSSM latent state to capture task-relevant spatial structure beyond pixel reconstruction. The resulting structured latent dynamics improve long-horizon imagination fidelity and stabilize policy optimization. Experiments in a driving simulation benchmark demonstrate consistent gains over both model-free and pixel-based world-model baselines in terms of sample efficiency and driving performance. Ablation studies further verify that the proposed design enhances spatial representation quality within the latent space. These results suggest that integrating kinematic grounding into RSSM-based world models provides a scalable and physically grounded paradigm for autonomous driving policy learning.
中文摘要 由于大规模现实世界交互带来的高成本和安全风险，数据高效的学习仍是自动驾驶的核心挑战。尽管基于世界模型的强化学习通过潜在想象力实现了策略优化，但现有方法往往缺乏明确的机制来编码驱动任务所需的空间和运动结构。在本研究中，我们基于循环状态空间模型（RSSM）提出一个基于运动学感知的自动驾驶潜在世界模型框架。车辆运动学信息被纳入观测编码器，用于物理意义运动动力学中的地面潜跃，而几何感知监督则规范RSSM潜态，捕捉超越像素重建的任务相关空间结构。由此产生的结构化潜在动态提高了长期视野的想象力真实度，稳定了政策优化。驾驶模拟基准测试中的实验显示，在样本效率和驾驶性能方面，相较于无模型和基于像素的世界模型基线均有持续提升。消融研究进一步证实，所提设计提升了潜空间内的空间表现质量。这些结果表明，将运动学基础融入基于RSSM的世界模型，为自动驾驶政策学习提供了可扩展且物理基础的范式。

Adaptive Double-Booking Strategy for Outpatient Scheduling Using Multi-Objective Reinforcement Learning

利用多目标强化学习实现门诊预约的自适应双重预约策略

Authors: Ninda Nurseha Amalina, Heungjo An
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07270
Pdf link: https://arxiv.org/pdf/2603.07270
Abstract Patient no-shows disrupt outpatient clinic operations, reduce productivity, and may delay necessary care. Clinics often adopt overbooking or double-booking to mitigate these effects. However, poorly calibrated policies can increase congestion and waiting times. Most existing methods rely on fixed heuristics and fail to adapt to real-time scheduling conditions or patient-specific no-show risk. To address these limitations, we propose an adaptive outpatient double-booking framework that integrates individualized no-show prediction with multi-objective reinforcement learning. The scheduling problem is formulated as a Markov decision process, and patient-level no-show probabilities estimated by a Multi-Head Attention Soft Random Forest model are incorporated in the reinforcement learning state. We develop a Multi-Policy Proximal Policy Optimization method equipped with a Multi-Policy Co-Evolution Mechanism. Under this mechanism, we propose a novel {\tau} rule based on Kullback-Leibler divergence that enables selective knowledge transfer among behaviorally similar policies, improving convergence and expanding the diversity of trade-offs. In addition, SHapley Additive exPlanations is used to interpret both the predicted no-show risk and the agent's scheduling decisions. The proposed framework determines when to single-book, double-book, or reject appointment requests, providing a dynamic and data-driven alternative to conventional outpatient scheduling policies.
中文摘要 患者缺席会扰乱门诊运营，降低生产力，并可能延迟必要的护理。诊所常常采用超预约或重复预约来减轻这些影响。然而，政策调整不当会增加拥堵和等待时间。大多数现有方法依赖固定启发式方法，无法适应实时排班条件或患者特定的缺席风险。为解决这些局限性，我们提出了一种自适应门诊双重预约框架，将个体化未出现预测与多目标强化学习相结合。调度问题被表述为马尔可夫决策过程，并将患者层面未出现概率通过多头注意力软随机森林模型估算，纳入强化学习状态。我们开发了一种配备多策略共演机制的多策略近端策略优化方法。在该机制下，我们提出了一种基于Kullback-Leibler发散的新{\tau}规则，允许行为相似政策之间的选择性知识转移，提升收敛性并扩大权衡多样性。此外，SHapley加法解释用于解释预测的缺席风险和代理人的排班决策。该框架决定何时单一预约、双重预约或拒绝预约请求，提供了一种动态且数据驱动的传统门诊预约政策替代方案。

AutoResearch-RL: Perpetual Self-Evaluating Reinforcement Learning Agents for Autonomous Neural Architecture Discovery

AutoResearch-RL：用于自主神经结构发现的永续自我评估强化学习代理

Authors: Nilesh Jain, Rohit Yadav, Sagar Kotian, Claude AI
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07300
Pdf link: https://arxiv.org/pdf/2603.07300
Abstract We present AutoResearch-RL, a framework in which a reinforcement learning agent conducts open-ended neural architecture and hyperparameter research without human supervision, running perpetually until a termination oracle signals convergence or resource exhaustion. At each step the agent proposes a code modification to a target training script, executes it under a fixed wall clock time budget, observes a scalar reward derived from validation bits-per-byte (val-bpb), and updates its policy via Proximal Policy Optimisation (PPO). The key design insight is the separation of three concerns: (i) a frozen environment (data pipeline, evaluation protocol, and constants) that guarantees fair cross-experiment comparison; (ii) a mutable target file (this http URL) that represents the agent's editable state; and (iii) a meta-learner (the RL agent itself) that accumulates a growing trajectory of experiment outcomes and uses them to inform subsequent proposals. We formalise this as a Markov Decision Process, derive convergence guarantees under mild assumptions, and demonstrate empirically on a single GPU nanochat pretraining benchmark that AutoResearch-RL discovers configurations that match or exceed hand-tuned baselines after approximately 300 overnight iterations, with no human in the loop.
中文摘要 我们提出了AutoResearch-RL，这是一个框架，强化学习代理在没有人类监督的情况下进行开放式神经架构和超参数研究，持续运行，直到终止预言机发出收敛或资源枯竭信号。在每一步，代理提出对目标训练脚本的代码修改，在固定的墙时钟时间预算下执行，观察由验证比特每字节（val-bpb）推导出的标量奖励，并通过近端策略优化（PPO）更新策略。关键的设计洞见是将三个关注点分离出来：（i）一个固定环境（数据流水线、评估协议和常数），以确保公平的跨实验比较;（ii）一个可变的目标文件（即该 http URL），代表代理的可编辑状态;以及（iii）一个元学习者（即强化学习代理本身），负责积累不断增长的实验结果轨迹，并用这些内容来指导后续的提案。我们将此过程形式化为马尔可夫决策过程，在轻微假设下推导收敛保证，并在单一GPU纳米聊天预训练基准测试中实证证明，AutoResearch-RL在约300次过夜迭代后发现匹配或超过手工调优基线的配置，且无人工参与。

Adversarial Latent-State Training for Robust Policies in Partially Observable Domains

针对部分可观测域中稳健策略的对抗性潜态训练

Authors: Angad Singh Ahuja
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.07313
Pdf link: https://arxiv.org/pdf/2603.07313
Abstract Robustness under latent distribution shift remains challenging in partially observable reinforcement learning. We formalize a focused setting where an adversary selects a hidden initial latent distribution before the episode, termed an adversarial latent-initial-state POMDP. Theoretically, we prove a latent minimax principle, characterize worst-case defender distributions, and derive approximate best-response certificates with finite-sample guarantees, providing formal meaning to empirical training diagnostics. Empirically, using a Battleship benchmark, we demonstrate that targeted exposure to shifted latent distributions reduces average robustness gaps between Spread and Uniform distributions from 10.3 to 3.1 shots at equal budget. Furthermore, iterative best-response training exhibits budget-sensitive behavior entirely consistent with our approximate certificate theory. Ultimately, we show that for latent-initial-state problems, our framework yields precise diagnostic principles and confirms that structured adversarial exposure effectively mitigates worst-case vulnerabilities.
中文摘要 在部分可观察的强化学习中，潜分布偏移下的鲁棒性仍然具有挑战性。我们形式化了一个聚焦环境，即对手在事件发生前选择隐藏的初始潜在分布，称为对抗潜在-初始状态POMDP。理论上，我们证明了潜在极小极大原理，表征了最坏情况防御者分布，并推导了带有有限样本保证的最佳响应证书，为实证训练诊断提供了形式意义。通过实证，我们利用战舰基准测试证明，针对移动潜在分布的有针对性暴露，在相同预算下，散布和均匀分布之间的平均稳健度差距从10.3降至3.1。此外，迭代最佳反应训练表现出与我们的近似证书理论完全一致的预算敏感行为。最终，我们证明对于潜初始态问题，我们的框架提供了精确的诊断原则，并证实结构化对抗性暴露有效缓解了最坏情况下的脆弱性。

Learning to Reflect: Hierarchical Multi-Agent Reinforcement Learning for CSI-Free mmWave Beam-Focusing

学习反射：无CSI毫米波束聚焦的分层多智能体强化学习

Authors: Hieu Le, Oguz Bedir, Mostafa Ibrahim, Jian Tao, Sabit Ekin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07370
Pdf link: https://arxiv.org/pdf/2603.07370
Abstract Reconfigurable Intelligent Surfaces promise to transform wireless environments, yet practical deployment is hindered by the prohibitive overhead of Channel State Information (CSI) estimation and the dimensionality explosion inherent in centralized optimization. This paper proposes a Hierarchical Multi-Agent Reinforcement Learning (HMARL) framework for the control of mechanically reconfigurable reflective surfaces in millimeter-wave (mmWave) systems. We introduce a "CSI-free" paradigm that substitutes pilot-based channel estimation with readily available user localization data. To manage the massive combinatorial action space, the proposed architecture utilizes Multi-Agent Proximal Policy Optimization (MAPPO) under a Centralized Training with Decentralized Execution (CTDE) paradigm. The proposed architecture decomposes the control problem into two abstraction levels: a high-level controller for user-to-reflector allocation and decentralized low-level controllers for low-level focal point optimization. Comprehensive ray-tracing evaluations demonstrate that the framework achieves 2.81-7.94 dB RSSI improvements over centralized baselines, with the performance advantage widening as system complexity increases. Scalability analysis reveals that the system maintains sustained efficiency, exhibiting minimal per-user performance degradation and stable total power utilization even when user density doubles. Furthermore, robustness validation confirms the framework's viability across varying reflector aperture sizes (45-99 tiles) and demonstrates graceful performance degradation under localization errors up to 0.5 m. By eliminating CSI overhead while maintaining high-fidelity beam-focusing, this work establishes HMARL as a practical solution for intelligent mmWave environments.
中文摘要 可重构智能表面有望彻底改变无线环境，但实际部署仍受限于信道状态信息（CSI）估计的高额开销以及集中式优化带来的维度爆炸。本文提出了一种分层多智能体强化学习（HMARL）框架，用于控制毫米波（mmWave）系统中机械可重构的反射面。我们引入了一种“无CSI”范式，用现成的用户本地化数据替代了基于试点的频道估计。为管理庞大的组合动作空间，所提架构采用了多代理近端策略优化（MAPPO），采用集中式训练与去中心化执行（CTDE）范式。所提架构将控制问题分解为两个抽象层级：用于用户与反射器分配的高层控制器，以及用于低层次焦点优化的去中心化低级控制器。全面的光线追踪评估表明，该框架相较于集中式基线实现了2.81-7.94 dB的RSSI提升，且随着系统复杂度的增加，性能优势进一步扩大。可扩展性分析显示，系统保持持续的效率，即使用户密度翻倍，每用户性能下降极小，总功耗也稳定。此外，鲁棒性验证确认了该框架在不同反射孔径（45-99格）下的可行性，并在定位误差达0.5米的情况下表现出优雅的性能下降。通过消除CSI开销同时保持高精度波束聚焦，这项工作确立了HMARL作为智能毫米波环境实用解决方案的地位。

Underwater Embodied Intelligence for Autonomous Robots: A Constraint-Coupled Perspective on Planning, Control, and Deployment

自主机器人的水下具身智能：关于规划、控制与部署的约束耦合视角

Authors: Jingzehua Xu, Guanwen Xie, Jiwei Tang, Shuai Zhang, Xiaofan Li
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.07393
Pdf link: https://arxiv.org/pdf/2603.07393
Abstract Autonomous underwater robots are increasingly deployed for environmental monitoring, infrastructure inspection, subsea resource exploration, and long-horizon exploration. Yet, despite rapid advances in learning-based planning and control, reliable autonomy in real ocean environments remains fundamentally constrained by tightly coupled physical limits. Hydrodynamic uncertainty, partial observability, bandwidth-limited communication, and energy scarcity are not independent challenges; they interact within the closed perception-planning-control loop and often amplify one another over time. This Review develops a constraint-coupled perspective on underwater embodied intelligence, arguing that planning and control must be understood within tightly coupled sensing, communication, coordination, and resource constraints in real ocean environments. We synthesize recent progress in reinforcement learning, belief-aware planning, hybrid control, multi-robot coordination, and foundation-model integration through this embodied perspective. Across representative application domains, we show how environmental monitoring, inspection, exploration, and cooperative missions expose distinct stress profiles of cross-layer coupling. To unify these observations, we introduce a cross-layer failure taxonomy spanning epistemic, dynamic, and coordination breakdowns, and analyze how errors cascade across autonomy layers under uncertainty. Building on this structure, we outline research directions toward physics-grounded world models, certifiable learning-enabled control, communication-aware coordination, and deployment-aware system design. By internalizing constraint coupling rather than treating it as an external disturbance, underwater embodied intelligence may evolve from performance-driven adaptation toward resilient, scalable, and verifiable autonomy under real ocean conditions.
中文摘要 自主水下机器人正日益被用于环境监测、基础设施检查、海底资源勘探和长视野勘探。然而，尽管基于学习的规划和控制取得了快速进步，真实海洋环境中的可靠自主性仍然受到紧密耦合的物理限制。水动力学不确定性、部分可观测性、带宽有限的通信和能源稀缺并非独立的挑战;它们在封闭的感知-规划-控制循环中相互作用，且常常随着时间相互放大。本综述提出了对水下具身智能的约束耦合视角，认为规划与控制必须在真实海洋环境中紧密耦合的感知、通信、协调和资源约束中理解。我们综合了强化学习、信念感知规划、混合控制、多机器人协调和基础模型整合的最新进展，通过这一具身视角。在代表性的应用领域，我们展示了环境监测、检查、勘探和合作任务如何揭示跨层耦合的不同应力剖面。为统一这些观察，我们引入了跨层失败分类法，涵盖认知学、动态和协调崩溃，并分析错误如何在不确定性下跨自律层级传递。基于这一结构，我们提出了基于物理的世界模型、可认证的学习驱动控制、通信感知协调以及部署感知系统设计的研究方向。通过内化约束耦合而非视为外部干扰，水下具身智能可能从性能驱动的适应演变为在真实海洋条件下具备韧性、可扩展性和可验证的自主性。

Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests

动态车辆路由问题，需提前确认请求

Authors: Amutheezan Sivagnanam, Ayan Mukhopadhyay, Samitha Samaranayake, Abhishek Dubey, Aron Laszka
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.07422
Pdf link: https://arxiv.org/pdf/2603.07422
Abstract Transit agencies that operate on-demand transportation services have to respond to trip requests from passengers in real time, which involves solving dynamic vehicle routing problems with pick-up and drop-off constraints. Based on discussions with public transit agencies, we observe a real-world problem that is not addressed by prior work: when trips are booked in advance (e.g., trip requests arrive a few hours in advance of their requested pick-up times), the agency needs to promptly confirm whether a request can be accepted or not, and ensure that accepted requests are served as promised. State-of-the-art computational approaches either provide prompt confirmation but lack the ability to continually optimize and improve routes for accepted requests, or they provide continual optimization but cannot guarantee serving all accepted requests. To address this gap, we introduce a novel problem formulation of dynamic vehicle routing with prompt confirmation and continual optimization. We propose a novel computational approach for this vehicle routing problem, which integrates a quick insertion search for prompt confirmation with an anytime algorithm for continual optimization. To maximize the number requests served, we train a non-myopic objective function using reinforcement learning, which guides both the insertion and the anytime algorithms towards optimal, non-myopic solutions. We evaluate our computational approach on a real-world microtransit dataset from a public transit agency in the U.S., demonstrating that our proposed approach provides prompt confirmation while significantly increasing the number of requests served compared to existing approaches.
中文摘要 运营按需交通服务的交通机构必须实时响应乘客的出行请求，这涉及解决上下车限制下的动态车辆路线问题。根据与公共交通机构的讨论，我们观察到一个现实问题，之前的工作未能解决：当行程提前预订（例如行程请求比请求的取车时间提前几小时到达）时，机构需要迅速确认请求是否能被接受，并确保接受的请求能够如约交付。最先进的计算方法要么提供即时确认，但缺乏持续优化和改进被接受请求路由的能力;要么提供持续优化但无法保证满足所有被接受请求。为弥补这一空白，我们引入了动态车辆路径的新问题表述，具有及时确认和持续优化。我们提出了一种新颖的计算方法，针对该车辆路由问题，结合快速插入搜索以获得即时确认，并结合随时持续优化的算法。为了最大化请求数量，我们通过强化学习训练一个非近视目标函数，指导插入和任意算法朝向最优、非短视的解。我们基于美国一家公共交通机构的真实微型交通数据集评估了我们的计算方法，证明我们提出的方法能够迅速确认，同时显著增加了请求数量，相较于现有方法。

Generalization in Online Reinforcement Learning for Mobile Agents

移动代理在线强化学习中的推广

Authors: Li Gu, Zihuan Jiang, Zhixiang Chi, Huan Liu, Ziqiang Wang, Yuanhao Yu, Glen Berseth, Yang Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07432
Pdf link: https://arxiv.org/pdf/2603.07432
Abstract Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions and interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision-language-model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open-source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce \textbf{AndroidWorld-Generalization}, a benchmark with three increasingly challenging regimes for evaluating zero-shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, yielding a 26.1\% improvement on unseen instances but only limited gains on unseen templates (15.7\%) and apps (8.3\%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open-source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnote{this https URL}.
中文摘要 基于图形用户界面（GUI）的移动代理通过解释自然语言指令并与屏幕交互，自动化移动设备上的数字任务。虽然最新方法在交互环境中以增强学习（RL）训练视觉语言模型（VLM）代理，主要关注性能，但由于缺乏标准化基准和开源强化学习系统，泛化性仍未被充分探索。在本研究中，我们将该问题形式化为情境马尔可夫决策过程（CMDP），并引入了 \textbf{AndroidWorld-Generalization}，这是一个具有三种日益具有挑战性的基准测试，用于评估对未见任务实例、模板和应用的零样本推广。我们还提出了一个强化学习训练系统，将群相对策略优化（GRPO）与可扩展的展开收集系统集成，该系统由容器化基础设施、异步执行百分比和错误恢复组成，以支持可靠高效的训练。在 AndroidWorld-Generalization 上的实验表明，强化学习使 7B 参数的 VLM 代理能够超越监督微调基线，在未见实例上提升 26.1% 的提升，但在未见模板（15.7%）和应用应用（8.3%）上提升有限，凸显了泛化的挑战。作为初步步骤，我们证明测试时的少样本适应能提升未见应用的性能，激励未来在这方面的研究。为了支持可重复性和公平比较，我们将完整的强化学习训练系统开源，包括环境、任务套件、模型、提示配置以及底层基础设施 \footnote{this https URL}。

Cost-Driven Representation Learning for Linear Quadratic Gaussian Control: Part II

线性二次高斯控制的成本驱动表述学习：第二部分

Authors: Yi Tian, Kaiqing Zhang, Russ Tedrake, Suvrit Sra
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.07437
Pdf link: https://arxiv.org/pdf/2603.07437
Abstract We study the problem of state representation learning for control from partial and potentially high-dimensional observations. We approach this problem via cost-driven state representation learning, in which we learn a dynamical model in a latent state space by predicting cumulative costs. In particular, we establish finite-sample guarantees on finding a near-optimal representation function and a near-optimal controller using the learned latent model for infinite-horizon time-invariant Linear Quadratic Gaussian (LQG) control. We study two approaches to cost-driven representation learning, which differ in whether the transition function of the latent state is learned explicitly or implicitly. The first approach has also been investigated in Part I of this work, for finite-horizon time-varying LQG control. The second approach closely resembles MuZero, a recent breakthrough in empirical reinforcement learning, in that it learns latent dynamics implicitly by predicting cumulative costs. A key technical contribution of this Part II is to prove persistency of excitation for a new stochastic process that arises from the analysis of quadratic regression in our approach, and may be of independent interest.
中文摘要 我们研究通过部分和潜在高维观测进行控制的状态表示学习问题。我们通过成本驱动状态表示学习来解决这个问题，即通过预测累积成本，在潜在状态空间中学习动力学模型。特别地，我们利用学习到的潜在模型，在无限视界时变线性二次高斯（LQG）控制中，建立了有限样本保证，以求得近似最优表示函数和近似最优控制器。我们研究两种成本驱动的表征学习方法，它们在潜态的跃迁函数是显式学习还是隐式学习上有所不同。第一种方法也在本研究的第一部分中探讨，用于有限视距时变LQG控制。第二种方法与MuZero非常相似，MuZero是经验强化学习中的一项突破，它通过预测累积成本隐式学习潜在动力学。本部分第二部分的一个关键技术贡献是证明了一种新随机过程的持续性，该过程源自我们方法中对二次回归的分析，且可能具有独立的兴趣。

Med-Evo: Test-time Self-evolution for Medical Multimodal Large Language Models

Med-Evo：医学多模态大型语言模型的测试时间自我演化

Authors: Dunyuan Xu, Xikai Yang, Juzheng Miao, Yaoqian Li, Jinpeng Li, Pheng-Ann Heng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.07443
Pdf link: https://arxiv.org/pdf/2603.07443
Abstract Medical Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse healthcare tasks. However, current post-training strategies, such as supervised fine-tuning and reinforcement learning, heavily depend on substantial annotated data while overlooking the potential of unlabeled test data for model enhancement. This limitation becomes particularly pronounced in medical domains, where acquiring extensive labeled medical data is difficult due to the strict data sensitivity and annotation complexity. Moreover, leveraging test data poses challenges in generating reliable supervision signals from unlabeled samples and maintaining stable self-evolution. To address these limitations, we propose Med-Evo, the first self-evolution framework for medical MLLMs that utilizes label-free reinforcement learning to promote model performance without requiring additional labeled data. Our framework introduces two key innovations: $1)$ Feature-driven Pseudo Labeling (FPL) that identifies semantic centroids from all heterogeneous candidate responses to select pseudo labels in each rollout, and $2)$ Hard-Soft Reward (HSR) that combines exact match with token-level assessment and semantic similarity to provide hierarchical reward. Experiments on three medical VQA benchmarks and two base MLLMs show clear advantages of our approach over SOTA methods, with significant improvements of 10.43\% accuracy and 4.68\% recall on the SLAKE dataset using Qwen2.5-VL, showing the effectiveness of our method.
中文摘要 医学多模态大型语言模型（MLLM）在多样化医疗任务中展现出卓越的能力。然而，当前的训练后策略，如监督式微调和强化学习，严重依赖大量注释数据，忽视了无标签测试数据在模型增强中的潜力。这一限制在医学领域尤为明显，因为由于严格的数据敏感性和注释复杂性，获取大量带标签的医疗数据非常困难。此外，利用测试数据在从未标记样本中生成可靠监督信号和维持稳定自我演化方面存在挑战。为解决这些局限性，我们提出了Med-Evo，这是首个利用无标签强化学习提升模型性能的医学多层次营销自我进化框架，无需额外标记数据。我们的框架引入了两项关键创新：$1）功能驱动伪标记（FPL），从每次推广中选择的伪标签中识别所有异质候选响应中的语义重心;以及$2）$ 硬软奖励（HSR），结合了精确匹配、代币级评估和语义相似性，提供层级奖励。在三个医学VQA基准测试和两个基础MLLM实验中，我们的方法相较于SOTA方法有明显优势，使用Qwen2.5-VL在SLAKE数据集上准确率显著提升了10.43%和召回率4.68%，显示了我们方法的有效性。

EvolveReason: Self-Evolving Reasoning Paradigm for Explainable Deepfake Facial Image Identification

EvolveReason：可解释的深度伪造面部图像识别自我演进推理范式

Authors: Binjia Zhou, Dawei Luo, Shuai Chen, Feng Xu, Seow, Haoyuan Li, Jiachi Wang, Jiawen Wang, Zunlei Feng, Yijun Bei
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.07515
Pdf link: https://arxiv.org/pdf/2603.07515
Abstract With the rapid advancement of AIGC technology, developing identification methods to address the security challenges posed by deepfakes has become urgent. Face forgery identification techniques can be categorized into two types: traditional classification methods and explainable VLM approaches. The former provides classification results but lacks explanatory ability, while the latter, although capable of providing coarse-grained explanations, often suffers from hallucinations and insufficient detail. To overcome these limitations, we propose EvolveReason, which mimics the reasoning and observational processes of human auditors when identifying face forgeries. By constructing a chain-of-thought dataset, CoT-Face, tailored for advanced VLMs, our approach guides the model to think in a human-like way, prompting it to output reasoning processes and judgment results. This provides practitioners with reliable analysis and helps alleviate hallucination. Additionally, our framework incorporates a forgery latent-space distribution capture module, enabling EvolveReason to identify high-frequency forgery cues difficult to extract from the original images. To further enhance the reliability of textual explanations, we introduce a self-evolution exploration strategy, leveraging reinforcement learning to allow the model to iteratively explore and optimize its textual descriptions in a two-stage process. Experimental results show that EvolveReason not only outperforms the current state-of-the-art methods in identification performance but also accurately identifies forgery details and demonstrates generalization capabilities.
中文摘要 随着AIGC技术的快速发展，开发识别方法以应对深度伪造带来的安全挑战变得迫在眉头。面部伪造识别技术可分为两类：传统分类方法和可解释的VLM方法。前者提供分类结果但缺乏解释能力，后者虽然能提供粗略解释，但常常存在幻觉和细节不足的问题。为克服这些局限，我们提出了EvolveReason，它模拟了人类审计员识别人脸伪造时的推理和观察过程。通过构建一个面向高级VLM的思维链数据集CoT-Face，我们的方法引导模型以类人的方式思考，促使其输出推理过程和判断结果。这为从业者提供了可靠的分析，有助于缓解幻觉。此外，我们的框架还集成了伪造潜空间分布采集模块，使EvolveReason能够识别难以从原始图像中提取的高频伪造线索。为进一步提升文本解释的可靠性，我们引入了自我进化探索策略，利用强化学习使模型能够在两阶段过程中迭代探索和优化文本描述。实验结果显示，EvolveReason不仅在识别性能上超越了当前最先进的方法，还能准确识别伪造细节并展现出泛化能力。

InterReal: A Unified Physics-Based Imitation Framework for Learning Human-Object Interaction Skills

InterReal：一个基于物理的统一模仿框架，用于学习人与物交互技能

Authors: Dayang Liang, Yuhang Lin, Xinzhe Liu, Jiyuan Shi, Yunlong Liu, Chenjia Bai
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.07516
Pdf link: https://arxiv.org/pdf/2603.07516
Abstract Interaction is one of the core abilities of humanoid robots. However, most existing frameworks focus on non-interactive whole-body control, which limits their practical applicability. In this work, we develop InterReal, a unified physics-based imitation learning framework for Real-world human-object Interaction (HOI) control. InterReal enables humanoid robots to track HOI reference motions, facilitating the learning of fine-grained interactive skills and their deployment in real-world settings. Within this framework, we first introduce a HOI motion data augmentation scheme with hand-object contact constraints, and utilize the augmented motions to improve policy stability under object perturbations. Second, we propose an automatic reward learner to address the challenge of large-scale reward shaping. A meta-policy guided by critical tracking error metrics explores and allocates reward signals to the low-level reinforcement learning objective, which enables more effective learning of interactive policies. Experiments on HOI tasks of box-picking and box-pushing demonstrate that InterReal achieves the best tracking accuracy and the highest task success rate compared to recent baselines. Furthermore, we validate the framework on the real-world robot Unitree G1, which demonstrates its practical effectiveness and robustness beyond simulation.
中文摘要 交互是类人机器人的核心能力之一。然而，大多数现有框架专注于非交互式的全身控制，这限制了其实际应用。在本研究中，我们开发了InterReal，这是一个统一的基于物理的模拟学习框架，用于现实世界的人与物交互（HOI）控制。InterReal使类人机器人能够追踪HOI参考动作，促进细粒度交互技能的学习及其在现实环境中的应用。在此框架下，我们首先引入了带有手-物体接触约束的HOI运动数据增强方案，并利用增强运动提升物体扰动下的策略稳定性。其次，我们提出一种自动奖励学习器，以应对大规模奖励塑造的挑战。由关键跟踪错误指标引导的元策略探索并分配奖励信号至低层级强化学习目标，从而实现更高效的交互策略学习。关于HOI任务如箱子拣选和推箱的实验表明，InterReal相比近期基线实现了最佳的跟踪精度和最高的任务成功率。此外，我们在现实机器人Unitree G1上验证了该框架，展示了其在模拟后实际有效性和鲁棒性。

Reinforcement learning-based dynamic cleaning scheduling framework for solar energy system

基于强化学习的太阳能系统动态清洁调度框架

Authors: Heungjo An
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07518
Pdf link: https://arxiv.org/pdf/2603.07518
Abstract Advancing autonomous green technologies in solar photovoltaic (PV) systems is key to improving sustainability and efficiency in renewable energy production. This study presents a reinforcement learning (RL)-based framework to autonomously optimize the cleaning schedules of PV panels in arid regions, where soiling from dust and other airborne particles significantly reduces energy output. By employing advanced RL algorithms, Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC), the framework dynamically adjusts cleaning intervals based on uncertain environmental conditions. The proposed approach was applied to a case study in Abu Dhabi, UAE, demonstrating that PPO outperformed SAC and traditional simulation optimization (Sim-Opt) methods, achieving up to 13% cost savings by dynamically responding to weather uncertainties. The results highlight the superiority of flexible, autonomous scheduling over fixed-interval methods, particularly in adapting to stochastic environmental dynamics. This aligns with the goals of autonomous green energy production by reducing operational costs and improving the efficiency of solar power generation systems. This work underscores the potential of RL-driven autonomous decision-making to optimize maintenance operations in renewable energy systems. In future research, it is important to enhance the generalization ability of the proposed RL model, while also considering additional factors and constraints to apply it to different regions.
中文摘要 推动太阳能光伏（PV）系统中的自动驾驶绿色技术，是提升可再生能源生产可持续性和效率的关键。本研究提出了基于强化学习（RL）的框架，以自主优化干旱地区光伏板的清洁计划，因为尘埃和其他空气中颗粒的污染显著降低了能量输出。通过采用先进的强化学习算法、近端策略优化（PPO）和软演员-批判者（SAC），该框架根据不确定的环境条件动态调整清理间隔。该方法应用于阿联酋阿布扎比的一项案例研究，证明PPO优于SAC和传统仿真优化（Sim-Opt）方法，通过动态响应天气不确定性实现高达13%的成本节约。结果凸显了灵活自主调度相较于固定区间方法的优越性，尤其是在适应随机环境动态方面。这与自动化绿色能源生产的目标相契合，通过降低运营成本并提升太阳能发电系统效率。这项工作强调了强化学习驱动的自主决策在优化可再生能源系统维护作中的潜力。在未来的研究中，提升所提出强化学习模型的泛化能力非常重要，同时也要考虑额外的因素和约束，以便将其应用于不同区域。

TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning

TableMind++：一个用于工具增强表推理的不确定性意识程序化代理

Authors: Mingyue Cheng, Shuo Yu, Chuang Jiang, Xiaoyu Tao, Qingyang Mao, Jie Ouyang, Qi Liu, Enhong Chen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.07528
Pdf link: https://arxiv.org/pdf/2603.07528
Abstract Table reasoning requires models to jointly perform semantic understanding and precise numerical operations. Most existing methods rely on a single-turn reasoning paradigm over tables which suffers from context overflow and weak numerical sensitivity. To address these limitations, we previously proposed TableMind as a tuning-based autonomous programmatic agent that simulates human-like interaction within a lightweight large language model (LLM). TableMind internalizes planning, action, and reflection through a two-stage training strategy involving supervised fine-tuning (SFT) on filtered high-quality data and reinforcement learning (RL) via a multi-perspective reward and the Rank-Aware Policy Optimization (RAPO) algorithm. While TableMind establishes a solid foundation for programmatic agents, the inherent stochasticity of LLMs remains a critical challenge that leads to hallucinations. In this paper, we extend this foundation to TableMind++ by introducing a novel uncertainty-aware inference framework to mitigate hallucinations. Specifically, we propose memory-guided plan pruning to retrieve historical trajectories for validating and filtering out logically flawed plans to address epistemic uncertainty. To ensure execution precision, we introduce confidence-based action refinement which monitors token-level probabilities to detect and self-correct syntactic noise for aleatoric uncertainty mitigation. Finally, we employ dual-weighted trajectory aggregation to synthesize a robust consensus from multiple reasoning paths. Extensive experiments on diverse benchmarks demonstrate that TableMind++ consistently outperforms previous baselines and proprietary models to validate the effectiveness of integrating autonomous training with uncertainty quantification. Our code is available.
中文摘要 表格推理要求模型共同执行语义理解和精确的数值运算。大多数现有方法依赖单回合推理范式，而表格则存在上下文溢出和数值敏感性较弱的问题。为解决这些局限性，我们此前提出TableMind作为基于调优的自主程序代理，模拟轻量级大型语言模型（LLM）中的类人交互。TableMind通过两阶段训练策略内化规划、行动和反思，包括对过滤高质量数据进行监督微调（SFT）和通过多视角奖励和秩感知策略优化（RAPO）算法进行强化学习（RL）。虽然TableMind为程序化代理奠定了坚实基础，但LLM固有的随机性仍是一个关键挑战，导致幻觉。本文将这一基础扩展到TableMind++，引入了一个新的不确定性感知推理框架以减轻幻觉。具体来说，我们提出记忆引导计划修剪，以检索历史轨迹，验证并过滤逻辑缺陷的计划，以解决认识论不确定性。为确保执行精度，我们引入基于置信的动作细化，监测令牌级概率以检测并自我纠正语法噪声，以减轻偶然性不确定性。最后，我们采用双权重轨迹聚合，从多条推理路径中综合出稳健的共识。在多种基准测试上的大量实验表明，TableMind++ 持续优于以往基线和专有模型，验证了自主训练与不确定性量化整合的有效性。我们的代码已开放。

COOL-MC: Verifying and Explaining RL Policies for Multi-bridge Network Maintenance

COOL-MC：验证和解释多桥网络维护的强化学习政策

Authors: Dennis Gross
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07546
Pdf link: https://arxiv.org/pdf/2603.07546
Abstract Aging bridge networks require proactive, verifiable, and interpretable maintenance strategies, yet reinforcement learning (RL) policies trained solely on reward signals provide no formal safety guarantees and remain opaque to infrastructure managers. We demonstrate COOL-MC as a tool for verifying and explaining RL policies for multi-bridge network maintenance, building on a single-bridge Markov decision process (MDP) from the literature and extending it to a parallel network of three heterogeneous bridges with a shared periodic budget constraint, encoded in the PRISM modeling language. We train an RL agent on this MDP and apply probabilistic model checking and explainability methods to the induced discrete-time Markov chain (DTMC) that arises from the interaction between the learned policy and the underlying MDP. Probabilistic model checking reveals that the trained policy has a safety-violation probability of 3.5\% over the planning horizon, being slightly above the theoretical minimum of 0\% and indicating the suboptimality of the learned policy, noting that these results are based on artificially constructed transition probabilities and deterioration rates rather than real-world data, so absolute performance figures should be interpreted with caution. The explainability analysis further reveals, for instance, a systematic bias in the trained policy toward the state of bridge 1 over the remaining bridges in the network. These results demonstrate COOL-MC's ability to provide formal, interpretable, and practical analysis of RL maintenance policies.
中文摘要 老旧的桥接网络需要主动、可验证且可解释的维护策略，但仅基于奖励信号训练的强化学习（RL）策略无法提供正式的安全保障，且对基础设施管理者来说仍然不透明。我们展示了COOL-MC作为验证和解释多桥网络维护强化学习策略的工具，基于文献中的单桥马尔可夫决策过程（MDP），并将其扩展到由三个具有共享周期预算约束的异构桥并行网络，并用PRISM建模语言编码。我们对该 MDP 训练强化学习代理，并对由学习策略与底层 MDP 交互产生的离散时间马尔可夫链（DTMC）应用概率性模型检查和解释方法。概率模型检查显示，训练出的政策在规划期内的安全违规概率为3.5%，略高于理论最低值0/%，表明所学政策的次优性。注意这些结果基于人为构建的过渡概率和恶化率，而非真实世界数据，因此应谨慎解读绝对性能数据。可解释性分析进一步揭示了例如，训练政策中对网络中剩余桥梁1状态的系统性偏见。这些结果展示了COOL-MC提供RL维护政策的正式、可解释且实用分析的能力。

Constraints Matrix Diffusion based Generative Neural Solver for Vehicle Routing Problems

基于矩阵扩散的生成神经求解器用于车辆路由问题的约束

Authors: Zhenwei Wang, Tiehua Zhang, Ning Xue, Ender Ozcan, Ling Wang, Ruibin Bai
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07568
Pdf link: https://arxiv.org/pdf/2603.07568
Abstract Over the past decade, neural network solvers powered by generative artificial intelligence have garnered significant attention in the domain of vehicle routing problems (VRPs), owing to their exceptional computational efficiency and superior reasoning capabilities. In particular, autoregressive solvers integrated with reinforcement learning have emerged as a prominent trend. However, much of the existing work emphasizes large-scale generalization of neural approaches while neglecting the limited robustness of attention-based methods across heterogeneous distributions of problem parameters. Their improvements over heuristic search remain largely restricted to hand-curated, fixed-distribution benchmarks. Furthermore, these architectures tend to degrade significantly when node representations are highly similar or when tasks involve long decision horizons. To address the aforementioned limitations, we propose a novel fusion neural network framework that employs a discrete noise graph diffusion model to learn the underlying constraints of vehicle routing problems and generate a constraint assignment matrix. This matrix is subsequently integrated adaptively into the feature representation learning and decision process of the autoregressive solver, serving as a graph structure mask that facilitates the formation of solutions characterized by both global vision and local feature integration. To the best of our knowledge, this work represents the first comprehensive experimental investigation of neural network model solvers across a 378-combinatorial space spanning four distinct dimensions within the CVRPlib public dataset. Extensive experimental evaluations demonstrate that our proposed fusion model effectively captures and leverages problem constraints, achieving state-of-the-art performance across multiple benchmark datasets.
中文摘要 在过去十年中，由生成式人工智能驱动的神经网络求解器因其卓越的计算效率和卓越的推理能力，在车辆导航问题（VRPs）领域引起了广泛关注。特别是，将自回归求解器与强化学习结合成为一个显著趋势。然而，现有许多工作强调神经方法的大规模推广，忽视了基于注意力的方法在异质参数分布中的有限鲁棒性。他们对启发式搜索的改进主要限于手工策划的固定分布基准测试。此外，当节点表示高度相似或任务涉及较长的决策范围时，这些架构往往会显著退化。为解决上述局限性，我们提出了一种新型融合神经网络框架，采用离散噪声图扩散模型来学习车辆导航问题的潜在约束并生成约束分配矩阵。该矩阵随后被自适应地集成到自回归求解器的特征表示学习和决策过程中，作为图结构掩码，便于构建具有全局视野和局部特征集成的解。据我们所知，这项工作代表了CVRPlib公共数据集中跨越4个不同维度、378组合空间内神经网络模型求解器的首次全面实验性研究。大量实验评估表明，我们提出的融合模型有效捕捉并利用了问题约束，在多个基准数据集上实现了最先进的性能。

GeoLoco: Leveraging 3D Geometric Priors from Visual Foundation Model for Robust RGB-Only Humanoid Locomotion

GeoLoco：利用Visual Foundation模型中的3D几何先验数据实现仅限RGB的人形运动

Authors: Yufei Liu, Xieyuanli Chen, Hainan Pan, Chenghao Shi, Yanjie Chen, Kaihong Huang, Zhiwen Zeng, Huimin Lu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.07624
Pdf link: https://arxiv.org/pdf/2603.07624
Abstract The prevailing paradigm of perceptive humanoid locomotion relies heavily on active depth sensors. However, this depth-centric approach fundamentally discards the rich semantic and dense appearance cues of the visual world, severing low-level control from the high-level reasoning essential for general embodied intelligence. While monocular RGB offers a ubiquitous, information-dense alternative, end-to-end reinforcement learning from raw 2D pixels suffers from extreme sample inefficiency and catastrophic sim-to-real collapse due to the inherent loss of geometric scale. To break this deadlock, we propose GeoLoco, a purely RGB-driven locomotion framework that conceptualizes monocular images as high-dimensional 3D latent representations by harnessing the powerful geometric priors of a frozen, scale-aware Visual Foundation Model (VFM). Rather than naive feature concatenation, we design a proprioceptive-query multi-head cross-attention mechanism that dynamically attends to task-critical topological features conditioned on the robot's real-time gait phase. Crucially, to prevent the policy from overfitting to superficial textures, we introduce a dual-head auxiliary learning scheme. This explicit regularization forces the high-dimensional latent space to strictly align with the physical terrain geometry, ensuring robust zero-shot sim-to-real transfer. Trained exclusively in simulation, GeoLoco achieves robust zero-shot transfer to the Unitree G1 humanoid and successfully negotiates challenging terrains.
中文摘要 感知类人移动的主流范式高度依赖主动深度传感器。然而，这种以深度为中心的方法从根本上摒弃了视觉世界中丰富的语义和密集的外观线索，切断了低层次控制与对一般具身智能所必需的高层推理。虽然单目RGB提供了一种普遍且信息密集的替代方案，但从原始二维像素进行端到端强化学习，由于几何尺度的固有丧失，存在极高的样本效率和灾难性的模拟与实物崩塌。为打破僵局，我们提出了GeoLoco，一种纯RGB驱动的运动框架，利用冷冻的尺度感知视觉基础模型（VFM）强大的几何先验，将单眼图像概念化为高维三维潜在表现。我们设计的不是简单特征串接，而是设计了一种本有感觉查询多头交叉注意力机制，动态关注基于机器人实时步态阶段的任务关键拓扑特征。关键是，为了防止策略对表面纹理过拟合，我们引入了双头辅助学习方案。这种显式正则化迫使高维潜空间严格与物理地形几何对齐，确保零样本模拟到实的稳健传输。GeoLoco专注于模拟训练，能够实现对Unitree G1人形生物的强健零发射转移，并成功应对复杂地形。

Exoskeleton Control through Learning to Reduce Biological Joint Moments in Simulations

通过学习控制外骨骼以减少仿真中的生物关节力矩

Authors: Zihang You, Xianlian Zhou
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07629
Pdf link: https://arxiv.org/pdf/2603.07629
Abstract Data-driven joint-moment predictors offer a scalable alternative to laboratory-based inverse-dynamics pipelines for biomechanics estimation and exoskeleton control. Meanwhile, physics-based reinforcement learning (RL) enables simulation-trained controllers to learn dynamics-aware assistance strategies without extensive human experimentation. However, quantitative verification of simulation-trained exoskeleton torque predictors, and their impact on human joint power injection, remains limited. This paper presents (1) an RL framework to learn exoskeleton assistance policies that reduce biological joint moments, and (2) a validation pipeline that verifies the trained control networks using an open-source gait dataset through inference and comparison with biological joint moments. Simulation-trained multilayer perceptron (MLP) controllers are developed for level-ground and ramp walking, mapping short-horizon histories of bilateral hip and knee kinematics to normalized assistance torques. Results show that predicted assistance preserves task-intensity trends across speeds and inclines. Agreement is particularly strong at the hip, with cross-correlation coefficients reaching 0.94 at 1.8 m/s and 0.98 during 5° decline walking, demonstrating near-matched temporal structure. Discrepancies increase at higher speeds and steeper inclines, especially at the knee, and are more pronounced in joint power comparisons. Delay tuning biases assistance toward greater positive power injection; modest timing shifts increase positive power and improve agreement in specific gait intervals. Together, these results establish a quantitative validation framework for simulation-trained exoskeleton controllers, demonstrate strong sim-to-data consistency at the torque level, and highlight both the promise and the remaining challenges for sim-to-real transfer.
中文摘要 数据驱动的联合力矩预测器为生物力学估计和外骨骼控制提供了实验室逆动力学流程的可扩展替代方案。与此同时，基于物理的强化学习（RL）使得经过仿真训练的控制者能够在无需大量人体实验的情况下学习动态感知辅助策略。然而，模拟训练的外骨骼扭矩预测器及其对人体关节动力注入的影响的定量验证仍然有限。本文提出了（1）一个用于学习减少生物关节力矩的外骨骼辅助策略的强化学习框架，以及（2）一个通过推断和比较生物关节力矩，验证开源步态数据集验证训练控制网络的流程。仿真训练的多层感知器（MLP）控制器专为平地和斜坡行走开发，将双侧髋关节和膝关节运动学的短视距历史映射为归一化辅助力矩。结果显示，预测辅助能保持不同速度和坡度的任务强度趋势。髋部的一致性尤为强烈，交叉相关系数在1.8 m/s时达到0.94,5°下坡行走时达到0.98，显示出近乎匹配的时间结构。差异在更高速度和更陡的坡度中加剧，尤其是在膝盖处，在关节功率比较中更为明显。延迟调优偏置有助于更强的正功率注入;适度的时机调整能增强正向力量，并改善特定步态间隔的一致性。这些结果共同建立了模拟训练外骨骼控制器的定量验证框架，展示了在扭矩层面的模拟数据一致性，并凸显了模拟到真实传输的前景与剩余挑战。

Helix: Evolutionary Reinforcement Learning for Open-Ended Scientific Problem Solving

Helix：开放式科学问题解决的进化强化学习

Authors: Chang Su, Zhongkai Hao, Zhizhou Zhang, Zeyu Xia, Youjia Wu, Hang Su, Jun Zhu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07642
Pdf link: https://arxiv.org/pdf/2603.07642
Abstract Large language models (LLMs) with reasoning abilities have demonstrated growing promise for tackling complex scientific problems. Yet such tasks are inherently domain-specific, unbounded and open-ended, demanding exploration across vast and flexible solution spaces. Existing approaches, whether purely learning-based or reliant on carefully designed workflows, often suffer from limited exploration efficiency and poor generalization. To overcome these challenges, we present HELIX -- a Hierarchical Evolutionary reinforcement Learning framework with In-context eXperiences. HELIX introduces two key novelties: (i) a diverse yet high-quality pool of candidate solutions that broadens exploration through in-context learning, and (ii) reinforcement learning for iterative policy refinement that progressively elevates solution quality. This synergy enables the discovery of more advanced solutions. On the circle packing task, HELIX achieves state-of-the-art result with a sum of radii of 2.63598308 using only a 14B model. Across standard machine learning benchmarks, HELIX further surpasses GPT-4o with a carefully engineered pipeline, delivering an average F1 improvement of 5.95 points on the Adult and Bank Marketing datasets.
中文摘要 具备推理能力的大型语言模型（LLM）在解决复杂科学问题方面展现出日益增长的潜力。然而，这些任务本质上是领域特定的、无边界且开放的，要求在庞大且灵活的解决方案空间中进行探索。现有方法，无论是纯粹基于学习还是依赖精心设计的工作流程，常常存在探索效率有限和泛化能力不足的问题。为克服这些挑战，我们提出了HELIX——一个带有上下文体验的层级进化强化学习框架。HELIX 引入了两项关键创新：（i）多样化且高质量的候选解决方案池，通过上下文学习拓宽探索;（ii）用于迭代策略优化的强化学习，逐步提升解决方案质量。这种协同效应使得更先进的解决方案得以发现。在圆填充任务中，HELIX仅使用14B模型实现了半径总和为2.63598308的先进结果。在标准机器学习基准测试中，HELIX凭借精心设计的流水线进一步超越GPT-4o，成人和银行营销数据集的平均F1提升为5.95分。

Numerical Approach for On-the-Fly Active Flow Control via Flow Map Learning Method

通过流图学习方法实现实时主动流量控制的数值方法

Authors: Xinyu Liu, Qifan Chen, Dongbin Xiu
Subjects: Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2603.07678
Pdf link: https://arxiv.org/pdf/2603.07678
Abstract We present a data-driven numerical approach for on-the-fly active flow control and demonstrate its effectiveness for drag reduction in two-dimensional incompressible flow past a cylinder. The method is based on flow map learning (FML), a recently developed framework for modeling unknown dynamical systems that is particularly effective for partially observed systems. For active flow control, we construct an FML dynamical model for the quantities of interest (QoIs), namely the drag and lift forces. During offline learning, training data are generated for the responses of drag and lift to the control variable, and a deep neural network (DNN)-based FML model is constructed. The learned FML model enables online optimal flow control without requiring simulations of the flow field. We demonstrate that the FML-based approach can be integrated with existing optimal control strategies, including deep reinforcement learning (DRL) and model predictive control (MPC). Numerical results show that the proposed approach enables on-the-fly flow control and achieves more than $20\%$ drag reduction. By eliminating the need for forward simulations during control optimization, the approach offers the potential for real-time optimal control in other systems.
中文摘要 我们提出了一种基于数据的数据的数值方法，用于实时主动流量控制，并展示了其在通过圆柱体的二维不可压缩流动中减少阻力方面的有效性。该方法基于流图学习（FML），这是一项新开发的未知动力系统建模框架，对部分观测系统尤为有效。对于主动流量控制，我们构建了一个FML动力学模型，用于关注的量（QoI），即阻力和升力。在离线学习过程中，生成对控制变量的阻力和升力响应训练数据，并构建基于深度神经网络（DNN）的FML模型。所学的FML模型实现了在线最优流量控制，无需对流场进行仿真。我们证明基于FML的方法可以与现有的最优控制策略集成，包括深度强化学习（DRL）和模型预测控制（MPC）。数值结果显示，该方法实现了实时流量控制，并实现超过20%%美元的阻力减少。通过消除控制优化过程中对前向仿真的需求，该方法为其他系统实现实时最优控制提供了潜力。

Mitigating the Memory Bottleneck with Machine Learning-Driven and Data-Aware Microarchitectural Techniques

利用机器学习驱动和数据感知微架构技术缓解内存瓶颈

Authors: Rahul Bera
Subjects: Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Operating Systems (cs.OS)
Arxiv link: https://arxiv.org/abs/2603.07683
Pdf link: https://arxiv.org/pdf/2603.07683
Abstract Modern applications process massive data volumes that overwhelm the storage and retrieval capabilities of memory systems, making memory the primary performance and energy-efficiency bottleneck of computing systems. Although many microarchitectural techniques attempt to hide or tolerate long memory access latency, rapidly growing data footprints continue to outpace technology scaling, requiring more effective solutions. This dissertation shows that modern processors observe large amounts of application and system data during execution, yet many microarchitectural mechanisms make decisions largely independent of this information. Through four case studies, we demonstrate that such data-agnostic design leads to substantial missed opportunities for improving performance and energy efficiency. To address this limitation, this dissertation advocates shifting microarchitecture design from data-agnostic to data-informed. We propose mechanisms that (1) learn policies from observed execution behavior (data-driven design) and (2) exploit semantic characteristics of application data (data-aware design). We apply lightweight machine learning techniques and previously underexplored data characteristics across four processor components: a reinforcement learning-based hardware data prefetcher that learns memory access patterns online; a perceptron predictor that identifies memory requests likely to access off-chip memory; a reinforcement learning mechanism that coordinates data prefetching and off-chip prediction; and a mechanism that exploits repeatability in memory addresses and loaded values to eliminate predictable load instructions. Our extensive evaluation shows that the proposed techniques significantly improve performance and energy efficiency compared to prior state-of-the-art approaches.
中文摘要 现代应用处理大量数据，导致内存系统的存储和检索能力不堪重负，使内存成为计算系统的主要性能和能效瓶颈。尽管许多微架构技术试图隐藏或容忍长的内存访问延迟，但快速增长的数据足迹仍超过技术扩展，需要更有效的解决方案。本论文表明，现代处理器在执行过程中会观察大量应用和系统数据，但许多微架构机制在决策上大多独立于这些信息。通过四个案例研究，我们证明了这种数据无关设计导致了大幅度提升性能和能效的错失良机。为解决这一限制，本论文主张将微架构设计从数据无关转向数据知情。我们提出的机制包括：（1）从观察到的执行行为中学习策略（数据驱动设计），以及（2）利用应用数据的语义特性（数据感知设计）。我们在四个处理器组件上应用了轻量级机器学习技术和此前未被充分探索的数据特性：基于强化学习的硬件数据预取器，在线学习内存访问模式;一种感知器预测器，用于识别可能访问芯片外内存的内存请求;一种协调数据预取和芯片外预测的强化学习机制;以及利用内存地址和加载值的重复性来消除可预测加载指令的机制。我们的广泛评估表明，所提技术相比以往最先进的方法显著提升了性能和能效。

Global Convergence of Average Reward Constrained MDPs with Neural Critic and General Policy Parameterization

平均奖励约束MDP与神经批判和一般政策参数化的全局收敛

Authors: Anirudh Satheesh, Pankaj Kumar Barman, Washim Uddin Mondal, Vaneet Aggarwal
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07698
Pdf link: https://arxiv.org/pdf/2603.07698
Abstract We study infinite-horizon Constrained Markov Decision Processes (CMDPs) with general policy parameterizations and multi-layer neural network critics. Existing theoretical analyses for constrained reinforcement learning largely rely on tabular policies or linear critics, which limits their applicability to high-dimensional and continuous control problems. We propose a primal-dual natural actor-critic algorithm that integrates neural critic estimation with natural policy gradient updates and leverages Neural Tangent Kernel (NTK) theory to control function-approximation error under Markovian sampling, without requiring access to mixing-time oracles. We establish global convergence and cumulative constraint violation rates of $\tilde{\mathcal{O}}(T^-1/4)$ up to approximation errors induced by the policy and critic classes. Our results provide the first such guarantees for CMDPs with general policies and multi-layer neural critics, substantially extending the theoretical foundations of actor-critic methods beyond the linear-critic regime.
中文摘要 我们研究无限视野受限马尔可夫决策过程（CMDP），结合一般策略参数化和多层神经网络批评。现有的受限强化学习理论分析主要依赖表格策略或线性批评，限制了其适用于高维和连续控制问题的范围。我们提出了一种原始-对偶自然演员-批评算法，将神经批判估计与自然策略梯度更新相结合，并利用神经切核（NTK）理论控制马尔可夫采样下的函数近似误差，无需访问混合时间预言机。我们建立了全局收敛和累计约束违规率为$\tilde{\mathcal{O}}（T^-1/4）$，且在策略类和批判类引起的近似误差范围内。我们的结果首次为具有一般策略和多层神经批评者的CMDP提供了此类保证，极大地扩展了actor-critic方法的理论基础，超越了线性批评体系。

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

TDM-R1：强化具有不可微奖励的少数步扩散模型

Authors: Yihong Luo, Tianyang Hu, Weijian Luo, Jing Tang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.07700
Pdf link: https://arxiv.org/pdf/2603.07700
Abstract While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: this https URL
中文摘要 虽然少步生成模型使得强大的图像和视频生成以显著降低的成本实现，但少步模型的通用强化学习（RL）范式仍是一个未解决的问题。现有的少数步扩散模型强化学习方法强烈依赖于通过可微奖励模型进行反向传播，从而排除了大多数重要的现实奖励信号，例如不可微分的奖励，如人类的二元相似性、物体数量等。为了妥善整合不可微分奖励以改进少步生成模型，我们引入了TDM-R1，这是一种基于领先少数步骤模型轨迹分布匹配（TDM）的新型强化学习范式。TDM-R1将学习过程拆分为替代奖励学习和生成器学习。此外，我们还开发了沿TDM确定生成轨迹获取每步奖励信号的实用方法，形成统一的强化学习后训练方法，显著提升了少步模型在通用奖励方面的能力。我们进行了广泛的实验，涵盖文本渲染、视觉质量和偏好对齐等。所有结果表明，TDM-R1 是一种强大的强化学习范式，适用于少步文本到图像模型，在域内和域外指标上都实现了最先进的强化学习性能。此外，TDM-R1 还能有效扩展到近期强劲的 Z-Image 模型，持续优于其仅有 4 个 NFE 的 100-NFE 和少步变体。项目页面：此 https URL

Residual Control for Fast Recovery from Dynamics Shifts

动力学变化快速恢复的残差控制

Authors: Nethmi Jayasinghe, Diana Gontero, Francesco Migliarba, Spencer T. Brown, Vinod K. Sangwan, Mark C. Hersam, Amit Ranjan Trivedi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.07775
Pdf link: https://arxiv.org/pdf/2603.07775
Abstract Robotic systems operating in real-world environments inevitably encounter unobserved dynamics shifts during continuous execution, including changes in actuation, mass distribution, or contact conditions. When such shifts occur mid-episode, even locally stabilizing learned policies can experience substantial transient performance degradation. While input-to-state stability guarantees bounded state deviation, it does not ensure rapid restoration of task-level performance. We address inference-time recovery under frozen policy parameters by casting adaptation as constrained disturbance shaping around a nominal stabilizing controller. We propose a stability-aligned residual control architecture in which a reinforcement learning policy trained under nominal dynamics remains fixed at deployment, and adaptation occurs exclusively through a bounded additive residual channel. A Stability Alignment Gate (SAG) regulates corrective authority through magnitude constraints, directional coherence with the nominal action, performance-conditioned activation, and adaptive gain modulation. These mechanisms preserve the nominal closed-loop structure while enabling rapid compensation for unobserved dynamics shifts without retraining or privileged disturbance information. Across mid-episode perturbations including actuator degradation, mass variation, and contact changes, the proposed method consistently reduces recovery time relative to frozen and online-adaptation baselines while maintaining near-nominal steady-state performance. Recovery time is reduced by \textbf{87\%} on the Go1 quadruped, \textbf{48\%} on the Cassie biped, \textbf{30\%} on the H1 humanoid, and \textbf{20\%} on the Scout wheeled platform on average across evaluated conditions relative to a frozen SAC policy.
中文摘要 在现实环境中运行的机器人系统在连续执行过程中不可避免地会遇到未被观察到的动力学变化，包括驱动变化、质量分布或接触条件的变化。当这种转变发生在发作中途时，即使是局部稳定的学习策略也可能经历显著的暂时性能下降。虽然输入到状态的稳定性保证了有界状态偏差，但并不能保证任务级性能的快速恢复。我们通过将适应归为围绕名义稳定控制器的受限扰动形态，来解决冻结策略参数下的推断时间恢复。我们提出一种稳定性对齐的残差控制架构，其中在名义动力学下训练的强化学习策略在部署时保持固定，适应仅通过有界加法残差通道进行。稳定对齐门（SAG）通过幅度约束、与标称作用的方向一致性、性能条件激活以及自适应增益调制来调节纠正权威。这些机制保持了名义上的闭环结构，同时能够快速补偿未被观测到的动力学变化，无需重新训练或特权干扰信息。在中段扰动（包括执行器退化、质量变化和接触变化）中，所提出的方法相较于冻结和在线适应基线持续缩短恢复时间，同时保持近标称稳态性能。在评估条件下，Go1四足动物的恢复时间平均减少了\textbf{87\%}，Cassie双足动物的\textbf{48\%}，H1类人生物的\textbf{30\%}，以及侦察机轮式平台的\textbf{20\%}，相较于冻结的SAC政策。

ProgAgent:A Continual RL Agent with Progress-Aware Rewards

ProgAgent：一个持续的强化学习代理，具有进度感知奖励

Authors: Jinzhou Tan, Gabriel Adineera, Jinoh Kim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.07784
Pdf link: https://arxiv.org/pdf/2603.07784
Abstract We present ProgAgent, a continual reinforcement learning (CRL) agent that unifies progress-aware reward learning with a high-throughput, JAX-native system architecture. Lifelong robotic learning grapples with catastrophic forgetting and the high cost of reward specification. ProgAgent tackles these by deriving dense, shaped rewards from unlabeled expert videos through a perceptual model that estimates task progress across initial, current, and goal observations. We theoretically interpret this as a learned state-potential function, delivering robust guidance in line with expert behaviors. To maintain stability amid online exploration - where novel, out-of-distribution states arise - we incorporate an adversarial push-back refinement that regularizes the reward model, curbing overconfident predictions on non-expert trajectories and countering distribution shift. By embedding this reward mechanism into a JIT-compiled loop, ProgAgent supports massively parallel rollouts and fully differentiable updates, rendering a sophisticated unified objective feasible: it merges PPO with coreset replay and synaptic intelligence for an enhanced stability-plasticity balance. Evaluations on ContinualBench and Meta-World benchmarks highlight ProgAgent's advantages: it markedly reduces forgetting, boosts learning speed, and outperforms key baselines in visual reward learning (e.g., Rank2Reward, TCN) and continual learning (e.g., Coreset, SI) - surpassing even an idealized perfect memory agent. Real-robot trials further validate its ability to acquire complex manipulation skills from noisy, few-shot human demonstrations.
中文摘要 我们介绍ProgAgent，一种持续强化学习（CRL）代理，将进步感知奖励学习与高通量、JAX原生系统架构统一起来。终身机器人学习面临灾难性遗忘和高昂的奖励定价代价。ProgAgent通过感知模型，从未标注的专家视频中推导出密集、有形状的奖励，估算任务在初始、当前和目标观察中的进展。我们理论上将其解释为一种习得的状态-势函数，能够根据专家行为提供强有力的指导。为了在在线探索中保持稳定性——即新颖的、分布外状态出现——我们采用了对抗性推回细化，规范奖励模型，抑制对非专家轨迹的过度自信预测，抵消分布转移。通过将这一奖励机制嵌入JIT编译的循环中，ProgAgent支持大规模并行部署和完全可微分的更新，使得复杂的统一目标成为可能：它将PPO与核心集重放和突触智能合并，以增强稳定性与可塑性平衡。ContinualBench和Meta-World基准测试的评估突出了ProgAgent的优势：它显著减少遗忘，提升学习速度，并且在视觉奖励学习（如Rank2Reward、TCN）和持续学习（如Coreset、SI）中表现优于关键基线——甚至超过了理想化的完美记忆代理。真实机器人试验进一步验证了其从嘈杂、少数几次的人类演示中获得复杂作技能的能力。

Toward Global Intent Inference for Human Motion by Inverse Reinforcement Learning

通过逆向强化学习实现人类运动的全局意图推断

Authors: Sarmad Mehrdad, Maxime Sabbah, Vincent Bonnet, Ludovic Righetti
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07797
Pdf link: https://arxiv.org/pdf/2603.07797
Abstract This paper investigates whether a single, unified cost function can explain and predict human reaching movements, in contrast with existing approaches that rely on subject- or posture-specific optimization criteria. Using the Minimal Observation Inverse Reinforcement Learning (MO-IRL) algorithm, together with a seven-dimensional set of candidate cost terms, we efficiently estimate time-varying cost weights for a standard planar reaching task. MO-IRL provides orders-of-magnitude faster convergence than bilevel formulations, while using only a fraction of the available data, enabling the practical exploration of time-varying cost structures. Three levels of generality are evaluated: Subject-Dependent Posture-Dependent, Subject-Dependent Posture-Independent, and Subject-Independent Posture-Independent. Across all cases, time-varying weights substantially improve trajectory reconstruction, yielding an average 27% reduction in RMSE compared to the baseline. The inferred costs consistently highlight a dominant role for joint-acceleration regulation, complemented by smaller contributions from torque-change smoothness. Overall, a single subject- and posture-agnostic time-varying cost function is shown to predict human reaching trajectories with high accuracy, supporting the existence of a unified optimality principle governing this class of movements.
中文摘要 本文探讨了单一统一的成本函数是否能够解释和预测人类的伸手动作，以区别于现有依赖主体或姿势特定优化标准的方法。利用最小观察逆强化学习（MO-IRL）算法，结合一组七维候选成本项，我们高效估算标准平面达标任务的时变成本权重。MO-IRL比双层表述快了数量级，且仅使用了一小部分可用数据，使得实际探索时间变化成本结构成为可能。评估三种普遍性层次：主语依赖姿态、主语相关姿态独立和主语无关姿态。在所有病例中，时间变化权重显著改善轨迹重建，平均RMSE较基线减少27%。推断成本一致强调了关节加速调节的主导作用，扭矩转换平滑度的贡献较小。总体而言，单一不依赖主体和姿势的时变成本函数能够高精度预测人类的到达轨迹，支持该类动作存在统一的最优性原则。

Preference-Conditioned Reinforcement Learning for Space-Time Efficient Online 3D Bin Packing

偏好条件强化学习用于空间时间高效的在线3D垃圾桶装箱

Authors: Nikita Sarawgi, Omey M. Manyar, Fan Wang, Thinh H. Nguyen, Daniel Seita, Satyandra K. Gupta
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.07800
Pdf link: https://arxiv.org/pdf/2603.07800
Abstract Robotic bin packing is widely deployed in warehouse automation, with current systems achieving robust performance through heuristic and learning-based strategies. These systems must balance compact placement with rapid execution, where selecting alternative items or reorienting them can improve space utilization but introduce additional time. We propose a selection-based formulation that explicitly reasons over this trade-off: at each step, the robot evaluates multiple candidate actions, weighing expected packing benefit against estimated operational time. This enables time-aware strategies that selectively accept increased operational time when it yields meaningful spatial improvements. Our method, STEP (Space-Time Efficient Packing), uses a preference-conditioned, Transformer-based reinforcement learning policy, and allows generalization across candidate set sizes and integration with standard placement modules. It achieves a 44% reduction in operational time without compromising packing density. Additional material is available at this https URL.
中文摘要 机器人箱装在仓库自动化中被广泛应用，现有系统通过启发式和基于学习的策略实现了强健的性能。这些系统必须在紧凑的布置与快速执行之间取得平衡，选择替代物品或重新定向可以提高空间利用率，但会增加更多时间。我们提出了一种基于选择的表述，明确考虑这一权衡：在每一步，机器人评估多个候选动作，权衡预期的包装效益与预计的作时间。这使得能够有选择性地接受更长时间的作时间，从而在空间上有意义地改进，从而实现时间感知。我们的方法STEP（时空高效打包）采用基于偏好条件的基于Transformer的强化学习策略，允许跨候选集合大小进行泛化及与标准配置模块的集成。在不牺牲包装密度的前提下，其运行时间可缩短44%。更多资料可在此 https 网址获取。

Relating Reinforcement Learning to Dynamic Programming-Based Planning

将强化学习与基于动态编程的规划联系起来

Authors: Filip V. Georgiev, Kalle G. Timperi, Başak Sakçak, Steven M. LaValle
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.07844
Pdf link: https://arxiv.org/pdf/2603.07844
Abstract This paper bridges some of the gap between optimal planning and reinforcement learning (RL), both of which share roots in dynamic programming applied to sequential decision making or optimal control. Whereas planning typically favors deterministic models, goal termination, and cost minimization, RL tends to favor stochastic models, infinite-horizon discounting, and reward maximization in addition to learning-related parameters such as the learning rate and greediness factor. A derandomized version of RL is developed, analyzed, and implemented to yield performance comparisons with value iteration and Dijkstra's algorithm using simple planning models. Next, mathematical analysis shows: 1) conditions under which cost minimization and reward maximization are equivalent, 2) conditions for equivalence of single-shot goal termination and infinite-horizon episodic learning, and 3) conditions under which discounting causes goal achievement to fail. The paper then advocates for defining and optimizing truecost, rather than inserting arbitrary parameters to guide operations. Performance studies are then extended to the stochastic case, using planning-oriented criteria and comparing value iteration to RL with learning rates and greediness factors.
中文摘要 本文弥合了最优规划与强化学习（RL）之间的部分空白，两者都源自动态规划应用于顺序决策或最优控制。规划通常倾向于确定性模型、目标终止和成本最小化，而强化学习则倾向于采用随机模型、无限视野贴现和奖励最大化，此外还考虑学习率和贪婪因素等学习相关参数。开发、分析和实现了去随机化版本的强化学习，以便通过简单规划模型与价值迭代和迪克斯特拉算法进行性能比较。接下来，数学分析表明：1）成本最小化与奖励最大化等价的条件，2）单次目标终止与无限视野的情景学习等价的条件，3）折现导致目标达成失败的条件。论文随后主张定义和优化truecost，而不是插入任意参数来指导作。随后，绩效研究扩展到随机情形，采用以规划为导向的标准，并将价值迭代与强化学习与学习率和贪婪因素进行比较。

SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

SynPlanResearch-R1：鼓励探索工具以进行合成计划的深度研究

Authors: Hansi Zeng, Zoey Li, Yifan Gao, Chenwei Zhang, Xiaoman Pan, Tao Yang, Fengran Mo, Jiacheng Lin, Xian Li, Jingbo Shang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2603.07853
Pdf link: https://arxiv.org/pdf/2603.07853
Abstract Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at this https URL.
中文摘要 研究代理使模型能够利用工具从网络收集信息以回答用户查询，要求它们动态交织内部推理与工具使用。虽然原则上可以通过可验证奖励的强化学习（RLVR）学习此类能力，但我们观察到代理常表现出不良的探索行为，包括过早终止和工具使用偏见。因此，仅靠RLVR带来的改进有限。我们提出了SynPlanResearch-R1框架，该框架综合工具使用轨迹，鼓励更深入的探索，以塑造冷启动监督微调中的探索，为后续强化学习提供强有力的初始化。在七个多跳和开放网络基准测试中，\framework 在 Qwen3-8B 和 Qwen3-4B 骨干上分别提升了 SOTA 基线高达 6.0% 和 5.8% 的性能。对工具使用模式和训练动态与基线的进一步分析，揭示了这些进步背后的因素。我们的代码在此 https URL 公开。

SMGI: A Structural Theory of General Artificial Intelligence

SMGI：通用人工智能的结构理论

Authors: Aomar Osmani
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07896
Pdf link: https://arxiv.org/pdf/2603.07896
Abstract We introduce SMGI, a structural theory of general artificial intelligence, and recast the foundational problem of learning from the optimization of hypotheses within fixed environments to the controlled evolution of the learning interface itself. We formalize the Structural Model of General Intelligence (SMGI) via a typed meta-model $\theta = (r,\mathcal H,\Pi,\mathcal L,\mathcal E,\mathcal M)$ that treats representational maps, hypothesis spaces, structural priors, multi-regime evaluators, and memory operators as explicitly typed, dynamic components. By enforcing a strict mathematical separation between this structural ontology ($\theta$) and its induced behavioral semantics ($T_\theta$), we define general artificial intelligence as a class of admissible coupled dynamics $(\theta, T_\theta)$ satisfying four obligations: structural closure under typed transformations, dynamical stability under certified evolution, bounded statistical capacity, and evaluative invariance across regime shifts. We prove a structural generalization bound that links sequential PAC-Bayes analysis and Lyapunov stability, providing sufficient conditions for capacity control and bounded drift under admissible task transformations. Furthermore, we establish a strict structural inclusion theorem demonstrating that classical empirical risk minimization, reinforcement learning, program-prior models (Solomonoff-style), and modern frontier agentic pipelines operate as structurally restricted instances of SMGI.
中文摘要 我们介绍了通用人工智能的结构理论SMGI，并将学习的基础问题从固定环境中假设的优化，重新构建为学习界面本身的受控演化。我们通过一个类型化元模型 $\theta = （r，\mathcal H，\Pi，\mathcal L，\mathcal E，\mathcal M）$ 形式化一般智能结构模型（SMGI），该模型将表征映射、假设空间、结构先验、多态评估器和记忆算符视为显式类型的动态组成部分。通过严格将结构本体（$\theta$）与其诱导行为语义（$T_\theta$）区分开来，我们将通用人工智能定义为一类可接受的耦合动力学$（\theta， T_\theta）$，满足四项义务：类型变换下的结构闭合、认证演化下的动态稳定性、有界的统计容量以及在体制转移间的评估不变性。我们证明了一个结构推广上界，将顺序PAC-贝叶斯分析与李雅普诺夫稳定性联系起来，为在可接受任务变换下的容量控制和有界漂移提供了充分条件。此外，我们建立了严格的结构包含定理，证明经典的经验风险最小化、强化学习、程序先验模型（Solomonoff风格）以及现代前沿代理管道作为结构受限的SMGI实例。

SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

SGG-R$^{\rm 3}$：从下一标记预测到端到端无偏场景图生成

Authors: Jiaye Feng, Qixiang Yin, Yuankun Liu, Tong Mo, Weiping Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.07961
Pdf link: https://arxiv.org/pdf/2603.07961
Abstract Scene Graph Generation (SGG) structures visual scenes as graphs of objects and their relations. While Multimodal Large Language Models (MLLMs) have advanced end-to-end SGG, current methods are hindered by both a lack of task-specific structured reasoning and the challenges of sparse, long-tailed relation distributions, resulting in incomplete scene graphs characterized by low recall and biased predictions. To address these issues, we introduce SGG-R$^{\rm 3}$, a structured reasoning framework that integrates task-specific chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity filtering to alleviate relation sparsity. Subsequently, a stage-aligned reward scheme optimizes the procedural reasoning during RL. Specifically, we propose a novel dual-granularity reward which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering. Experiments on two benchmarks show that SGG-R$^{\rm 3}$ achieves superior performance compared to existing methods, demonstrating the effectiveness and generalization of the framework.
中文摘要 场景图生成（SGG）将视觉场景结构化为对象及其关系的图。虽然多模态大型语言模型（MLLM）已经推进了端到端的SGG，但当前方法仍受限于缺乏任务特定结构化推理以及稀疏且长尾关系分布的挑战，导致场景图不完整，特点是回忆率低且预测有偏。为解决这些问题，我们引入了SGG-R$^{\rm 3}$，这是一个结构化推理框架，集成了任务特定思维链（CoT）引导的监督微调（SFT）和强化学习（RL）与组序列策略优化（GSPO），设计通过三个顺序阶段实现端到端无偏场景图生成。在SFT阶段，我们提出了一种关系增强策略，利用MLLM，并通过嵌入相似性过滤来优化，以缓解关系稀疏性。随后，阶段对齐的奖励方案优化了强化学习过程中的程序推理。具体来说，我们提出了一种新的双粒度奖励，整合细粒度和粗粒度关系奖励，同时通过基于频率的自适应谓词加权缓解长尾问题，并通过语义聚类提升关系覆盖率。两个基准测试的实验显示，SGG-R$^{\rm 3}$ 相较于现有方法实现了更优的性能，展示了该框架的有效性和推广性。

Model-Free DRL Control for Power Inverters: From Policy Learning to Real-Time Implementation via Knowledge Distillation

无模型的电力逆变器DRL控制：从策略学习到通过知识蒸馏实现的实时实施

Authors: Yang Yang, Chenggang Cui, Xitong Niu, Jiaming Liu, Chuanlin Zhang
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.07964
Pdf link: https://arxiv.org/pdf/2603.07964
Abstract In response to the trade-off between control performance and computational burden hindering the deployment of Deep Reinforcement Learning (DRL) in power inverters, this paper presents a novel model-free control framework leveraging policy distillation. To handle the convergence instability and steady-state errors inherent in model-free agents, an error energy-guided hybrid reward mechanism is established to theoretically constrain the exploration space. More specifically, an adaptive importance weighting mechanism is integrated into the distillation architecture to amplify the significance of fluctuation regions, ensuring high-quality transfer of transient control logic by mitigating the observational bias dominated by steady-state data. This approach efficiently compresses the heavy DRL policy into a lightweight neural network, retaining the desired control performance while overcoming the computational bottleneck during deployment. The proposed method is validated through a hardware-based kilowatt-level experimental platform. Experimental comparison results with traditional methods demonstrate that the proposed technique reduces inference time to the microsecond level and achieves superior transient response speed and parameter robustness.
中文摘要 针对控制性能与计算负担之间阻碍深度强化学习（DRL）在逆变器中部署的权衡，本文提出了一种利用策略提炼的新颖无模型控制框架。为处理模型无智能体固有的收敛不稳定性和稳态误差，建立了一种误差引导能量的混合奖励机制，以理论上约束探索空间。更具体地说，将自适应重要性加权机制集成到蒸馏架构中，以放大涨落区域的重要性，通过减轻稳态数据主导的观测偏差，确保瞬态控制逻辑的高质量传输。该方法高效地将繁重的日程学习策略压缩为轻量级神经网络，保持期望的控制性能，同时克服部署时的计算瓶颈。该方法通过基于硬件的千瓦级实验平台进行验证。与传统方法的实验比较结果表明，该技术将推理时间缩短至微秒级，并实现了更优越的瞬态响应速度和参数鲁棒性。

VORL-EXPLORE: A Hybrid Learning Planning Approach to Multi-Robot Exploration in Dynamic Environments

VORL-EXPLORE：动态环境中多机器人探索的混合学习规划方法

Authors: Ning Liu, Sen Shen, Zheng Li, Sheng Liu, Dongkun Han, Shangke Lyu, Thomas Braunl
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.07973
Pdf link: https://arxiv.org/pdf/2603.07973
Abstract Hierarchical multi-robot exploration commonly decouples frontier allocation from local navigation, which can make the system brittle in dense and dynamic environments. Because the allocator lacks direct awareness of execution difficulty, robots may cluster at bottlenecks, trigger oscillatory replanning, and generate redundant coverage. We propose VORL-EXPLORE, a hybrid learning and planning framework that addresses this limitation through execution fidelity, a shared estimate of local navigability that couples task allocation with motion execution. This fidelity signal is incorporated into a fidelity-coupled Voronoi objective with inter-robot repulsion to reduce contention before it emerges. It also drives a risk-aware adaptive arbitration mechanism between global A* guidance and a reactive reinforcement learning policy, balancing long-range efficiency with safe interaction in confined spaces. The framework further supports online self-supervised recalibration of the fidelity model using pseudo-labels derived from recent progress and safety outcomes, enabling adaptation to non-stationary obstacles without manual risk tuning. We evaluate this capability separately in a dedicated severe-traffic ablation. Extensive experiments in randomized grids and a Gazebo factory scenario show high success rates, shorter path length, lower overlap, and robust collision avoidance. The source code will be made publicly available upon acceptance.
中文摘要 分层多机器人探索通常将前沿分配与本地导航分离，这可能使系统在密集且动态的环境中变得脆弱。由于分配者缺乏对执行难度的直接感知，机器人可能聚集在瓶颈处，触发振荡性重新规划，并产生冗余覆盖。我们提出了VORL-EXPLORE，一种混合学习与规划框架，通过执行忠真度来解决这一限制，这是一种将任务分配与运动执行结合的本地导航性共享估计。该保真度信号被集成到带有机器人间排斥的保真耦合Voronoi物镜中，以在信号出现前减少争用。它还推动了全球A*指导与被动强化学习策略之间的风险感知自适应仲裁机制，在狭小空间内平衡远程效率与安全互动。该框架进一步支持利用基于近期进展和安全结果的伪标签在线自监督重新校准保真度模型，使得适应非平稳障碍而无需手动风险调整。我们分别在专门的严重交通消融中评估该能力。在随机网格和凉亭工厂场景中的大量实验显示，成功率高，路径长度更短，重叠更少，且具有稳健的碰撞避免能力。源代码一旦被接受，将公开公开。

On the Feasibility and Opportunity of Autoregressive 3D Object Detection

关于自回归三维物体检测的可行性和机遇

Authors: Zanming Huang, Jinsu Yoo, Sooyoung Jeon, Zhenzhen Liu, Mark Campbell, Kilian Q Weinberger, Bharath Hariharan, Wei-Lun Chao, Katie Z Luo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.07985
Pdf link: https://arxiv.org/pdf/2603.07985
Abstract LiDAR-based 3D object detectors typically rely on proposal heads with hand-crafted components like anchor assignment and non-maximum suppression (NMS), complicating training and limiting extensibility. We present AutoReg3D, an autoregressive 3D detector that casts detection as sequence generation. Given point-cloud features, AutoReg3D emits objects in a range-causal (near-to-far) order and encodes each object as a short, discrete-token sequence consisting of its center, size, orientation, velocity, and class. This near-to-far ordering mirrors LiDAR geometry--near objects occlude far ones but not vice versa--enabling straightforward teacher forcing during training and autoregressive decoding at test time. AutoReg3D is compatible across diverse point-cloud or backbones and attains competitive nuScenes performance without anchors or NMS. Beyond parity, the sequential formulation unlocks language-model advances for 3D perception, including GRPO-style reinforcement learning for task-aligned objectives. These results position autoregressive decoding as a viable, flexible alternative for LiDAR-based detection and open a path to importing modern sequence-modeling tools into 3D perception.
中文摘要 基于激光雷达的三维物体探测器通常依赖带有手工组件（如锚点分配和非最大抑制（NMS）的提案头，这增加了训练的复杂性并限制了扩展性。我们介绍AutoReg3D，一种自回归三维检测器，将检测视为序列生成。给定点云特征，AutoReg3D 以距离因果（近至远）顺序发射对象，并将每个物体编码为一个短的离散令牌序列，包含其中心、大小、方向、速度和类别。这种近远排序类似于激光雷达几何结构——近天体遮挡远处物体，但反之则不然——这使得培训时教师强迫和测试时的自回归解码变得简单。AutoReg3D 兼容多种点云或骨干网络，无需锚点或 NMS 即可实现具有竞争力的 nuScene 性能。超越了对称，顺序表述还解锁了三维感知的语言模型进展，包括针对任务对齐目标的GRPO式强化学习。这些结果使自回归解码成为基于激光雷达（LiDAR）检测的可行且灵活的替代方案，并为引入现代序列建模工具进入三维感知开辟了道路。

MJ1: Multimodal Judgment via Grounded Verification

MJ1：通过有据核查实现的多模态判断

Authors: Bhavesh Kumar, Dylan Feng, Leonard Tang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.07990
Pdf link: https://arxiv.org/pdf/2603.07990
Abstract Multimodal judges struggle to ground decisions in visual evidence. We present MJ1, a multimodal judge trained with reinforcement learning that enforces visual grounding through a structured grounded verification chain (observations $\rightarrow$ claims $\rightarrow$ verification $\rightarrow$ evaluation $\rightarrow$ scoring) and a counterfactual consistency reward that penalizes position bias. Even without training, our mechanism improves base-model accuracy on MMRB2 by +3.8 points on Image Editing and +1.7 on Multimodal Reasoning. After training, MJ1, with only 3B active parameters, achieves 77.0% accuracy on MMRB2 and surpasses orders-of-magnitude larger models like Gemini-3-Pro. These results show that grounded verification and consistency-based training substantially improve multimodal judgment without increasing model scale.
中文摘要 多模态法官难以以视觉证据为基础做出裁决。我们介绍MJ1，一种经过强化学习训练的多模态法官，通过结构化的基础验证链（观察$\rightarrow$ claims $\rightarrow$ 验证 $\rightarrow$ 验证$\rightarrow$ 评估$\rightarrow$评分）强化视觉基础，以及一个反事实一致性奖励，惩罚位置偏见。即使不进行训练，我们的机制在 MMRB2 上图像编辑基础模型准确率提升了 +3.8 点，在多模态推理方面提升了 +1.7 点。训练后，MJ1仅有3B激活参数，在MMRB2上达到77.0%的准确率，超过了数量级更大的模型如Gemini-3-Pro。这些结果表明，基于基础的验证和基于一致性的训练在不增加模型规模的情况下，显著提升了多模态判断能力。

ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning

ImageEdit-R1：通过强化学习提升多智能体图像编辑

Authors: Yiran Zhao, Yaoqi Ye, Xiang Liu, Michael Qizhe Shieh, Trung Bui
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.08059
Pdf link: https://arxiv.org/pdf/2603.08059
Abstract With the rapid advancement of commercial multi-modal models, image editing has garnered significant attention due to its widespread applicability in daily life. Despite impressive progress, existing image editing systems, particularly closed-source or proprietary models, often struggle with complex, indirect, or multi-step user instructions. These limitations hinder their ability to perform nuanced, context-aware edits that align with human intent. In this work, we propose ImageEdit-R1, a multi-agent framework for intelligent image editing that leverages reinforcement learning to coordinate high-level decision-making across a set of specialized, pretrained vision-language and generative agents. Each agent is responsible for distinct capabilities--such as understanding user intent, identifying regions of interest, selecting appropriate editing actions, and synthesizing visual content--while reinforcement learning governs their collaboration to ensure coherent and goal-directed behavior. Unlike existing approaches that rely on monolithic models or hand-crafted pipelines, our method treats image editing as a sequential decision-making problem, enabling dynamic and context-aware editing strategies. Experimental results demonstrate that ImageEdit-R1 consistently outperforms both individual closed-source diffusion models and alternative multi-agent framework baselines across multiple image editing datasets.
中文摘要 随着商业多模态模型的快速发展，图像编辑因其在日常生活中的广泛应用而备受关注。尽管取得了显著进展，现有图像编辑系统，尤其是闭源或专有型号，常常在复杂、间接或多步用户指令时遇到困难。这些限制阻碍了他们进行细腻、具上下文感知的编辑，以符合人类意图。本研究提出ImageEdit-R1，一种多智能体图像编辑框架，利用强化学习协调一组专业预训练视觉语言和生成智能体的高级决策。每个代理都负责不同的能力——如理解用户意图、识别兴趣区域、选择合适的编辑作以及综合视觉内容——而强化学习则规范其协作，确保行为连贯且目标导向。与依赖单一模型或手工制作流程的现有方法不同，我们的方法将图像编辑视为顺序决策问题，从而实现动态且具备上下文感知的编辑策略。实验结果表明，ImageEdit-R1在多个图像编辑数据集中，始终优于单个闭源扩散模型和其他多智能体框架基线。

In-Context Reinforcement Learning for Tool Use in Large Language Models

用于大型语言模型工具的上下文强化学习

Authors: Yaoqi Ye, Yiran Zhao, Keyu Duan, Zeyu Zheng, Kenji Kawaguchi, Cihang Xie, Michael Qizhe Shieh
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.08068
Pdf link: https://arxiv.org/pdf/2603.08068
Abstract While large language models (LLMs) exhibit strong reasoning abilities, their performance on complex tasks is often constrained by the limitations of their internal knowledge. A compelling approach to overcome this challenge is to augment these models with external tools -- such as Python interpreters for mathematical computations or search engines for retrieving factual information. However, enabling models to use these tools effectively remains a significant challenge. Existing methods typically rely on cold-start pipelines that begin with supervised fine-tuning (SFT), followed by reinforcement learning (RL). These approaches often require substantial amounts of labeled data for SFT, which is expensive to annotate or synthesize. In this work, we propose In-Context Reinforcement Learning (ICRL), an RL-only framework that eliminates the need for SFT by leveraging few-shot prompting during the rollout stage of RL. Specifically, ICRL introduces in-context examples within the rollout prompts to teach the model how to invoke external tools. Furthermore, as training progresses, the number of in-context examples is gradually reduced, eventually reaching a zero-shot setting where the model learns to call tools independently. We conduct extensive experiments across a range of reasoning and tool-use benchmarks. Results show that ICRL achieves state-of-the-art performance, demonstrating its effectiveness as a scalable, data-efficient alternative to traditional SFT-based pipelines.
中文摘要 虽然大型语言模型（LLMs）展现出强大的推理能力，但它们在复杂任务中的表现往往受限于其内部知识的局限。克服这一挑战的一个有效方法是用外部工具来补充这些模型——比如用于数学计算的Python解释器或用于检索事实信息的搜索引擎。然而，让模型有效使用这些工具仍是一项重大挑战。现有方法通常依赖冷启动流水线，先进行监督微调（SFT），再进行强化学习（RL）。这些方法通常需要大量带标签的数据用于SFT，且注释或综合成本高昂。在本研究中，我们提出了情境强化学习（ICRL），这是一种仅限RL的框架，通过在RL推广阶段利用少数样本提示，消除了SFT的需求。具体来说，ICRL在推广提示中引入上下文示例，教模型如何调用外部工具。此外，随着训练的推进，上下文示例数量逐渐减少，最终达到零样本状态，模型能够独立调用工具。我们在各种推理和工具使用基准中进行了广泛的实验。结果显示，ICRL实现了最先进的性能，证明了其作为传统SFT管道的可扩展、数据高效替代方案的有效性。

Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization

迈向基于LLM的稳健评判：分类偏见评估与去偏优化

Authors: Hongli Zhou, Hui Huang, Rui Zhang, Kehai Chen, Bing Xu, Conghui Zhu, Tiejun Zhao, Muyun Yang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.08091
Pdf link: https://arxiv.org/pdf/2603.08091
Abstract Large language model (LLM)-based judges are widely adopted for automated evaluation and reward modeling, yet their judgments are often affected by judgment biases. Accurately evaluating these biases is essential for ensuring the reliability of LLM-based judges. However, existing studies typically investigate limited biases under a single judge formulation, either generative or discriminative, lacking a comprehensive evaluation. To bridge this gap, we propose JudgeBiasBench, a benchmark for systematically quantifying biases in LLM-based judges. JudgeBiasBench defines a taxonomy of judgment biases across 4 dimensions, and constructs bias-augmented evaluation instances through a controlled bias injection pipeline, covering 12 representative bias types. We conduct extensive experiments across both generative and discriminative judges, revealing that current judges exhibit significant and diverse bias patterns that often compromise the reliability of automated evaluation. To mitigate judgment bias, we propose bias-aware training that explicitly incorporates bias-related attributes into the training process, encouraging judges to disentangle task-relevant quality from bias-correlated cues. By adopting reinforcement learning for generative judges and contrastive learning for discriminative judges, our methods effectively reduce judgment biases while largely preserving general evaluation capability.
中文摘要 基于大型语言模型（LLM）的评判被广泛应用于自动评估和奖励建模，但其判断常常受到判断偏差的影响。准确评估这些偏见对于确保基于LLM的法官的可靠性至关重要。然而，现有研究通常在单一评审表述下调查有限偏见，无论是生成式还是歧视性，缺乏全面的评估。为弥合这一差距，我们提出了JudgeBiasBench，这是一个系统性量化基于LLM法官偏见的基准工具。JudgeBiasBench定义了跨4维判断偏见的分类法，并通过受控偏见注入流程构建了带有偏见增强的评估实例，涵盖12种具有代表性的偏见类型。我们对生成式和判别性评审进行了大量实验，发现现有评委表现出显著且多样的偏见模式，常常影响自动评估的可靠性。为减轻判断偏差，我们提出了明确将偏见相关属性纳入培训过程的偏见意识培训，鼓励评委将任务相关质量与偏见相关线索分开。通过对生成式法官采用强化学习，对判别性法官采用对比学习，我们的方法有效减少了判断偏差，同时在很大程度上保留了一般评估能力。

DeReCo: Decoupling Representation and Coordination Learning for Object-Adaptive Decentralized Multi-Robot Cooperative Transport

DeReCo：对象自适应去中心化多机器人协作运输的解耦表示与协调学习

Authors: Kazuki Shibata, Ryosuke Sota, Shandil Dhiresh Bosch, Yuki Kadokawa, Tsurumine Yoshihisa, Takamitsu Matsubara
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.08111
Pdf link: https://arxiv.org/pdf/2603.08111
Abstract Generalizing decentralized multi-robot cooperative transport across objects with diverse shapes and physical properties remains a fundamental challenge. Under decentralized execution, two key challenges arise: object-dependent representation learning under partial observability and coordination learning in multi-agent reinforcement learning (MARL) under non-stationarity. A typical approach jointly optimizes object-dependent representations and coordinated policies in an end-to-end manner while randomizing object shapes and physical properties during training. However, this joint optimization tightly couples representation and coordination learning, introducing bidirectional interference: inaccurate representations under partial observability destabilize coordination learning, while non-stationarity in MARL further degrades representation learning, resulting in sample-inefficient training. To address this structural coupling, we propose DeReCo, a novel MARL framework that decouples representation and coordination learning for object-adaptive multi-robot cooperative transport, improving sample efficiency and generalization across objects and transport scenarios. DeReCo adopts a three-stage training strategy: (1) centralized coordination learning with privileged object information, (2) reconstruction of object-dependent representations from local observations, and (3) progressive removal of privileged information for decentralized execution. This decoupling mitigates interference between representation and coordination learning and enables stable and sample-efficient training. Experimental results show that DeReCo outperforms baselines in simulation on three training objects, generalizes to six unseen objects with varying masses and friction coefficients, and achieves superior performance on two unseen objects in real-robot experiments.
中文摘要 将分散式多机器人协作运输推广到形状和物理属性多样的物体之间，仍是一个根本性的挑战。在去中心化执行下，出现了两个关键挑战：部分可观测性下的对象相关表示学习和非平稳性下的多智能体强化学习（MARL）中的协调学习。一种典型方法是端到端地联合优化对象相关的表示和协调策略，同时在训练过程中随机调整对象形状和物理属性。然而，这种联合优化紧密耦合了表示法和协调学习，引入了双向干扰：在部分可观测性下不准确的表示会破坏协调学习，而MARL中的非平稳性进一步降低了表征学习，导致样本训练效率低下。为解决这种结构耦合，我们提出了DeReCo，一种新型MARL框架，能够解耦表示与协调学习，实现对象自适应多机器人协作运输，提升样本效率和跨物体和运输场景的泛化。DeReCo采用三阶段训练策略：（1）集中协调学习，利用特权对象信息;（2）从局部观测中重建对象相关表征;（3）逐步移除特权信息以实现去中心化执行。这种解耦减少了表示学习与协调学习之间的干扰，并实现了稳定且样本高效的训练。实验结果显示，DeReCo在三个训练对象的模拟中表现优于基线，推广到六个质量和摩擦系数不同的未见物体，并在真实机器人实验中对两个未观察物体表现出优异性能。

Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting

基于模型的离线强化学习，采用稳健的价值感知模型学习，并带有隐式可微的自适应加权

Authors: Zhongjian Qiao, Jiafei Lyu, Boxiang Lyu, Yao Shu, Siyang Gao, Shuang Qiu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.08118
Pdf link: https://arxiv.org/pdf/2603.08118
Abstract Model-based offline reinforcement learning (RL) aims to enhance offline RL with a dynamics model that facilitates policy exploration. However, \textit{model exploitation} could occur due to inevitable model errors, degrading algorithm performance. Adversarial model learning offers a theoretical framework to mitigate model exploitation by solving a maximin formulation. Within such a paradigm, RAMBO~\citep{rigter2022rambo} has emerged as a representative and most popular method that provides a practical implementation with model gradient. However, we empirically reveal that severe Q-value underestimation and gradient explosion can occur in RAMBO with only slight hyperparameter tuning, suggesting that it tends to be overly conservative and suffers from unstable model updates. To address these issues, we propose \textbf{RO}bust value-aware \textbf{M}odel learning with \textbf{I}mplicitly differentiable adaptive weighting (ROMI). Instead of updating the dynamics model with model gradient, ROMI introduces a novel robust value-aware model learning approach. This approach requires the dynamics model to predict future states with values close to the minimum Q-value within a scale-adjustable state uncertainty set, enabling controllable conservatism and stable model updates. To further improve out-of-distribution (OOD) generalization during multi-step rollouts, we propose implicitly differentiable adaptive weighting, a bi-level optimization scheme that adaptively achieves dynamics- and value-aware model learning. Empirical results on D4RL and NeoRL datasets show that ROMI significantly outperforms RAMBO and achieves competitive or superior performance compared to other state-of-the-art methods on datasets where RAMBO typically underperforms. Code is available at this https URL.
中文摘要 基于模型的离线强化学习（RL）旨在通过动态模型增强离线强化学习，从而促进策略探索。然而，由于不可避免的模型错误，可能会造成 \textit{model exploitation}，从而降低算法性能。对抗模型学习提供了一个理论框架，通过解决极大值公式来减轻模型的利用。在这样的范式下，RAMBO~\citep{rigter2022rambo} 已成为一种具有代表性和最流行的方法，提供了带有模型梯度的实用实现。然而，我们实证显示，RAMBO中仅需轻微超参数调校即可发生严重的Q值低估和梯度爆炸，表明其倾向于过于保守且模型更新不稳定。为解决这些问题，我们提出了 \textbf{RO}bust 值感知 \textbf{M}odel 学习，采用 \textbf{I}可微分自适应权重（ROMI）。ROMI没有用模型梯度更新动力学模型，而是引入了一种新颖的稳健值感知模型学习方法。该方法要求动力学模型预测在可调节状态不确定性集合内接近最小Q值的未来状态，从而实现可控保守和稳定模型更新。为了进一步提升多步推展期间的分布外（OOD）泛化能力，我们提出了隐式可微自适应加权，这是一种双级优化方案，能够自适应实现动态和价值感知型的模型学习。在D4RL和NeoRL数据集上的实证结果显示，ROMI在RAMBO通常表现不佳的数据集上，表现显著优于其他最先进方法，甚至更胜一筹。代码可在此 https URL 访问。

Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA

通过强化学习增强远程作和灵巧专家混合VLA实现类人作

Authors: Tutian Tang, Xingyu Ji, Wanli Xing, Ce Hao, Wenqiang Xu, Lin Shao, Cewu Lu, Qiaojun Yu, Jiangmiao Pang, Kaifeng Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.08122
Pdf link: https://arxiv.org/pdf/2603.08122
Abstract While Vision-Language-Action (VLA) models have demonstrated remarkable success in robotic manipulation, their application has largely been confined to low-degree-of-freedom end-effectors performing simple, vision-guided pick-and-place tasks. Extending these models to human-like, bimanual dexterous manipulation-specifically contact-rich in-hand operations-introduces critical challenges in high-fidelity data acquisition, multi-skill learning, and multimodal sensory fusion. In this paper, we propose an integrated framework to address these bottlenecks, built upon two components. First, we introduce IMCopilot (In-hand Manipulation Copilot), a suite of reinforcement learning-trained atomic skills that plays a dual role: it acts as a shared-autonomy assistant to simplify teleoperation data collection, and it serves as a callable low-level execution primitive for the VLA. Second, we present MoDE-VLA (Mixture-of-Dexterous-Experts VLA), an architecture that seamlessly integrates heterogeneous force and tactile modalities into a pretrained VLA backbone. By utilizing a residual injection mechanism, MoDE-VLA enables contact-aware refinement without degrading the model's pretrained knowledge. We validate our approach on four tasks of escalating complexity, demonstrating doubled success rate improvement over the baseline in dexterous contact-rich tasks.
中文摘要 尽管视觉-语言-行动（VLA）模型在机器人作方面表现出显著成功，但其应用主要局限于执行简单视觉引导的低自由度终端执行器任务。将这些模型扩展到类人、双手灵巧作——特别是接触丰富的手部作——带来了高保真数据采集、多技能学习和多模态感官融合的关键挑战。本文提出了一个综合框架，以解决这些瓶颈，基于两个组成部分。首先，我们介绍IMCopilot（手中作副驾驶），这是一套基于强化学习训练的原子技能，具有双重作用：它作为共享自治助手，简化远程作数据收集，同时作为VLA可调用的低级执行原语。其次，我们介绍MoDE-VLA（灵巧专家混合VLA），这是一种将异质力和触觉模态无缝整合进预训练VLA骨干的架构。通过利用残差注入机制，MoDE-VLA实现了接触感知的细化，同时不破坏模型的预训练知识。我们在四个复杂度递增的任务中验证了我们的方法，展示了在灵活接触丰富任务中成功率提升了两倍。

RexDrug: Reliable Multi-Drug Combination Extraction through Reasoning-Enhanced LLMs

RexDrug：通过推理增强的大型语言模型实现可靠的多药组合提取

Authors: Zhijun Wang, Ling Luo, Dinghao Pan, Huan Zhuang, Lejing Yu, Yuanyuan Sun, Hongfei Lin
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.08166
Pdf link: https://arxiv.org/pdf/2603.08166
Abstract Automated Drug Combination Extraction (DCE) from large-scale biomedical literature is crucial for advancing precision medicine and pharmacological research. However, existing relation extraction methods primarily focus on binary interactions and struggle to model variable-length n-ary drug combinations, where complex compatibility logic and distributed evidence need to be considered. To address these limitations, we propose RexDrug, an end-to-end reasoning-enhanced relation extraction framework for n-ary drug combination extraction based on large language models. RexDrug adopts a two-stage training strategy. First, a multi-agent collaborative mechanism is utilized to automatically generate high-quality expert-like reasoning traces for supervised fine-tuning. Second, reinforcement learning with a multi-dimensional reward function specifically tailored for DCE is applied to further refine reasoning quality and extraction accuracy. Extensive experiments on the DrugComb dataset show that RexDrug consistently outperforms state-of-the-art baselines for n-ary extraction. Additional evaluation on the DDI13 corpus confirms its generalizability to binary drugdrug interaction tasks. Human expert assessment and automatic reasoning metrics further indicates that RexDrug produces coherent medical reasoning while accurately identifying complex therapeutic regimens. These results establish RexDrug as a scalable and reliable solution for complex biomedical relation extraction from unstructured text. The source code and data are available at this https URL
中文摘要 大型生物医学文献中的自动药物组合提取（DCE）对于推进精准医疗和药理学研究至关重要。然而，现有的关系提取方法主要关注二元相互作用，难以建模可变长度的n元药物组合，这需要考虑复杂的兼容性逻辑和分布式证据。为解决这些局限性，我们提出了RexDrug，这是一种基于大型语言模型的端到端推理增强关系提取框架，用于n元药物组合提取。RexDrug采用两阶段培训策略。首先，采用多智能体协作机制，自动生成高质量的专家类推理痕迹，用于监督式微调。其次，采用专门针对DCE的多维奖励函数进行强化学习，进一步优化推理质量和提取准确性。DrugComb数据集上的大量实验表明，RexDrug在n元提取方面持续优于最先进的基线。对DDI13语料库的进一步评估确认其可推广至二元药物和药物相互作用任务。人类专家评估和自动推理指标进一步表明，RexDrug能够产生连贯的医学推理，同时准确识别复杂的治疗方案。这些结果确立了RexDrug作为一种可扩展且可靠的解决方案，用于从非结构化文本中提取复杂的生物医学关系。源代码和数据可在此 https URL 获取

SlowBA: An efficiency backdoor attack towards VLM-based GUI agents

SlowBA：针对基于VLM的图形界面代理的效率后门攻击

Authors: Junxian Li, Tu Lan, Haozhen Tan, Yan Meng, Haojin Zhu
Subjects: Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.08316
Pdf link: https://arxiv.org/pdf/2603.08316
Abstract Modern vision-language-model (VLM) based graphical user interface (GUI) agents are expected not only to execute actions accurately but also to respond to user instructions with low latency. While existing research on GUI-agent security mainly focuses on manipulating action correctness, the security risks related to response efficiency remain largely unexplored. In this paper, we introduce SlowBA, a novel backdoor attack that targets the responsiveness of VLM-based GUI agents. The key idea is to manipulate response latency by inducing excessively long reasoning chains under specific trigger patterns. To achieve this, we propose a two-stage reward-level backdoor injection (RBI) strategy that first aligns the long-response format and then learns trigger-aware activation through reinforcement learning. In addition, we design realistic pop-up windows as triggers that naturally appear in GUI environments, improving the stealthiness of the attack. Extensive experiments across multiple datasets and baselines demonstrate that SlowBA can significantly increase response length and latency while largely preserving task accuracy. The attack remains effective even with a small poisoning ratio and under several defense settings. These findings reveal a previously overlooked security vulnerability in GUI agents and highlight the need for defenses that consider both action correctness and response efficiency. Code can be found in this https URL.
中文摘要 现代基于视觉语言模型（VLM）的图形用户界面（GUI）代理不仅期望准确执行作，还能以低延迟响应用户指令。虽然现有关于图形界面代理安全的研究主要聚焦于作正确性作，但与响应效率相关的安全风险仍大多未被深入探讨。本文介绍了SlowBA，一种针对基于VLM的GUI代理响应性的新型后门攻击。关键思想是通过在特定触发模式下诱导过长的推理链来控响应延迟。为此，我们提出了一种两阶段的奖励级后门注入（RBI）策略，先对齐长响应格式，然后通过强化学习学习触发感知激活。此外，我们还设计了逼真的弹窗触发器，这些触发器自然出现在图形界面环境中，提升了攻击的隐蔽性。在多个数据集和基线上的大量实验表明，SlowBA可以在大幅保持任务准确性的情况下，显著延长响应长度和延迟。即使中毒率较低且在多种防御条件下，攻击依然有效。这些发现揭示了GUI代理此前被忽视的安全漏洞，并凸显了防御措施需要兼顾动作正确性和响应效率。代码可以在这个 https URL 中找到。

Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective

揭示大型语言模型中的行为可塑性：代币条件视角

Authors: Liyuan Mao, Le Yu, Jing Zhou, Chujie Zheng, Bowen Yu, Chang Gao, Shixuan Liu, An Yang, Weinan Zhang, JunYang Lin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.08398
Pdf link: https://arxiv.org/pdf/2603.08398
Abstract In this work, we reveal that Large Language Models (LLMs) possess intrinsic behavioral plasticity-akin to chameleons adapting their coloration to environmental cues-that can be exposed through token-conditional generation and stabilized via reinforcement learning. Specifically, by conditioning generation on carefully selected token prefixes sampled from responses exhibiting desired behaviors, LLMs seamlessly adapt their behavioral modes at inference time (e.g., switching from step-by-step reasoning to direct answering) without retraining. Based on this insight, we propose Token-Conditioned Reinforcement Learning (ToCoRL), a principled framework that leverages RL to internalize this chameleon-like plasticity, transforming transient inference-time adaptations into stable and learnable behavioral patterns. ToCoRL guides exploration with token-conditional generation and keep enhancing exploitation, enabling emergence of appropriate behaviors. Extensive experiments show that ToCoRL enables precise behavioral control without capability degradation. Notably, we show that large reasoning models, while performing strongly on complex mathematics, can be effectively adapted to excel at factual question answering, which was a capability previously hindered by their step-by-step reasoning patterns.
中文摘要 本研究揭示大型语言模型（LLMs）具有内在的行为可塑性——类似于变色龙根据环境线索调整颜色——可通过标记条件生成暴露，并通过强化学习稳定。具体来说，通过以从表现出期望行为的响应中精心挑选的标记前缀为条件生成，LLMs能够在推理时无缝调整其行为模式（例如，从逐步推理切换到直接回答），而无需重新训练。基于这一见解，我们提出了代币条件强化学习（ToCoRL）这一原则框架，利用强化学习内化这种变色龙般的可塑性，将瞬时的推理时间适应转化为稳定且可学习的行为模式。ToCoRL通过标记条件生成引导探索，持续增强利用，促使适当行为的出现。大量实验表明，ToCoRL能够实现精确的行为控制而不影响能力。值得注意的是，我们展示了大型推理模型虽然在复杂数学方面表现优异，但可以有效地适应事实性问题解答，而这在过去被其逐步推理模式所限制。

A Recipe for Stable Offline Multi-agent Reinforcement Learning

稳定离线多智能体强化学习的配方

Authors: Dongsu Lee, Daehee Lee, Amy Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.08399
Pdf link: https://arxiv.org/pdf/2603.08399
Abstract Despite remarkable achievements in single-agent offline reinforcement learning (RL), multi-agent RL (MARL) has struggled to adopt this paradigm, largely persisting with on-policy training and self-play from scratch. One reason for this gap comes from the instability of non-linear value decomposition, leading prior works to avoid complex mixing networks in favor of linear value decomposition (e.g., VDN) with value regularization used in single-agent setups. In this work, we analyze the source of instability in non-linear value decomposition within the offline MARL setting. Our observations confirm that they induce value-scale amplification and unstable optimization. To alleviate this, we propose a simple technique, scale-invariant value normalization (SVN), that stabilizes actor-critic training without altering the Bellman fixed point. Empirically, we examine the interaction among key components of offline MARL (e.g., value decomposition, value learning, and policy extraction) and derive a practical recipe that unlocks its full potential.
中文摘要 尽管单智能体离线强化学习（RL）取得了显著成就，多智能体强化学习（MARL）在采用这一范式方面一直困难，主要坚持从零开始采用策略训练和自我游戏。这一差距的一个原因是非线性价值分解的不稳定性，之前的研究建议避免复杂的混合网络，转而采用线性价值分解（如VDN），并在单智能体设置中使用价值正则化。本研究分析了离线MARL环境下非线性值分解不稳定性的根源。我们的观测结果证实，这些元素会诱导价值尺度放大和不稳定优化。为缓解这一问题，我们提出了一种简单的技术——尺度不变值归一化（SVN），该技术稳定actor-critic训练，同时不改变Bellman不动点。我们实证地考察了离线MARL关键组成部分（如价值分解、价值学习和策略提取）之间的相互作用，并推导出一个实用配方，以释放其全部潜力。

Aligning to Illusions: Choice Blindness in Human and AI Feedback

与幻觉对齐：人类与人工智能反馈中的选择盲点

Authors: Wenbin Wu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.08412
Pdf link: https://arxiv.org/pdf/2603.08412
Abstract Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.
中文摘要 人类反馈强化学习（RLHF）假设注释者偏好反映稳定的内部状态。我们通过三项跨越偏好管道的实验来挑战这一观点。在一项人类选择盲人研究中，91%的秘密交换偏好未被发现，将选择盲扩展到对陌生文本的第三人称评价比较。通过测试十五位LLM评审作为潜在替代者，我们发现检测依赖于浅层文本匹配，而非真正的自我监控：从上下文中移除先前推理会导致盲人率从接近零飙升至超过50%，而明确的社会压力则几乎全然顺从。在两种架构（86M到2B参数）的剂量响应实验中，必须有六分之一到三分之一的标签在奖励信号减半前被破坏，但标准的两对准确率几乎保持不变。一项N之最优评估证实，这会导致下游政策退化：在50%的腐败率下，奖励引导选择并未比随机抽样带来改善，而代理模型则报告单调地增加分数。这些结果共同揭示了一个偏好建构问题：进入RLHF的信号受诱发情境影响，这些是人类元认知、LLM自我监控或标准评估指标无法检测到的。

Meta-RL with Shared Representations Enables Fast Adaptation in Energy Systems

具备共享表示的元强化学习实现能源系统的快速适应

Authors: Théo Zangato, Aomar Osmani, Pegah Alizadeh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.08418
Pdf link: https://arxiv.org/pdf/2603.08418
Abstract Meta-Reinforcement Learning addresses the critical limitations of conventional Reinforcement Learning in multi-task and non-stationary environments by enabling fast policy adaptation and improved generalization. We introduce a novel Meta-RL framework that integrates a bi-level optimization scheme with a hybrid actor-critic architecture specially designed to enhance sample efficiency and inter-task adaptability. To improve knowledge transfer, we meta-learn a shared state feature extractor jointly optimized across actor and critic networks, providing efficient representation learning and limiting overfitting to individual tasks or dominant profiles. Additionally, we propose a parameter-sharing mechanism between the outer- and inner-loop actor networks, to reduce redundant learning and accelerate adaptation during task revisitation. The approach is validated on a real-world Building Energy Management Systems dataset covering nearly a decade of temporal and structural variability, for which we propose a task preparation method to promote generalization. Experiments demonstrate effective task adaptation and better performance compared to conventional RL and Meta-RL methods.
中文摘要 元强化学习解决了传统强化学习在多任务和非固定环境中的关键局限性，实现了快速的策略适应和改进的泛化能力。我们介绍了一种新颖的Meta-RL框架，将双级优化方案与混合演员-批评者架构结合，专门设计以提升样本效率和任务间适应性。为改善知识转移，我们元学习了跨演员网络和批评者网络联合优化的共享状态特征提取器，实现高效的表示学习，限制对单个任务或主导配置文件的过度拟合。此外，我们提出了一种参数共享机制，用于减少重复学习并加速任务重访时的适应。该方法在涵盖近十年时间和结构变异的真实建筑能源管理系统数据集上得到验证，我们提出了一种任务准备方法以促进泛化。实验显示，与传统强化学习和元强化学习方法相比，任务适应效果有效且性能更好。

Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

推理作为压缩：通过条件信息瓶颈统一预算强制

Authors: Fabio Valerio Massoli, Andrey Kuzmin, Arash Behboodi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.08462
Pdf link: https://arxiv.org/pdf/2603.08462
Abstract Chain-of-Thought (CoT) prompting improves LLM accuracy on complex tasks but often increases token usage and inference cost. Existing "Budget Forcing" methods reducing cost via fine-tuning with heuristic length penalties, suppress both essential reasoning and redundant filler. We recast efficient reasoning as a lossy compression problem under the Information Bottleneck (IB) principle, and identify a key theoretical gap when applying naive IB to transformers: attention violates the Markov property between prompt, reasoning trace, and response. To resolve this issue, we model CoT generation under the Conditional Information Bottleneck (CIB) principle, where the reasoning trace Z acts as a computational bridge that contains only the information about the response Y that is not directly accessible from the prompt X. This yields a general Reinforcement Learning objective: maximize task reward while compressing completions under a prior over reasoning traces, subsuming common heuristics (e.g., length penalties) as special cases (e.g., uniform priors). In contrast to naive token-counting-based approaches, we introduce a semantic prior that measures token cost by surprisal under a language model prior. Empirically, our CIB objective prunes cognitive bloat while preserving fluency and logic, improving accuracy at moderate compression and enabling aggressive compression with minimal accuracy drop.
中文摘要 思维链（CoT）提示提高了复杂任务中的LLM准确性，但通常会增加令牌使用和推理成本。现有的“预算强制”方法通过微调和启发式长度惩罚来降低成本，抑制了本质推理和冗余填充。我们将高效推理重新定义为信息瓶颈（IB）原则下的有损压缩问题，并在将朴素IB应用于变换器时发现了一个关键理论空白：注意在提示、推理追踪和响应之间违反了马尔可夫性质。为解决这个问题，我们根据条件信息瓶颈（CIB）原则对CoT生成进行建模，其中推理痕迹Z作为计算桥梁，仅包含无法直接从提示X访问的关于响应Y的信息。这给出了一个通用的强化学习目标：在先验下压缩完备度以推理迹为主的同时最大化任务奖励，将常见的启发式方法（如长度惩罚）作为特例（如一致先验）来吸收。与基于朴素的token计数方法不同，我们引入了语义先验，通过语言模型先验来衡量token成本。从经验角度看，我们的CIB目标在保持流畅性和逻辑的同时修剪认知膨胀，在中等压缩下提高准确率，并实现激进压缩且准确率下降最小。

Integrating Lagrangian Neural Networks into the Dyna Framework for Reinforcement Learning

将拉格朗日神经网络整合进Dyna强化学习框架

Authors: Shreya Das, Kundan Kumar, Muhammad Iqbal, Outi Savolainen, Dominik Baumann, Laura Ruotsalainen, Simo Särkkä
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.08468
Pdf link: https://arxiv.org/pdf/2603.08468
Abstract Model-based reinforcement learning (MBRL) is sample-efficient but depends on the accuracy of the learned dynamics, which are often modeled using black-box methods that do not adhere to physical laws. Those methods tend to produce inaccurate predictions when presented with data that differ from the original training set. In this work, we employ Lagrangian neural networks (LNNs), which enforce an underlying Lagrangian structure to train the model within a Dyna-based MBRL framework. Furthermore, we train the LNN using stochastic gradient-based and state-estimation-based optimizers to learn the network's weights. The state-estimation-based method converges faster than the stochastic gradient-based method during neural network training. Simulation results are provided to illustrate the effectiveness of the proposed LNN-based Dyna framework for MBRL.
中文摘要 基于模型的强化学习（MBRL）样本效率高，但依赖于学习动力学的准确性，而这些动力学通常采用不遵循物理定律的黑箱方法建模。这些方法在面对与原始训练集不同的数据时，往往会产生不准确的预测。在本研究中，我们采用拉格朗日神经网络（LNN），这些网络在基于Dyna的MBRL框架内强制执行拉格朗日结构，以训练模型。此外，我们使用基于随机梯度和状态估计的优化器训练LNN，以学习网络权重。基于状态估计的方法在神经网络训练中收敛速度快于基于随机梯度的方法。本书提供了模拟结果，以展示基于LNN的Dyna框架在MBRL中的有效性。

Oracle-Guided Soft Shielding for Safe Move Prediction in Chess

Oracle引导软屏蔽用于国际象棋安全走法预测

Authors: Prajit T Rajendran, Fabio Arnez, Huascar Espinoza, Agnes Delaborde, Chokri Mraidha
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.08506
Pdf link: https://arxiv.org/pdf/2603.08506
Abstract In high stakes environments, agents relying purely on imitation learning or reinforcement learning often struggle to avoid safety-critical errors during exploration. Existing reinforcement learning approaches for environments such as chess require hundreds of thousands of episodes and substantial computational resources to converge. Imitation learning, on the other hand, is more sample efficient but is brittle under distributional shift and lacks mechanisms for proactive risk avoidance. In this work, we propose Oracle-Guided Soft Shielding (OGSS), a simple yet effective framework for safer decision-making, enabling safe exploration by learning a probabilistic safety model from oracle feedback in an imitation learning setting. Focusing on the domain of chess, we train a model to predict strong moves based on past games, and separately learn a blunder prediction model from Stockfish evaluations to estimate the tactical risk of each move. During inference, the agent first generates a set of candidate moves and then uses the blunder model to determine high-risk options, and uses a utility function combining the predicted move likelihood from the policy model and the blunder probability to select actions that strike a balance between performance and safety. This enables the agent to explore and play competitively while significantly reducing the chance of tactical mistakes. Across hundreds of games against a strong chess engine, we compare our approach with other methods in the literature, such as action pruning, SafeDAgger, and uncertainty-based sampling. Our results demonstrate that OGSS variants maintain a lower blunder rate even as the agent's exploration ratio is increased by several folds, highlighting its ability to support broader exploration without compromising tactical soundness.
中文摘要 在高风险环境中，纯依赖模仿学习或强化学习的智能体常常难以避免探索过程中的安全关键错误。现有的国际象棋等环境中的强化学习方法需要数十万次集数和大量计算资源才能融合。而模仿学习则更高效，但在分布偏移下较为脆弱，缺乏主动风险规避机制。在本研究中，我们提出了甲骨机引导软屏蔽（OGSS），这是一个简单而有效的框架，用于更安全的决策，通过在模仿学习环境中从神谕反馈中学习概率安全模型，实现安全探索。专注于国际象棋领域，我们训练模型以基于以往对局预测强势走法，并单独学习Stockfish评估中的失误预测模型，以估算每步的战术风险。在推理过程中，代理首先生成一组候选走法，然后利用错误模型确定高风险选项，并利用一个效用函数将策略模型中的预测走法概率与错误概率结合起来，选择在性能与安全性之间取得平衡的行动。这使得特工能够探索和竞争，同时大幅降低战术失误的可能性。在数百场对弈强大国际象棋引擎的对局中，我们将方法与文献中的其他方法进行比较，如动作剪枝、SafeDAgger 和基于不确定性的采样。我们的结果表明，即使该代理的探索比率提高了数倍，OGSS变体仍保持较低的失误率，凸显了其支持更广泛探索而不影响战术合理性的能力。

Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning

打破凹多目标强化学习中的偏见壁垒

Authors: Swetha Ganesh, Vaneet Aggarwal
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.08518
Pdf link: https://arxiv.org/pdf/2603.08518
Abstract While standard reinforcement learning optimizes a single reward signal, many applications require optimizing a nonlinear utility $f(J_1^\pi,\dots,J_M^\pi)$ over multiple objectives, where each $J_m^\pi$ denotes the expected discounted return of a distinct reward function. A common approach is concave scalarization, which captures important trade-offs such as fairness and risk sensitivity. However, nonlinear scalarization introduces a fundamental challenge for policy gradient methods: the gradient depends on $\partial f(J^\pi)$, while in practice only empirical return estimates $\hat J$ are available. Because $f$ is nonlinear, the plug-in estimator is biased ($\mathbb{E}[\partial f(\hat J)] \neq \partial f(\mathbb{E}[\hat J])$), leading to persistent gradient bias that degrades sample complexity. In this work we identify and overcome this bias barrier in concave-scalarized multi-objective reinforcement learning. We show that existing policy-gradient methods suffer an intrinsic $\widetilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity due to this bias. To address this issue, we develop a Natural Policy Gradient (NPG) algorithm equipped with a multi-level Monte Carlo (MLMC) estimator that controls the bias of the scalarization gradient while maintaining low sampling cost. We prove that this approach achieves the optimal $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for computing an $\epsilon$-optimal policy. Furthermore, we show that when the scalarization function is second-order smooth, the first-order bias cancels automatically, allowing vanilla NPG to achieve the same $\widetilde{\mathcal{O}}(\epsilon^{-2})$ rate without MLMC. Our results provide the first optimal sample complexity guarantees for concave multi-objective reinforcement learning under policy-gradient methods.
中文摘要 虽然标准强化学习优化单个奖励信号，但许多应用需要在多个目标上优化非线性效用$f（J_1^\pi，\dots，J_M^\pi）$，每个$J_m^\pi$表示不同奖励函数的预期折现回报。一种常见方法是凹标量化，它捕捉了公平性和风险敏感性等重要权衡。然而，非线性标量化为策略梯度方法带来了根本性挑战：梯度依赖于$\partial f（J^\pi）$，而实际上只有经验收益估计$\hat J$可用。由于$f$是非线性的，插件估计器存在偏置（$\mathbb{E}[\partial f（\hat J）] \neq \partial f（\mathbb{E}[\hat J]））$），导致持续的梯度偏置，降低样本复杂度。本研究识别并克服了凹标量化多目标强化学习中的这一偏见障碍。我们表明，现有策略梯度方法由于这种偏差，样本复杂度本质上存在 $\widetilde{\mathcal{O}}（\epsilon^{-4}）$。为解决这一问题，我们开发了一种配备多级蒙特卡洛（MLMC）估计器的自然政策梯度（NPG）算法，该算法控制标量化梯度的偏置，同时保持低采样成本。我们证明该方法实现了计算 $\epsilon$-最优策略的最优 $\widetilde{\mathcal{O}}（\epsilon^{-2}）$ 样本复杂度。此外，我们证明当标量化函数为二阶光滑时，一阶偏置自动抵消，使普通NPG能够在不带MLMC的情况下获得相同的$\widetilde{\mathcal{O}}（\epsilon^{-2}）$速率。我们的结果为策略梯度方法下的凹多目标强化学习提供了首个最优样本复杂度保证。

Impact of Connectivity on Laplacian Representations in Reinforcement Learning

连通性对强化学习中拉普拉斯表征的影响

Authors: Tommaso Giorgi, Pierriccardo Olivieri, Keyue Jiang, Laura Toni, Matteo Papini
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.08558
Pdf link: https://arxiv.org/pdf/2603.08558
Abstract Learning compact state representations in Markov Decision Processes (MDPs) has proven crucial for addressing the curse of dimensionality in large-scale reinforcement learning (RL) problems. Existing principled approaches leverage structural priors on the MDP by constructing state representations as linear combinations of the state-graph Laplacian eigenvectors. When the transition graph is unknown or the state space is prohibitively large, the graph spectral features can be estimated directly via sample trajectories. In this work, we prove an upper bound on the approximation error of linear value function approximation under the learned spectral features. We show how this error scales with the algebraic connectivity of the state-graph, grounding the approximation quality in the topological structure of the MDP. We further bound the error introduced by the eigenvector estimation itself, leading to an end-to-end error decomposition across the representation learning pipeline. Additionally, our expression of the Laplacian operator for the RL setting, although equivalent to existing ones, prevents some common misunderstandings, of which we show some examples from the literature. Our results hold for general (non-uniform) policies without any assumptions on the symmetry of the induced transition kernel. We validate our theoretical findings with numerical simulations on gridworld environments.
中文摘要 在马尔可夫决策过程（MDPs）中学习紧凑状态表示对于解决大规模强化学习（RL）问题中的维度诅咒至关重要。现有的原则方法利用MDP的结构先验，通过构造状态图拉普拉斯特征向量的线性组合来构造状态表示。当转移图未知或状态空间过大时，可以通过样本轨迹直接估计图谱特征。在本研究中，我们证明了线性价值函数近似在所学到谱特征下近似误差的上界。我们展示了该误差如何与状态图的代数连通性成比例，使近似质量基于MDP的拓扑结构。我们进一步界定了特征向量估计本身引入的误差，从而实现了整个表示学习流水线的端到端误差分解。此外，我们对强化学习（RL）场景下拉普拉斯算子的表达，虽然与现有算子等价，但避免了一些常见误解，我们展示了文献中的一些例子。我们的结果适用于一般（非均匀）策略，且不假设诱导转移核的对称性。我们通过在网格世界环境中进行数值模拟验证理论发现。

RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

RetroAgent：通过回顾性双重内在反馈从解决到进化

Authors: Xiaoying Zhang, Zichen Liu, Yipeng Zhang, Xia Hu, Wenqi Shao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.08561
Pdf link: https://arxiv.org/pdf/2603.08561
Abstract Large language model (LLM)-based agents trained with reinforcement learning (RL) have shown strong potential on complex interactive tasks. However, standard RL paradigms favor static problem-solving over continuous adaptation: agents often converge to suboptimal strategies due to insufficient exploration, while learned knowledge remains implicit within parameters rather than explicitly retrievable, limiting effective experiential learning. To address these limitations, we introduce RetroAgent, an online RL framework that empowers agents to master complex interactive environments not just by solving, but by evolving. Concretely, RetroAgent features a hindsight self-reflection mechanism that produces dual intrinsic feedback: (1) intrinsic numerical feedback that that tracks incremental subtask completion relative to prior attempts, rewarding promising explorations, and (2) intrinsic language feedback that distills reusable lessons into a memory buffer, retrieved via our proposed Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy balancing relevance, utility, and exploration to effectively leverage past experiences. Extensive experiments on two model families across four challenging agentic tasks demonstrate that RetroAgent significantly outperforms existing methods, achieving state-of-the-art results -- e.g., surpassing Group Relative Policy Optimization (GRPO)-trained agents by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper -- while exhibiting strong test-time adaptation and generalization to out-of-distribution scenarios.
中文摘要 基于大型语言模型（LLM）的智能体通过强化学习（RL）训练，在复杂交互任务中展现出强大潜力。然而，标准的强化学习范式更倾向于静态解决问题而非持续适应：代理常因探索不足而趋向次优策略，而学到的知识则隐含在参数内，而非可显式检索，限制了有效的体验式学习。为解决这些局限性，我们引入了RetroAgent，一个在线强化学习框架，使智能体不仅通过求解，更通过进化来掌握复杂的交互环境。具体来说，RetroAgent 具有一种事后诸葛亮的自我反思机制，产生双重内在反馈：（1）内在数值反馈，跟踪相对于先前尝试的增量子任务完成情况，奖励有前景的探索;（2）内在语言反馈，将可重用的经验提炼到记忆缓冲区，通过我们提出的相似度与效用感知上置信度界限（SimUtil-UCB）策略，平衡相关性和效用，以及探索以有效利用过去的经验。在四个具有挑战性的代理任务中对两个模型家族进行的广泛实验表明，RetroAgent 显著优于现有方法，取得了最先进的结果——例如，在 ALFWorld 上以 +18.3% 的成绩超越 Group Relative Policy Optimization（GRPO）训练的代理，在 WebShop 上比 +15.4%，在 Sokoban 上超过+27.1%，在 MineSweeper 上超过+8.9%，同时在测试时间适应和泛化方面表现出强烈的非分布场景。

MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation

MetaWorld-X：通过VLM编排专家进行人形机车控的分层世界建模

Authors: Yutong Shen, Hangxu Liu, Penghui Liu, Jiashuo Luo, Yongkang Zhang, Rex Morvley, Chen Jiang, Jianwei Zhang, Lei Zhang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.08572
Pdf link: https://arxiv.org/pdf/2603.08572
Abstract Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross-skill gradient interference and motion pattern conflicts in high-degree-of-freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld-X, a hierarchical world model framework for humanoid control. Guided by a divide-and-conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. The VLM-guided router dynamically integrates expert policies according to high-level task semantics, facilitating compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.
中文摘要 学习实现同步运动与作（机动作）的人形机器人的自然、稳定且组合上可通用的全身控制策略，仍然是机器人学中的一项根本挑战。现有的强化学习方法通常依赖单一的单一策略来获得多项技能，这常导致高自由度系统中跨技能梯度干涉和运动模式冲突。因此，生成的行为常表现出不自然的运动、稳定性有限以及对复杂任务组合的推广性差。为解决这些局限性，我们提出了MetaWorld-X，一种用于类人生物控制的分层世界模型框架。在分而治之原则的指导下，我们的方法将复杂的控制问题分解为一组专业专家策略（专业专家策略，SEP）。每位专家都通过模仿约束强化学习，在人体运动先验下接受训练，引入生物力学上一致的归纳偏向，确保自然且物理上合理的运动生成。在此基础上，我们进一步开发了由视觉语言模型（VLM）监督的智能路由机制（IRM），实现语义驱动的专家写作。VLM引导的布线器根据高层任务语义动态集成专家策略，便于多阶段机车作任务中的组合泛化和自适应执行。

Towards Batch-to-Streaming Deep Reinforcement Learning for Continuous Control

迈向批量到流的深度强化学习以实现持续控制

Authors: Riccardo De Monte, Matteo Cederle, Gian Antonio Susto
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.08588
Pdf link: https://arxiv.org/pdf/2603.08588
Abstract State-of-the-art deep reinforcement learning (RL) methods have achieved remarkable performance in continuous control tasks, yet their computational complexity is often incompatible with the constraints of resource-limited hardware, due to their reliance on replay buffers, batch updates, and target networks. The emerging paradigm of streaming deep RL addresses this limitation through purely online updates, achieving strong empirical performance on standard benchmarks. In this work, we propose two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer. Both algorithms achieve performance comparable to state-of-the-art streaming baselines on standard benchmarks without requiring tedious hyperparameter tuning. Finally, we further investigate the practical challenges of transitioning from batch to streaming learning during finetuning and propose concrete strategies to tackle them.
中文摘要 最先进的深度强化学习（RL）方法在连续控制任务中取得了显著性能，但由于依赖重放缓冲区、批处理更新和目标网络，其计算复杂度常常与资源有限的硬件限制不兼容。新兴的深度强化学习流式学习通过纯在线更新解决了这一局限，在标准基准测试上取得了强劲的实证表现。在本研究中，我们提出了两种新颖的流式深度强化学习算法，即流式软演员-批判者（S2AC）和流确定性演员-批判者（SDAC），明确设计为兼容最先进的批处理强化学习方法，使其特别适合如Sim2Real传输等设备内微调应用。这两种算法在标准基准测试上都能实现与最先进的流式基线相当的性能，无需繁琐的超参数调优。最后，我们进一步探讨了在微调过程中从批学习过渡到流式学习的实际挑战，并提出了具体策略来应对这些挑战。

Diff-Muscle: Efficient Learning for Musculoskeletal Robotic Table Tennis

Diff-Muscle：肌肉骨骼机器人乒乓球的高效学习

Authors: Wentao Zhao, Jun Guo, Kangyao Huang, Xin Liu, Huaping Liu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.08617
Pdf link: https://arxiv.org/pdf/2603.08617
Abstract Musculoskeletal robots provide superior advantages in flexibility and dexterity, positioning them as a promising frontier towards embodied intelligence. However, current research is largely confined to relative simple tasks, restricting the exploration of their full potential in multi-segment coordination. Furthermore, efficient learning remains a challenge, primarily due to the high-dimensional action space and inherent overactuated structures. To address these challenges, we propose Diff-Muscle, a musculoskeletal robot control algorithm that leverages differential flatness to reformulate policy learning from the redundant muscle-activation space into a significantly lower-dimensional joint space. Furthermore, we utilize the highly dynamic robotic table tennis task to evaluate our algorithm. Specifically, we propose a hierarchical reinforcement learning framework that integrates a Kinematics-based Muscle Actuation Controller (K-MAC) with high-level trajectory planning, enabling a musculoskeletal robot to perform dexterous and precise rallies. Experimental results demonstrate that Diff-Muscle significantly outperforms state-of-the-art baselines in success rates while maintaining minimal muscle activation. Notably, the proposed framework successfully enables the musculoskeletal robots to achieve continuous rallies in a challenging dual-robot setting.
中文摘要 肌肉骨骼机器人在灵活性和灵巧度方面具有优势，使其成为迈向具身智能的有前景的前沿。然而，目前的研究主要局限于相对简单的任务，限制了其在多段协调中全部潜力的探索。此外，高效学习依然是个挑战，主要由于高维作用空间和固有的过度驱动结构。为应对这些挑战，我们提出了Diff-Muscle，一种利用微分平坦性将冗余肌肉激活空间的策略学习重新构建为显著低维关节空间的肌肉骨骼机器人控制算法。此外，我们利用高度动态的机器人乒乓球任务来评估我们的算法。具体来说，我们提出了一种分层强化学习框架，将基于运动学的肌肉驱动控制器（K-MAC）与高水平轨迹规划相结合，使肌肉骨骼机器人能够执行灵巧且精准的拉力动作。实验结果表明，Diff-Muscle 在保持最小肌肉激活的同时，成功率显著优于最先进的基线。值得注意的是，该框架成功使肌肉骨骼机器人能够在具有挑战性的双机器人环境中实现连续拉力。

Embedding Classical Balance Control Principles in Reinforcement Learning for Humanoid Recovery

将经典平衡控制原则融入强化学习以实现类人生物恢复

Authors: Nehar Poddar, Stephen McCrory, Luigi Penco, Geoffrey Clark, Hakki Erhan Svil, Robert Griffin
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.08619
Pdf link: https://arxiv.org/pdf/2603.08619
Abstract Humanoid robots remain vulnerable to falls and unrecoverable failure states, limiting their practical utility in unstructured environments. While reinforcement learning has demonstrated stand-up behaviors, existing approaches treat recovery as a pure task-reward problem without an explicit representation of the balance state. We present a unified RL policy that addresses this limitation by embedding classical balance metrics: capture point, center-of-mass state, and centroidal momentum, as privileged critic inputs and shaping rewards directly around these quantities during training, while the actor relies solely on proprioception for zero-shot hardware transfer. Without reference trajectories or scripted contacts, a single policy spans the full recovery spectrum: ankle and hip strategies for small disturbances, corrective stepping under large pushes, and compliant falling with multi-contact stand-up using the hands, elbows, and knees. Trained on the Unitree H1-2 in Isaac Lab, the policy achieves a 93.4% recovery rate across randomized initial poses and unscripted fall configurations. An ablation study shows that removing the balance-informed structure causes stand-up learning to fail entirely, confirming that these metrics provide a meaningful learning signal rather than incidental structure. Sim-to-sim transfer to MuJoCo and preliminary hardware experiments further demonstrate cross-environment generalization. These results show that embedding interpretable balance structure into the learning framework substantially reduces time spent in failure states and broadens the envelope of autonomous recovery.
中文摘要 类人机器人仍易受坠落和无法恢复的失效状态影响，限制了其在非结构化环境中的实用性。虽然强化学习已经证明了站立行为，但现有方法将恢复视为纯粹的任务-奖励问题，未明确表示平衡状态。我们提出了统一的强化学习策略，通过嵌入经典平衡指标：捕获点、质心状态和重心动量作为特权的批评输入，并在训练过程中直接围绕这些量塑造奖励，而行为者则仅依赖本体感觉进行零射击硬件传输。没有参考轨迹或脚本化接触，单一策略涵盖了整个恢复光谱：针对小范围干扰的踝关节和髋部策略，在大力推下进行纠正性步道，以及配合多接触站立摔倒，使用手、肘和膝盖。该策略在Isaac实验室的Unitree H1-2上训练，在随机初始姿势和非脚本坠落配置中实现了93.4%的恢复率。一项消融研究表明，去除平衡知情结构会导致站立式学习完全失败，证实这些指标提供了有意义的学习信号，而非偶然的结构。模拟对模拟器传输到 MuJoCo 以及初步硬件实验进一步展示了跨环境的泛化能力。这些结果表明，将可解释的平衡结构嵌入学习框架，显著减少了失败状态的时间，拓宽了自主恢复的范围。

How Far Can Unsupervised RLVR Scale LLM Training?

无监督RLVR能在多大程度上扩展LLM培训？

Authors: Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Xiusi Chen, Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng, Ran Li, Huan-ang Gao, Yuchen Zhang, Bowen Zhou, Zhiyuan Liu, Ning Ding
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.08660
Pdf link: https://arxiv.org/pdf/2603.08660
Abstract Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.
中文摘要 带可验证奖励的无监督强化学习（URLVR）通过无实地标签推导奖励，为LLM培训超越监督瓶颈提供了扩展路径。近期研究利用模型内在信号，显示出早期有望的进展，但其潜力和局限性尚不明确。在本研究中，我们重新审视了URLVR，并提供了涵盖分类学、理论及大量实验的全面分析。我们首先将 URLVR 方法分为基于奖励来源的内在和外部方法，然后建立统一的理论框架，揭示所有内在方法都趋向于提升模型初始分布。当初始置信度与正确性一致时，这种锐化机制成功，但当错位时则会灾难性地失败。通过系统实验，我们表明内在奖励在各方法中始终呈现上升-下降的模式，崩溃时间由模型先验决定，而非工程选择。尽管存在这些扩展限制，我们发现内在奖励在小数据集的测试时训练中依然有价值，并提出模型崩溃步骤来衡量模型先验，作为强化学习可训练性的实用指标。最后，我们探讨了将验证置于计算不对称性基础上的外部奖励方法，并初步证明它们可能突破置信度正确性上限。我们的发现为内在的 URLVR 绘制了边界，同时推动了可扩展替代方案的路径。

Agentic Critical Training

能动批判训练

Authors: Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, Furong Huang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.08706
Pdf link: https://arxiv.org/pdf/2603.08706
Abstract Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.
中文摘要 将大型语言模型（LLMs）训练为自主智能体通常从模仿学习开始，但这只是教会智能体该做什么，却不理解原因：智能体从未将成功的动作与次优替代方案进行对比，因此缺乏对动作质量的认知。近年来，一些方法试图通过引入基于专家与替代行动对比的自我反思监督来解决这一问题。然而，训练范式本质上仍是模仿学习：模型模仿预构的反思文本，而非自主学习推理。我们提出了代理关键训练（ACT），这是一种强化学习范式，训练代理识别更优的行动方式。通过奖励模型判断是否正确，ACT促使模型自主发展关于行动质量的推理，产生真正的自我反思，而非模仿。在三个具有挑战性的代理基准测试中，ACT结合不同的培训后方法，持续提升代理绩效。它比模仿学习平均提升了5.07分，比强化学习提升了4.62分。与通过知识蒸馏注入反射能力的方法相比，ACT也展现出明显优势，平均提升2.42分。此外，ACT支持在代理基准测试上强力的分布外泛化，并在无推理特定训练数据的情况下提升一般推理基准测试的性能，凸显了我们方法的价值。这些结果表明，ACT是开发更具反思性和功能性LLM代理的有前景道路。

Keyword: diffusion policy

DexKnot: Generalizable Visuomotor Policy Learning for Dexterous Bag-Knotting Manipulation

DexKnot：灵活袋结作的通用体力运动政策学习

Authors: Jiayuan Zhang, Ruihai Wu, Haojun Chen, Yuran Wang, Yifan Zhong, Ceyao Zhang, Yaodong Yang, Yuanpei Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.07136
Pdf link: https://arxiv.org/pdf/2603.07136
Abstract Knotting plastic bags is a common task in daily life, yet it is challenging for robots due to the bags' infinite degrees of freedom and complex physical dynamics. Existing methods often struggle in generalization to unseen bag instances or deformations. To address this, we present DexKnot, a framework that combines keypoint affordance with diffusion policy to learn a generalizable bag-knotting policy. Our approach learns a shape-agnostic representation of bags from keypoint correspondence data collected through real-world manual deformation. For an unseen bag configuration, the keypoints can be identified by matching the representation to a reference. These keypoints are then provided to a diffusion transformer, which generates robot action based on a small number of human demonstrations. DexKnot enables effective policy generalization by reducing the dimensionality of observation space into a sparse set of keypoints. Experiments show that DexKnot achieves reliable and consistent knotting performance across a variety of previously unseen instances and deformations.
中文摘要 打结塑料袋是日常生活中的常见任务，但由于塑料袋具有无限自由度和复杂的物理动力学，对机器人来说却是一项挑战。现有方法在推广到未见的袋子实例或变形时常常存在困难。为此，我们提出了DexKnot，一个结合关键点可向性和扩散策略的框架，以学习一个可推广的袋结策略。我们的方法通过通过真实手工变形收集的关键点对应数据学习包的形状无关性表示。对于未见的袋状配置，关键点可以通过将表示与参考匹配来识别。这些关键点随后被传递给扩散变压器，该变压器基于少量人工演示生成机器人动作。DexKnot通过将观测空间的维度简化为稀疏的关键点集合，实现了有效的策略泛化。实验表明，DexKnot在多种此前未见的实例和变形中都能实现可靠且一致的结结性能。