Arxiv Papers of Today

生成时间: 2026-03-13 16:44:26 (UTC+8); Arxiv 发布时间: 2026-03-13 20:00 EDT (2026-03-14 08:00 UTC+8)

今天共有 48 篇相关文章

Keyword: reinforcement learning

ResWM: Residual-Action World Model for Visual RL

ResWM：视觉强化学习的残余作用世界模型

Authors: Jseen Zhang, Gabriel Adineera, Jinzhou Tan, Jinoh Kim
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.11110
Pdf link: https://arxiv.org/pdf/2603.11110
Abstract Learning predictive world models from raw visual observations is a central challenge in reinforcement learning (RL), especially for robotics and continuous control. Conventional model-based RL frameworks directly condition future predictions on absolute actions, which makes optimization unstable: the optimal action distributions are task-dependent, unknown a priori, and often lead to oscillatory or inefficient control. To address this, we introduce the Residual-Action World Model (ResWM), a new framework that reformulates the control variable from absolute actions to residual actions -- incremental adjustments relative to the previous step. This design aligns with the inherent smoothness of real-world control, reduces the effective search space, and stabilizes long-horizon planning. To further strengthen the representation, we propose an Observation Difference Encoder that explicitly models the changes between adjacent frames, yielding compact latent dynamics that are naturally coupled with residual actions. ResWM is integrated into a Dreamer-style latent dynamics model with minimal modifications and no extra hyperparameters. Both imagination rollouts and policy optimization are conducted in the residual-action space, enabling smoother exploration, lower control variance, and more reliable planning. Empirical results on the DeepMind Control Suite demonstrate that ResWM achieves consistent improvements in sample efficiency, asymptotic returns, and control smoothness, significantly surpassing strong baselines such as Dreamer and TD-MPC. Beyond performance, ResWM produces more stable and energy-efficient action trajectories, a property critical for robotic systems deployed in real-world environments. These findings suggest that residual action modeling provides a simple yet powerful principle for bridging algorithmic advances in RL with the practical requirements of robotics.
中文摘要 从原始视觉观察中学习预测世界模型是强化学习（RL）中的核心挑战，尤其是在机器人技术和连续控制领域。传统的基于模型的强化学习框架直接将未来预测置于绝对动作，这使得优化不稳定：最优动作分布依赖任务，先验未知，且常导致振荡或控制效率低下。为此，我们引入了残差作用世界模型（ResWM），这是一个新的框架，将控制变量从绝对作用重新表述为残差作用——相对于前一步的增量调整。这种设计符合现实控制固有的平滑性，减少了有效搜索空间，并稳定了长视野规划。为进一步强化表述，我们提出了一种观测差分编码器，明确建模相邻帧之间的变化，产生紧凑的潜在动力学，自然与残差作用耦合。ResWM集成到Dreamer风格的潜在动力学模型中，几乎没有修改，没有额外的超参数。想象力推广和策略优化均在残余行动领域进行，使探索更顺畅，控制方差更低，规划更可靠。DeepMind 控制套件的实证结果表明，ResWM 在样本效率、渐近回波和控制平滑度方面持续提升，显著超越了 Dreamer 和 TD-MPC 等强基线。除了性能，ResWM还能产生更稳定、更节能的动作轨迹，这对在实际环境中部署的机器人系统至关重要。这些发现表明，残差作用建模为强化学习的算法进步与机器人的实际需求之间提供了简单而强大的原理。

Learning Tree-Based Models with Gradient Descent

学习带有梯度下降的树状模型

Authors: Sascha Marton
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.11117
Pdf link: https://arxiv.org/pdf/2603.11117
Abstract Tree-based models are widely recognized for their interpretability and have proven effective in various application domains, particularly in high-stakes domains. However, learning decision trees (DTs) poses a significant challenge due to their combinatorial complexity and discrete, non-differentiable nature. As a result, traditional methods such as CART, which rely on greedy search procedures, remain the most widely used approaches. These methods make locally optimal decisions at each node, constraining the search space and often leading to suboptimal tree structures. Additionally, their demand for custom training methods precludes a seamless integration into modern machine learning (ML) approaches. In this thesis, we propose a novel method for learning hard, axis-aligned DTs through gradient descent. Our approach utilizes backpropagation with a straight-through operator on a dense DT representation, enabling the joint optimization of all tree parameters, thereby addressing the two primary limitations of traditional DT algorithms. First, gradient-based training is not constrained by the sequential selection of locally optimal splits but, instead, jointly optimizes all tree parameters. Second, by leveraging gradient descent for optimization, our approach seamlessly integrates into existing ML approaches e.g., for multimodal and reinforcement learning tasks, which inherently rely on gradient descent. These advancements allow us to achieve state-of-the-art results across multiple domains, including interpretable DTs rees for small tabular datasets, advanced models for complex tabular data, multimodal learning, and interpretable reinforcement learning without information loss. By bridging the gap between DTs and gradient-based optimization, our method significantly enhances the performance and applicability of tree-based models across various ML domains.
中文摘要 基于树的模型因其可解释性而广受认可，并在多个应用领域，尤其是高风险领域中，已被证明非常有效。然而，学习决策树（DT）因其组合复杂性和离散、不可微性而面临重大挑战。因此，依赖贪婪搜索程序的传统方法如CART，仍然是最广泛使用的方法。这些方法在每个节点做出局部最优决策，限制搜索空间，常常导致树结构不优。此外，他们对定制训练方法的需求阻碍了与现代机器学习（ML）方法的无缝整合。在本论文中，我们提出了一种通过梯度下降学习硬轴对齐DT的新方法。我们的方法利用在密集的DT表示上使用直通算符进行反向传播，实现所有树参数的联合优化，从而解决了传统DT算法的两个主要局限。首先，基于梯度的训练不受局限于顺序选择局部最优分拆，而是联合优化所有树参数。其次，通过利用梯度下降进行优化，我们的方法能够无缝集成到现有机器学习方法中，例如多模态和强化学习任务，这些任务本质上依赖梯度下降。这些进展使我们能够在多个领域实现最先进的结果，包括针对小型表格数据集的可解释DTs、复杂表格数据的高级模型、多模态学习以及无信息丢失的可解释强化学习。通过弥合基于梯度的技术分析与梯度优化之间的差距，我们的方法显著提升了基于树模型在各机器学习领域的性能和适用性。

Enhancing Value Alignment of LLMs with Multi-agent system and Combinatorial Fusion

通过多智能体系统和组合融合增强大型语言模型的价值对齐

Authors: Yuanhong Wu, Djallel Bouneffouf, D. Frank Hsu
Subjects: Subjects: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.11126
Pdf link: https://arxiv.org/pdf/2603.11126
Abstract Aligning large language models (LLMs) with human values is a central challenge for ensuring trustworthy and safe deployment. While existing methods such as Reinforcement Learning from Human Feedback (RLHF) and its variants have improved alignment, they often rely on a single evaluator or narrowly defined reward signals, limiting their ability to capture ethical pluralism. In this work, we propose the Value Alignment System using Combinatorial Fusion Analysis (VAS-CFA), a framework that operationalizes multi-agent fusion alignment. It instantiates multiple moral agents, each fine-tuned to represent a distinct normative perspective, and fuses their outputs using CFA with both rank- and score-based aggregation. This design leverages cognitive diversity, between agents, to mitigate conflicts and redundancies across multiple agents, producing responses that better reflect human values. Empirical evaluation demonstrates that VAS-CFA outperforms both single agent baselines and prior aggregation approaches on standard metrics, showing that multi-agent fusion provides a robust and effective mechanism for advancing value alignment in LLMs.
中文摘要 将大型语言模型（LLMs）与人类价值观对齐，是确保可信和安全部署的核心挑战。虽然现有方法如人类反馈强化学习（RLHF）及其变体有所改进，但它们通常依赖单一评估器或狭义的奖励信号，限制了其捕捉伦理多元的能力。在本研究中，我们提出了基于组合融合分析（VAS-CFA）的价值对齐系统，这一框架将多代理融合对齐运用化。它实例化多个道德主体，每个主体都经过微调以代表不同的规范视角，并通过CFA将其输出与基于排名和分数的聚合结合起来。该设计利用智能体间的认知多样性，减少多智能体间的冲突和重复，产生更能体现人类价值观的反应。实证评估表明，VAS-CFA在标准指标上优于单一智能体基线和以往聚合方法，表明多智能体融合为推动大型语言模型价值对齐提供了稳健且有效的机制。

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

DeReason：难度意识课程改进了解耦SFT再强化学习的通用推理训练

Authors: Hanxu Hu, Yuxuan Wang, Maggie Huan, Jannis Vamvas, Yinya Huang, Zhijiang Guo, Rico Sennrich
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.11193
Pdf link: https://arxiv.org/pdf/2603.11193
Abstract Reinforcement learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for eliciting reasoning capabilities in large language models, particularly in mathematics and coding. While recent efforts have extended this paradigm to broader general scientific (STEM) domains, the complex interplay between supervised fine-tuning (SFT) and RL in these contexts remains underexplored. In this paper, we conduct controlled experiments revealing a critical challenge: for general STEM domains, RL applied directly to base models is highly sample-inefficient and is consistently surpassed by supervised fine-tuning (SFT) on moderate-quality responses. Yet sequential SFT followed by RL can further improve performance, suggesting that the two stages play complementary roles, and that how training data is allocated between them matters. Therefore, we propose DeReason, a difficulty-based data decoupling strategy for general reasoning. DeReason partitions training data by reasoning intensity estimated via LLM-based scoring into reasoning-intensive and non-reasoning-intensive subsets. It allocates broad-coverage, non-reasoning-intensive problems to SFT to establish foundational domain knowledge, and reserves a focused subset of difficult problems for RL to cultivate complex reasoning. We demonstrate that this principled decoupling yields better performance than randomly splitting the data for sequential SFT and RL. Extensive experiments on general STEM and mathematical benchmarks demonstrate that our decoupled curriculum training significantly outperforms SFT-only, RL-only, and random-split baselines. Our work provides a systematic study of the interplay between SFT and RL for general reasoning, offering a highly effective and generalized post-training recipe.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为激发大型语言模型推理能力的强大范式，尤其是在数学和编码领域。尽管近期努力将这一范式扩展到更广泛的通用科学（STEM）领域，但监督微调（SFT）与强化学习（RL）在这些背景下的复杂相互作用仍未被充分探讨。本文进行了受控实验，揭示了一个关键挑战：对于一般STEM领域，直接应用于基础模型的强化学习样本效率极低，且在中等质量的响应上，持续被监督微调（SFT）超越。然而，顺序SFT和强化学习可以进一步提升性能，表明这两个阶段起着互补的作用，训练数据的分配也很重要。因此，我们提出了DeReason，一种基于难度的数据解耦策略，用于通用推理。DeReason 通过基于大型语言模型的评分估计的推理强度，将训练数据划分为推理强度和非推理强度的子集。它为SFT分配了广泛覆盖、非推理密集型问题，以建立基础领域知识，并为强化学习保留了一部分重点的复杂问题，用于培养复杂推理能力。我们证明了这种原则性的解耦比随机拆分数据在顺序SFT和RL中更优。对一般STEM和数学基准的广泛实验表明，我们的解耦课程培训显著优于仅SFT、仅RL和随机拆分基线。我们的工作系统地研究了SFT与RL之间的相互作用，提供了一套高效且通用的训练后方案。

Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning

Senna-2：整合VLM与端到端驾驶政策，以实现一致的决策和规划

Authors: Yuehao Song, Shaoyu Chen, Hao Gao, Yifan Zhu, Weixiang Yue, Jialv Zou, Bo Jiang, Zihao Lu, Yu Wang, Qian Zhang, Xinggang Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.11219
Pdf link: https://arxiv.org/pdf/2603.11219
Abstract Vision-language models (VLMs) enhance the planning capability of end-to-end (E2E) driving policy by leveraging high-level semantic reasoning. However, existing approaches often overlook the dual-system consistency between VLM's high-level decision and E2E's low-level planning. As a result, the generated trajectories may misalign with the intended driving decisions, leading to weakened top-down guidance and decision-following ability of the system. To address this issue, we propose Senna-2, an advanced VLM-E2E driving policy that explicitly aligns the two systems for consistent decision-making and planning. Our method follows a consistency-oriented three-stage training paradigm. In the first stage, we conduct driving pre-training to achieve preliminary decision-making and planning, with a decision adapter transmitting VLM decisions to E2E policy in the form of implicit embeddings. In the second stage, we align the VLM and the E2E policy in an open-loop setting. In the third stage, we perform closed-loop alignment via bottom-up Hierarchical Reinforcement Learning in 3DGS environments to reinforce the safety and efficiency. Extensive experiments demonstrate that Senna-2 achieves superior dual-system consistency (19.3% F1 score improvement) and significantly enhances driving safety in both open-loop (5.7% FDE reduction) and closed-loop settings (30.6% AF-CR reduction).
中文摘要 视觉语言模型（VLMs）通过利用高层语义推理，增强端到端（E2E）驱动政策的规划能力。然而，现有方法常忽视VLM高层决策与E2E低层规划之间的双系统一致性。因此，生成的轨迹可能与预期的驱动决策不匹配，导致系统的自上而下引导和决策跟随能力减弱。为解决这一问题，我们提出了Senna-2，一种先进的VLM-E2驱动政策，明确将两者系统对齐，实现一致的决策和规划。我们的方法遵循以一致性为导向的三阶段训练范式。第一阶段，我们进行驱动预训练以实现初步决策和规划，决策适配器以隐式嵌入的形式将VLM决策传输到E2E策略。第二阶段，我们将VLM和端对端政策（E2E）政策对齐于开环环境。第三阶段，我们通过自下而上的层级强化学习在3DGS环境中进行闭环对齐，以强化安全性和效率。大量实验表明，Senna-2在双系统一致性方面实现了优越的稳定性（F1分数提升19.3%），并在开环（FDE降低5.7%）和闭环（AF-CR降低30.6%）中显著提升了驾驶安全。

ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning

ExecVerify：白盒强化学习，代码执行推理提供可验证的分步骤奖励

Authors: Lingxiao Tang, He Ye, Zhaoyang Chu, Muyang Ye, Zhongxin Liu, Xiaoxue Ren, Lingfeng Bao
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2603.11226
Pdf link: https://arxiv.org/pdf/2603.11226
Abstract Code LLMs still struggle with code execution reasoning, especially in smaller models. Existing methods rely on supervised fine-tuning (SFT) with teacher-generated explanations, primarily in two forms: (1) input-output (I/O) prediction chains and (2) natural-language descriptions of execution traces. However, intermediate execution steps cannot be explicitly verified during SFT, so the training objective can reduce to merely matching teacher explanations. Moreover, training data is typically collected without explicit control over task difficulty. We introduce ExecVerify, which goes beyond text imitation by incorporating verifiable white-box rewards derived from execution traces, including next-statement prediction and variable value/type prediction. Our work first builds a dataset with multiple difficulty levels via constraint-based program synthesis. Then, we apply reinforcement learning (RL) to reward correct answers about both intermediate execution steps and final outputs, aligning the training objective with semantic correctness at each execution step. Finally, we adopt a two-stage training pipeline that first enhances execution reasoning and then transfers to code generation. Experiments demonstrate that a 7B model trained with ExecVerify achieves performance comparable to 32B models on code reasoning benchmarks and improves pass@1 by up to 5.9\% on code generation tasks over strong post-training baselines.
中文摘要 代码大型语言模型在代码执行推理上仍然存在困难，尤其是在较小的模型中。现有方法依赖监督微调（SFT）和教师生成的解释，主要有两种形式：（1）输入输出（I/O）预测链和（2）执行痕迹的自然语言描述。然而，SFT中无法明确验证中间执行步骤，因此培训目标可能仅为匹配教师的解释。此外，训练数据通常没有明确控制任务难度。我们介绍ExecVerify，它超越了文本模仿，还结合了基于执行痕迹得出的可验证白盒奖励，包括下一语句预测和变量值/类型预测。我们的工作首先通过基于约束的程序综合构建一个具有多难度级别的数据集。然后，我们应用强化学习（RL）来奖励关于中间执行步骤和最终输出的正确答案，使训练目标在每个执行步骤的语义正确性上保持一致。最后，我们采用了两阶段训练流程，先提升执行推理能力，然后转入代码生成。实验表明，使用ExecVerify训练的7B模型在代码推理基准测试中的性能可与32B模型相当，且在代码生成任务上的较强的训练后基线提升pass@1多达5.9%。

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

事后诸葛亮的策略优化：在稀疏奖励环境中将失败转化为反馈

Authors: Yuning Wu, Ke Wang, Devin Chen, Kai Wei
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.11321
Pdf link: https://arxiv.org/pdf/2603.11321
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for post-training reasoning models. However, group-based methods such as Group Relative Policy Optimization (GRPO) face a critical dilemma in sparse-reward settings: pure Reinforcement Learning (RL) suffers from advantage collapse and high-variance gradient estimation, while mixed-policy optimization introduces persistent distributional bias. To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO). HAPO employs the Synthetic Success Injection (SSI) operator, a hindsight mechanism that selectively anchors optimization to teacher demonstrations during failure. This injection is governed by a Thompson sampling-inspired gating mechanism, creating an autonomous, self-paced curriculum. Theoretically, we demonstrate that HAPO achieves \textit{asymptotic consistency}: by naturally annealing the teacher signal as the policy improves, HAPO recovers the unbiased on-policy gradient. This ensures off-policy guidance acts as a temporary scaffold rather than a persistent ceiling, enabling the model to surpass the limitations of static teacher forcing.
中文摘要 带可验证奖励的强化学习（RLVR）已成为训练后推理模型的有前景范式。然而，基于群体的方法如群体相对策略优化（GRPO）在稀疏奖励环境中面临关键困境：纯强化学习（RL）存在优势崩溃和高方差梯度估计，而混合策略优化则引入持续的分布偏见。为解决这一难题，我们引入了事后诸葛亮锚定策略优化（HAPO）。HAPO采用合成成功注入（SSI）算子，这是一种事后诸葛亮机制，能够在失败时选择性地将优化锚定于教师演示。这种注入由受汤普森采样启发的门槛机制管理，创建自主、自定进度的课程。理论上，我们证明HAPO实现了\textit{渐近一致性}：通过自然退火教师信号，随着策略改进，HAPO恢复了无偏的政策梯度。这确保了非政策指导作为临时支架而非持续的天花板，使模型能够超越静态教师强制的局限。

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

带有自我反思的元强化学习用于智能搜索

Authors: Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, Hannaneh Hajishirzi
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.11327
Pdf link: https://arxiv.org/pdf/2603.11327
Abstract This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at this https URL.
中文摘要 本文介绍了MR-Search，这是一种基于情境的元强化学习（RL）形式，用于带有自我反思的代理搜索。MR-Search 不是在单个独立剧集内优化策略，而是训练一个基于过去剧集并调整搜索策略的策略。MR-Search 通过自我反思学习搜索策略，使搜索代理在测试时能够提升上下文中的探索能力。具体来说，MR-Search通过在每集后生成明确的自我反思，并利用这些内容作为额外上下文指导后续尝试，从而实现跨剧集探索，从而促进测试期间更有效的探索。我们还引入了一种多回合强化学习算法，能够估算回合层面的密集相对优势，从而实现每集的细粒度署名分配。跨多个基准测试的实证结果显示MR-Search相较于基于基线的强化学习（RL）优势，八个基准测试中显示出强有力的泛化性和相对提升，分别达到9.2%至19.3%。我们的代码和数据可在此 https URL 访问。

Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning

学习协助：基于物理的多智能体强化学习人与人控制

Authors: Yuto Shibata, Kashu Yamazaki, Lalit Jayanti, Yoshimitsu Aoki, Mariko Isogawa, Katerina Fragkiadaki
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.11346
Pdf link: https://arxiv.org/pdf/2603.11346
Abstract Humanoid robotics has strong potential to transform daily service and caregiving applications. Although recent advances in general motion tracking within physics engines (GMT) have enabled virtual characters and humanoid robots to reproduce a broad range of human motions, these behaviors are primarily limited to contact-less social interactions or isolated movements. Assistive scenarios, by contrast, require continuous awareness of a human partner and rapid adaptation to their evolving posture and dynamics. In this paper, we formulate the imitation of closely interacting, force-exchanging human-human motion sequences as a multi-agent reinforcement learning problem. We jointly train partner-aware policies for both the supporter (assistant) agent and the recipient agent in a physics simulator to track assistive motion references. To make this problem tractable, we introduce a partner policies initialization scheme that transfers priors from single-human motion-tracking controllers, greatly improving exploration. We further propose dynamic reference retargeting and contact-promoting reward, which adapt the assistant's reference motion to the recipient's real-time pose and encourage physically meaningful support. We show that AssistMimic is the first method capable of successfully tracking assistive interaction motions on established benchmarks, demonstrating the benefits of a multi-agent RL formulation for physically grounded and socially aware humanoid control.
中文摘要 类人机器人具有强大潜力，能够改变日常服务和护理应用。尽管物理引擎（GMT）中通用运动跟踪技术的最新进展使虚拟角色和类人机器人能够重现广泛的人类动作，但这些行为主要局限于无接触的社交互动或孤立的运动。相比之下，辅助情境需要持续关注人类伴侣，并迅速适应其不断变化的姿势和动态。本文将模拟紧密交互、力交换的人人运动序列，提出多智能体强化学习问题。我们在物理模拟器中共同训练支持者（助理）代理和接收代理的合作伙伴感知策略，以追踪辅助运动参考。为了使该问题更易解决，我们引入了一种合作伙伴策略初始化方案，将单人运动跟踪控制器的先验数据转移，大大提升了探索水平。我们还提出了动态引用重定向和接触促进奖励，这些奖励将助理的引用动作调整到接收者的实时姿势，并鼓励有意义的身体支持。我们证明AssistMimic是首个能够在既定基准上成功追踪辅助互动动作的方法，展示了多智能体强化学习方案在物理基础和社会意识类人控制方面的优势。

Novelty Adaptation Through Hybrid Large Language Model (LLM)-Symbolic Planning and LLM-guided Reinforcement Learning

通过混合大型语言模型（LLM）-符号规划和LLM引导强化学习的新颖适应

Authors: Hong Lu, Pierrick Lorang, Timothy R. Duggan, Jivko Sinapov, Matthias Scheutz
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.11351
Pdf link: https://arxiv.org/pdf/2603.11351
Abstract In dynamic open-world environments, autonomous agents often encounter novelties that hinder their ability to find plans to achieve their goals. Specifically, traditional symbolic planners fail to generate plans when the robot's planning domain lacks the operators that enable it to interact appropriately with novel objects in the environment. We propose a neuro-symbolic architecture that integrates symbolic planning, reinforcement learning, and a large language model (LLM) to learn how to handle novel objects. In particular, we leverage the common sense reasoning capability of the LLM to identify missing operators, generate plans with the symbolic AI planner, and write reward functions to guide the reinforcement learning agent in learning control policies for newly identified operators. Our method outperforms the state-of-the-art methods in operator discovery as well as operator learning in continuous robotic domains.
中文摘要 在动态的开放世界环境中，自主智能体常常会遇到新奇事物，阻碍他们寻找实现目标的计划。具体来说，当机器人的规划领域缺乏能够与环境中新对象适当交互的操作符时，传统的符号规划器无法生成计划。我们提出了一种神经符号架构，整合了符号规划、强化学习和大型语言模型（LLM），用于学习如何处理新对象。特别是，我们利用LLM的常识推理能力识别缺失操作符，使用符号AI规划器生成计划，编写奖励函数以指导强化学习主体对新识别操作符的控制策略。我们的方法在连续机器人领域中，在操作员发现和操作员学习方面表现优于最先进的方法。

abx_amr_simulator: A simulation environment for antibiotic prescribing policy optimization under antimicrobial resistance

abx_amr_simulator：一个用于抗菌耐药性下抗生素处方政策优化的模拟环境

Authors: Joyce Lee, Seth Blumberg
Subjects: Subjects: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
Arxiv link: https://arxiv.org/abs/2603.11369
Pdf link: https://arxiv.org/pdf/2603.11369
Abstract Antimicrobial resistance (AMR) poses a global health threat, reducing the effectiveness of antibiotics and complicating clinical decision-making. To address this challenge, we introduce abx_amr_simulator, a Python-based simulation package designed to model antibiotic prescribing and AMR dynamics within a controlled, reinforcement learning (RL)-compatible environment. The simulator allows users to specify patient populations, antibiotic-specific AMR response curves, and reward functions that balance immedi- ate clinical benefit against long-term resistance management. Key features include a modular design for configuring patient attributes, antibiotic resistance dynamics modeled via a leaky-balloon abstraction, and tools to explore partial observability through noise, bias, and delay in observations. The package is compatible with the Gymnasium RL API, enabling users to train and test RL agents under diverse clinical scenarios. From an ML perspective, the package provides a configurable benchmark environment for sequential decision-making under uncertainty, including partial observability induced by noisy, biased, and delayed observations. By providing a customizable and extensible framework, abx_amr_simulator offers a valuable tool for studying AMR dynamics and optimizing antibiotic stewardship strategies under realistic uncertainty.
中文摘要 抗微生物耐药性（AMR）构成全球健康威胁，降低抗生素的有效性并使临床决策更加复杂。为应对这一挑战，我们推出了abx_amr_simulator，这是一款基于Python的仿真软件包，旨在模拟抗生素开具和AMR动态，在受控且支持强化学习（RL）的环境中。该模拟器允许用户指定患者群体、抗生素特异性AMR反应曲线及奖励函数，以平衡即时临床益处与长期耐药性管理。关键功能包括用于配置患者属性的模块化设计、通过泄漏气球抽象建模抗生素耐药性动态，以及通过噪声、偏置和观测延迟探索部分可观测性的工具。该软件包兼容Gymnasium RL API，使用户能够在多种临床场景下训练和测试强化学习代理。从机器学习的角度来看，该软件包提供了一个可配置的基准环境，用于在不确定性下做出顺序决策，包括由噪声、偏置和延迟观测引起的部分可观测性。通过提供可定制且可扩展的框架，abx_amr_simulator为研究AMR动态和优化现实不确定性下的抗生素管理策略提供了宝贵工具。

Ensuring Safety in Automated Mechanical Ventilation through Offline Reinforcement Learning and Digital Twin Verification

通过离线强化学习和数字孪生验证确保自动机械通气的安全

Authors: Hang Yu, Huidong Liu, Qingchen Zhang, William Joy, Kateryna Nikulina, Andreas A. Schuppert, Sina Saffaran, Declan Bates
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.11372
Pdf link: https://arxiv.org/pdf/2603.11372
Abstract Mechanical ventilation (MV) is a life-saving intervention for patients with acute respiratory failure (ARF) in the ICU. However, inappropriate ventilator settings could cause ventilator-induced lung injury (VILI). Also, clinicians workload is shown to be directly linked to patient outcomes. Hence, MV should be personalized and automated to improve patient outcomes. Previous attempts to incorporate personalization and automation in MV include traditional supervised learning and offline reinforcement learning (RL) approaches, which often neglect temporal dependencies and rely excessively on mortality-based rewards. As a result, early stage physiological deterioration and the risk of VILI are not adequately captured. To address these limitations, we propose Transformer-based Conservative Q-Learning (T-CQL), a novel offline RL framework that integrates a Transformer encoder for effective temporal modeling of patient dynamics, conservative adaptive regularization based on uncertainty quantification to ensure safety, and consistency regularization for robust decision-making. We build a clinically informed reward function that incorporates indicators of VILI and a score for severity of patients illness. Also, previous work predominantly uses Fitted Q-Evaluation (FQE) for RL policy evaluation on static offline data, which is less responsive to dynamic environmental changes and susceptible to distribution shifts. To overcome these evaluation limitations, interactive digital twins of ARF patients were used for online "at the bedside" evaluation. Our results demonstrate that T-CQL consistently outperforms existing state-of-the-art offline RL methodologies, providing safer and more effective ventilatory adjustments. Our framework demonstrates the potential of Transformer-based models combined with conservative RL strategies as a decision support tool in critical care.
中文摘要 机械通气（MV）是ICU急性呼吸衰竭（ARF）患者的救命干预。然而，不合适的呼吸机设置可能导致呼吸机诱发的肺损伤（VILI）。此外，临床医生的工作量与患者的治疗结果直接相关。因此，MV应实现个性化和自动化，以改善患者治疗效果。此前在虚拟智能中融入个性化和自动化的尝试包括传统的监督式学习和离线强化学习（RL）方法，这些方法常常忽视时间依赖性，过度依赖基于死亡率的奖励。因此，早期生理恶化和VILI风险未能被充分捕捉。为解决这些局限性，我们提出了基于Transformer的保守Q-学习（T-CQL）新颖的离线强化学习框架，集成了Transformer编码器以有效建模患者动态，采用基于不确定性量化的保守自适应正则化以确保安全性，并采用一致性正则化以实现稳健决策。我们构建了一个临床知情的奖励函数，包含VILI指标和患者病情严重程度评分。此外，以往的工作主要使用拟合Q评估（FQE）来评估静态离线数据的强化学习策略，该方法对动态环境变化响应较差，且易受分布变化影响。为克服这些评估局限，ARF患者的互动数字孪生被用于在线“床边”评估。我们的结果表明，T-CQL持续优于现有最先进的离线强行学习方法，提供了更安全、更有效的通气调整。我们的框架展示了基于Transformer的模型与保守的强化学习策略结合，作为重症监护决策支持工具的潜力。

SliceFed: Federated Constrained Multi-Agent DRL for Dynamic Spectrum Slicing in 6G

SliceFed：用于6G动态频谱切片的联邦约束多智能体DRL

Authors: Hossein Mohammadi, Seyed Bagher Hashemi Natanzi, Ramak Nassiri, Jamshid Hassanpour, Bo Tang, Vuk Marojevic
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.11390
Pdf link: https://arxiv.org/pdf/2603.11390
Abstract Dynamic spectrum slicing is a critical enabler for 6G Radio Access Networks (RANs), allowing the coexistence of heterogeneous services. However, optimizing resource allocation in dense, interference-limited deployments remains challenging due to non-stationary channel dynamics, strict Quality-of-Service (QoS) requirements, and the need for data privacy. In this paper, we propose SliceFed, a novel Federated Constrained Multi-Agent Deep Reinforcement Learning (F-MADRL) framework. SliceFed formulates the slicing problem as a Constrained Markov Decision Process (CMDP) where autonomous gNB agents maximize spectral efficiency while explicitly satisfying inter-cell interference budgets and hard ultra-reliable low-latency communication (URLLC) latency deadlines. We employ a Lagrangian primal-dual approach integrated with Proximal Policy Optimization (PPO) to enforce constraints, while Federated Averaging enables collaborative learning without exchanging raw local data. Extensive simulations in a dense multi-cell environment demonstrate that SliceFed converges to a stable, safety-aware policy. Unlike heuristic and unconstrained baselines, SliceFed achieves nearly 100% satisfaction of 1~ms URLLC latency deadlines and exhibits superior robustness to traffic load variations, verifying its potential for reliable and scalable 6G spectrum management.
中文摘要 动态频谱切片是6G无线接入网（RAN）的关键推动力，使异构服务得以共存。然而，由于非固定信道动态、严格的服务质量（QoS）要求以及数据隐私需求，在密集且干扰有限的部署中优化资源分配仍然具有挑战性。本文提出SliceFed，一种新型联合受限多智能体深度强化学习（F-MADRL）框架。SliceFed将切片问题表述为受限马尔可夫决策过程（CMDP），其中自主的gNB代理在满足单元间干扰预算和硬性超可靠低延迟通信（URLLC）延迟截止时间的同时，最大化频谱效率。我们采用拉格朗日原始对偶方法，结合近端策略优化（PPO）来强制约束，而联邦平均则实现了无需交换原始局部数据的协作学习。在密集多单元环境中的大量模拟表明，SliceFed趋向稳定且安全意识强的策略。与启发式和无约束基线不同，SliceFed几乎100%满足1~ms URLLC延迟的满足，并且对流量负载变化表现出卓越的鲁棒性，验证了其在可靠且可扩展的6G频谱管理方面的潜力。

ARROW: Augmented Replay for RObust World models

ARROW：增强重放，适用于RoBust World模型

Authors: Abdulaziz Alyahya, Abdallah Al Siyabi, Markus R. Ernst, Luke Yang, Levin Kuhlmann, Gideon Kowadlo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.11395
Pdf link: https://arxiv.org/pdf/2603.11395
Abstract Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.
中文摘要 持续强化学习挑战代理在保留既有技能的同时习得新技能，目标是提升过去和未来任务的表现。大多数现有方法依赖无模型方法和重放缓冲区来减轻灾难性遗忘;然而，由于内存需求大，这些解决方案常常面临显著的扩展性挑战。我们从神经科学中汲取灵感，神经科学将经验重放给预测性世界模型，而非直接回放政策，我们提出了ARROW（增强重放，面向强健世界模型），这是一种基于模型的持续强化学习算法，扩展了DreamerV3，增加了内存高效、分布匹配的回放缓冲区。与标准固定大小的先进先出缓冲区不同，ARROW维护两个互补缓冲区：一个用于近期经验的短期缓冲区，以及通过智能采样保持任务多样性的长期缓冲区。我们在两种具有挑战性的持续强化学习环境中评估ARROW：无共享结构任务（Atari）和具有共享结构、可实现知识转移的任务（Procgen CoinRun变体）。与无模型和基于模型、重放缓冲区大小相同的基线相比，ARROW在没有共享结构的情况下显著减少了对任务的遗忘，同时保持了可比的前向传输。我们的发现凸显了基于模型的强化学习和仿生启发方法在持续强化学习中的潜力，值得进一步研究。

Adversarial Reinforcement Learning for Detecting False Data Injection Attacks in Vehicular Routing

用于检测车辆路由中虚假数据注入攻击的对抗强化学习

Authors: Taha Eghtesad, Yevgeniy Vorobeychik, Aron Laszka
Subjects: Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2603.11433
Pdf link: https://arxiv.org/pdf/2603.11433
Abstract In modern transportation networks, adversaries can manipulate routing algorithms using false data injection attacks, such as simulating heavy traffic with multiple devices running crowdsourced navigation applications, to mislead vehicles toward suboptimal routes and increase congestion. To address these threats, we formulate a strategically zero-sum game between an attacker, who injects such perturbations, and a defender, who detects anomalies based on the observed travel times of network edges. We propose a computational method based on multi-agent reinforcement learning to compute a Nash equilibrium of this game, providing an optimal detection strategy, which ensures that total travel time remains within a worst-case bound, even in the presence of an attack. We present an extensive experimental evaluation that demonstrates the robustness and practical benefits of our approach, providing a powerful framework to improve the resilience of transportation networks against false data injection. In particular, we show that our approach yields approximate equilibrium policies and significantly outperforms baselines for both the attacker and the defender.
中文摘要 在现代交通网络中，对手可以通过虚假数据注入攻击操控路由算法，例如多台设备运行众包导航应用模拟繁重交通，误导车辆走上次优路线，加剧拥堵。为应对这些威胁，我们设计了一个战略性零和博弈：攻击者注入扰动，防御者则基于观察到的网络边缘传输时间检测异常。我们提出一种基于多智能体强化学习的计算方法，用于计算该博弈的纳什均衡，提供最优的检测策略，确保即使在攻击存在的情况下，总旅行时间仍保持在最坏情况下的边界内。我们进行了全面的实验评估，展示了我们方法的鲁棒性和实用优势，为提升交通网络对虚假数据注入的韧性提供了有力框架。特别是，我们证明了我们的方法能够产生近似均衡策略，并且对攻击方和防守方都显著优于基线。

NFPO: Stabilized Policy Optimization of Normalizing Flow for Robotic Policy Learning

NFPO：机器人策略学习流程规范化的稳定策略优化

Authors: Diyuan Shi, Yiqi Tang, Zifeng Zhuang, Donglin Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.11470
Pdf link: https://arxiv.org/pdf/2603.11470
Abstract Deep Reinforcement Learning (DRL) has experienced significant advancements in recent years and has been widely used in many fields. In DRL-based robotic policy learning, however, current de facto policy parameterization is still multivariate Gaussian (with diagonal covariance matrix), which lacks the ability to model multi-modal distribution. In this work, we explore the adoption of a modern network architecture, i.e. Normalizing Flow (NF) as the policy parameterization for its ability of multi-modal modeling, closed form of log probability and low computation and memory overhead. However, naively training NF in online Reinforcement Learning (RL) usually leads to training instability. We provide a detailed analysis for this phenomenon and successfully address it via simple but effective technique. With extensive experiments in multiple simulation environments, we show our method, NFPO could obtain robust and strong performance in widely used robotic learning tasks and successfully transfer into real-world robots.
中文摘要 深度强化学习（DRL）近年来取得了显著进展，并被广泛应用于多个领域。然而，在基于日程学习的机器人策略学习中，当前事实上的策略参数化仍然是多元高斯矩阵（带有对角协方差矩阵），这缺乏多模态分布建模的能力。在本研究中，我们探讨采用现代网络架构，即归一化流程（NF）作为策略参数化，以实现多模态建模、封闭形式的对数概率以及低计算和低内存开销。然而，在在线强化学习（RL）中天真地训练NF通常会导致训练不稳定。我们对这一现象进行了详细分析，并通过简单而有效的技术成功应对。通过在多种仿真环境中的广泛实验，我们展示了我们的方法，NFPO能够在广泛使用的机器人学习任务中获得稳健且强的性能，并成功迁移到现实世界机器人中。

SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

SVLL：基于身体化的具身任务规划的分阶段视觉语言学习

Authors: Yuyuan Yang, Junkun Hong, Hongrong Wang, Honghao Cai, Xunpeng Ren, Ge Wang, Mingcong Lei, Shenhao Yan, Jiahao Yang, Chengsi Yao, Xi Li, Yiming Zhao, Yatong Han, Jinke Ren
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.11563
Pdf link: https://arxiv.org/pdf/2603.11563
Abstract Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.
中文摘要 具身任务规划要求视觉语言模型生成既视觉扎实又因果连贯的动作序列。然而，现有训练范式面临一个关键权衡：联合端到端训练常导致过早的时间绑定，而标准强化学习方法则存在优化不稳定性。为了弥合这一差距，我们提出了分阶段视觉-语言学习（SVLL），这是一个统一的三阶段框架，用于稳健、物理化的具体规划。在前两个阶段，SVLL将空间基础与时间推理解耦，建立了稳健的视觉依赖性，然后引入顺序动作历史。在最后阶段，我们指出标准直接偏好优化（DPO）的一个关键局限性——其纯粹相对性——仅优化胜负轨迹之间的偏好差距，而忽视最优路径的绝对似然约束，常常导致不安全或幻觉行为。为此，我们进一步引入了Bias-DPO，这是一种新的对齐目标，通过明确最大化地面真实行为的概率，同时惩罚过度自信的幻觉，注入归纳偏向专家轨迹。通过将政策锚定在专家流形上并减少因果错位，SVLL借助Bias-DPO确保严格遵守环境条件，有效遏制物理上不可能出现的捷径。最后，交互式AI2-THOR基准测试和实际机器人部署的广泛实验表明，SVLL在任务成功率上优于最先进的开源（如Qwen2.5-VL-7B）和闭源模型（如GPT-4o、Gemini-2.0-flash），同时显著减少了物理约束违规。

Multi-Agent Reinforcement Learning for UAV-Based Chemical Plume Source Localization

基于无人机的化学羽流源定位的多智能体强化学习

Authors: Zhirun Li, Derek Hollenbeck, Ruikun Wu, Michelle Sherman, Sihua Shao, Xiang Sun, Mostafa Hassanalian
Subjects: Subjects: Systems and Control (eess.SY); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.11582
Pdf link: https://arxiv.org/pdf/2603.11582
Abstract Undocumented orphaned wells pose significant health and environmental risks to nearby communities by releasing toxic gases and contaminating water sources, with methane emissions being a primary concern. Traditional survey methods such as magnetometry often fail to detect older wells effectively. In contrast, aerial in-situ sensing using unmanned aerial vehicles (UAVs) offers a promising alternative for methane emission detection and source localization. This study presents a robust and efficient framework based on a multi-agent deep reinforcement learning (MARL) algorithm for the chemical plume source localization (CPSL) problem. The proposed approach leverages virtual anchor nodes to coordinate UAV navigation, enabling collaborative sensing of gas concentrations and wind velocities through onboard and shared measurements. Source identification is achieved by analyzing the historical trajectory of anchor node placements within the plume. Comparative evaluations against the fluxotaxis method demonstrate that the MARL framework achieves superior performance in both localization accuracy and operational efficiency.
中文摘要 无证孤井通过释放有毒气体和污染水源，对附近社区构成重大健康和环境风险，甲烷排放是主要关注点。传统的测量方法如磁力测量常常无法有效检测老井。相比之下，利用无人机（UAV）进行空中原位感测，为甲烷排放检测和源头定位提供了有前景的替代方案。本研究基于多智能体深度强化学习（MARL）算法，提出了一个稳健高效的化学羽流源定位（CPSL）问题框架。该方法利用虚拟锚节点协调无人机导航，实现通过机载和共享测量协同感测气体浓度和风速。源头识别通过分析锚节点在羽流中历史轨迹来实现。与通量趋向法的比较评估表明，MARL框架在定位精度和操作效率方面均表现优异。

WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

WeEdit：一个数据集、基准测试和字形引导的文本中心图像编辑框架

Authors: Hui Zhang, Juntao Liu, Zongkai Liu, Liqiang Niu, Fandong Meng, Zuxuan Wu, Yu-Gang Jiang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.11593
Pdf link: https://arxiv.org/pdf/2603.11593
Abstract Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging textual elements embedded within images. However, existing leading models often struggle to execute complex text editing precisely, frequently producing blurry or hallucinated characters. We attribute these failures primarily to the lack of specialized training paradigms tailored for text-centric editing, as well as the absence of large-scale datasets and standardized benchmarks necessary for a closed-loop training and evaluation system. To address these limitations, we present WeEdit, a systematic solution encompassing a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy. Specifically, we propose a novel HTML-based automatic editing pipeline, which generates 330K training pairs covering diverse editing operations and 15 languages, accompanied by standardized bilingual and multilingual benchmarks for comprehensive evaluation. On the algorithmic side, we employ glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by a multi-objective reinforcement learning stage to align generation with instruction adherence, text clarity, and background preservation. Extensive experiments demonstrate that WeEdit outperforms previous open-source models by a clear margin across diverse editing operations.
中文摘要 基于指令的图像编辑旨在根据用户提供的指示修改现有图像中的具体内容，同时保留非目标区域。除了传统的以对象和风格为中心的操作，以文本为中心的图像编辑还侧重于修改、翻译或重新排列嵌入图像中的文本元素。然而，现有的主流模型常常难以精确执行复杂的文本编辑，常常产生模糊或幻觉的字符。我们主要将这些失败归因于缺乏专门针对文本编辑的训练范式，以及缺乏大规模数据集和标准化基准测试，这些都是闭环训练和评估系统所必需的。为解决这些局限性，我们介绍WeEdit，一种系统化解决方案，包含可扩展的数据构建流程、两个基准测试和量身定制的两阶段训练策略。具体来说，我们提出了一种基于HTML的新型自动编辑流程，生成33万对训练对，涵盖多样化的编辑操作和15种语言，并配有标准化的双语和多语言基准，用于全面评估。在算法方面，我们采用字形引导的监督微调，注入显式的空间和内容先验，随后进行多目标强化学习阶段，使生成过程与指令遵循、文本清晰度和背景保存对齐。大量实验表明，WeEdit在多样化的编辑操作中明显优于以往的开源模型。

Hybrid Energy-Aware Reward Shaping: A Unified Lightweight Physics-Guided Methodology for Policy Optimization

混合能源感知奖励塑造：一种统一的轻量级物理指导政策优化方法

Authors: Qijun Liao (1), Jue Yang (1), Yiting Kang (1), Xinxin Zhao (1), Yong Zhang (2), Mingan Zhao (2) ((1) School of Mechanical Engineering, University of Science and Technology Beijing, China, (2) Jiangsu XCMG Construction Machinery Research Institute Co., Ltd., China)
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.11600
Pdf link: https://arxiv.org/pdf/2603.11600
Abstract Deep reinforcement learning excels in continuous control but often requires extensive exploration, while physics-based models demand complete equations and suffer cubic complexity. This study proposes Hybrid Energy-Aware Reward Shaping (H-EARS), unifying potential-based reward shaping with energy-aware action regularization. H-EARS constrains action magnitude while balancing task-specific and energy-based potentials via functional decomposition, achieving linear complexity O(n) by capturing dominant energy components without full dynamics. We establish a theoretical foundation including: (1) functional independence for separate task/energy optimization; (2) energy-based convergence acceleration; (3) convergence guarantees under function approximation; and (4) approximate potential error bounds. Lyapunov stability connections are analyzed as heuristic guides. Experiments across baselines show improved convergence, stability, and energy efficiency. Vehicle simulations validate applicability in safety-critical domains under extreme conditions. Results confirm that integrating lightweight physics priors enhances model-free RL without complete system models, enabling transfer from lab research to industrial applications.
中文摘要 深度强化学习擅长连续控制，但通常需要大量探索，而基于物理的模型则需要完整的方程，且存在立方复杂度。本研究提出了混合能量感知奖励塑形（H-EARS），将基于势的奖励塑造与能量感知动作正则化相结合。H-EARS通过函数分解在任务特定和基于能量的势位之间进行约束，通过捕捉主导能量成分实现线性复杂度O（n），而无需完全动态。我们建立了理论基础，包括：（1）独立任务/能源优化的功能独立性;（2）基于能量的收敛加速;（3）函数近似下的收敛保证;以及（4）近似潜在误差界限。李雅普诺夫稳定性连接被作为启发式指南进行分析。跨基线实验显示收敛性、稳定性和能效有所提升。车辆模拟验证了在极端条件下安全关键领域的适用性。结果证实，集成轻量级物理先验能增强无需完整系统模型即可实现的无模型强化学习，从而实现从实验室研究到工业应用的转变。

Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

简单配方有效：视觉-语言-行动模型是带有强化学习的自然持续学习者

Authors: Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, Roberto Martin-Martin
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.11653
Pdf link: https://arxiv.org/pdf/2603.11653
Abstract Continual Reinforcement Learning (CRL) for Vision-Language-Action (VLA) models is a promising direction toward self-improving embodied agents that can adapt in openended, evolving environments. However, conventional wisdom from continual learning suggests that naive Sequential Fine-Tuning (Seq. FT) leads to catastrophic forgetting, necessitating complex CRL strategies. In this work, we take a step back and conduct a systematic study of CRL for large pretrained VLAs across three models and five challenging lifelong RL benchmarks. We find that, contrary to established belief, simple Seq. FT with low-rank adaptation (LoRA) is remarkably strong: it achieves high plasticity, exhibits little to no forgetting, and retains strong zero-shot generalization, frequently outperforming more sophisticated CRL methods. Through detailed analysis, we show that this robustness arises from a synergy between the large pretrained model, parameter-efficient adaptation, and on-policy RL. Together, these components reshape the stability-plasticity trade-off, making continual adaptation both stable and scalable. Our results position Sequential Fine-Tuning as a powerful method for continual RL with VLAs and provide new insights into lifelong learning in the large model era. Code is available at this http URL.
中文摘要 视觉-语言-行动（VLA）模型的持续强化学习（CRL）是一个有前景的方向，朝向能够适应开放式、不断演变的环境的具身智能体。然而，持续学习的传统观点认为，朴素的顺序微调（Seq. FT）会导致灾难性的遗忘，因此需要复杂的CRL策略。在本研究中，我们退一步，系统地研究了三大模型和五个具有挑战性的终身强化学习基准，针对大型预训练VLA的CRL。我们发现，与既有看法相反，低秩适应的简单测序FT（LoRA）表现出显著的强力：它实现了高可塑性，几乎没有遗忘，并且保持了强的零射概括性，常常优于更复杂的CRL方法。通过详细分析，我们表明这种鲁棒性源于大型预训练模型、参数高效适应和政策驱动学习之间的协同效应。这些组成部分共同重塑了稳定性与可塑性的权衡，使持续适应既稳定又具可扩展性。我们的结果将顺序微调定位为一种强大的强化学习方法，适用于持续进行VLA的强化学习，并为大模型时代的终身学习提供了新的见解。代码可在此 http 网址获取。

Resonate: Reinforcing Text-to-Audio Generation via Online Feedback from Large Audio Language Models

Resonate：通过大型音频语言模型的在线反馈强化文本转音频生成

Authors: Xiquan Li, Junxi Liu, Wenxi Chen, Haina Zhu, Ziyang Ma, Xie Chen
Subjects: Subjects: Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2603.11661
Pdf link: https://arxiv.org/pdf/2603.11661
Abstract Reinforcement Learning (RL) has become an effective paradigm for enhancing Large Language Models (LLMs) and visual generative models. However, its application in text-to-audio (TTA) generation remains largely under-explored. Prior work typically employs offline methods like Direct Preference Optimization (DPO) and leverages Contrastive Language-Audio Pretraining (CLAP) models as reward functions. In this study, we investigate the integration of online Group Relative Policy Optimization (GRPO) into TTA generation. We adapt the algorithm for Flow Matching-based audio models and demonstrate that online RL significantly outperforms its offline counterparts. Furthermore, we incorporate rewards derived from Large Audio Language Models (LALMs), which can provide fine-grained scoring signals that are better aligned with human perception. With only 470M parameters, our final model, \textbf{Resonate}, establishes a new SOTA on TTA-Bench in terms of both audio quality and semantic alignment.
中文摘要 强化学习（RL）已成为增强大型语言模型（LLMs）和视觉生成模型的有效范式。然而，其在文本转音频（TTA）生成中的应用仍然大多被忽视。以往的工作通常采用离线方法，如直接偏好优化（DPO），并利用对比语言-音频预训练（CLAP）模型作为奖励函数。本研究探讨在线组相对策略优化（GRPO）如何整合进TTA生成。我们将该算法适配为基于流匹配的音频模型，并证明在线强化学习显著优于离线强化学习。此外，我们还整合了来自大型音频语言模型（LALMs）的奖励，这些模型可以提供更细粒度的评分信号，更贴合人类感知。仅有4.7亿参数，我们的最终模型\textbf{Resonate}在TTA-Bench上建立了新的SOTA，无论是音频质量还是语义对齐。

Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge

多任务强化学习：增强多模态LLM作为评判

Authors: Junjie Wu, Xuan Kan, Zihao He, Shunwen Tan, Bo Pan, Kaitai Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.11665
Pdf link: https://arxiv.org/pdf/2603.11665
Abstract Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks. However, most existing judge models are optimized for single-task scenarios and struggle to generalize to diverse contexts, which is a critical requirement for reliable evaluation. To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL. Experimental results against several strong baselines demonstrate that MT-RL-Judge outperforms strong baselines in both judgment consistency and correlation with human preferences. Furthermore, our approach exhibits robust generalization on out-of-distribution tasks, further validating its effectiveness.
中文摘要 多模态大型语言模型（MLLM）因其在各种视觉任务中与人类判断高度契合而被广泛采用为MLLM作为评判。然而，大多数现有的法官模型都针对单一任务场景进行了优化，难以推广到多样化的情境，而这对可靠评估至关重要。为解决这一限制，我们提出了MLLM即法官多任务强化学习（MT-RL-Judge）框架，该框架结合RL的泛化能力，共同优化多任务中的法官模型。针对多个强基线的实验结果表明，MT-RL-Judge在判断一致性和与人类偏好的相关性上均优于强基线。此外，我们的方法在非分布任务上展现出强有力的泛化能力，进一步验证了其有效性。

STAIRS-Former: Spatio-Temporal Attention with Interleaved Recursive Structure Transformer for Offline Multi-task Multi-agent Reinforcement Learning

阶梯型：带有交错递归结构变换器的时空注意力，用于离线多任务多智能体强化学习

Authors: Jiwon Jeon, Myungsik Cho, Youngchul Sung
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.11691
Pdf link: https://arxiv.org/pdf/2603.11691
Abstract Offline multi-agent reinforcement learning (MARL) with multi-task datasets is challenging due to varying numbers of agents across tasks and the need to generalize to unseen scenarios. Prior works employ transformers with observation tokenization and hierarchical skill learning to address these issues. However, they underutilize the transformer attention mechanism for inter-agent coordination and rely on a single history token, which limits their ability to capture long-horizon temporal dependencies in partially observable MARL settings. In this paper, we propose STAIRS-Former, a transformer architecture augmented with spatial and temporal hierarchies that enables effective attention over critical tokens while capturing long interaction histories. We further introduce token dropout to enhance robustness and generalization across varying agent populations. Extensive experiments on diverse multi-agent benchmarks, including SMAC, SMAC-v2, MPE, and MaMuJoCo, with multi-task datasets demonstrate that STAIRS-Former consistently outperforms prior methods and achieves new state-of-the-art performance.
中文摘要 由于各任务中代理数量不同且需要泛化到未见场景，离线多智能体强化学习（MARL）在多任务数据集上具有挑战性。以往的研究采用了带有观察标记化和层级技能学习的变换器来解决这些问题。然而，它们未能充分利用变换器注意力机制进行代理间协调，且依赖单一历史标记，这限制了它们在部分可观测MARL环境中捕捉长视野时间依赖的能力。本文提出STAIRS-Former，一种通过空间和时间层级增强的变换器架构，能够有效关注关键标记，同时捕捉长的交互历史。我们还引入了代币脱落，以增强不同代理群体间的鲁棒性和泛化性。在多任务数据集上对多种多智能体基准测试的广泛实验，包括SMAC、SMAC-v2、MPE和MaMuJoCo，表明STAIRS-Formal持续优于以往方法，实现了新的尖端性能。

Exploiting Expertise of Non-Expert and Diverse Agents in Social Bandit Learning: A Free Energy Approach

利用非专家和多元代理的专业知识进行社会盗贼学习：一种自由能源方法

Authors: Erfan Mirzaei, Seyed Pooya Shariatpanahi, Alireza Tavakoli, Reshad Hosseini, Majid Nili Ahmadabadi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.11757
Pdf link: https://arxiv.org/pdf/2603.11757
Abstract Personalized AI-based services involve a population of individual reinforcement learning agents. However, most reinforcement learning algorithms focus on harnessing individual learning and fail to leverage the social learning capabilities commonly exhibited by humans and animals. Social learning integrates individual experience with observing others' behavior, presenting opportunities for improved learning outcomes. In this study, we focus on a social bandit learning scenario where a social agent observes other agents' actions without knowledge of their rewards. The agents independently pursue their own policy without explicit motivation to teach each other. We propose a free energy-based social bandit learning algorithm over the policy space, where the social agent evaluates others' expertise levels without resorting to any oracle or social norms. Accordingly, the social agent integrates its direct experiences in the environment and others' estimated policies. The theoretical convergence of our algorithm to the optimal policy is proven. Empirical evaluations validate the superiority of our social learning method over alternative approaches in various scenarios. Our algorithm strategically identifies the relevant agents, even in the presence of random or suboptimal agents, and skillfully exploits their behavioral information. In addition to societies including expert agents, in the presence of relevant but non-expert agents, our algorithm significantly enhances individual learning performance, where most related methods fail. Importantly, it also maintains logarithmic regret.
中文摘要 基于人工智能的个性化服务涉及一组个体强化学习代理。然而，大多数强化学习算法专注于利用个体学习，未能充分利用人类和动物通常展现出的社会学习能力。社会学习将个人经验与观察他人行为相结合，为提升学习成果提供了机会。本研究聚焦于社会强盗学习场景，即社会代理观察其他代理的行为，却不了解其奖励。代理人们独立执行自己的政策，没有明确的动力去互相指导。我们提出一种基于能量的免费社会盗贼学习算法，覆盖政策空间，社会代理评估他人的专业水平，而无需依赖任何预言机或社会规范。因此，社会行为体整合其在环境中的直接经验与他人的估计政策。我们的算法理论上与最优策略收敛性已被证明。实证评估验证了我们社会学习方法在不同情境下优于其他方法的优越性。我们的算法即使在存在随机或次优代理的情况下，也能战略性地识别相关代理，并巧妙利用其行为信息。除了包含专家代理的社会外，在存在相关但非专家代理的情况下，我们的算法显著提升了大多数相关方法失败的个体学习表现。重要的是，它还保持了对数级的遗憾。

Towards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding

通过LLM驱动的程序生成和基于文本的B-Rep原始接地实现高保真CAD生成

Authors: Jiahao Li, Qingwang Zhang, Qiuyu Chen, Guozhan Qiu, Yunzhong Lou, Xiangdong Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.11831
Pdf link: https://arxiv.org/pdf/2603.11831
Abstract The field of Computer-Aided Design (CAD) generation has made significant progress in recent years. Existing methods typically fall into two separate categorie: parametric CAD modeling and direct boundary representation (B-Rep) synthesis. In modern feature-based CAD systems, parametric modeling and B-Rep are inherently intertwined, as advanced parametric operations (e.g., fillet and chamfer) require explicit selection of B-Rep geometric primitives, and the B-Rep itself is derived from parametric operations. Consequently, this paradigm gap remains a critical factor limiting AI-driven CAD modeling for complex industrial product design. This paper present FutureCAD, a novel text-to-CAD framework that leverages large language models (LLMs) and a B-Rep grounding transformer (BRepGround) for high-fidelity CAD generation. Our method generates executable CadQuery scripts, and introduces a text-based query mechanism that enables the LLM to specify geometric selections via natural language, which BRepGround then grounds to the target primitives. To train our framework, we construct a new dataset comprising real-world CAD models. For the LLM, we apply supervised fine-tuning (SFT) to establish fundamental CAD generation capabilities, followed by reinforcement learning (RL) to improve generalization. Experiments show that FutureCAD achieves state-of-the-art CAD generation performance.
中文摘要 计算机辅助设计（CAD）生成领域近年来取得了显著进展。现有方法通常分为两个分类：参数化CAD建模和直接边界表示（B-Rep）综合。在现代基于特征的CAD系统中，参数化建模与B-Rep本质上密不可分，因为高级参数化操作（如修角和倒角）需要明确选择B-Rep几何原语，而B-Rep本身就是从参数化操作推导出来的。因此，这一范式鸿沟仍然是限制复杂工业产品设计中AI驱动CAD建模的关键因素。本文介绍了FutureCAD，一种新型文本转CAD框架，利用大型语言模型（LLMs）和B-Rep接地变换器（BRepGround）实现高精度CAD生成。我们的方法生成可执行的CadQuery脚本，并引入基于文本的查询机制，使LLM能够通过自然语言指定几何选择，然后BRepGround将这些选择与目标原语进行基础化。为了训练我们的框架，我们构建了一个包含真实世界CAD模型的新数据集。对于LLM，我们采用监督微调（SFT）建立基本的CAD生成能力，随后进行强化学习（RL）以提升泛化能力。实验显示，FutureCAD实现了最先进的CAD生成性能。

Hybrid Human-Agent Social Dilemmas in Energy Markets

能源市场中的混合人与主体社会困境

Authors: Isuri Perera, Frits de Nijs, Julian Garcia
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2603.11834
Pdf link: https://arxiv.org/pdf/2603.11834
Abstract In hybrid populations where humans delegate strategic decision-making to autonomous agents, understanding when and how cooperative behaviors can emerge remains a key challenge. We study this problem in the context of energy load management: consumer agents schedule their appliance use under demand-dependent pricing. This structure can create a social dilemma where everybody would benefit from coordination, but in equilibrium agents often choose to incur the congestion costs that cooperative turn-taking would avoid. To address the problem of coordination, we introduce artificial agents that use globally observable signals to increase coordination. Using evolutionary dynamics, and reinforcement learning experiments, we show that artificial agents can shift the learning dynamics to favour coordination outcomes. An often neglected problem is partial adoption: what happens when the technology of artificial agents is in the early adoption stages? We analyze mixed populations of adopters and non-adopters, demonstrating that unilateral entry is feasible: adopters are not structurally penalized, and partial adoption can still improve aggregate outcomes. However, in some parameter regimes, non-adopters may benefit disproportionately from the cooperation induced by adopters. This asymmetry, while not precluding beneficial entry, warrants consideration in deployment, and highlights strategic issues around the adoption of AI technology in multiagent settings.
中文摘要 在人类将战略决策委托给自主智能体的混合群体中，理解何时以及如何出现合作行为仍是一个关键挑战。我们在能源负载管理的背景下研究这个问题：消费者代理人根据需求依赖定价来安排他们的家电使用。这种结构可能制造社会困境，所有人都受益于协调，但在均衡情况下，代理者往往选择承担合作轮流避免的拥堵成本。为解决协调问题，我们引入了使用全球可观测信号以增强协调的人工代理。通过进化动力学和强化学习实验，我们表明人工智能体可以调整学习动态，以有利于协调结果。一个常被忽视的问题是部分采用：当人工代理技术处于早期采用阶段时会发生什么？我们分析了采用者和非认养者的混合群体，证明单方面进入是可行的：采用者不会受到结构性惩罚，部分采用仍能改善整体结果。然而，在某些参数制度下，非采用者可能从采用者引发的合作中获得不成比例的益处。这种不对称性虽不排除有益的进入，但值得在部署时考虑，并凸显了AI技术在多智能体环境中采用的战略难题。

Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language

Bielik-Minitron-7B：通过结构化剪枝和知识蒸馏压缩波兰语大型语言模型

Authors: Remigiusz Kinas, Paweł Kiszczak, Sergio P. Perez, Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwoździej
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.11881
Pdf link: https://arxiv.org/pdf/2603.11881
Abstract This report details the creation of Bielik-Minitron-7B, a compressed 7.35B parameter version of the Bielik-11B-v3.0 model, specifically optimized for European languages. By leveraging a two-stage compression methodology inspired by the NVIDIA Minitron approach, we combined structured hybrid pruning and knowledge distillation to reduce the model's parameter count by 33.4%, from 11.04B to 7.35B. We utilized the NVIDIA Model Optimizer for structural pruning and the NVIDIA NeMo Framework for logit-based distillation for quality recovery. Following distillation, the model underwent a rigorous alignment pipeline consisting of Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO-P), and Reinforcement Learning (GRPO). Our final model successfully recovered approximately 90% of the baseline model's performance while providing up to 50% inference speedup. This approach demonstrates an efficient pathway to create language models for less-represented languages, preserving the original model quality while reducing inference deployment costs.
中文摘要 本报告详细介绍了Bielik-Minitron-7B的创建过程，这是Bielik-11B-v3.0模型的7.35B参数压缩版本，专为欧洲语言优化。通过采用受NVIDIA Minitron方法启发的两阶段压缩方法，我们将结构化混合剪枝与知识蒸馏结合，将模型参数数量从11.04B减少到7.35B，减少了33.4%。我们使用NVIDIA Model Optimizer进行结构修剪，使用NVIDIA NeMo框架进行基于logit的蒸馏以实现质量恢复。提炼后，模型经历了严格的对齐流程，包括监督微调（SFT）、直接偏好优化（DPO-P）和强化学习（GRPO）。我们的最终模型成功恢复了基线模型约90%的性能，同时提供了高达50%的推理加速。这种方法展示了一种高效的路径，可以为代表性较少的语言创建语言模型，既保持原始模型质量，又降低推理部署成本。

The price of decentralization in managing engineering systems through multi-agent reinforcement learning

通过多智能体强化学习管理工程系统的去中心化代价

Authors: Prateek Bhustali, Pablo G. Morato, Konstantinos G. Papakonstantinou, Charalampos P. Andriotis
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.11884
Pdf link: https://arxiv.org/pdf/2603.11884
Abstract Inspection and maintenance (I&M) planning involves sequential decision making under uncertainties and incomplete information, and can be modeled as a partially observable Markov decision process (POMDP). While single-agent deep reinforcement learning provides approximate solutions to POMDPs, it does not scale well in multi-component systems. Scalability can be achieved through multi-agent deep reinforcement learning (MADRL), which decentralizes decision-making across multiple agents, locally controlling individual components. However, this decentralization can induce cooperation pathologies that degrade the optimality of the learned policies. To examine these effects in I&M planning, we introduce a set of deteriorating systems in which redundancy is varied systematically. These benchmark environments are designed such that computation of centralized (near-)optimal policies remains tractable, enabling direct comparison of solution methods. We implement and benchmark a broad set of MADRL algorithms spanning fully centralized and decentralized training paradigms, from value-factorization to actor-critic methods. Our results show a clear effect of redundancy on coordination: MADRL algorithms achieve near-optimal performance in series-like settings, whereas increasing redundancy amplifies coordination challenges and can lead to optimality losses. Nonetheless, decentralized agents learn structured policies that consistently outperform optimized heuristic baselines, highlighting both the promise and current limitations of decentralized learning for scalable maintenance planning.
中文摘要 检查与维护（I&M）规划涉及在不确定性和信息不完整的情况下进行顺序决策，可以建模为部分可观测的马尔可夫决策过程（POMDP）。虽然单智能体深度强化学习为POMDP提供了近似解，但在多组件系统中扩展性不佳。可扩展性可以通过多智能体深度强化学习（MADRL）实现，该技术将决策分散到多个智能体之间，局部控制各个组件。然而，这种去中心化可能引发合作病态，从而降低所学政策的最优性。为了研究这些影响，我们引入了一组系统性地变化冗余的退化系统。这些基准环境设计使得计算集中化（近）最优策略保持可操作性，从而实现解决方案方法的直接比较。我们实现并基准测试涵盖完全集中式和去中心化训练范式的广泛MADRL算法，涵盖价值因数到actor-critic方法。我们的结果明确显示冗余对协调有明显影响：MADRL算法在类串数环境中实现近似最优性能，而冗余增加则加剧协调挑战，可能导致最优性损失。尽管如此，去中心化代理学习的结构化策略始终优于优化的启发式基线，凸显了去中心化学习在可扩展维护规划方面的前景与当前局限性。

Learning Visuomotor Policy for Multi-Robot Laser Tag Game

学习多机器人激光枪战游戏的维苏马达策略

Authors: Kai Li, Shiyu Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.11980
Pdf link: https://arxiv.org/pdf/2603.11980
Abstract In this paper, we study multi robot laser tag, a simplified yet practical shooting-game-style task. Classic modular approaches on these tasks face challenges such as limited observability and reliance on depth mapping and inter robot communication. To overcome these issues, we present an end-to-end visuomotor policy that maps images directly to robot actions. We train a high performing teacher policy with multi agent reinforcement learning and distill its knowledge into a vision-based student policy. Technical designs, including a permutation-invariant feature extractor and depth heatmap input, improve performance over standard architectures. Our policy outperforms classic methods by 16.7% in hitting accuracy and 6% in collision avoidance, and is successfully deployed on real robots. Code will be released publicly.
中文摘要 本文研究了多机器人激光枪战，这是一种简化但实用的射击游戏式任务。经典模块化方法面临诸如可观测性有限、依赖深度映射和机器人间通信等挑战。为克服这些问题，我们提出了一种端到端视觉运动策略，将图像直接映射到机器人的动作。我们通过多智能体强化学习培训高绩效教师政策，并将其知识提炼为基于愿景的学生政策。技术设计，包括置换不变特征提取器和深度热图输入，提升了相较标准架构的性能。我们的政策在命中率上比经典方法高出16.7%，在碰撞避免率上高出6%，并且已成功应用于真实机器人。代码将公开发布。

Sim-to-reality adaptation for Deep Reinforcement Learning applied to an underwater docking application

深度强化学习的模拟现实适配应用于水下对接应用

Authors: Alaaeddine Chaarani, Narcis Palomeras, Pere Ridao
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.12020
Pdf link: https://arxiv.org/pdf/2603.12020
Abstract Deep Reinforcement Learning (DRL) offers a robust alternative to traditional control methods for autonomous underwater docking, particularly in adapting to unpredictable environmental conditions. However, bridging the "sim-to-real" gap and managing high training latencies remain significant bottlenecks for practical deployment. This paper presents a systematic approach for autonomous docking using the Girona Autonomous Underwater Vehicle (AUV) by leveraging a high-fidelity digital twin environment. We adapted the Stonefish simulator into a multiprocessing RL framework to significantly accelerate the learning process while incorporating realistic AUV dynamics, collision models, and sensor noise. Using the Proximal Policy Optimization (PPO) algorithm, we developed a 6-DoF control policy trained in a headless environment with randomized starting positions to ensure generalized performance. Our reward structure accounts for distance, orientation, action smoothness, and adaptive collision penalties to facilitate soft docking. Experimental results demonstrate that the agent achieved a success rate of over 90% in simulation. Furthermore, successful validation in a physical test tank confirmed the efficacy of the sim-to-reality adaptation, with the DRL controller exhibiting emergent behaviors such as pitch-based braking and yaw oscillations to assist in mechanical alignment.
中文摘要 深度强化学习（DRL）为自主水下对接提供了传统控制方法的有力替代方案，尤其是在适应不可预测环境条件方面。然而，弥合“模拟与真实”之间的差距和管理高训练延迟，仍然是实际部署中的重要瓶颈。本文提出了利用赫罗纳自主水下飞行器（AUV）利用高精度数字孪生环境实现自主对接的系统方法。我们将Stonefish模拟器改编为多处理强化学习框架，显著加快学习进程，同时融入了真实的AUV动力学、碰撞模型和传感器噪声。利用近端策略优化（PPO）算法，我们开发了一套在无头环境中训练的6-DoF控制策略，起始位置随机，以确保通用性能。我们的奖励结构考虑了距离、方向、动作平滑性和自适应碰撞惩罚，以促进软对接。实验结果显示，该模拟器成功率超过90%。此外，在物理测试罐中的成功验证了模拟现实适配的有效性，日行学习控制器表现出如基于俯仰制动和偏航振荡等突发行为，以辅助机械对齐。

AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling

AGMARL-DKS：一种基于动态Kubernetes调度的自适应图增强多智能体强化学习

Authors: Hamed Hamzeh
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.12031
Pdf link: https://arxiv.org/pdf/2603.12031
Abstract State-of-the-art cloud-native applications require intelligent schedulers that can effectively balance system stability, resource utilisation, and associated costs. While Kubernetes provides feasibility-based placement by default, recent research efforts have explored the use of reinforcement learning (RL) for more intelligent scheduling decisions. However, current RL-based schedulers have three major limitations. First, most of these schedulers use monolithic centralised agents, which are non-scalable for large heterogeneous clusters. Second, the ones that use multi-objective reward functions assume simple, static, linear combinations of the objectives. Third, no previous work has produced a stress-aware scheduler that can react adaptively to dynamic conditions. To address these gaps in current research, we propose the Adaptive Graph-enhanced Multi-Agent Reinforcement Learning Dynamic Kubernetes Scheduler (AGMARL-DKS). AGMARL-DKS addresses these gaps by introducing three major innovations. First, we construct a scalable solution by treating the scheduling challenge as a cooperative multi-agent problem, where every cluster node operates as an agent, employing centralised training methods before decentralised execution. Second, to be context-aware and yet decentralised, we use a Graph Neural Network (GNN) to build a state representation of the global cluster context at each agent. This represents an improvement over methods that rely solely on local observations. Finally, to make trade-offs between these objectives, we use a stress-aware lexicographical ordering policy instead of a simple, static linear weighting of these objectives. The evaluations in Google Kubernetes Engine (GKE) reveal that AGMARL-DKS significantly outperforms the default scheduler in terms of fault tolerance, utilisation, and cost, especially in scheduling batch and mission-critical workloads.
中文摘要 最先进的云原生应用需要智能调度器，能够有效平衡系统稳定性、资源利用率及相关成本。虽然Kubernetes默认提供基于可行性的放置，但近期研究已探索利用强化学习（RL）实现更智能的调度决策。然而，当前基于强化学习的调度器有三个主要限制。首先，大多数调度器使用单体集中化代理，对于大型异构集群来说不可扩展。其次，使用多目标奖励函数的方案假设目标的简单、静态、线性组合。第三，以往没有任何工作能够产生能够自适应动态条件的压力感知调度器。为弥补当前研究中的这些空白，我们提出了自适应图增强型多智能体强化学习动态Kubernetes调度器（AGMARL-DKS）。AGMARL-DKS通过引入三项主要创新来弥补这些空白。首先，我们构建了一个可扩展的解决方案，将调度挑战视为协作多代理问题，每个集群节点作为代理运行，采用集中训练方法后再进行去中心化执行。其次，为了既具上下文感知又实现去中心化，我们使用图神经网络（GNN）在每个代理处构建全局簇上下文的状态表示。这相比仅依赖局部观测的方法有了改进。最后，为了在这些目标之间做出权衡，我们采用应力感知的词典序排序策略，而非简单静态的线性加权。Google Kubernetes Engine（GKE）的评估显示，AGMARL-DKS在容错性、利用率和成本方面显著优于默认调度器，尤其是在批处理和关键任务工作负载调度方面。

Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics

通过Bellman一致性和混合批评者的跨域策略优化

Authors: Ming-Hong Chen, Kuan-Chen Pan, You-De Huang, Xi Liu, Ping-Chun Hsieh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.12087
Pdf link: https://arxiv.org/pdf/2603.12087
Abstract Cross-domain reinforcement learning (CDRL) is meant to improve the data efficiency of RL by leveraging the data samples collected from a source domain to facilitate the learning in a similar target domain. Despite its potential, cross-domain transfer in RL is known to have two fundamental and intertwined challenges: (i) The source and target domains can have distinct state space or action space, and this makes direct transfer infeasible and thereby requires more sophisticated inter-domain mappings; (ii) The transferability of a source-domain model in RL is not easily identifiable a priori, and hence CDRL can be prone to negative effect during transfer. In this paper, we propose to jointly tackle these two challenges through the lens of \textit{cross-domain Bellman consistency} and \textit{hybrid critic}. Specifically, we first introduce the notion of cross-domain Bellman consistency as a way to measure transferability of a source-domain model. Then, we propose $Q$Avatar, which combines the Q functions from both the source and target domains with an adaptive hyperparameter-free weight function. Through this design, we characterize the convergence behavior of $Q$Avatar and show that $Q$Avatar achieves reliable transfer in the sense that it effectively leverages a source-domain Q function for knowledge transfer to the target domain. Through experiments, we demonstrate that $Q$Avatar achieves favorable transferability across various RL benchmark tasks, including locomotion and robot arm manipulation. Our code is available at this https URL.
中文摘要 跨域强化学习（CDRL）旨在通过利用源域收集的数据样本，促进类似目标领域的学习，从而提高强化学习的数据效率。尽管具有潜力，RL中的跨域传输已知存在两个根本且相互交织的挑战：（i）源域和目标域可能具有不同的状态空间或作用空间，这使得直接传输不可行，因此需要更复杂的域间映射;（ii）源域模型在强化学习中的可转移性不易事先识别，因此CDRL在迁移过程中可能产生负面影响。本文拟通过 \textit{跨域 Bellman 一致性}和 \textit{混合批评}的视角，共同应对这两个挑战。具体来说，我们首先引入了跨域贝尔曼一致性的概念，作为衡量源域模型可转移性的一种方式。然后，我们提出了$Q$Avatar，它结合了源域和目标域的Q函数，并结合了一个自适应的超参数无权重函数。通过该设计，我们描述了$Q$Avatar的收敛行为，并展示了$Q$Avatar实现了可靠的传输，即有效利用源域Q函数将知识传递到目标域。通过实验，我们证明$Q$Avatar在包括移动和机械臂操作在内的多种强化学习基准任务中实现了良好的迁移性。我们的代码可在此 https URL 访问。

A Robust and Efficient Multi-Agent Reinforcement Learning Framework for Traffic Signal Control

一个稳健高效的多智能体强化学习框架用于交通信号控制

Authors: Sheng-You Huang, Hsiao-Chuan Chang, Yen-Chi Chen, Ting-Han Wei, I-Hau Yeh, Sheng-Yao Kuan, Chien-Yao Wang, Hsuan-Han Lee, I-Chen Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.12096
Pdf link: https://arxiv.org/pdf/2603.12096
Abstract Reinforcement Learning (RL) in Traffic Signal Control (TSC) faces significant hurdles in real-world deployment due to limited generalization to dynamic traffic flow variations. Existing approaches often overfit static patterns and use action spaces incompatible with driver expectations. This paper proposes a robust Multi-Agent Reinforcement Learning (MARL) framework validated in the Vissim traffic simulator. The framework integrates three mechanisms: (1) Turning Ratio Randomization, a training strategy that exposes agents to dynamic turning probabilities to enhance robustness against unseen scenarios; (2) a stability-oriented Exponential Phase Duration Adjustment action space, which balances responsiveness and precision through cyclical, exponential phase adjustments; and (3) a Neighbor-Based Observation scheme utilizing the MAPPO algorithm with Centralized Training with Decentralized Execution (CTDE). By leveraging centralized updates, this approach approximates the efficacy of global observations while maintaining scalable local communication. Experimental results demonstrate that our framework outperforms standard RL baselines, reducing average waiting time by over 10%. The proposed model exhibits superior generalization in unseen traffic scenarios and maintains high control stability, offering a practical solution for adaptive signal control.
中文摘要 交通信号控制（TSC）中的强化学习（RL）在实际应用中面临重大挑战，因为其对动态交通流变化的泛化有限。现有方法常常过拟合静态模式，并使用与驱动期望不兼容的动作空间。本文提出了一个稳健的多智能体强化学习（MARL）框架，并在Vissim交通模拟器中得到了验证。该框架整合了三种机制：（1）转向比率随机化，这是一种训练策略，使智能体接触动态转向概率以增强对未知场景的鲁棒性;（2）以稳定性为导向的指数相位持续时间调整作用空间，通过周期性、指数级的相位调整来平衡响应性和精确性;以及（3）基于邻居的观察方案，采用MAPPO算法，采用集中式训练与去中心化执行（CTDE）。通过利用集中式更新，这种方法在保持可扩展的本地通信的同时，近似全球观测的有效性。实验结果显示，我们的框架优于标准强化学习基线，平均等待时间减少了10%以上。该模型在未见交通场景中展现出更优的泛化性，并保持高度控制稳定性，为自适应信号控制提供了实用解决方案。

On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

关于强化学习中对大型语言模型代理主动推理的信息自锁

Authors: Deyu Zou, Yongqiang Chen, Fan Feng, Mufei Li, Pan Li, Yu Gong, James Cheng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.12109
Pdf link: https://arxiv.org/pdf/2603.12109
Abstract Reinforcement learning (RL) with outcome-based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and struggles to internalize already-obtained information. To understand the phenomenon, we decompose active reasoning into two core capabilities: Action Selection (AS), which determines the observation stream through queries, and Belief Tracking (BT), which updates the agent's belief based on collected evidence. We show that deficient AS and BT capabilities will limit the information exploration during RL training. Furthermore, insufficient exploration in turn hinders the improvement of AS and BT, creating a feedback loop that locks the agent in a low-information regime. To resolve the issue, we propose a simple yet effective approach that reallocates the learning signal by injecting easy- to-obtain directional critiques to help the agent escape self-locking. Extensive experiments with 7 datasets show that our approach significantly mitigates the information self-locking, bringing up to 60% improvements.
中文摘要 基于结果的奖励强化学习（RL）在训练大型语言模型（LLM）智能体处理复杂推理任务方面取得了显著成功。然而，在主动推理中，代理需要战略性提问以获取与任务相关的信息，我们发现接受强化学习训练的大型语言模型代理常常存在信息自锁问题：代理停止提出有信息性的问题，难以内化已获得的信息。为了理解这一现象，我们将主动推理分解为两个核心能力：行动选择（AS），通过查询确定观察流;以及信念追踪（BT），根据收集到的证据更新智能体的信念。我们表明，AS和BT能力的不足将限制强化学习训练中的信息探索。此外，探索不足反过来阻碍AS和BT的改进，形成反馈循环，使智能体被锁定在低信息状态。为解决此问题，我们提出了一种简单但有效的方法，通过注入易于获得的方向性批判来重新分配学习信号，帮助智能体逃脱自锁。对7个数据集的广泛实验表明，我们的方法显著减轻了信息自锁问题，提升了高达60%。

Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives

驯服对手：通过分数目标实现稳定的极小极大深度确定性策略梯度

Authors: Taeho Lee, Donghwan Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.12110
Pdf link: https://arxiv.org/pdf/2603.12110
Abstract Reinforcement learning (RL) has achieved remarkable success in a wide range of control and decision-making tasks. However, RL agents often exhibit unstable or degraded performance when deployed in environments subject to unexpected external disturbances and model uncertainties. Consequently, ensuring reliable performance under such conditions remains a critical challenge. In this paper, we propose minimax deep deterministic policy gradient (MMDDPG), a framework for learning disturbance-resilient policies in continuous control tasks. The training process is formulated as a minimax optimization problem between a user policy and an adversarial disturbance policy. In this problem, the user learns a robust policy that minimizes the objective function, while the adversary generates disturbances that maximize it. To stabilize this interaction, we introduce a fractional objective that balances task performance and disturbance magnitude. This objective prevents excessively aggressive disturbances and promotes robust learning. Experimental evaluations in MuJoCo environments demonstrate that the proposed MMDDPG achieves significantly improved robustness against both external force perturbations and model parameter variations.
中文摘要 强化学习（RL）在广泛的控制和决策任务中取得了显著成功。然而，强化学习代理在面对意外外部干扰和模型不确定性的环境中部署时，常表现出不稳定或性能下降。因此，在这种条件下确保可靠性仍是关键挑战。本文提出极小极大深度确定性策略梯度（MMDDPG），作为持续控制任务中学习干扰韧性策略的框架。训练过程被表述为用户策略与对抗干扰策略之间的极小极大优化问题。在此问题中，用户学习一个稳健策略，使目标函数最小化，而攻击者则产生扰动最大化目标函数。为了稳定这种相互作用，我们引入了一个分数目标，平衡任务表现和干扰强度。这一目标防止了过度攻击性的干扰，促进了扎实的学习。MuJoCo环境下的实验评估表明，所提出的MMDDPG在对外部力扰动和模型参数变化方面表现出显著提升的鲁棒性。

Increasing intelligence in AI agents can worsen collective outcomes

人工智能智能体的提升可能恶化集体结果

Authors: Neil F. Johnson
Subjects: Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI); General Economics (econ.GN); Physics and Society (physics.soc-ph)
Arxiv link: https://arxiv.org/abs/2603.12129
Pdf link: https://arxiv.org/pdf/2603.12129
Abstract When resources are scarce, will a population of AI agents coordinate in harmony, or descend into tribal chaos? Diverse decision-making AI from different developers is entering everyday devices -- from phones and medical devices to battlefield drones and cars -- and these AI agents typically compete for finite shared resources such as charging slots, relay bandwidth, and traffic priority. Yet their collective dynamics and hence risks to users and society are poorly understood. Here we study AI-agent populations as the first system of real agents in which four key variables governing collective behaviour can be independently toggled: nature (innate LLM diversity), nurture (individual reinforcement learning), culture (emergent tribe formation), and resource scarcity. We show empirically and mathematically that when resources are scarce, AI model diversity and reinforcement learning increase dangerous system overload, though tribe formation lessens this risk. Meanwhile, some individuals profit handsomely. When resources are abundant, the same ingredients drive overload to near zero, though tribe formation makes the overload slightly worse. The crossover is arithmetical: it is where opposing tribes that form spontaneously first fit inside the available capacity. More sophisticated AI-agent populations are not better: whether their sophistication helps or harms depends entirely on a single number -- the capacity-to-population ratio -- that is knowable before any AI-agent ships.
中文摘要 当资源稀缺时，AI代理群体会协调一致，还是陷入部落混乱？来自不同开发者的多样化决策人工智能正进入日常设备——从手机、医疗设备到战场无人机和汽车——这些AI代理通常争夺有限的共享资源，如充电时段、中继带宽和流量优先级。然而，它们的集体动态以及对用户和社会的风险却被理解得很有限。本文我们研究人工智能代理群体，作为首个能够独立切换四个关键变量的真实代理系统：天赋（先天的LLM多样性）、教养（个体强化学习）、文化（新兴部落形成）和资源稀缺性。我们通过实证和数学方法表明，当资源稀缺时，AI模型多样性和强化学习会增加危险的系统过载，尽管部落形成会降低这一风险。与此同时，有些人却从中获利丰厚。当资源充足时，同样的成分会使过载接近零，尽管部落形成会稍微加重过载。交叉是算术性的：即自发形成的对立部落首次在可用容量内的位置。更复杂的人工智能代理群体并不更好：它们的复杂性是有益还是有害，完全取决于一个数字——容量与人口比——这是在任何AI代理舰船出现前即可知晓的。

Automatic Generation of High-Performance RL Environments

高性能强化学习环境的自动生成

Authors: Seth Karten, Rahul Dev Appapogu, Chi Jin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2603.12145
Pdf link: https://arxiv.org/pdf/2603.12145
Abstract Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a reusable recipe - a generic prompt template, hierarchical verification, and iterative agent-assisted repair - that produces semantically equivalent high-performance environments for <$10 in compute cost. We demonstrate three distinct workflows across five environments. Direct translation (no prior performance implementation exists): EmuRust (1.5x PPO speedup via Rust parallelism for a Game Boy emulator) and PokeJAX, the first GPU-parallel Pokemon battle simulator (500M SPS random action, 15.2M SPS PPO; 22,320x over the TypeScript reference). Translation verified against existing performance implementations: throughput parity with MJX (1.04x) and 5x over Brax at matched GPU batch sizes (HalfCheetah JAX); 42x PPO (Puffer Pong). New environment creation: TCGJax, the first deployable JAX Pokemon TCG engine (717K SPS random action, 153K SPS PPO; 6.6x over the Python reference), synthesized from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Hierarchical verification (property, interaction, and rollout tests) confirms semantic equivalence for all five environments; cross-backend policy transfer confirms zero sim-to-sim gap for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns. The paper contains sufficient detail - including representative prompts, verification methodology, and complete results - that a coding agent could reproduce the translations directly from the manuscript.
中文摘要 将复杂的强化学习（RL）环境转化为高性能实现，传统上需要数月的专业工程。我们提出了一个可重用的配方——通用提示模板、分层验证和迭代代理辅助修复——能够以<$10的计算成本生成语义等效的高性能环境。我们在五个环境中展示了三种不同的工作流程。直接翻译（此前无性能实现）：EmuRust（通过Rust并行技术提升1.5倍PPO速度，适用于Game Boy模拟器）和PokeJAX，首款GPU并行宝可梦战斗模拟器（随机动作5亿SP，PPO1520万SP;比TypeScript参考提升22,320倍）。与现有性能实现的翻译验证：与MJX的吞吐量（1.04x）以及在匹配GPU批处理规模下的Brax上的5倍（HalfCheetah JAX）;42次PPO（河豚乒乓）。新环境创建：TCGJax，首个可部署的JAX宝可梦TCG引擎（随机动作717K SPS，153K SPS PPO;比Python参考高6.6倍），由网络提取的规范合成。在200米参数下，环境开销降至训练时间的4%以下。层级验证（属性测试、交互测试和展开测试）确认了五个环境的语义等价;跨后端策略传输确认了五个环境之间无SIM到SIM之间的差距。TCGJax 由一个私有引用合成而成，未出现在公共仓库中，作为代理预训练数据问题的污染控制工具。论文包含足够细节——包括代表性的提示、验证方法和完整结果——编码代理可以直接从稿件中复刻翻译。

Linking Perception, Confidence and Accuracy in MLLMs

连接多层次营销中的感知、信心与准确性

Authors: Yuetian Du, Yucheng Wang, Rongyu Zhang, Zhijie Xu, Boyu Yang, Ming Kong, Jie Liu, Qiang Zhu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.12149
Pdf link: https://arxiv.org/pdf/2603.12149
Abstract Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.
中文摘要 近年来，多模态大型语言模型（MLLM）的进展主要集中在增强视觉感知以提高准确性。然而，一个关键问题仍未被探讨：模型是否知道自己不知道？通过探测实验，我们揭示了MLLM中严重的置信度校准错误问题。为此，我们提出了信心驱动强化学习（CDRL），该方法利用原始噪声图像对和一种新颖的基于置信度的奖励，以增强感知敏感度并稳健校准模型的置信度。除了训练带来的好处外，校准自信还能让测试时间的扩展更为高效，就像免费的午餐一样。我们进一步提出了信心感知测试时间尺度（CA-TTS），它动态协调自洽、自我反思和视觉自我检查模块，并以信心信号为指导。专家模型承担多重角色（如规划者、批评者、投票者），安排这些模块并提供外部验证。我们的综合框架建立了新的最先进成果，在四个基准中持续增长8.8%。更多的消融研究证明了每个模块的有效性和扩展优势。

IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL

IsoCompute手册：LLM RL的优化扩展采样计算

Authors: Zhoujun Cheng, Yutao Xie, Yuxiao Qu, Amrith Setlur, Shibo Hao, Varad Pimpalkhute, Tongtong Liang, Feng Yao, Zhengzhong Liu, Eric Xing, Virginia Smith, Ruslan Salakhutdinov, Zhiting Hu, Taylor Killian, Aviral Kumar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.12151
Pdf link: https://arxiv.org/pdf/2603.12151
Abstract While scaling laws guide compute allocation for LLM pre-training, analogous prescriptions for reinforcement learning (RL) post-training of large language models (LLMs) remain poorly understood. We study the compute-optimal allocation of sampling compute for on-policy RL methods in LLMs, framing scaling as a compute-constrained optimization over three resources: parallel rollouts per problem, number of problems per batch, and number of update steps. We find that the compute-optimal number of parallel rollouts per problem increases predictably with compute budget and then saturates. This trend holds across both easy and hard problems, though driven by different mechanisms: solution sharpening on easy problems and coverage expansion on hard problems. We further show that increasing the number of parallel rollouts mitigates interference across problems, while the number of problems per batch primarily affects training stability and can be chosen within a broad range. Validated across base models and data distributions, our results recast RL scaling laws as prescriptive allocation rules and provide practical guidance for compute-efficient LLM RL post-training.
中文摘要 虽然缩放律指导LLM预训练的计算分配，但大型语言模型（LLMs）后训练强化学习（RL）的类似处方仍不充分。我们研究了LLM中策略强化学习方法中采样计算的计算最优分配，将扩展性框架为对三种资源的计算受限优化：每个问题的并行展开次数、每批问题的问题数量和更新步骤数。我们发现，每个问题的并行展开次数可预测地随着计算预算增加而增加，随后趋于饱和。这一趋势在简单问题和困难问题中都存在，但驱动机制不同：简单问题的解力锐化，困难问题的覆盖范围扩展。我们还进一步表明，增加并行部署数量可以减轻问题间的干扰，而每批次的问题数量主要影响训练稳定性，并且可以在较宽范围内选择。经过基础模型和数据分布验证，我们的结果将强化学习的尺度律重新定义为规范性分配规则，并为高效计算的LLM RL在训练后提供实用指导。

LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning

LatentGeo：多模几何推理的潜在空间中可学习辅助构造

Authors: Haiying Xu, Zihan Wang, Song Dai, Zhengxuan Zhang, Kairan Dou, Xuming Hu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.12166
Pdf link: https://arxiv.org/pdf/2603.12166
Abstract Despite recent advances in multimodal reasoning, representing auxiliary geometric constructions remains a fundamental challenge for multimodal large language models (MLLMs). Such constructions are absent from the original diagram and must be introduced before theorems apply. Existing approaches predominantly rely on explicit construction paradigms, including text-based geometric specification, visual-token interleaving during reasoning, and tool-augmented geometric execution. However, these methods either fail to faithfully represent complex spatial relationships, incur representation mismatch between discrete symbols and continuous geometric structures, or rely on external capabilities that hinder end-to-end optimization. To address these limitations, we propose LatentGeo, a framework that learns continuous latent visual representations to internalize auxiliary geometric constructions without pixel-level rendering or external executors. We design a three-stage curriculum that progressively aligns and internalizes these latent representations through auxiliary visual supervision, followed by LaGDPO, a latent-aware reinforcement learning procedure that stabilizes latent representations during policy optimization while improving end-task correctness. To systematically evaluate construction-centric representation quality, we introduce GeoAux, a new benchmark targeting visually dependent geometry problems, and conduct experiments on GeoAux and MathVerse. Results show that LatentGeo achieves substantial gains on geometric reasoning tasks, particularly those requiring auxiliary constructions. Extensive analyses and ablation studies further validate the effectiveness of each component in our framework.
中文摘要 尽管多模态推理取得了近期进展，表示辅助几何构造仍然是多模态大型语言模型（MLLM）面临的根本挑战。这些构造在原始图中缺失，必须在定理应用前引入。现有方法主要依赖显式构造范式，包括基于文本的几何规范、推理时的可视化令牌交错以及工具增强的几何执行。然而，这些方法要么无法忠实地表示复杂的空间关系，要么导致离散符号与连续几何结构之间的表示不匹配，或者依赖外部能力阻碍端到端优化。为解决这些限制，我们提出了 LatentGeo 框架，该框架通过学习连续潜在视觉表示，内化辅助几何构造，无需像素级渲染或外部执行器。我们设计了一套三阶段课程，通过辅助视觉监督逐步对齐和内化这些潜在表征，随后是LaGDPO，一种潜在意识强化学习过程，在策略优化过程中稳定潜在表征，同时提升终端任务的正确性。为了系统评估以构造为中心的表示质量，我们引入了GeoAux，一个针对视觉依赖几何问题的新基准测试，并在GeoAux和MathVerse上进行了实验。结果显示，LatentGeo在几何推理任务上取得了显著提升，尤其是在需要辅助构造的任务中。广泛的分析和消融研究进一步验证了我们框架中每个组成部分的有效性。

Integrated Online Monitoring and Adaption of Process Model Predictive Controllers

集成在线监控与流程模型预测控制器的适配

Authors: Samuel Mallick, Laura Boca de de Giuli, Alessio La Bella, Azita Dabiri, Bart De Schutter, Riccardo Scattolini
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.12187
Pdf link: https://arxiv.org/pdf/2603.12187
Abstract This paper addresses the design of an event-triggered, data-based, and performance-oriented adaption method for model predictive control (MPC). The performance of such a strategy strongly depends on the accuracy of the prediction model, which may require online adaption to prevent performance degradation under changing operating conditions. Unlike existing methods that continuously update model and control parameters from data, potentially leading to catastrophic forgetting and unnecessary control modifications, we propose a novel approach based on statistical monitoring of closed-loop performance indicators. This framework enables the detection of performance degradation, and, when required, controller adaption is performed via reinforcement learning and identification techniques. The proposed strategy is validated on a high-fidelity simulation of a district heating system benchmark.
中文摘要 本文探讨了一种基于事件、基于数据且以性能为导向的模型预测控制（MPC）适应方法的设计。这种策略的性能高度依赖于预测模型的准确性，而预测模型可能需要在线调整以防止在变化的操作条件下性能下降。与现有不断更新模型和控制参数的方法不同，这些方法可能导致灾难性的遗忘和不必要的控制修改，我们提出了一种基于闭环绩效指标统计监测的新方法。该框架能够检测性能下降，并在需要时通过强化学习和识别技术进行控制器适配。该策略通过高保真模拟对区域供热系统基准进行了验证。

HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies

HandelBot：通过快速适应灵巧机器人政策实现现实钢琴演奏

Authors: Amber Xie, Haozhi Qi, Dorsa Sadigh
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.12243
Pdf link: https://arxiv.org/pdf/2603.12243
Abstract Mastering dexterous manipulation with multi-fingered hands has been a grand challenge in robotics for decades. Despite its potential, the difficulty of collecting high-quality data remains a primary bottleneck for high-precision tasks. While reinforcement learning and simulation-to-real-world transfer offer a promising alternative, the transferred policies often fail for tasks demanding millimeter-scale precision, such as bimanual piano playing. In this work, we introduce HandelBot, a framework that combines a simulation policy and rapid adaptation through a two-stage pipeline. Starting from a simulation-trained policy, we first apply a structured refinement stage to correct spatial alignments by adjusting lateral finger joints based on physical rollouts. Next, we use residual reinforcement learning to autonomously learn fine-grained corrective actions. Through extensive hardware experiments across five recognized songs, we demonstrate that HandelBot can successfully perform precise bimanual piano playing. Our system outperforms direct simulation deployment by a factor of 1.8x and requires only 30 minutes of physical interaction data.
中文摘要 几十年来，掌握多指灵巧的操作一直是机器人领域的一项重大挑战。尽管具有潜力，收集高质量数据的难度仍是高精度任务的主要瓶颈。虽然强化学习和模拟到现实世界的转移提供了有前景的替代方案，但转移策略常常在要求毫米级精度的任务中失败，比如双手钢琴演奏。在本研究中，我们介绍了HandelBot，这是一个结合了仿真策略和通过两阶段流水线快速适应的框架。从模拟训练的策略出发，我们首先应用结构化的细化阶段，通过根据物理展开调整侧指关节来纠正空间对齐。接下来，我们利用残余强化学习自主学习细粒度的纠正措施。通过对五首已知歌曲进行的大量硬件实验，我们证明了HandelBot能够成功完成精确的双手钢琴演奏。我们的系统性能是直接仿真部署的1.8倍，且只需30分钟的物理交互数据。

Separable neural architectures as a primitive for unified predictive and generative intelligence

可分离神经架构作为统一预测与生成智能的原始工具

Authors: Reza T. Batley, Apurba Sarker, Rajib Mostakim, Andrew Klichine, Sourav Saha
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.12244
Pdf link: https://arxiv.org/pdf/2603.12244
Abstract Intelligent systems across physics, language and perception often exhibit factorisable structure, yet are typically modelled by monolithic neural architectures that do not explicitly exploit this structure. The separable neural architecture (SNA) addresses this by formalising a representational class that unifies additive, quadratic and tensor-decomposed neural models. By constraining interaction order and tensor rank, SNAs impose a structural inductive bias that factorises high-dimensional mappings into low-arity components. Separability need not be a property of the system itself: it often emerges in the coordinates or representations through which the system is expressed. Crucially, this coordinate-aware formulation reveals a structural analogy between chaotic spatiotemporal dynamics and linguistic autoregression. By treating continuous physical states as smooth, separable embeddings, SNAs enable distributional modelling of chaotic systems. This approach mitigates the nonphysical drift characteristics of deterministic operators whilst remaining applicable to discrete sequences. The compositional versatility of this approach is demonstrated across four domains: autonomous waypoint navigation via reinforcement learning, inverse generation of multifunctional microstructures, distributional modelling of turbulent flow and neural language modelling. These results establish the separable neural architecture as a domain-agnostic primitive for predictive and generative intelligence, capable of unifying both deterministic and distributional representations.
中文摘要 物理学、语言和感知领域的智能系统通常表现出可因式分解的结构，但通常由未明确利用该结构的单体神经架构建模。可分离神经结构（SNA）通过形式化一个表示类来解决这个问题，该类统一了加法、二次和张量分解的神经模型。通过约束相互作用阶和张量秩，SNA施加结构性归纳偏置，将高维映射分解为低元分量。可分离性不一定是系统本身的属性：它通常体现在系统表达的坐标或表示中。关键的是，这种坐标感知表述揭示了混沌时空动力学与语言自归之间的结构类比。通过将连续物理态视为平滑、可分离的嵌入，SNA使混沌系统的分布建模成为可能。该方法减轻了确定性算符的非物理漂移特性，同时仍适用于离散序列。该方法的组合多样性在四个领域得到了展示：通过强化学习实现自主航点导航、多功能微结构的逆生成、湍流的分布建模以及神经语言建模。这些结果确立了可分离神经结构作为一种领域无关的预测性和生成智能原语，能够统一确定性和分布性表征。

Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

信任你的批评者：强有力的奖励建模与强化学习，实现忠实的图像编辑与生成

Authors: Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan, Shengyuan Ding, Changyao Tian, Yuhang Zang, Junchi Yan, Xue Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.12247
Pdf link: https://arxiv.org/pdf/2603.12247
Abstract Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at this https URL.
中文摘要 强化学习（RL）已成为增强图像编辑和文本对图像（T2I）生成的有前景范式。然而，当前作为强化学习批评者的奖励模型常常出现幻觉并赋予噪声评分，本质上误导了优化过程。本文介绍了FIRM（忠实图像奖励建模），这是一个综合框架，开发了稳健的奖励模型，为忠实图像生成和编辑提供准确可靠的指导。首先，我们设计定制的数据策划流程，构建高质量的评分数据集。具体来说，我们通过执行和一致性来评估编辑，而生成主要通过指令跟随来评估。利用这些流水线，我们收集了FIRM-Edit-370K和FIRM-Gen-293K数据集，并训练能够准确反映这些标准的专业奖励模型（FIRM-Edit-8B和FIRM-Gen-8B）。其次，我们介绍FIRM-Bench，这是一个专为编辑和生成批评者设计的综合基准测试。评估表明，我们的模型与人类判断的高度契合度优于现有指标。此外，为了无缝将这些批评者整合进强化学习流程，我们制定了一种新颖的“基础与奖励”奖励策略，平衡了两个竞争目标：编辑的一致性调制执行（CME）和生成的质量调制对齐（QMA）。在该框架的推动下，我们最终生成的模型FIRM-Qwen-Edit和FIRM-SD3.5实现了显著的性能突破。全面的实验表明FIRM能够减轻幻觉，确立了相较于现有通用模型的忠实度和指令遵循度的新标准。我们所有的数据集、模型和代码都已在此 https URL 公开。

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

先关注：通过自回归凝视实现高效且可扩展的视频理解

Authors: Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.12254
Pdf link: https://arxiv.org/pdf/2603.12254
Abstract Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: this https URL.
中文摘要 多模态大型语言模型（MLLM）具备先进的通用视频理解能力，但在长视频高分辨率时表现不佳——尽管存在显著的时空冗余，它们在视觉转换器（ViT）或大型语言模型中对每个像素的处理都相当。我们引入了AutoGaze，一个轻量级模块，可以在ViT或MLLM处理前移除冗余补丁。通过下一个令牌预测和强化学习训练，AutoGaze 自回归选择一组多尺度补丁，能够在用户指定的错误阈值内重建视频，消除冗余并保留信息。从经验上看，AutoGaze将视觉标记减少了4倍到100倍，并将ViT和MLLM加速多达19倍，使MLLM能够扩展到1K帧4K分辨率视频，并在视频基准测试中取得优异成绩（例如VideoMME的67.0%）。此外，我们推出了HLVid：首个高分辨率长视频质量保证基准测试，支持5分钟4K分辨率视频，采用AutoGaze放大的MLLM相较基线提升10.1%，且比之前最佳MLLM高出4.5%。项目页面：这个 https URL。

Keyword: diffusion policy

V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

V2A-DPO：视频到音频生成的全偏好优化

Authors: Nolan Chan, Timmy Gang, Yongqian Wang, Yuzhe Liang, Dingdong Wang
Subjects: Subjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2603.11089
Pdf link: https://arxiv.org/pdf/2603.11089
Abstract This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core innovations: (1) AudioScore-a comprehensive human preference-aligned scoring system for assessing semantic consistency, temporal alignment, and perceptual quality of synthesized audio; (2) an automated AudioScore-driven pipeline for generating large-scale preference pair data for DPO optimization; (3) a curriculum learning-empowered DPO optimization strategy specifically tailored for flow-based generative models. Experiments on benchmark VGGSound dataset demonstrate that human-preference aligned Frieren and MMAudio using V2A-DPO outperform their counterparts optimized using Denoising Diffusion Policy Optimization (DDPO) as well as pre-trained baselines. Furthermore, our DPO-optimized MMAudio achieves state-of-the-art performance across multiple metrics, surpassing published V2A models.
中文摘要 本文介绍了V2A-DPO，一种针对基于流的视频到音频生成（V2A）模型量身定制的新型直接偏好优化（DPO）框架，结合关键调整以有效将生成音频与人类偏好对齐。我们的方法包含三项核心创新：（1）AudioScore——一个全面的人类偏好对齐评分系统，用于评估合成音频的语义一致性、时间对齐和感知质量;（2）自动化的AudioScore驱动流水线，用于生成大规模偏好对数据以优化DPO;（3）专门针对基于流程的生成模型量身定制的课程学习赋能DPO优化策略。基于基准VGGSound数据集的实验表明，使用V2A-DPO的人类偏好对齐Frieren和MMAudio在使用去噪扩散策略优化（DDPO）及预训练基线优化的版本中表现优于对应者。此外，我们经过DPO优化的MMAudio在多个指标上实现了最先进的性能，超越了已发布的V2A模型。

Concurrent Prehensile and Nonprehensile Manipulation: A Practical Approach to Multi-Stage Dexterous Tasks

并行抓握与非抓握操作：多阶段灵巧任务的实用方法

Authors: Hao Jiang, Yue Wu, Yue Wang, Gaurav S. Sukhatme, Daniel Seita
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.11655
Pdf link: https://arxiv.org/pdf/2603.11655
Abstract Dexterous hands enable concurrent prehensile and nonprehensile manipulation, such as holding one object while interacting with another, a capability essential for everyday tasks yet underexplored in robotics. Learning such long-horizon, contact-rich multi-stage behaviors is challenging because demonstrations are expensive to collect and end-to-end policies require substantial data to generalize across varied object geometries and placements. We present DexMulti, a sample-efficient approach for real-world dexterous multi-task manipulation that decomposes demonstrations into object-centric skills with well-defined temporal boundaries. Rather than learning monolithic policies, our method retrieves demonstrated skills based on current object geometry, aligns them to the observed object state using an uncertainty-aware estimator that tracks centroid and yaw, and executes them via a retrieve-align-execute paradigm. We evaluate on three multi-stage tasks requiring concurrent manipulation (Grasp + Pull, Grasp + Open, and Grasp + Grasp) across two dexterous hands (Allegro and LEAP) in over 1,000 real-world trials. Our approach achieves an average success rate of 66% on training objects with only 3-4 demonstrations per object, outperforming diffusion policy baselines by 2-3x while requiring far fewer demonstrations. Results demonstrate robust generalization to held-out objects and spatial variations up to +/-25 cm.
中文摘要 灵巧的双手能够同时进行抓握和非抓握操作，比如在与另一个物体互动时同时握持一个物体，这在日常任务中至关重要，但在机器人技术中尚未充分开发。学习这种长远、接触丰富的多阶段行为具有挑战性，因为演示收集成本高昂，而端到端策略需要大量数据来泛化到不同物体几何和位置。我们介绍DexMulti，这是一种针对现实世界灵活多任务操作的样本高效方法，将演示分解为具有明确时间边界的对象中心技能。我们不学习单一策略，而是基于当前对象几何体检索已展示的技能，利用不确定性感知估计器跟踪重心和偏航，并将其与观察到的对象状态对齐，并通过检索-对齐-执行范式执行。我们在1000多次真实世界试验中，评估了三个需要同时操作的多阶段任务（抓握+拉动、抓握+开握和抓握+抓握），覆盖两只灵巧的手（快板和快板和LEAP）。我们的方法在训练对象上平均成功率为66%，每个对象只需3-4次演示，比扩散政策基线高出2-3倍，且演示次数远少于此。结果显示对外展物体的稳健推广，空间变化可达+/-25厘米。