Arxiv Papers of Today

生成时间: 2026-06-02 20:10:25 (UTC+8); Arxiv 发布时间: 2026-06-02 20:00 EDT (2026-06-03 08:00 UTC+8)

今天共有 102 篇相关文章

Keyword: reinforcement learning

SortingHat: Redefining Operating Systems Education with a Tailored Digital Teaching Assistant

SortingHat：用定制数字教学助理重新定义操作系统教育

Authors: Yifan Zhang, Xinkui Zhao, Zuxin Wang, Zhengyi Zhou, Guanjie Chen, Shuiguang Deng, Jianwei Yin
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2606.00015
Pdf link: https://arxiv.org/pdf/2606.00015
Abstract Operating Systems (OS) courses are among the most challenging in computer science education due to the complexity of internal structures and the diversity of running environments. Traditional teaching methods often fail to address the diverse backgrounds, learning speeds, and practical needs of students. To tackle these challenges, we present SortingHat, a personalized digital teaching assistant tailored specifically for OS education. SortingHat integrates advanced AI technologies, including a retrieval augmented generation (RAG) framework and multi agent reinforcement learning (MARL), to deliver adaptive, scalable, and effective educational support. SortingHat features a 3D digital human interface powered by large language models (LLMs) to provide personalized, empathetic, and context aware guidance. It generates tailored exercises based on each student's learning history and academic performance, reinforcing weak areas and challenging advanced concepts. Additionally, the system incorporates a robust evaluation pipeline that ensures fair, consistent, and unbiased grading of student submissions while delivering personalized, actionable feedback for improvement. By combining personalized guidance, adaptive content creation, and automated assessment, SortingHat transforms OS education into an engaging, immersive, and scalable experience.
中文摘要 操作系统（OS）课程因内部结构复杂和运行环境的多样性而成为计算机科学教育中最具挑战性的课程之一。传统教学方法常常未能满足学生多元背景、学习速度和实际需求。为了应对这些挑战，我们推出了SortingHat，一款专为操作系统教育量身定制的个性化数字教学助理。SortingHat 集成了先进的人工智能技术，包括检索增强生成（RAG）框架和多智能体强化学习（MARL），以提供自适应、可扩展且高效的教育支持。SortingHat 采用由大型语言模型（LLM）驱动的三维数字人机界面，提供个性化、富有同理心且具上下文感知的指导。它根据每位学生的学习历史和学业表现生成定制练习，强化薄弱环节，挑战高级概念。此外，系统还设有完善的评估流程，确保学生提交的评分公平、一致且无偏见，同时提供个性化且可操作的改进反馈。通过结合个性化指导、自适应内容创作和自动评估，SortingHat 将操作系统教育转变为一种引人入胜、沉浸式且可扩展的体验。

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

MindGames 竞技场泛化专题：延迟每步奖励归因的 In2AI 解决方案

Authors: Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.00017
Pdf link: https://arxiv.org/pdf/2606.00017
Abstract Training language model agents for multi-agent strategic interaction presents a core difficulty: the quality of any action may depend on future events that never materialize, on moves that violate game rules, or on decisions made by other players. Standard reinforcement learning assumes that rewards can be assigned at each step, but this assumption fails in settings where outcomes are entangled across time and agents. We introduce delayed per-step reward attribution with eligibility gating, an episode lifecycle and postprocessing pipeline that computes rewards only at episode end, propagates them back to originating steps according to task-specific semantics, and excludes steps that lack valid dependent information from training. Together with asynchronous rollout generation via vLLM's continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, this approach enables stable, sample-efficient RL training in multi-agent environments. We evaluate on the MindGames Arena benchmark at NeurIPS 2025, where a single 8-billion-parameter open-source model trained with our method matched or surpassed substantially larger proprietary systems, including GPT-5, in head-to-head play and took first place in both the Open (unrestricted) and Efficient (<=8B parameters) tracks.
中文摘要 为多智能体战略交互训练语言模型代理存在核心难题：任何行为的质量可能取决于未来未发生的事件、违反游戏规则的行动，或其他玩家的决策。标准强化学习假设每一步都可以分配奖励，但在结果跨时间和代理纠缠的环境中，这一假设失效。我们引入了延迟每步奖励归因，采用资格门槛，采用仅在事件结束计算奖励的发情生命周期和后处理流程，并根据任务特定语义将其传播回起始步骤，并排除缺乏有效依赖信息的步骤。结合vLLM的连续批处理异步展开生成、基于课程的对抗采样和多层次分层批构建，该方法实现了多智能体环境中稳定、样本高效的强化学习训练。我们在NeurIPS 2025的MindGames Arena基准测试中进行了评估，其中一个80亿参数的开源模型用我们的方法训练，在一对一对抗中匹敌甚至超过了包括GPT-5在内的更大规模的专有系统，并在开放（无限制）和高效（<=8B参数）两项赛道中均获第一。

Reinforcement Learning for Optimal Experiment Design in Parameter Identification of Mechatronic Systems

机电一体化系统参数识别实验设计中的强化学习

Authors: Julian Langschwert, Georg Schaefer, Jakob Rehrl, Stefan Huber, Simon Hirlaender
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.00059
Pdf link: https://arxiv.org/pdf/2606.00059
Abstract Informative excitation signals are critical for accurate system identification of mechatronic systems, yet classical system identification (SI) approaches require expert knowledge and hand-crafted signal design to respect hardware safety constraints, limiting their generalizability. We propose a reinforcement learning (RL) agent that learns optimal excitation signals for a Quanser Aero 2 testbed while autonomously enforcing safety constraints through reward shaping. Evaluated across 10 independent training seeds, our comprehensive agent achieves competitive estimation accuracy across all three identified parameters, outperforming classical baselines while incurring only 0.75% safety violations.
中文摘要 信息激励信号对于机电一体化系统的准确系统识别至关重要，但经典系统识别（SI）方法需要专业知识和手工设计信号，以尊重硬件安全约束，限制了其泛化性。我们提出了一种强化学习（RL）智能体，能够在自主执行安全约束的同时，为Quanser Aero 2测试平台学习最优激励信号。通过10个独立训练种子的评估，我们的综合代理在所有确定的三个参数上都达到了竞争性估算准确率，优于经典基线，同时仅有0.75%的安全违规。

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

从演示到奖励：VLM奖励模型的测试时间提示优化

Authors: Christian Gumbsch, Leonardo Barcellona, Lennard Schünemann, Platon Karageorgis, Andrii Zadaianchuk, Zehao Wang, Sergey Zakharov, Fabien Despinoy, Rahaf Aljundi, Efstratios Gavves
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.00083
Pdf link: https://arxiv.org/pdf/2606.00083
Abstract Reinforcement learning relies on accurate reward functions, which are often hand-crafted or even unavailable in real-world applications, such as robotics. Recent work has explored the zero-shot reasoning capabilities of pre-trained Vision-Language Models (VLMs) as reward models. However, without careful prompt engineering, these approaches tend to produce suboptimal rewards, where false positive predictions can severely degrade downstream policy learning. In robotics, limited datasets comprising expert demonstrations are often collected to bootstrap policy learning. This scenario provides an opportunity to optimize a reward model prior policy training. We propose Demo2Reward a test-time adaptation technique to optimize the language instruction of a reward model based on a few demonstrations (3-10 trajectories) to reduce false positives while preserving true positives. Crucially, this requires no additional model training or computation resources during policy learning. We show that Demo2Reward consistently outperforms existing zero- and few-shot VLM reward models across a range of simulated robotic tasks and policy backbones. Finally, we demonstrate that Demo2Reward effectively transfers to a real-world robotic learning scenario, enabling policy learning without manually engineering a reward function.
中文摘要 强化学习依赖于精准的奖励函数，这些函数通常是手工打造的，甚至在现实应用（如机器人）中难以实现。近期研究探讨了预训练视觉语言模型（VLM）作为奖励模型的零样本推理能力。然而，如果没有细致的提示工程，这些方法往往会产生次优的奖励，假阳性预测会严重影响下游策略学习。在机器人领域，通常收集包含专家演示的有限数据集以启动政策学习。该情景为优化政策培训前的奖励模型提供了机会。我们提出了Demo2Reward：一种测试时间适应技术，基于少数演示（3-10条轨迹）优化奖励模型的语言教学，以减少假阳性同时保持真阳性。关键是，这在政策学习过程中无需额外的模型训练或计算资源。我们证明，Demo2Reward在一系列模拟机器人任务和策略骨干中，持续优于现有的零和少样本VLM奖励模型。最后，我们证明了Demo2Reward能够有效迁移到现实世界的机器人学习场景，实现政策学习而无需手动设计奖励函数。

World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications

世界模型：架构、方法论、推理范式及应用的全面综述

Authors: Arif Hassan Zidan, Yi Pan, Hanqi Jiang, Ruiyu Yan, Wei Ruan, Zihao Wu, Lifeng Chen, Weihang You, Xinliang Li, Bowen Chen, Huawen Hu, Peilong Wang, Sizhuang Liu, Jing Zhang, Siyuan Li, Zhengliang Liu, Yu Bao, Lin Zhao, Lichao Sun, Dajiang Zhu, Xiang Li, Jinglei Lv, Quanzheng Li, Wei Liu, Tianming Liu, Wei Zhang
Subjects: Subjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2606.00133
Pdf link: https://arxiv.org/pdf/2606.00133
Abstract World models, internal simulators that learn the structure and dynamics of an environment, have emerged as a central paradigm in the pursuit of artificial general intelligence, enabling agents to predict, plan, and reason within learned representations. Despite rapid progress across reinforcement learning, robotics, autonomous driving, and video generation, the field lacks a unified framework integrating its diverse architectural choices, training methods, reasoning mechanisms, and application settings. This survey addresses that gap with a multi-axis taxonomy organized along four dimensions: (i) architecture, encompassing representation format, dynamics formulation, input modality, learning paradigm, and downstream application; (ii) methodological family, including state-space and recurrent approaches, transformer-based models, diffusion-based generators, physics-informed networks, and language-augmented multimodal systems; (iii) reasoning strategy, covering imagination-based planning, latent policy learning, counterfactual reasoning, and planning under uncertainty; and (iv) application domain, spanning robotics, autonomous driving, video prediction, multimodal agents, reinforcement learning, scientific modeling, medical imaging, educational measurement, and business and finance. Tracing the field from early cognitive-science foundations to milestone systems such as PlaNet, the Dreamer family, MuZero, Sora, Cosmos, and Genie, we examine how these dimensions interact and highlight the recent convergence of chain-of-thought reasoning with world-model imagination. We review evaluation protocols and benchmarks, identify persistent challenges such as compounding prediction errors, sim-to-real transfer, and fragmented evaluation, and outline future directions toward unified multimodal world models, foundation-scale interactive simulators, and safe deployment in safety-critical domains.
中文摘要 世界模型，即学习环境结构和动态的内部模拟器，已成为追求通用人工智能的核心范式，使智能体能够在已学习的表征中预测、规划和推理。尽管强化学习、机器人技术、自动驾驶和视频生成等领域取得了快速进展，但该领域仍缺乏统一框架，能够整合其多样的架构选择、训练方法、推理机制和应用设置。本调查通过多轴分类法填补了这一空白，该分类法沿四个维度组织：（i）架构、涵盖表示格式、动态表述、输入模态、学习范式及下游应用;（ii）方法论家族，包括状态空间和循环方法、基于变换器模型、基于扩散的生成器、物理知情网络以及语言增强多模态系统;（iii）推理策略，涵盖基于想象力的规划、潜在政策学习、反事实推理以及不确定性下的规划;以及（iv）应用领域，涵盖机器人学、自动驾驶、视频预测、多模态代理、强化学习、科学建模、医学影像、教育测量以及商业和金融。我们追溯该领域从早期认知科学基础到PlaNet、Dreamer家族、MuZero、Sora、Cosmos和Genie等里程碑系统，探讨这些维度如何相互作用，并突出思维链推理与世界模型想象力的近期融合。我们回顾了评估协议和基准，识别了持续存在的挑战，如预测误差叠加、模拟到实际的传输和分段化评估，并概述了未来向统一多模态世界模型、基础级交互式模拟器以及安全关键领域安全部署的方向。

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

关于代理工具调用和强化学习培训的有效性和效率

Authors: Tong Liu, Cheng Qian, Matej Cief, Yuan He, Daniele Dan, Nikolaos Aletras, Gabriella Kazai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00135
Pdf link: https://arxiv.org/pdf/2606.00135
Abstract Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowledge. This paper studies tool-calling along two complementary axes: effectiveness, i.e., how this capability is measured, and efficiency, i.e., how it is learned. On effectiveness, we systematically analyze tool-calling evaluation pipelines and show that results can be highly sensitive to seemingly minor, often undocumented implementation choices including the random seed, system prompt, multi-turn template construction, and how prior interaction/reasoning history is carried forward. These choices can lead to substantial differences in reported performance, especially in multi-turn settings where without rigorous standardization, leaderboard rankings are unreliable. On efficiency, we examine standard reinforcement learning (RL) for tool-calling and identify two sources of computational waste: (i) during rollouts, many prompts produce no learning signal, and (ii) during policy updates, optimization incurs high computational cost. Guided by these findings, we introduce two techniques that accelerate RL-based tool-calling training, achieving substantial wall-clock speedup without degrading performance.
中文摘要 工具调用是现代大型语言模型（LLM）代理的核心组成部分，赋予他们超出参数化知识的技能。本文从两个互补轴线研究工具调用：有效性，即如何衡量该能力，以及效率，即如何学习工具。关于有效性，我们系统分析工具调用评估流程，显示结果对看似微小且常常未被记录的实施选择高度敏感，包括随机种子、系统提示、多回合模板构建以及先前交互/推理历史的延续。这些选择可能导致报告表现的显著差异，尤其是在多回合环境中，缺乏严格标准化，排行榜排名不可靠。关于效率，我们考察了标准强化学习（RL）用于工具调用，并识别出两个计算浪费来源：（i）在推广过程中，许多提示不产生学习信号，（ii）在策略更新时，优化会产生高计算成本。基于这些发现，我们引入了两种技术，加速基于强化学习的工具调用训练，实现显著的墙钟加速，同时不降低性能。

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

通过重试探索政策梯度强化学习的兴起

Authors: Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno, Toshinori Kitamura, Shin Ishii, Yutaka Matsuo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00151
Pdf link: https://arxiv.org/pdf/2606.00151
Abstract In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy is optimal. We formalize this intuition with ReMax, an objective that evaluates a policy by the expected maximum return over $M$ samples, where $M$ is a positive integer, while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policy-gradient formulation for ReMax and introduce ReMax PPO (RePPO), a PPO variant that optimizes ReMax while generalizing the discrete retry count $M$ to a continuous parameter $m > 0$, enabling fine-grained control of exploration. Empirically, RePPO promotes exploration, without any explicit exploration bonuses, on the MinAtar and Craftax benchmarks.
中文摘要 在强化学习（RL）中，智能体之所以从探索中获益，是因为他们反复遇到相似的状态：尝试不同的动作可以提升性能或减少不确定性;没有这种重试，贪婪的策略是最优选择。我们用ReMax来形式化这一直觉，ReMax是一个目标，它通过$M$样本的期望最大回报来评估策略，其中$M$为正整数，同时考虑收益不确定性。优化这一目标会诱导随机探索作为一种涌现属性，而无需明确的附加条款。为了高效优化策略，我们推导出了ReMax的新策略梯度表述，并引入了ReMax PPO（RePPO），这是一种PPO变体，在将离散重试计数$M$推广为连续参数$m >0$的同时，优化ReMax，实现了探索的细粒度控制。从经验上看，RePPO在MinAtar和Craftax基准测试上推广探索，但没有明确的探索加成。

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

演员阵容：非特权剪辑非对称自学，带有GRPO优势翻转

Authors: Yang Li, Gongle Xue, Yijia Guo, Yuheng Yuan, Liwen Hu, Lei Ma
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00172
Pdf link: https://arxiv.org/pdf/2606.00172
Abstract Reinforcement learning with verifiable rewards (RLVR), especially Group Relative Policy Optimization (GRPO), has been widely used to improve reasoning in large language models. However, outcome-level rewards provide only sparse supervision, and group-relative advantages vanish when all sampled trajectories for a prompt are either correct or incorrect. On-Policy Self-Distillation (OPSD) offers dense token-level guidance, but its token preferences are not necessarily aligned with trajectory correctness; empirical diagnostics show that OPSD signals behave differently on correct and incorrect rollouts, with teacher-positive and teacher-negative gap signals exhibiting different noise profiles. These diagnostics are conducted under an OPSD-style privileged teacher context for analysis only, whereas CAST training uses answer-free self-teacher this http URL by these observations, this work proposes CAST, an answer-free self-distillation method for GRPO-style RLVR. CAST keeps the verifier-grounded GRPO objective, but uses a stop-gradient self-teacher to shape token-level advantages according to trajectory correctness. Unlike prior self-distilled RLVR methods, CAST does not require reference-solution-conditioned teacher scoring, keeps the self-teacher log-probability gap active throughout training, and applies bidirectional local advantage sign reversal: teacher-negative tokens in correct trajectories can receive negative token-level advantages, while teacher-positive tokens in incorrect trajectories can receive bounded positive local advantages. For zero-variance all-correct and all-wrong groups, CAST assigns bounded sign-constrained base advantages, so these otherwise zero-gradient groups can contribute verifier-signed token feedback. Experiments on mathematical reasoning show that CAST improves RLVR training while retaining a lightweight, verifier-grounded trajectory-level objective.
中文摘要 带有可验证奖励的强化学习（RLVR），尤其是群相对策略优化（Group Relative Policy Optimization，GRPO），已被广泛用于提升大型语言模型中的推理能力。然而，结果级奖励只能提供有限的监督，当所有抽样的提示轨迹都是正确或错误时，群体相对优势将消失。策略上自我蒸馏（OPSD）提供了密集的代币级指导，但其代币偏好不一定与轨迹正确性一致;实证诊断显示，OPSD信号在正确和错误展开时表现不同，教师正负信号的间隙信号表现出不同的噪声分布。这些诊断仅在OPSD风格的特权教师环境中进行分析，而CAST培训则使用无答案自学，基于这些观察，本文提出了CAST，这是一种针对GRPO风格RLVR的无答案自提炼方法。CAST保持以验证者为基础的GRPO目标，但采用停止梯度自学法，根据轨迹正确性塑造代币层面优势。与以往自提炼的RLVR方法不同，CAST不要求引用解条件教师评分，在整个培训过程中保持自教师对数概率差距的活跃，并应用双向局部优势符号反转：正确轨迹中的教师负标记可获得负的标记层面优势，而错误轨迹中的教师正值符号则可获得有界的正局部优势。对于零方差的全正确和全错组，CAST赋予有界符号约束的基础优势，因此这些本应零梯度的组可以贡献验证者签名的令牌反馈。数学推理实验表明，CAST在保持轻量级、基于验证器的轨迹级目标的同时，提升了RLVR训练。

Agentic Transformers Provably Learn to Search via Reinforcement Learning

代理变换器可通过强化学习学习搜索

Authors: Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2606.00183
Pdf link: https://arxiv.org/pdf/2606.00183
Abstract Tree search is a central abstraction behind many language-agent reasoning and decision-making tasks: agents must explore actions, remember failures, and backtrack toward promising alternatives. Yet, we lack a theoretical understanding of how transformer-based policies acquire such search capabilities from the training dynamics of reinforcement learning (RL). We study this question in a stochastic $k$-ary tree environment, where an agentic transformer observes only its trajectory history through interaction and receives a terminal reward for reaching a hidden leaf goal node. We first construct a two-head transformer that implements randomized depth-first search (DFS): one head tracks previous actions, while the other detects failure outcomes and triggers backtracking. We then analyze the training dynamics of policy gradient under a depth-wise curriculum, showing that this same DFS mechanism emerges in stages from sparse reinforcement feedback without expert demonstrations. The resulting policy exhibits depth generalization: after training only on depth-$1$ and depth-$2$ trees, it succeeds on deeper full trees. We further show that, under imbalanced goal distributions, discounting the return leads to a ranked DFS policy that prioritizes higher-probability branches. Overall, our results identify a mechanistic normal form for transformer-based search, in which attention heads specialize and cooperate to extract decision-relevant traces from context and convert them into agentic action selection via RL training.
中文摘要 树搜索是许多语言智能体推理和决策任务的核心抽象：智能体必须探索动作，记住失败，并回溯寻找有前景的替代方案。然而，我们对基于变换器的策略如何从强化学习（RL）的训练动态中获得此类搜索能力的理论理解尚不足。我们在随机$k$树环境中研究这个问题，其中代理变换器仅通过交互观察其轨迹历史，并获得到达隐藏叶节点的终极奖励。我们首先构建了一个双头变换器，实现随机深度优先搜索（DFS）：一个头跟踪之前的操作，另一个头检测失败结果并触发回溯。随后，我们分析了基于深度课程的政策梯度训练动态，表明这一DFS机制是在缺乏专家演示的情况下，从稀疏的强化反馈中分阶段出现的。由此产生的策略表现出深度推广：在只训练深度-$1和深度-$2美元树后，它成功训练了更深的满树。我们还进一步证明，在不平衡目标分布下，折现收益会导致排名DFS策略优先考虑高概率分支。总体而言，我们的结果确定了基于变换器的搜索机制范式，注意头专门化并合作，从上下文中提取决策相关痕迹，并通过强化学习训练将其转化为能动的行动选择。

LithoGRPO: Fast Inverse Lithography via GRPO Reinforced Flow Matching

LithoGRPO：通过GRPO增强流匹配实现的快速逆向光刻

Authors: Yao Lai, Xuyuan Xiong, Zeyue Xue, Guojin Chen, Jing Wang, Xihui Liu, Rui Zhang, Robert Mullins, Bei Yu, Ping Luo
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.00228
Pdf link: https://arxiv.org/pdf/2606.00228
Abstract In semiconductor manufacturing, lithography projects circuit layouts onto silicon wafers through an optical mask. As circuit features shrink below the wavelength of light, optical diffraction causes the printed patterns to deviate from their intended layouts. Inverse Lithography Technology (ILT) addresses this challenge by generating optimized masks that enhance the fidelity of pattern transfer onto wafers. While ILT resembles an image synthesis task, its reliance on explicit physical metrics for mask evaluation limits the applicability of existing generative models. We introduce LithoGRPO, an ILT framework that integrates the flow-matching paradigm with GRPO-based reinforcement learning (RL) fine-tuning, enabling efficient exploration of diverse masks for a given target layout. Unlike purely generative or optimization-based approaches, RL in LithoGRPO exploits the explicitly defined, physics-based reward function of ILT, enabling optimization under complex, process-aware constraints. To the best of our knowledge, this is the first framework that unifies flow matching and RL for mask optimization. To improve RL sampling efficiency, we propose a fast shot-counting algorithm for manufacturability evaluation, achieving over 130x speedup while preserving the mask ranking of the traditional shot-count metric. Extensive experiments demonstrate that LithoGRPO achieves state-of-the-art performance over both optimization-based and learning-based methods, while maintaining efficient mask generation.
中文摘要 在半导体制造中，光刻通过光学掩膜将电路布局投影到硅晶圆上。随着电路特征收缩到低于光波长，光学衍射会导致印刷图案偏离其预期布局。逆光刻技术（ILT）通过生成优化掩膜来解决这一挑战，从而提升晶圆上图案传输的真实性。虽然ILT类似于图像合成任务，但其对显式物理度量的掩码评估限制了现有生成模型的适用性。我们介绍了LithoGRPO，这是一个ILT框架，将流量匹配范式与基于GRPO的强化学习（RL）微调相结合，使得针对特定目标布局的多样化掩体进行高效探索。与纯粹的生成或基于优化的方法不同，LithoGRPO 中的强化学习利用了 ILT 明确定义的物理奖励函数，使得在复杂且过程感知约束下实现优化。据我们所知，这是第一个统一流匹配和强化学习用于掩码优化的框架。为提高强化学习采样效率，我们提出了一种快速的击球计数算法用于制造可行性评估，在保持传统击球计数指标掩膜排名的同时，实现了超过130倍的加速。大量实验表明，LithoGRPO在基于优化和基于学习的方法上都实现了最先进的性能，同时保持了高效的掩膜生成。

MindZero: Learning Online Mental Reasoning With Zero Annotations

MindZero：零注释在线学习思维推理

Authors: Shunchi Zhang, Jin Lu, Chuanyang Jin, Yichao Zhou, Zhining Zhang, Tianmin Shu
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.00240
Pdf link: https://arxiv.org/pdf/2606.00240
Abstract Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains. We address these challenges by introducing MindZero, a self-supervised reinforcement learning framework that trains multimodal large language models (MLLMs) for efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations. After training, MindZero internalizes model-based reasoning into fast single-pass inference. We evaluate MindZero against baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains. We found that LLMs alone are insufficient; model-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity. In contrast, MindZero enhances MLLMs' intrinsic ToM ability and significantly outperforms model-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self-supervised skill.
中文摘要 有效的现实世界辅助需要具备强大心智理论（ToM）的人工智能代理：通过行为推断人类的心理状态。尽管近期取得了进展，但仍面临若干关键挑战，包括（1）在线推断，针对多个假设进行强有力的不确定性更新;（2）适合实时协助的高效推理;以及（3）现实世界中缺乏基于真实的心理状态注释。我们通过介绍MindZero来应对这些挑战，MindZero是一个自监督强化学习框架，用于训练多模态大型语言模型（MLLMs），实现高效且稳健的在线心理推理。在训练过程中，模型因生成心理状态假设而获得奖励，这些假设最大化了规划者估计的观察到行为的可能性，类似于基于模型的ToM推理。因此，这种方法消除了对显式心理状态注释的需求。训练完成后，MindZero 将基于模型的推理内化为快速的单次推理。我们对MindZero在网格世界和家庭领域中具有挑战性的思维推理和AI辅助任务的基线进行了评估。我们发现单靠大型语言模型是不够的;基于模型的方法提高了准确性，但速度慢、成本高，且受限于骨干MLLM容量。相比之下，MindZero增强了MLLM固有的ToM能力，在准确性和效率上显著优于基于模型的方法，表明心理推理可以作为自我监督技能有效地学习。

Capability Self-Assessment: Teaching LLMs to Know Their Limits

能力自我评估：教大语言模型认识自己的极限

Authors: Haoyan Yang, Reza Shirkavand, Yukai Jin, Jiawei Zhou, Shangqian Gao, Heng Huang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00251
Pdf link: https://arxiv.org/pdf/2606.00251
Abstract The ability to recognize one's own limitations and decide whether to solve a problem or delegate is fundamental for reliable intelligent systems. Yet we show that modern large language models systematically lack this ability: across diverse model families and scales, they overestimate their competence and attempt queries they cannot solve. We refer to this ability as Capability Self-Assessment (CSA) and formulate it as a policy-learning problem, aiming to improve self-assessment while preserving the model's original capabilities. Our results show that reinforcement learning teaches CSA effectively, significantly outperforming supervised fine-tuning while preserving original capabilities. In contrast, supervised fine-tuning severely degrades the capabilities the model is meant to assess. Moreover, learned self-assessment behavior generalizes well out of distribution, suggesting that CSA is a transferable model trait. Finally, CSA is practically useful: it improves local-cloud decision making at inference time and provides a signal for targeted data selection during training.
中文摘要 认识到自身局限并决定是解决问题还是委托的能力，是可靠智能系统的基础。然而，我们表明现代大型语言模型系统性地缺乏这种能力：在不同模型家族和尺度中，它们高估了自己的能力，并尝试无法解决的查询。我们将此能力称为能力自我评估（CSA），并将其表述为一个政策学习问题，旨在提升自我评估能力，同时保持模型的原始能力。我们的结果表明，强化学习有效教授CSA，显著优于监督式微调，同时保留了原有能力。相比之下，监督式微调会严重降低模型评估的能力。此外，习得的自我评估行为能够很好地推广出分布，表明CSA是一个可转移的模型特征。最后，CSA在实际中非常有用：它在推断时提升本地云决策，并在训练过程中为有针对性的数据选择提供信号。

HOIST: Humanoid Optimization with Imitation and Sample-efficient Tuning for Manipulating Suspended Loads

HOIST：模拟和样品高效调校的人形优化，用于操控悬挂载荷

Authors: Songyang Liu, Shunyu Yao, Dingyuan Huang, Shuai Li
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.00252
Pdf link: https://arxiv.org/pdf/2606.00252
Abstract Manipulating suspended payloads with humanoid robots is challenging because the robot can only influence an underactuated, oscillatory load through whole-body motion and intermittent contact. Imitation learning provides safe initial behavior but does not directly optimize final placement, while reinforcement learning from scratch is unsafe and sample-inefficient on real humanoids. We present HOIST-Humanoid Optimized with Imitation and Sample-efficient Tuning for manipulating suspended loads. HOIST first finetunes a high-level vision-language-action (VLA) policy from virtual-reality (VR) teleoperation demonstrations and executes its commands through a whole-body controller. It then uses VLA rollouts and iterative batched RL to improve placement accuracy and stopping behavior. Experiments in simulation and on a real humanoid show that HOIST improves over imitation-only and additional-demonstration baselines; compared with pure VLA rollouts, HOIST reduces translational placement error by 19.9 cm and raw angular error by 3.56 degrees, demonstrating the potential of humanoids for underactuated material-handling tasks.
中文摘要 用类人机器人操作悬挂有效载荷具有挑战性，因为机器人只能通过全身运动和间歇性接触影响欠驱动的振荡负载。模仿学习提供安全的初始行为，但无法直接优化最终放置，而从零开始的强化学习在真实类人生物上既不安全又样本效率低下。我们推出了HOIST-Humanoid Optimized with Simetation和Sample-Efficient Tuning，用于操控悬挂载荷。HOIST首先从虚拟现实（VR）远程操作演示中微调高级视觉-语言-动作（VLA）策略，并通过全身控制器执行命令。然后它利用VLA的展开和迭代批量强化学习来提升布置准确性和停止行为。模拟和真人生物实验显示，HOIST优于仅模拟和额外演示基线;与纯VLA展开相比，HOIST将平移放置误差减少19.9厘米，原始角度误差减少3.56度，展示了类人机器人在欠驱动物料搬运任务中的潜力。

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

ARCA：当令牌信号退化时的适配器残余信用分配

Authors: Rodney Lafuente-Mercado
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00257
Pdf link: https://arxiv.org/pdf/2606.00257
Abstract Token-level credit assignment for language-model reinforcement learning is usually formulated as if the policy were fully trainable, while practical LLM-RL pipelines often rely on parameter-efficient fine-tuning, especially LoRA. We argue that this separation hides a structural failure mode. Under LoRA, the policy is restricted to a low-rank neighborhood of the reference model, so the per-token output-distribution differences used by common intrinsic credit signals, surprisal, entropy reduction, and policy divergence, can become degenerate after within-trajectory normalization, either approaching uniform weights or concentrating on a small set of task-agnostic positions. We formalize this behavior and propose measuring it directly with concentration diagnostics such as weight Gini and effective-token ratio. We then introduce \emph{Adapter-Residual Credit Assignment} (ARCA), a lightweight alternative that derives token salience from the adapter's own hidden-state residual, $\|h^{\text{adapted}}_t - h^{\text{base}}_t\|_2$. ARCA asks where the adapter actually changes the model, rather than where the output distribution appears uncertain or shifted, and requires no learned reward model, value head, or tree construction. In a compact MATH/Qwen3-1.7B GRPO sweep, ARCA exhibits the predicted non-degenerate middle-regime credit distribution under matched rollout budgets and remains competitive with rank-matched baselines.
中文摘要 语言模型强化学习的令牌级学分分配通常假设该策略完全可训练，而实际的LLM-RL流水线通常依赖参数效率的微调，尤其是LoRA。我们认为这种分离隐藏了结构性失效模式。在LoRA中，该策略仅限于参考模型的低秩邻域，因此常见内在信用信号所用的每个代币输出分布差异、惊讶信号、熵减少和政策分歧，在轨迹内归一化后可能退化，要么趋近于统一权重，要么集中于一小部分任务无关的位置。我们对这种行为进行了形式化，并建议通过权重基尼和有效代币比等浓度诊断直接测量。接着我们引入\emph{适配器-残余信用分配}（ARCA），这是一种轻量级替代方案，其代币显著性源自适配器自身隐藏状态残差$\|h^{\text{adapted}}_t - h^{\text{base}}_t\|_2$。ARCA询问适配器实际改变模型的位置，而非输出分布不确定或偏移的具体地点，且无需学习奖励模型、价值头或树结构。在紧凑的MATH/Qwen3-1.7亿GRPO扫描中，ARCA在匹配推广预算下展现出预测的非退化中间体制信用分布，并与排名匹配基线保持竞争力。

Closed-Loop Neural Activation Control in Vision-Language-Action Models

视觉-语言-行动模型中的闭环神经激活控制

Authors: Abhijith Babu, Ramneet Kaur, Nathaniel D. Bastian, Olivera Kotevska, Susmit Jha, Yanzhao Wu, Sumit Kumar Jha, Anirban Roy
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00269
Pdf link: https://arxiv.org/pdf/2606.00269
Abstract Vision-Language-Action (VLA) models can be steered at test time by intervening on semantically meaningful internal directions, but existing methods use a fixed steering coefficient, effectively operating in open loop. This is poorly suited to embodied control, where task state and concept error evolve over time, often causing overcorrection, oscillation, and reduced task success, especially for temporal behaviors such as speed and smoothness. We propose CTRL-STEER, a closed-loop framework that replaces static intervention strength with adaptive, time-varying control signals. The key idea is to decouple representation from regulation: rather than assuming temporal concepts are directly controlled by individual neurons, we steer along motion-aligned residual directions while a feedback controller adjusts intervention magnitude online. We instantiate this framework with both PID and reinforcement learning based controllers. Experiments with a fine-tuned OpenVLA policy on four LIBERO task suites show that CTRL-STEER achieves more stable concept regulation and a better steering-task success trade-off than fixed-coefficient baselines, without modifying or retraining the base model.
中文摘要 视觉-语言-行动（VLA）模型可以在测试时通过干预语义有意义的内部方向来引导，但现有方法使用固定的引导系数，实际上处于开环运行。这不适合具象控制，因为任务状态和概念误差随时间演变，常导致过度纠正、振荡和任务成功率下降，尤其是时间性行为如速度和顺畅度。我们提出了CTRL-STEER框架，这是一种闭环框架，用自适应、时变的控制信号取代静态干预强度。关键思想是将表征与调控脱钩：我们不假设时间概念由单个神经元直接控制，而是沿着运动对齐的残余方向前进，同时反馈控制器在线调整干预强度。我们通过基于PID和强化学习的控制器来实现该框架。在四个LIBERO任务套件上使用微调的OpenVLA策略实验显示，CTRL-STEER在概念调控上更稳定，且在不修改或重新训练基础模型的情况下，实现了更稳定的概念调控和更佳的引导任务成功权衡。

Robust Shielding for Safe Reinforcement Learning

安全强化学习的强健屏蔽

Authors: Edwin Hamel-De le Court, Thom Badings, Alessandro Abate, Francesco Belardinelli, Francesco Fabiano
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2606.00270
Pdf link: https://arxiv.org/pdf/2606.00270
Abstract Shielding is an effective approach to formally guarantee the safety of reinforcement learning agents in Markov decision processes (MDPs). However, existing shielding techniques typically assume knowledge of the safety-relevant transition dynamics - a requirement that is seldom met in practice. To address this limitation, we introduce a novel shielding framework for robust MDPs (RMDPs), i.e., MDPs with sets of transition probabilities. We define safety as the satisfaction of a linear temporal logic (LTL) formula with a certain threshold probability under the worst-case transition probabilities of the RMDP. We prove that our shielding framework is both sound and optimal for the RMDP: every policy admissible by the shield is safe, and conversely, every safe RMDP policy is admissible by the shield. We combine our approach with existing sampling methods for learning transition probabilities of MDPs with probably approximately correct (PAC) guarantees. This combination enables the construction of shields for MDPs that, with high confidence, guarantee safety while remaining minimally restrictive. Our experiments show that our shields for learned RMDPs guarantee safety in unknown MDPs while recovering strong expected return as the number of samples increases.
中文摘要 屏蔽是一种有效的方法，形式上保证马尔可夫决策过程（MDP）中强化学习代理的安全性。然而，现有屏蔽技术通常假设对安全相关的过渡动态有一定了解——而这一要求在实际中很少被满足。为解决这一限制，我们引入了一种新的稳健MDP（RMDPs）屏蔽框架，即具有转移概率集合的MDP。我们将安全性定义为线性时间逻辑（LTL）公式在RMDP的最坏情况下转移概率下满足一定阈值概率。我们证明了我们的屏蔽框架既稳健又最优，适用于RMDP：每一份被盾牌认可的保单都是安全的，反之，所有安全的RMDP保单也被盾牌认可。我们将方法结合现有的抽样方法，用于学习具有近似正确（PAC）保证的MDP的过渡概率。这种组合使得为MDP建造盾牌能够高度自信地保证安全，同时保持最小限制。我们的实验表明，我们对已学习RMDP的屏蔽能保证在未知MDP中的安全性，同时随着样本数量增加，恢复强劲的预期回报。

DRL-Based Pose Control for Double-Ackermann Robots Under Actuation Uncertainties

基于DRL的双阿克曼机器人在驱动不确定性下的姿态控制

Authors: Oussama Zaim, Mélodie Daniel, Aly Magassouba, Miguel Aranda, Olivier Ly
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00313
Pdf link: https://arxiv.org/pdf/2606.00313
Abstract Robust deployment of deep reinforcement learning (DRL) policies on real robots remains challenging due to discrepancies between simulation and real-world dynamics. We address this issue in the context of maneuvering with double-Ackermann-steering mobile robots, which introduce additional constraints due to their non-holonomic nature. Building upon the DRL framework ManeuverNet, we extend its objective from position control to full pose control, resulting in a more challenging task. We further investigate the impact of actuation-related uncertainties on policy transfer. The use of simplified actuation models during training of the extended policy can lead to poor generalization, shown by a success rate drop from 100% in PyBullet to 25% in Gazebo under stricter evaluation conditions. To address this limitation, we adopt a sim-to-sim-to-real approach, where actuation effects observed in Gazebo are incorporated into the PyBullet training environment. Using multi-environment DRL with SAC and CrossQ, we learn policies that remain robust despite modeling inaccuracies. This approach can significantly reduce the performance gap across simulators, achieving up to 92% success rate in Gazebo and maintaining 69% under stricter thresholds, with successful transfer to a real robot without additional tuning.
中文摘要 由于模拟与现实动力学存在差异，深度强化学习（DRL）策略在真实机器人上的稳健部署仍然具有挑战性。我们在双阿克曼转向移动机器人的机动操作背景下解决了这个问题，因为这些机器人由于非全全体性质，带来了额外的约束。基于日程学习框架ManeuverNet，我们将目标从位置控制扩展到全姿态控制，从而实现更具挑战性的任务。我们还进一步探讨了执行相关不确定性对政策转移的影响。在扩展策略训练中使用简化执行模型可能导致推广力差，表现为在更严格评估条件下，PyBullet的成功率从100%降至Gazebo的25%。为解决这一限制，我们采用了模拟对模拟到真实的方法，将Gazebo中观察到的驱动效应纳入PyBullet的训练环境。通过多环境DRL配合SAC和CrossQ，我们学习到即使存在不准确建模也能保持稳健的策略。这种方法可以显著缩小模拟器间的性能差距，在Gazebo中实现高达92%的成功率，并在更严格的门槛下保持69%，并且在无需额外调校的情况下成功转移到真实机器人。

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

隔离LLM词汇偏见：一种无需审核的三角测量指标用于偏好阶段学习

Authors: Xiaoyang Ming, Jose Hernandez, Thomas Stephan Juzek
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00334
Pdf link: https://arxiv.org/pdf/2606.00334
Abstract Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are thought to partly originate in the preference-learning stage, e.g. Reinforcement Learning from Human Feedback, which generally makes models more useful but simultaneously may introduce systematic lexical bias. In terms of lexical behavior, this is visible in a model's preference for certain formats or the overuse of words (delve, furthermore), even when such patterns are not present in base model outputs. Research on lexical misalignment induced during preference training is constrained by reliance on manual curation. We address this, by introducing the Triangulated Preference Shift score, a metric that triangulates between human gold standards, base models, and instruct variants to isolate shifts induced specifically by preference learning, without manual curation. We provide data across six model families, anchor the results in the literature, and illustrate the general approach's utility by analyzing whether preference learning shifts models toward what could be interpreted as a "language of prestige". The metric provides an initial automated method to quantify behavioral shifts attributable to preference tuning, and thus, may help inform model alignment and development of trustworthy AI.
中文摘要 近年来，多个语言领域经历了显著变化;这些变化主要归因于大型语言模型的出现及其与自然语言使用的不一致。这些错位部分被认为起源于偏好学习阶段，例如来自人类反馈的强化学习，这通常使模型更有用，但同时也可能引入系统性的词汇偏见。在词汇行为方面，这体现在模型偏好某些格式或词语过度使用（进一步说是 delve），即使这些模式在基础模型输出中并不存在。关于偏好训练期间诱发词汇错位的研究受限于人工策划。我们通过引入三角偏好转移评分来解决这个问题，该指标在人类黄金标准、基础模型之间进行三角定位，并指导变体分离偏好学习诱发的偏移，无需人工管理。我们提供了六个模型家族的数据，将结果锚定在文献中，并通过分析偏好学习是否使模型向可被解读为“声望语言”转变，展示了该方法的实用性。该指标为量化归因于偏好调整的行为变化提供了初步的自动化方法，因此有助于指导模型对齐和可信AI的开发。

Drift Q-Learning

漂移Q-学习

Authors: Anas Houssaini, Mohamad H. Danesh, Amin Abyaneh, Scott Fujimoto, Hsiu-Chin Lin, David Meger
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00350
Pdf link: https://arxiv.org/pdf/2606.00350
Abstract Offline reinforcement learning requires improving a policy from fixed data while avoiding out-of-distribution actions with unreliable value estimates. Diffusion and flow policies handle this trade-off by modeling the behavior distribution to regularize the RL objective, but they require iterative denoising, solver integrations, and in more efficient variants, distillation or other approximations at inference. We propose DriftQL, which combines a drift-based behavioral regularizer with critic-driven policy improvement. The value signal biases the policy toward high-value regions of the data support, while attraction and repulsion together keep generated actions near the data and prevent collapse onto a single mode. DriftQL is implemented as a single network with a unified training objective and generates actions in a single forward pass. On D4RL and OGBench, DriftQL consistently outperforms diffusion and flow methods, advancing the state of the art. Under degraded data quality, where the baselines visibly struggle, DriftQL remains close to its clean-data performance, positioning it as a promising alternative to diffusion and flow-based methods while maintaining the simplicity and efficiency of deterministic approaches. Project page: this https URL
中文摘要 离线强化学习需要从固定数据中改进策略，同时避免不可靠的价值估计导致分布外的行为。扩散和流策略通过建模行为分布来规范强化学习目标，处理这一权衡，但它们需要迭代去噪、求解器积分，以及更高效的推断时的蒸馏或其他近似。我们提出了DriftQL，它结合了基于漂移的行为规范器和批评者驱动的策略改进。价值信号偏向数据支持的高值区域，而吸引和排斥则使生成的动作保持在数据附近，防止崩溃到单一模式。DriftQL 作为一个统一的训练目标实现的单一网络，并在一次前向传递中生成动作。在D4RL和OGBench上，DriftQL始终优于扩散和流动方法，推动了技术的进步。在数据质量下降、基线明显挣扎的情况下，DriftQL仍接近其纯净数据表现，将其定位为扩散和基于流的方法的有前景替代方案，同时保持确定性方法的简洁性和高效性。项目页面：此 https URL

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

长期决策问题中的配对偏好强化学习

Authors: Jonathan Colaço Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00367
Pdf link: https://arxiv.org/pdf/2606.00367
Abstract Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise preferences are often easier to specify than scalar rewards, and they express certain goals that scalar rewards cannot. Methods for reinforcement learning with pairwise preferences have thus received growing interest. Unfortunately, these methods are inefficient in problems with long time horizons, and they lack guarantees on the performance of Markov policies relative to history-dependent policies, which bridge the theory and practice of reinforcement learning. We therefore propose the \textit{Markov decision contest} as a new problem model for reinforcement learning with pairwise preferences. We prove that stationary Markov policies are optimal among all history-dependent policies, that solving a Markov decision contest exactly is in P, and that a simple iterative algorithm converges to an optimal policy at a sublinear rate. Lastly, in a set of high-dimensional decision problems with long time horizons, we show that our approximate algorithm is significantly more learning-efficient than prior work.
中文摘要 强化学习问题通常将目标定义为最大化标量奖励函数的期望值。但成对偏好通常比标量奖励更容易指定，并且它们表达了标量奖励无法表达的某些目标。因此，基于两两偏好的强化学习方法受到越来越多的关注。遗憾的是，这些方法在时间范围较长的问题中效率较低，且缺乏对马尔可夫策略相对于历史依赖策略的表现保证，而历史依赖策略则连接了强化学习的理论与实践。因此，我们提出 \textit{Markov 决策争题}作为一种基于成对偏好的强化学习新问题模型。我们证明了平稳马尔可夫策略在所有历史依赖策略中最优，精确解马尔可夫决策竞题属于P，且简单迭代算法以亚线性速率收敛到最优策略。最后，在一组长时间视野的高维决策问题中，我们证明了我们的近似算法比以往工作显著更高效地学习。

Constrained Whole-Body Tracking for Humanoid Robots

人形机器人的受限全身追踪

Authors: Daniel Morton, Pranit Mohnot, Marco Pavone
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.00374
Pdf link: https://arxiv.org/pdf/2606.00374
Abstract Recent advances in reinforcement learning (RL) have demonstrated impressive whole-body agility for humanoid robots, yet ensuring safety and satisfying constraints -- particularly those specified after training -- remains a challenge. Towards this goal, we present ConstrainedMimic, a control framework that leverages whole-body kinematics and dynamics for real-time constraint enforcement within RL tracking policies. By integrating principles from operational space control and control barrier functions (CBFs), we enable the satisfaction of arbitrary runtime constraints on both the kinematic reference motion and the underlying dynamics. In whole-body motion-tracking and teleoperation experiments on a (simulated) Unitree G1 with a learned policy, we demonstrate collision avoidance (both with the robot body and external obstacles), joint limits, and center of mass stability constraints. By remaining consistent with the current contact mode and tracking objectives, we minimally restrict the capabilities of the policy when constraints are active. Our method is fully differentiable, runs on CPU, GPU, and TPU, and can be deployed at up to 300-500 Hz. All software will be freely available upon publication.
中文摘要 强化学习（RL）的最新进展展示了人形机器人的全身敏捷性，但确保安全和令人满意的约束条件——尤其是训练后规定的——仍是一大挑战。为此，我们介绍了ConstrainedMimic，一个利用全身运动学和动态学实现实时约束执行的控制框架，适用于强化学习跟踪策略。通过整合操作空间控制和控制屏障功能（CBF）的原理，我们能够满足对运动学参考运动和底层动力学的任意运行时间约束。在（模拟）Unitree G1 上进行的全体运动跟踪和远程操作实验中，我们展示了碰撞避免（包括与机器人身体及外部障碍物）、关节极限以及质心稳定性约束。通过保持当前联系模式的一致性并跟踪目标，我们在约束激活时最大限度地限制了策略的能力。我们的方法完全可微分，运行在CPU、GPU和TPU，最高可部署在300-500 Hz。所有软件发布后将免费开放。

Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization

通过受限策略优化进行检测器-规避式大型语言模型的改写

Authors: Mingyi Wang, Zhuoer Shen, Yuheng Bu, Shaofeng Zou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00392
Pdf link: https://arxiv.org/pdf/2606.00392
Abstract AI-text detectors are vulnerable to paraphrasing and detector-guided paraphrasing attacks, but existing detector-evasion methods often lack precise control over semantic preservation. In particular, optimizing directly for detector evasion can degrade fine-grained semantics, whereas scalarized reward designs provide only indirect, weight-sensitive control over the evasion-semantics trade-off. We address this limitation by formulating detector-evasive LLM paraphrasing as a Constrained Markov Decision Process, where detector evasion is the primary objective and semantic preservation is enforced as an explicit constraint. We propose Detector Evasion Policy Optimization (DEPO), a Lagrangian primal-dual reinforcement learning algorithm with a novel GRPO-style group-based policy update. DEPO adaptively balances semantic preservation and detector evasion during training, enabling the policy to improve attack success within a prescribed semantic-preservation region. Experiments on MAGE, M4, RAID, and peer-review datasets, evaluated against MAGE, RoBERTa, RADAR, Binoculars, and Fast-DetectGPT detectors, show that DEPO achieves strong detector evasion while precisely satisfying the semantic preservation constraint. DEPO also exhibits cross-domain, cross-detector, and prompt-level robustness.
中文摘要 AI文本检测器易受改写和检测器引导的改写攻击，但现有的检测规避方法往往缺乏对语义保持的精确控制。特别是，直接优化检测器规避会降低细粒度语义，而标量化奖励设计仅提供间接且权重敏感的规避语义权衡控制。我们通过将检测器规避性LLM意译为受限马尔可夫决策过程来解决这一局限，其中探测器规避为主要目标，语义保持作为显式约束执行。我们提出了探测器规避策略优化（DEPO），这是一种拉格朗日原始对偶强化学习算法，具有新颖的GRPO风格基于组的策略更新。DEPO在训练期间自适应地平衡语义保持和检测规避，使该策略能够在规定的语义保留区域内提高攻击成功率。在MAGE、M4、RAID和同行评审数据集上进行的实验，结合MAGE、RoBERTa、RADAR、双筒望远镜和Fast-DetectGPT探测器进行评估，显示DEPO在精确满足语义保持约束的同时实现了强的探测器规避。DEPO还表现出跨域、交叉检测器和提示级的鲁棒性。

PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning

PR2：基于MoE的大型语言模型强化学习的预测路由重放

Authors: Daize Dong, Junlin Chen, Haolong Jia, Jiawei Wu, Huanwei Di, Jiang Liu, Jialian Wu, Zhengzhong Liu, Zicheng Liu, Emad Barsoum, Dimitris N. Metaxas, Hongyi Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00395
Pdf link: https://arxiv.org/pdf/2606.00395
Abstract Mixture of Experts (MoE) Large Language Models (LLMs) achieve strong performance at scale. However, reinforcement learning (RL) on MoE-based LLMs often suffers from training instability. A root cause is router drift, i.e., expert activations can change drastically across model updates and differ between disaggregated rollout and training phases, causing large rollout--training mismatch and unstable importance sampling weights in PPO-style RL algorithms. Routing replay mitigates this issue by freezing the replay route within each reasoning trajectory, but it ignores how the router evolves under off-policy updates and thus causes router staleness. To address this limitation, we propose Predictive Routing Replay (PR2), which augments each router with a lightweight evolution predictor that learns to anticipate short-horizon router evolution. During the rollout phase, we use the predictive routing distribution to apply top-$k$ routing, enabling gradients to reach experts that are likely to become active after updates. During the training phase, we replay the resulting predicted route to retain consistency for stable importance estimation. Theoretical analysis and experiments support that PR2 reduces routing-induced mismatch, improves RL stability, and yields stronger performance across various reasoning benchmarks.
中文摘要 专家混合（MoE）大型语言模型（LLMs）在大规模环境中实现了强劲的性能。然而，基于MoE的LLM上的强化学习（RL）常常存在训练不稳定性的问题。根本原因是路由器漂移，即专家激活值在模型更新期间可能发生巨大变化，且在拆分的展开和训练阶段之间存在差异，导致大规模展开——在PPO式强化学习算法中训练不匹配和重要性抽样权重不稳定。路由重放通过在每个推理轨迹中冻结重放路由来缓解这个问题，但它忽略了路由器在非策略更新时的演变，从而导致路由器的停滞。为解决这一限制，我们提出了预测路由重放（PR2），它为每个路由器配备一个轻量级演变预测器，能够学会预测短视野路由器的演变。在推广阶段，我们使用预测路由分布应用最高$k美元路由，使梯度能够覆盖可能在更新后激活的专家。在训练阶段，我们会重演预测路径，以保持一致性以实现稳定的重要性估计。理论分析和实验支持PR2减少路由引起的错配，提高强化学习稳定性，并在各种推理基准测试中实现更强的性能。

Topology-Aware State Abstraction with Tangle Cores for Markov Decision Processes

具有纠结核心的拓扑感知状态抽象，用于马尔可夫决策过程

Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.00427
Pdf link: https://arxiv.org/pdf/2606.00427
Abstract State abstraction in reinforcement learning is usually formulated as a partition of states based on reward and transition similarity. This excludes a common structural pattern in navigation, graph, and hierarchical decision problems: interface states such as doors, hubs, and bottlenecks naturally participate in more than one region. We introduce \emph{tangle-core abstraction}, an overlapping state-abstraction framework based on graph tangles of empirical transition graphs. The method constructs abstract states from consistently oriented low-order separations and represents shared interfaces through a membership kernel rather than a hard partition. We give value-preservation guarantees for the induced overlapping abstract MDP under an explicit action-consistency condition, identify an interior-homogeneity/boundary-leakage error decomposition, and prove a quantitative interface-overlap result showing when hard partitions incur an avoidable boundary error. Empirically, tangle-core abstractions achieve favorable compression--return tradeoffs against reward-aware, learned, topological-map, and graph-partitioning baselines across bottlenecked tabular domains, procedurally generated mazes, and MiniGrid representations. We also identify a clear failure regime in which transition topology is uninformative, where tangles predictably offer little benefit. These results position graph tangles as an effective topology-aware abstraction prior for decision problems with shared interface structure.
中文摘要 强化学习中的状态抽象通常以奖励和过渡相似度为基础的状态划分。这排除了导航、图和层级决策问题中常见的结构模式：界面状态如门、枢纽和瓶颈自然参与多个区域。我们引入了\emph{纠结核心抽象}，这是一个基于经验转移图图纠缠的重叠状态抽象框架。该方法从一致方向的低阶分离中构造抽象状态，并通过成员核而非硬分区表示共享接口。我们在显式动作一致性条件下为诱导的重叠抽象MDP提供价值保持保证，识别内部齐次性/边界泄漏误差分解，并证明硬划分何时产生可避免边界误差的定量界面重叠结果。从经验角度看，纠结核心抽象在瓶颈表格域、程序生成迷宫和微网格表示中，相较于奖励意识、学习的拓扑映射和图划分基线，实现了有利的压缩——回报权衡。我们还识别出一个明确的失效机制，其中过渡拓扑缺乏信息，纠结通常带来的益处有限。这些结果将图纠结定位为一种有效的拓扑感知抽象，适用于具有共享界面结构的判定问题。

SDR: Set-Distance Rewards for Radiology Report Generation

SDR：放射报告生成的集合距离奖励

Authors: Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge, Olivier Gevaert
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00440
Pdf link: https://arxiv.org/pdf/2606.00440
Abstract Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-$N$ selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-$N$ selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly \href{this https URL}{available}.
中文摘要 带有可验证奖励的强化学习迅速推动了视觉推理——语言模型的发展。然而，对于胸部X光报告生成，标准奖励（即精确匹配准确性和步骤级过程）不兼容，因为报告由无序且正交的发现组成，而非因果推理链。我们用基于集合的视角来解决这一空白：每个报告被拆分成句子，并由冻结的句子变换器嵌入，从而得到无序嵌入集。我们提出使用生成的嵌入与参考嵌入之间的集合到集合距离作为连续且置换不变的奖励。在两个数据集和三个视觉-语言模型（Qwen3-VL-2B/4B，Gemma3-4B）中，基于对集合距离的GRPO训练后，在所有主要指标上均优于监督微调和精确匹配GRPO（BERTScore、RadGraph F1和CheXbert F1，平均分别提升6.80、\%7.82和\%4.45）。相同的距离设置还支持测试时$N美元最佳选择：按候选人与训练报告嵌入的距离评分，在我们训练的模型以及三个闭源大语言模型（Mistral-Small、Gemini-2.5 Flash-Lite、GPT-4o-mini）上表现优于随机选择，BERTScore的平均相对提升率为16.4%。作为流式信号，它们支持一种更高效的测试时间缩放方式：在中代修剪低分候选词，可以将生成的代币减少超过50%的同时，保持了Findings对$N最佳选择的质量。这些结果共同确立了设定距离奖励作为训练后和测试时间标尺在胸部X光报告生成中的统一信号。我们的代码是公开的 \href{this https URL}{available}。

DriveAnchor: Progressive Anchor-based Flow Learning for Autonomous Driving Planning

DriveAnchor：基于Anchor的渐进式流程学习，用于自动驾驶规划

Authors: Limin Yan, Haoyun Tang, Yutao Qiu, Hongqing Liu, Haoyu Xu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.00519
Pdf link: https://arxiv.org/pdf/2606.00519
Abstract We present DriveAnchor, a three-stage framework for autonomous driving planning that achieves behavioral diversity, controllability, and safety in a composable pipeline. Demonstration Flow Pretraining replaces the unstructured Gaussian prior with a vocabulary of 2,398 trajectory shapes constructed by farthest-point sampling, structurally grounding behavioral diversity in vocabulary coverage. Guided Flow Post-training jointly post-trains an Energy Field module with flow matching (FM), conditioning the Energy Field on static road geometry alone, to relocate anchors toward user-specified corridor polygons before flow generation, adding controllability without differentiable guidance; after Stage 2, new corridor presets require only Energy Field updates, not FM retraining. Reward-Refined Flow Fine-tuning applies zeroth-order reinforcement learning to align each anchor's output with collision-avoidance objectives: because the flow-matching model is a deterministic feedforward network in single-step mode, each anchor uniquely determines the output trajectory, reducing reward optimization to a direction search in anchor space without log-likelihood computation or ODE-to-SDE conversion. Evaluated on approximately 2 million held-out driving scenarios, DriveAnchor reduces near-range collision rates by 89% and improves mean reward by 32% without degradation in imitation accuracy, with 2.06 ms inference on NVIDIA Drive Orin. DriveAnchor has been validated through real-world vehicle testing, confirming its practicality for production deployment.
中文摘要 我们介绍DriveAnchor，一个三阶段的自动驾驶规划框架，实现行为多样性、可控性和安全性，采用可组合流程。演示流预训练用2398个轨迹形状的词汇表取代了无结构的高斯先验，这些词汇由最远点采样构建，结构性地将行为多样性建立在词汇覆盖中。引导流后训练联合后期训练能量场模块与流动匹配（FM），仅以静态道路几何为条件，将锚点向用户指定的走廊多边形移动，增加无微分引导的可控性;第二阶段之后，新的走廊预设只需能量场更新，无需FM重训。奖励精细流微调应用零阶强化学习，使每个锚点的输出与碰撞避免目标对齐：由于流匹配模型是单步模式的确定性前馈网络，每个锚点唯一确定输出轨迹，将奖励优化简化为锚点空间中的方向搜索，无需对数似然计算或常微分方程转SDE转换。在约200万个预设驾驶场景下评估，DriveAnchor将近距离碰撞率降低89%，平均奖励提升32%，且在未降低模拟精度的情况下，NVIDIA Drive Orin的推断速度为2.06毫秒。DriveAnchor已通过真实车辆测试验证，确认其在生产部署中的实用性。

Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents

学习检索：文本转SQL代理的双层长期记忆

Authors: Yibo Wang, Nikki Lijing Kuang, Philip S. Yu, Zhewei Yao, Yuxiong He
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.00547
Pdf link: https://arxiv.org/pdf/2606.00547
Abstract Interactive text-to-SQL agents solve database tasks through multi-turn interactions involving schema exploration, query execution, feedback interpretation, and decision revision. Long-term memory helps agents reuse past experiences, but existing retrieval methods remain limited. Static methods rely on fixed similarity heuristics that do not optimize downstream utility, while dynamic methods often learn from sparse final outcomes and retrieve memories at a single decision horizon. This is insufficient when memory usefulness changes across interaction stages, since memories useful for initial planning may differ from those needed for local, state-conditioned execution. We propose MERIT, a dynamic multi-horizon memory retrieval framework. MERIT maintains episode-level memory for global strategic guidance and turn-level memory for local decision support. Both levels use learned retrieval policies optimized with reinforcement learning. To train turn-level retrieval despite limited intermediate supervision, MERIT uses a lightweight Process Reward Model to provide dense proxy rewards for local memory selection. Experiments on BIRD-Interact show that MERIT outperforms no-memory, static-retrieval, and dynamic-retrieval baselines in success rate while reducing average interaction turns. Transfer results on Spider2-Snow further show positive cross-benchmark transfer without benchmark-specific tuning. These results suggest that multi-horizon retrieval improves experience reuse in interactive text-to-SQL agents.
中文摘要 交互式文本转SQL代理通过多回合交互解决数据库任务，包括模式探索、查询执行、反馈解释和决策修订。长期记忆帮助代理重用过去的经验，但现有的检索方法仍然有限。静态方法依赖固定的相似性启发式，这些方法无法优化下游效用，而动态方法则常从稀疏的最终结果中学习，并在单一决策视野处检索记忆。当记忆的有用性在交互阶段变化时，这就不够了，因为用于初始规划的记忆可能与局部状态条件执行所需的记忆不同。我们提出了MERIT，一种动态多视野记忆检索框架。MERIT维护剧集级记忆以实现全球战略指导，并以回合级记忆支持本地决策支持。这两个层级都采用经过强化学习优化的学习性检索策略。为了在中间监督有限的情况下训练回合级检索，MERIT 使用轻量级过程奖励模型为局部记忆选择提供密集代理奖励。BIRD-Interact的实验显示，MERIT在成功率上优于无记忆、静态检索和动态检索基线，同时降低了平均交互回合数。Spider2-Snow 的传输结果进一步显示，在没有针对特定基准调校的情况下，实现了正向的跨基准转移。这些结果表明，多视野检索提升了交互式文本转SQL代理中的体验重用率。

Interpretable Policy Distillation for Power Grid Topology Control

电网拓扑控制的可解释策略提炼

Authors: Aleksandra Dmitruka, Karlis Freivalds
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00561
Pdf link: https://arxiv.org/pdf/2606.00561
Abstract Deep reinforcement learning (RL) offers a promising route to real-time power grid operation, yet large neural policies are costly to evaluate, hard to deploy on constrained hardware, and opaque to operators. We ask whether a Proximal Policy Optimization (PPO) agent for grid topology control can be compressed into compact tree-based surrogates without losing operational performance. A PPO teacher is trained on Grid2Op's standard 14-bus environment with a stability-oriented reward, using stress-focused data collection on critical, high-loading states. The policy is then distilled into a decision tree and a random forest. Across held-out validation episodes, both surrogates exceed the teacher in mean reward and survival length at a fraction of the inference cost. The decision tree shows high exact-action agreement with the PPO argmax and near-complete agreement within its top-ranked actions, while remaining small enough to be inspected directly. Feature-importance analysis reveals a representational shift: the PPO policy relies mainly on line-loading signals, while the distilled tree is driven primarily by bus-topology variables. These results suggest that stress-focused distillation can convert a black-box neural controller into a lightweight, auditable rule-like surrogate suited for real-time deployment, while also surfacing risks tied to deterministic actions and topology-specific generalization.
中文摘要 深度强化学习（RL）为实时电网运行提供了一条有前景的途径，但大型神经策略的评估成本高昂，难以在受限的硬件上部署，且对运营商来说难以理解。我们询问，网格拓扑控制的近端策略优化（PPO）代理是否可以压缩成紧凑的基于树的代理，而不影响运营性能。PPO教师在Grid2Op标准的14条总线环境中接受培训，采用以稳定性为导向的奖励，利用压力为中心的数据收集，关注临界高负载状态。策略随后被提炼为决策树和随机森林。在被保留验证的阶段中，两种替代者在平均奖励和生存时长上均超过教师，且推理成本仅为其一小部分。决策树显示与PPO argmax高度精确动作一致，且其最高级别动作内几乎完全一致，同时保持足够小可直接检查。特征重要性分析揭示了一种表示方式的转变：PPO策略主要依赖线路加载信号，而精炼树主要由总线拓扑变量驱动。这些结果表明，压力聚焦蒸馏可以将黑箱神经控制器转变为轻量级、可审计的类规则代理，适合实时部署，同时也揭示了与确定性动作和拓扑特定泛化相关的风险。

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

DeepLatent：通过并行潜在视觉推理用图像思考

Authors: Dongchen Lu, Zhimo Li, Mao Shu, Huo Cao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.00562
Pdf link: https://arxiv.org/pdf/2606.00562
Abstract The emerging paradigm of "thinking with images" embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply explicit visual operations but suffer from high latency and restricted manipulation types. Latent reasoning methods autoregressively produce implicit visual states, but underperform tool-assisted methods, and their latent tokens fail to capture effective visual information. In this work, we propose DeepLatent, a parallel framework for latent visual reasoning. First, we introduce LatentFormer. It uses learnable 2D tokens to generate context-conditioned latent states in parallel, anchoring every visual update directly in the original image features. Second, we design a continuous-space reinforcement learning algorithm. It optimizes latent modulation parameters directly in the embedding space, significantly improving latent representation quality. The framework is trained via knowledge distillation followed by this continuous-space RL algorithm. Furthermore, we contribute DeepLatent-180K, a large-scale dataset tailored for latent visual reasoning. Extensive evaluations across multiple benchmarks demonstrate that DeepLatent achieves state-of-the-art performance.
中文摘要 “以图像思考”的新范式将视觉状态嵌入中间推理步骤，为视觉语言模型定义了新的前沿。现有的路线沿两条路线分岔。工具辅助方法应用显式视觉操作，但存在高延迟和受限的操作类型。潜在推理方法自回归产生隐式视觉状态，但表现不及工具辅助方法，其潜在标记无法捕捉有效的视觉信息。在本研究中，我们提出了DeepLatent，一种用于潜在视觉推理的平行框架。首先，我们介绍 LatentFormer。它使用可学习的二维标记并行生成上下文条件的潜在状态，将每次视觉更新直接锚定在原始图像特征中。其次，我们设计了一个连续空间强化学习算法。它直接优化嵌入空间中的潜在调制参数，显著提升了潜在表示质量。该框架通过知识蒸馏和连续空间强化学习算法进行训练。此外，我们还贡献了DeepLatent-180K，这是一个针对潜在视觉推理量身定制的大规模数据集。多项基准测试的广泛评估表明，DeepLatent 实现了最先进的性能。

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

SPADER：逐步的同伴优势，多题解答的多样性意识探索奖励

Authors: Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng, Yingcai Wu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00593
Pdf link: https://arxiv.org/pdf/2606.00593
Abstract Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent work has improved long-horizon tool-use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real-world queries require discovering a comprehensive set of valid answers, a setting known as Multi-Answer QA. This setting raises two challenges: fine-grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high-frequency entities. We propose SPADER, a reinforcement learning framework for long-horizon tool use in Multi-Answer QA. SPADER includes Step-wise Peer Advantage (SPA), a critic-free step-level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity-aware exploration reward that promotes long-tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting-based agents, outcome-supervised RL methods, and recent step-level supervision approaches. Our code and model weights are available at this https URL.
中文摘要 大型语言模型越来越多地作为工具增强的代理被部署，以获取超出参数化知识的信息。虽然近期研究改善了长期工具使用推理能力，但大多数方法仍关注只有一个正确答案的任务。相比之下，许多现实中的查询需要发现一套全面的有效答案，这种设定被称为多重答案质询。这种设定带来了两个挑战：在长期搜索轨迹下进行细粒度的信用分配，以及对持续探索超出简单高频实体的奖励匹配。我们提出了SPADER，这是一个用于长期工具用于多重问答质量保证的强化学习框架。SPADER包含分级同伴优势（SPA），这是一种无批评的阶级信用分配机制，通过决策步骤对齐平行轨迹，并从同伴回报中估算优势。它还包括一种多样性感知的探索奖励，通过提高罕见发现权重和降低冗余发现，促进长尾实体的发现。在QAMPARI、Mintaka、WebQSP和QUEST上的实验表明，SPADER通常比基于提示的代理、结果监督的强化学习方法以及近期的步骤级监督方法更能提升回忆和整体F1能力。我们的代码和模型权重可在该 https URL 访问。

CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts

CARE-RL：能力感知强化学习以缓解跨域冲突

Authors: Rui Zhang, Xinle Wu, Yao Lu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00609
Pdf link: https://arxiv.org/pdf/2606.00609
Abstract Reinforcement learning (RL) with verifiable rewards has achieved strong progress in reasoning-oriented LLMs, but extending it to multi-domain RL remains challenging due to reward unreliability in non-verifiable tasks and capability interference across domains. We propose CARE-RL to combine protocol-aware reward generation with capability-aware optimization for mitigating cross-domain conflicts. For non-verifiable tasks, the Protocol-Aware Generative Reward Model (PA-GRM) constructs prompt-level evaluation protocols and schemas before producing trace-conditioned rewards, enabling task-adaptive yet comparable evaluation of open-ended responses. For multi-domain optimization, Direction-Aware Capability Subspace Projection (DACSP) extracts historical capability directions from previous RL stages and modulates later updates by amplifying aligned components, suppressing conflicting components, and preserving orthogonal updates. Experiments across math, chat, and instruction-following benchmarks show that CARE-RL consistently outperforms standard multi-domain RL baselines, achieving Total Avg scores of 47.9 and 50.7 on Qwen2.5-7B and Qwen3-4B, respectively.
中文摘要 带有可验证奖励的强化学习（RL）在推理导向的大型语言模型中取得了显著进展，但由于不可验证任务中的奖励不可靠性和跨域能力干扰，将其推广到多域强化学习仍具挑战。我们提出CARE-RL结合协议感知奖励生成与能力感知优化，以缓解跨域冲突。对于不可验证任务，协议感知生成奖励模型（PA-GRM）在生成痕量条件奖励前构建提示级评估协议和模式，实现任务适应且可比的开放式响应评估。为了多域优化，方向感知能力子空间投影（DACSP）从之前的强化学习阶段提取历史能力方向，并通过放大对齐的组件、抑制冲突组件和保持正交更新来调制后续更新。数学、聊天和指令跟踪基准测试的实验显示，CARE-RL持续优于标准多域强化学习基线，分别在Qwen2.5-7B和Qwen3-4B上获得47.9和50.7的总平均得分。

Global-Local Attention Decomposition for Terrain Encoding in Humanoid Perceptive Locomotion

类人生物感知运动中地形编码的全局-局部注意力分解

Authors: Shengcheng Fu, Yang Zhang, Zhanxiang Cao, Liyun Yan, Yizhi Chen, Yunpeng Yin, Yue Gao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.00637
Pdf link: https://arxiv.org/pdf/2606.00637
Abstract Although reinforcement learning has significantly advanced humanoid locomotion, perceptive policies still struggle on sparse-foothold terrain and constrained environments. Success in these scenarios requires both broad terrain awareness and precise foothold selection, two perceptual roles that conventional encoders often entangle. To address this challenge, we propose Global-Local Attention Decomposition (GLAD) for terrain encoding in humanoid locomotion. Realized by a coarse-to-fine encoder over a robot-centric elevation map, GLAD explicitly separates these objectives: a global attention branch utilizes attention pooling to summarize the surrounding terrain context, while a state-conditioned local attention branch sparsifies and encodes precise foothold-relevant geometry. This explicit attention decomposition prevents the dilution of fine-grained spatial cues while reducing training overhead. Experiments demonstrate that GLAD enables reliable locomotion over challenging gaps, stepping stones, and stairs. Furthermore, the learned policy exhibits emergent terrain-responsive behaviors, autonomously following narrow paths and avoiding obstacles under simple velocity commands without explicit navigation planners. In real-world deployment on a Unitree G1 humanoid robot using onboard LiDAR, the proposed method achieves robust zero-shot sim-to-real transfer across diverse sparse-foothold and obstacle-rich domains.
中文摘要 尽管强化学习显著推动了类人生物的运动能力，但感知策略在稀疏的立足点和受限环境中仍然难以适应。在这些场景中取得成功，既需要广泛的地形感知，也需要精确的脚点选择，这两者是传统编码器常常纠缠的感知角色。为应对这一挑战，我们提出了用于人形运动地形编码的全局-局部注意力分解（GLAD）。GLAD通过粗细编码器在机器人中心的高程图上实现，明确区分这些目标：全局注意力分支利用注意力汇总来总结周围地形背景，而状态条件化的局部注意力分支则稀疏化并编码精确的足迹相关几何。这种显式的注意力分解防止了细粒度空间线索的稀释，同时减少了训练开销。实验表明，GLAD能够可靠地通过具有挑战性的间隙、踏脚石和楼梯。此外，所学策略表现出地形响应性行为，自主沿着狭窄路径前进，并在没有明确导航规划器的情况下，通过简单的速度指令避开障碍物。在使用车载激光雷达的Unitree G1人形机器人上实际部署时，该方法实现了在多样化稀疏据点和障碍物丰富领域间的稳健零发射模拟到实物传输。

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

结果优化的悖论：大型语言模型推理捷径的因果信息论界限

Authors: Zihan Chen, Yiming Zhang, Wenxiang Geng, Zenghui Ding, Yining Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00674
Pdf link: https://arxiv.org/pdf/2606.00674
Abstract Large Language Models (LLMs) aligned via outcome-based Reinforcement Learning (RL) frequently exhibit a critical failure mode: they achieve high performance on in-distribution benchmarks while demonstrating brittle reasoning capabilities on out-of-distribution (OOD) tasks. We term this phenomenon Reward-Induced Manifold Collapse. We establish a theoretical framework bridging Structural Causal Models (SCM) and the Information Bottleneck (IB) principle to explain this paradox. We define reasoning as a high-complexity causal process and shortcut learning as the exploitation of low-complexity spurious correlations. Under the implicit inductive bias of Stochastic Gradient Descent (SGD), models optimized for outcome rewards are biased toward shortcut solutions whenever the training distribution allows for a ``Markovian Screening'' of the true causal mechanism. We derive a new generalization bound based on Semantic Coverage Measure ($\eta$) rather than sample size, showing why data scaling on homogeneous distributions may fail to correct reasoning flaws. We also show that Process Reward Models (PRMs) function as Topological Filters, enforcing step-wise mutual information constraints that render the low-complexity shortcut manifold inadmissible. These results provide a mathematical grounding for the role of process supervision beyond simple credit assignment.
中文摘要 通过基于结果的强化学习（RL）对齐的大型语言模型（LLM）常常表现出关键失败模式：它们在分布内基准测试中表现出高性能，而在分布外（OOD）任务中却展现出脆弱的推理能力。我们称这种现象为“奖励诱导的多形崩溃”。我们建立了一个理论框架，连接结构因果模型（SCM）与信息瓶颈（IB）原则，以解释这一悖论。我们将推理定义为高复杂度的因果过程，捷径学习为利用低复杂度的虚假相关性。在随机梯度下降（SGD）隐含归纳偏倚下，当训练分布允许对真实因果机制进行“马尔可夫筛选”时，针对结果奖励优化的模型会偏向捷径解。我们基于语义覆盖度量（$\eta$）而非样本量推导出新的推广界限，说明了为何在同质分布上进行数据尺度可能无法纠正推理缺陷。我们还展示了过程奖励模型（PRM）作为拓扑滤波器，强制执行逐级互信息约束，使得低复杂度的捷径流形不可接受。这些结果为流程监督在简单学分分配之外的作用提供了数学基础。

Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief

正则化离线策略优化与后验混合贝叶斯信念

Authors: Hongqiang Lin, Pengfei Wang, Nenggan Zheng
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.00680
Pdf link: https://arxiv.org/pdf/2606.00680
Abstract Offline reinforcement learning (RL) aims to optimize policies from pre-collected datasets. A bottleneck of this paradigm is managing epistemic uncertainty, which arises from limited data coverage (sample-level) and the ambiguity in identifying transition dynamics from finite data (model-level). To provide a unified quantification of these uncertainties, Bayesian RL has been proposed by treating the dynamics model as a random variable and maintaining a corresponding belief. Despite its theoretical appeal, policy optimization in Bayesian RL remains computationally challenging as it requires solving composite objectives with expectations. Prior methods either employ search-based techniques with poor computational scalability or impose restrictive posterior assumptions that sacrifice the adaptability of Bayesian RL. To address these limitations, we propose Posterior Hybrid Bayesian Belief (PhyB), which reformulates the expectation as a convex combination over a subset of dynamics models. Theoretical analysis demonstrates that the objective discrepancy induced by this approximation remains bounded. Based on PhyB, we develop an iterative regularized policy optimization algorithm that provides metric-agnostic guarantees for monotonic improvement until convergence. Empirical results demonstrate that PhyB achieves state-of-the-art performance on various benchmarks.
中文摘要 离线强化学习（RL）旨在从预先收集的数据集中优化策略。该范式的一个瓶颈是管理认知不确定性，这源于有限的数据覆盖（样本层面）以及从有限数据（模型层面）识别过渡动态时的模糊性。为了统一量化这些不确定性，提出了贝叶斯强化学习，将动力学模型视为随机变量并保持相应信念。尽管理论上很有吸引力，贝叶斯强化学习中的策略优化在计算上依然具有挑战性，因为它需要在有期望的情况下求解复合目标。以往的方法要么采用基于搜索但计算可扩展性较差的技术，要么施加限制性的后验假设，从而牺牲了贝叶斯强化学习的适应性。为解决这些局限性，我们提出了后验混合贝叶斯信念（PhyB），将期望重新表述为对部分动力学模型的凸组合。理论分析表明，这种近似引起的客观差异仍然是有界的。基于PhyB，我们开发了一种迭代正则化策略优化算法，为单调改进提供度量无关的保证，直到收敛。实证结果表明，PhyB在多个基准测试上达到了最先进的性能。

Shape Your Body: Value Gradients for Multi-Embodiment Robot Design

塑造你的身体：多体型机器人设计的价值梯度

Authors: Nico Bohlinger, Jan Peters
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00702
Pdf link: https://arxiv.org/pdf/2606.00702
Abstract We propose to turn generalist multi-embodiment value functions into reusable models for robot design. Instead of running a new reinforcement learning co-design loop for each robot, we first train an embodiment-aware policy and value function across many robot designs. After training, the frozen value function is used as a differentiable surrogate to optimize candidate embodiments through value gradients. We evaluate our approach across different robot design settings, from perturbed single robots to held-out robots across morphology classes, with single models trained on up to 50 robots and design spaces of over 1100 continuous embodiment parameters. Beyond optimizing complete embodiments, we show that value gradients can identify performance-limiting design and control parameters, enabling both the optimization and the analysis of new robot designs.
中文摘要 我们提议将通用的多身体价值函数转化为可重复使用的机器人设计模型。我们不为每个机器人运行新的强化学习共设计循环，而是先在多个机器人设计中训练一个具备具现感知的策略和价值函数。训练完成后，冻结值函数被用作可微替代，通过值梯度优化候选体现。我们评估了不同机器人设计环境下的方法，从扰动单一机器人到外展机器人，跨形态类别，单一模型训练于多达50台机器人和超过1100个连续具象化参数的设计空间。除了优化完整实现外，我们还证明了数值梯度可以识别性能限制设计和控制参数，从而实现新机器人设计的优化和分析。

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

MOSAIC：结构化智能智能与合成的模块化编排

Authors: Yifan Bao, Xinyu Xi, Xinyu Liu, Wen Ge, Lei Jiang, Kevin Zhang, Raad Khraishi, Yihao Ang, Anthony K.H. Tung, Lukasz Szpruch, Hao Ni
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.00708
Pdf link: https://arxiv.org/pdf/2606.00708
Abstract Automated data science is a structured model-selection problem. A solution must choose data transformations, feature representations, architecture, training procedure, evaluation protocol, and refinement strategy for a task. AutoML systems automate parts of this process, but typically search within predefined pipeline, model, and hyperparameter spaces. LLM-based agents offer greater flexibility through retrieval, code generation, and execution feedback, yet their modelling decisions are often unstructured, difficult to verify, and hard to reuse. We introduce \textsc{MOSAIC} (Modular Orchestration for Structured Agentic Intelligence and Composition), a structured agentic framework for memory-grounded model selection and workflow construction. Given a task and dataset, \textsc{MOSAIC} builds a semantic task profile, retrieves prior cases and source-code modules, and constructs a blueprint: an intermediate representation specifying selected modelling components, composition, interface constraints, and execution requirements. This blueprint turns model selection into a staged, context-grounded search and grounds LLM-based code generation in retrieved evidence rather than unconstrained synthesis. Candidate models are validated by execution and refined using diagnostic feedback, training traces, task metrics, and a failure-aware reinforcement learning policy. We instantiate \textsc{MOSAIC} on financial time-series forecasting and generation, where models must satisfy predictive accuracy, distributional fidelity, execution reliability, and downstream financial criteria such as risk and tail behaviour. Experiments against AutoML and agentic baselines show that \textsc{MOSAIC} improves task performance, execution success, and decision traceability, demonstrating the value of treating automated data science as structured, reusable, and execution-grounded model selection.
中文摘要 自动化数据科学是一种结构化模型选择问题。解决方案必须为任务选择数据转换、特征表示、架构、训练过程、评估协议和优化策略。AutoML 系统自动化了这一过程的部分，但通常在预定义的管道、模型和超参数空间内进行搜索。基于LLM的智能体通过检索、代码生成和执行反馈提供了更大的灵活性，但其建模决策往往是无结构化的，难以验证且难以重复使用。我们介绍了 \textsc{MOSAIC}（模块化编排，用于结构化智能智能与组合），这是一个用于基于内存的模型选择和工作流构建的结构化智能体框架。给定一个任务和数据集，\textsc{MOSAIC} 构建语义任务配置文件，检索之前的案例和源代码模块，构建蓝图：一个中间表示，指定所选建模组件、组合、接口约束和执行要求。该蓝图将模型选择转变为分阶段、基于上下文的搜索，并将基于LLM代码生成建立在检索到的证据上，而非无约束的综合。候选模型通过执行验证，并通过诊断反馈、训练痕迹、任务指标以及失败感知强化学习策略进行优化。我们将\textsc{MOSAIC}应用于金融时间序列预测和生成，模型必须满足预测准确性、分布忠实度、执行可靠性以及风险和尾部行为等下游财务标准。针对自动机器学习和代理基线的实验表明，\textsc{MOSAIC} 能够提升任务性能、执行成功率和决策可追溯性，展示了将自动化数据科学视为结构化、可重用且基于执行的模型选择的价值。

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

内化温度：策略自蒸馏作为强化学习策略加热器

Authors: Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun, Junjie Wang, Yujiu Yang
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.00755
Pdf link: https://arxiv.org/pdf/2606.00755
Abstract Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learning signals. Existing remedies either constrain the RL objective (e.g., entropy regularization) or adjust sampling temperature during rollout collection, but these interventions remain external to the model parameters. We propose Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a lightweight policy reheating method that internalizes the exploratory effect of temperature into model parameters. Starting from an entropy-collapsed RL checkpoint, TS-OPSD constructs a self-teacher by applying high-temperature scaling to the model's own logits, then distills the resulting smoother distribution back into the student. This policy reheating requires no external teacher, privileged data, or additional inference cost. Experiments on Qwen3-4B-Base and Qwen3-8B-Base show that policy reheating yields a stronger initialization for continued RL than both standard continued RL and rollout-level temperature reheating. Further analyses show that TS-OPSD mainly reduces output sharpness while preserving intermediate representations, top candidate sets, and reasoning capability. These results suggest that entropy restoration can serve as a simple post-collapse intervention for extending reasoning-oriented RL.
中文摘要 通过可验证奖励进行强化学习提升了大型语言模型的推理能力，但常常会受到熵崩溃的影响，即越来越集中的策略减少了推广的多样性和有用的学习信号。现有的处理方法要么限制强化学习目标（例如熵正则化），要么在展开收集时调整采样温度，但这些干预措施仍不在模型参数范围内。我们提出了温度尺度政策自蒸馏（TS-OPSD），这是一种轻量级政策再加热方法，将温度的探索效应内化到模型参数中。从熵坍缩的强化学习检查点出发，TS-OPSD通过对模型自身的logits施加高温缩放构建自学，然后将所得的更平滑分布提炼回学生。该策略重新加热无需外部教师、特权数据或额外的推理成本。Qwen3-4B-Base和Qwen3-8B-Base的实验表明，策略性再加热比标准持续RL和推广级温度再加热更强的持续RL初始化。进一步分析显示，TS-OPSD主要降低输出锐利度，同时保留中间表示、顶候选集和推理能力。这些结果表明熵恢复可以作为一种简单的崩溃后干预，用于延长以推理为导向的强化学习。

GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval

GIRL-DETR：视频时刻检索的梯度孤立强化学习

Authors: Shihang Zhang, Mingjin Kuai, Ye Wei, Zhen Zhang, Wei Ji
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00775
Pdf link: https://arxiv.org/pdf/2606.00775
Abstract Video Moment Retrieval (VMR) task requires accurately localizing temporal boundaries aligned with natural language queries, but many models suffer from a misalignment between continuous surrogate losses and non-differentiable metrics, leading to optimization stagnation during the late stages of training and trapping boundary predictions in suboptimal solutions. Although Reinforcement Learning (RL) post-training successfully optimizes localization results for large models, applying it directly to lightweight networks easily disrupts the fragile feature representations established during the supervised phase. To overcome this optimization bottleneck, we propose Gradient-Isolated Reinforcement Learning for DETR (GIRL-DETR), introducing RL post-training into a lightweight temporal localization framework for the first time. The input video and text features first establish early alignment through Cross-Modal Interaction (CMI) before entering the transformer encoder. Subsequently, a Text-Guided Gating (TGG) mechanism dynamically injects semantic priors into the queries before the transformer decoder generates candidate proposals, providing high signal-to-noise ratio inputs for temporal prediction. After the supervised training reaches convergence, the backbone network is frozen to protect the feature manifold, while the detection head directly optimizes the non-differentiable evaluation metric tIoU to enhance localization accuracy through a Three-stage Progressive Reinforcement Learning (TPRL) strategy. This approach achieves an orthogonal decoupling of state representation and metric optimization. Experiments on Charades-STA, QVHighlights, and TACoS demonstrate that GIRL-DETR effectively resolves surrogate loss degradation and achieves substantial accuracy improvements with minimal parameter updates, providing a robust new pathway for RL applications in lightweight VMR models.
中文摘要 视频矩检索（VMR）任务需要准确定位与自然语言查询对齐的时间边界，但许多模型存在连续代理丢失与不可微度度量之间的错位，导致训练后期优化停滞，边界预测被困在次优解中。尽管强化学习（RL）后训练成功优化了大型模型的定位结果，但直接应用于轻量级网络则容易破坏监督阶段建立的脆弱特征表示。为克服这一优化瓶颈，我们提出了DETR的梯度孤立强化学习（GIRL-DETR），首次将强化学习的后训练引入轻量级时间定位框架。输入的视频和文本特征首先通过跨模态交互（CMI）建立早期对齐，然后进入变压器编码器。随后，文本引导门控（TGG）机制在变压器解码器生成候选提案前，动态地将语义先验注入查询中，提供高信噪比的时序预测输入。监督训练达到收敛后，骨干网络被冻结以保护特征流形，而检测头直接优化不可微分的评估度量tIoU，通过三阶段渐进强化学习（TPRL）策略提升定位准确性。该方法实现了状态表示和度量优化的正交解耦。在Charades-STA、QVHighlights和TACoS上的实验表明，GIRL-DESTR能够有效解决代理丢失的降级问题，并在最小参数更新下实现显著精度提升，为轻量级VMR模型中的强化学习应用提供了坚实的新路径。

Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

基于Transformer的世界模型进行离线元强化学习的行为不变任务表示学习

Authors: Fuyuan Qian, Menglong Zhang, Song Wang, Quanying Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00780
Pdf link: https://arxiv.org/pdf/2606.00780
Abstract Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces key challenges from context and policy distribution shifts. These issues hinder agents from adapting to online environments, and are further exacerbated under sparse-reward settings. As a result, agents often become trapped in an inherent pattern dilemma, failing to achieve robust generalization. In this work, we propose a novel framework that integrates information-theoretic task representation learning with a Transformer-based stochastic world model. Our approach extracts task-defining latent variables that are invariant to behavior policy, thereby effectively mitigating the context distribution shift. To further handle policy shift and model exploitation, we apply a conservative value penalty to imagination-based rollouts, preventing the policy from exploiting model inaccuracies while maintaining robust adaptation. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches, with superior stability and generalization under out-of-distribution and sparse-reward settings.
中文摘要 离线元强化学习利用静态数据集，使智能体能够通过结合离线效率与元学习适应性，推广到看不见的环境，但同时也面临着上下文和策略分布变化带来的关键挑战。这些问题阻碍了客服适应在线环境，在奖励稀疏的环境中更是加剧。因此，代理常常陷入固有的模式困境，无法实现稳健的泛化。本研究提出一个新框架，将信息理论任务表示学习与基于Transformer的随机世界模型整合。我们的方法提取了对行为策略不变的任务定义潜在变量，从而有效缓解了上下文分布的转移。为了进一步处理策略转移和模型利用，我们对基于想象力的推广施加保守价值惩罚，防止策略利用模型不准确性，同时保持强健的适应性。大量评估表明，我们的方法在分布外和奖励稀疏条件下表现优于最先进方法，稳定性和泛化性更优。

Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications

基于规范，强化学习中可扩展归纳推广的解耦行为克隆

Authors: Vignesh Subramanian, Subhajit Roy, Suguman Bansal
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00838
Pdf link: https://arxiv.org/pdf/2606.00838
Abstract Inductive generalization is a framework for reinforcement learning (RL) generalization in which inductively related task instances admit inductively related policies. Prior work captures this structure via a higher-order policy-evolution function learned directly with RL, but suffers from poor training scalability: as training tasks grow, aggregated reward feedback becomes noisy and conflicting, destabilizing training and weakening generalization. We propose DIBS, a decoupled behavioral cloning approach that separates learning task-specific policies from learning the evolution function. We first learn individual teacher policies per task via standard RL, then fit the evolution function via behavioral cloning on teacher-labeled state-action pairs. This replaces noisy reward aggregation with dense, stable supervision. DIBS achieves significant improvements in both training stability and zero-shot generalization against existing RL and meta-RL algorithms.
中文摘要 归纳泛化是一种强化学习（RL）推广框架，归纳相关的任务实例承认归纳相关策略。以往的研究通过通过强化学习直接学习的高阶策略-演化函数捕捉了这种结构，但由于训练可扩展性较差：随着训练任务的增加，汇总的奖励反馈变得噪声且冲突，导致训练不稳定，泛化能力减弱。我们提出了DIBS，一种解耦的行为克隆方法，将学习任务特定策略与进化函数的学习区分开来。我们首先通过标准强化学习学习每个任务的教师策略，然后通过行为克隆拟合教师标记的状态-行动对的演化函数。这用密集、稳定的监督取代了噪声的奖励聚合。DIBS在训练稳定性和零样本推广方面均取得了显著提升，相较于现有的强化学习（RL）和元强化学习算法。

Certificate-Guided Evaluation of Reinforcement Learning Generalization

证书引导的强化学习泛化评估

Authors: Vignesh Subramanian, Đorđe Žikelić, Suguman Bansal
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00840
Pdf link: https://arxiv.org/pdf/2606.00840
Abstract This work presents a logic-driven framework to evaluate the performance of reinforcement learning (RL) algorithms in their ability to generalize to unseen tasks. Our framework defines a family of inductive reach-avoid tasks, characterized by structural similarities in task dynamics, enabling evaluation of generalization capabilities. We introduce a neural certificate function that validates trajectories generated by RL algorithms by enforcing key conditions, thereby serving as a litmus test for RL generalization. We empirically demonstrate our method's capability in certifying generalization for several state-of-the-art generalizable RL algorithms on challenging continuous environments. Our results show that a lower percentage of certificate function violations correlates with a higher number of test tasks successfully solved, highlighting the effectiveness of our framework in evaluating and distinguishing generalization capabilities of RL algorithms. This work provides a principled approach for benchmarking RL generalization.
中文摘要 本研究提出了一个逻辑驱动框架，用于评估强化学习（RL）算法在推广到看不见任务方面的性能。我们的框架定义了一系列归纳式达到-避免任务，其特征是任务动态结构相似，从而能够评估泛化能力。我们引入了一种神经证书函数，通过强制执行关键条件来验证强化学习算法生成的轨迹，从而作为强化学习泛化的试金石。我们通过实证方式证明了该方法在挑战性连续环境中认证多种最先进可通用强化学习算法泛化的能力。我们的结果显示，证书函数违规率较低，成功完成测试任务数量的比例较高，凸显了我们框架在评估和区分强化学习算法泛化能力方面的有效性。这项工作为强化学习的基准测试提供了一种有原则的方法。

Meta-Black-Box Optimization with Ensemble Surrogate Modeling for Robustness-Accuracy Trade-off within SAEA

在SAEA中结合集合代理建模的元黑盒优化，实现鲁棒性与准确性的权衡

Authors: Xiao Jin, Yongxiong Wang, Haobo Liu, Yudong Du, Yukun Du
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.00862
Pdf link: https://arxiv.org/pdf/2606.00862
Abstract Surrogate-assisted evolutionary algorithms (SAEAs) have been widely used for expensive black-box optimization problems. However, their reliance on rigid and manually designed components limits their flexibility and generalization across tasks. Meta-black-box optimization (MetaBBO) provides a promising paradigm for adaptively configuring algorithmic components. Nevertheless, existing MetaBBO methods usually control only a single component, and few studies have investigated the unified control of multi-component optimizers such as SAEAs. Moreover, the robustness-accuracy trade-off in surrogate modeling, which is crucial for stable early-stage exploration and accurate late-stage exploitation, has rarely been explicitly considered. To address these issues, we propose AdaE-SAEA, an adaptive ensemble surrogate-assisted evolutionary algorithm for expensive multi-objective optimization. AdaE-SAEA embeds SAEA as the low-level optimizer within the MetaBBO framework and jointly controls the infill criterion and ensemble-based surrogate modeling. Specifically, bagging and boosting are designed as surrogate modeling modules to adaptively balance robustness and accuracy across different search phases, while the meta-policy simultaneously selects the infill criterion to enable adaptive sampling decisions. The meta-policy is trained through reinforcement learning with parallel sampling and centralized training, improving both training efficiency and transferability. Experiments on synthetic and real-world problems demonstrate that AdaE-SAEA outperforms state-of-the-art baselines and MetaBBO-based methods. We further verify the effectiveness of TabPFN as the base surrogate model for ensemble learning. To the best of our knowledge, this is the first work to unify the control of surrogate modeling and infill criteria in SAEAs while explicitly addressing the robustness--accuracy trade-off.
中文摘要 代理辅助进化算法（SAEA）被广泛用于昂贵的黑箱优化问题。然而，它们依赖于僵化且手动设计的组件，限制了其在各任务中的灵活性和通用性。元黑箱优化（MetaBBO）为自适应配置算法组件提供了有前景的范式。尽管如此，现有的MetaBBO方法通常只控制单一成分，且很少有研究探讨多成分优化器（如SAEA）的统一控制。此外，代理建模中鲁棒性与准确性的权衡——对于稳定的早期探索和准确的后期开发至关重要——很少被明确考虑。为解决这些问题，我们提出了AdaE-SAEA，一种自适应的集合替代辅助进化算法，用于昂贵的多目标优化。AdaE-SAEA将SAEA嵌入MetaBBO框架中的低级优化器，并共同控制填充标准和基于集成的替代建模。具体来说，袋装和提升被设计为替代建模模块，以适应不同搜索阶段的鲁棒性和准确性，而元策略则同时选择填充标准以支持自适应抽样决策。元政策通过强化学习、并行抽样和集中训练进行训练，提升了训练效率和可迁移性。合成和现实世界问题的实验表明，AdaE-SAEA优于最先进的基线和基于MetaBBO的方法。我们还进一步验证了TabPFN作为集合学习基础替代模型的有效性。据我们所知，这是首个统一SAEA中替代建模和填充标准控制，同时明确解决鲁棒性——准确性权衡的著作。

Enhancing LLM Metacognition via Cognitive Pairwise Training

通过认知成对训练提升LLM元认知

Authors: Weitao Li, Hao Zhou, Xuanyu Lei, Fandong Meng, Yuanhang Liu, Jingyi Ren, Ante Wang, Xiaolong Wang, Yuanchi Zhang, Fuwen Luo, Guangwen Yang, Lin Gan, Weizhi Ma, Yang Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.00869
Pdf link: https://arxiv.org/pdf/2606.00869
Abstract Reinforcement learning with verifiable rewards (RLVR) has become central to LLM reasoning, but its outcome-level rewards can make models more willing to give confident answers when evidence or reasoning is unreliable. Existing SFT or RL methods mainly teach LLMs to refuse or express uncertainty at the response level, which can overfit abstention behavior rather than improve reasoning reliability. To address this limitation, we propose Cognitive Pairwise Training (CPT), a cognitive mid-training alignment stage that turns pairwise comparisons over reasoning traces into a reusable alignment signal. By learning to distinguish trustworthy from flawed reasoning, CPT encourages the model to internalize a reasoning-quality discrimination boundary rather than memorize surface refusal patterns. Across five model scales and three model families, CPT improves the reasoning--metacognition trade-off. At 14B, CPT+RL outperforms the standard SFT+RL pipeline by +2.2 math-average points and +5.2 abstention-F1 points. Further analyses show that CPT improves trace quality and exhibits strong robustness and scalability across evaluation and training settings. Code and models are released at this https URL.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为大型语言模型推理的核心，但其结果层面的奖励能使模型在证据或推理不可靠时更愿意给出自信答案。现有的SFT或强化学习方法主要教LLM在反应层面拒绝或表达不确定性，这可能使隐忘行为过拟合，而非提高推理可靠性。为解决这一局限性，我们提出了认知成对训练（CPT），这是一种认知中途训练对齐阶段，将推理追踪的成对比较转化为可重复使用的对齐信号。通过学会区分可信与错误推理，CPT鼓励模型内化一个推理质量的歧视界限，而不是死记硬背表面的拒绝模式。在五个模型量表和三个模型家族中，CPT改善了推理——元认知权衡。在14B阶段，CPT+RL比标准SFT+RL流水线高出+2.2的数学平均分和+5.2的弃权F1分。进一步分析显示，CPT提升了痕迹质量，并在评估和培训环境中表现出强烈的鲁棒性和可扩展性。代码和模型会在这个 https URL 上发布。

Task diversity produces systematic transfer but inhibits continual reinforcement learning

任务多样性产生系统迁移，但抑制持续强化学习

Authors: Purab Seth, Neil Shah, Kunal Jha, Samuel J. Gershman, Max Kleiman-Weiner, Wilka Carvalho
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00880
Pdf link: https://arxiv.org/pdf/2606.00880
Abstract Continual reinforcement learning aims to produce agents that learn not only to improve at their current tasks but also to adapt as task distributions change. Training an agent on many diverse tasks can induce zero-shot generalization, but previous work generally evaluates this generalization after training -- with frozen weights. Whether task diversity also improves an agent's ability to continue learning across distribution shifts remains unclear. We introduce Banyan, a GPU-accelerated continual RL domain in which task diversity factors into three independently controllable axes: the map layouts an agent must navigate, the objects it must interact with, and the hierarchical structures of sub-goal dependencies. Across individual distribution shifts, increasing diversity along each axis causes agents to begin training on the new tasks near the performance attained on the previous one, even when the shift changes the structure of the optimal policy. However, as the number of shifts increases, this local transfer does not by itself yield sustained continual learning: longer-horizon tasks plateau, and earlier task distributions are forgotten after later training. Banyan is a benchmark for studying when controlled task diversity produces transferable learning, when that transfer persists, and where it falls short of proper continual learning.
中文摘要 持续强化学习旨在培养不仅能提升当前任务，还能随着任务分布变化进行适应的智能体。在多种不同任务上训练代理可以诱导零样本泛化，但以往的工作通常在训练后——即用冻结权重——来评估这种泛化。任务多样性是否也提升了代理在分布转变中持续学习的能力，目前尚不清楚。我们介绍Banyan，这是一个GPU加速的连续强化学习领域，其中任务多样性分为三个独立可控的轴：代理必须导航的地图布局、必须交互的对象以及子目标依赖关系的层级结构。在各个分布偏移中，沿各轴增加多样性会使代理开始训练接近前一轴表现的新任务，即使偏移改变了最优策略的结构。然而，随着班次次数增加，这种局部转移本身并不能带来持续的学习：更长的任务时间趋于稳定，早期任务分布在后续训练后会被遗忘。Banyan 是研究可控任务多样性何时产生可迁移学习、何时转移持续存在以及何时未能实现适当持续学习的标杆。

Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers

Ryze：生物医学论文中的证据丰富数据综合

Authors: Yeqi Huang, Yue Chen, Yanwei Ye, Guanhao Su, Luo Mai
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.00902
Pdf link: https://arxiv.org/pdf/2606.00902
Abstract General-purpose VLMs remain unreliable for biomedical research because valid answers in scientific papers depend on evidence split across figures, tables, charts, captions, and referring text. Existing post-training pipelines are bottlenecked by costly expert annotation and by synthetic data that drops this evidence structure. We present Ryze, a fully automated system that converts raw biomedical papers into an evidence-enriched training set and a domain-specialized VLM. Ryze synthesizes QA pairs with complete supporting evidence (visual element, caption, extracted structure, and referring paragraphs), reduces layout and OCR errors via chart/table-aware extraction and LLM-based cleansing, and applies a progress-gated post-training strategy combining supervised fine-tuning with reinforcement learning. Starting from Qwen3-VL-8B, Ryze produces BioVLM-8B at under USD 200, achieving 48.0% weighted accuracy on LAB-Bench, outperforming the base model by +12.6 percentage points (pp) and surpassing GPT-5.2 by +3.8 pp. We release Ryze as open source together with the trained BioVLM-8B model.
中文摘要 通用VLM在生物医学研究中依然不可靠，因为科学论文中的有效答案依赖于分布在图表、图表、说明和引用文本中的证据。现有的培训后流程因昂贵的专家注释和导致证据结构丢失的合成数据而成为瓶颈。我们介绍Ryze，一个全自动化系统，将原始生物医学论文转换为证据丰富的训练集和领域专门化的VLM。Ryze综合QA配对与完整支持证据（视觉元素、说明、提取结构和引用段落），通过图表感知提取和基于LLM的清理减少布局和OCR错误，并采用结合监督微调与强化学习的进阶门控训练后策略。从Qwen3-VL-8B开始，Ryze生产BioVLM-8B的定价低于200美元，在LAB-Bench上实现48.0%的加权准确率，比基础模型高出+12.6个百分点（pp），并比GPT-5.2高出+3.8个百分点。我们将Ryze与训练好的BioVLM-8B模型一起开源发布。

Generative Multi-Robot Motion Planning via Diffusion Modeling with Multi-Agent Reinforcement Learning Guidance

通过扩散建模与多智能体强化学习指导实现生成式多机器人运动规划

Authors: Suk Ki Lee, Venkata Sai Deepak Mutta, Hyunwoong Ko
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.00933
Pdf link: https://arxiv.org/pdf/2606.00933
Abstract Coordinating multiple robots in shared environments requires generating feasible trajectories for each agent while accounting for interactions among agents. Centralized planning approaches become difficult to scale as the number of robots increases, while decentralized approaches that allow each agent to plan independently do not inherently account for inter-agent interactions. This paper presents a framework for coordinated multi-robot motion planning that combines decentralized generative trajectory planning with multi-agent reinforcement learning (MARL)-based coordination. Each robot independently generates candidate trajectories using a diffusion model trained on single-agent motion data, leveraging the generative model's ability to produce feasible and diverse trajectories. To reduce conflicts between agents, a centralized value function trained via MARL guides the reverse diffusion process through gradient-based steering, enabling interaction-aware trajectory generation without centralized joint planning or retraining of the generative model. This guidance follows an exponential tilting formulation, in which the value function biases the denoising distribution toward trajectories with higher expected multi-agent return. The framework is evaluated in a simulated maze environment with four mobile robots. Experimental results show that the proposed value-guided diffusion planning reduces the inter-agent interference rate from 55.4% to 41.8%, demonstrating that coordination can be effectively achieved while preserving the scalability of decentralized trajectory generation. These results suggest that MARL-based value guidance can effectively introduce coordination into decentralized generative planners without requiring a fully joint multi-robot model.
中文摘要 在共享环境中协调多个机器人需要为每个智能体生成可行的轨迹，同时考虑智能体之间的交互。随着机器人数量的增加，集中式规划方法难以扩展，而允许每个代理独立规划的去中心化方法，本质上并未考虑代理间的交互。本文提出了一个协调多机器人运动规划的框架，结合了去中心化生成轨迹规划与基于多智能体强化学习（MARL）的协调。每个机器人独立使用基于单代理运动数据训练的扩散模型生成候选轨迹，利用生成模型生成可行且多样化轨迹的能力。为减少代理间冲突，通过MARL训练的中心值函数通过梯度引导反向扩散过程，实现交互感知轨迹生成，无需集中联合规划或生成模型的重新训练。该指导遵循指数倾斜表述，其中值函数使去噪分布偏向期望多代理返回更高的轨迹。该框架在模拟迷宫环境中进行评估，配备四台移动机器人。实验结果表明，所提出的价值引导扩散规划将代理间干扰率从55.4%降至41.8%，表明在保持分散式轨迹生成可扩展性的同时，可以有效实现协调。这些结果表明，基于MARL的价值指导能够有效地将协调引入去中心化生成规划器，而无需完全联合的多机器人模型。

Explainable deep reinforcement learning reveals energy-efficient control strategies for turbulent drag reduction

可解释的深度强化学习揭示了节能的湍流阻力减缓控制策略

Authors: Federica Tonti, Ricardo Vinuesa
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
Arxiv link: https://arxiv.org/abs/2606.00949
Pdf link: https://arxiv.org/pdf/2606.00949
Abstract We propose a method combining Multi-Agent Deep Reinforcement Learning (MARL) and eXplainable Deep Learning (XDL) to reduce drag in wall-bounded turbulent flows. Taking as a baseline the results of training agents directly targeting wall-shear stress and opposition control, three SHAP-guided approaches are compared. In the first, the reward is computed from SHAP attributions of a U-net predicting the future velocity field; in the second, from SHAP attributions of a U-net predicting the skin-friction coefficient; in the third, from a combination of SHAP attributions of two U-nets predicting the skin-friction coefficient and the wall pressure fluctuations, respectively. The combined SHAP strategy based on skin-friction coefficient and wall-pressure fluctuations achieves the best overall performance, achieving a DR of 34.44% and a NES of 34.01% with only 0.43% normalized input power. Relative to opposition control, drag reduction and net energy saving increase by 49.41% and 48.52%, respectively. Compared with the direct wall-shear-stress baseline, the proposed strategy simultaneously improves performance while reducing the normalized actuation cost from 5.90% to 0.43%. Analysis of the results reveals that the energetically efficient policy is consistent with pressure-gated actuation, activating predominantly at near-zero wall pressure, and operates on a temporal timescale comparable to the lifetime of the near-wall turbulent structures.
中文摘要 我们提出一种结合多智能体深度强化学习（MARL）和可解析深度学习（XDL）的方法，以减少壁面受限湍流中的阻力。以直接针对壁面剪切应力和对立控制的训练剂结果为基线，比较了三种SHAP引导的方法。第一种方法是通过预测未来速度场的U网的SHAP归因计算的;第二种是通过SHAP归因的U-网预测表皮摩擦系数;第三个则是结合两个U线的SHAP归因，分别预测了表皮摩擦系数和壁面压力波动。基于皮肤摩擦系数和壁面压力波动的综合SHAP策略实现了最佳的整体性能，DR为34.44%，NES为34.01%，且仅有0.43%的归一化输入功率。相对于对向控制，阻力减缓和净节能分别提升了49.41%和48.52%。与直接壁面剪切应力基线相比，所提策略同时提升性能，同时将归一化驱动成本从5.90%降至0.43%。结果分析显示，这种能量高效策略与压力门控驱动一致，主要在接近零壁面压力下激活，且其时间尺度与近壁湍流结构的寿命相当。

MedGym:A Unified Continuous-Time Benchmark for Dynamic Medical Treatment Reinforcement Learning

MedGym：动态医疗治疗强化学习的统一连续时间基准

Authors: Yuepeng Wang, Ken Kawano, Yongqi Zhou, Yoshihiko Fujisawa, Richard Weiss, Akifumi Wachi, Katsuki Fujisawa, Ying Chen, Mehrshad Sadria, Xin Liu, Kyoung-Sook Kim, Xiao Hu, Sebastien Gros, Xun Shen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.01028
Pdf link: https://arxiv.org/pdf/2606.01028
Abstract Medical treatment recommendation poses several challenges to reinforcement learning (RL): patient physiology evolves in continuous time, measurements and interventions are performed at irregular intervals, and treatment effects vary substantially across individuals. Existing RL formulations and simulated environments, however, are based on discrete-time MDP or POMDP abstractions with fixed or pre-specified decision intervals. Thus, it remains difficult to evaluate whether RL methods can handle time-interval-dependent disease progression, personalized treatment response, and safety between consecutive measurement points. To address this gap, we introduce MedGym, a benchmark environment for dynamic treatment recommendation. MedGym models longitudinal patient evolution in a continuous-time framework and constructs a configurable medical RL benchmark from clinical data by using Physics-Informed Neural Networks. The resulting benchmark supports both offline and online RL, and enables direct comparison between discrete-time and continuous-time methods under irregular treatment timing and patient-specific dynamics. Besides, MedGym supports evaluation from clinically important perspectives, including personalization, trajectory-level safety, and the performance gap between model-based offline learning and online deployment. By providing a standardized and configurable benchmark for continuous-time dynamic treatment, MedGym aims to facilitate more realistic and informative evaluation of medical RL methods.
中文摘要 医学治疗建议对强化学习（RL）提出了诸多挑战：患者生理结构以连续时间演变，测量和干预间隔不定，且治疗效果在个体间差异显著。然而，现有的强化学习表述和模拟环境基于离散时间的MDP或POMDP抽象，具有固定或预先指定的决策区间。因此，评估强化学习方法是否能处理时间间隔相关的疾病进展、个性化治疗反应以及连续测量点之间的安全性仍然困难。为弥补这一空白，我们推出了MedGym，这是一个动态治疗推荐的标杆环境。MedGym在连续时间框架中建模患者纵向演变，并利用物理知情神经网络从临床数据构建可配置的医学强化学习基准。最终基准支持离线和在线强化学习，并实现离散时间和连续时间方法在不规则治疗时机和患者具体动态下的直接比较。此外，MedGym还支持从临床重要视角进行评估，包括个性化、轨迹级安全性以及基于模型的离线学习与在线部署之间的性能差距。通过为连续时间动态治疗提供标准化且可配置的基准，MedGym旨在促进医学强化学习方法的更真实和信息丰富的评估。

OPD+: Rethinking the Advantage Design for On-Policy Distillation

OPD+：重新思考政策提炼的优势设计

Authors: Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata, David Yao, Wenpin Tang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.01039
Pdf link: https://arxiv.org/pdf/2606.01039
Abstract On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student models, and can be formulated in a reinforcement learning style objective using student generated rollouts. Yet, despite the divergence reward being dependent on student model likelihood, existing works usually adopt a stop gradient design primarily for stability, which makes the resulting advantage estimation questionable. In this work, we provide a generic optimization framework based on f-divergence between the student and teacher, and mathematically revisit whether such design space is valid. We prove that general stop-gradient operation would lead to biased estimates of the reward objective and corresponding gradient for general divergence functions. We propose OPD+, the corrected version of OPD that demonstrates improved performance over the baseline KL approach and also supports the choice of various f-divergence. We validate our findings on mathematical reasoning and tool-use benchmarks.
中文摘要 策略提纯（OPD）是一种广泛使用的技术，用于将能力从具备能力的教师语言模型转移到基础学生模型，并可通过学生生成的推广项目，形成强化学习风格的目标。然而，尽管发散奖励依赖于学生模型似然度，现有研究通常主要采用停止梯度设计以求稳定性，这使得所得优势估计存在疑问。本研究提供了基于学生与教师之间f-发散的通用优化框架，并数学上重新审视此类设计空间是否有效。我们证明了一般的停止梯度运算会导致一般散度函数的奖励目标和相应梯度的有偏估计值。我们提出OPD+，即OPD的修正版本，它展示了优于基线KL方法的性能，并支持选择各种f-发散。我们验证了数学推理和工具使用基准的发现。

ExpWeaver: LLM Agents Learn from Experience via Latent RAG

ExpWeaver：LLM代理通过潜在RAG从经验中学习

Authors: Tao Feng, Tianyang Luo, Jingjun Xu, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.01041
Pdf link: https://arxiv.org/pdf/2606.01041
Abstract Experience learning has achieved promising results in enhancing LLM agent planning and reasoning by integrating past interactions as reusable knowledge. However, existing methods remain confined to explicit text space, retrieving experiences via semantic similarity and concatenating them into the context window, leading to substantial token overhead and a decoupled architecture that separates retrieval from generation. To address these limitations, we propose ExpWeaver, a framework that enables LLM agents to learn from experience via latent retrieval-augmented generation, without requiring a separate RAG module. ExpWeaver encodes experiences using the LLM's own hidden states, retrieves relevant experiences directly in latent space at each decoding step, and integrates them through cross-attention aggregation and gated residual mechanisms. The entire pipeline is optimized end-to-end with reinforcement learning, supporting both generative and ranking tasks. We evaluate ExpWeaver on 13 diverse tasks spanning question answering, reasoning, coding, scientific prediction, and recommendation. Results demonstrate that ExpWeaver achieves state-of-the-art performance on 12 out of 13 tasks, outperforming the strongest baseline by over 6.8%; maintains token efficiency comparable to non-retrieval baselines while text-based retrieval methods require 1.5 to 2 times more tokens; and exhibits superior cross-domain generalization, outperforming the strongest baseline by 16.32% under zero-shot transfer and 15.21% under few-shot transfer. Our code for ExpWeaver is released at this https URL.
中文摘要 经验学习通过将过去的互动整合为可重用的知识，在提升LLM代理的规划和推理方面取得了有希望的成果。然而，现有方法仍局限于显式文本空间，通过语义相似性检索体验并将其串接到上下文窗口，导致大量令牌开销和解耦架构，将检索与生成分离。为解决这些局限性，我们提出了ExpWeaver框架，使LLM代理能够通过潜在检索增强生成从经验中学习，而无需单独的RAG模块。ExpWeaver 利用 LLM 自身的隐藏状态编码经验，在每个解码步骤直接在潜伏空间中检索相关经验，并通过交叉注意力聚合和门控残留机制进行整合。整个流程通过强化学习实现端到端优化，支持生成任务和排名任务。我们评估ExpWeaver在13项涵盖问答、推理、编码、科学预测和推荐等多项任务上。结果显示，ExpWeaver 在13项任务中有12项实现了最先进的性能，比最强基线高出6.8%以上;其令牌效率与非检索基线相当，而基于文本的检索方法则需要多出1.5到2倍的令牌数;并且表现出更优异的跨域泛化能力，在零剂量转移下比最强基线高出16.32%，在少量注射转移下高出15.21%。我们的ExpWeaver代码在此HTTPS网址发布。

Interaction-Limited Safe Continuous-Time RL for Dynamical Medical Treatment

限制交互安全连续时间强化学习用于动态医学治疗

Authors: Xun Shen, Yuepeng Wang, Akifumi Wachi, Yongqi Zhou, Richard Weiss, Yoshihiko Fujisawa, Ken Kawano, Mehrshad Sadria, Ying Chen, Xin Liu, Sebastien Gros, Xiao Hu, Kyoung-Sook Kim, Mengmou Li, Katsuki Fujisawa, Kenji Wakabayashi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.01051
Pdf link: https://arxiv.org/pdf/2606.01051
Abstract Dynamic medical treatment requires deciding treatment intensity and intervention timing, while patient states evolve continuously and adverse events may occur between clinical interactions. Most existing treatment learning methods assume fixed schedules or enforce safety only at discrete decision points. We propose Interaction-Limited Safe Continuous-Time Reinforcement Learning, a framework that jointly optimizes treatment administration and clinical interaction timing under trajectory-level safety constraints. Our key idea is to reformulate the continuous time treatment problem as an option-based semi-Markov decision process, where each option specifies a continuous-time treatment policy and its duration. We develop a safety-tightening mechanism showing that suitably constructed constraints at interaction times guarantee safety over the full continuous-time trajectory with high probability. We further establish finite-sample guarantees for policy learning from logged treatment trajectories and introduce a practical data-driven conservative surrogate. Experiments show that the proposed adaptive interaction-timing mechanism improves both safety and treatment effectiveness over equidistant interaction schemes across different safe policy optimization methods.
中文摘要 动态医疗治疗需要决定治疗强度和干预时机，而患者状态持续变化，临床交互之间可能发生不良事件。大多数现有的治疗学习方法假设固定的时间表，或仅在离散决策点执行安全。我们提出了交互限制安全连续时间强化学习框架，该框架在轨迹级安全约束下联合优化治疗施用和临床交互时机。我们的核心思想是将连续时间处理问题重新表述为基于选项的半马尔可夫决策过程，每个选项指定连续时间处理策略及其持续时间。我们开发了一种安全紧致机制，证明在相互作用时间处构建适当约束能够以高概率保证整个连续时间轨迹的安全。我们进一步建立了从记录处理轨迹中获取策略学习的有限样本保证，并引入了实用的数据驱动保守替代指标。实验表明，所提出的自适应交互-时序机制在不同安全策略优化方法中，相较于等距离相互作用方案，提高了安全性和处理效果。

Before the Model Learns the Bug:Fuzzing RLVR Verifiers

在模型学习漏洞之前：RLVR验证器在模糊中

Authors: Jaideep Ray
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.01066
Pdf link: https://arxiv.org/pdf/2606.01066
Abstract Reinforcement learning with verifiable rewards (RLVR) replaces human preference labels with executable reward functions such as math answer checkers, JSON tool-call validators, and code unit-test harnesses. That makes the reward partly a software artifact: if the verifier is wrong, optimization can learn the bug. We study this failure mode with a lightweight verifier-fuzzing framework that generates adversarial completions, compares buggy and stricter reference verifiers, logs paired decisions, and reports false-positive, false-negative, disagreement, exploit, and uncertainty metrics.
中文摘要 带可验证奖励的强化学习（RLVR）用可执行的奖励函数替代人类偏好标签，如数学答案检查器、JSON 工具调用验证器和代码单元测试工具。这使得奖励部分成为软件伪影：如果验证者错误，优化可以学习漏洞。我们采用轻量级验证器模糊框架研究该失败模式，生成对抗性补全，比较有缺陷和更严格的引用验证器，记录配对决策，并报告误判、假阴性、不一致、利用和不确定性指标。

MViewRouter: Internalizing Geometric Equivariance via Multi-view Alternating Attention for Combinatorial Routing

MViewRouter：通过多视角交替注意力内化几何等变性，实现组合路由

Authors: Shiyan Liu, Bohan Tan, Yaoxin Wu, Yan Jin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.01084
Pdf link: https://arxiv.org/pdf/2606.01084
Abstract Combinatorial routing problems such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) are fundamental NP-hard problems with broad real-world applications. While recent deep reinforcement learning methods have shown promising performance, they typically handle geometric symmetries only through data augmentation, resulting in inconsistent decisions and limited generalization. To address this issue, we propose MViewRouter, a multi-view framework that internalizes geometric equivariance as a structural inductive bias to achieve invariant decision-making across routing problem variants. Our approach introduces a Multi-view Alternating Attention (MAA) mechanism that enables parallel processing over the $D_4$ symmetry group, alternating between intra-view relational modeling and inter-view feature alignment. Furthermore, we optimize the policy via Collective Policy Gradient Aggregation (CPGA), leveraging consensus gradients from multiple symmetric views to stabilize training and accelerate convergence. Experiments on TSP and CVRP benchmarks, as well as real-world TSPLIB instances, demonstrate that MViewRouter achieves competitive solution quality and strong zero-shot generalization.
中文摘要 组合路由问题，如旅行推销员问题（TSP）和电容车辆路由问题（CVRP），是具有广泛实际应用的基础性NP难问题。尽管近期的深度强化学习方法表现出良好表现，但它们通常仅通过数据增强处理几何对称，导致决策不一致且泛化有限。为解决这一问题，我们提出了MViewRouter，一种多视角框架，将几何等变性内化为结构性归纳偏置，实现跨路由问题变体的不变决策。我们的方法引入了多视角交替注意力（MAA）机制，使得在$D_4$对称组上实现并行处理，并在视图内关系建模和跨视图特征对齐之间交替进行。此外，我们通过集体策略梯度聚合（CPGA）优化策略，利用多个对称视角的共识梯度来稳定训练并加速收敛。TSP和CVRP基准测试以及真实TSPLIB实例的实验表明，MViewRouter实现了具有竞争力的解质量和强的零样本泛化能力。

CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation

CAREAgent：具结构化推理和工具集成的临床代理，用于订单生成

Authors: Ruihui Hou, Ziyue Huai, Chennuo Zhang, Ziyan Liu, Siran Zhao, Yao Yu, Jie Zhai, Tong Ruan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.01094
Pdf link: https://arxiv.org/pdf/2606.01094
Abstract Clinical order generation serves as a critical bridge between clinical decision-making and real-world practice, translating medical decisions into concrete and executable orders. Existing agents mainly focus on coarse-grained decisions and overlook the fine-grained, executable information required for clinical orders. To address this gap, we propose CAREAgent, an agent for clinical order generation. To support its training, we introduce a two-stage agentic reasoning data construction method. First, we design an agent framework that constructs verifiable reasoning trajectories aligned with realistic clinical tool usage. Second, we filter reasoning trajectories by format compliance, order validity, and clinical plausibility. Building on the constructed data, the model is first trained via supervised fine-tuning to acquire fundamental reasoning formats and medical knowledge, and is subsequently optimized through reinforcement learning with multi-dimensional reward functions to enhance complex clinical reasoning capabilities. Experiments on multiple benchmarks demonstrate the effectiveness of CAREAgent. On ClinicalBench (unseen during training), CAREAgent improves the F1 score by 5.05%, 2.09%, and 0.86% over the single-agent, multi-agent, and agentic reasoning methods, respectively.
中文摘要 临床医嘱生成是临床决策与现实实践之间的关键桥梁，将医疗决策转化为具体且可执行的医嘱。现有的代理主要关注粗粒度决策，忽视了临床订单所需的细粒度、可执行信息。为弥补这一空白，我们提出了CAREAgent，一个临床订单生成代理。为支持其训练，我们引入了一种两阶段的代理推理数据构建方法。首先，我们设计了一个代理框架，构建可验证的推理轨迹，符合现实的临床工具使用情况。其次，我们通过格式合规性、顺序效度和临床合理性来过滤推理轨迹。基于构建的数据，模型首先通过监督微调训练以获得基础推理格式和医学知识，随后通过多维奖励函数的强化学习优化，以增强复杂的临床推理能力。多个基准测试的实验证明了CAREAgent的有效性。在ClinicalBench（培训期间未见）上，CAREAgent分别将F1分数提升了5.05%、2.09%和0.86%，均优于单智能体、多智能体和智能推理方法。

MiCU: End-to-End Smart Home Command Understanding with Large Language Model

MiCU：基于大型语言模型实现端到端智能家居指令理解

Authors: Haowei Han, Kexin Hu, Weiwei Cai, Debiao Zhang, Bin Qin, Yuxiang Wang, Jiawei Jiang, Xiao Yan, Bo Du
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.01099
Pdf link: https://arxiv.org/pdf/2606.01099
Abstract Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, while they perform well on precise utterances (e.g., "turn on the bedroom light"), they struggle with ambiguous or misaligned commands (e.g., "make the bedroom cozy"). Large language models (LLMs) generalize well across various domains and can outperform traditional rule-based systems on such tasks, but their effectiveness is often constrained by scarce domain-specific data, insufficient task-specific adaptation, and high computational costs. In this paper, we propose an automated training data synthesis workflow using user logs and LLMs; then we build MiCU, a domain-specific LLM that excels at command understanding. Specifically, we employ curriculum learning to inject domain knowledge into the base LLM, then we enhance its reasoning ability via cold-start training combined with reinforcement learning (RL) guided by domain-specific thinking rules. Additionally, we introduce a token compression technique that condenses device description into a single special token, substantially reducing inference overhead and enabling \model-fast, an efficient variant optimized for long inputs. Extensive experiments show that MiCU significantly outperforms baselines, with an average accuracy gain of 20.01% across all device categories. We have deployed MiCU in the Xiaomi Home app, receiving approximately 1.7 million page views per day. Production evaluations show that MiCU reduces user correction rate by 1.57% and increases human audited accuracy by 32.05%. Our data and code are available at this https URL
中文摘要 智能家居生态系统中的指令理解系统可以自动化设备控制，显著提升用户体验。然而，虽然它们在精准的言语上表现良好（例如“打开卧室灯”），但在模糊或错位的指令上表现不佳（例如“让卧室变得舒适”）时会遇到困难。大型语言模型（LLMs）在多个领域具有良好的推广性，并且在此类任务中能够优于传统基于规则的系统，但其效能常常受到领域特定数据稀缺、任务特定适应不足以及高计算成本的限制。本文提出了利用用户日志和大型语言模型（LLM）实现的自动化训练数据综合工作流程;然后我们构建MiCU，一个擅长命令理解的领域专用大型语言模型。具体来说，我们通过课程学习将领域知识注入基础LLM，然后通过冷启动训练结合强化学习（RL），并以领域特定思维规则为指导，提升其推理能力。此外，我们引入了一种令牌压缩技术，将设备描述浓缩为单一特殊令牌，显著降低推理开销，并实现了高效变体\model-fast，优化长输入。大量实验表明，MiCU在所有设备类别中平均准确率提升20.01%，显著优于基线产品。我们已在小米家居应用中部署了MiCU，每天约有170万次页面浏览量。生产评估显示，MiCU使用户更正率降低了1.57%，并使人工审核准确率提升了32.05%。我们的数据和代码可在此 https URL 获取

From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning

从无奖励表征到偏好：重新思考基于偏好的离线强化学习

Authors: Jun-Jie Yang, Chia-Heng Hsu, Kui-Yuan Chen, Ping-Chun Hsieh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.01123
Pdf link: https://arxiv.org/pdf/2606.01123
Abstract Preference-based reinforcement learning (PbRL) avoids explicit reward engineering by learning from pairwise human preference feedback. Existing offline PbRL methods typically follow a two-stage pipeline, first learning a reward or preference model from labeled preferences and then performing offline RL on unlabeled data. We revisit offline PbRL through the lens of reward-free representation learning (RFRL) from the zero-shot RL literature, and propose a new training framework that first learns latent successor-measure representations from reward-free offline data, followed by contrastive search and fine-tuning using preference data. Through extensive experiments and ablations, we show that our method achieves superior preference efficiency over offline PbRL baselines. This work is the first to connect RFRL with PbRL, highlighting its potential as a feedback-efficient solution. Our code is publicly available at this https URL.
中文摘要 基于偏好的强化学习（PbRL）通过学习两对的人类偏好反馈，避免了显式奖励工程。现有的离线PbRL方法通常遵循两阶段流程，首先从标记偏好中学习奖励或偏好模型，然后对未标记数据进行离线强化学习。我们通过零样本强化学习文献中的无奖励表示学习（RFRL）视角重新审视离线PbRL，并提出了一种新的训练框架，先从无奖励离线数据学习潜在后继测度表示，然后通过对比搜索和利用偏好数据进行微调。通过大量实验和消融，我们表明该方法优于离线PbRL基线，实现了更优越的偏好效率。这项工作首次将RFRL与PbRL连接起来，突出其作为高效反馈解决方案的潜力。我们的代码在此 https URL 公开。

Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies

拉格朗日扰动扩散引导：生成策略的潜在强化学习

Authors: Hikmet Simsir, Ozgur S. Oguz
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.01151
Pdf link: https://arxiv.org/pdf/2606.01151
Abstract Behavior cloning with high-capacity generative policies achieves strong imitation performance, but is often limited by demonstration coverage and distribution shift. Direct reinforcement learning fine-tuning can improve performance, but updating large action decoders is frequently unstable and sample inefficient. We propose Lagrangian Perturbation Diffusion Steering (LP-DS), a lightweight adaptation method that improves a frozen generative policy by learning a compact noise-space perturbation before decoding. LP-DS optimizes this perturbation with a Lagrangian trust-region objective, improving downstream value while constraining deviation from the latent prior. Across RoboMimic manipulation, OpenAI Gym locomotion, and Adroit dexterous manipulation benchmarks, LP-DS improves sample efficiency, success, and return while maintaining higher action-space entropy than unconstrained noise-space steering, with return improvements of up to 25% over prior baselines. Additional evaluations with flow-matching backbones, a large vision-language-action model, and physical Franka deployment show that LP-DS is not limited to compact diffusion policies or simulated benchmarks. Project page: this https URL.
中文摘要 使用高容量生成策略的行为克隆能实现强的模仿性能，但通常受限于示范覆盖率和分布转移。直接强化学习的微调可以提升性能，但更新大型动作解码器通常不稳定且采样效率低下。我们提出了拉格朗日微扰扩散引导（LP-DS），这是一种轻量级适应方法，通过在解码前学习紧凑的噪声空间扰动，改进冻结的生成策略。LP-DS 通过拉格朗日信任区域目标优化该扰动，改善下游值，同时约束与潜在先验的偏差。在RoboMimic操作、OpenAI健身房移动和Adroit灵巧操作基准测试中，LP-DS在保持比无约束噪声空间引导更高的动作空间熵的同时，提高了样本效率、成功率和回报，回报提升高达25%。通过流量匹配骨干、大型视觉-语言-行动模型以及Franka的物理部署进行额外评估，表明LP-DS并不局限于紧凑的扩散策略或模拟基准测试。项目页面：这个 https URL。

Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

形式数学验证中生成奖励建模的期望值对齐

Authors: Shihao Ji, Haotao Tan, Zihui Song, Mingyu Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.01160
Pdf link: https://arxiv.org/pdf/2606.01160
Abstract Large Language Models (LLMs) are increasingly used with formal interactive theorem provers such as Lean 4. Scaling these systems with reinforcement learning or search methods requires process reward models (PRMs) that can evaluate intermediate reasoning steps. Existing reward-model designs expose a practical trade-off. Value-head models provide continuous scores but modify the generative model interface, while generative reward models preserve textual rationales but are poorly matched to continuous floating-point regression because numeric values are split across tokens. We introduce Expected Value Alignment (EVA), a reward-modeling procedure that keeps the surface output discrete while extracting continuous scores from the model's token distribution. The model emits integer scores in a structured JSON format, and EVA computes a continuous score as the expectation over the logits of the corresponding anchor tokens. Training combines the causal language modeling objective with an auxiliary mean squared error loss on these expected values. We instantiate EVA in \textit{Leibniz}, a reward model for Lean 4 formal verification, and evaluate it against zero-shot and reward-modeling baselines. The evaluation demonstrates that continuous logit-based scoring significantly reduces discretization artifacts while retaining the interpretability of generative critiques.
中文摘要 大型语言模型（LLM）越来越多地与形式交互式定理证明器如精益4一起使用。用强化学习或搜索方法扩展这些系统需要过程奖励模型（PRM），以评估中间推理步骤。现有的奖励模型设计揭示了一个实际的权衡。价值头模型提供连续评分，但修改生成模型界面;生成奖励模型保留文本逻辑，但由于数值被分散在多个代币之间，与连续浮点回归匹配度较差。我们引入了期望价值对齐（EVA），这是一种奖励建模过程，保持表面输出离散，同时从模型的代币分布中提取连续得分。该模型以结构化的JSON格式输出整数分数，EVA则计算对相应锚点代币logit的期望值，获得连续分数。训练将因果语言建模目标与这些期望值上的辅助均方误差损失相结合。我们在 \textit{Leibniz} 中实例化 EVA 的 Eva，这是一种用于精益 4 形式验证的奖励模型，并结合零机会和奖励建模基线进行评估。评估表明，基于连续logit的评分显著减少离散化伪影，同时保留生成批判的可解释性。

Deft Scheduling of Dynamic Cloud Workflows with Varying Deadlines via Mixture-of-Experts

通过专家混合灵活安排动态云工作流程，灵活安排不同截止日期

Authors: Ya Shen, Gang Chen, Hui Ma, Mengjie Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.01162
Pdf link: https://arxiv.org/pdf/2606.01162
Abstract Workflow scheduling in cloud computing demands the intelligent allocation of dynamically arriving, graph-structured workflows with varying deadlines onto ever-changing virtual machine resources. However, existing deep reinforcement learning (DRL) schedulers remain limited by rigid, single-path inference architectures that struggle to handle diverse scheduling scenarios. We introduce \textbf{DEFT} (\textbf{D}eadline-p\textbf{E}rceptive Mixture-o\textbf{F}-Exper\textbf{t}s), an innovative DRL policy architecture that leverages a specialized mixture of experts, each trained to manage different levels of deadline tightness. To our knowledge, DEFT is the first to introduce and validate a Mixture-of-Experts architecture for dynamic cloud workflow scheduling. By adaptively routing decisions through the most appropriate experts, DEFT is capable of meeting a broad spectrum of deadline requirements that no single expert can achieve. Central to DEFT is a \textbf{graph-adaptive} gating mechanism that encodes workflow deadlines and DAGs, task states, and VM conditions, using cross-attention to guide expert activation in a fine-grained, deadline-sensitive manner. Experiments on dynamic cloud workflow benchmarks demonstrate that DEFT significantly reduces execution cost and deadline violations, outperforming multiple state-of-the-art DRL baselines.
中文摘要 云计算中的工作流调度要求将动态到达、图结构化、截止日期不一的工作流智能分配到不断变化的虚拟机资源上。然而，现有的深度强化学习（DRL）调度器仍受限于僵化的单路径推理架构，难以应对多样化的调度场景。我们引入了 \textbf{DEFT}（\textbf{D}eadline-p\textbf{E}rceptive Mixture-o\textbf{F}-Exper\textbf{t}s），这是一种创新的DRL政策架构，利用了多位专业专家，每位专家都经过培训，能够管理不同程度的截止时间紧迫度。据我们所知，DEFT 是首个引入并验证动态云工作流程调度的专家混合架构的机构。通过适应性地将决策通过最合适的专家，DEFT能够满足单一专家无法完成的广泛截止日期要求。DEFT 的核心是一个 \textbf{graph-adaptive} 门控机制，编码工作流程截止日期和 DAG、任务状态和虚拟机条件，利用交叉注意力引导专家以细粒度、对截止日期敏感的方式激活。动态云工作流程基准测试的实验表明，DEFT显著降低了执行成本和截止日期违规，优于多个最先进的DRL基线。

Fine-Tuning Diffusion Models for Molecular Generation via Reinforcement Learning and Fast Sampling

通过强化学习和快速采样微调分子生成的扩散模型

Authors: Guang Lin, Shikui Tu, Lei Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.01220
Pdf link: https://arxiv.org/pdf/2606.01220
Abstract Generating molecules that simultaneously satisfy drug-like properties and conform to the 3D structure of a target protein is a core challenge in structure-based drug design (SBDD). Existing generative approaches, however, often rely on costly post-hoc processing during Sampling or require carefully curated datasets during training, yet still achieve modest gains. These limitations are especially pronounced in multi-objective settings, where balancing conflicting criteria remains a core challenge. To address these challenges, We propose FTDiff, a reinforcement learning fine-tuning framework tailored for diffusion-based molecular generation under structural constraints. To ensure stable and sample-efficient optimization, FTDiff adopts a group relative policy optimization (GRPO) style strategy. Furthermore, FTDiff builds upon a time-free pretrained diffusion model and incorporates a fast sampling mechanism that reduces the number of denoising steps, significantly accelerating both training and inference while maintaining generation quality. By optimizing a fixed threshold-aware reward, FTDiff effectively guides the model to produce valid, diverse, and high- quality molecules that balance multiple drug design objectives. Extensive experiments on benchmark datasets demonstrate that FTDiff consistently outperforms prior methods, without requiring expensive post-hoc optimization or intricate data engineering.
中文摘要 生成既满足药物类特性又符合目标蛋白三维结构的分子，是基于结构的药物设计（SBDD）的核心挑战。然而，现有的生成方法往往依赖于采样时昂贵的事后处理，或在训练时需要精心策划的数据集，但仍能取得适度的收益。这些限制在多目标环境中尤为明显，平衡相互冲突的标准仍是核心挑战。为应对这些挑战，我们提出了FTDiff，一种针对结构约束下基于扩散的分子生成量身定制的强化学习微调框架。为确保稳定和样本高效优化，FTDiff采用群相对策略优化（GRPO）风格策略。此外，FTDiff基于无时间预训练扩散模型，采用快速采样机制，减少去噪步骤，显著加快训练和推断速度，同时保持生成质量。通过优化固定的阈值感知奖励，FTDiff有效引导模型生成有效、多样化且高质量的分子，以平衡多个药物设计目标。对基准数据集的大量实验表明，FTDiff 持续优于以往方法，无需昂贵的事后优化或复杂的数据工程。

RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

无无效样本的RLVR：针对LLM推理的组优先级非策略优化

Authors: Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.01281
Pdf link: https://arxiv.org/pdf/2606.01281
Abstract Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, its effectiveness is substantially hindered by the prevalence of ineffective training data: many sampled prompts yield response groups that are either entirely correct or entirely incorrect, resulting in zero-variance rewards and limited learning signals. Recent state-of-the-art methods address this issue through extensive LLM rollouts to filter ineffective samples, but at the cost of considerable computational overhead. Alternative approaches, including predictive sampling and trajectory replay, aim to improve data efficiency but often remain insufficient and may introduce additional issues such as systematic bias or suboptimal constraints. To address these limitations, we propose Group Prioritized Off-Policy Optimization (POPO), a simple yet effective framework that fully exploits effective training batches without additional rollout overhead. POPO comprises two key components: prioritized group replay and decoupled off-policy optimization. The former replaces ineffective on-policy groups with effective off-policy groups via a recency-based replay mechanism that jointly considers sample quality and the degree of off-policiness. To further mitigate the off-policy gap, POPO employs decoupled importance sampling to correct off-policy bias while maintaining stable policy updates under consistent trust-region constraints. Empirical evaluations across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that POPO substantially accelerates RL finetuning and achieves strong reasoning performance with significantly fewer rollouts.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLMs）推理能力的强大范式。然而，其有效性因训练数据不佳的普遍性而大大受阻：许多抽样提示生成的反应组要么完全正确，要么完全错误，导致奖励为零方差且学习信号有限。最新的先进方法通过大规模语言模型的大规模语言模型（LLM）推广来过滤无效样本，但代价是计算开销巨大。替代方法包括预测抽样和轨迹重放，旨在提高数据效率，但通常不足以实现，并可能带来系统性偏差或次优约束等额外问题。为解决这些限制，我们提出了组优先级非策略优化（POPO），这是一个简单但有效的框架，能够充分利用有效的训练批次，无需增加额外的推广开销。POPO包含两个关键组成部分：优先级组重放和解耦非策略优化。前者通过基于近期性的重放机制，将无效的非政策组替换为有效的非政策组，该机制结合样本质量和非政治性程度。为进一步减少非策略差距，POPO采用解耦重要性抽样以纠正非策略偏差，同时在一致的信任区域约束下保持策略的稳定更新。涵盖数学、规划和视觉几何等多种推理任务的实证评估表明，POPO显著加快了强化学习的微调，并以显著减少的展开次数实现了强有力的推理性能。

Digital Twin-Assisted Adaptive Multi-Agent DRL for Intelligent Spectrum and Resource Management in Open-RAN UAV-Enabled 6G Networks

数字孪生辅助自适应多智能体日程，用于开放式无人机支持的6G网络中的智能频谱和资源管理

Authors: Marwan Dhuheir, Thang X. Vu, Symeon Chatzinotas
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.01324
Pdf link: https://arxiv.org/pdf/2606.01324
Abstract The evolution toward 6G wireless networks envisions a seamlessly intelligent, Open-RAN-enabled architecture where unmanned aerial vehicles (UAVs) play a pivotal role in extending coverage, enhancing resilience, and ensuring reliable connectivity for ground users deployment. However, efficiently managing spectrum and resources in such highly dynamic UAV-assisted environments remains a major challenge due to nonlinear system interactions, mobility-induced topology variations, and stringent latency and energy constraints. To address these challenges, we propose a digital twin (DT)-assisted adaptive deep reinforcement learning (DRL) framework that enables intelligent spectrum sharing and resource allocation across distributed ground users. The complex optimization problem is decomposed into UAV trajectory optimization using particle swarm optimization (PSO) and dynamic spectrum-power-association management via multi-agent DRL (MADRL). This hybrid DT-driven approach empowers intelligent, context-aware decision-making and adaptive coordination among UAVs. Extensive simulations demonstrate significant gains in spectral efficiency, data rates, and energy utilization, showcasing a transformative path toward self-evolving, autonomous 6G UAV and ground users (GUs) connectivity.
中文摘要 向6G无线网络的演进设想了一个无缝智能、支持开放RAN的架构，无人机（UAV）在扩展覆盖、增强韧性以及确保地面用户部署的可靠连接方面发挥关键作用。然而，由于非线性系统相互作用、移动性引发的拓扑变化以及严格的延迟和能量限制，在如此高度动态的无人机辅助环境中高效管理频谱和资源仍是一大挑战。为应对这些挑战，我们提出了一个数字孪生（DT）辅助的自适应深度强化学习（DRL）框架，实现分布式地面用户之间的智能频谱共享和资源分配。复杂的优化问题被分解为利用粒子群优化（PSO）和通过多智能体DRL（MADRL）进行动态谱-功率关联管理的无人机轨迹优化。这种混合型DT驱动的方法赋能无人机之间智能、情境感知的决策和自适应协调。大量模拟显示，频谱效率、数据率和能耗利用均有显著提升，展示了实现自我进化自主6G无人机及地面用户（GU）连接的变革性道路。

S2M-Trek: From Single to Multi-Sphere Transport via Per-Frame Deep Sets on a Wheel-Legged Robot

S2M-Trek：通过轮腿机器人的每帧深度套装，从单球到多球体传输

Authors: Zong Chen, Xuebin Li, Jinpeng Xiao, Shaoyang Li, Ben Liu, Min Li, Zhouping Yin, Yiqun Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.01332
Pdf link: https://arxiv.org/pdf/2606.01332
Abstract We study the problem of scaling dynamic loco-manipulation from a single free-rolling sphere to multiple spheres transported simultaneously on the back of a wheel-legged quadruped, without fences, grippers, or mechanical stops. Multiple identical free-rolling spheres form an unordered set with no persistent identity: their ordering may change independently at each history frame, creating a \emph{per-frame permutation symmetry} that standard history-concatenation set encoders do not explicitly enforce -- these encoders impose only a shared, diagonal permutation symmetry over the full history. We show that this symmetry mismatch leads to a concrete failure mode in curriculum-based reinforcement learning. Within the same PPO training budget, flat MLPs and branch-wise encoders plateau at or below the two-sphere stage, while a history-concatenation Deep Sets baseline (\HCDS) fails to progress past the two-sphere stage in our runs unless ball-to-slot assignments are randomised during training, suggesting that it exploits slot indices as a curriculum shortcut rather than learning identity-free multi-sphere dynamics. We propose \textbf{Per-Frame Deep Sets (\PFDS)}, which performs permutation-invariant pooling within each history frame before temporal readout; we prove that \PFDS is $\Gframe$-invariant and universally approximates continuous $\Gframe$-invariant policies. A $2{\times}2$ ablation over encoder architecture and slot randomisation separates the architectural and data-augmentation pathways, and \PFDS reaches the five-sphere stage with 100\% no-drop transport in simulation across all five random seeds. We further distill the \PFDS teacher into \TactSet via DAgger, replacing privileged sphere-state observations with a $16{\times}16$ Boolean union contact map, yielding a compact and naturally $\Gframe$-invariant tactile representation.
中文摘要 我们研究如何将动态机车操控从单个自由滚动球体扩展到多个球体，同时由轮腿四足动物背负运输，无需围栏、抓钳或机械止挡装置。多个相同的自由滚动球面形成一个无序集合，没有持久的恒定性：它们的顺序可能在每个历史框架独立变化，形成一个\emph{每帧置换对称}，而标准历史连接集编码器不会显式强制执行——这些编码器仅在完整历史上施加共享的对角置换对称性。我们证明了这种对称性不匹配导致基于课程的强化学习中出现具体的失败模式。在同一PPO训练预算内，平坦的MLP和分支编码器在两球阶段或以下停留，而历史连接的深度集基线（\HCDS）在我们的运行中无法突破两球阶段，除非训练时随机分配球到槽，这表明它利用槽索引作为课程捷径，而非学习无身份多球动态。我们提出了\textbf{每帧深度集（\PFDS）}，在每个历史帧内进行置换不变池化，然后进行时间读出;我们证明了 \PFDS 是 $\Gframe$ 不变的，并且普遍近似连续的 $\Gframe$-不变策略。通过2美元对编码器架构和槽随机化的消融，将架构和数据增强路径分开，PFDS在五个随机种子模拟中实现100%无掉落传输，达到五球阶段。我们进一步将 \PFDS 教师提炼为 \TactSet，通过 DAgger 替代特权球态观测，替换为 $16{\times}16$ 布尔并集接触映射，得到一个紧凑且自然 $\Gframe$ 不变的触觉表示。

All Models are Wrong, Knowing Where is Useful: On Model Uncertainty in Reinforcement Learning

所有模型都是错误的，知道哪里有用：关于强化学习中的模型不确定性

Authors: Bernd Frauenknecht, Devdutt Subhasish, Artur Eisele, Friedrich Solowjow, Sebastian Trimpe
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.01363
Pdf link: https://arxiv.org/pdf/2606.01363
Abstract Model-based reinforcement learning (MBRL) infers information about the environment from a learned dynamics model and bears the potential to address open problems such as data efficient and safe learning in robotics. However, inaccuracies of the learned dynamics model are typically exploited by the agent, substantially hampering the capabilities of MBRL methods. We present a framework for dealing with inaccuracies of probabilistic models through targeted handling of uncertainty that effectively mitigates model exploitation. We present recent successes in learning directly on hardware and safe exploration, and discuss future directions for uncertainty-aware MBRL.
中文摘要 基于模型的强化学习（MBRL）从学习的动力学模型中推断环境信息，并有望解决诸如机器人中数据高效且安全的学习等未解决问题。然而，学习动力学模型的不准确性通常会被智能体利用，极大地限制了MBRL方法的能力。我们提出了一个框架，通过针对不确定性的处理来应对概率模型的不准确性，有效减少模型的利用。我们介绍了近期在硬件上直接学习和安全探索的成功案例，并讨论了不确定性感知MBRL的未来方向。

Cross-lingual Self-Consistency for Multilingual Reasoning with Language Models

跨语言自洽性用于多语言推理与语言模型

Authors: Ahmed Elhady, Eneko Agirre, Mikel Artetxe
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.01464
Pdf link: https://arxiv.org/pdf/2606.01464
Abstract Despite expanding their multilingual coverage, the advanced reasoning capabilities of LLMs remain largely confined to a few high-resource languages like English. To address this, we propose an unsupervised Reinforcement Learning (RL) approach to enhance multilingual reasoning by enforcing cross-lingual self-consistency: the principle that a model should produce the same final answer for equivalent problems in different languages. Existing methods are limited by the scarcity of multilingual reasoning data and show weak generalization to unseen languages. Our approach requires neither gold answers nor parallel data, and it achieves average gains of up to 21.7% on MGSM across 10 languages. In addition, our method demonstrates strong generalization, with an 18.2% mean improvement on MGSM languages unseen during training, and up to 6.2% gain on 3 out-of-distribution benchmarks. These results show the potential of consistency-based methods to improve the multilingual capabilities of LLMs without requiring supervised data.
中文摘要 尽管扩展了多语言覆盖，LLMs的高级推理能力仍主要局限于少数资源丰富的语言，如英语。为此，我们提出了一种无监督强化学习（RL）方法，通过强制执行跨语言自洽性来增强多语言推理能力：即模型应对不同语言中等效问题产生相同的最终答案。现有方法受限于多语言推理数据的稀缺，且对未见语言的推广较弱。我们的方法既不需要金答案，也不需要并行数据，在10种语言中MGSM的平均增长可达21.7%。此外，我们的方法展现了强烈的泛化性，训练期间未见的MGSM语言平均提升了18.2%，在3个非分布基准中提升了最多6.2%。这些结果显示了基于一致性的方法在提升LLM多语言能力的潜力，而无需监督数据。

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

OmniOPD：通过推测验证进行无Logit的政策提炼

Authors: Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Peng Bo, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.01476
Pdf link: https://arxiv.org/pdf/2606.01476
Abstract On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.
中文摘要 策略上提炼（OPD）在更强教师的密集代币级反馈下，将学生模型训练为自身生成轨迹，缓解了监督式微调（SFT）的非策略分布偏移和强化学习（RL）中学分分配稀疏的问题。然而，标准门诊有两个耦合限制。首先，它要求直接访问教师的代币级logit，排除了一大类具备能力的专有模型作为教师。其次，令牌级logit信号本身脆弱，依赖于师生之间可能下一个符号的狭窄重叠，容易放大诸如重复循环等退化模式。本文介绍了OmniOPD，这是一个通过无logit、块级监督信号解决这两个限制的创新框架。OmniOPD 用蒙特卡洛展开取代确定性 logit 匹配，通过多词块上的连续语义相似度指标近似教师的本地偏好，并通过峰值熵调度器集中监督，仅在学生高不确定性推理叉处审计。狄利克雷-多项式贝叶斯先验和基模型KL锚进一步限制了离散抽样的方差，防止了未审计代币间的策略崩溃。在竞争基准测试中，OmniOPD在数学上比标准OPD方法高出最多+28.64%，证实了区块级语义验证比token级logit匹配更为可靠的学习信号，后者的高信息密度被显著的噪声和脆弱性所抵消。此外，当与更强的黑箱教师如Claude-4.5-Haiku和Gemini-2.5-Flash配合时，OmniOPD在数学上的相对优势比开放权重教师版本额外提升了+9.54%，使学生超越了自我探索式强化学习的表现。

Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX

Crazyflow：一款在JAX中实现的精确、GPU加速、可微分的无人机模拟器

Authors: Martin Schuck, Marcel P. Rath, Yufei Hua, AbhisheK Goudar, SiQi Zhou, Angela P. Schoellig
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.01478
Pdf link: https://arxiv.org/pdf/2606.01478
Abstract High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial-robotics algorithm development, from model-based to data-driven methods, gradient-based to sampling-based approaches, and single-agent to multi-agent systems. Compared to existing state-of-the-art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real-world experiments show Crazyflow supports both analytical-gradient-based policy learning, achieving sub-centimeter trajectory tracking accuracy without domain randomization, and sampling-based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train-then-deploy paradigm, we show that its unprecedented speed even enables in-flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open-source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light-weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open-source resource for synthetic data generation, with emerging capabilities for large-scale parallelization for online, in-execution learning and optimization, opening the door to novel algorithm development.
中文摘要 高质量、大规模的仿真合成数据正成为推动机器人算法能力的基石。虽然空中机器人模拟器已经发展到能够独立支持如真实度、可微性和群体群等特殊需求，但能够综合所有这些领域数据的统一平台仍然缺失。在本研究中，我们提出了Crazyflow，一款旨在推动空中机器人算法开发极限的模拟器，从基于模型到数据驱动的方法，基于梯度到基于采样的方法，以及单智能体到多智能体系统。与现有最先进的无人机模拟器相比，它在单个无人机中的速度快了一个数量级以上，并能模拟数千个4000个无人机的蜂群。实际实验表明，Crazyflow支持基于解析梯度的策略学习，实现亚厘米轨迹跟踪精度且无需域随机化，并支持基于采样的障碍物规避速度超过每秒5亿步。打破传统的训练后部署模式，我们展示了其前所未有的速度甚至支持飞行中强化学习;我们通过将一架实体无人机抛向空中，并在0.38秒内从零开始训练回收策略，成功稳定了无人机。Crazyflow 支持多层次的仿真抽象，直接兼容所有开源 Crazyflie 模型，并通过提供轻量级系统识别流水线，实现跨定制无人机平台和应用的快速重配置。通过同时推动准确性、速度和微分性，Crazyflow 作为一个开源的合成数据生成资源，具备大规模在线并行化、执行中学习和优化的新兴能力，为新颖算法开发打开了大门。

Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

分层语义增强导航：视觉语言导航的最佳传输与图驱动推理

Authors: Xiang Fang, Wanlong Fang, Changshuo Wang
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.01565
Pdf link: https://arxiv.org/pdf/2606.01565
Abstract Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented Navigation (HSAN)} framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations, from objects to regions to zones, enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich's duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.
中文摘要 连续环境中的视觉语言导航（VLN-CE）对自主智能体构成了巨大挑战，要求自然语言指令与视觉观察无缝集成，以导航复杂的三维室内空间。现有方法常因对场景理解有限、规划效率低下以及缺乏稳健的决策框架而在长期任务中表现不佳。我们介绍了\textbf{分层语义增强导航（HSAN）}框架，这是一种通过三项协同创新重新定义VLN-CE的开创性方法。首先，HSAN构建了一个动态的层级语义场景图，利用视觉语言模型捕捉从物体到区域再到区域的多层次环境表征，实现细致的空间推理。其次，它采用基于康托罗维奇对偶性的最优运输拓扑规划器，通过在语义相关性和空间可达性与理论保证之间取得长期目标。第三，基于图的强化学习策略确保了低层次的精确控制，能够在子目标的同时稳健地避开障碍。通过整合谱图理论、最优传输和高级多模态学习，HSAN解决了以往工作中常见的静态地图和启发式规划工具的不足。对多个具有挑战性的VLN-CE数据集的广泛实验表明，HSAN实现了最先进的性能，显著提升了导航成功率和对未可见环境的泛化能力。

Physics-Informed Modeling and Control of Emergent Behaviors in Robot Swarms

基于物理的建模与机器人群体中涌现行为的控制

Authors: Zixuan Jin, Wenzhuo Zhang, Shuxian Quan, Zirui Dong, Fangwen Ye, Yuchen Shi, Cheng Xu
Subjects: Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.01597
Pdf link: https://arxiv.org/pdf/2606.01597
Abstract Robot swarms can exhibit coherent collective behaviors through local perception, limited communication and decentralized decision-making, yet modeling and controlling such emergence remains challenging when behaviors unfold over multiple phases. Here we introduce PhySwarm, a physics-informed micro--macro framework that represents multi-stage swarm emergence as physically constrained density-field evolution coupled to executable robot motion. At the macroscopic level, a multi-phase advection--diffusion--reaction model (Macro-ADR) describes phase-dependent swarm-density evolution through directed transport, diffusion-based spatial regulation and behavioral phase transitions. At the microscopic level, an equivalent deterministic motion model (Micro-EDM) realizes these mechanisms through potential-field advection, density-gradient compensation and rate- or event-gated phase switching. A neural-physics controller (NPC) maps local observations and temporal memory to bounded physical parameters, and is trained with a reinforcement learning--PINN objective that combines task rewards with macro-scale density residuals and micro-scale motion-consistency constraints. In several proof-of-concept swarm missions -- including trail-guided foraging, formation-reconfigurable navigation and role-adaptive search and rescue -- we demonstrate that PhySwarm can generate distinct multi-stage emergent behaviors within a unified physics-informed modeling framework. The learned density fields and physical parameters provide interpretable evidence of how advection, diffusion and reaction jointly regulate multi-stage swarm organization. These results establish a physics-informed route for learning, interpreting and controlling emergent behaviors in robot swarms.
中文摘要 机器人群体可以通过局部感知、有限的交流和去中心化决策展现出连贯的集体行为，但当行为在多个阶段展开时，建模和控制这种出现仍然充满挑战。这里我们介绍PhySwarm，一个基于物理学的微宏框架，将多阶段群体的出现表示为物理约束的密度场演化与可执行机器人运动耦合。在宏观层面，多相对流-扩散-反应模型（Macro-ADR）描述了通过定向输运、基于扩散的空间调控和行为相变，实现相依赖的群体密度演化。在微观层面，等效确定性运动模型（Micro-EDM）通过势场对流、密度梯度补偿以及速率门控或事件门控相位切换实现这些机制。神经物理控制器（NPC）将局部观测和时间记忆映射到有界物理参数，并以强化学习-PINN目标训练，该目标将任务奖励与宏观尺度密度残差及微观尺度运动一致性约束结合起来。在多个概念验证的群体任务中——包括沿途引导采集、可编队重构导航和角色自适应搜救——我们展示了PhySwarm能够在统一的物理导向建模框架内生成独特的多阶段涌现行为。所学到的密度场和物理参数为平流、扩散和反应如何共同调控多阶段群体组织提供了可解读的证据。这些结果为学习、解读和控制机器人群体中涌现行为建立了基于物理的路径。

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

TRON：针对视觉推理的可规则验证在线环境

Authors: Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang, Ninghao Liu, Jin Sun
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.01599
Pdf link: https://arxiv.org/pdf/2606.01599
Abstract Reinforcement learning (RL) for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-training trains on static curated datasets, with fixed image-question-answer samples bounded by their collection budget. In this work, we introduce TRON (Targeted, Rule-verifiable Online eNvironments), an online environment substrate: a training rollout is generated on demand by a controllable generator-verifier program that samples a fresh latent visual state, renders an image, asks a question, and exactly verifies the answer. A single run can therefore draw an unbounded stream of fresh instances at the difficulty level required by the current curriculum. The current TRON suite contains 520 environments organized into five ability buckets (spatial, mathematical, diagram, pattern/logic, and counting); the same substrate supports both a single full model trained on all buckets and per-bucket ability-specialist models, with no additional data collection. We also introduce a substrate analysis covering generation reliability, instance and level diversity, cross-environment near-duplicates, and base-model pass rate by difficulty level. RL post-training with METHOD consistently improves performance on ten external multimodal reasoning benchmarks across Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B-SFT.
中文摘要 用于视觉推理的强化学习（RL）需要可扩展、可验证和可控的训练信号。现有的视觉强化学习后训练基于静态策划数据集，固定图像-问题-答案样本受其收集预算限制。在本研究中，我们介绍了TRON（目标化、可规则验证的在线环境），这是一种在线环境基底：训练推广由可控的生成器-验证程序按需生成，该程序采样新的潜在视觉状态，渲染图像，提出问题，并精确验证答案。因此，单次运行可以根据当前课程要求的难度，吸引无限的新实例流。当前的TRON套件包含520个环境，分为五个能力桶（空间、数学、图表、模式/逻辑和计数）;同一基底支持在所有桶上训练的单一完整模型和每个桶的能力专家模型，无需额外数据收集。我们还引入了涵盖生成可靠性、实例和等级多样性、跨环境近似重复以及按难度等级划分的基础模型通过率的底质分析。使用METHOD进行强化学习后训练，在Qwen3-VL-4B、Qwen2.5-VL-7B和MiMo-VL-7B-SFT的十个外部多模态推理基准测试中持续提升性能。

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

再技能：在智能强化学习中调和技能创建与策略优化

Authors: Zelin He, Haotian Lin, Boran Han, Wei Zhu, Haoyang Fang, Bernie Wang, Xuan Zhu, Runze Li, Matthew Reimherr
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2606.01619
Pdf link: https://arxiv.org/pdf/2606.01619
Abstract Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not systematically accumulate reusable strategies that generalize across tasks. Modular skills can provide such reusable strategies, yet existing skill-augmented RL methods decouple skill creation from policy optimization, risking adopting skills that conflict with the evolving policy. Inspired by Anthropic's Skill Creator, we introduce ReSkill, an RL-in-the-loop skill creation framework that reconciles skill evolution with policy learning. ReSkill exploits the group-wise structure of GRPO to naturally embed three mechanisms with only marginal additional overhead: (1) an assertion-driven skill creator that diagnoses failures from past experience and proposes conditional, trigger-based skill revisions; (2) within-group rollout sampling that enables controlled comparison of skill versions, capturing which version best supports the policy's ongoing learning; and (3) Thompson Sampling with adaptive discounting to balance exploration and exploitation in skill version selection as the policy evolves. Across several domains, ReSkill consistently outperforms existing memory and skill-based RL methods, with the largest gains on unseen tasks. Analysis of the skill lifecycle shows skills being automatically created, tested, refined, and pruned as the policy improves, demonstrating reconciled skill-policy co-evolution.
中文摘要 代理强化学习（RL）使LLM代理能够从环境奖励中持续改进，但由此产生的策略并未系统地积累可重复使用的策略，这些策略在任务间推广。模块化技能可以提供这种可重复使用的策略，但现有的技能增强强化学习方法将技能创建与策略优化分离，可能导致采用与策略演变冲突的技能。受Anthropic的技能创建器启发，我们介绍ReSkill，这是一个强化学习环路技能创建框架，将技能演变与政策学习相结合。ReSkill利用GRPO的群体结构，自然嵌入三种机制，且仅有边际额外开销：（1）基于断言驱动的技能创建器，基于过去经验诊断失败并提出条件性、基于触发的技能修订;（2）组内推广抽样，实现技能版本的受控比较，捕捉哪个版本最支持策略的持续学习;以及（3）结合自适应贴现的汤普森抽样，以平衡技能版本选择中的探索与利用，随着政策演进。在多个领域，ReSkill 持续优于现有的基于记忆和技能的强化学习方法，在未完成任务时获得最大收益。技能生命周期分析显示，随着策略改进，技能会自动被创建、测试、精炼和修剪，展示了技能与政策的协调共进化。

RDA: Reward Design Agent for Reinforcement Learning

RDA：强化学习奖励设计代理

Authors: Hojoon Lee, Ajay Subramanian, Ben Abbatematteo, Vijay Veerabadran, Pedro Matias, Karl Ridgeway, Nitin Kamra
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.01672
Pdf link: https://arxiv.org/pdf/2606.01672
Abstract Reinforcement learning has enabled the acquisition of impressive robotic skills, but typically requires hand-crafted reward functions that are slow to design and difficult to align with human intentions. Recent work, such as Eureka, automates reward design by using an LLM to iteratively generate and refine reward code from task descriptions. However, they rely on coarse feedback signals such as success rate, which provide little semantic insight into the learned behavior. As a result, their trained policies achieve the final goal but are frequently poorly aligned with task instructions. We introduce the Reward Design Agent (RDA), a VLM-based agentic framework that injects semantic understanding into reward design. RDA decomposes tasks, visually evaluates trajectories, summarizes failure modes, and iteratively revises reward code to better align with task instructions. Across 12 tabletop manipulation tasks from ManiSkill and 4 whole-body manipulation tasks from HumanoidBench, RDA produces policies substantially more instruction-aligned than those of other baselines, while achieving comparable task success rates. Videos and the generated reward code are available on this https URL.
中文摘要 强化学习使得令人印象深刻的机器人技能得以习得，但通常需要手工设计且难以与人类意图对齐的奖励函数。近期工作如Eureka通过使用大型语言模型（LLM）从任务描述中迭代生成和优化奖励代码，实现了奖励设计的自动化。然而，它们依赖于粗糙的反馈信号，如成功率，这对学习行为的语义洞察有限。因此，他们训练有素的策略能达成最终目标，但往往与任务指令不匹配。我们介绍了奖励设计代理（RDA），这是一种基于VLM的代理框架，将语义理解注入奖励设计中。RDA分解任务，直观评估轨迹，总结失败模式，并迭代修订奖励代码，以更好地符合任务指令。在ManiSkill的12个桌面操作任务和HumanoidBench的4个全身操作任务中，RDA产出的策略比其他基线更符合指令，同时实现了相当的任务成功率。视频和生成的奖励代码可在该 https URL 上观看。

TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment

TriAlign：迈向个性化大语言模型对齐中的普遍真理一致性

Authors: Thi-Nhung Nguyen, Linhao Luo, Rollin Omari, Junae Kim, Thuy-Trang Vu, Dinh Phung
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.01755
Pdf link: https://arxiv.org/pdf/2606.01755
Abstract Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either ignore personalization or mainly focus on subjective preference alignment, largely overlooking fairness and consistency in universal truths. To address this gap, we study Truth-Invariant Alignment (TIA), an alignment problem for personalized LLMs that aims to ensure universal truths remain consistent across social groups while preserving personalization. We propose TriAlign, the first offline multi-agent reinforcement learning (MARL) framework for TIA, where each social group is modeled as an agent interacting. TriAlign jointly optimizes universal truth accuracy, cross-group truth consistency, and personalization through a fairness-aware objective and an explicit inconsistency penalty. Experiments across diverse benchmarks demonstrate that TriAlign achieves a stronger balance among these three objectives than strong baselines, reducing universal truth disparities across social groups while improving both objective task performance and personalization quality.
中文摘要 个性化大型语言模型会根据用户偏好和社会属性调整响应，但可能会在不同社会群体之间引入显著的普遍真理不一致，比如某些群体在客观任务上系统性地获得较不准确的回答。现有的对齐方法要么忽视个性化，要么主要关注主观偏好对齐，在很大程度上忽视了普遍真理的公平与一致性。为弥补这一差距，我们研究了真理不变对齐（TIA），这是个性化大型语言模型的一种对齐问题，旨在确保普遍真理在社会群体间保持一致，同时保持个性化。我们提出了TriAlign，这是TIA首个离线多智能体强化学习（MARL）框架，每个社交群体都被建模为一个交互的智能体。TriAlign通过公平性目标和明确的不一致性惩罚，共同优化了普遍真理的准确性、跨群体的真值一致性和个性化。跨多项基准测试的实验表明，TriAlign在这三个目标之间实现了比强基线更平衡的效果，减少了社会群体间的普遍真理差异，同时提升了客观任务的表现和个性化质量。

MetaForge: A Self-Evolving Multimodal Agent that Retrieves, Adapts, and Forges Tools On Demand

MetaForge：一款自我进化的多模态智能体，能够按需检索、适应并锻造工具

Authors: Shouang Wei, Houcheng Min, Xinpeng Dong, Xin Lin, Sen Cui, Bo Jiang, Zhongxiang Dai, Kun Kuang, Guandong Xu, Fei Wu, Min Zhang
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.01801
Pdf link: https://arxiv.org/pdf/2606.01801
Abstract Multimodal agents have achieved notable progress on complex reasoning tasks through tool use, yet remain limited by two issues: statically predefined tool inventories fail to generalize to unseen scenarios, and indiscriminate tool invocation incurs redundant cost and noise-induced errors. We propose MetaForge, a multimodal agent framework that learns when to invoke tools and how to evolve its toolset on demand. MetaForge factorizes agentic behavior into four coupled stages: Decide (judging whether tool use is warranted), Retrieve (selecting suitable tools), Adapt (grounding tool parameters in task context), and Forge (synthesizing new skills online and recycling them into the tool library for reuse), forming a closed judge-retrieve-adapt-forge-recycle loop. A unified orchestration policy enables the agent to choose among answering directly, reusing existing tools, or forging new ones. We jointly optimize invocation necessity, retrieval accuracy, execution effectiveness, and forged-skill reusability via reinforcement learning, with an explicit invocation-cost penalty discouraging redundant calls. Across 12 benchmarks, MetaForge consistently surpasses 16 baselines in accuracy, efficiency, and generalization, validating a paradigm shift from static tool inventories to on-demand self-evolution.
中文摘要 多模态代理通过工具使用在复杂推理任务上取得了显著进展，但仍受限于两个问题：静态预定义的工具库存无法推广到未见场景，且无差别的工具调用会产生冗余成本和噪声引起的错误。我们提出了MetaForge，一个多模态代理框架，能够学习何时调用工具以及如何按需演进其工具集。MetaForge将代理行为分解为四个耦合阶段：决定（判断工具使用是否合理）、回收（选择合适工具）、适应（在任务情境中建立工具参数基础）和锻造（在线综合新技能并回收工具库以便重复使用），形成一个封闭的评判-检索-适应-锻造-回收循环。统一的编排策略使代理能够在直接接听、重复使用现有工具或打造新工具之间做出选择。我们通过强化学习共同优化调用必要性、检索准确性、执行效率和锻造技能的可重复使用性，并明确设置调用成本惩罚以防止重复调用。在12个基准测试中，MetaForge在准确性、效率和泛化性方面持续超过16个基线，验证了从静态工具库存向按需自我演进的范式转变。

CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

CAPF：以信用减弱特权反馈指导搜索代理推广

Authors: Bin Chen, Xinye Liao, Yiming Liu, Xin Liao, Chonghan Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.01830
Pdf link: https://arxiv.org/pdf/2606.01830
Abstract Recent LLM search agents use reinforcement learning with verifiable rewards (RLVR) to learn search-augmented reasoning from outcome rewards. On hard problems, these agents rarely sample end-to-end successful rollouts, leaving outcome-only RLVR with few positive-reward trajectories. We argue that improving learning on such problems requires additional guidance during training, and RLVR already contains verifier-side information that can provide it. This information can identify errors or omissions in the agent's submitted answer and guide revision within the rollout. We propose a training-time mechanism called \textbf{Credit-Attenuated Privileged Feedback} (CAPF), which makes this verifier-side information available through a Privileged Feedback call during training. CAPF lets the policy revise zero-reward attempts into positive-reward repair trajectories and attenuates credit for the feedback call and earlier actions to accommodate deployment without this call. Empirical research demonstrates that CAPF improves Qwen3-4B's average exact-match score from 44.7% under outcome-only RLVR to 48.5% on seven open-domain QA benchmarks.
中文摘要 最新的大型语言模型搜索代理使用可验证奖励的强化学习（RLVR）从结果奖励中学习搜索增强推理。在困难问题上，这些代理很少采样端到端的成功推广，导致仅结果的RLVR几乎没有正向奖励轨迹。我们认为，提升此类问题的学习需要培训期间的额外指导，而RLVR已经包含了验证者端的信息可以提供这些指导。这些信息可以识别代理提交答案中的错误或遗漏，并在推广过程中指导修订。我们提出了一种名为\textbf{信用减弱特权反馈}（CAPF）的培训时间机制，通过培训期间的特权反馈调用，使验证者端的信息能够获得。CAPF允许政策将零奖励尝试修正为正向奖励修复轨迹，并减少反馈呼叫及早期部署的积分，以适应无该呼叫的部署。实证研究表明，CAPF将Qwen3-4B在仅结果RLVR下的44.7%平均精确匹配得分提升至7个开放域QA基准的48.5%。

From Global Policies to Local Strategies: Multi-Objective Optimization of Resource-Specific Handover Policies

从全球政策到本地策略：资源特定切换政策的多目标优化

Authors: Lukas Kirchdorfer, Artemis Doumeni, Han van der Aa, Hugo A. López
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.01857
Pdf link: https://arxiv.org/pdf/2606.01857
Abstract Efficient resource allocation is a key challenge in business process management, with direct implications for cost, throughput time, and utilization. While recent Reinforcement Learning (RL) approaches have shown promise in deriving adaptive allocation policies, they typically neglect inter-resource collaboration patterns that can strongly influence real-world task handovers. Recognizing this, this paper introduces the first approach for multi-objective optimization of resource-level decision-making, enabling the recommendation of person-specific handover policies. To achieve this, our work combines an existing Multi-Agent System-based process simulator with a multi-objective evolutionary algorithm. The resulting approach produces Pareto-optimal, resource-specific policies that optimize the process across multiple objectives. Experimental results on synthetic and real-world datasets show that our approach reduces costs by an average of 37% and waiting time by 58%, consistently outperforming heuristic baselines and demonstrating the potential of leveraging collaboration-aware optimization to improve process performance.
中文摘要 高效的资源分配是业务流程管理中的一个关键挑战，直接影响成本、吞吐量时间和利用率。尽管近期强化学习（RL）方法在制定自适应分配策略方面展现出潜力，但通常忽视了资源间协作模式，这些模式会强烈影响现实任务的切换。鉴于这一点，本文介绍了首个多目标优化资源级决策的方法，使得推荐针对个人的切换政策成为可能。为此，我们的工作结合了现有的基于多智能体系统的过程模拟器与多目标进化算法。由此产生的方法产生帕累托最优的资源特定策略，优化多个目标的流程。在合成和现实世界数据集上的实验结果显示，我们的方法平均降低了37%的成本，降低了58%的等待时间，持续优于启发式基线，并展示了利用协作感知优化提升流程性能的潜力。

Task-Induced Representational Invariances Depend on Learning Objective in Deep RL

任务诱导的表征不变性依赖于深度强化学习中的学习目标

Authors: Manu Srinath Halvagal, Sebastian Lee, SueYeon Chung
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.01868
Pdf link: https://arxiv.org/pdf/2606.01868
Abstract Reinforcement Learning (RL) has long served as a model for goal-directed animal behavior in neuroscience. Modern deep RL has shown remarkable success across many domains, further strengthening this connection. The ability to learn abstract representations of high-dimensional state spaces underlies much of this success. However, theoretical understanding of these learned representations remains limited, hindering direct comparisons between models and animal learning. We address this gap by analyzing deep RL representations through the lens of MDP reduction theory. Investigating canonical RL algorithms in a navigation task, we find that even when performance is comparable, the value-based method (DQN) learns representations that are invariant to MDP homomorphism symmetries, while the policy-gradient method (PPO) learns representations invariant to action symmetries. These differences emerge consistently across domains, have downstream consequences for transfer learning, and appear in LLMs in a prompt-dependent manner. Our findings provide a principled approach to comparing learned representations across RL algorithms, with demonstrated practical implications and possible insights for neural coding in the brain.
中文摘要 强化学习（RL）长期以来一直是神经科学中目标导向动物行为的模型。现代深度强化学习在多个领域取得了显著成功，进一步强化了这种联系。能够学习高维状态空间的抽象表示，是这些成功的基础。然而，对这些学习得来的表征的理论理解仍然有限，这阻碍了模型与动物学习之间的直接比较。我们通过MDP约简理论的视角分析深度强化学习表示来弥补这一空白。在导航任务中研究规范强化学习算法时，我们发现即使性能相当，基于值的方法（DQN）学习的表示对MDP同态对称性不变，而策略梯度法（PPO）则学习对作用对称性不变的表示。这些差异在不同领域中持续出现，对迁移学习产生后续影响，并且以提示依赖的方式出现在大型语言模型中。我们的发现为比较不同强化学习算法的学习表征提供了一种有原则的方法，并对大脑神经编码提出了实际意义和可能的见解。

Comparing ML-Specific and General Python Code Smells Across Project Characteristics

比较机器学习专用和通用Python代码在不同项目特性上的表现

Authors: Halimeh Agh, Betül Cimendag, Stefan Wagner
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2606.01882
Pdf link: https://arxiv.org/pdf/2606.01882
Abstract Machine learning systems consist of general-purpose code as well as machine-learning-specific code. While ML-specific code smells have been identified, their connection to project characteristics and their interaction with overall code quality are not well understood. Without this knowledge, quality assurance strategies remain one-size-fits-all, failing to account for the contextual factors that drive technical debt in ML systems. We present empirical evidence by examining how six project features (size, age, contributors, commit frequency, CI/CD adoption, and domain) relate to both ML-specific and general Python code quality in 279 open-source ML projects on GitHub. Using CodeSmile for ML code smells and Pylint for general Python smells, our results show: (1) ML code smells are 41-94 times less frequent than general Python smells; (2) commit frequency and domain are significantly associated with ML-specific quality, while project size, team size, age, and CI/CD adoption are not, challenging traditional views on technical debt; (3) general Python smells are not linked to any project characteristic, indicating systemic coding issues that are independent of project context; (4) domains that suffer most from ML-specific smells are not necessarily the same domains that suffer most from general Python smells, necessitating tailored quality strategies for each smell type. MLOps often involves configuration issues, Reinforcement Learning faces challenges with tensor manipulation, and Computer Vision encounters problems with GPU workflows. Overall, ML code quality depends on domain-specific practices and specialized CI/CD quality gates, as standard automation often overlooks domain-specific correctness problems.
中文摘要 机器学习系统包括通用代码以及机器学习专用代码。虽然机器学习特有的代码气味已被识别，但它们与项目特性的联系以及与整体代码质量的相互作用尚不充分。没有这些知识，质量保证策略往往是一刀切的，未能考虑到驱动机器学习系统技术债务的背景因素。我们通过分析六个项目特性（规模、年龄、贡献者、提交频率、CI/CD采纳和域）如何与机器学习特有和一般Python代码质量的关系，呈现了实证证据，涵盖GitHub上279个开源机器学习项目。使用 CodeSmile 进行机器学习代码气味检测，使用 Pylint 进行通用 Python 气味检测，我们的结果显示：（1）机器学习代码气味的频率是一般 Python 气味的 41-94 倍;（2）提交频率和领域与机器学习特定质量有显著关联，而项目规模、团队规模、年龄和CI/CD采用率则不相关，挑战了传统对技术债务的看法;（3）一般的 Python 嗅觉与任何项目特征无关，表明存在与项目上下文无关的系统性编码问题;（4）最常受机器学习特异气味影响的领域，不一定与受一般Python气味影响最多的领域相同，因此需要针对每种气味类型量身定制质量策略。MLOps常伴随配置问题，强化学习在张量操作上面临挑战，计算机视觉在GPU工作流程中遇到问题。总体而言，机器学习代码质量依赖于领域特定的实践和专门的CI/CD质量门，因为标准自动化往往忽视了领域特定的正确性问题。

Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

社区意识评估社会文本参与与共鸣：以人为本的视角下用户生成内容评估

Authors: Tianjiao Li, Kai Zhao, Xiang Li, Yang Liu, Huyang Sun
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.01897
Pdf link: https://arxiv.org/pdf/2606.01897
Abstract Traditional Video Quality Assessment (VQA) focuses narrowly on aesthetic fidelity, overlooking the complex social dynamics that define quality in User-Generated Content (UGC). In this work, we propose a paradigm shift from signal-centric metrics to human-centric resonance assessment. We introduce CASTER (Community-Aware Assessment of Social Textual Engagement and Resonance), a new task that evaluates whether a UGC item achieves positive community resonance based on its multimodal attributes rather than visual quality alone. To address this, we present MEDEA (Multimodal Engagement-Driven Evaluation Architecture), which introduces a novel Social Chain-of-Thought (Social-CoT) mechanism. Unlike traditional logical CoT, Social-CoT performs multimodal perspective-taking, instantiating diverse viewer personas to simulate collective cognitive and emotional reactions (i.e., the "community mind") before deriving a quality judgment. MEDEA is trained via a two-stage approach involving supervised fine-tuning and process-supervised reinforcement learning with Social Alignment Reward to ensure reasoning paths are grounded in authentic human social cognition. To support this task, we release CASTER-Bench, a comprehensive human-annotated benchmark covering diverse UGC categories. Experiments demonstrate that MEDEA significantly outperforms state-of-the-art baselines on CASTER-Bench while providing interpretable and empathetic reasoning paths that align with real community feedback.
中文摘要 传统的视频质量评估（VQA）则狭隘地关注美学的真实度，忽视了定义用户生成内容（UGC）质量的复杂社会动态。本研究提出从信号中心指标转向以人为中心的共振评估范式转变。我们引入了CASTER（社区意识社会文本参与与共鸣评估），这是一项新任务，评估UGC项目是否基于其多模态属性而非仅凭视觉质量获得积极社区共鸣。为此，我们提出了MEDEA（多模态参与驱动评估架构），它引入了一种新的社会思维链（Social-CoT）机制。与传统的逻辑CoT不同，Social-CoT执行多模态视角获取，实现多样的观众人格，以模拟集体认知和情感反应（即“社区心智”），然后再得出高质量判断。MEDEA通过两阶段方法进行培训，包括监督式微调和过程监督强化学习，配合社会对齐奖励，确保推理路径扎根于真实的人类社会认知。为支持此项任务，我们发布了CASTER-Bench，一个涵盖多种UGC类别的全面人工注释基准测试。实验表明，MEDEA在CASTER-Bench上显著优于最先进的基线，同时提供了可解释且富有同理心的推理路径，与真实社区反馈相符。

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

HMPO：用于思维链压缩的混合中位数长度策略优化

Authors: Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin, Xiaoyang Qu, Ze Wang, Shuling Yang, Ziyu Peng, Kaike Zhang, Pan Zhou, Kun Zhan
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.01934
Pdf link: https://arxiv.org/pdf/2606.01934
Abstract Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.
中文摘要 大型语言模型通过扩展思维链（CoT）推理实现了卓越的性能，但这一漫长过程带来了大量的推理开销。现有的CoT压缩方法面临着不灵活的手动长度预算、计算成本高的多阶段训练流程以及仅限于小模型的脆弱可扩展性。我们提出了HMPO（混合中位数长度策略优化），这是一个成本效益高的单阶段强化学习框架。HMPO通过三个协同组件高效压缩CoT：基于中位数的自适应预算，源自成功推出以消除手动调校;余弦衰减代币奖励以实现平滑长度惩罚;以及乘法奖励表述，通过严格优先考虑答案正确性，显著减少了琐碎的奖励黑客行为。HMPO完全基于数学数据进行训练，能够无缝推广数学、代码、科学和指令遵循任务。通过在密集和专家混合（MoE）架构上从9B到122B参数进行大规模扩展的实验表明，HMPO实现了19%-46%的令牌压缩率，且准确率几乎不下降，同时相比现有多阶段基线大幅降低了训练成本。

Randomized Least Squares Value Iteration itself is Joint Differentially Private

随机最小二乘值迭代本身是联合微分私有的

Authors: Haiyang Lu, Pratik Gajane, Shaojie Bai, Mohammad Sadegh Talebi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.01952
Pdf link: https://arxiv.org/pdf/2606.01952
Abstract As reinforcement learning (RL) increasingly applies to sensitive domains, such as health care and recommendation systems, privacy-preserving techniques have become essential to protect users' sensitive information. We investigate privacy-preserving RL under an episodic setting, focusing on algorithms based on randomized exploration, such as Randomized Least Squares Value Iteration (RLSVI). The overall goal is to study how randomized exploration interacts with the injected noise required by privacy mechanisms. In this work, we show a new privacy analysis that characterizes how the noise in RLSVI set for exploration simultaneously provides privacy protection. Specifically, we prove that RLSVI is $(\varepsilon(\delta),\delta)$-joint differentially private in tabular MDP as is with $\varepsilon(\delta) = \frac{2AK}{H^2\log(2HSA)} + 2\sqrt{\frac{2AK\log(1/\delta)}{H^2\log(2HSA)}}$, where $S$ and $A$ are the number of states and actions respectively, $H$ is the length of an episode and $K$ is the number of episodes.
中文摘要 随着强化学习（RL）越来越多地应用于敏感领域，如医疗和推荐系统，隐私保护技术变得保护用户敏感信息至关重要。我们研究在情景化环境中保护隐私的强化学习，重点关注基于随机探索的算法，如随机最小二乘值迭代（RLSVI）。总体目标是研究随机探索如何与隐私机制所需的注入噪声相互作用。本研究展示了一项新的隐私分析，描述了RLSVI中待探索噪声如何同时提供隐私保护。具体来说，我们证明RLSVI在表格MDP中是$（\varepsilon（\delta），\delta）$-关节的差分私有，就像$\varepsilon（\delta） = \frac{2AK}{H^2\log（2HSA）} + 2\sqrt{\frac{2AK\log（1/\delta）}{H^2\log（2HSA）}}}$，其中$S$和$A$分别表示状态和动作的数量，$H$是发作的长度，$K$是发作次数。

AI-Based KPI Prediction Methods in Future 6G Networks: A Survey

基于人工智能的未来6G网络KPI预测方法：一项调查

Authors: Niloofar Mehrnia, Gourav Prateek Sharma, Samie Mostafavi, Andreas Johnsson, Sinem Coleri, Carlo Fischione, James Gross
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.01972
Pdf link: https://arxiv.org/pdf/2606.01972
Abstract The evolution from 5G to 5G-Advanced and the vision of 6G demand unprecedented levels of network performance, in which meeting stringent network Key Performance Indicators (KPIs), including capacity, latency, coverage, and reliability, is critical to supporting emerging applications such as autonomous driving, industrial automation, and immersive communications. Traditional reactive network management is insufficient in this context, driving the need for predictive, data-driven approaches. Machine Learning (ML) has emerged as a key enabler, enabling the forecasting of KPI trends from diverse data sources and thereby enabling proactive, AI-native automation in mobile networks. This survey provides the first comprehensive and systematic review of data-driven KPI prediction methods for future 6G networks. We introduce a multi-dimensional taxonomy that classifies prediction approaches by KPI type, data source, the network protocol stack at which the KPI is predicted, prediction horizon, model family, and prediction objective. Using this taxonomy, we analyze the state of the art across various KPIs, highlighting representative methods ranging from classical statistical models to deep learning and reinforcement learning. We further discuss enabling system aspects, including data collection and learning architectures, and examine deployment challenges, including data availability, scalability, privacy, and sustainability. Finally, we outline open research directions spanning new KPI definitions, probabilistic and explainable predictions. This survey aims to provide researchers and practitioners with a structured understanding of the KPI prediction landscape and a roadmap toward predictive network automation in future 6G systems.
中文摘要 从5G向5G先进发展以及6G愿景要求前所未有的网络性能，满足严格的网络关键绩效指标（KPI），包括容量、延迟、覆盖和可靠性，对于支持自动驾驶、工业自动化和沉浸式通信等新兴应用至关重要。传统的被动式网络管理在此背景下已不够，推动了对预测性和数据驱动方法的需求。机器学习已成为关键推动力，能够预测来自多样数据源的KPI趋势，从而实现移动网络中的主动、AI原生自动化。本调查首次全面且系统地回顾了未来6G网络的数据驱动KPI预测方法。我们引入了多维分类法，按KPI类型、数据源、预测的网络协议栈、预测视野、模型族和预测目标对预测方法进行分类。通过该分类法，我们分析了各类关键绩效指标（KPI）的技术最新进展，重点介绍了从经典统计模型到深度学习和强化学习等代表性方法。我们还进一步讨论了系统赋能方面，包括数据收集和学习架构，并探讨部署挑战，包括数据可用性、可扩展性、隐私和可持续性。最后，我们概述了涵盖新KPI定义、概率性和可解释预测的开放研究方向。本调查旨在为研究人员和从业者提供对KPI预测格局的结构化理解，并为未来6G系统中预测性网络自动化的路线图提供指导。

MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching

MT-EditFlow：多回合图像编辑的强化学习，支持流程匹配

Authors: Jiahui Huang, Yasi Zhang, Tianyu Chen, Shu Wang, Jianwen Xie, Oscar Leong, Mingyuan Zhou, Nanzhu Wang, Ying Nian Wu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.01985
Pdf link: https://arxiv.org/pdf/2606.01985
Abstract Recent breakthroughs in instruction-based image editing have captured significant attention, as models are now capable of handling real-world editing demands with the practicality required by everyday users. However, editing models trained primarily for single-turn edits often break down in multi-turn editing--the natural interactive setting where a user iteratively refines an image based on the model's own previous outputs. This failure stems from the all-or-nothing requirement, where a single failed turn compromises the entire sequence, and error propagation, where exposure bias leads to compounding editing errors. To address these challenges, we introduce MT-EditFlow, a flow-matching reinforcement learning framework designed to optimize reward signals for sequential image editing. MT-EditFlow integrates a multi-turn perspective with a multi-reward formulation to provide a unified structure applicable to both GRPO and NFT-based reinforcement learning methods. We systematically analyze and optimize the reward signal by investigating effective scoring strategies for turn-level aggregation, VLM reasoning modes to trade off reward bias and variance, and advantage fusion levels to prevent reward hacking. Our findings reveal that broadcasting the aggregated advantage across the entire editing trajectory effectively bridges the gap between local planning and global multi-turn task success. Extensive experiments demonstrate that MT-EditFlow significantly improves performance across diverse base models. Notably, it boosts FLUX.1-Kontext-dev by 6.85 points in turn-3 overall performance, surpassing state-of-the-art open-source models such as Qwen-Image-Edit. By maintaining high marginal success rates and reducing exposure bias, MT-EditFlow provides a foundation for more reliable and natural human-AI collaboration in visual content creation.
中文摘要 基于指令的图像编辑的最新突破引起了广泛关注，模型现在能够以日常用户所需的实用性应对现实世界的编辑需求。然而，主要训练为单回合编辑的编辑模型在多回合编辑中常常出现故障——多回合编辑是用户根据模型自身之前输出迭代优化图像的自然交互方式。这种失败源于“全有或全无”的要求，即一次失败的回合就会破坏整个序列;以及错误传播，即曝光偏差导致编辑错误叠加。为应对这些挑战，我们引入了MT-EditFlow，一种流程匹配强化学习框架，旨在优化连续图像编辑的奖励信号。MT-EditFlow将多回合视角与多奖励表述相结合，提供适用于GRPO和基于NFT的强化学习方法的统一结构。我们系统地分析和优化奖励信号，研究回合级聚合的有效得分策略、用于平衡奖励偏差和方差的VLM推理模式，以及防止奖励黑客的优势融合水平。我们的发现表明，将整体优势广播到整个编辑轨迹，有效弥合了本地规划与全球多回合任务成功之间的差距。大量实验表明，MT-EditFlow显著提升了在不同基模型中的性能。值得注意的是，它在第三回合整体性能上提升了 FLUX.1-Kontext-dev 6.85 点，超过了诸如 Qwen-Image-Edit 等最先进的开源模型。通过保持高边际成功率和减少曝光偏差，MT-EditFlow为视觉内容创作中更可靠、更自然的人机协作奠定了基础。

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

SafeMCP：通过环境基础的前瞻性推理，主动监管LLM代理防御

Authors: Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji, Chi Harold Liu, Yaodong Yang, Juntao Dai
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2606.01991
Pdf link: https://arxiv.org/pdf/2606.01991
Abstract As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.
中文摘要 随着大型语言模型（LLM）代理越来越多地利用模型上下文协议（MCP）在复杂环境中运行，其动作空间的扩展为代理提供了不安全的能力，并凸显了权力寻求的风险。虽然宽广的行动空间和更大的环境影响对任务完成至关重要，但它们制造了一个脆弱的风险表面，小错误或幻觉被放大成灾难性的失败。作为回应，我们提出了SafeMCP，一个{服务器端}防御插件，通过预测推理未来安全风险来限制工具的获取。SafeMCP采用内部世界模型进行前瞻性推理，实现两层防御：主动工具过滤以限制危险电力扩张，以及作为故障保障的即时干预。为训练SafeMCP，我们引入了三阶段流程，包括环境动态接地、安全策略初始化和强化学习（RL），并具有双重可验证的奖励。对PowerSeeking bench、ToolEmu和AgentHarm的实验表明，SafeMCP实现了安全平衡，有效降低风险，同时保持了药物的效用。

RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network

RL-ACRGNet：基于强化学习的胸部放射报告生成网络

Authors: Yogesh Kumar Meena, Saurabh Agarwal, K.V. Arya
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.02035
Pdf link: https://arxiv.org/pdf/2606.02035
Abstract Medical imaging interpretation is a foundational pillar of modern clinical diagnostics, yet the manual generation of radiology reports remains a time-consuming process prone to interpretation inconsistencies. Within the field of medical AI, automating these descriptions through deep learning promises to streamline clinical workflows and standardise diagnostic output. However, accurate disease detection and precise report generation remain significant challenges due to limitations in capturing fine-grained visual features and ensuring clinical coherence. To address these issues, we propose RL-ACRGNet, an improved encoder-decoder model that integrates a pre-trained DenseNet encoder with a multilevel LSTM decoder within an off-policy reinforcement learning framework. Using a dual-network approach to refine visual-semantic embeddings through a metric-based reward mechanism, we demonstrate that RL-ACRGNet consistently outperforms state-of-the-art baselines on the IU-Xray dataset, achieving quantitative improvements in BLEU-4 (0.47%), METEOR (0.17%) and ROUGE-L (0.518). Furthermore, comprehensive evaluations on the large-scale MIMIC-CXR data set confirm the robust generalisation of the model and its ability to generate high-quality, clinically relevant reports
中文摘要 医学影像解读是现代临床诊断的基础支柱，但人工生成放射报告仍是一个耗时且容易存在解读不一致的过程。在医疗人工智能领域，通过深度学习自动化这些描述有望简化临床工作流程并标准化诊断输出。然而，由于捕捉细粒度视觉特征和确保临床连贯性有限，准确的疾病检测和报告生成仍面临重大挑战。为解决这些问题，我们提出了RL-ACRGNet，一种改进的编码器-解码器模型，将预训练的DenseNet编码器与多级LSTM译码器集成在非策略强化学习框架内。我们通过双网络方法通过基于指标的奖励机制优化视觉语义嵌入，证明RL-ACRGNet在IU-Xray数据集上持续优于最先进基线，在BLEU-4（0.47%）、METEOR（0.17%）和ROUGE-L（0.518）方面实现了定量提升。此外，对大规模MIMIC-CXR数据集的全面评估证实了该模型的稳健泛化能力及其生成高质量、临床相关报告的能力

Explainable Data-driven Deep Reinforcement Learning Methods for Optimal Energy Management in Buildings

可解释的数据驱动深度强化学习方法，用于建筑物中最优的能源管理

Authors: Hallah Shahid Butt, Qiong Huang, Gökhan Demirel, Kevin Förderer, Erfan Tajalli-Ardekani, Simnon Waczowicz, Luigi Spatafora, Veit Hagenmeyer, Benjamin Schäfer
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.02049
Pdf link: https://arxiv.org/pdf/2606.02049
Abstract The increasing integration of renewable energy sources into power systems, particularly in buildings equipped with photovoltaic (PV) panels and energy storage systems, introduces significant complexity in energy systems. Volatile power generation, varying electricity tariffs, and increased entities, e.g., PV systems, and heat pumps, have increased the complexity and made the system harder to operate. This leads to the demand for additional control and optimization routes including data-based controls, such as reinforcement learning. While deep reinforcement learning (DRL) has emerged as a promising solution to optimize building operations in dynamic and ever more complex environments, its black-box nature impedes user trust and practical adoption. This paper presents a framework for explainable deep reinforcement learning (XRL) applied to energy management in residential buildings. We demonstrate its usage on both synthetic data but also on real-world data from the Living Lab Energy Campus (LLEC) at KIT. We train and compare both on-policy and off-policy DRL agents on an expanded state space that incorporates real-time measurements (demand, PV generation, battery power, state of charge), external signals (dynamic electricity price, local weather data), calendrical and holiday indicators, and forecasts for demand and price. Our experimental results indicate that on-policy algorithms, particularly Advantage Actor Critic (A2C) and Proximal Policy Optimization (PPO), outperform off-policy methods in terms of cumulative rewards and policy stability. To explain these models, we employ post-hoc interpretation techniques to elaborate the learned control policies. Our findings demonstrate that the XRL framework not only reduces electricity costs through optimal battery management, but also provides transparent, actionable insights into the agent's decision-making process.
中文摘要 可再生能源日益融入电力系统，尤其是在配备光伏（PV）电池板和储能系统的建筑中，这带来了能源系统的显著复杂性。电力生产波动、电价波动以及实体数量的增加（如光伏系统和热泵）增加了系统复杂性，使系统运行更加困难。这导致了对更多控制和优化路径的需求，包括基于数据的控制，如强化学习。虽然深度强化学习（DRL）已成为优化建筑运行在动态且日益复杂环境中的有前景解决方案，但其黑箱特性阻碍了用户的信任和实际应用。本文提出了一个可解释深度强化学习（XRL）应用于住宅建筑能源管理的框架。我们展示了其在合成数据以及来自基特大学生活实验室能源园区（LLEC）的真实世界数据上的应用。我们在扩展的状态空间中训练并比较政策内和非政策的日程车代理，该空间包含实时测量（需求、光伏发电量、电池功率、充电状态）、外部信号（动态电价、本地天气数据）、日历和节假日指标，以及需求和价格预测。我们的实验结果表明，策略内算法，特别是优势行为者批评者（A2C）和近端策略优化（PPO），在累计奖励和策略稳定性方面优于策略外方法。为了解释这些模型，我们采用事后解释技术来阐述所学到的控制策略。我们的发现表明，XRL框架不仅通过最佳电池管理降低了电费，还为代理的决策过程提供了透明且可操作的洞察。

Network Distributed Multi-Agent Reinforcement Learning for Consensus Control of Quadcopters

网络分布式多智能体强化学习用于四旋翼共识控制

Authors: Youssef Mahran, Zeyad Gamal, Aamir Ahmad, Ayman El-Badawy
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.02107
Pdf link: https://arxiv.org/pdf/2606.02107
Abstract This paper proposes a Network Distributed Multi-Agent Reinforcement Learning (ND-MARL) framework for quadcopter consensus control. Compared to conventional multi-agent MARL formulations that rely on centralized planning or fully decentralized execution, ND-MARL incorporates the swarm communication graph into the decision process. Under a 2-Neighbor communication topology, each agent observes information of only two neighbors and outputs an action through a distributed policy. A high-level distributed consensus planner is trained using Multi-Agent Soft Actor-Critic (MASAC) and embedded in a hierarchical stack to generate reference target positions tracked by a low-level quadcopter controller. Results demonstrate smooth consensus trajectories and planner-tracker integration when compared to a centralized MARL controller. Most notably, the learned controller exhibits zero-shot scalability, as policies trained on a three-agent system are deployed to swarms of up to 250 agents under the same 2-Neighbor communication topology without retraining or fine-tuning, achieving consistent convergence with increasing steady-state spread at large team sizes due to sparse information propagation. These findings highlight ND-MARL as a stable framework for distributed, communication-aware quadcopter consensus control.
中文摘要 本文提出了一种用于四旋翼共识控制的网络分布式多智能体强化学习（ND-MARL）框架。与依赖集中规划或完全去中心化执行的传统多智能体MARL表述相比，ND-MARL将群体通信图纳入决策过程。在二邻居通信拓扑下，每个代理只观察两个邻居的信息，并通过分布式策略输出一个动作。高级分布式共识规划器通过多智能体软性演员-批判者（MASAC）训练，并嵌入分层栈中，生成由低级四旋翼控制器跟踪的参考目标位置。结果显示，与集中式MARL控制器相比，实现了平稳的共识轨迹和规划者-跟踪器整合。最显著的是，学习的控制器表现出零样本可扩展性，在三代理系统上训练的策略可部署到最多250个代理的集群中，在同一二邻通信拓扑下无需重新训练或微调，实现了在大团队规模中因信息传播稀疏而实现稳定的收敛。这些发现凸显了ND-MARL作为分布式、通信感知四旋翼共识控制的稳定框架。

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

学习何时不采取行动：减轻智能强化学习中工具滥用

Authors: Liuji Chen, Dianxing Tang, Xing Shi, Dingshuo Chen, Qiang Liu, Shu Wu, Liang Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.02132
Pdf link: https://arxiv.org/pdf/2606.02132
Abstract Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which reduce tool frequency but may also suppress useful tool-assisted exploration. We propose EAPO, an Efficient Agentic Policy Optimization framework that learns selective tool use. EAPO introduces tool-free trajectories into each rollout group, applies difficulty-aware reward shaping to penalize redundant tool calls mainly on easier queries, and uses confidence-aware token reweighting to improve policy learning. Across nine mathematical and knowledge-intensive reasoning benchmarks, EAPO consistently improves the accuracy efficiency trade-off on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. Compared with GRPO, EAPO improves average performance by 10.45%, 7.27%, and 9.69%, while reducing average tool calls by 18.33%, 18.33%, and 24.59%, respectively. These results show that agents can learn when not to use tools without compromising tool-integrated reasoning.
中文摘要 代理强化学习可能导致工具滥用，即模型过度使用外部工具，即使是通过内部推理解决的查询。现有方法通过统一的工具使用惩罚或硬性限制来缓解这一问题，虽然减少工具使用频率，但也可能抑制有用的工具辅助探索。我们提出了EAPO，一种高效的代理策略优化框架，学习选择性工具的使用。EAPO在每个部署组中引入无工具轨迹，应用难度感知奖励塑形，主要惩罚对较简单查询的重复工具调用，并使用信心感知令牌重权以提升策略学习。在九个数学和知识密集型推理基准测试中，EAPO持续提升Qwen2.5-3B、Qwen2.5-7B和Llama3.1-8B的准确性效率权衡。与GRPO相比，EAPO的平均性能提升了10.45%、7.27%和9.69%，平均工具调用分别减少了18.33%、18.33%和24.59%。这些结果表明，智能体可以在不影响工具整合推理的情况下学会何时不使用工具。

Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

具学习奖励的大型行为模型的连贯非策略改进

Authors: Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek, Ingmar Posner, Jan Peters
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.02194
Pdf link: https://arxiv.org/pdf/2606.02194
Abstract Distilling expert demonstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL) can be used as a means to finetune these policies further using additional experience. An open question is whether RL is more sample-efficient than collecting more human demonstrations. Prior work has finetuned large pretrained policies in a scalable fashion by applying RL to a smaller residual policy that corrects the pretrained model. However, for the typical sparse reward tasks, RL algorithms can struggle to optimize the behavior in a sample-efficient manner. We explore inverse reinforcement learning, where a dense reward function is learned from expert demonstrations, potentially reducing the challenge of RL finetuning. We specifically consider coherent imitation learning, an IRL method that facilitates improvement of the BC policy through using a specific reward formulation with theoretical guarantees. We show that our IRL method maintains or improves the performance of pi-0.5 on all six sparse manipulation tasks and achieves a $\geq 90\%$ success rate on five out of six complex manipulation tasks, outperforming RL-based baselines using sparse rewards. By ensuring our initial pretrained finetuning policy is optimal for our initial reward and critic, our method circumvents the initial drop commonly seen in RL finetuning and enables faster improvement.
中文摘要 利用行为克隆将专家演示数据提炼为大型生成模型，是一种可扩展的方法，用于学习机器人控制的有效策略，尤其是在灵巧操作方面。强化学习（RL）可以作为通过额外经验进一步微化这些策略的方法。一个未解之谜是，强化学习是否比收集更多人类演示更高效。先前的工作通过将强化学习应用于更小的残余策略，修正预训练模型，从而以可扩展的方式微调大型预训练策略。然而，对于典型的稀疏奖励任务，强化学习算法在以样本效率的方式优化行为时可能遇到困难。我们探讨了逆强化学习，通过专家演示学习密集的奖励函数，可能降低强化学习的微调挑战。我们特别考虑了连贯模仿学习，这是一种现实学习方法，通过使用带有理论保证的特定奖励表述，促进BC政策的改进。我们证明，我们的IRL方法在所有六个稀疏操作任务中都能保持或提升pi-0.5的性能，并在六个复杂操作任务中有五个实现$\geq的90%%成功率，优于使用稀疏奖励的基于强化学习的基线。通过确保初始预训练的微调策略对初始奖励和批评者最优，我们的方法绕过了强化学习微调中常见的初始下降，从而实现更快的改进。

Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

通过落后者感知组大小实现更快的同步策略强化学习

Authors: Azal Ahmad Khan, Ammar Ahmed, Zeshan Fayyaz, Sheng Di, Mingyi Hong, Ali Anwar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.02218
Pdf link: https://arxiv.org/pdf/2606.02218
Abstract Synchronous reinforcement learning methods such as Group Relative Policy Optimization (GRPO) provide stable and reproducible on-policy training, but they are highly vulnerable to stragglers, a single unusually long rollout can delay reward computation and parameter updates for the entire group. This problem becomes more severe as group size increases, creating a tension between the benefits of larger groups and the wall-clock cost of synchronization stalls. We propose Straggler-Aware Group Control (SAGC), a dynamic group-size controller that adapts the training group online based on observed rollout behavior. SAGC formulates group-size selection as an online constrained optimization problem, seeking to retain the benefits of larger groups while controlling the long-term rate of straggler events. Across synchronous GRPO and DAPO training, and on top of both vanilla and strong engineered baselines, SAGC consistently reduces straggler incidence and improves wall-clock efficiency while achieving competitive or better training reward. We further show that these gains transfer to final model quality: SAGC is competitive with or better than the strongest static group-size baseline on downstream reasoning benchmarks, and often produces shorter outputs without any explicit length penalty. These results position dynamic group control as a practical way to make synchronous on-policy RL more efficient and robust.
中文摘要 同步强化学习方法如群相对策略优化（GRPO）提供稳定且可重复的策略训练，但它们极易受到落后者的影响，单次异常长的部署可能会延迟整个组的奖励计算和参数更新。随着组规模的增加，这个问题变得更加严重，导致大组的优势与同步停滞带来的墙钟成本之间产生紧张关系。我们提出了落后者感知群控制（SAGC），这是一种动态的组规模控制器，根据观察到的展开行为调整训练组的在线状态。SAGC将组规模选择表述为在线受限优化问题，旨在保留更大组的优势，同时控制长期滞留事件的发生率。在同步GRPO和DAPO培训中，以及基础和强工程基准的基础上，SAGC持续减少落后者发生率，提高墙钟效率，同时实现竞争性或更好的训练回报。我们还进一步证明，这些优势转化为最终模型质量：SAGC在下游推理基准测试中与最强静态组规模基线竞争甚至更优，且通常产出更短的输出且无明显的长度惩罚。这些结果使动态群控制成为使同步策略强化学习更高效、更稳健的实用方式。

ResMerge: Residual-based Spectral Merging of Large Language Models

ResMerge：基于残差的大型语言模型谱合并

Authors: Yandu Sun, Zhiyan Hou, Haokai Ma, Yuheng Jia, Junfeng Fang, Haiyun Guo, Hongyan An, weizhen wang, Jinqiao Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.02252
Pdf link: https://arxiv.org/pdf/2606.02252
Abstract Model merging offers a training-free way to combine multiple post-trained expert models, but merging experts obtained through reinforcement learning (RL) remains challenging. Existing spectral merging methods often assume that leading singular directions contain the main task signal, while lower-energy residual components can be compressed, selected, or attenuated to reduce interference. We find that this assumption does not hold for RL task vectors: after decomposing each task vector into a leading spectral head and a residual component, both parts can independently recover substantial behavior knowledge, while exhibiting different merging properties. The head is highly concentrated and informative but more prone to sharp cross-expert conflicts, whereas the residual component is more dispersed and provides a more stable basis for aggregation. Based on this observation, we propose ResMerge, a residual-based spectral merging framework for RL experts. ResMerge first constructs a stable residual backbone with Spherical Residual Consensus Adaptation, which estimates a reliability-weighted consensus direction on the Frobenius sphere. It then reintroduces leading-head information through a Lightweight Head Correction module gated by positive cross-expert agreement. Experiments across multiple RL expert groups and capability domains show that ResMerge better preserves expert capabilities than representative task-vector and spectral merging baselines. The implementation of ResMerge is publicly available at this https URL.
中文摘要 模型合并提供了一种无需训练的方式，可以将多个后训练专家模型合并，但通过强化学习（RL）获得的专家合并仍然具有挑战性。现有的频谱合并方法通常假设主导奇异方向包含主要任务信号，而低能量残余分量则可压缩、选择或衰减以减少干扰。我们发现该假设对强化学习任务向量不成立：在将每个任务向量分解为前导频谱头和残差分量后，两部分可以独立恢复大量行为知识，同时表现出不同的合并性质。头部高度集中且信息丰富，但更容易引发尖锐的跨专家冲突，而残余部分则更分散，为聚合提供更稳定的基础。基于这一观察，我们提出了ResMerge，一种面向强化学习专家的基于残差的频谱合并框架。ResMerge首先构建了一个稳定的残差骨干，采用球面残差共识适应，该方法估计了弗罗贝尼乌斯球上的可靠性加权共识方向。随后，它通过一个由专家一致认可的轻量级头部矫正模块重新引入领先头信息。跨多个强化学习专家组和能力领域的实验表明，ResMerge比代表性的任务向量和频谱合并基线更能保留专家能力。ResMerge 的实现可在此 https URL 公开获取。

Dynamics Are Learned, Not Told: Semi-Supervised Discovery of Latent Dynamics Geometries For Zero-Shot Policy Adaptation

动力学是学习的，而非被告知：半监督发现潜在动力学几何以实现零射策略适应

Authors: Zhiming Xu, Weitao Zhou, Xianghui Pan, Nanshan Deng, Chengju Liu, Qijun Chen, Chenpeng Yao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.02280
Pdf link: https://arxiv.org/pdf/2606.02280
Abstract Real-world dynamics shifts pose a critical challenge for reinforcement learning in robotics, as policies tightly coupled to nominal environments often fail catastrophically when physical conditions change. Most existing methods rely on encoding explicitly identified physical parameters into a latent context, a parameter-centric paradigm that depends on pre-specified axes of variation and becomes brittle under unmodeled or compound dynamics changes. We revisit dynamics adaptation from an outcome-centric perspective: rather than telling policies what the dynamics are, we enable them to learn how dynamics affect interaction outcomes. Theoretically, this is grounded in a monotonic relationship between target-domain regret and the Lipschitz constant of a trajectory dynamics encoder. Practically, this constant can be upper-bounded through contrastive learning, yielding a smooth, task-relevant latent topology without privileged dynamics information. On MuJoCo benchmarks, our method consistently outperforms parameter-centric baselines under severe dynamics shifts, including unmodeled and time-varying parameters, while also improving in-distribution stability and latent interpretability. Overall, these results validate that controlling latent geometry is a principled mechanism for robust adaptation.
中文摘要 现实世界的动态变化对机器人强化学习构成关键挑战，因为紧密绑定于名义环境的策略在物理条件变化时常常会灾难性地失效。大多数现有方法依赖于将明确识别的物理参数编码到潜在上下文中，即依赖预先指定的变异轴的参数中心范式，并在未建模或复合动力学变化下变得脆弱。我们从以结果为中心的角度重新审视动态适应：我们不告诉政策动态是什么，而是让他们了解动态如何影响交互结果。理论上，这基于目标域遗憾与轨迹动力学编码器的利普希茨常数之间的单调关系。实际上，该常数可以通过对比学习进行上界，从而获得一个平滑、与任务相关的潜在拓扑，而无需特权动力学信息。在MuJoCo基准测试中，我们的方法在严重动态变化（包括未建模和时间变化参数）下，持续优于以参数为中心的基线，同时提升了分布内稳定性和潜在可解释性。总体而言，这些结果验证了控制潜在几何是稳健适应的原则机制。

Towards Precise Intent-Aligned VLA Aerial Navigation via Expert-Guided GRPO

通过专家引导GRPO实现精确意图对齐VLA航天导航

Authors: Tianyang Chen, Wenjun Li, Xin zhou, Yuze Wu, Fei Gao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.02313
Pdf link: https://arxiv.org/pdf/2606.02313
Abstract Vision-Language-Action (VLA) models offer a promising end-to-end paradigm for unmanned aerial vehicles (UAVs) to accomplish complex tasks specified by fine-grained instructions. However, standard supervised fine-tuning (SFT) suffers from data scarcity, limited generalization, and weak supervision for nuanced and complicated human intents. Reinforcement fine-tuning offers a natural way to mitigate these challenges and align policy behaviors with human intents through designable feedback, but applying it to aerial navigation remains challenging due to inefficient exploration in expansive continuous spaces. To address these challenges, we introduce an efficient reinforcement learning (RL) framework for VLA-based aerial navigation. At its core, we propose EG-GRPO (Expert-Guided Group Relative Policy Optimization) to augment online rollouts with few-shot expert data. Additionally, we design a heterogeneous pipeline enabling parallel simulation and inference, which reduces rollout time by 43.5%. Across multiple tasks specified by complex human intents, EG-GRPO improves the success rate to 2.13x that of the SFT baseline, while improving intent alignment performance by 60.9%. These results demonstrate that our framework can move aerial navigation toward precise intent-aligned flight.
中文摘要 视觉-语言-行动（VLA）模型为无人机（UAV）提供了一种有前景的端到端范式，用于完成由细粒度指令指定的复杂任务。然而，标准的监督微调（SFT）存在数据稀缺、泛化有限以及对复杂人类意图的监督薄弱的问题。强化微调为缓解这些挑战、通过可设计反馈使政策行为与人类意图保持一致提供了一种自然的方法，但由于在广阔连续空间中探索效率低下，将其应用于空中导航仍然具有挑战性。为应对这些挑战，我们引入了基于VLA的高效强化学习（RL）框架。我们的核心建议是EG-GRPO（专家指导小组相对政策优化），用少数样本专家数据来增强在线推广。此外，我们设计了异构流水线，实现并行模拟和推断，将部署时间缩短了43.5%。在多项由复杂人类意图指定的任务中，EG-GRPO将成功率提升至SFT基线的2.13倍，同时意图对齐表现提升60.9%。这些结果表明，我们的框架能够推动航空导航朝向精确、意图对齐的飞行方向发展。

Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

受限多智能体强化学习的协调图

Authors: Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.02337
Pdf link: https://arxiv.org/pdf/2606.02337
Abstract Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with the number of agents, and additional requirements couple agents in ways that reward structure alone does not capture. We introduce Coordination Graphs for Constrained Multi-Agent Reinforcement Learning (CG-CMARL), a framework that addresses both challenges by combining coordination graphs with Lagrangian duality. The system decomposes the joint problem into pairwise regions, each served by a set of shared Q-functions, one for the primary objective and one for each of the constraints, so that the number of learned models is independent of the number of agents. At execution time, Max-Sum message passing coordinates actions across the factor graph, while a Lagrangian multiplier controls the objective--constraint tradeoff, allowing a single trained model to trace a Pareto front without retraining. We provide convergence guarantees under mild conditions, together with a compositional error bound that decomposes into separate interpretable sources, each traceable to a specific design choice and independently controllable. Experiments on cooperative navigation tasks (where teams of up to 10 agents must coordinate to reach target positions while satisfying pairwise constraints) show that our method produces Pareto fronts dominating established baselines trained at fixed reward-shaping ratios, while scaling to team sizes where centralized approaches become intractable.
中文摘要 受限多智能体强化学习（CMARL）面临两个交织的挑战：联合行动空间随着智能体数量呈指数增长，且额外需求以结构奖励无法捕捉的方式耦合智能体。我们介绍了受限多智能体强化学习协调图（CG-CMARL），该框架通过将协调图与拉格朗日对偶性结合，解决了这两个挑战。该系统将联合问题分解为两个区域，每个区域由一组共享的Q函数服务，一个代表主要目标，一个代表每个约束，因此学习模型的数量与代理数量无关。执行时，最大和消息传递坐标在因子图上的动作，而拉格朗日乘子控制目标-约束权衡，使单个训练模型无需重新训练即可追踪帕累托前缘。我们在温和条件下提供收敛保证，并附带一个组合误差界限，分解为独立可解释的源，每个源可追溯到特定设计选择且可独立控制。在合作导航任务（最多10名代理团队必须协调以达到目标位置并满足成对约束）实验显示，我们的方法产生了在固定奖励塑造比下训练的既有基线的帕累托前沿，同时在集中化方法变得难以解决的团队规模下，能够扩展到团队规模。

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

SIRI：基于内在技能的自我内化强化学习，用于LLM代理培训

Authors: Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen, Xingyang Li, Meng Hsuan Yu, Xiangrong Liu, Leyi Wei, Lu Pan, Ke Zeng, Xunliang Cai
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.02355
Pdf link: https://arxiv.org/pdf/2606.02355
Abstract Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Self-Internalizing Reinforcement learning with Intrinsic skills (SIRI), a three-phase framework that enables agents to discover, validate, and internalize skills without external skill generators or inference-time skill banks. SIRI first warms up the policy with GiGPO to acquire basic interaction ability and collect successful skill-free trajectories. It then performs self-skill mining, where the current policy summarizes compact skills from its own successful plain rollouts and validates them through paired skill-augmented and skill-free rollouts. Finally, SIRI distills only beneficial skill-guided action tokens into the plain policy using trajectory-level utility and action-level advantage. At inference, the agent runs with the original prompt only. On ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improves GiGPO from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop, outperforming prompt-based, RL-based, and memory-augmented baselines. Further analysis shows that our self-mining strategy can achieve performance comparable to distillation with closed-source large model. Our code is available at this https URL.
中文摘要 长期视野的LLM代理可以受益于可复用的技能，但现有基于技能的方法往往在训练时或推理时持续检索技能时依赖外部技能生成器，增加了工程复杂度、上下文长度和部署延迟。我们提出了“内在技能自内化强化学习”（SIRI），这是一个三阶段框架，使智能体能够在无需外部技能生成器或推理时间技能库的情况下发现、验证和内化技能。SIRI首先通过GiGPO热身策略，以获得基本的交互能力并收集成功的无技能轨迹。然后它进行自我技能挖掘，当前政策总结了自身成功基础部署中的紧凑技能，并通过技能增强和无技能的双对推广进行验证。最后，SIRI仅将有益的技能引导动作标记提炼为纯策略，利用轨迹层效用和行动层优势。推理时，代理只执行原始提示。在ALFWorld和WebShop上，配合Qwen2.5-7B-Instruct，SIRI将GIGPO从ALFWorld的0.908提升到0.930，WebShop上的0.728提升到0.813，优于基于提示、基于强化学习和内存增强的基线。进一步分析表明，我们的自挖掘策略可实现与闭源大模型蒸馏相当的性能。我们的代码可在此 https URL 访问。

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

束带-1：具备状态外化束带的搜索代理的强化学习

Authors: Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu, Jiashuo Sun, Jimeng Sun, Hammad Bashir, Jiawei Han
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2606.02373
Pdf link: https://arxiv.org/pdf/2606.02373
Abstract Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at this https URL.
中文摘要 搜索代理通常被训练为对不断增长的转录的策略：模型必须决定如何搜索，同时记住所见、哪些证据有用、哪些约束尚未解决，以及哪些主张已被核实。我们认为这种表述在策略中包含了过多常规状态管理：强化学习被迫优化语义搜索决策和环境能更可靠维护的可恢复账务。我们介绍了 Harness-1，这是一个 20B 搜索代理（检索子代理），通过强化学习在有状态搜索框架中训练。该工具维护环境侧工作记忆，包括候选人池、重要性标记的策划集、紧凑的证据链接、验证记录、压缩和去重观察值，以及预算感知的上下文渲染。该政策保留语义决策：搜索什么、保留或丢弃哪些文件、验证哪些以及何时停止。在涵盖网络、金融、专利和多跳质量保证的八个检索基准中，Harness-1 平均策划召回率为 0.730，比次强的开放搜索子代理高出 +11.4 分，并与规模更大的前沿模型搜索者保持竞争力。其在保留转移基准测试中取得的优势尤为显著，表明强化学习在显式搜索状态上可以产生超越训练领域的泛化检索行为。我们的代码可在此 https URL 访问。

Policy and World Modeling Co-Training for Language Agents

语言代理的政策与世界建模共同培训

Authors: Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu, Haoze Lv, Yanbin Wei, Lingting Zhu, Shengju Qian, Xin Wang, Ying-Cong Chen, Qi Wang, Ke Tang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.02388
Pdf link: https://arxiv.org/pdf/2606.02388
Abstract Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.
中文摘要 强化学习（RL）通过教大型语言模型（LLM）代理哪些行为带来高回报，提升他们，但对这些行为对环境的影响几乎没有监督。世界建模（WM）可以填补这一空白，但现有方法通常需要独立的模拟器、额外的训练阶段或额外的推理时间计算。我们观察到，策略内强化学习的推广已经包含所需的信号：每个过渡都将一个动作与其下一个观测匹配。基于这一观察，我们提出了PaW，这是一个政策与世界建模的共训练框架，在强化学习期间为同一策略增加了辅助WM监督，但不改变推理范式。为了使辅助WM监督既有信息量又稳定，PaW引入了三个组成部分：基于动作熵的WM数据选择、容噪WM损失和奖励自适应损失平衡。在三个能动任务基准测试上的实验显示，跨模型和强化学习算法相比强化学习基线在不同模型中持续有显著提升。这些结果表明，标准的强化学习推广是语言代理训练中WM监督的实用来源。

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

多域强化学习中跨域干涉与恢复的局部微扰理论

Authors: Lei Yang, Siyu Ding, Deyi Xiong
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.02398
Pdf link: https://arxiv.org/pdf/2606.02398
Abstract Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code $\rightarrow$ Math $\rightarrow$ QA $\rightarrow$ CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.
中文摘要 强化学习（RL）在后续训练中提升了大型语言模型（LLM）在数学推理、代码生成、问答和创意写作（CW）等多个领域的表现，但对某一领域的训练往往会降低其他领域的表现。基于灾难性遗忘或全局梯度冲突的现有解释是不完整的：即使全模型梯度几乎正交，仍可能发生显著干涉。我们表明，单域强化学习会产生稀疏、小幅度的参数编辑，且顶位变化神经元之间的重叠较弱，而不同领域仍共享大量主动计算路径，更新方向决定它们是协同还是冲突。基于这一观察，我们在多域强化学习的局部扰动模型下证明，后期域训练主要通过二阶损伤项损害早期域，在观察到的稀疏路径结构下，该损伤项集中于低维共享冲突子空间。此外，短时间域刷新会收缩该子空间上的有害成分，从而实现选择性恢复，同时减少附带损害。与理论一致，代码 $\rightarrow$ 数学 $\rightarrow$ QA $\rightarrow$ CW 后，简短刷新 Re-Math 将数学从 57.66 恢复到 66.04，同时在其他领域基本保持性能，获得了最佳平均分 66.39。除了刷新外，对Math-QA对稀疏代理冲突坐标集进行无训练回滚，部分恢复了数学，直接提供代理级的局部损害证据。这些结果为多域强化学习中干扰与恢复提供了局部机制性解释。

Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

像鸽子一样主动探索：通过能动视觉语言模型强化空间推理

Authors: Wei Deng, Xianlin Zhang, Mengshi Qi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.02459
Pdf link: https://arxiv.org/pdf/2606.02459
Abstract Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons' building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new \emph{dynamic cognitive map} parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel \emph{Spatial Assertion Codes (SAC)}, Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with \emph{80.5\%} overall accuracy, outperforming the best current method by \emph{29.5} accuracy points (a relative improvement of \emph{53.2\%}) on the challenging \textsc{Rotation} subset. Our code and data are open-sourced at this https URL.
中文摘要 使视觉语言模型（VLMs）能够进行空间推理仍然具有挑战性。现有方法将VLM视为被动观察者，这在实际应用中较为困难。此外，强化学习方法依赖稀疏的奖励，限制了其在复杂推理任务中的有效性。受鸽子构建和利用认知地图导航的启发，我们提出了一种新的空间推理能动管道。首先，我们引入了一个新的\emph{动态认知映射}，将场景布局参数化为物体的位置和方向，作为新观察的持久记忆。其次，我们提出了一种新颖的\emph{空间断言代码（SAC）}，即Python表达式，用程序描述空间关系。通过与动态认知图谱协作，SAC支持对中间推理步骤的验证，提供密集的奖励信号。我们通过监督和强化微调来优化模型。MindCube基准测试的实验显示，整体准确率为\emph{80.5\%}，在具有挑战性强的\textsc{Rotation}子集上，准确率比目前最佳方法高出\emph{29.5}分（相对提升于\emph{53.2\%}）。我们的代码和数据是开源的，地址是这个 https URL。

Learning When to Translate for Multilingual Reasoning

学习何时进行多语言推理翻译

Authors: Deokhyung Kang, Hyounghun Kim, Gary Geunbae Lee
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.02465
Pdf link: https://arxiv.org/pdf/2606.02465
Abstract Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, but still exhibit substantial multilingual reasoning gaps, largely due to language-understanding failures in non-English inputs. English translation can mitigate these failures by expressing non-English inputs in a form that RLMs can more reliably interpret, yet translating every input is unnecessary when the model can reason reliably from the original query. To address this challenge, we propose Luar, a Language Understanding Boundary-aware Reinforcement Learning framework that trains RLMs to selectively invoke translation when direct understanding is unreliable. Luar trains the model to choose between solving the original input directly and reasoning over its English translation, encouraging translation only when translator-augmented reasoning is expected to substantially outperform direct reasoning. Across multilingual reasoning benchmarks, Luar outperforms standard GRPO and other training-based baselines, with particularly large gains on low-resource languages. Further analysis shows that Luar avoids unnecessary translation in cases where direct reasoning is sufficient, while extending its translator-call behavior to unseen low-resource languages. Together, our work suggests a selective approach to multilingual reasoning: RLMs can learn to invoke translation only when their direct understanding is unreliable. The project will be made publicly available at this https URL
中文摘要 推理语言模型（RLMs）在复杂的推理任务中表现出色，但仍存在显著的多语言推理差距，这在很大程度上是由于非英语输入中的语言理解失败所致。英语翻译可以通过以RLM更可靠的方式表达非英语输入来缓解这些失败，但当模型能够可靠地从原始查询推理时，翻译所有输入就变得不必要。为应对这一挑战，我们提出了Luar，一种语言理解边界感知强化学习框架，训练RLM在直接理解不可靠时选择性调用翻译。Luar训练模型在直接解决原始输入和推理其英文翻译之间做出选择，只有在预期翻译增强推理远超直接推理时才鼓励翻译。在多语言推理基准测试中，Luar优于标准GRPO及其他基于训练的基线，在低资源语言上表现尤为显著。进一步分析显示，Luar在直接推理充分的情况下避免了不必要的翻译，同时将其翻译调用行为扩展到看不见的低资源语言。我们的研究共同提出了一种选择性多语言推理的方法：RLM只有在其直接理解不可靠时才能学会调用翻译。该项目将通过该 https URL 公开发布

Keyword: diffusion policy

From Noise to Control: Parameterized Diffusion Policies

从噪声到控制：参数化扩散政策

Authors: Renhao Zhang, Haotian Fu, Mingxi Jia, George Konidaris, Yilun Du, Bruno Castro da Silva
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.00336
Pdf link: https://arxiv.org/pdf/2606.00336
Abstract We propose Parameterized Diffusion Policy (PDP), a framework for learning diffusion policies conditioned on low-dimensional, continuous parameters embedded in a learned behavior manifold. By constructing this manifold so that distances between latent representations reflect the semantic similarity between physical trajectories, we transform diffusion from a mechanism for stochastic diversity into a precise and optimizable tool for behavior steering. Our approach enables smooth interpolation between known strategies and efficient adaptation to novel constraints without updating policy weights. We demonstrate that PDP significantly improves adaptation performance on complex multimodal benchmarks in both simulated and real-robot experiments compared to standard diffusion policies, particularly in scenarios requiring the synthesis of novel behaviors.
中文摘要 我们提出了参数化扩散策略（PDP），这是一种基于嵌入于学习行为流形中的低维连续参数的扩散策略学习框架。通过构建该流形，使潜在表征之间的距离反映物理轨迹间的语义相似性，我们将扩散从随机多样性机制转变为一种精确且可优化的行为引导工具。我们的方法能够实现已知策略之间的平滑插值，并高效适应新约束，而无需更新策略权重。我们证明，PDP在模拟和真实机器人实验中，在复杂多模态基准测试上的适应性能显著提升，相较于标准扩散策略，尤其是在需要综合新颖行为的场景中。

Set-Supervised Diffusion Policy: Learning Action-Chunking Diffusion through Corrections

集合监督扩散策略：通过修正学习动作分块扩散

Authors: Zhaoting Li, Gang Chen, Javier Alonso-Mora, Cosimo Della Santina, Jens Kober
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.01865
Pdf link: https://arxiv.org/pdf/2606.01865
Abstract Diffusion policies have recently emerged as a powerful framework for robotic manipulation. However, like other behavior cloning methods, they remain vulnerable to distributional shift, often requiring human-in-the-loop interventions to correct failures during deployment. These interactions naturally provide paired supervision in the form of the robot's undesired actions and the human teacher's corrective actions. Yet existing data aggregation pipelines and standard behavior cloning losses largely ignore this negative signal from undesired actions, leading to overfitting to teacher's actions and an increasing reliance on costly expert data. To address this limitation, we propose Set-Supervised Diffusion Policy (SDP), a novel learning framework that utilizes contrastive action-chunk data to train diffusion policies from human corrections. From paired positive and negative action-chunks, SDP constructs a set of desired action-chunks and designs a training pipeline that encourages the diffusion policy to align with the set. Through extensive experiments across multiple robotic manipulation tasks, we demonstrate that SDP consistently improves policy performance, with particularly strong gains in robustness to noisy data. Moreover, SDP induces high-quality aggregated datasets, enabling more efficient and reliable policy learning from human-in-the-loop corrections. Our code is available at this https URL.
中文摘要 扩散政策最近成为机器人操作的强大框架。然而，与其他行为克隆方法一样，它们仍易受分布式转移影响，常常需要人工干预以纠正部署中的失败。这些互动自然形成了机器人不希望的行为和人类教师的纠正措施之间的成对监督。然而，现有的数据聚合流程和标准行为克隆损失大多忽视了这种不良行为带来的负面信号，导致对教师行为的过度拟合，并加深对昂贵专家数据的依赖。为解决这一局限，我们提出了集合监督扩散策略（SDP），这是一种利用对比作用块数据训练人类修正扩散策略的新学习框架。从成对的正负动作块，SDP构建一组期望的动作块，并设计训练流水线，鼓励扩散策略与集合对齐。通过对多项机器人操作任务的广泛实验，我们证明SDP持续提升策略性能，尤其在对噪声数据的鲁棒性方面有显著提升。此外，SDP还能生成高质量的聚合数据集，使得从人工干预修正中实现更高效、更可靠的政策学习。我们的代码可在此 https URL 访问。