Arxiv Papers of Today

生成时间: 2026-06-19 20:08:11 (UTC+8); Arxiv 发布时间: 2026-06-19 20:00 EDT (2026-06-20 08:00 UTC+8)

今天共有 32 篇相关文章

Keyword: reinforcement learning

Physical Atari: A Robust and Accessible Platform for Real-time Reinforcement Learning on Robots

实体雅达利：一个强大且易于访问的实时机器人强化学习平台

Authors: Khurram Javed, Joseph Modayil, Gloria Kennickell, Richard S. Sutton, John Carmack
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.19357
Pdf link: https://arxiv.org/pdf/2606.19357
Abstract We built a robot called the Robotroller that actuates an Atari CX40+ controller and a device called the Atari Devbox that renders the game frame and the reward signal from the Arcade Learning Environment on a screen. The Robotroller and the Atari Devbox, together with an off-the-shelf camera and a desktop computer, constitute a system that can be used to study reinforcement learning algorithms in the physical world. We call the full system Physical Atari. In this paper, we detail the key decisions that make Physical Atari a robust and accessible platform. To make the system robust, we designed the Robotroller so that all movement is done through bearings, which reduces wear. Additionally, we wrote software that monitors the state of the servos at a high frequency and intervenes to limit stress. To make the system accessible, we used affordable off-the-shelf components and parts that can be manufactured using consumer 3D printers. Physical Atari can be built for under $1,000 and has been used for weeks of non-stop reinforcement learning experiments without any mechanical failures. We used it to validate that reinforcement learning algorithms can learn directly on robots and show that even small distribution shifts between learning and deployment can significantly degrade the performance of policies. Our results underscore the importance of on-device adaptation for strong performance on robots.
中文摘要 我们制造了一个叫Robotroller的机器人，它驱动Atari CX40+控制器，还有一个叫Atari Devbox的设备，它能在屏幕上渲染游戏帧和游戏机学习环境的奖励信号。Robotroller和Atari Devbox，加上现成的相机和台式电脑，构成了一个可用于物理世界中研究强化学习算法的系统。我们称整个系统为实体雅达利。本文详细介绍了使实体雅达利成为强大且易于访问平台的关键决策。为了让系统更稳健，我们设计了机器人滚路机，使所有运动都通过轴承完成，从而减少了磨损。此外，我们还编写了软件，监控伺服机高频状态并介入以减少应力。为了让系统更易接近，我们使用了可通过消费级3D打印机制造的经济实惠的现成组件和零件。实体雅达利组装成本不到1000美元，已经连续数周进行强化学习实验，且未发生机械故障。我们利用它验证了强化学习算法可以直接在机器人上学习，并证明即使是学习与部署之间的小幅分布变化也可能显著降低策略的性能。我们的结果强调了设备内适配对于机器人高性能的重要性。

Human-like autonomy emerges from self-play and a pinch of human data

类人自主性源自自我游戏和一点人类数据

Authors: Daphne Cornelisse, Julian Hunt, Zixu Zhang, Waël Doulazmi, Kevin Joseph, Jaime Fernández Fisac, Eugene Vinitsky
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.19370
Pdf link: https://arxiv.org/pdf/2606.19370
Abstract Self-play reinforcement learning has recently emerged as a way to train driving policies without any human data. It uses cheap, large-scale simulations to substitute expensive, large-scale human driving demonstrations. A key limitation of this approach is that policies trained through pure self-play can learn effective but alien driving conventions incompatible with people. Previous works attempt to mitigate such behavioral misalignments through extensive reward engineering and domain randomization, which are brittle and labor-intensive. Instead of completely discarding human demonstrations, our method treats them as a regularization objective on top of a minimal safe goal-reaching reward. Like the spice in a good stew, we find that a little human data goes a long way: our method uses only 30 minutes of human demonstrations, 2500x fewer than comparable imitation learning approaches. Resulting policies coordinate with held-out human trajectories and complete training in 15 hours on a single consumer-grade GPU. Videos and full source code are available at this https URL.
中文摘要 自我游戏强化学习最近被用作无需人工数据即可训练驾驶政策的方法。它用廉价的大规模模拟替代昂贵的大规模人类驾驶演示。这种方法的一个关键局限是，通过纯粹自我游戏训练的政策可以学习到有效但与人不兼容的陌生驾驶习惯。以往的研究试图通过广泛的奖励工程和领域随机化来缓解此类行为错位，这些方法脆弱且劳动密集。我们的方法没有完全放弃人类示范，而是将其视为规范化目标，同时给予最低限度的安全达成目标。就像炖菜中的香料一样，我们发现少量人类数据能带来很大帮助：我们的方法只需30分钟的人类示范，比同类模仿学习方法少了2500倍。由此产生的政策与预期的人类轨迹协调，并在一块消费级GPU上完成15小时的培训。视频和完整源代码可在此 https 网址获取。

Insulin4RL: Real-Time Insulin Management in the Intensive Care Unit for Offline Reinforcement Learning

Insulin4RL：重症监护病房的实时胰岛素管理，用于离线强化学习

Authors: Thomas Frost, Steve Harris
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.19481
Pdf link: https://arxiv.org/pdf/2606.19481
Abstract Offline reinforcement learning (ORL) offers the potential to improve the quality of clinical decision-making using historical electronic health record (EHR) data. Current training and evaluative practices in this field rely heavily on EHR datasets that have been temporally discretised into fixed, regular time intervals. Discretisation creates fictional representations of complex clinical scenarios and compromises the generalisability of retrospective model evaluations. In this paper, we introduce Insulin4RL, a healthcare ORL dataset featuring naturally irregular inputs and actions from real clinical trajectories. Derived from MIMIC-IV, Insulin4RL comprises over 375,000 labelled decisions across 12,209 patients requiring insulin infusion titration in the Intensive Care Unit. The dataset can thus be used for research into ORL model performance under realistic clinical sampling assumptions. We provide a description of the dataset's structure and characteristics, baseline performance metrics using model-free offline reinforcement learning, and a standardised evaluation protocol using fitted Q-evaluation. We conclude with suggested areas for future research that could be addressed using this resource.
中文摘要 离线强化学习（ORL）有望利用历史电子健康记录（EHR）数据提升临床决策质量。当前该领域的培训和评估实践高度依赖于被时间离散化为固定、规律时间区间的电子健康记录数据集。离散化会创造复杂临床场景的虚构表现，并损害回顾性模型评估的普遍性。本文介绍了Insulin4RL，一个医疗ORL数据集，包含来自真实临床轨迹的自然不规则输入和动作。Insulin4RL源自MIMIC-IV，包含超过375,000个标记决策，涵盖12,209名在重症监护室需要滴定胰岛素的患者。因此，该数据集可用于在现实的临床抽样假设下研究ORL模型的性能。我们提供了数据集的结构和特征描述，基于无模型离线强化学习的基线性能指标，以及使用拟合Q评估的标准化评估协议。最后，我们提出了未来研究的建议，这些领域可以利用本资源进行探讨。

Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation

通过决策树蒸馏对学习到的多智能体通信策略进行形式验证

Authors: Ahmad Farooq, Kamran Iqbal
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.19632
Pdf link: https://arxiv.org/pdf/2606.19632
Abstract Multi-agent reinforcement learning (MARL) enables agents to develop coordination strategies through emergent communication, but neural policies lack the formal safety guarantees required for safety-critical robotic deployment in drone swarms and autonomous vehicle fleets. We present the first end-to-end framework for safety verification of learned multi-agent communication policies through policy abstraction: neural policies are distilled into interpretable decision trees, then formally verified, with empirical validation confirming that verified safety properties transfer to original networks. Our four-stage pipeline consists of domain-specific feature extraction from agent observations, decision tree distillation achieving 97.9% +/- 1.2% fidelity to neural policies, automated translation to PRISM probabilistic model checker specifications with complete feature-to-state-variable correspondence, and compositional verification of Probabilistic Computation Tree Logic (PCTL) properties via pairwise decomposition with union-bound aggregation and empirical neighbor modeling. Evaluating Vector-Quantized Variational Information Bottleneck (VQ-VIB) policies for multi-drone coordination with 5-7 agents, we verify 18 temporal logic properties across safety, liveness, and cooperation, achieving 88.9% property satisfaction with all five safety thresholds satisfied (0.3% collision probability vs. 1% threshold). Monte Carlo validation of original neural policies confirms that verified safety properties transfer with <=0.6 percentage-point deviation (95% CI). Discrete VQ-VIB messages provide +11.6 to +13.6 percentage-point fidelity advantages over continuous methods, enabling 3-4x faster verification. Our framework provides empirically validated safety verification for distilled policy abstractions, serving as a practical bridge between deep MARL and formal safety workflows for multi-robot deployment.
中文摘要 多智能体强化学习（MARL）使智能体能够通过涌现通信制定协调策略，但神经策略缺乏在无人机群和自主车辆车队中安全关键机器人部署所需的正式安全保障。我们提出了首个通过策略抽象实现多智能体通信策略安全验证的端到端框架：神经策略被提炼成可解释的决策树，然后进行形式验证，实证验证确认验证的安全属性会转移到原始网络。我们的四阶段流程包括从主体观察中提取领域特异特征、决策树蒸馏（实现97.9% +/- 1.2%的神经策略忠实度）、自动转换为带有完整特征与状态变量对应的PRISM概率模型检查器规范，以及通过结合聚合和经验邻居建模的成对分解验证概率计算树逻辑（PCTL）属性。在评估5-7名智能体下多无人机协调的矢量量化变分信息瓶颈（VQ-VIB）策略时，我们验证了18项时间逻辑属性在安全性、活性性和协作性方面，实现了88.9%的属性满意度，满足了所有五个安全阈值（碰撞概率0.3%对抗1%）。蒙特卡洛对原始神经策略的验证确认，经过验证的安全属性转移时，偏差<=0.6个百分点（95%置信区间）。离散的VQ-VIB消息相比连续方法提供了+11.6至+13.6%的百分点保真度优势，使验证速度快3-4倍。我们的框架为精炼政策抽象提供了经过实证验证的安全验证，作为深度MARL与多机器人部署正式安全工作流程之间的实用桥梁。

CTS-MoE: Implicit Terrain Adaptation via Mixture-of-Experts for Perceptive Locomotion

CTS-MoE：通过专家混合实现隐式地形适应以实现感知移动

Authors: Francisco Affonso, Matheus P. Angarola, Ana Luiza Mineiro, Aditya Potnis, Marcelo Becker, Girish Chowdhary
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.19633
Pdf link: https://arxiv.org/pdf/2606.19633
Abstract Perceptive legged locomotion over discontinuous terrain (e.g., stairs, gaps, and obstacles) requires adaptive behavior, as a single conservative gait cannot produce the anticipatory maneuvers needed for abrupt topology changes. Cast as multi-task reinforcement learning, this problem introduces a tension between sharing and separation. Tasks use a common locomotion base but have conflicting rewards, so a policy must share behavior while avoiding value interference. Prior work addresses only one side, with monolithic policies sacrificing specialization and hierarchical sub-policies sacrificing generalization across transitions and unseen terrain. We propose CTS-MoE, which combines a dense mixture-of-experts actor with perception-based gating to compose shared behaviors and a multi-critic with task-specific value heads to prevent interference. The model is trained end-to-end in a single-stage concurrent teacher-student setup that handles partial observability and avoids sequential distillation, with task labels used only during training. At deployment, routing depends solely on perception, allowing terrain adaptation without a high-level selector or terrain classifier. Experiments on a Unitree Go1 in simulation and on hardware across seen and unseen terrains show task-aware specialization, with lower tracking error and higher success rates than monolithic baselines. Project Website: this https URL .
中文摘要 在不连续地形（如楼梯、缝隙和障碍物）上进行感知性腿部运动需要适应性行为，因为单一保守步态无法产生实现突变拓扑所需的预判动作。该问题被定位为多任务强化学习，引入了共享与分离之间的张力。任务使用共同的移动基底，但奖励冲突，因此策略必须共享行为，同时避免价值干扰。以往的工作仅针对一面，单一政策牺牲了专业化，层级子政策则牺牲了跨过渡和未知领域的泛化。我们提出了CTS-MoE，它结合了密集的专家参与者与基于感知的门控来组合共享行为，以及一个多重批评者，并以任务特定值头为目标，以防止干扰。该模型采用单阶段师生并发配置，实现端到端训练，处理部分可观测性并避免顺序蒸馏，任务标签仅在培训期间使用。部署时，路线完全依赖感知，允许在不依赖高级选择器或地形分类器的情况下进行地形适应。在Unitree Go1上的模拟和硬件实验显示出任务感知特性，追踪误差更低，成功率高于单一基线。项目网站：此 https URL 。

DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning

DF-ExpEnse：扩散滤波探索以实现样本高效微调化

Authors: Calvin Luo, Chen Sun, Shuran Song
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.19656
Pdf link: https://arxiv.org/pdf/2606.19656
Abstract A natural recipe for intelligent robotic decision-making is initializing from pretrained generative control policies, which have summarized offline experience, and adapting them to self-collected online experience. We present DF-ExpEnse, an exploration technique that improves the quality of online experience collection, thus increasing finetuning sample-efficiency. DF-ExpEnse leverages the multimodal modeling capabilities of the generative control policy to create an expressive and tractably evaluatable candidate set. It then utilizes an ensemble of critics to identify the action that best balances quality with high exploration interest. In fleet settings, DF-ExpEnse further enables cross-agent communication to facilitate collaborative exploration as a group. DF-ExpEnse can be seamlessly integrated with existing strategies that finetune pretrained generative control policies via reinforcement learning. We experimentally validate consistent sample-efficiency benefits through DF-ExpEnse across a variety of manipulation and locomotion tasks, compared to default finetuning and alternative action selection schemes. Project can be found at this https URL.
中文摘要 智能机器人决策的自然配方是从预训练的生成控制策略中初始化，这些策略总结了离线体验，并将其调整到自我收集的在线体验中。我们介绍DF-ExpEnse，一种探索技术，能够提升在线体验收集的质量，从而提高微调样本效率。DF-ExpEnse 利用生成控制策略的多模态建模能力，创建一个表达性强且易于评估的候选集。然后，它利用一批评论家来确定在质量与高度探索兴趣之间最平衡的行动。在舰队环境中，DF-ExpEnse进一步促进跨智能体通信，促进团队协作探索。DF-ExpEnse 可以无缝集成于现有策略中，通过强化学习微调预训练的生成控制策略。我们通过实验验证了通过DF-ExpEnse在多种操作和移动任务中，与默认微调和替代动作选择方案相比，样本效率提升一致。项目可以在这个 https 网址找到。

Multi-Granular Attention-Driven Reinforcement Learning Framework for Web Intelligent Enhancement Systems

多粒度注意力驱动强化学习框架，用于网络智能增强系统

Authors: Navin Chhibber, Deepak Singh, Anokh Kishore, Nikita Chawla, K. Anguraj
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.19690
Pdf link: https://arxiv.org/pdf/2606.19690
Abstract From the past few years, web intelligent enhancement systems increasingly rely on heterogeneous and dynamic web data to deliver personalized, context-aware services. However, traditional machine learning, deep learning, and reinforcement learning models often struggle with semantic understanding, adaptability, and scalability in continuously evolving web environments. In this research, a Multi-Granular Attention-based Reinforcement Web Intelligent Enhancement System (MGAR-WIES) is proposed to address the challenges by integrating semantic graph modeling, attention mechanisms, and adaptive reinforcement learning. Initially, heterogeneous web data comprising structured, semi-structured and unstructured sources are collected and preprocessed for generating unified feature representations. These representations are transformed into a dynamic semantic graph, where entities and their relationships are modeled by using graph embeddings enhanced by attention mechanisms for capturing both local relevance and global contextual dependencies. Subsequently, an adaptive multi-agent reinforcement learning strategy leverages the attention-aware semantic states to optimize personalized web actions like content recommendation, navigation optimization, and service adaptation. Finally, the continuous online feedback is further integrated to update graph representations and learning policies in real time by ensuring sustained adaptability and performance. The proposed MGAR-WIES acheived better results in terms of accuracy (80%) when compared with existing approaches.
中文摘要 近年来，网络智能增强系统越来越依赖异构且动态的网络数据，以提供个性化、上下文感知的服务。然而，传统的机器学习、深度学习和强化学习模型在不断演变的网络环境中常常在语义理解、适应性和可扩展性方面遇到困难。本研究提出了一种多粒注意力强化网络智能增强系统（MGAR-WIES），通过整合语义图建模、注意力机制和自适应强化学习来应对这些挑战。最初，收集并预处理包含结构化、半结构化和非结构化源的异构网络数据，以生成统一的特征表示。这些表示被转化为动态语义图，通过图嵌入建模实体及其关系，并结合注意力机制以捕捉局部相关性和全局上下文依赖。随后，自适应多智能体强化学习策略利用注意力感知语义状态，优化个性化网络动作，如内容推荐、导航优化和服务适应。最后，持续的在线反馈进一步整合，实时更新图表表示和学习策略，确保持续的适应性和性能。与现有方法相比，所提出的MGAR-WIES在准确率方面取得了更好的结果（80%）。

OnDeFog: Online Decision Transformer under Frame Dropping

OnDeFog：帧丢失下的在线决策变换器

Authors: Daiki Yotsufuji, Kenta Nishihara, Shoma Shimizu, Kento Uchida, Shinichi Shirakawa
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.19721
Pdf link: https://arxiv.org/pdf/2606.19721
Abstract In challenging real-world reinforcement learning applications, communication delays or sensor failures often cause frame dropping, in which the agent cannot receive the dropped states and associated rewards. To address the performance degradation caused by frame dropping, the Decision Transformer under Random Frame Dropping (DeFog) was developed by incorporating additional mechanisms into the decision transformer to tackle frame dropping. Although DeFog can mitigate performance degradation in frame-dropping environments, since DeFog is an offline learning method, it struggles to effectively generalize to novel states not adequately represented in the training dataset. In this study, we propose OnDeFog, which integrates the mechanisms in DeFog with the online decision transformer (ODT), an online reinforcement learning method that learns policies through direct environmental interaction. Comprehensive experimental evaluation demonstrates that our proposed OnDeFog achieves superior performance compared to ODT in environments characterized by high dropping frame rate and outperforms DeFog on datasets containing a large amount of low-reward data.
中文摘要 在具有挑战的现实强化学习应用中，通信延迟或传感器故障常导致帧丢弃，代理无法接收丢弃状态及相关奖励。为了解决帧丢弃带来的性能下降，开发了随机丢帧决策转换器（DeFog），在决策变换器中加入了更多机制以应对帧丢失问题。虽然DeFog可以在丢帧环境中减轻性能下降，但由于DeFog是一种离线学习方法，它很难有效泛化到训练数据集中未充分代表的新状态。本研究提出OnDeFog，将DeFog中的机制与在线决策变换器（ODT）整合，ODT是一种通过直接环境互动学习策略的在线强化学习方法。全面的实验评估表明，我们提出的OnDeFog在高帧率下降的环境中表现优于ODT，并且在包含大量低回报数据的数据集上表现优于DeFog。

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

《流形盗贼：贝叶斯课程学习：大型语言模型潜在几何学》

Authors: Darrien McKenzie, Nicklas Hansen, Xiaolong Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.19750
Pdf link: https://arxiv.org/pdf/2606.19750
Abstract Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.
中文摘要 强化学习（RL）是提升大型语言模型（LLM）推理能力的核心方法，其中训练效率关键在于优化过程中问题的采样方式。现有的自适应课程学习方法通常优先考虑中等难度的提示，将问题选择视为带有独立臂的标准盗贼问题，忽视了任务空间的结构化和异质性。在本研究中，我们将问题采样框架为一种具有内生非平稳性的流形结构盗垒问题：问题通过模型的潜在表示空间相互关联，采样决策可以引导学习信号在该空间中的演变。为了实现这一观点，我们引入了贝叶斯流形课程（BMC），这是一种结构感知框架，将问题组织成层级任务树，并应用贝叶斯学习指导抽样。实证上，我们发现不同的抽样策略会在生产力（学习信号）、多样性（任务流形覆盖度）和效用（评估相关性）之间产生非平凡的权衡。这些结果表明，仅仅优先考虑难度不足以实现强有力的下游表现，凸显了将结构和类型意识融入问题抽样的重要性。

Temporal Self-Imitation Learning

时间自我模仿学习

Authors: Yinsen Jia, Boyuan Chen
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.19752
Pdf link: https://arxiv.org/pdf/2606.19752
Abstract Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.
中文摘要 经过奖励塑造训练的长期机器人操作策略仍可通过低效率的交互利用高密度奖励，而罕见的高效行为则可能在训练中被遗忘。我们认为时间效率本身为强化学习提供了强大且未被充分利用的自我监督来源。我们介绍了时间自我模仿学习（TSIL），这是一种强化学习框架，挖掘学习过程中产生的时间高效成功轨迹，并将其转化为可重复使用的监督，用于未来政策改进。TSIL通过基于快速成功轨迹的配置条件自适应时间目标逐步优化学习，同时通过效率加权自我模仿学习保持和重放高效行为。在15个不同的长视野操作任务中，TSIL持续提升学习效率、任务完成效率、快速成功行为的重审以及对不稳定训练条件的韧性。更广泛地说，我们的结果表明，成功行为的时间结构本身为强化学习提供了可扩展的自我监督信号，超越了单纯的人工工程奖励塑造。

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

超越熵：从代币级分布偏差中学习LLM推理

Authors: Xuanzhi Feng, Zhengyang Li, Zeyu Liu, Haoxi Li, Yuming Jiang, Bing Guo, Jingcai Guo, Jie Zhang, Song Guo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.19771
Pdf link: https://arxiv.org/pdf/2606.19771
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning; however, it faces a fundamental optimization instability: uniform token updates precipitate entropy collapse, leading to premature convergence to suboptimal strategies, whereas excessive Shannon Entropy maximization can cause entropy explosion, driving blind exploration toward incoherent reasoning chains. To resolve this dichotomy, we introduce the Independent Combinatorial Tokens (ICT) framework, which shifts the optimization focus from scalar uncertainty to the distributional properties of token logits. By leveraging the Jensen-Shannon (JS) divergence between token logits distributions, ICT identifies tokens with distinctive distributional patterns as critical branching points for guiding effective exploration in LLM reasoning. Our theoretical analysis, grounded in both Shannon and second-order Rényi entropy, proves that selectively updating on these tokens regulates policy concentration: it reduces the overall distribution uncertainty measured by Shannon entropy, while controlling probability concentration captured by second-order Rényi entropy. This dual effect prevents over-concentrated token generation from weakening exploration and effectively stabilizes the training landscape. Empirical results demonstrate that updating only the top 10% of unique tokens on Qwen2.5 (0.5B/1.5B/7B) models yields an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks spanning math, commonsense, and Olympiad-level problems.
中文摘要 带可验证奖励的强化学习（RLVR）显著推动了大型语言模型（LLM）推理;然而，它面临着根本的优化不稳定性：统一令牌更新会导致熵崩溃，导致过早收敛到次优策略;而过度的香农熵最大化则可能导致熵爆炸，推动盲目探索走向不连贯的推理链。为解决这一二分法，我们引入了独立组合令牌（ICT）框架，将优化重点从标量不确定性转向令牌逻辑的分布性质。通过利用 Jensen-Shannon（JS）在代币日志分布之间的发散，ICT 将具有独特分布模式的代币识别为指导有效探索 LLM 的关键分支点。我们的理论分析基于香农熵和二阶雷尼熵，证明对这些代币进行选择性更新可以调节政策集中：它降低了香农熵测量的整体分布不确定性，同时控制了二阶雷尼熵捕获的概率集中。这种双重效果防止了过度集中的标记生成削弱探索，有效稳定了训练环境。实证结果显示，仅更新Qwen2.5（0.5B/1.5B/7B）模型中排名前10%的独特代币，平均pass@4提升为4.58%，最大提升14.9%，在涵盖数学、常识和奥林匹克级别的七个基准测试中，相较GRPO、20熵和STAPO基线提升14.9%。

Uncertainty-Aware Reward Modeling for Stable RLHF

稳定RLHF的不确定性感知奖励建模

Authors: Licheng Pan, Haocheng Yang, Haoxuan Li, Yichen Sun, Yunsheng Lu, Shijian Wang, Lei Shen, Yuan Lu, Zhixuan Chu, Hao Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.19818
Pdf link: https://arxiv.org/pdf/2606.19818
Abstract Reinforcement learning from human feedback (RLHF) aligns large language models by training reward models on preference data and optimizing policies to maximize predicted rewards. However, this pipeline faces two fundamental challenges: (1) reward models cannot signal when their predictions are unreliable, since they usually act as deterministic point estimators; and (2) modern group-based policy optimization can amplify unreliable reward signals, as exemplified by GRPO's uniform treatment of rewards during advantage computation. As policies explore increasingly diverse responses, these two limitations create a critical vulnerability: unreliable reward estimates may be granted disproportionate influence, triggering severe reward hacking. We propose Uncertainty-Aware Reward Modeling (UARM), which equips reward models with calibrated uncertainty via quantile-based conformal prediction and reweights GRPO advantages through heteroscedastic variance decomposition. Experiments across HelpSteer, UltraFeedback, and PKU-SafeRLHF demonstrate that UARM significantly improves reward model calibration, reduces reward hacking, and enhances downstream alignment quality compared to standard GRPO and uncertainty-agnostic baselines.
中文摘要 人类反馈强化学习（RLHF）通过基于偏好数据训练奖励模型并优化策略以最大化预测奖励，从而对大语言模型进行对齐。然而，该流程面临两个根本挑战：（1）奖励模型无法表明其预测不可靠，因为它们通常作为确定性点估计器;以及（2）现代基于群体的策略优化可以放大不可靠的奖励信号，例如GRPO在优势计算中对奖励的统一处理。随着政策探索越来越多样化的反应，这两个局限性造成了一个关键漏洞：不可靠的奖励估计可能被赋予不成比例的影响，从而引发严重的奖励黑客行为。我们提出了不确定性感知奖励建模（UARM），通过基于分位数的共形预测为奖励模型配备校准不确定性，并通过异方差方差分解重新加权GRPO优势。HelpSteer、UltraFeedback和PKU-SafeRLHF的实验表明，UARM相比标准GRPO和不确定性无关基线，显著提升了奖励模型校准，减少了奖励黑客行为，并提升了下游对齐质量。

MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

MetaResearcher：在对抗性虚拟环境中通过自我反思强化学习扩展深度研究

Authors: Wei Yu, Suxing Liu, Minjie Yu, Jiahao Wang, Zhijian Zheng, Haocheng Deng, Bing Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.19893
Pdf link: https://arxiv.org/pdf/2606.19893
Abstract Deep research agents have demonstrated remarkable capabilities in autonomous information gathering and synthesis, yet their training remains constrained by the static nature of simulated environments, the limits of fact-retrieval-only task designs, and the inefficiency of outcome-based reinforcement learning. In this work, we propose MetaResearcher, a novel framework that scales deep research agent training across four synergistic dimensions. First, we introduce an Evolving Virtual World that injects temporal dynamics and adversarial misinformation into the training environment, forcing agents to develop source credibility assessment and temporal conflict resolution skills. Second, we design Discovery-Oriented Tasks -- including hypothesis generation and contradiction resolution -- that transcend simple fact retrieval and push agents toward genuine research behaviors. Third, we propose a Self-Reflective Meta-Reward mechanism within the GRPO framework that jointly optimizes for answer correctness, search path efficiency, reflection depth, and tool call diversity, directly addressing the repetitive action loop problem observed in prior work. Fourth, we introduce a Heterogeneous Multi-Agent Swarm architecture comprising specialized Scout, Filter, and Synthesizer models that learn collaborative research strategies through coordinated reinforcement learning. Built upon the LiteResearcher infrastructure, MetaResearcher requires zero marginal API cost for training while targeting substantial improvements in both benchmark performance (GAIA, Xbench-DS) and epistemic robustness under adversarial conditions. We present the complete framework design, training methodology, and planned experimental validation.
中文摘要 深度研究代理在自主信息收集和综合方面展现出卓越能力，但其训练仍受制于模拟环境的静态性、仅靠事实检索的任务设计局限性，以及基于结果的强化学习效率低下。在本研究中，我们提出了MetaResearcher，一种新颖框架，能够在四个协同维度上扩展深度研究代理培训。首先，我们引入了一个不断演变的虚拟世界，将时间动态和对抗性错误信息注入训练环境中，迫使代理培养源信度评估和时间冲突解决技能。其次，我们设计以发现为导向的任务——包括假设生成和矛盾解决——超越简单事实检索，推动代理采取真实的研究行为。第三，我们提出了在GRPO框架内的自我反思元奖励机制，联合优化答案正确性、搜索路径效率、反射深度和工具调用多样性，直接解决先前研究中观察到的重复动作循环问题。第四，我们引入了异构多智能体群架构，包含专门的侦察器、滤波器和合成器模型，通过协调强化学习学习协作研究策略。MetaResearcher 基于 LiteResearcher 基础设施，训练无需任何边际 API 成本，同时在基准性能（GAIA、Xbench-DS）和对抗条件下的认知鲁棒性方面实现显著提升。我们展示了完整的框架设计、培训方法论和计划中的实验验证。

CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

关怀：视频MLLM中适应性推理长度的能力意识奖励塑造

Authors: Chengwen Liu, Hao Peng, Jisheng Dang, Hong Peng, Bin Hu, Tat-Seng Chua
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.19927
Pdf link: https://arxiv.org/pdf/2606.19927
Abstract In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model's evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at this https URL.
中文摘要 在多模态视频推理中，基于强化学习的方法通常依赖于简单且不灵活的推理长度控制策略，这些策略无法适应模型不断演变的能力。这种不匹配可能抑制早期必要的探索，同时鼓励冗余推理和解码效率低下，尤其是在模型变得更成熟时。本文提出CARE，一种能力感知奖励塑造框架，用于多模态推理中的自适应推理长度优化。具体来说，CARE通过指数移动平均的通过率维持平滑能力估计，并利用该方法将训练划分为渐进阶段，将奖励偏好从探索导向的长式推理转向效率导向的简洁推理。为避免将冗长与内在任务复杂性混为一谈，CARE进一步将推理工作量与批次级统计量归一化，并引入后验放大器以增强对历史上困难样本中意外强劲表现的奖励信号。该机制可无缝集成到GRPO培训流程中，且不会产生额外的推理时间开销。多视频推理和通用视频理解基准测试的大量实验表明，CARE持续提升推理准确性，稳定强化学习，并显著提升代币效率。此外，CARE在训练过程中表现出推理长度的典型反向U轨迹，并在收敛处产生更短但更具信息量的推理轨迹，表明推理预算的有效自适应分配。我们在此 https URL 上提供了我们提出的 CARE 框架和实验的源代码。

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

连接点：通过强化学习实现跨域泛化的长生命周期代理大型语言模型训练

Authors: Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li, Bolin Ding, Jingren Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.20002
Pdf link: https://arxiv.org/pdf/2606.20002
Abstract This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{this https URL}.
中文摘要 本研究提出了一个通用框架，用于训练大型语言模型（LLM）以实现“连接点”（Connect the Dots，CoD），这是长生命周期代理所需的元能力：当基于LLM的AI智能体部署到环境中时，它在不断探索环境、从自身经验中学习并迭代自我更新环境上下文的同时，解决了一系列任务，从而在基于更新上下文的条件下，实现未来任务的逐步提升表现。CoD框架的主要组成部分包括：（1）端到端强化学习（RL）的算法设计和基础设施，采用长时间的展开序列，穿插于求解任务和更新上下文章节之间;（2）在培训过程中激励和激发LLMs目标元能力的任务和环境，以及在评估过程中忠实测量进展。我们展示了CoD框架的概念验证实现，包括带有细粒度信用分配的GRPO式强化学习算法，以及针对目标元能力量身定制的任务和环境（而非领域特定LLM能力或标准逐任务强化学习）。实证结果验证了CoD环境下端到端强化学习训练的有效性，并展示了在训练领域内、不同领域以及从CoD到Ralph环设置中，诱导的元能力具有分布外泛化的潜力。我们对《使命召唤》的研究连接了多条以往研究线，为大型语言模型和人工智能代理的发展开辟了新机遇。为了促进进一步的研究和应用，我们将实现文件发布于 \url{this https URL}。

VIMPO: Value-Implicit Policy Optimization for LLMs

VIMPO：面向大型语言模型的价值隐式策略优化

Authors: Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song, Xuandong Zhao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.20008
Pdf link: https://arxiv.org/pdf/2606.20008
Abstract Reinforcement learning with verifiable rewards has become a central tool for improving the reasoning ability of large language models, but current methods face a trade-off between simplicity and credit assignment. Group-relative methods such as GRPO avoid training a critic, but typically assign a trajectory-level advantage to every token. Actor-critic methods provide denser learning signals, but require a learned value function with its own training instability. We introduce VIMPO, a critic-free policy optimization method that derives a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. For autoregressive generation, the resulting value recurrence can be written in terms of policy-reference log-ratios and anchored by the terminal condition that no future reward remains at the end of a trajectory. This gives a simple value loss that incorporates outcome-level verifiable rewards without training a critic. The same derivation also yields a critic-free actor advantage, allowing VIMPO to separate reward incorporation through the value loss from policy improvement through a PPO-style actor update. On mathematical RLVR benchmarks, VIMPO improves over GRPO across MATH-500, AIME 2024, AIME 2025, and OlympiadBench, with especially larger gains on competition-style evaluations. Under noisy rewards, VIMPO retains a consistent advantage over GRPO, suggesting that policy-implied value optimization can provide finer credit assignment while preserving the practical simplicity of critic-free training.
中文摘要 带有可验证奖励的强化学习已成为提升大型语言模型推理能力的核心工具，但现有方法在简单性与学分赋值之间面临权衡。群相对方法如GRPO避免训练批评者，但通常为每个标记分配轨迹层级优势。Actor-critic 方法提供更密集的学习信号，但需要一个带有自身训练不稳定性的学习价值函数。我们介绍VIMPO，一种无批判的策略优化方法，通过KL正则化强化学习的最优条件推导出策略隐含价值函数。对于自回归生成，所得价值的复发可以用策略参考对数比表示，并以轨迹结束时无未来奖励的终极条件为锚定。这带来了简单的价值损失，包含了结果层面、可验证的奖励，而无需培训批评者。同样的推导还带来了无批评的演员优势，使VIMPO能够通过PPO式的演员更新将价值损失带来的奖励纳入与政策改进分开。在数学RLVR基准测试中，VIMPO在MATH-500、AIME 2024、AIME 2025和OlympiadBench等方面均优于GRPO，尤其在竞赛式评估中取得更大提升。在噪声奖励下，VIMPO始终优于GRPO，表明策略隐含价值优化可以在保持无批评培训的实用简洁性的同时，提供更细致的信用分配。

Hierarchical Control in Multi-Agent Games: LLM-based Planning and RL Execution

多智能体游戏中的层级控制：基于LLM的规划与强化学习执行

Authors: Jannik Hösch, Alessandro Sestini, Florian Fuchs, Amir Baghi, Joakim Bergdahl, Konrad Tollmar, Jean-Philippe Barrette-LaPierre, Linus Gisslén
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.20014
Pdf link: https://arxiv.org/pdf/2606.20014
Abstract Reinforcement learning (RL) has achieved strong performance in sequential decision-making, yet scaling to complex multi-agent environments remains challenging due to sparse rewards, large state-action spaces, and the difficulty of learning coordinated strategies. We propose a hierarchical architecture where a pretrained large language model (LLM) acts as a centralized strategic controller that selects among specialized RL skill policies for a team of agents, while RL policies handle reactive low-level execution. We evaluate this hybrid system in a competitive 2v2 King of the Hill environment against behavior tree (BT) and \emph{``Flat''} RL (end-to-end training without skill decomposition) baselines. The LLM+RL system achieves task performance statistically equivalent to hand-crafted BT (46.4\% vs 51.5\% win rate, $p=0.103$) while both significantly outperform Flat RL trained without skill decomposition. A user study ($n=15$) reveals that 60\% of participants perceive LLM+RL agents as the most human-like ($p=0.027$), citing behavioral adaptability and tactical variability. These results demonstrate that pretrained LLM reasoning can effectively orchestrate pretrained RL skills, achieving competitive multi-agent coordination and superior perceived believability without manual rule engineering.
中文摘要 强化学习（RL）在顺序决策方面取得了强劲表现，但由于奖励稀疏、状态操作空间大以及协调策略学习困难，扩展到复杂的多智能体环境仍然具有挑战性。我们提出一种分层架构，其中预训练大型语言模型（LLM）作为集中战略控制器，为代理团队选择专门的强化学习技能策略，而强化学习策略则处理反应式低层执行。我们在竞争激烈的2v2“王者山”环境中，结合行为树（BT）和强化学习（无技能分解的端到端训练）基线进行了评估。LLM+RL系统在统计上实现了与手工BT相当的任务表现（46.4%对51.5%胜率，$p=0.103$），而且两者在无技能分解的情况下都明显优于Flat RL。一项用户研究（$n=15$）显示，60%的参与者认为LLM+RL代理最像人类（$p=0.027$），原因包括行为适应性和战术变异性。这些结果表明，预训练的LLM推理能够有效协调预训练的强化学习技能，实现竞争性的多智能体协调和卓越的感知可信度，而无需手动规则工程。

A Neuromorphic Reinforcement Learning Framework for Efficient Pathfinding in Robotic Mobile Fulfillment Systems

一个用于机器人移动履约系统高效路径寻寻的神经形态强化学习框架

Authors: Junzhe Xu, Zecui Zeng, Lusong Li, Yuetong Fang, Renjing Xu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.20031
Pdf link: https://arxiv.org/pdf/2606.20031
Abstract Dynamic environmental changes, confined workspaces, and stringent real-time constraints make pathfinding in Robotic Mobile Fulfillment Systems (RMFS) a challenging problem for conventional search- and rule-based methods, which typically suffer from high computational complexity and long decision latency. While reinforcement learning (RL) has emerged as a powerful alternative, deploying learned policies with extreme energy efficiency on resource-constrained hardware remains an open challenge. We present SDQN-RMFS, an end-to-end framework that achieves high-fidelity deployment of an RL-trained policy from a full-precision artificial neural network (ANN) through to a neuromorphic chip. By computing only when triggered by sparse events, this framework unlocks ultra-low-power RMFS pathfinding. Our full-stack pipeline operates as follows: an ANN policy is first efficiently trained via a collision-allowing strategy to densify informative trajectories, and then converted into a spiking neural network (SNN) via a hard-label knowledge distillation approach. This effectively addresses the output distribution mismatch, preserving policy capability across the ANN-to-SNN pipeline while substantially reducing inference latency. Hardware experiments demonstrate up to 11,281$\times$ energy savings and a nearly two-fold reduction in latency compared to a high-performance GPU baseline, while maintaining decision quality on par with the original trained policy. These results establish physical neuromorphic inference as a practical and energy-sustainable pathway for large-scale RMFS operations.
中文摘要 动态的环境变化、有限的工作空间和严格的实时约束使机器人移动履约系统（RMFS）中的寻路成为传统搜索和规则化方法的挑战，这些方法通常面临高计算复杂性和较长决策延迟。虽然强化学习（RL）已成为一种强大的替代方案，但在资源受限的硬件上部署具有极高能效的学习策略仍是一个开放的挑战。我们介绍SDQN-RMFS，一个端到端框架，实现了从全精度人工神经网络（ANN）到神经形态芯片的高保真部署强化学习训练策略。该框架仅在稀疏事件触发时计算，从而解锁了超低功耗的RMFS路径寻寻。我们的全栈流水线工作原理如下：先通过允许碰撞的策略高效训练人工神经网络策略以密集信息轨迹，然后通过硬标签知识蒸馏方法转化为尖峰神经网络（SNN）。这有效解决了输出分布不匹配问题，保留了ANN到SNN流水线的策略能力，同时大幅降低推理延迟。硬件实验显示，与高性能GPU基线相比，节省了高达11,281美元/时间美元，延迟几乎减少了两倍，同时决策质量保持在原始训练策略的水平水平。这些结果确立了物理神经形态推断作为大规模RMFS操作中实用且能源可持续的途径。

Process-Verified Reinforcement Learning for Theorem Proving via Lean

通过精益证明定理的过程验证强化学习

Authors: Minsu Kim, Se-Young Yun
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.20068
Pdf link: https://arxiv.org/pdf/2606.20068
Abstract While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured rewards into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.
中文摘要 虽然可验证奖励强化学习（RLVR）通常依赖单一二元验证信号，而形式推理中的符号证明助手则提供丰富且细粒度的结构化反馈。结构化流程与非结构化奖励之间的这种差距凸显了反馈既密集又稳健的重要性。在本研究中，我们展示了精益证明助手本身可以作为一个符号化的过程预言机，在培训期间提供结果层面和细粒度战术层面的验证反馈。证明尝试被解析为战术序列，精益的详细说明既标记了局部合理的步骤，也标记了最早的失败步骤，产生了基于类型理论的密集、验证者基础的信用信号。我们将这些结构化奖励纳入类似GRPO的强化学习目标中，采用首次错误传播和首次代币积分方法，平衡结果层面和流程层面的优势。STP-Lean和DeepSeek-Prover-V1.5的实验表明，战术级监督在大多数环境中优于仅结果基线，在MiniF2F和ProofNet等基准测试上有所提升。除了实证收益外，我们的研究还强调了一个更广泛的视角：符号证明助手不仅在评估时是验证者，还能在培训过程中作为流程级奖励预言机。这为强化学习框架铺平了一条道路，该框架结合了语言模型的可扩展性与符号验证对形式推理的可靠性。

Multi-Head Attention-Based Feature Extractor Integration with Soft Actor-Critic for Porosity Prediction and Process Parameter Optimization in Additive Manufacturing

多头基于注意力的特征提取器与软演员-批判者集成，用于增材制造中的孔隙度预测和工艺参数优化

Authors: Kianoush Aqabakee, Leonardo Stella
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.20087
Pdf link: https://arxiv.org/pdf/2606.20087
Abstract Additive manufacturing process optimization requires precise parameter control to minimize defects such as porosity. Traditional reinforcement learning (RL) approaches using discrete action spaces suffer from slow convergence and susceptibility to local optima, limiting their effectiveness for high-precision manufacturing tasks. This study addresses these limitations by employing a continuous action space combined with a novel architecture that integrates a multi-head attention mechanism with the Soft Actor-Critic (SAC) algorithm. The attention-based feature extractor enhances the agent's ability to capture subtle variations in low-dimensional input features, enabling more effective exploration-exploitation balance for navigating value spaces with local minima. We validate our approach on porosity prediction and process parameter optimization in laser powder bed fusion, demonstrating faster convergence and higher final reward values compared to standard RL methods including DQN, PPO, TD3, and vanilla SAC. The proposed methodology achieves a convergence value of 322.79 within 14 episodes, outperforming existing approaches while maintaining stability throughout training.
中文摘要 增材制造工艺优化需要精确的参数控制，以最大限度减少诸如孔隙率等缺陷。传统的强化学习（RL）方法使用离散动作空间，收敛缓慢且易受局部最优影响，限制了其在高精度制造任务中的有效性。本研究通过采用连续动作空间结合新颖架构，将多头注意力机制与软演员-批评者（SAC）算法相结合，解决了这些局限性。基于注意力的特征提取器增强了智能体捕捉低维输入特征细微变化的能力，使在具有局部极小值的价值空间中实现更高效的探索与利用平衡。我们验证了激光粉末床聚变中孔隙度预测和工艺参数优化的方法，证明其收敛速度更快，最终奖励值更高，优于包括DQN、PPO、TD3和原版SAC在内的标准强化学习方法。该方法在14次发作内实现322.79的收敛值，优于现有方法，同时保持整个训练过程的稳定性。

Quantile of Means: A Bonus-Free Ensemble Method for Minimax Optimal Reinforcement Learning

均值分位数：一种无加分的集成方法用于极小极大最优强化学习

Authors: Asaf Cassel, Aviv Rosenberg
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.20107
Pdf link: https://arxiv.org/pdf/2606.20107
Abstract Optimal Reinforcement Learning (RL) algorithms typically rely on carefully constructed count-based uncertainty estimates to drive exploration. Although theoretically sound, such estimates are hard to compute in practical settings and therefore offer limited insight for designing exploration heuristics. Meanwhile, ensembling has emerged as a practical approach, but remains without theoretical justification. Building on a recent ensemble-based method for Multi-Armed Bandits, we propose a quantile-based ensemble method for finite-horizon Markov Decision Processes (MDPs). Our simple count-free approach achieves optimal variance-dependent regret bounds, providing theoretical grounding for ensemble-based exploration in RL.
中文摘要 最优强化学习（RL）算法通常依赖精心构建的基于计数的不确定性估计来推动探索。虽然理论上合理，但这些估计在实际环境中难以计算，因此对设计探索启发式方法的见解有限。与此同时，集成作为一种实用方法出现，但理论上仍缺乏依据。基于近期针对多臂强盗的集合方法，我们提出了一种基于分位数的集合方法，用于有限视界马尔可夫决策过程（MDP）。我们简单的无计数方法实现了最优的方差依赖遗憾界限，为强化学习中的集合探索提供了理论基础。

Augmenting Game AI with Deep Reinforcement Learning

用深度强化学习增强游戏AI。

Authors: Alessandro Sestini, Joakim Bergdahl, Amir Baghi, Jean-Philippe Barrette-LaPierre, Florian Fuchs, Linus Gisslén
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.20210
Pdf link: https://arxiv.org/pdf/2606.20210
Abstract Immersion in video games depends not only on graphics, audio, and game mechanics, but also on the quality of in-game characters. Producing believable characters, or game AI, remains a significant challenge as behavioral complexity is hard to capture with hand-coded systems. Game AI is a source of immersion and engagement; however, the limitations stemming from the challenges of creating game AI often lead to frustration and the breaking of the illusion of realism within the game. The introduction of machine learning models opens the door to creating more believable, authentic, and relatable characters in games. The promise is that they either learn from interacting with the game, or from player data, to develop true human-like behavior. In this paper, we envision more applications of reinforcement learning for game AI in the future. For this to materialize, current research limitations are prohibitive to broad deployment across game genres. Therefore, we propose a framework for training reinforcement learning models with a set of requirements in mind that are suited towards game AI and game development. We present examples of games with reinforcement learning-augmented game AI and describe the practicalities of deploying player-facing machine learning agents in modern games. Furthermore, we identify bottlenecks and hard problems in these areas, which we believe offer promising research directions to accelerate the adoption of machine learning in game AI for the video game industry.
中文摘要 游戏沉浸感不仅取决于画面、音频和游戏机制，还取决于游戏内角色的质量。制作可信角色或游戏AI，依然是一个重大挑战，因为手工编码系统难以捕捉其行为复杂性。游戏AI是沉浸感和参与感的来源;然而，游戏AI开发带来的挑战带来的限制常常导致挫败感，打破游戏内真实感的幻象。机器学习模型的引入为创造更具可信度、真实感和共鸣的角色打开了大门。承诺是他们要么通过与游戏互动，要么从玩家数据中学习，从而发展出真正的类人行为。本文展望未来强化学习在游戏AI中的更多应用。要实现这一目标，目前的研究限制限制了跨游戏类型的广泛推广。因此，我们提出了一个针对游戏人工智能和游戏开发的训练框架，针对一套适合游戏人工智能和游戏开发的需求。我们展示了采用强化学习增强的游戏人工智能的游戏示例，并描述了在现代游戏中部署面向玩家的机器学习代理的实际操作。此外，我们还识别了这些领域的瓶颈和难题，认为这些为加速机器学习在游戏人工智能中电子游戏行业的应用提供了有前景的研究方向。

A Multi-Agent system for Multi-Objective constrained optimization

一个多目标约束优化的多代理系统

Authors: Federica Filippini
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.20236
Pdf link: https://arxiv.org/pdf/2606.20236
Abstract Many decision-making problems in computing and networking systems can be naturally formulated as cost-minimization problems under performance constraints. In dynamic environments, reinforcement learning (RL) is often used to solve such problems at runtime by embedding both costs and constraint violations into a single scalar reward through weighted penalty terms, following a Lagrangian-inspired formulation. However, in this context the behavior of the learned policy critically depends on the choice of these weights, which are typically selected manually. This makes it difficult to identify an appropriate trade-off between optimizing the primary objective and effectively avoiding constraint violations, particularly in non-stationary environments where their relative importance may change. This paper presents MAMO (Multi-Agent system for Multi-Objective constrained optimization), an approach to tackle this balancing problem through multi-agent RL. MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem, providing a !rst step towards more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.
中文摘要 计算和网络系统中的许多决策问题可以自然地表述为性能约束下的成本最小化问题。在动态环境中，强化学习（RL）常被用于在运行时解决此类问题，通过加权惩罚项将成本和约束违规嵌入单一标量奖励中，遵循拉格朗日启发的表述。然而，在此背景下，所学策略的行为关键依赖于这些权重的选择，而权重通常是手动选择的。这使得在优化主要目标与有效避免约束违规之间找到合适的权衡变得困难，尤其是在非静止环境中，约束的重要性可能发生变化。本文提出了MAMO（多智能体系统，用于多目标约束优化），这是一种通过多智能体强化学习解决这一平衡问题的方法。MAMO通过将奖励权重的选择定为学习问题，将任务执行与目标设计脱钩，为动态环境中受限优化问题迈向更自主且稳健的基于强化学习的解决方案的第一步。

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

ELVA：探索排名驱动的通用多模态检索

Authors: Yuhan Liu, Pei Fu, Hang Li, Yukun Qi, Chao Jiang, Jingwen Fu, Zhen Liu, Bin Qin, Zhenbo Luo, Jian Luan, Jingmin Xin
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.20280
Pdf link: https://arxiv.org/pdf/2606.20280
Abstract Leveraging Multimodal Large Language Models (MLLMs) via contrastive learning has become a mainstream paradigm for improving the performance of Universal Multimodal Retrieval (UMR). However, previous works have ignored the grain blindness when adapting the contrastive paradigm into retrieval tasks. Grain blindness refers to the tendency of the model to overlook grain-level information contained in the query, which is crucial for effectively handling complex queries. This stems from contrastive learning treating samples as a binary classification (positive/negative), while ignoring the different information carried by each negative sample. To address this, we argue that negatives should be treated differently according to their similarity to the positive sample, enabling the model to learn distinct grain information from each negative. In this paper, we introduce a simple but effective framework, called ELVA, a novel rule-based RL framework that mitigates grain blindness through ranking-driven MLLMs. 1) Instead of relying on reward models, we extend Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks, allowing the model to explore new ranking behaviors without explicit ranking labels. 2) By utilizing rule-based rewards, our approach jointly optimizes the ranking of negative samples while enlarging the similarity gap between positive and negative. To more precisely measure grain blindness, we further introduce MRBench, a new benchmark specifically designed for multi-grain query scenarios. ELVA achieves state-of-the-art results across standard retrieval benchmarks, and its notable 13.1% improvement on MRBench further demonstrates its effectiveness in alleviating grain blindness.
中文摘要 通过对比学习利用多模态大型语言模型（MLLM）已成为提升通用多模态检索（UMR）性能的主流范式。然而，以往的研究在将对比范式应用于反演任务时，忽略了颗粒盲。粒盲指的是模型倾向于忽略查询中包含的粒级信息，这对于有效处理复杂查询至关重要。这源于对比学习将样本视为二元分类（正负），而忽略每个负样本所携带的不同信息。为此，我们认为应根据负片与正样本的相似度进行不同处理，使模型能够从每个负片中学习不同的晶粒信息。本文介绍了一个简单但有效的框架——ELVA，这是一个新颖的基于规则的强化学习框架，通过排名驱动的MLLM减轻粮食盲化。1）我们不再依赖奖励模型，而是扩展了带可验证奖励的强化学习（RLVR）以进行任务检索，使模型能够在不带有明确排名标签的情况下探索新的排名行为。2）通过基于规则的奖励，我们的方法共同优化了负面样本的排名，同时缩小了正负样本之间的相似性差距。为了更精确地测量粒盲，我们进一步引入了MRBench，这是一个专门为多粒度查询场景设计的新基准测试。ELVA在标准检索基准中取得了最先进的结果，其对MRBench的显著提升13.1%，进一步证明了其在缓解谷盲方面的作用。

A Model-Driven Approach for Developing Families of Reinforcement Learning Environments

一种基于模型的强化学习环境家庭开发方法

Authors: Xiaoran Liu, Istvan David
Subjects: Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.20324
Pdf link: https://arxiv.org/pdf/2606.20324
Abstract Virtual training environments are software-intensive systems in which reinforcement learning (RL) agents learn, adapt, and demonstrate meaningful behavior. Virtual training environments offer a safe and cost-efficient alternative to training agents in real-world settings. However, to converge, most realistic RL problems require training in multiple, mostly similar but slightly different environments - i.e., families of environment variants. The typical development process of environment families is a labor-intensive and error-prone manual endeavor that does not scale well. To alleviate these issues, in this paper, we propose a model-driven approach for developing families of RL training environments. To obtain the family of environments, we develop an approach and prototype tool. In our approach, a hybrid genetic algorithm - a combination of population-based global search and heuristic local search - generates environment families. Mutations and constraints are expressed as model transformations and are operationalized into a search process by a state-of-the-art model transformation engine. We demonstrate the soundness of our approach in a wildfire mitigation scenario and curriculum learning - a particular learning paradigm that relies on environment families.
中文摘要 虚拟训练环境是软件密集型系统，强化学习（RL）代理在此学习、适应并展示有意义的行为。虚拟培训环境为现实环境中的培训代理提供了一种安全且经济高效的替代方案。然而，要收敛，大多数现实的强化学习问题需要在多个大多相似但略有不同的环境中训练——即环境变体家族。环境家族的典型开发过程是一项劳动密集且易出错的手工工作，且难以扩展。为缓解这些问题，本文提出了一种基于模型的强化学习训练环境家庭开发方法。为了获得环境家族，我们开发了方法和原型工具。在我们的方法中，混合遗传算法——结合了种群的全局搜索和启发式局部搜索——生成环境家族。变异和约束以模型变换形式表达，并由先进的模型变换引擎操作化为搜索过程。我们在野火缓解情景和课程学习中展示了我们方法的合理性——这是一种依赖环境家庭的特定学习范式。

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX：快速安全强化学习基准测试

Authors: Tristan Tomilin, Mourad Boustani, Mickey Beurskens, Thiago D. Simão
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.20376
Pdf link: https://arxiv.org/pdf/2606.20376
Abstract Safety is a core concern for deploying reinforcement learning (RL) agents in real-world domains such as robotics and autonomous driving. While benchmarks have been central to progress in RL, existing safety benchmarks with high-fidelity 3D physics remain computationally slow, limiting large-scale experimentation and rapid prototyping. To address this gap, we propose CRAX (Constrained RL Accelerated with JAX). Built on top of the MuJoCo XLA (MJX) physics engine with realistic 3D dynamics, CRAX leverages vectorized operations and hardware acceleration, yielding up to ~100x speedups over comparable CPU-based safety benchmarks. The benchmark features six environment suites and three agent-specific tasks, each spanning three difficulty levels. Evaluating six popular safe RL methods shows that no single approach dominates across all tasks, and reveals the trade-offs between performance and safety. We find that curriculum learning across difficulty levels and safety transfer can improve performance over direct training in harder settings.
中文摘要 安全是在机器人和自动驾驶等现实世界领域部署强化学习（RL）代理时的核心关注点。虽然基准测试一直是强化学习进步的核心，但现有高保真3D物理的安全基准计算速度仍然较慢，限制了大规模实验和快速原型制作。为弥补这一空白，我们提出了CRAX（JAX约束强化学习加速）。CRAX基于MuJoCo XLA（MJX）物理引擎，具备逼真的三维动态，利用矢量化运算和硬件加速，比同类基于CPU的安全基准提升高达100倍。该基准测试包含六个环境套件和三个代理专用任务，每个任务涵盖三个难度等级。评估六种流行的安全强化学习方法表明，没有单一方法能在所有任务中占主导地位，并揭示了性能与安全性之间的权衡。我们发现，跨难度层级和安全转移的课程学习，在更难的环境中比直接培训更能提升表现。

Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning

直接优势估计，实现可扩展且样本高效的深度强化学习

Authors: Hsiao-Ru Pan, Bernhard Schölkopf
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.20411
Pdf link: https://arxiv.org/pdf/2606.20411
Abstract Direct Advantage Estimation (DAE) has been shown to improve the sample efficiency of deep reinforcement learning algorithms. However, its reliance on full environment observability limits its applicability in realistic settings, and its requirement to model transition probabilities incurs substantial computational overhead for high-dimensional observations. In the present work, we address both limitations. First, we extend the theoretical framework of DAE to partially observable domains with minimal modifications. Second, we reduce its computational complexity by introducing discrete latent dynamics models that efficiently approximate transition probabilities. We evaluate our approach on the Arcade Learning Environment and find that DAE scales effectively with function approximator capacity while retaining high sample efficiency.
中文摘要 直接优势估计（DAE）已被证明能提高深度强化学习算法的样本效率。然而，其对全环境可观测性的依赖限制了其在现实环境中的适用性，且其对过渡概率的建模要求在高维观测中产生了大量计算开销。在本次研究中，我们同时解决了这两种局限性。首先，我们将DAE的理论框架扩展到部分可观测的域，且修改极少。其次，我们通过引入离散潜在动力学模型，高效近似跃迁概率，降低其计算复杂度。我们评估了在街机学习环境中的方法，发现DAE能够有效扩展函数近似器的容量，同时保持高样本效率。

TaCauchy: An Extensible FEM Framework for Vision-Based Tactile Simulation

TaCauchy：一种用于基于视觉的可扩展有限元（FEM）触摸模拟框架

Authors: Hengfei Zhao, Yifan Xie, Junhao Gong, Yue Sun, Kai Zhu, Weihua He, Shoujie Li, Haohuan Fu, Wenbo Ding
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.20426
Pdf link: https://arxiv.org/pdf/2606.20426
Abstract Vision-based tactile sensors require high-fidelity simulation for reinforcement learning, yet existing approaches struggle to provide accurate mechanical stress fields within GPU-accelerated robotics platforms. We present TaCauchy, an extensible Finite Element Method (FEM) framework that integrates rigorous physics-based force computation into Isaac Sim. Built on the Unified Incremental Potential Contact (UIPC) solver, TaCauchy directly computes Cauchy stress tensors from hyperelastic constitutive laws and projects them onto contact surfaces to obtain traction forces and pressure distributions, providing mechanical ground truth from first principles rather than empirical estimation. Our framework features automatic mesh generation with geometry-aware adaptive refinement and a modular sensor interface enabling rapid integration of diverse sensors (GelSight Mini, DIGIT, 9DTact) with minimal configuration. Performance benchmarks demonstrate 33.40 FPS for single environments and 555 FPS aggregate throughput across 60 parallel environments, with stress extraction overhead under 1 ms. Physical validation experiments show strong agreement between simulated and real tactile responses across force ranges from 1.2556 N to 4.7332 N, achieving SSIM above 0.93, confirming the framework's capability to provide accurate, physically-grounded force supervision for downstream robotic manipulation tasks.
中文摘要 基于视觉的触觉传感器需要高精度模拟来进行强化学习，但现有方法在GPU加速机器人平台上难以提供准确的机械应力场。我们介绍TaCauchy，一种可扩展的有限元法（FEM）框架，将严谨的物理力计算集成到Isaac Sim中。TaCauchy基于统一增量势接触（UIPC）求解器，直接从超弹性本构定律计算柯西应力张量，并将其投影到接触面上，以获得牵引力和压力分布，从基本原理而非经验估计中提供机械学的基础信息。我们的框架具备自动网格生成功能，具备几何感知的自适应细化功能，并采用模块化传感器接口，实现以最小配置快速集成多种传感器（GelSight Mini、DIGIT、9DTact）。性能基准测试显示单环境可达33.40 FPS，60个并行环境的总吞吐量为555 FPS，压力提取开销低于1毫秒。物理验证实验显示，模拟与真实触觉响应在1.2556牛顿至4.7332牛顿之间的力范围高度一致，SSIM超过0.93，证实该框架具备为下游机器人操作任务提供准确、物理基础力监督的能力。

Fast Human Attention Prediction for Fixation-guided Active Perception in Autonomous Navigation

自主导航中注视引导主动感知的快速人类注意力预测

Authors: Fatma Youssef Mohammed, Grzegorz Malczyk, Kostas Alexis
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.20491
Pdf link: https://arxiv.org/pdf/2606.20491
Abstract Human visual attention relies on structured scanpaths to efficiently process scenes, yet instilling this behavior into robot autonomy is in its infancy and hindered by the high,computational costs of existing predictive models. To address this, we introduce GazeLNN, a computationally lightweight,scanpath prediction model that leverages Liquid Neural Networks as its recurrent engine and employs MobileNetV3 for feature extraction. Operating auto-regressively, the architecture predicts sequential fixation heatmaps conditioned on the current visual stimulus and fixation history. Despite requiring only 0.61 GFLOPs, GazeLNN achieves state-of-the-art performance on the MIT Low Resolution dataset achieving 0.47 ScanMatch score. It outperforms existing recurrent baselines across diverse evaluation metrics, while reducing computational costs by 99.40% and accelerating inference by up to six times. To investigate the role of human attention modeling in robot autonomy and demonstrate the practical utility of this highly efficient architecture, we integrate GazeLNN into an active camera-robot control policy trained via Reinforcement Learning. This integration enables human-fixation-guided perception during autonomous navigation, validated through successful real-world deployments on an aerial robot.
中文摘要 人类视觉注意力依赖结构化扫描路径高效处理场景，但将这种行为灌输给机器人自主性仍处于起步阶段，且受限于现有预测模型的高计算成本。为此，我们介绍了GazeLNN，一个计算量轻的扫描路径预测模型，利用液态神经网络作为其循环引擎，并采用MobileNetV3进行特征提取。该架构以自回归方式预测基于当前视觉刺激和注视历史的顺序注视热图。尽管只需0.61 GFLOP，GazeLNN在MIT低分辨率数据集上仍实现了最先进的性能，达到0.47的ScanMatch得分。它在多种评估指标上优于现有的重复基线，同时将计算成本降低了99.40%，并将推理速度提升了最多六倍。为了研究人类注意力建模在机器人自主性中的作用并展示这一高效架构的实际用途，我们将GazeLNN整合进通过强化学习训练的主动摄像头-机器人控制策略中。这一整合使自主导航时能够实现人注视引导感知，并通过空中机器人的实际部署得到验证。

Keyword: diffusion policy

DiffusionVS: A Generative Framework for Robust Visual Servoing Based on Diffusion Policy

DiffusionVS：基于扩散策略的稳健视觉服务生成框架

Authors: Hongkang Cui, Rui He, Haoyao Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.19397
Pdf link: https://arxiv.org/pdf/2606.19397
Abstract Visual servoing is a fundamental technique in robotic manipulation and navigation. Regression-based visual servoing frequently experiences trajectory jitter as a result of noise-sensitive single-step mappings and the accumulation of errors during distribution shifts. In contrast, Diffusion Policy maintains temporal consistency by predicting action sequences and improves robustness through implicit data augmentation. This paper presents a novel diffusion-based servoing method. Based on Diffusion Policy, the proposed approach uses normalized image coordinates of observed tag corners as input and generates camera velocity through conditional denoising. To overcome the generalization limitations of models trained on static datasets, an online training paradigm is adopted, continuously expanding the diversity of training data through interactive experience collection. This strategy substantially enhances both the performance and generalization capability of the model. Comprehensive simulations and real-world experiments demonstrate the effectiveness of the proposed method, achieving success rates of nearly 100\% in simulation and 93\% in physical experiments. Beyond the specific pipeline, we further validate the generality of the diffusion mechanism. Experiments show that existing visual servoing networks consistently achieve improved performance when integrated with our diffusion-based module. These results indicate that the proposed strategy possesses broad applicability and can enhance various visual servoing systems beyond the specific architecture presented here.
中文摘要 视觉伺服是机器人操作和导航中的一项基础技术。基于回归的视觉伺服经常因噪声敏感的单步映射和分布偏移过程中误差的累积而出现轨迹抖动。相比之下，扩散策略通过预测动作序列保持时间一致性，并通过隐式数据增强提升鲁棒性。本文提出了一种新的基于扩散的伺服方法。基于扩散政策，所提方法使用观测到的标签角的归一化图像坐标作为输入，并通过条件去噪产生相机速度。为克服静态数据集训练模型的泛化限制，采用了在线训练范式，通过交互式体验收集不断扩展训练数据的多样性。这一策略显著提升了模型的性能和泛化能力。全面的模拟和实际实验证明了该方法的有效性，在模拟中成功率接近100%，物理实验中成功率为93%。除了具体的管道外，我们还进一步验证了扩散机制的通用性。实验表明，现有的视觉伺服网络在与我们的扩散模块集成时，性能持续提升。这些结果表明，所提策略具有广泛的适用性，能够增强多种视觉伺服系统，超越本文所展示的具体架构。

MirrorDuo: Reflection-Consistent Visuomotor Learning from Mirrored Demonstration Pairs

MirrorDuo：镜像示范对的反射一致视觉运动学习

Authors: Zheyu Zhuang, Ruiyu Wang, Giovanni Luca Marchetti, Florian T. Pokorny, Danica Kragic
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.20048
Pdf link: https://arxiv.org/pdf/2606.20048
Abstract Image-based behaviour cloning leverages demonstrations captured from ubiquitous RGB cameras. However, it remains constrained by the cost of collecting diverse demos, especially for generalizing across workspace variations. We propose MirrorDuo, a reflection-based formulation that operates on image, proprioception, and full 6-DoF end-effector action tuples, generating a mirrored counterpart for each original demonstration, effectively achieving "collect one, get one for free". It can be applied as a data augmentation strategy for existing learning pipelines, such as standard behaviour cloning or diffusion policy, or as a structural prior for reflection-equivariant policy networks. By leveraging the overlap between the original and mirrored domains, MirrorDuo achieves significantly improved performance under the same data budget when demonstrations are evenly distributed across both sides of the workspace. When demonstrations are confined to one side, MirrorDuo enables efficient skill transfer to the mirrored workspace with as few as zero or five demos in the target arrangement.
中文摘要 基于图像的行为克隆利用了从普遍的RGB摄像头拍摄的演示。然而，它仍受限于收集多样化演示的成本，尤其是在跨工作区变体进行泛化时。我们提出了MirrorDuo，这是一种基于反射的表述，作用于图像、本体感觉和完整的6景深端执行器动作元组，为每个原始演示生成镜像对应物，有效实现“收集一赠一”的效果。它可以作为现有学习管道的数据增强策略，如标准行为克隆或扩散策略，或作为反射等变策略网络的结构先验。通过利用原始域与镜像域的重叠，MirrorDuo 在演示均匀分布于工作区两侧时，在同一数据预算下显著提升性能。当演示仅限于一侧时，MirrorDuo能够高效地将技能转移到镜像工作空间，目标配置中只需零到五个演示。

Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation

频率感知流量匹配，实现连续且一致的机器人动作生成

Authors: Jianing Guo, Fangzheng Chen, Zihao Mao, Wong Lik Hang Kenny, Zhenhong Wu, Yu Li, Yishuai Cai, Yuanpei Chen, Yikun Ban, Kai Chen, Qi Dou, Yaodong Yang, Xianglong Liu, Huijie Zhao, Simin Li
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.20135
Pdf link: https://arxiv.org/pdf/2606.20135
Abstract Flow matching has emerged as a standard paradigm for robotic manipulation owing to its strong expressive power for modelling complex, multimodal action distributions, alongside similar approaches like diffusion policy. However, existing methods rely on discretized action chunks, making them brittle to demonstrations collected at heterogeneous control frequencies and prone to temporally inconsistent actions that degrade control stability. In this paper, we propose Frequency-Aware Flow Matching (FAFM), which outputs continuous, temporally consistent actions. To handle heterogeneous frequency input, we transform discrete action sequences into the frequency domain with the discrete cosine transform (DCT), perform flow matching over the resulting coefficients, and reconstruct continuous actions via cosine basis expansion. To generate temporally consistent actions, we regularize the first-order temporal derivative to promote smooth actions. This corresponds to a Sobolev-type constraint that suppresses high-frequency errors and discourages abrupt action changes. Our FAFM is simple, introduces no additional network parameters and applies to standalone flow-matching policies and vision-language action models. Across synthetic toy benchmark, obstacle avoidance, LapGym, and LIBERO, FAFM improves success rates, multimodal expressivity, motion smoothness, convergence speed, robustness to mechanical bias and mixed-frequency input. These gains are consistent when deployed on a real-world Franka robot. Code available at this https URL.
中文摘要 流匹配因其在建模复杂多模态作用分布方面的强大表现力，已成为机器人操作的标准范式，与扩散策略等类似方法并行。然而，现有方法依赖离散化的动作块，使其对异构控制频率下收集的演示较为脆弱，且容易出现时间不一致的动作，从而降低控制稳定性。本文提出了频率感知流量匹配（FAFM），该方法输出连续且时间一致的动作。为处理异质频率输入，我们将离散作用序列转换为频域，使用离散余弦变换（DCT），对所得系数进行流匹配，并通过余弦基展开重建连续作用量。为了生成时间一致的动作，我们正则化一阶时间导数以促进平滑动作。这对应于一种索博列夫式约束，抑制高频误差并阻止突然的动作变化。我们的FAFM结构简单，不引入额外网络参数，适用于独立的流量匹配策略和愿景语言动作模型。在合成玩具基准、障碍避让、LapGym和LIBERO等领域，FAFM提升了成功率、多模态表现力、运动平滑性、收敛速度、对机械偏置和混频输入的韧性。这些提升在真实的Franka机器人上是一致的。代码可在此 https URL 获取。