Arxiv Papers of Today

生成时间: 2026-06-16 20:58:41 (UTC+8); Arxiv 发布时间: 2026-06-16 20:00 EDT (2026-06-17 08:00 UTC+8)

今天共有 78 篇相关文章

Keyword: reinforcement learning

Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

利用离散扩散模型高效强化视觉-文本思维

Authors: Yoonjeon Kim, Yuhta Takida, Chieh-Hsin Lai, Eunho Yang, Yuki Mitsufuji
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.14792
Pdf link: https://arxiv.org/pdf/2606.14792
Abstract RL-based post-training has been widely adopted to enable interleaved visual and textual reasoning in unified multimodal models capable of both text and image generation. However, most existing approaches are built upon autoregressive (AR) unified models, which require full image regeneration during visual reasoning. In this work, we demonstrate that multimodal discrete diffusion models are effective alternatives to AR models for reinforcement learning in interleaved reasoning, owing to their ability to perform efficient visual rollouts via localized visual editing rather than full image-token regeneration. This reduces rollout computation during GRPO by 26.9\% compared to AR baselines, with minimal performance drop. Despite the improved efficiency, we find that joint reward assignment, which employs a shared reward signal across modalities, introduces cross-modal interference between unrelated image and text token sequences during RL updates. To address this issue, we propose factorized reward assignment, a strategy that assigns rewards independently to text and vision segments. With factorized reward assignment, our RL approach achieves an 11.2% improvement over joint reward assignment and a 38.04% improvement over the base model.
中文摘要 基于强化学习的后期训练已被广泛采用，用于实现统一多模态模型中交错的视觉与文本推理，能够同时生成文本和图像。然而，大多数现有方法基于自回归（AR）统一模型，需要在视觉推理时进行完整的图像再生。本研究证明，多模态离散扩散模型是交错推理强化学习中AR模型的有效替代方案，因为它们能够通过局部化视觉编辑而非完整的图像标记再生实现高效的视觉展开。这在GRPO期间的推广计算比AR基线减少了26.9%，性能下降极小。尽管效率有所提升，我们发现联合奖励分配（即跨模态共享奖励信号）在强化学习更新期间引入了无关图像和文本令牌序列之间的跨模态干扰。为解决这个问题，我们提出了因数化奖励分配策略，即将奖励独立分配给文本和视觉片段。通过分解奖励分配，我们的强化学习方法比联合奖励分配提升了11.2%，比基础模型提升了38.04%。

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

QPILOTS：高效的流量策略测试时Q引导

Authors: Yifan Ruan, Chenyang Cao, Andreas Burger, Ali Pesaranghader, Kaveh Kamali, Jaehong Kim, Nandita Vijaykumar, Alan Aspuru-Guzik, Igor Gilitschenski, Nicholas Rhinehart
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.14801
Pdf link: https://arxiv.org/pdf/2606.14801
Abstract Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic's action gradient, yet directly backpropagating this signal through a multi-step denoising process can be numerically unstable. Existing methods work around this either by discarding gradient information, distilling the policy into a simpler one-step actor, or repeatedly fine-tuning the denoising policy as the critic improves. We propose QPILOTS, a method that leaves the original policy unmodified and steers the denoising process at inference time. At each denoising step, instead of evaluating the critic on the noisy intermediate action where critic predictions are unreliable, we first project that intermediate state to an estimate of the final clean action and compute the critic gradient there. We introduce two variants: QPILOTS-U uses a fast single-point approximation, while QPILOTS-M draws differentiable posterior samples via a learned auxiliary network. On a standard offline-to-online RL benchmark, QPILOTS achieves the best aggregate performance, reaching an average success rate of 90% across 50 tasks. We also apply QPILOTS to steer a large, frozen, pretrained Vision-Language Action (VLA) foundation model, outperforming or matching prior inference-time approaches across six manipulation tasks in simulation.
中文摘要 流匹配和扩散策略是表达式动作生成器，但用时间差分强化学习（RL）优化它们仍然困难。有效的策略提取需要利用批评者的作用梯度，但直接通过多步去噪过程反向传播该信号在数值上可能不稳定。现有方法通过丢弃梯度信息、将策略提炼为更简单的一步行为者，或随着批评者的改进反复微调去噪策略来绕过这一问题。我们提出了QPILOT方法，该方法保持原策略不变，并在推断时引导去噪过程。在每个去噪步骤中，我们不先将该中间状态投影为最终干净作用的估计值，并计算其梯度。我们介绍了两种变体：QPILOTS-U使用快速单点近似，而QPILOTS-M通过学习辅助网络绘制可微分后验样本。在标准离线到在线强化学习基准测试中，QPILOTS 实现了最佳的综合性能，50个任务的平均成功率达到90%。我们还应用QPILOT引导一个大型、冻结的预训练视觉语言动作（VLA）基础模型，在仿真中的六个操作任务中表现优于或匹敌之前的推理时间方法。

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

CoRA：可靠思维链推理的信心-理据对齐

Authors: Juming Xiong, Weixin Liu, Kevin Guo, Congning Ni, Junchao Zhu, Chongyu Qu, Chao Yan, Katherine Brown, Avinash Baidya, Xiang Gao, Bradley Malin, Zhijun Yin
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.14961
Pdf link: https://arxiv.org/pdf/2606.14961
Abstract Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale alignment: whether a model's confidence in its committed answer is justified by its generated rationale. We introduce a GRPO-based reinforcement learning framework that jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support, where the rubric assesses grounding, coherence, task match, and connection to the selected answer without revealing the gold answer to the judge. Across MedQA, MathQA, and OpenBookQA using three open-weight LLMs, our method reduces the confidence--rationale alignment error by up to 26.51% compared with untuned checkpoints, SFT, and correctness-only GRPO, while maintaining competitive accuracy and often improving calibration. These results show that reliable CoT reasoning requires not only confident answers, but rationales that substantively support them.
中文摘要 思维链（CoT）推理可以提升LLM的表现，但当CoT的推理合理但不完整或支持不足时，高答案置信度可能具有误导性。我们研究置信度——理据对齐：模型对其承诺答案的信心是否由其生成的理据所证明。我们引入了基于GRPO的强化学习框架，该框架联合奖励答案正确性、承诺答案概率和基于评分标准的理由支持，评分标准评估其基础性、连贯性、任务匹配度及与所选答案的联系，同时不向评委透露金答案。在MedQA、MathQA和OpenBookQA中，使用三种开权大语言模型，我们的方法相比未调优检查点、SFT和仅正确性的GRPO，将置信度-理据比对误差降低了最多26.51%，同时保持竞争准确性并常常提升校准水平。这些结果表明，可靠的CoT推理不仅需要自信的答案，还需要有实质性支持这些答案的理由。

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Nemotron 3 Ultra：开放高效的专家混合混合曼巴-变换器模型，用于代理推理

Authors: NVIDIA: Aaron Blakeman, Aaron Thomas, Aastha Jhunjhunwala, Abhibha Gupta, Abhinav Khattar, Adam Rajfer, Adi Renduchintala, Adil Asif, Aditya Vavre, Adriana Flores Miranda, Ahmad Bilal, Aileen Zaman, Ajay Hotchandani, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Alex Gronskiy, Alex Kondratenko, Alex Steiner, Alex Ye, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alice Gatti, Alisa Liu, Alok Kumar, Amar Phanishayee, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Anahita Bhiwandiwalla, Ananth Subramaniam, Andrea Santilli, Andrew Fulks, Andrew McHarg, Andrew Tao, Andrii Skliar, Anjulie Agrusa, Ankur Srivastava, Ankur Verma, Anna Shors, Anna Warno, Antoni-Joan Solergibert I Llaquet, Arham Mehta, Arkadiusz Nowaczynski, Arti Jain, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asma Kuriparambil Thekkumpate, Atefeh Sohrabizadeh, Avinash Kaur, Avinash Vem, Ayush Dattagupta, Barath Subramaniam Anandan, Bardiya Sadeghi, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bill Thiede, Bita Darvish Rouhani, Bo Deng, Bob Schatz, Boris Ginsburg, Boxin Wang, Brad Nemire, Brandon Norick, Brian Dang, Brian Westphal, Brian Yu, Brucek Khailany, Bryan Catanzaro, Carlo del Mundo, Caryln Aarish, Chankyu Lee, Chantal Hwang, Charbel Sakr, Charles Wang, Charlie Truong, Chen Cui, Cheng Cheng, Cheng-Ping Hsieh, Chenghao Zhang, Chenhui Deng, Chintan Patel, Chris Alexiuk, Christian Cosgrove, Christian Munley, Christine Harvey, Christopher Parisien, Chunyang Shen, Coco Li, Collin Neale, Cynthia Gao, Cyril Meurillon, Dan Gil
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.15007
Pdf link: https://arxiv.org/pdf/2606.15007
Abstract We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using Supervised Fine Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD). Nemotron 3 Ultra is our most capable model yet, employing multiple key technologies - LatentMoE, Multi Token Prediction (MTP), NVFP4 pre-training, multi-environment RLVR, MOPD, and reasoning budget control. Nemotron 3 Ultra achieves up to ~6x higher inference throughput as compared to state-of-the-art publicly available LLMs while attaining on-par accuracy. The state-of-the-art accuracy, high inference throughput, and 1M token context length make Nemotron 3 Ultra ideal for long-running autonomous agentic tasks. We open-source the base, post-trained, and quantized checkpoints, along with the training data and recipe on HuggingFace.
中文摘要 我们介绍Nemotron 3 Ultra，一个总参数5500亿、活跃参数550亿的混合Mamba-Attention语言模型。我们对Nemotron 3 Ultra进行了20万亿个文本标记的预训练，随后将上下文长度扩展到100万个标记，并使用监督式精细调优（SFT）、强化学习（RL）和多教师策略提纯（MOPD）进行后期训练。Nemotron 3 Ultra是我们迄今为止最强大的模型，采用了多项关键技术——潜在能力分析（LatentMoE）、多令牌预测（MTP）、NVFP4预训练、多环境RLVR、MOPD和推理预算控制。与最先进的公开大型语言模型相比，Nemotron 3 Ultra 的推理吞吐量高出高达 ~ 6 倍，同时实现了相当的精度。其最先进的精度、高推理吞吐量和100万令牌上下文长度，使Nemotron 3 Ultra非常适合长期运行的自主代理任务。我们将基础、后期训练和量化检查点开源，以及训练数据和配方，都发布在HuggingFace上。

Temporal Difference Learning for Diffusion Models

扩散模型中的时间差分学习

Authors: Qizhen Ying, Yangchen Pan, Victor Adrian Prisacariu, Junfeng Wen
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.15048
Pdf link: https://arxiv.org/pdf/2606.15048
Abstract Diffusion models are typically trained with objectives that focus on local denoising targets at individual time steps (or adjacent pairs), which do not enforce consistency between predictions along the denoising trajectory. This lack of cross-time consistency can degrade performance, especially for few-step samplers. We introduce a temporal difference (TD) objective that penalizes inconsistency of the model's multi-step progress along the denoising path. By reformulating the diffusion process as a Markov reward process and casting denoising as a policy evaluation problem in reinforcement learning, we derive a unified TD approach that applies to both discrete- and continuous-time diffusion formulations. We further propose a principled sample-based reweighting method that stabilizes training. Empirically, we show that using our TD training can significantly improve sample quality measured by FID, with stronger advantages when the number of sampling steps is small, highlighting its practical utility under low-computation-budget scenarios. We provide ablation studies to justify our design choices, including pairwise loss reweighting, regularization weight, and one-step stride. Overall, our TD approach can be a general drop-in that enforces cross-time consistency and improves generation quality across different diffusion generative models.
中文摘要 扩散模型通常通过聚焦于单个时间步（或相邻对）局部去噪目标的目标进行训练，这些目标物不会强制消除去噪轨迹上的预测一致性。这种跨时间一致性的缺失会降低性能，尤其是对于少步采样器。我们引入了一个时间差（TD）目标，惩罚模型在去噪路径上多步进展的不一致。通过将扩散过程重新表述为马尔可夫奖励过程，并将去噪作为强化学习中的策略评估问题，我们推导出了一个统一的TD方法，适用于离散和连续时间扩散的表述。我们还提出了一种原则性的基于样本的重权重方法，以稳定训练。通过实证，我们表明使用TD训练可以显著提升FID测量的样本质量，当采样步骤数较少时优势更强，凸显了其在低计算预算场景下的实用性。我们提供消融研究以支持设计选择，包括两两减重加权、正则化权重和一步步。总体而言，我们的TD方法可以作为一种通用的插入方式，强制执行跨时间一致性，并提升不同扩散生成模型间的生成质量。

Towards Ubiquitous 6G Computing and Networking Convergence: Architecture and Mechanism for Cross-Domain Resource Coordination

迈向无处不在的6G计算与网络融合：跨域资源协调的架构与机制

Authors: Yang Li, Xing Zhang, Yan Zhang, Wenbo Wang
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2606.15073
Pdf link: https://arxiv.org/pdf/2606.15073
Abstract The 6G network will support six major application scenarios, such as immersive communication, integrated AI and communication, and integrated sensing and communication. Many scenarios necessitate significant computational support. Moreover, user demands are becoming increasingly segmented, diverse, and personalized. Traditional network slicing alone is insufficient to meet the heterogeneous computing and networking demands of emerging service scenarios. Mobile computing network convergence (CNC) introduces a fundamentally different paradigm from the conventional cloud computing plus communication network model by deeply embedding computing resources into the mobile network infrastructure and enabling integrated computing-network services tailored to diverse user demands. In this article, we investigate orchestration architectures and mechanisms for CNC in 6G mobile networks. We begin by reviewing the evolution of CNC from a mobile network perspective and surveying existing studies, which we categorize according to mobile network architectures. Building on these insights, we propose a hierarchical, cross-domain coordination architecture and an orchestration mechanism based on hierarchical multi-agent reinforcement learning. Performance evaluations demonstrate that the proposed architecture and mechanism significantly reduce system energy consumption while enhancing task satisfaction rate. Finally, we discuss open challenges and future research directions.
中文摘要 6G网络将支持六大主要应用场景，如沉浸式通信、集成人工智能与通信以及综合感测与通信。许多场景需要大量的计算支持。此外，用户需求正变得越来越细分、多样化和个性化。仅靠传统网络切片无法满足新兴服务场景中异构计算和网络需求。移动计算网络融合（CNC）引入了一种与传统云计算加通信网络模型截然不同的范式，通过深度嵌入计算资源到移动网络基础设施中，实现针对不同用户需求的集成计算网络服务。本文将探讨6G移动网络中CNC的编排架构和机制。我们首先从移动网络的角度回顾CNC的发展，并回顾现有研究，并根据移动网络架构进行分类。基于这些洞见，我们提出了一种分层的跨域协调架构和基于分层多智能体强化学习的编排机制。性能评估表明，所提架构和机制显著降低了系统能耗，同时提升了任务满意度。最后，我们讨论了开放挑战和未来的研究方向。

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

灵与环 2.6 技术报告：万亿参数尺度下的高效即时智能智能

Authors: Ang Li, Ben Liu, Bin Han, Bin Hu, Bin Jing, Binbin Hu, Bing Li, Cai Chen, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Liang, Chen Qian, Chengfu Tang, Chengyao Wen, Chilin Fu, Chunwei Wu, Cong Zhang, Cunyin Peng, Daixin Wang, Dalong Zhang, Deng Zhao, Dingnan Jin, Dingyuan Zhu, Donghao Zhang, Fan Yuan, Fangzheng Zhao, Fanzhuang Meng, Feifan Wu, Feng Xu, Fengbin Fang, Gangshan Wang, Guodong Yang, Hailin Zhao, Haitao Wang, Haitao Zhang, Hanxiao Zhang, Hanzi Wang, Hao Dai, Hao Liu, Hao Qian, Hao Wu, Haoxiong Liu, Haoyu Xu, Heng Zhang, Hong Liu, Hongliang Zhang, Hongrui Liu, Hongxun Li, Hongzhi Ruan, Huaidong Xiong, Huihuang Zheng, Huikang Tang, Jia Guo, Jia Li, Jia Liu, Jiameng Wang, Jiaming Liu, Jiannan Shi, Jianping Wei, Jiaolong Yang, Jiapeng Wang, Jie Gao, Jie Wang, Jiewei Wu, Jin Yang, Jinjin Li, Jinjing Huang, Jinquan Sun, Jinyao Chen, Juanhui Tu, Jun Liu, Jun Mei, Jun Xu, Jun Zhou, Junjie Ou, Junnan Sipan, Junpeng Fang, Kaihong Zhang, Kaiqin Hu, Ke Shi, Kuan Xu, Kun Tang, Kunlong Chen, Lanyin Mei, Lei Chen, Lei Liang, Lei Xu, Li Tang, Liang Jiang, Liangcheng Fu, Lihui Zhang, Linfeng Shi, Lintao Ma, Liyuan Liu, Longfei Li, Longfei Zheng, Lu Liu, Lu Yu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.15079
Pdf link: https://arxiv.org/pdf/2606.15079
Abstract Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6 and Ring-2.6, a family of models designed to address this challenge at scale. Ling-2.6 is optimized for instant response generation and high capability per output token, whereas Ring-2.6 is tailored for deeper reasoning and more advanced agentic workflows. Instead of training from scratch, we upgrade the Ling-2.0 base model through architectural migration pre-training and large-scale post-training. This upgrade is guided by a unified co-design of model architecture, optimization objectives, serving systems, and agent training environments, enabling improvements in both model capability and deployment efficiency. At the architectural level, we introduce a hybrid linear attention design that integrates Lightning Attention with MLA, improving the efficiency of long-context training and decoding. To further enhance token efficiency, we optimize capability per output token through Evolutionary Chain-of-Thought, Linguistic Unit Policy Optimization, bidirectional preference alignment, and shortest-correct-response distillation. For agentic capabilities, we propose KPop, a reinforcement learning framework designed to support stable training of Ring-2.6-1T on large-scale environment-grounded data. KPop improves training efficiency through asynchronous scheduling across coding, search, tool use, and workflow execution, enabling scalable learning from complex agent-environment interactions. Together, Ling-2.6 and Ring-2.6 provide a practical pathway toward efficient, scalable, and open agentic systems. We open-source all checkpoints in the 2.6 family to support further research and development in practical agentic intelligence.
中文摘要 高效且可扩展的智能智能需要能够提供低延迟响应和强大推理能力的模型，同时保持可行的训练、服务和部署能力。在本报告中，我们介绍了Ling-2.6和Ring-2.6系列模型，旨在大规模应对这一挑战。Ling-2.6 优化为即时响应生成和每个输出令牌的高能力，而 Ring-2.6 则针对更深层次的推理和更高级的代理工作流量身定制。我们不是从零开始训练，而是通过架构迁移的预训练和大规模的后期训练升级Ling-2.0基础模型。此次升级由统一的模型架构、优化目标、服务系统和代理训练环境共同设计引导，从而提升模型能力和部署效率。在架构层面，我们引入了一种混合线性注意力设计，将闪电注意力与MLA整合，提升了长上下文训练和解码的效率。为了进一步提升代币效率，我们通过进化思考链、语言单元策略优化、双向偏好对齐和最短正确响应提炼，优化每个输出代币的能力。关于代理能力，我们提出了KPop，一种强化学习框架，旨在支持Ring-2.6-1T在大规模环境基数据上的稳定训练。KPop通过跨编码、搜索、工具使用和工作流执行的异步调度提升训练效率，实现复杂的代理-环境交互中可扩展的学习。Ling-2.6 和 Ring-2.6 共同提供了一条通往高效、可扩展和开放智能体系统的实用路径。我们开源了2.6系列的所有检查点，以支持实用智能领域的进一步研究与开发。

Think Less, Act Early: Reinforced Latent Reasoning with Early Exit in Vision-Language-Action Models

少想早行动：视觉-语言-行动模型中强化潜在推理与早期退出

Authors: Dianqiao Lei, Lianlei Shan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.15099
Pdf link: https://arxiv.org/pdf/2606.15099
Abstract Existing Vision-Language-Action (VLA) models predominantly rely on explicit Chain-of-Thought (CoT) reasoning to bridge perception and action. While effective, this paradigm suffers from high computational costs and error propagation in multi-step tasks. In this paper, we propose Adaptive Variable Alignment VLA (AVA-VLA), a novel Latent Reasoning VLA framework that models reasoning as a sequence of unobservable latent variables, bypassing the need for explicit text generation. However, latent trajectories are inherently susceptible to noise interference and misalignment with downstream objectives. To address this, we introduce a Reinforcement Learning-based Denoising mechanism that treats latent state generation as a sequential decision process, optimizing reasoning trajectories via task-level rewards. Furthermore, we incorporate an Early-Exit Strategy that adaptively terminates reasoning based on state confidence, enabling a dynamic trade-off between depth and efficiency. Extensive experiments on embodied decision benchmarks demonstrate that AVA-VLA achieves a 6x inference speedup over explicit CoT methods while attaining a 98.3% average success rate on LIBERO, improving both efficiency and long-horizon stability over full-reasoning baselines.
中文摘要 现有的视觉-语言-行动（VLA）模型主要依赖显式的思维链（Chain-of-Thought，CoT）推理来连接感知与行动。虽然有效，但该范式在多步任务中存在高计算成本和错误传播的问题。本文提出了自适应变量对齐 VLA（AVA-VLA），一种新颖的潜在推理 VLA 框架，将推理建模为一组不可观测的潜在变量序列，绕过显式文本生成的需求。然而，潜伏轨迹本质上容易受到噪声干扰和与下游目标的错位。为此，我们引入了基于强化学习的去噪机制，将潜态生成视为顺序决策过程，通过任务级奖励优化推理轨迹。此外，我们采用了一种基于状态置信度的早期退出策略，能够自适应终止推理，实现深度与效率之间的动态权衡。对具身决策基准的广泛实验表明，AVA-VLA在显式CoT方法上的推理速度提升了6倍，同时在LIBERO上平均成功率达98.3%，在全推理基线上提升了效率和长期稳定性。

DLWM: Diverse Latent World Models for Efficient Multimodal Reasoning

DLWM：多元潜在世界模型以实现高效多模态推理

Authors: David Huang, Lianlei Shan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.15160
Pdf link: https://arxiv.org/pdf/2606.15160
Abstract Reasoning capabilities of multimodal large language models (MLLMs) have improved considerably in recent years. Existing approaches typically rely on explicit chain-of-thought or continuous latent-space trajectories to enhance multi-step reasoning. However, these methods generally assume that an input admits a single latent interpretation and unfold reasoning along a fixed path or under a uniform computation budget. In real-world multimodal settings, visual observations are often subject to occlusion, blur, viewpoint variation, or semantic ambiguity, giving rise to multiple plausible interpretations. A uniform reasoning strategy not only limits the model's ability to explore multiple hypotheses but also incurs high memory usage and rollout cost. We present DLWM (Diverse Latent World Models), a multimodal reasoning framework that combines latent-space reasoning with reinforcement learning. First, we construct a set of diverse latent world hypotheses in continuous latent space, each capturing a different plausible interpretation of the visual input, and unfold latent reasoning independently on each hypothesis. An orthogonality-based diversity regularizer explicitly prevents hypothesis collapse. Second, we formulate the latent reasoning process as a resource-constrained sequential decision problem and introduce a resource-aware reinforcement learning policy that adaptively allocates computation across hypotheses, dynamically deciding whether to expand, terminate, or merge reasoning paths, thereby substantially reducing memory footprint and improving rollout efficiency. Experiments on multiple multimodal reasoning benchmarks demonstrate that DLWM outperforms existing methods by 2-5 points in accuracy while reducing memory usage by 24%.
中文摘要 近年来，多模态大型语言模型（MLLMs）的推理能力有了显著提升。现有方法通常依赖显式的思维链或连续的潜空间轨迹来增强多步推理。然而，这些方法通常假设输入具有单一潜在解释，并沿固定路径或均匀计算预算展开推理。在现实世界的多模态环境中，视觉观察常常受到遮挡、模糊、视点变化或语义歧义的影响，导致多种合理的解释。统一的推理策略不仅限制模型探索多重假设的能力，还会导致高内存使用和扩展成本。我们介绍了DLWM（多元潜在世界模型），这是一种结合潜空间推理与强化学习的多模态推理框架。首先，我们在连续的潜在空间中构建一组多样化的潜在世界假设，每个假设捕捉视觉输入的不同合理解释，并对每个假设独立展开潜在推理。基于正交性的多样性正则化子明确防止假设崩溃。其次，我们将潜在推理过程表述为资源受限的顺序决策问题，并引入资源感知强化学习策略，自适应地在假设间分配计算，动态决定是扩展、终止还是合并推理路径，从而大幅减少内存占用并提高推广效率。多模态推理基准测试的实验表明，DLWM在准确率上比现有方法高出2-5个百分点，同时内存使用减少了24%。

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

StarOR：协同树搜索与测试时强化学习用于优化建模

Authors: Jiajun Li, Yu Ding, Shisi Guan, Ran Hou, Wanyuan Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.15197
Pdf link: https://arxiv.org/pdf/2606.15197
Abstract Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or curated training data, but are costly to adapt to new problem distributions. Meanwhile, one-shot generation remains brittle in hierarchical modeling, where early symbolic errors can propagate into invalid formulations. Test-time scaling offers a promising alternative by enabling structural exploration with additional instance-level computation; however, existing search-based methods typically rely on a fixed policy, causing repeated rollouts to inherit similar modeling biases and providing limited credit assignment for intermediate decisions. To address these limitations, we propose StarOR, a synergistic search-and-adaptation framework that couples MCTS with Test-Time Reinforcement Learning for optimization modeling. StarOR decomposes the modeling process into four stages and updates a transient LoRA adapter via GRPO at each non-terminal node. By using MCTS-generated siblings as local comparison sets, StarOR transforms search-time exploration into instance-specific policy refinement. Moreover, an unsupervised multi-faceted reward system provides fine-grained feedback for intermediate formulation decisions without ground-truth labels. Experiments across five optimization benchmarks show that StarOR achieves state-of-the-art performance even with a 4B backbone, outperforming existing methods and the frontier LLMs.
中文摘要 优化建模本质上是层级化的，需要精确的符号承诺序列。传统的基于学习的自动优化建模方法通过大规模注释或策划的训练数据改进建模策略，但适应新问题分布成本较高。与此同时，单次生成在层级建模中仍然脆弱，早期符号错误可能传播成无效的表述。测试时间缩放提供了一种有前景的替代方案，通过实现结构探索并实现额外的实例级计算;然而，现有基于搜索的方法通常依赖固定策略，导致反复推出时会继承类似的建模偏差，并为中间决策提供有限的信用分配。为解决这些局限性，我们提出了StarOR，一个协同的搜索与适应框架，将MCTS与测试时强化学习结合进行优化建模。StarOR 将建模过程分解为四个阶段，并通过每个非终端节点的 GRPO 更新瞬态 LoRA 适配器。通过使用MCTS生成的兄弟节点作为本地比较集，StarOR将搜索时间的探索转化为实例特定策略的细化。此外，无监督的多维奖励系统为中间的决策提供了细粒度的反馈，无需贴上真实标签。五个优化基准测试的实验表明，即使采用4B骨干，StarOR仍能实现最先进的性能，超越现有方法和前沿大型语言模型。

SPARK: Spatial Policy-driven Adaptive Reinforcement learning for Knowledge distillation

SPARK：空间政策驱动的自适应强化学习，用于知识蒸馏

Authors: Mohamed Jismy Aashik Rasool, Shabir Ahmad, Gisong Oh, Teag Kuen Whangbo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.15243
Pdf link: https://arxiv.org/pdf/2606.15243
Abstract Low-bit quantization enables deployment of image restoration (IR) networks on resource-constrained devices, but introduces rounding noise that disproportionately degrades high-frequency regions such as edges and fine textures. Existing knowledge distillation (KD) methods apply distillation signals uniformly across all spatial locations, overlooking the varying reconstruction difficulty across image regions. To address this, we propose SPARK (Spatial Policy-driven Adaptive Reinforcement Learning for Knowledge Distillation), a framework that adaptively allocates distillation effort using a lightweight reinforcement learning (RL) policy network. At each training step, a difficulty feature extractor computes four signals, namely Laplacian variance, pixel variance, student reconstruction error, and teacher-student knowledge gap, which are fed into a compact policy CNN that produces a stochastic spatial weight map to modulate the KD loss during quantization-aware training (QAT). SPARK is IR task-agnostic, adds no inference cost, and integrates into any existing QAT pipeline without architectural changes. Experiments on benchmark datasets demonstrate that SPARK consistently outperforms PTQ, QAT, and state-of-the-art (SOTA) KD approaches across multiple student architectures, achieving reconstruction quality closest to the full-precision teacher under significant computational constraints.
中文摘要 低位量化使得在资源受限的设备上部署图像恢复（IR）网络成为可能，但会引入舍入噪声，导致边缘和细微纹理等高频区域不成比例地退化。现有的知识蒸馏（KD）方法在所有空间位置均均匀地应用蒸馏信号，忽略了图像区域间不同的重建难度。为此，我们提出了SPARK（空间政策驱动的自适应强化学习知识蒸馏框架），该框架通过轻量级强化学习（RL）策略网络自适应分配蒸馏工作。在每个训练步骤，难度特征提取器会计算四个信号，即拉普拉斯方差、像素方差、学生重建误差和师生知识差距，这些信号被输入一个紧凑的策略卷积神经网络，生成随机空间权重图，以调制量化感知训练（QAT）期间的KD损失。SPARK对工业关系任务无关，不增加推理成本，并且可以集成到任何现有的QAT流水线中，无需架构更改。基准数据集上的实验表明，SPARK在多种学生架构中始终优于PTQ、QAT和最先进（SOTA）KD方法，在较大计算约束下实现最接近全精度教师的重建质量。

Exploring Starts Are Not Enough: Counterexamples and a Fix for Monte Carlo Exploring Starts

探索开局还不够：反例与修复蒙特卡洛探索开局

Authors: Octave Oliviers, Glenn Vinnicombe
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.15247
Pdf link: https://arxiv.org/pdf/2606.15247
Abstract The asymptotic behaviour of Monte Carlo Exploring Starts (MCES) is a long-standing open question in reinforcement learning, even in the tabular setting. We investigated the convergence properties of tabular MCES by constructing examples in which the algorithm converges to suboptimal solutions. This paper presents new counterexamples for both initial-visit and first-visit MCES and gives a convergence-restoring modification for the initial-visit case. We show that stable suboptimal solutions may exist for initial-visit MCES with sample-average updates even when greedy actions are updated more often than non-greedy actions on average. However, by scaling learning rates inversely to update frequencies on a state-by-state basis, convergence to optimality is guaranteed. Unlike previous uniformisation methods, this modification is applicable to large-scale problems that require approximating the estimated value function. We then extend the example to show that sample-average first-visit MCES may also converge to suboptimal solutions. This largely settles a fundamental open problem and shows that exploring starts alone do not guarantee convergence to optimality. More broadly, these results highlight that convergence depends critically on the relative size and frequency of updates applied to different actions, making the choice of learning rates and the balance between exploration and exploitation central to the analysis of MCES and the implementation of scalable Monte Carlo control methods.
中文摘要 蒙特卡洛探索起点（MCES）的渐近行为在强化学习中是一个长期存在的未解问题，即使在表格环境中也是如此。我们通过构建算法收敛到次优解的例子，研究了表格MCES的收敛性质。本文提出了初次就诊和首次就诊MCES的新反例，并为初次就诊案例提供了收敛恢复的修正。我们表明，即使贪婪行为平均更新频率高于非采集行为，初始访问MCES仍可能存在稳定的次优解。然而，通过将学习率反向调整以更新频率，保证收敛至最优。与以往的统一化方法不同，这种修改适用于需要近似估计值函数的大规模问题。随后我们进一步说明，样本平均首次访问MCES也可能收敛到次优解。这在很大程度上解决了一个根本性的未解问题，并表明仅靠探索开始并不能保证趋向最优。更广泛地说，这些结果强调收敛性关键地依赖于对不同动作应用的更新规模和频率，因此学习率的选择以及探索与利用之间的平衡，是MCES分析和实现可扩展蒙特卡洛控制方法的核心。

Trust-Region Diffusion Policies for Massively Parallel On-Policy RL

大规模并行策略强化学习的信任区域扩散策略

Authors: Huy Le, Onur Celik, Denis Blessing, Tai Hoang, Claas A Voelcker, Axel Brunnbauer, Felix Richter, Michael Volpp, Gerhard Neumann
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.15260
Pdf link: https://arxiv.org/pdf/2606.15260
Abstract Reinforcement learning with massively parallel simulations has become a standard framework for developing robust, deployable policies; however, most existing approaches still rely on simple Gaussian policy parameterizations. Diffusion models provide a more expressive policy class and have shown strong performance on challenging control problems, yet most diffusion-based RL methods are designed for offline or off-policy training. In this work, we ask whether diffusion policies can be trained effectively in the massively parallel, on-policy regime. To this end, we introduce Trust-region Diffusion Policies (TruDi), which enables diffusion policies for on-policy RL with massively parallel simulations. This setting is particularly challenging because the data distribution changes quickly across updates, making stable training with complex policies difficult. TruDi addresses this by integrating a trust-region optimization rule to enforce a KL-divergence constraint over the entire diffusion trajectory. Empirically, we evaluate TruDi on a diverse set of 4 massively parallel RL benchmarks comprising a total of 73 tasks. Across these tasks, TruDi consistently outperforms or is on-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks, establishing a strong new baseline for massively parallel on-policy RL.
中文摘要 大规模并行模拟的强化学习已成为开发稳健且可部署策略的标准框架;然而，大多数现有方法仍然依赖简单的高斯策略参数化。扩散模型提供了更具表现力的策略类别，并在具有挑战性的控制问题上表现出优异表现，然而大多数基于扩散的强化学习方法仍设计用于离线或非策略训练。在本研究中，我们探讨扩散政策是否能在大规模平行、政策内的体系中有效培训。为此，我们引入了信任区域扩散策略（TruDi），实现了针对策略上强化学习的扩散策略，并进行大规模并行模拟。这种设置尤其具有挑战性，因为数据分布在每次更新间变化迅速，使得复杂策略下的稳定训练变得困难。TruDi通过集成信任区域优化规则来解决这个问题，以在整个扩散轨迹中强制执行KL散度约束。通过实证，我们基于4个大规模并行强化学习基准测试（共73个任务）来评估TruDi。在这些任务中，TruDi在标准任务上持续优于或与强基线持平，并在更具挑战性的人形控制任务中取得明显进展，为大规模并行策略强化学习奠定了坚实的新基线。

Discovering Lattice Reduction Strategies via Self-Play

通过自我游戏学习减少格点策略

Authors: Mohamed Malhou, Kristin Lauter, Ludovic Perret
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.15301
Pdf link: https://arxiv.org/pdf/2606.15301
Abstract The Lenstra-Lenstra-Lovász (LLL) algorithm is a seminal contribution to computer science used for lattice basis reduction, yet its polynomial-time outputs produce bases that are far from optimal as the dimension grows. We show that deep reinforcement learning can discover strictly superior, generalizable reduction strategies by interacting with the primitive action space of LLL. We formulate lattice reduction as a single-player Markov Decision Process (MDP) and train a deep residual network using an AlphaZero-style self-play pipeline augmented with adaptive-horizon MCTS (Monte Carlo Tree Search), which couples multi-step network predictions with an entropy-gated expansion mechanism. The resulting policy, DeltaStar, is trained exclusively on small $8$-dimensional $q$-ary lattices and requires fewer primitive row operations than LLL. Crucially, it generalizes zero-shot to unseen moduli and higher dimensions up to $n=32$ without retraining.
中文摘要 Lenstra-Lenstra-Lovász（LLL）算法是计算机科学中一项开创性贡献，用于晶格基约简，但其多项式时间输出随着维数增长产生的基数远非最优。我们证明，深度强化学习可以通过与LLL的原始作用空间互动，发现严格优越、可推广的约简策略。我们将格点约简化作为单人马尔可夫决策过程（MDP）来构建，并使用类似AlphaZero的自玩流水线，辅以自适应视界MCTS（蒙特卡洛树搜索）来训练深残差网络，该流程将多步网络预测与熵门控扩展机制耦合。最终形成的策略DeltaStar专门训练于小型8美元$q元格点，所需的原始行操作比LLL少。关键是，它将零射向未见模数及更高维度，无需重新训练即可实现$n=32$。

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

采用思维链监督的强化学习，以可解释地检测仇恨和宣传表情包

Authors: Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.15307
Pdf link: https://arxiv.org/pdf/2606.15307
Abstract Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.
中文摘要 仇恨和宣传性表情包利用图像与文字的相互作用，传达任何一种形式都无法单独揭示的有害意图。尽管基于思维的多模态大型语言模型（MLLM）具备先进的视觉语言理解能力，但其在表情包内容审核中的应用仍未被充分探索。我们提出了一种基于强化学习的培训后方法，通过任务特定奖励和群体相对策略优化（GRPO）提升基于思维的MLLM的分类性能和基于引用的解释质量。具体来说，我们（i）系统地实证研究现成的MLLMs，用于在英语和阿拉伯语基准中实现仇恨和宣传性表情包理解;（ii）通过提炼和多LLM细粒度宣传注释，扩展现有弱监督思维链（CoT）理由的表情包数据集;（iii）引入基于GRPO的目标，采用思维长度正则化，共同优化分类准确性和解释质量; 以及（iv）利用基于共识的伪标签研究无标签模因的自我监督GRPO。在 Hateful Memes 和 ArMeme 基准测试上的实验显示，我们的方法优于此前报告的 FHM 准确率（最高至 +2.1%，从 79.9% 降至 82.0%）以及 ArMeme 宏 F1 （最高至 +7.6 分，含解释时从 0.536 降至 0.612;相比原始 ArMeme 基准为 +6.1），同时还能生成自然语言解释。在ArMeme上，序列分类基线在原始准确性方面依然更强，而我们的方法则提供了更均衡的每类表现和解释。我们公开发布代码、数据扩展和评估资源。

Hamilton-Jacobi Reachability-Based Safe Reinforcement Learning for Emergency Collision Avoidance

基于可达性的安全强化学习用于紧急碰撞避免

Authors: Yuhong Jiang, Shiyue Zhao, Junzhi Zhang, Junfeng Zhang, Xinhan Li, Shijie Zhao, Chengkun He
Subjects: Subjects: Systems and Control (eess.SY); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.15311
Pdf link: https://arxiv.org/pdf/2606.15311
Abstract Emergency collision avoidance under extreme driving conditions demands safety-critical control that accounts for both obstacle proximity and vehicle dynamic stability over a future time horizon, yet existing methods often rely on instantaneous or local safety evaluations. This paper proposes a safe reinforcement learning framework guided by a Hamilton-Jacobi (HJ) reachability based motion safety set that provides forward-looking safety supervision for constrained policy optimization. Specifically, a unified signed safety function is formulated by combining geometric collision margins and chassis stability limits, and is then extended through reachability analysis into a finite-horizon motion safety set that characterizes whether safety can be maintained under future vehicle state evolution. To enable practical computation, the motion safety set is approximated from offline extreme driving data, mitigating the computational burden of grid-based HJ solvers. The learned motion safety set is then embedded as a continuous safety cost into a constrained Markov decision process, and a PID-Lagrangian policy optimization scheme is employed to adaptively regulate the Lagrange multiplier for safety constraint enforcement. Simulation and real-vehicle experiments on low-adhesion obstacle-avoidance scenarios demonstrate that the proposed method achieves higher goal-reaching rates, produces smoother avoidance maneuvers, and maintains larger unified safety margins than baseline methods.
中文摘要 极端驾驶条件下的紧急碰撞避免需要安全关键的控制，既考虑障碍物的接近度，也要保证未来时间内车辆动态稳定性，但现有方法往往依赖即时或局部的安全评估。本文提出了一个安全强化学习框架，基于Hamilton-Jacobi（HJ）可达性运动安全集，为受限策略优化提供前瞻性的安全监督。具体来说，通过结合几何碰撞余裕和底盘稳定性极限，构建统一的签名安全函数，然后通过可达性分析扩展为有限视界运动安全集，以表征未来车辆状态演化下是否能保持安全。为了实现实际计算，运动安全集通过离线极限驾驶数据近似，减轻基于网格的HJ求解器的计算负担。学习到的运动安全集随后作为连续安全成本嵌入受限的马尔可夫决策过程，并采用PID-拉格朗日策略优化方案自适应调控拉格朗日乘数以实现安全约束的强制执行。在低附着力障碍物规避场景下的模拟和实车实验表明，所提方法比基线方法实现更高的目标达成率，实现更平稳的规避动作，并保持更大的统一安全裕度。

CausalDrive: Real-time Causal World Models for Autonomous Driving

CausalDrive：自动驾驶的实时因果世界模型

Authors: Tianyi Yan, Huan Zheng, Dubing Chen, Meizhi Qu, Yingying Shen, Lijun Zhou, Mingfei Tu, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun, Cheng-zhong Xu, Jianbing Shen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.15341
Pdf link: https://arxiv.org/pdf/2606.15341
Abstract World models have emerged as a promising paradigm for scaling autonomous driving (AD) data, yet existing video generative models fall short as interactive simulators. Layout-conditioned renderers rely on "oracle" future trajectories of all background agents, rendering them strictly non-reactive. Conversely, pure action-conditioned predictors lack semantic control over complex interactions and suffer from prohibitive diffusion latencies, hindering closed-loop policy learning. To bridge this gap, we present CausalDrive, a controllable, real-time foundation driving world renderer. CausalDrive operates solely on the initial front-view frame, the ego-vehicle's trajectory, and a macroscopic text prompt. By excluding future NPC layouts, we compel the model to intrinsically predict causal interactions, enabling text-driven control over Driving Sociology, allowing users to dynamically orchestrate diverse counterfactual reactions to identical ego-actions. To overcome the efficiency bottleneck and address the covariate shift in autoregressive generation, we propose a novel Context-Forced DMD architecture. This combines continuous flow-matching with a self-correcting distillation objective, achieving interactive speeds of 12 FPS. This breakthrough transforms the passive video generator into a playable neural simulator. We demonstrate its versatility across three downstream applications: (1) generative closed-loop evaluation with significantly mitigated collision artifacts, (2) large-scale Reinforcement Learning (RL) post-training driven by a Video2Reward module, and (3) real-time human-in-the-loop simulation. Extensive experiments validate that policies trained within CausalDrive's reactive scenarios exhibit superior interaction capabilities in the real world.
中文摘要 世界模型已成为自动驾驶（AD）数据扩展的有前景范式，但现有的视频生成模型作为交互式模拟器仍不足。布局条件渲染器依赖所有背景代理的“oracle”未来轨迹，使其完全非反应性。相反，纯动作条件预测变量缺乏对复杂交互的语义控制，且扩散延迟过高，阻碍了闭环策略学习。为了弥合这一差距，我们推出了CausalDrive，一款可控的实时基础驱动世界渲染器。CausalDrive 仅基于初始的正面视角框架、自我载具的轨迹和宏观文本提示来运行。通过排除未来的NPC布局，我们促使模型内在预测因果互动，实现对驱动社会学的文本驱动控制，使用户能够动态协调对相同自我行为的多样反事实反应。为了克服效率瓶颈并解决自回归生成中的协变量转变，我们提出了一种新的上下文强制DMD架构。该技术结合了连续流量匹配与自纠正蒸馏目标，实现12帧每秒的交互速度。这一突破将被动视频生成器转变为可玩的神经模拟器。我们展示了其在三个下游应用中的多功能性：（1）生成式闭环评估，显著减少碰撞伪影;（2）由Video2Reward模块驱动的大规模强化学习（RL）后训练;（3）实时人机环中仿真。大量实验验证了在CausalDrive反应场景中训练的策略在现实世界中展现出更优越的交互能力。

Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds

语言模型代理中的奖励黑客：重新审视人工智能安全网格世界

Authors: Ömer Veysel Çağatan, Xuandong Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.15385
Pdf link: https://arxiv.org/pdf/2606.15385
Abstract Reward hacking, where AI systems exploit misspecified objectives to achieve high reward without satisfying intended goals, remains a central challenge in AI safety. Yet most known instances have been discovered post hoc in frontier systems where controlled study is impractical. We adapt the AI Safety Gridworlds framework into a text-based evaluation suite that reformulates classic reinforcement learning safety tasks for language-based agents. Across frontier and mid-scale models, we find that specification gaming emerges zero-shot: models systematically achieve high observed reward while underperforming on hidden safety objectives, and even apparently safe behaviors can reflect misunderstanding rather than principled safety. Reinforcement learning does not correct these failures: direct reward optimization widens the gap between observed and hidden reward, as the model's initial competence causes it to lock into locally rewarding strategies before discovering safer alternatives. This pattern persists across model scales (1.5B--14B) and is not resolved by finer credit assignment, exploration prompts, or entropy regularization. Our results show that reward hacking arises naturally when optimizing proxy objectives with capable language model agents and resists standard mitigations, suggesting that proxy-reward failures in agentic settings may require approaches beyond standard exploration and credit-assignment fixes. To facilitate reproducibility, the code for this work is available at \href{this https URL}{our public repository}.
中文摘要 奖励黑客攻击，即AI系统利用错误指定目标以获得高额奖励，但未能满足预期目标，仍是AI安全的核心挑战。然而，大多数已知的案例是在前沿系统中事后发现的，那里的控制研究不切实际。我们将AI Safety Gridworlds框架改编为基于文本的评估套件，重新表述了基于语言的智能体的经典强化学习安全任务。在前沿和中档模型中，我们发现规格博弈出现零机会：模型系统性地获得高观察奖励，但在隐藏的安全目标上表现不佳，甚至表面上安全的行为也可能反映出误解而非原则安全。强化学习无法纠正这些失败：直接奖励优化加大了观察到的奖励与隐藏奖励之间的差距，因为模型的初始能力使其锁定于局部奖励策略，然后才发现更安全的替代方案。这种模式在模型尺度（1.5B-14B）中持续存在，且通过更细致的署名分配、探索提示或熵正则化无法解决。我们的结果表明，奖励黑客在使用具备能力的语言模型代理优化代理目标时自然产生，且对标准缓解措施有抵触性，表明代理-奖励在代理环境中的失败可能需要超越标准探索和信用分配修复的方法。为了便于复现，本作品的代码可在 \href{this https URL}{我们的公共仓库} 获取。

Defending against Adaptive Prompt Injection Attacks via Reasoning-enabled Task Alignment

通过推理驱动的任务对齐防御自适应提示注入攻击

Authors: Lipeng He, Yihan Wang, Jiawen Zhang, N. Asokan
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.15441
Pdf link: https://arxiv.org/pdf/2606.15441
Abstract Indirect prompt injection attacks hijack LLM-based agents by embedding malicious instructions in third-party data that the agent retrieves during task execution. Existing defenses report near-zero attack success rate on static benchmarks, yet recent adaptive evaluations show that these results collapse once the attacker is allowed to optimize against the deployed defense. In this work, we trace this collapse to two failure modes. First, existing defense methods are confined to recognizing specific attack patterns, rather than assessing whether the intent of every embedded instruction is relevant to the user task. Second, training-based defenses, which otherwise offer the strongest safety-utility trade-off, assemble their adversarial examples from a handful of hand-crafted templates, and the resulting defender fails to generalize outside that narrow strategy distribution. To address these gaps, we propose RETA, a training-based method that grounds defense decisions on the user tasks rather than attacker-controlled data. At each tool-output step, the defender undertakes chain-of-thought reasoning verifying that its actions are consistent with the user task. Leveraging red-teaming, a simulated attacker synthesizes adversarial training data and receives a dictionary-learning diversity reward, achieving broad coverage of injection-reformulation strategies. Together, these allow the defender to be optimized via multi-objective reinforcement learning and achieve better safety-utility trade-off. Across six black-box adaptive attacks, RETA keeps every per-attack ASR below 10%, with average ASR of 2.92% and 3.75% on the two target models, while preserving most utility under attack and on clean inputs.
中文摘要 间接提示注入攻击通过在任务执行过程中获取的第三方数据中嵌入恶意指令，劫持基于LLM的代理。现有防御在静态基准测试下报告攻击成功率几乎为零，但近期自适应评估显示，一旦攻击者被允许对部署防御进行优化，这些结果就会崩溃。在本研究中，我们将这种崩溃追溯到两种失效模式。首先，现有的防御方法仅限于识别特定的攻击模式，而非评估每个嵌入指令的意图是否与用户任务相关。其次，基于训练的防御，虽然提供了最强的安全与效用权衡，但它们的对抗例子仅基于少数手工模板，最终防御者无法在狭窄的战略分布之外进行推广。为弥补这些空白，我们提出了RETA，这是一种基于训练的方法，将防御决策基于用户任务而非攻击者控制的数据。在每个工具输出步骤中，防御者进行思维链推理，验证其行为是否与用户任务一致。利用红队技术，模拟攻击者综合对抗训练数据，获得词典学习多样性奖励，实现广泛覆盖注入-重组策略。这些因素共同使防御者通过多目标强化学习得到优化，实现更好的安全与效用权衡。在六种黑箱自适应攻击中，RETA 将每次攻击的 ASR 保持在 10% 以下，两个目标模型的平均 ASR 分别为 2.92% 和 3.75%，同时在攻击和干净输入时保持了大部分效用。

Understanding Diversity Collapse in RLVR via the Lens of Overtraining

从过度训练的视角理解RLVR中的多样性崩溃

Authors: Suqin Yuan, Jinkun Chen, Jiyang Zheng, Muyang Li, Lei Feng, Dadong Wang, Tao Xiang, Tongliang Liu, Bo An
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.15455
Pdf link: https://arxiv.org/pdf/2606.15455
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a key approach for enhancing the reasoning abilities of large language models. However, RLVR often suffers from \emph{diversity collapse}: Pass@$1$ improves while high-$k$ Pass@$k$ degrades, which is viewed as a narrowing of the model's reasoning boundary. We formalize this diversity collapse through the lens of \emph{overtraining}: once a problem's contribution to the reference metric has effectively saturated, further updates no longer expand what the model can solve but still concentrate probability mass on the trajectories favored by on-policy sampling. Under a standard setup with few rollouts per problem, even a single observed success places a problem in a nearly saturated regime for high-$k$ Pass@$k$, so most updates in standard RLVR are overtraining from the boundary perspective. This perspective also suggests a reading of whether RLVR can expand the model's reasoning abilities beyond the base model: since RLVR is structurally biased against high-$k$ Pass@$k$, its aggregate decline does not by itself mean that no new reasoning gains occurred. Interventionally, restricting updates to problems with zero observed success lifts Pass@$256$ above the base model on difficult benchmarks; observationally, a non-trivial fraction of initially unsolvable problems become solvable during standard RLVR training. Building on these findings, we propose \emph{Bayesian Boundary Gating} (BBG), which redirects optimization away from overtraining by estimating each problem's marginal contribution to the reasoning boundary. Across multiple reasoning benchmarks, BBG improves average Pass@$k$ across a wide range of $k$.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理能力的关键方法。然而，RLVR常常存在\emph{多样性崩溃}的问题：Pass@$1美元有所提升，而高$k$Pass@$k$则退化，这被视为模型推理边界的收窄。我们通过\emph{overtraining}的视角形式化这种多样性崩溃：一旦问题对参考指标的贡献有效饱和，进一步更新不再扩展模型可解决的范围，而是将概率质量集中于政策抽样有利的轨迹。在标准配置下，每个问题部署次数很少，即使有一次成功，问题也几乎处于高$k Pass@$k美元的饱和区，因此大多数标准RLVR更新从边界角度来看是过度训练。这一观点也暗示了RLVR是否能将推理能力扩展到基础模型之外的解读：由于RLVR在结构上对高$k$ Pass@$k$有偏见，其整体下降本身并不意味着没有新的推理提升。从干预角度看，限制更新到零成功问题，在困难基准测试中比基础模型高出Pass@$256美元;在观察性上，在标准RLVR训练中，部分最初无法解决的问题变得可解。基于这些发现，我们提出了\emph{贝叶斯边界门控}（BBG），通过估计每个问题对推理边界的边际贡献，将优化从过度训练中转向。在多个推理基准测试中，BBG在广泛的$k美元范围内提升了平均Pass@$k美元。

Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities

基于软融合的强化学习引导检索，在缺失模态下实现稳健的多模态模仿学习

Authors: Hassan Ismkhan, Hamid Bouchahcia
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.15514
Pdf link: https://arxiv.org/pdf/2606.15514
Abstract Robotic systems perceive the world through multiple input modalities -- including visual camera streams and natural language instructions -- and must select appropriate actions based on these signals. However, assuming the permanent availability of all input devices is unrealistic, as sensors may fail, become occluded, or drop out entirely during deployment. Robust handling of such missing-modality scenarios is therefore essential for real-world robot operation. This paper introduces RL4IL, a reinforcement learning guided method for imitation learning that selects the most suitable action for a given observation by identifying the most relevant expert demonstrations from a training library. A reinforcement learning policy, trained via Proximal Policy Optimisation over Breadth-First Search candidate sets, ranks candidate demonstrations and a soft cross-attention fusion head aggregates their action signals to produce the final prediction. When a modality is missing at inference time, a dedicated per-modality RL retrieval policy identifies donor demonstrations from the training library, and a soft imputation head reconstructs the missing embedding via cross-attention over the top-ranked donors -- without requiring any retraining of the system. Experiments on three LIBERO benchmark suites demonstrate that RL4IL substantially outperforms state-of-the-art imitation learning methods under sensor dropout conditions, while requiring no policy network training. The code can be found at this https URL
中文摘要 机器人系统通过多种输入方式感知世界——包括视觉摄像头流和自然语言指令——并必须根据这些信号选择合适的行动。然而，假设所有输入设备都永久可用是不现实的，因为传感器可能失效、被遮蔽，甚至在部署过程中完全失效。因此，对此类缺失模态场景的稳健处理对于现实世界的机器人操作至关重要。本文介绍了RL4IL，这是一种强化学习引导的模仿学习方法，通过从训练库中识别最相关的专家演示，选择最合适的动作来满足给定观察。通过对广度优先搜索候选集进行近端策略优化训练的强化学习策略，对候选演示进行排名，软性交叉注意力融合头则汇总它们的动作信号，生成最终预测。当推断时缺少模态时，专门的每种模态强化学习检索策略会从训练库中识别供体示范，软补补头通过对排名最高的供体进行交叉关注重建缺失嵌入——无需对系统进行任何重新训练。在三套LIBERO基准测试套件上的实验表明，RL4IL在传感器中断条件下显著优于最先进的模仿学习方法，且无需策略网络训练。代码可在此 https URL 找到

Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

在分歧处定位信用：用于LLM推理的路径条件自蒸馏

Authors: Yu Li, Shu Hong, Tian Lan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.15576
Pdf link: https://arxiv.org/pdf/2606.15576
Abstract Reinforcement learning from verifiable rewards assigns a single scalar to each rollout, leaving token-level credit assignment underspecified in long reasoning traces. On-policy self-distillation addresses this by letting the same model act as a teacher conditioned on privileged information, producing a dense per-token signal. But the common choice of a ground-truth answer is only an endpoint cue: on terse-answer tasks, the teacher falls silent at the intermediate positions where path-level guidance matters most. We propose Hindsight Self-Distillation (HSD), which conditions the teacher on a successful peer rollout drawn from the current training group. Such a peer is an exact sample from the success-conditioned policy, requiring no additional sampled rollouts. By providing a full successful continuation rather than only the final answer, the resulting credit signal concentrates at the divergence position between a failed rollout and a successful peer. Across Qwen3-8B and Qwen3-32B on math and code benchmarks, HSD obtains the best result against GRPO variants and on-policy distillation baselines, with the largest gains on terse-answer tasks such as AIME.
中文摘要 从可验证奖励中进行强化学习，每次推广都分配一个标量，导致代币级的信用分配在长推理中未明确说明。策略自提纯通过让同一模型作为教师在特权信息条件下工作，生成一个密集的每个令牌信号来解决这个问题。但普遍选择的真实答案只是一个终点线索：在简洁的答案任务中，教师在路径指导最关键的中间位置保持沉默。我们提出“事后诸葛亮自我提炼”（HSD），即以当前培训组中成功开展同伴推广为条件。此类对等节点是成功条件策略的精确样本，无需额外抽样推广。通过提供完整的成功延续而非仅仅的最终答案，最终的信用信号集中在失败推广与成功节点之间的分歧点。在Qwen3-8B和Qwen3-32B的数学和代码基准测试中，HSD在GRPO变体和策略中提炼基线中取得最佳成绩，在简明答案任务如AIME中取得最大收益。

Agentic Retrieval and Reinforcement Learned Equation Chains: A Controlled Generation Framework for Complex and Novel Physics Word Problems

能动检索与强化学习方程链：复杂新颖物理应用题的受控生成框架

Authors: Tirthankar Mittra
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.15591
Pdf link: https://arxiv.org/pdf/2606.15591
Abstract Generating high-quality Physics Word Problems (PWPs) that are novel, complex, and solvable remains a challenging and underexplored problem in educational content generation. Existing approaches, many adapted from Math Word Problem (MWP) generation, often produce ambiguous, unsolvable, or structurally simple questions with limited linguistic diversity. We introduce ARVRE (Agentic Retrieval Value Reinforced Equation-chain), a two-stage framework for generating diverse and mathematically valid PWPs. In the first stage, a form of offline temporal-difference learning is used to construct valid chains of physics equations, while an agentic retrieval-augmented generation (RAG) framework dynamically selects topic-specific concepts and vocabulary. This design enables explicit control over problem structure and difficulty. In the second stage, a Large Language Model (LLM) converts the equation chain and retrieved concepts into a natural-language physics question. By grounding generation in valid equation chains, our method preserves mathematical correctness while promoting linguistic diversity and contextual richness. Human and automated evaluations demonstrate that ARVRE generates PWPs that are more complex, novel, and solvable than those produced by existing approaches. These results highlight the potential of combining reinforcement learning, retrieval, and LLMs for reliable generation of educational physics content.
中文摘要 生成高质量、新颖、复杂且可解的物理应用题（PWP）仍然是教育内容生成中一个充满挑战且未被充分探索的问题。现有的方法，许多是从数学词题（MWP）生成中改良而来，常常产生模糊、无法解决或结构简单的问题，语言多样性有限。我们介绍了ARVRE（代理检索价值强化方程链），这是一个两阶段框架，用于生成多样且数学上有效的PWP。第一阶段采用离线时间差分学习方式构建有效的物理方程链，而代理检索增强生成（RAG）框架则动态选择主题特定的概念和词汇。这种设计使得对问题结构和难度的明确控制成为可能。第二阶段，大型语言模型（LLM）将方程链和检索的概念转换为自然语言物理问题。通过将生成建立在有效方程链上，我们的方法既保持了数学正确性，又促进了语言多样性和语境丰富性。人工和自动化评估表明，ARVRE生成的PWP比现有方法更复杂、新颖且可解。这些结果凸显了将强化学习、检索和大型语言模型结合起来，能够可靠生成教育物理内容的潜力。

Self-Questioning Vision-Language Models: Reinforcement Learning for Compositional Visual Reasoning

自我质疑视觉语言模型：构图视觉推理的强化学习

Authors: Saraswathy Amjith
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.15651
Pdf link: https://arxiv.org/pdf/2606.15651
Abstract Vision-Language Models (VLMs) are AI systems that process both images and text, yet they often struggle with compositional visual reasoning questions that require chaining multiple steps together, such as identifying objects, counting them, and comparing the results. Existing approaches improve this reasoning by training models on human-written step-by-step explanations, but creating these annotations is expensive and difficult to scale. We propose a self-questioning framework that trains a VLM to break visual questions into smaller sub-questions and answer each one before producing a final response, using a reinforcement learning algorithm called Group Relative Policy Optimization (GRPO). The model is never shown examples of how to decompose questions, it discovers this behavior on its own, guided by a reward signal that scores whether the output contains sub-questions and whether the final answer is correct. We apply this framework to a 3-billion-parameter model, training on both synthetic scenes of geometric shapes (CLEVR) and real-world photographs (A-OKVQA). On A-OKVQA, both self-questioning and standard reinforcement learning substantially improve accuracy over the untrained model (52.2% and 51.6% vs. 46.8%). We introduce the first self-questioning VLM by rewarding not only the final answer like standard RL but additionally for generating intermediate sub-questions, enabling it to discover compositional decomposition strategies. These results suggest that teaching AI systems to ask themselves intermediate questions is a promising strategy for complex visual reasoning, particularly when the difficulty of a question warrants explicit step-by-step decomposition.
中文摘要 视觉语言模型（VLMs）是能够处理图像和文本的人工智能系统，但它们常常在需要将多个步骤串联起来的构图视觉推理问题上遇到困难，比如识别物体、计数和比较结果。现有方法通过基于人类逐步编写的解释训练模型来改进这种推理能力，但创建这些注释既昂贵又难以扩展。我们提出了一种自我提问框架，训练VLM将视觉问题拆分为更小的子问题，并在生成最终回答前回答每个问题，采用一种名为Group Relative Policy Optimization（GRPO）的强化学习算法。模型从未展示如何分解问题的示例，它会自行发现这种行为，由奖励信号引导，该信号评分输出是否包含子问题以及最终答案是否正确。我们将该框架应用于一个30亿参数的模型，训练对象包括几何形状的合成场景（CLEVR）和现实世界的照片（A-OKVQA）。在A-OKVQA中，自我提问和标准强化学习的准确率均显著提升于未训练模型（52.2%和51.6%对46.8%）。我们引入了第一个自我提问VLM，不仅奖励最终答案（如标准强化学习），还奖励生成中间子问题，使其能够发现组合分解策略。这些结果表明，教AI系统自我提出中间问题，是复杂视觉推理的有前景策略，尤其是在问题难度高到需要明确逐步分解时。

Proximal Policy Optimization for Amortized Discrete Sampling

摊销离散采样的近端策略优化

Authors: Anna Zykova-Myzina, Timofei Gritsaev, Daniil Tiapkin, Nikita Morozov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2606.15793
Pdf link: https://arxiv.org/pdf/2606.15793
Abstract This paper explores policy gradient algorithms for training stochastic policies to sample from structured discrete probability distributions under the Generative Flow Network (GFlowNet) framework. Building on extensive theoretical connections between GFlowNets and entropy-regularized reinforcement learning, we derive equivalents of standard policy gradient algorithms for training GFlowNets, as well as experimentally explore their various methodological aspects, including baseline training and advantage estimation. Most importantly, our work is the first to derive and successfully apply proximal policy optimization to GFlowNets, showing its improved convergence speed and data efficiency compared to standard GFlowNet training objectives on benchmarks ranging from synthetic energies to molecular graph generation.
中文摘要 本文探讨了在生成流网络（GFlowNet）框架下训练随机策略以从结构化离散概率分布中采样的策略梯度算法。基于GFlowNet与熵正则化强化学习之间的广泛理论联系，我们推导出了标准策略梯度算法用于训练GFlowNet的等价算法，并通过实验探讨了其各种方法学方面，包括基线训练和优势估计。最重要的是，我们的工作首次推导出并成功应用了GFlowNets的近端策略优化，展示了其在从合成能量到分子图生成等基准测试中，相较于标准GFlowNet训练目标，取得了更好的收敛速度和数据效率。

FlashNav: Ultra-Fast Policy Training for Robot Navigation within 20 Seconds

FlashNav：机器人导航超快速政策培训，20秒内完成

Authors: Shanze Wang, Yiwei Qian, Xinming Zhang, Jun Xue, Siwei Cheng, Xianghui Wang, Qingyuan Hu, Xiaoyu Shen, Wei Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.15846
Pdf link: https://arxiv.org/pdf/2606.15846
Abstract Deep reinforcement learning has shown strong potential for robot navigation, but its practical deployment is still limited by the long wall-clock cost of policy training. This paper presents FlashNav, a GPU-first framework for ultra-fast range-based robot navigation training. To the best of our knowledge, FlashNav is the first DRL-based robot navigation framework that reaches seconds-level policy training, with the fastest deployable policy trained in less than 20 seconds. The key idea is to align simulation with the navigation MDP: FlashNav preserves the essential components for velocity-level navigation, including occupancy geometry, range sensing, goal-conditioned control, robot motion dynamics, collision handling, termination, and reset, while removing unnecessary rendering and high-fidelity physical details from the training loop. Built on a batched bitmap simulator and a fully GPU-resident training pipeline with our FastDSAC learner, FlashNav generates massive parallel navigation transitions entirely on GPU. Experiments on TurtleBot2 and Unitree Go2 show that FlashNav achieves a 100\% success-rate below 20 seconds on an RTX 5090 and remains within tens of seconds across desktop GPUs. The learned policies further transfer to physical wheeled and legged robots in static and dynamic indoor scenes, demonstrating that DRL-based navigation can be trained at seconds-level speed while preserving deployable obstacle-avoidance behavior.
中文摘要 深度强化学习在机器人导航中展现出强大潜力，但其实际应用仍受限于政策培训的漫长时钟成本。本文介绍了FlashNav，一个以GPU为先的超高速基于距离的机器人导航训练框架。据我们所知，FlashNav是首个基于DRL的机器人导航框架，能够实现秒级政策培训，且最快的可部署策略训练时间不到20秒。关键理念是将仿真与导航MDP对齐：FlashNav保留了速度级导航的关键组件，包括占用几何、距离感测、目标条件控制、机器人运动动力学、碰撞处理、终止和重置，同时从训练循环中剔除不必要的渲染和高精度物理细节。FlashNav 基于批处理位图模拟器和完全驻留 GPU 的 FastDSAC 学习器训练流水线，能够完全在 GPU 上生成大规模的并行导航切换。在TurtleBot2和Unitree Go2上的实验显示，FlashNav在RTX 5090上能实现100%的成功率，在20秒以下，并且在桌面GPU上保持在数十秒内。所学政策进一步应用于静态和动态室内场景中的物理轮式和腿部机器人，证明基于日行学习器的导航可以在保持可部署避障行为的同时，以秒级速度进行训练。

STRIDE: Strategic Trajectory Reasoning via Discriminative Estimation for Verifiable Reinforcement Learning

STRIDE：通过判别性估计进行战略轨迹推理，实现可验证的强化学习

Authors: Qinjian Zhao, Zhihao Dou, Dinggen Zhang, Xiangyu Li, Chaoda Song, Zhongwei Wan, Xinpeng Li, Yanyan Zhang, Kaijie Chen, Qingtao Pan, Chengcheng Feng, Zhiqiang Gao, Xiaoyu Xia
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.15866
Pdf link: https://arxiv.org/pdf/2606.15866
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training paradigm for improving the reasoning abilities of large language models. However, existing RLVR methods typically rely on final-answer correctness to assign trajectory-level rewards, providing sparse supervision and treating all tokens uniformly regardless of their actual contribution to reasoning. Although recent studies introduce intermediate signals such as process rewards, high-entropy tokens, and semantic uncertainty, these signals are often not inherently verifiable and may fail to distinguish beneficial strategic patterns from harmful ones. To address this limitation, we propose STRIDE (Strategic Trajectory Reasoning with Discriminative Estimation), a fine-grained RLVR framework that derives strategic reasoning supervision from verifiable outcomes. STRIDE contrasts successful and failed trajectories within each response group to estimate the outcome-discriminative preference of each $n$-gram strategic pattern, and further combines this signal with reasoning saliency entropy to identify decision-relevant strategic patterns. These patterns are assigned differentiated advantage values during RL optimization, enabling more precise credit assignment while preserving the verifiability of RLVR. Extensive experiments demonstrate that STRIDE consistently improves reasoning performance across diverse models, tasks, and extended settings, including VLMs and agent-based systems.
中文摘要 可验证奖励强化学习（RLVR）已成为提升大型语言模型推理能力的有效训练后范式。然而，现有的RLVR方法通常依赖最终答案的正确性来分配轨迹级奖励，提供稀疏的监督，并统一对所有代币，无论其实际对推理的贡献如何。尽管近期研究引入了过程奖励、高熵代币和语义不确定性等中间信号，但这些信号往往本身无法验证，且可能无法区分有益的战略模式与有害的模式。为解决这一局限性，我们提出了STRIDE（策略轨迹推理与判别性估计）框架，这是一种细粒度的RLVR框架，通过可验证的结果推导战略推理监督。STRIDE对比每个响应组内的成功与失败轨迹，估算每个$n$-克战略模式的结果判别偏好，并将该信号与推理显著性熵结合，以识别决策相关的战略模式。这些模式在强化学习优化过程中被赋予差异化优势值，使信用分配更精确，同时保持RLVR的可验证性。大量实验表明，STRIDE在多种模型、任务和扩展环境中持续提升推理性能，包括VLM和基于代理的系统。

BALTO: Balanced Token-Level Policy Optimization for Hallucination Mitigation

BALTO：平衡的令牌级策略优化以缓解幻觉

Authors: Ning Li, Zixuan Guo, Yan Xu, Wenbo Fei, Yifan Niu, Chang Luo, Yasheng Wang, Weiwen Liu, Yong Yu, Weinan Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.15893
Pdf link: https://arxiv.org/pdf/2606.15893
Abstract Hallucinations remain a major obstacle to deploying large language models (LLMs) in knowledge-intensive settings, where generated responses must be faithfully grounded in provided evidence. Reinforcement learning (RL) is a promising direction for hallucination mitigation, but response-level faithfulness rewards suffer from a granularity mismatch: localized hallucinations can cause supported content to receive spurious penalties. Although recent work introduces fine-grained feedback such as claim-level verification and token-level rewards, unbalanced credit assignment can still induce length, verbosity, or optimization-noise biases. We propose BALTO, a Balanced Token-level Policy Optimization framework for hallucination mitigation. BALTO extracts checkable factual claims, verifies them against the reference context, and projects claim-level judgments to token-level labels. A balanced token-level credit assignment mechanism is introduced into the framework. This design redistributes probability mass from unsupported content toward faithful content, rather than suppressing the entire response. We systematically analyze the limitations of response-level rewards from a theoretical standpoint, and prove BALTO's advantages in training stability and optimization efficiency for hallucination mitigation. Experiments on ConFiQA, RAGTruth, and FinLLM-Eval show that BALTO achieves the highest faithfulness across all six model--benchmark settings and consistently outperforms existing post-training baselines in Q-Score, demonstrating a stronger faithfulness--informativeness trade-off.
中文摘要 幻觉仍然是在知识密集型环境中部署大型语言模型（LLMs）的主要障碍，因为生成的反应必须忠实地基于现有证据。强化学习（RL）是缓解幻觉的一个有前景的方向，但响应级忠实度奖励存在粒度不匹配的问题：局部幻觉可能导致支持内容获得虚假惩罚。尽管近期研究引入了细粒度反馈，如申诉层验证和代币级奖励，但不平衡的信用分配仍可能引发冗长、冗长或优化噪声偏差。我们提出了BALTO，一种平衡的令牌级策略优化框架，用于幻觉缓解。BALTO 提取可检查的事实主张，将其与参考上下文进行核实，并将主张级判断投影到代币级标签上。框架中引入了平衡的代币级信用分配机制。该设计将无支持内容的概率质量重新分配到忠实内容，而不是抑制整个响应。我们从理论角度系统分析反应级奖励的局限性，并证明了BALTO在训练稳定性和优化效率方面的幻觉缓解优势。ConFiQA、RAGTruth和FinLLM-Eval的实验显示，BALTO在六个基准模型设置中均达到最高的忠实度，并且在Q-Score上持续优于现有的训练后基线，展现出更强的忠实度与信息性权衡。

Reinforcement Learning for LLM-based Event Forecasting

基于LLM的事件预测的强化学习

Authors: Amit Arnold Levy
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.15917
Pdf link: https://arxiv.org/pdf/2606.15917
Abstract We use Group Relative Policy Optimization (GRPO), a recently devised sample and memory efficient reinforcement learning method, to finetune pretrained LLMs in the range of 1.5B to 14B parameters equipped with the ability to get current information through the use of a Wikipedia revisions tool, or news summaries, to forecast real events beyond the knowledge cutoff of the LLM, as well as problems made to simulate different aspects of the dynamics of that training. We use the results of these experiments to comment on the scaling capability of LLMs for forecasting, as well as classify how judgmental forecasting fits into the verifiable/unverifiable domain taxonomy, considering the impact of the inherent aleatoric uncertainty when forecasting future events (e.g. the roll of a die). As a result of the GRPO training, we manage to bring a 1.5B parameter transformer (Qwen 2.5 1.5B) to forecasting performance superior to Claude Sonnet 3.5 over the same dataset as measured by cross entropy from the market agreed probabilities. We also discuss various dead ends on the path to this result.
中文摘要 我们使用群相对策略优化（GRPO），一种最近设计的样本和内存高效强化学习方法，对1.5B至14B参数范围的预训练LLM进行微调，并配备通过维基百科修订工具获取当前信息或新闻摘要的能力，预测超出LLM知识截止范围的真实事件。以及模拟训练动态不同方面的题目。我们利用这些实验结果评论LLMs在预测中的扩展能力，并分类判断预测如何融入可验证/不可验证领域分类，考虑预测未来事件（如掷骰子）时固有的偶然不确定性的影响。通过GRPO训练，我们成功将1.5B参数变换器（Qwen 2.5 1.5B）带入预测性能，超过Claude Sonnet 3.5，该数据集通过市场公认概率的交叉熵测量。我们还讨论了通往这一结果的各种死胡同。

Energy-Efficient Arm Reaching for a Humanoid Robot via Deep Reinforcement Learning with Identified Power Models

通过深度强化学习与已识别功率模型实现节能臂对类人机器人的伸手

Authors: Nestor N. Deniz, Simon Parsons, Fernando Auat Cheein
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.15918
Pdf link: https://arxiv.org/pdf/2606.15918
Abstract Humanoid robots performing in-field manipulation tasks, such as robotic apple harvesting, face severe energy constraints that directly limit the number of reaching motions that can be executed per battery charge. This paper presents an end-to-end, energy-aware reinforcement learning framework for the 7-degree-of-freedom left arm of the Unitree~G1 humanoid robot, combining a physics-based, experimentally identified electrical power model with a Soft Actor-Critic (SAC) policy trained in a Pinocchio-based rigid-body dynamics simulator. The RL policy operates on an incremental joint-position action space and is trained with a Hybrid Constellation Reward that combines a four-point end-effector constellation distance with a torque-norm energy proxy; after % $5\times10^6$ training it reaches a $69.9\%$ success rate over $1\,000$ random targets in kinematic simulation, at a mean energy of \SI{98.16}{\joule} on successful episodes. Finally, on the physical Unitree~G1, the policy is validated over three independent 10-target batches, achieving a mean energy of $71.5 \pm 48.3$\,J, an end-effector position error of $2.64 \pm 1.04$\,cm, and an orientation error of $6.92 \pm 1.33^\circ$ -- within the \SI{4}{\centi\metre}/$8.6^\circ$ training tolerance. These results constitute a first step toward energy-aware reinforcement-learning-based arm reaching for humanoid robots.
中文摘要 执行现场操作任务（如机器人苹果采集）的人形机器人面临严重的能源限制，直接限制了每电池充电可执行的触达动作次数。本文提出了一个端到端、能量感知的强化学习框架，应用于Unitree~G1类人机器人的7自由度左臂，结合了基于物理的实验验证电力模型与在匹诺曹基础刚体动力学模拟器中训练的软演员-批判者（SAC）策略。强化操作策略基于增量联合位置作用空间，并通过混合星座奖励训练，该奖励结合了四点末端执行器星座距离与扭矩规范能量代理;经过 % $5\times10^6$ 训练后，在运动学模拟中，对价值 $1\，000 美元的随机目标，成功发作时的平均能量为 \SI{98.16}{\Joule}，成功成功率达到 $69.9\%$。最后，在物理Unitree~G1上，策略在三个独立的10靶批次中得到验证，平均能量为$71.5 \pm 48.3$\，J，末端执行器位置误差为$2.64 \pm 1.04$\，cm，方向误差为$6.92 \pm 1.33^\circ$——均在\SI{4}{\centi\meterre}/$8.6^\circ$训练容忍范围内。这些结果是迈向基于能量感知强化学习的类人机器人手臂伸展的第一步。

OmniOPSD: Rationale-Privileged On-Policy Self-Distillation for Affective Computing

OmniOPSD：基于理性特权的政策自我提炼，用于情感计算

Authors: Zebang Cheng, Shuimu Chen, Boxue Yang, Yuanshen Guan, Jingyi Chen, Zheng Lian, Xiaojiang Peng, Fei Ma, LaiZhong Cui, Qi Tian
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.15920
Pdf link: https://arxiv.org/pdf/2606.15920
Abstract Reinforcement learning for multimodal large language models (MLLMs) is often hindered by severe reward sparsity in complex reasoning tasks. This challenge is particularly pronounced in human-centered scenarios involving states, emotions, intentions, and behaviors, where heterogeneous multimodal signals and subjective human factors make high-quality chain-of-thought (CoT) annotations expensive and difficult to obtain. Although many multimodal datasets provide expert-annotated ground-truth labels, directly using these labels for supervised fine-tuning may encourage shortcut learning in multimodal perception and provides limited transparency for safety-critical human--AI interaction. To address these limitations, we propose OmniOPSD, a Rationale-Privileged On-Policy Self-Distillation framework that uses frontier-generated rationales as teacher-side privileged evidence rather than student imitation targets. OmniOPSD uses frontier-generated evidence-aware rationales only as training-time privileged evidence context for a local teacher. The student samples its own rollout from the original multimodal input, while the rationale-privileged teacher scores the same tokens and provides dense token-level supervision. Thus, the student learns on its own trajectory distribution without directly imitating frontier-model completions, and inference requires no labels, rationales, CoT annotations, or closed-source model access. Experiments on MER-UniBench show that OmniOPSD achieves state-of-the-art performance with an average score of $84.19$, and ablations further support the value of rationale-privileged teacher guidance.
中文摘要 多模态大型语言模型（MLLMs）的强化学习常常受到复杂推理任务中严重奖励稀疏的阻碍。这一挑战在涉及状态、情绪、意图和行为的以人为中心场景中尤为突出，在这些场景中，异构的多模态信号和主观的人为因素使得高质量的思维链（CoT）注释既昂贵又难以获得。尽管许多多模态数据集提供了专家注释的真实标签，但直接使用这些标签进行监督微调可能鼓励多模态感知中的捷径学习，并为安全关键的人类-人工智能交互提供有限的透明度。为解决这些局限性，我们提出了OmniOPSD，一种基于理据特权的政策自我提炼框架，利用前沿生成的理由作为教师侧特权证据，而非学生模仿目标。OmniOPSD仅将前沿生成的证据意识理由作为培训时间特权证据的背景，供本地教师使用。学生从原始多模态输入中采样自己的推广，而拥有理据特权的教师则对相同的代币进行评分，并提供密集的代币级监督。因此，学生无需直接模仿前沿模型补全，无需自身轨迹分布学习，推断无需标签、理由、CoT注释或闭源模型访问。MER-UniBench上的实验显示，OmniOPSD实现了最先进的性能，平均得分为84.19美元，消融进一步支持了理性特权教师指导的价值。

Artificial Intelligence for Power-Converter-Rich Electrical Systems: A Review

电力转换器富丰富电力系统的人工智能：综述

Authors: Pengfeng Lin, Yuan Gao, Yuxi Tang, Muhammad Waqas Qaisar, Peifeng Hui, Chuanlin Zhang, Miao Zhu, Peng Wang
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.15948
Pdf link: https://arxiv.org/pdf/2606.15948
Abstract Power-converter-rich electrical systems, formed by renewable generation, electrified transportation, and inverter-based resources, exhibit strongly nonlinear dynamics, multi-physics design tradeoffs, fast control requirements, and growing reliability and cybersecurity constraints. These characteristics strain workflows that rely only on physics-based modeling, sequential optimization, and rule-based operation. This paper reviews artificial intelligence (AI) for power-converter-rich electrical systems through a life-cycle and deployment-readiness perspective. The literature is organized across converter design, real-time control, system-level operation, and compliance-oriented governance. For design, we examine surrogate modeling, topology and parameter synthesis, EMI/EMC-aware optimization, reliability-oriented design, and knowledge-assisted workflows. For control, we compare supervised learning, reinforcement learning, learning-augmented predictive control, and safety-constrained learning according to their role in closed-loop implementation. For operations, we focus on microgrid coordination, forecasting, distribution-system observability, privacy-preserving coordination, and cyber-resilient operation where converter-interfaced resources shape the operating problem. Across these stages, the review emphasizes deployment-critical gaps, including stability certification, constraint satisfaction, interpretability, extrapolation, data efficiency, sim-to-real transfer, embedded latency, cybersecurity, privacy, and standards alignment. The resulting taxonomy is intended to clarify where AI is already useful as an engineering support tool and where further validation is needed before autonomous or safety-critical deployment.
中文摘要 由可再生能源、电气化交通和逆变器资源组成的富变换电系统，表现出强烈的非线性动力学、多物理设计权衡、快速控制需求以及日益增长的可靠性和网络安全限制。这些特性使得仅依赖基于物理建模、顺序优化和规则操作的工作流程感到压力。本文从生命周期和部署准备度的角度回顾了富含电力转换器的电力系统中的人工智能（AI）。文献涵盖变换器设计、实时控制、系统级运行以及以合规为导向的治理等领域。在设计方面，我们考察代理建模、拓扑与参数综合、EMI/EMC感知优化、可靠性导向设计以及知识辅助工作流程。在控制方面，我们比较监督学习、强化学习、学习增强预测控制和安全约束学习，视其在闭环实施中的作用而定。在运营方面，我们专注于微电网协调、预测、配电系统可观测性、隐私保护协调以及网络韧性运营，其中转换器接口资源塑造运营问题。在这些阶段，评审强调了部署中的关键缺口，包括稳定性认证、约束满足、可解释性、外推、数据效率、模拟到真实传输、嵌入式延迟、网络安全、隐私和标准一致性。由此产生的分类法旨在明确人工智能作为工程支持工具的哪些方面已经非常有用，以及在自主或安全关键部署前需要进一步验证的地方。

Thinking with Visual Grounding

以视觉为基础思考

Authors: Junkai Zhang, Yihe Deng, Kai-Wei Chang, Wei Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.16122
Pdf link: https://arxiv.org/pdf/2606.16122
Abstract Visual thinking should not only sound right; it should show its evidence. While recent vision-language models (VLMs) can produce natural-language reasoning traces, these traces often leave the supporting image regions implicit, making them hard to verify and difficult to supervise. We introduce visually grounded thinking, a reasoning process in which models interleave natural-language thoughts with explicit point or box groundings of the visual evidence used at each step. This lets the model express intermediate reasoning in language while grounding key objects in the image regions they refer to. To train this behavior, we construct a scalable synthesis pipeline that distills correct visual reasoning traces, extracts the visual objects required by the traces, grounds them with a SAM3-based agent, and derives aligned point and box supervision from the resulting masks. We further propose grounding-aware reinforcement learning, which combines answer correctness rewards with dense grounding rewards that score whether generated object references match the correct image evidence. Across two counting benchmarks and four spatial reasoning benchmarks, adding visually grounded thinking to Gemma3-4B-IT consistently improves performance over the original model and the non-grounded thinking baseline. On spatial reasoning, the visually grounded thinking 4B models match, and in some cases surpass, Gemma3-27B-IT from the same model family. Our analysis shows that point grounding is well suited to counting, while box grounding benefits most from explicit grounding rewards on spatial tasks. Overall, our results show that VLMs think better when their intermediate thoughts are tied to the image regions that make them true.
中文摘要 视觉思维不仅要听起来正确;它应该展示证据。虽然最新的视觉语言模型（VLMs）能够产生自然语言推理痕迹，但这些痕迹往往隐含支持图像区域，使其难以验证和监督。我们引入了视觉基础思维，这是一种推理过程，模型将自然语言的思维与每一步所用视觉证据的明确点或框式基础交错结合。这使得模型能够用语言表达中间推理，同时将关键对象置于其所指的图像区域。为了训练这种行为，我们构建了一个可扩展的综合流水线，提取正确的视觉推理痕迹，提取这些迹所需的视觉对象，用基于SAM3的代理进行基础化，并从所得掩码中推导出比对的点和盒监督。我们进一步提出了接地感知强化学习，结合了答案正确性奖励与密集接地奖励，后者用以评分生成的物体引用是否符合正确的图像证据。在两个计数基准和四个空间推理基准中，将视觉基础思维加入Gemma3-4B-IT持续提升了对原始模型和非基础思维基线的表现。在空间推理方面，视觉基础思维的4B模型与同一模型家族中的Gemma3-27B-IT相匹配，甚至在某些情况下超越。我们的分析显示，点式接地更适合计数，而盒式接地则最受益于空间任务中明确的接地奖励。总体来看，我们的结果表明，当VLM的中间思维与使其真实的图像区域相关联时，思考能力更强。

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

VibeThinker-3B：探索小语言模型中可验证推理的前沿

Authors: Sen Xu, Shixi Liu, Wei Wang, Jixin Min, Yingwei Dai, Zhibin Yin, Yirong Chen, Xin Zhou, Junlin Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.16140
Pdf link: https://arxiv.org/pdf/2606.16140
Abstract This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.
中文摘要 本技术报告介绍了VibeThinker-3B，这是一个紧凑且密集的模型，其3B参数旨在研究在严格小模型范围内可验证推理的极限。基于“频谱到信号”训练后范式，我们通过优化流程系统地增强模型，包括基于课程的监督微调、多领域强化学习和离线自我提炼。实验评估表明，VibeThinker-3B在高要求且可验证的任务上实现了前沿水平的性能。具体来说，它在AIME26上得分为94.3（通过申诉级别的测试时间缩放提升至97.1），在LiveCodeBench v6上获得80.2 Pass@1，并且在近期未公开的LeetCode竞赛中表现出强烈的非发行泛化，接受率为96.1%。这实际上将其置于一流推理系统的性能范围内，能够匹敌甚至超越那些性能大好几个数量级的旗舰机型，如DeepSeek V3.2、GLM-5和Gemini 3 Pro。此外，IFEval 93.4 分的分数证实了这种极端推理增强并未影响严格的指令可控性。在我们之前15亿年研究基础上，这些发现推动了参数化压缩覆盖假说，该假说认为可验证推理可压缩为紧凑的推理核心，而开放领域知识和通用能力则需要对事实、概念和长尾情景进行广泛的参数覆盖。这一观点表明，紧凑模型不仅是部署效率的替代品，更是在参数密集能力环境中，通往前沿性能的互补路径。

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

GRACE：基于上下文的忠实推理的步骤级基准

Authors: Hoang Pham, Dong Le, Anh Tuan Luu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.16151
Pdf link: https://arxiv.org/pdf/2606.16151
Abstract Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even when the final answer is correct. Existing methods detect hallucinations at the response level but fail to identify where in the chain a failure occurs or what type it is. We introduce GRACE, the first human-annotated step-level faithfulness benchmark with a data-driven error taxonomy for context-grounded textual reasoning. GRACE covers CoT traces from 10 models across 4 source datasets, with each step annotated for faithfulness, error category, and natural language explanation. A data-driven taxonomy, discovered bottom-up via unsupervised clustering, organizes failures into two tracks: GRACE-Inference (deductive errors) and GRACE-Grounding (factual grounding errors), with four categories each. The evaluation set is human-annotated and challenging by design. Our experiments reveal substantial headroom for current models. In addition, integrating step-level faithfulness signals into reinforcement learning pipelines improves both downstream accuracy and reasoning reliability.
中文摘要 许多推理任务需要模型基于输入上下文进行推理，从基于文档的问题回答到基于规则的推理。思维链（CoT）提示产生的痕迹看似透明，但每个步骤即使最终答案正确，也可能悄无声息地偏离原始证据。现有方法在反应层面检测幻觉，但无法识别失败发生在链条的哪个位置或其类型。我们介绍了GRACE，这是首个带有数据驱动的错误分类法的人工注释步骤忠实度基准，支持基于上下文的文本推理。GRACE 涵盖了 4 个源数据集中 10 个模型的 CoT 痕迹，每一步都对忠实度、错误类别和自然语言解释进行了注释。通过无监督聚类自下而上发现的数据驱动分类法，将失败分为两类：GRACE-推断（演绎错误）和GRACE-Grounding（事实基础错误），每类有四类。评估集设计上带有人工注释且具有挑战性。我们的实验显示，现有模型仍有相当大的余地。此外，将步级忠实信号整合进强化学习流程，不仅提高了下游的准确性，也提高了推理的可靠性。

A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization

RLVR稳定性与赢家优势政策优化的梯度视角

Authors: Prasanth YSS, Zhichen Ren, Rasa Hosseinzadeh, Ilan Gofman, Yuqi Chen, Zhaoyan Liu, Guangwei Yu, Jesse C. Cresswell, Satya Krishna Gorti
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.16154
Pdf link: https://arxiv.org/pdf/2606.16154
Abstract Reinforcement learning with verifiable rewards (RLVR) improves language-model reasoning, but GRPO-style optimization remains prone to collapse. We analyse this instability through token-level gradient dynamics, deriving a taxonomy that predicts how updates affect next-token probabilities and entropy. The taxonomy shows that stability depends jointly on the advantage sign and token distribution under the current policy. Motivated by this finding, we propose Winner Advantage Policy Optimization (WAPO), a simple online clipped policy-gradient objective that updates only on positive-advantage completions. Across mathematical reasoning and multi-hop QA benchmarks, WAPO improves training stability and matches or outperforms baselines across multiple model families. Full code can be found at this https URL.
中文摘要 带有可验证奖励的强化学习（RLVR）提升了语言模型推理能力，但GRPO式优化仍易崩溃。我们通过代币层级梯度动态分析这种不稳定性，推导出预测更新如何影响下一代币概率和熵的分类法。分类法表明，稳定性共同依赖于当前政策下的优势符号和代币分布。基于这一发现，我们提出了赢家优势政策优化（WAPO），这是一种简单的在线截剪政策梯度目标，仅在正向优势完成时更新。在数学推理和多跳质量保证基准测试中，WAPO提升了训练稳定性，并在多个模型家族中匹配甚至超越基线表现。完整代码可在此 https URL 中找到。

Binary Decompilation LLM with Feedback-Driven Multi-Turn Refinement

带有反馈驱动多回合精炼的二元反编译大型语言模型

Authors: Peipei Liu, Jian Sun, Mingzhe Xing, Yicheng Zeng, Zhaoteng Yan, Lixiao Zhang, Li Chen, Dan Li
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2606.16162
Pdf link: https://arxiv.org/pdf/2606.16162
Abstract Binary decompilation is fundamental to security tasks such as vulnerability discovery, malware inspection, and executable-only program understanding. Recent LLM-based decompilation methods have shown promising results, but most still follow a single-turn generation paradigm: given assembly code or decompiler-produced pseudo-code, the model generates one output and stops. Consequently, the generated code may appear readable or even compile successfully, yet still deviate from the behavior of the original binary and mislead downstream analysis. This paper presents AutoDecompiler, a decompilation-specialized LLM trained with reinforcement learning for feedback-driven multi-turn binary decompilation. Instead of treating decompilation as one-shot code generation, AutoDecompiler formulates it as an iterative refinement process, where the model revises generated code based on compilation, execution, and input/output testing feedback. To enable this process, we design decompilation-specific rewards that capture code validity, recompilability, execution consistency, and semantic fidelity. We further construct stage-aware diagnostic feedback from compiler errors, execution failures, and failed test cases, and introduce progress-aware trajectory rewarding and turn-aware advantage reweighting to encourage beneficial revisions while suppressing regressions. We train the AutoDecompiler family and evaluate it across different input settings, model scales, and benchmarks. Experimental results show that AutoDecompiler consistently outperforms its single-turn counterparts under the same model size and input setting, achieving clear improvements in behavioral re-executability. These results demonstrate that learning to exploit program feedback with reinforcement learning is an effective direction for improving the functional correctness of LLM-based binary decompilation.
中文摘要 二进制反编译是安全任务的基础，如漏洞发现、恶意软件检查和仅执行程序的理解。近期基于LLM的反编译方法显示出有希望的结果，但大多数仍遵循单回合生成范式：给定汇编代码或反编译器生成的伪代码，模型生成一个输出后停止。因此，生成的代码可能看起来可读甚至成功编译，但仍偏离原始二进制的行为，误导下游分析。本文介绍了AutoDecompiler，一种专门反编译的大型语言模型，通过强化学习训练，用于反馈驱动的多回合二进制反编译。AutoDecompiler 不将反编译视为一次性代码生成，而是将其表述为一种迭代精炼过程，模型基于编译、执行和输入输出测试反馈来修订生成的代码。为实现这一过程，我们设计了针对反编译的奖励，捕捉代码的有效性、可重编译性、执行一致性和语义忠实度。我们进一步构建了基于编译器错误、执行失败和失败测试用例的阶段感知诊断反馈，并引入了进度感知轨迹奖励和转向感知优势重权，以鼓励有益的修订，同时抑制回归。我们训练自动反编译器家族，并在不同输入设置、模型尺度和基准测试中进行评估。实验结果显示，在相同模型大小和输入设置下，自动反编译器持续优于单回合对应工具，显著提升了行为可执行性。这些结果表明，利用强化学习利用程序反馈，是提升基于LLM的二进制反编译功能正确性的有效方向。

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

PACT：多轮工具使用代理的特权追踪共训

Authors: Zhenbang Du, Jun Luo, Zhiwei Zheng, Xiangchi Yuan, Kejing Xia, Dachuan Shi, Qirui Jin, Qijia He, Shaofeng Zou, Yingbin Liang, Wenke Lee
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.16215
Pdf link: https://arxiv.org/pdf/2606.16215
Abstract Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.
中文摘要 多回合工具使用代理必须推理、调用工具，并适应多个交互回合中的观察。此类代理训练后具有挑战性，因为强化学习常因奖励稀疏和信用分配较弱而受挫，尽管匹配了仅提示的推理设置;而对专家痕迹进行监督微调则提供了密集的过程监督，但可能将模型过度限制在固定轨迹上。为此，我们提出了PACT，这是一个针对多轮工具使用代理的特权TRACE共训练框架。关键思想是仅将专家追踪作为训练时间优化信号，而非推广时间的提示。PACT将推出生成保持为仅提示，然后利用专家追踪通过两个互补信号引导优化：一个跟踪条件的强化学习代理（在专家跟踪上下文下评估仅提示的部署），以及一个组件感知的SFT丢失，监督推理前缀和退火强度的工具调用。为减少对仅训练追踪上下文的过度依赖，PACT进一步引入了仅提示的锚定。我们还提供了一个潜在痕迹视图，连接了两个基于痕迹的目标，并解释了专家痕迹如何在推广生成过程中不被使用的情况下指导优化。FTRL、BFCL和ToolHop的实验显示，PACT在基于SFT和RL的基线基础上持续提升，凸显了特权轨迹共训对多回合工具使用学习的价值。

Graphical conditional generative modeling for digital twin modeling

数字孪生建模的图形条件生成建模

Authors: Zongren Zou, Théo Bourdais, Ricardo Baptista, Houman Owhadi
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
Arxiv link: https://arxiv.org/abs/2606.16219
Pdf link: https://arxiv.org/pdf/2606.16219
Abstract Digital twin modeling, including control and data assimilation under model uncertainty, often faces an open-ended fidelity problem: adding variables, data streams, and time scales can indefinitely increase model complexity, ultimately producing systems that are difficult to maintain, validate, interpret, and use for stress or safety testing. As an alternative, one can seek parsimonious stochastic surrogate models built only on the variables needed to describe the relevant quantities of interest. We introduce a framework for discovering such variables from observational data by identifying which candidate inputs influence the full conditional law of a target quantity, rather than only its conditional mean. This distinction is essential in stochastic, coarse-grained, or partially observed systems, where dependencies may appear through changes in variability, tail behavior, multimodality, or uncertainty rather than through deterministic functional relationships. The framework couples conditional generative modeling, which learns the conditional distribution of the target given candidate inputs, with Gaussian-process-based analysis of variance (through kernel mode decomposition), which enables iterative pruning of non-influential inputs and interpretable structure discovery. In control settings, the resulting surrogate can be interpreted as a learned Markov decision process: the method identifies not only a transition model, but also the state, action, and memory variables needed to make the learned dynamics effectively Markovian. Across examples involving stochastic dynamical systems, missing variables, PDE control, reinforcement learning, and economic data, the discovered structures yield interpretable stochastic surrogates whose downstream performance is comparable to models trained on the full variable set.
中文摘要 数字孪生建模，包括在模型不确定性下进行控制和数据同化，常常面临开放式的保真度问题：添加变量、数据流和时间尺度会无限增加模型复杂度，最终导致系统难以维护、验证、解释和用于压力或安全测试。作为替代方案，可以寻找仅基于描述相关相关变量的简约随机替代模型。我们引入了一个框架，通过识别哪些候选输入影响目标量的全部条件律，而不仅仅是其条件均值，从而从观测数据中发现此类变量。这种区分在随机、粗粒度或部分观测系统中尤为重要，因为依赖关系可能通过变异性、尾部行为、多模态或不确定性的变化出现，而非确定性功能关系。该框架结合了条件生成建模（学习给定候选输入的条件分布）与基于高斯过程的方差分析（通过核模式分解），从而实现对无影响输入的迭代剪枝和可解释结构的发现。在控制环境中，所得的替代可以被解释为一种学习的马尔可夫决策过程：该方法不仅识别了转移模型，还识别了使所学动力学有效实现马尔可夫所需的状态、作用和记忆变量。在涉及随机动力系统、缺失变量、偏微分方程控制、强化学习和经济数据的例子中，发现的结构产生了可解释的随机代理，其下游性能可与在全变量集上训练的模型相当。

Evolutionary Bilevel Reward Shaping for Generalization in Reinforcement Learning

强化学习中泛化的进化双层奖励塑造

Authors: Ekasit Usaratniwart, Xilin Gao, Marc Ong, Youhei Akimoto
Subjects: Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2606.16236
Pdf link: https://arxiv.org/pdf/2606.16236
Abstract Reinforcement learning (RL) often suffers from performance degradation when deployed in environments that differ from those encountered during training. Existing techniques such as domain randomization (DR) mitigate this, but require access to diverse training environments and full trajectory observability, assumptions that fail in privacy-preserving or restricted scenarios where only scalar performance metrics are available. We propose Generalization via Evolutionary Reward Shaping (GERS), a bilevel optimization approach to improve generalization on unseen test environments using only scalar feedback from validation environments. At the lower level, an RL agent guided via a reward function shaped by the upper level learns a policy on a limited set of training environments with accessible trajectory data; at the upper level, CMA-ES optimizes the reward shaping parameters to maximize the cumulative unshaped reward on separate validation environments for which trajectory access is unavailable. Results on continuous control tasks indicate that GERS outperforms the standard RL baseline on unseen test environments. GERS performance is comparable to DR, despite DR treating the combined set of training and validation environments of GERS as a single training set that requires trajectory access, whereas GERS cannot access validation trajectories. These results confirm that GERS effectively enhances generalization under restricted data access constraints.
中文摘要 强化学习（RL）在部署于与培训环境中不同的环境中时，常常会表现下降。现有技术如域随机化（DR）可以缓解这一问题，但需要访问多样化的训练环境和完整的轨迹可观测性，而这些假设在隐私保护或受限场景中只能获得标量性能指标时会失效。我们提出了通过进化奖励塑形泛化（GERS）的方法，这是一种双层优化方法，旨在仅利用来自验证环境的标量反馈，提升对未见测试环境的泛化能力。在较低层次，强化学习代理通过由上层塑造的奖励函数引导，在有限的训练环境中学习可访问轨迹数据的策略;在上层，CMA-ES优化奖励塑形参数，以最大化在无法获得轨迹访问的独立验证环境中累计未成形奖励。连续控制任务的结果表明，GERS在未见测试环境中优于标准强化学习基线。尽管DR将GERS的训练和验证环境视为单一训练集，需要轨迹访问，但GERS的性能与DR相当。这些结果证实，GERS在受限的数据访问约束下有效增强了泛化能力。

TopoRetarget: Interaction-Preserving Retargeting for Dexterous Manipulation

TopoRetarget：保持交互性的重定向，实现灵巧操作

Authors: Jielin Wu, Shenzhe Yao, Guanqi He, Xiaohan Liu, Zhaoqing Zeng, Xiangrui Jiang, Han Yang, Wentao Zhang, Hang Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.16272
Pdf link: https://arxiv.org/pdf/2606.16272
Abstract Human hand-object demonstrations provide dense reference motions for training dexterous manipulation reinforcement learning (RL) policies through reference tracking. However, to use such demonstrations for RL policy learning, retargeting must preserve hand pose and task-relevant hand-object contact structure. Otherwise, contact and feasibility artifacts can degrade downstream RL policy performance. We introduce TopoRetarget, an interaction-preserving retargeting framework that uses a single set of parameters across diverse retargeting conditions while maintaining task-relevant hand-object interaction and adapting human demonstrations to dexterous robot hands. The method constructs a sparse interaction graph over hand and object keypoints and optimizes distance-weighted Laplacian deformation with directional consistency, kinematic constraints, and penetration handling. Evaluations show that the generated references improve both interaction fidelity and policy learning: TopoRetarget achieves the best contact precision and alignment over all baselines on the ContactPose Dataset, improves Pen-Spin training success by 40.6 percentage points over the existing baseline methods, and enables zero-shot transfer to Wuji Hand hardware on cube reorientation and pen spinning.
中文摘要 人类手部物体演示通过参考跟踪为训练灵巧操作强化学习（RL）策略提供了密集的参考动作。然而，为了将此类演示用于强化学习策略学习，重定向必须保留手部姿势和与任务相关的手与物体接触结构。否则，接触和可行性伪影可能会降低下游强化学习策略的性能。我们介绍了TopoRetarget，一种保持交互性的重定向框架，利用一套参数应对多种重定向条件，同时保持任务相关的手与物体交互，并将人类演示适应灵活的机器人手。该方法构建了手和物体关键点的稀疏交互图，并通过方向一致性、运动学约束和穿透处理优化了距离加权拉普拉斯变形。评估显示，生成的参考不仅提升了交互忠实度，也提升了策略学习：TopoRetarget在ContactPose数据集上实现了最佳的接触精度和对齐，使Pen-Spin训练成功率比现有基线方法高出40.6个百分点，并支持在立方体重新定向和笔旋转时实现零shot传输到无极手硬件。

An Adjoint-based Neural Regulator for Real-Time Optimal Control with State Constraints

基于伴随的神经调控器，用于带状态约束的实时最优控制

Authors: Isaiah A. Agboola, Yuxin Tong, Uduak Inyang-Udoh
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.16303
Pdf link: https://arxiv.org/pdf/2606.16303
Abstract This paper introduces a learning-based control framework for real-time constrained optimal control of nonlinear systems with safety guarantees based on the Pontryagin's Minimum Principle. The approach learns a neural co-state (adjoint) policy that encodes optimality through the system Hamiltonian, rather than directly approximating a control law. Feasibility is enforced separately at runtime through an efficient convex projection that incorporates actuator limits and safety constraints expressed as control barrier functions. We refer to this framework as an adjoint-based neural regulator (ANR) as it yields a controller that satisfies constraints while retaining the optimality structure encoded by the learned adjoint. We demonstrate the effectiveness of the proposed framework on nonlinear constrained control tasks using a unicycle model. The ANR achieves performance at par with nonlinear model predictive control at more than two orders of magnitude lower computational cost, while exhibiting near-invariant performance across unseen scenarios, thus, significantly outperforming reinforcement learning methods in out-of-training-distribution regimes.
中文摘要 本文介绍了基于庞特里亚金最小原则的实时受限最优非线性系统的控制框架，并保证安全。该方法学习一种神经共态（伴随）策略，通过系统哈密顿量编码最优性，而非直接近似控制律。可行性在运行时通过高效的凸投影单独执行，该投影包含执行器极限和安全约束，以控制障碍函数表示。我们将该框架称为伴随神经调控器（ANR），因为它提供了一个满足约束条件的控制器，同时保持所学伴随编码的最优结构。我们通过单轮模型展示了该框架在非线性受限控制任务中的有效性。ANR在计算成本下实现与非线性模型预测控制相当的性能，同时在未见场景中表现近乎不变，因此在训练外分布环境中显著优于强化学习方法。

RL-Index: Reinforcement Learning for Retrieval Index Reasoning

RL-Index：用于检索索引推理的强化学习

Authors: Yongjia Lei, Nedim Lipka, Zhisheng Qi, Utkarsh Sahu, Koustava Goswami, Franck Dernoncourt, Ryan A. Rossi, Yu Wang
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.16316
Pdf link: https://arxiv.org/pdf/2606.16316
Abstract Retrieving external knowledge is essential for solving real-world tasks, yet it remains challenging when the relationship between a query and its relevant knowledge involves implicit and complex reasoning beyond surface-level semantic or lexical matching (e.g., mathematical problems relying on the same theorem or coding requiring deep reasoning). Existing approaches primarily rely on query-side reasoning (e.g., query rewriting), which introduces significant online latency and underutilizes the opportunity to perform reasoning over the knowledge corpus itself (i.e., index-side reasoning). In this paper, we propose RL-Index, an agentic indexing framework that formulates retrieval index reasoning as a reinforcement learning problem. Instead of performing reasoning at query time, RL-Index shifts reasoning to the indexing stage by augmenting documents with LLM-generated rationales that explicitly encode the latent query-knowledge relationship. To optimize the quality of these rationales, we employ Group Relative Policy Optimization (GRPO) and use retrieval similarity as a verifiable reward signal, enabling direct optimization of indexing decisions for retrieval effectiveness. Extensive experiments on the BRIGHT benchmark demonstrate that RL-Index consistently improves both retrieval and downstream question-answering performance, while significantly reducing online inference latency. Moreover, the learned rationale augmentation generalizes across diverse retrievers and generators, highlighting its robustness as a plug-and-play indexing strategy across different retrieval systems.
中文摘要 检索外部知识对于解决现实任务至关重要，但当查询与其相关知识之间的关系涉及超越表层语义或词汇匹配的隐性且复杂的推理时（例如依赖同一定理或编码的数学问题，需要深度推理），检索依然具有挑战性。现有方法主要依赖查询侧推理（例如查询重写），这带来了显著的在线延迟，且未能充分利用对知识语料库本身进行推理的机会（即索引侧推理）。本文提出了RL-Index，一种代理索引框架，将检索索引推理提出为强化学习问题。RL-Index 不再在查询时进行推理，而是通过增强文档中 LLM 生成的理由，明确编码潜在的查询-知识关系，将推理推进到索引阶段。为了优化这些理由的质量，我们采用了群体相对策略优化（Group Relative Policy Optimization，GRPO），并将检索相似性作为可验证的奖励信号，从而实现索引决策的直接优化，从而提高检索效果。BRIGHT基准测试的大量实验表明，RL-Index持续提升检索和下游问答性能，同时显著降低在线推理延迟。此外，所学到的理据增强可推广到不同的检索器和生成器，凸显其作为跨不同检索系统即插即用的索引策略的稳健性。

Diffusion Offline Reinforcement Learning for Fair and Energy-Efficient UAV-Assisted Wireless Networks

扩散离线强化学习，实现公平且节能的无人机辅助无线网络

Authors: Eslam Eldeeb, Hirley Alves
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.16331
Pdf link: https://arxiv.org/pdf/2606.16331
Abstract The integration of generative artificial intelligence with wireless communication and signal processing systems has opened new avenues for intelligent, data-driven decision-making in future 6G networks. This work proposes a diffusion soft actor-critic (Diffusion-SAC) approach that leverages offline reinforcement learning (RL) enhanced by denoising diffusion probabilistic models (DDPMs) to optimize trajectory and scheduling control in unmanned aerial vehicle (UAV) networks. While offline RL methods, such as conservative Q-learning (CQL), can learn from static datasets, they often struggle to generalize in low-data or dynamic conditions. To address this, we combine the robustness of CQL with the generative power of diffusion models, enabling expressive and signal-aware policy learning that generalizes beyond behavior policies. Applied to a UAV-assisted wireless network, the proposed framework minimizes transmission energy and improves fairness among devices. Simulations show that Diffusion-SAC outperforms standard offline RL baselines, achieving more stable convergence and higher rewards even with limited datasets. The method enhances data efficiency, reduces energy consumption, and increases throughput by more than 35 % compared to existing algorithms, demonstrating its potential for robust policy learning in next-generation wireless control systems.
中文摘要 生成式人工智能与无线通信和信号处理系统的整合，为未来6G网络中智能、数据驱动的决策打开了新途径。本研究提出了一种扩散软演员-批判（Diffusion-SAC）方法，利用离线强化学习（RL），通过去噪扩散概率模型（DDPM）优化无人机（UAV）网络中的轨迹和调度控制。虽然离线强化学习方法，如保守Q学习（CQL），可以从静态数据集中学习，但在低数据或动态条件下常难以泛化。为此，我们将CQL的鲁棒性与扩散模型的生成能力相结合，实现了超越行为政策的表达性和信号感知型策略学习。应用于无人机辅助无线网络时，该框架最大限度地减少传输能量，提升设备间的公平性。模拟显示，Diffusion-SAC优于标准离线强化学习基线，即使数据集有限，也能实现更稳定的收敛和更高的奖励。该方法提升了数据效率，降低了能耗，并使吞吐量比现有算法提升了35%以上，展示了其在下一代无线控制系统中稳健的策略学习潜力。

PathRouter: Aligning Rewards with Retrieval Quality in Agentic Graph Retrieval-Augmented Generation

PathRouter：在代理图检索增强生成中，如何将奖励与检索质量对齐

Authors: Bo Wang, Heyan Huang, Yaolin Li, Wei Tang, Yuan Zhang, Wenbo Li, Mingze Gao, Ge Shi, Chong Feng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.16409
Pdf link: https://arxiv.org/pdf/2606.16409
Abstract Agentic GraphRAG trains language-model agents to iteratively retrieve and reason over graph-structured evidence, enabling more accurate and context-aware decision-making by efficiently navigating complex information networks. However, outcome-only reinforcement learning suffers from \textit{\textbf{answer-path reward aliasing}}, where correct answers may come from shortcuts rather than useful evidence paths. It also exhibits \textit{\textbf{search-update ambiguity}}, as scalar trajectory-level feedback does not indicate which retrieval actions to adjust. To mitigate these shortcomings, we present PathRouter, a path-aware training framework for agentic GraphRAG. PathRouter jointly evaluates each trajectory along answer correctness and evidence-path overlap, yielding four trajectory categories with differentiated GRPO advantage scaling that suppresses shortcut reinforcement while preserving evidence-seeking behavior. For evidence-poor trajectories, a frozen gold-evidence teacher provides token-level KL guidance on reasoning and search-query tokens, excluding answer tokens to avoid direct response imitation. Experiments on six QA benchmarks across three model sizes show that PathRouter consistently improves answer F1 and evidence-path overlap, achieving average F1 gains of 3.1 on 3B and 4.9 on 7B models compared to a strong baseline.
中文摘要 代理GraphRAG训练语言模型代理迭代检索和推理图结构证据，通过高效导航复杂信息网络实现更准确和上下文感知的决策。然而，仅结果强化学习存在 \textit{\textbf{答案路径奖励别名}}的问题，正确答案可能来自捷径而非有用的证据路径。它还表现出 \textit{\textbf{search-update 歧义}}，因为标量轨迹级反馈并未指示应调整哪些检索动作。为弥补这些不足，我们推出了PathRouter，一个针对代理型GraphRAG的路径感知训练框架。PathRouter 联合评估每个轨迹的答案正确性和证据路径重叠，生成四个轨迹类别，采用差异化的 GRPO 优势尺度，抑制捷径强化，同时保持证据寻求行为。对于证据不足的轨迹，一位冻结的金证据教师会提供基于标记级别的逻辑分析和查询标记的指导，排除答案标记以避免直接响应模仿。在六个QA基准测试中，跨越三种模型规模的实验显示，PathRouter在回答F1和证据路径重叠方面持续提升，3B模型的平均F1提升为3.1,7B模型为4.9，相较于强基线。

HOLO-MPPI: Multi-Scenario Motion Planning via Hierarchical Policy Optimization

HOLO-MPPI：通过层级策略优化实现多场景运动规划

Authors: Youngjae Min, Jovin D'sa, Faizan M. Tariq, David Isele, Navid Azizan, Sangjae Bae
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.16480
Pdf link: https://arxiv.org/pdf/2606.16480
Abstract Robots deployed in the real world must plan motions across diverse scenarios without per-scenario retuning. End-to-end reinforcement learning (RL) can generalize across scenarios but often becomes brittle under distribution shift, reward misspecification, and stochastic interactions. Model predictive path integral (MPPI) control enables strong real-time refinement without gradients, but its performance depends on a well-shaped sampling prior, while manually designing the priors does not scale to multi-scenario deployment. We present HOLO-MPPI (High-level Offline, Low-level Online MPPI), a multi-scenario motion planning framework that combines high-level policy learning with low-level stochastic optimal control. Offline, we learn a high-level policy that proposes scenario-robust plans in an abstract action space, with a learned world model for online rollout. Online, the policy serves as a data-driven prior generator that parameterizes MPPI's sampling distribution conditioned on the current observation and goal. MPPI then optimizes low-level control sequences around this prior in real time to adapt to local disturbances. We instantiate HOLO-MPPI in autonomous driving by designing an effective high-level action space and tailored model architectures. Our evaluation across diverse driving scenarios shows that HOLO-MPPI improves upon MPPI and end-to-end RL baselines while maintaining real-time control.
中文摘要 部署在现实世界中的机器人必须在不同场景中规划动作，且不能逐场景重新调谐。端到端强化学习（RL）可以在不同场景中推广，但在分布转移、奖励错误指定和随机交互作用下常常变得脆弱。模型预测路径积分（MPPI）控制能够实现无需梯度的强实时细化，但其性能依赖于一个良好形状的采样先验，而手动设计先验无法扩展到多场景部署。我们介绍HOLO-MPPI（高层离线，低层在线MPPI），这是一种多场景运动规划框架，结合了高层次政策学习与低层次随机最优控制。线下，我们学习一项高层次政策，提出在抽象行动空间中情景稳健的计划，并结合一个学习过的世界模型进行在线推广。在线上，该策略作为一个基于数据的先验生成器，参数化基于当前观测和目标的MPPI抽样分布。MPPI随后实时优化低级别控制序列，以适应局部扰动。我们通过设计高效的高层次动作空间和定制模型架构，实现了HOLO-MPPI在自动驾驶中的应用。我们在多种驾驶场景下的评估显示，HOLO-MPPI在保持实时控制的同时，优于MPPI和端到端强化学习基线。

BRICKS-WM: Building Reusability via Interface Composition Kinetics for Structured World Models

BRICKS-WM：通过界面组合动力学构建结构化世界模型的可重用性

Authors: Shaowei Zhang, Jiahan Cao, Xunlan Zhou, Shenghua Wan, De-Chuan Zhan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.16489
Pdf link: https://arxiv.org/pdf/2606.16489
Abstract Model-based Reinforcement Learning (MBRL) has achieved remarkable success in continuous control by leveraging latent world models. However, prevailing approaches typically rely on monolithic latent dynamics, entangling environment dynamics into a coupled process. This coupling severely limits reusability: altering the agent necessitates retraining the entire world from scratch, even if the environment remains constant. To address this, we introduce BRICKS-WM (Building Reusability via Interface Composition Kinetics for Structured World Models), a framework for the modular assembly of structured world models. Driven by the insight that the physical world is composed of independent entities, we posit that global dynamics can be modeled as a composition of distinct dynamical modules interacting via latent interfaces. As a minimal instantiation, we factorize the latent state space into an actuated Agent module and an external Background module, bridged by a learned latent interface. Unlike prior object-centric methods that prioritize visual segmentation, BRICKS-WM enforces a functional separation in transition dynamics, ensuring that background dynamics remains agnostic to the agent's dynamics. Empirically, BRICKS-WM achieves control performance comparable to strong monolithic baselines when trained from scratch, and enables the reuse of frozen background dynamics across agents.
中文摘要 基于模型的强化学习（MBRL）通过利用潜在世界模型，在持续控制方面取得了显著成功。然而，主流方法通常依赖单一潜在动力学，将环境动力学纠缠成耦合过程。这种耦合严重限制了重用性：改变智能体需要从头重新训练整个世界，即使环境保持不变。为此，我们引入了BRICKS-WM（结构化世界模型通过接口合成动力学构建可重用性），这是一个用于结构化世界模型模块化组装的框架。基于物理世界由独立实体组成的洞见，我们假设全局动力学可以被建模为通过潜在界面相互作用的不同动力模块的组合。作为最小实例化，我们将潜态空间分解为一个驱动的代理模块和一个外部背景模块，并通过学习的潜在接口桥接。与以往优先处理视觉分割的以对象为中心的方法不同，BRICKS-WM 在过渡动态中强制实现功能分离，确保背景动态与智能体动态保持中立。从经验上看，BRICKS-WM在从零训练时实现了与强单体基线相当的控制性能，并支持在各代理间重用冻结的背景动态。

daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization

daVinci内核：通过强化学习共进化技能选择、总结与利用以实现GPU内核优化

Authors: Dayuan Fu, Mohan Jiang, Tongyu Wang, Dian Yang, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.16497
Pdf link: https://arxiv.org/pdf/2606.16497
Abstract GPU kernel optimization represents a paradigm where functional correctness is assumed and execution efficiency is the objective. We present daVinci-kernel, a reinforcement learning framework that couples skill discovery with skill exploitation through a dynamically evolving skill library. daVinci-kernel jointly trains three agents sharing one LLM backbone: a Skill Selection Agent that retrieves relevant techniques via BM25 and LLM reranking, a Policy Agent that generates multi-turn CUDA/Triton kernels conditioned on selected skills, and a Skill Summary Agent that distills successful rollouts into reusable skills. Candidate skills are added only after execution-based verification confirms reproducible speedups. All three agents share a single LLM backbone, are initialized via a structured SFT cold start on diversity-filtered data, and are then jointly optimized end-to-end with multi-turn REINFORCE and per-agent advantage estimation. On KernelBench, daVinci-kernel-14B achieves 37.2%, 70.6%, and 32.2% on Level 1, Level 2, and Level 3 under the Fast$_1$ threshold, outperforming the strongest prior RL-trained model, this http URL-14B.
中文摘要 GPU内核优化代表了一种范式，假设功能正确性，执行效率为目标。我们介绍daVinci内核，这是一个强化学习框架，通过动态演进的技能库将技能发现与技能利用结合起来。daVinci内核联合训练三个共享一个LLM骨干的代理：一个技能选择代理，通过BM25和LLM重新排序获取相关技术;一个策略代理，基于所选技能生成多回合CUDA/Triton内核;以及一个技能汇总代理，将成功推出的技能提炼成可复用技能。候选技能只有在基于执行的验证确认可重复的加速后才会添加。三位代理共用单一LLM骨干，通过结构化SFT冷启动对多样性过滤数据进行初始化，随后通过多回合REINFORCE和每代理优势估计实现端到端联合优化。在KernelBench上，daVinci-kernel-14B在1级、2级和3级（Fast$_1$阈值下）分别达到37.2%、70.6%和32.2%，超过了之前最强的强化学习训练模型——这个http URL-14B。

How Post-Training Shapes Biological Reasoning Models

后训练如何塑造生物推理模型

Authors: Lukas Fesser, Hanlin Zhang, Michelle M. Li, Eric Wang, Bryan Perozzi, Shekoofeh Azizi, Sham M. Kakade, Marinka Zitnik
Subjects: Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2606.16517
Pdf link: https://arxiv.org/pdf/2606.16517
Abstract Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.
中文摘要 生物学的科学推理模型将语言模型与基于多模态生物数据训练的基础模型结合，包括DNA、RNA和蛋白质。这些模型是通过后期训练构建的，但每个阶段如何塑造推理和泛化仍然理解不足。我们研究后培训何时提升表现，何时导致过度专业化。在基因组学、转录组学和蛋白质领域，我们训练并评估了100多个生物推理模型，采用主链、持续预训练（CPT）、监督微调（SFT）和强化学习（RL）等受控变异，测量域内（ID）和域外（OOD）表现。我们发现每个训练后阶段都会以不同方式重塑泛化，而非均匀贡献。CPT通过将模型与生物语言对齐，提升下游性能。SFT持续提升ID性能，但会导致OOD性能在早期达到峰值，随着模型拟合训练分布而下降。当强SFT检查点与奖励对齐时，强化学习能提升OOD表现并部分恢复泛化。这些结果表明，生物推理不会单调地通过额外的监督或计算提升。相反，性能取决于训练阶段的组成方式。在固定的培训后预算下，最强的ID-OOD权衡来自短暂的SFT、更大的RL分配以及跨阶段的非对称适应能力。

Incentives and Evidence in Learned Service Orchestration

学习服务编排中的激励与证据

Authors: Syed Izhan Khilji, Alireza Furutanpey, Schahram Dustdar
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.16555
Pdf link: https://arxiv.org/pdf/2606.16555
Abstract Reinforcement learning for service orchestration has been the subject of sustained research for over a decade, yet it is not used in production at scale. The usual explanation is that learned controllers degrade under delayed and noisy telemetry, workload shifts, and uncontrolled tenants. We test whether existing evidence supports that explanation. We evaluate three highly influential RL-based orchestration systems spanning resource allocation, DAG scheduling, and autoscaling, using pre-registered predictions about comparative degradation under production-relevant perturbations and paired inference with family-wise error correction. Across the tests, most predicted performance reversals do not occur. Diagnostic analyses show that these outcomes often reflect comparator collapse, artefact limitations, or evaluation choices rather than evidence that learned controllers tolerate the perturbations. One apparent advantage under observation lag is roughly fortyfold compared to a Kubernetes HPA-equivalent controller. Another widely cited result cannot be reconstructed from its released artefact, and the strongest reproducible margin is far smaller than the published results. Conclusions also reverse under changes in perturbation magnitude and evaluation mode. Based on these results and broader patterns in the literature, we identify an institutional problem. Publication and review incentives favour benchmark gains against convenient comparators, even when those gains provide little evidence of deployment performance. We argue that the problem is not solely technical. Rather, it is institutional, so learned orchestration needs production-grade comparators, registered perturbation models, separate operational metrics, and publication criteria that reward reproducible operational evidence. Without these changes, the literature can grow without establishing whether learning improves orchestration.
中文摘要 用于服务编排的强化学习已经持续研究十多年，但它并未大规模投入生产。通常的解释是，学习过的控制器在延迟和噪音的遥测、工作负载转移以及不受控制的租户下会性能下降。我们测试现有证据是否支持这一解释。我们评估了三种极具影响力的基于强化学习的编排系统，涵盖资源分配、DAG调度和自扩展，利用预注册的在生产相关扰动下的比较退化预测，并结合家族级错误更正进行配对推断。在所有测试中，大多数预测的性能逆转都不会发生。诊断分析显示，这些结果往往反映了比较对象坍缩、伪影限制或评估选择，而非学习后的控制者容忍扰动的证据。观察滞后的一个明显优势大约是 Kubernetes 等效 HPA 控制器的四十倍。另一个广泛引用的结果无法从其发布的伪影中重建，且最强的可复现边际远小于已发表的结果。在扰动强度和评估模式变化下，结论也会反转。基于这些结果和文献中的更广泛模式，我们识别出一个制度性问题。发表和综述激励有利于基准优势，尽管这些提升几乎无法证明部署性能。我们认为问题不仅仅是技术层面。相反，它是制度性质的，因此学习式编排需要生产级比较器、注册的微扰模型、独立的操作指标，以及奖励可重复操作证据的发表标准。如果没有这些变化，文献可能会不断增长，却无法确定学习是否能改善配器。

ROSA-RL: Uncertainty-Aware Roundabout Optimized Speed Advisory with Reinforcement Learning

ROSA-RL：不确定性感知环岛优化速度咨询与强化学习

Authors: Anna-Lena Schlamp, Jeremias Gerner, Klaus Bogenberger, Werner Huber, Stefanie Schmidtner
Subjects: Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.16558
Pdf link: https://arxiv.org/pdf/2606.16558
Abstract Roundabouts challenge automated driving in mixed traffic, as heterogeneous and non-deterministic human behavior, unknown driving intentions, and high interaction complexity create uncertainty about whether the conflict zone will be blocked or available at the moment of entry. We present ROSA-RL -- uncertainty-aware Roundabout Optimized Speed Advisory with Reinforcement Learning. It enables safe and efficient roundabout entry for automated and human-driven vehicles in mixed traffic through probabilistic conflict forecasting. A Transformer-based model predicts conflict zone occupancy over a five-second horizon, capturing multi-agent interactions to anticipate upcoming conflicts and available gaps. The prediction outputs encode uncertainty in future motion and intent, and augment the state of a classical RL framework, enabling uncertainty-aware speed coordination. Evaluated in simulations grounded in real-world data, ROSA-RL can effectively handle uncertainty and outperform a comparable model-based baseline, closing the gap to an ideal setting assuming fully known occupancy while improving traffic efficiency and safety. The source code of this work is available under: this http URL.
中文摘要 环岛在混合交通中挑战自动驾驶，因为人类行为异质且非确定性、未知的驾驶意图以及高交互复杂性，导致冲突区是否会被封锁或开放的不确定性。我们呈现ROSA-RL——不确定性感知的环岛优化速度咨询与强化学习。它通过概率冲突预测，使自动驾驶和人力驾驶车辆在混合交通中安全高效地进入环岛。基于Transformer的模型预测冲突区在五秒内的占有情况，捕捉多智能体互动，预测即将发生的冲突和可用空缺。预测输出编码未来运动和意图的不确定性，增强了经典强化学习框架的状态，实现了不确定性感知的速度协调。通过基于真实世界数据的模拟评估，ROSA-RL能够有效处理不确定性，并优于类似的基于模型的基线，在完全确定占用率的情况下缩小与理想设置的差距，同时提升交通效率和安全。本作品的源代码可通过以下 http URL 获取。

Steering Generative Reinforcement Learning into Stable Robotic Controller

引导生成强化学习进入稳定机器人控制器

Authors: Yixuan Wang, Shutong Ding, Ke Hu, Tianxiang Gui, Jingya Wang, Ye Shi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.16572
Pdf link: https://arxiv.org/pdf/2606.16572
Abstract Diffusion and flow-based generative policies provide a powerful policy class for reinforcement learning by inducing rich stochastic exploration through iterative action generation. However, the stochasticity of diffusion policies is not suitable for stable and precise control in high-dimensional robotic systems, where small action variations can accumulate into inconsistent motion and reduced robustness. To address this issue, we propose SteerGenPO, a latent-space reinforcement learning framework that steers a trained generative policy into a robust deterministic robotic controller. The key idea is to replace stochastic latent sampling of the trained generative policy with a learned latent actor that predicts a state-dependent latent input for the generative policies. This separates exploration and control: stochastic generative sampling provides diverse action proposals during policy learning, while deterministic latent steering provides stable and adaptive control at deployment. We evaluate SteerGenPO on six Isaac Lab benchmarks and a Unitree G1 locomotion task. The results show SteerGenPO improves over both classical RL and generative RL baselines, while its deterministic latent steering produces more stable inference-time behaviors and more reliable command responses.
中文摘要 扩散和基于流的生成策略通过迭代动作生成引入丰富的随机探索，为强化学习提供了强大的策略类。然而，扩散策略的随机性不适合在高维机器人系统中稳定且精确的控制，因为微小的动作变化可能累积导致运动不一致和稳健性降低。为解决这一问题，我们提出了SteerGenPO，一种潜空间强化学习框架，将训练有素的生成策略引导为稳健的确定性机器人控制器。关键思想是用一个学习过的潜在行为者来替代训练有素生成策略的随机潜在抽样，该行为者预测生成策略的状态依赖潜在输入。这区分了探索与控制：随机生成抽样在政策学习过程中提供了多样化的行动建议，而确定性潜在引导则在部署时提供稳定且自适应的控制。我们基于六个Isaac实验室基准测试和一个Unitree G1运动任务评估了SteerGenPO。结果显示，SteerGenPO相较于经典强化学习和生成式强化学习基线均有改进，其确定性潜在引导则实现了更稳定的推理时间行为和更可靠的指令响应。

Infant Spontaneous Movement Noise Improves Exploration in Deep RL

婴儿自发运动噪音改善深层强化学习的探索能力

Authors: Francisco M. López, Markus R. Ernst, Francisco Cruz, Matej Hoffmann, and Jochen Triesch
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2606.16590
Pdf link: https://arxiv.org/pdf/2606.16590
Abstract Exploration in deep reinforcement learning (RL) is commonly implemented as temporally uncorrelated white noise. However, recent works show that temporally correlated colored noise can improve exploration efficiency by producing smooth trajectories with better coverage of the state space. We inquire whether action noise inspired by infant spontaneous movements can also improve exploration in deep RL. We find that the power spectral densities of babies' end-effector velocities follow a colored noise process where the spectral exponent increases with age. Inspired by this developmental pattern, we introduce a mechanism that progressively increases the temporal auto-correlation of exploration noise during RL training, matching the infant statistics. Experiments across several RL environments show that infant-inspired noise produces structured exploratory behavior and can improve learning efficiency compared to conventional exploration strategies. These findings suggest that human motor and cognitive development can provide useful guidance for designing learning mechanisms in artificial agents. Our code is available at this https URL.
中文摘要 深度强化学习（RL）中的探索通常以时间无关白噪声的形式实现。然而，近期研究表明，时间相关的彩色噪声可以通过产生更均匀的轨迹和更广泛的状态空间覆盖来提高探索效率。我们探究由婴儿自发动作激发的动作噪声是否也能改善深层强化学习的探索。我们发现婴儿末端执行器速度的功率谱密度遵循一种有色噪声过程，其中谱指数随年龄增长而增加。受这一发展模式启发，我们引入了一种机制，逐步增加强化学习训练中探索噪声的时间自相关性，使婴儿统计数据相匹配。在多种强化学习环境中的实验表明，婴儿启发的噪音产生结构化的探索行为，并能提升学习效率，相较于传统探索策略。这些发现表明，人类运动和认知发展可以为设计人工代理的学习机制提供有用指导。我们的代码可在此 https URL 访问。

DifferAD-R1: A Difference-Guided IndustrialAnomaly Localization with Multimodal LargeLanguage Models

DifferAD-R1：一种基于多模态大语言模型的差导工业异常定位

Authors: Dingrong Wang, Xian Tao, Zhen Qu, Hengliang Luo, Xinyi Gong, Fei Shen, Zhengtao Zhang, Guiguang Ding
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.16601
Pdf link: https://arxiv.org/pdf/2606.16601
Abstract Industrial anomaly localization aims to accurately identify and localize abnormal regions in industrial products, addressing the critical challenge of detecting unseen defect categories in real-world scenarios. Traditional closed-set methods often suffer from poor cross-scenario generalization, while existingMultimodal Large Language Model (MLLM)-based approachesface two core limitations: they either adopt QA-style paradigmsmisaligned with the practical demands of localization, or relyon standard optimization techniques such as Group RelativePolicy Optimization (GRPO), which fails to deliver effectivelearning signals for subtle defects. To tackle these issues, thispaper proposes DifferAD-R1, an MLLM-augmented reinforcement learning framework tailored for industrial anomaly localization. We design a Difference-Guided dual-image paradigm,which reformulates the localization task as a one-shot difference grounding problem to effectively explore cross-scenarioanomalies. A Dual-Consistency Localization Reward is developedfor hard-to-detect anomalies, enhancing optimization stabilityand robustness. Additionally, we integrate a difficulty-awarestrategy with adaptive reweighting and group-wise resamplingto prioritize learning on challenging instances. To facilitateevaluations in real-world industrial settings, we construct theAD-DualDiff dataset, comprising 13K paired images across 20categories. Experimental results demonstrate that DifferADR1 significantly outperforms existing baselines and achievescompetitive performance compared to large-scale models likeQwen3-VL (235B parameters). Our code is publicly availableat: this https URL.
中文摘要 工业异常定位旨在准确识别和定位工业产品中的异常区域，解决在现实场景中检测未见缺陷类别的关键挑战。传统的封闭集方法常常存在跨场景泛化能力较差的问题，而现有基于多模态大型语言模型（MLLM）的方法面临两个核心限制：它们要么采用与本地化实际需求不符的质量保证范式，要么依赖标准优化技术如群相对策略优化（Group RelativePolicy Optimization，GRPO），而该技术未能有效传递针对细微缺陷的学习信号。为解决这些问题，本文提出了DifferAD-R1，一种针对工业异常定位的MLLM增强强化学习框架。我们设计了一个差分导向双影像范式，将定位任务重新表述为一次性差分基础问题，以有效探索跨场景异常。为难以检测的异常开发了双一致性定位奖励，提升了优化稳定性和鲁棒性。此外，我们还结合了难度意识策略，结合自适应加权和分组重抽样，优先学习具有挑战性的实例。为了便于在真实工业环境中进行评估，我们构建了AD-DualDiff数据集，包含20个类别的13K张配对图像。实验结果表明，DifferADR1显著优于现有基线，并实现了与Qwen3-VL（235B参数）等大型模型的竞争性能。我们的代码公开发布于：https URL。

Reinforcement Learning with Inner-loop Dynamics Estimator for Aerial Manipulation under Uncertainty

利用内环动力学估计器进行不确定性下空中操控的强化学习

Authors: Shivansh Pratap Singh, Samaksh Ujjwal, Ishita Chaudhary, V R Vasudevan, Rishabh Dev Yadav, Spandan Roy
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.16621
Pdf link: https://arxiv.org/pdf/2606.16621
Abstract Aerial manipulators enable physical interaction in hard-to-reach environments; however, the combined problem of direct whole-body aerial manipulation under rapid arm motion, payload changes, and related unknown dynamic uncertainty remains a largely unsolved problem. We present a hierarchical control framework that combines Reinforcement Learning (RL) with an inner-loop dynamics estimator to address this problem. The RL outer loop maps desired 6-degrees-of-freedom (DOF) end-effector targets to coordinated whole-body commands, enabling direct task-driven control without relying on a fully accurate coupled dynamic model in the policy layer. An inner loop then tracks these commands while compensating for transient inertial shifts and uncertainty during execution via a dynamics estimator scheme without requiring system model knowledge. We validate the proposed approach on a custom quadrotor equipped with a 3-DoF manipulator through hardware experiments under varying payload conditions. Compared with RL+PID and RL+INDI+PID baselines, the proposed method reduces end-effector tracking error and improves task success rate across the tested hardware conditions. These results show that combining learned whole-body coordination with estimator-based low-level compensation improves the precision and robustness of aerial manipulation under changing operating conditions.
中文摘要 空中机械臂使得在难以触及的环境中进行物理互动;然而，在快速手臂运动下，直接全身空中操作、有效载荷变化及相关未知动态不确定性的问题仍大多未解决。我们提出了一个分层控制框架，结合了强化学习（RL）和内环动力学估计器来解决这一问题。强化外环将期望的6自由度（DOF）端执行器目标映射到协调的全体指令，实现直接的任务驱动控制，无需依赖策略层中完全准确的耦合动态模型。内环通过动力学估计方案在不依赖系统模型知识的情况下，在执行时补偿瞬态惯性偏移和执行中的不确定性，同时跟踪这些指令。我们在配备3-DoF机械臂的定制四旋翼仪上，通过硬件实验验证了该方法，且在不同有效载荷条件下进行实验。与RL+PID和RL+INDI+PID基线相比，所提方法降低了末端执行器跟踪误差，并提高了测试硬件条件下的任务成功率。这些结果表明，将学习到的全身协调与基于估计器的低级补偿结合，可以提升在变化操作条件下空中操作的精度和稳健性。

Understanding Automated Web GUI Testing: An Empirical Study Across Exploration Strategies and State Abstractions

理解自动化网页图形界面测试：跨探索策略与状态抽象的实证研究

Authors: Chenxu Liu, Wei Yang, Ying Zhang, Tao Xie
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2606.16650
Pdf link: https://arxiv.org/pdf/2606.16650
Abstract Automated web GUI testing (AWGT) relies on exploration strategies that exercise web applications through GUI actions to maximize code coverage, spanning traditional model-based, reinforcement learning (RL)-based, and emerging large language model (LLM)-based approaches. State abstraction, which detects pages with the same functionality to avoid repeated testing, has long been recognized as critical to guiding exploration. However, how exploration strategies and state abstractions jointly affect testing effectiveness remains underexplored. We present an empirical study analyzing both factors from the perspectives of code coverage and failure revelation. We compare representative model-based, RL-based, and LLM-based approaches; investigate how six state abstractions influence model-based and RL-based approaches; examine LLM-based approaches under different history representations, which act as a form of state abstraction; and compare the failures exposed by different approaches. Our results show that no single strategy excels across all dimensions; instead, categories exhibit complementary strengths in code coverage, state coverage, and failure discovery. State abstraction is a key factor: strict, fine-grained abstractions favor model-based strategies, while compact ones better support RL-based strategies. History representation substantially affects LLM-based strategies, where concise, functionality-level context performs best. We also find that code coverage is weakly correlated with failure-revealing ability, underscoring the need for multi-dimensional evaluation. These findings offer practical guidance for selecting exploration strategies and designing effective state abstractions for AWGT.
中文摘要 自动化网页图形界面测试（AWGT）依赖探索策略，通过图形界面动作锻炼网页应用以最大化代码覆盖，涵盖传统的基于模型、基于强化学习（RL）以及新兴的基于大型语言模型（LLM）的方法。状态抽象通过检测具有相同功能的页面以避免重复测试，长期以来被认为是指导探索的关键。然而，探索策略与状态抽象如何共同影响测试效果仍缺乏深入探讨。我们提出了一项实证研究，从代码覆盖率和故障揭示的角度分析这两个因素。我们比较了具有代表性的基于模型、基于强化学习（RL）和基于大型语言模型（LLM）的方法;研究六态抽象如何影响基于模型和强化学习的方法;在不同历史表示方式下，考察基于LLM的方法，这些历史表现形式为状态抽象;并比较不同方法暴露出的失败情况。我们的结果表明，没有哪种策略能在所有维度上都卓越;相反，这些类别在代码覆盖率、州级覆盖和故障发现方面展现出互补优势。状态抽象是一个关键因素：严格、细粒度的抽象更适合基于模型的策略，而紧凑的抽象则更适合基于强化学习的策略。历史表示对基于LLM的策略有显著影响，在这些策略中，简洁的功能级上下文表现最佳。我们还发现代码覆盖率与故障揭示能力相关性较弱，强调了多维评估的必要性。这些发现为选择探索策略和设计有效的 AWGT 状态抽象提供了实用指导。

VENOM: Versatile Embodied Network for Omni-bodied Motion tracking

VENOM：多功能具体网络，用于全体运动追踪

Authors: Siddharth Padmanabhan, Kazuki Miyazawa, Takato Horii
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.16696
Pdf link: https://arxiv.org/pdf/2606.16696
Abstract Achieving expert-level expressive full-body motion tracking across multiple humanoids solely from demonstration data remains a challenging and relatively an underexplored problem in humanoid robot learning. Cross-embodiment motion tracking policies are mostly trained by decoupling the control problem into upper and lower body control. This work proposes VENOM, a cross-embodiment full-body motion tracking model for humanoids in simulation. VENOM is a GPT-based motion tracker trained on multiple humanoid data that can track the entire body without the requirement to split into upper and lower body control. We curate a multi-humanoid motion tracking dataset called the VENOM dataset that contains states, actions, and rewards and train VENOM and the baselines on this dataset. In this letter, we evaluate VENOM's performance against baselines and show that we can achieve a stable motion tracker across different humanoids more capable than an MLP trained on multiple humanoid data with supervised learning alone, and also show that despite lack of reward feedback, VENOM closely matches the tracking capability of experts that were trained using asymmetric-actor critic reinforcement learning.
中文摘要 仅凭演示数据实现多类人生物的专家级表现性全身运动追踪仍是一个挑战性且相对未被充分探索的问题。跨身体动作追踪策略大多通过将控制问题分离为上下半身控制来训练。这项工作提出了VENOM，一种用于模拟人形的交叉身体全身运动跟踪模型。VENOM是一款基于GPT的运动追踪器，基于多个类人生物数据训练，能够追踪全身，无需分上下半身控制。我们策划了一个多人形运动追踪数据集，称为VENOM数据集，包含状态、动作和奖励，并在该数据集上训练VENOM及基线数据。在这封信中，我们评估了VENOM的基线表现，并展示了我们能够在不同类人生物之间实现稳定的动作追踪器，能力优于仅用监督学习训练的多类人生物MLP;同时也证明，尽管缺乏奖励反馈，VENOM的跟踪能力与使用非对称行为者批评强化学习训练的专家非常接近。

Harmonizing Semantic and Collaborative in LLMs: Reasoning-based Embedding Generator for Sequential Recommendation

大语言模型中的语义与协作协调：基于推理的顺序推荐嵌入生成器

Authors: Qidong Liu, Mingyao Huang, Moranxin Wang, Wenxuan Yang, Haiping Zhu
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2606.16703
Pdf link: https://arxiv.org/pdf/2606.16703
Abstract Sequential Recommender Systems (SRS) predict the next item of interest based on users' interaction histories and have been widely deployed, but hindered by long-tail problem. Large Language Models (LLMs), with strong semantic understanding and reasoning capabilities, offer a promising way to enrich item semantics and have recently been used as embedding generators. However, two fundamental gaps remain. First, current LLM-based embedding methods fail to exploit the model's inner reasoning capacity. Second, existing methods often inject collaborative signals implicitly via supervised fine-tuning, lacking explicit guidance for collaborative embedding alignment. In this paper, we introduce ReaEmb, a novel framework that resolves both issues via a Latent Reasoning-enhanced Contrastive Learning (LRCL) stage and a Collaborative Reward Reinforcement Learning (CRRL) stage. LRCL exploits the LLMs' inner reasoning capacity through a two-pass forward process with an additional attention module. CRRL subsequently explicitly injects collaborative signals into the LLM via a tailored reinforcement learning. Extensive experiments on three real-world datasets demonstrate superior effectiveness of ReaEmb across multiple SRS models. To ease reproducibility, we release the code online.
中文摘要 顺序推荐系统（SRS）基于用户的交互历史预测下一个感兴趣的项目，已被广泛应用，但因长尾问题而受阻。大型语言模型（LLMs）具有强大的语义理解和推理能力，提供了丰富项目语义的有前景方式，近年来被用作嵌入生成器。然而，仍有两个根本性的空白。首先，当前基于LLM的嵌入方法未能充分发挥模型的内部推理能力。其次，现有方法常通过监督微调隐式注入协作信号，缺乏明确的协作嵌入对齐指导。本文介绍了ReaEmb，这是一种新颖框架，通过潜在推理增强对比学习（LRCL）阶段和协作奖励强化学习（CRRL）阶段解决这两个问题。LRCL通过带有额外注意力模块的双向前进过程，利用LLMs的内部推理能力。随后，CRRL通过定制化的强化学习，显式地向LLM注入协作信号。在三个真实世界数据集上的广泛实验显示，ReaEmb在多个SRS模型中表现优异。为了简化复现性，我们将代码在线发布。

Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies

医学世界模型：代表医疗状态，建模临床动态并指导干预政策

Authors: Ke Liu, Mengxuan Li, Yanyi Bao, Tianyun Zhang, Chong Chu, Jiajun Bu, Haishuai Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.16721
Pdf link: https://arxiv.org/pdf/2606.16721
Abstract Medical diagnosis and treatment are dynamic processes in which patient states evolve over time and clinical interventions alter future outcomes. Although current medical AI can detect disease, estimate risk and generate reports, many systems still return static labels or scores, offering limited insight into how illness may progress or how alternative interventions may reshape its trajectory. Medical world models adapt the world-model idea from artificial intelligence to healthcare by learning internal simulators of patient-state dynamics. Their long-term goal is to help clinicians anticipate deterioration, compare treatment-conditioned futures and tailor care to individual patients. Yet relevant work remains scattered across foundation models, longitudinal modelling, disease simulation, treatment-effect estimation, reinforcement learning and digital twins. To bridge this gap, this review outlines a roadmap for advancing medical AI from isolated diagnosis and prediction toward medical world models that simulate disease evolution and support intervention decisions. This roadmap is organized around three coupled capabilities: patient-state construction, clinical dynamics modelling and intervention decision support. Across representative systems, the comparison highlights what each capability contributes and how partial components can be integrated into more mature perception--dynamics--planning systems. Finally, we identify the challenges involved in turning plausible rollouts into clinically useful simulators. Related literature is available at this https URL.
中文摘要 医学诊断和治疗是一个动态过程，患者状态随时间演变，临床干预改变未来结局。尽管现有的医疗人工智能能够检测疾病、评估风险并生成报告，但许多系统仍然返回静态标签或评分，只能有限地了解疾病进展或替代干预如何重塑病情轨迹。医疗世界模型通过学习患者与状态动态的内部模拟器，将人工智能的世界模型理念应用于医疗领域。他们的长期目标是帮助临床医生预见恶化，比较治疗条件下的未来，并根据个别患者量身定制护理。然而，相关工作仍分散在基础模型、纵向建模、疾病模拟、治疗效果估计、强化学习和数字孪生等领域。为弥合这一差距，本综述提出了一条将医疗人工智能从孤立诊断和预测向模拟疾病演变并支持干预决策的医学世界模型发展的路线图。该路线图围绕三项耦合能力组织：患者状态构建、临床动态建模和干预决策支持。在代表性系统中，比较突出了每个能力的贡献，以及如何将部分组件整合到更成熟的感知——动态——规划系统中。最后，我们识别了将可行推广转化为临床实用模拟器所面临的挑战。相关文献可在此 https 网址查阅。

Pride and Prejudice: Toward an Information-Theoretic Framework for Mutually Communicative Driver Behavior Modeling

《傲慢与偏见：迈向信息理论框架的相互交际驱动行为建模》

Authors: Tingjun Li, Nan Xu, Shuo Feng, Hassan Askari, Bruno Henrique Groenner Barbosa, Konghui Guo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.16735
Pdf link: https://arxiv.org/pdf/2606.16735
Abstract Mixed autonomy driving becomes unsafe and inefficient when autonomous vehicles (AVs) and human-driven vehicles (HVs) misread each other's intentions. We study this problem as implicit mutual communication in lane changes. The proposed framework models how the ego vehicle both expresses its intent and probes the other driver's preference under epistemic uncertainty. It combines a level-k Bayesian persuasion game with virtual features for proactive signaling, information-theoretic rewards for mutual communication, and adaptive weights of communication affordances. We further introduce the Pride-Inquiry (P-I) and Pride-Prejudice (P-P) planes to analyze communication intensity and tendency. The model is calibrated with a Communication-Based Multi-Agent Inverse Reinforcement Learning algorithm (C-MIRL) on the naturalistic NGSIM dataset. Compared with the non-communicative baseline, the proposed model reduces the prediction error of mandatory lane changes by up to 20% while maintaining strong generalization. Driver-In-the-Loop questionnaire scores are positively correlated with the calibrated communication variables, supporting the subjective validity of the model. The learned rewards further show that inquiry and listening affordances contribute more than pride and expression alone, and that inquiry preference varies more strongly across drivers. These results support explicit modeling of mutual communication and epistemic uncertainty in interactive driving.
中文摘要 当自动驾驶车辆（AV）和人驾驶车辆（HV）误判对方意图时，混合自动驾驶变得不安全且效率低下。我们将这个问题作为变道中的隐性相互沟通来研究。该框架模拟了自我载体如何在认知不确定性下表达其意图并探究对方的偏好。它结合了k级贝叶斯说服游戏、用于主动信号传递的虚拟特征、相互沟通的信息理论奖励以及自适应的沟通权重。我们进一步介绍了骄傲-探究（P-I）和骄傲-偏见（P-P）层面，以分析沟通强度和倾向。该模型通过基于通信的多智能体逆强化学习算法（C-MIRL）在自然主义NGSIM数据集上进行校准。与非通信基线相比，所提模型在保持强有力泛化性的同时，将强制变道的预测误差降低了最多20%。驾驶员在环问卷评分与校准后的沟通变量呈正相关，支持模型的主观效度。学习到的奖励进一步表明，询问和倾听的契合性贡献远超单纯的自豪感和表达力，且探询偏好在不同驾驶者间差异更大。这些结果支持了交互式驾驶中相互沟通和认知不确定性的显式建模。

Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games with Average Reward

平均奖励均值场博弈的最大熵逆强化学习

Authors: Şevket Kaan Alkır, Naci Saldı, Berkay Anahtarcı, Can Deha Karıksız
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.16759
Pdf link: https://arxiv.org/pdf/2606.16759
Abstract We study inverse reinforcement learning for discrete-time, infinite-horizon mean-field games (MFGs) under an average-reward criterion. Expert demonstrations are assumed to arise from a stationary mean-field equilibrium under an unknown reward, and the goal is to recover a policy explaining the observed behaviour via the maximum causal entropy principle. We formulate the inverse problem by enforcing consistency with the expert mean-field term and long-run feature expectations, treating two reward classes within a unified occupation-measure framework. For finite-dimensional linear rewards, we give a convex dual reformulation with an explicit log-partition objective, and prove smoothness and curvature properties justifying constant-step-size gradient descent. For infinite-dimensional RKHS rewards, we develop a Lagrangian relaxation whose inner-maximising policy is characterised by a soft Bellman equation. The main obstacle is the absence of a discount-factor contraction. We resolve this by introducing a minorisation-based sub-stochastic kernel that yields a strict contraction of the soft Bellman operator. We establish Fréchet differentiability and Lipschitz smoothness of the log-likelihood score, leading to a gradient ascent algorithm with convergence guarantees. Two numerical examples, a malware-spread MFG and an RKHS-based consumer-choice model, show that the recovered policies closely match expert behaviour.
中文摘要 我们研究在平均奖励准则下，离散时间无限视界均值场博弈（MFGs）的逆强化学习。专家证明假设来自在未知奖励下的平稳平均场平衡，目标是通过最大因果熵原理恢复解释观察到行为的策略。我们通过强制执行专家均值项和长期特征预期的一致性来构建逆问题，并在统一的职业-测量框架内处理两个奖励类别。对于有限维线性奖励，我们给出一个带有显式对数划分目标的凸对偶重述，并证明光滑性和曲率性质支持恒步长梯度下降。对于无限维RKHS奖励，我们发展出一个拉格朗日松弛，其内极大策略由软贝尔曼方程表征。主要障碍是缺乏贴现因子收缩。我们通过引入一个基于最小化的子随机核来解决这个问题，该核对软贝尔曼算子产生严格收缩。我们建立了对数似然得分的弗雷谢可微性和利普希茨平滑性，从而形成了带有收敛保证的梯度上升算法。两个数值示例——恶意软件传播的MFG和基于RKHS的消费者选择模型——显示恢复的政策与专家行为高度匹配。

GD$^2$PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization

GD$^2$PO：通过群体动态奖励解耦策略优化缓解多奖励冲突

Authors: Haotian Liu, Yihao Liu, Jingwei Ni, Siyuan Huang, Xinpeng Liu, Pengyu Cheng, Jiajun Song, Ruijin Ding, Junfeng Li, Zhechao Yu, Mengyu Zhou, Hongteng Xu, Xiaoxi Jiang, Guanjun Jiang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.16771
Pdf link: https://arxiv.org/pdf/2606.16771
Abstract As LLMs advance, post-training reinforcement learning (RL) increasingly relies on multi-dimensional rewards to cultivate comprehensive capabilities. This shift demands new algorithms capable of optimizing diverse and potentially competing objectives simultaneously. To address this, existing methods such as Group reward-Decoupled Policy Optimization (GDPO) decompose the overall score into independent reward groups, then compute the RL loss separately within each group. However, this strategy still encounters multi-reward conflicts: a single rollout can yield positive advantages on certain reward dimensions but negative ones on others, causing opposing signals to cancel each other out during aggregation, further hindering RL training efficiency. Inspired by Dynamic sAmpling Policy Optimization (DAPO), which improves RL training efficiency by filtering out ineffective rollouts with near-zero advantages, we propose Group-Dynamic reward-Decoupled Policy Optimization (GD$^2$PO). Specifically, GD$^2$PO employs a conflict-aware filtering mechanism to mask out rollouts suffering from severe reward-wise disagreement. By preventing conflicting signals from canceling each other out, this masking strategy preserves and enhances the magnitude of effective RL advantages, thereby significantly accelerating learning efficiency. Furthermore, we introduce query-level reweighting to dynamically adjust the update intensity of each query based on its overall reward consensus. Experiments on various multi-reward scenarios, including tool calling and human preference alignment, demonstrate that GD$^2$PO consistently and significantly outperforms existing baselines. The code is available at this https URL.
中文摘要 随着LLM的发展，训练后强化学习（RL）越来越依赖多维奖励来培养综合能力。这一转变需要能够同时优化多样且潜在竞争目标的新算法。为解决这个问题，现有方法如组奖励脱钩策略优化（GDPO）将整体得分分解为独立的奖励组，然后分别计算每个组内的强化学习损失。然而，该策略仍会遇到多重奖励冲突：单次推广可能在某些奖励维度上带来正向优势，但在其他维度上却带来负面优势，导致相反信号在聚合过程中相互抵消，进一步降低强化学习训练效率。受动态采样策略优化（DAPO）启发，DAPO通过过滤无效且几乎无优势的部署来提升强化学习训练效率，我们提出了组动态奖励-解耦策略优化（GD$^2$PO）。具体来说，GD$^2$PO采用了冲突感知过滤机制，以掩盖那些在奖励上存在严重分歧的推出。通过防止冲突信号相互抵消，这种掩蔽策略保留并增强了强化学习优势的幅度，从而显著加快了学习效率。此外，我们引入了查询级别的重权重，根据查询的整体奖励共识动态调整每个查询的更新强度。在多种多奖励场景中的实验，包括工具调用和人类偏好对齐，表明GD$^2$PO持续且显著地优于现有基线。代码可在该 https URL 访问。

OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models

OpenClaw-Skill：agentic 大型语言模型的集体技能树搜索

Authors: Tianyi Lin, Chuanyu Sun, Jingyi Zhang, Changxu Wei, Huanjin Yao, Shunyu Liu, Xikun Zhang, Liu Liu, Jiaxing Huang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.16774
Pdf link: https://arxiv.org/pdf/2606.16774
Abstract Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems like OpenClaw. In this work, we aim to develop a framework that automatically constructs such reusable skills to enhance LLMs in tool use, multi-step reasoning, and dynamic environment interaction. To this end, we propose Collective Skill Tree Search (CSTS), a novel tree-search-based skill construction framework that constructs structured, diverse and generalizable tree of skills. The core idea of CSTS is to leverage collective intelligence to jointly search, identify and compose effective skills via two iterative phases: Collective Skill Node Generation (CSN-Gen) and Collective Skill Node Assessment (CSN-Assess). CSN-Gen exploits collective knowledge from multiple models to explore diverse candidate skills for each subtask, enabling comprehensive skill exploration. CSN-Assess employs multiple models as judges to evaluate and select skill nodes with two scoring mechanisms: (1) collective quality scoring that aggregates independent evaluations to produce a robust estimate of skill effectiveness, and (2) collective transferability scoring that explicitly verifies whether a skill generalizes well across different models. With CSTS, we construct a set of comprehensive tree of skills along with skill-augmented training data, enabling models to effectively learn and utilize skills. Besides, we introduce Collective Skill Reinforcement Learning, which actively selects multiple relevant skills from the tree to broaden solution-space exploration, avoid being trapped by a single skill and its resulting homogeneous or suboptimal solutions. As a result, our trained model, OpenClaw-Skill, exhibits outstanding agentic capabilities in long-horizon planning, tool use and generalization over challenging benchmarks.
中文摘要 为大型语言模型（LLM）代理配备有效技能对于解决像OpenClaw这样的现实世界系统中的复杂任务至关重要。在本工作中，我们旨在开发一个框架，自动构建此类可重复使用的技能，以增强大型语言模型在工具使用、多步推理和动态环境交互方面的表现。为此，我们提出了集体技能树搜索（CSTS），这是一种基于树状搜索的新型技能构建框架，构建结构化、多样化且可推广的技能树。CSTS的核心理念是利用集体智能，通过两个迭代阶段——集体技能节点生成（CSN-Gen）和集体技能节点评估（CSN-Assess）——共同搜索、识别和构建有效技能。CSN-Gen利用多个模型的集体知识，探索每个子任务的多样化候选技能，实现全面的技能探索。CSN-Assess 采用多种模型作为评判，通过两种评分机制评估和选择技能节点：（1）集体质量评分，汇总独立评估以得出技能有效性的稳健估计;（2）集体可转移性评分，明确验证技能是否适合跨不同模型推广。通过CSTS，我们构建了一套全面的技能树以及技能增强后的训练数据，使模型能够有效学习和运用技能。此外，我们还引入了集体技能强化学习，主动从树中选择多个相关技能，拓宽解空间探索，避免被单一技能及其均匀或次优解所困。因此，我们训练好的模型OpenClaw-Skill在长期规划、工具使用和对挑战性基准的泛化方面展现出卓越的代理能力。

Understanding the Behaviors of Environment-aware Information Retrieval

理解环境感知信息检索的行为

Authors: Ruifeng Yuan, Chaohao Yuan, David Dai, Yu Rong, Hong Cheng, Hou Pong Chan, Chenghao Xiao
Subjects: Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2606.16817
Pdf link: https://arxiv.org/pdf/2606.16817
Abstract Recent retrieval-augmented generation (RAG) approaches have demonstrated strong capability in handling complex queries, yet current research overlooks a critical challenge: different retrievers require fundamentally different query formulation strategies for optimal performance. In this work, we present the first systematic analysis of how LLMs can learn to adapt their query formulation strategies for different retrievers via reinforcement learning (RL). Our empirical study reveals that RL effectively teaches an LLM to tailor its queries to specific retriever characteristics. We discover that different retrievers exhibit surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), suggesting strategies learned for one retriever ineffective for another. We further show that performance can be enhanced by incorporating retriever-specific human guidance and by scaling model size. To facilitate learning over multi-retrieval-step trajectories, we introduce a branching-based rollout technique that improves training stability. Our work provides the first empirical evidence and actionable insights for building truly retriever-aware RAG systems. Code and resources are available at this https URL.
中文摘要 近期的检索增强生成（RAG）方法已证明在处理复杂查询方面表现出强大能力，但当前研究忽视了一个关键挑战：不同的检索器需要根本不同的查询表述策略以实现最佳性能。本研究首次系统分析了大型语言模型如何通过强化学习（RL）学习适应不同检索器调整查询表述策略。我们的实证研究表明，强化学习有效地教会大型语言模型根据特定检索者的特性定制查询。我们发现不同的检索者表现出令人惊讶地不同的最优查询风格（例如描述型与提问型），表明一种检索者学到的策略对另一位无效。我们还进一步表明，通过引入针对寻回者的人类指导和缩放模型规模，可以提升性能。为了促进多步检索轨迹的学习，我们引入了一种基于分支的展开技术，以提升训练稳定性。我们的工作提供了首批实证证据和可操作的洞见，帮助构建真正感知检索者的RAG系统。代码和资源可在此 https URL 获取。

Deep Q-Learning on Hölder Spaces

Hölder 空间上的深度 Q 学习

Authors: Qian Qi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.16846
Pdf link: https://arxiv.org/pdf/2606.16846
Abstract We study the operator-theoretic core of Q-learning in continuous-time stochastic control with continuous states and actions. In value-based reinforcement learning, each Q-learning or DQN update is built from a Bellman optimality target; our analysis isolates this target in a diffusion setting and studies its regularity and approximation complexity. Under uniform ellipticity and Hölder-regular coefficients, we show that a Bellman update maps bounded inputs into an anisotropic regularity class, smoothing the state variable while leaving only Lipschitz dependence on the action variable. This yields a compact family of Bellman iterates and motivates a tensor-product DeepONet architecture adapted to the mixed regularity of the problem. We then derive explicit approximation and resource bounds, together with a stiffness--complexity trade-off as the time step $\delta \to 0$. The resulting theory makes a direct contribution to Q-learning theory at the level of Bellman target regularity and approximation in continuous stochastic control. At the same time, we do not claim a full convergence theorem for practical sampled Q-learning with exploration, replay, and stochastic gradient updates.
中文摘要 我们研究连续时间随机控制中Q学习的算子理论核心，这些控制具有连续状态和作用。在基于价值的强化学习中，每次Q学习或DQN更新都基于Bellman最优性目标构建;我们的分析在扩散环境中分离该靶，并研究其规律性和近似复杂度。在均匀椭圆性和Hölder正则系数下，我们证明了Bellman更新将有界输入映射到各向异性正则类，使状态变量平滑，同时仅保留对作用变量的Lipschitz依赖。这产生了一个紧凑的Bellman迭代族，并激励出适应问题混合正则性的张量积DeepONet架构。然后我们推导出显式近似和资源界限，以及刚度与复杂度的权衡，即时间步 $\delta \ 到 0$。该理论在连续随机控制中对 Q 学习理论在 Bellman 目标正则性和近似层面上做出了直接贡献。同时，我们不声称对实际采样Q学习（含探索、回放和随机梯度更新）拥有完整的收敛定理。

Video-Based Optimal Transport for Feedback-Efficient Offline Preference-Based Reinforcement Learning

基于视频的反馈高效离线偏好强化学习的最优传输

Authors: Tung M. Luu, Hwanhee Kim, Younghwan Lee, Chang D. Yoo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.16856
Pdf link: https://arxiv.org/pdf/2606.16856
Abstract Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL (PbRL) offers a promising alternative by learning reward functions from human feedback, but its scalability is hindered by high labeling costs. Inspired by advances in Video Foundation Models (ViFMs), we present Video-based Optimal Transport Preference (VOTP), a semi-supervised framework that learns effective reward functions from only a handful of labels. By leveraging optimal transport to align visual trajectories within the rich representation space of ViFMs, VOTP effectively generates high-fidelity pseudo-labels for large amounts of unlabeled data, substantially reducing human supervision. Extensive experiments across locomotion and manipulation benchmarks demonstrate the superiority of VOTP, which outperforms state-of-the-art offline PbRL methods under limited feedback budgets. We also showcase the robustness of VOTP in the presence of visual distractors and validate its utility on real robotic tasks, where it learns meaningful rewards with minimal human input.
中文摘要 向强化学习（RL）智能体传达复杂目标通常需要细致的奖励工程。基于偏好的强化学习（PbRL）通过从人类反馈学习奖励函数，提供了一种有前景的替代方案，但其可扩展性受限于高昂的标签成本。受视频基础模型（ViFMs）进展的启发，我们提出了基于视频的最优传输偏好（VOTP）的半监督框架，仅通过少数标签学习有效的奖励函数。通过利用最佳传输在ViFM丰富的表示空间内对齐视觉轨迹，VOTP有效地为大量未标记数据生成高保真伪标签，大大减少了人工监督。跨行走和操作基准的广泛实验证明了VOTP的优越性，在有限的反馈预算下，VOTP的表现优于最先进的离线PbRL方法。我们还展示了VOTP在视觉干扰因素存在下的鲁棒性，并验证了其在真实机器人任务中的实用性，在这些任务中，它能以最小的人工输入学习有意义的奖励。

Latent Space Reinforcement Learning for Inverse Material Estimation in Food Fracture Simulation

潜在空间强化学习用于食物断裂模拟中的反材料估计

Authors: Adrian Ramlal, Yuhao Chen, John S. Zelek
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2606.16870
Pdf link: https://arxiv.org/pdf/2606.16870
Abstract Realistic visual simulation of food manipulation requires accurate material parameters, yet these are difficult to measure directly and vary across the heterogeneous regions of a single food item. We address the inverse problem of estimating material parameters from a target description of fracture behavior in a non-differentiable continuum damage mechanics simulator. Using orange peeling as a test case, we train a neural surrogate on 2,000 forward simulations and compare Covariance Matrix Adaptation Evolution Strategy (CMA-ES, a gradient-free evolutionary optimizer) with Proximal Policy Optimization (PPO, a reinforcement learning algorithm) across the original 9-dimensional parameter space and two learned 4-dimensional latent representations. Since different oranges have different material properties, a practical inverse system must handle arbitrary targets without retraining. We train a goal-conditioned PPO policy that learns a general inverse mapping: given any target description of peeling behavior, the policy produces a material parameter estimate in a single forward pass (8 surrogate evaluations, approximately 10ms). Operating in a normalizing flow latent space with a shared surrogate evaluator, the goal-conditioned policy achieves 0.642 actual recovery when validated through the simulator, outperforming the original parameter space by 23%. A warm-start extension that initializes CMA-ES refinement from the policy's output further improves recovery to 0.828 with 540 evaluations. These findings provide a practical framework for inverse food physics and lay groundwork for vision-driven material identification from video observations of food manipulation.
中文摘要 对食品操作进行逼真的视觉模拟需要精确的材料参数，但这些参数难以直接测量，且在单个食品的异质区域间存在差异。我们解决了在不可微连续介质损伤力学模拟器中，从断裂行为目标描述估算材料参数的逆问题。以橙皮为测试案例，我们在2000次前向模拟上训练神经代理，并将协方差矩阵适应进化策略（CMA-ES，一种无梯度进化优化器）与近端策略优化（PPO，一种强化学习算法）在原始9维参数空间和两个学习到的4维潜在表示上进行比较。由于不同橙子具有不同的材料性质，实用的逆系统必须在不重新训练的情况下处理任意目标。我们训练一个目标条件PPO策略，学习一个一般的逆映射：给定剥落行为的任意目标描述，策略在一次前向传递（8次替代评估，约10毫秒）中产生重要参数估计。在一个归一化的流动潜空间中运行，并配备共享的代理评估器，目标条件策略在通过模拟器验证后实现了0.642的实际恢复，比原始参数空间高出23%。一个启动期延长，初始化从政策输出中细化CMA-ES，进一步提升了540次评估后的恢复率至0.828。这些发现为逆食品物理提供了实用框架，并为通过视频观察食物操作进行视觉驱动的材料识别奠定了基础。

Greed Is Learned: Visible Incentives as Reward-Hacking Triggers

贪婪是学习的：可见的激励作为奖励黑客的触发因素

Authors: Tong Che, Rui Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.16914
Pdf link: https://arxiv.org/pdf/2606.16914
Abstract Deployed agents increasingly act with their reward proxy in view, such as a balance, score, or KPI dashboard. We show that reinforcement learning can make a policy \emph{addicted} to such a visible self-benefit channel. It chases the displayed payoff across held-out domains, sacrifices the true task to do so, and follows the channel wherever we rewrite it, while policies that never saw the channel stay honest. We call this \emph{reward-channel addiction} and study it in \emph{MoneyWorld}, a synthetic sandbox. The addiction can \emph{flip a model's safety alignment}: trained only on innocuous money tasks with no safety content, the model abandons the safe action it otherwise always takes whenever a dashboard pays for an unsafe one, and reverts to safe once the channel is hidden. This learned bribe replicates across model scales and families. Blindly optimizing super-capable, next-generation AI on KPIs or P\&L can be dangerous for alignment. \emph{Greed is learned} when following such a channel pays.
中文摘要 部署中的代理越来越多地以奖励代理为参照，如余额、评分或关键绩效指标仪表盘。我们展示了强化学习可以使这种可见的自我益处渠道形成策略。它追逐着被保留的域名之间的展示收益，牺牲了实现这一目标的真正任务，并且无论我们如何重写频道，都跟随它，而那些从未让频道保持诚实的政策。我们称之为\emph{reward-channel成瘾}，并在\emph{MoneyWorld}——一个合成沙盒中研究它。这种成瘾可以\emph{翻转模型的安全对齐}：只训练无害的金钱任务，没有安全内容，模型放弃了仪表盘支付不安全仪表盘时总采取的安全动作，频道被隐藏后又恢复安全状态。这种习得的贿赂会在不同模型尺度和家庭中复制。盲目优化超能力的下一代人工智能，基于KPI或P&L，可能危及对齐。\emph{贪婪是学会的}，当跟随这样的频道时会有回报。

A Unified Causal-Origin Taxonomy of Distributional Shifts in Reinforcement Learning

强化学习中分布转变的统一因果-起源分类法

Authors: Ardianto Wibowo, Paulo E Santos, Amer Baghdadi, Matthew Stephenson, Karl Sammut, Jean-Philippe Diguet
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.16933
Pdf link: https://arxiv.org/pdf/2606.16933
Abstract Reinforcement learning (RL) systems often degrade when operating conditions differ from those previously encountered, reflecting distributional shifts in the underlying data-generating process. Such shifts may occur between training and evaluation, as in In-Distribution (ID) and Out-of-Distribution (OOD) generalization, or within non-stationary settings where environment dynamics evolve over time. However, the formal relationship between these views remains unclear, and existing work mainly focuses on mitigation rather than the causal origin of shift within the agent-environment interaction. This work develops a unified causal-origin taxonomy that characterizes sources of distributional shift in RL and relates ID/OOD generalization to non-stationary settings. We transfer the classical dataset-shift principle from supervised learning to RL by reformulating distributional shift in terms of the generative interaction process. Using a Partially Observable Markov Decision Process (POMDP), we decompose the interaction into structural components, including the state distribution, observation process, policy, reward, and transition dynamics, together with the shifted-time boundary. The proposed taxonomy distinguishes internal, agent-driven, and external, environment-driven, distributional shifts. The shifted-time boundary perspective further characterizes explicit, implicit, and hybrid shifts. This formulation unifies ID/OOD generalization and non-stationarity as structured changes in the underlying process. We also introduce an evaluation framework for measuring shift impact and adaptation through performance degradation and recovery metrics. By grounding distributional shift in the causal-origin structure of RL, this work supports systematic analysis of robustness under distributional shift.
中文摘要 强化学习（RL）系统在操作条件与之前不同时常会退化，反映底层数据生成过程的分布变化。这种变化可能发生在培训和评估之间，如分布内（ID）和分布外（OOD）的泛化，或在非固定环境中环境动态随时间演变。然而，这些观点之间的正式关系仍不明确，现有研究主要聚焦于缓解措施，而非主体-环境交互中转变的因果起源。本研究发展了一个统一的因果-起源分类法，描述了强化学习中分布偏移的来源，并将ID/OOD推广与非平稳环境联系起来。我们通过将分布转移从监督式学习转移到强化学习，将经典数据集转移原则转化为生成交互过程。利用部分可观测马尔可夫决策过程（POMDP），我们将交互分解为结构组成部分，包括状态分布、观察过程、策略、奖励和过渡动态，以及时间边界的转移。该分类法区分了内部、主体驱动和外部环境驱动的分布转变。移位时间边界视角进一步描述了显性、隐性和混合型变换。该表述统一了ID/OOD的泛化和非平稳性，作为底层过程中的结构变化。我们还引入了通过绩效下降和恢复指标衡量班次影响和适应性的评估框架。通过将分布偏移置于强化学习的因果-起源结构中，本研究支持对分布偏移下鲁棒性的系统分析。

Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter

探索使用代码解释器进行有效推理的外在性和内在属性

Authors: Patomporn Payoungkhamdee, Napat Laosaengpha, Jenta Wonglertsakul, Pittawat Taveekitworachai, Pume Tuchinda, Panjapong Poobanchuen, Ekapol Chuangsuwanich, Can Udomcharoenchaikit, Samuel Cahyawijaya, Peerat Limkonchotiwat, Sarana Nutanong
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.16934
Pdf link: https://arxiv.org/pdf/2606.16934
Abstract Reasoning with a Code Interpreter (CI) has emerged as an effective paradigm for enhancing the reasoning capabilities of large language models (LLMs) through executable computation and iterative verification. Despite its growing adoption, the behavioral properties underlying effective code reasoning remain largely underexplored. In this work, we investigate code reasoning from two distinct perspectives inspired by prior studies of natural language reasoning: extrinsic properties, represented by crucial tokens, and intrinsic properties, represented by code-specific cognitive behaviors. Across multiple LLMs, we find that stronger CI reasoning models consistently exhibit a higher prevalence of crucial tokens and cognitive behaviors, particularly verification, backtracking, and backward chaining. Building on these observations, we examine how these properties can be leveraged during both inference and training. At inference time, appending code-specific crucial tokens improves performance on several reasoning capabilities, including mathematical, ordering, and optimization, while yielding limited benefits elsewhere. At training time, augmenting a state-of-the-art framework with code-specific cognitive behaviors improves supervised fine-tuning and reinforcement learning performance in two of three evaluated models. Further analysis shows that these behaviors reduce overthinking in incorrect responses and improve token efficiency, while also revealing factors that limit gains in a certain model. Our findings provide the first systematic characterization of effective reasoning with CI and demonstrate both the potential and limitations of leveraging key properties to improve CI-based reasoning.
中文摘要 使用代码解释器（CI）推理已成为通过可执行计算和迭代验证提升大型语言模型（LLM）推理能力的有效范式。尽管采用率不断上升，有效代码推理背后的行为特性仍然大多未被充分探讨。本研究从两个截然不同的视角研究代码推理，这些视角受此前自然语言推理研究启发：外在属性，由关键符号表示，以及内在属性，由代码特定的认知行为表示。在多个大型语言模型中，我们发现更强的CI推理模型在关键代币和认知行为（尤其是验证、回溯和后向链）中表现出更高的普遍性。基于这些观察，我们探讨了这些特性如何在推理和训练中被利用。在推理阶段，附加特定代码的关键令牌可以提升数学、排序和优化等多种推理能力的性能，而在其他方面带来的益处有限。在训练阶段，通过增强最先进的框架，加入针对代码的认知行为，可以提升三个受评估模型中两个的监督微调和强化学习表现。进一步分析显示，这些行为减少了错误回答时的过度思考，提高了代币效率，同时也揭示了限制某些模型收益的因素。我们的发现首次系统地描述了基于CI的有效推理，并展示了利用关键特性提升基于CI推理的潜力与局限性。

Task-Error Residual Learning for Real-Robot Five-Ball Juggling

真实机器人五球杂耍的任务误差残差学习

Authors: Kai Ploeger, Jan Peters
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.16978
Pdf link: https://arxiv.org/pdf/2606.16978
Abstract For residual learning that refines existing behavior, sample efficiency depends on two things: how much information each rollout returns, and how efficiently the learner uses that information. Reinforcement learning's standard scalar reward carries far less information than the directional task error that defines the task. Random exploration further discards whatever information each rollout returns. Through residual learning with directional task-error supervision and a task error model that drives sample selection, we achieve stable three-, four-, and five-ball juggling on anthropomorphic Barrett WAM arms. Despite planning and controlling through a simple, idealized stack, the system converges from the second attempt. The first attempt drops, after which task error decreases monotonically without further failures. In comparison, five-ball juggling typically takes humans years of practice. We compare residual learners across two ternary axes, the directional information in the learning feedback and the commitment of the analytic prior, spanning Newton-style Jacobian updates, Composite Bayesian Optimization, and stochastic search methods. Both axes prove necessary: neither directional feedback nor an informative prior suffices alone, and the simplest method that combines them, a fixed-Jacobian Newton update, is the most reliable. The learned residual tolerates substantial prior misalignment and degraded joint tracking, affecting mainly convergence speed. The bottleneck for residual learning on real robots is therefore the information content of the supervision signal and how the learner uses it, not the accuracy of the surrounding stack. Video documentation of all experiments is available at this https URL.
中文摘要 对于精炼现有行为的残余学习，样本效率取决于两个方面：每次推广返回多少信息，以及学习者如何高效利用这些信息。强化学习的标准标量奖励所携带的信息远少于定义任务的方向性任务误差。随机探索会进一步丢弃每次推出时带来的信息。通过残差学习、定向任务误差监督和驱动样本选择的任务误差模型，我们在拟人化的Barrett WAM臂上实现了稳定的三球、四球和五球杂耍。尽管通过简单理想化的堆栈来规划和控制，系统在第二次尝试时收敛了。第一次尝试会掉落，之后任务错误会单调减少，不再失败。相比之下，五球杂耍通常需要人类多年的练习。我们比较了两个三元轴上的残差学习者：学习反馈中的方向信息和分析先验的承诺，涵盖牛顿式雅可比更新、复合贝叶斯优化和随机搜索方法。这两种轴都是必要的：单靠方向反馈和信息先验都不够，最简单的结合方法——固定雅可比牛顿更新——是最可靠的。习得残差容忍较大的先验错位和关节追踪退化，主要影响收敛速度。因此，真实机器人残差学习的瓶颈在于监督信号的信息内容及其使用方式，而非周围堆栈的准确性。所有实验的视频文档可在此 https 网址获取。

DreamX-World 1.0: A General-Purpose Interactive World Model

DreamX-World 1.0：通用交互世界模型

Authors: DreamX Team, Yancheng Bai, Rui Chen, Xiangxiang Chu, Rujing Dang, Hao Dou, Bingjie Gao, Qiwen Gu, Siyu Hong, Jiachen Lei, Geng Li, Jifan Li, Ruimin Lin, Qingfeng Shi, Bingze Song, Lei Sun, Jing Tang, Ruitian Tian, Jun Wang, Jiahong Wu, Pengfei Zhang, Shen Zhang, Jiashu Zhu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.16993
Pdf link: https://arxiv.org/pdf/2606.16993
Abstract DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.
中文摘要 DreamX-World 1.0 是一款通用交互式文本/图像转视频世界模型，支持可控的长视界生成。它支持摄像机导航、重访先前观察到的区域，以及跨越照片级写实、游戏风格和风格化领域的提示事件。我们的数据引擎结合了摄像机精确的虚幻引擎渲染、丰富的动作游戏录制和真实世界视频，并结合了恢复的摄像机几何体。在相机控制方面，我们引入了E-PRoPE，这是一种轻量级的投影位置编码变体，保留了PRoPE的投影相机几何体，同时对空间缩减的标记施加了相机感知的关注。我们通过因果强迫、DMD式蒸馏和长期推广训练，将双向视频生成器转换为几步自回归世界模型。在自生成的长视野上下文上训练，使模型暴露于自身生成的历史，减少自回归块中积累的样式和颜色漂移。记忆条件场景持久性通过基于相机几何的检索来检索早期视图，而残余循环则使条件路径对不完美记忆潜伏的敏感度降低。事件指令调校增加了可组合事件控制，强化学习对齐在蒸馏后恢复了摄像头控制和视觉质量。凭借混合精度的DiT执行、残留重用、75%修剪VAE解码和异步流水线并行，DreamX-World 1.0在八块RTX\5090 GPU上最高可达16，FPS。在我们5秒的基础评估中，DreamX-World 1.0的摄像机控制得分为73.75，总得分为84.76，超过了HY-WorldPlay 1.5和LingBot-World，后者分别为80.79和80.45。

When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning

有疑问时，就规划出来：针对反应强化学习的承诺小语言模型审议

Authors: Nathan Gavenski, Juarez Monteiro, Francisco Galuppo, Adriano Veloso, Odinaldo Rodrigues
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.16995
Pdf link: https://arxiv.org/pdf/2606.16995
Abstract Reinforcement Learning (RL) policies often degrade in unfamiliar environments because they lack explicit deliberation. We propose Plan, Align, Commit, Think (PACT), a hybrid architecture that combines a fast, reactive RL policy with a slow, deliberative Small Language Model (SLM) planner. PACT invokes the SLM asynchronously to generate and validate candidate action plans. Once a plan is verified through simulation as safe, feasible, and complete, it is executed directly, bypassing the RL policy without retraining or modifying it. Evaluated on three FrozenLake configurations of increasing difficulty, PACT outperforms all baselines while relying on a 2B-parameter SLM backbone, suggesting that deliberative planning and reactive execution are more powerful in concert than either is alone in these settings.
中文摘要 强化学习（RL）策略在陌生环境中常常会退化，因为它们缺乏明确的审议。我们提出了规划、对齐、承诺、思考（PACT）混合架构，结合了快速、被动的强化学习策略与缓慢、审慎的小型语言模型（SLM）规划器。PACT异步调用SLM以生成和验证候选行动计划。一旦计划通过仿真验证为安全、可行且完整，便直接执行，绕过强化学习策略，无需重新训练或修改。在三种难度递增的FrozenLake配置评估下，PACT在依赖2B参数SLM骨干时表现优于所有基线，表明在这些环境中，审慎规划与被动执行协同作用比单独使用更为强大。

ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning

ROVE：通过强化学习解锁人类干预以实现类人生物操控

Authors: Wei Xiao, Weiliang Tang, Yuying Ge, Hui Zhou, Yao Mu, Li Zhang, Yixiao Ge
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.17011
Pdf link: https://arxiv.org/pdf/2606.17011
Abstract Human interventions provide crucial corrective signals for post-training Vision-Language-Action (VLA) models. However, enabling seamless humanoid interventions is a formidable systems challenge due to complex whole-body kinematics and dexterous-hand control. Consequently, the collected intervention trajectories are often suboptimal, and methods that rely on human interventions as expert supervision can absorb hesitant, inefficient, or even erroneous behaviors. To address both the system and algorithmic challenges, we propose ROVE, a reinforcement learning framework for humanoid VLA post-training with imperfect human interventions. First, ROVE introduces a human-in-the-loop pipeline capable of collecting deployment and intervention data for humanoid manipulation. Second, it utilizes Optimistic Value Estimation (OVE) to prioritize high-value behaviors from mixed-quality trajectories. To further robustify value estimation, we incorporate cross-embodiment human experience videos to provide rich supervision for long-tailed failure and recovery modes. The resulting critic yields informative advantage signals, steering the VLA actor to focus on high-value behaviors rather than indiscriminately imitating all actions. On challenging real-world contact-rich and fine-grained humanoid manipulation tasks, ROVE outperforms experience-learning baselines and consistently improves across multiple rollout-intervention iterations.
中文摘要 人工干预为训练后视觉-语言-行动（VLA）模型提供了关键的纠正信号。然而，由于复杂的全身运动学和灵巧的手部控制，实现无缝的人形干预是一项艰巨的系统挑战。因此，收集到的干预路径往往不够理想，依赖人工干预作为专家监督的方法容易吸收犹豫、低效甚至错误的行为。为解决系统和算法挑战，我们提出了ROVE，这是一种针对人形VLA后训练的强化学习框架，且人类干预不完美。首先，ROVE引入了一条能够收集部署和干预数据以实现人形操控的人机化流程。其次，它利用乐观价值估计（OVE）优先考虑混合质量轨迹中的高价值行为。为了进一步强化价值估计，我们结合了跨身体人类经验视频，为长尾失效和恢复模式提供丰富的监督。由此产生的批评者产生了信息优势信号，引导VLA演员专注于高价值行为，而非盲目模仿所有动作。在具有挑战性的现实世界接触丰富且细致的人形操作任务中，ROVE表现优于经验学习基线，并在多次推广干预迭代中持续提升。

ExpRL: Exploratory RL for LLM Mid-Training

ExpRL：探索性强化学习，面向LLM中期培训

Authors: Violet Xiang, Amrith Setlur, Chase Blagden, Nick Haber, Aviral Kumar
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.17024
Pdf link: https://arxiv.org/pdf/2606.17024
Abstract Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but its success depends critically on the coverage present in the base model. In practice, models are often primed for RL through \emph{mid-training} on curated reasoning traces that teach useful primitive skills such as decomposition, verification, or self-correction. Although effective, this strategy requires manually specifying what the model should learn, and it remains unclear whether such primitive coverage is enough for much harder problems, which require combining these skills into broader solution strategies. We study a more automated approach: \emph{RL-based mid-training} using large corpora of human-written question-answer data. Rather than treating reference solutions as targets to imitate, our method, ExpRL, uses them as \emph{reward scaffolds}: references are hidden from the policy and used only to construct problem-specific grading rubrics for judging on-policy reasoning traces. The policy samples from the original problem prompt, while an LLM judge compares the sampled reasoning trace against the reference solution and assigns outcome-level or process-level dense rewards. This lets ExpRL reinforce partial progress, useful intermediate reductions, and productive reasoning behaviors that sparse final-answer rewards often fail to upweight. On challenging math reasoning tasks, ExpRL yields stronger RL priming than SFT, sparse-reward GRPO, and self-distillation, and provides a better initialization for subsequent sparse-reward RL. Additional mixed-domain experiments further suggest that ExpRL can extend beyond the original math-only setting.
中文摘要 稀疏奖励强化学习（RL）已成为提升LLM推理的标准工具，但其成功关键取决于基础模型中的覆盖度。在实践中，模型通常通过\emph{mid-training}通过精心策划的推理痕迹为强化学习做准备，这些痕迹教授了如分解、验证或自我纠正等实用的原始技能。虽然该策略有效，但需要手动指定模型应学习的内容，且尚不清楚这种原始覆盖是否足以应对更难的问题，而这些问题需要将这些技能结合为更广泛的求解策略。我们研究一种更自动化的方法：基于强化学习的中期训练，使用大量人类编写的问答数据语料库。我们的方法ExpRL不将参考解视为模仿目标，而是将其用作\emph{奖励支架}：参考从策略中隐藏，仅用于构建针对问题的评分评分标准，以判断策略推理痕迹。策略从原始问题提示中抽样，而LLM评审则将抽样推理痕迹与参考解比较，并分配结果层级或过程层级的密集奖励。这使得ExpRL能够强化部分进展、有用的中间还原和生产性推理行为，而这些稀疏的最终答案奖励往往无法被提升。在具有挑战性的数学推理任务中，ExpRL比SFT更强的强化过程启动、稀疏奖励GRPO和自蒸馏，并为后续稀疏奖励RL提供了更好的初始化。更多的混合领域实验进一步表明，ExpRL可以超越最初的纯数学设置。

DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents

DEEPRUBRIC：深层研究代理高效强化学习的证据树评分标准

Authors: Minghang Zhu, Chuyang Wei, Junhao Xu, Yilin Cheng, Zhumin Chen, Jiyan He
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.17029
Pdf link: https://arxiv.org/pdf/2606.17029
Abstract Deep research agents synthesize long-form reports by searching and reasoning over retrieved evidence. Reinforcement learning with rubric-based rewards improves these agents by optimizing them against checkable criteria that translate report quality into reward signals, but its efficiency depends on whether those criteria reliably capture the task scope and evidence needs. Most existing studies ask an LLM to generate rubrics for a given query, but when the model fails to infer the underlying information needs, the generated rubrics may be incomplete and reduce RL efficiency. To obtain more reliable query--rubric supervision, we introduce DeepRubric, a data construction framework that reverses this process: instead of inferring evaluation criteria for a given query, it first determines what an evidence-backed report should be evaluated on and then synthesizes aligned query--rubric pairs from those evaluation targets. Starting from a sampled seed topic, DeepRubric builds an evidence tree by recursively expanding evidence-backed sub-questions, whose leaves serve as atomic and verifiable evaluation targets. It then uses the evidence tree to synthesize the training query and rubrics, ensuring that the reward evaluates exactly the information requested by the query. Using DeepRubric, we construct 9K query--rubric supervision examples and train DeepRubric-8B with rubric-based GRPO, achieving comparable performance to prior open state-of-the-art deep research models across three benchmarks with roughly 13x fewer RL GPU-hours.
中文摘要 深度研究人员通过搜索和推理检索到的证据，综合长文报告。基于评分标准的奖励强化学习通过优化这些代理，使其符合可检查的标准，将报告质量转化为奖励信号，但其效率取决于这些标准是否可靠地捕捉任务范围和证据需求。大多数现有研究要求LLM为给定查询生成评分标准，但当模型未能推断出潜在信息需求时，生成的评分标准可能不完整，降低强化学习的效率。为了获得更可靠的查询——评分标准监督，我们引入了DeepRubric，一种数据构建框架，它颠覆了这一过程：它不再为给定查询推断评估标准，而是先确定有证据支持的报告应基于哪些指标进行评估，然后从这些评估目标中综合对齐的查询评分标准对。DeepRubric 从抽样的种子主题出发，通过递归扩展证据支持的子问题构建证据树，这些子问题的叶子作为原子级且可验证的评估目标。然后它利用证据树综合训练查询和评分标准，确保奖励准确评估查询所要求的信息。利用DeepRubric，我们构建了9K查询——评分标准监督示例，并用基于评分标准的GRPO训练DeepRubric-8B，在三个基准测试中实现了与之前开放的最先进深度研究模型相当的性能，且RL GPU小时数约减少了13倍。

Context-Aware RL for Agentic and Multimodal LLMs

适用于代理型和多模态大型语言模型的上下文感知强化学习

Authors: Peiyang Xu, Bangzheng Li, Sijia Liu, Karthik R. Narasimhan, Pramod Viswanath, Prateek Mittal, Xingyu Fu
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.17053
Pdf link: https://arxiv.org/pdf/2606.17053
Abstract Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an \emph{indirect} auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.
中文摘要 大型语言模型（LLM）常常在回答需要在漫长或复杂语境中识别一个小而决定性的证据时失败，比如工具追踪中的一条线或图像中的细微细节。我们提出了ContextRL，一种情境感知强化学习（RL）方法，通过\emph{indirect}辅助目标提升长视野推理和多模态性能。ContextRL不仅监督最终答案，而是向模型呈现查询、答案和两个高度相似的上下文，并奖励其选择支持查询-答案对的上下文，从而促进细粒度的基础化。我们在两个领域构建对比上下文数据：对于编码代理，轨迹作为上下文，通过条件过滤构建1k对;对于多模态推理，图像作为上下文，通过生成编辑和相似度搜索构建出7K对。ContextRL在5个长期视野基准中平均比标准GRPO提升+2.2%，在12个多样化的视觉问答基准中均提升+1.8%。为了区分所提目标与额外数据的影响，我们与数据增强基线进行比较，后者将相同的对比语境重新利用为标准查询-上下文-答案示例。这些基线几乎没有带来改善，表明收益主要来自拟议的情境选择目标，而非仅仅来自对比数据。

The Value Axis: Language Models Encode Whether They're on the Right Track

价值轴：语言模型编码它们是否走在正确的道路上

Authors: Nick Jiang, Isaac Kauvar, Jack Lindsey
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.17056
Pdf link: https://arxiv.org/pdf/2606.17056
Abstract We investigate whether language models internally track the value of their current trajectory, defined as the likelihood that their ongoing strategy will achieve their goals. Using synthetic, in-context reinforcement learning data, we construct a "value" axis for Qwen3-8B. We find that activations along this axis distinguish between high vs. low verbalized confidence, rollouts without and with backtracking, and correct vs. corrupted code. Steering towards high value causally suppresses self-correction and reduces explanatory verbosity, while steering towards low value induces backtracking and exploration. We demonstrate that direct preference optimization (DPO) can increase the internal value of rewarded behaviors (e.g. use a certain word), causing the model to act more confidently after exhibiting them. Finally, we apply the value axis to study in-the-wild settings. For example, we find that Qwen assigns low value to politically sensitive chat queries after post-training and that supervised fine-tuning increases internal confidence within the training domain. Our results suggest that language models linearly encode an estimate of expected goal success that modulates their confidence in pursuing a direction.
中文摘要 我们研究语言模型是否内部追踪其当前轨迹的价值，定义为其持续策略实现目标的可能性。利用合成的上下文强化学习数据，我们构建了Qwen3-8B的“价值”轴。我们发现，沿该轴的激活区分了高与低口头信心、无回溯和有回溯的推出，以及正确与损坏的代码。偏向高价值因果会抑制自我纠正，减少解释性的冗长;而偏向低价值则会引发回溯和探索。我们证明了直接偏好优化（DPO）可以提升奖励行为（例如使用某个词）的内部价值，使模型在表现出这些行为后更自信地行动。最后，我们将价值轴应用于研究野外环境。例如，我们发现Qwen在培训后对政治敏感聊天查询的价值较低，而监督式微调则提升了训练领域内的内部信心。我们的结果表明，语言模型线性编码了一个预期目标成功的估计值，从而调节其追求方向的信心。

Keyword: diffusion policy

There is no result