Arxiv Papers of Today

生成时间: 2026-06-03 20:44:04 (UTC+8); Arxiv 发布时间: 2026-06-03 20:00 EDT (2026-06-04 08:00 UTC+8)

今天共有 46 篇相关文章

Keyword: reinforcement learning

Margin Play: A Multi-Agent System For Public Policy Analysis In The Brazilian Equatorial Margin

边际游戏：巴西赤道边际公共政策分析的多代理系统

Authors: Antonio de Sousa Leitão Filho, Fabrício Saul Lima, Selby Mykael Lima dos Santos, Rejani Bandeira Vieira Sousa, Luís Jorge Mesquita de Jesus, Dennys Correia da Silva, Allan Kardec Duailibe Barros Filho
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.02614
Pdf link: https://arxiv.org/pdf/2606.02614
Abstract The Brazilian Equatorial Margin (BEM) is Brazil's next offshore oil frontier, with operations expected to begin in 2026 in the Foz do Amazonas basin. Its assets are fiscally and territorially linked primarily to Maranhao -- the state with the lowest HDI in the Federation (0.676, IBGE 2022). This raises the central policy question: under what conditions does BEM exploration generate net positive externalities for Maranhao? The problem is intrinsically multi-agent: the Federal Government seeks revenue and energy security; the state seeks regional welfare under constitutional royalty earmarking; the operator maximizes profit under risk; ANP and IBAMA hold conflicting mandates; and Amazonian communities prioritize territorial and environmental vectors over monetary income. We present Margin Play, a Multi-Agent Reinforcement Learning (MARL) system simulating these tensions under Brazilian empirical calibration and classical economic literature. It implements six agents under the CTDE paradigm, trained with BRO-MARL. Results from 60,000 episodes across six scenarios indicate the answer is conditional on the institutional regime: under the reference baseline, the welfare gain is marginal (Waval approx. 1.68), whereas the MA-Prospero configuration yields Delta W = +17.5% and Delta Rcom = +21.3%, with a lower environmental liability (Eamb = 0.048 vs. 0.076). The fundamental problem is not a trade-off between production and welfare, but the choice of public policy regime linked to exploration.
中文摘要 巴西赤道边际（BEM）是巴西下一个海上石油前沿，预计于2026年在亚马逊河流域开始开采作业。其资产在财政和领土上主要与马拉尼昂州相连——这是联邦中HDI最低的州（0.676，IBGE 2022）。这引出了核心政策问题：在什么条件下，BEM勘探会为马拉尼昂带来净正外部性？问题本质上是多代理的：联邦政府追求收入和能源安全;国家通过宪法王室专拨寻求地区福利;操作者在风险下最大化利润;ANP和IBAMA持有相互冲突的授权;亚马逊社区则优先考虑领土和环境因素而非货币收入。我们介绍了边际游戏，一种多智能体强化学习（MARL）系统，模拟巴西实证校准和经典经济学文献下的这些紧张关系。它在CTDE范式下实施了六名经纪人，并接受了BRO-MARL训练。6万个病例在六个情景中的结果表明，答案取决于机构体制：参考基线下福利收益为边际（Waval约1.68），而MA-Prospero配置则为Delta W = +17.5%，Delta Rcom = +21.3%，环境责任较低（Eamb = 0.048 vs. 0.076）。根本问题不在于生产与福利之间的权衡，而是与勘探相关公共政策体制的选择。

Inference Cost Attacks for Retrieval-Augmented Large Language Models

检索增强大型语言模型的推理成本攻击

Authors: Chengliang Liu, Liangbo Ning, Yujuan Ding, Wenqi Fan
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2606.02643
Pdf link: https://arxiv.org/pdf/2606.02643
Abstract Retrieval-Augmented Generation (RAG)-enhanced LLM systems, while powerful, introduce substantial inference costs due to the inclusion of an extra multi-stage pipeline that dynamically retrieves and synthesizes information from external knowledge sources. This high operational cost exposes a critical vulnerability to Inference Cost Attacks (ICAs). However, existing ICAs often rely on the impractical assumption of direct prompt manipulation. We argue that a more feasible and potent threat to RAG-enhanced LLM systems arises from poisoning external knowledge bases (e.g., web knowledge from the Internet). In this work, we introduce the Retrieval-Augmented Inference Cost Attack (RA-ICA), a novel attacking paradigm that targets the computational cost of RAG-enhanced LLM systems by injecting malicious documents into external knowledge corpus. To operationalize this attack, we propose Computational Resource Exhaustion via External Poisoning (CREEP), a novel framework that leverages LLM agents to automatically craft malicious documents that are both semantically relevant for retrieval and potent for inducing an abnormal increase in token consumption during the inference phase. To enhance the attack's effectiveness, we introduce Memory-Augmented Group Relative Policy Optimization (MA-GRPO), a novel reinforcement learning algorithm that fine-tunes the agents by learning from a dynamic memory of historical best adversarial documents. Extensive experiments across three real-world datasets demonstrate that RA-ICA increases token consumption by up to 13.12 times with an over 90% success rate, without degrading the integrity of the generated answer.
中文摘要 检索增强生成（RAG）增强型LLM系统虽然强大，但由于包含了额外的多级流水线，能够动态检索和综合外部知识源的信息，这带来了相当高的推理成本。这种高昂的运营成本暴露了推理成本攻击（ICA）的一个关键漏洞。然而，现有的ICA往往依赖于不切实际的直接即时操作假设。我们认为，对RAG增强型LLM系统更可行且更强大的威胁来自于毒害外部知识库（例如来自互联网的网络知识）。在本研究中，我们介绍了检索增强推断成本攻击（RA-ICA），这是一种新颖的攻击范式，通过将恶意文档注入外部知识语料库，针对RAG增强型LLM系统的计算成本。为了将该攻击付诸实施，我们提出了“外部中毒计算资源耗尽”（CREEP）这一新框架，利用LLM代理自动生成既语义相关又能在推理阶段诱导异常增加令牌消耗的恶意文档。为增强攻击效果，我们引入了内存增强群相对策略优化（MA-GRPO），这是一种新型强化学习算法，通过动态记忆历史最佳对抗文档对智能体进行微调。在三个真实世界数据集上的广泛实验表明，RA-ICA能将代币消耗增加多达13.12倍，成功率超过90%，且不影响生成答案的完整性。

Motion Planning in Dynamic Environments: A Survey from Classical to Modern Methods

动态环境中的运动规划：从经典到现代方法的综述

Authors: Zongyuan Shen, Yaming Ou, Shalabh Gupta, Shancheng Zhao, Dehua Zhou, Gao Wang, Zhongqiang Ren, Junfeng Fan, Long Cheng
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.02677
Pdf link: https://arxiv.org/pdf/2606.02677
Abstract Motion planning in dynamic environments requires robots to continuously adapt their paths in response to environmental changes for safe and uninterrupted navigation. While many surveys have reviewed planning in static settings, systematic reviews focused on dynamic environments remain limited. This paper presents a comprehensive survey of 138 works, primarily published between 2015 and 2025, spanning both classical and learning-based approaches. The motion planning methods are grouped into five categories based on the concepts of sampling, graph search, model predictive control, learning, and additional classical local planning approaches, including velocity obstacles, potential fields and dynamic windows. The learning techniques include supervised learning and reinforcement learning. We also discuss the role of dynamic perception in motion planning, covering techniques for detecting and modeling moving obstacles using cameras, LiDAR, and event-based sensors. The survey analyzes the principles, strengths, and limitations of each method, with particular attention to challenges unique to dynamic environments, such as prediction uncertainty, human-robot interaction, and the freezing robot problem. The survey provides researchers with a structured understanding of motion planning methods in dynamic environments.
中文摘要 动态环境中的运动规划要求机器人根据环境变化不断调整路径，以确保安全且不中断的导航。虽然许多调查回顾了静态环境中的规划，但聚焦动态环境的系统性综述仍然有限。本文全面综述了138篇主要发表于2015年至2025年间的研究，涵盖了经典方法和基于学习的方法。运动规划方法根据采样、图搜索、模型预测控制、学习以及其他经典局部规划方法（包括速度障碍、势场和动态窗口）等概念分为五大类。学习技巧包括监督学习和强化学习。我们还讨论了动态感知在运动规划中的作用，涵盖利用摄像头、激光雷达（LiDAR）和基于事件的传感器检测和建模移动障碍物的技术。调查分析了每种方法的原理、优势和局限性，特别关注动态环境中独特的挑战，如预测不确定性、人机交互以及冻结机器人问题。调查为研究人员提供了动态环境中运动规划方法的结构化理解。

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

Traj-Evolve：一种用于肺癌早期检测患者轨迹建模的自我演化多智能体系统

Authors: Sihang Zeng, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.02812
Pdf link: https://arxiv.org/pdf/2606.02812
Abstract Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases. We present Traj-Evolve, a self-evolving multi-agent system with two complementary evolving mechanisms. First, an Experience Pool (ExPool) acts as a non-parametric memory, indexing rejection-sampled reasoning traces to retrieve similar patients as few-shot contexts. Second, multi-agent reinforcement learning (MARL) via reward-ranked fine-tuning parametrically optimizes inter-agent and agent-memory collaboration. A leave-one-out cross-retrieval strategy unifies the two, aligning training- and inference-time behavior under retrieval augmentation. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj-Evolve outperforms 9 strong baselines on the overall population and a challenging never-smoker population. Analysis of the evolving dynamics highlights three key findings: (1) expanding the ExPool shifts optimal retrieval from diverse to specific samples; (2) under MARL, the manager agent's prediction loss converges quickly while the worker agents' temporal reasoning continues to benefit from more verified patients; and (3) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity.
中文摘要 从纵向电子健康记录（EHR）建模患者轨迹需要对稀疏、噪声大且多情境的多模态序列进行推理。现有基于LLM的多代理系统处理上下文长度，但处理患者时是孤立的，未能反映临床医生如何利用类似病例积累的经验。我们介绍Traj-Evolve，一种具有两种互补演化机制的自我进化多智能体系统。首先，经验池（ExPool）作为非参数记忆，索引拒绝抽样的推理痕迹，以检索类似患者作为少数样本上下文。其次，通过奖励排序的微调进行多智能体强化学习（MARL），参数化地优化了代理间和代理-记忆的协作。一种“保留一”交叉检索策略将两者统一起来，将训练和推理时间的行为与检索增强相结合。在一项使用多模式电子健康记录（HHR）的肺癌预测任务中，Traj-Evolve在总体人群和挑战性从未吸烟人群中表现优于9个强基线。对演变动态的分析强调了三个关键发现：（1）扩展ExPool使最佳检索从多样化样本转向特定样本;（2）在MARL下，管理者代理的预测损失迅速收敛，而工作代理的时间推理则继续受益于更多经过验证的患者;（3）两种机制在预测风险上互补，ExPool提升特异性，MARL提高敏感性。

Fairness Definitions and Metrics in Deep Reinforcement Learning for Drug Discovery in Healthcare: A Rapid Evidence Review

医疗药物发现深度强化学习中的公平定义与指标：快速证据综述

Authors: Esmaeil Shakeri, Ronnie de Souza Santos, Behrouz Far
Subjects: Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.02902
Pdf link: https://arxiv.org/pdf/2606.02902
Abstract Deep reinforcement learning (DRL) is increasingly applied to de novo molecular design, but choices in data, rewards, and evaluation can yield uneven performance across disease areas and chemotypes. Despite this, there is no concise synthesis of how fairness is defined, measured, and tested in DRL-based drug discovery. In this rapid evidence review, we synthesize fairness definitions and metrics for DRL-driven molecule generation in healthcare. We focus on three questions: (i) how dataset composition and split strategies, especially scaffold versus random splits, affect evaluation and distribution shift; (ii) how reward design (e.g., QED, docking, toxicity, synthetic accessibility) can create or mitigate bias, with emphasis on cancer targets; and (iii) which measurable metrics best capture fairness. This includes parity across cancer versus non-cancer indications and across cancer subtypes. It also includes distributional balance in key physicochemical descriptors, scaffold/chemotype diversity, groupwise validity, toxicity, and synthetic accessibility. From 2017 onward, we searched major biomedical, computer science, and engineering literature databases and used arXiv for horizon scanning. Records were screened using PRISMA-style procedures and analyzed via content coding to link reported parity outcomes to dataset and reward choices. Our review provides a concise set of fairness definitions and metrics for DRL molecule generation. It offers practical guidance for reporting distribution parity and outcome parity. It also summarizes how dataset and reward choices relate to observed parity effects and identifies open gaps relevant to trustworthy, cancer-relevant DRL generation.
中文摘要 深度强化学习（DRL）越来越多地应用于新生分子设计，但数据、奖励和评估的选择可能导致疾病区域和化学型之间的表现不均。尽管如此，基于日日学习学习的药物发现中公平的定义、测量和测试，仍缺乏简明的综合。在这份快速证据综述中，我们综合了医疗领域中由日程学习驱动的分子生成的公平定义和指标。我们关注三个问题：（i）数据集组合和拆分策略，尤其是支架与随机拆分，如何影响评估和分布转移;（ii）奖励设计（如QED、对接、毒性、合成可及性）如何创造或减轻偏倚，重点关注癌症靶点;以及（iii）哪些可测量指标最能捕捉公平性。这包括癌症适应症与非癌症适应症之间的奇缘关系，以及跨癌症亚型的对应。它还包括关键物理化学描述符的分布平衡、支架/化学型多样性、群体效度、毒性以及合成可及性。自2017年起，我们检索了主要的生物医学、计算机科学和工程文献数据库，并使用 arXiv 进行地平线扫描。记录采用PRISMA式筛查程序，并通过内容编码分析，将报告的平价结果与数据集和奖励选择联系起来。我们的综述提供了一套简明的DRL分子生成的公平性定义和指标。它为报告分布平价和结果平价提供了实用指导。它还总结了数据集和奖励选择如何与观察到的宇价效应相关，并识别了与可信且癌症相关的日程学习（DRL）生成相关的未解决差距。

ConTraIRL: Factorized Contrastive Abstractions for Transferable IRL

ConTraIRL：可转移现实的因式分解对比抽象

Authors: Yikang Gui, Bikramjit Banerjee, Prashant Doshi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.03017
Pdf link: https://arxiv.org/pdf/2606.03017
Abstract Reward transfer in Inverse Reinforcement Learning (IRL) is unreliable when policies must generalize to unseen combinations of environment dynamics and task goals. We propose Factorized Contrastive Abstractions for Transferable IRL (ConTraIRL), a framework that enables compositional reward transfer by learning decoupled latent representations of these two factors. ConTraIRL uses a dual-encoder architecture that maps observations into separate dynamics and goal latent spaces, trained with a dual contrastive objective. Temporal alignment encourages the dynamics encoder to learn goal-invariant structure, while the goal encoder captures dynamics-invariant features. This factorization supports reward inference under recombined dynamics-goal settings. Experiments on continuous control benchmarks demonstrate effective few-shot transfer to unseen dynamics-goal pairings, improving sample efficiency and reward recovery over transfer IRL baselines.
中文摘要 当策略必须泛化到环境动态和任务目标的未见组合时，逆向强化学习（IRL）中的奖励转移是不可靠的。我们提出了可迁移IRL的分解对比抽象（ConTraIRL），这一框架通过学习这两个因素的解耦潜在表示实现组合奖励转移。ConTraIRL采用双编码器架构，将观测值映射到独立的动态和目标潜在空间，并通过对比物镜进行训练。时间对齐鼓励动力学编码器学习目标不变结构，而目标编码器则捕捉动力学不变特征。这种因式分解支持在重组动力学-目标设置下进行奖励推断。连续对照基准测试的实验显示，对未见动态-目标配对的少量样本转移有效，提高了样本效率，并相较于现实中转移基线更能奖励回收率。

Hint-Guided Diversified Policy Optimization for LLM Reasoning

面向大型语言模型推理的提示引导多样化策略优化

Authors: Zhiyu Cao, Kaixin Wu, Mingjie Zhong, Peifeng Li, Xiaobo Li, Can Ye, Qiaoming Zhu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.03021
Pdf link: https://arxiv.org/pdf/2606.03021
Abstract Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward mechanisms are constrained to the outcome-level correctness and lack explicit signals to guide the model to consider diverse solutions. In contrast, human problem solving typically involves evaluating multiple potential approaches and selecting the most reliable solution, a cognitive process that current RLVR frameworks do not explicitly incentivize. Inspired by this, we propose Hint-Guided Diversified Policy Optimization (HDPO), allowing the model to first list all potential candidate solution outlines as hints and then select the most reliable one for further reasoning. HDPO comprises two stages of Cold Start for Structured Reasoning and Hint-Guided Diversified Reinforcement Learning to incentivize the model to generate diverse and reliable solutions following the ``propose-select-think'' trajectory. Experimental results show that HDPO effectively boosts LLM reasoning and enhances the diversity of candidate solutions as well as the LLM's ability to identify reliable solutions.
中文摘要 大型语言模型（LLMs）的最新发展展示了令人印象深刻的推理能力，其中可验证奖励强化学习（RLVR）是一种有前景的增强策略。然而，现有的奖励机制受限于结果层级的正确性，缺乏明确信号来引导模型考虑多样化解。相比之下，人类问题解决通常涉及评估多种潜在方法并选择最可靠的解决方案，这一认知过程目前的RLVR框架并未明确激励。受此启发，我们提出了提示引导多样策略优化（HDPO），允许模型先列出所有潜在的候选方案大纲作为提示，然后选择最可靠的方案进行进一步推理。HDPO包括结构化推理冷启动和提示引导多样化强化学习的两个阶段，旨在激励模型按照“提出-选择-思考”的轨迹生成多样化且可靠的解决方案。实验结果显示，HDPO有效提升LLM推理能力，增强候选解的多样性以及LLM识别可靠解的能力。

Brief Announcement: Generative Markov Model for Distributed Computing Systems

简要公告：分布式计算系统的生成马尔可夫模型

Authors: Alfreds Lapkovskis, Ali Beikmohammadi, Sindri Magnússon, Praveen Kumar Donta
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.03061
Pdf link: https://arxiv.org/pdf/2606.03061
Abstract Emerging distributed computing paradigms, such as the computing continuum, are inherently heterogeneous, stochastic, and complex. Efficiently and effectively utilizing all available resources across the continuum demands a unified formal model of the system. To address this gap, we propose a general framework for modeling distributed computing systems as a generative Markov model, factorized over a structured system state. In our model, the state decomposes into high-dimensional variables, each further factorized over its elements, reflecting the sparse dependency structure inherent to distributed systems. This yields a tractable model enabling simulation, inference, and policy learning over otherwise intractable system states, bridging distributed computing with Markov chain theory and reinforcement learning (RL). We demonstrate our framework through a case study of collaborative AI inference, in which a dedicated server combines resources with those volunteered by service users. Our results show that centralized scheduling becomes a bottleneck at scale, while distributing computation across user devices reduces both latency and server resource consumption. These findings highlight the value of adaptive decision-making in distributed computing systems and demonstrate the framework's utility for modeling, simulation, and optimization.
中文摘要 新兴的分布式计算范式，如计算连续体，本质上是异构的、随机的且复杂的。高效且有效地利用整个连续体中所有可用资源，需要一个统一的形式系统模型。为弥补这一空白，我们提出了一个通用框架，将分布式计算系统建模为生成马尔可夫模型，并基于结构化系统状态进行分解。在我们的模型中，状态分解为高维变量，每个变量对其元素进行进一步分解，反映了分布式系统固有的稀疏依赖结构。这带来了一个可处理的模型，使得对本难以处理的系统状态进行仿真、推理和策略学习，连接了分布式计算与马尔可夫链理论和强化学习（RL）。我们通过协作式AI推理的案例研究展示了我们的框架，其中专用服务器将资源与服务用户自愿提供的资源结合起来。我们的结果表明，集中调度在大规模中成为瓶颈，而将计算分散到用户设备之间则能降低延迟和服务器资源消耗。这些发现凸显了自适应决策在分布式计算系统中的价值，并展示了该框架在建模、仿真和优化方面的实用性。

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

ASymPO：无行为信息的异步LLM后训练的非对称尺度策略优化

Authors: Zehua Liu, Yuxuan Yao, Xiaojin Fu, Tao Zhong, Mingxuan Yuan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03070
Pdf link: https://arxiv.org/pdf/2606.03070
Abstract Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optimization, but stale responses introduce distribution drift. Standard behavior-corrected methods control this drift with behavior-policy probabilities, importance ratios, or clipping, which requires token-aligned, versioned, and numerically consistent behavior log-probabilities across rollout and learner systems. We ask whether asynchronous group-relative RL can instead be stabilized using only current-policy probabilities. We identify a scale-imbalance failure mode: when stale responses are evaluated under the current policy, positive and negative loss terms can appear at different negative log-probability scales, so zero-sum advantages no longer imply balanced loss contributions. We propose Asymmetric-Scale Policy Optimization (ASymPO), which normalizes each response's token loss by its current average token negative log-probability. ASymPO requires no behavior-policy probabilities, restores response-level zero-sum balance, and preserves a nonzero learning signal. We also introduce Scaled Policy Optimization (SPO), a fixed negative-scaling baseline, and evaluate both current-policy-only objectives in asynchronous mathematical reasoning post-training.
中文摘要 异步强化学习可以通过将响应生成与策略优化脱钩来提升语言模型训练后的吞吐量，但陈旧的响应会引入分布漂移。标准的行为纠正方法通过行为策略概率、重要性比或裁剪来控制这种漂移，这需要在推广和学习者系统之间实现令牌对齐、版本化且数值一致的行为日志概率。我们探讨异步群体相对强化学习是否可以通过仅用当前策略概率来稳定。我们识别出一种尺度失衡失效模式：当在当前政策下评估陈旧反应时，正负损失项可能出现在不同的负对数概率尺度，因此零和优势不再意味着损失贡献平衡。我们提出了非对称尺度策略优化（ASymPO），该方法通过其当前平均负对数概率对每个响应的令牌丢失进行归一化。ASymPO不需要行为策略概率，恢复反应水平的零和平衡，并保持非零学习信号。我们还介绍了规模化策略优化（SPO），这是一个固定的负标度基线，并在异步数学推理训练后评估了当前仅策略的两个目标。

Efficient Hyperparameter Optimization for LLM Reinforcement Learning

LLM强化学习的高效超参数优化

Authors: Minping Chen, Bowen Xiao, Du Liang, Chuxuan Zeng, Zeyi Wen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03073
Pdf link: https://arxiv.org/pdf/2606.03073
Abstract Reinforcement learning (RL) for large language models (LLMs) is highly sensitive to hyperparameter configurations, making hyperparameter optimization (HPO) essential yet computationally expensive. Existing multi-fidelity HPO methods remain inefficient for LLM RL due to the massive model scale and resource-intensive training cycles. In this paper, we propose Joint Fidelity Hyperparameter Optimization (JF-HPO), which simultaneously adapts both model size and training budget as fidelity. JF-HPO is empowered by: (i) it leverages a small proxy model of the target LLM for efficient training and evaluation in each HPO trial; (ii) it integrates carefully designed early-stopping strategies based on training dynamics; (iii) it introduces an efficient checkpointing mechanism to eliminate redundant computations. Compared with existing HPO methods, JF-HPO significantly improves the computational efficiency of each trial (up to 14.9 times), while achieving better or competitive predictive accuracy under the same time budget. Notably, compared with utilizing hyperparameter configurations from the VeRL Recipe, JF-HPO delivers performance improvements ranging from 5.8% to 111.6%.
中文摘要 大型语言模型（LLM）的强化学习（RL）对超参数配置高度敏感，因此超参数优化（HPO）既必要，又计算成本高昂。由于模型规模庞大且训练周期高，现有的多保真HPO方法对LLM RL仍然效率低下。本文提出了联合保真度超参数优化（JF-HPO），该方法同时将模型规模和训练预算调整为保真度。JF-HPO 的优势在于：（i）它利用目标大型语言模型的小型代理模型，在每次 HPO 试验中高效训练和评估;（ii）它结合了基于训练动态的精心设计的早期停止策略;（iii）引入高效的检查点机制，消除冗余计算。与现有HPO方法相比，JF-HPO显著提高了每次试验的计算效率（高达14.9倍），同时在相同时间预算下实现了更优或更具竞争力的预测准确率。值得注意的是，与使用 VeRL 配方中的超参数配置相比，JF-HPO 的性能提升范围在 5.8% 到 111.6% 之间。

Libra: Efficient Resource Management for Agentic RL Post-Training

Libra：智能强化学习后培训的高效资源管理

Authors: Kaiwen Chen, Xin Tan, Jingzong Li, Hong Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2606.03077
Pdf link: https://arxiv.org/pdf/2606.03077
Abstract Reinforcement learning (RL) has become a standard post-training paradigm for large language models (LLMs), extending beyond preference alignment to complex reasoning and multi-turn agentic behaviors. In agentic RL, the rollout stage generates trajectories while invoking tools, producing long-tailed and non-stationary workloads that challenge conventional resource-management assumptions. Three fundamental challenges arise. First, due to the long-tail distribution, a small fraction of trajectories dominates rollout makespan. Second, rollout and training exhibit strong asymmetry in compute patterns, memory demands, and sensitivity to sequence length. Third, as the RL policy evolves, the trajectory-length distribution drifts over time, rendering any static resource split progressively suboptimal. We present Libra, which introduces two core mechanisms. The first is a periodic global resource planner that jointly optimizes GPU allocation across rollout and training clusters. It leverages an elastic hybrid pool to enable lightweight, non-blocking worker reallocation between stages. The second is a causality-driven multi-level feedback queue (C-MLFQ) scheduler, which routes requests to heterogeneous rollout buckets based on causal signals derived from tool-return outcomes, rather than relying on fragile length predictions. Evaluated on 48 A800 GPUs, Libra achieves up to 3.0$\times$ higher throughput and converges up to 2.5$\times$ faster in reward compared to the baselines.
中文摘要 强化学习（RL）已成为大型语言模型（LLMs）训练后标准范式，超越偏好对齐，延伸到复杂推理和多回合代理行为。在代理强化学习中，展开阶段在调用工具的同时生成轨迹，产生长尾且非固定的工作负载，挑战传统资源管理假设。出现了三个根本性的挑战。首先，由于长尾分布，只有少数弹道主导了滚动完成周期。其次，展开和训练在计算模式、内存需求和对序列长度敏感度上表现出强烈的不对称性。第三，随着强化学习策略的演变，轨迹长度分布随时间漂移，使得任何静态资源分割逐渐变得不理想。我们介绍天秤座，介绍了两个核心机制。第一种是周期性全球资源规划器，联合优化部署和训练集群间的GPU分配。它利用弹性混合池实现轻量级、无阻塞的工人在各阶段间的重新分配。第二种是基于因果关系的多级反馈队列（C-MLFQ）调度器，它根据工具返回结果产生的因果信号将请求路由到异构的展开桶，而非依赖脆弱的长度预测。在48块A800 GPU上评估时，Libra的吞吐量比基线高出3.0美元\时间$，收敛速度快达2.5美元\时间。

Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR

学会解题，忘记保留：RLVR中的正确周转

Authors: Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Peng Fu, Zheng Lin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.03087
Pdf link: https://arxiv.org/pdf/2606.03087
Abstract Reinforcement learning with verifiable rewards (RLVR) improves the ability of large language model, yet headline accuracy gains often conceal a hidden cost: previously solved problems quietly become unsolvable as training proceeds. We frame this phenomenon as \emph{correct-set turnover}, representing the coupled dynamics of solution acquisition and regression over the mastered set. Under this view, retention becomes an explicit optimization target alongside acquisition. We analytically and empirically establish the \emph{repair-window principle}: the cost of restoring a regressed prompt grows sharply with review delay, defining a low-cost window that standard RLVR pipelines fail to exploit. To address this, we propose \textbf{\method{}}, a retention-aware review mechanism that tracks mastered prompts and periodically reintroduces them to \textbf{remind} the model of previous solutions. By utilizing pre-rollout batch replacement, \method{} incurs zero additional rollout overhead. Evaluated across 20 benchmarks spanning image-text, video, and text-only tasks with Qwen3-VL and Qwen2.5-Math, \method{} consistently improves performance over GRPO, DAPO, and replay baselines, demonstrating robust generalizability across modalities and algorithms.
中文摘要 带有可验证奖励的强化学习（RLVR）提升了大型语言模型的能力，但标题准确率的提升往往掩盖了一个隐藏的代价：随着训练的推进，先前解决的问题悄然变得无法解决。我们将这一现象框架为\emph{correct-set turnover}，代表解获取与回归在母带集合上的耦合动态。在这种观点下，留存成为与收购并列的明确优化目标。我们通过分析和实证建立了\emph{修复窗口}原则：恢复退化提示的成本随着审查延迟急剧上升，定义了一个标准RLVR流水线无法利用的低成本窗口。为此，我们提出了 \textbf{\method{}}，一种具保留感的复习机制，能够追踪已掌握的提示词，并定期将其重新引入 \textbf{remind} 之前解答的模型中。通过使用预部署批次替换，\method{} 无需额外启动开销。通过Qwen3-VL和Qwen2.5-Math在20个基准测试中评估，涵盖图像文本、视频和纯文本任务，\method{}在GRPO、DAPO和重放基线上持续提升性能，展示了跨模态和算法的稳健泛化能力。

FGRPO: Federated GRPO with Adaptive Aggregation on Non-IID Data

FGRPO：结合非IID数据的自适应聚合的联合GRPO

Authors: Pengyu Chen, Shaowei Li, Kai Wang, Yunsheng Yuan, Kai Han, Jun Luo, Feng Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.03094
Pdf link: https://arxiv.org/pdf/2606.03094
Abstract Recent advances in language models have established reinforcement learning as the primary paradigm for eliciting self-correction and long-chain reasoning. While group relative policy optimization (GRPO) offers superior scalability by eliminating the critic network, deploying it on a central infrastructure entails collecting a large volume of data from distributed owners, which poses significant privacy risks. To address these concerns, we introduce federated GRPO (FGRPO), a framework designed to decentralize the fine-tuning of reasoning models across heterogeneous data owners. To effectively mitigate the instability caused by divergent reward scales across heterogeneous tasks, FGRPO incorporates an adaptive aggregation mechanism based on relative performance gain. By characterizing each client's improvement relative to its personalized historical baseline, the framework dynamically prioritizes effective learning trajectories regardless of local task difficulty. FGRPO ensures robust convergence on non-IID data while preserving data privacy.
中文摘要 语言模型的最新进展确立了强化学习作为引发自我纠正和长链推理的主要范式。虽然群相对策略优化（GRPO）通过消除批评网络提供了更优越的可扩展性，但在中央基础设施上部署则需要从分布式所有者收集大量数据，这带来了重大的隐私风险。为解决这些问题，我们引入了联邦GRPO（FGRPO），这是一个旨在分散异构数据所有者推理模型微调的框架。为了有效缓解异构任务中不同奖励尺度带来的不稳定性，FGRPO采用了基于相对性能提升的自适应聚合机制。通过结合每位客户的个性化历史基线来描述其进步，该框架动态优先排序有效的学习轨迹，无论本地任务难度如何。FGRPO确保非IID数据的稳健融合，同时保护数据隐私。

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

小型强化学习控制器，大型语言模型：强化学习引导自适应采样用于测试时间缩放

Authors: Runpeng Dai, Tong Zheng, Rui Liu, Chengsong Huang, Hongtu Zhu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.03102
Pdf link: https://arxiv.org/pdf/2606.03102
Abstract Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.
中文摘要 测试时间缩放提升了大型语言模型的推理性能，但会带来巨大的计算成本和延迟。现有的自适应抽样方法通过动态决定何时停止抽样，部分缓解了这一问题，但它们通常依赖启发式规则或分布假设。在本研究中，我们将自适应抽样表述为马尔可夫决策过程（MDP）。我们训练一个轻量级抽样控制器，配合强化学习（RL），共同平衡答案正确性、延迟和计算成本。每轮，控制者决定停止采样或获取额外样本。我们的方法很轻量，仅依赖最终答案的统计数据，并且可以在CPU上训练和部署。我们进一步证明，所得框架可以解释为带有显式预算约束的受约束优化问题的拉格朗日松弛。针对强基线如ASC和ESC的实验显示，我们的方法在答案正确性、抽样轮次和所需样本总数之间实现了更好的权衡。

Experience-Driven Dynamic Exits for LLMs with Reinforcement Learning

基于经验驱动的大型语言模型动态退出与强化学习

Authors: Yanyu Zhu, Hoilam Pao, Niu Hu, Wei Guo, Shaoxiong Zhan, Boyu Lai, Zitai Wang, Yongqin Zeng, Hai-Tao Zheng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.03113
Pdf link: https://arxiv.org/pdf/2606.03113
Abstract Large Language Models suffer from slow autoregressive inference. While self-speculative decoding accelerates this process, its efficiency is hampered by static configurations like fixed exit layers and speculation lengths. We reframe this optimization as a \textbf{Markov Decision Process} and propose \textbf{LEDE}, a framework that uses offline reinforcement learning. LEDE learns a policy to dynamically select the optimal exit layer and speculation length based on the local context of the generated sequence at each step, balancing computational cost and draft quality. Comprehensive evaluations on Llama-2 and Llama-3 models show LEDE achieves up to a $2.0\times$$\sim$$2.7\times$ speedup over autoregressive decoding and and provides an additional 17\% speedup over the static speculative baselines.
中文摘要 大型语言模型存在缓慢的自回归推理问题。虽然自推测解码加速了这一过程，但其效率受限于静态配置，如固定出口层和推测长度。我们将此优化重新框架为\textbf{马尔可夫决策过程}，并提出了\textbf{LEDE}框架，该框架采用离线强化学习。LEDE 学习策略，根据生成序列的局部上下文动态选择最优出口层和推测长度，平衡计算成本和草稿质量。对Llama-2和Llama-3模型的全面评估显示，LEDE相比自回归解码实现了高达$2.0\times$\sim$2.7\times$的加速，并且比静态推测基线额外提升了17%的速度。

Learning to See via Epiretinal Implant Stimulation in silico with Model-Based Deep Reinforcement Learning

通过基于模型的深度强化学习通过视网膜前植入刺激学习在计算机中看见

Authors: Jacob Lavoie, Marwan Besrour, William Lemaire, Jean Rouat, Réjean Fontaine, Eric Plourde
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2606.03118
Pdf link: https://arxiv.org/pdf/2606.03118
Abstract Objective: Diseases such as age-related macular degeneration and retinitis pigmentosa cause the degradation of the photoreceptor layer. One approach to restore vision is to electrically stimulate the surviving retinal ganglion cells with a microelectrode array such as epiretinal implants. Epiretinal implants are known to generate visible anisotropic shapes elongated along the axon fascicles of neighboring retinal ganglion cells. Recent work has demonstrated that to obtain isotropic pixel-like shapes, it is possible to map axon fascicles and avoid stimulating them by inactivating electrodes or lowering stimulation current levels. Avoiding axon fascicle stimulation aims to remove brushstroke-like shapes in favor of a more reduced set of pixel-like shapes. Approach: In this study, we propose the use of isotropic and anisotropic shapes to render intelligible images on the retina of a virtual patient in a reinforcement learning environment named rlretina. The environment formalizes the task as using brushstrokes in a stroke-based rendering task. Main Results: We train a deep reinforcement learning agent that learns to assemble isotropic and anisotropic shapes to form an image. We investigate which error-based or perception-based metrics is adequate to reward the agent. The agent is trained in a model-based data generation fashion using the psychophysically validated axon map model to render images as perceived by different virtual patients. We show that the agent can generate more intelligible images compared to the naive method in different virtual patients. Significance: This work shares a new way to address epiretinal stimulation that constitutes a first step towards improving visual acuity in artificially-restored vision using anisotropic phosphenes.
中文摘要 目的：诸如年龄相关黄斑变性和色素性视网膜炎等疾病会导致感光细胞层的退化。恢复视力的一种方法是用微电极阵列（如视网膜前植入物）电刺激幸存的视网膜神经节细胞。已知视网膜前植入物能在邻近视网膜神经节细胞的轴突束上产生可见的各向异性形状。最新研究表明，为了获得各向同性的像素状形状，可以绘制轴突束的图谱，并通过失活电极或降低刺激电流来避免刺激它们。避免轴突束刺激旨在去除笔触状形状，转而呈现更为简化的像素状形状。方法：本研究提出利用各向同性和各向异性形状，在名为rlretina的强化学习环境中，在虚拟患者的视网膜上渲染可理解的图像。环境将任务形式化为使用笔触的笔触渲染任务。主要结果：我们训练了一个深度强化学习代理，它学习组装各向同性和各向异性形状以形成图像。我们调查哪些基于错误或基于感知的指标足以奖励代理人。该代理采用基于模型的数据生成方式训练，使用经过心理物理验证的轴突映射模型，渲染不同虚拟患者的感知图像。我们证明，代理在不同的虚拟患者中能够生成比天真方法更易理解的图像。重要性：本研究分享了一种新的视网膜前刺激方法，这是利用各向异性光显效果改善人工恢复视力的第一步。

Cost-Aware Optimization for Agentic Query Execution

代理查询执行的成本感知优化

Authors: Lunyiu Nie, Yilin Xia, Yiren Liu, Christopher Jermaine, Swarat Chaudhuri
Subjects: Subjects: Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2606.03152
Pdf link: https://arxiv.org/pdf/2606.03152
Abstract Classical query optimization searches over algebraically equivalent plans that differ only in cost. This assumption breaks once LLM-backed operators enter the picture: their placement, ordering, and granularity jointly determine both dollar cost and answer quality, and the right choice among the alternatives is often revealed only at runtime. We formalize this setting as agentic query execution, a query execution paradigm in which agent-based planning is interleaved with execution, and agent workflow optimization becomes the analogue of classical query optimization. We then present EnumGRPO, a self-improving optimizer for this setting. During a learning stage, EnumGRPO enumerates query plans over decisions such as execution paradigm, operator type, operator placement, selectivity scope, and projection width, then distills quality-cost feedback into reusable planning heuristics via in-context reinforcement learning. Across four databases in SWAN, EnumGRPO achieves 35.4% execution accuracy at $0.011 per query in LLM-operator cost, a ~317x cost reduction over the hybrid query baseline with an 18% relative improvement in answer accuracy.
中文摘要 经典查询优化是在代数等价方案上搜索，这些方案仅在成本上有所不同。一旦LLM支持的算符出现，这一假设就被打破了：它们的位置、排序和细度共同决定了费用成本和答案质量，而正确的选择往往只在运行时才会被揭示。我们将此设定形式化为代理查询执行，这是一种基于代理的规划与执行交错的查询执行范式，代理工作流优化成为经典查询优化的类比。随后介绍EnumGRPO，一个针对该设定的自我优化优化器。在学习阶段，EnumGRPO枚举执行范式、操作符类型、操作符位置、选择范围和投影宽度等决策的查询计划，然后通过上下文强化学习将质量-成本反馈提炼为可重用的规划启发式。在SWAN的四个数据库中，EnumGRPO实现了35.4%的执行准确率，且每查询成本为LLM操作员成本0.011美元，成本比混合查询基线降低约317倍，且回答准确度相对提升18%。

ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off Control

ConTrack：受限的手部动作追踪，带自适应权衡控制

Authors: Yutong Liang, Quanquan Peng, Ri-Zhao Qiu, Xiaolong Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.03177
Pdf link: https://arxiv.org/pdf/2606.03177
Abstract Human demonstrations provide strong priors for robot manipulation, yet it is non-trivial to transfer them to execute on real robots due to the kinematic gap. In dexterous manipulation, it remains challenging to track long-horizon, contact-rich sequences even in simulators: a reference-tracking policy must keep objects on their target trajectories while preserving demonstrated joint motion and contact timing. Existing approaches often rely on hand-crafted reward tuning that require per-sequence tuning and break under limited interaction budgets. We introduce ConTrack, a reinforcement learning (RL) framework that scales with tracking data. ConTrack treats object tracking as a constraint and allocates remaining control authority to motion fidelity, which allows it to adapt task--style trade-offs online using a dual-variable update. In addition, ConTrack also stabilizes long-horizon learning with an adaptive mid-trajectory reset library that reuses policy-reachable simulator states. Our qualitative and quantitative results in simulation tracking and real robot demonstrate that ConTrack improves success and object pose accuracy significantly over prior arts while preserving joint and contact fidelity. Website: this https URL.
中文摘要 人体演示为机器人操作提供了坚实的先验，但由于运动学差距，将其转化到真实机器人上并不容易。在灵巧操作中，即使在模拟器中，追踪长视距且接触丰富的序列仍具挑战性：参考跟踪策略必须保持物体在目标轨迹上，同时保持关节运动和接触时序的展示。现有方法通常依赖手工定制的奖励调优，需要逐序列调优，且在有限的交互预算下会被破坏。我们介绍ConTrack，一个可随跟踪数据扩展的强化学习（RL）框架。ConTrack将对象跟踪视为约束，剩余控制权分配给运动保真度，从而通过双变量更新在线调整任务式权衡。此外，ConTrack还通过自适应的中期轨迹重置库稳定长视野学习，该库可重用策略可达的模拟器状态。我们在模拟跟踪和真实机器人上的定性和定量结果表明，ConTrack在保持关节和接触真实度的同时，显著提升了成功率和物体姿态的准确性。网站：这个 https URL。

MemTrain: Self-Supervised Context Memory Training

MemTrain：自我监督上下文记忆训练

Authors: Ziheng Li, Xingrun Xing, Haoqing Wang, Zhi-Hong Deng, Yehui Tang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.03197
Pdf link: https://arxiv.org/pdf/2606.03197
Abstract Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize information accumulated across extended interactions. Existing memory-agent approaches are typically trained end-to-end with reinforcement learning on downstream tasks. However, collecting high-quality annotated problems for memory-intensive scenarios is costly, and the resulting training data often lack sufficient diversity to cover general memory behaviors. In this work, we propose MemTrain, a self-supervised training framework for generally enhancing the context-memory capability of LLM agents for more effective downstream post-training. MemTrain introduces two coupled proxy tasks over unlabeled Wikipedia corpora: (1) an end-to-end masked reconstruction objective, which requires the model to recover masked entities after multiple rounds of memory updates, thereby encouraging memory maintenance from the final outcome perspective; and (2) an intermediate memory recall objective, which requires the model to reconstruct masked historical information using intermediate memory states, encouraging faithful compression and memory completeness throughout the interaction process. The two objectives are jointly optimized using GRPO. Extensive experiments on long-text QA and search-based QA benchmarks demonstrate that MemTrain consistently improves downstream memory-intensive reasoning performance across different models, achieving gains of up to 17.67 points over direct task-specific post-training.
中文摘要 内存是长视野LLM代理不可或缺的能力，使它们能够保存和利用通过长时间交互积累的信息。现有的记忆代理方法通常通过下游任务的强化学习进行端到端训练。然而，为内存密集型场景收集高质量注释问题成本高昂，且产生的训练数据往往缺乏足够的多样性来涵盖一般的记忆行为。在本研究中，我们提出了MemTrain，一种自我监督的训练框架，旨在普遍提升LLM代理的上下文记忆能力，从而在后续训练中更有效。MemTrain 在无标签的维基百科语料库上引入了两个耦合代理任务：（1）端到端的掩蔽重建目标，要求模型在多次内存更新后恢复掩蔽实体，从而从最终结果的角度鼓励内存维护;以及（2）中间记忆回忆目标，要求模型利用中间记忆状态重建掩蔽的历史信息，鼓励在整个交互过程中忠实压缩和记忆完整性。这两个目标均通过GRPO共同优化。对长文本质量保证和基于搜索的质量保证基准的广泛实验表明，MemTrain在不同模型中持续提升下游对内存密集型推理的性能，较直接针对特定任务的训练后提升高达17.67分。

Reinforcement Learning from Cross-domain Videos with Video Prediction Model

基于跨领域视频的视频预测模型强化学习

Authors: Zhao Yang, Xinrui Zu, Jacob E. Kooi, Thomas Delliaux, He Liu, Shujian Yu, Kevin Sebastian Luck, Vincent François-Lavet
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03201
Pdf link: https://arxiv.org/pdf/2606.03201
Abstract Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the presence of domain gaps. We introduce XIPER (Cross-domain Video Prediction Reward), a reward model for learning from expert videos collected in a visually different domain, where the agent's appearance differs due to factors such as color, morphology, or the sim-to-real gap. More specifically, XIPER trains a cross-domain video prediction model that maps agent observations into the expert domain and uses the prediction likelihood as a reward signal. Experiments on the DMC Color Suite (8 tasks) and DMC Body Suite (3 tasks) show that XIPER consistently outperforms baselines despite domain gaps such as differences in agent color and morphology. We further analyze XIPER on a sim-to-real transfer dataset, demonstrating that it produces meaningful reward signals for real-robot observations given only simulated expert videos. Code, pretrained models, datasets and video demonstrations can be found on our project webpage: this https URL
中文摘要 由于缺乏奖励信号和领域间隙，从专家视频中进行强化学习具有挑战性。我们引入了XIPER（跨领域视频预测奖励），这是一种用于从视觉上不同领域收集的专家视频中学习的奖励模型，该领域代理的外观因颜色、形态或模拟与现实之间的差距等因素而不同。更具体地说，XIPER训练了一个跨域视频预测模型，将代理观察映射到专家领域，并将预测似然作为奖励信号。DMC色彩套件（8项任务）和DMC身体套件（3项任务）的实验表明，尽管存在因子颜色和形态差异等领域差距，XIPER仍持续优于基线。我们进一步分析了模拟到真实传输数据集上的XIPER，证明它仅给出模拟专家视频即可为真实机器人观察产生有意义的奖励信号。代码、预训练模型、数据集和视频演示可在我们的项目网页找到：https URL

Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning

正义造就力量：对齐已验证的隐藏状态增强现实逻辑

Authors: Ziyue Wang, Aomufei Yuan, Yongfu Zhu, Shuai Dong, Wenpu Liu, Yiran Yao, Weichu Xie, Yuqi Xu, Caoyuan Ma, Wenqi Shao, Xiaoying Zhang, Nan Duan, Jiaqi Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.03234
Pdf link: https://arxiv.org/pdf/2606.03234
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) has become the dominant approach for improving mathematical reasoning in large language models, yet current methods reduce each correct rollout to a single reward bit, ignoring the geometric structure shared among their hidden states. Investigating this structure, we find that at the anchor token (the position immediately before the answer marker), correct rollouts converge naturally because they must produce the same answer (cosine similarity ~0.84), yet each retains residual variance from its unique reasoning path. Encouraging full alignment at this point pushes the model to extract a unified "correct decision" representation, reducing sensitivity to which reasoning path was taken. Based on this observation, we propose Hidden-Align, an auxiliary loss function that aligns the last-layer hidden states of correct rollouts at the anchor token during RL training, with zero overhead in both training and inference. On eight mathematical reasoning benchmarks, Hidden-Align improves average pass@1 over the DAPO baseline by 3.8, 6.2, and 5.4 percentage points on Qwen3-1.7B, 4B, and 14B respectively, with consistent pass@k gains across all three scales, supported by ablations on loss type, anchor position, layer depth, and loss weight.
中文摘要 可验证奖励强化学习（RLVR）已成为提升大型语言模型数学推理的主流方法，但当前方法将每个正确展开简化为单个奖励位，忽视了隐藏状态之间共享的几何结构。研究该结构时，我们发现在锚点（即答案标记前的位置），正确的滚出自然收敛，因为它们必须产生相同的答案（余弦相似度~0.84），但每个序列仍保留其唯一推理路径的残差方差。此时鼓励完全对齐，推动模型提取统一的“正确决策”表示，降低对推理路径的敏感度。基于这一观察，我们提出了隐藏对齐（Hidden-Align）这一辅助损失函数，在强化学习训练期间，将正确展开的最后一层隐藏状态对齐锚点标记处，且训练和推断均无开销。在八个数学推理基准测试中，Hidden-Align 在 QWEN3-1.7B、4B 和 14B 的 DAPO 基线平均pass@1分别提升了 3.8、6.2 和 5.4 个百分点，三个尺度均有稳定的 pass@k 提升，并得到了损失类型、锚点位置、层深度和损失重量的消融支持。

When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming

当RLHF失败时：奖励黑客、崩溃与评估者游戏的机制分类法

Authors: Zelalem Abahana
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03238
Pdf link: https://arxiv.org/pdf/2606.03238
Abstract Reinforcement learning from human feedback (RLHF) makes large-scale post-training possible by replacing an underspecified human objective with learned and scalable proxies. The same substitution creates a structured failure surface: optimization can raise the learned reward while external quality falls, degrade both proxy and judge scores, reveal proxy under-alignment, or produce evaluator-specific disagreement. We present an empirical failure-mode study of a compact RLHF pipeline with proximal policy optimization (PPO), direct preference optimization (DPO), uncertainty-penalized PPO (UP-PPO), reward-model uncertainty, approximate policy drift, diversity and repetition diagnostics, and two external LLM judges. Rather than treating reward hacking as a single terminal event, we classify matched transitions between checkpoints using the directions of the learned reward, judge scores, and average judge score. Across 61 checkpoint rows and 1920 row-level transitions, aggressive PPO has the highest localized reward-hacking rate (14.45%; bootstrap 95% CI: 10.16-18.75), while UP-PPO yields lower rates in the same aggressive regime (11.33-10.94%). A pre-transition logistic model predicts future row-level reward hacking with ROC-AUC 0.821, and row-level analysis finds localized reward hacking that checkpoint averages miss in 3 of 12 settings. The central conclusion is methodological: RLHF failures are not only final-model pathologies, but training dynamics that can be classified, localized, and partially anticipated.
中文摘要 来自人类反馈的强化学习（RLHF）通过用学习且可扩展的代理替代一个未明确的人类目标，使大规模的后期训练成为可能。同样的替代会形成结构化的失效面：优化可能提高学习奖励，而外部质量下降，降低代理和评判得分，暴露代理不对齐，或产生评估者特有的分歧。我们提出了一项针对紧凑RLHF流水线的实证失效模式研究，该流水线包含近端策略优化（PPO）、直接偏好优化（DPO）、不确定性惩罚PPO（UP-PPO）、奖励模型不确定性、近似策略漂移、多样性和重复诊断，以及两名外部LLM评判。我们不将奖励黑客视为单一终端事件，而是利用学习到的奖励方向、裁判得分和平均裁判得分对匹配的检查点转换进行分类。在61个检查点行和1920行级转换中，激进PPO拥有最高的局部奖励黑客率（14.45%;自助95%置信区间：10.16-18.75），而UP-PPO在同一攻击模式下获得较低的奖励率（11.33-10.94%）。过渡前逻辑模型预测未来行级奖励黑客的ROC-AUC 0.821，行级分析发现局部奖励黑客在12种设置中有3种检查点平均值未命中。核心结论是方法论上的：RLHF失效不仅是最终模型的病理，更是可分类、局部化和部分预见的训练动态。

PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training

PaddleOCR-VL-1.6：在区域优化不足和渐进式后训练下拓展文档解析的前沿

Authors: Zelun Zhang, Hongen Liu, Suyin Liang, Yubo Zhang, Yiqing Xiang, Jiaxuan Liu, Ting Sun, Manhui Lin, Yue Zhang, Changda Zhou, Tingquan Gao, Cheng Cui, Yi Liu, Dianhai Yu, Yanjun Ma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.03264
Pdf link: https://arxiv.org/pdf/2606.03264
Abstract We introduce PaddleOCR-VL-1.6, an upgraded compact document parsing model built upon PaddleOCR-VL-1.5. Although PaddleOCR-VL-1.5 establishes a strong 0.9B baseline, its remaining errors concentrate in under-optimized regions where model behavior is unstable, data coverage is sparse, or supervision is unreliable. Rather than expanding the training corpus indiscriminately, PaddleOCR-VL-1.6 introduces a region-aware data optimization framework that identifies weak regions from the previous model, applies targeted enhancement to these regions, and improves the reliability of supervision signals. It further adopts a progressive post-training recipe based on curated data selection and reinforcement learning, pushing model performance to a higher level through staged optimization. PaddleOCR-VL-1.6 achieves a new state-of-the-art score of 96.33% on OmniDocBench v1.6, demonstrates strong competitiveness against top-tier VLMs, and provides a practical post-training recipe for the PaddleOCR-VL series.
中文摘要 我们介绍了 PaddleOCR-VL-1.6，这是一种基于 PaddleOCR-VL-1.5 的升级版紧凑文档解析模型。尽管PaddleOCR-VL-1.5建立了强有力的0.9B基线，其剩余误差集中在模型行为不稳定、数据覆盖稀少或监督不可靠等优化不足的区域。PaddleOCR-VL-1.6 没有无差别地扩展训练语料库，而是引入了一个区域感知数据优化框架，识别前一模型中的弱区域，对这些区域进行有针对性的增强，并提升监督信号的可靠性。它还采用基于精心策划的数据选择和强化学习的渐进式训练后方案，通过分阶段优化将模型性能提升到更高水平。PaddleOCR-VL-1.6在OmniDocBench v1.6上取得了96.33%的最高得分，展现出对顶级VLM的强劲竞争力，并为PaddleOCR-VL系列提供了实用的培训后方案。

EaDex: A Cross-Embodiment Dexterous Manipulation Framework from Low-Cost Demonstrations

EaDex：低成本演示中的交叉身体灵巧操作框架

Authors: Qian Zhao, Xin Tong, Chengdong Wu, Yang Yang, Yingtian Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.03268
Pdf link: https://arxiv.org/pdf/2606.03268
Abstract Dexterous manipulation learning has long been hindered by the high costs of data and training, as pure reinforcement learning typically requires large-scale interactive exploration and imitation learning depends on high-quality demonstrations that are expensive to collect. To address this problem, we propose EaDex, a multi-embodiment dexterous manipulation learning framework under low-cost demonstration conditions, which enables rapid generation of demonstration data and consequently reduces training time for efficient dexterous manipulation. At the data level, EaDex captures human hand motions using only a single RGB-D camera and constructs structured demonstration data through MANO-based hand modeling, data normalization, and motion retargeting. At the learning level, we introduce a contact-reward-based dynamic demonstration annealing mechanism, which guides early-stage exploration under demonstration and gradually transitions to autonomous optimization with accumulating contact rewards. Using our custom dataset, we evaluate EaDex on three dexterous hands and three articulated object-opening tasks, covering nine cross-embodiment manipulation settings, achieving a 55.3% relative improvement over the baseline without demonstration annealing. These results validate the effectiveness of the proposed low-cost demonstration pipeline and the dynamic demonstration annealing strategy for dexterous manipulation learning.
中文摘要 灵巧操作学习长期以来受制于高昂的数据和训练成本，纯强化学习通常需要大规模的互动探索，而模仿学习依赖于高质量且收集成本高昂的演示。为解决这一问题，我们提出了EaDex，一种多身形的灵活操作学习框架，在低成本的演示条件下实现，能够快速生成演示数据，从而缩短训练时间，实现高效的灵巧操作。在数据层面，EaDex 仅用一台 RGB-D 摄像机捕捉人类手部动作，并通过基于 MANO 的手部建模、数据归一化和运动重定向构建结构化演示数据。在学习层面，我们引入了基于接触奖励的动态演示退火机制，在演示下引导早期探索，逐步过渡到自主优化，并逐步积累接触奖励。利用我们的自定义数据集，我们在三项灵活手和三项关节开启物体任务中评估了EaDex，涵盖九种交叉身体操作设置，较基线相比无示范退火时的相对提升达55.3%。这些结果验证了所提低成本示范流水线及动态示范退火策略在灵巧操作学习中的有效性。

GPU-Parallel Multi-Task Reinforcement Learning with Demonstration Guided Policy Optimization

GPU并行多任务强化学习，演示引导策略优化

Authors: Rui Zhang, Qiwei Wu, Zhengyu Zhang, Tao Li, Yunrong Guo, Junjie Lai, Renjing Xu, Weihua Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.03335
Pdf link: https://arxiv.org/pdf/2606.03335
Abstract Large scale GPU-parallel reinforcement learning has changed what can be trained in robot simulation, yet most systems still optimize one specialist policy per task. We propose a construction methodology for turning structured manipulation task families into GPU-parallel multi-task RL benchmarks, and instantiate it as MT-Libero using LIBERO assets and task predicates in Isaac Lab. The resulting benchmark supports simultaneous reinforcement learning over heterogeneous task suites with parallel rendering, physics randomization, and state-input or visual-input policies. To make such training practical under sparse success signals and limited prior data, we further propose DGPO, an on-policy demonstration guided method that combines importance weighted PPO with adaptive behavior cloning on matched demonstration actions. DGPO enables a tunable preference toward demonstrated task distributions, outperforming both prior-free RL and existing demonstration-based methods while preserving the stability and online improvement benefits of on-policy PPO.
中文摘要 大规模GPU并行强化学习改变了机器人仿真中可训练的内容，但大多数系统仍然为每个任务优化一个专业策略。我们提出了一种构建方法，将结构化操作任务族转化为GPU并行多任务强化学习基准，并在Isaac实验室使用LIBERO资产和任务谓词实现MT-Libero。最终基准测试支持异构任务套件上的同时强化学习，支持并行渲染、物理随机化以及状态输入或视觉输入策略。为了使此类训练在成功信号稀疏且先前数据有限的情况下实用，我们进一步提出了DGPO，这是一种基于策略的示范指导方法，结合了重要性加权PPO与匹配示范行动的自适应行为克隆。DGPO允许对已演示任务分布进行可调节的偏好，优于先验无强化学习和现有基于演示的方法，同时保持策略PPO的稳定性和在线改进优势。

Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions

本地指导，全球影响：高斯重塑信任区域解锁行为转变

Authors: Bingxu Liu, Jiashun Liu, Johan Obando-Ceron, Hao Wang, Runze Liu, Pablo Samuel Castro, Aaron Courville, Ling Pan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03382
Pdf link: https://arxiv.org/pdf/2606.03382
Abstract While Proximal Policy Optimization (PPO) demonstrates strong performance in stationary settings, we show that its standard optimization paradigm struggles in continual and non-stationary environments. The failure does not stem from insufficient model capacity or overly restrictive clipping. Instead, PPO performs persistent, directionally inefficient local updates, which indicates a lack of geometry-aware guidance for accumulating meaningful behavioral change and ultimately hindering transitions toward new behavior patterns. Although divergence-based regularization introduces partial geometric awareness, its monotonically increasing penalties implicitly discourage large policy deviations, even when such shifts are necessary for effective adaptation. To address this limitation, we propose Gaussian Trust Region Policy Optimization (GTR), which reshapes the trust region using a Gaussian kernel. The resulting constraint is bounded and non-monotonic, providing strong local stability while progressively relaxing under sustained high-advantage updates. To further improve robustness, we introduce a Mixture Gaussian Anchor that adapts to recent policy trajectories, reducing variance induced by stale references. GTR is architecture-agnostic and achieves strong performance across games, simulated robotic control, open-world exploration, and language model post-training. These results demonstrate that geometry-aware trust-region design can be a promising direction for robust reinforcement learning in complex non-stationary environments. Our code is available at this https URL.
中文摘要 虽然近端策略优化（PPO）在静止环境中表现出强劲的性能，但我们发现其标准优化范式在连续和非平稳环境中表现不佳。故障并非源于模型容量不足或过于限制性的削波。相反，PPO执行的是持续且方向性低效的局部更新，表明缺乏具备具几何感知的指导来积累有意义的行为变化，最终阻碍向新行为模式的转变。尽管基于发散的正则化引入了部分几何意识，但其单调增加的惩罚隐含地抑制了大幅度的政策偏离，即使这些转变对有效适应是必要的。为解决这一限制，我们提出了高斯信任区域策略优化（GTR），即利用高斯核重塑信任区域。由此产生的约束是有界且非单调的，在持续高优势更新下，在逐渐放松的同时，局部稳定性增强。为进一步提升鲁棒性，我们引入了混合高斯锚，能适应近期政策轨迹，减少因陈旧引用引起的方差。GTR不依赖架构，能够在各类游戏、模拟机器人控制、开放世界探索和语言模型后训练中实现强劲性能。这些结果表明，几何感知的信任区域设计可以成为复杂非平稳环境中稳健强化学习的有前景方向。我们的代码可在此 https URL 访问。

PerchRL: Vision-Based Agile Perching on Inclined Platforms under Rapid and Irregular Motion

PerchRL：基于视觉的敏捷栖息，在快速且不规则运动下的倾斜平台上

Authors: Zihong Lu, Zongzhuo Liu, Huaxu Li, Jinqiang Cui, Jie Mei, Youmin Gong, U Kei Cheang, Boyu Zhou
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.03441
Pdf link: https://arxiv.org/pdf/2606.03441
Abstract Autonomous vision-based perching of quadrotors on moving inclined platforms is critical for air-ground collaboration but remains challenging due to the limited field of view (FOV). In this paper, we propose PerchRL, a reinforcement learning (RL) framework for vision-based agile perching on inclined platforms under rapid and irregular motion. Specifically, we employ a two-stage learning strategy consisting of state-based pre-training followed by vision-based fine-tuning. To improve generalization across diverse platform motions, we employ randomized platform trajectories to prevent overfitting and temporal augmentation methods to capture latent motion patterns from historical observations. During vision-based fine-tuning, a hybrid learning framework consisting of visibility-aware state augmentation and active perception rewards is presented to improve robustness under intermittent visual loss. Extensive simulation and real-world experiments demonstrate the feasibility, stability, and real-time performance of PerchRL, while successful deployment across distinct quadrotor platforms further validates its adaptability. The source code will be released to benefit the community.
中文摘要 基于自主视觉的四旋翼机停靠在移动倾斜平台上对于空地协作至关重要，但由于视野有限，仍具挑战性。本文提出了PerchRL，一种用于在快速且不规则运动下倾斜平台上进行基于视觉的敏捷蹲点的强化学习（RL）框架。具体来说，我们采用了两阶段学习策略，先是基于状态的预训练，随后是基于视觉的微调。为了提升跨不同平台运动的泛化，我们采用随机平台轨迹以防止过拟合，并采用时间增强方法捕捉历史观测中的潜动模式。在基于视觉的微调过程中，呈现一种混合学习框架，结合可见性感知状态增强和主动感知奖励，以提升间歇性视觉损失下的鲁棒性。广泛的模拟和实际实验证明了PerchRL的可行性、稳定性和实时性能，而在不同四旋翼平台上的成功部署进一步验证了其适应性。源代码将被发布以惠及社区。

Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs

从错误中学习：安全代码大型语言模型的树状自玩

Authors: Wenqi Chen, Ziyan Zhang, Bing Wang, Lin Liu, Hengheng Zhang, Zhengsu Chen
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03489
Pdf link: https://arxiv.org/pdf/2606.03489
Abstract While Large Language Models (LLMs) excel in code generation, they remain prone to replicating subtle yet critical vulnerabilities endemic to their training data. Current alignment techniques, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), typically apply coarse-grained optimization at the sequence level. This approach often fails to address the localized nature of security flaws, where a single incorrect token choice can compromise an entire program. To bridge this gap, we introduce Tree-like Self-Play (TSP), a framework that reframes secure code generation as a fine-grained sequential decision process. Unlike standard methods that blindly maximize likelihood, TSP constructs a decision tree where the model explores branching trajectories--generating both secure "golden paths" and vulnerable variants. By treating code generation as a self-play game, the model learns to strictly discriminate against its own localized errors. This provides a dense, on-policy learning signal that forces self-correction precisely at the critical decision nodes where vulnerabilities typically emerge. Our experiments demonstrate that TSP fundamentally enhances model reliability. In Python security benchmarks, TSP boosts CodeLlama-7B's pass rate (SPR@1) to 75.8%, significantly outperforming SFT (57.0%) and unstructured self-play baselines. Crucially, TSP induces robust out-of-distribution generalization: the model not only reduces vulnerabilities in unseen categories (CWEs) by 24.5% but also successfully transfers security principles learned from C/C++ to diverse languages, including Python, Go, and JavaScript. This suggests that TSP does not merely memorize patches, but internalizes abstract, language-agnostic security logic.
中文摘要 虽然大型语言模型（LLM）在代码生成方面表现出色，但它们仍容易复制训练数据中存在的微妙但关键的漏洞。当前比对技术，如监督微调（SFT）和强化学习（RL），通常在序列层面应用粗粒度优化。这种方法往往无法解决安全漏洞的局部性问题，一个错误的令牌选择就可能危及整个程序。为弥合这一差距，我们引入了类似树的自玩（TSP）框架，将安全代码生成重新定义为一个细粒度的顺序决策过程。与盲目最大化似然的标准方法不同，TSP构建了一个决策树，模型探索分支轨迹——既生成安全的“黄金路径”，也生成易受影响的变体。通过将代码生成视为自我游戏，模型学会严格区分自身的局部错误。这提供了一个密集的、符合策略的学习信号，迫使在漏洞通常出现的关键决策节点进行自我纠正。我们的实验表明TSP从根本上提升了模型的可靠性。在Python安全基准测试中，TSP将CodeLlama-7B的通过率（SPR@1）提升至75.8%，显著优于SFT（57.0%）和非结构化自玩基线。关键是，TSP实现了强健的分布外泛化：该模型不仅将看不见类别（CWE）中的漏洞减少了24.5%，还成功将从C/C++学到的安全原则转移到包括Python、Go和JavaScript在内的多种语言中。这表明TSP不仅仅是记忆补丁，而是内化抽象且语言无关的安全逻辑。

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

ThoughtFold：通过内省偏好学习折叠推理链条

Authors: Ziyan Liu, Xueda Shen, Yuzhe Gu, Songyang Gao, Kuikun Liu, Guangran Cheng, Chengqi Lyu, Dahua Lin, Wenwei Zhang, Kai Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03503
Pdf link: https://arxiv.org/pdf/2606.03503
Abstract Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.
中文摘要 大型推理模型（LRM）得益于基于思维链（CoTs）的可验证奖励强化学习（RLVR）取得了显著进展。然而，由于长CoT自然包含试错，主流RLVR方法选择结果正确的CoT轨迹进行记忆，长期CoT中的重复探索不可避免地被强化，导致LRMs的过度思考问题。以往解决这一问题的尝试主要赋予较短轨迹更多优势，但它们的学习信号仍基于结果，无法减少长CoT中冗余探索的记忆。因此，我们提出了ThoughtFold框架，利用细粒度偏好学习来减少冗余探索，实现高效的推理。ThoughtFold采用内省策略识别每个正确轨迹中的冗余，从而产生一系列候选子轨迹。利用这一光谱，我们引入了一个掩蔽偏好优化目标，明确惩罚冗余探索，鼓励模型直接桥接关键推理片段，有效地将推理链折叠成更简洁的路径。大量实验表明，ThoughtFold显著提升了效率。它在保持最先进精度的同时，将DeepSeek-R1-Distill-Qwen-7B的代币使用率降低约56%。

Post-Hoc Robustness for Model-Based Reinforcement Learning

基于模型的强化学习的事后鲁棒性

Authors: Siemen Herremans, Ali Anwar, Siegfried Mercelis
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03521
Pdf link: https://arxiv.org/pdf/2606.03521
Abstract To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents under adversarial environment perturbations. In this setting, a protagonist agent optimizes a policy under environmental perturbations from an adversary, resulting in a zero-sum Markov game. When adversarially robust RL is combined with model-based RL, the adversary can target a learned transition model instead of the training environment. Extending this idea, this work introduces post-hoc robustification of deep RL agents at inference time. By using the learned model in combination with a trained nominal policy, our approach performs a robust policy improvement step. The goal is to improve robustness without any additional training of neural networks. Specifically, we utilize model-predictive control under adversarial rollouts, which are approximated via projected gradient descent within a bounded uncertainty set. Furthermore, these offline rollouts are performed while considering and mitigating out-of-distribution issues. The proposed methodology is validated by demonstrating significant improvements in robustness when the algorithm is evaluated in perturbed Gymnasium MuJoCo environments, while considering the computational limitations of the post-hoc inference setting.
中文摘要 为了提高强化学习（RL）在现实世界的适用性，对抗性强韧强化学习领域研究如何在对抗环境扰动下训练代理。在此设定中，主角代理人在对手的环境扰动下优化策略，形成零和马尔可夫博弈。当对抗性强化学习与基于模型的强化学习结合时，攻击者可以针对已学习的过渡模型而非训练环境。在此基础上，本研究引入了深度强化智能体在推理时的事后稳健化。通过结合学习到的模型与训练后的名义策略，我们的方法实现了稳健的策略改进步骤。目标是在不额外训练神经网络的情况下提升鲁棒性。具体来说，我们在对抗性展开下利用模型预测控制，这些推展通过在有界不确定性集中内的投影梯度下降来近似。此外，这些离线推广是在考虑并缓解分销外问题的同时进行的。在考虑事后推理环境的计算局限性的情况下，算法在受扰动的Gymnasium MuJoCo环境中评估算法时，验证了该方法鲁棒性显著提升。

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

通过宽基线匹配诱导MLLM中的复杂空间推理

Authors: Hao Zhong, Muzhi Zhu, Shenyan Zeng, Anzhou Li, Cong Chen, Hua Geng, Duochao Shi, Wentao Ye, Tao Lin, Hao Chen, Chunhua Shen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.03577
Pdf link: https://arxiv.org/pdf/2606.03577
Abstract Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.
中文摘要 宽基线匹配（WBM）需要整合几何理解、视角变化、细粒度感知和遮挡推理，使其成为多模态大型语言模型（MLLM）中空间推理的挑战性测试平台，应用于物理环境中。然而，当前的MLLM缺乏系统化的评估和培训框架。我们介绍了ReasonMatch-Bench，这是一个按视点位移和匹配粒度在室内、室外及以对象为中心场景进行分层的基准测试，并显示当前MLLM在细粒度宽基线对应方面仍存在困难：在90个样本的复杂子集上，人工注释者达到84.0 F1，而最佳基线为37.2。为弥合这一差距，我们构建了一个可扩展的数据生成流水线，能够自动从大型视频-三维语料库（包括RGB-D视频和SfM重建）中提取宽基线视图对，实现多样化且可验证的监督。我们还提出了动态对应强化学习（DCRL），结合图像级视角进展和点级对应课程，通过可验证的奖励提升WBM培训，无需明确的CoT监督。大量实验表明，DCRL显著提升了ReasonMatch-Bench，并可迁移到相关的空间基准测试，同时保持了整体的视觉理解性能，并在多个基准测试上略有提升。

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

利用验证-生成差距：基于置信条件验证的测试时间强化学习

Authors: Jiahui Li, Jianfeng Shan, Wenpei Chen, Shunyu Wu, Jian Lou, Wenjie Feng, Dan Li, See-Kiong Ng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03608
Pdf link: https://arxiv.org/pdf/2606.03608
Abstract Test-time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models in a completely label-free manner. Despite existing studies focusing on Pass@1 performance, optimizing Pass@k remains under-explored yet critical in label-free settings, which measures generation coverage for sustained exploration. Optimizing Pass@k in label-free setting is highly non-trivial, as directly applying the Pass@k advantage designs effective for RLVR yields unsatisfactory performance. Through in-depth empirical analysis, we discover the root causes hindering performance: pseudo-label estimations for low-confidence samples have a high probability of being incorrect, while candidate answers for high-confidence samples suffer from severe diversity collapse. To overcome these hurdles, we propose TTRL-CoCoV (Test-Time Reinforcement Learning with Confidence-Conditioned Verification), a novel confidence-adaptive framework that expands Pass@k coverage and improves Pass@1 performance. Based on our key insight that verification capability generally leads generation capability, TTRL-CoCoV employs a confidence-conditioned mechanism: for high-confidence samples, it bootstraps verifier and applies an exploration-enhancing reward to prevent diversity collapse; for low-confidence samples, it delegates pseudo-label selection to the verifier to filter incorrect pseudo-labels; and for medium-confidence samples, it bypasses verification entirely. Extensive experiments demonstrate that TTRL-CoCoV outperforms the best competing methods across 6 widely-recognized benchmarks, achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 over TTRL, and even achieves absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks when compared against fully supervised RL methods. Our code repository: this https URL.
中文摘要 测试时强化学习已成为一种有前景的范式，能够以完全无标签的方式提升大型语言模型的复杂推理能力。尽管已有研究聚焦Pass@1性能，优化Pass@k在无标签环境中仍然未被充分探索，而在无标签环境下至关重要，该环境衡量持续探索的生成覆盖率。在无标签环境中优化Pass@k非常困难，因为直接应用对RLVR有效的Pass@k优势设计会导致性能不理想。通过深入的实证分析，我们发现了阻碍表现的根本原因：低置信样本的伪标签估计出错概率很高，而高置信样本的候选答案则严重存在多样性崩溃。为克服这些障碍，我们提出了TTRL-CoCoV（带信心条件验证的测试时间强化学习），这是一种新型信心自适应框架，扩展了Pass@k覆盖范围并提升Pass@1性能。基于我们对验证能力通常领先生成能力的关键见解，TTRL-CoCoV采用置信条件机制：对于高置信度样本，它启动验证器并施加增强探索奖励以防止多样性崩溃;对于低置信样本，它将伪标签选择委托给验证者，以过滤错误的伪标签;对于中等置信度样本，则完全绕过验证。大量实验表明，TTRL-CoCoV在6个广泛认可的基准测试中优于最佳竞争方法，Pass@1平均绝对提升为+9.8%，Pass@16平均提升+18.7%，甚至在多个推理基准测试中，与完全监督强化学习方法相比，绝对Pass@1提升高达+5.0%。我们的代码仓库：这个 https URL。

Multi$^2$: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments

Multi$^2$：基于LLM的智能体在交互环境中进行层级多智能体决策

Authors: Sangeun Park, Minhae Kwon
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.03698
Pdf link: https://arxiv.org/pdf/2606.03698
Abstract A central goal of large language model (LLM) research is to build agentic systems that can plan, act, and adapt through sustained interaction with dynamic environments. While recent LLM-based agents exhibit impressive contextual reasoning, their long-horizon decision-making remains fragile, often suffering from objective drift, where goals and plans drift over extended interactions. We introduce Multi$^2$, a hierarchical multi-agent decision-making framework that explicitly decomposes agent behavior into complementary roles. A high-level agent (System 1) focuses on context-aware sub-goal generation using supervised fine-tuning (SFT), while a low-level agent (System 2) executes atomic actions through offline-to-online reinforcement learning (RL) in interactive environments. This separation enables stable long-horizon control, mitigates objective drift, and allows efficient adaptation. Across diverse interactive environments, Multi$^2$ consistently outperforms strong agentic baselines, demonstrating improved robustness and coordination in multi-turn interaction. Beyond performance, we introduce and release three hierarchical benchmark datasets, filling a long-standing gap in training and evaluating hierarchical decision-making for LLM-based agents.
中文摘要 大型语言模型（LLM）研究的核心目标之一是构建能够通过与动态环境持续互动来规划、行动和适应的智能体系统。尽管近期基于LLM的智能体展现出令人印象深刻的情境推理能力，但其长期决策仍较为脆弱，常常存在客观漂移，目标和计划在长时间互动中漂移。我们介绍Multi$^2$，一种层级多智能体决策框架，明确将智能体行为分解为互补角色。高级代理（System 1）专注于通过监督微调（SFT）生成上下文感知的子目标，而低级代理（System 2）则通过离线到在线强化学习（RL）在交互环境中执行原子动作。这种分离实现了稳定的长视距控制，减轻了客观漂移，并实现了高效的适应。在多样化的交互环境中，Multi$^2$ 持续优于强代理基线，展现出多回合交互的鲁棒性和协调性提升。除了性能之外，我们还推出了三个分层基准数据集，填补了基于LLM的智能体在训练和评估分层决策方面长期存在的空白。

When are supercapacitors practically feasible in electric vehicles?

超级电容器在电动汽车中什么时候才算可行？

Authors: Yue Wu, Ziqing Xia, Shaokun Li, Heng Li, Shengyu Tao, Zhiwu Huang
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.03732
Pdf link: https://arxiv.org/pdf/2606.03732
Abstract While the hybrid energy storage system (HESS) can theoretically mitigate battery degradation in electric vehicles, its practical implementation remains highly limited. To delineate the specific scenarios and application boundaries where supercapacitors remain feasible, this study proposes a multi-dimensional techno-economic feasibility evaluation framework. First, a cross-vehicle sizing method based on dynamic programming is established to quantify physical mass-volume packaging constraints and identify feasible supercapacitor candidates across different vehicle types. Building upon the optimal sizing parameters derived from the battery aging Pareto front, an expert-guided deep reinforcement learning energy management strategy is integrated to yield near-optimal online performance, ensuring a fair life-cycle economic assessment. Finally, a comprehensive feasibility matrix is constructed to systematically evaluate mass, volume, battery lifespan, additional supercapacitor costs, total cost of ownership, future energy storage prices, and the influence of emerging solid-state batteries. Results reveal that city buses remain the most promising vehicle type for HESS due to minimal additional costs and sufficient packaging space. Current mass-volume penalties and limited economic benefits hinder HESS application in passenger vehicles and heavy-duty trucks, respectively. This situation may only improve if supercapacitor prices drop significantly in the future. Beyond vehicle types, the HESS feasibility is governed by load-frequency characteristics. Furthermore, looking toward the 2030+ solid-state battery era, we highlight that integrating increasingly affordable supercapacitors can provide substantial asset protection leverage.
中文摘要 虽然混合动力储能系统（HESS）理论上可以减轻电动汽车电池的劣化，但其实际应用仍然非常有限。为界定超级电容器仍可行的具体情景和应用边界，本研究提出了一个多维技术经济可行性评估框架。首先，建立了基于动态规划的跨车辆尺寸测定方法，以量化物理质量-体积封装约束，并识别不同车辆类型中可行的超级电容器候选。基于电池老化帕累托的优化尺寸参数，整合了专家指导的深度强化学习能量管理策略，实现接近最佳的在线性能，确保公平的生命周期经济评估。最后，构建了一个全面的可行性矩阵，系统评估质量、体积、电池寿命、额外超级电容器成本、总拥有成本、未来储能价格以及新兴固态电池的影响。结果显示，由于额外成本极低且包装空间充足，城市公交仍是HESS最有前景的车辆类型。目前的大规模销量限制和有限的经济效益分别阻碍了HESS在乘用车和重型卡车上的应用。如果超级电容器价格未来大幅下降，这种情况可能才会改善。除了车辆类型外，HESS的可行性还受负载频率特性的影响。此外，展望2030+固态电池时代，我们强调，整合日益实惠的超级电容器可以带来显著的资产保护杠杆。

Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning

工具感知优化与熵指导，实现高效的代理强化学习

Authors: Hongye Cao, Nuo Yan, Haoyuan Deng, Ziwei Wang, Tianpei Yang, Jing Huo, Yuyao Zhang, Yang Gao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03762
Pdf link: https://arxiv.org/pdf/2606.03762
Abstract Agentic reinforcement learning (RL) equips large language models (LLMs) with tool-use capabilities that substantially improve reasoning on complex tasks. However, integrating external tools often destabilizes training: over-reliance on tools can induce input distribution shift, while overly conservative tool use limits effective exploration. To address this issue, we propose a unified framework TAO-RL that couples tool-aware trajectory filtering with entropy-guided exploration for efficient policy optimization. Specifically, at the data level, TAO-RL filters rollout trajectories along two criteria: discarding those where all tool invocations fail to execute, and removing those where all rollouts are either correct or incorrect, as both cases yield degenerate advantage estimates that contribute no discriminative learning signal. This joint filtering retains data that are both tool-capable and informative, establishing a high-quality training distribution. At the algorithmic level, we introduce a tool-aware entropy-guided bonus that reshapes the advantage function at post-tool-call tokens, encouraging the policy to explore more diverse reasoning paths at critical decision points. These two components are mutually reinforcing: trajectory filtering establishes a clean and informative training foundation, while entropy-guided exploration drives stronger reasoning behaviors at critical tool-interaction junctures. Extensive experiments on 7 challenging reasoning benchmarks across 3 model scales demonstrate the superiority of TAO-RL over existing methods.
中文摘要 智能强化学习（RL）赋予大型语言模型（LLMs）工具使用能力，显著提升复杂任务的推理能力。然而，整合外部工具往往会破坏培训的稳定性：过度依赖工具会导致输入分布转移，而过于保守的工具使用则限制了有效的探索。为解决这一问题，我们提出了一个统一的TAO-RL框架，将工具感知轨迹过滤与熵引导探索相结合，实现高效的策略优化。具体来说，在数据层面，TAO-RL根据两个标准过滤推广轨迹：丢弃所有工具调用均未执行的轨迹，以及剔除所有推展正确或错误的推销，因为这两种情况都会产生退化优势估计，且不提供判别性学习信号。这种联合过滤保留了既具备工具支持且信息丰富的数据，建立了高质量的训练分布。在算法层面，我们引入了工具感知熵引导的加成，重塑工具调用后代币的优势函数，鼓励策略在关键决策点探索更多样化的推理路径。这两个组成部分相互强化：轨迹过滤建立了清晰且富有信息量的训练基础，而熵引导探索则在关键工具与交互的关键节点推动更强的推理行为。在3个模型尺度上对7个具有挑战性的推理基准进行的广泛实验证明了TAO-RL优于现有方法的优势。

Trading Human Curation for Synthetic Augmentation in RLVR

在RLVR中用合成增强替代人类策划

Authors: Akshansh, Leonardo Rosa Rodrigues, Michael Korostelev, Youssef Hassan, Mark E. Whiting
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03800
Pdf link: https://arxiv.org/pdf/2606.03800
Abstract The supply of high-quality training tasks is a central bottleneck for reinforcement learning from verifiable rewards (RLVR) on agentic language models. Each task requires a sandboxed setup, a prompt, and a hand-authored reward function, and only tasks that pass a quality bar produce useful training signal. Hand-curation at this quality bar does not scale economically to the task counts effective RL training requires, and the substitution rate between automatically generated task variants and human-authored ones is not yet established. We investigate using pre-specified, gate-filtered augmentations of a small hand-authored base as a substitute for additional human curation during RLVR. We formalize the cost-adjusted trade rate $\rho_{\text{cost}}$ between augmented and human-authored tasks, measure it through a controlled ablation across training corpora with varying augmentation share, and characterize the end-to-end economics of the augmentation pipeline. Substituting augmented content for additional human-authored tasks retains aggregate held-out generalization on a ten-benchmark suite spanning code, instruction following, reasoning, and multi-turn agentic function-calling. The cost-adjusted trade rate $\rho_{\text{cost}}$ between gated synthetic and human-authored RLVR tasks stays in $[1.4\times, 11.6\times]$ across the plausible $c_{\text{human}}/c_{\text{aug}}$ range.
中文摘要 高质量训练任务的供应是智能语言模型上可验证奖励强化学习（RLVR）的核心瓶颈。每个任务都需要沙盒设置、提示和手工编写的奖励函数，只有通过质量标准的任务才会产生有用的训练信号。在这一质量标准下手工策划无法经济地适应有效强化学习训练所需的任务数量，且自动生成任务变体与人类编写任务变体之间的替换率尚未确定。我们研究使用预先指定的、经过门过滤的小型手工作者基础的增强，作为RLVR期间额外人工筛选的替代。我们形式化了增强任务与人工任务之间的成本调整交易率$\rho_{\text{cost}}}，通过对不同增强份额的训练语料库进行受控消融来衡量，并描述增强流程的端到端经济学。用增强内容替代额外的人工任务，可以保留对涵盖代码、指令跟踪、推理和多回合智能函数调用的十个基准套件的综合推广。经过成本调整的交易速率$\rho_{\text{cost}}$在门控合成任务和人类编写的RLVR任务之间，在合理的$c_{\text{human}}/c_{\text{aug}}}范围内保持在$[1.4\times， 11.6\times]$。

Easy-to-Use Shielding for Reinforcement Learning

易用屏蔽用于强化学习

Authors: Stefan Pranger, Bettina Könighofer
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.03804
Pdf link: https://arxiv.org/pdf/2606.03804
Abstract Safe exploration is a key challenge in Reinforcement Learning (RL) that aims to prevent agents from making harmful decisions while exploring their environment. Safe exploration is a key challenge in Reinforcement Learning (RL) that aims to prevent agents from making harmful decisions while exploring their environment. Shielding is one such technique that assumes domain knowledge in the form of an environment model to decide upon action safety. Although well-established, shielding has seen limited adoption in RL due to the lack of accessible end-to-end infrastructure connecting formal shield synthesis with standard RL frameworks. Applying shielding typically requires expertise in formal methods and substantial engineering effort, keeping it outside the typical RL workflow. We address this by extending our shield synthesis tool Tempest into a practical backend for safe RL. Our core contribution is tempestpy, a Python library that integrates Tempest-based shield synthesis directly into the Gymnasium API, allowing shields to be synthesized and deployed within existing RL pipelines. This lowers the barrier to entry for shielding and turns formal safe-exploration methods into a usable component for RL practitioners. We also extend Tempest's algorithmic support to compute sound shields for stochastic multiplayer games, preserving formal safety guarantees. We demonstrate the resulting workflow end to end and evaluate shielded and unshielded RL across multiple environments. To facilitate modeling, we provide symbolic models for MiniGrid and introduce MiniGridSafe, a collection of playground environments designed to make shielding easily accessible and experimentally transparent. MiniGridSafe extends MiniGrid with safety-oriented scenarios featuring probabilistic transitions and additional agents, enabling the study of challenging safety aspects in a simple and intuitive setting.
中文摘要 安全探索是强化学习（RL）中的一个关键挑战，旨在防止智能体在探索环境中做出有害决策。安全探索是强化学习（RL）中的一个关键挑战，旨在防止智能体在探索环境中做出有害决策。屏蔽是一种假设以环境模型形式具备领域知识来决定动作安全性的技术。尽管屏蔽已经很成熟，但由于缺乏连接正式屏蔽综合与标准强化学习框架的端到端基础设施，其在强化学习中的应用有限。应用屏蔽通常需要具备形式方法的专业知识和大量工程投入，使其不属于典型的强化学习工作流程。我们通过将盾牌合成工具Tempest扩展为实用的后端，以实现安全的强化学习。我们的核心贡献是 tempestpy，这是一个 Python 库，将基于 Tempest 的盾牌合成直接集成到 Gymnasium API，使盾牌能够被合成并部署在现有的强化学习流水线中。这降低了屏蔽的门槛，使正式的安全探索方法成为强化学习从业者可用的组成部分。我们还扩展了Tempest的算法支持，用于计算随机多人游戏的声波屏障，同时保持形式安全保障。我们展示了由此产生的工作流程，并评估多个环境中的屏蔽和非屏蔽强化学习。为了促进建模，我们为MiniGrid提供了符号模型，并推出了MiniGridSafe，这是一组旨在使屏蔽易于访问且实验性透明的游乐场环境集合。MiniGridSafe通过包含概率转变和额外代理的以安全为导向的场景扩展MiniGrid，使得在简单直观的环境中研究具有挑战性的安全问题成为可能。

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

EvoDS：具备技能学习和上下文管理的自我演进自主数据科学代理

Authors: Zherui Yang, Fan Liu, Yansong Ning, Hao Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03841
Pdf link: https://arxiv.org/pdf/2606.03841
Abstract Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at this https URL.
中文摘要 大型语言模型（LLM）代理的最新进展推动了自动化数据科学的有望进展。然而，现有方法仍受限于静态的动作集和缺乏原则性的长期上下文管理，阻碍了它们在任务间积累可重用经验和在多阶段迭代数据科学流程中可靠运行的能力。为应对这些挑战，我们介绍了EvoDS，一款自我进化的自主数据科学代理，通过代理强化学习学习扩展技能并自适应管理长期情境。具体来说，EvoDS引入了两项关键策略：（1）自主技能习得（ASA）机制，使智能体能够综合、验证并重用可执行技能;以及（2）自适应上下文压缩（ACC）策略，将上下文管理视为学习式控制问题，而非被动截断。这些策略通过两阶段多智能体训练方案协调，使EvoDS能够自主地随着时间提升。理论上，我们证明了EvoDS的分层设计减少了工具选择误差，其优化目标符合信息瓶颈原则，确保了上下文的高效使用。从实证数据来看，EvoDS在四个多样化基准测试中平均优于最先进的开源数据科学代理28.9%，同时消除了代币外的失败。我们的代码和数据可在此 https URL 访问。

Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

熵还不够：通过视觉锚定的代币选择解锁视觉推理的有效强化学习

Authors: Senjie Jin, Peixin Wang, Boyang Liu, Xiaoran Fan, Shuo Li, Zhiheng Xi, Jiazheng Zhang, Yuhao Zhou, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03937
Pdf link: https://arxiv.org/pdf/2606.03937
Abstract While token-level entropy is commonly recognized as effective for credit assignment in text-only reinforcement learning with verifiable rewards (RLVR), it remains unclear whether this mechanism still holds in visual reasoning. Our controlled study shows that this mechanism collapses in visual reasoning due to the omission of vision-sensitive tokens with naturally low entropy. Although existing multimodal RL methods increasingly acknowledge the importance of visual perception, they struggle to satisfy the inherent demand for interleaving precise perceptual grounding with semantic reasoning, either lacking systematic visual measurements or overlooking that token entropy primarily drives semantic exploration. To address this, we introduce VEPO (Vision-Entropy token-selection for Policy Optimization), an effective RL framework explicitly integrating visual sensitivity with token entropy via a principled multiplicative coupling, where VEPO redirects gradient credit toward tokens which are simultaneously visually grounded and highly informative. Extensive experiments demonstrate VEPO's leading performance, significantly outperforming the entropy-only baseline by 2.28 points at 7B-scale and 3.15 points at 3B-scale. Ablations further substantiate the soundness of our method.
中文摘要 虽然代币级熵在仅文本可验证奖励强化学习（RLVR）中通常被认为对学分分配有效，但该机制在视觉推理中是否仍然成立尚不清楚。我们的对控研究表明，由于缺少具有自然低熵的视觉敏感标记，这一机制在视觉推理中会崩溃。尽管现有的多模态强化学习方法越来越重视视觉感知的重要性，但它们难以满足将精确感知基础与语义推理交织的固有需求，要么缺乏系统性的视觉测量，要么忽视了符号熵主要驱动语义探索。为此，我们引入了VEPO（策略优化中的视觉-熵代币选择），这是一个有效的强化学习框架，通过原则乘法耦合明确整合视觉敏感性与代币熵，其中VEPO将梯度信用转向既直观且信息丰富的代币。大量实验显示VEPO表现领先，在7B尺度显著优于仅熵基线2.28个百分点，3B尺度高3.15个百分点。消融进一步证明了我们方法的合理性。

Preference-Calibrated Human-in-the-Loop Reinforcement Learning for Robotic Manipulation

偏好校准的机器人操作人机循环强化学习

Authors: Zeyi Liu, Guangyao Liu, Yinuo Qu, Yuquan Xue, Bofang Jia, Chunhua Yang, Weihua Gui, Keke Huang, Ziwei Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.03949
Pdf link: https://arxiv.org/pdf/2606.03949
Abstract Human-in-the-loop reinforcement learning (HIL-RL) improves sample efficiency in real-robot manipulation through online human intervention. However, successful trajectories may include suboptimal actions that deviate from the desired task-execution path and force human intervention. Existing HIL-RL methods typically apply the consistent credit assignment principle to all transitions, uniformly propagating discounted terminal rewards through suboptimal segments, ignoring the actual contribution of each transition to task success. This overestimates Q-values for critic learning and indirectly misguides actor updates toward suboptimal behavior patterns. To this end, we propose PACT, a Preference-calibrated Actor-Critic Training framework that leverages the implicit preference signals induced by intervention to perform credit reassignment on identified suboptimal segments while directly guiding policy training for unbiased critic-actor learning. Specifically, we first design a progress model that learns from human demonstration and identifies suboptimal segments for credit correction. Then, from the human action and resampled policy action at the intervention state, we build preference pairs to define a counterfactual advantage that penalizes Bellman targets of the identified suboptimal segment, enabling directional credit calibration. Moreover, we directly align the policy with human corrective actions in the bounded mean space, providing an additional signal beyond critic-guided updates. Across five real-robot manipulation tasks, PACT improves the average success rate by 24.5% and achieves 1.3 times faster convergence, thereby improving both RL sample efficiency and performance. Code is available at this https URL.
中文摘要 人工在环中强化学习（HIL-RL）通过在线人类干预提升了真实机器人操作中的样本效率。然而，成功的轨迹可能包含偏离期望任务执行路径的次优行为，迫使人类干预。现有的HIL-RL方法通常将一致的信用分配原则应用于所有转换，均匀地将折现后的终端奖励通过次优分段传播，忽略每个转换对任务成功的实际贡献。这高估了批判者学习的Q值，间接地误导了演员更新，使行为模式变得次优。为此，我们提出了PACT，这是一个偏好校准的行为者-批评者培训框架，利用干预引发的隐性偏好信号，对识别出的次优片段进行信用重新分配，同时直接指导政策培训以实现无偏的批评者-行为者学习。具体来说，我们首先设计了一个进展模型，从人工演示中学习，并识别信用修正的次优部分。然后，从干预状态的人工行为和重抽样的政策行动中，我们构建偏好对，定义一种反事实优势，惩罚已识别出的次优部分的贝尔曼目标，从而实现方向性信用校准。此外，我们直接将政策与有界均空间内的人为纠正措施对齐，提供了超越批评者引导更新的额外信号。在五个真实机器人操作任务中，PACT将平均成功率提高了24.5%，收敛速度提升了1.3倍，从而提升了强化学习的样本效率和性能。代码可在此 https URL 访问。

Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

利用奖励不确定性诱导强化学习中的多样化行为

Authors: Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas, Eser Aygün, David Smalling, Shibl Mourad, Doina Precup, André Barreto, Mark Rowland
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03962
Pdf link: https://arxiv.org/pdf/2606.03962
Abstract Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization or diversity bonuses often require fragile trade-offs that sacrifice performance for stochasticity or rely on heuristic metrics that can misalign policy rankings. We argue that diversity is more naturally understood as the rational response to uncertainty in the reward. When the reward function is not perfectly known--as is the case with ambiguous preferences or imperfect reward models--committing to a single action can be sub-optimal. Building on this, we propose a fundamental reformulation of the RL objective by replacing the scalar reward with a distribution over reward functions, and applying a non-linear objective over sets of actions. The result is a framework in which calibrated behavioural diversity emerges naturally, remains controllable through the reward function distribution, and is obtained without sacrificing expected reward. Focusing on the contextual bandit setting, we derive a principled gradient estimator for this objective and prove that our formulation naturally generalizes both vanilla policy gradient and more recently developed action-set approaches. Our empirical results demonstrate that this framework offers a robust and theoretically grounded alternative for complex RL tasks where the traditional formulation of the problem fails to induce the desired breadth of agent behaviour.
中文摘要 经典强化学习（RL）通常寻求一种确定性策略，以最大化标量奖励的期望总和。然而，现代应用如语言模型微调或科学发现要求多样性。现有的补救措施如熵正则化或多样性加成，往往需要脆弱的权衡，牺牲性能以换取随机性，或依赖启发式指标，从而导致政策排名错位。我们认为多样性更自然地被理解为对奖励不确定性的理性反应。当奖励函数不完全确定时——如偏好模糊或奖励模型不完美——仅承诺单一行动可能不够优。在此基础上，我们提出了对强化学习目标的根本性重述，通过用奖励函数分布取代标量奖励，并在行动集合上应用非线性目标。结果是一个框架，在这种框架中，校准的行为多样性自然出现，通过奖励函数分布保持可控，并且在不牺牲预期奖励的情况下实现。聚焦于情境盗贼设定，我们为该目标推导出一个原则梯度估计，并证明我们的表述自然地推广了原版政策梯度和较新发展的行动集方法。我们的实证结果表明，该框架为复杂强化学习任务提供了一种稳健且理论基础的替代方案，尤其是在传统问题表述无法诱导出预期的智能体行为广度时。

Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

视觉条件无人机导航的自我精炼智能强化学习

Authors: Roohan Ahmed Khan, Yasheerah Yaqoot, Muhammad Ahsan Mustafa, Dzmitry Tsetserukou
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03963
Pdf link: https://arxiv.org/pdf/2606.03963
Abstract Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fine tuning, which is time consuming and does not guarantee high success in the desired task. This paper presents AgenticRL, agent guided reinforcement learning framework that increases autonomy in reward design, policy refinement, and real world deployment for unmanned aerial vehicles (UAV) navigation tasks. AgenticRL uses a multimodal generative pre-trained tansformer (GPT) agent to interpret task information and visual scene observations, generate task specific reward functions, train policies using Proximal Policy Optimization (PPO) algorithm, and then act as a critic by evaluating the trained policy through diagnosis packets to generate feedback. Based on this feedback, the agent identifies failure modes and refines the reward function in a closed loop self improvement process. To further leverage the multimodal GPT agent during inference, AgenticRL uses real world images and natural language task information to automatically identify the active scenario and select the appropriate trained policy for execution. The framework is evaluated on multiple navigational tasks, including gate traversal, obstacle avoidance, wall barrier crossing with landing, trajectory following, and motion behavior learning. Experimental results show that the closed loop refinement process improves policy behavior compared with initial rewards by 71%. We also demonstrate sim-to-real transfer of the proposed framework, achieving a real world success rate of 91% and a sim-to-real accuracy of 94%.
中文摘要 深度强化学习展现出强大的潜力，使自主机器人能够学习复杂的导航任务。然而，其实际应用仍高度依赖于人工设计的奖励函数和反复的人工微调，这耗时且无法保证任务的高度成功率。本文介绍了AgenticRL，一种代理引导强化学习框架，能够提升无人机（UAV）导航任务中奖励设计、策略细化和实际部署的自主性。AgenticRL使用多模态生成预训练tansformer（GPT）智能体来解释任务信息和视觉场景观察，生成任务特定的奖励函数，使用近端策略优化（PPO）算法训练策略，然后通过诊断包评估训练好的策略以生成反馈，作为批评者。基于这些反馈，代理识别失败模式，并在闭环自我改进过程中细化奖励函数。为了在推理过程中进一步利用多模态GPT代理，AgenticRL利用真实世界的图像和自然语言任务信息，自动识别当前场景并选择合适的训练策略执行。该框架评估多个导航任务，包括门穿过、障碍物规避、着陆时穿越墙壁障碍、轨迹跟踪和运动行为学习。实验结果显示，闭环优化过程相比初始奖励改善了71%的策略行为。我们还展示了所提框架的模拟到现实转移，实现了91%的真实世界成功率和94%的模拟真实准确率。

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

智能思维链引导，实现高效且可控的大型语言模型推理

Authors: Yu Xia, Zhouhang Xie, Xin Xu, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03965
Pdf link: https://arxiv.org/pdf/2606.03965
Abstract Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at this https URL.
中文摘要 大型语言模型通过扩展的思维链推理提升最终答案的准确性，但通常消耗代币效率低，且推理时间控制有限。现有的高效推理方法通过缩短、提前停止或压缩痕迹来控制思维长度，使模型的思维方式隐含。本文提出了智能思维链引导（ACTS），将推理引导表述为一种马尔可夫决策过程，控制代理在推理过程中自适应地引导冻结推理者。在每一步，控制者观察推理追踪和剩余的思维预算，然后发出一个引导动作，包括推理策略和引导短语，启动下一步推理步骤。这使预算感知的策略控制能够高效地进行推理，同时保持推理器的生成连续性。我们通过多预算增强从构建的合成引导轨迹初始化控制代理，并通过预算条件奖励塑形的强化学习进一步优化。跨多个基准测试的实验表明，ACTS能够匹配全思考性能，同时节省大量代币成本，并实现不同推理者和任务之间可控的准确性与效率权衡。代码可在该 https URL 访问。

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

QUBRIC：联合设计强化学习超越可验证奖励的查询和评分标准

Authors: Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang, Qingyu Yin, Xin Liu, Zixuan Zhang, Priyanka Nigam, Bing Yin, Tuo Zhao, Chao Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03968
Pdf link: https://arxiv.org/pdf/2606.03968
Abstract Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.
中文摘要 基于评分标准的强化学习是将强化学习扩展到可验证奖励之外的有前景路径，但现有方法在优化评分标准的同时，将查询分布视为固定。我们发现了一个结构性瓶颈：评分标准的质量受查询结构的限制。开放式查询会给出模糊的评分标准;天真地缩小范围会引入任何模型无法验证的虚假引用，导致所有响应失败，训练也接收不到奖励信号。我们介绍QUBRIC，一个共同设计查询和评分标准的框架。教师提出的关键点将开放式问题的重写转化为基于情境、有价值的问题。对比性评分标准生成将教师与政策的差距转化为查询层面的标准，可学习性过滤仅保留信息性的查询-评分标准对以供GRPO培训。QUBRIC在ArenaHard上比SFT基线提升了+5.5分。仅基于指令跟随数据训练，它进一步转向涵盖法律、道德和叙事推理的三个未定基准（平均+6.3分），改进集中在推理相关维度。这些结果证明，共同设计查询和评分标准可以使基于评分标准的强化学习成为RLVR在严格可验证任务之外的实用补充。

Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

语言模型需要睡眠：学习自我修改和巩固记忆

Authors: Ali Behrouz, Farnoosh Hashemi, Vahab Mirrokni
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.03979
Pdf link: https://arxiv.org/pdf/2606.03979
Abstract The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific shallow models to more general deep Large Language Models (LLMs). Despite showing promising results in tasks that require instant prediction or in-context learning, existing models lack the ability to continually learn and effectively transfer their temporal in-context knowledge to their long-term parameters. Inspired by human learning process, we introduce a ''Sleep'' paradigm that allows the models to continually learn, distill their short-term fragile memories into stable long-term knowledge with replay, and recursively improve themselves with ''Dreaming'' process. In more detail, sleep consists of two stages: (1) Memory Consolidation: an upward distillation process, called Knowledge Seeding, where the memories of a smaller-self are distilled into a larger network to provide more capacity while preserving the knowledge. As a proof of concept, we present a new Generalized Distillation process for {Knowledge Seeding} (i.e., the combination of on-policy distillation with Reinforcement Learning (RL)-based imitation learning); (2) Dreaming: a self-improvement phase, where the model uses RL to generate a curriculum of synthetic data to rehearse new knowledge and refine existing capabilities without human supervision. Our experiments on long-horizon, continual learning, knowledge incorporation, and few-shot generalization tasks support the importance of the sleep stage.
中文摘要 过去几十年，机器学习算法设计取得了显著进展，从早期针对任务的浅层模型研究到更广泛的深度大型语言模型（LLM）。尽管在需要即时预测或上下文学习的任务中显示出有前景的成果，现有模型仍缺乏持续学习和有效将时间上下文知识转化为长期参数的能力。受人类学习过程启发，我们引入了“睡眠”范式，使模型能够持续学习，将短期脆弱的记忆提炼为稳定的长期知识，并通过“做梦”过程递归提升自己。更详细地说，睡眠包括两个阶段：（1）记忆巩固：一种向上提炼过程，称为知识播种，将较小自我的记忆提炼成更大的网络，以提供更大容量同时保留知识。作为概念验证，我们提出了一种新的广义蒸馏过程，用于{知识播种}（即策略蒸馏与基于强化学习（RL）的模仿学习的结合）;（2）做梦：自我提升阶段，模型利用强化学习生成合成数据课程，以在无人工监督的情况下排练新知识并完善现有能力。我们在长期视野、持续学习、知识整合和少数样本推广任务上的实验支持了睡眠阶段的重要性。

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Skill-RM：通过代理技能统一异质评估标准

Authors: Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang, Yihao Liu, Jingwei Ni, Jiaqi Guo, Mengyu Zhou, Kai Tang, Junling Liu, Qinliang Su, Xiaoxi Jiang, Guanjun Jiang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.03980
Pdf link: https://arxiv.org/pdf/2606.03980
Abstract Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at this https URL.
中文摘要 奖励模型（RM）为LLM的训练后提供关键反馈信号，尤其是在强化微调（RFT）和强化学习（RL）流水线中。然而，当前的奖励评估依赖于异质标准，如基于规则的验证器、真实引用、程序性检查表和复杂的评分标准，尚未探索整合所有证据的统一机制。为此，我们提出了技能奖励模型（Skill-RM），这是一个统一框架，将奖励建模重新表述为可重用的奖励评估技能的执行。通过将奖励计算视为结构化的代理任务，Skill-RM提供了一个一致的接口，用于协调异构资源，动态选择并汇总针对每个输入的具体需求量身定制的证据。这种方法使奖励模型能够超越静态评估，确保多样化任务的一致性和透明度。对奖励基准和后续应用的广泛实验，包括最佳N选择和强化学习，表明Skill-RM持续优于传统评判基线。我们的发现表明，Skill-RM不仅为奖励建模提供了统一解决方案，还通过战略性和动态的证据协调实现了卓越的绩效。代码就在这个 https 网址。

Keyword: diffusion policy

There is no result