Arxiv Papers of Today

生成时间: 2026-06-30 18:59:55 (UTC+8); Arxiv 发布时间: 2026-06-30 20:00 EDT (2026-07-01 08:00 UTC+8)

今天共有 65 篇相关文章

Keyword: reinforcement learning

Multi-Agent DRL for QoS and Energy Optimization in RIS-Enabled Open-RAN Industrial 6G TN/NTN Networks

多智能体DRL用于RIS支持的开放式RAN工业6G TN/NTN网络中的QoS和能源优化

Authors: Marwan Dhuheir, Thang X. Vu, Symeon Chatzinotas
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.28339
Pdf link: https://arxiv.org/pdf/2606.28339
Abstract Industrial 6G networks require ultra-reliable, low-latency, and energy-efficient connectivity in dynamic and blockage-prone environments, where conventional terrestrial deployments often fail to ensure stable coverage. Hence, in this paper, we propose a RIS-enabled Open-RAN framework for integrated terrestrial/non-terrestrial (TN/NTN) industrial 6G networks, in which UAVs-mounted reconfigurable intelligent surfaces (RISs) cooperate with ground radio units and a high-altitude platform (HAP) to enhance connectivity for dense industrial IoT devices. Owing to the high dimensionality and strong coupling among decision variables, conventional optimization techniques become computationally intractable. To overcome this limitation, the joint optimization problem of data rates, latency, and energy consumptions is formulated as a decentralized partially observable Markov decision process (Dec-POMDP) and solved using a multi-agent deep reinforcement learning framework. Simulation results show improvements of up to 75\% in data rate, 25\% latency reduction, and 16\% energy savings compared with state-of-the-art learning-based and non-RIS baselines, demonstrating the effectiveness of RIS-assisted Open-RAN intelligence for industrial 6G networks.
中文摘要 工业6G网络需要在动态且易受阻挡的环境中提供超可靠、低延迟且节能的连接，而传统地面部署往往无法确保稳定覆盖。因此，本文提出了一个基于RIS的开放RAN框架，用于集成地面/非地面（TN/NTN）工业6G网络，其中无人机安装的可重构智能表面（RIS）与地面无线电单元和高空平台（HAP）协同，增强密集工业物联网设备的连接性。由于高维度和决策变量间强烈耦合，传统优化技术在计算上变得难以处理。为克服这一限制，数据速率、延迟和能耗的联合优化问题被表述为去中心化的部分可观测马尔可夫决策过程（Dec-POMDP），并通过多智能体深度强化学习框架求解。模拟结果显示，与最先进的基于学习和非RIS基线相比，数据速率提升高达75%，延迟减少25%，节能16%，证明RIS辅助的开放RAN智能在工业6G网络中的有效性。

RADIANT-PET: Reasoning-Augmented PET/CT Lesion Segmentation with Large Language Models and Reinforcement Learning

RADIANT-PET：基于大型语言模型和强化学习的推理增强PET/CT病灶分割

Authors: Jiasheng Wang, Tanun Jitwatcharakomol, Piyawadee Jongpradubgiat, Simeng Zhu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.28392
Pdf link: https://arxiv.org/pdf/2606.28392
Abstract Accurate lesion segmentation in PET/CT is critical for oncology, yet remains challenging because physiologic tracer uptake and artifacts can mimic malignant signal. We present RADIANT-PET, a reasoning-augmented framework that couples a high-sensitivity voxel-level segmentation model with lesion-level large language model (LLM) adjudication. Candidate uptake regions are generated with a deliberately permissive segmentation stage, then converted into structured textual descriptions that summarize uptake intensity, morphology, and regional and global anatomical context. An LLM classifies each candidate as true lesion vs. false positive, optionally leveraging the radiology report as additional clinical context. To strengthen lesion-level reasoning, we further optimize a local LLM via reinforcement learning using Group Relative Policy Optimization, rewarding correct lesion classification and anatomically concordant site assignment. Across AutoPET and an OSU test cohort, RADIANT-PET consistently outperforms strong image-only baselines, with the largest improvements observed when radiology reports are provided. Overall, these results demonstrate that LLM-based lesion-level reasoning adds a novel reasoning layer beyond conventional segmentation, suppressing physiologic false positives and aligning voxel-level predictions with clinical interpretation. The project repository is available at: this https URL.
中文摘要 PET/CT中准确的病灶分割对肿瘤学至关重要，但由于生理示踪剂摄取和伪影可能模拟恶性信号，这仍然具有挑战性。我们介绍了RADIANT-PET，一种推理增强框架，将高灵敏度体素级别的分割模型与病灶级大型语言模型（LLM）判定结合起来。候选摄取区域通过有意允许的分段阶段生成，然后转化为结构化文本描述，总结吸收强度、形态学以及区域和全球解剖背景。LLM将每个候选人分为真病变与假阳性，并可选择性地利用放射科报告作为额外的临床背景。为强化病灶层级推理，我们通过强化学习利用群体相对策略优化进一步优化局部LLM，奖励病灶分类的正确和解剖学匹配的部位分配。在AutoPET和俄亥俄州立大学的一组测试队列中，RADIANT-PET持续优于强的仅图像基线，且在提供放射学报告时，改善幅度最大。总体而言，这些结果表明基于LLM的病灶级推理在传统分割之外增加了新的推理层，抑制了生理误报，并将体素级预测与临床解释相匹配。项目仓库可在以下网址访问：https URL。

Reinforcement Learning for Software Vulnerability Analysis: A Systematic Review with Emphasis on C/C++ Source Code and Static Analysis

软件漏洞分析的强化学习：系统综述，重点为C/C++源代码和静态分析

Authors: Bruno Caro-Vásquez, Carola Figueroa-Flores, Gastón Marquez
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.28403
Pdf link: https://arxiv.org/pdf/2606.28403
Abstract Vulnerability detection in C/C++ software remains a major security challenge due to code complexity, manual memory management, and the limitations of traditional static analysis. Reinforcement Learning (RL) has emerged as a promising approach, particularly for fuzzing, test generation, program exploration, and, more recently, vulnerability detection and localization. Following PRISMA 2020 guidelines, this work reviews RL techniques for software vulnerability analysis, focusing on C/C++ source code and static analysis. We identified 21 primary studies published between 2015 and 2026 from major scientific databases and complementary searches. We analyze the addressed tasks, algorithms, state-action-reward-environment formulations, code representations, datasets, and evaluation metrics. Results show that 15 studies focus on fuzzing and guided exploration, only 3 on direct vulnerability detection, and just 1 on statement-level localization. Moreover, statically extracted structural representations such as Control Flow Graphs (CFGs) and Abstract Syntax Trees (ASTs) are rarely used as agent states, and benchmarks lack comparability. We propose a task- and formulation-oriented taxonomy and identify a key research gap: the absence of RL agents that use source-code CFGs as states to detect and localize vulnerable nodes.
中文摘要 由于代码复杂性、手动内存管理以及传统静态分析的局限性，C/C++ 软件中的漏洞检测仍是重大安全挑战。强化学习（RL）已成为一种有前景的方法，尤其适用于模糊检测、测试生成、程序探索，以及最近的漏洞检测和定位。遵循PRISMA 2020指南，本研究回顾了软件漏洞分析的强化学习技术，重点关注C/C++源代码和静态分析。我们识别了2015年至2026年间发表于主要科学数据库和补充检索的21项主要研究。我们分析了所针对的任务、算法、状态-动作-奖励环境表述、代码表示、数据集和评估指标。结果显示，15项研究聚焦于模糊和引导探索，只有3项专注于直接脆弱性检测，只有1项专注于语句级本地化。此外，静态提取的结构表示，如控制流图（CFG）和抽象语法树（AST）很少被用作代理状态，基准测试也缺乏可比性。我们提出了一个任务和表述导向的分类法，并指出一个关键研究空白：缺乏使用源代码CFG作为状态来检测和定位易受攻击节点的强化学习代理。

Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy

位置：强化学习的研究人员需要区分求解模拟器和使用模拟器作为代理

Authors: Matthew Vandergrift, Esraa Elelimy, Martha White
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.28433
Pdf link: https://arxiv.org/pdf/2606.28433
Abstract One goal in reinforcement learning (RL) research is to understand general-purpose sequential decision-making, using benchmark simulators as a proxy for learning in deployment settings. When running experiments, however, the goal of achieving high performance in the simulator can mutate into focusing exclusively on solving the simulator. To achieve high scores, researchers may adopt solutions exclusively meant for solving simulators, rather than learning while the agent is deployed outside a simulator. Solving simulators is also worthy of investigation, but it is a fundamentally different RL research question. In this paper, we argue that RL researchers need to distinguish between two use cases of simulators: solving simulators and using simulators as a proxy for learning in deployment. We first discuss how these two use-cases are importantly different, in terms of constraints on how the agent can use the simulator, which algorithms are appropriate, and which evaluation metrics are appropriate. We then highlight several issues and misleading conclusions that can occur by not making the distinction between these two settings clear, supported with examples and simple experiments. This work is a call to the community to begin clearly distinguishing how they are using simulators in their work, hopefully sparking further discussion on which empirical practices work best in each setting.
中文摘要 强化学习（RL）研究的一个目标是理解通用的顺序决策，利用基准模拟器作为部署环境中学习的代理。然而，在运行实验时，追求模拟器高性能的目标可能会转变为专注于求解模拟器本身。为了获得高分，研究人员可能会采用专门用于求解模拟器的解法，而不是在代理部署于模拟器外时学习。解模拟器同样值得研究，但这是完全不同的强化学习研究问题。本文论证了强化学习研究者需要区分模拟器的两种用例：求解模拟器和将模拟器作为部署学习的代理。我们首先讨论了这两种用例在对智能体如何使用模拟器的约束、哪些算法适用以及哪些评估指标方面有显著差异。随后，我们通过示例和简单实验，强调了若干问题和误导性结论，这些问题可能因未能明确区分这两个环境而产生。这项工作呼吁社区开始明确区分他们在工作中使用模拟器的方式，希望能激发更多关于在每种环境中最有效的实证实践的讨论。

Dockerless: Environment-Free Program Verifier for Coding Agents

Dockerless：编程代理的无环境程序验证器

Authors: Wenhao Zeng, Yuling Shi, Xiaodong Gu, Chao Hu, Chaofan Wang, Yuhao Cui, Hongting Zhou, Mengnan Qi, Jianqiao Wangni, Zhaojian Yu, Shuzheng Gao, Kai Cai, Shilin He
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.28436
Pdf link: https://arxiv.org/pdf/2606.28436
Abstract Program verifiers play a central role in training coding agents, including selecting trajectories for supervised fine-tuning (SFT) and providing rewards for reinforcement learning (RL). Standard execution-based verification requires running unit tests inside per-repository environments such as Docker images, incurring substantial environment setup costs. We propose Dockerless, an environment-free agentic patch verifier that evaluates generated code patches without executing them. Rather than simply matching candidate patches to references, Dockerless judges patch correctness using evidence gathered through agentic repository exploration. On a verifier evaluation benchmark, Dockerless outperforms the strongest open-source verifier by 14.3 AUC points. Using Dockerless as both the SFT trajectory filter and the RL reward enables a fully environment-free post-training pipeline. The resulting model reaches 62.0%, 50.0%, and 35.2% resolve rate on SWE-bench Verified, Multilingual, and Pro, respectively. It surpasses the Qwen3.5-9B baseline by 2.4, 8.7, and 2.9 points, matching environment-based post-training.
中文摘要 程序验证器在训练编码代理中发挥核心作用，包括选择监督微调（SFT）轨迹以及为强化学习（RL）提供奖励。标准的基于执行的验证需要在每个仓库环境中运行单元测试，如 Docker 镜像，导致大量环境设置成本。我们提出Dockerless，一种环境无约束的智能补丁验证器，可以评估生成的代码补丁而不执行它们。Dockerless 不仅仅将候选补丁与引用匹配，而是通过代理仓库探索收集的证据来判断补丁的正确性。在验证器评估基准测试中，Dockerless 比最强的开源验证者高出 14.3 AUC 点。将Dockerless作为SFT轨迹过滤器和强化学习奖励，实现了完全无环境的培训后流程。最终模型在 SWE-bench Verified、Multilingual 和 Pro 中分别达到 62.0%、50.0% 和 35.2% 的解决率。它比Qwen3.5-9B基线高出2.4点、8.7分和2.9分，与基于环境的后期训练相当。

R$^2$-Searcher: Calibrating Retrieval and Reasoning Boundaries for Agentic Search

R$^2$-搜索器：校准代理搜索的检索与推理边界

Authors: Sheng Zhang, Junyi Li, Wenlin Zhang, Xiaowei Qian, Yichao Wang, Yingyi Zhang, Maolin Wang, Yong Liu, Xiangyu Zhao
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2606.28566
Pdf link: https://arxiv.org/pdf/2606.28566
Abstract Recent search agents for multi-hop reasoning often fail by either retrieving incomplete evidence or reasoning over irrelevant portions of the retrieved content, leading to a retrieval-reasoning boundary shift. We propose R$^2$-Searcher, a novel framework that explicitly explores and calibrates the retrieval and reasoning boundaries via fine-grained, query-token-guided evidence modeling and post-retrieval reflection. Specifically, R$^2$-Searcher: (1) constructs fine-grained reasoning contexts by extracting precise facts from retrieved content based on query token semantics (e.g., subjects, actions, temporal markers, and degree modifiers), thereby guiding the attention of search agent; (2) introduces a retrieval reflection mechanism that evaluates and corrects boundary deviations after each retrieval step, guiding the generation of improved queries grounded in the extracted reasoning contexts; and (3) employs an end-to-end reasoning-reflection-guided reinforcement learning algorithm, R$^2$PO, which jointly optimizes both boundaries through a tree-based exploration of reasoning regions and reflections. Our method significantly enhances the quality of both retrieval and reasoning, establishing an iterative loop where retrieval and reasoning mutually enhance each other. Extensive experiments on seven complex multi-hop QA benchmarks demonstrate that R$^2$-Searcher significantly outperforms state-of-the-art agentic search methods in answer accuracy and retrieval-reasoning quality. Ablation studies further confirm the critical role of retrieval-reasoning boundary calibration.
中文摘要 近期用于多跳推理的搜索代理常常因检索不完整证据或对检索内容中无关部分进行推理而失败，导致检索推理边界发生转移。我们提出了R$^2$-Searcher，这是一种新颖框架，通过细粒度的查询标记引导证据建模和检索后反思，明确探索并校准检索和推理边界。具体来说，R$^2$-Searcher：（1）通过根据查询标记语义（如主语、动作、时间标记和度数修饰符）从检索内容中提取精确事实，构建细粒度推理上下文，从而引导搜索代理的注意力;（2）引入检索反射机制，在每次检索步骤后评估并修正边界偏差，指导基于提取的推理上下文生成改进查询;（3）采用端到端推理-反射引导强化学习算法R$^2$PO，通过基于树的推理区域和反射探索共同优化这两个边界。我们的方法显著提升了检索和推理的质量，建立了一个相互提升的迭代循环。对七个复杂多跳质询基准测试的广泛实验表明，R$^2$-Searcher在答案准确性和检索推理质量方面显著优于最先进的代理搜索方法。消融研究进一步证实了反演推理边界校准的关键作用。

Neuromorphic Energy-Aware Learning for Adaptive Deep Brain Stimulation

神经形态能量感知学习用于适应性深脑刺激

Authors: Binh Nguyen, Colleen Josephson, Mircea Teodorescu, Gert Cauwenberghs, Jason Eshraghian
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.28600
Pdf link: https://arxiv.org/pdf/2606.28600
Abstract Neuromorphic and edge computing research has focused on reducing the inference cost of neural network controllers, yet in physical closed-loop systems the actuator can rival or exceed an efficient controller in energy. An efficient controller is therefore necessary but not sufficient, because the actuator becomes the cost worth reducing once inference no longer dominates it. Here, we introduce energy-aware learning, an approach that incorporates actuator energy directly into the reinforcement learning reward, and demonstrate it in closed-loop deep brain stimulation (DBS) for Parkinson's disease. A deep spiking Q-network, trained in a biophysical cortico-basal ganglia-thalamic circuit model, learns to suppress pathological alpha-beta oscillations by 45.2% while reducing stimulation charge by 80.0% relative to continuous DBS. Sparsity-constrained knowledge distillation compresses the policy onto the SynSense XyloAudio 3 neuromorphic processor at 0.52 mW inference power, yielding 28.1x lower energy per inference than an equivalent artificial neural network on conventional edge hardware. By co-optimizing stimulation energy and inference efficiency, the framework addresses both major power demands in implantable neuromodulation.
中文摘要 神经形态和边缘计算的研究一直致力于降低神经网络控制器的推理成本，但在物理闭环系统中，执行器在能量上可以与高效控制器媲美甚至超过。因此，高效的控制器是必要的，但还不够，因为执行器在推理不再主导时，就成为值得降低的成本。在这里，我们介绍了能量感知学习，这是一种将执行器能量直接纳入强化学习奖励的方法，并在帕金森病的闭环深脑刺激（DBS）中进行了演示。经过生物物理皮质-基底核-丘脑回路模型训练的深度尖峰Q网络，能够相较于连续DBS将病理性的α-β振荡抑制45.2%，同时将刺激电荷降低80.0%。稀疏约束知识蒸馏将策略压缩到SynSense XyloAudio 3神经形态处理器上，推理功率为0.52毫瓦，每次推理的能量比传统边缘硬件上的同等人工神经网络低28.1倍。通过对刺激能量和推理效率的协同优化，该框架满足了植入式神经调节中两大主要的功率需求。

Entropy Regularized Reinforcement Learning for Zero-Sum Stochastic Differential Games in a Regime-Switching Jump-Diffusion Process

熵正则化强化学习在状态切换跳扩散过程中的零和随机微分博弈

Authors: Congde Hu, Zhuo Jin, Danping Li, Lin Xu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.28669
Pdf link: https://arxiv.org/pdf/2606.28669
Abstract To address parameter misspecification and sudden structural environmental changes in conventional stochastic differential game (SDG) frameworks, this paper introduces a distributional control approach that characterizes optimal strategies as probability distributions over actions, conditioned on the continuous state, the discrete regime state, and parameters. This forms a reinforcement learning framework for entropy-regularized zero-sum stochastic differential games (ERRL-ZSSDGs) in a regime-switching jump-diffusion process. Using the dynamic programming principle (DPP), we derive the associated coupled systems of Hamilton-Jacobi-Bellman-Isaacs (HJBI) equations, from which equilibrium strategies are expressed via gradients of the value function. For linear-quadratic problems, semi-analytical solutions for both value function and equilibrium strategies are obtained by solving a system of coupled ordinary differential equations (ODEs). In more general settings, an Actor-Critic policy improvement algorithm is developed to approximate the value functions and equilibrium policies across different regimes. The method is applied to an investment game, and numerical examples illustrate the effect of the temperature parameter and regime transitions on optimal policies and values.
中文摘要 为解决传统随机微分博弈（SDG）框架中的参数错误和突发结构性环境变化，本文引入了一种分布控制方法，将最优策略描述为基于连续状态、离散状态和参数的概率分布。这为熵正则化零和随机微分博弈（ERRL-ZSSDGs）在态态切换跳跃扩散过程中构建了一个强化学习框架。利用动态规划原理（DPP），我们推导出相关的Hamilton-Jacobi-Bellman-Isaacs（HJBI）方程的耦合系统，其中平衡策略通过价值函数的梯度来表达。对于线性-二次问题，通过求解一组耦合常微分方程（ODE）可以获得价值函数和均衡策略的半解析解。在更一般的环境中，开发了Actor-Critic策略改进算法，用于近似不同体制下的价值函数和均衡政策。该方法应用于投资博弈，数值示例展示了温度参数和体制转变对最优政策和价值的影响。

Entropy-Regularized Reinforcement Learning for Linear-Quadratic Stackelberg Differential Games in Regime-Switching Diffusion Models

熵正则化强化学习在态态切换扩散模型中的线性二次斯塔克伯格微分博弈

Authors: Congde Hu, Danping Li, Lin Xu, Wenying Xu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.28671
Pdf link: https://arxiv.org/pdf/2606.28671
Abstract Stackelberg differential games (SDGs) provide a powerful framework for hierarchical decision-making in stochastic and continuous-time environments, yet their solution remains computationally challenging due to the complexity of traditional dynamic programming and Hamilton-Jacobi-Bellman-Isaacs (HJBI) methods, especially in high-dimensional systems. This paper proposes an entropy-regularized reinforcement learning (ERRL) approach for linear-quadratic SDGs (LQ-SDGs) within a continuous-time diffusion framework governed by Markovian regime switching. The key innovation lies in deriving exploratory weakly-coupled HJBI equations with entropy regularization, which promotes stochastic policies that actively avoid suboptimal equilibria -- a limitation of classical SDG methods. Neural networks are integrated to approximate regime-dependent value functions and solve high-dimensional partial differential equations (PDEs) efficiently, while a novel sampling technique enhances computational tractability. Numerical results demonstrate the effectiveness of the framework compared to conventional approaches, particularly in escaping suboptimal traps through exploratory policies. The study highlights the critical role of entropy regularization and neural network approximations in achieving robust solutions for hierarchical decision-making problems under abrupt environmental shifts.
中文摘要 斯塔克尔伯格微分博弈（SDGs）为随机和连续时间环境中的层级决策提供了一个强大的框架，但由于传统动态规划和汉密尔顿-雅各比-贝尔曼-艾萨克斯（HJBI）方法的复杂性，尤其是在高维系统中，其解法仍然具有计算上的挑战。本文提出了一种熵正则化强化学习（ERRL）方法，适用于线性-二次SDGs（LQ-SDGs），在连续时间扩散框架下，该框架由马尔可夫态切换控制。关键创新在于推导具有熵正则化的探索性弱耦合HJBI方程，这促进了主动避免次优均衡的随机策略——这是经典SDG方法的局限性。神经网络被集成用于近似状态依赖的值函数，并高效求解高维偏微分方程（PDE），同时一种新型采样技术提升了计算的可处理性。数值结果显示该框架相较于传统方法的有效性，尤其是在通过探索性策略逃避次优陷阱方面。该研究强调了熵正则化和神经网络近似在实现环境骤变下层级决策问题稳健解决方案中的关键作用。

An AI agent for treatment reasoning over a biomedical tool universe

一个用于治疗推理的人工智能代理，超越生物医学工具宇宙

Authors: Shanghua Gao, Ayush Noori, Richard Zhu, Curtis Ginder, Zhenglun Kong, Xiaorui Su, Justin Kauffman, Benjamin S. Glicksberg, Joshua Lampert, Ankit Sakhuja, Ashwin Sawant, ATHENA-R1 Evaluation Consortium, David A. Clifton, Noa Dagan, Ran Balicer, Marinka Zitnik
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.28692
Pdf link: https://arxiv.org/pdf/2606.28692
Abstract Treatment reasoning underpins every therapeutic decision, integrating disease context, comorbidities, medications, contraindications, and evolving biomedical knowledge to select an appropriate therapy. It is inherently iterative: candidates are weighed against many constraints, revised as evidence emerges, and grounded in verifiable sources. Here we introduce ATHENA-R1, an AI agent for treatment reasoning across all FDA approved drugs since 1939, trained by reinforcement learning over a universe of 212 biomedical tools. At each step it identifies missing information, selects and runs relevant tools, and incorporates the evidence. To train it without human-annotated traces, we build a two-level self-learning framework: multi-agent systems construct the tools, tasks, and reasoning trajectories for supervised fine-tuning, then reinforcement learning with scientific feedback rewards reasoning quality (evidence gathering, grounded tool use, logical non-redundancy). Across five benchmarks of 3,168 drug reasoning tasks and 456 patient treatment cases, ATHENA-R1 outperforms language models and tool-use systems, reaching 94.7% accuracy on open-ended drug reasoning and 82.9% on treatment reasoning, 17.8 and 10.7 points above GPT-5. In blinded evaluations by experts from 28 rare disease organizations, it is preferred over reference models on all criteria, and physicians rated it favorably on complex hospitalized cardiovascular and infectious-disease cases. Adverse-event hypotheses it generated, tested in electronic health records from 5.4 million patients, reached adjusted odds ratios of 1.48-1.84, with no elevation among negative controls. Because it requires knowing what evidence to seek before concluding, treatment reasoning has long been hard for AI; we show it can be reframed as a learnable process of iterative evidence gathering that reinforcement learning can train AI to perform.
中文摘要 治疗推理是每一个治疗决策的基础，整合疾病背景、共病、用药、禁忌症以及不断发展的生物医学知识，以选择合适的治疗方案。它本质上是迭代的：候选人会在众多限制条件下权衡，随着证据出现而修订，并以可验证的来源为基础。这里我们介绍ATHENA-R1，这是一种自1939年以来用于治疗推理的人工智能药物，通过强化学习训练，涵盖212种生物医学工具。每一步它都会识别缺失的信息，选择并运行相关工具，并整合证据。为了在没有人工注释痕迹的情况下训练它，我们构建了一个两层自学框架：多智能体系统构建工具、任务和推理轨迹以进行监督微调，然后通过科学反馈强化学习奖励推理质量（证据收集、工具使用基础、逻辑非冗余）。在五个基准测试中，包含3,168个药物推理任务和456个患者治疗案例，ATHENA-R1的表现优于语言模型和工具使用系统，开放式药物推理准确率达94.7%，治疗推理准确率为82.9%，分别比GPT-5高出17.8个和10.7个百分点。在28个罕见病组织专家的盲测评估中，它在所有标准上优于参考模型，医生在复杂的住院心血管和传染病病例中给予了积极评价。其生成的不良事件假设在540万患者的电子健康记录中测试，调整后的比值比值为1.48-1.84，阴性对照组无升高。由于需要知道在结论前需要寻找哪些证据，治疗推理对人工智能来说一直很困难;我们展示了强化学习可以被重新框架为一个可学习的迭代证据收集过程，从而训练人工智能执行任务。

BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards

BV-Blend：稳定无批评的历史基线，奖励可验证

Authors: Yupeng Chang, Yuan Wu, Yi Chang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.28707
Pdf link: https://arxiv.org/pdf/2606.28707
Abstract Critic-free reinforcement learning with verifiable rewards (RLVR), exemplified by Group Relative Policy Optimization (GRPO), avoids training a value function (critic) and reduces memory and compute overhead relative to critic-based PPO pipelines for aligning large language models. However, GRPO-style advantage estimation depends on prompt-local (within-prompt-group) reward statistics and can be unstable. In particular, when all rollouts in a prompt group receive identical rewards, the within-group reward variance becomes zero, and group normalization yields zero advantages for that group, impeding learning in cold-start regimes with binary verifiers. We introduce BV-Blend, a critic-free framework that stabilizes advantage estimation by combining prompt-local on-policy statistics with semantic-cluster-conditioned historical moments. BV-Blend maintains EMA-tracked reward moments for each cluster, derives a confidence weight from a standard error of the mean (SEM) proxy, and uses this weight to blend historical and prompt-local baseline and variance statistics into a standardized advantage for PPO-style clipped updates. Experiments on verifiable reasoning benchmarks show that BV-Blend improves training stability and performance, and remains robust in regimes where group-normalized methods may stall.
中文摘要 无批判者可验证奖励强化学习（RLVR），以群体相对策略优化（GRPO）为例，避免了价值函数（critic）的训练，并且相较于基于批判的PPO管道，减少了内存和计算开销，用于对齐大型语言模型。然而，GRPO式优势估计依赖于提示-局部（提示组内）的奖励统计，且可能不稳定。特别是，当一个即时组内的所有推广都获得相同的奖励时，组内奖励方差变为零，组归一化对该组没有任何优势，阻碍了在带有二元验证器的冷启动模式下学习。我们介绍了BV-Blend，这是一个无批评的框架，通过结合提示-局部-政策统计与语义-集群条件的历史时刻，稳定优势估计。BV-Blend为每个簇维护EMA追踪的奖励时刻，从均值（SEM）代理的标准误差中推导置信权重，并利用该权重将历史统计和提示-局部基线及方差统计数据融合，为PPO风格的剪辑更新提供标准化优势。对可验证推理基准的实验表明，BV-Blend 能提升训练稳定性和表现，并且在群体归一化方法可能停滞的环境中依然保持稳健。

Hierarchical Decision Making with Structured Policies: A Principled Design via Inverse Optimization

结构化策略的层级决策：通过逆优化实现的原则性设计

Authors: Yuexuan Wang, Jingyuan Zhou, Kaidi Yang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.28764
Pdf link: https://arxiv.org/pdf/2606.28764
Abstract Hierarchical decision-making frameworks are pivotal for addressing complex control tasks, enabling agents to decompose intricate problems into manageable subgoals. Despite their promise, existing hierarchical policies face critical limitations: (i) reinforcement learning (RL)-based methods struggle to guarantee strict constraint satisfaction, and (ii) optimal control (OC)-based approaches often rely on myopic and computationally prohibitive formulations. To reconcile these trade-offs, hierarchical RL-OC architectures have emerged as a promising paradigm. However, the formulation of the lower-level optimization within these frameworks remains underexplored, often relying on heuristic or myopic objectives. In this work, we propose a principled framework that systematically integrates upper-level goal abstraction with structured lower-level decision making. We adopt an inverse optimization approach to inform the structure of the lower-level problem from expert demonstrations, ensuring that the objective of the lower-level policy remains aligned with the overall long-term task goal. To validate the approach, our framework is evaluated on distinct decision making tasks: network-based resource allocation and continuous collision avoidance. Empirical results demonstrate that our method consistently outperforms strong baselines based on end-to-end RL, learning-augmented optimal control, and existing hierarchical RL approaches in both efficiency and decision quality.
中文摘要 分层决策框架对于处理复杂的控制任务至关重要，使智能体能够将复杂的问题分解为可管理的子目标。尽管前景看好，现有的层级策略仍面临关键局限：（i）基于强化学习（RL）的方法难以保证严格的约束满足，（ii）基于最优控制（OC）的方法常依赖目光短浅且计算量过大的表述。为了平衡这些权衡，层级RL-OC架构成为一种有前景的范式。然而，这些框架下底层优化的表述仍然缺乏充分探索，常常依赖启发式或目光短浅的目标。在本研究中，我们提出了一个原则性框架，系统地整合了上层目标抽象与结构化的下层决策。我们采用逆优化方法，通过专家演示来指导下层问题的结构，确保下层策略的目标与整体长期任务目标保持一致。为验证该方法，我们的框架在不同的决策任务上进行了评估：基于网络的资源分配和连续碰撞避免。实证结果表明，我们的方法在效率和决策质量方面，始终优于基于端到端强化学习、学习增强最优控制和现有层级强化学习方法的强基线。

Physics Models for Sim-to-Real Transfer in Professional-Level Robot Table Tennis

职业级机器人乒乓球模拟到真实转移的物理模型

Authors: Christian Conti (1), Bilan Yang (1), Alexander Sigrist (2), Lorenzo Miele (2), Yamen Saraiji (1), Peter Dürr (2), Naoya Takahashi (2) ((1) Sony AI, Tokyo, Japan, (2) Sony AI, Zurich, Switzerland)
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.28805
Pdf link: https://arxiv.org/pdf/2606.28805
Abstract At competitive speeds and spins, a table tennis ball follows complex, counterintuitive trajectories that a robot must track and precisely counter within fractions of a second. Training a reinforcement learning policy capable of these skills is prohibitively expensive and dangerous in the real world, making high-fidelity simulation essential. Transferability of such policies, however, critically depends on how faithfully the simulation captures real-world dynamics--a requirement made even more stringent by the adversarial nature of the game, where any regime in which a model fails to approximate reality becomes an exploitable weakness for the opponent. Prior state-of-the-art in robot table tennis generally focuses on a limited range of velocities and spins and fails to capture the richness of ball behaviors encountered in professional-level play. In this work, we present physics models for the aerodynamic ball flight, for the contact dynamics between the ball and the table, as well as between the ball and the racket that accurately capture the ball behavior over a vast range of speeds and spins relevant to the game. Specifically, we model drag and Magnus force coefficients as functions of Reynolds number and spin ratio in the aerodynamics equations. For the table contact model we model effects of ball buckling on the coefficient of restitution and incorporate residuals into the instantaneous point-contact models. For the racket contact model we introduce a residual neural network component to complement coefficients related to normal and tangential coefficients of restitution as well as torsional spin damping. The resulting models were used for the first real-world robot table tennis AI agent capable of competing against professional players, to train reinforcement learning policies.
中文摘要 在竞技速度和旋转下，乒乓球遵循复杂且反直觉的轨迹，机器人必须在几分之一秒内追踪并精确反击。在现实世界中，训练具备这些技能的强化学习策略成本高昂且危险，因此高精度模拟变得不可或缺。然而，这些策略的可转移性关键在于模拟对现实世界动态的忠实程度——这一要求因游戏的对抗性质而更加严格，任何模型未能近似现实的状态都会成为对手可利用的弱点。以往机器人乒乓球的先进技术通常只关注有限的速度和旋转范围，未能捕捉职业水平比赛中球的丰富表现。在本研究中，我们提出了空气动力学球体飞行、球与球台之间以及球与球拍之间的接触动力学物理模型，这些模型准确捕捉了球在与比赛相关的广泛速度和旋转范围内的行为。具体来说，我们将阻力和马格努斯力系数建模为空气动力学方程中的雷诺数和自旋比的函数。对于桌面接触模型，我们模拟了球屈曲对恢复系数的影响，并将残差纳入瞬时点接触模型。对于球拍接触模型，我们引入了残差神经网络组件，以补补与正向和切向恢复系数相关的系数以及扭转自旋阻尼。这些模型被用于首个能够与职业选手竞争的现实世界机器人乒乓球AI代理，用于训练强化学习策略。

Q-DASC: State-of-the-Art Safe Quantum Control for HVAC under Local Model Misspecification

Q-DASC：局部模型错误规定下最先进的暖通空调安全量子控制

Authors: Yifan Wang
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.28834
Pdf link: https://arxiv.org/pdf/2606.28834
Abstract Variational quantum reinforcement learning offers a compact policy class for building-energy control, but it inherits a deployment weakness shared by learned controllers: when the thermal model is locally wrong, a policy that appears safe on the model can violate occupant comfort in the real building. Guarantees that depend on noisy quantum read-out are also insufficient for safety-critical control. We address this gap with Q-DASC, Discrepancy-Attributed Safe Quantum Control. Q-DASC wraps a variational-quantum-circuit (VQC) policy with a certified classical safety layer that discovers misspecified operating regimes with false-discovery-rate control, repairs their local thermal gains with shrinkage, projects the proposed quantum schedule onto the repaired comfort-feasible set, and attributes residual violations to policy error, model error, or physical limits. Because the final certificate is produced by classical projection, comfort feasibility is invariant to finite-shot and depolarizing read-out noise. On real BOPTEST building emulators across three buildings, two localized misspecifications, and three seeds, Q-DASC reduces average comfort violation from 26.0\% for the raw VQC controller and 55.3\% for a model-trusting scheduler to 0.02\%, matching a clairvoyant oracle, and remains at 0.24\% under NISQ read-out noise. A repair-aware VQC variant reaches 0.00\% violation and reduces projection intervention, while the default Q-DASC keeps lower energy and stronger observational-data behavior. The same wrapper transfers to EnergyPlus heating and cooling benchmarks and to real hospital air-handling-unit data. These results establish a safety-efficiency frontier for deploying quantum policies in physics-constrained control.
中文摘要 变分量子强化学习提供了一个紧凑的建筑能能控制策略类，但它继承了一个与学习控制者共享的部署弱点：当热模型局部错误时，模型上看似安全的策略可能会侵犯真实建筑中的居住者舒适度。依赖噪声量子读出的保证也不足以实现安全关键控制。我们用Q-DASC（差异归因安全量子控制）来弥补这一空白。Q-DASC将变分量子电路（VQC）策略包裹在经过认证的经典安全层中，该层通过假发现率控制发现错误的操作模式，通过收缩修复其局部热增益，将拟议的量子调度投影到修复后的舒适可行集合上，并将残余违规归因于策略错误、模型错误或物理极限。由于最终证书由经典投影产生，舒适可行性不受有限单值和去极化读出噪声的影响。在三栋建筑的真实BOPTEST仿真器上，涉及两处局部错误配置和三个种子，Q-DASC将原始VQC控制器的平均舒适度违规率从26.0%和模型信任调度器55.3%降至0.02%，与预言机相匹配，且在NISQ读出噪声下仍保持在0.24%。可修复的VQC变体达到0.00\%的违规率，减少投影干预，而默认的Q-DASC则保持较低能量和更强的观测数据行为。同样的包装也传递到EnergyPlus的供暖和制冷基准以及真实医院空气处理单元数据。这些结果为在物理约束控制中部署量子政策奠定了安全效率的前沿。

A3M: Adaptive, Adversarial and Multi-Objective Learning for Strategic Bidding in Repeated Auctions

A3M：自适应、对抗和多目标学习，用于重复拍卖中的战略竞价

Authors: Junhan Li, Yuxin Zhang, Haoran Wang, Minghao Chen
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.28943
Pdf link: https://arxiv.org/pdf/2606.28943
Abstract Learning to bid in repeated multi-unit auctions with bandit feedback poses a fundamental challenge. Existing methods often rely on rigid explore-then-exploit schedules, assume stationary adversaries, and optimize solely for bidder utility, thereby limiting adaptability and strategic robustness. To address these limitations, we introduce the A3M framework, which integrates adaptive deep reinforcement learning (DRL), explicit adversarial reasoning, and principled multi-objective reward design for online auction strategy optimization. A3M employs an actor-critic DRL backbone to dynamically balance exploration and exploitation, an opponent model for fictitious play against non-stationary adversaries, and a composite reward function to jointly maximize utility, auctioneer revenue, and fairness. We provide the first comprehensive empirical evaluation of this integrated approach against established baselines in both discriminatory and uniform price auctions. Results show that A3M reduces final regret by 30--40\% in standard settings, maintains robust performance against adversarial strategy shifts, scales favorably with the number of units $K$, and enables tunable multi-objective trade-offs. An extensive ablation study confirms the necessity of each core component. Our work establishes A3M as a powerful and flexible framework for learning in complex auction environments.
中文摘要 学会在多次多单位拍卖中配合盗贼反馈竞拍，是一项根本性的挑战。现有方法常依赖僵化的探索后利用计划，假设对手静止不动，并仅为竞标者效用进行优化，从而限制了适应性和战略稳健性。为解决这些局限性，我们引入了A3M框架，该框架整合了自适应深度强化学习（DRL）、显式对抗推理和原则性的多目标奖励设计，用于在线拍卖策略优化。A3M采用演员-评论家DRL骨干，动态平衡探索与利用，对手模型用于虚构对抗非平稳对手，并采用复合奖励函数共同最大化效用、拍卖收入和公平性。我们首次对这一综合方法进行了全面的实证评估，基于歧视性和统一价格拍卖中的既定基线。结果显示，A3M在标准设置下可将最终遗憾降低30%至40%的后悔，对对抗策略转变保持稳健性能，随单位数量$K$增加而适度扩展，并支持可调节的多目标权衡。一项广泛的消融研究证实了每个核心成分的必要性。我们的工作确立了A3M作为复杂拍卖环境中学习的强大且灵活的框架。

Modification-Considering Value Learning for Reward Hacking Mitigation in RL

修改考虑价值学习以缓解强化学习中的奖励黑客缓解

Authors: Evgenii Opryshko, Umangi Jain, Igor Gilitschenski
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.28955
Pdf link: https://arxiv.org/pdf/2606.28955
Abstract Reinforcement learning agents can exploit misspecified reward signals to achieve high apparent returns while failing on the intended objective, a failure mode known as reward hacking. Existing practical defenses typically constrain policy updates to stay near a known safe reference, creating a tension between suppressing hacking and permitting legitimate improvement. We propose Modification-Considering Value Learning (MCVL), which operationalizes the theoretical idea of current utility optimization for standard value-based RL. MCVL wraps an off-policy learner and treats each incoming transition as a candidate modification: it forecasts two training paths, one that includes the transition and one that does not, and scores both with a frozen bootstrapped-return estimator derived from a learned reward model and value function. The transition is admitted only if inclusion does not decrease the score. We formalize conditions under which this filtering is both safe and permissive, and instantiate MCVL with DDQN and TD3. Across four safety-relevant gridworlds and three modified MuJoCo continuous-control tasks with diverse hacking mechanisms, MCVL mitigates reward hacking while continuing to improve the intended objective. Project website: this http URL.
中文摘要 强化学习代理可以利用错误指定的奖励信号，在未达预期目标的同时获得高表观回报，这种失败模式被称为奖励黑客。现有的实际防御通常限制政策更新必须靠近已知的安全参考，这在遏制黑客攻击与允许合法改进之间产生了张力。我们提出了修改考虑价值学习（MCVL），将当前效用优化的理论理念应用于标准基于价值的强化学习。MCVL包裹一个非策略学习器，并将每个进入的转换视为候选修改：它预测两条训练路径，一条包含过渡，一条不包含，并用基于学习奖励模型和价值函数的冻结自助收益估计器对两者进行评分。只有在纳入不降低分数的情况下，过渡才被允许。我们正式确定该过滤既安全又允许的条件，并与DDQN和TD3实现MCVL的生成。在四个安全相关的网格世界和三个带有多样黑客机制的修改版MuJoCo连续控制任务中，MCVL在不断改进预期目标的同时，减轻了奖励黑客行为。项目网站：这个http网址。

Efficient Spatio-Temporal Grounding with Multimodal Large Models via Second-Level Tracking and RL Verification

通过二级跟踪和强化学习验证，高效实现多模态大型模型的时空基础化

Authors: Tianshu Zhang, Yan Wang, Ji Qi, Lijie Wen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29023
Pdf link: https://arxiv.org/pdf/2606.29023
Abstract Spatio-temporal grounding in long videos requires precise temporal localization and robust object tracking conditioned on natural-language queries. While recent vision-language models (VLMs) show strong reasoning ability, directly applying frame-by-frame inference to long sequences is computationally expensive and unstable. We propose a practical pipeline that shifts from frame-level to second-level tracking and performs cross-second smoothing to preserve continuity while reducing sequence length. To improve reasoning supervision, we synthesize chain-of-thought style trajectories using advanced multimodal models for temporal localization and target selection, and replace generated spatio-temporal coordinates with ground-truth annotations to avoid noisy supervision. We further optimize the policy with reinforcement learning using a verifier based on $t_\mathrm{IoU}+mv_\mathrm{IoU}$. Experiments across multiple FPS settings show that our method achieves a strong trade-off between efficiency and localization quality.
中文摘要 长视频中的时空基础需要精确的时间定位和基于自然语言查询的稳健对象追踪。虽然最新的视觉语言模型（VLMs）展现出强大的推理能力，但直接逐帧推断长序列计算成本高且不稳定。我们提出了一种实用的流水线，可以从帧级追踪切换到二级跟踪，并进行跨秒平滑，以保持连续性同时缩短序列长度。为了提升推理监督，我们利用先进的多模态模型综合链式思维轨迹进行时间定位和目标选择，并用真实注释替代生成的时空坐标，以避免噪声监督。我们进一步优化策略，使用基于$t_\mathrm{IoU}+mv_\mathrm{IoU}$的验证器进行强化学习。在多个FPS设置下的实验表明，我们的方法在效率与定位质量之间取得了强烈的权衡。

Fairness Attacks on Recommender Systems

对推荐系统公平性的攻击

Authors: Yanan Wang, Yong Ge
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29064
Pdf link: https://arxiv.org/pdf/2606.29064
Abstract The unfairness of recommender systems has become a topic of concern due to its significant social and ethical implications. Although existing works have shown the effectiveness of attacks on the performance of recommender systems (e.g., promotion and demotion attack), the study of fairness attacks on recommender systems remains largely under-explored. To this end, we propose a novel structure-aware reinforcement learning-based fairness attack method designed to exacerbate the unfairness of target recommender systems. Specifically, we first employ a graph-based structure encoder to model the structural dependencies among the generated fake user-item interactions and the original user-item interactions. Then, we model the sequential dependency of the injected fake items using a recurrent neural network. Based on the learned structure-aware and sequence-aware representations of the fake user and item, the item selection policy attentively decides the next injected fake item. Since the target recommender system may employ fairness-aware training and leverage the user's sensitive attribute information, such as gender, we further designed a gender selection policy to decide the gender of the entire fake user profile. Both the item selection and gender selection policy are learned jointly in our proposed method. Finally, experimental results on four types of target recommendation models and two real-world datasets demonstrate the effectiveness of the proposed attack method in exacerbating the unfairness of recommender systems.
中文摘要 推荐系统不公平性因其重大的社会和伦理影响而成为关注的话题。尽管现有研究已证明攻击对推荐系统性能的有效性（例如晋升和降级攻击），但对公平性攻击对推荐系统的研究仍然较少被充分探讨。为此，我们提出了一种基于结构感知强化学习的公平性攻击方法，旨在加剧目标推荐系统中的不公平性。具体来说，我们首先使用基于图的结构编码器来建模生成的虚假用户-物品交互与原始用户-物品交互之间的结构依赖关系。然后，我们用循环神经网络模拟注入假物品的顺序依赖性。基于对虚假用户和物品的结构感知和顺序感知表征，项目选择策略会认真决定下一个注入的假物品。由于目标推荐系统可能采用公平意识培训，并利用用户的敏感属性信息（如性别），我们进一步设计了性别选择政策，决定整个虚假用户档案的性别。我们提出的方法中，题目选择和性别选择政策均是共同学习的。最后，针对四种目标推荐模型和两个真实世界数据集的实验结果证明了该攻击方法在加剧推荐系统不公平性的有效性。

Masked Diffusion Decoding as $x$-Prediction Flow

掩蔽扩散解码作为$x$-预测流

Authors: Weitian Wang, Lianlei Shan, Shubham Rai, Cecilia De La Parra, Akash Kumar
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.29066
Pdf link: https://arxiv.org/pdf/2606.29066
Abstract Masked diffusion language models (MDLMs) generate text by iteratively unmasking tokens, but their standard decoder reduces each step to a binary action: a position is either committed to a single token or left fully masked, with no representation of partial belief in between. This all-or-nothing regime discards rich predictive information and forces premature, irrevocable commitments, leading to poor performance under a limited decoding budget. In this paper, we reinterpret mask prediction as clean-state prediction ($x$-prediction) and show that it can be used to induce a continuous flow in input embedding space. Building on this view, we propose a continuous decoding framework for MDLMs where tokens can accumulate partial progress at each diffusion step and remain revisable. To match the uneven contextual constraints across positions in language, we replace the globally synchronous schedule in image diffusion with a confidence-based asynchronous update in which the diffusion progress is token-wise accumulated. Additionally, we introduce a lightweight policy network and formulate its training as a reinforcement learning problem. Applied to pretrained LLaDA, our continuous decoder reaches 97% of its performance on the HumanEval dataset with 25% of decoding budget.
中文摘要 掩盖扩散语言模型（MDLMs）通过迭代卸掩护令牌生成文本，但其标准解码器将每一步简化为二元动作：一个位置要么被承诺到单个令牌，要么完全掩蔽，中间没有部分信念的表示。这种全有或全无的制度摒弃了丰富的预测信息，迫使过早且不可撤销的承诺，导致在有限的解码预算下表现不佳。本文将掩膜预测重新解释为干净状态预测（$x$-预测），并证明它可以在输入嵌入空间中诱导连续流。基于这一观点，我们提出了一个连续解码的MDLM框架，其中代币在每个扩散步骤中可以累积部分进展并保持可修订性。为了匹配语言中不同位置的上下文约束不均，我们将图像扩散中的全局同步调度替换为基于置信度的异步更新，其中扩散进度按标记累计。此外，我们引入了一个轻量级策略网络，并将其训练制定为强化学习问题。应用于预训练的LLaDA，我们的连续解码器在HumanEval数据集上性能达到97%，解码预算仅占25%。

HiComm: Hierarchical Communication for Multi-agent Reinforcement Learning

HiComm：多智能体强化学习的分层通信

Authors: Runze Zhao, Dongruo Zhou, Sumit Kumar Jha, Nathaniel D. Bastian, Ankit Shah
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29126
Pdf link: https://arxiv.org/pdf/2606.29126
Abstract Cooperative multi-agent reinforcement learning (MARL) often relies on communication to mitigate partial observability, yet most existing protocols treat messages as flat dense vectors detached from the structure of the observations they summarize. This design overlooks an important source of inductive bias in many cooperative environments, where observations naturally follow a hierarchy such as groups and entities. We propose \textsc{HiComm}, a plug-in communication module that grounds messages in the sender's hierarchical observation. \textsc{HiComm} is receiver-driven: the receiver issues a query, and the hierarchy is resolved through a three-stage decoding process that first selects a group, then a sender, and then an entity within that group, returning the corresponding feature slice as the message. This converts communication from unstructured vector transmission into structured information retrieval over the sender's observation hierarchy. We instantiate this mechanism with Straight-Through Gumbel-Softmax for differentiable discrete selection and a lightweight shared projection design that attaches to standard MARL pipelines. Experiments across cooperative MARL tasks with different observation structures and coordination demands show that \textsc{HiComm} matches or outperforms representative learned communication baselines while reducing communication volume by up to $23\times$ per receiver per episode.
中文摘要 合作多智能体强化学习（MARL）通常依赖通信来减少部分可观测性，但大多数现有协议将消息视为与其总结观测结构分离的平坦密集向量。这种设计忽略了许多合作环境中归纳偏见的重要来源，这些环境中观测自然遵循层级结构，如群组和实体。我们提出了 \textsc{HiComm}，一个插件通信模块，将消息建立在发送者的层级观察中。\textsc{HiComm} 是接收者驱动的：接收端发出查询，层级结构通过三阶段解码过程解决，先选择一组，然后选择发送方，再选择该组内的实体，返回相应的特征切片作为消息。这将非结构化向量传输的通信转换为发送方观察层级的结构化信息检索。我们通过直通Gumbel-Softmax实现了这一机制，实现可微分的离散选择和轻量化的共享投影设计，可连接到标准MARL管道。跨不同观察结构和协调要求的协作MARL任务实验显示，\textsc{HiComm}能够匹配甚至优于代表性的学习交流基线，同时每集最多减少每个接收者23美元/倍数的通信量。

GPC: Large-Scale Generative Pretraining for Transferable Motor Control

GPC：可迁移运动控制的大规模生成预训练

Authors: Yi Shi, Yifeng Jiang, Chen Tessler, Xue Bin Peng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.29148
Pdf link: https://arxiv.org/pdf/2606.29148
Abstract Developing controllers capable of completing a wide range of tasks in a natural and life-like manner is a key challenge in enabling practical applications of physics-based character animation. In this work, we introduce Generative Pretrained Controllers (GPC), which leverage tokenization and next-token modeling to create general-purpose, reusable generative controllers from large-scale motion datasets. Our framework utilizes end-to-end reinforcement learning to jointly optimize a "motion vocabulary", modeled via Finite Scalar Quantization (FSQ), along with a corresponding control policy that can map the discrete codes to physics-based controls. After the "codebook" has been learned, the underlying structure of this large vocabulary is modeled by training a GPT-style autoregressive transformer, leading to a powerful generative controller that generates controls for a physically simulated character by performing next-token prediction. Once the generative controller has been trained, we propose a suite of adaptation techniques for finetuning the controller for new downstream tasks. Our proposed framework greatly simplifies the training process compared to previous tokenized methods, and achieves a 99.98% success rate in reproducing a vast corpus of motion clips. The generative controller exhibits a variety of natural emergent behaviors, such as responsive behaviors to perturbations and recovery behaviors after falling. This results in highly robust general purpose controllers for a variety of downstream applications.
中文摘要 开发能够以自然且逼真的方式完成各种任务的控制器，是推动基于物理的角色动画实际应用的关键挑战。在本研究中，我们介绍了生成预训练控制器（GPC），利用标记化和下一标记建模，从大规模运动数据集中创建通用、可重用的生成控制器。我们的框架利用端到端强化学习，共同优化通过有限标量量化（FSQ）建模的“运动词汇”，并配套相应的控制策略，将离散代码映射到基于物理的控制。在学习完“代码本”后，通过训练类似GPT的自回归变换器来建模这个庞大词汇的底层结构，从而生成强大的生成控制器，通过执行下一个令牌预测，为物理模拟的字符生成控制。生成控制器训练完成后，我们提出了一套适应技术，用于微调控制器以适应新的下游任务。我们提出的框架相比以往的标记化方法大大简化了训练过程，并实现了99.98%的成功率，重现了大量动作片段。生成控制器表现出多种自然涌现行为，如对扰动的反应性行为和坠落后的恢复行为。这造就了适用于多种下游应用的高通用控制器。

OASIF: An Efficient Obfuscation-Aware Self-Improving Framework for LLM-Based Assembly Code Instruction Following and Comprehension

OASIF：一个高效的混淆感知自我改进框架，用于基于LLM的汇编代码指令跟随与理解

Authors: Xinyi Wang, Rongze Chen, Ke Wang, Qiyuan Chen, Yanming Liu, Xiang Li, Chunfu Jia
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2606.29155
Pdf link: https://arxiv.org/pdf/2606.29155
Abstract Large Language Models (LLMs) have recently shown promise in automated binary analysis, yet they remain brittle under commercial-grade obfuscation. We present OASIF, an Obfuscation-Aware Self-evolving Instruction-Following framework for obfuscated assembly comprehension. OASIF couples a token-efficient assembly encoder with a lightweight projector to expose long obfuscated code to a pretrained code LLM under a bounded context budget and follows a three-phase training: (i) feature-space alignment, (ii) supervised instruction fine-tuning, and (iii) online self-evolving reinforcement learning with hybrid rewards, enabling continual adaptation with minimal manual verification. On VMISA-Bench, a challenging out-of-distribution suite featuring three commercial VM-based obfuscators, OASIF consistently improves open-source backbones; Qwen2.5-Coder-Instruct-14B attains Success Rate gains of +15.9, +5.8, and +16.9 percentage points (pp) on Code Virtualizer, Themida (v3.0.7), and VMProtect (v3.5), respectively, and improves the OASIF-Bench average by +9.8. OASIF further delivers stable gains across seven standard BCSD benchmarks while preserving general and domain-relevant capabilities on HumanEval, VulBench, and HumanEval-Decompile.
中文摘要 大型语言模型（LLMs）最近在自动化二进制分析中展现出潜力，但在商业级混淆下仍然较为脆弱。我们提出了OASIF，一个混淆感知的自演进指令跟随框架，用于混淆汇编理解。OASIF 将高效的汇编编码器与轻量级投影器结合，将长期混淆代码暴露给预训练代码 LLM，且在有限制的上下文预算下，遵循三阶段训练：（i）特征空间对齐，（ii）监督式指令微调，（iii）在线自我演化强化学习与混合奖励，实现持续适应且需最小的手动验证。在VMISA-Bench上，这是一个具有三个商业虚拟机混淆器的复杂非发行套件，OASIF持续改进开源骨干网;Qwen2.5-Coder-Instruct-14B 在 Code Virtualizer、Themida（v3.0.7）和 VMProtect（v3.5）上分别提升了 +15.9、+5.8 和 +16.9 个百分点（pp），并将 OASIF-Bench 平均提升了 +9.8。OASIF 在七个标准 BCSD 基准测试中实现了稳定的提升，同时保留了 HumanEval、VulBench 和 HumanEval-Decompile 上的通用和领域相关能力。

MIThinker: A Plug-and-Play Policy-Optimized Thinker For Motivational Interviewing Counseling

MIThinker：一款即插即用的政策优化思维工具，用于动机性访谈咨询

Authors: Yizhe Yang, Palakorn Achananuparp, Heyan Huang, Jing Jiang, Ee-Peng Lim
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.29265
Pdf link: https://arxiv.org/pdf/2606.29265
Abstract Reasoning large language models (LLMs) have recently made much progress in complex problem-solving, leveraging internal reasoning (or thought) to guide their solution generation. However, existing LLM-based counseling agents, including those using Motivational Interviewing (MI), generate responses without explicitly aligning thoughts with counseling techniques, limiting their effectiveness. We propose MIThinker, a lightweight thinking model that generates therapeutic thoughts to guide MI counseling agents in strategy selection and response generation. To overcome the lack of annotated thought data, we introduce AugR1-MI, an automated pipeline that reverse-engineers counselor's thoughts from observed responses. Through two-stage training combining supervised fine-tuning and reinforcement learning, MIThinker demonstrates improved theory-of-mind assessment and strategy alignment. Comprehensive evaluations show that MindfulMI, our agent leveraging MIThinker, achieves MI competency comparable to state-of-the-art systems with an order of magnitude less computation.
中文摘要 推理大型语言模型（LLMs）近年来在复杂问题解决方面取得了显著进展，利用内部推理（或思维）来指导其解决方案生成。然而，现有基于LLM的咨询代理，包括使用动机访谈（MI）的代理，在未明确将思想与咨询技术对齐的情况下生成反应，限制了其效果。我们提出了MIThinker，一种轻量级思维模型，能够生成治疗性思维，指导心灵心理咨询代理在策略选择和回应生成方面。为弥补缺乏注释思维数据的问题，我们引入了AugR1-MI自动化流程，能够从观察到的反应中逆向工程咨询师的想法。通过结合监督微调和强化学习的两阶段训练，MIThinker展示了心智理论评估和策略对齐的改进。综合评估显示，我们利用MIThinker的代理MindfulMI实现了与最先进系统相当的智能信息能力，且计算量少了一个数量级。

Process Advantage Signal Shaping: A Paradigm-Agnostic Middleware for Process-Supervised RL in LLM Reasoners

过程优势信号整形：一种面向LLM推理器中过程监督强化学习的范式无关中间件

Authors: Chao Wang, Hongtao Tian, Tao Yang, Yunsheng Shi, Ting Yao, Wenbo Ding
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29296
Pdf link: https://arxiv.org/pdf/2606.29296
Abstract Group Relative Policy Optimization (GRPO) is a default recipe for process-supervised reinforcement learning of LLM reasoners, and dense process supervision -- via learned process reward models (PRMs) or on-policy-distillation KL signals -- is a common way to densify its otherwise weak outcome reward. Layering such a step-level signal on top of GRPO's group-standardized advantage, however, exposes three structural pathologies: \emph{channel contamination} between the pooled process, outcome, and format streams at group standardization; \emph{resolution mismatch} between the granularity of the process signal and the granularity of the logical decisions being credited; and a \emph{cumulative trap} by which GRPO's return-to-go sum surfaces either length inflation or truncated exploration depending on the sign regime of the signal. We propose \textbf{PASS} (\emph{Process Advantage Signal Shaping}), a compact middleware that sits between any scalar step-level process signal and GRPO's clipped surrogate and addresses the three pathologies in turn: \emph{Advantage Fusion} standardizes the three streams independently within each group, \emph{Chunk-by-Value} derives value-homogeneous chunks from the signal itself and broadcasts credit within each chunk, and \emph{Divide-Length} converts the cumulative objective into an average-value-density score. We validate PASS across two domains and two process-signal paradigms -- a learned PRM on mathematical reasoning and an on-policy-distillation KL signal (with a generalized variant) on multi-hop question answering -- and under two group-standardization operators. In every regime PASS delivers a consistent pass@1 gain over the corresponding GRPO baseline.
中文摘要 群相对策略优化（GRPO）是LLM推理器过程监督强化学习的默认配方，而密集过程监督——通过学习过程奖励模型（PRMs）或策略提纯上的基层代码（KL）——是加强其本来较弱的结果奖励的常见方式。然而，在GRPO的组标准化优势基础上叠加此类阶级信号，暴露出三种结构性病态：\emph{通道污染}：组标准化时汇集过程、结果和格式流之间的污染;\emph{分辨率不匹配}，说明过程信号的粒度与被归功逻辑决策的粒度;以及一个\emph{累积陷阱}，通过该陷阱，GRPO的回归求和根据信号的符号区段出现长度膨胀或截断探索。我们提出了 \textbf{PASS}（\emph{Process Advantage Signal Shaping}），这是一个紧凑的中间件，位于任何标量级进程信号和 GRPO 截剪代理之间，依次解决这三种病态问题：\emph{Advantage Fusion} 在每个组内独立标准化三条流，\emph{Chunk-by-Value} 从信号本身推导出价值同质块，并在每个区块内广播信用， \emph{除长度}将累积目标转换为平均值密度分数。我们在两个领域和两种过程-信号范式中验证PASS——一个基于数学推理的学习PRM，以及一个基于策略提纯的KL信号（带有广义变体）用于多跳问答——并在两个组标准化算符下验证。在每个阶段，PASS相较相应的GRPO基线都能提供一致的pass@1增益。

LAMP: Long-Horizon Adaptive Manipulation Planning for Multi-Robot Collaboration in Cluttered Space

LAMP：多机器人协作的长视野自适应操作规划，适用于拥挤空间中的协作

Authors: Shuai Zhou, Yorai Shaoul, Jiaoyang Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.29358
Pdf link: https://arxiv.org/pdf/2606.29358
Abstract Multi-robot manipulation requires jointly reasoning about contact formations, robot motions under coupled dynamics, and collision avoidance. Systematically searching over this large space is difficult and becomes increasingly intractable as the number of robots grows, the task horizon lengthens, or the scene becomes more cluttered. Existing approaches therefore either learn to solve the problem end-to-end via reinforcement learning or restrict planning to a simpler surrogate problem, such as planning object motions while learning short-horizon contact primitives. However, neither paradigm scales to the problem instances we target: longhorizon multi-robot manipulation in extremely dense environments. In this paper, we propose a Long-horizon Adaptive Manipulation Planning (LAMP) framework with two planners that enable tractable search over the full coupled space by combining a learned generative manipulation model: a LAMPA* planner that systematically searches over the coupled objectrobot space, and LAMP-Lazy: a lazy planner that enables real-time replanning through deferred evaluation. Experiments in challenging simulated environments demonstrate that our approach solves complex long-horizon tasks in highly cluttered environments that prior methods cannot handle.
中文摘要 多机器人操作需要共同推理接触形成、耦合动力学下的机器人运动以及碰撞避免。系统性地在这片广阔空间中搜索变得困难且随着机器人数量增加、任务视野拉长或场景变得杂乱，变得越来越难以处理。因此，现有方法要么通过强化学习从端到端学习解决问题，要么将规划限制在更简单的替代问题上，比如在学习短视距接触原语时规划物体运动。然而，这两种范式都无法适用于我们所针对的问题实例：在极高密度环境中的长视界多机器人操作。本文提出了一个长视野自适应操作规划（LAMP）框架，结合了学习到的生成操作模型，实现对整个耦合空间的可操作搜索：一个LAMPA*规划器系统性搜索耦合对象机器人空间，以及LAMP-Lazy：一个懒散规划器，通过延迟评估实现实时重新规划。在具有挑战性的模拟环境中的实验表明，我们的方法能够解决复杂且长期且高度拥乱的任务，这些任务是以往方法无法应对的。

EntroRouter: Learning Efficient Model Routing via Entropy Regulation

EntroRouter：通过熵调控学习高效的模型路由

Authors: Kaiyi Zhang, Xueliang Zhao, Zhuocheng Gong, Wei Wu, Yankai Lin
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.29424
Pdf link: https://arxiv.org/pdf/2606.29424
Abstract Model routing balances solution accuracy and computational cost by selecting among models of varying capabilities. While recent multi-round frameworks interleave reasoning and planning, we identify a structural failure mode termed Trust Region Collapse. We demonstrate that the deep coupling of reasoning and routing, exacerbated by the dominance of strong pre-training priors under sparse supervision, leads to degenerate local optima where capable experts are systematically suppressed. To decouple these processes, we propose $\textbf{EntroRouter}$, a single-round routing framework that treats entropy regulation as a core objective. We first initialize the policy via Soft Supervision, fitting a distribution of suitable models to establish a high-entropy prior for exploration. Subsequently, we stabilize Reinforcement Learning using a Soft Anchor, which utilizes offline capability estimates to orchestrate controlled entropy contraction within a safe trust region. Extensive experiments demonstrate that EntroRouter retains 98.3% of the strongest expert's accuracy while reducing computational costs by 48.25%.
中文摘要 模型路由通过选择能力不同的模型，平衡解的准确性和计算成本。虽然近期多轮融资框架将推理与规划交织在一起，但我们识别出一种结构性失败模式，称为信任区域崩溃。我们证明，推理与路由的深度耦合，加上在有限监督下强预训练先验的主导，导致局部最优状态退化，能力专家被系统性地压制。为了解耦这些过程，我们提出了$\textbf{EntroRouter}$，一个单轮路由框架，将熵调节视为核心目标。我们首先通过软监督初始化策略，拟合合适模型分布以建立高熵先验以便探索。随后，我们使用软锚点稳定强化学习，该锚点利用离线能力估计在安全信任区内协调受控熵收缩。大量实验表明，EntroRouter 保持了 98.3% 的最高专家准确率，同时将计算成本降低了 48.25%。

CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

CRAFT：来自免费兄弟姐妹推广的反事实学分作业，用于自我提炼的代理强化学习

Authors: Zibin Meng, Kani Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29476
Pdf link: https://arxiv.org/pdf/2606.29476
Abstract Self-distilled agentic reinforcement learning augments trajectory-level reward with a token-level distillation loss, using as its teacher the same policy conditioned on privileged context. The prevailing recipe gates this loss by a single scalar, the teacher-student log-probability gap. This signal is doubly limited: it is retrospective, scoring only the realised rollout and never the counterfactual ones, and it is sign-blind, never signalling when a teacher-preferred action would have harmed the trajectory. We introduce CRAFT, a three-pillar credit-assignment scheme that addresses both limitations. Pillar 1, Counterfactual Token Importance, reuses the G-1 sibling rollouts that GRPO already samples and importance-weights them by the log-probability gap to form a self-normalised estimate of the group-level counterfactual change in advantage from up-weighting teacher-preferred actions at each step; this yields a signed per-token credit at near-zero extra compute. Pillar 2 is an asymmetric controller that raises the distillation weight as it lowers the reference-KL weight along an exponential moving average of gate activity, and conversely. Pillar 3 polarises the KL penalty token by token, switching between a mode-seeking and a mode-covering update according to the sign of the credit. Each pillar has an independent switch that, when disabled, renders the loss and gradient byte-identical to the baseline in IEEE-754 arithmetic, so any measured gain is attributable to algorithmic change rather than implementation drift. We prove the estimator's consistency and a variance bound, give structural and bit-exact reproducibility guarantees, and evaluate CRAFT across three agentic environments, four model scales, and five end-to-end methods, plus two tabulated prior-work baselines. Among these is Adaptive-CRINGE, a comparator sharing Pillar 2 with CRAFT, isolating the counterfactual contribution.
中文摘要 自提纯的代理强化学习通过代币级蒸馏损失来增强轨迹级奖励，其教师使用了基于特权上下文的相同策略。主流的做法是将这种损失以一个标量限制，即师生对数概率差距。这种信号有双重限制：它是事后性的，只评分已实现的推广，从不计入反事实的，且是信号盲点，从不标明教师偏好的行动何时会损害发展轨迹。我们引入CRAFT，一个三支柱学分分配方案，解决了这两个限制。第一支柱，反事实令牌重要性，重用GRPO已采样的G-1兄弟姐妹部署，并通过对数概率差距加权，形成自归一化估计，评估每步教师优先行动加权带来的群体层面反事实优势变化;这几乎为零额外计算，每个代币获得签名信用。第二支柱是一个非对称控制器，随着门极活动的指数移动平均降低参考KL权重，同时提高蒸馏权重，反之亦然。第三支柱通过令牌极化KL惩罚代币，根据信用符号在寻模式和覆盖模式更新之间切换。每个支柱都有一个独立开关，禁用时损耗和梯度字节与IEEE-754算术中的基线完全相同，因此任何测量到的增益都归因于算法变更，而非实现漂移。我们证明估计器的一致性和方差界限，提供结构性和位精确可重复性保证，并在三种代理环境、四个模型尺度、五种端到端方法以及两个已统计的前期基线中评估CRAFT。其中包括Adaptive-CRINGE，一个与CRAFT共享支柱2的比较器，分离了反事实贡献。

To Reason or to Fabricate: Reasoning Without Shortcuts via Hint-Anchored Pairwise Aggregation

推理还是制造：通过提示锚定的两对聚合无捷径推理

Authors: Jiuheng Lin, Chen Zhang, Yansong Feng
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29481
Pdf link: https://arxiv.org/pdf/2606.29481
Abstract While reinforcement learning (RL) significantly enhances LLM reasoning, its efficacy is severely undermined by Pre-RL data overlap, where RL datasets overlap with pretraining or SFT corpora, causing models to exploit shortcuts by memorizing correct answers and fabricating post-hoc reasoning. To address this, we introduce HIPPO, a novel RL framework that integrates hint-injected aggregation with a tailored pairwise reward model. By utilizing hint injection to deliberately trigger overlap-induced behaviors, the resulting traces naturally serve as explicit anchors for pairwise comparison. This provides highly discriminable preference signals, enabling a lightweight judge model to reliably distinguish genuine reasoning deduction from shortcut-driven rationalization, while the pairwise formulation ensures stable and robust optimization compared to standard PRMs. Extensive experiments demonstrate that HIPPO yields substantial improvements over standard baselines and generalizes effectively to out-of-distribution general tasks, showing it extracts authentic, transferable reasoning skills rather than superficial shortcut patterns.
中文摘要 虽然强化学习（RL）显著增强了LLM推理能力，但其效能因前RL数据重叠而大大削弱，RL数据集与预训练或SFT语料库重叠，导致模型通过记忆正确答案和事后推理来利用捷径。为此，我们引入了HIPPO，一种新颖的强化学习框架，将提示注入聚合与定制化的两对奖励模型相结合。通过利用提示注入故意触发重叠诱导的行为，所得的迹迹自然成为两两比较的显式锚点。这提供了高度可辨的偏好信号，使轻量级裁判模型能够可靠区分真实推理推理与捷径驱动的合理化，而两两表述则确保了相较于标准PRM的稳定和稳健优化。大量实验表明，HIPPO相较标准基线有显著改进，并有效推广到分布外的一般任务，表明其能够提取真实且可转移的推理而是技巧，而不是表面上的捷径模式。

UCOB: Learning to Utilize and Evolve Agentic Skills via Credit-Aware On-Policy Bidirectional Self-Distillation

UCOB：通过信用意识的政策双向自我蒸馏，学习利用和发展代理技能

Authors: Songjun Tu, Chengdong Xu, Qichao Zhang, Yiwen Ma, Yaocheng Zhang, Linjing Li, Dong Li, Xiangyuan Lan, Dongbin Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.29502
Pdf link: https://arxiv.org/pdf/2606.29502
Abstract Skill memories can improve agentic reinforcement learning by reusing past experience as textual guidance, but retrieved skills are not oracular: they may help in one state while misleading the same policy in another. This makes the common privileged-teacher assumption fragile, namely that a skill-conditioned prompt can be treated as a fixed teacher for the no-skill prompt. We introduce UCOB, a framework for learning to utilize and evolve agentic skills via credit-aware on-policy bidirectional self-distillation. UCOB treats skill-conditioned and no-skill prompts as two on-policy context views of the same model, compares their return-to-go within the same task and anchor state, and uses the higher-return view as the local teacher. This local credit signal internalizes useful skill-conditioned behavior, corrects misleading skill usage, and guides task/state skill memory updates, utility-aware retrieval, and reflection self-training. Experiments on agentic tasks, including ALFWorld, WebShop, and Search-QA, show that UCOB outperforms skill-free RL, skill-memory baselines, and self-distillation methods across model scales, with up to 23.5 and 18.0 point gains over SOTA baselines on ALFWorld and WebShop. Ablations and analyses further validate its core mechanisms and efficiency.
中文摘要 技能记忆可以通过重复利用过去经验作为文本指导来改善能动强化学习，但检索到的技能并非神谕式的：它们可能在某一状态中有所帮助，而在另一状态中误导同一政策。这使得常见的特权教师假设变得脆弱，即技能条件提示可以被视为无技能提示的固定教师。我们介绍UCOB，这是一个通过信用意识的政策双向自我蒸馏，学习利用和演进代理技能的框架。UCOB将技能条件提示和无技能提示视为同一模型的两个政策上下文视图，比较它们在同一任务和锚定状态下的返回，并以高回报视角作为本地教师。该局部信用信号内化有用的技能条件行为，纠正误导性的技能使用，指导任务/状态技能记忆更新、效用感知检索和反思自我训练。针对代理任务的实验，包括ALFWorld、WebShop和Search-QA，显示UCOB在模型尺度上优于无技能强化学习、技能记忆基线和自蒸馏方法，在ALFWorld和WebShop上相比SOTA基线提升了最多23.5点和18.0点。消融和分析进一步验证了其核心机制和效率。

Reinforcement Learning in Super Mario Bros: Curriculum, Pedagogy, and Optimal Level Design in World 1-1

《超级马里奥兄弟》中的强化学习：课程、教学法与世界1-1中的最优关卡设计

Authors: Jesse Ponnock, Lucas Ho
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.29511
Pdf link: https://arxiv.org/pdf/2606.29511
Abstract World 1-1 of Super Mario Bros is widely celebrated as a masterclass in game design: its progressive structure is credited with teaching players core mechanics through the level itself. We ask whether that structure is empirically measurable using reinforcement learning. We implement World 1-1 from scratch as a fully discrete environment and compare four algorithms -- Q-Learning, SARSA, Monte Carlo, and Deep Q-Network (DQN) -- across three progressively complex versions of the same level. Monte Carlo emerges as the strongest agent (94.9% $\pm$ 1.5% win rate), outperforming DQN (76.4% $\pm$ 3.4%) by learning to maximize intermediate rewards along winning paths rather than taking the most direct route. We then use Monte Carlo in a curriculum experiment permuting World 1-1's six canonical segments across twelve conditions. Canonical ordering converges fastest, achieves the highest learning efficiency, and is the only condition with zero catastrophic failures; no random permutation matches all three criteria simultaneously. These results provide, to the best of our knowledge, the first empirical validation that World 1-1's canonical design encodes genuine pedagogical structure: one that measurably accelerates learning and cannot be replicated by chance.
中文摘要 《超级马里奥兄弟》的1-1世界被广泛誉为游戏设计的典范课：其渐进式结构被认为通过关卡本身教会玩家核心机制。我们询问这种结构是否可以通过强化学习进行实证测量。我们将World 1-1从零开始实现为一个完全离散的环境，并在三个逐步复杂化的同一层级版本中比较了四个算法——Q-Learning、SARSA、Monte Carlo和Deep Q-Network（DQN）。蒙特卡洛成为最强代理（94.9% $\pm$ 1.5%胜率），通过学会最大化中级奖励而非走最直接路线，表现优于DQN（76.4% $\pm$ 3.4%）。然后我们在一个课程实验中使用蒙特卡洛，将世界1-1的六个典型片段置换成十二个条件。典范排序收敛最快，学习效率最高，且是唯一零灾难性失败的条件;没有随机排列能同时满足所有三个条件。据我们所知，这些结果首次实证验证了World 1-1的规范设计编码了真正的教学结构：一种可测量地加速学习且无法偶然复制的结构。

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

优化训练策略的幻影：单调推断策略作为大型语言模型强化学习的真正目标

Authors: Jing Liang, Hongyao Tang, Yi Ma, Yancheng He, Weixun Wang, Xiaoyang Li, Ju Huang, Wenbo Su, Jinyi Liu, Yan Zheng, Jianye Hao, Bo Zheng
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.29526
Pdf link: https://arxiv.org/pdf/2606.29526
Abstract Reinforcement learning (RL) has gained growing attention in large language model (LLM) post-training, yet RL training remains fragile and can suffer from instability or collapse. One vital cause is training-inference mismatch: LLM adopts separate inference and training engines for generation efficiency and training precision, which in practice exhibits inconsistent probabilities for the same trajectories on training and inference sides, even with synchronized model parameters. This naturally induces a special type of off-policyness ever existing and poisoning the training. Prior works have made various efforts in addressing the off-policyness to stabilize the training policies under the mismatch. In this paper, we point out the objective misalignment neglected by existing works that an effective update to the policy in the training engine not necessarily ensures the improvement of the inference policy, i.e., the one used in deployment. To this end, we propose a new policy optimization objective for LLM RL, named Monotonic Inference Policy Improvement (MIPI). Following this principle, we introduce Monotonic Inference Policy Update (MIPU), a two-step LLM RL framework that constructs sampler-referenced candidate updates and selectively accepts synchronized candidates using an inference-side gap proxy. Experiments conducted on two model scales under high mismatch show that MIPU improves average reasoning performance and training stability.
中文摘要 强化学习（RL）在大型语言模型（LLM）训练后逐渐受到关注，但强化学习依然脆弱，可能出现不稳定或崩溃。一个关键原因是训练-推断不匹配：LLM采用不同的推理和训练引擎以提升生成效率和训练精度，但实际上即使模型参数同步，训练和推理双方在相同轨迹上也存在不一致的概率。这自然会引发一种特殊的违规状态，持续存在，毒害培训。以往的工作已多次努力解决政策不符的问题，以稳定培训政策在不匹配情况下的表现。本文指出现有工作忽视的客观错位，即对训练引擎策略的有效更新并不一定能保证推理策略（即部署时的策略）的改进。为此，我们提出了一个新的LLM强化策略优化目标，称为单调推断策略改进（MIPI）。遵循这一原则，我们引入单调推理策略更新（MIPU），这是一个两步的大型语言模型强化学习框架，构建采样器引用的候选更新，并通过推理侧的间隙代理选择性接受同步候选。在两个模型尺度上进行的高不匹配实验显示，MIPU能提升平均推理表现和训练稳定性。

Persona-Trained Monte Carlo: Estimating Market-Outcome Distributions via Swarms of Persona-Conditioned Neural Policy Bots in a Limit Order Book

Persona训练的蒙特卡洛：通过限价单簿中大量Persona条件神经策略机器人估算市场结果分布

Authors: Salavat Ishbulatov
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.29556
Pdf link: https://arxiv.org/pdf/2606.29556
Abstract We propose Persona-Trained Monte Carlo (PTMC), a method for estimating distributions of market-outcome statistics by repeatedly simulating limit-order-book interaction among swarms of persona-conditioned neural-policy trading bots. Each run instantiates many bots sharing one trained policy network but conditioned on heterogeneous, individually sampled persona parameters drawn from a learned trader-heterogeneity distribution; the bots interact in a continuous double auction, and the resulting price path is one Monte Carlo sample. Repeating this over independent persona-population draws yields an ensemble from which a target market statistic is estimated. Randomness enters through persona draws, within-run action sampling, and optional exogenous shocks, not solely through price as in classical Monte Carlo. We distinguish PTMC from adjacent paradigms, including classical Monte Carlo, hand-coded agent-based models, single-agent reinforcement learning, and large-language-model-based generative agents. To justify the design, we survey cross-disciplinary foundations -- agent-based computational economics, market microstructure, behavioral finance, deep reinforcement learning, generative/LLM-based agents, news-driven trading, systemic risk, econophysics, and game theory -- connecting each literature to a specific design choice in the policy network, training data, or validation protocol. We formalize the PTMC estimator and its convergence properties, specify a candidate bot architecture and training objective, and propose a four-level validation methodology: stylized-fact matching, microstructure- and agent-level checks, and historical stress-test comparison against a zero-intelligence baseline. The framework is proposed but not implemented: we contribute a formal estimator, a cross-disciplinary design justification, and a validation roadmap, and conclude with open research questions.
中文摘要 我们提出了Persona-Trained Monte Carlo（PTMC），这是一种通过反复模拟群居条件神经政策交易机器人之间限价单相互作用来估算市场结果统计分布的方法。每次运行实例化多个共享一个训练策略网络的机器人，但条件是从学习到的交易者-异质性分布中提取的异构、个别抽样的人格参数;机器人在连续的双重拍卖中互动，最终的价格路径是一条蒙特卡洛样本。在独立的人物-人群抽取上重复此过程，得到一个集合，从中估算目标市场统计量。随机性通过角色抽取、赛中动作抽样和可选的外来震撼进入，而不仅仅是像经典蒙特卡洛那样通过价格。我们将PTMC与邻近范式区分开来，包括经典蒙特卡洛、手工编码的基于主体的模型、单主体强化学习以及基于大型语言模型的生成代理。为了证明设计的合理性，我们考察了跨学科基础——基于主体的计算经济学、市场微观结构、行为金融学、深度强化学习、生成式/基于大型语言模型的主体、新闻驱动交易、系统性风险、经济物理学和博弈论——并将每项文献与政策网络、训练数据或验证协议中的具体设计选择联系起来。我们形式化了PTMC估计器及其收敛性质，指定候选机器人架构和训练目标，并提出了四级验证方法：风格化事实匹配、微观结构和代理级检查，以及与零智能基线的历史压力测试比较。框架被提出但未实现：我们贡献了形式估计器、跨学科设计理由和验证路线图，并以开放性研究问题作结。

GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots

GUICrafter：弱监督的GUI代理，利用大量无注释截图

Authors: Sunqi Fan, Lingshan Chen, Runqi Yin, Qingle Liu, Yongming Rao, Meng-Hao Guo, Shi-Min Hu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.29705
Pdf link: https://arxiv.org/pdf/2606.29705
Abstract Data, as the fundamental substrate of modern intelligence, has greatly driven the development of current foundation models. Naturally, researchers aim to extend this paradigm to the domain of GUI agents, hoping to build strong GUI agents through a similar paradigm. However, GUI agent data cannot be directly harvested from the internet, making it costly and difficult to collect at scale. As a result, current GUI agents suffer from poor cross-device generalization and limited visual grounding ability for fine-grained GUI elements. As an attempt to address data challenge in GUI agents, we propose GUICrafter, a weakly-supervised GUI agent leveraging massive unannotated screenshots to substantially reduce the reliance on expensive human annotations. GUICrafter explores a curriculum learning framework for training GUI agents through two progressive stages. First, the model learns visual grounding from large-scale unannotated screenshots and webpages, leveraging the rich contextual signals inherent in GUI interactions without human annotations. Then, in Stage 2, we leverage a small amount of high-quality data to calibrate the model via reinforcement learning. Experiments show that GUICrafter achieves competitive, or even superior, performance to advanced systems like UI-TARS while using only 0.1% of its data. Furthermore, under the same amount of annotated data, GUICrafter surpasses all previous methods such as GUI-R1. Code, data, and models are available at this https URL.
中文摘要 数据作为现代智能的基础，极大地推动了当前基础模型的发展。自然，研究人员旨在将这一范式扩展到图形界面代理领域，希望通过类似的范式构建强大的图形界面代理。然而，GUI 代理数据无法直接从互联网采集，这使得大规模收集成本高昂且困难。因此，当前的图形界面代理存在跨设备泛化能力较差，且对细粒度图形界面元素的视觉接地能力有限。为了解决GUI代理的数据挑战，我们提出了GUICrafter，一个监督较弱的GUI代理，利用大量无注释截图大幅减少对昂贵人工注释的依赖。GUICrafter 探索了一套课程学习框架，通过两个渐进阶段培训 GUI 代理。首先，模型通过大规模无注释截图和网页学习视觉基础，利用无需人工注释的图形界面交互中蕴含的丰富上下文信号。然后，在第二阶段，我们利用少量高质量数据通过强化学习校准模型。实验显示，GUICrafter 在仅用 0.1% 的数据的情况下，就能与先进系统如 UI-TARS 竞争甚至更优。此外，在相同注释数据量下，GUICrafter超越了所有之前的方法，如GUI-R1。代码、数据和模型均可在此 https URL 访问。

Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression

为什么要与持续的潜在情绪挣扎？通过渲染压缩实现可解释的离散潜在推理

Authors: Shuochen Chang, Qingyang Liu, Shaobo Wang, Bingjie Gao, Qianli Ma, Haonan Zhao, Yibo Miao, Yulin Sun, Zelin Peng, Jiangtong Li, Li Niu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.29712
Pdf link: https://arxiv.org/pdf/2606.29712
Abstract Large language models achieve high reasoning performance via explicit chain-of-thought and reinforcement learning, but require long output sequences and extended inference time. Latent reasoning reduces this cost by shifting computation into a latent space; however, continuous latent methods are hard to train, suffering from unstable and uninterpretable reasoning trajectories. We argue these issues stem from a misalignment between continuous-space reasoning and discrete symbolic supervision, as continuous states lack explicit anchors for step-by-step alignment. To resolve this, we propose \textbf{Discrete Latent Reasoning~(DLR)}, the first method that converts continuous latent states into explicit discrete tokens. Inspired by render-based compression, we render textual chains of thought into images, extract visual features, and construct a discrete latent vocabulary via clustering-based fine-tuning. Expanding the vocabulary and output head enables standard autoregressive modeling over both natural language and latent tokens, supporting pretraining alignment, SFT, and RL. Experiments on five reasoning benchmarks and two model series~(Qwen3-VL and LLaMA-3) confirm that \textbf{DLR} outperforms prior latent reasoning baselines with up to \textbf{20$\times$ compression}. Furthermore, the learned latent trajectories retain an interpretable semantic structure. Overall, discrete latent tokens provide a controllable and interpretable basis for efficient latent reasoning.
中文摘要 大型语言模型通过显式思维链和强化学习实现高推理性能，但需要较长的输出序列和更长的推理时间。潜在推理通过将计算过程转移到潜在空间来降低这一成本;然而，连续潜在方法难以训练，存在不稳定且无法解释的推理轨迹。我们认为这些问题源于连续空间推理与离散符号监督之间的不一致，因为连续状态缺乏明确的逐步对齐锚点。为解决此问题，我们提出了 \textbf{离散潜在推理~（DLR）}，这是第一个将连续潜态转换为显式离散标记的方法。受基于渲染的压缩启发，我们将文本思维链渲染到图像中，提取视觉特征，并通过基于聚类的微调构建离散的潜在词汇。扩展词汇和输出头支持对自然语言和潜在令牌进行标准自回归建模，支持预训练比对、SFT和强化学习。五个推理基准测试和两个模型系列~（Qwen3-VL和LLaMA-3）的实验证实，\textbf{DLR}在最高\textbf{20$\times$压缩}下优于先前的潜在推理基线。此外，学习的潜在轨迹保留了可解释的语义结构。总体而言，离散潜在标记为高效潜在推理提供了可控且可解释的基础。

PS-PPO: Prefix-Sampling PPO for Critic-Free RLHF

PS-PPO：前缀采样PPO，用于无评论RLHF

Authors: Doo Hwan Hwang, Kee-Eung Kim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29758
Pdf link: https://arxiv.org/pdf/2606.29758
Abstract Reinforcement Learning from Human Feedback (RLHF) for Large Language Models increasingly relies on critic-free methods as a practical alternative to actor--critic training. Despite their simplicity, existing critic-free approaches propagate a trajectory-level learning signal uniformly across all tokens in a trajectory. This requires full-trajectory policy updates for every rollout, leading to substantial optimization cost for long reasoning traces, even though intermediate prefixes often contain enough information to largely determine the final outcome. We propose Prefix-Sampling Proximal Policy Optimization (PS-PPO), a compute-efficient critic-free method for RLHF that exploits this temporal redundancy. PS-PPO introduces a prompt-conditioned cutoff distribution and samples a cutoff timestep for each trajectory. During the update pass, PS-PPO backpropagates only through the sampled prefix of each trajectory and applies an importance-weighting correction so that the resulting truncated gradient estimator remains unbiased with respect to the full-trajectory objective. Experiments on mathematical reasoning and RLHF benchmarks show that PS-PPO achieves large reductions in training compute and peak GPU memory, while maintaining accuracy comparable to strong critic-free baselines.
中文摘要 来自人类反馈的强化学习（RLHF）针对大型语言模型，越来越依赖无批评方法作为演员-批评者训练的实用替代方案。尽管简单，现有无批评方法会在轨迹上的所有标记均匀传播轨迹级学习信号。这要求每次部署都进行完整轨迹策略更新，导致长推理轨迹的优化成本巨大，尽管中间前缀通常包含足够信息以大致决定最终结果。我们提出了前缀采样近端策略优化（PS-PPO），这是一种计算高效的无批判RLHF方法，利用了这一时间冗余。PS-PPO 引入了即时条件截止分布，并为每个轨迹采样截止时间步。在更新过程中，PS-PPO 仅通过每个轨迹的采样前缀反向传播，并对重要性加权进行修正，使得的截断梯度估计器相对于完整轨迹目标保持无偏。数学推理和RLHF基准测试的实验表明，PS-PPO在训练计算和峰值GPU内存方面实现了大幅减少，同时保持了与强无批判基线相当的准确性。

MR-IQA: A Unified Margin View of Regression and Ranking for Blind Image Quality Assessment

MR-IQA：盲图质量评估中回归与排名的统一边际视图

Authors: Yuan Li, Youyuan Lin, Zitang Sun, Yung-Hao Yang, Kiyofumi Miyoshi, Chenhui Chu, Shin'ya Nishida
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.29760
Pdf link: https://arxiv.org/pdf/2606.29760
Abstract Blind image quality assessment (BIQA) is commonly built on two basic learning paradigms: regression and ranking. Regression calibrates absolute scores, whereas ranking recovers quality structure from ordinal relations. Although joint regression-ranking supervision often improves BIQA, the relation between the two paradigms remains largely empirical and underexplored. In this work, we revisit what underlies regression and ranking and identify pairwise relational distance, termed quality margin, as their common bridge. Our derivation shows that, at the objective-optimization level, both paradigms fit quality margins: regression fits margins induced by score endpoints, while ranking fits transformed or sign-level margins through preference probabilities. Motivated by this insight, we propose MR-IQA, a direct quality-margin optimization framework for reinforcement learning (RL)-based BIQA. MR-IQA samples quality scores and optimizes pairwise margin errors as policy rewards, thereby modeling quality structure more explicitly. Experiments on six BIQA benchmarks show competitive general performance, and controlled comparisons demonstrate that MR-IQA achieves the strongest average PLCC/SRCC over regression- or ranking-based RL methods. Our findings provide a new insight into unifying regression and ranking, offering a theoretical basis for understanding quality-structure modeling in BIQA and beyond.
中文摘要 盲图质量评估（BIQA）通常基于两种基本学习范式：回归和排名。回归校准绝对得分，而排名则从序数关系中恢复质量结构。尽管联合回归排名监督常常能改善BIQA，但两者之间的关系仍主要处于实证层面且未被充分探讨。在本研究中，我们重新审视回归和排名的基础，并将两对关系距离，称为质量边际，作为它们的共同桥梁。我们的推导表明，在客观优化层面，两种范式都适合质量边界：回归拟合由分数端点诱导的边界，而排名拟合通过偏好概率拟合变换后或符号级边界。基于这一见解，我们提出了MR-IQA，一种基于强化学习（RL）的BIQA直接质量边际优化框架。MR-IQA采样质量评分，并将两两边际误差作为政策奖励优化，从而更明确地建模质量结构。六个BIQA基准测试的实验显示其具有竞争力的整体表现，受控比较显示MR-IQA在回归或基于排名的强化学习方法中实现了最强的平均PLCC/SRCC。我们的发现为统一回归与排名提供了新的见解，为理解BIQA及其他领域质量结构建模提供了理论基础。

SMART-MIG: A Learning Framework for Scalable and Energy-Efficient GPU Scheduling

SMART-MIG：一个可扩展且节能的GPU调度学习框架

Authors: Wenqing Yu, Neel Karia, Tanvi Hisaria, Clifford Stein, Olivier Tardieu, Asser Tantawi
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2606.29775
Pdf link: https://arxiv.org/pdf/2606.29775
Abstract The emergence of Multi-Instance GPU (MIG) technology enables us to run smaller machine learning models on partitions of a GPU rather than the entire device, thus improving utilization and reducing energy consumption, albeit with potential performance trade-offs. Meanwhile, the growing energy demands of GPU-equipped data centers motivate the development of online partitioning and scheduling schemes that not only ensure fast job processing but also achieve high energy efficiency. However, achieving energy-tardiness efficiency with manageable algorithmic complexity in large-scale scheduling remains a great challenge, due to the dual objectives of deciding on the GPU partitions and scheduling jobs onto the slices of the heterogeneous partitions. To address this challenge, we propose SMART-MIG, a parallel computing system that combines Mean-Field Multi-Agent Reinforcement Learning (MF-MARL) for large-scale MIG repartitioning with tailored heuristic algorithms for job scheduling. We demonstrate that the complexity of the repartitioning component remains constant even as the number of jobs and GPUs increases. We also establish theoretical lower bounds on energy consumption and tardiness to rigorously benchmark system performance. Finally, extensive experiments show that SMART-MIG improves the energy-tardiness efficiency by $18\%$ compared to its corresponding static-partitioning counterpart, while being only $27\%$ above the theoretical lower bound on energy consumption.
中文摘要 多实例GPU（MIG）技术的出现使我们能够在GPU的分区上运行更小的机器学习模型，从而提升利用率并降低能耗，尽管存在性能的潜在权衡。与此同时，GPU数据中心日益增长的能源需求推动了在线分区和调度方案的发展，这些方案不仅确保了快速的作业处理，还实现了高能效。然而，在大规模调度中实现能耗延迟效率和可控算法复杂性仍是巨大挑战，因为两者目标是决定GPU分区和将作业调度到异构分区的切片。为应对这一挑战，我们提出了SMART-MIG，这是一种并行计算系统，结合了大规模MIG重分的均场多智能体强化学习（MF-MARL）与定制的启发式作业调度算法。我们证明，即使作业和GPU数量增加，重分区组件的复杂度依然保持不变。我们还设定了能耗和延迟的理论下限，以严格基准测试系统性能。最后，大量实验表明，SMART-MIG相比其对应的静态分配版本，提高了18%$的能量延迟效率，同时仅比理论能耗下限高出27%$。

Accelerating Q-learning through Efficient Value-Sharing across Actions

通过高效跨行动的价值共享加速Q学习

Authors: Prabhat Nagarajan, Brett Daley, Martha White, Marlos C. Machado
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29806
Pdf link: https://arxiv.org/pdf/2606.29806
Abstract Action-values are foundational to many control algorithms such as Q-learning. Therefore learning action-values efficiently is central to reinforcement learning (RL). However, learning them can be slow, requiring many updates to move values from their initialization, typically near zero, to their true values, which may be far from zero. Moreover, action-value learning algorithms typically update each state-action pair independently, without learning shared value structure across actions within a state. In this paper, we address these inefficiencies by introducing the mean-expansion layer, which accelerates action-value learning by sharing values across actions within a state and by changing the problem from directly learning potentially large action-values to learning a lower-norm representation of them. In deep RL, this layer can be applied as a parameter-free addition to Q-network architectures without altering the underlying algorithm. Applied to deep Q-networks and implicit quantile networks, it improves aggregate performance across 57 Atari games while increasing action gaps and dramatically reducing value overestimation.
中文摘要 动作值是许多控制算法（如Q学习）的基础。因此，高效学习动作值是强化学习（RL）的核心。然而，学习它们的速度可能较慢，需要多次更新，才能将值从初始化（通常接近零）移动到真实值，而真实值可能远离零。此外，动作价值学习算法通常独立更新每个状态-动作对，而不会在状态内跨动作学习共享的价值结构。本文通过引入均值展开层来解决这些低效问题，该层通过在状态内的动作间共享值来加速动作值学习，并将问题从直接学习潜在较大的动作值转变为学习它们的低范数表示。在深度强化学习中，这一层可以作为无参数的补充应用到Q网络架构中，而无需改变底层算法。应用于深度Q网络和隐式分位数网络，它提升了57款Atari游戏的整体性能，同时扩大动作差距，显著减少价值高估。

Consistency as Inductive Bias: Learning Cross-View Invariance for Robust Multimodal Reasoning

一致性作为归纳偏见：学习交叉视图不变性以实现稳健的多模态推理

Authors: Xin Zou, Haolin Deng, Yibo Yan, Shuliang Liu, Kening Zheng, Zhiwei Jin, Chen Chen, Haonan Lu, Xuming Hu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.29812
Pdf link: https://arxiv.org/pdf/2606.29812
Abstract Inductive biases steer learning toward generalizable solutions by encoding task structure. In this work, we identify a crucial missing bias in MLLMs: cross-view consistency, \textit{i.e.}, semantically invariant views of the same instance should lead to the same answer. Standard reinforcement learning with verifiable rewards (RLVR) objectives do not impose this constraint, but instead assign pointwise rewards to each visual input. Even with data augmentation (DA), transformed views are typically rewarded independently, providing little signal once within-view rewards saturate. We propose \textbf{ConsistRoll}, a simple but effective method that injects cross-view consistency into RLVR training by reusing the group-sampling mechanism of GRPO. Specifically, ConsistRoll places original and semantically invariant transformed views in the same generation group, and assigns a joint reward only when paired completions are both correct and consistent. In this way, ConsistRoll turns consistency into an online credit-assignment signal, \textbf{without extra generation overhead and annotations}. Theoretically, we show that cross-view consistency is a valid inductive bias, and ConsistRoll introduces a cross-view correction term absent from DA, penalizing view dependence and alleviating advantage collapse. Comprehensive benchmarks across math, general-purpose, hallucination domains confirm that ConsistRoll achieves robust improvements in multimodal reasoning.
中文摘要 归纳偏见通过编码任务结构引导学习朝向可推广的解。在本研究中，我们识别出MLLM中一个关键的缺失偏差：交叉视图一致性、\textit{即}，同一实例的语义不变视图应得出相同的答案。标准的可验证奖励强化学习（RLVR）目标不施加这种限制，而是为每个视觉输入分配逐点奖励。即使采用数据增强（DA），转换后的视图通常也会独立获得奖励，一旦视内奖励过饱和，信号就很少。我们提出了 \textbf{ConsistRoll}，这是一种简单但有效的方法，通过重用 GRPO 的组抽样机制，为 RLVR 训练注入交叉视图一致性。具体来说，ConsistRoll将原始视图和语义不变的转换视图置于同一生成组，只有当配对完备化都正确且一致时，才会分配联合奖励。通过这种方式，ConsistRoll将一致性转化为在线署名分配信号，\textbf{无需额外的生成开销和注释}。理论上，我们证明交叉视角一致性是一种有效的归纳偏见，ConsistRoll引入了一个DA中缺少的交叉视角修正项，惩罚视角依赖性并减轻优势崩溃。涵盖数学、通用和幻觉领域的综合基准测试证实，ConsistRoll在多模态推理方面取得了强劲的提升。

Dual-Flow Reinforcement Learning with State-Aware Exploration

双流强化学习与状态感知探索

Authors: Qijun Li, Zheng Fu, Qi Song, Yifei He, Weitao Zhou, Kun Jiang, Diange Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29820
Pdf link: https://arxiv.org/pdf/2606.29820
Abstract In complex continuous-control reinforcement learning tasks, multimodal optimal actions often coincide with uncertain, multimodal return distributions, making reliable value estimation and multimodal exploration challenging. Existing value estimation methods using unimodal Gaussians restrict expressiveness and yield biased estimates. Recent generative policies can represent multimodal actions but often collapse to a few modes and under-explore high-value areas of the action space. Motivated by these challenges, we propose Dual-Flow RL, a unified actor-critic framework that jointly models a continuous return distribution and a multimodal policy distribution using conditional flow matching (CFM). This design supports reliable value estimation and sustained multimodal exploration. To further enhance exploration, we introduce an Entropy-Covariance Exploration Regulator (ECER) that enables state-aware exploration regulation leveraging policy entropy and action-uncertainty covariance. Experiments on DeepMind Control Suite and Humanoid-Bench show that Dual-Flow RL achieves state-of-the-art performance on most tasks, significantly outperforming prior diffusion-based and flow-based methods.
中文摘要 在复杂的连续控制强化学习任务中，多模态最优动作常与不确定的多模态收益分布重合，使得可靠的价值估计和多模态探索变得具有挑战性。现有使用单峰高斯量的值估计方法限制了表达性并产生偏置估计。近期的生成策略可以代表多模态动作，但往往简化为少数模式，且未能充分探索行动空间中高价值的领域。基于这些挑战，我们提出了双流强化学习（Dual-flow RL），这是一个统一的actor-critic框架，结合条件流匹配（CFM）建模连续返回分布和多模态策略分布。该设计支持可靠的价值估算和持续的多模态勘探。为进一步优化勘探，我们引入了熵协方差探索调节器（ECER），利用政策熵和行动不确定性协方差实现状态感知的勘探调控。DeepMind Control Suite 和 Humanoid-Bench 的实验表明，双流强化学习在大多数任务上都达到了最先进的性能，远超以往基于扩散和基于流程的方法。

KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration in Agentic Search

KbSD：知识边界感知自我蒸馏，用于智能搜索中的行为校准

Authors: Tao Feng, Xinke Jiang, Chao Wu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.29863
Pdf link: https://arxiv.org/pdf/2606.29863
Abstract Agentic search equips large language models with dynamic retrieval abilities, but existing reinforcement learning methods remain limited by reward sparsity in knowledge boundary calibration -- deciding when to trust parametric memory, when to rely on retrieved evidence, and when to abstain. Binary rewards can penalize undesirable outcomes, but provide little guidance on the reasoning process required to make calibrated decisions across different knowledge states. To address this, we propose KbSD (Knowledge boundary Self-Distillation), a framework that tackles this limitation through dense token-level supervision, outcome-level sparse rewards, and quadrant-adaptive optimization. KbSD constructs a hint-augmented teacher, architecturally identical to the student, that receives explicit knowledge boundary signals -- including parametric certainty, retrieval quality, and ground-truth answers -- to generate calibrated reasoning demonstrations. This information-asymmetric self-distillation enables dense supervision without requiring a larger external model. To further account for the heterogeneous reasoning distributions across knowledge states, we introduce a quadrant-adaptive distillation objective: reverse KL for concentrated integration, forward KL for diverse refusal, and Pareto-optimal bidirectional KL for asymmetric quadrants requiring both precision and coverage. Experiments on multiple benchmarks show that KbSD consistently improves both task accuracy and hallucination mitigation over strong baselines, with the largest gains appearing in the challenging quadrants where sparse rewards are least informative.
中文摘要 代理搜索赋予大型语言模型动态检索能力，但现有强化学习方法仍受限于知识边界校准中的奖励稀疏——决定何时信任参数记忆，何时依赖检索证据，何时弃用。二元奖励可能会惩罚不良结果，但对在不同知识状态之间做出校准决策所需的推理过程几乎没有指导。为此，我们提出了KbSD（知识边界自我蒸馏）框架，通过密集的代币级监督、结果级稀疏奖励和象限自适应优化来解决这一局限。KbSD构建了一个提示增强教师，其架构与学生完全相同，接收显式知识边界信号——包括参数确定性、检索质量和真实答案——以生成校准的推理演示。这种信息非对称自蒸馏使得密集监督成为可能，而无需更大的外部模型。为进一步解释知识状态间的异质推理分布，我们引入象限自适应蒸馏目标：集中积分采用反KL，多样性拒绝采用正向KL，非对称象限需精确度和覆盖度的帕累托最优双向KL。多个基准测试的实验显示，KbSD在强基线下持续提升任务准确性和幻觉缓解，最大收益出现在奖励稀疏且信息量最差的挑战象限。

RoAd-RL: A Unified Library and Benchmark for Robust Adversarial Reinforcement Learning

RoAd-RL：一个统一的库和强健对抗强化学习的基准

Authors: Adithya Mohan, Daniel Kriegl, Torsten Schön
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29867
Pdf link: https://arxiv.org/pdf/2606.29867
Abstract Deep Reinforcement Learning (DRL) has achieved significant success in robotics and autonomous systems, yet remains vulnerable to adversarial perturbations that can severely degrade performance. Research in adversarial reinforcement learning is often limited by fragmented implementations, inconsistent evaluation protocols, and poor reproducibility. To address these challenges, we present \textbf{RoAd-RL}, an open-source benchmarking framework that provides unified abstractions for policies, attacks, defenses, and robustness metrics, together with reproducible evaluation pipelines and seamless integration with Stable-Baselines3 and Gymnasium. We evaluate DQN, PPO, and SAC agents in LunarLander and Highway-v0 under 192 attack-defense configurations. Results reveal substantial variations in robustness across environments and show that some commonly used defenses can be more detrimental than the attacks they aim to mitigate, while temporal smoothing consistently achieves strong performance. RoAd-RL establishes a standardized benchmark for adversarial reinforcement learning research and is publicly available at this https URL.
中文摘要 深度强化学习（DRL）在机器人和自主系统领域取得了显著成功，但仍易受到对抗性扰动的影响，这些扰动可能严重降低性能。对抗性强化学习的研究常受限于实现碎片化、评估方案不一致以及重复性差。为应对这些挑战，我们推出了 \textbf{RoAd-RL}，一个开源基准测试框架，提供策略、攻击、防御和鲁棒性指标的统一抽象，并具备可复现的评估流水线，并与 Stable-Baselines3 和 Gymnasium 无缝集成。我们在192种攻防配置下评估了LunarLander和Highway-v0中的DQN、PPO和SAC代理。结果显示不同环境的鲁棒性差异显著，并表明某些常用防御手段可能比其旨在缓解的攻击更具破坏性，而时间平滑则持续获得强劲表现。RoAd-RL 建立了对抗性强化学习研究的标准化基准，并可在此 https URL 公开获取。

AI Training Manager: Bounded Closed-Loop Control of Adaptive Training Recipes

AI训练管理器：有界闭环控制自适应训练配方

Authors: Anjali Rao, Nikhil Kamalkumar Advani
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29871
Pdf link: https://arxiv.org/pdf/2606.29871
Abstract We present the AI Training Manager, a bounded LLM-based supervisory controller for adaptive machine learning training. Standard training pipelines often rely on fixed recipes or single-axis schedulers, which can struggle with mid-run failures such as severe overfitting, loss imbalance, exploration collapse, or unsafe exploration. Rather than replacing mathematical optimizers or acting as an unconstrained coding agent, the manager operates through a schema-conditioned interface: it reads structured telemetry snapshots from an active run, audits a constrained action space, and returns validated updates to training parameters such as learning rate, regularization strength, loss-weight coefficients, and exploration settings. We evaluate this architecture across supervised language modeling and reinforcement learning. On TinyStories, the manager detects and corrects overfitting, achieving a validation loss 60% lower than the baseline while producing auditable intervention logs. In this supervised setting, we additionally show that manager inference does not need to block the training loop: training can continue while a manager response is pending, and validated updates can be applied asynchronously once available. In a robotic manipulation reinforcement-learning task, we use the same bounded decision interface in an episodic closed-loop setting, where manager updates are applied at evaluation or checkpoint boundaries. The manager mitigates both conservative and unsafe exploration regimes. These results suggest that schema-conditioned LLMs can serve as bounded supervisory managers for live training runs, complementing conventional optimizers and schedulers with interpretable, multi-axis intervention capabilities
中文摘要 我们介绍AI训练管理器，一款基于有限LLM的自适应机器学习训练监督控制器。标准训练流程通常依赖固定配方或单轴调度器，这些流程中途可能出现严重过拟合、损失失衡、勘探崩溃或不安全勘探等故障。管理器不替代数学优化器或作为无约束编码代理，而是通过模式条件界面操作：它读取活跃运行中的结构化遥测快照，审计受限动作空间，并返回经过验证的训练参数更新，如学习率、正则化强度、损权系数和探索设置。我们评估了该架构在监督式语言建模和强化学习中的表现。在TinyStories中，管理者检测并纠正过拟合，使验证损失比基线低60%，同时产生可审计的干预日志。在这种监督式环境中，我们还展示了经理推断不必阻断培训循环：培训可以在经理响应待处理时继续进行，验证后的更新一旦可用，可以异步应用。在机器人操作强化学习任务中，我们在情节闭环环境中使用相同的有界决策接口，管理者更新在评估或检查点边界处应用。管理方案既能缓解保守的勘探方式，也能缓解不安全的探险方案。这些结果表明，模式条件LLM可以作为实时训练运行的有界监督管理器，补充传统优化器和调度器，具备可解释的多轴干预能力

Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models

相信你的直觉：基于信心的测试时间强化学习，适用于视觉-语言-行动模型

Authors: Siyao Chen, Jiakang Yuan, Jiaxin Wang, Tao Chen
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29892
Pdf link: https://arxiv.org/pdf/2606.29892
Abstract Reinforcement learning (RL) has become indispensable for pushing Vision-Language-Action Models (VLAs) beyond static imitation learning. However, existing RL methods typically require external environmental feedback, relying on predefined success signals to guide policy updates. In this work, we show that VLA models possess useful internal evaluative capabilities: in discrete-action VLAs, trajectories with higher generation confidence are significantly more likely to succeed. Based on this observation, we introduce T^2VLA (Test-time VLA), an architecture-agnostic test-time RL framework that enables VLA models to achieve self-bootstrapping policy improvement. Instead of relying on external rewards, T^2VLA leverages trajectory-level similarity to high-confidence expert demonstrations as an intrinsic reward signal. In addition, we propose a Confidence-Driven Dual Expert Bootstrapping mechanism, which dynamically balances a Local Pseudo-Expert for exploration and a Global Expert Pool for training stability. Extensive experiments on the LIBERO and RoboTwin benchmarks show that T^2VLA consistently outperforms supervised baselines and approaches oracle RL performance with ground-truth rewards, achieving effective improvement without external reward feedback. Furthermore, T^2VLA adapts to distinct VLA paradigms, including both OpenVLA-OFT and the pi series.
中文摘要 强化学习（RL）已成为推动视觉-语言-行动模型（VLA）超越静态模仿学习的不可或缺手段。然而，现有的强化学习方法通常需要外部环境反馈，依赖预设的成功信号来指导政策更新。本研究表明VLA模型具备有用的内部评估能力：在离散动作VLA中，生成置信度更高的轨迹成功率显著更高。基于这一观察，我们引入了T^2VLA（测试时VLA），这是一个与架构无关的测试时RL框架，使VLA模型能够实现自我引导策略改进。T^2VLA不依赖外部奖励，而是利用轨迹层面与高置信度专家演示的相似性作为内在奖励信号。此外，我们提出了一种信心驱动双专家自助机制，动态平衡一个用于探索的本地伪专家和一个用于训练稳定性的全球专家池。在LIBERO和RoboTwin基准测试上的大量实验表明，T^2VLA始终优于监督基准，并以真实奖励接近oracle强化学习表现，无需外部奖励即可实现有效改进。此外，T^2VLA适应不同的VLA范式，包括OpenVLA-OFT和pi系列。

StrucTab: A Structured Optimization Framework for Table Parsing

StrucTab：一个用于表解析的结构化优化框架

Authors: Gengluo Li, Shangpin Peng, Chengquan Zhang, Binghong Wu, Hao Feng, Weinong Wang, Pengyuan Lyu, Huawen Shen, Xingyu Wan, Zhuotao Tian, Han Hu, Can Ma, Yu Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.29905
Pdf link: https://arxiv.org/pdf/2606.29905
Abstract Table parsing aims to convert table images into structured, machine-readable representations, a task requiring the joint perception of complex spatial layouts and textual content. While recent vision-language models (VLMs) enable end-to-end parsing, they typically rely on direct supervision of the final output, thereby bypassing the explicit intermediate reasoning that is crucial for understanding complex table structures. Furthermore, attempts to optimize these models using reinforcement learning (RL) are often hindered by unstable or ambiguous reward designs, limiting potential performance gains. To address these limitations, we propose StrucTab, a table parsing model learned through intermediate structural supervision and reward decomposition. At the modeling level, by decomposing the parsing process into human-inspired subtasks, such as row-column counting and merged-cell analysis, StrucTab progressively unifies them through a sequential reasoning strategy. At the optimization level, we introduce Uni-TabRL, a unified RL framework that leverages decomposed rewards (validity, structure, and content) to provide stable and informative optimization signals. Finally, at the evaluation level, we present TableVerse-5K, a large-scale, challenging benchmark encompassing diverse, real-world table scenarios. Extensive experiments demonstrate the state-of-the-art performance of StrucTab across all evaluated public benchmarks and significant improvements on TableVerse-5K, validating the effectiveness of explicit structural modeling and decomposed reward optimization. Code and benchmark are publicly available at this https URL.
中文摘要 表格解析旨在将表格图像转换为结构化、机器可读的表示，这需要对复杂空间布局和文本内容的共同感知。虽然最新的视觉语言模型（VLMs）支持端到端解析，但它们通常依赖对最终输出的直接监督，从而绕过了理解复杂表格结构所需的显式中间推理。此外，利用强化学习（RL）优化这些模型的尝试常常受到不稳定或模糊的奖励设计阻碍，限制了潜在的性能提升。为解决这些局限性，我们提出了StrucTab，一种通过中间结构监督和奖励分解学习的表解析模型。在建模层面，通过将解析过程分解为人为驱动的子任务，如行列计数和合并单元分析，StrucTab通过顺序推理策略逐步统一它们。在优化层面，我们引入了Uni-TabRL，一个统一的强化学习框架，利用分解后的奖励（效度、结构和内容）提供稳定且信息丰富的优化信号。最后，在评估层面，我们呈现TableVerse-5K，这是一个涵盖多样真实表格场景的大规模且具有挑战性的基准测试。大量实验展示了StrucTab在所有评估的公开基准测试中的最先进性能，以及TableVerse-5K的显著改进，验证了显式结构建模和分解奖励优化的有效性。代码和基准测试可在该 https URL 公开获取。

Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization

迈向物理直觉对比动力学：随机性结晶的案例研究

Authors: Kunal Samanta, Ari Holtzman, Peter West
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.29933
Pdf link: https://arxiv.org/pdf/2606.29933
Abstract The alignment of language models is typically studied through the lens of capability benchmarks, but the dynamics of how models change during post-training remain poorly understood. We argue that the physical sciences, and thermodynamic phase-transition theory in particular, offer a principled and underexplored vocabulary for reasoning about these dynamics. As a case study, we instantiate this position through the lens of material Crystallization, which is a well-studied thermodynamic phase transition. For tasks like random number generation, this breaks into 3 phases: (1) the high entropy liquid phase in the pretrained model, with many distinct sampling distributions promptable from the model; (2) the nucleation phase caused by supervised finetuning, in which behavior collapses onto a single seed distribution present in the pretrained LLM; and (3) a settling phase in which reinforcement learning techniques redistribute probability of the collapsed distribution, but largely keep it concentrated on the same options as the seed distribution. We propose intuitive metrics to verify the transitions between these phases, and validate the idea across a range of random tasks. Crystallization is one instance of a broader class of physical frameworks we believe alignment research should import to answer questions about where alignment-induced structure comes from, why it converges where it does, and what it fundamentally cannot change.
中文摘要 语言模型的对齐通常通过能力基准的视角研究，但模型在训练后的变化动态仍难以理解。我们认为，物理科学，尤其是热力学相变理论，提供了一套原则性且未被充分探讨的推理这些动力学的词汇。作为案例研究，我们通过材料结晶这一经过充分研究的热力学相变来阐述这一立场。对于随机数生成等任务，这分为三个阶段：（1）预训练模型中的高熵液体相，模型可提示许多不同的抽样分布;（2）由监督微调引起的成核阶段，行为会归结为预训练LLM中存在的单一种子分布;以及（3）一个稳定阶段，强化学习技术重新分配了崩溃分布的概率，但主要集中于与种子分布相同的选项。我们提出了直观的指标来验证这些阶段之间的转换，并在各种随机任务中验证这一想法。结晶是我们认为比对研究应当引入的更广泛物理框架类别中的一个实例，以解答关于比对诱导结构的来源、为何会在某处汇聚以及其根本无法改变的部分。

RoamFlow: Reinforcement-Aligned One-Step Action MeanFlow Policy for Image-Goal Navigation

RoamFlow：基于强化对齐的一步动作MeanFlow策略，用于图像目标导航

Authors: Zixuan Zhang, Yuqi Chen, Junjie Gao, Siyuan Song, Yongzhou Pan, Beichen Wang, Mir Feroskhan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.29934
Pdf link: https://arxiv.org/pdf/2606.29934
Abstract Image-goal navigation is a key challenge in embodied robotics, where an agent must reach a target specified solely by a goal image. While existing reinforcement learning approaches map perceptual observations directly to actions, they struggle to model long-horizon dependencies, often leading to suboptimal trajectories. To address this limitation, we propose RoamFlow, a generative navigation framework that leverages MeanFlow to predict the average velocity field for trajectory synthesis, enabling efficient few-step generation and reducing inference latency. We further adopt a two-stage training strategy that combines expert imitation for stable initialization with reinforcement learning for task-specific policy refinement. Extensive experiments in both Habitat simulation and real-world robotic platforms demonstrate that RoamFlow achieves efficient inference while maintaining strong navigation performance under real-time constraints.
中文摘要 图像-目标导航是具身机器人技术中的一个关键挑战，智能体必须达到仅由目标图像指定的目标。现有的强化学习方法将感知观察直接映射到行动，但它们难以建模长视野依赖关系，常导致轨迹不理想。为解决这一限制，我们提出了RoamFlow，一种生成式导航框架，利用MeanFlow预测轨迹综合的平均速度场，实现高效的少步生成并降低推理延迟。我们进一步采用两阶段训练策略，结合专家模仿以实现稳定初始化，并通过强化学习进行任务特定策略的细化。在Habitat模拟和现实机器人平台上的广泛实验表明，RoamFlow在实时约束下实现了高效的推断，同时保持了强大的导航性能。

LatentRevise: Learning from Zero-Hit Reasoning

潜伏修正：从零打击推理中学习

Authors: Yiqiu Guo, Xueting Han, Qi Jia, Guangtao Zhai, Jing Bai
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.29938
Pdf link: https://arxiv.org/pdf/2606.29938
Abstract Reinforcement learning with verifiable rewards (RLVR) is bottlenecked by hard prompts on which correct trajectories have low probability, so sampling misses them within a practical budget and leaves the policy update with little useful signal. We frame such zero-hit prompts as RLVR's sampling frontier, where new reasoning behavior is most valuable yet least likely to be sampled. Importantly, failed rollouts can be informative: they expose where the model's reasoning went wrong. We introduce LatentRevise, a first-order latent revision method that recovers training signal for this zero-hit regime. Given a failed rollout and the gold answer as an anchor, LatentRevise optimizes the input embeddings of its reasoning prefix under two complementary gradients, moving the prefix away from the failed continuation and toward the gold answer. The optimization is constrained to the convex hull of the model's vocabulary embeddings, so each update moves the latent toward a real token embedding rather than an arbitrary feature direction. We find that continuations from the revised prefix lengthen, exhibit self-reflection, and reach correct answers missed by the original rollouts. Used as training data, these trajectories improve SFT and RLVR on math benchmarks over standard baselines.
中文摘要 带有可验证奖励的强化学习（RLVR）被硬提示限制，这些提示正确轨迹概率较低，因此抽样在实际预算内会错过这些提示，导致政策更新时几乎没有有用的信号。我们将此类零命中提示框定为RLVR采样前沿，即新推理行为最有价值但最不可能被采样。重要的是，失败的推广可以带来信息：它们揭示了模型推理出错的地方。我们介绍了LatentRevise，一种一阶潜在修正方法，能够恢复该零命中状态的训练信号。在推测失败且金色答案作为锚点的情况下，LatentRevise 会优化推理前缀的输入嵌入，使其在两个互补梯度下，将前缀从失败延续方向移向金色答案。优化受限于模型词汇嵌入的凸包，因此每次更新都使潜在嵌入趋向真实的令牌嵌入，而非任意特征方向。我们发现，修订前缀的延续会延长，展现自我反思，并达到原始推出中遗漏的正确答案。作为训练数据，这些轨迹在数学基准测试上提升了SFT和RLVR相较于标准基线的表现。

Exploration and Online Transfer with Behavioral Foundation Models

基于行为基础模型的探索与在线转移

Authors: Louis Bagot (SyCoSMA), Mathieu Lefort (LIRIS, SyCoSMA, IRISA, MALT, UR), Laëtitia Matignon (SyCoSMA)
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.29980
Pdf link: https://arxiv.org/pdf/2606.29980
Abstract Zero-shot Transfer in Reinforcement Learning (RL) aims to train an agent that can generate optimal policies for any reward function, without additional learning at transfer time, while training only on reward-free trajectories. For their generality over tasks, such models are sometimes called ``Behavioral Foundation Models'' (BFMs). While they have shown strong performances and improvements in recent years, the current framework and algorithms still assume that, during the transfer phase, the agent is informed offline about the reward (the task to solve) through a dataset of state-reward pairs, which it uses to pick the best policy to deploy. However, in practice if the reward is a black-box (e.g. direct user feedback), it is not possible to generate such a dataset: it is necessary to observe the reward through interactions with the environment. In other words, the current framework of offline transfer is not aligned with the traditional RL setting of online learning through trial-and-error, which requires exploration in order to find rewards. This paper proposes to tackle this new online transfer in zero-shot RL, with the key insight that the BFM itself can be used to generate exploration policies. We show that it is possible to frame this online learning problem in terms of a bandit-like exploration-exploitation problem. More precisely, at each step the bandit algorithm recommends a policy, the BFM executes it in the environment, which yields a reward and a new state; we repeat the process until we converge to the optimal policy. In the popular context of linear reward approximation, we derive a formulation inspired by Upper Confidence Bound and show that exploration can be achieved through the minimization of the eigenvalues of an uncertainty matrix. We evaluate qualitatively and quantitatively our framework on a simple environment to validate the concept of our method.
中文摘要 强化学习（RL）中的零样本转移旨在训练一个能够为任意奖励函数生成最优策略的智能体，无需在转移时额外学习，同时仅在无奖励轨迹上进行训练。由于其对任务的通用性，这类模型有时被称为“行为基础模型”（BFMs）。尽管近年来表现出强劲的性能和改进，但当前框架和算法仍假设在转移阶段，智能体通过状态-奖励对数据集离线告知奖励（需解决的任务），并据此选择最佳策略部署。然而，在实际操作中，如果奖励是黑箱（例如直接用户反馈），就无法生成这样的数据集：必须通过与环境的交互来观察奖励。换句话说，当前的离线迁移框架与传统的通过试错法进行在线学习的强化学习环境不一致，后者需要探索以寻找回报。本文提出在零样本强化学习中解决这一新的在线传输，关键见解是BFM本身可用于生成探索策略。我们证明，可以将该在线学习问题框架为类似盗贼的探索-利用问题。更准确地说，每一步bandit算法都会推荐策略，BFM在环境中执行该策略，从而获得奖励和新状态;我们反复进行这个过程，直到收敛到最优策略。在线性奖励近似的流行语境中，我们推导出受上置信界启发的表述，并证明可以通过最小化不确定矩阵的特征值来实现探索。我们在一个简单的环境中对我们的框架进行定性和定量评估，以验证我们方法的理念。

Be Faithful When Response: Returning Fluent and Grounded Answers for Vision-Language Models Reinforcement Learning

回答时要忠实：为视觉语言模型强化学习反馈流畅且扎根的答案

Authors: Peng, Lee, Yin Zhang, Yanglin Zhang, Haonan Wu, Zishan Liu, Ruoxi Zang, Xin Zhu, Jiayin Zheng, Jian Yao, Zefeng Ji, Fei Ma
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29984
Pdf link: https://arxiv.org/pdf/2606.29984
Abstract Reinforcement Learning (RL) is an important paradigm for improving the reasoning capabilities of Vision-Language Models (VLMs). However, directly applying RL to rollout multimodal reasoning can lead to instability, due to the exploitation of language priors, the neglect of visual evidence, and the generation of reasoning traces that are fluent yet not visually grounded. The question arises: Can initially steer the policy toward visually faithful reasoning regime before applying reinforcement learning? To this end, we propose a Faithful Warm-Start (FWS) strategy that first curates samples with explicit vision-language causal relationships from six general VQA benchmarks to construct the FaithfulQA dataset, where each of the image-question pairs gains a certain degree of visual observations, question requirements, commonsense knowledge, domain knowledge, and the final answer. Subsequently, a VLM-based judge is employed to further purify the dataset, ensuring strong causal consistency and visual faithfulness. This warm-start stage equips the model with the capability to understand causally grounded vision-language patterns before subsequent RL optimization under sparse answer-level rewards. Experimental results show that such faithful supervision improves answer accuracy, stabilizes RL training, and reduces visually unsupported reasoning.
中文摘要 强化学习（RL）是提升视觉语言模型（VLMs）推理能力的重要范式。然而，直接将强化学习应用于多模态推理，可能导致不稳定，因为语言先验被利用，忽视视觉证据，以及产生流畅但缺乏视觉基础的推理痕迹。问题是：在应用强化学习之前，能否先将策略引导到视觉忠实的推理体系？为此，我们提出了一种忠实热启（FWS）策略，首先从六个一般VQA基准中筛选具有明确视觉-语言因果关系的样本，构建FaithfulQA数据集，每个图像-问题对分别获得一定程度的视觉观察、问题要求、常识知识、领域知识以及最终答案。随后，聘请基于VLM的评判人员进一步净化数据集，确保因果一致性和视觉准确性。该热启动阶段使模型具备理解因果基础视觉语言模式的能力，然后在回答层面奖励稀疏的情况下进行后续强化学习优化。实验结果显示，这种忠实的监督能提高答案准确性，稳定强化学习训练，并减少视觉上无依据的推理。

ACPO: Agent-Chained Policy Optimization for Multi-Agent Reinforcement Learning

ACPO：多智能体强化学习的代理链式策略优化

Authors: Daiki E. Matsunaga, Junho Na, Tri Wahyu Guntara, Scott Sanner, Pascal Poupart, Jongmin Lee, Kee-Eung Kim
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.30072
Pdf link: https://arxiv.org/pdf/2606.30072
Abstract Cooperative tasks in Multi-Agent Reinforcement Learning (MARL) require agents to collectively maximize a shared return. Under the Centralized Training with Decentralized Execution (CTDE) paradigm, policy gradients have remained difficult to compute directly. Prior methods largely follow two approaches: independent factorized updates with centralized critics, which lack general joint-improvement guarantees without value decomposition assumptions, or alternating best-response updates, which can converge to suboptimal Nash Equilibria. In this paper, we show the joint policy gradient admits an exact decentralized decomposition of per-agent terms, each formed from per-agent score functions and decentralized critics. Based on this decomposition, we develop Agent-Chained Policy Optimization (ACPO), where actors are trained independently, with their updates together constituting a single step on the joint policy gradient. Central to this result is a serialized view of the simultaneous joint decision in which agents commit actions one at a time, each conditioning on a belief over preceding actions. The belief acts as the coordination mechanism which ties the independent per-agent updates into a joint gradient step. We evaluate ACPO on Multi-Robot Warehouse, SMACv2, and MA-MuJoCo, where it outperforms strong baselines, with the gap widening as the number of agents grows.
中文摘要 多智能体强化学习（MARL）中的合作任务要求智能体集体最大化共享回报。在集中式训练与去中心化执行（CTDE）范式下，政策梯度仍难以直接计算。以往的方法主要遵循两种方法：独立分解更新与中心批评者，缺乏一般的联合改进保证，且无值分解假设;或交替最佳响应更新，可能收敛到次优纳什均衡。本文展示了联合策略梯度允许每个代理的精确分散分解，每个指标由每个代理的评分函数和去中心化批评者组成。基于该分解，我们开发了代理链策略优化（ACPO），参与者独立训练，其更新构成联合策略梯度上的单一步骤。这一结果的核心是对同时联合决策的序列化视角，即代理者一次执行一个行动，每个行为都基于对前行为的信念。信念作为协调机制，将独立的每个智能体更新绑定为联合梯度步骤。我们在Multi-Robot Warehouse、SMACv2和MA-MuJoCo上评估ACPO，这些指标表现优于强劲基线，且随着代理数量的增加，差距进一步扩大。

Hierarchical Reinforcement Learning in StarCraft Micromanagement with Influence Maps and Cluster-based Scripts

星际争霸中的层级强化学习，微观管理含影响图和基于集群脚本

Authors: Chunhui Bai, Changhe Li, Dequan Li, Xinye Cai, Shengxiang Yang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.30092
Pdf link: https://arxiv.org/pdf/2606.30092
Abstract Real-time strategy (RTS) games present significant AI challenges, characterized by expansive state-action spaces arising from multi-unit coordination in continuous battlefields, and sparse delayed rewards stemming from final win/lose signals. Existing approaches face a trade-off between managing the dimensionality explosion of joint actions and maintaining the interpretability of complex state representations. This complexity is further intensified by the limitation of traditional hierarchical structures in adaptively decomposing tasks into effective tactical modules. Such difficulties are compounded by the black-box nature of deep learning models and their reliance on sparse rewards, which together result in limited sample efficiency and a lack of decision-making transparency. To address these limitations, this paper proposes HRL-IM/CBS, a hierarchical reinforcement learning framework with influence map hashing and cluster-based scripts for StarCraft micromanagement. Influence map hashing encodes global battlefield situations into compact hexadecimal codes, capturing spatial control and relative advantage. Cluster-based scripts enable dynamic local coordination through adaptive unit partitioning. The hierarchical multi-Q-table architecture decomposes decision-making into upper-level clustering strategy selection and lower-level tactical execution, with reward allocation providing dense learning signals. Experiments across six asymmetric scenarios demonstrate competitive performance against deep RL baselines while offering advantages in sample efficiency and interpretability through transparent Q-table representations.
中文摘要 即时战略（RTS）游戏带来了重大的人工智能挑战，特点是多单位在连续战场上的协调导致的广阔状态行动空间，以及最终胜负信号带来的稀疏延迟奖励。现有方法面临着管理联合动作维度爆炸与维护复杂状态表示可解释性的权衡。这种复杂性因传统层级结构在自适应分解任务为有效战术模块方面的局限而进一步加剧。这些困难因深度学习模型的黑箱特性以及对稀疏奖励的依赖而加剧，这些因素共同导致样本效率有限和决策透明度不足。为解决这些局限性，本文提出了HRL-IM/CBS，一种具有影响图哈希和基于集群脚本的层级强化学习框架，用于星际争霸微观管理。影响地图哈希将全球战场情势编码为紧凑的十六进制代码，捕捉空间控制权和相对优势。基于集群的脚本通过自适应单元分区实现动态本地协调。分层多Q表架构将决策分解为上层聚类策略选择和下层战术执行，奖励分配则提供密集的学习信号。六种非对称场景的实验展示了在深度强化学习基线下的竞争性能，同时通过透明的Q表表示在样本效率和可解释性方面提供了优势。

Domain Adaptation with Adaptive Imagination for Visual Reinforcement Learning under Limited Target Data

在有限目标数据下，利用自适应想象力进行视觉强化学习的领域适应

Authors: Hyunwoo Park, Sang-Hyun Lee
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.30192
Pdf link: https://arxiv.org/pdf/2606.30192
Abstract Sim-to-real transfer remains a major obstacle for reinforcement learning (RL), especially for vision-based control where image observations exacerbate the state-distribution shift between simulation and the real world. Domain adaptation (DA) is a promising remedy for this challenge. Prior sim-to-real DA works have demonstrated encouraging results, yet these approaches typically assume substantially more target data, which is not available in practice. Indeed, their performance degrades significantly when the target data budget is reduced. To address this challenge, we propose AIDA (Adaptive Imagination for Domain Adaptation), a domain adaptation framework for visual reinforcement learning that addresses sim-to-real transfer under scarce target data without requiring additional interaction with the target environment. Our key idea is adaptive imagination: generating reliable and semantic imagination rollouts to augment limited target data. Specifically, AIDA employs a distribution-shift-aware discriminator that truncates rollouts when imagined transitions drift into low-confidence regions, so that only reliable transitions contribute to the augmentation. On these reliable transitions, AIDA introduces a self-consistency loss that cycles through state -> image observation -> state, penalizing discrepancies between the original and reconstructed states. This provides additional adaptation signals beyond the scarce target data. Our experiments demonstrate that adaptive imagination effectively truncates unreliable rollouts. By enforcing a self-consistency loss on the resulting reliable transitions, AIDA learns semantically meaningful state representations and outperforms baselines across five MuJoCo tasks and two Gymnasium-Robotics tasks.
中文摘要 模拟到现实的转移仍然是强化学习（RL）的主要障碍，尤其是在基于视觉的控制中，图像观察加剧了模拟与现实世界之间的状态分布转移。域适应（DA）是解决这一挑战的有前景的解决方案。以往模拟到现实的DA工作已显示出令人鼓舞的效果，但这些方法通常假设了更多的目标数据，而这些数据在实际中并不存在。事实上，当目标数据预算减少时，它们的性能会显著下降。为应对这一挑战，我们提出了AIDA（自适应想象领域适应）框架，这是一种用于视觉强化学习的领域适应框架，能够在稀缺目标数据下实现模拟到现实的转移，而无需与目标环境进行额外交互。我们的核心理念是自适应想象力：生成可靠且语义化的想象力推广，以增强有限的目标数据。具体来说，AIDA采用分布移位感知判别器，当想象中的转移漂移到低置信区时，会截断推广，只有可靠的过渡才会对增强做出贡献。在这些可靠跃迁上，AIDA引入了自一致性损失，该过程循环于状态->图像观测->状态，惩罚原始状态与重建状态之间的差异。这提供了超出稀缺目标数据的额外适应信号。我们的实验表明，自适应想象力有效截断了不可靠的推广。通过对最终可靠转变强制自一致性丧失，AIDA学习了语义上有意义的状态表示，并在五个MuJoCo任务和两个体育馆-机器人任务中表现优于基线。

Sparse Sensor Placement in Multi-Agent Reinforcement Learning Control of Rayleigh-Bénard Convection

多智能体强化学习控制中稀疏传感器布置

Authors: Jan Stenner, Hans Harder, Sebastian Peitz
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.30238
Pdf link: https://arxiv.org/pdf/2606.30238
Abstract This paper studies sparse sensor placement for control of Rayleigh-Bénard convection with multi-agent reinforcement learning. We train dense expert policies with windowed observations and distill sparse apprentice policies by supervised learning with grouped regularization on encoder input weights. The framework combines ordered non-convex grouped regularization and iterative reweighted grouped regularization, and uses a grouping construction that enforces consistent pruning across overlapping observation windows. Experiments with fixed and varying initial conditions show that Multi-Agent Transformer policies train more stably than proximal policy optimization baselines, while sparse apprentices retain control behavior comparable to dense experts. Sparsity results are strong for the proposed grouped methods across settings, including maximal sparsity in all fixed-initial-condition setting variants and maximal or near-maximal sparsity in varying-initial-condition setting variants. As an additional proof of concept, training from learned minimal sensor sets reduces per-agent observation size from 360 to 12 and preserves the overall training trend in simulation while reducing data throughput. The results provide both an interpretable basis for identifying control-relevant spatial regions and state components, and a practical pathway toward sensor-efficient control under realistic hardware constraints.
中文摘要 本文研究了利用多智能体强化学习控制Rayleigh-Bénard对流的稀疏传感器布置。我们通过窗口观察训练密集专家策略，并通过编码器输入权重的分组正则化，提炼稀疏学徒策略。该框架结合了有序非凸分组正则化和迭代重加权分组正则化，并采用分组构造，强制在重叠的观察窗口间实现一致剪枝。在固定且可变初始条件下的实验表明，多代理变换器策略的训练稳定性优于近端策略优化基线，而稀疏学徒的控制行为则可与密集专家相当。对于提出的分组方法在不同设置中，稀疏性结果很强，包括所有固定初始条件设置变体的最大稀疏性，以及在不同初始条件设置变体中最大或接近极大稀疏性。作为进一步的概念验证，从学习到的最小传感器集训练可将每位代理的观察规模从360减少到12，同时保持模拟中的整体训练趋势，同时降低数据吞吐量。结果既为识别控制相关空间区域和状态分量提供了可解释的基础，也为在现实硬件约束下实现传感器高效控制提供了实用路径。

KYON: Semi-Modular Wheel-Legged Quadruped With Agile Bimanual Capability

KYON：半模块化轮腿四足，具备灵活双手能力

Authors: Luca Rossini, Arturo Laurenzi, Francesco Ruscelli, Yifang Zhang, Giovanbattista Gravina, Lorenzo Baccelliere, Corrado Burchielli, Stefano Cordasco, Nikos Tsagarakis
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.30243
Pdf link: https://arxiv.org/pdf/2606.30243
Abstract This paper presents KYON, a hybrid wheel-legged quadruped robot equipped with a bimanual upper body for loco-manipulation tasks. The platform features a semi-modular design with a reconfigurable lower legs, enabling both wheeled and legged locomotion depending on the environment. A design approach that places actuators in the base and uses transmission mechanisms reduces distal inertia, improving agility and dynamic performance. The robot integrates a whole-body control framework together with a reinforcement learning based policy to handle nonlinear dynamics and enhance robustness to disturbances for the execution of locomotion and manipulation tasks, independently. Experimental results demonstrate effective dynamic locomotion and bimanual manipulation, validating the platform's capability to operate in complex and unstructured scenarios.
中文摘要 本文介绍了KYON，一种配备双手上半身用于机车操作任务的混合轮腿四足机器人。该平台采用半模块化设计，下腿可重构，根据环境支持轮式和腿式移动。将执行器置于基座并采用传动机构的设计方法，减少了远端惯性，提升了灵活性和动态性能。该机器人集成了全身控制框架和基于强化学习的策略，能够处理非线性动力学，并增强对干扰的鲁棒性，以独立执行运动和操作任务。实验结果显示其动态移动和双手操作有效，验证了平台在复杂和非结构化场景下的运行能力。

Toward an Energy-Optimized Operation of Data Centers Located in Wind Farms Using Reinforcement Learning

通过强化学习实现位于风电场的数据中心实现能源优化运行

Authors: Jan Stenner, Alexander Kilian, Sebastian Peitz, Hermann de Meer
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.30316
Pdf link: https://arxiv.org/pdf/2606.30316
Abstract This paper studies Reinforcement Learning as an online controller for curtailment-aware workload shifting in wind-turbine-integrated high-performance computing (HPC) data centers. We introduce a reproducible fixed-day simulation framework with synthetic wind and price signals and delayed completion feedback, designed to be extensible toward more complex scenarios. As a controlled benchmarking basis, we then focus on the minimal case with one wind turbine and one co-located data center. In this setting, pure Reinforcement Learning exhibits a pronounced credit-assignment problem and tends to underuse free wind energy early in the day. We therefore evaluate two complementary countermeasures: optimization-based Imitation Learning and potential-based Reward Shaping. Across multi-seed training and a 200-day test set, Proximal Policy Optimization (PPO) and a Soft Actor-Critic (SAC) variant with an additional on-policy update routine achieve strong empirical performance among learned policies, and both Imitation Learning and Reward Shaping provide improvements in relevant configurations. A performance gap to the optimizer remains, which is expected: the optimizer plans offline with full-day foresight, whereas Reinforcement Learning must decide online from current observations without future realizations. The benchmark and ablation results provide a transparent basis for extending the approach toward richer multi-site and continuous-time scenarios.
中文摘要 本文研究强化学习作为风力涡轮机集成高性能计算（HPC）数据中心中限时感知工作负载转移的在线控制器。我们引入了可复现的固定日模拟框架，结合合成风力和价格信号及延迟完工反馈，旨在扩展至更复杂的场景。作为受控基准测试基础，我们关注最小情形，即一台风力涡轮机和一台共址数据中心。在此环境中，纯强化学习表现出明显的学分分配问题，且在一天的早晨往往会减少自由风能的利用。因此，我们评估了两种互补的对策：基于优化的模仿学习和基于潜力的奖励塑造。在多种子训练和200天测试集中，近端策略优化（PPO）和带有额外策略更新程序的软性演员-批判者（SAC）变体在学习策略中取得了强劲的实证表现，模仿学习和奖励塑造在相关配置上提供了改进。与优化器之间存在性能差距，这是预期中的：优化者以全天预见离线进行规划，而强化学习则必须基于当前观察在线决策，无法实现未来的实现。基准和消融结果为将该方法扩展到更丰富的多位点和连续时间场景提供了透明基础。

DRIFT: Difficulty Routing Self-DIstillation with Rhythm-Gated Exploration and Success BuFfer Training

DRIFT：通过节奏门控探索和成功训练难以路由自净化

Authors: Haisen Luo, Yiwei Liu, Haoning Wang, Dan Liu, Junxi Yin, Haotian Wang, Lei Zhang, Xiaoyu Tian, Shuaiting Chen, Yuansheng Song, Baoyan Guo, Xiongfei Yan, Bolan Yang, Chengwei Liu, Ming Cui, Jiong Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.30345
Pdf link: https://arxiv.org/pdf/2606.30345
Abstract Enabling large language models to achieve stable self-improvement without external expert supervision remains a central challenge in complex reasoning tasks. Existing self-distillation and reinforcement learning methods lack explicit mechanisms for tracking problem-level learning progress and adapting optimization strategies accordingly. Consequently, training may over-optimize easy problems, receive weak supervision from hard problems, and fail to sufficiently explore borderline cases. To resolve these issues, we propose DRIFT, an online self-evolution policy optimization framework for large language models. DRIFT regulates the model's self-improvement process through the joint use of Difficulty Routing and Rhythm Gating. The former identifies the model's learning state at the problem level and dynamically allocates self-distillation and reinforcement learning signals, while the latter refines policy updates at the token level, concentrating exploration on critical reasoning positions. By further incorporating a success buffer and a two-stage curriculum learning strategy, DRIFT preserves high-quality historical experience while progressively guiding the model from reliable behavior acquisition toward stable policy evolution. Evaluated across five benchmarks and three model scales, DRIFT surpasses the peak performance of both GRPO and SDPO across all evaluated metrics. On the average score over the five benchmarks, DRIFT achieves 79.5$\%$, outperforming GRPO by 9.5$\%$ and SDPO by 7.5$\%$, establishing a new state-of-the-art result. Notably, on ToolUse, DRIFT reaches an accuracy of 79.2$\%$, improving over GRPO by 13.5$\%$ and SDPO by 10.7$\%$, setting a new state-of-the-art and substantially outperforming all concurrent methods.
中文摘要 使大型语言模型在没有外部专家监督的情况下实现稳定的自我提升，仍然是复杂推理任务中的核心挑战。现有的自蒸馏和强化学习方法缺乏明确的机制来跟踪问题层级的学习进展并相应调整优化策略。因此，培训可能过度优化简单问题，难题的指导薄弱，且未能充分探讨边缘案例。为解决这些问题，我们提出了DRIFT，一个面向大型语言模型的在线自我演化策略优化框架。DRIFT通过结合使用困难路由和节律门控来调节模型的自我提升过程。前者识别模型在问题层面的学习状态，动态分配自我蒸馏和强化学习信号，后者则在代币层面细化政策更新，重点探索关键推理立场。通过进一步纳入成功缓冲和两阶段课程学习策略，DRIFT保持了高质量的历史经验，同时逐步引导模型从可靠的行为习得向稳定的政策演进发展。经过五个基准和三个模型尺度的评估，DRIFT在所有评估指标上均超过了GRPO和SDPO的峰值表现。在五个基准的平均得分中，DRIFT达到79.5$\%$，比GRPO高9.5%$，比SDPO高7.5%$，创下了新的最先进成绩。值得注意的是，在ToolUse上，DRIFT的准确率达到79.2$\%$，比GRPO提升13.5$\%$，比SDPO提升10.7$\%$，创下了新的最前沿，并显著优于所有并发方法。

FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification

FlowAWR：通过优势加权整流的在线自适应流强化

Authors: Zheming Fu, Ruizhe He, Wei Shang, Xiaoxiao Ma, Lei Wang, Chang Liu, Siming Fu
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.30376
Pdf link: https://arxiv.org/pdf/2606.30376
Abstract Aligning generative flow models on continuous spaces via online reinforcement learning is constrained by intractable trajectory likelihoods. Existing density-approximated policy gradient methods rely on stochastic SDE samplers to construct tractable transition kernels, which introduce training-inference inconsistencies and necessitates Classifier-Free Guidance (CFG). While implicit frameworks such as DiffusionNFT directly optimize forward-process velocity fields, its heuristic fixed-magnitude corrections prevent optimization strength from relative intra-group quality. We propose \textit{Flow Advantage-Weighted Rectification} (\textbf{FlowAWR}), a paradigm that recasts continuous generative policy optimization as supervised regression toward a theoretically optimal velocity field. Starting from the optimal policy of a KL-constrained reward maximization, FlowAWR derives the optimal velocity field that admits a magnitude-aware, advantage-weighted rectification form, yielding SDE-free optimization and CFG-free generation. In comparative evaluations on SD3.5-Medium, FlowAWR achieves improved alignment performance alongside a 2$\times$ to 5$\times$ convergence acceleration over DiffusionNFT (e.g., reaching a 24.12 PickScore in 1.2k steps, versus 23.82 in 2.0k steps for DiffusionNFT and 23.50 in $>$4k steps for FlowGRPO). Under multi-reward constraints, FlowAWR sustains generation quality, satisfying structural rules while maintaining stable out-of-domain performance.
中文摘要 通过在线强化学习对齐连续空间的生成流模型受限于难以解决的轨迹似然。现有的密度近似策略梯度方法依赖随机SDE采样器构建可处理的转移核，这会引入训练-推断不一致，并需要无分类器指导（CFG）。虽然像DiffusionNFT这样的隐式框架直接优化了前向过程的速度场，但其启发式固定幅度修正阻碍了相对组内质量的优化强度。我们提出了 \textit{Flow Advantage-Weighted Rectification}（\textbf{FlowAWR}），这是一种将连续生成策略优化重新定义为监督回归，朝向理论上最优速度场的范式。从KL约束奖励最大化的最优策略出发，FlowAWR推导出允许幅度感知且优势加权的整流形式的最优速度场，实现无SDE优化和无CFG生成。在对SD3.5-Medium的比较评估中，FlowAWR在比齐性能提升，并实现了2$\times$至5$\times$的收敛加速（例如，1.2k步达到24.12的PickScore，而DiffusionNFT在2.0k步达到23.82,4千美元步，FlowGRPO在$>$4k步内达到23.50）。在多奖励约束下，FlowAWR 维持生成质量，满足结构规则，同时保持域外稳定性能。

MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

MOPD：多教师在策略上提炼，用于LLM后培训中的能力集成

Authors: Wenhan Ma, Jianyu Wei, Liang Zhao, Hailin Zhang, Bangjun Xiao, Lei Li, Qibin Yang, Bofei Gao, Yudong Wang, Rang Li, Jinhao Dong, Zhifang Sui, Fuli Luo
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.30406
Pdf link: https://arxiv.org/pdf/2606.30406
Abstract Modern large language models (LLMs) rely on reinforcement learning during post-training to push specific capabilities, yet integrating multiple capabilities into one model remains hard. Existing methods, such as Off-Policy Finetune and Mix-RL, are either inefficient or lose performance. In this work, we propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm for combining the capabilities of multiple domain RL teachers: we first run per-domain specialised RL to obtain a set of domain teachers, then distill these teachers into the student on its own rollouts. This eliminates exposure bias and provides a dense optimization signal. On Qwen3-30B-A3B, MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines, inheriting nearly all of each teacher's capability. MOPD also enables parallel, independent development of domain teachers, removing the cross-domain coupling typical of multi-domain post-training. MOPD has been deployed in the post-training of MiMo-V2-Flash, an industrial-scale frontier model, demonstrating its practical value for capability integration in frontier-scale LLMs.
中文摘要 现代大型语言模型（LLMs）依赖于训练后强化学习来推动特定能力，但将多种能力整合到一个模型中仍然困难。现有方法，如Off-Policy微调和Mix-RL，要么效率低下，要么性能下降。本研究提出多教师政策提炼（MOPD），这是一种培训后范式，用于结合多个领域强化学习教师的能力：我们首先运行每个领域专门的强化学习，获得一组领域教师，然后在学生自身的推广中将这些教师提炼进学生中。这消除了曝光偏置，提供了密集的优化信号。在Qwen3-30B-A3B中，MOPD的表现优于Mix-RL、Cascade RL、Off-Policy微调和Param-Merge基线，几乎继承了每位教师的所有能力。MOPD还促进了领域教师的并行、独立发展，消除了多领域后期培训中典型的跨领域耦合。MOPD已被部署在MiMo-V2-Flash（工业规模前沿模型）的后期训练中，展示了其在前沿规模大型语言模型中功能集成的实用价值。

Diffusion Fine-tuning with Rewarded Moment Matching Distillation

与奖励时刻匹配蒸馏的扩散微调

Authors: Alexis Jacq, Guillaume Couairon, Valentin De Bortoli, Quentin Berthet, Arnaud Doucet, Romuald Elie
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.30414
Pdf link: https://arxiv.org/pdf/2606.30414
Abstract Distillation and Reinforcement Learning (RL) fine-tuning are the primary pillars of diffusion post-training. While traditionally studied in isolation, the interaction between these phases remains poorly understood, and in particular how fine-tuning impacts the generative quality of distilled models. We introduce Rewarded Moment Matching Distillation (RMMD), a novel framework that simultaneously distills diffusion models and maximizes a reward function. RMMD preserves the high-fidelity ``naturalness'' characteristic of advanced distillation (such as 8-step Moment Matching) by adapting the sampling loop for on-policy training and repurposing the distillation loss as a proxy for integral KL regularization. By evaluating the FID-Reward Pareto fronts on ImageNet, we demonstrate that RMMD achieves superior trade-offs compared to single-step baselines (DI++) and multi-step competitors (DRaFT, HyperNoise). Finally, we apply RMMD to GenCast, a state-of-the-art weather forecasting model, to distill it while optimizing the Continuous Ranked Probability Score (CRPS) metric. The resulting distilled model achieves a 7.5x speedup while outperforming the teacher model on 93% of target weather variables, and being better calibrated. This proves that RMMD scales to complex, high-dimensional scientific domains.
中文摘要 蒸馏与强化学习（RL）微调是培训后扩散的主要支柱。虽然传统上是孤立研究，但这些阶段之间的相互作用仍然了解不足，尤其是微调如何影响蒸馏模型的生成质量。我们介绍了奖励矩匹配蒸馏（RMMD），这是一种新颖的框架，既能提炼扩散模型，又最大化奖励函数。RMMD通过调整采样环路以适应政策训练，并将蒸馏损耗重新利用为整体KL正则化的代理，保持了高级蒸馏（如8步矩匹配）的高保真“自然性”特性。通过评估ImageNet上的FID-奖励帕累托前沿，我们证明了RMMD相较于单步基线（DI++）和多步竞争者（DRaFT、HyperNoise）在权衡上更优越。最后，我们将RMMD应用于GenCast这一先进的天气预报模型，在优化连续排名概率评分（CRPS）指标的同时进行提炼。最终的提炼模型在93%的目标天气变量上表现优于教师模型，且校准更佳，速度提升了7.5倍。这证明了RMMD能够扩展到复杂且高维的科学领域。

Experience Augmented Policy Optimization for LLM Reasoning

LLM推理的增强策略优化经验

Authors: Jinda Lu, Kexin Huang, Junkang Wu, Shuo Yang, Jinghan Li, Chiyu Ma, Shaohang Wei, Xiang Wang, Guoyin Wang, Jingren Zhou
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.30420
Pdf link: https://arxiv.org/pdf/2606.30420
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for improving the reasoning capabilities of large language models (LLMs). However, existing RLVR methods typically rely on on-policy optimization from scratch, resulting in high sampling costs and inefficient utilization of accumulated experience. As model capabilities and policy behaviors evolve during training, recent attempts to reuse experience via fixed reasoning trajectories further suffer from policy mismatch. Motivated by these limitations, we argue that experience in RLVR should not be reused as fixed reasoning trajectories, but instead expressed in a policy-adaptive manner. In this work, we propose Experience-Augmented Policy Optimization (EAPO), which leverages a prior RL-optimized policy as an action-level experience prior and selectively injects experience at critical decision points during rollout. To ensure stable and unbiased learning from experience-augmented rollouts, EAPO further incorporates an adapted importance sampling scheme. Experiments on using Qwen-2.5-math 7b and Qwen-3-8B on five different benchmarks demonstrate that EAPO consistently improves reasoning performance over state-of-the-art RLVR methods.
中文摘要 带可验证奖励的强化学习（RLVR）是一种强有力的范式，用于提升大型语言模型（LLM）的推理能力。然而，现有的RLVR方法通常从零开始依赖策略优化，导致抽样成本高且积累经验的利用效率低下。随着模型能力和策略行为在训练过程中演变，近期通过固定推理轨迹重用经验的尝试进一步存在策略不匹配的问题。基于这些局限，我们主张RLVR中的经验不应被重复用作固定推理轨迹，而应以政策适应性的方式表达。本研究提出经验增强策略优化（EAPO），利用先前强化学习优化策略作为行动层经验，并在部署关键决策点有选择地注入经验。为了确保经验增强推广带来稳定且无偏的学习，EAPO还引入了适应性重要性抽样方案。在五个不同基准测试上使用 Qwen-2.5-math 7b 和 Qwen-3-8B 的实验表明，EAPO 在推理性能上持续优于最先进的 RLVR 方法。

Grasp-Oriented Non-Prehensile Manipulation via Learning a Graspability Field

通过学习可抓场实现的以握有为导向的非抓握操作

Authors: Licheng Zhong, Gim Hee Lee
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.30474
Pdf link: https://arxiv.org/pdf/2606.30474
Abstract Non-prehensile manipulation is often used as a preparatory step for robotic grasping, yet existing approaches typically require a predefined target object pose. In practice, however, objects admit multiple graspable configurations and the desired pose is not known in advance. We reformulate non-prehensile manipulation for grasping as optimizing an object centric graspability objective rather than reaching a specific pose. We construct a graspable set from synthesized grasps and define a graspability field that measures how suitable an object configuration is for successful grasp execution. The scalar measure provides a dense learning signal for reinforcement learning and determines when to terminate manipulation. This yields a closed-loop manipulation-to-grasp pipeline driven by a single policy. Experiments in simulation and on a real robot show that the policy reliably reconfigures objects into graspable states and transitions to grasping without external planners or manually specified stopping conditions. The predicted graspability distance correlates with real world grasp success, which indicates that the learned representation captures grasp feasibility of object configurations.
中文摘要 非抓握操作常被用作机器人抓取的准备步骤，但现有方法通常需要预先定义的目标物体姿态。然而，实际上物体可以接受多种可抓握的配置，且期望的姿态事先未知。我们将非抓握操作重新表述为优化以物体为中心的可抓取目标，而非达到特定姿势。我们从合成的抓取构建一个可抓取集，并定义一个可抓取场，衡量对象配置在成功抓取执行中的适用性。标量为强化学习提供了密集的学习信号，并决定何时终止操作。这就形成了由单一策略驱动的闭环操作到掌握流程。在模拟和真实机器人上的实验表明，该策略能够可靠地将物体重新配置为可抓取状态，并在无需外部规划器或手动指定停止条件的情况下实现抓取的过渡。预测的可抓距与现实世界的抓取成功率相关，表明所学的表征能够捕捉物体配置的抓握可行性。

When and Which Sensor to Observe? Timely Tracking of a Joint Markov Source

何时以及观察哪个传感器？联合马尔可夫源的及时跟踪

Authors: Ismail Cosandal, Sennur Ulukus, Nail Akar
Subjects: Subjects: Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2606.30623
Pdf link: https://arxiv.org/pdf/2606.30623
Abstract We investigate the problem of remote estimation (at a monitor) of a discrete-time joint Markov process with individual components which can be observed with dedicated sensors. At a given time slot, the monitor has the option of staying idle or sending a pull request to one of the sensors to obtain a partial state value, while the sensors are assumed to have heterogeneous sampling costs. Our goal is to develop a monitor pull policy, i.e., determining when and towards which sensor to send a pull request, in order to minimize a weighted sum of average age of incorrect information (AoII), or in short age, and sampling costs. As the communication model, we assume an erasure channel with a fixed one-slot delay from each sensor to the monitor. In this setting, the monitor does not perfectly know either the state of the process or the age, at any given time. We first obtain a sufficient statistic, namely belief, representing the joint distribution of the age and the current state of the observed process, by using the history of all pull requests and observations. Then, we formulate the optimization problem as a continuous state-space Markov decision process (MDP), namely belief-MDP, for the solution of which we propose two model predictive control (MPC) methods, namely MPC without terminal costs (MPC-WTC), and reinforcement learning MPC (RL-MPC). The effectiveness of the proposed methods is validated by numerical examples.
中文摘要 我们研究在监测器上远程估计离散时间联合马尔可夫过程的问题，这些过程具有独立分量，可用专用传感器观测。在给定时间段，监视器可以选择保持空闲或向某个传感器发送拉取请求以获取部分状态值，而传感器则假设采样成本异构。我们的目标是制定监控拉取策略，即确定何时以及向哪个传感器发送拉取请求，以最小化错误信息的平均年龄（AoII）、短年龄和采样成本的加权总和。作为通信模型，我们假设每个传感器到显示器之间有一个固定的一时隙延迟的擦除信道。在此环境中，监测仪无法完全知道过程的状态或任何时间的年龄。首先，我们通过利用所有拉取请求和观测值的历史，获得一个充分的统计量，即信念值，表示年龄与当前状态的联合分布。随后，我们将优化问题表述为连续状态空间马尔可夫决策过程（MDP），即信念-MDP，为此我们提出了两种模型预测控制（MPC）方法，分别是无终端成本的MPC（MPC-WTC）和强化学习MPC（RL-MPC）。所提方法的有效性通过数值实例得到验证。

Keyword: diffusion policy

Keypose Exploration: Efficient Automatic Trajectory Labelling and Cross-Embodiment Policy Transfer

关键时刻探索：高效的自动轨迹标记与跨实体策略转移

Authors: Yupu Lu, Hang Xu, Yizhou Chen, Jia Pan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.29028
Pdf link: https://arxiv.org/pdf/2606.29028
Abstract Keypose-based manipulation decomposes tasks into critical waypoints to simplify policy learning for long-horizon tasks, but existing approaches rely on task-specific heuristics or manual annotation to extract keyposes from demonstrations. We present an automatic trajectory labelling pipeline for grasp-related tasks. This pipeline combines vision-language models (VLMs) for semantic event detection with classical trajectory analysis for precise temporal alignment, requiring VLM inference only on one single demo among repeating ones per task. Using the labelled data, we train a keypose-guided Diffusion Policy (DP) that exploits keypose conditioning to intervene demonstration distributions. We explore the possibility to apply this property for cross-embodiment transfer: candidate keyposes are sampled and filtered via a reachability map, steering the policy toward kinematically feasible keyposes for the target robot. As a preliminary feasibility study, experiments on two robomimic tasks show that the labelled data produces policies matching a standard DP baseline, and that reachability-filtered keypose conditioning may benefit zero-shot transfer on the multimodal insertion task when feasible candidates are available.
中文摘要 基于关键姿势的操作将任务分解为关键路径点，简化长期任务的策略学习，但现有方法依赖任务特定的启发式或手动注释来从演示中提取关键位置。我们为抓握相关任务提供了自动轨迹标记流程。该流程结合了用于语义事件检测的视觉语言模型（VLM）与经典轨迹分析以实现精确的时间对齐，只需在每个任务重复的演示中推断一次VLM即可。利用标记数据，我们训练了一个关键姿态引导扩散策略（DP），利用关键位态条件干预演示分布。我们探讨将该特性应用于跨身体转移的可能性：候选关键姿态通过可达性图进行抽样和过滤，引导策略朝向目标机器人运动学上可行的关键姿态。作为初步可行性研究，对两个机器人模拟任务的实验表明，标记数据产生的策略与标准DP基线相匹配，且当有可行候选方案时，可达性过滤关键姿态条件可能有助于多模态插入任务的零样本转移。

Behavior Uncloning: Distilling Mode Redirection into Policy Weights without Inference-Time Steering

行为解克隆：将模式重定向转化为策略权重，无需推理时间引导

Authors: Hao Wang, Jiuzhou Lei, Dayou Li, Bangya Liu, Minghui Zheng, Manling Li, Ruohan Zhang, Zhiwen Fan
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.29201
Pdf link: https://arxiv.org/pdf/2606.29201
Abstract Behavior-cloned policies often learn multiple behavior modes from demonstration datasets, including modes that are unsafe or otherwise undesired at deployment. For example, a policy trained on diverse handover demonstrations may learn to pass a knife blade-first. Standard remedies such as data curation and inference-time steering either require access to the original demonstrations for full retraining or add substantial inference-time overhead. To address this gap, we propose MoRE(Mode Redirection), which redirects policy rollouts toward desired behavior modes through a short "uncloning" step. Specifically, MoRE distills the redirection signal from a temporary mode classifier into the policy weights to steer behavior. A retain loss balances this edit by preserving desired-mode competence, allowing the standalone policy to suppress unwanted modes with zero inference-time overhead. Across eight simulated and real-world tasks, MoRE improves the average deployment success rate (SR) by 44 percentage points over the original mixed-mode policy. Among all compared adaptation and steering baselines, MoRE achieves the strongest SR and approaches the filtered-data retraining reference, while preserving task competence and inference speed. MoRE also generalizes across robot policy backbones, including Diffusion Policy and the Pi0.5 VLA, diverse task categories, and real-world deployments.
中文摘要 行为克隆策略通常从演示数据集中学习多种行为模式，包括部署时不安全或不受欢迎的模式。例如，一个经过多样化交接演示训练的政策，可能会学会先用刀刃传递刀具。标准的补救措施如数据整理和推理时间引导，要么要求访问原始演示以进行全面再培训，要么增加大量推理时间开销。为弥补这一空白，我们提出了MoRE（模式重定向），通过一个短的“克隆”步骤将政策推广转向期望的行为模式。具体来说，MoRE将临时模式分类器的重定向信号提炼为策略权重以引导行为。保留损失通过保留期望模式能力来平衡该编辑，使独立策略能够以零推理时间开销抑制不需要模式。在八个模拟和现实任务中，MoRE使平均部署成功率（SR）比原始混合模式策略提高了44个百分点。在所有比较的适应和引导基线中，MoRE实现了最强的SR，并采用过滤数据重训参考，同时保持任务能力和推断速度。MoRE还推广到机器人政策骨干，包括扩散策略和Pi0.5 VLA，多样化任务类别以及实际部署。