Arxiv Papers of Today

生成时间: 2026-06-25 18:46:02 (UTC+8); Arxiv 发布时间: 2026-06-25 20:00 EDT (2026-06-26 08:00 UTC+8)

今天共有 54 篇相关文章

Keyword: reinforcement learning

ReviewGuard: Aligning LLM-Assisted Peer Review with Long-Term Scientific Impact

ReviewGuard：将LLM辅助同行评审与长期科学影响相结合

Authors: Abdur Rasool, Xiaohui Huang, Yanqing Hu, Linyi Yang
Subjects: Subjects: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.24892
Pdf link: https://arxiv.org/pdf/2606.24892
Abstract Peer review is central to scientific quality control, yet it can undervalue papers that later achieve substantial citation impact. While frontier large language models have shown promise in automating aspects of peer review, they primarily mimic human reviewer preferences rather than predict long-term scientific value. We introduce ReviewGuard, a two-stage framework that aligns LLM-generated reviews with citation-based estimates of long-term scientific impact rather than contemporaneous reviewer judgments. On 20,861 AI/ML papers from OpenReview augmented with Semantic Scholar citation data, ReviewGuard achieves a Spearman correlation of \r{ho} = 0.776 with future citations on rejected-then-published papers, outperforming human reviewers (\r{ho} = 0.492) and a supervised Expert model (\r{ho} = 0.681). Under the same decision threshold, ReviewGuard flags 10.2% of high-impact rejected papers, compared with 1.8% for human reviewers, corresponding to a 5.6x improvement. Our results demonstrate that impact-aligned reinforcement learning can provide editors with a complementary signal for identifying high-potential work, without replacing human judgment.
中文摘要 同行评审是科学质量控制的核心，但它可能低估了那些后来获得显著引用影响力的论文。虽然前沿大型语言模型在自动化同行评审方面展现出潜力，但它们主要模仿人类审稿人的偏好，而非预测长期科学价值。我们介绍了ReviewGuard，这是一个两阶段框架，将LLM生成的综述与基于引用的长期科学影响估计对齐，而非审稿人当时的判断。在OpenReview的20,861篇AI/ML论文中，加上Semantic Scholar引用数据，ReviewGuard与被拒后发表论文的后续引用实现了\r{ho} = 0.776的斯皮尔曼相关系数，优于人类评审者（\r{ho} = 0.492）和监督专家模型（\r{ho} = 0.681）。在同一决策门槛下，ReviewGuard标记了10.2%的高影响力被拒论文，而人类评审者为1.8%，对应提升了5.6倍。我们的结果表明，影响对齐强化学习可以为编辑者提供识别高潜力作品的互补信号，而无需取代人类判断。

LLM Evolution as an Industry-Scale Ecosystem: A Lifecycle Perspective on Continual Learning

作为行业规模生态系统的LLM演进：持续学习的生命周期视角

Authors: Hao Jiang, Enneng Yang, Guojie Zhu, Yibin Chen, Yunkun Xu, Zifu Kou, Jiayi Li, Chong Chen, Zhao Cao, Li Shen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.24901
Pdf link: https://arxiv.org/pdf/2606.24901
Abstract Continual learning capability is critical for Industrial LLMs, as deployed models must be continuously updated to meet evolving requirements and environments, rather than repeatedly retrained from scratch. However, most existing research focuses on improvements on static benchmarks, failing to capture real industrial needs. In this survey, we reformulate Industrial Continual Learning (ICL) for LLMs as a closed-loop update-and-release problem in a versioned ecosystem, where updates propagate hierarchically to industrial, application-specific models and LLM-powered applications, with capability inheritance and transfer across versions and model families. From this ecosystem perspective, we identify three core challenges: repeated adaptation erodes model plasticity, foundation-model upgrades break capability inheritance, and long-term sustainability is constrained by deployment requirements. We then organize the technical landscape of ICL around five lifecycle design principles: preserving plasticity headroom, treating upgrades as capability transfer, enabling trustworthy continual reinforcement learning, making training recipes self-optimizing, and building accountability as a base layer for long-term iteration. For each principle, we synthesize representative technical directions. Finally, we evaluate the maturity of each principle and its technical components via an evidence-based lens, identify key gaps hindering real-world deployment, and outline a practical ICL deployment blueprint and a pathway for feeding industrial realities back into academic research.
中文摘要 持续学习能力对工业大型语言模型至关重要，因为部署的模型必须不断更新以满足不断变化的需求和环境，而不是反复从零重新训练。然而，大多数现有研究都集中在改进静态基准，未能捕捉到真实的工业需求。在本次调查中，我们将工业持续学习（ICL）重新表述为LLM的闭环更新与发布问题，在一个版本化生态系统中，更新通过分层传播到工业专用的应用模型和基于LLM的应用，能力在版本和模型家族间继承和转移。从生态系统角度，我们识别出三大核心挑战：反复适应削弱模型可塑性，基础模型升级破坏能力继承，长期可持续性受部署需求限制。随后，我们将ICL的技术环境围绕五个生命周期设计原则来组织：保持可塑性剩余空间，将升级视为能力转移，支持可信的持续强化学习，使训练配方自我优化，以及建立责任作为长期迭代的基础层。针对每个原则，我们综合了代表性的技术方向。最后，我们通过循证视角评估每个原则及其技术组成部分的成熟度，识别阻碍实际部署的关键漏洞，并概述了切实可行的ICL部署蓝图及将工业现实反馈回学术研究的路径。

The Hitchhiker's Guide to Agentic AI: From Foundations to Systems

智能人工智能搭便车指南：从基础到系统

Authors: Haggai Roitman
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.24937
Pdf link: https://arxiv.org/pdf/2606.24937
Abstract The Hitchhiker's Guide to Agentic AI is a comprehensive practitioner's reference for building autonomous AI systems. The book covers the full stack from first principles to production deployment, organized around a central thesis: building great agentic systems requires understanding every layer of the pipeline, not just one. The book opens with the LLM substrate -- transformer architecture, GPU systems, training and fine-tuning (SFT,LoRA, MoE), model compression, and inference optimization -- treated as essential foundations rather than the primary focus. It then develops the alignment and reasoning layer: reinforcement learning from human feedback (RLHF), PPO, DPO and its variants, GRPO, reward modeling, and RL for large reasoning models including chain-of-thought and test-time scaling. The second half is devoted to agentic AI proper. Topics include agentic training and trajectory-based RL, retrieval-augmented generation (RAG and Agentic RAG), memory systems (in-context, external, episodic, and semantic), agent harness design and context management, and a taxonomy of agent design patterns. Inter-agent coordination is covered in depth: the Model Context Protocol (MCP), agent skills and tool use, the Agent-to-Agent (A2A) communication protocol, and multi-agent architectures spanning centralized, decentralized, and hierarchical topologies. The book concludes with agent development frameworks, agentic UI design, evaluation methodology for agentic tasks, and production deployment. Each chapter pairs rigorous theoretical foundations with implementation guidance, code examples, and references to the primary literature.
中文摘要 《智能人工智能银河系漫游指南》是构建自主人工智能系统的综合实践参考书。本书涵盖了从基本原理到生产部署的全栈，围绕一个核心论点组织：构建优秀的代理系统需要理解流水线的每一层，而不仅仅是一层。本书开篇介绍了LLM基底——变换器架构、GPU系统、训练与微调（SFT、LoRA、MoE）、模型压缩以及推理优化——被视为重要基础，而非主要焦点。随后，它开发了对齐与推理层：基于人类反馈的强化学习（RLHF）、PPO、DPO 及其变体、GRPO、奖励建模以及针对大型推理模型（包括思维链和测试时间尺度）的强化学习。后半部分则专注于真正的智能人工智能。主题包括代理训练与基于轨迹的强化学习、检索增强生成（RAG和代理RAG）、记忆系统（上下文、外部、情节和语义）、代理机束设计与上下文管理，以及代理设计模式的分类法。代理间协调有详细介绍：模型上下文协议（MCP）、代理技能与工具使用、代理间（A2A）通信协议，以及跨越中心化、去中心化和层级拓扑的多代理架构。本书最后介绍了代理开发框架、代理界面设计、代理任务的评估方法论以及生产部署。每章结合严谨的理论基础、实现指导、代码示例及原始文献参考。

Supervised Reinforcement Learning for the Coordination of Distributed Energy Resources

分布式能源资源协调的监督强化学习

Authors: Haoyuan Deng, Yihong Zhou, Thomas Morstyn, Yi Wang
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.24947
Pdf link: https://arxiv.org/pdf/2606.24947
Abstract The increasing integration of distributed energy resources (DERs) is crucial for power system decarbonization, yet unlocking DERs' flexibility is challenged by their inherent uncertainties and modelling complexity. As traditional optimization methods struggle with such uncertainty and complexity of DERs, reinforcement learning (RL) has emerged as a promising alternative for DER management. However, standard RL methods suffer from sample inefficiency and sub-optimality when trained from scratch. Inspired by the training paradigms in large language models, this paper proposes a Supervised Reinforcement Learning (SRL) framework for learning DER coordination policies. This framework first pre-trains a policy on demonstration data in a supervised-learning fashion, which is then further fine-tuned using RL. Furthermore, we propose a two-step fine-tuning process: offline fine-tuning for enhancing policy performance and online fine-tuning for adapting it to the real-world dynamics. Experiments demonstrate that RL implementations based on the proposed framework significantly outperform all benchmarks, achieving high cost efficiency even under low-quality demonstration data.
中文摘要 分布式能源资源（DERs）日益整合对电力系统脱碳至关重要，但释放DER的灵活性仍面临其固有的不确定性和建模复杂性。随着传统优化方法难以应对DER的不确定性和复杂性，强化学习（RL）已成为DER管理的有前景替代方案。然而，标准强化学习方法在从零训练时存在样本效率低和次优问题。本文受大型语言模型训练范式启发，提出了一种用于学习DER协调策略的监督强化学习（SRL）框架。该框架首先以监督式学习方式预训练示范数据策略，然后通过强化学习进一步微调。此外，我们提出了一个两步微调过程：离线微调以提升政策绩效，在线微调以适应现实世界的动态。实验表明，基于该框架的强化学习实现显著优于所有基准测试，即使在低质量演示数据下也能实现高成本效率。

Digital Twin-Driven Adaptive Sim-to-Real Alignment via Reinforcement Learning for Vibration-Based Bearing Health Monitoring Under Data Scarcity

数字孪生驱动的自适应模拟与现实对齐，通过强化学习实现基于振动的轴承健康监测，在数据稀缺下实现

Authors: Jinghan Wang, Yanjun Chen, Wei Zhang, Wentao Wu, Tianchen Liu, Gaoliang Peng
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.24954
Pdf link: https://arxiv.org/pdf/2606.24954
Abstract Vibration-based health monitoring of rotating machinery requires reliable fault diagnosis under operational data constraints, yet condition assessment remains challenged by structural scarcity of fault events and heterogeneous sim-to-real gaps in digital twin-generated signals. Each fault type generates impulses with distinct periodicity, amplitude modulation, and spectral character, making feature-space discrepancies fundamentally heterogeneous across fault classes. Existing domain adaptation methods apply a class-agnostic global transformation that cannot close all fault-specific gaps without distorting inter-class separability, while uniform source-target mixing introduces distributional noise into the data-abundant Normal class. These limitations stem from treating a sequential, state-dependent alignment problem as a one-shot optimization. Each corrective transformation simultaneously reshapes all class distributions, creating state dependencies that static gradient descent cannot resolve. We formulate feature alignment as a continuous-action Markov decision process solved via Proximal Policy Optimization, where the learned policy issues fault-type-specific affine corrections responsive to the current feature-space configuration, with a dual-objective reward balancing gap minimization against separability preservation. An asymmetry-aware strategy reserves real data for the Normal class while augmenting fault classes with policy-aligned simulated samples. Validation across XJTU-SY, CWRU, and a self-built slewing bearing testbed confirms the dominant gain from reinforcement learning-driven alignment, and cross-equipment linear probing achieves 92.8% without encoder retraining, demonstrating transferable monitoring capability.
中文摘要 基于振动的旋转机械健康监测需要在运行数据限制下可靠地故障诊断，但状态评估仍面临故障事件结构稀缺性和数字孪生信号中模拟到实的异构缺口。每种故障类型产生具有不同周期性、幅度调制和频谱特征的脉冲，使得特征空间差异在不同故障类别间根本性地异质化。现有的域适应方法采用了类无关的全局变换，无法在不扭曲类间可分离性的情况下弥合所有故障特异性，而均匀的源-目标混合则会向数据丰富的正规类引入分布噪声。这些局限源于将顺序、状态依赖的比对问题视为一次性优化。每次修正变换同时重塑所有类别分布，产生静态梯度下降无法解决的状态依赖关系。我们将特征比对表述为通过近端策略优化解决的连续动作马尔可夫决策过程，其中学习策略对当前特征空间配置响应故障类型特定仿射修正，同时实现双目标奖励平衡差距最小化与可分离性保持。不对称感知策略将真实数据保留给正常类，同时用策略对齐的模拟样本补充故障类。跨越XJTU-SY、CWRU及自建旋转轴承测试平台的验证证实，增强学习驱动的对准获得主导增益，跨设备线性探测在无编码器重新训练的情况下实现92.8%，展示了可转移的监控能力。

Towards Scalable Multi-Task Reinforcement Learning with Large Decision Models

迈向可扩展的多任务强化学习，采用大型决策模型

Authors: Thibaut Kulak
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.24962
Pdf link: https://arxiv.org/pdf/2606.24962
Abstract Recent progress in large-scale sequence modeling has shown that a single model can learn useful representations across highly diverse data distributions. Inspired by these advances, we investigate whether a unified transformer policy can be trained across large collections of heterogeneous reinforcement learning environments. We introduce LDM-v0, a Large Decision Model trained offline on trajectories collected from thousands of environments spanning multiple domains and modalities. LDM-v0 is a multi-task, multi-modal transformer policy conditioned on histories of observations, actions, rewards, and termination signals, and trained through supervised next-action prediction over offline trajectories. We describe the environment infrastructure, automated data generation pipeline, model architecture, and training methodology used to build LDM-v0, and evaluate its performance across diverse environments. We show that a single pretrained model matches the performance of independently trained task-specific reference policies on approximately 1,000 environments including robotics, autonomous driving, inventory management, cybersecurity, trading, and video games. These results demonstrate the feasibility of large-scale offline pretraining across heterogeneous reinforcement learning environments using a single transformer policy.
中文摘要 大规模序列建模的最新进展表明，单个模型能够在高度多样化的数据分布中学习有用的表示。受这些进展启发，我们研究统一的变换器策略是否可以跨越大量异构强化学习环境进行训练。我们介绍LDM-v0，一个基于数千个跨领域和模态环境中收集轨迹的离线大型决策模型。LDM-v0 是一种多任务、多模态变换器策略，基于观测、动作、奖励和终止信号的历史，并通过监督下的下一个动作预测训练离线轨迹。我们介绍了构建LDM-v0所使用的环境基础设施、自动化数据生成流水线、模型架构和训练方法，并评估其在不同环境中的性能。我们证明，单个预训练模型在约1000个环境（包括机器人、自动驾驶、库存管理、网络安全、交易和视频游戏）上，能够与独立训练的任务特定参考策略的性能匹配。这些结果展示了利用单一变换器策略在异构强化学习环境中进行大规模离线预训练的可行性。

Uncertainty-aware reinforcement learning for chemical language models

化学语言模型的不确定性感知强化学习

Authors: Borja Medina, Jon Paul Janet
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.24990
Pdf link: https://arxiv.org/pdf/2606.24990
Abstract Reinforcement Learning (RL) has become a powerful paradigm for de novo molecular design, enabling Chemical Language Models (CLMs) to navigate and explore the chemical space while optimizing specific desired properties. However, the existing RL frameworks treat all scoring functions as deterministic oracles, neglecting the inherent uncertainty attached to the predictions of the different molecular properties. This can lead to the exploration of highly-uncertain regions of the chemical space, focusing on the generation of highly scored molecules which are poorly supported by the training data. This can destabilize the optimization process, yielding predictions that are far from their true values. We propose and compare two complementary ways of incorporating predictive uncertainty into RL. In the first one, uncertainty is treated as an additional optimization objective and incorporated along with the rest of the scoring functions, allowing the policy to trade off exploitation against reliability. Secondly, uncertainty is used to modulate policy updates, reducing the influence of molecules whose properties lie far outside the scoring function confidence domain. Both approaches were evaluated across three different settings: (i) a controlled model system, in which the prediction error is modeled as a Gaussian distribution, with a variance proportional to the distance to the training data; and two real-world tasks, making use of (ii) ChemProp models and (iii) a Conformal Prediction wrapper applied to a Random forest classifier. We show that uncertainty-aware RL enables CLMs to explore chemical space more robustly by favoring lower-uncertainty regions. This leads to more reliable hit discovery without compromising molecular score, increasing the true hit rate by 0.25 (from 0.5 to 0.75), and nearly doubling the total number of true hits.
中文摘要 强化学习（RL）已成为新生分子设计的强大范式，使化学语言模型（CLMs）能够在优化特定期望性质的同时，导航和探索化学空间。然而，现有的强化学习框架将所有评分函数视为确定性预言机，忽视了不同分子性质预测本身的固有不确定性。这可能导致对化学空间中高度不确定的区域的探索，重点是生成训练数据支持不足的高分分子。这会破坏优化过程，导致预测值远离真实值。我们提出了并比较两种互补的方式，将预测不确定性纳入强化学习。在第一种情况下，不确定性被视为额外的优化目标，并与其他评分函数一同整合，使策略能够在利用性与可靠性之间进行权衡。其次，不确定性被用来调节政策更新，减少那些属性远超出评分函数置信域的分子的影响。这两种方法在三种不同环境中进行了评估：（i）受控模型系统，其中预测误差以高斯分布建模，方差与训练数据的距离成正比;以及两个实际任务，分别使用（ii）ChemProp模型和（iii）应用于随机森林分类器的共形预测包装器。我们证明，不确定性感知的强化学习使CLM能够通过优先选择低不确定性区域，更有效地探索化学空间。这使得更可靠的命中率发现，同时不影响分子评分，真实命中率提高了0.25（从0.5提升到0.75），真实命中总数几乎翻倍。

Solving Markov Decision Processes with Future Information via MPC

通过MPC解决马尔可夫决策过程与未来信息

Authors: Shambhuraj Sawant, Akhil S Anand, Dirk Reinhardt, Sebastien Gros
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2606.24991
Pdf link: https://arxiv.org/pdf/2606.24991
Abstract Model Predictive Control (MPC) is widely used in industrial and robotic systems for enforcing constraints and embedding domain knowledge through finite-horizon optimization-based planning. However, despite these strengths, an MPC scheme typically does not yield optimal policies for sequential decision-making problems formulated as Markov Decision Processes (MDPs). Recent combinations of MPC with Reinforcement Learning (RL) alleviate this issue by treating MPC as a parameterized model of the optimal policy of an MDP and adjusting its parameters using data. While these approaches typically consider classical MDPs, many real-world problems include future information--such as forecasts, prices, or reference trajectories--at decision time, which must be included in the MDP state for optimal decision-making. Current MPC-RL approaches do not directly account for this augmented-state structure, raising the question of how to incorporate future information into MPC to obtain an optimal policy. This work establishes the structural requirements under which a parameterized MPC can exactly represent the optimal value functions and policy of an MDP with future information. We further demonstrate that such a parameterized MPC can serve as a structured function approximator, with its parameters learned using RL. The approach is illustrated on a point-mass racing task with future reference information.
中文摘要 模型预测控制（MPC）广泛应用于工业和机器人系统，通过基于有限视界的优化规划来执行约束和嵌入领域知识。然而，尽管有这些优势，MPC方案通常无法为以马尔可夫决策过程（MDP）形式表述的顺序决策问题提供最优策略。近期MPC与强化学习（RL）的结合通过将MPC视为MDP最优策略的参数化模型，并利用数据调整其参数，缓解了这一问题。虽然这些方法通常考虑经典MDP，但许多现实问题在决策时包含未来信息——如预测、价格或参考轨迹——必须包含在MDP状态中以实现最佳决策。当前的MPC-RL方法并未直接考虑这种增强状态结构，这引发了如何将未来信息纳入MPC以获得最佳策略的问题。本研究确立了参数化MPC在未来信息下能够精确表示MDP的最优值函数和策略的结构要求。我们还进一步证明了这样的参数化MPC可以作为结构化函数近似器，其参数通过强化学习获得。该方法在点质量竞赛任务中展示，并附有未来参考信息。

ExTra: Exploratory Trajectory Optimization for Language Model Reinforcement Learning

ExTra：语言模型强化学习的探索性轨迹优化

Authors: Wenyang Hu, Junxiang Jia, Zhen Shu, Daniel Dahlmeier, See-Kiong Ng, Bryan Kian Hsiang Low
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.24994
Pdf link: https://arxiv.org/pdf/2606.24994
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) for language-model reasoning can fail at both extremes of task difficulty: easy prompts often produce all-correct, low-diversity rollout groups with little gradient signal, while hard prompts can produce all-incorrect groups with no positive reward. We introduce ExTra (Exploratory Trajectory Optimization), a GRPO-compatible framework that extracts exploration signals from the model's own rollouts. ExTra combines two mechanisms: (i) a novelty reward that adds embedding-based diversity bonuses after GRPO normalization, rewarding diverse correct solutions; and (ii) entropy-guided prefix regeneration, which scores partial trajectories using entropy signals and continues exploration from promising intermediate steps. Across six mathematical reasoning benchmarks, ExTra improves Qwen3-1.7B over GRPO by about +5 points on pass@1 and +7 points on pass@16, showing that trajectory-level exploration signals can improve both single-sample accuracy and inference-time coverage.
中文摘要 带可验证奖励的强化学习（RLVR）在任务难度的两个极端都可能失败：简单提示通常产生全正确、低多样性且梯度信号较小的展开组，而困难提示则可能产生全错且无正面奖励的组。我们引入ExTra（探索轨迹优化），这是一个兼容GRPO的框架，能够从模型自身的部署中提取探索信号。ExTra 结合了两种机制：（i）新颖奖励，在 GRPO 归一化后增加基于嵌入的多样性加值，奖励多样化的正确解;以及（ii）由熵引导的前缀再生，利用熵信号对部分轨迹进行评分，并从有前景的中间步骤继续探索。在六个数学推理基准测试中，ExTra在pass@1上将Qwen3-1.7B提升了约+5个点，pass@16上提升了+7个百分点，表明轨迹级探索信号可以提升单样本准确性和推断时间覆盖率。

A Zeroth-Order Deep Learning Method for Fully Nonlinear Parabolic Partial Differential Equations with Unknown Coefficients

一种用于完全非线性抛物型偏微分方程且系数未知的零阶深度学习方法

Authors: Yanwei Jia, Du Ouyang, Huyên Pham, Xun Yu Zhou
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2606.24999
Pdf link: https://arxiv.org/pdf/2606.24999
Abstract High-dimensional partial differential equations (PDEs) with unknown coefficients arise widely in scientific machine learning, including continuous-time reinforcement learning, yet solving them efficiently in a data-driven way remains challenging. Existing deep learning solvers often rely on repeated automatic differentiation to evaluate differential operators, which can cause instability and amplify derivative errors in high dimensions, while probabilistic methods based on stochastic representations require explicit knowledge of the data-generating dynamics and therefore do not apply to black-box environments. We introduce two types of simulators as data-generating mechanisms, and take a ``representing-then-learning" approach that learns the solutions and their derivatives under settings where the underlying PDE operators are accessible only through simulations and pointwise evaluations. Our representation of derivatives relies on the zeroth-order derivative (ZOD) estimators derived from perturbed Monte Carlo trajectories. This fully model-free approach generates targets for the gradient and Hessian networks using only function evaluations. We provide a statistical learning analysis of the proposed approach, including a bias--variance tradeoff for ZODs. Assuming a standard contraction property of the underlying operator, we establish a non-asymptotic error bound that decomposes the total error into discretization error, approximation error, statistical error, and ZOD bias. Crucially, we derive the sample complexity of the learned representations in (weighted) Sobolev space, characterizing the error up to second-order derivatives. Numerical experiments illustrate the competitive performance of the method in moderate and high dimensions.
中文摘要 高维偏微分方程（PDE）在科学机器学习中广泛出现，包括连续时间强化学习，但以数据驱动的方式高效求解仍然具有挑战性。现有的深度学习求解器通常依赖反复自动微分来评估微分算子，这可能导致高维时的不稳定性并放大导数误差;而基于随机表示的概率方法则需要对数据生成动态有明确了解，因此不适用于黑箱环境。我们引入了两种类型的模拟器作为数据生成机制，并采用“先表示后学习”的方法，在仅通过模拟和点评测能够访问底层偏微分方程算子的环境中学习解及其导数。我们对导数的表示依赖于从受扰动蒙特卡洛轨迹推导出的零阶导数（ZOD）估计量。这种完全无模型的方法仅通过函数评估生成梯度网络和黑森网络的目标。我们对所提方法进行了统计学习分析，包括对ZODs的偏差-方差权衡。假设底层算符具有标准收缩性质，我们建立一个非渐近误差界限，将总误差分解为离散化误差、近似误差、统计误差和ZOD偏差。关键是，我们推导了（加权）索博列夫空间中所学表示的样本复杂度，并将误差描述为二阶导数。数值实验展示了该方法在中高维度下的竞争性能。

Geo-Strat-RL: Learning Geological Event Reasoning from Verifiable Tasks

地质-战略-强化学习：从可验证任务中学习地质事件推理

Authors: Lukas Mosser
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
Arxiv link: https://arxiv.org/abs/2606.25000
Pdf link: https://arxiv.org/pdf/2606.25000
Abstract To evaluate whether vision-language models can reason about geological histories, it is necessary to construct observations for which the underlying process history is known. Furthermore, reasoning over geological histories is not just a question of recognizing visual patterns, but also of understanding temporal and structural relationships that may be only indirectly visible or highly ambiguous. When ground-truth event histories are not uniquely identifiable or are unavailable, it remains an open challenge to teach models capable of visual reasoning to produce valid geological reconstructions that are consistent with both observed evidence and geological principles. We therefore investigate whether defining a verifiable geological reasoning task can improve geological event reconstruction across observation domains through reinforcement learning with verifiable rewards (RLVR). To this end, we present Geo-Strat-RL, a synthetic environment that generates stratigraphic observations and compact visible-evidence event histories. The environment combines a geological generator with an executable verifier that scores chronology, event identity, deposition, and structural relationships. We show that RLVR improves geological reconstruction in vision-language models (VLMs), increasing geological content scores on held out stratigraphic diagrams. We further evaluate the same held-out geological histories in a synthetic seismic observation domain by converting the generated scenes into acoustic-impedance-derived amplitude sections. In this controlled paired-renderer setting, we present evidence that geological reasoning learned from stratigraphic diagram-domain RLVR training transfers to synthetic seismic representations without seismic-specific training examples, supporting the hypothesis that RLVR can teach reusable geological reasoning concepts across related observation formats.
中文摘要 为了评估视觉语言模型是否能够推理地质历史，有必要构建已知其底层过程历史的观测数据。此外，对地质历史的推理不仅仅是识别视觉模式的问题，更是理解那些可能只能间接可见或高度模糊的时间和结构关系。当真实事件历史无法唯一识别或无法获得时，教导具备视觉推理能力的模型，如何产出既符合观察证据又符合地质原理的有效地质重建，仍是一个开放的挑战。因此，我们探讨定义可验证地质推理任务是否能通过带可验证奖励的强化学习（RLVR）改善跨观察域的地质事件重建。为此，我们介绍了Geo-Strat-RL，一种合成环境，能够生成地层观测和紧凑的可见证据事件历史。该环境结合了地质生成器和可执行验证器，后者对年代学、事件身份、沉积和结构关系进行评分。我们表明RLVR能改善视觉语言模型（VLM）中的地质重建，提高保留地层图的地质内容评分。我们进一步通过将生成的场景转换为声阻抗导出的振幅剖面，在合成地震观测领域中评估相同的地质历史。在这种受控的配对渲染器环境中，我们展示了从地层图领域RLVR训练中学到的地质推理能够转移到合成地震表征中，无需地震特有的训练实例，支持RLVR能够跨相关观测格式教授可重复使用的地质推理概念的假说。

Bias-Controlled Primal-Dual Natural Actor-Critic: Optimal Rates for Constrained Multi-Objective Average-Reward RL

偏倚控制的原始-对偶自然行为者-批评者：受限多目标平均奖励强化学习的最优率

Authors: Ankur Naskar, Swetha Ganesh, Vaneet Aggarwal
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.25012
Pdf link: https://arxiv.org/pdf/2606.25012
Abstract Many reinforcement learning (RL) problems in the infinite-horizon average-reward setting require optimizing multiple conflicting objectives while satisfying multiple safety constraints. A common approach is concave scalarization, where the agent maximizes a utility $ f(J^\pi_{r_1}, \ldots, J^\pi_{r_M}) $ subject to a scalarized constraint $ g(J^\pi_{c_1}, \ldots, J^\pi_{c_N}) \ge 0 $, where $J^\pi_{r_m}$ and $J^\pi_{c_n}$ denote the average-reward and cost under policy $\pi$. However, the nonlinearity of $f$ and $g$ introduces bias in policy-gradient and actor-critic methods, since gradients must be evaluated using noisy estimates of $J^\pi,$ and $ \mathbb{E}[\partial f(J^\pi)] \neq \partial f(\mathbb{E}[J^\pi]),$ and this bias propagates through both primal and dual updates. We propose an MLMC-based primal-dual Natural Actor-Critic algorithm for average-reward MDPs that controls bias in scalarized objectives, constraint evaluation, and actor-critic estimation without requiring mixing-time knowledge. We show that the algorithm achieves optimal global convergence and constraint-violation rates of $ \tilde{O}(1/\sqrt{T}) $. To our knowledge, this is the first result establishing optimal convergence for concave scalarized multi-objective RL in the average-reward setting, both with and without constraints, and the first to do so without mixing-time information even in the absence of scalarization.
中文摘要 许多在无限视野平均奖励设置下的强化学习（RL）问题需要在满足多个安全约束的同时优化多个冲突目标。一种常见方法是凹标量化，其中代理最大化效用 $ f（J^\pi_{r_1}， \ldots， J^\pi_{r_M}） $ 在标量约束 $ g（J^\pi_{c_1}， \ldots， J^\pi_{c_N}） \ge 0 $ 的情况下，其中 $J^\pi_{r_m}$ 和 $J^\pi_{c_n}}$ 表示在策略 $\pi$ 下的平均收益和成本。然而，$f$和$g$的非线性在策略梯度和actor-critic方法中引入偏见，因为梯度必须使用$J^\pi，$和$ \mathbb{E}[\partial f（J^\pi]）] \neq \partial f（\mathbb{E}[J^\pi]），$的噪声估计来评估梯度，这种偏置会通过原始更新和对偶更新传播。我们提出了一种基于MLMC的原始对偶自然行为者-批判者算法，用于平均奖励MDP，能够控制标量化目标、约束评估和演员-批评者估计的偏差，而无需混合时间知识。我们证明该算法实现了最优的全局收敛率和约束违规率 $ \tilde{O}（1/\sqrt{T}） $。据我们所知，这是首个在平均奖励环境下（无论有约束或无约束）中，在均奖励条件下建立凹标量化多目标强化学习的最优收敛结果，也是首个即使没有标量化也能在无混合时间信息的情况下实现这一结果。

GCT-MARL: Graph-Based Contrastive Transfer for Sample-Efficient Cooperative Multi-Agent Reinforcement Learning

GCT-MARL：基于图的对比转移用于样本高效合作多代理强化学习

Authors: Animesh Animesh, Satheesh K Perepu, Kaushik Dey
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.25073
Pdf link: https://arxiv.org/pdf/2606.25073
Abstract In cooperative multi-agent reinforcement learning (MARL), from a deployment perspective, it is challenging and expensive to train agents from scratch for each new environment or task. In this work, we propose GCT-MARL, a transfer learning framework that builds on the multi-view graph contrastive backbone of MAIL and augments it with a per-view, adaptively weighted alignment loss and a two-phase training protocol specifically designed for transfer across populations of varying sizes and compositions. We empirically demonstrate that the proposed framework markedly accelerates convergence on the target task relative to from-scratch training, in both homogeneous (within-faction, varying N) and heterogeneous (cross-faction and mixed unit-type) transfer scenarios. Furthermore, we show that the framework naturally supports continual learning by sequentially chaining the two-phase transfer protocol across a series of related tasks. Overall, this work provides a unified approach to mitigating key limitations in current MARL transfer methods with new insights at both methodological and empirical levels.
中文摘要 在协作多智能体强化学习（MARL）中，从部署角度看，从零开始训练每个新环境或任务的代理既具有挑战性又昂贵。本研究提出GCT-MARL，一种转移学习框架，基于MAIL的多视图图对比骨干，并补充了每视角、自适应加权比对丢失和专为跨不同规模和组成种群迁移设计的两阶段训练协议。我们实证证明，所提出的框架相较于从零开始训练，显著加速了目标任务的趋同，无论是在均质（派系内，变化 N 个体）还是异质（跨派系和混合单位类型）转移场景中。此外，我们证明该框架通过连续链式链式连接两阶段传输协议，自然支持持续学习。总体而言，这项工作为缓解当前MARL转移方法的关键局限性提供了统一方法，并在方法论和实证层面都带来了新的见解。

Energy Efficient Scheduling of AI/ML Workloads on Multi Instance GPUs with Dynamic Repartitioning

多实例GPU上的AI/ML工作负载节能调度，采用动态重分

Authors: Ellie Lipe, Neel Karia, Connor Espenshade, Clifford Stein, Asser Tantawi, Olivier Tardieu
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2606.25082
Pdf link: https://arxiv.org/pdf/2606.25082
Abstract Increasing demand from AI/ML workloads is exacerbating the rising energy consumption of data centers. Recent advances in hardware such as NVIDIA's Multi Instance GPUs (MIGs) offer improvements in flexibility and computational power and the opportunity for data centers to manage incoming jobs in energy-efficient ways, while maintaining acceptable performance. The challenge in achieving this multi-objective in a MIG environment through job scheduling is multi-faceted. Firstly, for a given MIG configuration, one seeks an easy-to-implement scheduling algorithm which selects a job from the queue as well as decides on which slice in the configuration the job runs. Secondly, for the identified scheduling algorithm, a particular MIG configuration may not always be suitable (as the workload fluctuates) and may need to be repartitioned. We tackle both problems using simulations and reinforcement learning (RL). We present a dynamic repartitioning scheduling framework for a single MIG as a solution to a multi-objective heterogeneous machine scheduling problem with preemption. In particular, we compare four scheduling algorithms and identify a promising one. Then, we employ reinforcement learning to perform dynamic repartitioning over a day. Furthermore, using a diurnal workload pattern based on real-world data center traces, we demonstrate the superiority of our dynamic repartitioning algorithm over twice-daily repartitioning ($26\%$), static partitioning ($31\%$) and no partitioning at all ($68\%$) according to a multi-objective function of energy consumption and tardiness. Our results indicate specific preferred configurations at different times of the day under different queue conditions, suggesting a policy for predictive and automatic reconfiguration.
中文摘要 对AI/ML工作负载需求的增加加剧了数据中心能源消耗的上升。硬件的最新进展，如英伟达的多实例GPU（MIG），提升了灵活性和计算能力，并为数据中心提供了高效管理新任务的机会，同时保持可接受的性能。在MIG环境中通过作业调度实现这一多目标的挑战是多方面的。首先，对于给定的MIG配置，需要一种易于实现的调度算法，该算法不仅能从队列中选择作业，还决定该作业在配置中运行的哪个片。其次，对于所识别的调度算法，某些MIG配置可能并不总是适用（因为工作负载会波动），可能需要重新分区。我们通过模拟和强化学习（RL）来解决这两个问题。我们提出了一个针对单个MIG的动态重分排程框架，作为多目标异构机器调度问题的解决方案，且带有抢占。特别地，我们比较了四种调度算法，并找出一个有前景的算法。然后，我们采用强化学习，在一天内进行动态重新分配。此外，我们利用基于真实数据中心追踪的日间工作负载模式，展示了动态重分算法相较于每天两次重划分（$26\%$）、静态分区（$31\%$）和完全不分区（$68\%$），这些都基于能耗和延迟的多目标函数。我们的结果表明，在一天中不同时间段、不同的队列条件下，存在特定的首选配置，这表明了预测和自动重配置的策略。

RGB: RL Guided Whole-Body MPPI for Humanoid Control

RGB：用于人形控制的强化全身MPPI

Authors: Yunsoo Seo, Sol Choi, Euncheol Im, Myo Taeg Lim, Yisoo Lee
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.25123
Pdf link: https://arxiv.org/pdf/2606.25123
Abstract Humanoid robots require whole-body controllers that are both robust and precise in contact-rich environments. While deep reinforcement learning (RL) achieves robust stability, its behavior is tightly coupled to the training objective and command interface, making it difficult to add new feedback objectives without retraining. In this study, we propose an RL guided whole-body model predictive path integral (MPPI) framework that acts as an add-on feedback controller on top of a pretrained RL policy. Instead of using RL policy as the final controller, we use it as a sampling prior that biases MPPI rollouts toward dynamically feasible behaviors. Task objectives are specified through modular MPPI cost terms, and MPPI closes the loop by continuously correcting the RL prior online to satisfy these objectives without retraining the policy. Simulations on a 29-DoF Unitree G1 humanoid in MuJoCo demonstrate stable high-rate control (average 280~Hz). The proposed method improves task-level precision over a pure RL baseline under the same command interface. This is achieved by correcting systematic drift during straight walking and tracking additional whole-body reference signals imposed through the cost.
中文摘要 类人机器人需要在接触密集环境中既坚固又精准的全身控制器。虽然深度强化学习（RL）实现了稳健的稳定性，但其行为与训练目标和指令接口紧密耦合，因此在不重新训练的情况下很难添加新的反馈目标。本研究提出一个强化学习引导的全身模型预测路径积分（MPPI）框架，作为预训练强化学习策略基础上的附加反馈控制器。我们不把强化学习策略作为最终控制，而是将其作为采样先验，使MPPI的推广偏向动态可行的行为。任务目标通过模块化的MPPI成本项指定，MPPI通过在线前不断修正强化语言来实现这些目标，而无需重新训练策略。在MuJoCo中对29 DoF Unitree G1类人生物的模拟显示出稳定的高速率控制（平均280~Hz）。该方法在相同命令接口下，相较于纯强化学习基线提升了任务级精度。这通过纠正直线行走时的系统性漂移和跟踪通过成本施加的额外全身参考信号来实现。

Reward-Conditioned Attention: How Reward Design Shapes What Autonomous Driving Agents See

奖励条件注意力：奖励设计如何影响自动驾驶智能体所见

Authors: Mohamed Benabdelouahad, Ahmed Djalal Hacini, Nadir Farhi, Aissa Boulmerka
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2606.25127
Pdf link: https://arxiv.org/pdf/2606.25127
Abstract We investigate how reward design shapes the internal attention patterns of reinforcement learning agents trained for autonomous driving. Using three Perceiver-based agents that share identical architectures and training data but differ only in their reward configurations$\unicode{x2014}$ranging from basic violation penalties to continuous proximity penalties$\unicode{x2014}$we analyze cross-attention allocation across 50 real-world scenarios from the Waymo Open Motion Dataset. A central methodological finding is that naïve pooling of timesteps across episodes substantially underestimates the attention$\unicode{x2013}$risk relationship; within-episode correlation with Fisher z-transform aggregation is the appropriate statistic and reveals a robustly positive link between collision risk and agent-directed attention. Building on this validated methodology, we demonstrate two reward-conditioned effects: agents trained with navigation rewards allocate up to $2.0\times$ more attention to GPS-path tokens than those trained with additional proximity penalties$\unicode{x2014}$and $4.7\times$ more than agents with no navigation incentive$\unicode{x2014}$revealing that reward content directly determines which scene elements the encoder prioritizes, and continuous time-to-collision penalties create a $\textit{learned vigilance prior}$$\unicode{x2014}$elevated resting agent surveillance maintained throughout collision-free phases. In several scenarios, the complete-reward and minimal-reward models exhibit opposite attention$\unicode{x2013}$risk correlation directions, demonstrating that reward design can qualitatively reverse attentional strategy rather than merely modulating its magnitude. These results suggest that attention analysis is a practical diagnostic for verifying that a reward function produces the intended representational behaviour in safety-critical RL systems.
中文摘要 我们研究奖励设计如何影响为自动驾驶训练的强化学习代理的内部注意力模式。使用三个基于感知器的代理，它们共享相同的架构和训练数据，但仅在奖励配置上有所不同。$\unicode{x2014}$ranging从基本违规惩罚到连续接近惩罚$\unicode{x2014}$we分析了Waymo开放运动数据集中50个真实场景中的交叉注意力分配。一个核心方法发现是，跨发作时间步的朴素池化大大低估了注意力$\unicode{x2013}$risk关系;与Fisher层变换聚合的剧情内相关性是合适的统计量，显示碰撞风险与代理引导注意力之间存在强有力的正相关性。基于这一验证方法论，我们展示了两种奖励条件效应：接受导航奖励训练的代理对GPS路径标记的关注度高达$2.0\乘倍$，比那些接受额外接近惩罚训练的代理多$$\unicode{x2014}$and比无导航激励的代理多$$$\unicode{x2014}多4.7\倍数$$revealing奖励内容直接决定编码器优先处理哪些场景元素，连续的碰撞时间惩罚形成了$\Textit{事先学习的警觉}$$\Unicode{x2014}的静息特工监视$elevated贯穿无碰撞阶段。在多个情境中，完全奖励模型和最小奖励模型表现出相反的关注方向\unicode{x2013}$risk相关性，表明奖励设计可以定性逆转注意力策略，而不仅仅是调节其强度。这些结果表明，注意力分析是验证奖励函数是否能在安全关键强化学习系统中产生预期表征行为的实用诊断方法。

TRUSTMEM: Learning Trustworthy Memory Consolidation for LLM Agents with Long-Term Memory

TRUSTMEM：为具备长期记忆的LLM代理学习可信记忆巩固

Authors: Tianyu Yang, Sudipta Paul, Vijay Srinivasan, Vivek Kulkarni, Srinivas Chappidi
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.25161
Pdf link: https://arxiv.org/pdf/2606.25161
Abstract Large language model (LLM) agents rely on long-term memory to support extended interactions and personalized assistance beyond finite context windows. Existing memory agents actively update external memory through generated write, revise, and delete operations, but these updates may omit important information, corrupt existing memory, or introduce unsupported hallucinated content. Once stored, such errors become persistent system-state failures that can affect future reasoning and generation. In this paper, we propose TrustMem, a framework designed to improve the trustworthiness of memory consolidation. TrustMem relies on a Memory Transition Verifier to evaluate the transition process of memory updates in terms of coverage, preservation, and faithfulness. It further constructs preference pairs among candidate updates under the same memory state, enabling preference-guided reinforcement learning to directly optimize memory updating behaviors. Extensive experiments demonstrate that TrustMem improves both memory utility and reliability: it achieves state-of-the-art results across MemoryAgentBench, HaluMem, and the Mem-alpha validation set, improves HaluMem memory extraction by 12.14 F1 points, and reduces transition-level omission, corruption, and hallucination by 40.1\%, 79.1\%, and 50.0\%, respectively, compared with the strongest baseline for each error type.
中文摘要 大型语言模型（LLM）代理依赖长期记忆支持超越有限上下文窗口的扩展交互和个性化辅助。现有内存代理通过生成的写入、修订和删除操作主动更新外部内存，但这些更新可能会遗漏重要信息、损坏现有内存或引入不受支持的幻觉内容。一旦存储，这些错误就会变成持续存在的系统状态故障，影响未来的推理和生成。本文提出了TrustMem框架，旨在提升记忆巩固的可信度。TrustMem 依赖内存转换验证器来评估内存更新的转换过程，包括覆盖率、保存性和忠实性。它进一步构建同一记忆状态下候选更新的偏好对，使偏好引导强化学习能够直接优化内存更新行为。大量实验表明，TrustMem 提升了内存实用性和可靠性：它在 MemoryAgentBench、HaluMem 和 Mem-alpha 验证集上实现了最先进的结果，使 HaluMem 内存提取效率提升了 12.14 个 F1 点，并且在过渡层的遗漏、损坏和幻觉分别减少了 40.1%、79.1% 和 50.0% 的错误，相比每种错误类型的最强基线。

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

通用推理的可迁移性：多领域RLVR自动化课程

Authors: Yongjin Yang, Jiarui Liu, Yinghui He, Lezhen Zhang, Bernhard Schölkopf, Zhijing Jin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.25178
Pdf link: https://arxiv.org/pdf/2606.25178
Abstract Reinforcement learning with verifiable rewards (RLVR) has been extended from single-domain training to multi-domain reasoning suites spanning mathematics, programming, and science. However, the training curriculum (how often each domain is sampled) is typically fixed or hand-tuned, even though reasoning skills transfer unevenly across domains. Existing learnability-based curricula adapt to where the policy is currently improving, but are blind to whether a gradient step on the selected domain benefits the remaining domains. In this paper, we propose Transfer-Aware Curriculum (TAC), a bandit-style online curriculum that prioritizes domains whose updates broadly benefit the rest of the training suite. TAC repurposes signals already produced by RL training: per-domain advantages capture local learnability, and projected gradients, taken from the GRPO step being computed, estimate cross-domain transferability via gradient-geometry alignment, at negligible cost (<1% wall-clock overhead). Across a six-domain reasoning suite, TAC achieves the best macro-averaged accuracy on both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit, and improving over the last of these by up to 2.8 points (10% relative). Ablations show performance degrades sharply when the transferability term is removed, and TAC remains robust on imbalanced training mixtures where learnability-only curricula over-commit to dominant domains. Our findings establish cross-domain transferability as a key signal for curriculum design in multi-domain RLVR.
中文摘要 带可验证奖励的强化学习（RLVR）已从单一领域训练扩展到涵盖数学、编程和科学的多领域推理套件。然而，训练课程（每个领域的抽样频率）通常是固定的或手工调整的，尽管推理技能在不同领域间的转移不均。现有的可学习性课程会根据政策当前的改进进行调整，但对所选领域的梯度步骤是否对其他领域有益视而不见。本文提出了“迁移意识课程”（TAC），这是一种类似“盗贼”式的在线课程，优先考虑那些更新对培训套件其他部分有广泛益处的领域。TAC重新利用强化学习训练已产生的信号：逐域优势捕捉局部可学习性，且从计算GRPO步骤中预测梯度，通过梯度几何对齐估计跨域可转移性，成本可忽略不计（<1%的墙时钟开销）。在六域推理套件中，TAC在Qwen3-1.7B和Llama3.2-3B上均获得最佳宏观平均准确率，优于比例随机抽样、手工设计计划和仅可学习性强盗，且在后者基础上提升了最多2.8个百分点（相对提升10%）。消融显示，去除可迁移项后表现会急剧下降，而TAC在仅可学习性课程过度依赖主导领域的不平衡训练组合下依然稳健。我们的发现确立了跨领域可迁移性作为多领域RLVR课程设计的关键信号。

Learning Perceptive Platform Adaptive Locomotion Controllers for Quadrupedal Robots

学习四足机器人的感知平台自适应运动控制器

Authors: David Rytz, Kim Tien Ly, Ioannis Havoutis
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.25179
Pdf link: https://arxiv.org/pdf/2606.25179
Abstract Universal quadrupedal locomotion remains limited by the difficulty of integrating perception across diverse robot morphologies. State-of-the-art controllers rely on single-robot training or blind policies that omit real-time perception, leading to poor cross-embodiment generalization. Designing locomotion policies that remain robust across related quadruped morphologies while incorporating perception is challenging. Moreover, fully perceptive policies are often sensitive to noise, whereas blind controllers lack terrain awareness. In this work, we study how perception should be integrated into morphology-aware reinforcement learning architectures for deployable quadrupedal control. Building on MorAL, we train morphology-specialized universal controllers on multiple reference quadrupeds using adaptive terrain curricula. We compare a blind baseline, a critic-perceptive variant (MorAL+), and a fully perceptive actor-critic (PPAL). Policies are evaluated in simulation on flat and rough terrains, and deployed on ANYmal hardware. Results show that critic-only perception improves robustness and tracking consistency over blind baselines while remaining more stable than fully perceptive policies under perception noise. These findings highlight that perception placement and curriculum design are key factors for scalable, morphology-aware locomotion.
中文摘要 通用四足行走仍受限于在不同机器人形态间整合感知的困难。最先进的控制器依赖单一机器人训练或盲政策，省略了实时感知，导致跨身体泛化能力较差。设计能够在相关四足形态中保持稳健的运动策略，同时纳入感知，是一项具有挑战性的问题。此外，完全感知的策略通常对噪声敏感，而盲人控制员缺乏地形感知能力。本研究研究如何将感知整合进形态感知强化学习架构，以实现可部署的四足控制。基于MorAL，我们利用自适应地形课程，在多条参考四足动物上训练形态学专用的通用控制器。我们比较了盲基线、批评者-感知变体（MorAL+）和完全感知行为者-批评者（PPAL）。策略在平坦和崎岖地形的模拟中评估，并在ANYmal硬件上部署。结果显示，仅批评者感知在盲基线下提升了鲁棒性和跟踪一致性，同时在感知噪声下比完全感知策略更稳定。这些发现强调，感知配置和课程设计是实现可扩展、形态感知型态运动的关键因素。

SoK: AI Secure Code Generation: Progress, Pitfalls, and Paths Forward

SoK：AI安全代码生成：进展、陷阱与前进路径

Authors: Rupam Patir, Keyan Guo, Haipeng Cai, Hongxin Hu
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.25195
Pdf link: https://arxiv.org/pdf/2606.25195
Abstract The increasing use of AI systems for code generation raises a central security question: what can today's models and coding agents actually do to produce secure code, where do they still fail, and what would move the field forward? Existing work has explored prompting, fine-tuning, reinforcement learning, and agentic workflows for secure code generation, but the field still lacks a systematic understanding of how these techniques improve security and why substantial failures persist. In this SoK, we systematize the progress, pitfalls, and paths forward for AI secure code generation. We introduce a three-level framework that measures models' natural-language understanding of secure coding principles, their code-level actuation of those principles during generation, and the knowledge--actuation gaps between the two. We instantiate this framework across models and coding agents on benchmarks covering both isolated function-level security and full web-application security. Our results show that secure-coding-principle understanding is a statistically strong predictor of code-level outcomes, including functional correctness, security, and joint functional-security correctness. Yet substantial knowledge--actuation gaps remain: models can recognize relevant security principles but still fail to translate them into secure and functional code. These findings offer a principle-centered account of where AI secure code generation stands today and identify concrete paths forward through principle-guided generation, evaluation, benchmarking, and agentic workflows.
中文摘要 AI系统在代码生成中的日益广泛应用引发了一个核心安全问题：当今的模型和编码代理究竟能做些什么来生成安全代码？它们仍在哪些方面存在缺陷？又将如何推动该领域前进？现有研究探索了提示、微调、强化学习和代理工作流程以实现安全代码生成，但该领域仍缺乏系统性理解这些技术如何提升安全性以及为何存在重大失败。在本SoK中，我们系统化了AI安全代码生成的进展、陷阱和前进路径。我们引入了一个三层框架，衡量模型对安全编码原则的自然语言理解、生成过程中对这些原则的代码级执行，以及两者之间的知识-执行差距。我们在模型和编码代理之间实例化该框架，基于涵盖孤立功能级安全和完整网页应用安全的基准测试。我们的结果表明，安全编码原则的理解是编码层面结果的统计学上强预测因子，包括功能正确性、安全性和联合功能安全正确性。然而，仍然存在大量知识与执行的差距：模型能够识别相关的安全原则，但仍未能将其转化为安全且实用的代码。这些发现以原则为中心，展示了人工智能安全代码生成目前的现状，并通过原则引导的生成、评估、基准测试和代理工作流程，确定了具体的前进路径。

Inverse Reinforcement Learning for Interpretable Keystroke Biomarkers in Parkinson's Disease

帕金森病中可解释的按键生物标志物的逆强化学习

Authors: Navin Bondade
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.25270
Pdf link: https://arxiv.org/pdf/2606.25270
Abstract Keystroke dynamics have been explored extensively as a passive digital biomarker for Parkinson's disease (PD), typically by extracting summary statistics from typing timing and training a classifier to discriminate PD from healthy controls. We instead apply inverse reinforcement learning (IRL) to keystroke data, modeling each keystroke as a discrete choice over typing speed and recovering, per subject, an interpretable reward function that explains their observed timing behavior. To our knowledge this is the first application of IRL to keystroke dynamics. On the public neuroQWERTY MIT-CSXPD dataset (85 subjects, 42 with PD), an initial four-parameter reward decomposition (speed, effort, smoothness, hand-alternation cost) was found to suffer severe feature collinearity between two terms ($r=1.000$ in typical contexts); we diagnose and correct this, yielding an identifiable three-parameter model. The recovered speed-preference weight correlates with UPDRS-III severity at $r=-0.607$ ($p<0.001$, $n=42$), replicates independently across two sub-cohorts, is stable across nine sensitivity configurations, and retains a statistically significant contribution beyond raw typing speed alone (incremental $R^2$ from 0.194 to 0.338, $p=0.006$). Two other recovered weights (consistency, hand-alternation) did not survive confound checks and are reported as negative results. We document two implementation bugs found during adversarial code review (session-boundary contamination, a rolling-window data leakage) and show the headline result is materially unchanged after fixing both. We discuss this result in the context of a literature where reported accuracies vary widely between studies (pooled AUC 0.85, I^2=94% in a 2022 meta-analysis), and argue that the validation process itself, not only the correlation coefficient, is part of the contribution.
中文摘要 按键动态作为帕金森病（PD）的被动数字生物标志物已被广泛探索，通常通过从打字时序提取总结统计数据，并训练分类器以区分帕金森病与健康对照。我们转而对击键数据应用逆强化学习（IRL），将每次击键建模为对打字速度的离散选择，并根据每个受试者恢复一个可解释的奖励函数，以解释其观察到的时序行为。据我们所知，这是IRL首次应用于击键动态。在公开的 neuroQWERTY MIT-CSXPD 数据集（85 名受试者，其中 42 名患有 PD）中，初步四参数奖励分解（速度、努力、平滑度、手交替成本）发现两个项之间存在严重特征共线（$r=1.000$，在典型语境下）;我们对此进行诊断和纠正，得到一个可识别的三参数模型。恢复的速度偏好权重与UPDRS-III严重度相关，表现为$r=-0.607$（$p<0.001$，$n=42$），在两个子队列间独立复制，在九种敏感度配置中稳定，且在原始类型速度之外仍保持统计学上显著贡献（增量$R^2$，从0.194到0.338，$p=0.006$）。另外两种恢复的权重（一致性、手交替）未通过混淆检定，结果为阴性。我们记录了在对抗性代码审查中发现的两个实现错误（会话边界污染、滚动窗口数据泄漏），并显示修复后总体结果基本未变。我们将在文献背景下讨论这一结果，该文献报告的准确性在不同研究间差异很大（2022年荟萃分析中合并AUC 0.85，I^2=94%），并认为验证过程本身，而不仅仅是相关系数，也是贡献的一部分。

DynaMOMA: Instantaneous Prediction of Grasp Poses for Mobile Manipulation of Dynamic Objects

DynaMOMA：动态物体移动操作中的抓握姿态瞬时预测

Authors: Zhinan Yu, Junyan Xu, Jiazhao Zhang, Zheng Qin, Yijie Tang, Yuhang Huang, Yihan Cao, Zhiyuan Yu, Yongjun Wang, Renjiao Yi, Chenyang Zhu, Kai Xu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.25295
Pdf link: https://arxiv.org/pdf/2606.25295
Abstract Mobile manipulation is a fundamental robotics task and has advanced rapidly in recent years, enabling robots to navigate, reach, and interact with objects in complex environments. However, mobile manipulation of dynamic objects remains highly challenging, as robots must coordinate the mobile base and arm while adapting to continuously evolving target poses. A key challenge lies in predicting temporally consistent short-horizon grasp trajectories from dynamic observations. In this work, we propose \ours{}, a dynamic mobile manipulation framework that couples instantaneous grasp trajectory prediction with whole-body control policy. Our predictor uses an anchor-based diffusion model to generate temporally consistent short-horizon grasp trajectories conditioned on historical observations. The predicted trajectories are then encoded as compact features and fed to a whole-body reinforcement learning policy, which controls the mobile manipulator for dynamic grasping. We further introduce a anticipation-guided reward that equips the policy with an anticipatory grasping horizon by adaptively shifting the target from the current grasp observation to the instantaneously predicted grasp trajectory. Through extensive experiments in Isaac Gym simulation, we show that our method achieves strong performance in mobile manipulation of dynamic objects across diverse settings and grasping metrics. Furthermore, our predictor and policy demonstrate strong generalizability in real-world experiments.
中文摘要 移动操作是机器人技术中的一项基础任务，近年来发展迅速，使机器人能够在复杂环境中导航、触及和与物体互动。然而，移动操作动态物体依然极具挑战性，因为机器人必须协调移动底座和手臂，同时适应不断变化的目标姿势。一个关键挑战在于如何从动态观测中预测时间一致的短视距抓取轨迹。本研究提出了\ours{}，一种动态移动操作框架，将瞬时抓取轨迹预测与全身控制策略相结合。我们的预测器使用基于锚点的扩散模型，基于历史观测生成时间一致的短视距抓取轨迹。预测轨迹随后被编码为紧凑特征，输入一个全身强化学习策略，该策略控制移动操作器以实现动态抓取。我们进一步引入了一种预期引导的奖励，通过自适应地将目标从当前掌握观察调整到瞬时预测的掌握轨迹，为政策配备了一个前瞻性抓取视野。通过Isaac Gym模拟的广泛实验，我们证明了该方法在不同环境和抓取指标下的动态物体移动操作方面表现出色。此外，我们的预测器和政策在现实实验中展现了强烈的泛化性。

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

V-Zero：无答案标签的政策提炼，并以对比证据门控实现细粒度的视觉推理

Authors: Haoxiang Sun, Zhihang Yi, Langxuan Deng, Yuhao Zhou, Peiqi Jia, Jian Zhao, Li Yuan, Jiancheng Lv, Tao Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.25319
Pdf link: https://arxiv.org/pdf/2606.25319
Abstract Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-scale annotated reasoning traces, leading to costly exploration, hand-designed verification rules, or heavy dependence on textual supervision. A natural way to avoid such external answer labels is to learn from trajectories sampled by the student itself, which points to On-Policy Distillation (OPD). To understand what OPD can and cannot provide for visual reasoning, we revisit it as negative-free stop-gradient alignment. This perspective shows that, although OPD provides effective token-level correction, its ceiling is constrained by the absence of trajectory-level discrimination. Motivated by these observations, we propose V-Zero, an answer-label-free framework for visual reasoning with contrastive evidence gating. V-Zero uses no annotated textual answer labels; instead, during training it pairs a question-relevant regional crop with a negative visual view to evaluate student-sampled trajectories and gate dense token-level distillation. Experiments on multiple visual reasoning benchmarks show that V-Zero consistently improves fine-grained visual reasoning while preserving strong generalization. Notably, V-Zero is more than 5$\times$ faster than previous supervised fine-tuning methods and more than 10$\times$ faster than reinforcement learning baselines. Code and dataset will be released at this https URL
中文摘要 细粒度视觉推理需要多模态大型语言模型（MLLMs）来识别与任务相关的视觉证据，并将其推理扎根于局部图像区域。现有的代理方法通常依赖于带有可验证奖励的强化学习或对大规模注释推理痕迹的监督微调，导致了昂贵的探索、手工设计的验证规则或高度依赖文本监督。避免这种外部答案标签的自然方法是从学生自己采样的轨迹中学习，这指向了策略提纯（OPD）。为了理解OPD能提供什么、不能提供什么视觉推理，我们重新审视它为无负停止梯度对齐。这一观点表明，尽管OPD提供了有效的代币级修正，但其上限受限于缺乏轨迹层级的辨别。基于这些观察，我们提出了V-Zero，一种无答案标签的视觉推理框架，采用对比证据门控。V-Zero 不使用带注释的文本答案标签;相反，在培训过程中，它将与问题相关的区域作物与负面视觉视图配对，以评估学生采样的轨迹和门槛高密度的代币级蒸馏。多视觉推理基准测试的实验表明，V-Zero 在保持强泛化性的同时，持续提升细粒度的视觉推理能力。值得注意的是，V-Zero比以往监督微调方法快超过5$\times$，比强化学习基线快超过10$\times$。代码和数据集将在此 https URL 发布

Omni-Perception Policy Optimization for Multimodal Emotion Reasoning

多模态情绪推理的全知觉策略优化

Authors: Zhiyuan Han, Beier Zhu, Wenwen Tong, Pengyang Shao, Peipei Song, Xinyi Wang, Jiangnan Chen, Lewei Lu, Xun Yang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.25325
Pdf link: https://arxiv.org/pdf/2606.25325
Abstract We find that current emotion-oriented Omni-MLLMs still lack reliable omni-modal perception: they (i) underutilize multimodal cues in their reasoning trajectories and (ii) exhibit unfaithful behavior, often hallucinating modality-specific statements from other modalities. Building on these insights, we propose OPPO (Omni-Perception Policy Optimization), a reinforcement learning framework that explicitly optimizes multimodal perception. First, an Omni-Perception Reward decomposes ground-truth reasoning into fine-grained visual, acoustic, and emotion cues and rewards trajectories that semantically recover these cues. Second, an Omni-Perception Loss compares the policy under full and unimodally masked inputs, applying a KL penalty only to modality-specific evidence tokens to suppress cross-modal hallucination. We further introduce MEP-Bench, a diagnostic benchmark that quantifies utilization and faithfulness. Experiments show that OPPO achieves state-of-the-art performance on MER-UniBench and MME-Emotion, while substantially improving utilization and faithfulness scores on MEP-Bench, highlighting the importance of sufficient and faithful omni perception for multimodal emotion reasoning.
中文摘要 我们发现，当前以情绪为导向的全模态多层次多层次感知（Omni-MLLM）仍缺乏可靠的全模态感知：他们（i）在推理轨迹中未能充分利用多模态线索，且（ii）表现出不忠实的行为，常常幻觉其他模态的特定陈述。基于这些见解，我们提出了OPPO（全感知策略优化），这是一种强化学习框架，明确优化多模态感知。首先，全知觉奖励将真实推理分解为细粒度的视觉、声学和情感线索，并奖励语义上恢复这些线索的轨迹。其次，全感知损失在全模态且无模式掩蔽输入下比较保单，仅对模态特定证据标记施加 KL 惩罚以抑制跨模态幻觉。我们还进一步介绍了MEP-Bench，一个量化利用率和忠诚度的诊断基准。实验显示，OPPO在MER-UniBench和MME-Emotion上实现了最先进的性能，同时在MEP-Bench上的利用率和忠实度评分大幅提升，凸显了多模态情绪推理充分且忠实的全知觉的重要性。

Stagnant Neuron: Towards Understanding the Plasticity Loss in Multi-Agent Reinforcement Learning Value Factorization Methods

停滞神经元：探讨多智能体强化学习价值因子方法中的可塑性损失

Authors: Zhengzhu Liu, Zeming Gao, Haoyuan Qin, Jiawei Hu, Junhao Wu, Miao Zhu, Haipeng Zhang, Chennan Ma, Siqi Shen, Cheng Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.25335
Pdf link: https://arxiv.org/pdf/2606.25335
Abstract Multi-Agent Reinforcement Learning (MARL) value factorization methods can suffer from a loss of plasticity, gradually failing to adapt when transferring to new task instances. We trace this issue to stagnant neurons, units whose gradient updates become negligibly small relative to their weights, thereby hindering learning. While existing plasticity injection methods exist, they prove ineffective for such neurons. To address this, we propose Knowledge-retentive Neuron-level PlastIcity Focusing InjEction (KNIFE), a novel method that directly targets stagnant neurons. KNIFE replaces each stagnant neuron with a composite unit comprising three specialized components: a frozen knowledge neuron to preserve acquired knowledge, a re-initialized active neuron to restore learning capacity, and a compensation neuron to ensure the combined output matches the original, thus maintaining previous learned cooperation knowledge. Extensive experiments on SMACv2, predator-prey, and matrix games demonstrate that KNIFE significantly outperforms state-of-the-art plasticity injection methods.
中文摘要 多智能体强化学习（MARL）价值因数方法可能存在可塑性丧失的问题，随着迁移到新任务实例时逐渐无法适应。我们将这个问题归因于停滞的神经元，即其梯度更新相对于其权重微乎其微，从而阻碍学习。虽然现有可塑性注射方法存在，但对此类神经元效果有限。为此，我们提出了知识保留性神经元级爆能聚焦注射（KNIFE），这是一种直接针对停滞神经元的新方法。KNIFE用一个复合单元替代每个停滞的神经元，该单元由三个专业组件组成：一个冻结的知识神经元以保存已获得的知识，一个重新初始化的活跃神经元以恢复学习能力，以及一个补偿神经元，确保合并输出与原始输出一致，从而保持先前学习的协作知识。对SMACv2、捕食者-猎物和矩阵博弈的广泛实验表明，KNIFE的表现显著优于最先进的可塑性注射方法。

AI Coaching for Accelerating Human Skill Development with Reinforcement Learning

人工智能教练通过强化学习加速人类技能发展

Authors: Wei Wang, Enlin Gu, Antonio Loquercio, Haimin Hu, Rahul Mangharam
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2606.25337
Pdf link: https://arxiv.org/pdf/2606.25337
Abstract AI copilots can substantially boost human performance through shared control, but excessive assistance can induce over-reliance and skill atrophy. This paper studies how an embodied AI agent can act as a coach that accelerates human motor-skill development. We argue that effective coaching requires strategic scaffolding and stepping back that are aligned with the learner's capability, allowing productive failures that drive learning. We formalize the interactive AI coaching process as a non-cooperative dynamic game in which the learner optimizes task performance while the coach targets the learner's independent competence. Building on this formalism, we develop a reinforcement learning framework combining adaptive shared control with probabilistic models of the coach's causal influence on skill evolution, enabling tractable training of coaching policies. A comprehensive user study (N=33) on first-person-view drone racing shows significant gains in human learning outcomes over state-of-the-art AI coaching baselines.
中文摘要 AI副驾驶通过共享控制可以显著提升人类表现，但过度协助可能导致过度依赖和技能退化。本文研究具身人工智能智能体如何作为加速人类运动技能发展的教练。我们认为，有效的辅导需要与学习者能力相匹配的战略脚手架和退一步，从而允许生产性的失败推动学习。我们将互动式AI教练过程形式化为一个非合作的动态游戏，学习者优化任务表现，教练则针对学习者的独立能力。基于这一形式主义，我们开发了一个强化学习框架，结合了自适应共享控制与教练对技能演变因果影响的概率模型，从而实现了教练政策的可操作性培训。一项关于第一人称视角无人机竞速的综合用户研究（N=33）显示，人类学习成果相较于最先进的AI教练基线有显著提升。

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

高效且可训练的语言模型通过局部分支路由进行测试时间扩展

Authors: Yutong Yin, Mingyu Jin, Jin Pan, Changyi Yang, Zijie Xia, Dhruv Pai, Shuming Hu, Zhen Zhang, Chenyang Zhao, Jinman Zhao, Wujiang Xu, Raymond Li, Xin Eric Wang, Julian McAuley, Zhaoran Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.25354
Pdf link: https://arxiv.org/pdf/2606.25354
Abstract Test-time scaling improves language-model reasoning, but existing approaches often face a difficult trade-off: long chain-of-thought sampling remains single-threaded, while sentence- or solution-level search can be computationally expensive and hard to train end-to-end. We introduce Local Branch Routing (LBR), a token-level test-time scaling framework that expands a small local lookahead tree, forwards all sampled branches through the language model, and uses a lightweight router to select the depth-1 subtree to commit. By routing over the hidden states of candidate local futures, LBR allows each token decision to use evidence beyond the root next-token distribution while avoiding full solution-level search. The resulting prune-shift-grow decoding process preserves discrete branch identities and defines a tractable tree-trajectory likelihood: newly grown nodes are counted when first sampled, and router decisions are assigned explicit probabilities. This enables end-to-end reinforcement learning with verifiable rewards, jointly optimizing the base model and router under the same likelihood-ratio principle as discrete-token RLVR. On synthetic hierarchical-planning tasks, LBR shows that post-candidate hidden states provide useful routing evidence. On mathematical reasoning benchmarks, LBR improves both Pass@1 and Pass@32 over discrete chain-of-thought, vanilla discrete-token RLVR, and RL-compatible soft-token branching baselines. These results suggest that lightweight local branching offers an efficient, trainable, and discrete form of language-model test-time scaling.
中文摘要 测试时间缩放提升了语言模型推理能力，但现有方法常面临一个艰难的权衡：长链思维采样仍为单线程，而句子级或解题级搜索计算成本高且难以端到端训练。我们引入了局部分支路由（LBR），这是一个令牌级测试时间缩放框架，扩展了一个小型本地前瞻树，将所有采样的分支转发到语言模型中，并使用轻量级路由器选择深度为1的子树提交。通过路由候选本地期货的隐藏状态，LBR允许每个代币决策使用根下一代币分布之外的证据，同时避免完整的解级搜索。由此产生的修剪-移位-生长解码过程保留了离散分支的身份，并定义了一个可处理的树轨迹似然：新生长的节点在首次采样时被计数，并赋予路由器决策显式概率。这支持端到端强化学习，并提供可验证的奖励，结合与离散令牌RLVR相同的似然比原则，共同优化基础模型和路由器。在合成层级规划任务中，LBR表明候选后隐藏状态提供了有用的路由证据。在数学推理基准测试中，LBR在离散思维链、原版离散令牌RLVR和兼容强化逻辑的软令牌分支基线上，提升了Pass@1和 Pass@32。这些结果表明，轻量级局部分支提供了一种高效、可训练且离散的语言-模型测试时间缩放形式。

Compositional Behavioral Semantics for State Abstraction in Reinforcement Learning

强化学习中状态抽象的组合行为语义

Authors: Yivan Zhang, Ziyan Luo, Manuel Baltieri
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Category Theory (math.CT)
Arxiv link: https://arxiv.org/abs/2606.25357
Pdf link: https://arxiv.org/pdf/2606.25357
Abstract State abstraction plays a key role in scaling reinforcement learning to complex but structured systems. In studying such systems, a wide range of behavioral structures have been studied in reinforcement learning, including value functions, invariants, bisimulation relations, and behavioral metrics. However, a general principle for determining what structures are provably preserved under state abstraction is still lacking. In this paper, we present a unified framework for defining and analyzing behavioral structures in reinforcement learning. Our framework provides a compositional way to specify behavioral semantics based on local, one-step descriptions of system dynamics. Using this framework, we establish results showing how behavioral structures can be safely transferred between abstract and concrete systems. We further show how to construct quantitative metrics from logical behavioral semantics with soundness guarantees. Together, these results provide a principled foundation for reasoning about behaviors under state abstraction in reinforcement learning and offer reusable definition and proof principles for a broad class of behavioral structures in reinforcement learning.
中文摘要 状态抽象在将强化学习扩展到复杂但结构化的系统中起着关键作用。在研究此类系统时，强化学习研究了广泛的行为结构，包括价值函数、不变量、双模拟关系和行为度量。然而，判定哪些结构在状态抽象下可被证明保存的通用原则仍然缺乏。本文提出了一个统一框架，用于定义和分析强化学习中的行为结构。我们的框架提供了一种基于局部、一步系统动态描述的行为语义的组合方法。利用该框架，我们建立了行为结构如何在抽象系统与具体系统之间安全转移的结果。我们还展示了如何从具有合理性保证的逻辑行为语义构建定量指标。这些结果共同为强化学习中状态抽象行为的推理提供了原则性基础，并为强化学习中广泛行为结构提供了可重复使用的定义和证明原则。

FactorLibrary: From Polynomials to Circuits via Recursive Subgoals

因子库：通过递归子目标从多项式到电路

Authors: Rohan Pandey, Michael Ruofan Zeng, Weikun K. Zhang, Kaijie Jin, Naomi Morato, Archit Ganapule, Bhaumik Mehta, Jarod Alper
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.25394
Pdf link: https://arxiv.org/pdf/2606.25394
Abstract Finding minimal arithmetic circuits for polynomials over finite fields is a combinatorially hard problem central to algebraic complexity theory. We formulate it as a reinforcement learning problem in two directions, bottom-up and top-down. To address the challenge of a fast-growing combinatorial search space, we introduce FactorLibrary, which stores factorizable subexpressions that serve as reusable subgoals across training episodes. We trained a bottom-up agent with Gumbel-PPO-MCTS and two top-down agents with PPO+MCTS and SAC. The PPO+MCTS top-down agent exhibited the most stable performance, finding certified optimal circuits up to complexity $8$ with a success rate of $91.8\%$.
中文摘要 在有限域上为多项式寻找极小算术电路是一个组合学上极难的问题，也是代数复杂性理论的核心。我们将它表述为一个由下而上和自上而下两个方向的强化学习问题。为了应对快速增长的组合搜索空间，我们引入了 FactorLibrary，它存储可分解子表达式，作为可重复使用的子目标，跨训练阶段使用。我们用Gumbel-PPO-MCTS训练了一名自下而上的特工，以及两名自上而下的特工使用PPO+MCTS和SAC。PPO+MCTS自顶而下代理表现最稳定，找到了复杂度最高至8美元的认证最优电路，成功率为91.8美元。

MAPL: Multi-Objective Preference Learning for Robot Locomotion

MAPL：机器人运动的多目标偏好学习

Authors: Xiyue Chen, Muhan Lin, Shuyang Shi, Joseph Campbell
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.25398
Pdf link: https://arxiv.org/pdf/2606.25398
Abstract Reward design remains a major bottleneck in reinforcement learning for robot locomotion, where successful policies often depend on carefully tuned, task-specific reward functions. Preference-based reinforcement learning offers an alternative, but existing LLM-based methods typically ask for a single overall judgment between behaviors, making it difficult to capture the multiple competing objectives that underlie high-quality locomotion. We present Multi-Objective AI-Informed Preference Learning (MAPL), a framework that learns locomotion rewards from high-level natural language objectives rather than manually engineered reward equations. MAPL prompts a large language model to compare trajectories independently along semantically meaningful criteria, using generic language descriptions that are terrain-invariant and require little domain expertise. These objective-wise preferences are used to train a multi-head preference scoring model, whose outputs are aggregated to form a scalar reward for policy optimization. Across four quadruped locomotion environments, MAPL trains policies using only LLM-generated preferences and achieves performance comparable to or better than expert-designed rewards, while eliminating task-specific reward engineering.
中文摘要 奖励设计仍然是机器人运动强化学习中的一个主要瓶颈，成功的策略往往依赖于精心调优的任务特定奖励函数。基于偏好的强化学习提供了另一种选择，但现有基于LLM的方法通常要求行为间的单一整体判断，这使得难以捕捉支撑高质量运动的多个竞争目标。我们提出了多目标人工智能知情偏好学习（MAPL），这是一个通过高层自然语言目标而非手动设计的奖励方程学习运动奖励的框架。MAPL促使大型语言模型根据语义有意义的标准独立比较路径，使用地形不变且不需多领域专业知识的通用语言描述。这些目标偏好用于训练多优先评分模型，其输出被汇总形成政策优化的标量奖励。在四个四足行走环境中，MAPL仅使用LLM生成的偏好进行策略训练，性能可与专家设计的奖励相当甚至更好，同时消除了针对任务的奖励工程。

Learning with a Single Rollout via Monte Carlo Pass@k Critic

通过蒙特卡洛Pass@k Critic 单次推广学习

Authors: Fengdi Che, Yang Liu, Lei Yu, Meng Cao, Tong Che, Rupam Mahmood, Dale Schuurmans
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.25451
Pdf link: https://arxiv.org/pdf/2606.25451
Abstract Estimating token-level advantages in reinforcement learning (RL) for language models remains challenging because scaling up episodic experience collection is expensive. The difficulty intensifies for baseline advantage estimation methods, where repeated sampling causes trajectories to diverge into substantially different reasoning prefixes. In this context, RL algorithms such as GRPO prove limited: an outcome reward is too sparse to be attributed to specific actions like intermediate steps, and comparisons across sampled traces are non-trivial because they are heterogeneous. To mitigate both the computational cost of repeated sampling and the difficulty of credit assignment, we study single-rollout proximal policy optimization (SR-PPO) featuring token-level credit assignment in RL for language models. Instead of estimating advantages by normalizing episodic returns within the candidate group, we train a calibrated token-level credit critic using Monte Carlo outcomes from one rollout per prompt. Specifically, we use the critic to predict the Pass@k success probability at the prompt prefix, which is derived from a Pass@1 attempt. This choice yields a more selective learning signal than Pass@1: it discounts easily solved prefixes while prioritizing hard ones whose success probability remains marginal. We show that as $k$ increases, Pass@k converges to a reachability indicator, reflecting whether a prefix can lead to at least one successful continuation. In an explicit state graph, the limit ($k \rightarrow \infty$) can be computed in $O(|V|+|E|)$ time, offering a promising surrogate for direct credit assignment without the need to sample contrastive traces. As an initial validation, SR-PPO exhibits stable learning dynamics, along with consistent gains in Pass@128 success rates on mathematical reasoning benchmarks such as HMMT26 and AIME24.
中文摘要 评估语言模型强化学习（RL）中代币级优势仍然具有挑战性，因为扩大情节性经验收集的成本很高。在基线优势估计方法中，这一难度更加明显，因为反复抽样会导致轨迹分化为显著不同的推理前缀。在此背景下，像GRPO这样的强化学习算法被证明是有限的：结果奖励过于稀疏，无法归因于特定动作如中间步骤，且抽样痕迹间的比较也非平凡，因为它们是异构的。为降低重复抽样的计算成本和学分分配难度，我们研究了在强化学习中采用词级学分分配的单次推出近端策略优化（SR-PPO）技术。我们不是通过规范候选群体内的情节收益来估算优势，而是使用每个提示中一次推广的蒙特卡洛结果来训练校准的代币级信用批评者。具体来说，我们利用批评者预测提示前缀的Pass@k成功概率，该提示词源自Pass@1尝试。这种选择比Pass@1更为选择性地传递学习信号：它排除了容易解决的前缀，同时优先考虑那些成功概率仍然边缘的难前缀。我们证明，随着$k$的增加，Pass@k趋向可达性指标，反映前缀是否能至少引出一次成功的延续。在显式状态图中，极限（$k \rightarrow \infty$）可以在$O（|V|+|E|）时间，提供了一个有前景的替代直接信用分配，无需采样对比痕迹。作为初步验证，SR-PPO表现出稳定的学习动态，并在数学推理基准测试如HMMT26和AIME24上Pass@128成功率持续提升。

Rate-Aware Quantum-Inspired Trajectory Learning for Interference-Limited Multi-UAV Networks

干扰限制多无人机网络的速率感知量子启发轨迹学习

Authors: Khaoula Khaled, Muhammad Afaq, Ali Arshad Nasir, Zeeshan Kaleem
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.25480
Pdf link: https://arxiv.org/pdf/2606.25480
Abstract Unmanned aerial vehicle (UAV) can provide on-demand, high-capacity connectivity in disaster and normal situation. However, it faces a challenge of curse of dimensionality in trajectory optimization, where interference-limited environments and vast search spaces make real-time coordination computationally expensive. To overcome this challenge, we propose the Rate-Aware Quantum-Annealed Graph Condensation (RA-QAGC) scheme, which combines rate-aware graph abstraction with decentralized reinforcement learning to enable scalable, interference-aware UAV coordination. By identifying high throughput locations and guiding UAV trajectory adaptation toward throughput-optimal regions, RA-QAGC effectively balances network capacity by maintaining quality-of-service (QoS) requirements. Simulation results demonstrate the proposal outperformed over existing schemes by achieving 59.4 Mbps total throughput and 23.9 Mbps priority-user throughput, representing gains of approximately 15% and 34%, respectively, over the baseline schemes.
中文摘要 无人机（UAV）可以在灾难和正常情况下提供按需、高容量的连接。然而，它在轨迹优化中面临维度诅咒的挑战，干扰受限环境和庞大的搜索空间使实时协调计算成本高昂。为克服这一挑战，我们提出了速率感知量子退火图凝聚（RA-QAGC）方案，该方案结合了速率感知图抽象与去中心化强化学习，实现可扩展、干扰感知的无人机协调。通过识别高吞吐量位置并引导无人机轨迹适应至吞吐量最佳区域，RA-QAGC有效平衡网络容量，同时保持服务质量（QoS）需求。模拟结果表明，该提案优于现有方案，实现了59.4 Mbps的总吞吐量和23.9 Mbps的优先用户吞吐量，分别比基线方案提升了约15%和34%。

Low Variance Trust Region Optimization with Independent Actors and Sequential Updates in Cooperative Multi-agent Reinforcement Learning

合作多智能体强化学习中的低方差信任区域优化，采用独立演员和顺序更新

Authors: Bang Giang Le, Viet Cuong Ta
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.25526
Pdf link: https://arxiv.org/pdf/2606.25526
Abstract Cooperative multi-agent reinforcement learning assumes each agent shares the same reward function and can be trained effectively using the Trust Region framework of single-agent. Instead of relying on other agents' actions, the independent actors setting considers each agent to act based only on its local information, thus having more flexible applications. However, in the sequential update framework, it is required to re-estimate the joint advantage function after each individual agent's policy step. Despite the practical success of importance sampling, the updated advantage function suffers from exponentially high variance problems, which likely result in unstable convergence. In this work, we first analyze the high variance advantage both empirically and theoretically. To overcome this limitation, we introduce a clipping objective to control the upper bounds of the advantage fluctuation in sequential updates. With the proposed objective, we provide a monotonic bound with sub-linear convergence to $\epsilon$-Nash Equilibria. We further derive two new practical algorithms using our clipping objective. The experiment results on three popular multi-agent reinforcement learning benchmarks show that our proposed method outperforms the tested baselines in most environments. By carefully analyzing different training settings, our proposed method is highlighted with both stable convergence properties and the desired low advantage variance estimation. For reproducibility purposes, our source code is publicly available at this https URL.
中文摘要 合作多智能体强化学习假设每个智能体共享相同的奖励函数，并可通过单智能体的信任区域框架有效训练。独立行为者设置不依赖其他代理的操作，而是仅根据其本地信息考虑每个代理的行为，因此应用更灵活。然而，在顺序更新框架中，每个代理的策略步骤后需要重新估算联合优势函数。尽管重要性抽样在实际中取得了成功，更新后的优势函数仍面临指数级高方差问题，可能导致收敛不稳定。在本研究中，我们首先从实证和理论角度分析高方差优势。为克服这一限制，我们引入了一个裁剪目标，以控制连续更新中优势波动的上限。在提出的目标下，我们给出一个单调界限，具有亚线性收敛性到 $\epsilon$-Nash 均衡。我们进一步利用裁剪目标推导出两个新的实用算法。基于三个流行的多智能体强化学习基准测试的实验结果表明，我们提出的方法在大多数环境中优于测试的基线。通过仔细分析不同的训练设置，我们提出的方法既具有稳定收敛性质，也得到了期望的低优势方差估计。为了可重复性，我们的源代码在此 https URL 公开。

Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Offline Priors

超越一刀切：基于诊断的线上强化学习与线下先验

Authors: Guozheng Ma, Lu Li, Zilin Wang, Pierre-Luc Bacon, Dacheng Tao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.25527
Pdf link: https://arxiv.org/pdf/2606.25527
Abstract Online reinforcement learning (RL) agents increasingly depend on knowledge acquired offline to achieve practical efficiency. Originally studied in offline-to-online RL, this paradigm now spans foundation model post-training and embodied intelligence, with prior types expanding from offline datasets and pre-trained policies to increasingly diverse knowledge sources such as multimodal foundation models and generative world models. Offline priors have become central to how deep RL is developed and deployed. However, this reliance introduces a challenge that the prevailing benchmark-driven paradigm cannot resolve: because prior validity varies across deployments and shifts during training, no single approach to managing it is universally optimal, and benchmark rankings offer limited guidance for real-world deployments. Rather than pursuing universal solutions, we argue that the field should shift to diagnosis-driven tension management, in which deployment-specific evidence guides how the learner relates to its priors throughout training, enabling both flexible and adaptive deployment. We support this position with a framework characterizing how priors reshape online optimization through three functional roles, controlled experiments demonstrating help-or-hurt reversals, cross-domain evidence from foundation model post-training to embodied intelligence, and engagement with five substantive counterarguments.
中文摘要 在线强化学习（RL）智能体越来越依赖离线获取的知识以实现实用效率。该范式最初在离线到在线强化学习中研究，现已涵盖基础模型的后训练和具身智能，之前的类型从离线数据集和预训练策略扩展到越来越多样化的知识源，如多模态基础模型和生成世界模型。离线先验已成为强化学习深度的核心。然而，这种依赖带来了一个挑战，而现有的基准驱动范式无法解决：由于先验效度在不同部署和培训期间的转变中存在差异，没有单一的管理方法是普遍最优的，基准排名对实际部署的指导有限。我们主张，与其追求普遍解决方案，不如转向以诊断为驱动的张力管理，通过部署具体证据指导学习者在整个培训过程中如何与其先验建立关系，从而实现灵活和适应性的部署。我们通过一个框架来支持这一立场，通过三个功能角色、受控实验展示帮助或伤害的逆转、从基础模型训练后到具身智能的跨领域证据，以及五个实质性反驳的参与来支持这一立场。

Latency-Aware Service Placement using Neural Combinatorial Optimisers for Edge--Cloud Systems

利用神经组合优化器实现边缘云系统的延迟感知服务部署

Authors: Kimia Abedpour, Mohammadsadeq Garshasbi Herabad, Zheng Li, Javid Taheri
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2606.25553
Pdf link: https://arxiv.org/pdf/2606.25553
Abstract The growth of Internet of Things (IoT) applications and latency-sensitive services has increased the demand for efficient service placement across compute continuum platforms, such as edge--cloud systems. Modern applications are decomposed into interdependent microservices deployed over heterogeneous infrastructures, making placement under resource and network constraints an intractable NP-hard combinatorial optimisation problem. This study proposes a latency-aware Edge Placement Neural Combinatorial Optimiser (EP-NCO), a learning-based framework for service placement in compute continuum platforms. EP-NCO employs a dual-graph model to capture resource relationships and service dependencies within both computing infrastructure and application structure. Graph neural networks (GNNs) learn structural embeddings of infrastructure nodes and service components, whereas reinforcement learning policies construct feasible placements that account for execution latency, communication link delays, and bandwidth-sharing effects. Extensive simulations across multiple system scales demonstrate that EP-NCO consistently achieves high-quality placement decisions, reducing the total service response time by 46%--50% compared with metaheuristics (genetic algorithm and particle swarm optimisation) and by 25%--35% compared with controlled RL ablation baselines. Once trained, EP-NCO enables fast online inference, making it a practical solution for dynamic large-scale edge--cloud environments with hundreds of computing nodes, hosting thousands of applications, which is significantly beyond the capability of current scheduling systems.
中文摘要 物联网（IoT）应用和延迟敏感服务的增长，增加了对跨计算连续体平台（如边缘云系统）高效服务部署的需求。现代应用被分解为部署在异构基础设施上的相互依赖微服务，使得在资源和网络约束下部署成为一个难以解决的NP-难组合优化问题。本研究提出了一种延迟感知的边缘置入神经组合优化器（EP-NCO），这是一种基于学习的连续计算平台服务部署框架。EP-NCO 采用双图模型来捕捉计算基础设施和应用结构中的资源关系和服务依赖关系。图神经网络（GNN）学习基础设施节点和服务组件的结构嵌入，而强化学习策略则构建可行的布局，考虑执行延迟、通信链路延迟和带宽共享效应。跨多系统尺度的广泛模拟表明，EP-NCO始终实现高质量的布置决策，与元启发式（遗传算法和粒子群优化）相比，总服务响应时间减少了46%-50%，与受控强化学习消融基线相比减少了25%-35%。一旦训练完成，EP-NCO能够实现快速的在线推理，使其成为拥有数百个计算节点、托管数千应用的动态大规模边缘云环境的实用解决方案，这远远超出现有调度系统的能力。

FeVOS: Foresight Expression Video Object Segmentation

FeVOS：前瞻性表达视频对象分割

Authors: Kehan Lan, Kaining Ying, Henghui Ding
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.25585
Pdf link: https://arxiv.org/pdf/2606.25585
Abstract Existing Referring Video Object Segmentation tasks focus on referring expressions describing events, actions or appearances of relevant objects within the observed frames, lacking evaluation in scenarios that require pre-decisive spatio-temporal reasoning, thereby limiting their applicability. To address this, we propose Foresight Expression Video Object Segmentation, a task that queries future events in upcoming video segments and requires masks of the objects in the observed frames as visual answers. For example, in ego-centric scenes, the question "What tool will be used?" demands reasoning over spatio-temporal cues to predict the masks of the next tool to be used, which helps with the understanding of future actions and decisions. To support this task, we introduce FeVOS, a dataset with 968 video clips, 14,525 foresight expressions, and 2,904 chain-of-thought annotations to provide explicit and interpretable reasoning steps. We further develop FeVOS-R1, an MLLM-based model trained on our dataset via a two-stage pipeline of supervised fine-tuning and reinforcement learning. FeVOS-R1 not only achieves state-of-the-art performance on FeVOS, but also demonstrates strong generalization to existing RVOS benchmarks. We hope this work can inspire more research on predictive reasoning in video perception.
中文摘要 现有的参照视频对象分割任务侧重于描述观察帧内相关物体事件、动作或出现的指称表达，缺乏在需要预先决定性时空推理的场景中评估，从而限制了其适用性。为此，我们提出了前瞻性表达视频对象分割的任务，该任务查询即将到来的视频片段中的未来事件，并需要在观察到的帧中对物体进行遮罩作为视觉答案。例如，在以自我为中心的场景中，“将使用什么工具？”这个问题要求通过时空线索推理，预测下一个工具的面罩，有助于理解未来的行动和决策。为支持这项任务，我们引入了FeVOS，这是一个包含968个视频片段、14,525个前瞻性表达和2,904条思维链注释的数据集，以提供明确且可解释的推理步骤。我们还进一步开发了FeVOS-R1，这是一个基于MLLM的模型，通过监督式微调和强化学习的两阶段流水线训练。FeVOS-R1 不仅在 FeVOS 上实现了最先进的性能，还展现了对现有 RVOS 基准测试的强有力推广能力。我们希望这项工作能激发更多关于视频感知预测推理的研究。

Low-Complexity Policy Tessellations in Structured Markov Decision Processes

结构化马尔可夫决策过程中的低复杂度策略镶嵌

Authors: Fredy Pokou (CRIStAL)
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.25593
Pdf link: https://arxiv.org/pdf/2606.25593
Abstract We study optimal-policy geometry in structured Markov decision processes. While approximate dynamic programming and reinforcement learning typically approximate high-dimensional value functions, we show that optimal policies induce simpler decision tessellations. We propose boundary-based policy approximations that learn policy regions directly. A policy-loss decomposition links performance degradation to action margins and explains why errors concentrate near indifference boundaries. Inventory control and queue admission experiments show lower policy error, smaller value gaps, faster error decay, and stability than reinforcement learning baselines.
中文摘要 我们研究结构化马尔可夫决策过程中的最优策略几何。虽然近似动态规划和强化学习通常近似高维值函数，但我们证明最优策略能诱导更简单的决策镶嵌。我们提出了基于边界的政策近似，直接学习政策区域。策略损失分解将性能下降与动作边际联系起来，并解释了为何错误集中在无关边界附近。库存控制和队列准入实验显示，策略误差更低、价值差距更小、错误衰减更快，且比强化学习基线更稳定。

Power-Budgeted Underwater Vehicle Control via Constrained Reinforcement Learning

通过受限强化学习实现功率预算水下飞行器控制

Authors: Yinuo Wang, Gavin Tao, Yuze Liu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Signal Processing (eess.SP); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.25680
Pdf link: https://arxiv.org/pdf/2606.25680
Abstract Underwater vehicles operate from a fixed onboard energy budget that propulsion rapidly depletes, so a controller that completes its task while drawing less thruster power directly extends mission range and endurance. Reinforcement learning yields capable model-free controllers for station-keeping and trajectory tracking, but optimizing task accuracy alone drives the policy toward oscillatory, energy-wasting actuation. The established remedy subtracts an energy penalty from the reward, yet this sets the task-power trade-off through a single weight with no physical units: a target power level cannot be specified, the weight must be re-tuned for every vehicle and task, and a mismatched weight can even raise power. This paper instead formulates energy-efficient underwater control as a constrained Markov decision process in which average thruster power is subject to an explicit budget, solved with a PPO-Lagrangian algorithm. The power level is set by declaring a budget in physical units, and a single dual variable is updated online to meet it for each vehicle and task, without manual weight search. Across three vehicles and four tasks in the MarineGym simulator, the energy-constrained policy draws the least power in all twelve settings, reducing it by 14--65\% (up to 64.9\%) over a task-only baseline and below an energy-reward baseline everywhere, while remaining the smoothest in ten settings and preserving task accuracy except in one deliberately power-limited regime. Imposing energy as an explicit constraint thus offers a tuning-free route to energy-efficient underwater control that needs no per-vehicle, per-task weight search.
中文摘要 水下飞行器依靠固定的机载能量收支，推进力会迅速消耗，因此一个在消耗更少推进器功率的情况下完成任务的控制器，能直接延长任务航程和续航时间。强化学习提供了无模型的控制器用于保持站位和轨迹跟踪，但仅优化任务准确性就将政策推向振荡、耗能的驱动。既定的补救方案从奖励中扣除能量惩罚，但这通过单一权重实现任务与能量权衡，没有物理单位：无法指定目标功率等级，必须为每个载具和任务重新调校权重，且权重不匹配甚至能提升功率。本文将节能水下控制表述为一个受限的马尔可夫决策过程，其中平均推力功率受显式预算约束，并用PPO-拉格朗日算法求解。功率水平通过以物理单位为单位的预算声明来确定，并且在线更新单一双变量以满足每辆车辆和任务的功率，无需手动重量查询。在MarineGym模拟器中，三辆载具和四个任务中，节能政策在所有十二种设置中耗电最少，在仅任务基准上将耗电降低14-65%（最高可达64.9%），同时在十种设置中保持最顺畅，且保持任务准确性，仅在一个刻意限制功率的状态下例外。将能量作为明确约束，提供了一种无需调校的节能水下控制路径，无需逐车、逐任务进行重量检测。

Memory-Efficient Policy Libraries with Low-Rank Adaptation in Reinforcement Learning

强化学习中具低秩适应性的高效内存策略库

Authors: Samuel Valland Lyngset, Tor Viljen Raanaas, Gard Sveipe, Eirik Møller Nilsen, Jim Torresen, Kai Olav Ellefsen, Tobias Lømo
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.25700
Pdf link: https://arxiv.org/pdf/2606.25700
Abstract When fine-tuning Large Language Models (LLMs), there has been success in minimizing both memory usage and computation with Parameter-Efficient Fine-Tuning (PEFT), like Low Rank Adaptation (LoRA). In this article, we have explored whether this approach is transferable to the world of robotics and Reinforcement Learning (RL), allowing learning with reduced memory usage and improved computational performance. Specifically, we focused on a version of multi-task robotics, where a library of specialist policies are created. In such a library memory efficiency is especially important. We used a Proximal Policy Optimization (PPO) algorithm and fine-tuned a baseline model to different tasks using LoRA. Our results demonstrate that, depending on the hyperparameters, LoRA can minimize memory usage by a factor of 20-160 compared to full fine-tuning of all layers. This implies a 90-95% storage saving when deploying a library of many (10-50) specialized policies, which can be the differentiating factor between being able to store the entire library in memory or having to use swap-memory in an applied robotics setting. At the same time, our results indicate that there is no significant difference in the success-rate between full fine-tuning and LoRA fine-tuning for the selected tasks.
中文摘要 在微调大型语言模型（LLMs）时，通过参数高效微调（PEFT），如低秩适应（LoRA），成功地最大限度地减少了内存使用和计算。本文探讨了这种方法是否适用于机器人学和强化学习（RL）领域，从而实现减少内存使用和提升计算性能的学习。具体来说，我们关注了多任务机器人的一个版本，通过创建专门策略库。在这样的库中，内存效率尤为重要。我们使用了近端策略优化（PPO）算法，并用LoRA微调了基线模型以适应不同任务。我们的结果表明，根据超参数的不同，LoRA可以将内存使用量减少20-160倍，相较于对所有层进行全面微调。这意味着在部署包含多个（10-50个）专用策略的库时，存储空间节省了90-95%，这可能是区分能否将整个库存储在内存中，还是在应用机器人环境中使用交换内存的区别因素。同时，我们的结果表明，对于所选任务，完全微调与LoRA微调之间的成功率无显著差异。

OPERA: Aligning Open-Ended Reasoning via Objective Perplexity-based Reinforcement Learning

OPERA：通过基于客观困惑的强化学习对齐开放式推理

Authors: Wenxuan Jiang, Zining Fan, Zijian Zhang, Xuecheng Wu, Hongming Tan, Haoyang Dai, Xiaoyu Li, Xuezhi Cao, Ninghao Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.25757
Pdf link: https://arxiv.org/pdf/2606.25757
Abstract Reinforcement Learning (RL) has enabled LLMs to excel in objective reasoning tasks such as mathematics and code generation. However, applying RL to open-ended tasks, such as creative writing, remains challenging because LLM-as-a-judge reward models often exhibit stylistic biases and positional inconsistencies, leading to unstable supervision. To address this, we propose OPERA (Objective Perplexity-based Reflective Alignment), which replaces unreliable external judges with intrinsic rewards derived from perplexity dynamics. Specifically, we derive an intrinsic reward signal from perplexity dynamics, quantifying uncertainty reduction at critical reflective states. During the cold-start phase, we introduce a data synthesis method that leverages carefully designed guiding words to generate diverse reasoning traces, along with perplexity-prioritized rollouts that utilize internal log-probabilities to identify logically consistent reasoning branches. This pipeline yields a large-scale dataset comprising 20,000 high-quality reasoning trajectories. Empirical evaluations consistently demonstrate the scalability and efficacy of our approach in alignment for open-ended tasks. Implementing OPERA on Qwen3-8B establishes a new state-of-the-art among open-source models, achieving parity with or surpassing proprietary models like Gemini2.5 and MiniMax-M2.5 in some open-ended tasks. The code is available at this https URL.
中文摘要 强化学习（RL）使大型语言模型在客观推理任务中表现出色，如数学和代码生成。然而，将强化学习应用于开放式任务，如创意写作，依然具有挑战性，因为以LLM为评判者的奖励模型常常表现出风格偏见和位置不一致，导致监督不稳定。为此，我们提出了基于目标困惑度的反思对齐（OERA），用源自困惑动态的内在奖励取代不可靠的外部评判。具体来说，我们从困惑动力学中推导出内在奖励信号，量化临界反射状态下的不确定性减少。在冷启动阶段，我们引入了一种数据综合方法，利用精心设计的引导词生成多样的推理痕迹，同时采用基于困惑度的优先级展开，利用内部对数概率识别逻辑一致的推理分支。该流程产生了包含2万条高质量推理轨迹的大规模数据集。实证评估持续显示我们方法在开放式任务中具有可扩展性和有效性。在Qwen3-8B上实现OPERA在开源模型中树立了新的尖端水平，在某些开放式任务中实现了与Gemini2.5和MiniMax-M2.5等专有模型的平等甚至超越。代码可在该 https URL 访问。

StairMaster: Learning to Conquer Risky Hollow Stairs for Agile Quadrupedal Robots

楼梯大师：学习征服灵活四足机器人的冒险空心楼梯

Authors: Xincheng Tang, Youhan Xie, Zhengjie Shu, Wanyu Li, Lai Jiang, Wenkang Hu, Yitong Li, Ruigang Yang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.25765
Pdf link: https://arxiv.org/pdf/2606.25765
Abstract Climbing hollow stairs remains a challenging problem for quadruped robots due to the high risk of leg trapping, severe depth sparsity, and high-frequency depth-sensing noise. In this paper, we propose StairMaster, a novel three-stage reinforcement learning framework for stable locomotion on such extreme discontinuous terrains. Our architecture integrates a Cross-Attention mechanism to extract structural features from noisy depth data, alongside a Spatial-aware Recurrent Unit (SRU) that maintains robust spatio-temporal memory to mitigate perception blind spots. To bridge the sim-to-real gap in depth perception, we propose a high-fidelity sim-to-real depth sensor modeling pipeline that faithfully replicates real-world sensor artifacts. Additionally, we employ a 3D waypoint-guided active perception reward for proactive sensing, alongside hollow gap kinematic and stair edge penalties to ensure precise foothold placement. We successfully deployed StairMaster on a Unitree Go2 robot, demonstrating its ability to conquer hollow stairs with an unprecedented incline of up to 55$^\circ$ through zero-shot transfer. To the best of our knowledge, this is the first RL-based policy to achieve such steep hollow stair climbing in real-world environments. Project Website: this https URL.
中文摘要 由于腿部陷阱风险高、深度稀疏严重以及高频深度感应噪音，四足机器人攀爬空心楼梯仍是个挑战。本文提出了StairMaster，一种用于极端不连续地形稳定移动的新型三阶段强化学习框架。我们的架构集成了交叉注意力机制，从噪声深度数据中提取结构特征，同时配备空间感知循环单元（SRU），保持强大的时空记忆以减轻感知盲点。为了弥合模拟到真实深度感知的差距，我们提出了一条高精度模拟到真实深度传感器建模流水线，忠实地还原现实世界的传感器伪影。此外，我们还采用3D路点引导的主动感知奖励，以实现主动感知，同时设置空心间隙运动学和阶梯边缘惩罚，确保精准的立足点定位。我们成功地将StairMaster部署在Unitree Go2机器人上，展示了它通过零发射传输实现最高55$^\circ$前所未有坡度的空心楼梯。据我们所知，这是首个基于强化学习的政策，在现实环境中实现如此陡峭的空心楼梯攀登。项目网站：这个https网址。

MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources

MiniOpt：用推理方法建模并解决有限资源下的通用优化问题

Authors: Ke Zhao, Zixiang Di, Hong Qian, Xiang Shu, Yaolin Wen, Qitao Shi, Bingdong Li, Xingyu Lu, Xiangfeng Wang, Jun Zhou, Ke Tang, Yang Yu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.25832
Pdf link: https://arxiv.org/pdf/2606.25832
Abstract Achieving strong optimization generalization across diverse optimization problems while requiring limited training resources remains a challenging problem for optimization-oriented large language models (LLMs). Existing approaches typically rely on large-scale supervised datasets, costly reasoning annotations, and expensive intermediate step verification, resulting in substantial training overhead. To address these challenges, we propose MiniOpt, a reinforcement learning framework that learns to solve optimization problems through an "reasoning-to-model-and-solve" paradigm. MiniOpt decomposes optimization reasoning into structured optimization modeling and executable solver generation. Building upon this paradigm, we introduce OptReward, a reward function with hierarchical score structure that jointly evaluates formulation and solution, enabling effective policy learning without expert demonstrations. We further develop an optimization-oriented policy optimization strategy that improves exploration efficiency and stabilizes reinforcement learning for compact models. Extensive experiments show that MiniOpt-3B exhibits strong optimization generalization across various optimization types, problem scenarios, and task domains. For models with fewer than 10B parameters, MiniOpt series achieves the highest average solving accuracy (SA). For models with more than 10B parameters, MiniOpt still shows competitive performance. These results suggest that optimization-oriented reward design and reinforcement learning provide an effective pathway for developing compact optimization-specialized language models with strong optimization generalization capabilities. The code is available at this https URL.
中文摘要 在有限的训练资源下，实现跨多种优化问题的强推广仍是面向优化的大型语言模型（LLMs）面临的挑战。现有方法通常依赖大规模监督数据集、昂贵的推理注释和昂贵的中间步骤验证，导致训练开销巨大。为应对这些挑战，我们提出了MiniOpt，一种通过“推理到建模和求解”范式学习优化问题的强化学习框架。MiniOpt将优化推理分解为结构化优化建模和可执行求解器生成。基于这一范式，我们引入了OptReward，一种具有层级评分结构的奖励函数，能够联合评估制定与解决方案，实现无需专家演示即可实现有效的政策学习。我们进一步开发了一种以优化为导向的策略优化策略，提升了探索效率并稳定了紧凑模型的强化学习。大量实验表明，MiniOpt-3B 在各种优化类型、问题场景和任务领域表现出强烈的优化泛化能力。对于参数少于10B的模型，MiniOpt系列实现了最高的平均求解精度（SA）。对于参数超过 100 亿的模型，MiniOpt 依然表现出竞争力。这些结果表明，优化导向的奖励设计和强化学习为开发具有强大优化泛化能力的紧凑优化专用语言模型提供了有效路径。代码可在该 https URL 访问。

Semantic Consistency Policy Optimization for Reinforcement Learning of LLM Agents

语义一致性策略优化用于LLM代理的强化学习

Authors: Peng Xu, Sijia Chen, Junzhuo Li, Xuming Hu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.25852
Pdf link: https://arxiv.org/pdf/2606.25852
Abstract Group-based reinforcement learning effectively post-trains LLM agents for long-horizon, sparse-reward tasks by deriving step-level credit from trajectory outcomes. However, this ties a step's credit to its rollout's final outcome: semantically near-identical intermediate steps receive opposite credit depending on whether their trajectory eventually succeeded or failed. Such semantic credit inconsistency sends conflicting gradients to similar actions and wastes the partially-correct progress inside failed rollouts. Motivated by this, we propose Semantic Consistency Policy Optimization (SCPO), a value-free reward-shaping method that mitigates this inconsistency by recovering step-level credit from successful siblings in the same rollout group. Concretely, SCPO scores each failed step against a successful sibling and adds positive step-level credit for new progress along that sibling. On ALFWorld and WebShop, SCPO matches or exceeds strong group-based baselines, reaching 93.7+/-4.1 percent success on ALFWorld and 74.8+/-2.0 percent on WebShop at 1.5B parameters, with gains concentrated on the hardest multi-step tasks.
中文摘要 基于群体的强化学习通过从轨迹结果获得阶级信用，有效为LLM代理进行长期、稀疏奖励任务的后期训练。然而，这使一步的功劳与其最终推广结果挂钩：语义上几乎相同的中间步骤根据其轨迹最终成功还是失败而获得相反的功劳。这种语义上的信用不一致会导致类似行为产生冲突的梯度，并浪费了失败推广中部分正确的进度。基于此，我们提出了语义一致性策略优化（Semantic Consistency Policy Optimization，简称SCPO），这是一种无价值的奖励塑造方法，通过从同一推广组中成功的兄弟姐妹中回收步级信用来缓解这种不一致性。具体来说，SCPO会将每个失败的步骤与成功的兄弟姐妹进行评分，并对该兄弟姐妹的新进展给予正向的步骤级加分。在ALFWorld和WebShop上，SCPO与强有力的基于群体的基线持平甚至超过，在15亿参数下达到93.7+/-4.1%，WebShop为74.8+/-2.0%，主要集中在最难的多步骤任务上。

Enhancing Brain MRI Anomaly Detection and Reasoning with ROI Rethink and Synthetic Data

通过投资回报率（ROI）和合成数据增强脑MRI异常检测与推理

Authors: Shangkun Li, Jie Xu, Yi Guo, Zeju Li, Yuanyuan Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.25894
Pdf link: https://arxiv.org/pdf/2606.25894
Abstract Medical vision-language models typically generate diagnoses through single-pass inference without indicating which image regions support their conclusions. This lack of spatial grounding limits clinical utility: outputs cannot be audited, and models may hallucinate findings on normal scans. We present BrReMark (Brain Rethink via ROI Marking), a framework that introduces explicit region marking into brain MRI diagnosis. The model first generates hypotheses about potential abnormalities and grounds them through explicit bounding box marking, then verifies conclusions by re-examining the marked evidence. Training combines supervised fine-tuning on structured reasoning trajectories with reinforcement learning using a composite reward over localization accuracy and diagnostic reasoning. Furthermore, we integrate a domain randomization-based pathology synthesis augmentation strategy to improve the model's generalizability to out-of-distribution (OOD) data. On internal benchmark, BrReMark improves mAP50 from 0.74% to 37.54% compared to the base model, while achieving 21.57% Clinical F1 and 45.26% diagnostic accuracy. On NOVA OOD benchmark, it also achieves competitive overall performance with a 45.7% reduction in false positives compared to the state-of-the-art, indicating reduced hallucination on rare pathologies. These findings suggest that explicit hypothesis-verification grounding is a practical path toward trustworthy open-ended brain MRI diagnosis across both in-distribution and OOD settings.
中文摘要 医学视觉语言模型通常通过单次推断生成诊断，但不指示哪些图像区域支持其结论。这种缺乏空间基础限制了临床效用：输出无法审计，模型可能会在正常扫描中产生幻觉。我们介绍了BrReMark（通过ROI标记实现的脑部再思考），这是一个将显性区域标记引入脑MRI诊断的框架。该模型首先生成关于潜在异常的假设，并通过显式边界框标记进行基础，然后通过重新审视标记的证据验证结论。培训结合了对结构化推理轨迹的监督微调，以及基于定位准确性和诊断推理的复合奖励强化学习。此外，我们整合了基于领域随机化的病理综合增强策略，以提高模型对非分布（OOD）数据的泛化性。在内部基准测试中，BrReMark将mAP50从0.74%提升至37.54%，同时实现了21.57%的临床F1和45.26%的诊断准确率。在NOVA OOD基准测试中，它也实现了竞争性的整体表现，假阳性率比最先进设备减少了45.7%，显示出罕见病理的幻觉减少。这些发现表明，明确的假设验证基础是实现可信开放式脑MRI诊断的切实可行路径，适用于分布内和当年患者。

WinDOM: Self-Family Distillation for Small-Model GUI Grounding

WinDOM：用于小模型图形界面基础的自家族蒸馏

Authors: Chengheng Li-Chen, Zhiqian Zhou, Hao Chen, Nicolas Chauvin
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.25964
Pdf link: https://arxiv.org/pdf/2606.25964
Abstract Small ($\sim$2B) GUI-grounding agents are attractive for on-device deployment, accessibility tooling, and low-cost iteration, but at this scale they face two open recipe questions: how to obtain bounding-box training data without expensive human annotation, and how to combine supervised fine-tuning with reinforcement learning. We address both, with the explicit goal of pushing small-model performance rather than scaling up. WinDOM is a $54{,}425$-record grounding corpus harvested by driving an open-source Windows 11 web reimplementation under headless Playwright, with bounding boxes read directly off the DOM and no OCR or human annotation. Self-Family Distillation (SFD) is a single rejection-sampling cold-start parameterised only by the teacher choice: either an EMA of the student (no external model) or a frozen larger same-family teacher. We then treat the saturation depth of the SFD cold-start as an explicit GRPO hyperparameter. On a Qwen3.5-2B student, the under-saturated cold-start is a better GRPO initialiser than the converged one: SFD-4B with Early-init RL gains $+5.4$ OOD-mean ($+3.5$ ScreenSpot-Pro, $+7.0$ OSWorld-G, $+5.8$ ScreenSpot-V2) over the base. The same-size EMA mode lands within roughly one OOD-mean point of the cross-size $4$B variant ($65.2$ vs $66.3$) without an external teacher.
中文摘要 小型（$\sim$200）的图形界面接地代理在设备上部署、无障碍工具和低成本迭代方面具有吸引力，但在当前规模下，它们面临两个悬而未决的问题：如何在不依赖昂贵人工注释的情况下获得边界盒训练数据，以及如何将监督微调与强化学习结合起来。我们同时处理两者，明确目标是推动小模型性能，而非放大。WinDOM 是一个价值 54{，}425 美元的记录基础语料库，通过在无头 Playwright 下驱动开源 Windows 11 网页重实现而成，边界框直接从 DOM 读取，没有 OCR 或人工注释。自家庭蒸馏（SFD）是一种单一的拒绝抽样冷启动，仅由教师选择参数决定：要么是学生的EMA（无外部模型），要么是一个固定的较大同家庭教师。然后我们将SFD冷启动的饱和深度视为显式的GRPO超参数。在Qwen3.5-2B学生中，欠饱和冷启动比收敛冷启动更适合作为GRPO初始化：SFD-4B带早期初始化RL时，比基础平均获得$+5.4$的OOD-平均收益（$+3.5$ ScreenSpot-Pro，$+7.0$ OSWorld-G，$++5.8$ ScreenSpot-V2）。同尺寸的EMA模式在没有外部教师的情况下，落点大约在跨尺寸4美元B版本的OOD平均点（$65.2$对66.3美元）范围内。

Mixture-of-Experts RL for Fault-Tolerant Legged Locomotion

专家混合强化学习，用于容错腿式行车

Authors: Giulio Turrisi, Ozan Pali, Luca Oneto, Claudio Semini
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.25965
Pdf link: https://arxiv.org/pdf/2606.25965
Abstract Legged robots deployed in planetary exploration and other remote environments must maintain reliable locomotion despite actuator failures and challenging terrain conditions. Although reinforcement learning has achieved strong results in legged locomotion, monolithic policies can struggle to efficiently represent the diverse control strategies required to compensate for different fault conditions. In this work, we propose a fault-aware modular control architecture that explicitly leverages fault-diagnosis information to activate specialized control experts associated with distinct actuator failure modes. Experimental results show that explicit fault-conditioned modular policies consistently outperform monolithic policies of comparable size, achieving higher locomotion performance across failure scenarios. Moreover, the proposed modular architecture retains competitive performance even under significantly reduced network capacity, highlighting its suitability for compute-constrained robotic platforms, such as those typically employed in space applications. The code associated with this work is available at: this https URL.
中文摘要 部署在行星探测和其他偏远环境中的腿式机器人，必须在执行器故障和恶劣地形条件下保持可靠的移动。尽管强化学习在腿部运动中取得了显著成效，但单一策略在高效表示为补偿不同故障条件所需的多样控制策略时可能遇到困难。本研究提出一种故障感知模块化控制架构，明确利用故障诊断信息激活与不同执行器失效模式相关的专业控制专家。实验结果表明，显式故障条件模块策略在性能上始终优于同等规模的单体策略，在故障场景下实现更高的移动性能。此外，该模块化架构即使在显著减少网络容量的情况下仍保持竞争力，凸显其适用于计算受限机器人平台，如通常用于航天应用的机器人平台。与本工作相关的代码可在：此 https URL。

Hierarchical Reinforcement Learning for Neural Network Compression (HiReLC): Pruning and Quantization

神经网络压缩的层级强化学习（HiReLC）：剪枝与量化

Authors: Kamar Hibatallah Baghdadi, Kawther Guoual Belhamidi, Sara Belhadj, Aissa Boulmerka, Nadir Farhi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2606.26002
Pdf link: https://arxiv.org/pdf/2606.26002
Abstract We present HiReLC, a hierarchical ensemble-reinforcement learning framework for automated joint quantization and structured pruning of deep neural networks. The framework decomposes the compression search across two levels of abstraction: low-level agents (LLAs) operate independently per block, selecting per-kernel configurations over a multi-discrete action space spanning bitwidth, pruning keep-ratio, quantization type, and granularity, while high-level agents (HLAs) coordinate global budget allocation via ensemble voting guided by Fisher Information-based sensitivity estimates. To mitigate the computational cost of policy evaluation, an iterative active learning loop interleaves surrogate-guided RL optimization with post-compression fine-tuning, using a lightweight MLP surrogate to amortize expensive evaluations and a logit-MSE proxy during cold-start. The surrogate is used for reward shaping rather than as a replacement for final post-compression evaluation. The controller is architecture-agnostic by design, with a modular layer abstraction decoupling the RL environment from the underlying network topology. Experiments across Vision Transformer and CNN benchmarks demonstrate effective parameter-storage compression ratios of 5.99 - 6.72$\times$ with a 3.83 % gain in one setting and 0.55 - 5.62 % accuracy drops elsewhere, supporting hierarchical policy decomposition and sensitivity-aware guidance as practical design choices for joint neural network compression.
中文摘要 我们介绍HiReLC，一种用于深度神经网络自动联合量化和结构化剪枝的分层集合强化学习框架。该框架将压缩搜索分解为两层抽象：低级代理（LLA）在每个块独立运行，在跨越比特宽度、修剪保持比、量化类型和粒度的多离散动作空间中选择每个内核配置;而高级代理（HLA）则通过基于Fisher信息的敏感度估计进行集合投票协调全局预算分配。为降低策略评估的计算成本，迭代主动学习循环将代理引导的强化学习优化与压缩后微调交错交织，使用轻量级MLP代理来摊销昂贵的评估，并在冷启动时使用logit-MSE代理。替代物用于奖励塑造，而非最终压缩后评估的替代。该控制器设计上与架构无关，采用模块化层抽象，将强化学习环境与底层网络拓扑解耦。Vision Transformer和CNN基准测试的实验显示，参数-存储压缩比为5.99 - 6.72$\times$，单一设置提升3.83%，其他设置精度下降0.55%至5.62%，支持层级策略分解和敏感度意识指导作为联合神经网络压缩的实用设计选择。

FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation

FORCE：通过数值校准预热和自蒸馏实现高效的VLA增强微调

Authors: Shuyi Zhang, Yunfan Lou, Hongyang Cheng, Yichen Guo, Chuyao Fu, Yaoxu Lyu, Xiaojie Zhang, Haoran Li, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.26006
Pdf link: https://arxiv.org/pdf/2606.26006
Abstract Vision-Language-Action (VLA) models are often constrained by the imitation ceiling imposed by sub-optimal data. While Reinforcement Learning (RL) fine-tuning can surpass this limit, it is notoriously sample inefficient. This challenge arises from two core issues: (1) catastrophic initial unlearning due to an unstable Q-function and (2) inefficient policy updates caused by low-quality exploration data, often forcing a reliance on costly human interventions. We introduce FORCE, a 3-stage framework that stabilizes fine-tuning by tackling both issues. FORCE first incorporates a Value-Calibrated Warm-Up phase, utilizing on-policy rollouts to mitigate the distributional shift of the Q-function. Subsequently, during the online stage, this calibrated Q-function acts as a filter for both the policy's own action proposals and expert data, ensuring only high-value actions are used for the policy update. We evaluate FORCE on various simulation and real-world tasks, and the result shows that FORCE achieves a 79% absolute improvement in success rates and outperform prior RL methods by 10%, while accelerating training by 32.5%. Critically, it mitigates the common success rate drop and achieves this robust performance without human intervention, marking a significant step towards deploying capable and autonomous robotic agents.
中文摘要 视觉-语言-行动（VLA）模型常常受限于次优数据带来的模仿上限。虽然强化学习（RL）微调可以突破这一限制，但它以采样效率低著称。这一挑战源于两个核心问题：（1）由于Q函数不稳定导致的灾难性初始复学，（2）低质量探索数据导致的低效策略更新，常常迫使人们依赖昂贵的人为干预。我们介绍FORCE，一个三阶段框架，通过解决这两个问题来稳定微调。FORCE 首先包含价值校准预热阶段，利用政策内推广来减轻 Q 函数的分布偏移。随后，在在线阶段，这个校准后的Q函数作为策略自身行动提案和专家数据的过滤器，确保只使用高价值的行动进行策略更新。我们在各种模拟和现实任务中评估FORCE，结果显示FORCE的成功率绝对提升了79%，比以往强化学习方法高出10%，同时训练加速32.5%。关键是，它缓解了常见的成功率下降，并在无需人工干预的情况下实现了强健的性能，标志着部署有能力且自主机器人代理的重要一步。

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

为什么多步骤工具使用强化学习会崩溃，以及监督信号如何解决这个问题

Authors: Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.26027
Pdf link: https://arxiv.org/pdf/2606.26027
Abstract Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited gains in tool-use tasks. In our experiments, some models exhibit catastrophic collapse, where performance abruptly drops and tool-invocation structures fail. The analysis reveals that these failures stem from unexpected probability spikes in specific control tokens, disrupting structured execution, yet the underlying tool-use capability remains intact, merely obscured by specific formats. To address this, we systematically investigate a diverse set of supervisory signals, including off-policy supervision, hint-based guidance, erroneous example supervision, and others, applied under both synchronous and interleaved training schemes. We find that interleaving supervised fine-tuning (SFT) with RL substantially improves stability, but exhibits degraded performance under format and content out-of-distribution (OOD) evaluation. We also analyze the impact of learning rates and generalization across settings. These results highlight the importance of understanding RL failures and demonstrate how diverse supervisory signals can guide exploratory learning, enabling robust training of LLMs for complex, multi-step tool-use tasks. Our Code is available at this https URL.
中文摘要 工具的使用使大型语言模型（LLMs）能够执行复杂任务，而最新的代理强化学习（RL）方法也展现出提升模型能力的潜力。然而，仅靠强化学习常常导致工具使用任务的不稳定性或有限的收益。在我们的实验中，一些模型表现出灾难性崩溃，性能突然下降，工具调用结构失效。分析显示，这些失败源于特定控制标记的意外概率激增，扰乱了结构化执行，但其基础工具使用能力依然完整，只是被特定格式所掩盖。为此，我们系统地研究了一套多样化的监督信号，包括非政策监督、基于提示的指导、错误示例监督等，这些信号在同步和交错训练方案下均有应用。我们发现，将监督式微调（SFT）与强化学习交错使用，显著提升了稳定性，但在格式和内容非分发（OOD）评估下表现下降。我们还分析了学习率和泛化在不同环境中的影响。这些结果强调了理解强化学习失败的重要性，展示了多样监督信号如何指导探索性学习，从而使LLM能够稳健地训练复杂多步工具使用任务。我们的代码可在此 https 网址获取。

Learning Robot Visual Navigation in Crowds via Intention-Aware Scene Representations

通过意图感知场景表征学习机器人在人群中的视觉导航

Authors: Han Bao, Bingyi Xia, Hanjing Ye, Yu Zhan, Hao Cheng, Baozhi Jia, Wenjun Xu, Jiankun Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.26047
Pdf link: https://arxiv.org/pdf/2606.26047
Abstract Robot crowd navigation requires the ability to infer human intentions while accounting for the structural constraints of the environment. Currently, deep reinforcement learning (DRL) provides a promising method for learning navigation policies that understand human intentions. However, most of them rely on limited scene representations, treating pedestrians as simple 2D points and ignoring rich visual cues from both humans and the environment. To address this issue, we introduce iCrowdNav, a novel visual crowd navigation method with intention-aware scene representations, to encode behavioral and structural context from egocentric visual observations. Our method employs two key components: a spatio-temporal encoder for extracting occupancy features of the scene, and Intent-Interact Former (I$^2$ Former), an attention-based module that encodes human poses to infer pedestrians' motion intentions. These features are integrated into a compact state embedding that supports effective DRL policy training. Extensive experiments show that our method achieves superior performance over baselines, and real-world deployment demonstrates vision-based crowd navigation.
中文摘要 机器人人群导航需要在考虑环境结构限制的前提下推断人类意图的能力。目前，深度强化学习（DRL）为学习理解人类意图的导航策略提供了有前景的方法。然而，大多数游戏依赖于有限的场景表现，将行人视为简单的二维点，忽视了来自人类和环境的丰富视觉线索。为解决这个问题，我们引入了iCrowdNav，一种新颖的视觉人群导航方法，具有意图感知场景，用于编码以自我为中心的视觉观察中的行为和结构语境。我们的方法采用两个关键组成部分：用于提取场景占用特征的时空编码器，以及基于注意力的意图互动形式（I$^2$ Former），该模块编码人类姿势以推断行人的运动意图。这些功能集成在一个紧凑的状态嵌入中，支持有效的DRL政策培训。大量实验表明，我们的方法在基线上表现更优，且实际部署展示了基于视觉的人群导航。

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

培训后被忽视的免费午餐：LLM代理的进步优势

Authors: Changdae Oh, Wendi Li, Seongheon Park, Samuel Yeh, Tanwi Mallick, Sharon Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.26080
Pdf link: https://arxiv.org/pdf/2606.26080
Abstract Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. In this work, we show that reinforcement learning (RL) post-training already provides the ingredients for effective step-level scoring, eliminating the need for dedicated reward model training altogether. Concretely, we derive an implicit advantage under a general stochastic Markov decision process, which we term progress advantage -- log-probability ratio between the RL-trained policy and its reference policy exactly recovers the optimal advantage function. This formulation makes the resulting signal annotation-free, domain-agnostic, and available as a byproduct of the standard RL post-training pipeline. We validate the effectiveness of the progress advantage across three different applications: test-time scaling, uncertainty quantification, and failure attribution on five benchmarks and four model families. Across all settings, it consistently outperforms confidence-based baselines and, despite requiring no task-specific training, surpasses dedicated trained reward models. We complement these results with deeper analyses on characteristics of progress advantage, offering practical guidance for adoption in real-world agentic systems.
中文摘要 过程奖励模型使得对LLM进行细粒度的步级评估成为可能，但针对代理环境构建它们仍然极为困难：长视野相互作用、不可逆作用和随机环境反馈使得大规模的人工注释和蒙特卡洛估计都不可行。本研究显示，强化学习（RL）在训练后已经为有效的步级评分提供了要素，完全消除了专门的奖励模型训练需求。具体来说，我们在一般随机马尔可夫决策过程下推导出隐式优势，称之为进步优势——强化学习策略与其参考策略之间的对数概率比恰好恢复了最优优势函数。这种表述使得所得信号无注释、域无关性，并作为标准强化学习后培训流程的副产品提供。我们在五个基准和四个模型家族中验证了进展优势在三种不同应用中的有效性：测试时间缩放、不确定性量化和失败归因。在所有设置下，它始终优于基于信心的基线，尽管不需要针对特定任务的培训，但也超过了专门训练的奖励模型。我们通过更深入的进步优势特征分析补充这些结果，为实际智能系统中的应用提供实用指导。

On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

基于策略的自蒸馏与采样演示降低了输出多样性

Authors: Andrei Liviu Nicolicioiu, Mohammad Pezeshki, Aaron Courville
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.26091
Pdf link: https://arxiv.org/pdf/2606.26091
Abstract On-policy self-distillation achieves strong pass@1 accuracy by using a single model as both teacher and student, with the teacher conditioned on a correct demonstration to provide dense token-level feedback. We show that this could come at a hidden cost: rollout diversity decreases and pass@k curves flatten (i.e., generating more rollouts fails to improve accuracy). We trace this to compounding biases in the design of self-distillation with sampled demonstrations. The teacher scores each student rollout while conditioned on a sampled correct rollout, channeling its feedback through the model's own biases. We theoretically analyze the optimal self-distillation policy and show that it tilts the base distribution by a pointwise conditional mutual information score between the student's rollout and the correct rollout used as context. Unlike the ideal optimal on-policy reinforcement learning (RL), which preserves probability ratios among equally correct rollouts, self-distillation can amplify existing probability gaps, concentrating mass on already-dominant modes. On a controlled graph path-finding task and science question-answering benchmarks, self-distilled models match or exceed RL on average performance but exhibit substantially lower functional and semantic diversity, failing on out-of-distribution settings that require diverse strategies.
中文摘要 政策自提炼通过使用单一模型作为教师和学生，教师以正确的演示为条件，提供密集的代币级反馈，从而实现了pass@1的高精度。我们表明，这可能带来一个隐藏的代价：推出多样性下降，pass@k曲线趋平（即增加推广次数无法提升准确性）。我们将此归因于自蒸馏设计中通过采样演示的叠加偏见。教师在抽样的正确推广基础上对每个学生的推广进行评分，并通过模型自身的偏见引导反馈。我们理论上分析了最优自蒸馏策略，并证明它通过学生的展开与正确展开之间的点状条件互信息分数来倾斜基础分布。与理想的最佳策略强化学习（RL）不同，后者保持同样正确的展开之间的概率比，自蒸馏可以放大现有的概率差距，将质量集中在已占主导的模式上。在受控图路径寻找任务和科学问答基准测试中，自提炼模型的平均性能与强化学习相当甚至超过，但功能和语义多样性显著较低，在需要多样化策略的分布外环境中失败。

Keyword: diffusion policy

One Body, Two Minds: Variable Autonomy Approach for a Co-embodied Robotic Hand

一体两智：同一体化机器人手的可变自主性方法

Authors: Piotr Koczy, Yuchong Zhang, Danica Kragic, Michael C. Welle
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.25575
Pdf link: https://arxiv.org/pdf/2606.25575
Abstract Assistive robotic systems face a fundamental trade-off: fully autonomous systems lack user agency, while fully user-controlled systems demand continuous cognitive effort. Existing shared autonomy approaches blend human and robot commands but are mostly deployed in separate physical bodies. We introduce co-embodiment with variable autonomy, where human and robot share a single physical body and operate at different autonomy levels across task phases, from mutual autonomy during object search and grasping to human-dominant control during actuation. We present a co-embodied, wearable robotic hand that has its own mind'' and operates with variable autonomy levels. A learning-from-demonstration visuomotor diffusion policy enables autonomous grasping when the user positions the hand near known objects. Once grasped, the system signals completion and the human can actuate the grasped tool (drill, spray bottle, infrared thermometer, lighter, and ice-cream scoop) via hands-free head gestures. The human retains veto authority at all times through a release gesture that returns the system to the initial phase. Unlike blended autonomy, where control is continuously negotiated, our co-embodied approach consists of variable autonomy from full human control to full independent actions while maintaining physical coupling, realizing a one body, two minds paradigm. In a user study with 44 participants performing five bimanual tasks, users rapidly adapted to thistwo minds'' paradigm: completion times improved by 23.3% across trials ($p < 0.001$, Cohen's $d = 0.94$), the best-performing policy variant reached a 93.6% task success rate, and acceptance ratings were high (5.70/7 overall impression, 5.52/7 daily use willingness). This work establishes co-embodiment with variable autonomy as a viable approach for assistive robotics, enabling human-robot collaboration through co-embodiment.
中文摘要 辅助机器人系统面临一个根本性的权衡：完全自主的系统缺乏用户自主性，而完全用户控制的系统则需要持续的认知努力。现有的共享自主方法融合了人类和机器人指令，但大多部署在独立的物理实体中。我们引入了可变自主性的共体化，即人类和机器人共享一个物理身体，并在任务阶段以不同的自主层级操作，从物体搜索和抓取时的相互自主到执行时的人类主导控制。我们呈现一只具备共体、可穿戴的机械手，拥有自己的“意识”，并以不同程度的自主性运作。一种从演示中学习的视觉运动扩散策略，使用户在手部靠近已知物体时实现自主抓取。一旦握住，系统会发出完成信号，人类可以通过免手持的头部手势操作所握工具（电钻、喷雾瓶、红外温度计、打火机和冰淇淋勺）。人类始终通过释放手势保持否决权，系统回到初始阶段。与持续协商控制的混合自主不同，我们的共体化方法包含从完全人类控制到完全独立行动的可变自主性，同时保持物理结合，实现一体两智的范式。在一项有44名参与者进行五个双手任务的用户研究中，用户迅速适应了这一“双心”范式：各试验完成时间提升了23.3%（$p <0.001美元，Cohen's $d = 0.94$），表现最佳的策略变体任务成功率达到93.6%，接受度评分较高（整体印象5.70/7，每日使用意愿5.52/7）。这项工作确立了具有可变自主性的共体化作为辅助机器人的可行方法，通过共具化实现人机协作。

Stage-Aware and Roughness-Constrained Diffusion Policy for Multi-Stage Robotic Polishing

多阶段机器人抛光的阶段感知和粗糙度约束扩散政策

Authors: Shuai Ke, Jiexin Zhang, Huan Zhao, Zhiao Wei, Yikun Guo, Tiange Wu, Guoqiang Guo, Haoyuan Zhou, Jie Pan, Han Ding
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.25754
Pdf link: https://arxiv.org/pdf/2606.25754
Abstract Polishing is a critical finishing process in high-end manufacturing fields such as aerospace, where surface quality directly affects the service performance and reliability of components. Robotic imitation learning provides a flexible solution for such tasks, but current methods remain limited in industrial polishing because of long-horizon dependencies, uncertain stage transitions, and the difficulty of modeling and regulating coupled process parameters. To address these issues, this paper proposes a Stage-Aware and Roughness-Constrained Diffusion Policy (SRDP) for robotic polishing. SRDP infers the process-stage posterior from multimodal observation histories and uses it to condition the shared reverse denoising process, enabling stage-consistent action generation without external stage labels during execution. Furthermore, a roughness-oriented process-constrained diffusion sampling method is incorporated to generate constrained feed speed and normal contact force under stage-wise preset spindle speeds, thereby improving process consistency and physical feasibility. Systematic experiments are conducted on two representative scenarios, namely spacecraft cabin coating-surface polishing and inner-cavity structural surface finishing. Comparisons with advanced baselines, ablation studies, and real-robot validations comprehensively evaluate the proposed method. The results show that SRD improves stage-transition stability, process-parameter consistency, and final surface quality across different polishing scenarios.
中文摘要 抛光是航空航天等高端制造领域中关键的精加工工艺，表面质量直接影响零件的服务性能和可靠性。机器人模仿学习为此类任务提供了灵活的解决方案，但由于长视野依赖性、阶段转换不确定以及耦合工艺参数的建模和调控困难，目前工业抛光方法仍有限。为解决这些问题，本文提出了一种针对机器人抛光的阶段感知与粗糙度约束扩散政策（SRDP）。SRDP通过多模态观察历史推断过程阶段后验，并利用它来条件共享的反向去噪过程，从而实现在执行过程中无需外部阶段标签即可生成阶段一致的动作。此外，采用了以粗糙度为导向的工艺约束扩散采样方法，在各级预设主轴转速下产生受限进给速度和法向接触力，从而提升工艺一致性和物理可行性。系统性实验在两种代表性场景上进行，即航天器舱涂涂层表面抛光和腔内结构表面处理。通过与先进基线、消融研究和真实机器人验证的比较，全面评估了该方法。结果显示，SRD在不同抛光场景下提升了阶段过渡稳定性、工艺参数一致性以及最终表面质量。