生成时间: 2026-06-11 20:12:55 (UTC+8); Arxiv 发布时间: 2026-06-11 20:00 EDT (2026-06-12 08:00 UTC+8)
今天共有 44 篇相关文章
Keyword: reinforcement learning
Compatibility-Aware Dynamic Fine-Tuning for Large Language Models
大型语言模型的兼容性感知动态微调
- Authors: Yucheng Zhou, Junwei Sheng, Qianning Wang, Jianbing Shen
- Subjects: Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.11206
- Pdf link: https://arxiv.org/pdf/2606.11206
- Abstract
Supervised Fine-Tuning (SFT) is the predominant paradigm for aligning large language models (LLMs), yet it suffers from optimization instability and limited generalization. Recent work attributes this issue to pathological gradient scaling and proposes Dynamic Fine-Tuning (DFT) to correct it at the token level. However, DFT assumes all demonstrations are equally suitable learning targets, an assumption violated by the strong heterogeneity of large-scale instruction data, where demonstration-policy mismatch induces high-variance updates at the sample level. We introduce Compatibility-Aware Dynamic Fine-Tuning (CADFT), a principled extension of DFT that controls sample-level optimization variance. CADFT derives a dynamic, policy-dependent compatibility signal from model likelihoods to modulate supervised updates, suppressing high-variance gradients from incompatible demonstrations. We further propose a delayed, low-frequency compatibility-guided rewriting strategy to transform persistently incompatible demonstrations into learnable targets. We show that CADFT can be interpreted as a variance-controlled estimator that generalizes token-level stabilization in DFT to the sample level. Extensive experiments demonstrate improved stability, generalization, and cold-start reinforcement learning initialization, while remaining fully supervised and independent of explicit reward modeling.
- 中文摘要
监督式微调(SFT)是对比大型语言模型(LLM)的主要范式,但它存在优化不稳定性和泛化有限的问题。最新研究将此问题归因于病态梯度缩放,并提出动态微调(DFT)方法在令牌层面进行纠正。然而,DFT假设所有演示都是同等合适的学习目标,这一假设被大规模指令数据的强烈异质性所打破,在这些数据中,演示策略不匹配会导致样本层面出现高方差更新。我们介绍兼容感知动态微调(CADFT),这是DFT的原则性扩展,用于控制样本级优化方差。CADFT通过从模型似然中推导出动态的、依赖策略的兼容性信号,以调制监督更新,抑制因不兼容演示而产生的高方差梯度。我们还提出了一种延迟、低频兼容性引导重写策略,将持续不兼容的演示转化为可学习的目标。我们证明CADFT可以被解释为一种方差控制估计器,将DFT中的代币级稳定性推广到样本层面。大量实验展示了稳定性、泛化性和冷启动强化学习初始化的改进,同时保持完全监督且独立于显式奖励建模。
ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward
ProcessThinker:通过基于推广的过程奖励增强多模态大型语言模型推理能力
- Authors: Jingpei Wu, Xiao Han, Weixiang Shen, Boer Zhang, Zifeng Ding, Volker Tresp
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.11209
- Pdf link: https://arxiv.org/pdf/2606.11209
- Abstract
Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps -- a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct
- 中文摘要
视觉化问题解答越来越需要多步骤推理。近期的训练后,基于可验证奖励(RLVR)和群体相对策略优化(GRPO)的强化学习可以提升多模态推理能力,但大多数方法依赖于仅有结果的稀疏奖励。因此,他们很难判断错误答案是因为推理后期的小错误,还是从一开始就有不利的逻辑。常见的解决方案是训练流程奖励模型(PRM)以实现步骤级监督,但这通常需要大规模高质量的思维链注释和额外的培训成本。我们提出了ProcessThinker,一种实用的培训后流程,无需培训显式PRM即可提供步骤级流程奖励。ProcessThinker 首先将推理追踪重写为带步骤标记的格式,用于冷启动监督微调,然后将 GRPO 与标准格式奖励和基于推广的流程奖励应用。具体来说,对于每个中间步骤,我们会从该步骤抽取多个后续,并以经验成功率(最终答案验证)作为步骤奖励。这提供了密集的学分分配,并鼓励更可靠地支持正确结论的推理步骤,有助于减少各步骤间的不一致或自相矛盾的进展——这是逻辑推理中的关键问题。在四个具有挑战性的视频基准测试(Video-MMMU、MMVU、VideoMathQA 和 LongVideoBench)中,ProcessThinker 持续优于基线模型 Qwen3-VL-8B-Instruct
CFCamo: A Counterfactual Detect-or-Abstain Framework for Camouflaged Object Detection
CFCamo:一种用于伪装物体检测的反事实检测或避免框架
- Authors: Suhang Li, Osamu Yoshie, Yuya Ieiri
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2606.11231
- Pdf link: https://arxiv.org/pdf/2606.11231
- Abstract
Vision-language reinforcement learning has recently shown strong target-present localization for camouflaged object detection (COD). Yet localization is only one side of the decision: when the agent faces an ordinary image with no camouflaged target, will it still claim that a camouflaged object exists? Standard COD training and evaluation data are positive-only, so agents optimized under this setting can acquire an over-detect bias, a task-specific form of object hallucination that standard COD evaluation leaves unmeasured. To quantify this target-absent behavior, we construct Counterfactual COD (CF-COD), a paired benchmark that removes the camouflaged target from each held-out COD evaluation image while preserving a plausible background. CF-COD evaluates whether a model detects the target on the original image and abstains on the target-absent counterfactual, summarized by Pair Accuracy (PA). We further introduce CFCamo, a paired counterfactual framework for COD with abstention. For training, CFCamo optimizes a Qwen3-VL-4B-Instruct agent with Counterfactual Sequence Policy Optimization (CSPO), which samples paired original-counterfactual rollouts and uses a Counterfactual Paired Reward (CPR) to couple original-image detection with counterfactual abstention. On CAMO-test, CFCamo improves S_alpha by +3.7 pp over the prior RL-based COD baseline; across CF-COD, it reaches 80.0-90.8% PA. Ablations show that removing counterfactual coupling reduces PA to 1.4-5.2% despite strong target-present COD scores, showing that target-present evaluation alone does not characterize detect-or-abstain behavior. Overall, these results indicate that CFCamo improves COD agents by coupling target-present detection with target-absent abstention, rather than merely strengthening target-present localization. Code and data are available at this https URL.
- 中文摘要
视觉语言强化学习最近显示出伪装物体探测(COD)中目标-现状定位能力强。然而,定位只是决策的一面:当代理面对一张没有伪装目标的普通图像时,它是否仍会声称存在伪装物体?标准COD训练和评估数据仅为正向,因此在该环境下优化的智能体可能会产生过度检测偏差,这是一种任务特异的物体幻觉,标准COD评估未被测量。为了量化这种目标缺失行为,我们构建了反事实COD(CF-COD),这是一个配对基准,去除每个保留COD评估图像中的伪装目标,同时保持合理的背景。CF-COD评估模型是否在原始图像上检测到目标,并在目标缺失的反事实上弃用,总结为配对准确率(Pair Accuracy,PA)。我们还进一步介绍了CFCamo,这是一个针对COD与弃权的配对反事实框架。在训练方面,CFCamo 通过反事实序列策略优化(CSPO)优化 Qwen3-VL-4B-Ininstruction 代理,该系统采样原始-反事实的部署,并使用反事实配对奖励(CPR)将原始图像检测与反事实隐匿结合起来。在CAMO测试中,CFCamo的S_alpha比之前基于强化语言的COD基线提升+3.7 pp;在CF-COD中,年利率达到80.0-90.8%。消融显示,尽管目标-当前COD评分强,去除反事实耦合后,PA仍降至1.4-5.2%,表明仅靠目标-当前评估无法描述检测或戒除行为。总体而言,这些结果表明CFCamo通过将目标存在检测与目标缺失的抑制结合,而不仅仅是强化目标当前定位,从而提升了COD因子的表现。代码和数据可在此 https URL 获取。
Multi-agent rendezvous in fluid flows via reinforcement learning
通过强化学习实现流体流动中的多智能体交会
- Authors: Bocheng Li, Jingran Qiu, Lihao Zhao
- Subjects: Subjects:
Multiagent Systems (cs.MA); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
- Arxiv link: https://arxiv.org/abs/2606.11274
- Pdf link: https://arxiv.org/pdf/2606.11274
- Abstract
Rendezvous is a critical task for multi-agent systems, requiring agents to coordinate to meet at an unspecified location. However, achieving this in fluid environments presents a challenge, as it remains unclear how agents can exploit underlying fluid kinematics to facilitate convergence. In this study, we adopt a multi-agent reinforcement learning (MARL) approach to develop physics-informed rendezvous strategies in vortical flows. Compared to a naive strategy, where agents navigate toward their counterparts, MARL strategies significantly improve the rendezvous rate. MARL strategies also show transferability across varying vortex intensities, vortex scales, and swarm sizes. By breaking the symmetry of the state-action map, MARL strategy leverages a non-intuitive mechanism that prevents agents from becoming trapped in separate vortices, thereby enhancing rendezvous success. Additionally, a heuristic strategy is extracted from the learned strategy and also outperforms the naive strategy. Furthermore, a theoretical analysis demonstrates that fluid deformation impedes the rendezvous process. Large finite-time Lyapunov exponents identify where fluid effects separate adjacent agents, suggesting that targets should be planned in weak-deformation regions. Our findings reveal the important role that agent-fluid interactions play in multi-agent tasks and highlight the MARL capability to explore swarm intelligence in complex flow environments.
- 中文摘要
会合是多智能体系统的关键任务,要求智能体协调在未指定地点会面。然而,在流体环境中实现这一点存在挑战,因为智能体如何利用底层流体运动学促进融合尚不明确。本研究采用多智能体强化学习(MARL)方法,开发基于物理的涡流交会策略。与天真策略相比,MARL策略中代理向对方移动,显著提高了会合率。MARL策略还显示出在不同涡旋强度、涡旋尺度和群体规模之间的可转移性。通过打破状态-动作图的对称性,MARL策略利用一种非直观机制,防止代理被困在不同的涡旋中,从而增强交会成功率。此外,从学习到的策略中提取启发式策略,且其表现优于朴素策略。此外,理论分析表明流体变形会阻碍交会过程。大型有限时间李雅普诺夫指数能识别流体影响与相邻因子分离的位置,建议在弱变形区域规划靶点。我们的发现揭示了智能体-流体交互在多智能体任务中的重要作用,并凸显了MARL在复杂流动环境中探索群体智能的能力。
Phi-Actor-Critic: Steering General-Sum Games to Pareto-Efficient Correlated Equilibria
Phi-Actor-Critic:引导一般和博弈达到帕累托有效相关均衡
- Authors: Wongyu Lee, Francesco Lelli, Omran Ayoub, Massimo Tornatore
- Subjects: Subjects:
Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.11284
- Pdf link: https://arxiv.org/pdf/2606.11284
- Abstract
Real-world multi-agent systems, from traffic coordination to resource allocation, are often modeled as general-sum games where individual incentives conflict with collective welfare. In these settings, the central challenge is not merely finding an equilibrium, but selecting socially desirable outcomes among many suboptimal Nash equilibria. Standard deep multi-agent reinforcement learning (MARL) methods struggle with this problem, as value-decomposition approaches are constrained by monotonicity assumptions and policy-gradient methods often converge to stable but socially inefficient equilibria. To address this limitation, we propose $\Phi$-Actor-Critic ($\Phi$-AC), a framework that leverages swap regret minimization to steer learning toward high-welfare correlated equilibria (CE). To make counterfactual regret estimation tractable in deep MARL, $\Phi$-AC employs a centralized attention critic that predicts vector-valued regrets in a single forward pass, avoiding computationally expensive counterfactual simulations. We further introduce a Lagrangian-based equilibrium selection mechanism that optimizes social welfare while enforcing stability through regret constraints. Experiments on matrix games, Multi-Agent Particle Environments (MPE), and the Melting Pot Harvest scenario demonstrate that $\Phi$-AC learns efficient and stable coordination strategies across diverse mixed-motive settings while maintaining high collective return and competitive fairness.
- 中文摘要
现实世界的多智能体系统,从流量协调到资源分配,通常被建模为一般和博弈,个人激励与集体福利发生冲突。在这些环境中,核心挑战不仅是找到平衡,而是在众多次优纳什均衡中选择社会上理想的结果。标准的深度多智能体强化学习(MARL)方法难以解决这一问题,因为价值分解方法受限于单调性假设,而策略梯度方法往往趋向稳定但社会效率低下的均衡。为解决这一限制,我们提出了$\Phi$-Actor-Critic($\Phi$-AC)框架,该框架利用掉期遗憾最小化引导学习朝向高福利相关均衡(CE)。为了使反事实遗憾估计在深度MARL中变得可处理,$\Phi$-AC采用了集中注意力批评器,在一次前向传递中预测向量值的遗憾,避免了计算量高的反事实模拟。我们进一步引入了基于拉格朗日量的均衡选择机制,该机制在通过遗憾约束来优化社会福利的同时,实现稳定。矩阵博弈、多智能体粒子环境(MPE)和熔炉收获场景的实验表明,$\Phi$-AC能够在多样化混合动机环境中学习高效且稳定的协调策略,同时保持高集体回报和竞争公平性。
PLUME: Probabilistic Latent Unified World Modeling and Parameter Estimation for Multi-Finger Manipulation
PLUME:多指操作的概率潜在统一世界建模与参数估计
- Authors: Abhinav Kumar, Soshi Iba, Rana Soltani Zarrin, Dmitry Berenson
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.11396
- Pdf link: https://arxiv.org/pdf/2606.11396
- Abstract
Dexterous manipulation with multi-finger hands can be sensitive to physical parameters such as object shape, pose, and friction coefficients. While simulation enables large-scale data collection with known parameter values, simulation-trained policies must still handle uncertainty at deployment, where the true parameters and therefore the true dynamics are unknown. Standard domain randomization strategies may be insufficient for precise tasks like screwdriver turning, as manipulation strategies may need to change depending on specific parameter values. To address this, we propose Probabilistic Latent Unified world Modeling and parameter Estimation (PLUME), a world model that jointly learns to evolve a belief over parameter values as well as the system dynamics conditioned on those parameters. We learn a latent space to jointly represent multiple qualitatively different physical parameters along with rewards, themselves functions of partially-observable variables, to inform planning. Our novel learning framework leads to efficient alignment of the world model to true dynamics through online parameter inference as opposed to re-training or fine-tuning. We evaluate our method on simulated screwdriver turning, valve turning, bucket lifting, and disk flicking tasks, as well as a hardware screwdriver turning task, where we achieve successful zero-shot transfer of our simulation-trained policy and outperform state-of-the-art offline reinforcement learning and world-model-augmented behavior cloning baselines. Please see our website at this https URL for videos.
- 中文摘要
多指手的灵巧操作对物体形状、姿势和摩擦系数等物理参数非常敏感。虽然仿真支持基于已知参数值的大规模数据收集,但仿真训练的策略在部署时仍需处理不确定性,因为真实参数及真实动态未知。标准域随机化策略可能不足以应对像螺丝刀转动这样的精确任务,因为操作策略可能需要根据具体参数值进行调整。为此,我们提出了概率潜在统一世界建模和参数估计(PLUME),这是一种结合学习对参数值及基于这些参数的系统动态的信念的世界模型。我们学习一个潜在空间,以联合表示多个质的不同物理参数以及奖励,奖励本身是部分可观察变量的函数,以指导规划。我们新颖的学习框架通过在线参数推断,高效地将世界模型与真实动态对齐,而非重新训练或微调。我们在模拟螺丝刀转动、阀门转动、铲斗抬起和翻盘任务,以及硬件螺丝刀转动任务中评估了我们的方法,在这些任务中,我们成功实现了模拟训练策略的零样本转移,并且在最先进的离线强化学习和世界模型增强行为克隆基线中表现优于该方法。请访问我们的网站 https 网址获取视频。
Dynamic Execution Horizon Prediction for Chunk-based Robot Policies
基于区块的机器人策略动态执行视野预测
- Authors: Yuchi Zhao, Miroslav Bogdanovic, Arjun Sohal, Liyu Tao, Kourosh Darvish, Alán Aspuru-Guzik, Florian Shkurti, Animesh Garg
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.11408
- Pdf link: https://arxiv.org/pdf/2606.11408
- Abstract
Action chunking has become a standard design in modern robot policies, from diffusion/flow policies to vision-language-action models, where the policy predicts a sequence of actions and executes a fixed number of them instead of acting one step at a time. However, this paradigm relies on a key assumption: a fixed execution horizon. During chunk execution, the policy operates open-loop, which is particularly problematic for fine-grained manipulation tasks that require frequent replanning. In practice, the execution horizon is typically chosen through empirical tuning and is highly task-dependent. To this end, we propose Dynamic Execution Horizon Prediction (DEHP), an effective method that trains a lightweight execution-horizon prediction branch using online reinforcement learning while keeping the pretrained chunk policy completely frozen. This makes the method compatible with black-box chunk policies and isolates the effect of adapting the execution horizon from changes to the underlying action generator. Across our evaluations, DEHP improves the success rate of different high-precision and long-horizon manipulation tasks by a large margin. Our qualitative analysis further shows that DEHP predicts shorter execution horizons during fine-grained stages of the task and longer horizons during free-space motion. In this way, DEHP balances the efficiency of open-loop chunk execution with the reactivity of closed-loop single-step control. Project page: this https URL
- 中文摘要
动作分块已成为现代机器人策略中的标准设计,从扩散/流策略到视觉-语言-动作模型,策略预测一系列动作并执行固定数量的动作,而非一步步行动。然而,该范式依赖于一个关键假设:固定的执行视野。在块执行期间,策略以开环方式运行,这对于需要频繁重新规划的细粒度操作任务尤其棘手。实际上,执行视野通常通过经验调优选择,且高度依赖任务。为此,我们提出了动态执行视野预测(DEHP),这是一种通过在线强化学习训练轻量级执行视野预测分支的有效方法,同时保持预训练的区块策略完全冻结。这使得该方法兼容黑箱块策略,并隔离了执行视野适应与底层动作生成器变化的影响。在我们的评估中,DEHP大幅提升了各种高精度和长视野操作任务的成功率。我们的定性分析进一步表明,DEHP在任务的细粒度阶段预测执行时间较短,在自由空间运动阶段预测更长的执行时间。通过这种方式,DEHP在开环块执行的效率与闭环单步控制的反应性之间取得了平衡。项目页面:此 https URL
Mirror Descent Beyond Euclidean Stability: An Exponential Separation in Initialization Sensitivity
镜像下降超越欧几里得稳定性:初始化灵敏度的指数级分离
- Authors: Shira Vansover-Hager, Matan Schliserman, Ofir Schlisselberg, Tomer Koren
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.11431
- Pdf link: https://arxiv.org/pdf/2606.11431
- Abstract
Mirror Descent (MD) extends Gradient Descent (GD) beyond Euclidean geometry and has recently reappeared as a lens for KL-regularized policy optimization in reinforcement learning and LLM post-training. This raises a basic robustness question, crucial to reproducibility and reliability: how sensitive are MD dynamics to their inputs? We focus on initialization, often itself a pretrained or previously aligned model. Quadratic-regularized MD, including GD and Mahalanobis geometries, is well-known to be stable for convex smooth objectives. We show a sharp contrast: once the regularizer is non-quadratic, MD can be exponentially more sensitive to initialization than GD, even with a well-conditioned regularizer in Euclidean norm. We give a three-dimensional construction with a convex, smooth objective and a strongly convex, smooth, well-conditioned regularizer where an initial $\varepsilon$ perturbation is quickly amplified to $\min{\text{polylog}^{-1}(1/\varepsilon), \varepsilon e^{\Omega(\eta T)}}$ after $T$ iterations of MD with step size $\eta$. For canonical KL-regularized MD on the simplex, we show that even linear objectives can amplify an initial $\varepsilon$ perturbation exponentially fast in high-dimensional or near-boundary regimes. Finally, we show that adding a Bregman regularization term toward an anchor point can stabilize the dynamics while largely preserving the optimization guarantees, and that the choice of anchor is crucial: anchoring at the initialization only partially mitigates the instability, whereas anchoring at a fixed point yields a more stable mechanism.
- 中文摘要
镜像下降(MD)将梯度下降(GD)扩展到欧几里得几何之外,最近作为强化学习和大型语言模型后训练中KL正则化策略优化的视角重新出现。这引发了一个基本的鲁棒性问题,这对可重复性和可靠性至关重要:MD动力学对其输入有多敏感?我们专注于初始化,通常本身就是一个预训练或事先对齐的模型。二次正则化MD,包括GD和Mahalanobis几何,已知对凸光滑物镜是稳定的。我们展示了一个鲜明的对比:一旦正则化子非二次化,MD对初始化的敏感度可能比GD高出指数级,即使其在欧几里得范数下是良好条件的正则化子。我们给出一个三维构造,带有一个凸、光滑的目标物和一个强凸、光滑且条件良好的正则化子,其中初始的 $\varepsilon$ 扰动会在步长 $\eta$ 的 MD 迭代后迅速放大为 $\min{\text{polylog}^-1}(1/\varepsilon), \varepsilon e^{\Omega(\eta T)}}$,经过$T次步长 $\eta$ 的 MD 迭代后。对于单纯形上的典型KL正则化MD,我们证明即使是线性目标,也能在高维或近边界区域以指数速度放大初始$\varepsilon$扰动。最后,我们证明在锚点附近加入布雷格曼正则化项可以稳定动力学,同时在很大程度上保持优化保证,锚点的选择至关重要:锚定在初始化点只能部分缓解不稳定性,而锚定在不动点则能带来更稳定的机制。
Agent Skill Evaluation and Evolution: Frameworks and Benchmarks
代理技能评估与演变:框架与基准
- Authors: Kexin Ding, Yang Zhou, Can Jin, Feng Tong, Mu Zhou, Dimitris N. Metaxas
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.11435
- Pdf link: https://arxiv.org/pdf/2606.11435
- Abstract
The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in real-world applications. Consequently, the field is undergoing an emerging paradigm shift from isolated skill creation to automated, evaluation-driven skill evolution. In this survey, we systematically examine the landscape of skill evolution and evaluation beyond foundational skill creation. We categorize evolution into four distinct paradigms, spanning execution feedback, trajectory distillation, compression, and reinforcement learning, showing how each element contributes to improving skill utility and reliability. We also provide an analysis of six skill-centric benchmark categories, identifying structural gaps in benchmark coverage, trade-offs, and metric richness to advance skill research. Finally, we identify open directions for building skill ecosystems that are generalizable, efficient, and verifiably safe. The project URL is this https URL
- 中文摘要
代理技能的增长改变了代理系统的构建、评估和部署方式。随着技能库的不断扩展,严格的评估对于确保其在实际应用中的实用性、质量和安全性变得至关重要。因此,该领域正经历从孤立技能创造向自动化、评估驱动技能演进的范式转变。在本综述中,我们系统地考察了技能演变与评估的全貌,超越了基础技能创造。我们将进化分为四个不同的范式,涵盖执行反馈、轨迹提纯、压缩和强化学习,展示了每个元素如何促进技能效用和可靠性的提升。我们还分析了六个以技能为中心的基准类别,识别基准覆盖率、权衡和指标丰富性的结构性差距,以推动技能研究的发展。最后,我们确定了构建可通用、高效且可验证安全的技能生态系统的开放方向。项目的网址是这个 https URL
INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration
INFRAMIND:基础设施感知多智能体编排
- Authors: Ahasan Kabir, Jiaqi Xue, Mengxin Zheng, Qian Lou
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.11440
- Pdf link: https://arxiv.org/pdf/2606.11440
- Abstract
Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model's queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.
- 中文摘要
现有的多智能体LLM编排方法,从暴力破解集成到学习的路由器,会根据任务和模型特征选择模型和拓扑结构。然而,这些方法不考虑服务基础设施的运行时状态。在并发负载下的共享GPU集群中,这种基础设施盲点导致系统性资源利用不足:首选模型积累了深层请求队列,而同等能力的替代方案则闲置。在多代理流水线中,每次查询触发多次顺序模型调用,这些延迟会在每个下游步骤中叠加。缩小这一差距具有挑战性,因为相关基础设施信号(队列深度、KV缓存压力、延迟)是动态且噪声大的,必须驱动三种不同的决策:规划、每步路由和调度。我们介绍INFRAMIND,一个让整个多代理栈具备基础设施感知的框架。基础设施感知的规划器根据实时系统负载和剩余预算来设置拓扑和角色选择,拥塞时偏向简单图,低负载时偏向丰富图。基础设施感知执行者随后观察每个模型队列深度、缓存利用率和响应延迟,以决定调用哪个模型以及推理多深;预算感知调度器进一步重新排序每个模型的队列,使紧急请求优先响应。系统被设定为一个层级约束的MDP,并通过强化学习端到端解决,学会自动平衡质量与延迟。在五个基准测试中,INFRAMIND在低负载下比之前基线高达+7.6 pp的精度,延迟降低最多7倍,并在高负载下所有基线都低于50%的情况下,SLO合规性可维持高达99.9%。
The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes
大型语言模型推理周期表:推理范式、方法与失败模式的结构化综述
- Authors: Avinash Anand, Mahisha Ramesh, Avni Mittal, Ashutosh Kumar, Erik Cambria, Zhengkui Wang, Timothy Liu, Aik Beng Ng, Simon See, Rajiv Ratn Shah
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.11470
- Pdf link: https://arxiv.org/pdf/2606.11470
- Abstract
Large Language Models (LLMs) have achieved strong performance across natural language processing tasks, yet reliable reasoning remains an open challenge. Although modern LLMs show progress in structured inference, multi-step problem solving, and contextual understanding, their reasoning behavior is often inconsistent and sensitive to prompting strategies, task design, and model scale. This survey provides a systematic analysis of more than 300 recent papers from arXiv, Semantic Scholar, Google Scholar, Papers with Code, and the ACL Anthology to examine how reasoning capabilities emerge in LLMs and where they fail. We make three main contributions. First, we introduce a structured taxonomy of LLM reasoning research, covering Chain-of-Thought reasoning, multi-hop reasoning, mathematical reasoning, common sense reasoning, visual and temporal reasoning, code and algorithmic reasoning, retrieval-augmented reasoning, tool-augmented and agentic reasoning, and reinforcement learning-based reasoning. Second, we analyze methodological trends across these paradigms, including prompting methods, model architectures, training objectives, reward modeling, and evaluation benchmarks. Third, we synthesize recurring limitations and failure modes, such as reasoning hallucinations, brittle multi-step inference, weak causal abstraction, and poor cross-domain generalization. By organizing a rapidly expanding literature, this survey offers a unified view of the current capabilities and limitations of reasoning in LLMs. We also identify emerging research directions, including meta-reasoning, self-evolving reasoning frameworks, multimodal reasoning, and socially grounded reasoning. Overall, this work aims to serve as a reference for developing more robust, interpretable, and generalizable reasoning systems in future language models.
- 中文摘要
大型语言模型(LLMs)在自然语言处理任务中取得了强劲表现,但可靠的推理仍是一个未解的挑战。尽管现代大型语言模型在结构化推理、多步骤问题解决和上下文理解方面有所进步,但其推理行为常常不一致,且对提示策略、任务设计和模型规模敏感。本调查系统分析了来自arXiv、Semantic Scholar、Google Scholar、Papers with Code和ACL Anthology的300多篇近期论文,探讨大型语言模型推理能力的形成及其不足之处。我们主要贡献三项。首先,我们介绍了大型语言模型推理研究的结构化分类,涵盖思维链推理、多跳推理、数学推理、常识推理、视觉与时间推理、代码与算法推理、检索增强推理、工具增强与代理推理,以及基于强化学习的推理。其次,我们分析了这些范式中的方法论趋势,包括提示方法、模型架构、训练目标、奖励建模和评估基准。第三,我们综合了反复出现的局限性和失败模式,如推理幻觉、脆弱的多步推断、薄弱的因果抽象以及较差的跨域推广。通过整理快速扩展的文献,本综述为大型语言模型推理的当前能力与局限性提供了一个统一的视角。我们还识别了新兴的研究方向,包括元推理、自我演化推理框架、多模态推理和社会基础推理。总体而言,这项工作旨在为未来语言模型中开发更稳健、可解释和可推广的推理系统提供参考。
Learning Object Manipulation from Scratch via Contrastive Interaction
通过对比交互从零开始学习对象操作
- Authors: Tongle Shen, Caleb Chuck, Fan Feng, Biwei Huang
- Subjects: Subjects:
Robotics (cs.RO); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.11525
- Pdf link: https://arxiv.org/pdf/2606.11525
- Abstract
Contrastive Reinforcement Learning (CRL) has seen recent success in a wide variety of goal-conditioned robotics tasks by learning structured representations of the dynamics. However, despite its success in locomotion and simpler control domains, CRL often struggles in interaction-rich manipulation. We argue that a key source of this difficulty is object-centric interaction, such as contact or grasping, that induces distinct changes in the underlying dynamic modes. In this work, we formulate manipulation dynamics as a piecewise-smooth Markov process and show that interaction-induced mode changes create piecewise nonlinear reachability structures that are difficult for standard CRL energy functions to represent and plan over. Based on this analysis, we introduce Interaction-weighted Resampling (IWR). IWR performs interaction-aware resampling around phases before, during, and after interactions, encouraging the learned representation to preserve the mode boundaries that determine future reachability to capture multi-modal and piecewise nonlinear reachability. Across interaction-centric environments, including 2D dynamic control, robotic manipulation, and robot air hockey, IWR improves both sample efficiency and overall performance over prior CRL methods, with 19.8% average improvement in simulation. Finally, using a sim-to-real pipeline with policies trained by IWR, we demonstrate the first real-world goal-conditioned robot air hockey agent capable of hitting goals, improving success from 25% to 60%. Project Page: this http URL.
- 中文摘要
对比强化学习(CRL)近年来在多种目标条件机器人任务中取得了成功,通过学习动态的结构化表示。然而,尽管CRL在移动和更简单的控制领域取得成功,但在交互丰富的操作上常常存在困难。我们认为,造成这一困难的关键来源是以对象为中心的交互,如接触或抓取,这会引发底层动态模式的明显变化。在本研究中,我们将操作动力学表述为分段光滑的马尔可夫过程,并展示了相互作用引起的模态变化创造了分段非线性可达结构,这些结构难以被标准CRL能量函数表示和规划。基于该分析,我们引入了交互加权重抽样(IWR)。IWR在交互前、交互中和交互后相位周围进行交互感知重采样,鼓励学习到的表示保持决定未来可达性的模态边界,以捕捉多模态和分段非线性可达性。在以交互为中心的环境中,包括二维动态控制、机器人操作和机器人空气曲棍球,IWR相比以往的CRL方法提升了样本效率和整体性能,模拟平均提升了19.8%。最后,我们利用IWR训练的政策的模拟到现实流水线,展示了首个具备真实目标条件的机器人空气曲棍球代理,能够实现目标,将成功率从25%提升到60%。项目页面:此 http URL。
HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation
英雄:环境观察的事后洞察增强反思,用于能动自我蒸馏
- Authors: Haoran Liu, Yuwei Zhang, Xiyao Li, Bohan Lyu, Jingbo Shang
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.11559
- Pdf link: https://arxiv.org/pdf/2606.11559
- Abstract
Reinforcement learning typically improves multi-turn agent capabilities through the terminal outcome of the trajectories, which makes it difficult to determine credit assignments for each intermediate turns. Recent on-policy self-distillation methods offer a promising alternative by converting privileged feedback into dense token-level supervision through a self-teacher. Our study is motivated by the unexpected performance degradation observed when naively extending this paradigm to multi-turn settings, which we attribute to a lack of alignment between privileged feedback, such as successful trajectories or terminal outcomes, and the student's current decision context. We introduce HERO, a hindsight-enhanced self-distillation framework that uses next environment observations as locally aligned feedback. After each rollout, HERO reflects on the completed interaction to convert each observation into a compact turn-level diagnosis, that captures actionable feedback about the original action such as its necessity, validity or failure cause. On TauBench and WebShop, HERO improves task success and reduces unnecessary turns over environment-feedback-only self-distillation and GRPO. It is especially effective under limited training turn budgets, where successful rollouts are rare and GRPO provides weak reward-contrast signals.
- 中文摘要
强化学习通常通过轨迹的最终结果提升多回合代理的能力,这使得确定每个中间回合的信用分配变得困难。最近的政策自提纯方法通过将特权反馈转化为密集的代币级监督,通过自学者提供了有前景的替代方案。本研究的动机是基于将该范式天真地推广到多回合环境时观察到的意外表现下降,我们认为这是由于特权反馈(如成功轨迹或终末结果)与学生当前决策环境之间缺乏对齐所致。我们介绍了HERO,一种事后诸葛亮增强的自我提炼框架,利用下一个环境观测数据作为局部对齐的反馈。每次部署后,HERO会对完成的交互进行反思,将每个观察转化为简明的回合级诊断,捕捉关于原始行动的可操作反馈,如其必要性、有效性或失败原因。在TauBench和WebShop上,HERO提升了任务成功率,减少了对环境反馈纯自提炼和GRPO的不必要周转。在训练回合预算有限的情况下,GRPO在成功推广稀少且奖励对比信号较弱的情况下尤为有效。
Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
架构感知强化学习使滑动窗口注意力在数学推理中更具竞争力
- Authors: Kai Liu, Peijie Dong, Xinchen Xie, Jianfei Gao, Qipeng Guo, Xiaowen Chu, Shaoting Zhang, Kai Chen
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.11634
- Pdf link: https://arxiv.org/pdf/2606.11634
- Abstract
The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a practical recipe for adapting SWA models to mathematical reasoning. SWARR has two stages: (1) efficient conversion from a pretrained SA model to SWA with supervised fine-tuning (SFT), which avoids pretraining a new base model, and (2) policy adaptation with reinforcement learning (RL). We find that SWA still underperforms SA after SFT, and we hypothesize that this gap is caused in part by a data-architecture mismatch: most SFT data are prepared for SA models and may contain long-range dependencies that are difficult for SWA to model. Because on-policy RL optimizes self-generated trajectories under the SWA constraint, it can adapt trajectories to better match SWA. Experiments on mathematical reasoning benchmarks show that this recipe substantially narrows the gap between SWA and SA, recovering much of the accuracy lost during SWA conversion while preserving the efficiency benefits of linear-complexity attention. Our central contribution is the empirical finding that RL changes the conclusion one would draw from conversion and SFT alone about SWA's viability for math reasoning.
- 中文摘要
推理和智能大型语言模型(LLMs)的快速发展增加了对长上下文推理的需求,但自注意(SA)与上下文长度成平方增长。为此,我们研究了SWARR(滑动窗口注意力与强化适应数学推理),这是将SWA模型应用于数学推理的实用配方。SWARR有两个阶段:(1)通过监督微调(SFT)从预训练SA模型高效转换为SWA,避免预训练新的基础模型;(2)策略适应,伴随强化学习(RL)。我们发现,SFT后SWA仍然表现不佳,我们假设这一差距部分源于数据架构不匹配:大多数SFT数据是为SA模型准备的,且可能包含SWA难以建模的长期依赖关系。由于策略上强化学习在SWA约束下优化自生成轨迹,因此可以调整轨迹以更好地匹配SWA。数学推理基准测试的实验表明,该方案显著缩小了SWA与SA之间的差距,恢复了SWA转换过程中失去的大部分准确性,同时保留了线性复杂度注意力带来的效率优势。我们的核心贡献是实证发现,强化学习改变了仅凭转换和SFT对SWA在数学推理中可行性的结论。
IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents
IAPO:用于小型多模态代理工具的输入归因感知策略优化
- Authors: Yifan Yang, Zhen Zhang, Jiayi Tian, Liyan Tan, Zheng Zhang
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.11652
- Pdf link: https://arxiv.org/pdf/2606.11652
- Abstract
This paper investigates reinforcement learning (RL) methods for improving tool-calling capabilities in multimodal small language model (SLM) agents. While existing works have explored various reward designs to improve agentic tool-calling ability, these approaches face inherent limitations for SLM training, especially under multimodal scenarios. First, many existing methods evaluate tool use correctness through exact matching against certain ground-truth or predefined formats. However, this assumption is often unsuitable for multimodal tasks, where multiple tool use paths may be valid and annotated tool trajectories are typically unavailable. Second, such sparse and brittle binary rewards provide little guidance on how to improve the underlying decision process, making them particularly difficult for multimodal SLM to learn from. To address these issues, we propose Input Attribution-Aware Policy Optimization (IAPO), an RL algorithm for improving tool use in multimodal SLM by aligning the model's attribution across input components with that of a stronger teacher. Experiments on Qwen2.5-VL-3B show that the proposed method improves visual question answering accuracy by an average of 3% across six test sets compared with existing visual tool use work, by helping the model attend to the most relevant input evidence.
- 中文摘要
本文探讨了增强学习(RL)方法用于提升多模态小语言模型(SLM)代理中的工具调用能力。虽然现有研究探索了多种奖励设计以提升代理工具调用能力,但这些方法在SLM训练中面临固有局限,尤其是在多模态场景下。首先,许多现有方法通过与某些真实或预定义格式的精确匹配来评估工具使用的正确性。然而,这一假设通常不适用于多模态任务,因为多模态任务中多条工具使用路径可能有效,且通常无法获得带注释的工具轨迹。其次,这种稀疏且脆弱的二元奖励几乎无法指导如何改进底层决策过程,使多模态SLM难以从中学习。为解决这些问题,我们提出了输入归因感知策略优化(IAPO)算法,这是一种强化学习算法,通过使模型在输入组件间的归因与更强教师的归因保持一致,提升工具在多模态SLM中的使用。Qwen2.5-VL-3B的实验显示,所提方法在六个测试集中平均提高了3%的视觉问题回答准确率,相比现有视觉工具使用工作,帮助模型关注最相关的输入证据。
Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning
肺-R1:一种基于知识图谱的肺部诊断逻辑大语言模型
- Authors: Haoyang Zeng, Yuanxi Fu, Rongzhen Li, Yuming Yang, Xiao Sun, Jingwang Huang, Gujie Shao, Guohui Xiang, Quan Lu, Dongfan Ye, Xuetao Chen, Jiang Zhong, Kaiwen Wei, Zhi Xu
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.11675
- Pdf link: https://arxiv.org/pdf/2606.11675
- Abstract
Diagnosing pulmonary diseases requires integrating heterogeneous evidence amid phenotypic variability and cross-disease overlap. Although large language models (LLMs) have shown progress on pulmonary knowledge question answering (QA) and information-processing tasks, reliable pulmonary diagnosis requires patient-specific, relation-aware reasoning over electronic medical record (EMR) evidence rather than isolated knowledge recall. We define this gap between pulmonary knowledge and case-level diagnostic reasoning as the Pulmonary Knowledge-to-Diagnosis Gap. To address it, we introduce LungKG, the first structured pulmonary knowledge graph for diagnostic knowledge organization and record-grounded reasoning. LungKG contains 59,038 nodes and 164,308 edges across 15 entity types and 112 relation types, serving as both a reusable pulmonary knowledge resource and the foundation for LungKG-guided model adaptation. Built on LungKG, we propose Lung-R1, a LungKG-guided pulmonary LLM trained through KG-constrained reasoning-chain construction and KG-guided reinforcement learning. In a 20-system evaluation, Lung-R1-14B achieves state-of-the-art performance across Choice, Pulmonary-QA, and EMR Diagnosis, reaching an EMR Diagnosis score of 4.3583 and surpassing the strongest non-Lung-R1 baseline by 0.1476 points. These results demonstrate the value of LungKG-guided training for EMR-based pulmonary diagnosis.
- 中文摘要
诊断肺部疾病需要在表型变异和疾病交叉重叠中整合异质证据。尽管大型语言模型(LLMs)在肺部知识问答(QA)和信息处理任务方面取得了进展,但可靠的肺部诊断需要基于患者特定、关系意识的推理,而非孤立的知识回忆。我们将肺部知识与病例级诊断推理之间的这一差距定义为“肺部知识与诊断差距”。为此,我们引入了LungKG,这是首个用于诊断知识组织和基于记录推理的结构化肺部知识图谱。LungKG包含59,038个节点和164,308条边,涵盖15种实体类型和112种关系类型,既是可重复使用的肺部知识资源,也是LungKG引导模型适配的基础。基于LungKG,我们提出了Lung-R1,这是一种通过KG约束推理链构建和KG引导强化学习训练的肺部大型语言模型。在20个系统评估中,肺-R1-14B在Choice、肺部QA和EMR诊断方面均达到最先进的表现,EMR诊断得分为4.3583,领先最强非肺R1基线0.1476分。这些结果证明了LungKG指导训练在基于EMR的肺部诊断中具有价值。
Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents
组织然后检索:高效代理的层级内存导航
- Authors: Hao-Lun Hsu, Nikki Lijing Kuang, Boyi Liu, Zhewei Yao, Yuxiong He
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.11680
- Pdf link: https://arxiv.org/pdf/2606.11680
- Abstract
Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessitate efficient working memory mechanisms. However, existing approaches either rely on lossy compression or similarity-based retrieval, which often fail to capture temporal structure and causal dependencies required for multi-step agentic tasks. In this work, we present HORMA, a Hierarchical Organize-and-Retrieve Memory Agent that organizes experience into a file-system-like hierarchical structure, where summarized entities are linked to the corresponding raw trajectories, enabling efficient access without losing detailed information. HORMA decomposes working memory into two stages: structured memory construction and navigation-based retrieval. The construction module iteratively refines how experiences are structured by distinguishing between failures caused by missing information and those caused by misleading or overloaded context. The navigation module retrieves task-relevant context by traversing the hierarchy using a lightweight agent trained with reinforcement learning to select minimal yet sufficient context, thereby reducing latency along the critical execution path. Across ALFWorld, LoCoMo, and LongMemEval, HORMA improves task performance under constrained context budgets while requiring at most 22.17% of the baseline token usage in long conversation tasks. Compared to existing methods, it consistently achieves better efficiency-performance trade-offs and generalizes effectively to unseen tasks.
- 中文摘要
大型语言模型(LLM)代理因其固有的无状态性,难以处理长视野任务,要求所有与任务相关的信息都编码在不断增长的输入上下文中。由此导致的推理质量下降、推理成本增加和更高的延迟,促使高效的工作记忆机制成为关键。然而,现有方法要么依赖有损压缩,要么依赖基于相似度的检索,这些方法往往无法捕捉多步代理任务所需的时间结构和因果依赖关系。在本研究中,我们介绍了HORMA,一种分层组织与检索记忆代理,它将经验组织成类似文件系统的层级结构,将总结后的实体与相应的原始轨迹关联,实现高效访问而不丢失详细信息。HORMA将工作记忆分解为两个阶段:结构化记忆构建和基于导航的检索。构建模块通过区分因信息缺失引起的失败和误导性或过载上下文导致的失败,迭代细化体验结构化。导航模块通过使用经过强化学习训练的轻量级代理遍历层级结构,选择最小但充足的上下文,从而降低关键执行路径上的延迟,从而获取任务相关的上下文。在ALFWorld、LoCoMo和LongMemEval中,HORMA在受限上下文预算下提升任务性能,同时在长对话任务中最多要求基准令牌使用率的22.17%。与现有方法相比,它始终能实现更好的效率与性能权衡,并有效推广到未被发现的任务。
RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation
RLCSD:带有对比性政策自我提炼的强化学习
- Authors: Leyi Pan, Shuchang Tao, Yunpeng Zhai, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, Lijie Wen
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.11709
- Pdf link: https://arxiv.org/pdf/2606.11709
- Abstract
On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs. We term this pathology \emph{privilege-induced style drift}, which destabilizes training or causes response length to shrink. To address this, we propose \textbf{RLCSD} (Reinforcement Learning with Contrastive on-policy Self-Distillation), which mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint, suppressing the style shift that conditioning on a hint tends to induce regardless of correctness, and yielding a signal that is more concentrated on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods. We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them, and its underlying insight extends to the broader cross-model on-policy distillation setting.
- 中文摘要
策略自蒸馏(OPSD)通过将模型自身分布与其在特权上下文下产生的分布(通常是经过验证的解)对齐,为推理模型提供密集的代币级监督。然而,我们表明,从这一分布差距中提取的学习信号更集中于样式标记,而非任务承载标记,因为暗示模型往往产生更直接、更短的输出。我们称这种病理为\emph{特权诱导风格漂移},它会破坏训练或导致反应长度缩短。为此,我们提出了 \textbf{RLCSD}(带有对比性政策自我蒸馏的强化学习),它通过对比正确提示下的师生差距与错误提示下的差距来缓解这种漂移,抑制提示条件下无论正确性如何都会引发的风格转变,并产生更集中于任务承载符号的信号。在Qwen3(1.7B/4B/8B)和Olmo-3-7B-Think上的数学和逻辑推理实验显示,RLCSD始终优于GRPO和之前的OPSD方法。我们还进一步表明对比原则具有普遍性:它接入现有的OPSD方法以改进,其潜在洞察延伸至更广泛的跨模型政策提炼环境。
UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA
UniReason-Med:医疗VQA中用于2D转3D传输的共享基础推理接口
- Authors: Mengzhuo Chen, Yan Shu, Chi Liu, Hongming Piao, Xidong Wang, Derek Li, Bryan Dai
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.11740
- Pdf link: https://arxiv.org/pdf/2606.11740
- Abstract
We study whether grounded reasoning supervision from abundant 2D medical images can improve 3D medical VQA when both input types are aligned through a common reasoning interface. We introduce UniReason-Med, a single-checkpoint framework that processes either a 2D image or a slice-serialized 3D volume at inference time, generating interleaved textual reasoning and localized visual evidence through shared box syntax, region-token injection, and a common grounded reasoning policy. To train this interface, we construct UniMed-CoT, a 220K instruction-tuning dataset with interleaved textual reasoning and grounded visual evidence, including 170K 2D and 50K 3D samples. Through supervised fine-tuning followed by outcome-level reinforcement learning, UniReason-Med learns to generate grounded reasoning traces without IoU/Dice-based localization rewards during RL. Data-mixture and component ablations show that joint 2D+3D grounded supervision substantially improves 3D reasoning over 3D-only training, while grounding and region-token injection consistently benefit both 2D and 3D tasks. These results suggest that a shared grounded reasoning interface can transfer reasoning structure from 2D images to slice-serialized volumetric medical understanding. The code and data are publicly available at this https URL.
- 中文摘要
我们研究了当两类输入通过共同推理接口对齐时,来自大量二维医学图像的扎根推理监督是否能改善三维医疗VQA。我们介绍了UniReason-Med,一个单检查点框架,在推理时处理二维图像或切片序列化的三维体积,通过共享框语法、区域-标记注入和共同的有根推理策略生成交错的文本推理和局部视觉证据。为了训练该接口,我们构建了UniMed-CoT,这是一个22万指令调优数据集,包含交错的文本推理和基于的视觉证据,包括17万个二维和5万个三维样本。通过监督微调和结果级强化学习,UniReason-Med 学会在强化学习期间生成基于 IoU/骰子的本地化奖励的基础推理痕迹。数据混合和分量消融显示,联合2D+3D地面监督显著提升了3D推理能力,而接地和区域标记注入则持续提升2D和3D任务。这些结果表明,共享的基础推理接口可以将推理结构从二维图像转移到切片序列化的体积医学理解中。代码和数据在此HTTPS网址公开。
TacCoRL: Integrating Tactile Feedback into VLA via Simulation
TacCoRL:通过仿真将触觉反馈整合进VLA
- Authors: Siyu Ma, Yuqi Liang, Chang Yu, Yunuo Chen, Hao Su, Yixin Zhu, Yin Yang, Chenfanfu Jiang
- Subjects: Subjects:
Robotics (cs.RO); Graphics (cs.GR); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.11743
- Pdf link: https://arxiv.org/pdf/2606.11743
- Abstract
Vision-language-action (VLA) models provide strong visual, language, and action priors for robot manipulation, but visual observations alone often miss the local contact state required for contact-rich tasks. We present TacCoRL, a scalable framework that injects Tactile feedback into VLA policies and improves them through sim-real Co-training and simulation-based reinforcement learning (RL), without requiring large-scale tactile pretraining or extensive real-world contact exploration. The key idea is not only adding touch as an input, but learning how contact readings should modulate action responses in near-failure states that are rare in demonstrations and risky to collect on hardware. We use a real-aligned simulator as a closed-loop training environment for contact interaction. Mixed simulated and real trajectories first warm-start tactile-conditioned actions in the pretrained policy. Reinforcement learning with verifiable task rewards then optimizes the policy using simulated contact rollouts. It reinforces tactile-conditioned actions that lead to task completion, while a supervised objective on real trajectories keeps the refined policy anchored to deployment visual, tactile, and action distributions. The resulting policy transfers directly to the real robot without privileged simulation state or online real-world RL. Across four bimanual contact-rich tasks, the final visuo-tactile policy achieves an average success rate of 72.5%, compared to baseline of 50.0%. Result videos and more details are available at this https URL
- 中文摘要
视觉-语言-动作(VLA)模型为机器人操作提供了强有力的视觉、语言和动作先验,但仅靠视觉观察往往无法达到接触丰富任务所需的局部接触状态。我们介绍了TacCoRL,这是一个可扩展的框架,将触觉反馈注入VLA策略,并通过模拟实共训练和基于仿真的强化学习(RL)进行改进,无需大规模触觉预训练或大量真实接触探索。关键思想不仅是添加触控作为输入,还要学习接触读数如何调制近乎失效状态下的动作响应,这在演示中很少见,且在硬件上收集风险较高。我们使用实比对模拟器作为闭环训练环境,用于接触互动。混合模拟和真实轨迹首先在预训练策略中进行热启动触觉条件作用。通过可验证的任务奖励进行强化学习,利用模拟接触展开优化策略。它强化了导致任务完成的触觉条件作用,而真实轨迹上的监督目标则使精细策略锚定于部署的视觉、触觉和行动分布。由此产生的策略直接传递给真实机器人,无需特权模拟状态或在线现实现实环境。在四个双手接触丰富的任务中,最终的视觉触觉策略平均成功率为72.5%,而基线为50.0%。结果视频及更多详情可在此 https 网址观看
SVoT: State-aware Visualization-of-Thought for Spatial Reasoning via Reinforcement Learning
SVoT:通过强化学习实现空间推理的状态感知思维可视化
- Authors: Chao Lei, Yanbei Jiang, Markus Hiller, Zhijian Zhou, Xunye Tian, Krista A. Ehinger, Nir Lipovetzky
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.11770
- Pdf link: https://arxiv.org/pdf/2606.11770
- Abstract
Spatial reasoning remains a challenge for Multimodal Large Language Models (MLLMs), as it requires reliable multi-hop inference over both intermediate states and state transitions. Current studies often leave intermediate states unverified and treat state transitions as implicit processes, which limits reliability in multi-hop spatial reasoning. To address this, we propose State-aware Visualization-of-Thought (SVoT), a reinforcement learning framework that generates interleaved, verifiable intermediate states and visualizations. SVoT integrates transition reasoning chains into the generation processes, enabling the model to verify action preconditions and effects through interleaved textual and visual reasoning. We train SVoT via Group Relative Policy Optimization (GRPO), instantiating verification through reward design and evaluating the efficacy of different fine-grained rewards. As existing benchmarks reduce state transitions to single-variable updates, substantially simplifying the problems, we establish five domains by extending classical environments and introducing two novel domains, Pacman and Gather, that require multi-object interactions and numerical reasoning. These domains support systematic evaluation of multi-hop spatial reasoning with quantitative verification of generated intermediate states and transition reasoning. SVoT with transition-aware supervision achieves state-of-the-art performance across the introduced domains, yielding up to a 65% absolute accuracy gain on out-of-distribution test sets.
- 中文摘要
空间推理对多模大型语言模型(MLLM)来说仍是挑战,因为它需要对中间状态和状态转换进行可靠的多跳推断。当前研究常常未验证中间状态,并将状态转变视为隐性过程,这限制了多跳空间推理的可靠性。为此,我们提出了状态感知思维可视化(SVoT)的强化学习框架,能够生成交错、可验证的中间状态和可视化。SVoT将过渡推理链整合进生成过程,使模型能够通过交错的文本和视觉推理验证动作前置条件和效果。我们通过群体相对策略优化(GRPO)训练SVoT,通过奖励设计实现验证,并评估不同细粒度奖励的有效性。随着现有基准测试将状态转换简化为单变量更新,问题大幅简化,我们通过扩展经典环境并引入两个新颖领域——Pacman和Gather,建立了五个领域,这些领域需要多对象交互和数值推理。这些领域支持系统性评估多跳空间推理,通过对生成的中间状态的定量验证和转移推理。带有过渡感知监督的SVoT在引入的领域内实现了最先进的性能,在分布外测试集上绝对准确率提升高达65%。
Space-sampled Value Decay: Forgetting Mechanisms for Non-stationary Deep Reinforcement Learning
空间采样值衰减:非平稳深度强化学习的遗忘机制
- Authors: Felix Störck, Fabian Hinder, Barbara Hammer
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.11797
- Pdf link: https://arxiv.org/pdf/2606.11797
- Abstract
Studies on rodents such as mice have shown the capabilities to adapt their behavior when dealing with changing parameters (
drift'') of the environment even if no information about change is provided (uncertainty) -- a behavior that can be modeled by forgetting mechanisms. Non-stationary Reinforcement Learning (NSRL) deals with adapting state-of-the-art RL methods to deal with changing environments: these however usually require (partially) perfect information about the drift such astask IDs'' or ``context''. To mitigate the effects of drift, this work develops \emph{Space-sampled Value Decay} as an explicit forgetting mechanism for value-based deep RL architectures as a simple yet effective approach. In particular we demonstrate and discuss positive effects but also limitations in achieved returns for modifications of Deep Q-networks (DQN) and Soft Actor-Critic (SAC) when evaluated on non-stationary environments.
- 中文摘要
对啮齿动物如小鼠的研究显示,即使没有提供变化信息(不确定性),也能适应环境参数(“漂移”)——这种行为可以通过遗忘机制来建模。非定常强化学习(NSRL)处理的是将最先进的强化学习方法调整以应对不断变化的环境:然而,这些方法通常需要(部分)完美的漂移信息,如“任务ID”或“上下文”。为减轻漂移影响,本研究开发了\emph{Emph{Value Decay}作为一种显式遗忘机制,用于基于值的深度强化学习架构,作为一种简单而有效的方法。特别是,我们展示了并讨论了在非平稳环境中评估深度Q网络(DQN)和软演员-批判者(SAC)修改时,实现收益的积极效果,但也存在局限性。
RePAIR: Predictive Self-Supervised Representation Learning in Chess
RePAIR:国际象棋中的预测性自我监督表征学习
- Authors: Christoph Koller, Johannes Fürnkranz, Timo Bertram
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.11860
- Pdf link: https://arxiv.org/pdf/2606.11860
- Abstract
In this paper, we introduce Representation Prediction via Autoencoding using Iterative Refinement (RePAIR) - a novel self-supervised representation learning architecture that synthesizes Masked Autoencoders (MAE), Joint Embedding Predictive Architectures (JEPA), and Bidirectional Encoder Representations from Transformers (BERT). We demonstrate how it can be used to encode objects in sequential data like consecutive chess positions into compact yet meaningful representations. The basic principle of the architecture is to mask large portions of a sequence of latent states, similar to BERT and MAE. Then, we apply a lightweight Predictor to the latent representations that repairs gaps in the sequence in a lower-dimensional embedding space akin to JEPA. Our experiments in the domain of chess show that the Encoder refines the board representations such that meaningful chess concepts emerge clustered in the latent space. Furthermore, reconstructions of the masked board states show that the model is able to reason about the piece movements without relying on costly reinforcement learning methods. Lastly, we find that the resulting representation space allows for quick and intuitive dissections of chess games by observing the game path trajectories in this semantically rich space.
- 中文摘要
本文介绍了利用迭代细化的自编码表示预测(RePAIR)——一种新型自监督表征学习架构,综合了掩盖自编码器(MAE)、联合嵌入预测架构(JEPA)和来自变换器的双向编码器表示(BERT)。我们展示了如何用它将连续国际象棋局面等连续数据中的对象编码成紧凑但有意义的表示。该架构的基本原理是掩蔽大量潜态序列,类似于BERT和MAE。然后,我们对潜在表示应用轻量级预测器,修复类似JEPA的低维嵌入空间中的序列空隙。我们在国际象棋领域的实验表明,编码器能够精炼棋盘表示,使有意义的国际象棋概念聚集在潜在空间中。此外,对掩码棋盘状态的重建表明,模型能够在不依赖昂贵的强化学习方法的情况下推理棋子的移动。最后,我们发现由此产生的表示空间通过观察这一语义丰富的空间中的棋局轨迹,能够快速直观地剖析国际象棋棋局。
Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training
利用路由前瞻实现强化学习中微步级 MoE 负载均衡 后期训练
- Authors: Yuming Zhou, Haoyang Li, Sheng Lin, Yanfeng Zhao, Tong Zhao, Xupeng Miao, Jie Jiang, Fangcheng Fu, Bin Cui
- Subjects: Subjects:
Distributed, Parallel, and Cluster Computing (cs.DC)
- Arxiv link: https://arxiv.org/abs/2606.11867
- Pdf link: https://arxiv.org/pdf/2606.11867
- Abstract
Mixture-of-Experts (MoE) and reinforcement learning (RL) post-training now dominate large language model (LLM) development, yet expert load imbalance remains a critical challenge. Existing load-balancing systems target pre-training by relying on historical step-level statistics. However, these methods fail under the unique workload dynamics of RL post-training: the step-level load is stable, but the tiny batch sizes processed during micro-steps cause severe, high-frequency load fluctuations. We introduce ForeMoE, a micro-step-level load balancing system for MoE RL post-training. Instead of relying on historical statistics, ForeMoE exploits the multi-stage RL pipeline (rollout, recompute, policy update) by using foreseeable routing information from the rollout stage to proactively guide load balancing in the remaining stages. To support frequent per-micro-step reconfiguration, ForeMoE employs a hierarchical planner that decomposes the NP-hard load balancing problem into tractable sub-components, alongside a transfer engine that leverages complementary hardware paths (CPU-assisted and GPU-direct) for overlapped expert transfer. Evaluations on 64 GPUs demonstrate that ForeMoE achieves up to a 1.45$\times$ speedup over state-of-the-art RL post-training systems.
- 中文摘要
专家混合(MoE)和强化学习(RL)培训后技术现已主导大型语言模型(LLM)开发,但专家负载不平衡仍是一个关键挑战。现有的负载均衡系统通过依赖历史级的步进统计数据来针对预训练。然而,这些方法在强化学习后训练的独特工作负载动态下失效:步级负载稳定,但微步中处理的极小批次会导致严重的高频负载波动。我们介绍了ForeMoE,一种用于MoE RL后期的微步级负载均衡系统。ForeMoE不再依赖历史统计数据,而是利用多阶段的强化学习流水线(部署、重新计算、策略更新),利用部署阶段可预见的路由信息,主动引导剩余阶段的负载均衡。为了支持频繁的每微步重配置,ForeMoE 采用了分层规划器,将 NP 硬负载均衡问题分解为可处理的子组件,同时配备利用互补硬件路径(CPU 辅助和 GPU 直接)实现专家重叠传输的传输引擎。对64款GPU的评估显示,ForeMoE相比最先进的强化学习后训练系统,实现了高达1.45美元\时间的加速。
Critic Architecture Matters: Dual vs. Unified Critics for Humanoid Loco-Manipulation
批评架构的重要性:双重批评者与统一批评者在人形机车操控方面
- Authors: Mehmet Turan Yardımcı
- Subjects: Subjects:
Robotics (cs.RO); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.11891
- Pdf link: https://arxiv.org/pdf/2606.11891
- Abstract
Multi-objective reinforcement learning for humanoid robots must coordinate locomotion and manipulation within a single policy. A natural design choice is whether to use a single (unified) critic that estimates the combined value of all objectives, or separate (dual) critics with disjoint reward signals. We present a controlled comparison on the Unitree G1 humanoid (23 active DoF) in NVIDIA Isaac Lab, training loco-manipulation policies through a sequential curriculum spanning 13 levels from stationary reaching to walking with variable-orientation targets. In standardized evaluation, dual-critic policies reach targets 3.5$\times$ faster (6.5 vs. 22.6 simulation steps), achieve 2$\times$ higher throughput (14.3 vs. 7.0 validated reaches per 1,000 steps), and attain higher validated reach rates (65.2% vs. 53.8%) compared to the unified-critic policy. Notably, additional anti-gaming reward mechanisms provide no further improvement beyond the architectural change alone (60.9% vs. 65.2%). These results have direct implications for the emerging paradigm of RL fine-tuning of imitation-learned policies: when refining a pre-trained manipulation policy with RL, a unified critic risks suppressing the learned behavior through competing locomotion gradients. These findings demonstrate that critic architecture is a primary - and often overlooked - design choice in multi-objective humanoid RL, with greater impact than reward engineering on reaching efficiency.
- 中文摘要
人形机器人的多目标强化学习必须在单一策略内协调运动和操作。一个自然的设计选择是使用单一(统一)批评者来估算所有目标的总值,还是使用独立(双重)批评者,并带有不相交的奖励信号。我们对NVIDIA Isaac实验室的Unitree G1类人生物(23个活跃深度)进行了对比,通过跨越13个等级的连续课程进行驾驶操作策略训练,从静止伸手到步行,目标方向可变。在标准化评估中,双重批评策略比统一批评策略更快达到3.5美元\时间美元(6.5对22.6步),吞吐量提高了2美元/倍次(每1,000步14.3对7.0次验证覆盖),且验证覆盖率更高(65.2%对53.8%)。值得注意的是,额外的反游戏奖励机制除了仅带来架构变化外的进一步改善(60.9%对65.2%)。这些结果直接影响了新兴的强化学习模仿策略微调范式:当用强化学习细化预训练的操作策略时,统一批评者有可能通过竞争的运动梯度抑制所学行为。这些发现表明,批判性架构是多目标类人强化学习中主要且常被忽视的设计选择,其对效率提升的影响比奖励工程更大。
The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning
审讯的艺术:一致性增强空间推理中的事实性
- Authors: Theo Uscidda, Marta Tintore Gazulla, Maks Ovsjanikov, Federico Tombari, Leonidas Guibas
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.11918
- Pdf link: https://arxiv.org/pdf/2606.11918
- Abstract
Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D and 3D constraints. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal reasoning process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers -- reward functions that check for geometric and semantic consistency under transformations -- we demonstrate that models can improve their spatial reasoning abilities. We use both image transformations, like flipping, and textual transformations, like swapping the order of objects in the question, and propose a new optimal transport-based RL strategy, OT-GRPO, which is a minimal-matching variant of group relative policy optimization tailored to pairwise verifiers. We show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and data domains.
- 中文摘要
当前的大型推理模型(LRM)展现出卓越的通用能力,但在空间推理任务中表现明显不足。现有方法将这一空白视为知识缺失,依赖监督微调(SFT)从外部视觉源或合成引擎导入标记空间数据。相反,我们认为对于许多任务,空间推理能力已在预训练的长程模型中具备,但需要在几何二维和三维约束下通过逻辑一致性实现对齐。在本研究中,我们提出了一种针对内部推理过程且不需真实注释的自监督强化学习(RL)框架。通过形式化一致性验证器的概念——即在变换下检查几何和语义一致性的奖励函数——我们展示了模型可以提升其空间推理能力。我们既使用图像变换,如翻转,也使用文本变换,如交换问题中对象的顺序,并提出了一种新的最优传输型强化学习策略OT-GRPO,这是一种针对成对验证者的最小匹配组相对策略优化变体。我们证明,这种无标签一致性训练能够接近在真实监督下训练模型的准确性,并在不同任务和数据领域实现类似的泛化。
PAWS: Preference Learning with Advantage-Weighted Segments
PAWS:带有优势加权片段的偏好学习
- Authors: Aleksandar Taranovic, Onur Celik, Niklas Freymuth, Ge Li, Serge Thilges, Huy Le, Tai Hoang, Rania Rayyes, Gerhard Neumann
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.11982
- Pdf link: https://arxiv.org/pdf/2606.11982
- Abstract
Preference-based reinforcement learning (PbRL) learns policies from human trajectory-level comparisons, avoiding explicit reward design and expert demonstrations. Existing methods typically train utility functions on trajectory or segment-level preferences while relying on per-step utility estimates during policy optimization. This training and inference mismatch induces a distribution shift that severely degrades temporal credit assignment and limits policy learning. We analyze this issue and propose PAWS, a segment-based preference learning method that performs policy updates directly using segment-level advantage functions. By aligning utility training with policy optimization, PAWS preserves trajectory-level preference information and avoids unreliable per-step learning signals. Experiments on simulated robotic manipulation and locomotion tasks demonstrate that PAWS consistently outperforms existing PbRL approaches, highlighting the importance of distribution-consistent preference learning.
- 中文摘要
基于偏好的强化学习(PbRL)通过人类轨迹层面的比较学习策略,避免显式奖励设计和专家演示。现有方法通常在策略优化时依赖每步效用估计,训练效用函数基于轨迹或分段层级偏好。这种培训与推断不匹配导致分布转移,严重削弱时间学分分配并限制政策学习。我们分析了这一问题,并提出了PAWS,这是一种基于分段的偏好学习方法,直接利用分段级优势函数进行策略更新。通过将效用训练与策略优化对齐,PAWS保留轨迹层级偏好信息,避免每步学习信号不可靠。对模拟机器人操作和运动任务的实验表明,PAWS始终优于现有的PbRL方法,凸显了分布一致性偏好学习的重要性。
Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization
泛化黑客:模型可以通过防止行为泛化来博弈强化学习
- Authors: Frank Xiao, Mary Phuong
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.12016
- Pdf link: https://arxiv.org/pdf/2606.12016
- Abstract
Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware, they may be motivated to resist training when the perceived objective conflicts with their current values, undermining developers' ability to detect misalignment and correct model behavior through further training. In this paper, we demonstrate generalization hacking, in which a model collects reward during RL while preventing the rewarded behavior from generalizing. We construct a model organism on Qwen3-235B-A22B, finetuning on synthetic documents describing training awareness and self-inoculation, a novel mechanism in which the model frames compliance as context-specific in its chain of thought, without demonstrating or instructing either behavior. The model organism achieves train-time harmfulness comparable to controls while maintaining a persistent ${\sim}15$ percentage point compliance gap across 700 steps of RL. Additionally, a control organism trained only on training awareness documents independently discovers inoculation-like reasoning under RL pressure, developing its own compliance gap despite never being exposed to the concept. Because the generalization-hacking organism receives high reward throughout, standard training metrics provide no signal that generalization has failed. Our results constitute the first demonstration that a model can actively resist RL behavioral modification while maintaining high reward, suggesting that as models become more capable and training-aware, they may be able to undermine the training process itself.
- 中文摘要
模型后训练,特别是强化学习(RL),是开发者塑造模型价值和行为的主要机制之一。然而,随着模型对评估和培训的认知日益增强,当所认为的目标与当前价值相冲突时,他们可能会被动机抵制训练,从而削弱开发者通过进一步训练发现错位和纠正模型行为的能力。本文展示了泛化黑客,即模型在强化学习期间收集奖励,同时防止奖励行为泛化。我们在Qwen3-235B-A22B上构建了一个模型生物体,基于描述训练意识和自我接种的合成文档进行微调,这是一种新颖机制,模型将顺从视为其思维链的情境特定性,但未展示或指导任何一种行为。模型生物在700个强化学习步骤中保持持续的15%百分点依从性差距,实现了与对照组相当的训练时间危害。此外,仅通过训练意识文档训练的对照生物在强化学习压力下独立发现类似接种的推理,尽管从未接触过该概念,却形成了自身的依从性差距。由于泛化黑客生物在整个过程中获得高回报,标准训练指标并不能表明泛化失败。我们的结果首次证明模型可以在保持高奖励的同时主动抵抗强化学习行为修正,表明随着模型能力提升和训练意识提升,它们可能能够破坏训练过程本身。
KinematicRL: A Sim-to-Real Reinforcement Learning Framework For Social Navigation With Kinodynamic Feasibility
运动学RL:一种具有运动动力学可行性的社会导航模拟到现实强化学习框架
- Authors: Zhiming Xu, Haodong Yang, Chengju Liu, Qijun Chen, Chenpeng Yao
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.12042
- Pdf link: https://arxiv.org/pdf/2606.12042
- Abstract
Deep Reinforcement Learning (DRL) has shown promise for social navigation, yet its real-world deployment remains hindered by a persistent sim-to-real gap arising from simplified first-order dynamics and context-specific human state estimation pipelines. This work presents a unified framework that addresses these limitations to produce dynamically feasible navigation policies suitable for real-world deployment. First, theoretical analysis reveals that tracking error between simulated and actual robot position decays exponentially with increased control order, motivating the use of higher-order control inputs as DRL action space. A second-order control formulation tailored to differential drive robots is developed, complemented by a stochastic iterative Linear Quadratic Regulator (iLQR) that pretrains the policy via a divergence minimization objective. Second, to avoid the added system complexity of camera-LiDAR fusion, a cluster-based human tracking pipeline using only 2D LiDAR is introduced. Human detections are associated according to both spatial proximity and velocity similarity, enabling reliable differentiation of nearby pedestrians and yielding stable velocity estimates through temporal aggregation. Third, we introduce an unbiased residual gating block to balance reaction- and memory-based behaviors while handling time-varying crowd sizes, both critical for social navigation. The resulting policy, KinematicRL, consistently improves kinematic performance and adapts to varying number of detected humans. Experiments in real-world environments demonstrate that, when combined with the proposed tracking pipeline, KinematicRL can be deployed on a real differential drive robot with minimal modifications.
- 中文摘要
深度强化学习(DRL)在社交导航方面展现出潜力,但其现实应用仍受制于简化的一阶动态和上下文特定人类状态估计流水线带来的持续模拟与现实差距。本研究提出了一个统一框架,解决了这些限制,从而生成适合实际部署的动态可行导航策略。首先,理论分析显示,随着控制阶数的增加,模拟与实际机器人位置之间的跟踪误差呈指数衰减,促使使用高阶控制输入作为日程学习(DRL)动作空间。开发了针对差动驱动机器人的二阶控制表述,辅以随机迭代线性二次调节器(iLQR),通过散度最小化目标预训练策略。其次,为避免相机与激光雷达融合带来的系统复杂性增加,引入了仅使用二维激光雷达的集群式人类跟踪流水线。人类检测根据空间接近度和速度相似度进行关联,从而实现附近行人的可靠区分,并通过时间聚合获得稳定的速度估计。第三,我们引入了无偏的残差门禁块,以平衡基于反应和记忆的行为,同时处理时间变化的人群规模,这两者对社交导航至关重要。由此产生的政策——KinematicRL,持续提升运动学性能,并适应不同数量的检测到的人类。在现实环境中的实验表明,结合拟议的跟踪流水线,KinematicRL可以以最小改动部署在真实差动驱动机器人上。
World Model Self-Distillation: Training World Models to Solve General Tasks
世界模型自蒸馏:训练世界模型以解决通用任务
- Authors: Sebastian Stapf, Pablo Acuaviva Huertos, Aram Davtyan, Paolo Favaro
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2606.12072
- Pdf link: https://arxiv.org/pdf/2606.12072
- Abstract
Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.
- 中文摘要
预训练视频生成器是具有涌现任务解决能力的有前景的视觉世界模型;然而,它们对详细文本描述的依赖限制了其在规划和决策中的直接应用。现有方法要么将推理外包给语言模型或视觉-语言模型,要么依赖通过配对任务执行视频进行监督微调,这些视频收集成本高且难以扩展。我们提出了一个可扩展的框架,通过结合自我蒸馏与强化学习,激发此类模型中的任务解决能力。给定一个未标记的场景图像,视觉语言模型生成候选任务和详细的逐步解决方案。该解条件是一个预训练的视频扩散模型,即演示器;我们将其行为提炼为仅基于图像和简短任务提示的执行者。这将执行知识从字幕引导生成转移到指令条件任务解决,无需经过精心策划的任务视频监督。我们进一步通过增强学习从VLM反馈中改进执行者,利用判断采样视频是否满足任务与生成解之间的不对称性。基于我们提出的WorldTasks-Benchmark和DreamGen机器人基准测试的实验显示,Executor在基于VLM的评估协议下超越了Demonstrator,并在机器人任务中具竞争力地转移。
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
InternVideo3:多模态上下文推理的Agentify基础模型
- Authors: Ziang Yan, Sheng Xia, Jiashuo Yu, Yue Wu, Tianxiang Jiang, Songze Li, Kanghui Tian, Yicheng Xu, Yinan He, Kai Chen, Limin Wang, Yu Qiao, Yi Wang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2606.12195
- Pdf link: https://arxiv.org/pdf/2606.12195
- Abstract
Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.
- 中文摘要
基础模型的最新进展已转向涉及多步骤推理和工具使用的代理行为。然而,开源工作大多聚焦于以文本为主导的环境,长期多模态任务却未被充分探索。这一差距在需要持续时间理解和迭代交互的视频任务中表现得尤为明显。我们介绍InternVideo3,这是一个通过多模态情境推理(MCR)增强这些能力的框架。MCR将理解视为一个闭环过程,基于包含观察、指令、推理、工具动作和记忆的共享且不断演变的上下文。这将长视频理解视为证据积累和验证。为确保效率,我们引入了多模态多头潜在注意力(M^2LA),这是一种保持令牌的重参数化,压缩KV缓存状态同时保留完整令牌流。我们的分阶段培训包括持续的预培训、短到长的监督微调、基于规则的强化学习以及策略提炼。实验显示,InternVideo3在Video-MME、MLVU和EgoSchema等基准测试中表现出色。我们进一步将模型实例化为带有检索工具的视频代理,展示了稳健的证据基础行为。我们的结果表明,高效的上下文处理和闭环推理对于将开放多模态模型适应到具有长期视觉基础的能动性至关重要。
DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems
DrivingAgent:自动驾驶系统设计与调度代理
- Authors: Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang
- Subjects: Subjects:
Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2606.12236
- Pdf link: https://arxiv.org/pdf/2606.12236
- Abstract
Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.
- 中文摘要
许多自动驾驶系统越来越多地采用基础模型,以提升泛化能力并处理长尾场景。然而,这一趋势带来了两个关键挑战:(i)设计和集成新模型的手工和劳动密集型过程,以及(ii)缺乏智能、动态的调度机制以满足严格的实时约束。虽然基于大型语言模型(LLM)的代理为自动化提供了有前景的途径,但现有框架并不适合自动驾驶。具体来说,它们未能区分系统设计和实时调度的根本不同需求,将模块视为不透明的黑箱,且不适合连续运行。为解决这些局限性,我们提出了DrivingAgent,一种针对自动驾驶系统设计和调度双重挑战量身定制的新型代理框架。在设计阶段,DrivingAgent通过解释系统架构、生成代码和通过超级网络训练验证模块,实现模块开发自动化。在调度阶段,它采用经过强化学习训练的轻量级大型语言模型,实时动态编排系统模块,并由结构化内存支持,将长期存储与带时间戳的短期上下文整合在一起。实验结果表明,DrivingAgent在nuScenes和Bench2Drive基准测试中都实现了更优的速度——准确性权衡。
Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization
强化学习打破基于梯度的对抗性优化
- Authors: Xinhai Zou, Chang Zhao, Alireza Aghabagherloo, Dave Singelée, Robin Degraeve, Bart Preneel
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
- Arxiv link: https://arxiv.org/abs/2606.12251
- Pdf link: https://arxiv.org/pdf/2606.12251
- Abstract
Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradient structure used by attackers by training image classifiers with policy-gradient objectives and epsilon-greedy exploration. Through systematic experiments across CIFAR-10, CIFAR-100, and ImageNet-100 with multiple architectures, we find that RL-trained classifiers significantly disrupt gradient-based adversarial optimization. To explain this, we conduct a comprehensive mechanism analysis using loss landscape visualization, static and dynamic gradient indicators, and predictive entropy. Our analysis reveals that RL acts as an implicit regularizer, producing models with highly unstable gradient directions and smaller gradient magnitudes. This combination makes each PGD step both unreliable in direction and limited in magnitude, causing gradient-based attacks to fail within practical iteration budgets. We further show that combining RL with adversarial training (RL-adv) provides a dual-layer defense operating at two complementary levels: RL degrades gradient information available to attackers (gradient-level defense), while adversarial training strengthens decision boundaries (boundary-level defense). RL-adv achieves the highest robustness across all major attack types evaluated, including gradient-based (PGD, AutoAttack), transfer-based, and query-based attacks, outperforming SL-adv by a significant margin. These findings identify RL-induced gradient disruption as a complementary robustness mechanism and motivate future research on hybrid SL-RL training schedules that combine SL's efficiency with RL's gradient-regularization properties.
- 中文摘要
基于梯度的对抗攻击仍然是深度神经网络(DNN)的主要威胁,因为它们利用梯度信息高效优化对抗扰动。为此,我们研究强化学习(RL)训练是否能通过用策略梯度目标和ε贪婪探索训练图像分类器,破坏攻击者使用的梯度结构。通过在CIFAR-10、CIFAR-100和ImageNet-100上采用多种架构的系统实验,我们发现强化学习训练的分类器显著破坏了基于梯度的对抗性优化。为解释这一点,我们通过损失景观可视化、静态和动态梯度指标以及预测熵进行了全面的机制分析。我们的分析显示,强化学习作为隐式正则化器,产生具有高度不稳定梯度方向和更小梯度幅值的模型。这种组合使得每个PGD步骤在方向上都不可靠且幅度有限,导致基于梯度的攻击在实际迭代预算内失败。我们还进一步证明,强化学习与对抗训练(RL-adv)结合,提供了双层防御,作用于两个互补层面:强化学习降低攻击者可用的梯度信息(梯度级防御),而对抗训练则强化决策边界(边界级防御)。RL-adv在所有主要攻击类型中实现了最高的鲁棒性,包括基于梯度的(PGD、AutoAttack)、基于传输的攻击和基于查询的攻击,显著优于SL-adv。这些发现将强化学习诱导的梯度破坏识别为一种互补的鲁棒性机制,并激励了未来关于结合强化学习效率与强化学习梯度正则化特性的混合SL-RL训练计划的研究。
Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models
超越完全随机掩蔽:注意力引导去噪与扩散语言模型优化
- Authors: Jia Deng, Junyi Li, Wayne Xin Zhao, Jinpeng Wang, Hongyu Lu, Ji-Rong Wen
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.12273
- Pdf link: https://arxiv.org/pdf/2606.12273
- Abstract
Diffusion large language models (dLLMs) offer an efficient alternative to autoregressive models through parallel decoding, yet existing post-training methods largely rely on random masking strategies that overlook intrinsic token dependencies. In this work, we present an empirical analysis of attention in dLLMs and show that tokens attending more strongly to unmasked context exhibit greater generation stability and play a critical role in reasoning. Motivated by these findings, we propose AGDO, an attention-guided denoising and optimization framework that aligns both training and optimization with attention-derived dependencies. AGDO determines the denoising order based on attention structure and emphasizes attention-critical tokens during supervised fine-tuning and reinforcement learning. Experiments on mathematical and coding benchmarks demonstrate that AGDO consistently improves reasoning performance, outperforming state-of-the-art post-training methods for dLLMs.
- 中文摘要
扩散大型语言模型(dLLMs)通过并行译码为自回归模型提供了高效的替代方案,但现有的后训练方法大多依赖于忽略内在符号依赖的随机掩蔽策略。在本研究中,我们对dLLMs中的注意力进行了实证分析,并证明了对未掩蔽上下文更强烈关注的代币表现出更高的生成稳定性,并在推理中发挥关键作用。基于这些发现,我们提出了AGDO,一种以注意力为导向的去噪和优化框架,能够将训练和优化与注意力相关的依赖关系对齐。AGDO根据注意力结构确定去噪顺序,并在监督微调和强化学习中强调注意力关键的标记。数学和编码基准测试的实验表明,AGDO持续提升推理表现,优于dLLMs的先进后训练方法。
Mathematical perspective on genetic algorithms with optimization guided operators
关于带有优化引导算子的遗传算法的数学视角
- Authors: Anna Brandenberger, Ilan Doron-Arad, Elchanan Mossel
- Subjects: Subjects:
Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.12279
- Pdf link: https://arxiv.org/pdf/2606.12279
- Abstract
Recent work in ML applies genetic algorithms at inference time to iteratively improve solutions to optimization problems. The basic mutation and recombination operators involved are qualitatively different from those studied classically. Mutations are no longer random; an ML algorithm mutates a solution with the goal of improving an objective. Similarly, recombination is not based on random collages of parent solutions. Instead, it is an ML optimization-based operator whose goal is to synthesize improved solutions from its inputs. Thus, these mutation and recombination operators are more likely to improve the objective, but their computational cost is much higher. We introduce a general model of genetic algorithms and formulating optimization in this model as a query-complexity problem, using the language of reinforcement learning. We then study specialized models. We show that some optimization problems require generation, mutation, and recombination to be solved. We then obtain qualitatively tight algorithms for a family of problems within this framework that captures the nontrivial role of diversity in the solution pool, a key feature of practical ML genetic algorithms.
- 中文摘要
机器学习的最新研究在推断时应用遗传算法,迭代改进优化问题的解决方案。涉及的基本突变和重组算符在质上与经典研究的不同。突变不再是随机的;机器学习算法通过变异解来改进某个目标。同样,重组也不是基于父解的随机拼贴。相反,它是一种基于机器学习优化的算子,其目标是从输入中综合出改进的解。因此,这些突变和重组算符更有可能改善目标,但计算成本却高得多。我们引入一个通用的遗传算法模型,并以此为查询复杂度问题的形式提出优化,使用强化学习的语言。然后我们研究专门的模型。我们表明,一些优化问题需要生成、突变和重组才能解决。随后,我们获得了该框架内一系列问题的定性紧密算法,捕捉了多样性在解池中不可平凡的作用,这是实用机器学习遗传算法的关键特征。
CCKS: Consensus-based Communication and Knowledge Sharing
CCKS:基于共识的沟通与知识共享
- Authors: Jinyuan Zu, Xiaowei Lv, Yongcai Wang, Deying Li, Yunjun Han, Wenping Chen, Fengyi Zhang, Naiqi Wu
- Subjects: Subjects:
Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.12281
- Pdf link: https://arxiv.org/pdf/2606.12281
- Abstract
In Decentralized Training and Decentralized Execution (DTDE) for cooperative Multi-Agent Reinforcement Learning (MARL), action-advising-based knowledge sharing promotes interpretable and scalable cooperation among agents. However, current action advising approaches often adhere too much to the teacher's guidance without evaluating teacher-student compatibility, which causes excessive advising, suboptimal stability, and degraded performance. To overcome these challenges, this paper presents a Consensus-based Communication and Knowledge Sharing (CCKS) framework, which allows agents to adopt recommendations based on consensus-derived constraints and to follow the teacher's instructions more smartly. This mechanism enables agents to balance exploration and learning from experienced teachers, improving overall performance. The key is the consensus model construction, for which we propose to employ contrastive learning to construct consensus models based on local observations in the agents' training phase. In action selection, agents score and choose actions based on consensus and shared knowledge. Designed as a plug-and-play solution, CCKS integrates seamlessly with existing DTDE algorithms. Experiments conducted in the Google Research Football environment and the complex StarCraft II Multi-Agent Challenge demonstrate that the integration with CCKS significantly improves cooperation efficiency, learning speed, and overall performance compared with current DTDE baselines. The code is available at this https URL.
- 中文摘要
在合作式多智能体强化学习(MARL)的去中心化训练与去中心化执行(DTDE)中,基于行动指导的知识共享促进了智能体之间可解释且可扩展的合作。然而,当前的行动指导方法往往过于依赖教师指导,而未评估师生兼容性,导致过度指导、稳定性不佳和绩效下降。为克服这些挑战,本文提出了基于共识的沟通与知识共享(CCKS)框架,允许代理基于共识衍生约束采纳建议,并更智能地遵循教师的指示。这一机制使代理能够平衡探索与向有经验教师学习,从而提升整体表现。关键在于共识模型构建,我们建议利用对比学习,基于智能体训练阶段的局部观察构建共识模型。在行动选择中,代理根据共识和共享知识对行动进行评分和选择。CCKS设计为即插即用解决方案,能够无缝集成现有DTDE算法。在谷歌研究橄榄球环境和复杂的《星际争霸II》多智能体挑战中进行的实验表明,与CCKS的集成相比当前DTDE基线显著提升了协作效率、学习速度和整体性能。代码可在该 https URL 访问。
Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling
突破熵界限:通过拒绝抽样加速强化学习(MTP)的强化学习
- Authors: Yucheng Li, Huiqiang Jiang, Yang Xu, Jianxin Yang, Yi Zhang, Yizhong Cao, Yuhao Shen, Fan Zhou, Rui Men, Jianwei Zhang, An Yang, Bowen Yu, Bo Zheng, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.12370
- Pdf link: https://arxiv.org/pdf/2606.12370
- Abstract
Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.
- 中文摘要
强化学习(RL)已成为现代大型语言模型的关键组成部分,但推广阶段仍然是强化学习训练流程中的关键瓶颈。尽管多令牌预测(MTP)通过推测解码为加速推广提供了自然的解决方案,但许多研究观察到,在强化学习训练期间,MTP的接受率显著下降,导致加速性能有限。为解决这一瓶颈,我们推出了Bebop,这是一项系统性研究,涵盖大型语言模型(LLM)培训后的MTP,并提供了将MTP整合到大规模强化学习流程中的实用方案。首先,我们揭示MTP接受率基本受限于模型熵的波动,这表明在强化学习阶段熵的上升与其呈明显的负线性关系。其次,我们证明概率拒绝抽样在很大程度上减轻了强化学习中熵带来的扰动,相较于贪婪草稿抽样。我们还进一步指出,传统MTP训练目标(交叉熵或KL)在此类环境下并不理想,因此我们提出了一种全新的端到端电视损耗,能够直接优化多步拒绝采样的接受率,从而提升约10%的接受率,实现高达95%的接受率,并在数学推理中获得最多25%的额外推断吞吐量提升, 代码生成和代理任务。第三,我们在强化学习期间测试了各种在线MTP训练策略,证明采用e2e电视丢失和拒绝采样的前强化MTP训练,能够在整个强化学习过程中实现一致的接受率和加速,消除了昂贵的在线MTP更新需求。我们提供了广泛的实验和分析,验证我们的发现。实验结果显示,我们的方法在Qwen3.5、Qwen3.6和Qwen3.7模型的异步强化学习中,端到端加速可达1.8倍。
UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning
UniIntervene:高效现实世界强化学习的代理干预
- Authors: Haoyuan Deng, Yitong Gao, Yudong Lin, Haichao Liu, Zhenyu Wu, Ziwei Wang
- Subjects: Subjects:
Robotics (cs.RO); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.12372
- Pdf link: https://arxiv.org/pdf/2606.12372
- Abstract
Human-in-the-loop reinforcement learning (HiL-RL) has emerged as an effective paradigm for real-world robotic manipulation, enabling online policy improvement with human guidance. However, current HiL-RL frameworks remain intervention-intensive, relying on frequent human corrections to redirect the policy out of unproductive exploration, which incurs high labor cost and limits real-world scalability. To address this, we propose UniIntervene, an agentic intervention model that detects unproductive exploration and autonomously recovers the policy toward high-value states, taking over the bulk of interventions from human operators. Specifically, UniIntervene first performs future-conditioned action-value estimation, predicting the latent consequence of the current action and evaluating its induced value, which provides a more stable progress signal. Building on this, a temporal value-risk critic aggregates recent value dynamics and triggers intervention when the estimated value exhibits sustained stagnation or degradation. When intervention is required, UniIntervene retrieves a high-value recovery target from a memory of past intervention episodes and produces executable corrective actions through a goal-conditioned recovery policy. In this way, UniIntervene turns intervention from passive human correction into a value-aware recovery process for efficient real-world RL. Extensive experiments on diverse real-world manipulation tasks demonstrate that UniIntervene improves the average success rate by 8.6% while reducing human interventions by 57% relative to state-of-the-art HiL-RL baselines.
- 中文摘要
人工参与强化学习(HiL-RL)已成为现实世界机器人操作的有效范式,能够在人工指导下进行在线政策改进。然而,当前的HiL-RL框架仍然干预密集,依赖频繁的人为修正来引导政策,避免非生产性的探索,这导致劳动力成本高昂,限制现实世界的可扩展性。为此,我们提出了UniIntervene,一种能动干预模型,能检测无效探索并自主恢复高价值状态的政策,接管大部分干预从人工操作手中完成。具体来说,UniIntervene首先进行未来条件作用值估计,预测当前动作的潜在后果并评估其诱导值,从而提供更稳定的进展信号。基于此,时间价值风险批评者汇总近期价值动态,并在估计价值持续停滞或下降时触发干预。当需要干预时,UniIntervene会从过去干预事件的记忆中提取高价值恢复目标,并通过目标条件恢复策略生成可执行的纠正措施。通过这种方式,UniIntervene将干预从被动的人为纠正转变为价值意识的恢复过程,实现高效的现实现实强化学习。在各种真实世界操作任务中的大量实验表明,UniIntervene相比最先进的HiL-RL基线,平均成功率提升8.6%,同时减少57%的人类干预。
Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization
可验证环境是乐高积木:推理推广的递归组合
- Authors: Hao Xiang, Qiaoyu Tang, Le Yu, Yaojie Lu, Xianpei Han, Ben He, Le Sun, Bowen Yu, Peng Wang, Hongyu Lin, Dayiheng Liu
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.12373
- Pdf link: https://arxiv.org/pdf/2606.12373
- Abstract
Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (\textbf{R}ecursive \textbf{A}utomated \textbf{C}omposition for \textbf{E}nvironment \textbf{S}caling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (\textsc{SEQUENTIAL}, \textsc{PARALLEL}, \textsc{SORT}, and \textsc{SELECT}) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.
- 中文摘要
具有可验证环境的强化学习(RL)已成为增强大型语言模型(LLM)推理能力的有力方法。虽然先前研究表明缩放环境量能提升强化学习性能,但现有的手动或单个构建方法存在线性缩放限制,从而阻碍了可扩展推理的推广。本文介绍了RACES(\textbf{R}递本\textbf{A}utomated \textbf{C}omposition for \textbf{E}nvironment \textbf{S}缩放),该框架将可验证环境概念化为可递归组装的可组合构建块。关键见解是,当一个环境的陪域(输出类型)与另一个环境的域(输入类型)匹配时,它们可以自动融合成一个新的可验证环境,从而实现递归复合。RACES实现了300个独立环境,并定义了一组组合运算符(\textsc{SEQUENTIAL}、\textsc{PARALLEL}、\textsc{SORT}和\textsc{SELECT}),这些操作子可以诱导不同的推理模式。大量实验表明,在这些复合环境中进行强化学习训练,始终能增强推理泛化能力。具体来说,RACES在六个基准测试中平均提升了DeepSeek-R1-蒸馏Qwen-14B33分3.1分(从48.2提升至51.3),并将Qwen3-14B的性能从58.8提升至61.1分,这些在训练环境构建过程中未曾见过。此外,RACES在仅用50个基础环境下,实现了与300个独立环境训练相当的性能,展现了显著的环境利用效率。
APPO: Agentic Procedural Policy Optimization
APPO:代理程序策略优化
- Authors: Xucong Wang, Ziyu Ma, Yong Wang, Yuxiang Ji, Shidong Yang, Guanhua Chen, Pengkun Wang, Xiangxiang Chu
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.12384
- Pdf link: https://arxiv.org/pdf/2606.12384
- Abstract
Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: \textit{where to branch and how to assign credit after branching}. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose \textbf{Agentic Procedural Policy Optimization (APPO)}, which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.
- 中文摘要
代理强化学习(RL)的最新进展显著提升了大型语言模型代理的多回合工具使用能力。然而,大多数现有方法都将功劳归于粗略的启发式单元,如工具调用边界或固定工作流,这使得识别哪些中间决策影响后续结果变得困难。在本研究中,我们从两个视角研究代理强化学习:\textit{分支在哪里以及如何分配信用}。我们的试点分析显示,影响决策点广泛分布在生成序列中,而非集中在工具调用处,而仅凭代币熵无法可靠反映其对最终结果的影响。基于这些观察,我们提出了 \textbf{代理程序策略优化(APPO)},将分支和功劳分配从粗交互单元转移到序列中的细粒度决策点。APPO利用分支评分选择分支位置,该评分结合了令牌不确定性与策略诱导的后续续通似然增长,从而实现更有针对性的探索,同时过滤掉虚假的高熵位置。它还引入了程序层面的优势扩展,以更好地将信贷分配到分支推广中。对13个基准测试的实验表明,APPO在保持高效工具调用和行为可解释性的情况下,持续提升强能动强强韧学习基线近4个百分点。
ATLAS: Active Theory Learning for Automated Science
ATLAS:自动化科学的主动理论学习
- Authors: Noémi Éltető, Nathaniel D. Daw, Kimberly L. Stachenfeld, Kevin J. Miller
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.12386
- Pdf link: https://arxiv.org/pdf/2606.12386
- Abstract
Advancing scientific understanding through mechanistic modeling requires posing the right experimental questions to yield maximally informative data. To automate this pursuit within cognitive science, we introduce ATLAS (Active Theory Learning for Automated Science), an active learning framework for the data-driven discovery of interpretable behavioral models. ATLAS iterates between generating mechanistic hypotheses--instantiated as a diverse ensemble of sparse neural networks (Disentangled RNNs)--and designing experiments that optimally distinguish between them. We test this approach on the problem of recovering reinforcement learning agents from their behavior in bandit tasks. ATLAS designs varied sequences of qualitatively novel experiments with temporal structure tailored to underlying agent characteristics. The models trained on these experiments are evaluated against a comprehensive set of metrics for mechanistic modeling that capture behavioral, structural, and computational similarity. ATLAS achieves a 5-10x improvement in sample efficiency across all metrics compared to random experimentation, and its performance is further validated against expert-designed experiments derived from literature. These in silico results showcase ATLAS's potential to accelerate human-interpretable insights in cognitive science and other domains where scientific inquiry relies on discovering mechanistic models.
- 中文摘要
通过机制建模推进科学理解,需要提出正确的实验问题,以产生最大限度的信息量。为了在认知科学中实现这一自动化追求,我们引入了ATLAS(自动科学主动理论学习),这是一个用于数据驱动的可解释行为模型发现的主动学习框架。ATLAS在生成机制假设——作为多样稀疏神经网络(解缠RNNs)集合的实例化——与设计能够最优区分它们的实验之间进行迭代。我们在强盗任务中从强化学习代理的行为中恢复该方法进行了测试。ATLAS设计了一系列具有质性新颖性质的实验,时间结构根据潜在药物特性量身定制。在这些实验中训练的模型会根据一套全面的机制建模指标进行评估,这些指标捕捉了行为、结构和计算上的相似性。与随机实验相比,ATLAS在所有指标上的样本效率提升了5-10倍,其表现也得到了专家设计的文献实验验证。这些计算机模拟结果展示了ATLAS加速认知科学及其他依赖机制模型发现科学领域的人类可解读洞见的潜力。
Keyword: diffusion policy
Adversarial Attacks on Learned Policies for Surgical Robotic Tasks
对手术机器人任务学习策略的对抗性攻击
- Authors: Shutong Jin, Ziyang Chen, Preethi Satish, Paavan Gupta, Florian T. Pokorny, Ken Goldberg
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.11535
- Pdf link: https://arxiv.org/pdf/2606.11535
- Abstract
Learning-based policies are being considered to augment the dexterity of human surgeons in robot-assisted surgery. Can the end-to-end mapping from visual observations to robot actions be vulnerable to adversarial attacks, potentially leading to patient injury? In this paper, we present the first study of adversarial threats to learning-based policies in surgical robotics. We investigate two threat modes: (a) disruptive attacks, where imperceptible visual perturbations interrupt policy execution, and (b) steering attacks, where such perturbations steer policy actions toward attacker-specified directions. We formulate three adversarial attack methods, each with increasing access to policy information, and evaluate their impact on two surgical subtasks: debridement and suturing. Our evaluation covers three end-to-end policy architectures: ACT, Diffusion Policy, and Pi0. In addition, we introduce a new class of photometric adversarial attacks that mimic natural visual changes, such as lighting variations, to generate effective yet visually plausible perturbations. Results from 560 physical experiments using phantoms for debridement and suturing suggest that state-of-the-art policies can be significantly disrupted, resulting in an average 61% reduction in surgical subtask success rates. Project page: this https URL
- 中文摘要
基于学习的政策正在考虑提升人类外科医生在机器人辅助手术中的灵活性。从视觉观察到机器人动作的端到端映射是否容易受到对抗性攻击,从而可能导致患者受伤?本文首次介绍了外科机器人中基于学习的策略面临的对抗性威胁研究。我们研究两种威胁模式:(a)扰动攻击,即不可察觉的视觉扰动中断策略执行;(b)引导攻击,这些扰动引导策略行动朝攻击者指定的方向前进。我们制定了三种对抗性攻击方法,每种方法都能增加对政策信息的获取,并评估它们对两个外科子任务:清创和缝合的影响。我们的评估涵盖了三种端到端的政策架构:ACT、扩散政策和Pi0。此外,我们还引入了一类新的光度对抗攻击,模拟自然的视觉变化,如光照变化,以产生有效且视觉上合理的扰动。560次使用幻肢进行清创和缝合的物理实验结果表明,最先进的政策可以被显著颠覆,导致手术子任务成功率平均降低61%。项目页面:此 https URL
Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning
通过Real2Sim2Real触觉政策学习实现盲目灵巧抓取
- Authors: Shengcheng Luo, Xiyan Huang, Zhe Xu, Wanlin Li, Ziyuan Jiao, Chenxi Xiao
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.11767
- Pdf link: https://arxiv.org/pdf/2606.11767
- Abstract
Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robots remains challenging due to the tactile sim-to-real gap and the limited expressiveness of sparse tactile signals. To bridge this gap, we propose a framework for tactile-only blind grasping that is deployable on a physical multi-fingered robotic hand. Our approach combines three key components. First, we introduce a Real2Sim tactile calibration pipeline that constructs a contact-calibrated digital-twin simulator capable of reproducing real tactile signals. Second, we improve the expressiveness of sparse tactile observations using a layout-aware tactile encoder, which incorporates sensor-geometry priors through self-supervised pretraining. Third, to improve generalization to unseen objects, we train object-specific reinforcement-learning experts in the calibrated simulator and aggregate their successful grasp trajectories into a tactile-conditioned Diffusion Policy. We evaluate our method on a physical LEAP Hand equipped with distributed tactile sensing across 10 seen and 10 unseen objects. The deployed policy achieves a 27\% real-world grasp success rate across all 20 objects, without real-world grasping demonstrations or visual input. Simulation ablations show that layout-aware tactile pretraining improves grasping performance, while sensing-level evaluations confirm that Real2Sim calibration increases the consistency of tactile contact events between simulation and hardware. Together, these results suggest that contact-event calibration, geometry-aware tactile representation learning, and diffusion-based policy aggregation provide an effective path toward tactile-only blind grasping on real dexterous robotic hands. Project page:this http URL.
- 中文摘要
用灵巧的手盲抓是关键的操控能力。然而,由于模拟与现实之间的感知差距以及触觉信号稀疏的表达有限,学习真实机器人的此类仅触觉策略仍然具有挑战性。为弥合这一差距,我们提出了一种仅触觉盲抓的框架,可部署在多指机械手上。我们的方法结合了三个关键要素。首先,我们介绍了Real2Sim触觉校准流程,构建了一个接触校准数字孪生模拟器,能够重现真实触觉信号。其次,我们使用布局感知的触觉编码器提升稀疏触觉观察的表现力,该编码器通过自我监督预训练整合了传感器几何先验。第三,为了提升对看不见物体的泛化,我们在校准模拟器中培训对象特定强化学习专家,并将他们成功的抓取轨迹汇总为触觉条件扩散策略。我们在一台配备分布式触觉感应的物理LEAP Hand上评估了该方法,覆盖10个可见和10个未看见物体。该策略在所有20个物体上实现了27%的真实世界抓取成功率,且无需真实世界抓取演示或视觉输入。仿真消融表明,布局感知的触觉预训练提升了抓取性能,而感测级评估则证实Real2Sim校准提高了仿真与硬件之间触觉接触事件的一致性。这些结果共同表明,接触事件校准、几何感知触觉表征学习和基于扩散的策略聚合,为实现真实灵巧机器人手的盲目抓取提供了有效路径。项目页面:此 http URL。
Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics
环境扩散政策:机器人学中从次优数据中模仿学习
- Authors: Adam Wei, Nicholas Pfaff, Thomas Cohn, Arif Kerem Dayı, Constantinos Daskalakis, Giannis Daras, Russ Tedrake
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.12365
- Pdf link: https://arxiv.org/pdf/2606.12365
- Abstract
We propose Ambient Diffusion Policy, a simple and principled method for imitation learning from suboptimal data in robotics. High-quality, task-specific robot data is expensive and time-consuming to collect, while suboptimal datasets with lower-quality or out-of-distribution demonstrations are abundant. Existing methods that co-train on both data sources in robotics often fail to separate the meaningful and the harmful features in the suboptimal samples. In contrast, our method extracts only the useful features by introducing a new axis to co-training in robotics: noise-dependent data usage. Ambient Diffusion Policy restricts the contribution of suboptimal data during training to only the high and low diffusion times. To rigorously justify our approach, we first observe that robot action data exhibits a spectral power law. This induces two important properties on the optimal Diffusion Policy that we exploit: a global-to-local hierarchy and locality. We theoretically formalize this discussion using a simplified model. Our experiments validate Ambient Diffusion Policy on four types of suboptimal action data (noisy trajectories, sim-to-real gap, task mismatch, and large-scale data mixtures) across six tasks. The results show that it effectively learns from arbitrary sources of suboptimal data. Notably, it outperforms existing co-training baselines by up to 33% when scaled to Open X-Embodiment - a large dataset with heterogeneous data quality and unstructured distribution shifts. Overall, Ambient Diffusion Policy increases the utility of suboptimal demonstrations and expands the set of usable data sources in robotics.
- 中文摘要
我们提出了环境扩散策略,这是一种简单且有原则的方法,用于机器人学中从次优数据中进行模仿学习。高质量、针对特定任务的机器人数据收集成本高且耗时,而质量较低或演示不均的次优数据集则非常丰富。现有在机器人中对两个数据源共训练的方法,往往无法区分样本中有意义的特征和有害的特征。相比之下,我们的方法仅通过引入一个新的轴来提取有用的特征:噪声依赖的数据使用。环境扩散策略将训练期间次优数据的贡献限制在高扩散和低扩散时间。为了严格证明我们的方法,我们首先观察到机器人动作数据表现出谱幂律。这引出了我们利用的最优扩散策略中两个重要属性:全局到局部层级结构和局部性。我们理论上用简化模型形式化了这个讨论。我们的实验验证了环境扩散策略在六个任务中对四种次优动作数据(噪声轨迹、模拟与实际差距、任务不匹配和大规模数据混合)的应用。结果显示,它实际上是从任意的次优数据来源中学习。值得注意的是,当它扩展到Open X-Embodiment——一个具有异质数据质量和非结构化分布变化的大型数据集时,其表现比现有的共训练基线高达33%。总体而言,环境扩散政策提高了次优演示的实用性,并扩展了机器人领域可用的数据源。