Arxiv Papers of Today

生成时间: 2026-05-19 19:23:55 (UTC+8); Arxiv 发布时间: 2026-05-19 20:00 EDT (2026-05-20 08:00 UTC+8)

今天共有 80 篇相关文章

Keyword: reinforcement learning

Mirror Descent-Type Algorithms for the Variational Inequality Problem with Functional Constraints

带有函数约束的变分不等式问题镜像下降型算法

Authors: Mohammad S. Alkousa, Fedor S. Stonyakin, Belal A. Alashqar, Seydamet S. Ablaev
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2605.16262
Pdf link: https://arxiv.org/pdf/2605.16262
Abstract Variational inequalities play a key role in machine learning research, such as generative adversarial networks, reinforcement learning, adversarial training, and generative models. This paper is devoted to the constrained variational inequality problems with functional constraints (inequality-type constraints). We propose some mirror descent-type algorithms that switch between productive and non-productive steps depending on the values of the functional constraints at iterations, with many different step size rules and stopping criteria. We analyze the proposed algorithms and prove their optimal convergence rate to achieve a solution with desired accuracy, for problems with bounded and monotone operators and Lipschitz convex functional constraints. In addition, we propose a modification of the proposed algorithms by considering each functional constraint in the calculation when we have a productive step, as well as the first constraint that violates the feasibility. This modification can save the running time of algorithms when we have many functional constraints. In addition, we provide an analysis of the proposed algorithms for $\delta$-monotone operators, allowing us to apply the proposed algorithms, as a special case, to constrained minimization problems when we do not have access to the exact information about the subgradient of the objective function. Numerical experiments that illustrate the work and performance of the proposed algorithms are also given.
中文摘要 变分不等式在机器学习研究中起着关键作用，如生成对抗网络、强化学习、对抗训练和生成模型。本文专注于带有泛函约束的受约束变分不等式问题（不等式类型约束）。我们提出了一些镜像下降型算法，根据迭代时函数约束的值在生产和非生产步骤之间切换，采用多种不同的步长规则和停止标准。我们分析所提出的算法，并证明其最优收敛率，以实现具有有界和单调算子及Lipschitz凸函数约束问题的求解精度。此外，我们提出修改算法，考虑计算中每个函数约束，当我们有生产步骤时，以及第一个违反可行性的约束。当我们有许多函数约束时，这种修改可以节省算法的运行时间。此外，我们分析了 $\delta$-单调算子的算法，使我们能够作为特例将这些算法应用于无法获得目标函数子梯度的精确信息时的约束最小化问题。还提供了展示所提算法工作和性能的数值实验。

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

通过反事实推理路径减少学分分配的差异

Authors: Fei Ding, Yongkang Zhang, Yeling Peng, Youwei Wang, Guoxiong Zhou, Zijian Zeng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.16302
Pdf link: https://arxiv.org/pdf/2605.16302
Abstract Reinforcement learning for multi-step reasoning with large language models (LLMs) often relies on sparse terminal rewards, leading to poor credit assignment conditions where the final feedback is evenly propagated across all intermediate decisions. This results in high gradient variance, unstable training, and numerous ineffective updates, ultimately causing the model to fail and preventing sustained improvement. We introduce a counterfactual comparison-based credit assignment framework, which samples multiple reasoning trajectories under the same input. By treating their differences as an implicit approximation of alternative decisions, we construct an implicit process-level advantage estimator that transforms sparse terminal rewards into step-sensitive learning signals. Based on this, we propose Implicit Behavior Policy Optimization (IBPO), which significantly improves training stability and performance upper bounds on mathematical and code reasoning benchmarks, pointing to a promising direction for unlocking the performance potential of LLMs.
中文摘要 用于大型语言模型（LLM）的多步推理强化学习通常依赖稀疏的终端奖励，导致学分分配条件不佳，最终反馈在所有中间决策中均等传递。这导致了高梯度方差、训练不稳定以及大量无效更新，最终导致模型失败，阻碍持续改进。我们引入了基于反事实比较的学分分配框架，在同一输入下采样多种推理轨迹。通过将差异视为替代决策的隐式近似，我们构建了一个隐式过程层面优势估计器，将稀疏的终端奖励转化为阶梯敏感的学习信号。基于此，我们提出了隐式行为策略优化（IBPO），显著提升了训练稳定性和数学和代码推理基准的性能上限，为释放大型语言模型的性能潜力指明了有前景的方向。

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

当行动消失：自我游戏强化学习中的对抗性行动移除

Authors: Arahan Kujur
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.16312
Pdf link: https://arxiv.org/pdf/2605.16312
Abstract We study adversarial action masking in self-play reinforcement learning: an attacker selectively removes legal actions from a victim's action set. Unlike observation or action perturbations, removal eliminates decision options before the agent acts. Across poker games scaling from 6 to 5,531 information states and two non-poker domains, learned masking causes substantially more damage than random masking and learned perturbation baselines. The attack persists across Q-learning, PPO, NFSP, neural NFSP, and DQN victims; transfers across agents; is amplified by self-play; and shows no recovery under extended masked training. Mechanistically, the adversary targets high-value decision points, captured by reach-weighted contingent action capacity (CAC$_w$) and a value-weighted refinement CAC$_v$. These results identify action availability as a distinct robustness surface in self-play RL.
中文摘要 我们研究了自我游戏强化学习中的对抗行为掩蔽：攻击者有选择地从受害者的行动集中移除合法行为。与观察或动作扰动不同，移除在主体行动前消除了决策选项。在涵盖6至5531信息状态的扑克游戏中，以及两个非扑克领域，学习性掩蔽造成的损害远大于随机掩蔽和习得性扰动基线。该攻击在Q学习、PPO、NFSP、神经NFSP和DQN受害者中持续存在;跨代理的转移;通过自我游戏来放大;且在长时间蒙面训练下无恢复迹象。在机制上，对手针对高价值决策点，这些点通过覆盖加权的应急行动能力（CAC$_w$）和价值加权的精炼CAC$_v$捕获。这些结果将动作可用性识别为自玩强化学习中一个独特的稳健面。

A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning

决策能力的结构性门槛规范了自我游戏强化学习中的崩溃

Authors: Arahan Kujur
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.16315
Pdf link: https://arxiv.org/pdf/2605.16315
Abstract We show that a threshold in decision capacity determines whether self-play reinforcement learning agents collapse under asymmetric rule perturbations. Across poker variants, matrix games, a dice game, and multiple learning algorithms, eliminating all positive-reach contingent decisions causes rapid convergence to a deterministic exploitation attractor, a fixed point at near-maximal loss. Preserving even a single positive-reach contingent decision point prevents this collapse. A frozen baseline and fixed-opponent control confirm that the mechanism is co-adaptation under constraint, not the perturbation itself. The phenomenon is timing-invariant, fully reversible upon action restoration, and intensifies under function approximation. These results establish a sharp threshold at zero reach-weighted contingent action capacity, with severity scaling continuously via reach-weighted capacity in the tested domains.
中文摘要 我们证明决策能力的阈值决定了自玩强化学习代理在非对称规则扰动下是否崩溃。在扑克变体、矩阵游戏、骰子游戏以及多种学习算法中，消除所有正向影响的决定会迅速收敛到确定性利用吸引子，即一个近乎最大损失的固定点。即使保留一个正向覆盖的有条件决策点，也能防止这种崩溃。固定基线和固定对抗控制确认机制是约束下的共适应，而非扰动本身。该现象时序不变，作用恢复时完全可逆，且在函数近似下增强。这些结果确立了一个明确的阈值：在覆盖加权的有变动作容量为零时，严重程度在测试领域通过覆盖加权能力持续递增。

Investigating Action Encodings in Recurrent Neural Networks in Reinforcement Learning

在强化学习中研究循环神经网络中的动作编码

Authors: Matthew Schlegel, Volodymyr Tkachuk, Adam White, Martha White
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.16318
Pdf link: https://arxiv.org/pdf/2605.16318
Abstract Building and maintaining state to learn policies and value functions is critical for deploying reinforcement learning (RL) agents in the real world. Recurrent neural networks (RNNs) have become a key point of interest for the state-building problem, and several large-scale reinforcement learning agents incorporate recurrent networks. While RNNs have become a mainstay in many RL applications, many key design choices and implementation details responsible for performance improvements are often not reported. In this work, we discuss one axis on which RNN architectures can be (and have been) modified for use in RL. Specifically, we look at how action information can be incorporated into the state update function of a recurrent cell. We discuss several choices in using action information and empirically evaluate the resulting architectures on a set of illustrative domains. Finally, we discuss future work in developing recurrent cells and discuss challenges specific to the RL setting.
中文摘要 构建和维护状态以学习策略和价值函数对于在现实世界中部署强化学习（RL）代理至关重要。循环神经网络（RNN）已成为状态构建问题的关键关注点，多个大规模强化学习代理也采用了循环网络。虽然RNN已成为许多强化学习应用的主力，但许多关键的设计选择和实现细节往往未被报告，这些关键导致性能提升。本文讨论RNN架构可以（且已被）修改以用于强化学习的一个轴。具体来说，我们研究如何将动作信息纳入循环单元的状态更新函数中。我们讨论了使用动作信息的几种选择，并通过实证方式评估一组说明性域上的结构。最后，我们讨论了未来在循环细胞开发方面的工作，并讨论了强化环境下特有的挑战。

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

GeoSym127K：多模几何推理的可扩展符号验证综合

Authors: Jinhao Jing, Zheng Ma, Jinwei Liang, Qiannian Zhao, Shawn Chen, Jing Yang, Por Lip Yee, Prayag Tiwari, Jingjing Bai, Benyou Wang, Lewei Lu, Zhan Su
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.16371
Pdf link: https://arxiv.org/pdf/2605.16371
Abstract Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain-of-Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro-symbolic framework. By leveraging a type-conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams. Using this engine, we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs. We also introduce GeoSym-Bench, an expert-curated suite of 511 complex samples for rigorous evaluation. Through extensive supervised fine-tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram-dependent and multi-step geometry tasks. Our Qwen3-VL-8B model gains an absolute +22.21% on the MathVerse Vision-Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long-horizon logic fragmentation and outperforming advanced closed-source models like Doubao-1.8. Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero-shot RL. Driven by deterministic exact-match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at this https URL and this https URL.
中文摘要 大型多模态模型（LMM）常因视觉幻觉和缺乏数学精确的思维链（CoT）数据而在几何推理上遇到困难。为此，我们提出了地质音响引擎，一个自动化且可扩展的神经符号框架。通过利用类型条件文法和分析 SymGT 求解器，它能够精确地推导出符号的基础真理，并无缝集成强大的渲染流水线，生成高精度的几何图。利用该引擎，我们构建了GeoSym127K，这是一个难度分层数据集，包含51K张高分辨率图像、127K个带有符号性地面真相的问题，以及55K经过答案验证的CoT QA对。我们还推出了GeoSym-Bench，这是一套由专家精心策划的511个复杂样本，用于严格评估。通过广泛的监督微调（SFT），我们证明了GeoSym在图表依赖和多步几何任务上实现了集中的改进。我们的Qwen3-VL-8B模型在MathVerse仅视觉子集上获得了绝对+22.21%，在WeMath上达到61.52%（+6.19%提升），缓解了长视野逻辑碎片化，并优于像斗宝1.8这样的先进闭源模型。此外，通过GRPO应用可验证奖励强化学习（RLVR）显示，从结构SFT检查点初始化显著提升了零射点RL的性能上限。这由确定性精确匹配信号驱动，展示了我们可验证推理综合的强大扩展潜力。数据集和代码可在此 https URL 和此 https URL 获取。

OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence

OrbiSim：世界模型作为具身智能的可微物理引擎

Authors: Jiajian Li, Jingyuan Huang, Junru Gong, Qi Wang, Xiaokang Yang, Yunbo Wang
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.16395
Pdf link: https://arxiv.org/pdf/2605.16395
Abstract We present OrbiSim, a novel robotic simulation paradigm that redefines world models as a fully differentiable physics engine for embodied intelligence. Unlike prior world models that focus on unconstrained imagination in latent or visual domains, OrbiSim establishes a unified, physically-grounded pathway that bridges structured scene assets, neural dynamics, and downstream reinforcement learning. By enabling end-to-end differentiability throughout the entire simulation loop -- spanning from explicit state transitions to visual observation generation -- OrbiSim supports tasks traditionally intractable for classical simulators, such as differentiable contact modeling, gradient-based policy optimization under sparse rewards, and intuitive physical inference. Empirical results demonstrate that OrbiSim significantly outperforms state-of-the-art world models in both predictive fidelity and control performance. Furthermore, its consistent responsiveness to asset configurations and physical parameters suggests its potential as a differentiable tool for enhancing robot simulation and policy training.
中文摘要 我们介绍OrbiSim，一种新型机器人仿真范式，重新定义了世界模型，作为具身智能的完全可微物理引擎。与以往专注于潜在或视觉领域无限制想象力的世界模型不同，OrbiSim建立了一条统一、物理基础的路径，桥接了结构化场景资产、神经动力学和下游强化学习。通过实现整个仿真循环的端到端微分性——从显式状态转换到视觉观察生成——OrbiSim支持传统上经典模拟器难以完成的任务，如可微接触建模、基于梯度的策略优化（在稀疏奖励下）以及直观的物理推断。实证结果表明，OrbiSim在预测准确性和控制性能方面均远超最先进的世界模型。此外，其对资产配置和物理参数的持续响应显示其作为提升机器人仿真和政策培训的可微化工具的潜力。

QuantFPFlow: Quantum Amplitude Estimation for Fokker--Planck Policy Optimisation in Continuous Reinforcement Learning

QuantFPFlow：连续强化学习中的福克-普朗克策略优化量子振幅估计

Authors: Abraham Itzhak Weinberg
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.16429
Pdf link: https://arxiv.org/pdf/2605.16429
Abstract We introduce \textbf{QuantFPFlow}, a reinforcement learning framework that integrates quantum amplitude estimation into the Fokker--Planck~(FP) formulation of stochastic policy optimisation. Classical continuous-space RL agents must estimate the FP partition function $Z = \int e^{-V(\mathbf{x})/D}\,d\mathbf{x}$ at cost $\calO(1/\varepsilon^{2})$; QuantFPFlow replaces this with a Grover-amplified amplitude estimator achieving $\calO(1/\varepsilon)$ -- a provable quadratic speedup. While the full quantum acceleration requires fault-tolerant hardware, the quantum-inspired classical simulation demonstrated here already exhibits the $\calO(1/\varepsilon)$ algorithmic structure. The estimated stationary distribution $\rhostar$ drives a theoretically grounded exploration bonus $\Raug = \Renv + \alpha\log(1/\rhostar(s))$. This bonus steers the agent toward globally optimal regions of multimodal reward landscapes while simultaneously constraining policy variance through FP diffusion matching. On a continuous-control task specifically designed to expose local-optima failure, QuantFPFlow achieves mean reward $1{,}295.7 \pm 423.2$ versus $1{,}284.0 \pm 474.0$ for Soft Actor-Critic~(SAC), while discovering the global optimum \textbf{10.4\,\% more frequently} (33.9\,\% vs.\ 30.7\,\%). Policy entropy remains near $H(\pi)\approx 6.5$\,nats throughout training, whereas SAC collapses to $1.5$\,nats, confirming that FP diffusion matching actively prevents premature convergence. Dimensionality experiments further show computational scaling of $\calO(d^{0.35})$ for QuantFPFlow versus $\calO(d^{0.76})$ for classical FP estimation.
中文摘要 我们介绍了 \textbf{QuantFPFlow}，这是一个强化学习框架，将量子振幅估计整合进 Fokker-Planck~（FP）随机策略优化表述中。经典连续空间强化学习代理必须估计FP划分函数 $Z = \int e^{-V（\mathbf{x}）/D}\，d\mathbf{x}$ 成本为 $\calO（1/\varepsilon^{2}）$;QuantFPFlow用Grover放大的振幅估计器替代了这一点，实现了$\calO（1/\varepsilon）$——一个可证明的二次加速。虽然完整的量子加速需要容错硬件，但这里展示的量子启发的经典仿真已经展现了$\calO（1/\varepsilon）$的算法结构。估计的平稳分布$\rhostar$驱动理论上有基础的探索加成$\Raug = \Renv + \alpha\log（1/\rhostar（s））$。这一加成引导智能体朝向多模态奖励景观的全局最优区域，同时通过FP扩散匹配限制策略方差。在专门设计用于暴露局部最优失败的连续控制任务中，QuantFPFlow实现平均奖励$1{，}295.7 \pm 423.2$，而软Actor-Critic~（SAC）为$1{，}284.0 \pm 474.0美元，同时发现全局最优条件的频率更高}（33.9\，\% 对 \30.7\，\%）。整个训练过程中，策略熵保持在$H（\pi）\约6.5$\，nats附近，而SAC则降至$1.5$\，nats，证实FP扩散匹配能有效防止过早收敛。维度实验进一步显示，计算尺度为 QuantFPFlow 的 $\calO（d^{0.35}）$，而经典的 FP 估计则为 $\calO（d^{0.76}）$。

Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign

峰值检测器：通过生理手势中指令调优的大型语言模型实现可解释的峰值检测

Authors: Jiahui Li, Yida Zhang, Zixuan Zeng, Jiayu Chen, Yingjian Song, Yin Xiao, Nishan Dong, Junjie Lu, Younghoon Kwon, Xiang Zhang, Jin Lu, Wenzhan Song, Fei Dou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.16452
Pdf link: https://arxiv.org/pdf/2605.16452
Abstract Accurate peak detection across diverse cardiac physiological signals, including the Electrocardiogram (ECG), Photoplethysmogram (PPG), Ballistocardiogram (BCG), and Bodyseismography (BSG), is fundamental for cardiovascular monitoring but is often hindered by artifacts and signal variability. Conventional algorithms are typically engineered with expert knowledge for a single signal modality, limiting their generalizability. Conversely, deep learning-based methods often lack interpretability, limiting transparency for expert verification and hindering expert-computer interaction. To address these limitations, we introduce Peak-Detector, a novel framework that leverages instruction-tuned Large Language Models (LLMs) for robust, cross-modal, and explainable peak detection. A core innovation of our framework is a "peak-representation" technique that transforms time-series data into a condensed format, preserving critical event information while significantly reducing signal length. This representation provides a crucial inductive bias, guiding the LLM to reason over physiologically meaningful events rather than raw, noisy data. The model is optimized through a two-stage process: supervised fine-tuning (SFT) followed by reinforcement learning (RL) with a multi-objective reward function. The model's self-explanation capabilities are cultivated by fine-tuning on a custom-built Peak-Explanation dataset. Across four modalities-ECG, PPG, BCG, and BSG-spanning seven datasets (six public benchmarks plus one real-world cohort), Peak-Detector demonstrates strong cross-modal performance, achieving best or tied-best detection under clinically relevant temporal tolerance. Beyond accuracy, the generated rationales surface failure modes and support verification and error analysis.
中文摘要 准确检测各种心脏生理信号峰值，包括心电图（ECG）、光电容积图（PPG）、心电图（BCG）和体震成像（BSG），是心血管监测的基础，但常因伪影和信号变异而受阻。传统算法通常基于单一信号模态的专业知识设计，限制了其泛化性。相反，基于深度学习的方法往往缺乏可解释性，限制了专家验证的透明度，并阻碍了专家与计算机的互动。为解决这些局限性，我们引入了Peak-Detector，这是一个新颖框架，利用指令调优的大型语言模型（LLMs）实现稳健、跨模态且可解释的峰值检测。我们框架的核心创新之一是“峰值表示”技术，将时间序列数据转换为压缩格式，保留关键事件信息，同时显著缩短信号长度。这种表示提供了关键的归纳偏见，引导LLM对生理学意义的事件进行推理，而非原始、噪声化的数据。该模型通过两阶段过程进行优化：监督微调（SFT），随后是带有多目标奖励函数的强化学习（RL）。模型的自解释能力通过在定制的峰值解释数据集上进行微调来培养。在涵盖七个数据集——心电图（ECG）、脉搏图（PPG）、机动扫描（BCG）和BSG数据集（六个公开基准和一个真实世界队列）中，峰值探测器展现出强劲的跨模态表现，在临床相关时间耐受性下实现最佳或并列最佳检测。除了准确性，生成的理由还体现了失效模式，并支持验证和误差分析。

Identifiable Token Correspondence for World Models

世界模型的可识别令牌对应

Authors: Youngin Kim (1), Ray Sun (2), Inho Kim (2), Bumsoo Park (3), Hyun Oh Song (1 and 2) ((1) Interdisciplinary Program in Artificial Intelligence, Seoul National University, (2) Department of Computer Science and Engineering, Seoul National University, (3) KRAFTON)
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.16457
Pdf link: https://arxiv.org/pdf/2605.16457
Abstract Transformer-based world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without explicitly modeling correspondence between tokens across time. We formulate next-frame prediction as a structured probabilistic inference problem with latent token correspondence variables, deriving a model in which each next-frame token is explained either by copying a token from the previous frame or by generating a new token. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on this https URL.
中文摘要 基于变形金刚的世界模型在视觉强化学习中表现出优异表现，但在长视野展开中常常存在时间不一致，包括物体复制、消失和转化。一个关键原因是，大多数现有方法将下一帧预测纯粹视为代币生成问题，而未明确建模代币间跨时间的对应关系。我们将下一帧预测表述为一个结构化的概率推断问题，具有潜在的符号对应变量，推导出一个模型，其中每个下一帧符号的解释方式是从前一个框架复制一个符号，要么生成一个新的符号。我们的实验在4项具有挑战性的基准测试上展现出最先进的性能。该方法在Craftax经典基准上获得了72.5%的回报率，得分为35.6%，远超之前的67.4%和27.9%。我们会在这个 https URL 上发布源代码。

REC-RL: Referring expression counting via Gaussian and range-based reward optimization

REC-RL：通过高斯和基于区间的奖励优化进行引用式计数

Authors: Hui Liu, Yunlai Teng, Kunlong Bai, Pengfei Qi, Haotian Yan, Liang Li, Junlan Feng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.16460
Pdf link: https://arxiv.org/pdf/2605.16460
Abstract Referring expression counting (REC) is an intention-driven task that requires context-aware visual reasoning. While recent vision-language models incorporate language for visual understanding, most existing REC methods rely on rulebased reinforcement learning with rewards focused primarily on final accuracy, overlooking the quality of intermediate reasoning. We propose REC-RL, a reinforcement learning framework that introduces a think-range-answer paradigm to explicitly optimize the visual reasoning process. RECRL employs Group Relative Policy Optimization and two lightweight rewards: an accuracy reward that combines range-based interval supervision with Gaussian-based precision guidance, and a format reward that enforces structured outputs. By modeling intermediate focus prediction as internal decision-making, REC-RL avoids additional annotations and better aligns with human perception. Extensive experiments demonstrate consistent improvements over strong baselines and robust generalization across benchmarks.
中文摘要 指称式计数（REC）是一种以意图驱动的任务，需要具备上下文感知的视觉推理能力。虽然最新的视觉语言模型包含了视觉理解的语言，但大多数现有的REC方法依赖基于规则的强化学习，奖励主要集中在最终准确性，忽视了中间推理的质量。我们提出了REC-RL，一种强化学习框架，引入思维-范围-回答范式，明确优化视觉推理过程。RECRL采用群体相对策略优化和两种轻量级奖励：一种是结合基于区间的区间监督和基于高斯的精确指导的准确性奖励;另一种是强制结构化输出的格式奖励。通过将中间焦点预测建模为内部决策，REC-RL避免了额外的注释，更好地与人类感知相符。大量实验显示，在基准测试中，基准测试中持续有优异的改进和稳健的泛化能力。

World Model-Enabled Causal Digital Twins for Semantic Communications in Physical AI Systems

世界模型支持的因果数字孪生，用于物理人工智能系统中的语义通信

Authors: Lingyi Wang, Tingyu Shui, Walid Saad, Pascal Adjakple
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.16547
Pdf link: https://arxiv.org/pdf/2605.16547
Abstract Semantic communication has emerged as a promising paradigm for enabling goal-oriented networking. However, most existing semantic communication solutions are tailored to one-shot tasks and optimize instantaneous performance. Hence, they cannot be used to support closed-loop dynamic systems with physical artificial intelligence (AI), in which the transmitted semantics affect not only the current inference outcome but also future control actions, state evolution, and ultimately long-horizon task performance. To address this gap, this paper investigates goal-oriented semantic communications for physical AI systems with closed-loop sensing-communication-inference-control. In particular, the problem of semantic communications is formulated as a long-term return-per-bit maximization under wireless bit-budget constraints while capturing both control efficiency and communication efficiency. To solve this problem, a novel causal information value (CIV) metric is introduced to evaluate the marginal contribution of each semantic token to the expected long-term return by transmission interventions. Then, a world-model-enabled causal digital twin (WM-CDT) framework is proposed to capture the dynamics of closed-loop physical AI systems and enable counterfactual reasoning for long-horizon imagined rollouts. Based on these imagined rollouts, an actor-critic policy is trained for long-horizon agent control with high data efficiency, while the semantic token selector is trained through CIV-per-bit evaluation. Extensive simulations on an AirSim-Sionna-based unmanned aerial vehicle (UAV) navigation simulator show that the proposed WM-CDT framework achieves significant improvement in return-per-kbit and navigation success rate compared to existing reinforcement learning solutions.
中文摘要 语义传播已成为实现目标导向网络的有前景范式。然而，大多数现有语义通信解决方案都针对一次性任务设计，并优化即时性能。因此，它们无法用于支持带有物理人工智能（AI）的闭环动态系统，其中传输语义不仅影响当前推理结果，还影响未来的控制操作、状态演化，最终影响长期任务表现。为弥补这一空白，本文探讨了具有闭环感测-通信-推理-控制的物理人工智能系统的目标导向语义通信。特别是，语义通信问题被表述为在无线比特预算约束下实现长期按比特收益最大化，同时兼顾控制效率和通信效率。为解决此问题，引入了一种新的因果信息价值（CIV）指标，用于评估每个语义代币对传输干预预期长期回报的边际贡献。随后，提出了一个基于世界模型支持的因果数字孪生（WM-CDT）框架，旨在捕捉闭环物理人工智能系统的动态，并为长期想象的部署实现反事实推理。基于这些设想的部署，演员-批判策略被训练为高数据效率的长视野代理控制，而语义令牌选择器则通过每比特CIV评估进行训练。基于AirSim-Sionna的无人机（UAV）导航模拟器的广泛模拟显示，所提出的WM-CDT框架在每kbit返回和导航成功率方面相比现有强化学习方案实现了显著提升。

EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control

高效TDMPC：改进的MPC目标，实现样品高效连续控制

Authors: Thomas Evers, Cristian Meo, Wendelin Bohmer, Justin Dauwels, Yaniv Oren
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.16692
Pdf link: https://arxiv.org/pdf/2605.16692
Abstract We introduce EfficientTDMPC, a sample-efficient model-based reinforcement learning method for continuous control built on the TD-MPC family of algorithms. Central to this family is a planner that aims to find an action sequence that maximizes the estimated return. The return is estimated using a learned model and value networks, each of which can introduce error. EfficientTDMPC proposes to reduce this error in two ways. First, it introduces an ensemble of dynamics models and averages the return estimates across those models and across different rollout depths. Second, it adds the option to apply an uncertainty penalty to the planner objective, yielding a planner that avoids actions with uncertain return estimates. It then adds practical improvements which increase buffer data freshness and reduce compute. Lastly, we find that our contributions enable EfficientTDMPC to benefit more from a higher update-to-data (UTD) ratio, further improving sample efficiency. To the best of our knowledge, in the low data regime of each benchmark, EfficientTDMPC achieves state-of-the-art (SOTA) in terms of sample efficiency on HumanoidBench-Hard and DMC hard, while matching SOTA on DMC easy.
中文摘要 我们介绍了EfficientTDMPC，这是一种基于TD-MPC算法家族的样本高效模型强化学习连续控制方法。该系列的核心是一个计划器，旨在找到最大化估计回报的行动序列。回报是通过学习到的模型和价值网络估算的，每个网络都可能引入误差。EfficientTDMPC提出通过两种方式减少这种误差。首先，它引入了一组动态模型，并对这些模型及不同推广深度的回报估计进行平均。其次，它增加了对规划目标施加不确定性惩罚的选项，从而实现一个避免收益估算不确定行为的规划者。随后，它增加了实际改进，提高了缓冲区数据的新鲜度并减少了计算量。最后，我们发现我们的贡献使EfficientTDMPC能够更好地利用更高的数据更新比（UTD），进一步提升样本效率。据我们所知，在每个基准测试的低数据区段，EfficientTDMPC在HumanoidBench-Hard和DMC hard的样本效率方面达到了最先进的（SOTA），而在DMC easy上则与SOTA相当。

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

PopuLoRA：为推理自玩共同进化的大型语言模型群体

Authors: Roger Creus Castanyer, Geoffrey Bradway, Lorenz Wolf, Maxwill Lin, Augustine N. Mavor-Parker, Matthew James Sargent
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.16727
Pdf link: https://arxiv.org/pdf/2605.16727
Abstract We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross-evaluation between sub-populations replaces the self-calibration that limits single-agent self-play. A family of LoRA weight-space evolution operators (mutations and crossovers that produce same-rank population members in seconds) serves as the replacement step of a population-based training loop at 7B scale. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per-adapter compute-matched single-agent baseline. Where the single agent self-calibrates to generating easy problems it can reliably solve, the population enters a co-evolutionary arms race: teachers produce increasingly complex problems, student solve rates oscillate, and problem-space coverage keeps expanding throughout training. Despite lower training-time reward, the population mean outperforms the baseline on three code benchmarks (HumanEval+, MBPP+, LiveCodeBench) and seven math benchmarks (AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench), and even the weakest member of the population beats the baseline on aggregate.
中文摘要 我们引入了PopuLoRA，一种基于群体的非对称自玩框架，用于训练后带有可验证奖励（RLVR）的强化学习。教师和学生是基于共享冻结基的专业LoRA适配器：教师提出问题，匹配的学生在程序验证器下解决，子群体间的交叉评估取代了限制单代理自我游戏的自我校准。一系列LoRA权重空间演化算子（即在几秒内产生同等级种群成员的突变和交叉）作为7B尺度基于种群训练循环的替代步骤。我们将PopuLoRA实例化在Absolute Zero Reasoner之上，并将其与每个适配器计算匹配的单代理基线进行比较。当单个智能体自我校准生成能够可靠解决的简单问题时，群体进入一场共同进化的军备竞赛：教师产生越来越复杂的问题，学生的解决率波动，问题空间覆盖在整个培训过程中不断扩大。尽管训练时间奖励较低，总体在三个代码基准测试（HumanEval+、MBPP+、LiveCodeBench）和七个数学基准测试（AIME 24/25、AMC 23、MATH-500、Minerva、GSM8K、OlympiadBench）上均值优于基线，即使是最弱的成员在总计上也超过基线。

NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

NeuroMAS：多智能体系统作为神经网络与联合强化学习

Authors: Haoran Lu, Luyang Fang, Wenxuan Zhong, Ping Ma
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Methodology (stat.ME); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.16757
Pdf link: https://arxiv.org/pdf/2605.16757
Abstract Multi-agent language systems are often built as hand-designed workflows, where agents are assigned semantic roles and communication protocols are specified in advance. We propose NeuroMAS, a method that first treats a multi-agent language system as a trainable and scalable neural-network-like architecture with LLM agents as nodes and intermediate textual signals as edges. In NeuroMAS, agent nodes are role-free but structure-aware: the topology only determines how information can flow in general, while reinforcement learning training determines how nodes communicate, specialize, and coordinate. This formulation shifts multi-agent design from workflow engineering toward architecture design, where depth, width, connectivity, and growth protocol become scalable sources of capability. Further, we provide a theoretical perspective showing why such modular textual computation is more parameter-efficient when tasks admit hierarchical decompositions. Experiments show that NeuroMAS improves significantly over both inference-time and trained multi-agent baselines. We further find that organizational scaling is path-dependent: larger systems can be challenging to train from scratch, but become feasible when grown progressively from smaller trained systems. These results suggest that learned neural multi-agent systems are a promising scaling axis for LLMs.
中文摘要 多智能体语言系统通常以手工设计的工作流程构建，代理被分配语义角色，通信协议事先指定。我们提出了NeuroMAS，这是一种首先将多智能体语言系统视为可训练且可扩展的类神经网络架构的方法，LLM代理作为节点，中间文本信号作为边缘。在NeuroMAS中，代理节点是无角色但具结构感知的：拓扑仅决定信息的一般流动方式，而强化学习训练决定节点之间的通信、专业化和协调。这一表述将多智能体设计从工作流工程转向架构设计，深度、宽度、连接性和增长协议成为可扩展的能力来源。此外，我们还提供了理论视角，说明为什么当任务允许层级分解时，这种模块化文本计算在参数效率更高。实验显示，NeuroMAS在推理时间和训练后多智能体基线均显著改善。我们还发现，组织规模化依赖路径：大型系统从零开始训练可能具有挑战性，但从较小的训练系统逐步扩展后，系统变得可行。这些结果表明，学习型神经多智能体系统是LLMs有前景的扩展轴。

AoI-MDP: An AoI Optimized Markov Decision Process (Student Abstract)

AoI-MDP：AoI优化的马尔可夫决策过程（学生摘要）

Authors: Yimian Ding, Jingzehua Xu, Yiyuan Yang, Guanwen Xie, Xinqi Wang, Shuai Zhang
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.16777
Pdf link: https://arxiv.org/pdf/2605.16777
Abstract Ocean exploration places high demands on autonomous underwater vehicles, especially when there's observation delay. We propose age of information optimized Markov decision process (AoI-MDP) to enhance underwater tasks by modeling observation delay as signal delay and including it in the state space. AoI-MDP also introduces wait time in the action space and integrates AoI with reward functions, optimizing information freshness and decision-making using reinforcement learning. Simulations show AoI-MDP outperforms the standard MDP, demonstrating superior performance, feasibility, and generalization in underwater tasks. To accelerate relevant research, we have made the codes available as open-source at this https URL.
中文摘要 海洋探索对自主水下载具要求极高，尤其是在观测延迟的情况下。我们提出了信息年龄优化的马尔可夫决策过程（AoI-MDP），通过将观测延迟建模为信号延迟并将其纳入状态空间，以增强水下任务。AoI-MDP还引入了动作空间的等待时间，并将AoI与奖励函数整合，利用强化学习优化信息的新鲜度和决策。模拟显示，AoI-MDP优于标准MDP，展现出在水下任务中更优的性能、可行性和泛化性。为了加快相关研究进展，我们将代码开源至此 https URL。

The Unlearnability Phenomenon in RLVR for Language Models

语言模型RLVR中的不可学习性现象

Authors: Yulin Chen, He He, Chen Zhao
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.16787
Pdf link: https://arxiv.org/pdf/2605.16787
Abstract Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at \url{this https URL}.
中文摘要 带可验证奖励的强化学习（RLVR）已被证明在提升大型语言模型（LLM）推理能力方面非常有效。然而，RLVR的学习动态仍未被充分探讨。本文揭示了一个反直觉现象：在模型最初难以处理的难例中，有相当一部分即使存在正确的展开，仍无法学习。为了理解这一现象，我们首先证明现有的优化和采样技术无法解决不可学习性问题。通过跨例梯度分析，我们表明不可学习的例子存在基本的表示问题，其特征是与其他例子的梯度相似度较低且推理模式不可推广。我们还进一步证明，在强化学习中表示缺陷难以缓解，因为数据增强并不能改善梯度相似性。本研究首次系统地描述了RLVR训练中不可学习数据，并揭示了当前强化学习方法在推理任务上的根本局限性。代码和数据可在 \url{this https URL} 获取。

TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition

TIER：多步工具组合的轨迹不变执行奖励

Authors: Anay Kulkarni, ChiaEn Lu, Dheeraj Mekala, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.16790
Pdf link: https://arxiv.org/pdf/2605.16790
Abstract Tool use enables large language models to solve complex tasks through sequences of API calls, yet existing reinforcement learning approaches fail to scale to multi-step composition settings. Outcome-based rewards provide only sparse feedback, while trajectory-supervised rewards depend on annotated reference solutions, penalizing valid alternatives and limiting scalability. We propose TIER: Trajectory-Invariant Execution Rewards, a reward framework that derives supervision directly from function schemas and runtime execution, rather than from reference trajectories. The reward decomposes into format validity, schema adherence, execution success, and answer correctness, providing dense, interpretable sequence-level feedback derived from fine-grained verification of individual steps of tool use. This design allows any valid execution path to receive credit, naturally supporting multiple solution strategies and adapting to evolving tool interfaces. On DepthBench, a compositional benchmark stratified by depth (1 to 6 steps), TIER achieves >90% accuracy across steps, where trajectory-supervised rewards collapse beyond step-4. We further demonstrate consistent gains on benchmarks like BFCL v3 and NestFUL. Ablation studies confirm that all reward components are necessary, highlighting the importance of multi-level supervision for compositional reasoning.
中文摘要 工具的使用使大型语言模型能够通过一系列API调用解决复杂任务，但现有的强化学习方法未能扩展到多步组合设置。基于结果的奖励仅提供稀疏反馈，而轨迹监督奖励依赖注释引用解，惩罚有效替代方案并限制可扩展性。我们提出了TIER：轨迹不变执行奖励，这是一种直接从函数模式和运行时执行中获得监督的奖励框架，而非参考轨迹。奖励分解为格式有效性、模式遵循性、执行成功率和答案正确性，提供基于对工具使用步骤细粒度验证的密集且可解释的序列级反馈。该设计允许任何有效的执行路径获得积分，自然支持多种解决方案策略并适应不断演变的工具界面。在DepthBench上，这是一个按深度分层（1到6步）的成分基准，TIER在各步骤中实现了>90%的准确率，其中轨迹监督奖励在第4步之后就崩溃了。我们还进一步展示了在BFCL v3和NestFUL等基准测试上的持续增长。消融研究证实所有奖励成分都是必需的，凸显了多层次监督对组合推理的重要性。

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

断开KL与轨迹：大语言模型蒸馏中SFT、DAgger、离线强化学习和OPD的统一视角

Authors: Anhao Zhao, Haoran Xin, Yingqi Fan, Junlong Tong, Wenjie Li, Xiaoyu Shen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.16826
Pdf link: https://arxiv.org/pdf/2605.16826
Abstract Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient-level identities showing forward KL gives SFT-style cross-entropy matching with teacher soft targets, whereas reverse KL gives an RL-style policy-gradient objective with a dense teacher-student log-ratio reward, connecting them to off-policy SFT, DAgger-style on-policy SFT, offline-RL-style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy-entropy tradeoff, prefix source a quality-compute tradeoff, and training length an accuracy-stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy-gated length curriculum. KL mixing shows long-sequence distillation requires substantial forward-KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy-gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long-horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.
中文摘要 知识蒸馏是大型语言模型训练后的核心，但其设计领域仍然理解不足，尤其是在强化学习（RL）方面。我们表明，现有范式——非策略蒸馏和非策略蒸馏（OPD）隐含耦合了两种正交选择：前缀源和令牌级KL方向。这可通过对自回归分布的序列级KL进行分解得出：正向KL将教师前缀与令元级前向KL配对，反向KL将学生前缀与令牌级反向KL配对。我们认为这种耦合并非内在的：解耦两个轴可产生四个有效目标。我们建立了梯度层级的同一性，前向 KL 与教师软目标实现了 SFT 风格的交叉熵匹配，而反向 KL 则给出了类似强化学习的策略梯度目标，并有密集的师生对数比奖励，将其与非策略 SFT、DAgger 式策略中 SFT、离线 RL 风格提纯和 OPD 连接起来。我们开展了一项广泛的对照研究，评估了四个目标作为独立方法和后续强化学习的初始化。结果揭示了三种权衡：KL方向诱导了准确性与熵的权衡，前缀源带来了质量与计算权衡，训练长度则是准确性与稳定性的权衡。基于这些发现，我们提出了基层岭混合和熵门控长度课程。KL混合表明，长序列蒸馏需要大量的前向KL权重，以防止熵坍缩和长度膨胀，同时不牺牲准确性。熵门控长度课程将Avg@k和Pass@k提升3.6倍，最高可达5.8分，平均响应长度比固定长视野训练缩短约3倍。我们的结果为设计平衡准确性、多样性、计算和强化学习行为的推理提炼目标提供了框架和实用方法。

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

Sketch 然后绘制：扩散多模态大型语言模型中的层级强化学习

Authors: Siqi Luo, Jianghan Shen, Yi Xin, Huayu Zheng, Haoxing Chen, Yan Tai, Yue Li, Junjun He, Yihao Liu, Guangtao Zhai, Yuewen Cao, Xiaohong Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.16842
Pdf link: https://arxiv.org/pdf/2605.16842
Abstract Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement. We also use a prompt-conditioned estimator to calculate importance ratios starting from a fully masked state. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation. Experiments using two popular dMLLM backbones, MMaDA and Lumina-DiMOO, demonstrate that HT-GRPO achieves substantial gains on the GenEval and DPG benchmarks. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference.
中文摘要 扩散多模态大型语言模型（dMLLM）在图像生成方面非常强大，但通过强化学习（RL）优化它们仍是一个重大挑战。一个主要难点是单个图像可以通过多种不同的解膜序列生成，这使得计算重要性比常常变得困难。此外，现有方法往往忽视dMLLM的层级生成过程，早期代币定义全局布局，后期代币则关注局部细节。通过给所有代币分配统一的奖励，这些现有方法未能反映每个代币对最终图像的实际贡献。为解决这些问题，我们提出了分层令牌GRPO（HT-GRPO），将该层级直接整合进策略优化过程。我们的方法采用了“草图-再绘画”培训方案，将更新组织为三个不同阶段：全局、结构和精炼。我们还使用提示条件估计器从完全掩蔽状态计算重要性比。此外，我们引入了分层信用分配机制，优先排序关键结构代币，确保奖励传递准确。使用两种流行的dMLLM骨干MMaDA和Lumina-DiMOO的实验表明，HT-GRPO在GenEval和DPG基准测试中取得了显著提升。六项额外指标的评估确认了图像质量、美感和人类偏好的显著改善。

Pedestrian-Aware LLM-Driven Behavioral Planning for Autonomous Vehicles

以行人感知的大型语言模型驱动的自动驾驶车辆行为规划

Authors: Aidana Baimbetova, Haruki Yonekura, Hamada Rizk, Hirozumi Yamaguchi
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.16858
Pdf link: https://arxiv.org/pdf/2605.16858
Abstract Autonomous Vehicles (AVs) must make reliable decisions in dense urban environments where pedestrian behavior is variable, sometimes abnormal, and often unseen during training. Reinforcement learning (RL)-based AV control systems perform well in structured traffic but struggle to generalize to unpredictable pedestrian interactions and out-of-distribution scenarios. Their reliance on handcrafted rewards and opaque decisions further limits their suitability for safety-critical, pedestrian-rich environments. To address these limitations, we introduce a Large Language Model (LLM)-based decision-making framework for pedestrian-aware behavioral planning. The system converts structured scene observations into natural-language reasoning prompts, enabling the LLM to infer pedestrian intent, anticipate risk, and generate cautious tactical driving decisions. These decisions are executed by a motion planner that ensures smooth, kinematically feasible control. We evaluate the framework in SUMO across multiple pedestrian-interaction scenarios, including unexpected jaywalking, turn-back crossing, hesitation, and bidirectional crossing. In zero-shot evaluation, the LLM-based agent achieves a 68% collision-free success rate, substantially outperforming deep RL baselines (17.7%). With few-shot episodic memory in a single-pedestrian scenario, performance increases to 96.0%, exceeding a custom DQN controller (82.0%). Cross-behavior evaluation further shows that memory derived from turn-back interactions transfers to unseen hesitation and bidirectional crossing scenarios, achieving 82.0% and 90.0% success, respectively. The system consistently initiates earlier responses, maintains wider safety buffers, and produces interpretable, human-aligned decisions.
中文摘要 自动驾驶车辆（AV）必须在密集的城市环境中做出可靠的决策，因为行人行为多变，有时异常，且在训练时常常被忽视。基于强化学习（RL）的视听控制系统在结构化交通中表现良好，但难以推广到不可预测的行人互动和分布外场景。他们依赖手工制作的奖励和不透明的决策，进一步限制了他们在安全关键、行人密集环境中的适用性。为解决这些局限性，我们引入了一个基于大型语言模型（LLM）的决策框架，用于行人感知行为规划。该系统将结构化的场景观察转化为自然语言推理提示，使LLM能够推断行人意图、预判风险并生成谨慎的战术驾驶决策。这些决策由动作规划器执行，确保控制平稳且运动学上可行。我们在多种行人互动场景中评估了SUMO框架，包括意外乱穿马路、回头过马路、犹豫和双向横行马路。在零样本评估中，基于LLM的代理实现了68%的无碰撞成功率，远超深度强化学习基线（17.7%）。在单人场景中使用少帧情节记忆时，性能提升至96.0%，超过自定义DQN控制器的82.0%。交叉行为评估进一步表明，回转交互产生的记忆会转移到看不见的犹豫和双向交叉场景，分别达到82.0%和90.0%的成功率。系统持续发起更早的响应，保持更宽的安全缓冲，并产生可解释、符合人类需求的决策。

Beyond Safety Filtering: Control Barrier Function-Informed Reinforcement Learning for Connected and Automated Vehicles

超越安全过滤：控制障碍功能知情强化学习，适用于互联和自动驾驶车辆

Authors: Jianye Xu, Bassam Alrifaee
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.16894
Pdf link: https://arxiv.org/pdf/2605.16894
Abstract Reinforcement Learning (RL) uses rewards to guide learning, yet reward design is typically hand-crafted using heuristics that can be difficult to tune. We propose a Control Barrier Function (CBF)-informed reward design for Multi-Agent RL (MARL) that converts CBF constraint values under joint MARL actions into a reward signal that explicitly guides safe learning. We compare against two heuristic reward baselines in a four-way multi-lane intersection with connected and automated vehicles. Results show that our method achieves the highest task performance and is less sensitive to reward hyperparameters, yielding consistently strong performance across the tested hyperparameter range. Code for reproducing the experimental results and a video demonstration are available at this https URL.
中文摘要 强化学习（RL）通过奖励来指导学习，但奖励设计通常是通过难以调整的启发式手工打造的。我们提出了一种基于控制障碍函数（CBF）为基础的多智能体强化学习（MARL）奖励设计，将联合MARL动作下的CBF约束值转换为明确指导安全学习的奖励信号。我们比较了连接车辆和自动驾驶车辆在四路多车道交叉口的两个启发式奖励基线。结果显示，我们的方法实现了最高的任务性能，且对奖励超参数的敏感度较低，从而在测试的超参数范围内持续获得强效表现。用于重现实验结果的代码和视频演示可在此 https 网址获取。

OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

OmniVL-Guard Pro：一个用于Omnibus视觉语言取证的工具增强代理

Authors: Jinjie Shen, Zheng Huang, Yuchen Zhang, Yujiao Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.16962
Pdf link: https://arxiv.org/pdf/2605.16962
Abstract Existing vision-language forgery detection and grounding methods operate under a closed-world paradigm, assuming verification can be completed by the model alone. However, self-contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open-world forensics -- particularly for real-time event verification requiring external clues and forgery segmentation demanding fine-grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self-contained model toward reaching beyond it. We propose \textbf{OmniVL-Guard Pro}, a tool-augmented agent that extends unified forensics from closed-world prediction to open-world clues-driven reasoning. OmniVL-Guard Pro integrates a tool environment spanning real-time event search, local cropping and zooming, edge-anomaly screening, face detection, video frame extraction, and SAM3-based segmentation. To generate high-quality tool-reasoning trajectories, we introduce \textbf{Tree-Structured Self-Evolving Tool Trajectory Generation}, which produces diverse trajectories through seed guidance, guider-free self-evolution, and weakly-hinted hard sample synthesis, yielding the Full-Spectrum Tool Reasoning (FSTR) dataset for training. We further propose \textbf{Checker-Guided Agentic Reinforcement Learning} (CGARL), which provides process-level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL-Guard Pro achieves state-of-the-art performance across various tasks, and exhibits strong zero-shot generalization. The FSTR dataset and code for OmniVL-Guard Pro will be publicly released at \url{this https URL}.
中文摘要 现有的视觉语言伪造检测和接地方法运行在封闭世界范式下，假设验证仅靠模型即可完成。然而，自包含的MLLM受限于有限的参数化知识、静态训练语料库和有限的感知分辨率，这在动态开放世界取证中形成了实际的天花板——尤其是在需要外部线索的实时事件验证和需要对局部操作进行细致审查的伪造分割方面。为了解决这些限制，我们从扩大自成体系模型转向超越它。我们提出了 \textbf{OmniVL-Guard Pro}，这是一种工具增强代理，将统一取证从封闭世界预测扩展到开放世界线索驱动推理。OmniVL-Guard Pro 集成了一个涵盖实时事件搜索、局部裁剪与缩放、边缘异常筛查、人脸检测、视频帧提取以及基于 SAM3 的分割的工具环境。为了生成高质量的工具推理轨迹，我们引入了 \textbf{树结构化自我演化工具轨迹生成}，通过种子指导、无向导的自我进化和弱暗示的硬样本综合生成多样化轨迹，生成用于训练的全谱工具推理（FSTR）数据集。我们还进一步提出了\textbf{检查器引导能动强化学习}（CGARL），该方法提供过程层级监督，惩罚那些答案正确但推理被扭曲的情况。大量实验表明，OmniVL-Guard Pro 在多个任务中实现了最先进的性能，并展现出强大的零发子泛化能力。OmniVL-Guard Pro 的 FSTR 数据集和代码将公开发布于 \url{this https URL}。

Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning

排名感知校准，实现可靠的多模态强化学习

Authors: Peng Cui, Boyao Yang, Jun Zhu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.16999
Pdf link: https://arxiv.org/pdf/2605.16999
Abstract Reinforcement learning post-training has substantially improved the reasoning accuracy of vision-language models, yet the resulting policies remain poorly calibrated. Terminal correctness rewards provide no gradient that penalizes confident errors more than uncertain ones and no signal that ties confidence to the quality of visual evidence, a gap that becomes especially severe under corrupted or ambiguous inputs where models continue to report high confidence on incorrect answers. We introduce Ranking-Aware Calibration (RAC), a training-time framework that supervises confidence using two comparison signals that group-based RL already produces at no additional labeling cost. The ranking-aware group loss enforces that a better rollout receives higher confidence than a worse one within the same prompt. The clean--corrupted pairwise loss enforces that confidence attenuates as visual evidence degrades. Because the ranking signal forces the policy to distinguish between correct and incorrect reasoning paths, it also reinforces task accuracy beyond what correctness rewards alone produce. Both losses require no external confidence annotations and integrate naturally with group-based RL post-training. We instantiate RAC on Qwen2.5-VL and InternVL-3.5 backbones and evaluate on six multimodal reasoning benchmarks under clean and corrupted inputs. Empirical results show that the ranking-aware loss substantially improves task accuracy by teaching the policy to discriminate between better and worse reasoning, while the pairwise corruption loss reduces calibration error under degraded inputs. Their combination achieves the best calibration across all tested backbones while improving accuracy in the majority of settings.
中文摘要 训练后的强化学习显著提升了视觉语言模型的推理准确性，但由此产生的策略仍然校准不足。终端正确性奖励没有提供比不确定错误更惩罚自信错误的梯度，也没有信号将信心与视觉证据质量挂钩，这种差距在输入损坏或模糊时尤其严重，模型对错误答案仍报告高度信心。我们介绍了排名感知校准（RAC），这是一个训练时间框架，利用基于群体的强化学习已产生的两个对比信号来监督置信度，且无需额外标记成本。排名感知组损失意味着，在同一提示中，较好的推广获得比较差的更强置信度。干净腐败的两两损失强化了信心随着视觉证据的退化而减弱。由于排名信号迫使策略区分正确与错误的推理路径，同时也强化了任务的准确性，超出了仅仅正确性奖励所能带来的范围。这两种损失都无需外部置信标注，且能自然地与基于组的强化学习在训练后整合。我们在Qwen2.5-VL和InternVL-3.5骨干上实例化RAC，并在六个多模态推理基准测试中，在干净且损坏的输入下进行评估。实证结果表明，排名感知损失通过教策略区分优劣推理显著提升任务准确性，而两两损坏损失则在退化输入下减少校准误差。它们的组合在所有测试骨干中实现最佳校准，同时在大多数环境中提升准确性。

Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training

学习区能量：在线数据选择以实现高效的强化学习后培训

Authors: Peng Cui, Boyao Yang, Jun Zhu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.17003
Pdf link: https://arxiv.org/pdf/2605.17003
Abstract Reinforcement Learning (RL) post-training has emerged as the dominant paradigm for eliciting mathematical reasoning in Large Language Models (LLMs), yet prevailing techniques such as GRPO and DAPO distribute rollout and gradient budgets nearly uniformly across prompts, squandering compute on samples that are already mastered or remain far beyond the model's current capability. To address this fundamental inefficiency, we propose Learning-Zone Energy (LZE), a theoretically grounded, fully online data selection framework that concentrates computation on the model's active learning frontier. At its core, we define a closed-form Learning-Zone Energy Score that fuses three complementary signals, an initial-difficulty anchor, a normalized outcome-uncertainty term, and a pass-rate momentum, into a single scalar that is provably aligned with the expected magnitude of group-relative policy gradient updates. A forward pruner with replay further reduces wall-clock time cost by skipping rollout generation for persistently solved prompts while periodically checking for forgetting. Evaluated on Qwen-family models (1.5B-8B) across GSM8K, MATH and DAPO-MATH, our method retains only 40% of the training data per step yet matches or surpasses full-data baselines, with especially pronounced out-of-distribution gains on AIME25 (+45.9%) and AMC23 (+18.2%), alongside an estimated 36% reduction in training FLOPs. Our code is available at this https URL.
中文摘要 强化学习（RL）后训练已成为大型语言模型（LLM）中引发数学推理的主导范式，然而如GRPO和DAPO等主流技术几乎均匀地将展开和梯度预算分配到提示词上，导致计算资源浪费在已掌握或远超模型当前能力的样本上。为解决这一根本性低效，我们提出了学习区能量（LZE），这是一个理论基础的全在线数据选择框架，将计算集中在模型的主动学习前沿。核心是我们定义一个闭式学习区能量评分，将三个互补信号、初始难度锚点、归一化的结果不确定项和通过率动量融合为一个标量，该标量可证明与群体相对政策梯度更新的预期幅度相匹配。带有回放功能的前向剪枝还能通过跳过持续解决提示的滚动生成，同时定期检查遗忘，从而进一步降低墙钟时间成本。在GSM8K、MATH和DAPO-MATH的Qwen家族模型（1.5B-8B）上评估，我们的方法每步仅保留40%的训练数据，但与完整数据基线匹配甚至超过，尤其是在AIME25（+45.9%）和AMC23（+18.2%）上，分布外增长尤为显著，训练FLOP数量估计减少了36%。我们的代码可在此 https URL 访问。

D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

D$^2$Evo：双难度感知自我进化，实现数据高效强化学习

Authors: Ru Zhang, Renda Li, Ziyu Ma, Weijie Qiu, Chongyang Tao, Yong Wang, Xiangxiang Chu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.17037
Pdf link: https://arxiv.org/pdf/2605.17037
Abstract Reinforcement learning (RL) has demonstrated potential for enhancing reasoning in large language models (LLMs). However, effective RL training, which requires medium-difficulty training samples, faces two fundamental challenges: Effective Data Scarcity and Dynamic Difficulty Shifts, where medium-difficulty samples are scarce and become trivial as models improve. Existing methods mitigate this scarcity to some extent by generating training samples. However, these approaches suffer from anchor-free generation, ignoring co-evolution, and difficulty mismatch. To address these issues, we propose D$^2$Evo, a Dual Difficulty-aware self-Evolution RL framework. In each iteration, our method mines medium-difficulty anchors based on the current Solver's capability, trains the Questioner to generate diverse questions at appropriate difficulty levels, and jointly optimizes both components to enable progressive reasoning gains. Extensive experiments demonstrate that D$^2$Evo outperforms existing methods on mathematical reasoning benchmarks with fewer than 2K real mathematical samples, and exhibits strong generalization on general reasoning benchmarks.
中文摘要 强化学习（RL）已展示出提升大型语言模型（LLM）推理能力的潜力。然而，有效的强化学习训练需要中等难度的训练样本，面临两个根本挑战：有效数据稀缺性和动态难度变化，中等难度样本稀缺且随着模型改进变得微不足道。现有方法通过生成训练样本在一定程度上缓解了这种稀缺性。然而，这些方法存在无锚点生成、忽视共演化以及难度不匹配的问题。为解决这些问题，我们提出了D$^2$Evo，一个双难度感知的自我进化强化学习框架。在每次迭代中，我们的方法基于当前解答器能力挖掘中等难度锚点，训练提问者生成适合难度的多样化问题，并共同优化这两个组成部分，以实现推理的逐步提升。大量实验表明，D$^2$Evo在实际样本少于2K的数学推理基准测试中优于现有方法，并且在通用推理基准测试中表现出强烈的推广性。

Learning Multi-Timescale Abstractions for Hierarchical Combinatorial Planning

学习多时间尺度抽象以实现层级组合规划

Authors: Vivienne Huiling Wang, Tinghuai Wang, Joni Pajarinen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.17058
Pdf link: https://arxiv.org/pdf/2605.17058
Abstract The combination of exponentially large action spaces, stochastic dynamics, and long-horizon decision-making under limited resources makes Sequential Stochastic Combinatorial Optimization (SSCO) particularly challenging for reinforcement learning. Hierarchical Reinforcement Learning (HRL) offers a natural decomposition, but it places the high-level policy in a Semi-Markov Decision Process (SMDP) where actions have variable durations, making it difficult to learn a world model that is suitable for planning. We introduce a model-based hierarchical framework for sequential stochastic combinatorial decision-making that directly addresses this issue. Our method combines a latent-space tree-search planner with an SMDP-aware world model for variable-duration decisions. A multi-timescale objective structures the latent dynamics so that transition magnitudes reflect the effective temporal scales of abstract actions, enabling efficient lookahead under adaptive temporal abstraction. We further learn a subgoal-conditioned budget policy jointly with the world model to support context-aware resource allocation. Across challenging SSCO benchmarks, our method outperforms strong baselines.
中文摘要 指数级大动作空间、随机动力学和在有限资源下的长期决策相结合，使得顺序随机组合优化（SSCO）在强化学习中尤为具有挑战性。分层强化学习（HRL）提供了自然的分解，但它将高层策略置于半马尔可夫决策过程（SMDP）中，其中行动持续时间可变，这使得学习适合规划的世界模型变得困难。我们引入了一个基于模型的层级框架，用于顺序随机组合决策，直接解决这一问题。我们的方法结合了潜空间树搜索规划器和SMDP感知的世界模型，用于可变时长的决策。多时间尺度的客观物构建了潜在动态，使得转变幅度反映抽象动作的有效时间尺度，从而在自适应时间抽象下实现高效的前瞻性展望。我们还进一步学习与世界模型联合制定的子目标条件预算政策，以支持情境感知的资源分配。在具有挑战性的SSCO基准中，我们的方法优于强劲的基线。

A Red Teaming Framework for Evaluating Robustness of AI-enabled Security Orchestration, Automation, and Response Systems

一个红队框架，用于评估AI驱动的安全编排、自动化和响应系统的稳健性

Authors: Ayan Javeed Shaikh, Nathaniel D. Bastian, Ankit Shah
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2605.17075
Pdf link: https://arxiv.org/pdf/2605.17075
Abstract AI-enabled Security Orchestration, Automation, and Response (SOAR) systems increasingly employ autonomous agents for cyber defense, yet their resilience to adaptive adversaries is underexplored. We introduce an autonomous red teaming framework that integrates large language models (LLMs) with reinforcement learning (RL) to generate adaptive, multi-stage attack campaigns against autonomous defenders in enterprise networks. A hierarchical design combines an LLM-based planner for strategic intent with an RL controller for tactical execution, supported by reward shaping aligned with kill-chain progression. Evaluation in a high-fidelity enterprise simulation demonstrates the effectiveness of the proposed approach, while also showing that standalone LLM agents fail to sustain multi-stage attack campaigns and that domain-specific cybersecurity models achieve only limited levels of compromise, highlighting the necessity for hybrid LLM-RL approaches to red teaming.
中文摘要 AI驱动的安全编排、自动化与响应（SOAR）系统越来越多地采用自主智能体进行网络防御，但其对适应性对手的韧性尚未被充分探索。我们引入了自主红队框架，将大型语言模型（LLM）与强化学习（RL）集成，生成针对企业网络自主防御者的自适应多阶段攻击战役。分层设计结合了基于大型语言模型的战略意图规划器和用于战术执行的强化学习控制器，并由与杀戮链进展相匹配的奖励形态支持。高保真企业模拟的评估展示了该方法的有效性，同时也显示独立的LLM代理无法维持多阶段攻击行动，且领域特定网络安全模型的攻破程度有限，凸显了混合LLM-RL方法在红队合作中的必要性。

From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

从模仿到互动：利用浅层强化学习掌握施纳普森游戏

Authors: Ján Klačan, Sizhong Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.17162
Pdf link: https://arxiv.org/pdf/2605.17162
Abstract This paper investigates whether shallow neural network agents can master the card game Schnapsen and challenge a strong search-based baseline, RdeepBot, which uses Monte Carlo sampling and lookahead search. Guided by a progressively more complex experimental design, we first evaluate a supervised learning agent (MLPBot) trained on replay data and then a reinforcement learning agent (RLBot) with the same shallow architecture trained through asynchronous Monte Carlo updates and experience replay. The results show that supervised imitation does not generalize well enough to defeat strong RdeepBot opponents, whereas reinforcement learning produces substantially stronger agents. In the setting that focuses on the depth parameter of RdeepBot, the best performance is achieved when the learned value function is combined with deeper lookahead during gameplay, allowing RLBot to achieve statistically significant higher winning rates against the strongest evaluated RdeepBot baseline. In the sample-based setting, the gains are more conditional: the strongest performance appears at a relatively lower training num_samples parameter rather than increasing uniformly with stronger sampling.
中文摘要 本文探讨浅层神经网络代理是否能掌握纸牌游戏Schnapsen，并挑战使用蒙特卡洛采样和前瞻搜索的强搜索基线RdeepBot。在越来越复杂的实验设计指导下，我们首先评估一个基于回放数据训练的监督学习代理（MLPBot），然后通过异步蒙特卡洛更新和经验重播训练的强化学习代理（RLBot），其结构同样浅薄。结果显示，监督模仿的泛化不足以击败强大的RdeepBot对手，而强化学习则能产生更强的代理。在以深度参数为核心的环境中，当学习到的价值函数与游戏过程中更深入的前瞻结合时，性能最佳，使RLBot在最强评估的RdeepBot基线下能实现统计上显著更高的胜率。在基于样本的环境中，提升更具条件性：最强表现出现在较低的训练num_samples参数，而非随着更强采样均匀提升。

Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation

超越执行：静态分析奖励与提示条件扩散强化学习用于代码生成

Authors: Shuyin Ouyang, Zhaozhi Qian, Faroq AL-Tam, Muhammad AL-Qurishi, Jie M. Zhang
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.17174
Pdf link: https://arxiv.org/pdf/2605.17174
Abstract Reinforcement Learning (RL) is an important paradigm for aligning Diffusion Language Models (DLMs) toward functional correctness in code generation. However, these models often encounter a ``capability cliff'' on complex tasks, where execution-based semantic rewards become too low to provide a viable learning signal. In this paper, we present a systematic empirical study of RL post-training for diffusion-based code generation along three axes: reward design, hint-conditioned sampling, and task difficulty. We investigate the effectiveness of execution-free rewards as alternatives to traditional unit-test execution, the role of training-time hint-conditioned diffusion sampling in mitigating exploration bottlenecks, and the impact of these design choices varies across tasks with different difficulty levels. Across HumanEval, MBPP, and LiveCodeBench, we find that static checking is the strongest overall standalone execution-free reward in our setting, especially improving DiffuCoder from 53.9 to 67.1 on HumanEval and from 14.9 to 15.5 on LiveCodeBench while reducing rollout time by 9.4\%. We further find that moderate AST-based hinting is most useful on harder benchmarks, while the best reward design depends strongly on task difficulty: similarity-based rewards are more effective on easier subsets, whereas static checking is more reliable on harder subsets where execution rewards are low. These findings suggest that reward design and training guidance substantially affect diffusion RL performance in our evaluated code-generation setting.
中文摘要 强化学习（RL）是使扩散语言模型（DLMs）实现代码生成功能正确性的重要范式。然而，这些模型在复杂任务中常常遇到“能力悬崖”，即基于执行的语义奖励过低，无法提供可行的学习信号。本文通过三个轴向系统地实证研究了基于扩散的代码生成在强化学习后进行的实验研究：奖励设计、提示条件抽样和任务难度。我们研究了无执行奖励作为传统单元测试替代方案的有效性，训练时间提示条件扩散采样在缓解探索瓶颈中的作用，以及这些设计选择在不同难度任务中的影响。在HumanEval、MBPP和LiveCodeBench中，我们发现静态检查是我们设定中独立无执行最强的奖励，尤其是在HumanEval上将DiffuCoder从53.9提升到67.1，LiveCodeBench从14.9提升到15.5，同时将推广时间缩短9.4%。我们还发现，适度的AST提示在较难的基准测试中最有用，而最佳奖励设计则高度依赖任务难度：基于相似度的奖励在较简单的子集上更有效，而静态检查在执行奖励较低的困难子集上更为可靠。这些发现表明，奖励设计和培训指导对我们评估的代码生成环境中扩散强化学习表现有显著影响。

Multi-LLM Systems Exhibit Robust Semantic Collapse

多LLM系统表现出稳健语义崩溃

Authors: Weiyi Kong, Shiyang Lai, Jinghua Piao, James Evans
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.17193
Pdf link: https://arxiv.org/pdf/2605.17193
Abstract Whether machines can originate novel content has been debated for nearly two centuries, from Lovelace's assertion that no engine can "originate anything" to Turing's question of whether a machine can amplify ideas brought in from outside. Multi-large language model (LLM) systems, increasingly deployed for autonomous generation, reopen this question empirically. Here we show that such systems, operating in closed loops, exhibit semantic collapse: systematic convergence in semantic representations despite apparent lexical variation. Across model families, extended simulations of 200 to 1,000 rounds, the pattern remains consistent. Twelve intervention strategies, spanning decoding parameters, prompt design, agent composition, activation engineering, and reinforcement learning, fail to restore semantic diversity. Mechanistic analyses suggest that semantic collapse is not explained by alignment or conformity biases, but is consistent with intrinsic properties of autoregressive generation. Our results point to fundamental constraints in the ability of multi-LLM systems to sustain open-ended knowledge production in closed-loop settings.
中文摘要 机器是否能产生新颖内容，已经争论了近两个世纪，从洛夫莱斯断言没有任何引擎能“产生任何东西”，到图灵关于机器是否能放大外部带来的思想的问题。多大型语言模型（LLM）系统日益被用于自主生成，从实证角度重新开启了这一问题。我们展示了这些系统在闭环中运行，表现出语义崩溃：尽管词汇表面上存在差异，语义表征却系统性收敛。在模型家族中，经过200至1000发的扩展模拟，这一模式保持一致。十二种干预策略，涵盖解码参数、提示设计、代理组合、激活工程和强化学习，但未能恢复语义多样性。机制分析表明，语义崩溃并非由对齐偏差或一致性偏差解释，而是与自回归生成的内在属性一致。我们的结果指出，多LLM系统在闭环环境中维持开放式知识生产的能力存在根本性限制。

Generating Realistic Safety-Critical Scenarios for Vehicle-Pedestrian Interactions

生成车辆与行人互动的真实安全关键场景

Authors: Qingwen Pu, Kun Xie, Yuan Zhu, Guocong Zhai
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.17229
Pdf link: https://arxiv.org/pdf/2605.17229
Abstract Automated driving system deployment requires rigorous validation across safety-critical vehicle-pedestrian interactions, yet real-world datasets rarely capture high-risk scenarios while simulation platforms lack realistic behavior. In response, this study proposes a three-stage framework that combines real-world grounding with adaptive simulation to generate behaviorally realistic safety-critical scenarios at scale. Stage 1 pre-trains multi-agent state-space Transformer-enhanced DDPG (MA-SST-DDPG) agents on real-world safety-critical data to learn human-like interactive evasive behaviors through data-driven learning. Stage 2 deploys pre-trained multi-agents in CARLA for online reinforcement learning to generalize across diverse scenarios, integrating real-world knowledge with simulation experience to produce a refined MA-SST-DDPG model. Stage 3 uses CARLA with the refined model to generate over 198,000 high-resolution interaction episodes from eight intersection scenarios, culminating in the Vehicle-Pedestrian Safety-Critical Interaction (VPSCI) dataset. The Refined MA-SST-DDPG model outperformed baseline methods in reproducing realistic evasive behaviors, achieving the lowest trajectory errors (ADE = 0.072 m, FDE = 0.142 m). Statistical comparison confirmed distributional equivalence between the generated and real-world data in both conflict severity and behavioral response. A Turing test confirmed that the three-stage framework generated evasive behaviors were indistinguishable from real-world interactions. These results demonstrate the framework's effectiveness in producing high-fidelity safety-critical data, offering valuable sources for the development of ADS and simulation-based safety evaluations.
中文摘要 自动驾驶系统的部署需要在安全关键的车辆-行人交互中进行严格验证，但现实世界的数据很少捕捉高风险场景，而模拟平台也缺乏真实行为。为此，本研究提出了一个三阶段框架，结合现实世界基础化与自适应模拟，以大规模生成行为上逼真的安全关键场景。第一阶段预训练多智能体状态空间增强型Transformer增强DDPG（MA-SST-DDPG）智能体，利用现实安全关键数据，通过数据驱动学习学习类人交互规避行为。第二阶段在CARLA中部署预训练的多智能体进行在线强化学习，将现实知识与模拟经验整合，生成精炼的MA-SST-DDPG模型。第三阶段利用CARLA结合精细模型，从八个交叉场景生成超过198,000个高分辨率交互事件，最终形成车辆-行人安全关键交互（VPSCI）数据集。精细化的MA-SST-DDPG模型在再现真实规避行为方面优于基线方法，实现了最低的轨迹误差（ADE = 0.072 m，FDE = 0.142 m）。统计比较证实了生成数据与现实数据在冲突严重程度和行为反应上的分布等效性。图灵测试证实，三阶段框架产生的规避行为与现实交互无异。这些结果展示了该框架在生成高保真安全关键数据方面的有效性，为开发基于ADS和仿真的安全评估提供了宝贵资源。

Step-wise Rubric Rewards for LLM Reasoning

LLM推理的分阶段评分标准奖励

Authors: Weichu Xie, Haozhe Zhao, Wenpu Liu, Yongfu Zhu, Liang Chen, Minghao Ye, Zirong Chen, Yuqi Xu, Shuai Dong, Ziyue Wang, Xinbo Xu, Kean Shi, Ruoyu Wu, Xiaoying Zhang, Wenqi Shao, Baobao Chang, Nan Duan, Jiaqi Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.17291
Pdf link: https://arxiv.org/pdf/2605.17291
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in large language models, but rewards only final-answer correctness with no supervision over intermediate steps. Rubric-based methods such as Rubrics as Rewards (RaR) introduce finer-grained supervision by scoring rollouts against structured criteria, yet the rubric scores are still aggregated into a single scalar applied to the entire response, causing three weaknesses: loss of multi-criterion structure, uniform supervision of correct and incorrect steps, and reward hacking through unbounded self-correction. On 1,000 problems, we find 18.2% of steps in correct-answer responses are wrong yet positively rewarded, while 49.9% of steps in incorrect-answer responses are correct yet penalized. We introduce Step-wise Rubrics as Rewards (SRaR), an RLVR framework that (i) uses an LLM judge to attribute each rubric item to a specific reasoning step, (ii) normalizes per-step rubric scores across rollouts so only steps whose quality varies produce a learning signal, and (iii) combines the per-step reward with the outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable. We further build a 16K-problem rubric dataset by contrastively distilling rubric items from correct and flawed reasoning paths sampled from a strong model. Across six mathematical reasoning benchmarks, SRaR improves average accuracy over RaR by 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B, raises the Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7%, and reduces self-correction looping from 48.1% to 26.5%.
中文摘要 带可验证奖励的强化学习（RLVR）被广泛用于提升大型语言模型的推理能力，但仅奖励最终答案的正确性，中间步骤没有监督。基于评分标准的方法如“评分标准即奖励”（RaR）通过基于结构化标准对推广进行评分，引入了更细粒度的监督，但评分标准仍被汇总为单个标量，应用于整个响应，导致三个弱点：多准则结构丧失、正确与错误步骤的统一监督，以及通过无界限自我纠正实现奖励黑客。在1000道题中，正确答案中有18.2%的步骤错误但获得正面奖励，而错误答案步骤中有49.9%正确但被惩罚。我们引入了分步评分标准作为奖励（SRaR），这是一个RLVR框架，（i）使用LLM评判将每个评分标准项归因于特定推理步骤，（ii）对各步评分进行规范化，仅在质量变化的步骤产生学习信号，（iii）通过解耦优势估计器将每步奖励与结果奖励结合，保持结果基线稳定。我们进一步通过对比性地从强模型中抽样的正确和有缺陷推理路径中提炼出16K问题的评分标准数据集。在六个数学推理基准测试中，SRaR使Qwen3-8B的平均准确率提升了3.57分，Qwen3-32B提升了2.75分，AIME 2025的忠实推理率从34.5%提升至46.7%，自我纠正循环从48.1%降至26.5%。

Leveraging Error Diversity in Group Rollouts for Reinforcement Learning

在群体推广中利用错误多样性进行强化学习

Authors: Wenpu Liu, Yuqi Xu, Weichu Xie, Yongfu Zhu, Shuai Dong, Ziyue Wang, Wenqi Shao, Xiaoying Zhang, Tong Yang, Nan Duan, Jiaqi Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.17333
Pdf link: https://arxiv.org/pdf/2605.17333
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) typically samples multiple responses per prompt and assigns binary rewards based on individual correctness, yet the collective structure of the group output, specifically the distribution of errors, is largely discarded. We identify this as a missed opportunity: empirical analysis reveals that error diversity within a group is a strong predictor of training success, with problems eliciting diverse wrong answers benefiting substantially more from RLVR than those producing homogeneous failures. Motivated by this observation, we propose Error Diversity Advantage Shaping (EDAS), a lightweight, algorithm-agnostic technique that modulates the advantage signal for incorrect rollouts based on intra-group error diversity. EDAS amplifies penalties for dominant, repeated errors and attenuates penalties for rare, exploratory ones, thereby encouraging the model to maintain diverse reasoning paths and discouraging error perseveration. Crucially, EDAS operates as a simple post-hoc adjustment that can be seamlessly integrated into any RLVR algorithm. We validate EDAS on top of several mainstream RLVR methods across a series of models and seven challenging math benchmarks, demonstrating consistent improvements. Notably, EDAS yields an average improvement of 6.29 points over DAPO on Qwen3-8B across seven benchmarks, confirming that exploiting the latent information in group rollouts is a broadly effective strategy for strengthening RLVR.
中文摘要 可验证奖励强化学习（RLVR）通常每个提示采样多个回答，并根据个别正确性分配二元奖励，但群体输出的集体结构，特别是错误分布，基本被舍弃。我们认为这是一个错失的机会：实证分析显示，组内错误多样性是训练成功的强有力预测因子，导致多样错误答案的问题比产生同质性失败的问题更能从RLVR中受益。基于这一观察，我们提出了错误多样性优势塑形（EDAS）技术，这是一种轻量级、算法无关的技术，基于组内错误多样性调制错误推断的优势信号。EDAS放大了主导性重复错误的惩罚，减轻罕见且探索性错误的惩罚，从而鼓励模型保持多样的推理路径，抑制错误的固执。关键是，EDAS作为一种简单的事后调整，可以无缝集成到任何RLVR算法中。我们在多个主流RLVR方法基础上验证了EDAS，涵盖一系列模型和七个具有挑战性的数学基准，显示出持续的改进。值得注意的是，EDAS在七个基准测试中，Qwen3-8B的平均提升比DAPO提升6.29分，证实利用小组推广中潜在信息是加强RLVR的广泛有效策略。

Learning Fill-in Reduction Ordering via Graph Policy Optimization for Sparse Matrices

通过图策略优化学习稀疏矩阵的填充归约排序

Authors: Ziwei Li, Shuzi Niu, Huiyuan Li, Tao Yuan, Wenjia Wu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.17362
Pdf link: https://arxiv.org/pdf/2605.17362
Abstract Matrix reordering in large sparse solvers seeks a permutation that minimizes factorization fill-in to reduce memory and computation. Because the minimum fill-in ordering problem is NP-complete and fill-in is implicit in the sparsity pattern, graph-theoretic heuristics are used. Existing reinforcement learning methods either ignore sparsity patterns--missing the global fill-in--or lack local exact fill-in feedback. We propose a graph policy optimization method, modeling fill-ins from global and local views: both the policy and value networks use a multi-hop graph neural backbone to embed global fill-in; the policy further interacts with symbolic factorization over graphs to extract local, step-level fill-ins, and the resulting feedback is aligned with the value network via an adaptive saturation function to improve convergence. On the SuiteSparse Matrix Collection, our method achieves mean reductions of 29.3 in fill-ins and 31.3 in peak memory usage over state-of-the-art baselines.
中文摘要 大型稀疏求解器中的矩阵重序寻求一种置换，以最小化分解填充，以减少内存和计算。由于最小填充排序问题是NP完全的，且填充在稀疏模式中隐含，因此采用了图论启发式方法。现有的强化学习方法要么忽略稀疏模式——缺少全局填充——要么缺乏局部精确填充反馈。我们提出了一种图策略优化方法，从全局和局部视角建模填充：策略网络和价值网络均使用多跳图神经骨干来嵌入全局填充;该策略进一步通过图上的符号分解来提取局部的步级填充，并通过自适应饱和函数与值网络对齐，以提升收敛性。在SuiteSparse Matrix Collection中，我们的方法在最先进基线上实现了填充平均减少29.3%和峰值内存使用31.3%的平均水平。

Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

多智能体强化学习中的异构信息瓶颈协调图

Authors: Wei Duan, Junyu Xuan, En Yu, Xiaoyu Yang, Jie Lu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.17393
Pdf link: https://arxiv.org/pdf/2605.17393
Abstract Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism to decide which edges should exist and how much information each edge should carry. Current methods rely on heuristic criteria that offer no formal guarantee on the learned topology, and no principled way to allocate different communication capacities to structurally different agent relationships. To address this, we propose Heterogeneous Information-Bottleneck Coordination Graphs (HIBCG), which learns a group-aware sparse graph in which both edge existence and message capacity are theoretically justified. With the graph information bottleneck (GIB) serving as the underlying tool, HIBCG first constructs a group-aligned block-diagonal prior that provides a closed-form criterion for edge retention -- determining which edges should exist and at what density per group block -- and then controls per-agent feature bandwidth on the resulting topology, compressing messages to retain only task-relevant content. We prove that the group-aligned prior strictly tightens the variational bound on topology learning, that the objective decomposes per group block, enabling differential edge control, and that capacity allocation follows a water-filling principle.
中文摘要 协调图是合作多智能体强化学习（MARL）中的核心抽象，但现有稀疏图学习器缺乏理论基础机制来决定哪些边应存在以及每条边应承载多少信息。当前方法依赖于启发式标准，这些标准无法对所学拓扑提供正式保证，也没有原则性的方式来分配不同的通信能力给结构上不同的代理关系。为此，我们提出了异构信息瓶颈协调图（HIBCG），它学习一个理论上同时证明边存在性和消息容量的群意识稀疏图。以图信息瓶颈（GIB）为底层工具，HIBCG首先构建了一个组对齐块对角先验，提供封闭式的边保留准则——确定每个组块应存在哪些边及其密度——然后控制每个代理对该拓扑的特征带宽，压缩消息以保留仅任务相关内容。我们证明群对齐先验严格收紧拓扑学习中的变分界限，目标在每个群块分解，从而实现微分边控制，且容量分配遵循水填充原则。

ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks

ClaHF：一种基于人类反馈的强化学习框架，用于改进分类任务

Authors: Tianxiang Xu, Xiaoyan Zhu, Xin Lai, Jiayin Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.17458
Pdf link: https://arxiv.org/pdf/2605.17458
Abstract Text classification models are typically trained via supervised fine-tuning (SFT). However, SFT essentially performs behavior cloning from instance-wise labels and thus fails to adequately capture relative preference relations among samples, which limits the model's ability to shape decision boundaries and calibrate predictive confidence. In this paper, we propose ClaHF, a human feedback-inspired reinforcement learning (RL) framework for text classification that integrates preference modeling and RL optimization into the classification pipeline without requiring additional human annotations. Unlike prior work that relies solely on instance-wise supervision, ClaHF constructs multiple candidate predictions together with their relative ranking relations, and jointly models the Top-1 preference and the ordering among non-optimal candidates within a reward model (RM). This design converts conventional label supervision into preference signals that are directly applicable to policy optimization. We conduct systematic evaluations on eight classification tasks spanning three categories of scenarios. Results demonstrate that ClaHF consistently improves both classification performance and confidence calibration across diverse language models (LMs). The data and code are available at this https URL.
中文摘要 文本分类模型通常通过监督微调（SFT）进行训练。然而，SFT本质上是从实例标签进行行为克隆，因此无法充分捕捉样本间的相对偏好关系，这限制了模型塑造决策边界和校准预测置信度的能力。本文提出了ClaHF，一种基于人类反馈的强化学习（RL）文本分类框架，将偏好建模和强化学习优化整合进分类流程，无需额外人工注释。与以往仅依赖实例监督的工作不同，ClaHF构建了多个候选人预测及其相对排名关系，并共同建模了奖励模型（RM）中非最优候选人的前一偏好和排序。该设计将传统标签监管转化为直接适用于策略优化的偏好信号。我们对涵盖三类情景的八个分类任务进行了系统评估。结果表明，ClaHF在不同语言模型（LM）中持续提升分类性能和置信校准。数据和代码可在该 https URL 访问。

DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

DyGRO-VLA：通过动态分组残差优化实现视觉-语言-行动模型的跨任务尺度化

Authors: Sixu Lin, Yunpeng Qing, Litao Liu, Ming Zhou, Ruixing Jin, Xiaoyi Fan, Guiliang Liu
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.17486
Pdf link: https://arxiv.org/pdf/2605.17486
Abstract Recent progress in Reinforcement Learning (RL) provides a principled approach to optimizing Vision-Language-Action (VLA) models, facilitating a shift from trajectory imitation to active learning in the task environment. Despite improvements in control precision, most RL optimizers remain task-specific, which reduces VLA models from generalist controllers to policies that overfit to a narrow set of tasks. In this study, we conduct an in-depth analysis of this phenomenon and highlight the importance of cross-task feature representations for improving the generalizability of VLA models. Motivated by this finding, we introduce DyGRO-VLA, a two-stage optimization framework that 1) effectively captures cross-task latent representations based on information-theoretic principles, and 2) dynamically refines policy optimization via a mixture-of-RL-residuals. DyGRO-VLA enables the RL optimizer to exploit task-relevant latent information while strategically mitigating adverse interference on the learned representations throughout the optimization process. We evaluate our approach on LIBERO, RoboTwin2 benchmarks, and further validate it on real world, demonstrating consistent improvements over strong baselines under multi-task training and distribution shift.
中文摘要 强化学习（RL）的最新进展为优化视觉-语言-行动（VLA）模型提供了一种有原则的方法，促进了任务环境中从轨迹模仿向主动学习的转变。尽管控制精度有所提升，大多数强化学习优化器仍保持任务特定性，这使得VLA模型从通用控制器简化为对狭窄任务集的拟合策略。本研究深入分析该现象，强调跨任务特征表示在提升VLA模型泛化性中的重要性。基于这一发现，我们引入了DyGRO-VLA，一种两阶段优化框架，1）基于信息论原理有效捕捉跨任务潜在表示，2）通过混合RL残差动态优化策略优化。DyGRO-VLA使强化学习优化器能够利用任务相关的潜在信息，同时在优化过程中策略性地减少对学习表征的不利干扰。我们基于LIBERO、RoboTwin2基准测试评估了我们的方法，并在实际世界中进一步验证，证明在多任务训练和分布转移下，相较于强基线持续提升。

Self-supervised Hierarchical Visual Reasoning with World Model

自监督层级视觉推理与世界模型

Authors: Yuanfei Xu, Lin Liu, Wengang Zhou, Mingxiao Feng, Houqiang Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.17537
Pdf link: https://arxiv.org/pdf/2605.17537
Abstract 3D open-world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Effective reasoning representations are essential in such settings. While existing self-supervised visual foresight reasoning approaches often suffer from multi-step error accumulation, many recent studies resort to injecting domain-specific knowledge for more stable guidance. Our key insight is that the photorealistic fidelity of visual reasoning representations is secondary; what truly matters is providing informative, task-relevant signals. To this end, we propose ResDreamer, a hierarchical world model in which each higher-level layer is trained to reconstruct the residuals of the layer below. This design enables progressive abstraction of increasingly sophisticated world dynamics and fosters the emergence of richer latent representations. Drawing inspiration from the "Bitter Lesson", ResDreamer trains its reasoning representations in a purely self-supervised manner. The higher-level residual representations are used to modulate lower-level predictions, allowing the world model to scale effectively with only linearly increasing cross-layer communication costs. Experiments show that ResDreamer achieves state-of-the-art sample efficiency and parameter efficiency. This scalable hierarchical visual foresight reasoning architecture paves the way for more capable online RL agents in open-ended, dynamic environments. The code is accessible at \url{this https URL}.
中文摘要 拥有对立对手的3D开放世界环境，由于其庞大的状态空间，仍然是强化学习的核心挑战。在此类环境中，有效的推理表征至关重要。虽然现有的自监督视觉前瞻推理方法常常存在多步错误累积的问题，但许多近期研究转向注入领域特定知识以获得更稳定的指导。我们的关键见解是，视觉推理表现的写实性是次要的;真正重要的是提供有信息、与任务相关的信号。为此，我们提出了ResDreamer，一种分层世界模型，每个高层都被训练以重建下层的残差。这种设计使得对日益复杂的世界动态进行渐进抽象，并促进了更丰富的潜在表征的出现。ResDreamer从“苦涩的教训”中汲取灵感，以纯粹自我监督的方式训练推理表征。高层残差表示用于调制低层预测，使世界模型能够有效扩展，同时仅线性增加跨层通信成本。实验表明，ResDreamer实现了最先进的采样效率和参数效率。这种可扩展的层级视觉前瞻推理架构为更强大的在线强化学习代理在开放式、动态环境中的发展铺平了道路。代码可在 \url{this https URL} 访问。

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

GRPO能有多离谱？Mu-GRPO 用于高效 LLM 强化学习

Authors: Minghao Tian, Yunfei Xie, Chen Wei
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.17570
Pdf link: https://arxiv.org/pdf/2605.17570
Abstract Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime that incurs substantial system overhead. We ask a simple question: How off-policy can GRPO be? We show that GRPO-style algorithms can tolerate substantially larger rollout staleness than previously assumed, and propose Mu-GRPO, an RL training framework that organizes training into a small number (e.g., four) of large sequential generation-optimization stages. This design induces high rollout staleness while greatly reducing rollout-optimization switching overhead. To stabilize learning under stale data, Mu-GRPO combines relaxed clipping, which preserves useful stale-rollout gradients, with negative-advantage veto, which removes destabilizing post-trigger suffix updates in negative-advantage responses. Across five language models and multiple math reasoning benchmarks, Mu-GRPO matches or exceeds the performance of standard GRPO while achieving around 2x speedup in wall-clock training time, establishing a substantially improved performance-efficiency trade-off for LLM reinforcement learning.
中文摘要 群体相对策略优化（GRPO）是近年来大型语言模型中可验证奖励强化学习（RLVR）取得进展的关键推动力，但它通常在低陈旧度、接近策略的环境中训练，导致系统开销较大。我们提出一个简单的问题：GRPO能有多不符合政策？我们证明了GRPO风格算法能容忍远大于先前假设的滚动陈旧性，并提出了Mu-GRPO这一强化学习训练框架，将训练组织为少量（例如四个）大型序列生成优化阶段。这种设计导致了高的滚动停滞，同时大大减少了部署优化切换的开销。为了稳定陈旧数据下的学习，Mu-GRPO结合了宽松裁剪（保留有用的陈旧滚动梯度）与负优势否决（negative-advantage veto），后者去除负优势响应中触发后不稳定的后缀更新。在五个语言模型和多个数学推理基准测试中，Mu-GRPO的性能与标准GRPO相当甚至超过，同时在墙钟训练时间上实现约2倍的加速，为LLM强化学习树立了显著提升的性能与效率权衡。

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

SAPO：基于推理的生成推荐的阶梯对齐策略优化

Authors: Zaiyi Zheng, Guanghui Min, Yaochen Zhu, Liang Wu, Liangjie Hong, Chen Chen, Jundong Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.17648
Pdf link: https://arxiv.org/pdf/2605.17648
Abstract Generative recommendation treats next-item prediction as autoregressive item-identifier generation. Specifically, items are encoded as semantic identifiers (SIDs), which are short coarse-to-fine token sequences whose early tokens capture broad semantics and later tokens refine them. Recent work augments this paradigm with reasoning traces and optimizes them via reinforcement learning with verifiable rewards, typically outcome-reward algorithm with exact-match feedback on the generated SID. However, in large-catalog recommendation, exact-match feedback on the generated SID only reports whether the final item is correct; when a generated SID mismatches, outcome-reward cannot identify which SID-token prediction caused the mismatch and may penalize matched SID-token positions together with the mismatched position. We identify that the natural unit of credit assignment in this setting is a single reasoning step (one thinking block paired with one SID token). We instantiate this idea in SAPO (Step-Aligned Policy Optimization): rather than broadcasting one advantage to the whole response, SAPO computes a separate group-relative advantage for each reasoning step and applies it only to the corresponding thinking block and SID token. Across three real-world recommendation datasets, SAPO stabilizes reinforcement-learning training and consistently improves over existing generative recommendation baselines, with the largest gains where sparse exact-match feedback makes reasoning-step credit assignment important. Our results suggest that reinforcement-learning objectives for structured generation should mirror the decoder's own decomposition of the output.
中文摘要 生成式推荐将下一题预测视为自回归题目标识符生成。具体来说，项被编码为语义标识符（SID），这是一种简短的粗到细令令序列，早期令牌捕捉广义语义，后期令牌则细化。近期工作通过推理痕迹补充这一范式，并通过强化学习与可验证的奖励进行优化，通常是带有精确匹配反馈的结果-奖励算法。然而，在大型目录推荐中，对生成SID的精确匹配反馈仅报告最终条目是否正确;当生成的SID不匹配时，结果奖励无法识别导致错配的SID-token预测，可能会对匹配的SID-token位置及不匹配位置进行惩罚。我们发现，在此环境中，信用分配的自然单位是一个推理步骤（一个思维块配一个SID代币）。我们在SAPO（步骤对齐策略优化）中实现了这一理念：SAPO不是向整个响应广播一个优势，而是为每个推理步骤计算一个单独的群体相对优势，并仅应用于对应的思考块和SID代币。在三个真实世界的推荐数据集中，SAPO稳定了强化学习训练，并持续优于现有生成推荐基线，在稀疏的精确匹配反馈使推理步骤学分赋值变得重要时，取得最大提升。我们的结果表明，结构化生成的强化学习目标应与解码器自身对输出的分解相匹配。

Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization

通过去噪策略优化微调口袋感知扩散模型

Authors: Yuan Xue, Daniel Kudenko, Megha Khosla
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.17693
Pdf link: https://arxiv.org/pdf/2605.17693
Abstract Structure-based drug design has been accelerated by pocket-aware 3D generative models, yet most methods primarily fit the training distribution and may fall short of satisfying multiple properties required in real-world therapeutic drug discovery. Recently, increasing attention has focused on structure-based molecule optimization (SBMO), which targets fine-grained control over multiple specified molecular properties. In this paper, we present DEPPA, a novel SBMO approach building upon Denoising Diffusion Policy Optimization for fine-tuning a pre-trained pocket-aware diffusion model via reinforcement learning. DEPPA enables optimization over multiple properties, including binding affinity, drug-likeness, synthesizability and diversity. We formulate the reverse denoising process of the pretrained pocket-aware diffusion model as a multi-step Markov Decision Process, where the desired properties that serve as reward signals are evaluated on the final generated ligand molecules. DEPPA incorporates a coarse denoising scheduler during the RL fine-tuning to achieve efficient and effective molecule optimization. Experimental results on the CrossDocked2020 benchmark demonstrate that DEPPA outperforms baselines in binding affinity (Vina Score -8.5 kcal/mol), drug-likeness and diversity while exhibiting competitive performance in synthesizability. The source code is available at this https URL .
中文摘要 基于结构的药物设计已被口袋感知的三维生成模型加速，但大多数方法主要满足训练分布，可能无法满足现实世界治疗药物发现所需的多重属性。近年来，越来越多的关注聚焦于基于结构的分子优化（SBMO），它旨在对多个特定分子性质进行细粒度控制。本文介绍了DEPPA，一种基于去噪扩散策略优化的新型SBMO方法，用于通过强化学习微调预训练的口袋感知扩散模型。DEPPA能够优化多种特性，包括结合亲和力、药物相似性、合成性和多样性。我们将预训练口袋感知扩散模型的逆去噪过程构建为多步马尔可夫决策过程，在最终生成的配体分子上评估作为奖励信号的期望性质。DEPPA在强化学习微调过程中采用粗噪去噪调度器，以实现高效且有效的分子优化。CrossDocked2020基准测试的实验结果显示，DEPPA在结合亲和力（Vina评分-8.5 kcal/mol）、药物相似性和多样性方面优于基线，同时在合成性方面表现出竞争性。源代码可在此 https URL 获取。

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

熵-梯度反演：迈向大型推理模型的内部机制

Authors: Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang, Quanshi Zhang, Yong Liu, Dongrui Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.17770
Pdf link: https://arxiv.org/pdf/2605.17770
Abstract The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive fast thinking'' text generation to systematic, step-by-stepslow thinking'' reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textit{the fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers}. We identify and formally define \textbf{Entropy-Gradient Inversion}, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbf{Correlation-Regularized Group Policy Optimization (CorR-PO)}, which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.
中文摘要 大型推理模型（LRM）的发展催生了从反应式“快速思考”文本生成向系统化、逐步“慢思考”推理的范式转变，解锁了复杂数学和逻辑任务中的尖端性能。然而，该领域面临着\textit{token级行为分析与内部推理机制之间的根本性差距，以及依赖昂贵外部验证器进行推理优化的强化学习（RL）的不稳定性}。我们识别并正式定义了 \textbf{熵-梯度反演}，这是一种在符号熵与 logit 梯度之间存在的稳健负相关，作为 LRM 推理能力的权威几何指纹。基于此，我们提出了 \textbf{相关正则化组策略优化（CorR-PO）}，将该反演签名嵌入强化学习奖励正则化中。在多个模型尺度上的各种推理基准测试中，广泛实验显示CorR-PO始终优于最先进的基线，证实更强的反演与更优越的推理表现直接相关。

HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

HydroAgent：通过模拟器基础强化学习缩小前沿大型语言模型与人类水文模型校准专家之间的差距

Authors: Zhi Li, Songkun Yan, Jie Cao, Mofan Zhang, Anjiang Wei, Jinwoong Yoo, Yang Hong
Subjects: Subjects: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
Arxiv link: https://arxiv.org/abs/2605.17792
Pdf link: https://arxiv.org/pdf/2605.17792
Abstract Calibrating distributed hydrologic models is a critical bottleneck across operational water resources management - streamflow prediction, reservoir operation, drought monitoring, infrastructure design, and flood forecasting all depend on it. Each basin demands an expert to translate hydrograph signatures into adjustments of a high-dimensional parameter vector, and the resulting workflow does not transfer between watersheds. We ask: can frontier large language model (LLM) agents replace the human hydrologic modeler, and if not, what would it take? We benchmark nine frontier LLM agents - Claude Opus 4.6/4.7, Sonnet 4.6, GPT-5/5.4/5.4-pro, and Gemini 2.5-pro/3.1-pro/3-flash - on the operational CREST distributed hydrologic model used by the U.S. National Weather Service for flash-flood forecasting. Best-of-twenty-rounds Nash-Sutcliffe Efficiency (NSE) across four held-out gauges spanning 329-40,792 km2 ranges from -0.16 (GPT-5.4) to 0.75 (Sonnet 4.6); the ceiling reproduces across all three vendors and capability tiers, with the strongest models concentrating in the 0.65-0.75 band, and no model reaches the human-expert reference except Opus-4.7 on one gauge. We argue this gap is not a parameter-count problem but a domain-grounding problem. We then propose HYDROAGENT, fine-tuning open-weight Qwen3-4B with supervised fine-tuning on 2,576 expert calibration trajectories and Group-Relative Policy Optimization using NSE as a verifiable reward from online CREST simulations - reinforcement learning with simulation feedback (RLSF). For Earth system science, a small domain-tuned policy with simulator-in-the-loop RL is a more compute-efficient and physically faithful path than scaling generic frontier models, and the multi-modal richness of Earth data - remote sensing, in-situ time series, and forecaster narrative - makes domain agents a leveraged direction for AI in physical science.
中文摘要 校准分布式水文模型是水资源管理运营中的关键瓶颈——溪流流量预测、水库运行、干旱监测、基础设施设计和洪水预报都依赖于它。每个流域都需要专家将水文曲线特征转换为高维参数矢量的调整，且最终的工作流程不会在流域间传递。我们问：前沿大型语言模型（LLM）代理能否取代人类水文建模者？如果不能，需要什么？我们基于美国国家气象局用于突发洪水预报的运行CREST分布式水文模型，对九个前沿LLM代理——Claude Opus 4.6/4.7、Sonnet 4.6、GPT-5/5.4/5.4-pro和Gemini 2.5-pro/3.1-pro/3-flash进行了基准测试。二十局两胜制纳什-萨特克利夫效率（NSE），涵盖4个外表，范围为329至40,792平方公里，范围为-0.16（GPT-5.4）至0.75（Sonnet 4.6）;天花板在三个厂商和能力层级中均有重现，最强型号集中在0.65-0.75区间，除了Opus-4.7在一个仪表上，没有模型达到人类专家的参考标准。我们认为，这一差距不是参数计数问题，而是域接地问题。随后，我们提出了HYDROAGENT，对2,576条专家校准轨迹进行监督微调开权重Qwen3-4B，以及利用NSE作为在线CREST模拟可验证奖励的群-相对策略优化——带模拟反馈的强化学习（RLSF）。对于地球系统科学来说，采用带有模拟器在环的强化学习（RL）的小型领域调优策略，比起扩展通用前沿模型，是更高效且物理忠实的路径，而地球数据的多模态丰富性——遥感、原位时间序列和预报员叙述——使领域代理成为物理科学中人工智能的可利用方向。

Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation

课程组策略优化：自适应抽样以释放文本到图像生成潜力

Authors: Baoteng Li, Xianghao Zang, Xinran Wang, Xiangyu Na, Zhixiang He, Hao Sun, Chi Zhang, Zhongjiang He, Tianwei Cao, Kongming Liang, Zhanyu Ma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.17807
Pdf link: https://arxiv.org/pdf/2605.17807
Abstract Text-to-Image (T2I) generation has achieved remarkable progress in recent years. Meanwhile, reinforcement learning methods, particularly those based on Group Relative Policy Optimization (GRPO), have attracted widespread attention and been successfully applied to T2I tasks. However, the uniform sampling strategy commonly used during training often ignores the match between sample difficulty and the model's current learning capability, leading to low training efficiency. We argue that improving training efficiency requires continuously prioritizing prompts that match the model's evolving capability and remain actively learnable. To this end, we propose Curriculum Group Policy Optimization (CGPO), an adaptive curriculum training framework. During training, each prompt produces a group of images scored by a reward model. We use the variance of group rewards as an online proxy for prompt inconsistency. A higher variance suggests that the model has partially captured the prompt requirements but has not yet achieved stable mastery. Such prompts are more likely to provide useful learning signals, so we increase their sampling probabilities accordingly. Additionally, to address data imbalance in multi-category datasets, we design a category calibration method based on proportional fairness optimization, which balances training difficulty across categories. Experiments on GenEval, T2I-CompBench++, and DPG Bench demonstrate that our framework effectively improves generation performance.
中文摘要 文本转图像（T2I）生成近年来取得了显著进展。与此同时，基于群体相对策略优化（GRPO）的强化学习方法已引起广泛关注，并成功应用于T2I任务。然而，训练中常用的均匀抽样策略常常忽略样本难度与模型当前学习能力的匹配，导致训练效率较低。我们认为，提升训练效率需要持续优先考虑符合模型不断演进能力且保持可主动学习性的提示。为此，我们提出了课程组策略优化（CGPO），一种自适应课程培训框架。在训练过程中，每个提示都会生成一组由奖励模型评分的图像。我们用团队奖励的差异作为在线时间不一致的代理指标。方差越高，说明模型部分捕捉了提示需求，但尚未达到稳定掌握。这些提示更可能提供有用的学习信号，因此我们相应提高它们的抽样概率。此外，为了解决多类别数据集中的数据不平衡，我们设计了一种基于比例公平性优化的类别校准方法，平衡各类别的训练难度。在GenEval、T2I-CompBench++和DPG Bench上的实验表明，我们的框架有效提升了生成性能。

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

从有机数据生成预训练令牌以实现数据绑定扩展

Authors: Zichun Yu, Chenyan Xiong
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.17849
Pdf link: https://arxiv.org/pdf/2605.17849
Abstract LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (organic) text falls far short of scaling demands. However, reaching the data-bound regime does not mean the model has fully utilized its organic corpus. In this paper, we introduce SynPro, a synthetic data generation framework that helps LLMs more thoroughly learn from limited organic data. SynPro applies two operations, rephrasing and reformat, that present the same organic source in diverse forms to facilitate deeper learning without introducing external information. Both generators are optimized via reinforcement learning with quality, faithfulness, and data influence rewards, and are continuously updated as pretraining plateaus to target content the model has yet to absorb. We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens (0.8B and 2.2B) from DCLM-Baseline, reflecting a realistic data-bound regime in frontier pretraining. Our results reveal that organic data is significantly underutilized by standard repetition: SynPro unlocks 3.7-5.2x the effective tokens of repetition, even surpassing the non-data-bound oracle that trains on equivalent unique data at the 1.1B scale. Analyses confirm that faithful, model-aware synthesis sustains data-bound scaling without causing distribution collapse. We open-source our code at this https URL.
中文摘要 LLM预训练正从计算受限转向数据受限的环境，现有的人类（有机）文本远远达不到扩展需求。然而，达到数据绑定状态并不意味着模型已充分利用其有机语料库。本文介绍了SynPro，一种合成数据生成框架，帮助大型语言模型更深入地从有限的有机数据中学习。SynPro采用两种操作：重述和重格式化，以不同形式呈现同一有机资源，促进更深入的学习，同时不引入外部信息。这两个生成器都通过强化学习优化，提供质量、忠实度和数据影响奖励，并在预训练平台期持续更新，目标内容尚未被模型吸收。我们用DCLM-Baseline中10%的Chinchilla最优标记（0.8B和2.2B）预训练400M和1.1B模型，反映了前沿预训练中一个现实的数据约束状态。我们的结果显示，自然数据在标准重复中严重被低利用：SynPro解锁的有效重复令牌数是3.7-5.2倍，甚至超过了在1.1亿尺度上训练的非数据绑定oracle。分析证实，忠实且具备模型感知的综合能够维持数据绑定的缩放而不导致分布崩溃。我们将代码开源于这个 https URL。

DAD4TS: Data-Augmentation-Oriented Diffusion Model for Time-Series Forecasting with Small-Scale Data

DAD4TS：面向数据增强的扩散模型，用于小尺度数据的时间序列预测

Authors: Masahiro Suzuki, Bohui Xia, Hiroto Yamamoto, Masanori Miyahara
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.17866
Pdf link: https://arxiv.org/pdf/2605.17866
Abstract Small-scale data is a critical problem in time-series forecasting tasks. Data augmentation is an effective strategy for this task, but it has a limitation in generating meaningful data. To address this limitation, we propose DAD4TS, a diffusion-model-based data augmentation method with reinforcement learning, designed for time-series forecasting with small-scale data. In DAD4TS, a data generator is simultaneously trained with a time-series model and controlled by a reinforcement learning model to efficiently generate samples that improve the forecast accuracy of the time-series model. To support small-scale data, we use mathematical methods instead of conventional VAE methods to train the diffusion model by projecting the time-series data into the geometric space. We validated the effectiveness of DAD4TS with seven comparative methods through qualitative and quantitative experiments on six real-world datasets and eight time-series models. As a result, DAD4TS was validated on five datasets.
中文摘要 小尺度数据是时间序列预测任务中的关键问题。数据增强是该任务的有效策略，但在生成有意义数据方面存在局限性。为解决这一限制，我们提出了DAD4TS，一种基于扩散模型的数据增强方法，带有强化学习，专为小尺度数据的时间序列预测设计。在DAD4TS中，数据生成器同时与时间序列模型训练，并由强化学习模型控制，高效生成样本以提高时间序列模型的预测准确性。为了支持小尺度数据，我们使用数学方法而非传统VAE方法，通过将时间序列数据投影到几何空间来训练扩散模型。我们通过六个真实世界数据集和八个时间序列模型的定性和定量实验，验证了DAD4TS的有效性，采用七种比较方法。因此，DAD4TS在五个数据集上得到了验证。

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

HINT-SD：长视野代理人的有针对性事后诸葛亮自我提炼

Authors: Woongyeng Yeo, Yumin Choi, Taekyung Ki, Sung Ju Hwang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.17873
Pdf link: https://arxiv.org/pdf/2605.17873
Abstract Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26$\times$ lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.
中文摘要 用强化学习训练长视野LLM代理具有挑战性，因为稀疏的结果奖励揭示任务是否成功，但无法说明哪些中间行为导致了结果，也不知道应如何纠正。最新方法通过从回合级动作输出信号生成奖励或文本提示，或使用反馈条件自蒸馏来缓解这一问题。然而，当许多中间转弯已经成功或中性时，每转都产生反馈效率低下，而在固定或错位转弯施加反馈往往无法监督导致失败的行为。为弥合这一差距，我们提出了HINT-SD框架，这是一种针对性的自蒸馏框架，利用全轨迹的事后诸葛来选择与失败相关的动作，并仅对目标动作跨度应用反馈条件蒸馏。BFCL v3和AppWorld的实验显示，我们的方法在密集的每回合反馈基线上提升了最多18.80%，同时每个训练步骤缩短了2.26$\时间$，表明选择蒸馏地点是有效且高效的长期代理培训的关键因素。

An Efficient Streaming Video Understanding Framework with Agentic Control

一个高效的流媒体视频理解框架，具备代理控制

Authors: Jinming Liu, Jianguo Huang, Zhaoyang Jia, Jiahao Li, Xiaoyi Zhang, Zongyu Guo, Bin Li, Wenjun Zeng, Yan Lu, Xin Jin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.17921
Pdf link: https://arxiv.org/pdf/2605.17921
Abstract Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trade-off: fast models fail on complex queries, while always-on heavy models violate real-time constraints and overcomplicate simple queries. Rather than fixing these decisions upfront, we propose R3-Streaming (Remember, Respond, Reason), which formulates streaming video understanding as a cascaded control problem: for each query, the system compresses memory, judges response readiness, and routes computation sequentially, so that each downstream decision builds on progressively refined information states. To optimize this pipeline, we introduce an age-aware forgetting policy for memory compression, as aggressively compressing historical frames can yield substantial performance gains. For compute routing, we propose TB-GRPO, a target-balanced reinforcement learning objective that routes hard queries to a stronger model while preventing mode collapse. Extensive evaluations demonstrate that R3-Streaming achieves state-of-the-art results among streaming MLLMs, reaching 57.92 on OVO-Bench and 76.36 on StreamingBench, while reducing visual token usage by 95 to 96 percent.
中文摘要 流媒体视频需要在严格的延迟预算下处理动态信息密度。然而，现有方法通常采用静态策略，如固定内存压缩或依赖单一模型，迫使双方权衡：快速模型在复杂查询中失败，而始终在线的重模型则违反实时约束并使简单查询过于复杂。我们不打算事先固定这些决策，而是提出了R3流式（记忆、回应、推理），将流式视频理解表述为一个级联控制问题：对每个查询，系统压缩内存，判断响应准备度，并顺序路由计算，使每个下游决策建立在逐步精炼的信息状态之上。为了优化该流水线，我们引入了带有年龄感知的遗忘策略用于内存压缩，因为对历史帧进行激进压缩可以带来显著的性能提升。对于计算路由，我们提出了TB-GRPO，这是一种目标平衡强化学习目标，能够将困难查询路由到更强的模型，同时防止模式崩溃。广泛评估显示，R3-Streaming在流媒体MLLM中取得了最先进的成绩，OVO-Bench达到57.92，StreamingBench达到76.36%，同时将视觉代币使用率降低了95%至96%。

Transfer Learning for Customized Car Racing Environments

定制赛车环境的迁移学习

Authors: Benedict Florance Arockiaraj, Richard Chang, Wesley Yee
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.17928
Pdf link: https://arxiv.org/pdf/2605.17928
Abstract Transfer Learning, a technique where a model/agent can use the knowledge/expertise that it gained from one task and exploit that to solve another closely-related task, is often used in tackling problems in deep learning. Through this project, we explore transfer learning in the purview of deep reinforcement learning. Specifically, we want to use transfer learning to achieve the fast lap times in OpenAI's Car racing environment by training the agent on one circuit, and racing it on other customized target environments by zero-shot transfer or by additional fine-tuning. In addition, we compare the performance of model-based and model-free approaches, and observe that model-based approaches dominate in performance and converge faster than model-free approaches in this environment. We observe that transfer learning in most setups not only boosts the performance on the target domain, but also shows high performance ability during learning.
中文摘要 迁移学习是一种技术，模型/智能体利用从一个任务中获得的知识/专长，并利用它来解决另一个密切相关的任务，常被用于解决深度学习中的问题。通过本项目，我们探索深度强化学习领域的迁移学习。具体来说，我们希望利用转移学习在OpenAI的赛车环境中实现快速圈速，方法是在一个赛道上训练代理，并通过零机会传输或额外微调在其他定制目标环境中进行比赛。此外，我们比较了基于模型的方法和无模型方法的性能，观察到基于模型的方法在性能上占主导地位，并且在此环境中收敛速度更快。我们观察到，大多数情况下迁移学习不仅提升目标领域的性能，还在学习过程中展现出高性能。

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

AtlasVA：为无教师VLM代理提供自我进化的视觉技能记忆

Authors: Pan Wang, Yihao Hu, Xiujin Liu, Jingchu Yang, Hang Wang, Zhihao Wen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.17933
Pdf link: https://arxiv.org/pdf/2605.17933
Abstract Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals. We argue that reusable experience for VLM agents should remain visually grounded. Based on this insight, we propose \textbf{AtlasVA}, a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. AtlasVA further evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, and reuses these self-evolving atlases as potential-based shaping rewards for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision. Experiments on \textsc{Sokoban}, \textsc{FrozenLake}, 3D embodied navigation, and 3D robotic manipulation benchmarks show that AtlasVA consistently outperforms text-centric memory baselines and competitive VLM agents, with especially strong gains on spatially intensive tasks. Homepage: this https URL
中文摘要 视觉语言模型（VLM）代理越来越依赖记忆增强强化学习，在长期任务中重复利用经验，但大多数现有框架将记忆存储为文本，依赖专有教师模型来总结或精炼。这种设计与空间决策不匹配：几何先验被压缩成有损语言，稀疏的交互通常通过延迟的文本反馈来监督，而非密集的视觉基础信号。我们认为，VLM代理的可重用体验应保持视觉基础。基于这一见解，我们提出了 \textbf{AtlasVA}，一个无需教师参与的视觉技能记忆框架，将记忆组织为三个互补层次：空间热图、视觉范例和符号文本技能。AtlasVA进一步从轨迹统计和轻量级网格启发式直接演化危险和亲和图集，并将这些自我演化图集作为基于潜力的塑形奖励用于强化学习。这统一了感知、记忆和优化，无需外部LLM监督。在\textsc{Sokoban}、\textsc{FrozenLake}、3D具身导航和3D机器人操作基准测试上的实验显示，AtlasVA始终优于以文本为中心的内存基线和竞争VLM代理，尤其是在空间密集型任务上表现显著。主页：这个 https URL

Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning

通过基于一致性的强化学习提升LLMs的代码推理能力

Authors: Zhanyue Qin, Jia Feng, Yibo Lyu, Yun Peng, Dianbo Sui, Cuiyun Gao, Qing Liao
Subjects: Subjects: Machine Learning (cs.LG); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2605.17958
Pdf link: https://arxiv.org/pdf/2605.17958
Abstract Code reasoning refers to the task of predicting the output of a program given its source code and specific inputs. It can measure the reasoning capability of large language models (LLMs) and also benefit downstream tasks such as code generation and mathematical reasoning. Existing work has verified the effectiveness of reinforcement learning on the task. However, these methods design rewards solely based on final outputs or coarse-grained signals, and neglect the inherent consistency of the stepwise reasoning process in the task. Therefore, these methods often result in sparse reward or reward hacking, which limits the full play of enhanced learning capabilities. To alleviate these issues, we propose CodeThinker, a consistency-driven reinforcement learning framework for code reasoning. Specifically, CodeThinker has three key components: (1) a stepwise reasoning-aware model training module, which utilizes a consistency tracing paradigm as a template to synthesize training data that captures the stepwise reasoning process; (2) a dynamic beam sampling strategy, which aims to improve the quality of sampled outputs under a fixed sampling budget; and (3) a consistency reward mechanism that can effectively alleviate reward hacking. Experiments on three popular benchmarks show that CodeThinker achieves state-of-the-art performance across multiple LLMs. For instance, it outperforms the strongest baseline by 4.3% in accuracy when deployed on Qwen2.5-Coder-7B-Instruct. We also validate the effectiveness of CodeThinker on downstream tasks. Results show that, without additional training, CodeThinker obtains average accuracy gains of 5.33 and 3.11 percentage points on mathematical reasoning and code reasoning tasks covering 17 programming languages, respectively.
中文摘要 代码推理是指根据程序的源代码和特定输入，预测其输出的任务。它可以衡量大型语言模型（LLM）的推理能力，同时也能帮助后续任务如代码生成和数学推理。现有研究已验证强化学习在该任务中的有效性。然而，这些方法的设计仅基于最终输出或粗粒度信号进行奖励，忽视了任务中逐步推理过程的内在一致性。因此，这些方法常常导致奖励稀疏或奖励黑客，限制了增强学习能力的充分发挥。为了缓解这些问题，我们提出了CodeThinker，一个一致性驱动的代码推理强化学习框架。具体来说，CodeThinker 包含三个关键组件：（1）逐步推理感知模型训练模块，利用一致性追踪范式作为模板，综合训练数据以捕捉逐步推理过程;（2）动态束流采样策略，旨在在固定采样预算下提升采样输出质量;以及（3）一种能够有效缓解奖励黑客行为的一致性奖励机制。在三个流行基准测试上的实验表明，CodeThinker在多个大型语言模型中实现了最先进的性能。例如，在Qwen2.5-Coder-7B-Instruct上部署时，其准确率比最强基线高出4.3%。我们还验证了 CodeThinker 在下游任务中的有效性。结果显示，在未进行额外训练的情况下，CodeThinker在涵盖17种编程语言的数学推理和代码推理任务中，平均准确率提升了5.33个百分点和3.11个百分点。

Generation Navigator: A State-Aware Agentic Framework for Image Generation

世代导航器：一种状态感知的图像生成智能框架

Authors: Jinming Liu, Ruoyu Feng, Yuqi Wang, Wenjun Zeng, Xin Jin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.17969
Pdf link: https://arxiv.org/pdf/2605.17969
Abstract Despite rapid advances in text-to-image generation, faithfully realizing user intent remains challenging, often requiring manual multi-turn trial and error. To automate this process, existing systems rely on either simple prompt rewriting or closed-loop agents driven by hand-crafted rules, rather than learning to adapt actions to the evolving generation process. In this paper, we reformulate image generation as a state-conditioned action-making problem and propose Generation Navigator, a multi-turn T2I agent that learns to dynamically steer the generation trajectory and output the next action. However, training this agent via reinforcement learning introduces a critical credit assignment challenge: naively rewarding a trajectory based solely on a single state assigns equal credit to all actions in the rollout, ignores the quality dynamics across turns, and fails to distinguish actions that improve the trajectory from those that degrade it or waste turns without progress. We resolve this with PRE-GRPO (Peak-Retention-Efficiency Group Relative Policy Optimization), a trajectory-level reinforcement learning objective that explicitly rewards discovering a high-quality image (Peak), avoiding subsequent quality degradation across turns (Retention), and minimizing unnecessary turns (Efficiency). Experiments show substantial improvements across benchmarks, reaching a WISE score of 0.90 and 79.06% reasoning accuracy on T2I-ReasonBench.
中文摘要 尽管文本生成技术迅速进步，但忠实实现用户意图仍具挑战，常常需要手动多次反复试验。为了自动化这一过程，现有系统依赖简单的提示重写或由手工规则驱动的闭环代理，而非学习适应不断演变的生成过程。本文将图像生成重新表述为状态条件作用生成问题，并提出了生成导航器（Generation Navigator），这是一个多回合的T2I智能体，能够学习动态引导生成轨迹并输出下一个动作。然而，通过强化学习训练该智能体带来了关键的信用分配挑战：仅基于单一状态的轨迹进行天真奖励，会给推广中的所有动作赋予同等的信用，忽视了各回合的质量动态，也未能区分改善轨迹的行动与削弱轨迹或浪费回合而无进展的行为。我们通过PRE-GRPO（峰值-保留-效率组相对策略优化）来解决这个问题，这是一个轨迹级强化学习目标，明确奖励发现高质量图像（峰值）、避免后续质量下降（保留率）以及减少不必要的回合（效率）。实验显示，各基准测试都有显著提升，T2I-ReasonBench的WISE评分为0.90，推理准确率为79.06%。

AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

AutoVecCoder：教大语言模型生成显式矢量化代码

Authors: Shangzhan Li, Xinyu Yin, Xuanyu Jin, Ye He, Yuxin Zhou, Yuxuan Li, Xu Han, Wanxiang Che, Qi Shi, Ting Liu, Maosong Sun
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.17978
Pdf link: https://arxiv.org/pdf/2605.17978
Abstract Vectorization via Single Instruction, Multiple Data (SIMD) architectures is a cornerstone of high-performance computing. To fully exploit hardware potential, developers often resort to explicit vectorization using intrinsics, as compiler-based auto-vectorization frequently yields suboptimal results due to conservative static analysis. While Large Language Models (LLMs) have demonstrated remarkable proficiency in general code generation, they struggle with explicit vectorization due to the scarcity of high-quality corpora and the strict semantic constraints of low-level hardware instructions. In this paper, we propose AutoVecCoder, a novel framework designed to empower LLMs with the capability of automated explicit vectorization. AutoVecCoder integrates two core components: VecPrompt, an automated data synthesis pipeline to inject domain-specific intrinsic knowledge; and VecRL, a reinforcement learning framework that aligns code generation with execution efficiency. AutoVecCoder-8B trained by this framework achieves state-of-the-art performance on the SSE and AVX subsets of SimdBench and, in some cases, generates implementations surpassing standard -O3 optimizations, effectively overcoming the inherent bottlenecks of traditional automated vectorization.
中文摘要 通过单指令、多数据（SIMD）架构实现矢量化是高性能计算的基石。为了充分发挥硬件潜力，开发者常常采用显式的内在向量化，因为基于编译器的自动向量化常因保守静态分析而产生次优结果。虽然大型语言模型（LLMs）在通用代码生成方面表现出显著的熟练度，但由于高质量语料库稀缺和低级硬件指令的严格语义约束，它们在显式向量化方面存在困难。本文提出了AutoVecCoder，一种旨在赋予LLM自动显式向量化能力的新型框架。AutoVecCoder 集成了两个核心组件：VecPrompt，一个自动化数据综合流水线，用于注入领域特定的内在知识;以及VecRL，一个将代码生成与执行效率相结合的强化学习框架。由该框架训练的AutoVecCoder-8B在SimdBench的SSE和AVX子集上实现了最先进的性能，在某些情况下，还能生成超越标准-O3优化的实现，有效克服了传统自动向量化的固有瓶颈。

RL4RLA: Teaching ML to Discover Randomized Linear Algebra Algorithms Through Curriculum Design and Graph-Based Search

RL4RLA：通过课程设计和基于图的搜索教授机器学习发现随机线性代数算法

Authors: Jinglong Xiong, Xiaotian Liu, Ruoxin Wang, Zihang Liu, Yefan Zhou, Yujun Yan, Yaoqing Yang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.18004
Pdf link: https://arxiv.org/pdf/2605.18004
Abstract Randomized linear algebra (RLA) algorithms are a modern class of numerical linear algebra techniques that play an essential role in scientific computing and machine learning, with broad and growing adoption. However, their discovery remains mostly a manual process that requires deep expert knowledge and inspiration. While Reinforcement Learning (RL) offers a pathway to automation, standard approaches struggle with sparse reward landscapes and vast search spaces inherent to high-performing RLA algorithms. In this paper, we present RL4RLA, a general RL framework that automates the discovery of interpretable, symbolic RLA algorithms. Unlike black-box approaches, our method builds explicit algorithms from basic linear algebra primitives, ensuring verifiable and implementable representations. To enable efficient discovery, we introduce: (1) a numerical curriculum that progressively increments problem difficulty to encode inductive bias specific to the RLA domain; (2) Monte Carlo Graph Search, which optimizes exploration by identifying and merging equivalent partial algorithms. We demonstrate that RL4RLA rediscovers state-of-the-art methods, including sketch-and-precondition solvers, Randomized Kaczmarz, and Newton Sketch, and can be targeted to produce algorithms optimized for specific trade-offs between accuracy, speed, and stability. Code is available at this https URL.
中文摘要 随机线性代数（RLA）算法是一类现代数值线性代数技术，在科学计算和机器学习中发挥着至关重要的作用，并且被广泛且日益广泛地采用。然而，他们的发现大多仍是一个需要深厚专业知识和灵感的人工过程。虽然强化学习（RL）提供了自动化的路径，但标准方法在高效能RLA算法固有的稀疏奖励环境和庞大的搜索空间中存在困难。本文介绍了RL4RLA，一种通用的RL框架，用于自动化发现可解释的符号RLA算法。与黑箱方法不同，我们的方法从基本线性代数原语构建显式算法，确保表示可验证且可实现。为实现高效的发现，我们引入了：（1）一个数值课程，逐步递增问题难度，以编码针对RLA领域的归纳偏置;（2）蒙特卡洛图搜索，通过识别并合并等效的部分算法来优化探索。我们证明RL4RLA重新发现了最先进的方法，包括草图与预条件求解器、随机Kaczmarz和Newton Sketch，并可针对精度、速度和稳定性的特定权衡优化算法。代码可在此 https URL 访问。

Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning

破坏交互的对抗性学习框架，用于稳健的多智能体强化学习

Authors: Sunwoo Lee, Mingu Kang, Yonghyeon Jo, Seungyul Han
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.18024
Pdf link: https://arxiv.org/pdf/2605.18024
Abstract Cooperation is central to multi-agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations disrupt inter-agent interactions. Prior robust MARL methods have primarily considered value-oriented attacks, leaving a gap in robustness when interaction structures themselves are corrupted. In this paper, we propose an interaction-breaking adversarial learning (IBAL) framework that takes an information-theoretic view to construct attacks that impede coordination by perturbing agents' observations and actions, and trains agents to perform reliably under such disruptions. Empirically, our approach improves robustness over existing robust MARL baselines across diverse attack settings and yields stronger performance even under agent-missing scenarios.
中文摘要 合作是多智能体强化学习（MARL）的核心，但当外部干扰干扰代理间交互时，学习后的协调可能变得脆弱。此前的稳健MARL方法主要考虑价值导向攻击，当交互结构本身被破坏时，鲁棒性存在空白。本文提出了一种交互破坏对抗学习（IBAL）框架，采用信息理论视角构建阻碍协调的攻击，通过扰动智能体的观察和行为，并训练智能体在此类干扰下可靠执行。从经验角度看，我们的方法在不同攻击环境中提升了现有稳健MARL基线的鲁棒性，即使在缺少代理的场景下也能带来更强的性能。

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

合作多智能体强化学习的LLM引导通信

Authors: Sangjun Bae, Yisak Park, Sanghyeon Lee, Seungyul Han
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.18077
Pdf link: https://arxiv.org/pdf/2605.18077
Abstract Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information exchange or fail to transmit sufficient state information. To address this, we propose LLM-driven Multi-Agent Communication (LMAC), which leverages an LLM's reasoning capability to design a communication protocol that enables all agents to reconstruct the underlying state as accurately and uniformly as possible. LMAC iteratively refines the protocol using an explicit state-awareness criterion, improving state recovery while narrowing differences in agents' knowledge. Experiments on diverse MARL benchmarks show that LMAC improves state reconstruction across agents and yields substantial performance gains over prior communication baselines.
中文摘要 通信是多智能体强化学习（MARL）中缓解部分可观测性的关键组成部分，然而以往的方法往往依赖于低效的信息交换或未能传输足够的状态信息。为此，我们提出了基于LLM的多代理通信（LMAC），利用LLM的推理能力设计通信协议，使所有代理能够尽可能准确、统一地重建底层状态。LMAC通过显式状态意识标准迭代优化协议，提升状态恢复，同时缩小代理间知识差异。在多种MARL基准测试上的实验表明，LMAC改善了各代理间的状态重建，并比以往的通信基线实现了显著的性能提升。

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

两两偏好奖励与基于群体的多样性增强，以实现更优的开放式生成

Authors: Guining Cao, Jiaxin Peng, Chu Zeng, Yu Zhao, Shuangyong Song, Yongxiang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.18191
Pdf link: https://arxiv.org/pdf/2605.18191
Abstract Current reinforcement learning(RL) methods are broadly applicable and powerful in verifiable settings where scalar rewards can be provided. However, in open-ended generation tasks, verifying the correctness of responses remains challenging, and training reward models incurs substantial computational and annotation costs. Moreover, reinforcement learning (RLVR) often leads to diversity collapse and produces stereotypical or rigid outputs, outcomes that are particularly undesirable in open-domain scenarios. We propose Pairwise Preference Reward and Group-based Diversity Enhancement (PPR-GDE), a RL method that is more suitable for open-ended generation. PPR-GDE does not require scalar rewards and incorporates group-level diversity into the reward signal, it preserves the comparative structure of subjective evaluation through a pairwise preference reward, mitigates judge position bias via repeated comparisons with swapped response order, and introduces a group-based diversity reward that explicitly encourages semantic dispersion within a response group, all of these reward signals are integrated into a unified group-relative policy optimization objective. We instantiate PPR-GDE on role-playing task, experiments show that PPR-GDE achieves a better alignment quality as well as expressive diversity than strong RL baselines. Further analysis shows that pairwise preference is critical for preference alignment in subjective perspective, while the diversity metric plays an essential role in achieving superior expressive diversity and broader semantic coverage.
中文摘要 当前的强化学习（RL）方法在可验证的环境中广泛适用且强大，尤其是在能够提供标量奖励的环境中。然而，在开放式生成任务中，验证回答的正确性依然具有挑战性，且奖励模型的训练会产生大量的计算和注释成本。此外，强化学习（RLVR）常常导致多样性崩溃，产生刻板或僵化的结果，这些结果在开放领域场景中尤为不利。我们提出了更适合开放式生成的成对偏好奖励与基于群体多样性增强（PPR-GDE）的强化学习方法。PPR-GDE不要求标量奖励，将群体层面多样性纳入奖励信号，通过成对偏好奖励保持主观评估的比较结构，通过反复比较且响应顺序互换来减轻评判位置偏差，并引入基于群体的多样性奖励，明确鼓励响应组内的语义分散，所有这些奖励信号都整合进统一的群体相对策略优化中客观。我们在角色扮演任务中实例化PPR-GDE，实验显示PPR-GDE比强化学习基线更能实现更好的比对质量和表达多样性。进一步分析表明，成对偏好在主观视角中对齐偏好至关重要，而多样性度量则在实现更优异的表达多样性和更广泛的语义覆盖中起着关键作用。

Privacy Preserving Reinforcement Learning with One-Sided Feedback

以单方面反馈保护隐私强化学习

Authors: Lin William Cong, Guangyan Gan, Hanzhang Qin, Zhenzhen Yan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.18246
Pdf link: https://arxiv.org/pdf/2605.18246
Abstract We study reinforcement learning (RL) in multi-dimensional continuous state and action spaces with one-sided feedback, where the agent receives partial observations of the state and obtains reward information for only a subset of the state-action space at each time step. This setting introduces substantial challenges in both learning efficiency and privacy preservation. To address these challenges, we propose POOL, a novel privacy-preserving RL algorithm. We conduct a comprehensive theoretical analysis of POOL, deriving a sample complexity bound that matches the known lower bounds for non-private RL. Here, E_rho denotes the privacy parameter, H is the time horizon, and alpha is the optimality-gap parameter. Our findings show that it is possible to enforce strong privacy guarantees while maintaining high learning efficiency, marking a significant step toward practical, privacy-aware RL in multi-dimensional environments with one-sided feedback.
中文摘要 我们在多维连续状态和动作空间中研究强化学习（RL），即在单侧反馈下，代理在每个时间步接收到状态部分的观测，并仅获得状态-动作空间子集的奖励信息。这种环境在学习效率和隐私保护方面都带来了重大挑战。为应对这些挑战，我们提出了POOL，一种新型保护隐私的强化学习算法。我们对POOL进行了全面的理论分析，推导出一个与非私有强化学习已知下界相符的样本复杂度上界。这里，E_rho表示隐私参数，H 表示时间视界，alpha 表示最优性-缺口参数。我们的发现表明，在保持高学习效率的同时，可以执行强有力的隐私保障，这标志着在多维环境中实现具有隐私意识的强化学习迈出了重要一步。

Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains

知识验证：探索知识密集型领域LLM的RLVR

Authors: Zhonghang Yuan, Zhefan Wang, Fang Hu, Zihong Chen, Jinzhe Li, Gang Li, Jie Ying, Huanjun Kong, Songyang Zhang, Nanqing Dong
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.18261
Pdf link: https://arxiv.org/pdf/2605.18261
Abstract Reinforcement learning with verifiable rewards (RLVR) has demonstrated promising potential to enhance the reasoning capabilities of large language models (LLMs) in domains such as mathematics and coding. However, its applications on knowledge-intensive domains have not been effectively explored due to the scarcity of high-quality verifiable data. Furthermore, current RLVR focuses solely on the correctness of final answers, leading to the limitations of flawed reasoning and sparse reward signals. In this work, we propose Knowledge-to-Verification (K2V), a framework that extends RLVR to knowledge-intensive domains through automated verifiable data synthesis, while enabling verification of the LLM's reasoning process. Extensive experiments demonstrate that K2V enhances the reasoning of LLM in knowledge-intensive domains without significantly compromising the model's general capabilities. This study also suggests that integrating automated data synthesis with reasoning verification is a promising direction to enhance model capabilities in these broader domains. Code is available at this https URL.
中文摘要 带有可验证奖励的强化学习（RLVR）已展现出在提升大型语言模型（LLMs）推理能力方面具有潜力，尤其是在数学和编码等领域。然而，由于缺乏高质量且可验证的数据，其在知识密集型领域的应用尚未被有效探索。此外，当前的RLVR仅关注最终答案的正确性，导致推理有缺陷和奖励信号稀疏。在本研究中，我们提出了知识到验证（K2V）框架，通过自动化可验证数据综合将RLVR扩展到知识密集型领域，同时实现对LLM推理过程的验证。大量实验表明，K2V在知识密集型领域提升了LLM的推理能力，同时显著降低模型的整体能力。本研究还表明，将自动数据综合与推理验证相结合，是提升模型在更广泛领域能力的有前景方向。代码可在此 https URL 访问。

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

SD-搜索：搜索增强推理的政策事后自提炼

Authors: Yufei Ma, Zihan Liang, Ben Chen, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Chenyi Lei, Wenwu Ou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2605.18299
Pdf link: https://arxiv.org/pdf/2605.18299
Abstract Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the context available at inference time, and a teacher that additionally conditions on a compact hindsight block summarizing the search queries and final outcomes of a group of rollouts sampled from the same question. Since the teacher knows how each rollout unfolded and which ones succeeded, its query distribution implicitly marks which decisions were worth making, and the student is trained to recover this behavior by minimizing the token-level Jensen--Shannon divergence to the teacher at search-query positions. This layers a dense, step-level signal on top of GRPO's coarse trajectory reward. Crucially, this signal is produced by the policy itself within the standard RL training loop, without external model inference, auxiliary annotation pipeline, or additional training stage.
中文摘要 搜索增强推理代理将内部推理与调用外部检索器交错进行，其性能依赖于每个发出查询的质量。然而，在结果-奖励强化学习中，推广过程中的每个搜索决策共享相同的轨迹级奖励，导致单个查询没有步骤特定的信用。近期的过程监督方法通过从政策外部提取步骤级信号，或依赖更大规模的教师模型，或由更强外部系统生成的子问题注释来弥补这一空白。相比之下，我们提出SD-Search，它通过政策上的事后诸葛亮自我提炼，从政策本身推导出步骤级监督，无需外部教师或额外注释。在SD-Search中，单一模型扮演两个角色，仅在条件反射上有所不同：学生只看到推理时可用的上下文，教师则额外条件化一个简洁的事后诸葛亮块，总结从同一问题抽样的一组推广结果和搜索查询。由于教师知道每次推广的展开过程以及哪些成功，其查询分布隐含地标记了哪些决策值得做出，学生通过在搜索查询位置最小化令牌级的Jensen-Shannon发散来恢复这一行为。这在GRPO粗轨迹奖励之上叠加了一个密集的阶跃级信号。关键是，该信号由策略本身在标准强化学习训练循环中生成，无需外部模型推断、辅助注释流水线或额外的训练阶段。

Alignment Dynamics in LLM Fine-Tuning

LLM微调中的对齐动态

Authors: Yuhan Huang, Huanran Chen, Yinpeng Dong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.18309
Pdf link: https://arxiv.org/pdf/2605.18309
Abstract Although Large Language Models (LLMs) achieve strong alignment through supervised fine-tuning and reinforcement learning from human feedback, the alignment is often fragile under subsequent fine-tuning. Existing explanations either attribute alignment fragility to gradient geometry or characterize it as a distributional shift in model outputs, yet few provide a unified account that bridges parameter-space learning dynamics with function-space alignment behavior during fine-tuning. In this work, we introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a \textbf{\color{red!60!black} Rebound Force}, governed jointly by the current alignment state and the narrowness of model distribution, and a \textbf{\color{green!60!black} Driving Force}, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a \textbf{Rehearsal Priming Effect}: prior alignment leaves a latent posterior imprint that amplifies the effective Driving Force upon re-exposure, leading to faster re-alignment. We validate these predictions across safety alignment, emergent misalignment, and sentiment settings, demonstrating consistent alignment reversal and accelerated re-alignment under re-exposure. In addition, controlled experiments in safety alignment confirm the predicted dependence of rebound strength on posterior narrowness. Together, these results provide a unified dynamical perspective on how alignment is disrupted and reactivated during LLM fine-tuning.
中文摘要 尽管大型语言模型（LLMs）通过监督微调和人类反馈强化学习实现了强对齐，但这种对齐在后续微调下往往脆弱。现有解释要么将比对脆弱性归因于梯度几何，要么将其描述为模型输出的分布性变化，但很少有解释能够统一地桥接参数空间学习动态与微调过程中函数空间对齐行为。在本研究中，我们引入了一个可处理的比对分数，并在微调过程中推导出其封闭式更新，从而构建了一个统一的比对动态框架。我们的分析将比对更新分解为两个竞争组成部分：\textbf{\color{red！60！black} 反弹力}，由当前比对状态和模型分布的狭窄共同控制;以及 \textbf{\color{green！60！black} 驱动力}，由训练分布与结果条件后验相较于对齐和非对齐完成结果的对齐程度决定。这种分解解释了为何先验比对可以通过后续微调来逆转，以及为何更窄的后置结构会增强这种逆转。此外，我们的框架预测了一种\textbf（排练启动效应）：先前的对齐会留下潜在的后验印记，在重新暴露时放大有效驱动力，从而加快重新对齐。我们在安全比对、涌现错位和情绪设置中验证了这些预测，证明了一致性的比对逆转和在重新暴露下的加速重新比齐。此外，安全性对准的受控实验证实了回弹强度对后部狭窄性的预测依赖性。这些结果共同提供了关于大语言模型微调过程中比对如何被破坏和重新激活的统一动力学视角。

ISEP: Implicit Support Expansion for Offline Reinforcement Learning via Stochastic Policy Optimization

ISEP：通过随机策略优化实现离线强化学习的隐式支持扩展

Authors: Yifei Chen, Shaoqin Zhu, Xiaoqiang Ji
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.18320
Pdf link: https://arxiv.org/pdf/2605.18320
Abstract Offline reinforcement learning methods typically enforce strict constraints to ensure safety; yet this rigidity often prevents the discovery of optimal behaviors outside the immediate support of the behavior policy. To address this, we propose Implicit Support Expansion via stochastic Policy optimization (ISEP), which leverages a value function interpolated between in-distribution data and policy samples to implicitly expand the feasible action support. This mechanism "densifies" high-reward regions, creating a navigable path for policy improvement while theoretically guaranteeing bounded value error. However, optimizing against this expanded support creates a multimodal landscape where standard deterministic averaging leads to mode collapse and invalid actions. ISEP mitigates this via a stochastic action selection strategy, optimizing the policy by stochastically alternating between conservative cloning and optimistic expansion signals. We instantiate this framework as ISEP-FM using Conditional Flow Matching utilizing classifier-free guidance to effectively capture the interpolated value signal.
中文摘要 离线强化学习方法通常会强制执行严格约束以确保安全;然而，这种僵化常常阻碍在行为政策直接支持之外发现最优行为。为此，我们提出了通过随机策略优化（ISEP）进行隐性支持扩展，该方法利用分布内数据与策略样本之间的值函数，隐式扩展可行的行动支持。该机制“密集化”高回报区域，为政策改进创造可导航路径，同时理论上保证有界值误差。然而，针对这种扩展支持进行优化，会形成一个多模态环境，标准确定性平均会导致模式崩溃和无效动作。ISEP通过随机行动选择策略来缓解这一问题，通过随机交替使用保守克隆和乐观扩展信号来优化策略。我们将该框架实例化为ISEP-FM，利用条件流匹配，利用无分类器的指导，有效捕捉插值值信号。

Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers

超越推理时间搜索：强化学习综合可重复使用求解器

Authors: Soheyl Massoudi, Gabriel Apaza, Milad Habibi, Mark Fuge
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.18374
Pdf link: https://arxiv.org/pdf/2605.18374
Abstract Large language models (LLMs) typically approach combinatorial optimization as an inference-time procedure, solving each instance separately through sampling, search, or repeated prompting. We ask whether reinforcement learning can instead shift part of this reasoning cost into the weights of a code LLM, so that the model synthesizes a reusable solver for an entire problem family. We study this question on Synergistic Dependency Selection (SDS), a controlled variant of constrained Quadratic Knapsack designed to expose a specific failure mode: local signals and strict feasibility constraints make greedy heuristics attractive but unreliable. Under identical scaffolding, Best-of-64 base-model sampling saturates at an approximately 28.7% gap to the global Virtual Best Solver (VBS); code audits show that the base model often retrieves Simulated Annealing templates but misimplements the Metropolis acceptance rule. We fine-tune Qwen2.5-Coder-14B-Instruct with Group Relative Policy Optimization (GRPO) using a feasibility-gated reward and light structural scaffolding. The resulting policy converges to a constraint-aware Simulated Annealing template in 99.8% of feasible SDS outputs, achieves a 5.0% gap to that VBS, and is 91 times cheaper in post-generation execution/search cost than cumulative Best-of-64 evaluation. A compile-once check shows that one best frozen solver per seed remains highly competitive when reused unchanged across the SDS test set, while an additional-domain evaluation on Job Shop Scheduling provides narrower but positive evidence that the scaffold transfers beyond SDS. Negative ablations reveal the limits of this recipe: standard stabilizers degrade performance, a soft feasibility gate fails, and results remain sensitive to reward normalization and domain-specific design choices.
中文摘要 大型语言模型（LLMs）通常将组合优化视为推理时间过程，通过抽样、搜索或重复提示分别解决每个实例。我们询问强化学习是否能将部分推理成本转移到代码LLM的权重上，从而使模型合成出一个可重用的求解器，涵盖整个问题族。我们在协同依赖选择（SDS）上研究这个问题，这是一种受控变体的受限二次背囊，旨在揭示特定的失效模式：局部信号和严格的可行性约束使贪婪启发式方法具有吸引力但不可靠。在相同的支架下，64局中两胜的基础模型抽样与全球虚拟最佳解算器（VBS）的差距约为28.7%;代码审计显示，基础模型经常检索模拟退火模板，但简化了Metropolis验收规则。我们通过可行性门槛奖励和轻度结构支架，微调Qwen2.5-Coder-14B-Ininstruction，采用群体相对策略优化（GRPO）。最终策略在99.8%的可行SDS输出中趋同于约束感知的模拟退火模板，实现了5.0%的差距，且在生成后执行/搜索成本上比累计64场最佳评估低91倍。一次编译检查显示，每个种子中一个最佳冻结求解器在SDS测试集中不变地重复使用时依然具有高度竞争力，而Job Shop Scheduling的额外领域评估则提供了更狭窄但积极的证据，表明支架可以超越SDS。负消融揭示了该方案的局限性：标准稳定器会降低性能，软可行性门失效，结果仍对奖励归一化和特定领域设计选择敏感。

Heterogeneous Tasks Offloading in Vehicular Edge Computing: A Federated Meta Deep Reinforcement Learning Approach

车载边缘计算中的异构任务卸载：一种联邦元深度强化学习方法

Authors: Yaorong Huang, Jingtao Luo, Xuechao Wang
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2605.18437
Pdf link: https://arxiv.org/pdf/2605.18437
Abstract Vehicular edge computing (VEC) enables latency-sensitive vehicular applications by offloading computation-intensive tasks to nearby edge servers. However, real-world vehicular workloads are typically modeled as heterogeneous directed acyclic graph (DAG) tasks with complex dependency structures, making joint offloading and resource allocation highly challenging. Moreover, distributed MEC deployment raises privacy concerns when collaboratively training learning-based policies. In this paper, we propose a Federated Meta Deep Reinforcement Learning framework with GAT-Seq2Seq modeling (FedMAGS) for heterogeneous task offloading in VEC systems. The proposed approach leverages Graph Attention Networks to capture DAG dependencies, a Seq2Seq-based policy to generate structured offloading decisions, and federated meta-learning to enable fast adaptation across distributed MEC servers without sharing raw data. Extensive simulations demonstrate that FedMAGS achieves faster convergence, lower execution delay, and better scalability compared with state-of-the-art baselines. In addition, the federated design preserves data privacy while reducing communication overhead, making the framework well suited for dynamic and large-scale VEC environments.
中文摘要 车载边缘计算（VEC）通过将计算密集型任务卸载到附近的边缘服务器，使延迟敏感的车辆应用成为可能。然而，现实中的车辆工作负载通常被建模为具有复杂依赖结构的异构有向无环图（DAG）任务，这使得联合卸载和资源分配极具挑战性。此外，分布式MEC部署在协作培训基于学习的政策时也引发了隐私问题。本文提出了一个结合GAT-Seq2Seq建模（FedMAGS）的联合元深度强化学习框架，用于VEC系统中的异构任务卸载。该方法利用图关注网络捕捉DAG依赖关系，基于Seq2Seq的策略生成结构化卸载决策，以及联邦元学习实现分布式MEC服务器间快速适配而不共享原始数据。大量模拟表明，与最先进的基线相比，FedMAGS实现了更快的收敛、更低的执行延迟和更好的可扩展性。此外，联邦设计既保护了数据隐私，又降低了通信开销，使该框架非常适合动态且大规模的VEC环境。

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

利用强化学习建模客户轨迹，获得实用零售洞察

Authors: Ken Ming Lee, Paul Barde, Maxime C. Cohen, Derek Nowrouzezahrai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.18449
Pdf link: https://arxiv.org/pdf/2605.18449
Abstract Understanding customer movement within retail spaces is essential for optimizing store layouts. Real-world trajectory data can provide highly accurate insights, but collecting it is costly and often infeasible for many retailers. Heuristics such as Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) are commonly used as inexpensive approximations, but actual customer trajectories deviate by an average of 28% from shortest paths, highlighting a tradeoff between accuracy and practicality. We propose an agent-based modelling framework that casts customer trajectory prediction as a maximum entropy reinforcement learning (RL) problem, balancing reward maximization with stochasticity to better reflect customers with bounded rationality. Using real-world trajectory data from a convenience store, we show that RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Furthermore, only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Our work demonstrates that RL provides a practical, behaviourally grounded alternative that bridges the gap between oversimplified heuristics and data-intensive approaches, making accurate layout optimization more accessible. To encourage further research, the source code is available on GitHub.
中文摘要 了解零售空间内的顾客流动对于优化门店布局至关重要。真实世界的轨迹数据可以提供高度准确的洞察，但收集这些数据成本高昂，且对许多零售商来说往往难以实现。诸如旅行推销员问题（TSP）和概率最近邻（PNN）等启发式方法常被用作廉价近似，但实际客户轨迹平均偏离最短路径28%，凸显了准确性与实用性的权衡。我们提出了一种基于主体的建模框架，将客户轨迹预测定位为最大熵强化学习（RL）问题，平衡奖励最大化与随机性，以更好地反映具有有限理性的客户。利用便利店的真实轨迹数据，我们表明强化学习生成的轨迹比TSP和PNN更贴近顾客行为，从而提供了更准确的冲动购买率和货架流量密度估计。此外，只有基于强化学习的预测才能对冲量产品做出与实际轨迹数据推导一致的重新定位决策，从而实现可比的估计利润增长。我们的研究表明，强化学习提供了一种实用且基于行为的替代方案，弥合了过于简化的启发式方法与数据密集型方法之间的鸿沟，使精准布局优化更加易于实现。为了鼓励进一步研究，源代码已在GitHub上公开。

Scheduling That Speaks: An Interpretable Programmatic Reinforcement Learning Framework

会说话的调度：一个可解释的程序化强化学习框架

Authors: Chengpeng Hu, Yingqian Zhang, Hendrik Baier
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
Arxiv link: https://arxiv.org/abs/2605.18454
Pdf link: https://arxiv.org/pdf/2605.18454
Abstract Deep reinforcement learning (DRL) has recently emerged as a promising approach to solve combinatorial optimization problems such as job shop scheduling. However, the policies learned by DRL are typically represented by deep neural networks (DNNs), whose opaque neural architectures and non-interpretable policy decisions can lead to critical trust and usability concerns for human decision makers. In addition, the computational requirements of DNNs can further hinder practical deployment in resource constrained environments. In this work, we propose ProRL, a novel interpretable programmatic reinforcement learning framework that achieves high-performance scheduling with human-readable and editable programmatic policies (i.e., programs). We first introduce a domain-specific language for scheduling (DSL-S) to represent scheduling strategies as structured programs. ProRL then explores the program space defined by DSL-S using local search to identify incomplete programs, which are subsequently completed by learning their parameters via Bayesian optimization. ProRL learns which scheduling heuristic rules to select, and hence, it naturally incorporates existing heuristics already used in industrial scenarios. Experiments on widely used benchmark instances demonstrate the strong performance of ProRL against existing heuristics and DRL baselines. Furthermore, ProRL performs well under strongly constrained computational resources, such as training with only 100 episodes. Our code is available at this https URL.
中文摘要 深度强化学习（DRL）最近作为解决组合优化问题（如作业车间调度）的有前景方法出现。然而，DRL学习的策略通常由深度神经网络（DNN）代表，其不透明的神经架构和不可解释的政策决策可能导致人类决策者面临关键的信任和可用性问题。此外，DNN的计算需求还可能进一步阻碍在资源有限环境中的实际部署。在本研究中，我们提出了ProRL，一种新型可解释的程序化强化学习框架，能够实现高性能调度，并配备可读且可编辑的程序策略（即程序）。我们首先引入一种领域特定的调度语言（DSL-S），用以结构化程序表示调度策略。随后，ProRL利用局部搜索探索DSL-S定义的程序空间，识别不完整的程序，并通过贝叶斯优化学习其参数来完成程序。ProRL学习选择哪些调度启发式规则，因此自然地整合了工业场景中已使用的现有启发式规则。广泛使用的基准测试实例的实验显示，ProRL相较于现有启发式和DRL基线表现优异。此外，ProRL在计算资源受限的情况下表现良好，例如仅用100集训练。我们的代码可在此 https URL 访问。

DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization

DiPRL：通过架构熵正则化学习离散程序化策略

Authors: Chengpeng Hu, Yingqian Zhang, Hendrik Baier
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.18508
Pdf link: https://arxiv.org/pdf/2605.18508
Abstract Programmatic reinforcement learning (PRL) offers an interpretable alternative to deep reinforcement learning by representing policies as human-readable and -editable programs. While gradient-based methods have been developed to optimize continuous relaxations of programs, they face a significant performance drop when converting the continuous relaxations back into discrete programs. Post-hoc discretization can discard optimized branches and parameters in a program, which results in a collapse of policy expressivity and lowered task performance, leading in turn to a need for additional fine-tuning. To overcome these limitations, we propose Differentiable Discrete Programmatic Reinforcement Learning (DiPRL), a method that learns programmatic policies that become nearly discrete during training, avoiding a separate post-hoc fine-tuning stage. We first analyze the inherent risks of performance drop introduced by post-hoc discretization of gradient-based methods. Then, we introduce programmatic architecture entropy regularization, which enables smooth, differentiable training that encourages convergence toward a discrete program. DiPRL maintains the efficiency of gradient-based optimization while mitigating the risks of post-hoc discretization. Our experiments across multiple discrete and continuous RL tasks demonstrate that DiPRL can achieve strong performance via interpretable programmatic policies.
中文摘要 程序化强化学习（PRL）通过将策略表示为人类可读且可编辑的程序，提供了深度强化学习的可解释替代方案。虽然已有基于梯度的方法被开发出来以优化程序的连续松弛，但在将连续松弛转换回离散程序时，性能会显著下降。事后离散化会丢弃程序中优化的分支和参数，导致策略表达性崩溃和任务性能下降，进而需要额外的微调。为克服这些局限，我们提出了可微离散程序化强化学习（DiPRL）方法，该方法在训练过程中学习程序策略几乎离散化，避免了独立的事后微调阶段。我们首先分析基于梯度方法的事后离散化所带来的性能下降固有风险。随后，我们引入了程序化架构熵正则化，实现平滑且可微的训练，鼓励向离散程序收敛。DiPRL在降低事后离散化风险的同时，保持了基于梯度优化的高效性。我们在多个离散且连续的强化学习任务中的实验表明，DiPRL可以通过可解释的程序策略实现强劲的性能。

AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

AMR-SD：非对称元反思自我蒸馏，用于代币级信用分配

Authors: Zhenlin Wei, Pu Jian, Yingzhuo Deng, Xiaohan Wang, Jiajun Chai, Zhexin Hu, Wei Lin, Shanbin Zhang, Guojun Yin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.18529
Pdf link: https://arxiv.org/pdf/2605.18529
Abstract The alignment of Large Language Models (LLMs) for complex reasoning heavily relies on Reinforcement Learning with Verifiable Rewards (RLVR). However, standard algorithms like GRPO apply sequence-level rewards uniformly to all tokens, creating a severe credit-assignment bottleneck. While on-policy self-distillation attempts to resolve this by conditioning a self-teacher on privileged contexts, direct exposure to raw oracle solutions often induces over-conditioned teacher distributions, implicit answer leakage, and late-stage training collapse. To overcome these limitations, we propose Asymmetric Meta-Reflective Self-Distillation (AMR-SD). Instead of conditioning directly on raw reference traces, AMR-SD inserts a reflection bottleneck: it compresses diagnostic signals -- from verifier outcomes, peer rollouts, or reference feedback -- into concise, self-generated Socratic hints and critiques. Furthermore, we introduce Causal Information Gain (CIG) with an asymmetric, ReLU-gated threshold to translate these reflections into sparse, highly precise token-level advantage modulations. Combined with temporal annealing, this mechanism preserves the base environmental reward while filtering out distributional noise. Experiments across scientific, mathematical, and tool-use benchmarks demonstrate that AMR-SD significantly outperforms existing baselines, achieving robust long-horizon stability and successfully preventing late-stage collapse.
中文摘要 大型语言模型（LLMs）用于复杂推理的对齐在很大程度上依赖于可验证奖励的强化学习（RLVR）。然而，像GRPO这样的标准算法会对所有代币统一地应用序列级奖励，造成严重的信用分配瓶颈。虽然政策自提炼试图通过条件化自学者在特权上下文中来解决这个问题，但直接接触原始预言机解往往会导致教师分布过于条件化、隐性答案泄漏以及后期培训崩溃。为克服这些限制，我们提出了非对称元反射自蒸馏（AMR-SD）。AMR-SD不直接以原始参考痕迹为条件，而是插入一个反射瓶颈：它将诊断信号——从验证者结果、同行推广或参考反馈——压缩成简明、自发的苏格拉底式提示和批评。此外，我们引入了具有非对称、ReLU门控阈值的因果信息增益（CIG），将这些反射转化为稀疏且高精度的代币级优势调制。结合时间退火，该机制保留了基础环境奖励，同时滤除分布噪声。科学、数学和工具使用基准的实验表明，AMR-SD显著优于现有基线，实现了稳健的长期稳定，成功防止了晚期崩溃。

Unified Walking, Running, and Recovery for Humanoids via State-Dependent Adversarial Motion Priors

通过状态依赖的对抗性运动先验，实现类人生物的统一行走、跑步和恢复

Authors: Yidan Lu, Yichao Zhong, Liu Zhao, Wanyue Li, Peng Lu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.18611
Pdf link: https://arxiv.org/pdf/2605.18611
Abstract We propose a unified reinforcement learning framework that enables a single policy to perform walking, running, and fall recovery on the Unitree G1 humanoid robot, validated on physical hardware without any explicit mode-switching command at deployment. The framework extends Adversarial Motion Priors (AMP) by replacing the conventional global reference distribution with a state-dependent gate that routes each training transition to one of two discriminators: a dedicated recovery discriminator and a velocity-conditioned locomotion discriminator that jointly covers walking and running. The gate is defined by a single fixed threshold on projected gravity: the recovery discriminator is activated when body tilt exceeds approximately $37^\circ$ from vertical ($|g_z+1|>0.6$); otherwise the locomotion discriminator is used, with the normalized commanded velocity serving as a condition that selects the appropriate reference trajectory between walk and run clips. Only three LAFAN1 reference clips are required to regularize the complete behavior set. At deployment, a single frozen ONNX policy executes at 50\,Hz with no runtime mode logic; hardware experiments demonstrate successful recovery from both prone and supine falls and smooth walk-to-run transitions under the same controller.
中文摘要 我们提出了一个统一的强化学习框架，使单一策略能够在 Unitree G1 人形机器人上执行行走、跑步和跌倒恢复，并在部署时无需显式切换模式命令即可在物理硬件上验证。该框架通过用状态依赖的门将每个训练转换路由到两个判别器之一：一个专用的恢复判别器和一个速度条件的运动判别器，后者共同覆盖行走和奔跑，从而扩展了传统的全局参考分布。门由投影重力上的单一固定阈值定义：当体体倾斜超过垂直方向约$37^\circ$（$|g_z+1|>0.6$）时，恢复鉴别器被激活;否则使用运动判别器，归一化速度作为选择行走与跑步剪辑之间适当参考轨迹的条件。只需三个 LAFAN1 参考剪辑即可规范整个行为集。部署时，单个冻结的ONNX策略以50Hz执行，且无运行时模式逻辑;硬件实验证明，在同一控制器下，俯卧和仰卧都能成功恢复，并实现从步行到跑步的平滑过渡。

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

ManiSoft：迈向软连续体机器人的视觉语言操作

Authors: Ziyu Wei, Luting Wang, Chen Gao, Li Wen, Si Liu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.18617
Pdf link: https://arxiv.org/pdf/2605.18617
Abstract Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce \ManiSoft, a benchmark for vision-language manipulation with soft arms. ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, \ManiSoft{} includes an automated pipeline that generates $6{,}300$ diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation. Out codes and datasets are released at this https URL.
中文摘要 大多数现有视觉语言操控研究主要针对刚性机械臂，其固定形态限制了在拥挤或狭小空间中的适应性。软机械臂因其可变形性提供了吸引人的替代方案，但也面临本体感觉不可靠和分布式低能级驱动等挑战。为了探讨这些挑战，我们引入了\ManiSoft，这是软臂视觉语言操控的基准。ManiSoft配备了定制模拟器，通过弹性力约束将逼真的软体动力学与丰富的接触相互作用结合起来。基于此，ManiSoft定义了四个任务，每个任务突出变形控制的不同方面，从基本的末端执行器协调到障碍物避让。为支持政策培训和评估，ManiSoft{} 包含一个自动化流程，生成价值 6 美元、300 美元的多样化场景及相应的专家轨迹。为了大规模生成高质量轨迹，我们首先使用高级规划器将每个任务分解为一系列航点，随后采用低级强化学习策略生成扭矩指令以追踪航点。对三个代表性政策模型进行基准测试，在干净场景中表现相对有希望，但在随机化下性能显著下降。可视化分析表明，失效主要源于对本体感觉状态的视觉估计不准确，以及变形性在适应性避障方面的利用有限。我们预计ManiSoft将成为一个宝贵的试验平台，在视觉语言操作领域弥合刚性手臂与软手臂之间的鸿沟。我们的代码和数据集会在这个 https URL 上发布。

Leveraging Latent Visual Reasoning in Silence

在沉默中运用潜在的视觉推理

Authors: Dongyao Zhu, Zhen Wang, Xi Xiao, Han Jiang, Saeed Vahidian, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su, Raju Vatsavai, Jianyang Gu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.18641
Pdf link: https://arxiv.org/pdf/2605.18641
Abstract Latent visual reasoning involves visual evidence more directly in multimodal reasoning by inserting continuous latent tokens before textual generation. However, the necessity of these latent tokens at inference remains ambiguous. We show that replacing latent tokens with random noise or removing them completely causes little performance degradation across spatial reasoning benchmarks. Reinforcement learning further diminishes the latent generation behavior after post-training. These observations raise a central question: Is latent visual reasoning still meaningful? We argue that its value should be measured by how effectively latent tokens guide learning, rather than whether they persist as an inference-time format. Our analysis shows that latent reasoning is unevenly favorable across question types, yet hard task-level routing for applying latent generation is brittle. Motivated by these findings, we propose an attention-based reward that encourages generated latent tokens to interact with later text tokens during RL. This reward promotes latent utilization when the latent mode is activated while preserving the flexibility to use pure-text reasoning. Experiments show that our method improves performance across perception and visual reasoning benchmarks, even when latent tokens are rarely generated after post-training. Our results highlight that, without explicit expression at inference, latent visual reasoning can shape better visual grounding and more accurate textual reasoning in silence. Our code and trained models are publicly available at \href{this https URL}{GitHub} and \href{this https URL}{Hugging Face}.
中文摘要 潜在视觉推理在多模态推理中更直接地通过在文本生成前插入连续潜在标记来提供视觉证据。然而，推理时这些潜在标记的必要性仍然模糊不清。我们证明，用随机噪声替换潜在标记或完全移除它们，在空间推理基准测试中几乎不会对性能下降产生影响。强化学习进一步减少了训练后潜在生成行为。这些观察引发了一个核心问题：潜在视觉推理仍有意义吗？我们认为，其价值应当以潜在代币引导学习的有效程度来衡量，而非它们是否作为推理时间格式持续存在。我们的分析显示，潜在推理在不同题型间表现不均，但应用潜在生成的艰难任务层面路径较为脆弱。基于这些发现，我们提出了一种基于注意力的奖励，鼓励生成的潜在代币在强化学习中与后续文本代币互动。这种奖励在激活潜在模式时促进了潜在利用，同时保持了使用纯文本推理的灵活性。实验表明，即使后训练后很少生成潜在标记，我们的方法也能提升感知和视觉推理基准测试的性能。我们的结果表明，在推理时没有显式表达，潜在的视觉推理可以在静默中塑造更好的视觉基础和更准确的文本推理。我们的代码和训练好的模型公开于 \href{this https URL}{GitHub} 和 \href{this https URL}{Hugging Face}。

COOPO: Cyclic Offline-Online Policy Optimization Algorithm

COOPO：周期性离线-在线策略优化算法

Authors: Qisai Liu, Zhanhong Jiang, Joshua Russell Waite, Aditya Balu, Cody Fleming, Soumik Sarkar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.18675
Pdf link: https://arxiv.org/pdf/2605.18675
Abstract Offline reinforcement learning struggles with distributional shift and constrained performance due to static dataset limitations, while online RL demands prohibitive environment interactions. The recent advent of hybrid offline-to-online methods bridges these domains but suffers from distribution drift during transitions and catastrophic forgetting of offline knowledge. We introduce COOPO (Cyclic Offline-Online Policy Optimization), a generalized framework that repeatedly cycles between constrained offline training and online fine-tuning. Each cycle first anchors the policy to the dataset via KL-regularized advantage-weighted offline updates to minimize distributional shift and then fine-tunes it online using any policy optimization for stable exploration. Crucially, periodically returning to offline training eliminates forgetting and drift while maximizing dataset reuse. The cyclic behavior also helps reduce the online environment interactions. Theoretically, COOPO achieves better online sample efficiency, surpassing pure online RL, with guaranteed monotonic improvement under standard coverage assumptions. Extensive D4RL benchmarks demonstrate COOPO reduces online interactions versus state-of-the-art hybrids while improving final returns, maintaining robustness across diverse offline algorithms and online optimizers. This looped synergy sets new efficiency and performance standards for adaptive RL.
中文摘要 离线强化学习因静态数据集限制而面临分布转移和性能受限，而在线强化学习则要求严格的环境交互。近年来，离线到在线混合方法的出现连接了这些领域，但存在转移过程中分布漂移和离线知识的灾难性遗忘。我们介绍了COOPO（周期离线-在线策略优化），这是一个在受限离线训练和在线微调之间反复循环的通用框架。每个周期首先通过KL正则化优势加权的离线更新将策略锚定到数据集上以最小化分布偏移，然后在线通过任意策略优化进行微调以实现稳定探索。关键是，定期返回离线训练可以消除遗忘和漂移，同时最大化数据集的再利用。这种循环行为也有助于减少网络环境的互动。理论上，COOPO在标准覆盖假设下实现了更好的在线样本效率，超过纯在线强化学习，并且保证单调改进。大量D4RL基准测试表明，COOPO相比最先进的混合算法减少了在线交互，同时提升最终回报，保持多种离线算法和在线优化器的稳健性。这种循环协同为自适应强化学习树立了新的效率和性能标准。

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory：通过可执行环境综合和稳健强化学习扩展工具使用代理

Authors: Minrui Xu, Zilin Wang, Mengyi DENG, Zhiwei Li, Zhicheng Yang, Xiao Zhu, Yinhong Liu, Boyu Zhu, Baiyu Huang, Chao Chen, Heyuan Deng, Fei Mi, Lifeng Shang, Xingshan Zeng, Zhijiang Guo
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.18703
Pdf link: https://arxiv.org/pdf/2605.18703
Abstract Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including $\tau^2$-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.
中文摘要 通过代理强化学习（Agentic RL）为大型语言模型配备工具使用能力，面临两个挑战：缺乏可扩展且稳健的执行环境，以及缺乏能够捕捉隐含人类推理的真实训练数据。现有方法依赖于昂贵的现实世界API、易产生幻觉的LLM模拟器，或通常是单回合或依赖预先收集文档的合成环境。此外，合成轨迹常被过度指定，更像是指令序列而非人类的自然意图，降低了其在强化学习训练中的有效性。我们介绍EnvFactory，一个全自动化框架，解决了这两个挑战。EnvFactory 自主地探索并验证基于真实资源的有状态、可执行工具环境，并通过拓扑感知采样和校准细化综合自然多回合轨迹，生成带有隐性意图的扎根查询。仅利用7个域的85个经过验证的环境，EnvFactory生成了2,575条SFT和RL轨迹。尽管使用环境数量远少于以往工作，通常为之前的5倍，EnvFactory仍实现了卓越的训练效率和下游性能，在BFCLv3上提升Qwen3系列模型最高+15%，MCP-Atlas提升+8.6%，对话基准测试（如$\tau^2$-Bench和VitaBench）提升+6%。通过完全自动化环境构建和轨迹综合，EnvFactory 为能动强化学习提供了可扩展、可扩展且稳健的基础。

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

SafeDiffusion-R1：安全扩散训练后在线奖励引导

Authors: Komal Kumar, Ankan Deria, Abhishek Basu, Fahad Shamshad, Hisham Cholakkal, Karthik Nandakumar
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.18719
Pdf link: https://arxiv.org/pdf/2605.18719
Abstract Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a \textit{steering reward mechanism} that exploits an inherent property of CLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, without catastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07\% (vs. 48.9\% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improving compositional generation quality from 42.08\% to 47.83\% on GenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: this https URL.
中文摘要 扩散模型已被广泛研究用于去除预训练中学到的不安全内容。现有方法需要昂贵的监督数据，要么是非安全文本与安全图像真实匹配，要么是负/正图像对，因此难以实现规模化。此外，离线强化学习和监督式微调方法，用于离线生成合成数据，但也存在灾难性的遗忘和生成质量下降的问题。我们提出了一种新型在线强化学习框架，通过对负面和正向文本提示进行组相对策略优化（Group Relative Policy Optimization，GRPO）的后期训练，解决数据稀缺和模型退化问题。为了消除对专业安全/不安全奖励模型进行微调的需求，我们引入了一种\textit{引导奖励机制}，利用了CLIP嵌入固有的特性：在嵌入空间中引导文本表示朝向正安全方向，远离负向。我们的在线政策方法使模型能够从多样提示中学习，包括明确的不安全内容，而不会出现灾难性的遗忘。大量实验表明，我们的方法将不当内容降至18.07%（相比SD v1.4的48.9%），裸露检测率降至15%（基线为646%），同时将合成生成质量从42.08%提升至47.83%。值得注意的是，这些安全性提升可推广到七个危害类别的域外不安全提示，实现了无需监督配对数据或奖励调优的先进性能。GitHub：这个 https URL。

General Preference Reinforcement Learning

通用偏好强化学习

Authors: Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo, Emily Fox, John M. Cioffi
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.18721
Pdf link: https://arxiv.org/pdf/2605.18721
Abstract Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.
中文摘要 后期训练将大型语言模型（LLM）对齐分为两个主要脱节的轨道。在线强化学习（RL）具有可验证奖励，推动数学和代码的涌现推理，但依赖程序验证器无法触及开放式任务;而偏好优化处理开放式生成，却放弃了在线强化学习的持续探索。要缩小这一差距，需要一个开放性质量的验证者，但标量奖励模型并不适合这项工作。质量是多维的，任何标量分数都是不完全的代理，允许在线强化学习在评分最敏感的轴上崩塌。我们转而使用一般偏好模型（GPM），该模型将响应嵌入$k$偏对称子空间，并将偏好表示为结构化、非传递性的比较。基于此，我们提出了通用偏好强化学习（GPRL），将$k$-way结构延续到政策更新。GPRL计算每维群的相对优势，在各自尺度上归一化，使得没有哪个轴能占主导地位，并用上下文相关的特征值进行聚合。同一结构驱动闭环漂移监测器，检测单轴利用，并通过重新加权和收紧信任区域实时校正。从$\texttt{Llama-3-8B-Instruct}$开始，GPRL在AlpacaEval~2.0上达到了56.51\%$的长度控制胜率，同时在Arena-Hard、MT-Bench和WildBench上通过抵抗奖励黑客，在长时间训练中表现优于SimPO和SPPO。

Keyword: diffusion policy

SADP: Subgoal-Aware Diffusion Policy for Explainable Robots Learned from Foundation Model Generated Demonstrations

SADP：从基础模型生成演示中学习到的可解释机器人子目标感知扩散政策

Authors: Site Hu, Takato Horii
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.16871
Pdf link: https://arxiv.org/pdf/2605.16871
Abstract Explainable robots require not only successful task execution but also the ability to expose internal decision-making process in a user-friendly manner. However, most imitation learning methods are trained solely on task-level demonstrations, without explicitly modeling subgoal structure or execution progress. This limitation is further exacerbated by the scarcity of subgoal-level supervision in standard robot learning datasets, which restricts the development of robots that can convey the subtasks they are executing during long-horizon manipulation. To address this issue, this paper proposes Subgoal-Aware Diffusion Policy (SADP), a framework that leverages foundation models to autonomously generate subgoal-annotated demonstrations and trains diffusion policies on these datasets. SADP structures policy execution around human-interpretable subgoals by conditioning action generation on both task-level and subgoal-level descriptions. A lightweight auxiliary head further predicts subgoal completion states, allowing the robot to expose its current execution stage and monitor subgoal progression. Experiments in RLBench simulations and real-world evaluations on a UR5e robot demonstrate that SADP achieves higher task success rates than strong task-conditioned diffusion baselines, while providing subgoal-level execution signals for monitoring progress and diagnosing failures. These results highlight that built-in, rather than post-hoc, interpretability can coexist with high task performance.
中文摘要 可解释机器人不仅需要任务的成功执行，还需要能够以用户友好的方式揭示内部决策过程。然而，大多数模仿学习方法仅基于任务级演示进行训练，未明确建模子目标结构或执行进展。这一限制因标准机器人学习数据集中子目标级监督的稀缺而进一步加剧，限制了能够传达其在长视野操作中执行子任务的机器人开发。为解决这一问题，本文提出了子目标感知扩散策略（SADP）框架，该框架利用基础模型自主生成子目标注释演示，并在这些数据集上训练扩散策略。SADP通过将行动生成条件化为任务层级和子目标级描述，构建策略执行的框架，围绕人类可解释的子目标展开。轻量级辅助头进一步预测子目标完成状态，使机器人能够暴露当前执行阶段并监控子目标的进展。RLBench模拟实验和UR5e机器人的实际评估表明，SADP的任务成功率高于强任务条件扩散基线，同时提供子目标级执行信号，用于监控进度和诊断失败。这些结果强调，内置而非事后可解释性可以与高任务表现共存。

Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States

对比概念激活引导（COAST）：通过隐藏状态解锁视觉-语言-行动模型

Authors: Miranda Muqing Miao, Subin Kim, Brandon Yang, Lyle Ungar
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.17144
Pdf link: https://arxiv.org/pdf/2605.17144
Abstract Vision-Language-Action (VLA) models leverage powerful perceptual priors from web-scale Vision-Language Model (VLM) pre-training, yet they remain surprisingly brittle in practice, frequently failing at simple robotic tasks. To mitigate this, we propose Contrastive Conceptor Activation Steering (COAST). COAST builds on the notion of a "conceptor", a linear operator that soft-projects data into the principal components of a target distribution. COAST uses conceptors to identify success-critical subspaces for a target robotic task from a few examples of success and failure rollouts. At inference time, it steers VLA latents into these identified success subspaces to improve task outcomes. Across three architecturally distinct neural policies (flow-matching VLA, autoregressive VLA, and Diffusion Policy), COAST improves absolute mean simulation and real-robot task success rate by over 20 and 40% respectively. The activation subspace geometry reveals that failure modes share substantial structure across tasks while success representations remain largely task-specific. When tasks share similar failure modes, this structure enables previously fitted conceptors to improve performance on new tasks without refitting. Ultimately, our results suggest that current VLAs retain substantial task-relevant knowledge in their latent representations, and that the action expert's decoding bottleneck could be mitigated by steering its residual stream toward task-relevant subspaces. COAST provides a lightweight, training-free path to unlocking these latent capabilities by steering the model towards its own "success" distributions.
中文摘要 视觉-语言-行动（VLA）模型利用了网络级视觉-语言模型（VLM）预训练中的强大感知先验，但实际上它们仍然出奇地脆弱，经常在简单的机器人任务中失败。为缓解这一问题，我们提出了对比概念激活引导（COAST）。COAST基于“概念子”概念，这是一种线性算子，将数据软投影为目标分布的主要组成部分。COAST利用概念器从一些成功和失败的部署实例中，识别目标机器人任务的成功关键子空间。在推断阶段，它引导VLA潜伏者进入这些已识别的成功子空间，以改善任务结果。在三种架构上不同的神经策略（流量匹配VLA、自回归VLA和扩散策略）下，COAST分别提升了绝对平均仿真和40%以上的真实机器人任务成功率。激活子空间几何表明，失败模式在任务间共享大量结构，而成功表征则主要依赖于任务特定。当任务存在类似的失败模式时，这种结构使得先前具备的概念者能够在不重新调整的情况下提升新任务的表现。最终，我们的结果表明，当前VLA在其潜在表征中保留了大量与任务相关的知识，而动作专家的解码瓶颈可以通过引导其残余流向任务相关子空间来缓解。COAST通过引导模型走向自身的“成功”分布，提供了一条轻量级、无需训练的路径，帮助解锁这些潜在能力。

HCLM: A Hierarchical Framework for Cooperative Loco-Manipulation with Dual Quadrupeds

HCLM：双四足合作机车操作的层级框架

Authors: Qixuan Li, Chen Le, Jincheng Yu, Xinlei Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.17300
Pdf link: https://arxiv.org/pdf/2605.17300
Abstract We introduce HCLM, a hierarchical framework for general-purpose cooperative loco-manipulation with dual quadrupedal systems. Coordinating multi-robot collaborative manipulation across floating bases is highly challenging due to the conflicting demands of spatial coordination, robust locomotion, and closed-chain physical interactions. To resolve this, our architecture systematically decouples high-level collaborative reasoning from low-level robust motion execution. At the high level, a centralized Joint Diffusion Policy leverages an SE(3)-invariant task-space representation to learn coordinate-agnostic spatial coordination patterns. To translate these frame-agnostic references into physical motion, a task-centric hybrid Whole-Body Controller synergizes a proactive kinematic Model Predictive Control for collision-free velocity distribution with a reactive execution layer. Crucially, this reactive layer guarantees rapid responsiveness for precise end-effector tracking, while concurrently integrating active force regulation via a cooperative admittance scheme to safely resolve kinematic conflicts and strictly regulate internal stresses during closed-chain interactions. We validate the framework across progressively challenging simulated scenarios, including cooperative carrying, packing and handovers, and successfully deploy the latter in the real world. The results demonstrate reliable task execution, strict configuration agnosticism, and exceptional resilience against severe physical perturbations, offering a highly robust pathway for multi-robot embodied coordination.
中文摘要 我们介绍了HCLM，一种用于通用协作操作的分层框架，采用双足四足系统。由于空间协调、稳健的移动和闭链物理互动的冲突需求，协调多机器人在浮动基地间的协作操作极具挑战性。为此，我们的架构系统地将高层协作推理与低层稳健运动执行分离。在高层次上，集中式联合扩散策略利用SE（3）不变任务空间表示来学习坐标无关的空间协调模式。为了将这些帧无关的参考转化为物理运动，以任务为中心的混合体控制器将主动运动学预测控制与响应执行层协同，实现无碰撞速度分布。关键是，该反应层保证了对末端执行器精确跟踪的快速响应，同时通过协作导纳方案整合主动力调节，安全解决运动学冲突，并严格调节闭链相互作用中的内部应力。我们在逐步复杂化的模拟场景中验证该框架，包括协作式搬运、打包和交接，并成功在现实世界中部署了后者。结果显示了任务执行的可靠性、严格的配置无关性以及对严重物理扰动的卓越韧性，为多机器人具象协调提供了极为稳健的路径。

Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization

通过去噪策略优化微调口袋感知扩散模型

Authors: Yuan Xue, Daniel Kudenko, Megha Khosla
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.17693
Pdf link: https://arxiv.org/pdf/2605.17693
Abstract Structure-based drug design has been accelerated by pocket-aware 3D generative models, yet most methods primarily fit the training distribution and may fall short of satisfying multiple properties required in real-world therapeutic drug discovery. Recently, increasing attention has focused on structure-based molecule optimization (SBMO), which targets fine-grained control over multiple specified molecular properties. In this paper, we present DEPPA, a novel SBMO approach building upon Denoising Diffusion Policy Optimization for fine-tuning a pre-trained pocket-aware diffusion model via reinforcement learning. DEPPA enables optimization over multiple properties, including binding affinity, drug-likeness, synthesizability and diversity. We formulate the reverse denoising process of the pretrained pocket-aware diffusion model as a multi-step Markov Decision Process, where the desired properties that serve as reward signals are evaluated on the final generated ligand molecules. DEPPA incorporates a coarse denoising scheduler during the RL fine-tuning to achieve efficient and effective molecule optimization. Experimental results on the CrossDocked2020 benchmark demonstrate that DEPPA outperforms baselines in binding affinity (Vina Score -8.5 kcal/mol), drug-likeness and diversity while exhibiting competitive performance in synthesizability. The source code is available at this https URL .
中文摘要 基于结构的药物设计已被口袋感知的三维生成模型加速，但大多数方法主要满足训练分布，可能无法满足现实世界治疗药物发现所需的多重属性。近年来，越来越多的关注聚焦于基于结构的分子优化（SBMO），它旨在对多个特定分子性质进行细粒度控制。本文介绍了DEPPA，一种基于去噪扩散策略优化的新型SBMO方法，用于通过强化学习微调预训练的口袋感知扩散模型。DEPPA能够优化多种特性，包括结合亲和力、药物相似性、合成性和多样性。我们将预训练口袋感知扩散模型的逆去噪过程构建为多步马尔可夫决策过程，在最终生成的配体分子上评估作为奖励信号的期望性质。DEPPA在强化学习微调过程中采用粗噪去噪调度器，以实现高效且有效的分子优化。CrossDocked2020基准测试的实验结果显示，DEPPA在结合亲和力（Vina评分-8.5 kcal/mol）、药物相似性和多样性方面优于基线，同时在合成性方面表现出竞争性。源代码可在此 https URL 获取。