Arxiv Papers of Today

生成时间: 2026-05-20 18:53:03 (UTC+8); Arxiv 发布时间: 2026-05-20 20:00 EDT (2026-05-21 08:00 UTC+8)

今天共有 48 篇相关文章

Keyword: reinforcement learning

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

ReCrit：科学批评推理的过渡感知强化学习

Authors: Wanghan Xu, Yuhao Zhou, Hengyuan Zhao, Shuo Li, Dianzhi Yu, Zhenfei Yin, Yaowen Hu, Fengli Xu, Wanli Ouyang, Wenlong Zhang, Lei Bai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.18799
Pdf link: https://arxiv.org/pdf/2605.18799
Abstract Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at this https URL .
中文摘要 大型语言模型在批评者互动中可能失败，不仅可能回答错误，还可能在用户批评后放弃最初正确的科学解决方案。这在科学推理中尤其危险，因为用户批评可能将一个有效的答案变成错误。我们将批评者互动定位为一个中间的正确性-过渡问题，而非最终答案的准确性问题，并指出三个挑战：过渡意识、将有用纠正与有害谄媚脱钩，以及可扩展推广。我们提出了ReCrit，一种过渡感知强化学习框架，将初始到批判行为分解为四个象限：纠正、谄媚、稳健性和边界。ReCrit奖励纠正和鲁棒性，惩罚谄媚行为，并将持续错误视为弱边界信号。为了使交互训练更为实用，ReCrit进一步采用动态异步展开和尾部自适应完成，以减少滚动等待。在三个科学推理基准测试——ChemBench、TRQA和EarthSE上，ReCrit将Qwen3.5-4B的平均Critic准确率从38.15提升到51.49，Qwen3.5-9B从45.40提升到55.59。消融显示，最终答案奖励在交互层面的提升较小，而过渡感知奖励和象限加权则产生更明显的训练信号和更大的批判阶段净提升。代码可在此 https 网址获取。

Composition of Memory Experts for Diffusion World Models

扩散世界模型内存专家的组成

Authors: Sebastian Stapf, Pablo Acuaviva Huertos, Aram Davtyan, Paolo Favaro
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.18813
Pdf link: https://arxiv.org/pdf/2605.18813
Abstract World models aim to predict plausible futures consistent with past observations, a capability central to planning and decision-making in reinforcement learning. Yet, existing architectures face a fundamental memory trade-off: transformers preserve local detail but are bottlenecked by quadratic attention, while recurrent and state-space models scale more efficiently but compress history at the cost of fidelity. To overcome this trade-off, we suggest decoupling future-past consistency from any single architecture and instead leveraging a set of specialized experts. We introduce a diffusion-based framework that integrates heterogeneous memory models through a contrastive product-of-experts formulation. Our approach instantiates three complementary roles: a short-term memory expert that captures fine local dynamics, a long-term memory expert that stores episodic history in external diffusion weights via lightweight test-time finetuning, and a spatial long-term memory expert that enforces geometric and spatial coherence. This compositional design avoids mode collapse and scales to long contexts without incurring a quadratic cost. Across simulated and real-world benchmarks, our method improves temporal consistency, recall of past observations, and navigation performance, establishing a novel paradigm for building and operating memory-augmented diffusion world models.
中文摘要 世界模型旨在预测与过去观察结果一致的合理未来，这是强化学习规划和决策的核心能力。然而，现有架构面临一个根本的内存权衡：变换器保留局部细节，但被二次关注所限制，而递归和状态空间模型则更高效地扩展，但压缩历史，代价是保真度。为克服这一权衡，我们建议将未来-过去一致性与单一架构解耦，转而依靠一组专业专家。我们引入了一个基于扩散的框架，通过专家产品对比性表述整合异构记忆模型。我们的方法体现了三个互补角色：捕捉细致局部动态的短期记忆专家，通过轻量级测试时间微调将情景历史存储在外部扩散权重中的长期记忆专家，以及强化几何和空间连贯性的空间长期记忆专家。这种组合设计避免了模态坍缩，并且能够扩展到较长的上下文中而不产生二次成本。在模拟和现实世界基准测试中，我们的方法提升了时间一致性、对过去观测的回忆和导航性能，建立了构建和运行内存增强扩散世界模型的新范式。

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

混合型LoRA：连接全方位微调与低级适应，适应后期训练

Authors: Chengqian Zhang, Wei Zhu, Kyumin Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.18822
Pdf link: https://arxiv.org/pdf/2605.18822
Abstract Post-training has become essential for adapting large language models (LLMs) to complex downstream behaviors, including instruction following, preference alignment, and multi-step reasoning. Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a particularly effective post-training paradigm for improving reasoning capabilities, with critic-free algorithms such as GRPO and GSPO enabling scalable optimization. However, RLVR post-training with full fine-tuning (FFT) requires substantial GPU memory and incurs high training costs. Although parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), effectively reduce computational costs, they often suffer from a noticeable performance gap compared to full fine-tuning in post-training for complex reasoning tasks. In this paper, we propose Hybrid-LoRA, an efficient hybrid post-training framework that selectively applies full fine-tuning to a small subset of modules less suited to low-rank adaptation, while adapting the remaining components with LoRA. We introduce a novel Hybrid-LoRA Score to rank candidate modules according to their sensitivity to low-rank adaptation under a fixed parameter budget. Experiments show that Hybrid-LoRA closely matches full fine-tuning performance under a 10% full fine-tuning module budget, with the remaining candidate modules adapted by LoRA, consistently outperforming four state-of-the-art PEFT post-training baselines, achieving improvements of up to 5.65% and on average 4.36% over the best baseline.
中文摘要 后训练已成为适应大型语言模型（LLMs）以适应复杂下游行为的关键，包括指令跟随、偏好对齐和多步推理。带有可验证奖励的强化学习（RLVR）最近成为一种特别有效的训练后范式，用于提升推理能力，如GRPO和GSPO等无批判算法实现了可扩展的优化。然而，RLVR的全精细调优（FFT）后训练需要大量GPU内存，且训练成本高昂。尽管参数高效的微调（PEFT）方法，如低秩适应（LoRA），有效降低了计算成本，但与复杂推理任务中完全微调相比，它们通常存在明显的性能差距。本文提出了Hybrid-LoRA，一种高效的混合后训练框架，选择性地对不适合低秩适应的少数模块进行全面微调，其余组件则用LoRA进行调整。我们引入了一种新颖的混合-LoRA评分，根据候选模块在固定参数预算下对低秩适应的敏感度进行排序。实验显示，Hybrid-LoRA在10%全微调模块预算下，其余候选模块在4个最先进的PEFT培训后基线中表现稳定优于四个最先进的PEFT后基线，提升最高5.65%，平均提升4.36%。

The fitness landscape of social norms in social dilemmas

社会困境中社会规范的适应度景观

Authors: Maximilian Puelma Touzel
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI); Populations and Evolution (q-bio.PE)
Arxiv link: https://arxiv.org/abs/2605.18834
Pdf link: https://arxiv.org/pdf/2605.18834
Abstract By specifying behaviour across multiple agents, social norms are a coordination approach to resolving social dilemmas. Decentralized and wide adoption can be achieved by norms whose prescription involves interpreting stochastic signals in the environment. Such signals must have enough correlation to orchestrate mutually beneficial coordination and enough disincentivizing uncertainty about the benefits of exploiting that coordination. Evolutionary game theory of matrix games has been used to describe how, by rational agents comparing and adopting norms, a norm can evolve to become dominant in a population. Morsky \& Akçay (2019) classify norms according to a set of rationality criteria. Joint player strategies that adopt norms that are consistent with optimal single-player strategies with respect to expected reward naturally satisfy a correlated, rather than Nash game theoretic equilibrium condition. Here, we present a version of this theory that clarifies the basic ingredients. We formulate it in the more general Markov game setting more commonly used in reinforcement learning theory. We illustrate the theory by mapping norms over the signal and reward space, while also giving a detailed exposition of the underlying mechanics of the approach. Finally, we give a general solution and analysis of replicator dynamics, which Morsky \& Akçay (2019) propose as a means by which these norms could emerge.
中文摘要 通过在多个主体间指定行为，社会规范是一种协调方法来解决社会困境。通过规范来解释环境中的随机信号，可以实现去中心化和广泛的采用。此类信号必须具备足够的相关性，以协调互利的协调，并对利用这种协调的好处产生足够的不确定性。矩阵博弈的进化博弈论被用来描述通过理性主体比较和采纳规范，规范如何进化为在群体中占主导地位。Morsky \ & Akçay（2019）根据一套理性标准对规范进行分类。采用与单人最佳策略在期望奖励方面一致的规范的联合玩家策略，自然满足相关而非纳什博弈论的均衡条件。这里，我们提出一个版本，澄清了基本要素。我们在更通用的马尔可夫博弈中将其表述，这种设定更常用于强化学习理论。我们通过映射规范在信号和奖励空间上的映射来说明理论，同时详细阐述了该方法的基本机制。最后，我们给出了复制体动力学的一般解和分析，Morsky & Akçay（2019）提出该机制作为这些规范可能出现的一种方式。

From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

从累积约束到非平稳强化学习的自适应运行时安全控制

Authors: Timofey Tomashevskiy
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.18841
Pdf link: https://arxiv.org/pdf/2605.18841
Abstract Safety in reinforcement learning is often specified through cumulative cost constraints, but these trajectory-level guarantees do not directly prevent unsafe individual decisions, especially under nonstationarity. In continual and nonstationary settings, the difficulty is amplified because the risk associated with the same action can vary across contexts, while a fixed state-level threshold may be either too conservative or too weak. We propose Constraint Projection Safety Shield (CPSS), a runtime mechanism that converts a cumulative safety budget into adaptive state-level control constraints during execution. CPSS tracks the remaining safety budget, projects it into a time-varying admissible risk threshold, and filters policy actions whose predicted safety cost exceeds the active threshold. The threshold is adjusted online using contextual signals so that enforcement becomes stricter in more demanding or rapidly changing regimes and less restrictive when the available safety budget is sufficient. We analyze the resulting shielded policy and show that the mechanism guarantees per-state threshold satisfaction for executed actions, induces finite-horizon cumulative cost bounds, and yields a performance degradation bound in terms of intervention frequency and per-step reward distortion. We evaluate CPSS in nonstationary highway merging scenarios using highway-env. Across multiple seeds, CPSS substantially reduces proximity-based safety violations and increases separation margins while intervening selectively rather than dominating the learned policy. These results support adaptive budget-to-threshold projection as a practical way to transform cumulative safety specifications into effective local safety control for continual reinforcement learning systems.
中文摘要 强化学习中的安全性通常通过累积成本约束来确定，但这些轨迹级保证并不能直接防止不安全的个体决策，尤其是在非平稳性下。在连续和非平稳环境中，难度会被放大，因为同一动作的风险可能因情境而异，而固定的状态级阈值可能过于保守或过于弱。我们提出了约束投影安全盾（CPSS），这是一种运行时机制，在执行过程中将累计安全预算转换为自适应的状态级控制约束。CPSS跟踪剩余安全预算，将其预测为一个随时间变化的可接受风险阈值，并过滤那些预测安全成本超过有效阈值的政策行动。阈值通过在线情境信号调整，使在要求更高或快速变化的制度中执法更严格，而当安全预算充足时，限制更宽松。我们分析了由此产生的屏蔽策略，表明该机制保证了执行动作的每状态阈值满足，诱导有限视野的累计成本界限，并在干预频率和每步奖励扭曲方面产生性能下降的界限。我们利用高速公路环境评估非固定公路合并场景下的CPSS。在多个种子中，CPSS显著减少了基于接近度的安全违规，提高了分离余地，同时选择性干预而非主导已学到的政策。这些结果支持自适应预算到阈值预测，作为将累积安全规范转化为有效局部安全控制的实用方法，用于持续强化学习系统。

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

通过自适应安全约束实现非平稳性下的安全持续强化学习

Authors: Timofey Tomashevskiy
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.18842
Pdf link: https://arxiv.org/pdf/2605.18842
Abstract Safe reinforcement learning in nonstationary environments requires safety mechanisms that adapt as environmental conditions change. Standard safe reinforcement learning methods often assume fixed constraints or stable environmental conditions, which can become inadequate under distribution shift. We propose LILAC+, a framework for safe continual reinforcement learning under nonstationarity that combines three adaptive safety mechanisms: context-based safety constraints, adaptation-speed constraints, and budget-to-state safety enforcement. Context-based constraints adjust safety requirements using inferred and predicted environmental context. Adaptation-speed constraints tighten safety requirements when the rate of environmental change exceeds the agent's ability to adapt safely. Budget-to-state enforcement converts cumulative safety requirements into local state-level control constraints that can be enforced at decision time. Together, these mechanisms provide a unified approach for proactive and reactive safety adaptation in continual reinforcement learning. We evaluate the framework in simulated driving environments under stationary, seen nonstationary, and unseen nonstationary conditions. The results show that adaptive safety constraints substantially reduce safety violations under distribution shift while maintaining competitive task performance compared with unconstrained and fixed-constraint baselines. These findings suggest that safe continual reinforcement learning requires adaptive constraint mechanisms that respond not only to current state information but also to predicted environmental context, adaptation demand, and remaining safety budget.
中文摘要 在非静止环境中的安全强化学习需要能够根据环境变化调整的安全机制。标准的安全强化学习方法通常假设固定的约束或稳定的环境条件，这些条件在分布转移时可能变得不足。我们提出了LILAC+，这是一个在非平稳性下进行安全持续强化学习的框架，结合了三种自适应安全机制：基于情境的安全约束、适应速度约束和预算到国家的安全执法。基于情境的约束通过推断和预测的环境背景调整安全要求。当环境变化速率超过主体安全适应能力时，适应速度约束会加强安全要求。预算到州的执法将累积的安全要求转化为地方州级控制约束，可在决策时执行。这些机制共同提供了一种统一的方法，用于持续强化学习中的主动和被动安全适应。我们在静止、可见非静止和未见非静止条件下的模拟驾驶环境中评估该框架。结果显示，自适应安全约束在分配转移下显著减少安全违规，同时保持竞争性任务表现，相较于无约束和固定约束基线。这些发现表明，安全的持续强化学习需要适应性约束机制，不仅响应当前状态信息，还响应预测环境环境、适应需求和剩余安全预算。

Exact Linear Attention

精确线性注意力

Authors: Weinuo Ou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.18848
Pdf link: https://arxiv.org/pdf/2605.18848
Abstract This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention by leveraging the exact decomposition property of kernel functions, without any approximation error. It identifies and addresses gradient explosion and token attention dilution in prior linear attention methods by imposing kernel constraints that ensure non-negativity, discriminability, and geometric interpretability. Several kernel functions are proposed, including the Hadamard Exp Kernel, Summation Squared Euclidean Distance Kernel, and Subtraction Squared Euclidean Distance Kernel. Beyond the core attention formulation, the paper presents three engineering innovations: a Hyper Link structure that replaces traditional residual connections to mitigate gradient degradation, a Memory Lobe module based on bidirectional linear attention that captures transformation flow across layers to implement qualitative memory and an implicit reinforcement learning paradigm, and a routing score based bias mechanism for Mixture of Experts to improve interpretability and semantic alignment.
中文摘要 本文介绍了精确线性注意力（ELA）机制，该机制通过利用核函数的精确分解性质实现变换器注意力的线性计算复杂度，且无近似误差。它通过施加核约束，识别并解决了以往线性注意方法中的梯度爆炸和标记注意力稀释，确保了非负性、可判别性和几何解释性。提出了若干核函数，包括哈达玛德exp核、求和平方欧几里得距离核和减法平方欧几里得距离核。除了核心注意力表述，论文还提出了三项工程创新：一种替代传统残留连接以减轻梯度退化的超链接结构;一种基于双向线性注意力的记忆瓣模块，捕捉跨层转化流以实现定性记忆和隐性强化学习范式;以及基于路由分数的偏置机制，用于提升专家混合的可解释性和语义对齐。

STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning

STRIDE：可学习的逐步语言反馈，用于LLM推理

Authors: Junjie Zhang, Guozheng Ma, Shunyu Liu, Zetian Hu, Yongcheng Jing, Ting-En Lin, Yongbin Li, Dacheng Tao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.18851
Pdf link: https://arxiv.org/pdf/2605.18851
Abstract Recent advances in Reinforcement Learning (RL) have underscored its potential for incentivizing reasoning capabilities of Large Language Models (LLMs). However, existing step-level efforts suffer from costly annotations that limit domain coverage, while scalar scores further impose an information bottleneck, offering insufficient semantic bandwidth to improve intermediate decisions. Alternative language-critique approaches, which rely on frozen or external critics, provide richer textual feedback but lack the scalability needed for sustained policy improvement. In this work, we propose language-driven stepwise trajectory redirection, termed as STRIDE, a novel training framework that shifts process supervision from scalar rewards to learnable stepwise language feedback. Specifically, we co-train a generator and a generative verifier using only outcome-based rewards, eliminating external annotations, while delivering sustained policy improvement through jointly aligned verifier training. The verifier's stepwise language critiques explicitly localize and explain failures, enabling the generator to redirect reasoning trajectories at intermediate steps toward alternative decisions. The trajectory redirection design guarantees harmless policy improvement, even under noisy or suboptimal verifier feedback. Experiments on diverse reasoning benchmarks show that STRIDE significantly outperforms state-of-the-art baselines, as well as achieving breakthroughs on zero-pass-rate problems where scalar methods yield no learning signal in our ablation studies, demonstrating the effectiveness of learnable stepwise language feedback for enhancing LLM reasoning.
中文摘要 强化学习（RL）的最新进展凸显了其激励大型语言模型（LLMs）推理能力的潜力。然而，现有的步级工作存在昂贵的注释限制了域覆盖范围，而标量分数进一步造成信息瓶颈，语义带宽不足以改善中间决策。依赖冷冻或外部批评的替代语言批评方法，提供了更丰富的文本反馈，但缺乏持续政策改进所需的可扩展性。在本研究中，我们提出了语言驱动的逐步轨迹重定向，称为STRIDE，这是一种新颖的训练框架，将过程监督从标量奖励转向可学习的逐步语言反馈。具体来说，我们通过仅基于结果的奖励共同训练生成器和生成验证器，消除外部注释，同时通过联合对齐的验证者培训实现持续的政策改进。验证者的分阶段批评语言明确定位并解释失败，使生成器能够在中间步骤重新引导推理轨迹，朝向备选决策。轨迹重定向设计保证了无害的策略改进，即使在噪音或次优的验证者反馈下也能实现。在多种推理基准测试上的实验显示，STRIDE显著优于最先进的基线，并在我们的消融研究中，在标量方法无学习信号的零通过率问题上取得了突破，证明了可学习的逐步语言反馈在提升LLM推理方面的有效性。

SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

SAGE：塑造大型语言模型（LLM）中引导探索的锚点

Authors: Chanuk Lee, Minki Kang, Sung Ju Hwang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.18864
Pdf link: https://arxiv.org/pdf/2605.18864
Abstract Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse-KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward-KL provides a satisfactory solution, as both disrupt the efficiency-coverage trade-off by either inducing reward hacking or allocating probability mass to off-target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse-KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks. Our code is available at this https URL.
中文摘要 最新研究观察到，带可验证奖励的强化学习（RLVR）在推理任务上pass@1可靠地提升，但在pass@k上往往未能取得相当的提升，这引发了一个问题：RLVR是否真的使大型语言模型获得新的推理能力，还是仅仅提升了基础模型中已有的采样推理模式的效率。先前的分析大多支持后者观点，将此限制归因于标准RLVR物镜的结构性质，导致探测压力不足。在本研究中，我们认为反KL正则化产生了一个核心结构约束，这种约束稳定了训练，但本质上将策略锚定于参考分布，从而抑制了替代推理模式的出现。然而，我们证明，无论是去除KL项还是用前向KL替代，都无法提供令人满意的解决方案，因为两者都会通过诱导奖励黑客或将概率质量分配到非目标区域，破坏效率与覆盖权衡。为解决这一矛盾，我们提出了SAGE原则框架，通过引导函数q（x，y）重塑反KL锚点分布本身，实现可控的实证支持扩展，在具有挑战性的数学推理基准测试中实现pass@1和pass@k的持续提升。我们的代码可在此 https URL 访问。

Emergence of a Flow-Assisted Casting Strategy for Olfactory Navigation via Memory-Augmented Reinforcement Learning

通过记忆增强强化学习实现嗅觉导航的流动辅助投法策略的出现

Authors: Changxu Zhao, Dongxiao Zhao, Xin Bian, Gaojin Li
Subjects: Subjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
Arxiv link: https://arxiv.org/abs/2605.18881
Pdf link: https://arxiv.org/pdf/2605.18881
Abstract In dynamic flow fields, various animals exhibit remarkable odor search capabilities despite relying on stochastic detections. Interestingly, there exists an optimal time window for integrating these detections that maximizes search efficiency. To understand the underlying mechanism, we investigate the navigation performance of Reinforcement Learning (RL) agents in unsteady flows under varying memory lengths and flow conditions. Without any predefined models, the agents develop a flow-assisted casting strategy and adaptively adjust both the geometry of their search trajectories and the concentration threshold for initiating casting to maximize the success rate. The agent's average speed toward the odor source exhibits a non-monotonic dependence on memory length, which can be explained by the "sector-search" model.
中文摘要 在动态流场中，尽管依赖随机检测，各种动物仍表现出卓越的气味搜索能力。有趣的是，存在一个整合这些检测的最优时间窗口，以最大化搜索效率。为理解其底层机制，我们研究强化学习（RL）代理在不同内存长度和流条件下的非稳态流导航性能。在没有预设模型的情况下，代理人开发流动辅助抛投策略，并自适应地调整搜索轨迹的几何形状和投掷开始的浓度阈值，以最大化成功率。代理人前往气味源的平均速度对记忆长度表现出非单调依赖性，这可以用“扇区搜索”模型来解释。

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

推理的可移植性：指导RLVR时代MLLM的持续学习

Authors: Qiuhe Hong, Yuyang Liu, Shuo Yang, Tiantian Peng, Fei Zhu, Yonghong Tian
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.18903
Pdf link: https://arxiv.org/pdf/2605.18903
Abstract Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy's behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.
中文摘要 持续学习中的视觉语言模型（VLM-CL）旨在持续适应新的多模态任务，同时保留已有知识。将多模态大型语言模型（MLLM）与可验证奖励强化学习（RLVR）结合的新范式，呼吁建立一种新的模式来指导持续适应。推理能力的进步使得在推理层面施加约束成为可能。我们形式化了可迁移性，这是衡量前一策略行为在新任务中可重复使用的样本级指标，并实证显示推理级信号在非分布样本中依然可靠，而答案级信号则不然。我们将此实例化为推理可移植性（RP），并提出了基于推理的动态平衡持续学习（RDB-CL），该方法根据RP调制RLVR中的每样本Kullback-Leibler正则化：高RP样本的紧锚点保留可重复使用的推理，而低RP样本的松散锚点则允许探索新的推理路径。实验显示，RDB-CL始终优于基线，LAST准确率比原版RLVR基线提升+12.0%。

TabQL: In-Context Q-Learning with Tabular Foundation Models

TabQL：基于表格基础模型的上下文Q学习

Authors: Qisai Liu, Zhanhong Jiang, Timilehin Ayanlade, Ashutosh Kumar Nirala, Yang Li, Aditya Balu, Soumik Sarkar
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.18979
Pdf link: https://arxiv.org/pdf/2605.18979
Abstract We propose Tabular Q-Learning (TabQL), a reinforcement learning framework that replaces the conventional parametric Q-network in Deep Q-Learning (DQN) with a tabular foundation model endowed with in-context learning capabilities. The key idea is to represent Q-values through a sequence-to-sequence foundation model operating over a tabularized representation of state-action-Q-value tuples, enabling rapid adaptation from limited online interaction by conditioning on recent experience. TabQL departs from classical DQN by leveraging (i) zero- or few-shot Q-value inference via in-context updates, and (ii) a warm-up phase using standard DQN to bootstrap high-quality context. Particularly, to enhance the context quality, new transitions are generated by executing actions output by TabQL with predicted Q values from DQN. We formalize TabQL, analyze its convergence and sample complexity under mild assumptions, and show that TabQL interpolates between vanilla Q-learning and DQN with in-context learning. Our analysis demonstrates that TabQL achieves improved efficiency compared to DQN by amortizing Bellman updates through in-context learning. Extensive numerical experiments with several benchmarks showcase the effectiveness and efficacy of the proposed TabQL.
中文摘要 我们提出了表式Q-学习（TabQL），这是一种强化学习框架，用具备上下文学习能力的表式基础模型取代了深度Q-学习（DQN）中的传统参数化Q网络。核心思想是通过序列到序列的基础模型来表示Q值，该模型运行在状态-动作-Q值元组的表表表示上，通过近期经验实现快速适应有限的在线互动。TabQL 通过通过上下文更新实现零样本或少样本 Q 值推断，以及（ii）使用标准 DQN 进行热身阶段，从而实现高质量上下文的启动，从而区别于经典 DQN。特别是为了提升上下文质量，通过执行TabQL输出的动作并预测DQN中的Q值来生成新的转移。我们形式化了TabQL，分析其收敛性和在轻度假设下的样本复杂性，并展示了TabQL在上下文学习中插值了原版Q学习和DQN之间的关系。我们的分析表明，TabQL通过上下文学习摊销Bellman更新，相较DQN实现了更高的效率。通过多个基准测试进行的大量数值实验展示了所提TabQL的有效性和有效性。

Prompt Optimization for LLM Code Generation via Reinforcement Learning

通过强化学习实现LLM代码生成的提示优化

Authors: Ali Mohammadi Esfahani, Nafiseh Kahani, Samuel A.Ajila
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2605.19102
Pdf link: https://arxiv.org/pdf/2605.19102
Abstract Large Language Models (LLMs) can generate code from natural language, but their performance is highly sensitive to prompt formulation. We propose a reinforcement-learning-based framework that models prompt refinement as a sequential decision-making problem. A Proximal Policy Optimization (PPO) agent iteratively improves prompts using a hybrid action space that combines direct generation, genetic lexical mutation and semantic rewriting, guided by shaped rewards derived from unit-test feedback. We evaluate the framework on MBPP+, HumanEval+, and APPS using CodeT5+, CodeLLaMA, and DeepSeek-Coder as frozen code generators. On the 500-task MBPP+ test set, the PPO agent achieves strict Pass@1 scores of 57.58%, 64.80%, and 85.50%, respectively, outperforming EPiC, Reflexion, and Random-Hybrid. Soft-Pass@1 reaches 67.90%, 73.10%, and 88.20%, respectively. Similar improvements are observed on HumanEval+ and APPS across all backbone models. The results demonstrate that reinforcement learning with shaped test-driven rewards improves functional correctness in LLM-based code generation.
中文摘要 大型语言模型（LLMs）可以从自然语言生成代码，但其性能对提示表述非常敏感。我们提出了一个基于强化学习的框架，将提示精炼建模为顺序决策问题。一个近端策略优化（PPO）代理通过结合直接生成、基因词汇变异和语义重写的混合动作空间迭代改进提示，并由单元测试反馈衍生的有形奖励指导。我们在MBPP+、HumanEval+和APPS上使用CodeT5+、CodeLLaMA和DeepSeek-Coder作为冻结代码生成器来评估该框架。在500任务的MBPP+测试集中，PPO代理分别获得了57.58%、64.80%和85.50%的严格Pass@1分数，优于EPiC、Reflexion和Random-Hybrid。软Pass@1分别达到67.90%、73.10%和88.20%。在所有骨干模型的HumanEval+和APPS上也观察到了类似的改进。结果表明，带有塑造型测试驱动奖励的强化学习提升了基于LLM代码生成的功能正确性。

A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions

基于强化学习的四旋翼控制性能调优的启发式方法，通过奖励设计和终止条件

Authors: Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, George Nikolakopoulos
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2605.19166
Pdf link: https://arxiv.org/pdf/2605.19166
Abstract Reinforcement learning (RL)-based quadrotor control policies have achieved impressive performance in tasks such as fast navigation in cluttered environments and drone racing, where the focus is on speed and agility. However, in several applications, such as infrastructure inspection, it is critical to achieve precise, controlled maneuvers with tunable performance. In this article, we present a novel heuristic approach to achieve tunable performance in RL-based Quadrotor control through reward design and termination conditions. We present a novel reward structure containing dual bandwidth exponentials that achieves a baseline critically damped response in setpoint tracking, with low steady-state errors. When trained with a Proximal Policy Optimization (PPO) algorithm, in conjunction with episode truncation conditions, the desired performance is achieved in 6 million time steps in a sample-efficient manner. In order to tune the performance about the baseline behavior, we present intuitive heuristic rules to adjust the reward weights and exponential coefficients to achieve faster (acrobatic-like) and slower (inspection-like) settling time performance, while retaining the baseline critically damped response and approximately 2\% steady-state error. We evaluate the three RL policies (baseline, acrobatic, and inspection) across 100 trials and show accurate and tunable performance in position and yaw tracking from random initial conditions, thereby demonstrating the effectiveness of the proposed heuristic approach.
中文摘要 基于强化学习（RL）的四旋翼控制策略在繁忙环境中快速导航和无人机竞速等任务中取得了令人印象深刻的表现，这些任务注重速度和灵活性。然而，在某些应用中，如基础设施检查，实现精确、受控且性能可调至关重要。本文提出了一种新颖的启发式方法，通过奖励设计和终止条件实现基于强化学习的四旋翼控制可调性能。我们提出了一种包含双频宽指数的新型奖励结构，在设定点跟踪中实现基线临界阻尼响应，且稳态误差低。当使用近点策略优化（PPO）算法并结合集数截断条件进行训练时，期望的性能在600万个时间步内以样本效率的方式实现。为了调整基线行为的性能，我们提出了直观的启发式规则，调整奖励权重和指数系数，以实现更快（类似杂技）和更慢（类似检查）的稳定时间表现，同时保持基线临界阻尼响应和约2%稳态误差。我们评估了三种强化学习策略（基线、杂技和检查）在100项试验中，展示了随机初始条件下位置和偏航跟踪的准确且可调的性能，从而证明了所提启发式方法的有效性。

Aerial Inspection Behaviors via RL-based Quadrotor Control for Under-canopy Forest Environments

基于RL的四旋翼控制对树冠下森林环境的空中检查行为

Authors: Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, Viswa Narayanan Sankaranarayanan, George Nikolakopoulos
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2605.19202
Pdf link: https://arxiv.org/pdf/2605.19202
Abstract This paper addresses the problem of using a deep Reinforcement Learning (RL)-based low-level Quadrotor controller within an autonomous Quadrotor navigation stack for aerial inspection missions in under-canopy forest environments. Specifically, the article presents an end-to-end (mapping states to RPMs) Quadrotor control policy that achieves inspection view-pose tracking (simultaneous position and yaw reference tracking), which is crucial for various target inspection behaviors and point-to-point navigation in forests. To ensure safe and reliable deployment of the end-to-end RL controller in long-range missions, this article utilizes a higher navigation guidance layer comprising of a Traveling Salesman Problem planner (TSP) and a Rapidly-exploring Random Tree Star (RRT) planner. Over a known map of a forest and a set of user-specified inspection regions, the TSP planner finds the optimal visitation sequence. Between two target regions, collision-free paths that respect the tracking limitations of the lower end-to-end RL policy are generated by an RRT planner. Through five target inspection scenarios, this article demonstrates that an RL-based motor-level stabilizing controller, supported by a navigation guidance layer, can be used effectively as the low-level inspection execution module for under-canopy forest inspection missions.
中文摘要 本文探讨了在自主四旋翼导航堆栈中使用基于深度强化学习（RL）的低级四旋翼控制器，以执行林冠下环境的空中检查任务的问题。具体来说，文章提出了一种端到端（将状态映射到转速）四旋翼控制策略，实现了检查视方跟踪（同时位置和偏航参考跟踪），这对森林中的各种目标检查行为和点对点导航至关重要。为确保端到端RL控制器在远程任务中安全可靠部署，本文采用了更高层次的导航引导层，包括旅行推销员问题规划器（TSP）和快速探索随机树星（RRT）规划器。在已知的森林地图和用户指定的检查区域集合上，TSP规划器会找到最优的访问顺序。在两个目标区域之间，RRT规划器生成符合低端到端强化学习策略追踪限制的无碰撞路径。通过五种目标检查场景，本文展示了基于强化环境的电机级稳定控制器，配合导航引导层，可以有效地作为林冠下森林检查任务的低空检查执行模块使用。

GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning

GAE在信息不完美自玩强化学习方面表现不足

Authors: Zhiyuan Fan, Gabriele Farina
Subjects: Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2605.19235
Pdf link: https://arxiv.org/pdf/2605.19235
Abstract Competitive multi-agent reinforcement learning in imperfect-information games requires agents to act under partial observability and against adversarial opponents, necessitating stochastic policies. While self-play reinforcement learning with Proximal Policy Optimization (PPO) has achieved strong empirical success, its standard advantage estimator, generalized advantage estimation, suffers from additional variance due to the sampling of stochastic future actions. This variance is amplified in equilibrium self-play because of the stochastic nature of the equilibrium policy and persists even when the critic is exact. We address this bottleneck by introducing $Q$-boosting, a variance-reduced advantage estimator based on a centralized action-value critic, and propose Variance-Reduced Policy Optimization (VRPO), incorporating this new estimator. The algorithm replaces sampled multi-step backups with a multi-step Expected SARSA$(\lambda)$ trace, computing policy expectations at each step to average out action-sampling noise, while retaining PPO's clipped objective and on-policy actor updates. Empirically, VRPO consistently achieves strong performance from mid-sized to large-scale games including Dou Dizhu and Heads-Up No-Limit Texas Hold'em.
中文摘要 在不完全信息博弈中的竞争性多智能体强化学习要求智能体在部分可观察性下并对抗对抗对手，因此需要随机策略。虽然带有近端策略优化（PPO）的自我对战强化学习取得了显著的实证成功，但其标准优势估计器——广义优势估计——由于抽样了随机未来行动，存在额外的方差。由于均衡策略的随机性质，这种差异在均衡自我博弈中被放大，即使批评者是精确的，这种差异依然存在。我们通过引入基于中心化动作值批评器的方差减少优势估计器$Q$-boosting来解决这一瓶颈，并提出了方差减少策略优化（VRPO），并结合了这一新估计器。该算法用多步预期SARSA$（\lambda）$跟踪替代采样后的多步备份，在每步计算策略期望以平均动作采样噪声，同时保留PPO截断目标和策略中演员的更新。从实证来看，VRPO在中大型游戏如斗帝竹和Heads-Up No-Limit Texas Hold's等游戏中始终保持强劲表现。

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

重新思考μ子超越预训练：VLA和RLVR的频谱失效与高通补救方法

Authors: Chongyu Fan, Gaowen Liu, Mingyi Hong, Ramana Rao Kompella, Sijia Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.19282
Pdf link: https://arxiv.org/pdf/2605.19282
Abstract Muon is a matrix-aware optimizer that leverages Newton-Schulz (NS) iterations to enforce spectral gradient orthogonalization by driving all singular values of the momentum matrix toward 1. While this uniform spectral whitening enhances exploration and outperforms AdamW in LLM pretraining, we show it could lead to fundamental limitations beyond pretraining in two regimes: (i) cross-modality vision-language-action (VLA) training, where inherently low-rank action-module gradients cause amplification of noisy tail directions, and (ii) reinforcement learning with verifiable rewards (RLVR), where low-SNR gradients and the need to preserve per-head specialization from prior training make whitening unstable. To address these challenges, we propose Pion, a drop-in replacement for Muon that preserves its computational efficiency while replacing uniform spectral whitening with a two-stage Promotion+Suppression mechanism, which we call the high-pass NS iteration. This design induces a sharp spectral high-pass effect, anchoring dominant singular values at 1 while suppressing noisy tail components toward 0, with controllable filter strength. To preserve pretrained per-head heterogeneity, Pion also supports a per-head mode that applies updates independently across attention heads via a simple reshape, at no extra cost. In VLA training on LIBERO and LIBERO-Plus, Pion consistently outperforms both baselines across l_1-regression (VLA-Adapter) and flow-matching (VLANeXt) architectures, e.g., reaching 100% success rate on LIBERO Object after 1,500 training steps with VLA-Adapter, vs. 97.0% for Muon and only 32.2% for AdamW. The advantage of Pion further extends to a real Franka Research 3 robot with a pi_0.5 backbone under the DROID setup on three grasp-and-place tasks. In RLVR post-training on Qwen3-1.7B/4B with GRPO and GMPO, Pion also outperforms AdamW on MATH and GSM8K while Muon collapses to zero.
中文摘要 Muon 是一个矩阵感知优化器，利用牛顿-舒尔茨（NS）迭代来强制谱梯度正交化，通过将动量矩阵中的所有奇异值都逼近于 1。虽然这种均匀的光谱漂白增强了探索性，并在LLM预训练中优于AdamW，但我们表明，它可能在两种模式下带来超越预训练的根本性限制：（i）跨模态视觉-语言-动作（VLA）训练，其中本质上低秩动作模块梯度导致噪声尾部方向的放大;（ii）带可验证奖励的强化学习（RLVR），低信噪比梯度及保持先前训练中每个头的专精需求使得变白不稳定。为应对这些挑战，我们提出了Pion，这是一种替代μ子的可直接替代方法，既保持计算效率，又用两级升进+抑制机制取代均匀的谱白，我们称之为高通NS迭代。该设计产生锐利的频谱高通效应，将主导奇异值锚定在1，同时将噪声的尾部分量压制至0，滤波器强度可控。为了保持预训练的每个头的异质性，Pion 还支持一种按头模式，通过简单的重塑独立地在注意力头之间应用更新，无需额外费用。在LIBERO和LIBERO-Plus上的VLA训练中，Pion在l_1回归（VLA适配器）和流量匹配（VLANeXt）架构中始终优于基线，例如在使用VLA-Adapter训练1500步后，LIBERO对象成功率达到100%，而Muon为97.0%，AdamW仅为32.2%。Pion的优势还体现在一个真正的Franka Research 3机器人，配备pi_0.5的骨干，在DROID系统下执行三个抓取并放置任务。在RLVR后期训练中，Qwen3-1.7B/4B与GRPO和GMPO合作，Pion在MATH和GSM8K上表现优于AdamW，而Muon则崩溃归零。

UAV-Assisted Cooperative Edge Inference for Low-Altitude Economy via MoE-based Hierarchical Deep Reinforcement Learning

无人机辅助的协作边缘推断，通过基于MoE的层级深度强化学习实现低空经济

Authors: Wenhao Zhuang, Yuyi Mao, Ivan Wang-Hei Ho, Xianghao Yu
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.19290
Pdf link: https://arxiv.org/pdf/2605.19290
Abstract The low-altitude economy (LAE) is reshaping the industrial landscape by deploying unmanned aerial vehicles (UAVs) to facilitate a wide range of applications demanding flexible aerial mobility. Integrating edge artificial intelligence (AI) into LAE platforms creates a compelling paradigm where UAVs provide real-time AI-driven analysis while simultaneously executing their primary aerial mission duties. However, realizing this paradigm remains challenging due to the strict mission constraints imposed by these primary duties and the throughput bottlenecks of wireless links. To bridge this gap, we propose a UAV-assisted cooperative edge inference framework where UAVs execute mission-critical LAE duties, quantified by trajectory deviations from reference paths, while concurrently supporting ground devices via intermediate feature offloading. Within this framework, UAV trajectories, inference task offloading decisions, and feature compression ratios are jointly optimized to maximize the system performance. We cast this joint optimization task into a constrained partially observable Markov decision process (POMDP) framework. To efficiently solve it, we propose HDRL-MoE, a novel hierarchical deep reinforcement learning framework that decouples the optimization of slow-varying inference decisions from rapidly changing UAV trajectory control. Furthermore, HDRL-MoE integrates a mixture-of-experts (MoE) architecture, where a router network orchestrates discrete offloading decisions while expert networks independently optimize the feature compression ratios. Extensive simulations show that HDRL-MoE achieves significant inference accuracy gains over baselines and exhibits high scalability and efficiency through its MoE design.
中文摘要 低空经济（LAE）正在重塑工业格局，通过部署无人机（UAV）以实现需要灵活空中机动的广泛应用。将边缘人工智能（AI）集成到LAE平台，创造了一个引人注目的范式，使无人机在执行主要空中任务的同时，提供实时AI驱动的分析。然而，由于这些主要职责所施加的严格任务限制以及无线链路的吞吐量瓶颈，实现这一范式仍然具有挑战性。为弥合这一空白，我们提出了一种无人机辅助的协作边缘推断框架，其中无人机执行关键任务的LAE任务，通过轨迹偏离参考路径量化，同时通过中间特征卸载支持地面设备。在此框架下，无人机轨迹、推理任务卸载决策和特征压缩比均被联合优化，以最大化系统性能。我们将该联合优化任务归入一个受限部分可观测的马尔可夫决策过程（POMDP）框架中。为了高效解决，我们提出了HDRL-MoE，一种新型分层深度强化学习框架，将慢变推理决策的优化与快速变化的无人机轨迹控制解耦。此外，HDRL-MoE集成了专家混合架构（MoE），路由路由网络协调离散卸载决策，而专家网络独立优化特征压缩比。大量模拟表明，HDRL-MoE通过其MoE设计在推断精度上显著提升，并展现出高度的可扩展性和效率。

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

LambdaPO：一种用于推理语言模型的Lambda风格策略优化

Authors: Zhe Yuan, Yipeng Zhou, Jinghan Li, Xinyuan Chen, Bowen Deng, Zhiqian Chen, Liang Zhao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.19416
Pdf link: https://arxiv.org/pdf/2605.19416
Abstract Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward differentials against all peers in its cohort, where each pairwise comparison is dynamically attenuated by the policy's own probabilistic confidence in the established preference. To further mitigate the sparsity of binary outcome supervision, we augment the objective with a semantic density reward, derived from the precision-recall alignment between generated reasoning traces and ground-truth solutions. As a result, our method can mine more fine-grained optimization signals from a group of rollouts, guiding the LLM to a better optima. Experimental results across challenging math reasoning and question-answering tasks demonstrates that LambdaPO improves performance compared to the baseline methods.
中文摘要 群体相对策略优化（GRPO）已成为现代强化学习对齐的基石，因其在抽样轨迹队列间利用奖励归一化，成功避免显式的价值批评者而备受推崇。然而，该方法依赖于单一统计基线，如群体均值，导致轨迹空间的关系拓扑压缩为单一标量，从而抹去了导航复杂且排名敏感奖励景观所需的细粒度偏好信息。为解决这一问题，我们引入了一个新框架——Lambda策略优化（LambdaPO），通过将优势估计从标量值重新概念化为分解的两两偏好结构，解决了这一信息论瓶颈。具体来说，任一轨迹的优势被表述为对同龄人所有同伴的奖励差异积分和，每次两对比较都会被政策自身对既定偏好的概率置信度动态减弱。为了进一步减少二元结果监督的稀疏性，我们通过语义密度奖励来增强目标，该奖励源自生成的推理痕迹与真实解之间的精确回忆对齐。因此，我们的方法能够从一组部署中挖掘出更细粒度的优化信号，引导LLM达到更优的水平。在具有挑战性的数学推理和问答任务中的实验结果表明，LambdaPO相比基线方法提升了性能。

When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR

何时停止重复使用：动态梯度门控以实现采样高效RLVR

Authors: Yuchun Miao, Sen Zhang, Yuqi Zhang, Yaorui Shi, Qi Gu, Xunliang Cai, Lefei Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.19425
Pdf link: https://arxiv.org/pdf/2605.19425
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become the dominant paradigm for advanced reasoning in Large Language Models (LLMs), but rollout samples are expensive to obtain, making sample efficiency a critical bottleneck. A natural remedy is to reuse each rollout batch for multiple gradient updates, a standard practice in classical RL. Yet in RLVR, this amplifies policy shift, leading to severe performance degradation. Detecting the onset of degradation early enough to stop reuse remains an open and challenging problem. We close this gap by identifying the \textit{Disproportionate Weight Divergence (DWD)} phenomenon: performance degradation is synchronized with a sharp surge in the \texttt{lm_head} weight change, while intermediate layers remain stable. Empirically, we verify that DWD emerges consistently across diverse LLMs and tasks. Theoretically, we prove that (i) harmful gradients concentrate at the \texttt{lm_head} while intermediate layers are structurally attenuated, and (ii) the \texttt{lm_head} gradient norm lower-bounds the policy divergence. These results establish the \texttt{lm_head} gradient norm as a principled, real-time signal of catastrophic policy shift. Guided by this insight, we propose \textit{Dynamic Gradient Gating (DGG)}, a lightweight intervention that monitors the \texttt{lm_head} gradient norm in real time and intercepts harmful gradients before they corrupt the optimizer. DGG consistently matches or exceeds the standard single-use baseline, achieving up to $2.93\times$ sample efficiency and $2.14\times$ wall-clock speedup across math, ALFWorld, WebShop, and search-augmented QA tasks.
中文摘要 带可验证奖励的强化学习（RLVR）已成为大型语言模型（LLM）中高级推理的主导范式，但推广样本获取成本高昂，使得样本效率成为关键瓶颈。一种自然的解决方法是重复利用每个推出批次进行多重梯度更新，这是传统强化学习中的标准做法。然而在RLVR中，这反而加剧了政策转变，导致性能严重下降。及早检测劣化的起始以阻止重复使用仍是一个开放且具有挑战性的问题。我们通过识别 \textit{不成比例权重发散（DWD）}现象来弥合这一差距：性能下降与 \texttt{lm_head} 权重变化的急剧激增同步，而中间层保持稳定。通过实证，我们验证了DWD在各种大型语言模型和任务中持续出现。理论上，我们证明：（i）有害梯度集中在 \texttt{lm_head}，而中间层结构上减弱，（ii） \texttt{lm_head} 梯度范数使政策发散下界。这些结果确立了 \texttt{lm_head} 梯度规范作为一个有原则的、实时的灾难性政策转变信号。基于这一见解，我们提出了 \textit{动态梯度门控（DGG）}，这是一种轻量级干预，实时监控 \texttt{lm_head} 梯度规范，并在有害梯度损坏优化器之前拦截它们。DGG 始终能匹配甚至超过标准单次使用基线，在数学、ALFWorld、WebShop 和搜索增强的质量保证任务中实现高达 $2.93\x$ 的样本效率和 $2.14\x$ 的壁钟加速。

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

CEPO：利用对比证据政策优化进行RLVR自我蒸馏

Authors: Ahmed Heakl, Abdelrahman M. Shaker, Youssef Mohamed, Rania Elbadry, Omar Fetouh, Fahad Shahbaz Khan, Salman Khan
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.19436
Pdf link: https://arxiv.org/pdf/2605.19436
Abstract When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at this https URL.
中文摘要 当模型在可验证奖励（RLVR）强化学习下产生正确解答时，每个代币都会收到相同的奖励信号，无论它是决定性推理步骤还是语法填充。一个自然的解决方法是将模型条件置于作为教师的正确答案，识别如果模型知道答案时会以不同方式生成的标记。先前研究表明，这要么通过将答案泄漏到梯度中来破坏训练，要么产生弱信号，无法区分决定性步和填充，因为两者相对于模型基线看起来同样令人惊讶。我们提出了对比证据政策优化（CEPO），它在每个标记上提出更尖锐的问题：不仅仅是“正确答案是否偏向该标记？”而是“正确答案是否偏向该标记，而错误答案是否不利于它？”同时满足两者的代币才是真正的推理步骤;其中一个既不满足又不满足的填充剧。错误答案教师由培训批次中已被拒绝的推广内容构成，无需额外抽样成本。我们证明CEPO在严格提升关键代币信用的同时，继承了所有结构性安全保障，而在填充位位时，收益几乎消失。从实证数据来看，CEPO在2B和4B尺度的五个多模态数学推理基准中分别实现了43.43%和60.56%的平均准确率，而GRPO在相同培训预算下分别为41.17%和57.43%。分布匹配自蒸馏法（OPSD、SDPO）低于未训练的基线，实证证实了我们理论预测的信息泄漏。我们的代码可在此 https URL 访问。

When the Majority Votes Wrong, the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window

当多数人投票错误时，测试时强化学习的干预时间会隐藏在消退窗口中

Authors: Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.19444
Pdf link: https://arxiv.org/pdf/2605.19444
Abstract Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo-label signal. We argue these gains are systematically misinterpreted: most reflect sharpening of already-solvable problems rather than genuine learning, while problems corrupted from correct to incorrect outnumber truly learned ones, and this damage is irreversible once majority vote locks onto a wrong answer. Per-problem tracking reveals that correct-answer signals in low-ability problems are briefly active before being permanently suppressed, a phenomenon we term the \textit{Correct-Answer Extinction Window}, with Flip Rate (FR) as its leading indicator. We thus propose \textbf{TTRL-Guard}, a lightweight framework with three mechanisms targeting the extinction window: Flip-Rate-Aware Reward Scaling (FRS) down-weights at-risk updates as FR declines, Minority-Preserving Sampling (MPS) retains gradient signal from minority correct answers, and Risk-Conditioned Sparse Updatings (RCSU) suspends updates on polarized problems. Experiments across three models and four benchmarks show that TTRL-Guard achieves the best average pass@1 on Qwen2.5-7B-Instruct and Qwen3-4B, improves relatively over TTRL by +54\% on AIME 2025. \footnote{Our code and implementation details are available at this https URL.
中文摘要 测试时强化学习（TTRL）报告称，使用多数票作为伪标签信号，数学推理基准测试的准确率大幅提升。我们认为这些进步被系统性地误解了：大多数问题反映了对已可解决问题的加深，而非真正的学习，而从正确到错误被腐蚀的问题数量超过真正学到的问题，一旦多数人投票锁定错误答案，这种损害就无法逆转。逐题追踪显示，低能力题中的正确答案信号会短暂激活，随后被永久抑制，我们称之为\textit{正确答案消灭窗口}，其前导指标为翻转率（FR）。因此，我们提出了 \textbf{TTRL-Guard}，这是一个轻量级框架，具有三种针对消灭窗口的机制：Flip-Rate-Aware Reward Scaling（FRS）在 FR 下降时对风险更新进行降权，少数族裔保留抽样（MPS）保留少数族裔正确答案的梯度信号，以及风险条件稀疏更新（RCSU）暂停极化问题的更新。跨三个模型和四个基准测试的实验显示，TTRL-Guard在Qwen2.5-7B-Instruct和Qwen3-4B上实现了最佳的平均pass@1，在AIME 2025中相较TTRL提升+54%。\footnote{我们的代码和实现细节可在此 https 网址查阅。

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

蒸馏什么、何时蒸馏：多回合特工的选择性事后诸葛亮萃取

Authors: Xiaozhe Li, Tianyi Lyu, Yang Li, Yichuan Ma, Peiji Li, Linyang Li, Qipeng Guo, Dahua Lin, Kai Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.19447
Pdf link: https://arxiv.org/pdf/2605.19447
Abstract Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.
中文摘要 强化学习可以通过稀疏任务奖励训练LLM代理，但长期的学分分配仍然具有挑战性：必须将单一的成功或失败信号分布在多个动作中。现有方法依赖轨迹级奖励或代理信号，未能充分利用每步环境反馈。多回合代理设置被忽视，反馈可能包括错误信息、页面更改、观察或参考轨迹。我们系统地研究了五个反馈源和两种插入粒度，并引入了SERL，一种选择性环境加权学习框架。SERL利用任务奖励确定更新方向，环境反馈调整位置和强度，聚焦关键动作。在ALFWorld和WebShop上，SERL的成功率分别为90.0%和80.1%，优于强劲的强劲强化和蒸馏基线。分析显示，在有意义的节点提供扎实、与行动相关的反馈，始终优于无差别使用更长或更丰富的上下文。

Generative Auto-Bidding with Unified Modeling and Exploration

生成自动竞价与统一建模与探索

Authors: Mingming Zhang, Feiqing Zhuang, Na Li, Shengjie Sun, Xiaowei Chen, Junxiong Zhu, Fei Xiao, Keping Yang, Lixin Zou, Chenliang Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.19457
Pdf link: https://arxiv.org/pdf/2605.19457
Abstract Automated bidding is central to modern digital advertising. Early rule-based methods lacked adaptability, while subsequent Reinforcement Learning approaches modeled bidding as a Markov Decision Process but struggled with long-term dependencies. Recent generative models show promise, yet they lack explicit mechanisms to balance exploration and safety, relying solely on action perturbations or trajectory guidance without a safety fallback. This results in inefficient exploration and elevated financial risk for advertising platforms. To address this gap, we propose GUIDE (Generative Auto-Bidding with Unified Modeling and Exploration), a framework that synergistically integrates directed exploration with a safe fallback mechanism. GUIDE employs a Decision Transformer (DT) to jointly model historical bidding actions and environmental state transitions. A Q-value module guides the DT's exploration via regularization constraints, while an Inverse Dynamics Module (IDM) leverages DT-predicted future states to infer robust, behaviorally consistent actions as a safe policy fallback. The Q-value module then adaptively selects the final action between these two options, balancing exploration and safety. Together, these components form an integrated "explore-safeguard-select" pipeline that unifies efficiency and safety. We conduct extensive experiments on public datasets, in simulated auction environments, and through large-scale online deployment on Taobao, a leading Chinese advertising platform. Results show GUIDE consistently outperforms state-of-the-art baselines across all scenarios. In real-world deployment, GUIDE achieves notable gains: +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI, demonstrating its effectiveness and strong industrial applicability.
中文摘要 自动竞价是现代数字广告的核心。早期基于规则的方法缺乏适应性，而后续的强化学习方法则将叫牌建模为马尔可夫决策过程，但在长期依赖方面遇到困难。近期的生成模型展现出潜力，但缺乏明确的机制来平衡探索与安全，仅依赖于动作扰动或轨迹引导，没有安全后备方案。这导致广告平台的探索效率低下，财务风险也随之增加。为弥补这一空白，我们提出了GUIDE（生成自动竞价，结合统一建模与探索），该框架协同整合了定向探索与安全的后备机制。GUIDE使用决策变换器（DT）来联合建模历史投标行为和环境状态转换。Q值模块通过正则化约束指导DT的探索，而逆动力学模块（IDM）利用DT预测的未来状态推断出稳健且行为一致的行动，作为安全的策略后备。Q值模块随后自适应地在这两个选项中选择最终行动，平衡探索和安全。这些组成部分共同构成了一个整合的“探索-保障-选择”流程，统一了效率和安全。我们在公开数据集、模拟拍卖环境中以及在中国领先广告平台淘宝上的大规模线上部署进行了大量实验。结果显示，GUIDE在所有情境下始终优于最先进的基线数据。在实际部署中，GUIDE取得了显著提升：广告平均增长率+4.10%，广告点击+1.40%，广告成本+1.66%，广告投资回报率+3.52%，展示了其有效性和强烈的工业适用性。

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

超越模式崩溃：多元推理下的分布匹配

Authors: Xiaozhe Li, Yang Li, Xinyu Fang, Shengyuan Ding, Peiji Li, Yongkang Chen, Yichuan Ma, Tianyi Lyu, Linyang Li, Dahua Lin, Qipeng Guo, Qingwen Liu, Kai Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.19461
Pdf link: https://arxiv.org/pdf/2605.19461
Abstract On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality, an ideal testbed for evaluating exploration. DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO's 40.1%) and 43.1% on vision-based NP-Bench (vs. 38.4%), demonstrating 9% and 12% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.
中文摘要 像GRPO这样的策略上强化学习方法存在模式崩溃的问题：它们表现出解多样性降低，发现后将概率质量集中于单一解，停止探索替代策略。我们表明，这源于逆KL最小化的模式寻求行为，这种行为强化了最初发现的高奖励轨迹，而不是在多个不同解上维持分布。我们提出了DMPO（分布匹配策略优化），通过对前向KL最小化的原则性近似防止模式崩溃。DMPO构建一个与其奖励成正比的采样轨迹的组级目标分布，然后将策略分布与该目标对齐。这提供了模式覆盖行为，无需从难以处理的全球目标分布中采样，从而在训练过程中持续探索。我们在NP硬组合优化中验证了DMPO，其中存在指数级数量的可行解，但只有少数方法最优，是评估探索的理想测试平台。DMPO在基于文本的NP-Bench上实现了43.9%的质量比（相比GRPO的40.1%）和基于视觉的NP-Bench的43.1%（对比38.4%），分别表现出9%和12%的相对提升。这些提升推广到数学推理（+2.0%）和域外任务（+2.3%），表明多样性保持训练能提升不同模态的通用推理能力。我们的研究确立了分布匹配作为一种实用且有原则的方法，用于防止政策性强化学习模式崩溃，持续的质量提升展示了在多样推理任务中持续探索。

Sampling-Based Safe Reinforcement Learning

基于抽样的安全强化学习

Authors: Luca Vignola, Bruce D. Lee, Manish Prajapat, Manuel Wendl, Melanie Zeilinger, Andreas Krause, Yarden As
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.19469
Pdf link: https://arxiv.org/pdf/2605.19469
Abstract Safe exploration remains a fundamental challenge in reinforcement learning (RL), limiting the deployment of RL agents in the real world. We propose Sampling-Based Safe Reinforcement Learning (SBSRL), a model-based RL algorithm that maintains safety throughout the learning process by enforcing constraints jointly across a finite set of dynamics samples. This formulation approximates an intractable worst-case optimization over uncertain dynamics and enables practical safety guarantees in continuous domains. We further introduce an exploration strategy based on constraining epistemic uncertainty, eliminating the need for explicit exploration bonuses. Under regularity conditions, we derive high-probability guarantees of safety throughout learning and a finite-time sample complexity bound for recovering a near-optimal policy. Empirically, SBSRL achieves safe and efficient exploration both in simulation and in real robotic hardware, and readily extends to practical deep-ensemble implementations that scale to high-dimensional continuous control problems.
中文摘要 安全探索仍然是强化学习（RL）中的根本挑战，限制了强化学习代理在现实中的部署。我们提出了基于采样的安全强化学习（SBSRL），这是一种基于模型的强化学习算法，通过在有限的动力学样本集中联合强制约束，在整个学习过程中保持安全。该表述近似于不确定动力学的难解最坏情况优化，并实现连续域中的实用安全保障。我们进一步引入了一种基于约束认知不确定性的探索策略，消除了对显式探索加成的需求。在正规性条件下，我们推导出学习过程中高概率的安全保证，以及有限时间样本复杂度，以恢复近似最优策略。从经验角度看，SBSRL在模拟和真实机器人硬件中都能实现安全高效的探索，并易于扩展到可扩展到高维连续控制问题的实用深度集合实现。

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

基于强化学习的注意力引导奖励，针对大型推理模型的越狱

Authors: Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang, Haichang Gao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.19485
Pdf link: https://arxiv.org/pdf/2605.19485
Abstract Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a model's internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate jailbreak attacks on LRMs and reveal that the attack success rate (ASR) is closely correlated with LRMs' attention patterns. Specifically, successful jailbreaks tend to assign lower attention to harmful tokens in the input prompt, while allocating higher attention to those tokens in the reasoning content. Motivated by this finding, we propose a novel jailbreak method for LRMs that leverages reinforcement learning (RL) to enhance attack effectiveness, explicitly incorporating attention signals into the reward function design. In addition, we introduce diverse persuasion strategies to enrich the RL action space, which consistently improves the ASR. Extensive experiments on five open-source and closed-source LRMs across three benchmarks demonstrate that our method achieves substantially higher ASR, outperforming existing approaches in terms of effectiveness, efficiency, and transferability.
中文摘要 大型推理模型（LRM）通过生成结构化、逐步的推理内容，展现出了解决复杂问题的卓越能力。然而，暴露模型内部推理过程会带来额外的安全风险;例如，最新研究表明，LRM（长距离语言模型）比标准大型语言模型更容易受到越狱攻击。本文研究了对LRM（逻辑移动模型）的越狱攻击，并揭示攻击成功率（ASR）与LRM（逻辑移动模型）的注意力模式密切相关。具体来说，成功的越狱通常会在输入提示中对有害标记给予较低关注，而在推理内容中对这些标记给予更高的关注。基于这一发现，我们提出了一种新型越狱方法，利用强化学习（RL）提升攻击效果，明确将注意力信号纳入奖励函数设计。此外，我们还引入了多样化的说服策略，丰富强化学习的行动空间，持续提升ASR。在三个基准测试中对五个开源和闭源长程模型进行的广泛实验表明，我们的方法在有效性、效率和可迁移性方面实现了显著更高的ASR，优于现有方法。

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

ARC-RL：一个受ARC突袭者启发的强化学习游乐场

Authors: Carlo Romeo, Andrew D. Bagdanov
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.19503
Pdf link: https://arxiv.org/pdf/2605.19503
Abstract Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints.
中文摘要 腿部运动的强化学习已成熟为多元奖励函数和物理引擎基准测试的堆栈，其形态均统一源自真实商业硬件。然而，游戏中的NPC受制于模拟现实机器人中缺乏的风格限制，且经常以生物形态出现，没有真实机器人对应物。我们介绍ARC-RL，这是一套由四个MuJoCo连续控制环境组成的组合，机器人形态灵感来自ARC突袭者的怪物图鉴：18D高大六足女王、12D装甲六足Bastion、18D紧凑六足Tick，以及12D四足Leaper。这四个机器人共享统一的观察模板、动作惯例、仿真节奏和单一封闭形式多元奖励函数，其唯一的单一形态变化仅存在于一小部分权重和参数内。奖励融合了速度追踪帐篷、健康的生存加成、相位锁定的步态顺应加成/成本对、动作规则化器、三个安全惩罚和姿态锚点;奖励中没有任何动作捕捉数据。我们还为每个形态学提供手工制作的中央模式生成器演示器，既作为固定的专家参考，也作为离线到在线培训的先验数据来源。在该游乐场上，我们进行了受控实证研究，比较标准在线算法（SAC、SPEQ、SOPE-EO）与补充已有数据的方法（SACfD、SPEQ-O2O、SOPE），并描述每种范式如何应对游乐场的形态多样性和动画风格限制。

SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving

SafeAlign-VLA：一个负增强的安全对齐框架，用于风险意识自动驾驶

Authors: Kefei Tian, Yuansheng Lian, Kai Yang, Xiangdong Chen, Shen Li
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.19524
Pdf link: https://arxiv.org/pdf/2605.19524
Abstract End-to-end autonomous driving systems excel in common scenarios but struggle with safety-critical long-tail cases. Vision-Language-Action (VLA) models are promising due to their strong reasoning capabilities. However, most VLA-based approaches rely on positive expert demonstrations, rarely exploiting negative samples, leading to insufficient understanding of risky behaviors and safety boundaries. To address this limitation, we propose SafeAlign-VLA, a unified negative-enhanced safe alignment framework that incorporates negative data into supervised learning and reinforcement learning. First, we develop a counterfactual safety pairing paradigm to generate structured safety labels and counterfactual positive trajectories from risky scenarios via counterfactual reasoning. Then, a two-stage training strategy is adopted: negative-enhanced supervised fine-tuning for failure feedback and trajectory correction, followed by anchor-based group relative policy optimization that uses positive and negative trajectories as contrastive anchors to steer sampling and penalize high-risk behaviors via group-relative advantages. Experiments on NAVSIM and DeepAccident validate the proposed framework. SafeAlign-VLA achieves 89.1 PDMS on the NAVSIM v1 testset, improving over the baseline without negative data by 1.3%. On DeepAccident, it reduces the collision rate to 3.36%, while achieving 84.2% language accuracy and 85.8% risk prediction accuracy. These results demonstrate the effectiveness of the proposed negative-enhanced safe alignment framework for safe and robust autonomous driving.
中文摘要 端到端自动驾驶系统在常见场景中表现出色，但在安全关键的长尾车型上表现不佳。视觉-语言-行动（VLA）模型因其强大的推理能力而具有前景。然而，大多数基于VLA的方法依赖积极的专家演示，很少利用负样本，导致对风险行为和安全界限的理解不足。为解决这一局限，我们提出了SafeAlign-VLA，一个统一的负面增强安全对齐框架，将负面数据纳入监督学习和强化学习。首先，我们开发了一种反事实安全配对范式，通过反事实推理生成结构化的安全标签和风险情景中的反事实正向轨迹。随后采用两阶段训练策略：负增强监督微调以应对失效反馈和轨迹修正，随后是基于锚点的群体相对策略优化，利用正负轨迹作为对比锚点，引导抽样并通过群体相对优势惩罚高风险行为。NAVSIM和DeepAccident的实验验证了该框架。SafeAlign-VLA在NAVSIM v1测试集上实现了89.1 PDMS，较无负面数据基线提升了1.3%。在DeepAccident中，碰撞率降至3.36%，同时实现84.2%的语言准确率和85.8%的风险预测准确率。这些结果证明了所提议的负增强安全对齐框架在实现安全稳健自动驾驶方面的有效性。

Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning

利用强化学习优化300bps通信的神经语音编解码器

Authors: Junyi Wang, Chi Zhang, Jing Qian, Haifeng Luo, Hao Wang, Zengrui Jin, Chao Zhang
Subjects: Subjects: Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2605.19541
Pdf link: https://arxiv.org/pdf/2605.19541
Abstract In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 300 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 4.64% WER on the LibriSpeech test-clean set at 300 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.55% on test-clean and 10.4% on test-other, corresponding to a 23% relative reduction while preserving perceptual quality.
中文摘要 在带宽受限的通信中，如卫星和水下频道，语音通常必须以超低比特率传输，而可理解性是首要目标。在如此极端的压缩水平下，经过声学重建损耗训练的编解码器往往会将比特分配给感知细节，导致字错误率（WER）大幅下降。本文提出了ClariCodec，这是一种以300比特每秒（bps）速度运行的神经语音编解码器，将量化重新表述为随机策略，从而实现基于强化学习（RL）的可理解性优化。具体来说，编码器通过WER驱动的奖励进行微调，而声学重建流程则保持冻结状态。即使没有 RL，ClariCodec 在 LibriSpeech 测试净化集上也能实现 4.64% 的 WER，速度为 300 bps，已经与更高码率的编解码器竞争。进一步的强化学习微调将测试干净时的WER降至3.55%，测试其他测试时降至10.4%，相对降低23%，同时保持感知质量。

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

GoLongRL：能力导向的长上下文强化学习，具多任务对齐

Authors: Minxuan Lv, Tiehua Mei, Tanlong Du, Junmin Chen, Zhenpeng Su, Ziyang Chen, Ziqi Wang, Zhennan Wu, Ruotong Pan, jian Liang, Ruiming Tang, Han Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.19577
Pdf link: https://arxiv.org/pdf/2605.19577
Abstract We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.
中文摘要 我们介绍GoLongRL，一种完全开源、面向能力的训练后配方，用于带有可验证奖励（RLVR）的长上下文强化学习。现有的长上下文强化学习方法常将数据构建视为设计日益复杂的检索路径，导致任务覆盖和奖励表述趋于同质，未能充分反映实际的长上下文需求。我们的工作有两个贡献。（1）面向能力的数据构建，完全开放发布。我们公开发布了包含2.3万个RLVR样本的数据集、完整的构建流程以及所有训练代码。基于长上下文能力分类法，数据集涵盖9种任务类型，每种类型均配有其自然评估指标。它包括从成熟语料库中策划的开源样本，以及从真实源文档（如书籍、学术论文和多回合对话）生成的综合样本。在同一原版GRPO设置下，我们的数据集本身就优于闭源的QwenLong-L1.5数据集。此外，我们基于该数据训练的Qwen3-30B-A3B模型，其长上下文表现可与DeepSeek-R1-0528和Qwen3-235B-A22B-Thinking-2507相媲美，表明更广泛的覆盖范围和更大的奖励多样性显著提升了长上下文能力的提升。（2）用于异构多任务优化的TMN重权。为应对异质奖励带来的优化挑战，我们提出了TMN重权重方法，该方法结合了任务级平均归一化以实现跨任务奖励尺度对齐，并采用难度自适应加权，以实现更可靠的优势估计。TMN-Reweight进一步提升了相比普通GRPO的平均性能，且在报告的评估中保留或提升了通用能力。

Implicit Action Chunking for Smooth Continuous Control

隐式动作分块以实现平滑连续控制

Authors: Bosun Liang, Shuo Pei, Zirui Chen, Chuanzhi Fan, Chen Sun, Yuankai Wu, Huachun Tan, Yong Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.19592
Pdf link: https://arxiv.org/pdf/2605.19592
Abstract Reinforcement learning often produces high-frequency oscillatory control signals that undermine the safety and stability required for physical deployment. Explicit action chunking addresses this by predicting fixed-horizon trajectories but scales the policy output dimension proportionally with the horizon length, leading to optimization difficulties and incompatibility with standard step-wise interaction. To overcome these challenges, this paper proposes Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control. Unlike explicit methods, DWS enforces temporal coherence without expanding the action space. It uses a dual-window design: an execution window that ensures physical smoothness through deterministic modulation, and a value window that aligns temporal-difference targets over the horizon to correct critic bias caused by open-loop execution. DWS also includes a lightweight actor-side temporal regularizer based on first-order action differences to promote global continuity. This design effectively bridges the gap between temporal abstraction and reactive step-wise control. Experiments on benchmarks including the DeepMind Control Suite and industrial energy management tasks show that DWS outperforms state-of-the-art (SOTA) baselines. In complex vision-based autonomous driving tasks, DWS achieves smoother control, safer behavior with reduced jitter, and attains a 100% success rate.
中文摘要 强化学习常常产生高频振荡控制信号，削弱物理部署所需的安全性和稳定性。显式行动分块通过预测固定视界轨迹来解决这个问题，但策略输出维度与视野长度成比例缩放，导致优化困难且与标准的分级交互不兼容。为克服这些挑战，本文提出了双窗口平滑（DWS），这是一种隐式动作分块框架，用于实现平滑连续控制。与显式方法不同，DWS在不扩展作用空间的情况下强制时间相干性。它采用双窗口设计：一个通过确定性调制确保物理平滑的执行窗口，另一个值窗口通过视野对齐时间差目标，纠正开环执行引起的批评偏差。DWS还包含基于一阶动作差异的轻量级actor侧时间正则化器，以促进全局连续性。该设计有效弥合了时间抽象与反应式分级控制之间的差距。基于基准测试（包括DeepMind控制套件）和工业能源管理任务的实验显示，DWS的表现优于最先进的（SOTA）基线。在复杂的基于视觉的自动驾驶任务中，DWS实现更平稳的操控、更安全的行为和更小的抖动，并实现100%的成功率。

Projecting Latent RL Actions: Towards Generalizable and Scalable Graph Combinatorial Optimization

潜在强化学习动作的投影：迈向可推广和可扩展的图组合优化

Authors: Franco Terranova (UL, LORIA, Inria), Guillermo Bernardez (UC Santa Barbara), Albert Cabellos-Aparicio (UPC), Nina Miolane (UC Santa Barbara), Abdelkader Lahmadi (LORIA, UL, Inria)
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.19721
Pdf link: https://arxiv.org/pdf/2605.19721
Abstract Graph combinatorial optimization (GCO) has attracted growing interest, as many NP-hard problems naturally admit graph formulations, yet their combinatorial explosion renders exact methods computationally intractable. Recent advances in Reinforcement Learning (RL) combined with Graph Neural Networks (GNNs) have significantly improved learning-based GCO solvers. However, existing approaches face limitations in both generalization across diverse graph instances and computational scalability as action spaces grow. To address both challenges, we introduce projection agents, a novel RL-GCO approach that operates directly in a continuous GNN-based action embedding space, predicting a desired latent action in a single forward pass and subsequently decoding it into a valid discrete action. Additionally, we enable fair comparison across RL methods through a shared embedding space for both observations and actions. Across diverse benchmarks, our approach achieves up to 16.2x faster inference and up to 40% better generalization than existing solutions using only simple nearest-neighbor decoding, while opening the door to strong RL performance in super-linear decision spaces with multiple interdependent variables. Finally, we release LaGCO-RL, a Python library that automates latent action-space construction and supports existing RL-GCO solutions, promoting reproducibility and adaptation to new GCO benchmarks.
中文摘要 图组合优化（GCO）引起了越来越多的关注，因为许多NP难问题自然可以接受图的形式，但其组合爆炸性使得精确方法在计算上变得难以处理。强化学习（RL）与图神经网络（GNN）结合的最新进展显著提升了基于学习的GCO求解器。然而，现有方法在跨越多样图实例的推广和计算可扩展性方面都面临局限，且随着动作空间的扩大。为应对这两个挑战，我们引入了投影代理，这是一种新型的RL-GCO方法，直接在基于GNN的连续动作嵌入空间中工作，通过单次前向传递预测期望的潜在动作，随后将其解码为有效的离散动作。此外，我们通过共享嵌入空间实现了观察和动作的公平比较。在多种基准测试中，我们的方法仅用简单最近邻解码，推理速度高达16.2倍，泛化能力提升多达40%，同时在多依赖变量的超线性决策空间中实现强化学习表现。最后，我们发布了LaGCO-RL，一个Python库，能够自动化潜在动作空间构建并支持现有RL-GCO解决方案，促进对新GCO基准的可重复性和适应性。

Memory-Augmented Reinforcement Learning Agent for CAD Generation

用于CAD生成的记忆增强强化学习代理

Authors: Yin Xiaolong, Liu Yu, Shen Jiahang, Lu Xingyu, Ni Jingzhe, Fan Fengxiao, Sang Fan
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.19748
Pdf link: https://arxiv.org/pdf/2605.19748
Abstract Automatic generation of computer-aided design (CAD) models is a core technology for enabling intelligence in advanced manufacturing. Existing generation methods based on large language models (LLMs) often fall short when handling complex CAD models characterized by long operation sequences, diverse operation types, and strong geometric constraints, primarily because reasoning chains break and effective error-correction mechanisms are lacking. To address this problem, this paper proposes a memory-augmented reinforcement learning framework for CAD generation agents. The framework encapsulates the underlying geometric kernel into a structured toolchain callable by the agent and builds a closed-loop mechanism of design intent understanding, global planning, execution, and multi-dimensional verification. It also designs a dual-track memory module consisting of a case library and a skill library, and proposes a dynamic utility retrieval algorithm. By introducing reinforcement learning into retrieval and policy optimization, the agent can effectively avoid retrieval traps in which examples are semantically similar but geometrically infeasible, enabling online self-correction and continual evolution without additional large-scale annotated data. Experiments show that the proposed method significantly improves both the success rate and geometric consistency on complex CAD model generation tasks.
中文摘要 计算机辅助设计（CAD）模型的自动生成是实现先进制造智能化的核心技术。基于大型语言模型（LLMs）的现有生成方法在处理具有长操作序列、多样操作类型和强几何约束的复杂CAD模型时常常表现不足，主要原因是推理链断裂且缺乏有效的纠错机制。为解决这一问题，本文提出了一个用于CAD生成代理的记忆增强强化学习框架。该框架将底层几何内核封装成可由代理调用的结构化工具链，构建了一个闭环机制，涵盖设计意图理解、全局规划、执行和多维验证。它还设计了一个双轨内存模块，由案例库和技能库组成，并提出了动态效用检索算法。通过将强化学习引入检索和策略优化，智能体可以有效避免例子语义相似但几何上不可行的检索陷阱，从而实现在线自我纠正和持续演进，无需额外大规模注释数据。实验表明，所提方法在复杂CAD模型生成任务中显著提高了成功率和几何一致性。

Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

多项逻辑斯 MDP 的最优方差感知遗憾边界极小极大

Authors: Pierre Boudart (SIERRA), Pierre Gaillard (Thoth), Alessandro Rudi (PSL, DI-ENS, Inria)
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.19768
Pdf link: https://arxiv.org/pdf/2605.19768
Abstract We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant $\bar\sigma_T \leq 1/2$, measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of $\smash{\tilde{O}(dH^2\bar\sigma_T\sqrt{T})}$, which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $\bar\sigma_T = O(H^{-1})$, reducing the horizon dependence by a factor $H$. We further establish a matching $\smash{\Omega(dH^2\bar\sigma_T\sqrt{T})}$ lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.
中文摘要 我们研究了通过多项逻辑（MNL）模型建模的情节马尔可夫决策过程（MDP）的强化学习。现有的MNL混合MDP算法给出的后悔值为$\smash{\tilde{O}（dH^2\sqrt{T}）}$（Li等，2024），其中$d$为特色维度，$H$为集数，$T$为集数。受后勤强盗文献（Abeille 等，2021;Faury 等，2022;Boudart等，2026），我们引入了一个与问题相关的常数$\bar\sigma_T \leq 1/2$，衡量最优下游价值函数沿学习者轨迹的归一化平均方差。我们提出一种算法，实现后悔值为$\smash{\tilde{O}（dH^2\bar\sigma_T\sqrt{T}）}$，在最坏情况下恢复现有界限，并在结构化MDP中改进。例如，对于KL约束的鲁棒MDP，$\bar\sigma_T = O（H^{-1}）$，使视界依赖减少一个$H$的因子。我们进一步建立了匹配的$\smash{\Omega（dH^2\bar\sigma_T\sqrt{T}）}$下界，证明了极小极大最优性（对数因子内），并首次全面表征了MNL混合MDP的遗憾复杂性。

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

工具总是有益的吗？学习如何自适应调用工具以实现双模多模态大型语言模型推理

Authors: Qinghe Ma, Zhen Zhao, Yiming Wu, Jian Zhang, Lei Bai, Yinghuan Shi
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.19852
Pdf link: https://arxiv.org/pdf/2605.19852
Abstract Tool-augmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models (MLLMs). However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query. Within a reinforcement learning framework, we design an explicit dual-mode reasoning strategy with mode-specific reward functions to guide the model toward producing accurate responses. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool-assisted and text-centric reasoning throughout training, and promotes free exploration in later stages. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21.8\% accuracy gain on V* benchmark compared to the base model, and a 44.9\% improvement in efficiency over existing tool-augmented methods on POPE benchmark. Code is available at this https URL.
中文摘要 工具增强推理已成为提升多模态大型语言模型（MLLM）推理能力的有前景方向。然而，现有研究主要侧重于使模型能够执行工具调用，而忽视了调用工具的必要性。我们认为工具的使用并不总是有益的，因为冗余或不合适的调用会大大增加推理负担，甚至误导模型预测。为解决这个问题，我们引入了AutoTool模型，该模型根据每个查询的特性自适应地决定是否调用工具。在强化学习框架内，我们设计了一种显式的双模式推理策略，并配备特定模式的奖励函数，以引导模型产生准确的反应。此外，为防止过早偏向单一推理模式，AutoTool在培训过程中共同探索并平衡工具辅助与文本中心推理，并在后期阶段促进自由探索。大量实验表明，AutoTool表现出卓越的性能和高效率，V*基准测试的准确率比基础模型提升了21.8%，在POPE基准测试上相比现有工具增强方法的效率提升了44.9%。代码可在此 https URL 访问。

Fair-Aurora: Comparing Fairness Strategies for Reinforcement Learning-Based Congestion Control in Multi-Flow Environments

Fair-Aurora：比较多流环境中基于强化学习的拥塞控制公平策略

Authors: Thomas Mbrice, Yuyu Liu
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.19909
Pdf link: https://arxiv.org/pdf/2605.19909
Abstract Reinforcement learning (RL) has emerged as a promising paradigm for Internet congestion control, achieving higher link utilization than classical heuristics. However, RL-based controllers trained in single-flow environments are not guaranteed to share bandwidth equitably when deployed in multi-flow networks. This paper investigates the fairness properties of Aurora~\cite{jay2019aurora}, a state-of-the-art deep RL congestion controller, and evaluates three post-hoc fairness strategies that preserve Aurora's RL architecture: \emph{reward shaping} (Strategy~A), \emph{observation augmentation} (Strategy~B), and \emph{loss-sensitivity tuning} (Strategy~C). Using a custom shared-bottleneck simulator and Jain's fairness index as the primary metric, we find that modest reward shaping achieves the best fairness while preserving aggregate throughput. All strategies maintain the total bandwidth budget with fairness being achieved through redistribution, not reduction. Beyond the 2-flow homogeneous setting, an extended evaluation across mixed Aurora--CUBIC competition and dynamic flow entry/exit scenarios shows that Strategy~C's loss-sensitivity emerges as the most TCP-friendly mechanism, while Strategy~B is the most stable through dynamic flow-set changes.
中文摘要 强化学习（RL）已成为互联网拥塞控制的有前景范式，能够实现比传统启发式更高的链路利用率。然而，基于强化学习的控制器在单流环境中训练后，在多流网络中部署时并不能保证带宽公平共享。本文探讨了Aurora~\cite{jay2019aurora}的公平性属性，这是一款最先进的深度强化学习拥塞控制器，并评估了三种事后公平策略，这些策略保持了极光的强化学习架构：\emph{奖励塑造}（策略~A）、\emph{观察增强}（策略~B）和\emph{损失敏感性调优}（策略~C）。使用自定义共享瓶颈模拟器和Jain公平指数作为主要指标，我们发现适度的奖励塑形能在保持总吞吐量的同时实现最佳公平性。所有策略都通过重新分配而非减少来实现公平，从而维持总带宽预算。除了2流同质设置外，对混合极光-CUBIC竞争和动态流进/出场景的深入评估显示，Strategy~C的损失敏感性成为最友好TCP的机制，而Strategy~B则通过动态流集变化最为稳定。

Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

超越行动残差：通过瓶颈潜在强化学习实现现实世界的机器人政策引导

Authors: Dongjie Yu, Kun Lei, Zhennan Jiang, Jia Pan, Huazhe Xu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.19919
Pdf link: https://arxiv.org/pdf/2605.19919
Abstract Pretrained imitation policies have become a strong foundation for robot manipulation, but they often require online improvement to overcome execution errors, limited dataset coverage, and deployment mismatch. A central question is therefore how reinforcement learning (RL) should adapt policies after offline pretraining. Existing lightweight methods commonly apply residual corrections directly in action space, but this often leads to noisy and poorly structured exploration. In this work, we propose Z-Perturbation Reinforcement Learning (ZPRL), an approach that steers pretrained policies through a compact bottleneck latent rather than through policy weights or output actions. During offline training, we augment the policy with a plug-and-play variational information bottleneck (VIB) module to extract a task-relevant latent interface from observation embeddings. During online finetuning, the base policy is frozen and RL learns only a residual perturbation on this latent, whose decoded representation conditions the frozen action generator. We instantiate ZPRL on flow-matching policies and evaluate it on eight simulation tasks and four real-world tasks. Across diverse manipulation settings, ZPRL improves both sample efficiency and final performance over strong post-training baselines. In the real world, ZPRL improves the average success rate on four tasks by 33.7% over imitation base policies while producing smoother exploration behaviors than an action residual counterpart. These results suggest that a compact, task-aligned bottleneck latent provides an effective interface for online RL adaptation. More videos can be found at this https URL.
中文摘要 预训练的仿制策略已成为机器人操作的坚实基础，但它们通常需要在线改进以克服执行错误、有限的数据集覆盖和部署不匹配。因此，一个核心问题是强化学习（RL）应如何在离线预训练后调整策略。现有的轻量化方法通常直接在作用空间中应用残差修正，但这常导致噪声和结构不完善的探索。在本研究中，我们提出了Z-扰动强化学习（ZPRL）的方法，即通过紧密瓶颈潜在引导预训练策略，而非通过策略权重或输出动作。在离线训练期间，我们通过即插即用的变分信息瓶颈（VIB）模块来补充策略，从观测嵌入中提取任务相关的潜在接口。在线微调过程中，基础策略被冻结，强化学习只学习该潜在变量上的残余扰动，其解码后的表示条件为冻结作用发生器。我们将ZPRL实例化为流量匹配策略，并在八个模拟任务和四个真实世界任务中进行评估。在多种操作环境中，ZPRL在强有力的训练后基线条件下，提升了样本效率和最终表现。在现实世界中，ZPRL在四个任务的平均成功率比模仿基策略提高了33.7%，同时比动作残余策略更为平滑地进行探索。这些结果表明，紧凑且符合任务的瓶颈潜在器为在线强化学习适应提供了有效接口。更多视频可在此 https 网址找到。

RoHIL: Robust Human-in-the-Loop Robotic Reinforcement Learning Against Illumination Variations

RoHIL：针对照明变化的稳健人机在环机器人强化学习

Authors: Shuoqin Zhang, Yixin Xiong, Xiru Gao, Kai Liu, Ke Wang, Xichuan Zhou, Zhe Hu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.19924
Pdf link: https://arxiv.org/pdf/2605.19924
Abstract Human-in-the-loop reinforcement learning systems achieve near-perfect success on the workstation where they are trained, but collapse when the same robot is moved to a workstation a few meters away due to shifts in the visual input distribution caused by new lamp positions and window light. Re-collecting demonstrations and re-running HIL on every workstation is incompatible with deployment, and naively fine-tuning on shifted-light data triggers catastrophic forgetting of the source workstation. To close this cross-domain gap, we present RoHIL, an offline fine-tuning framework that uses no extra real-robot interaction. RoHIL combines (i) a world-model-based image relighter that re-synthesises the visual stream of source-workstation trajectories under multiple virtual HDRI environments, leaving actions and rewards real; (ii) Illumination-Retention Replay (IRR), a data-level anti-forgetting mechanism that interleaves relit adaptation transitions with original-light retention transitions to preserve source-workstation Bellman coverage; and (iii) an anchored Bellman-actor regulariser that constrains representation and policy drift from the original source-workstation policy. Across four real-robot manipulation tasks under significant cross-workstation illumination variations, RoHIL substantially improves shifted-light performance where standard HIL-RL collapses, while preserving source-workstation performance, eliminating the need to re-collect data and retrain for every new workstation and environment. Project page: this https URL
中文摘要 人机在环路中强化学习系统在训练工作站上几乎完美成功，但当同一机器人被移至几米外的工作站时，由于新灯具位置和窗户光线变化导致视觉输入分布变化，系统就会崩溃。在每个工作站重新收集演示和重新运行HIL与部署不兼容，而对光线移动数据进行天真微调则会引发对源工作站的灾难性遗忘。为了弥合这一跨领域差距，我们提出了RoHIL，一种不使用额外与真实机器人交互的离线微调框架。RoHIL结合了（i）基于世界模型的图像重照器，在多个虚拟HDRI环境下重新合成源-工作站轨迹的视觉流，使动作和奖励保持真实;（ii）照明-保留回放（IRR），一种数据级防遗忘机制，将重照灯适应过渡与原始光线保留过渡交错，以保持源-工作站的贝尔曼覆盖;以及（iii）锚定的Bellman-actor规范化器，用于限制表示和策略偏离原始源工作站策略。在四个真实机器人操作任务中，面对显著的跨工作站照明变化，RoHIL显著提升了标准HIL-RL崩溃的光线转移性能，同时保持源工作站性能，无需为每个新工作站和环境重新收集数据和重新训练。项目页面：此 https URL

JAXenstein: Accelerated Benchmarking for First-Person Environments

JAXenstein：第一人称环境的加速基准测试

Authors: Ruo Yu Tao, George Konidaris
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.19926
Pdf link: https://arxiv.org/pdf/2605.19926
Abstract The progression of reinforcement learning algorithms have been driven by challenging benchmarks. The rate in which a researcher can iterate on a problem setting directly impacts the speed of algorithm development. Modern machine learning has produced tools that allow for fast and scalable algorithm development like the JAX library. With the availability of these tools, a serious bottleneck in algorithm development is the availability of large and complex domains for experimentation. Most notably, the JAX reinforcement learning ecosystem does not have any benchmarks that test visual first-person tasks; these domains are crucial for testing both exploration and an agent's ability to overcome partial observability. We introduce JAXenstein: an open-source JAX-based benchmark that implements the Wolfenstein 3D rendering engine for fast and scalable experimentation in visual first-person tasks. JAXenstein is several times faster than comparable vision-based benchmarks, and is easily extensible to more complex first-person domains.
中文摘要 强化学习算法的发展由挑战性的基准推动。研究者在问题设置上迭代的速率直接影响算法开发的速度。现代机器学习已经开发出能够快速且可扩展算法开发的工具，比如JAX库。随着这些工具的出现，算法开发中的一个严重瓶颈是实验领域庞大且复杂的存在。最显著的是，JAX强化学习生态系统没有任何测试视觉第一人称任务的基准测试;这些域对于测试探索能力以及智能体克服部分可观测性的能力至关重要。我们介绍JAXenstein：一个基于JAX的开源基准测试，实现了Wolfenstein 3D渲染引擎，实现了视觉第一人称任务中的快速且可扩展的实验。JAXenstein 的速度是类似基于视觉的基准测试的数倍，并且易于扩展到更复杂的第一人称领域。

Safe Deep Reinforcement Learning for Spacecraft Reorientation with Pointing Keep-Out Constraint

带指向排除约束的航天器重新定向的安全深度强化学习

Authors: Juntang Yang, Mohamed Khalil Ben-Larbi
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.19967
Pdf link: https://arxiv.org/pdf/2605.19967
Abstract This paper implements deep reinforcement learning (DRL) with a safety filter for spacecraft reorientation control with a single pointing keep-out zone. A new state space representation is designed which includes a compact representation of the attitude constraint zone. A reward function is formulated to achieve the control objective while enforcing the attitude constraint. The soft actor-critic (SAC) algorithm is adopted to handle continuous state and action space. A curriculum learning approach is implemented for agent training. To guarantee the compliance of the attitude constraint, a control barrier function (CBF)-based safety filter is implemented for agent deployment. Simulation results demonstrate the effectiveness of the proposed state space presentation and the designed reward function. Monte Carlo simulations underscore that reward shaping alone cannot guarantee the safety during reorientation maneuver. In contrast, with the CBF-based safety filter, the constraint can be guaranteed during maneuvers.
中文摘要 本文实现了深度强化学习（DRL），并带有安全滤波器用于航天器重新定向控制，并设有单一指向禁区。设计了一个新的状态空间表示，包含姿态约束区的紧致表示。奖励函数被构造以实现控制目标，同时强制执行姿态约束。软演员-批评者（SAC）算法被采用来处理连续状态和动作空间。为代理培训实施了课程学习方法。为确保姿态约束的合规性，代理部署时实现了基于控制障碍函数（CBF）的安全滤波器。模拟结果证明了所提状态空间呈现和设计奖励函数的有效性。蒙特卡洛模拟强调，仅靠奖励塑造无法保证重新定向动作中的安全性。相比之下，基于CBF的安全滤镜可以在机动时保证约束。

A conceptual framework for learning to listen by reward: Curiosity-driven search for novel sources

通过奖励学习倾听的概念框架：好奇心驱动的新颖资源搜索

Authors: Andreas Triantafyllopoulos, Jakub Šťastný, Alexios Terpinas, Tianyi Liu, Yuanqi Wang, Björn W. Schuller
Subjects: Subjects: Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2605.19984
Pdf link: https://arxiv.org/pdf/2605.19984
Abstract Reinforcement learning is a powerful learning paradigm that has spearheaded progress in numerous domains. Its core promise lies in learning through high-level goals without the need for granular labels. However, it still remains elusive in the realm of audio, where it has received substantially less attention than in computer vision or other domains. The key question remains: how can agents learn to listen purely via reward-driven exploration? In this contribution, we present an overview of previous attempts and a new conceptual framework for learning to listen by reward. Our approach depends on the continuous search for novel sound sources. We formulate our framework, discuss open technical challenges, and present a first proof-of-concept implementation that showcases the feasibility of our approach.
中文摘要 强化学习是一种强大的学习范式，在多个领域引领了进步。其核心承诺在于通过高层次目标学习，无需细节标签。然而，在音频领域，它仍然难以捉摸，远不如计算机视觉或其他领域受到关注。关键问题依然存在：代理如何通过以奖励为驱动的探索来纯粹学会倾听？在本篇文章中，我们概述了以往的尝试，并提出了一个通过奖励学习倾听的新概念框架。我们的方法依赖于持续寻找新颖的声音源。我们制定框架，讨论开放的技术挑战，并展示首个概念验证实现，展示我们方法的可行性。

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

CogOmniControl：通过创造性意图认知实现推理驱动的可控视频生成

Authors: Hongji Yang, Songlian Li, Yucheng Zhou, Xiaotong Zhao, Alan Zhao, Chengzhong Xu, Jianbing Shen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.19995
Pdf link: https://arxiv.org/pdf/2605.19995
Abstract Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop "harness-like" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: this https URL
中文摘要 近期的扩散模型在视频生成方面实现了强烈的写真真实感和流畅性，但在抽象、稀疏或复杂条件下仍然脆弱，导致在专业制作流程如分镜草图和粘土渲染条件下表现不佳。现有的视频生成模型要么通过适配器注入条件，要么在扩散骨干中耦合通用视觉语言模型（VLM），导致能力缺口，无法生成符合用户创作意图的视频。我们介绍CogOmniControl，一个以推理为驱动的框架，将可控视频生成分解为创造性意图认知和生成。具体来说，我们利用真实的动漫制作数据训练专门的CogVLM。与通用VLM相比，它能生成更专业、更清晰的输出，能够从稀疏和抽象的条件下准确识别用户的创造意图，并将这些线索调整为密集的推理输出。此外，CogOmniDiT通过上下文生成统一了来自各种条件的控制，并通过强化学习与CogVLM推理输出对齐。此外，利用CogVLM在指导视频生成方面的强大能力，我们释放其在规划特定评审者的潜力，并为生成的视频提供最佳选择。这种集成将整个框架转变为闭环的“束缚式”架构。我们还介绍了CogReasonBench和CogControlBench，这些数据基于专业工作流程数据，这些数据带有真实的创意意图，而非模拟的。两个基准测试的实验显示，CogOmniControl 超越了现有的开源模型。项目网站：这个 https URL

GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards

GeoX：通过自玩和可验证奖励掌握地理空间推理

Authors: Kyeongjin Ahn, Seungeon Lee, Krishna P. Gummadi, Meeyoung Cha
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20006
Pdf link: https://arxiv.org/pdf/2605.20006
Abstract Geospatial reasoning requires solving image-grounded problems over the complex spatial structure of a scene. However, developing this capability is hindered by the cost of annotating a vast and combinatorial question space. We propose GeoX, a self-play framework that acquires spatial logic through executable programs that yield verifiable rewards, without relying on large-scale human-curated data Given a satellite or aerial image, our framework employs a single multimodal policy that proposes spatial problems as executable programs and solves them under three reasoning modes-abduction, deduction, and induction-over spatial primitives and an image understanding tool. A verifier executes each program to covert a reward signal that jointly optimizes the two roles via reinforcement learning. GeoX consistently improves its base VLMs by up to 5.5 points on average, matching or exceeding conventional baselines trained on millions of curated data. Along-side the proposed method, we release a benchmark for geospatial understanding accumulated through self-play.
中文摘要 地理空间推理需要解决场景复杂空间结构上的图像基础问题。然而，开发这一能力受限于注释庞大且组合化的问题空间的成本。我们提出了GeoX，一种自我游戏框架，通过可执行程序获取空间逻辑，产生可验证的奖励，无需依赖大规模人机整理的数据。给定卫星或航拍图像，我们的框架采用单一多模态策略，将空间问题作为可执行程序提出，并通过三种推理模式——推理、推理和归纳于空间原语和图像理解工具——解决。验证者执行每个程序以传递奖励信号，通过强化学习共同优化两个角色。GeoX其基础VLM平均提升高达5.5分，与基于数百万个精选数据训练的传统基线相当甚至超过。除了所提方法外，我们还发布了通过自我游戏积累的地理空间理解基准。

When Critics Disagree: Adaptive Reward Poisoning Attacks in RIS-Aided Wireless Control System

当批评者意见不合：RIS辅助无线控制系统中的自适应奖励中毒攻击

Authors: Deemah H. Tashman, Soumaya Cherkaoui
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20037
Pdf link: https://arxiv.org/pdf/2605.20037
Abstract Reward-poisoning attacks present a significant risk to learning-based wireless control systems. Given this, we propose a Disagreement-Guided Reward Poisoning (DGRP) adaptive attack on a Soft Actor-Critic (SAC) agent. In a Cognitive Radio Network (CRN) environment assisted by Reconfigurable Intelligent Surfaces (RIS), the SAC agent is tasked with maximizing the long-term secondary users' (SUs) rate by simultaneously optimizing the transmission power of the SU transmitter and the RIS phase shifts. DGRP corrupts rewards, particularly when the SAC dual critics exhibit substantial disagreement-especially in high-leverage, high-uncertainty states-resulting in distorted value estimations and guiding the policy towards suboptimal actions. Our findings demonstrate that DGRP substantially diminishes the performance improvements typically provided by RIS and degrades transmission quality. We further investigate key attack parameters and determine their impact on learning. In comparison to periodic-timing and exploration-triggered baselines, DGRP consistently causes greater damage, highlighting the necessity of considering disagreement-aware threats when evaluating the robustness of Deep Reinforcement Learning (DRL) in RIS-assisted networks.
中文摘要 奖励中毒攻击对基于学习的无线控制系统构成重大风险。基于此，我们提出了对软演员-批评者（SAC）代理进行分歧引导奖励中毒（DGRP）自适应攻击的方案。在认知无线网络（CRN）环境中，借助可重构智能表面（RIS）辅助，SAC代理的任务是通过同时优化次级用户（SU）的传输功率和RIS相位偏移，最大化长期次级用户（SU）的速率。DGRP腐蚀了奖励，尤其是在SAC双重批评者表现出显著分歧时——尤其是在高杠杆、高不确定性州——导致价值估计扭曲，政策趋向次优行动。我们的研究结果表明，DGRP显著削弱了RIS通常带来的性能提升，并降低了传输质量。我们还进一步研究了关键攻击参数，并确定它们对学习的影响。与周期性定时和探索触发基线相比，DGRP持续造成更大的损害，凸显了在评估RIS辅助网络中深度强化学习（DRL）鲁棒性时，必须考虑分歧感知威胁的必要性。

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

奖励信念，而非行动：长期代理的一致性指导信用分配

Authors: Wenjie Tang, Minne Li, Sijie Huang, Liquan Xiao, Yuan Zhou
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.20061
Pdf link: https://arxiv.org/pdf/2605.20061
Abstract Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: this https URL.
中文摘要 可验证奖励强化学习（RLVR）是一种有前景的范式，用于改进大型语言模型（LLM）代理在长期交互任务中的表现。然而，在部分可观察的环境中，不完整的观察会导致代理人信念随时间漂移，而延迟奖励则掩盖了中间决策的因果影响，加剧了时间上的信用分配挑战。为此，我们提出了ReBel（奖励信念），一种过程级强化学习算法，明确建模结构化信念状态，以总结交互历史并指导后续策略学习。ReBel 引入了信念一致性监督，将预测信念与观察反馈之间的差异转换为密集的自监督信号，无需外部逐级注释或验证器。它还采用信念感知分组来比较相似信念状态下的轨迹，从而获得更稳健且方差更低的优势估计。我们基于具有挑战性的长期基准测试，包括ALFWorld和WebShop，评估ReBel。ReBel能将任务成功率提升至每集基准GRPO多达20.4美元百分点，并使样本效率提升2.1美元。这些结果表明，信念感知自我监督是部分可观测性下可靠长期决策的有前景方向。代码可在以下 https URL 获取。

Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

基于基于GRPO的DBLP方法的文本到SPARQL生成与强化学习

Authors: Jann Pfeifer, Debayan Banerjee, Ricardo Usbeck
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.20066
Pdf link: https://arxiv.org/pdf/2605.20066
Abstract Knowledge graph question answering seeks to translate natural language questions into executable queries over knowledge graphs, but existing approaches often rely on large models or full supervision in the form of gold query annotations. This study examines whether reinforcement learning with outcome-based rewards can train a small instruction-tuned language model to perform zero-shot Text-to-SPARQL generation in the scholarly domain. Group-Relative Policy Optimization (GRPO) is applied to the Qwen3-1.7B model on DBLP-QuAD, using prompts that combine natural language questions with symbolic hints about entities and relations. Training relies on execution feedback, structural constraints, and answer-level rewards, with an additional variant that incorporates gold-query-based shaping. The resulting models are compared to the unmodified zero-shot baseline and to a supervised DoRA-finetuned baseline across answer-level accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit, suggesting that outcome-based reinforcement learning is a viable training strategy when gold queries are unavailable for token-level supervision.
中文摘要 知识图谱问题解答旨在将自然语言问题转化为知识图谱上的可执行查询，但现有方法通常依赖大型模型或以金质查询注释形式进行全面监督。本研究考察了基于结果的奖励强化学习是否能训练一个小型指令调优语言模型，在学术领域执行零样本文本到SPARQL生成。群相对策略优化（GRPO）应用于DBLP-QuAD上的Qwen3-1.7B模型，使用结合自然语言问题与关于实体和关系的符号提示的提示。培训依赖于执行反馈、结构约束和答案级奖励，还有一种结合金查询的变体。所得模型会与未修改的零样本基线及监督的DoRA微调基线进行比较，涵盖答案水平的准确性、执行准确度、类别分数以及对保留模板的推广。GRPO相较零次基线有显著提升，并展现出具有竞争力的泛化性，而监督式DoRA微调在同一模型尺度下能实现更高的整体准确率。消融分析表明，基于执行的奖励能带来大部分收益，额外的塑造带来的额外益处有限，这表明当金查询无法用于代币级监督时，基于结果的强化学习是一种可行的训练策略。

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

并非所有评分标准都同样有效：RLVR的政策意识评分标准奖励

Authors: Utkarsh Tyagi, Xingang Guo, MohammadHossein Rezaei, Daniel George, Anas Mahmoud, Jackson Lee, Bing Liu, Yunzhong He
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.20164
Pdf link: https://arxiv.org/pdf/2605.20164
Abstract Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins $24$ of $30$ base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in $2.5$--$4\times$ fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.
中文摘要 带有可验证奖励的强化学习使得当正确性自动检查时，训练后非常有效。然而，许多重要的模型行为需要同时满足多个定性标准。基于评分标准的奖励通过对提示的特定标准进行评分，并将其聚合成标量奖励来应对这一情境。然而，标准的静态聚合将标准的人为赋予的重要性与其作为优化信号的当前效用混为一谈。我们证明这一假设在RL评分标准中失效：许多重要标准已经饱和或目前无法达到，而区分推广的标准不一定是人类权重最高的标准。我们引入POW3R，一种政策意识型评分标准奖励框架，保持人类权重和类别平衡作为评分标准目标，同时在训练过程中调整标准级奖励权重。POW3R利用推广层级对比强调当前区分政策输出的标准，使GRPO奖励更具信息量，同时不改变基础评估目标。在两个涵盖多模态和纯文本设置的数据集上，POW3R在基础策略/指标比较中赢得了24美元或30美元的基础策略/指标比较，提升了平均评分奖励和严格完成率（即满足所有必要评分标准的提示的比例）均优于带有评分标准奖励的普通GRPO，且在2.5美元至4乘以减少的训练步骤内达到相同水平。因此，评分标准奖励应区分最终答案中应重视的内容与能教导现行政策的内容。

Keyword: diffusion policy

There is no result