Arxiv Papers of Today

生成时间: 2026-05-08 17:27:04 (UTC+8); Arxiv 发布时间: 2026-05-08 20:00 EDT (2026-05-09 08:00 UTC+8)

今天共有 71 篇相关文章

Keyword: reinforcement learning

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

将结果监督内化为过程监督：推理强化学习的新范式

Authors: Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Sibo wang, Huiming Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.05226
Pdf link: https://arxiv.org/pdf/2605.05226
Abstract The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamentally in how to transform feedback provided only at the end of a sequence into fine-grained learning signals that can guide intermediate reasoning steps. Existing approaches either rely on outcome-level rewards for sequence-level optimization, which makes precise credit assignment difficult, or depend on externally constructed process supervision, which is costly and difficult to scale sustainably. To address this, we propose a new perspective: reinforcement learning for reasoning can be understood as the problem of internalizing outcome supervision into process supervision. From this perspective, we introduce a supervision-internalization method for reinforcement learning for reasoning, enabling the model to automatically extract process-level learning signals through identifying, correcting, and reusing failed reasoning trajectories, thereby achieving finer-grained policy optimization under outcome-only supervision. We further abstract this idea into a new training paradigm, in which the model continually generates and refines its own internal process supervision during reinforcement learning, opening a new path for fine-grained credit assignment in reinforcement learning for reasoning that differs from externally provided process supervision.
中文摘要 推理强化学习的核心挑战不仅在于结果层级监督的稀疏，更根本在于如何将仅在序列结束时提供的反馈转化为能够指导中间推理步骤的细粒度学习信号。现有方法要么依赖结果层级奖励进行序列层优化，这使得精确的信用分配变得困难，要么依赖外部构建的流程监督，这既昂贵又难以持续扩展。为此，我们提出了一个新视角：推理强化学习可以理解为将结果监督内化到过程监督中的问题。从这一角度，我们引入了一种用于推理强化学习的监督内化方法，使模型能够通过识别、纠正和重用失败的推理轨迹，自动提取过程层级学习信号，从而在仅结果监督下实现更细粒度的策略优化。我们将这一思想进一步抽象为一种新的训练范式，在强化学习过程中，模型不断生成和完善自身的内部过程监督，为强化学习中细粒度的学分分配开辟了一条新路径，因为推理不同于外部提供的过程监督。

Topology-Driven Anti-Entanglement Control for Soft Robots

软机器人的拓扑驱动抗纠缠控制

Authors: Haoyang Le, Shengxuan Wang, Mohan Chen, Shuo Feng
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05236
Pdf link: https://arxiv.org/pdf/2605.05236
Abstract In the field of precision manufacturing in complex constrained environments, the role of soft robots is increasingly prominent, and the realization of anti-winding control based on multi-intelligent body reinforcement learning has become a research hotspot. One of the core problems at present is to coordinate multiple robots to complete the unwinding operation in a highly constrained environment. The existing distributed training framework faces some observability challenges in high-density barrier and unstable environments, resulting in poor learning results. This paper proposes a topology-driven Multi-Agent Reinforcement Learning (TD-MARL) framework to coordinate multi-robot systems to avoid entanglement. Specifically, the critical network adopts centralized learning, so that each intelligent body can perceive the strategies of other intelligent bodies by sharing the topological state, thus alleviating the training instability caused by complex interactions; eliminating the demand for communication resources between robots through distributed execution, Upgrade system reliability; the integrated topological security layer uses topological invariants to accurately assess and mitigate the risk of entanglement to avoid the strategy from falling into local difficulties. Finally, the full simulation experiments carried out in the real simulation environment show that the method is better than the current advanced deep reinforcement learning (DRL) method in terms of convergence and anti-winding effect.
中文摘要 在复杂受限环境下的精密制造领域，软机器人的作用日益突出，基于多智能身体强化学习实现的抗绕控已成为研究热点。当前的核心问题之一是协调多个机器人在高度受限的环境中完成解卷操作。现有的分布式训练框架在高密度障碍和不稳定环境中面临可观测性挑战，导致学习效果较差。本文提出了一种拓扑驱动的多智能体强化学习（TD-MARL）框架，用于协调多机器人系统以避免纠缠。具体来说，临界网络采用集中学习，使每个智能体通过共享拓扑状态来感知其他智能体的策略，从而缓解复杂相互作用引起的训练不稳定性;通过分布式执行消除机器人间通信资源的需求，升级系统可靠性;集成拓扑安全层利用拓扑不变量准确评估并降低纠缠风险，避免策略陷入局部难题。最后，在真实仿真环境中进行的完整仿真实验表明，该方法在收敛性和抗绕组效应方面优于现有的高级深度强化学习（DRL）方法。

Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning

神经共态策略：在循环强化学习中构建隐藏状态

Authors: David Leeftink, Max Hinne, Marcel van Gerven
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.05373
Pdf link: https://arxiv.org/pdf/2605.05373
Abstract A key capability of intelligent agents is operating under partial observability: reasoning and acting effectively despite missing or incomplete state observations. While recurrent (memory-based) policies learned via reinforcement learning address this by encoding history into latent state representations, their internal dynamics remain uninterpretable black boxes. This paper establishes a formal link between these hidden states and the Pontryagin minimum principle (PMP) from optimal control. We demonstrate that for standard recurrent architectures, latent representations map directly to PMP co-states, which allows the readout layer to be interpreted as performing Hamiltonian minimization. Because standard reward maximization does not naturally discover this alignment, we introduce a PMP-derived co-state loss to explicitly structure the internal dynamics. Empirically, this approach matches or improves performance on partially observable DMControl tasks, and is robust against zero-shot out-of-distribution sensor masking. By framing recurrent networks as dynamic processes governed by the minimum principle, we provide a principled approach to designing robust continuous control policies.
中文摘要 智能体的一个关键能力是在部分可观察性下运行：即使缺少或不完整的状态观察，仍能有效推理和行动。虽然通过强化学习学习的循环（基于记忆）策略通过将历史编码到潜在状态表示中来解决这个问题，但其内部动态依然无法解释。本文建立了这些隐藏态与最优控制的庞特里亚金极小原理（PMP）之间的正式联系。我们证明，对于标准递归架构，潜在表示直接映射到PMP共态，这使得读出层可以被解释为执行哈密顿最小化。由于标准奖励最大化不会自然发现这种对齐，我们引入了基于PMP的共态损失，以明确结构内部动态。从经验角度看，这种方法能够匹配或提升部分可观测的DMControll任务的性能，并且对零样品分布外传感器掩蔽具有鲁棒性。通过将循环网络框架为受最小原则支配的动态过程，我们提供了一种有原则的方法来设计稳健的持续控制策略。

Two-Stage Learned Decomposition for Scalable Routing on Multigraphs

多重图可扩展路由的两阶段学习分解

Authors: Filip Rydin, Morteza Haghir Chehreghani, Balázs Kulcsár
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05389
Pdf link: https://arxiv.org/pdf/2605.05389
Abstract Most neural methods for Vehicle Routing Problems (VRPs) are limited to Euclidean settings or simple graphs. In this work, we instead consider multigraphs, where parallel edges represent distinct travel options with varying trade-offs (e.g., distance vs time). Few methods are designed for such formulations and those that do exist face major scalability issues. We mitigate these scalability issues via a Node-Edge Policy Factorization (NEPF) approach, which splits the routing policy into a node permutation stage and an edge selection stage. To enable the decomposition, we introduce a pre-encoding edge aggregation scheme and a non-autoregressive architecture for the edge stage, as well as a hierarchical reinforcement learning method to train the stages jointly. Our experiments across six VRP variants demonstrate that NEPF matches or outperforms the state-of-the-art in terms of solution quality, while being significantly faster in training and inference.
中文摘要 大多数用于车辆路由问题（VRP）的神经方法仅限于欧几里得设置或简单图表。在本研究中，我们转而考虑多重图，其中平行边代表不同的旅行选项，且有不同的权衡（例如距离与时间）。很少有方法专门为此类表述设计，现有的方法也面临重大的可扩展性问题。我们通过节点-边缘策略分解（NEPF）方法缓解这些可扩展性问题，将路由策略拆分为节点置换阶段和边缘选择阶段。为实现分解，我们引入了预编码边聚合方案和非自回归的边缘阶段架构，以及一种分层强化学习方法以联合训练各阶段。我们在六种VRP变体中的实验表明，NEPF在解质量方面与最先进产品相当甚至超过，同时在训练和推断中显著更快。

LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks

LANTERN：带有体验门控推理网络的LLM增强神经符号转移

Authors: Mahyar Alinejad, Yue Wang, Amrit Singh Bedi, George Atia
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05478
Pdf link: https://arxiv.org/pdf/2605.05478
Abstract Transfer learning in reinforcement learning (RL) seeks to accelerate learning in new tasks by leveraging knowledge from related sources. Existing neurosymbolic transfer methods, however, typically rely on manually specified task automata, assume a single source task, and use fixed knowledge-integration mechanisms that cannot adapt to varying source relevance. We propose LANTERN, a unified framework for multi-source neurosymbolic transfer that addresses these limitations through three components: (i) deterministic finite automata generated from natural language task descriptions using large language models, (ii) semantic embedding-based aggregation of multiple source policies weighted by cross-task similarity, and (iii) adaptive teacher-student gating based on temporal-difference error and semantic uncertainty. Across domains spanning resource management, navigation, and control, LANTERN achieves 40-60% improvements in sample efficiency over existing baselines while remaining robust to poorly aligned sources. These results demonstrate that multi-source, adaptively weighted neurosymbolic transfer can improve scalability and robustness in symbolic RL settings.
中文摘要 强化学习（RL）中的迁移学习旨在通过利用相关来源的知识，加速新任务的学习。然而，现有的神经符号传输方法通常依赖手动指定的任务自动机，假设只有一个源任务，并使用固定的知识整合机制，无法适应源相关性的变化。我们提出了LANTERN，这是一个统一的多源神经符号传输框架，通过三个组成部分解决这些局限性问题：（i）利用大型语言模型生成的确定性有限自动机，（ii）基于语义嵌入的多源策略聚合，按跨任务相似度加权;（iii）基于时间差异误差和语义不确定性的自适应师生门控。在涵盖资源管理、导航和控制的多个领域，LANTERN在样本效率上比现有基线提升了40%-60%，同时对匹配不良的来源依然保持稳健。这些结果表明，多源、自适应加权的神经符号转移可以在符号强化学习环境中提升可扩展性和稳健性。

Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

下一个政策抽样：在深度强化学习中替换保守目标政策更新

Authors: Dillon Sandhu, Ronald Parr
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.05481
Pdf link: https://arxiv.org/pdf/2605.05481
Abstract We revisit a classic "chicken-and-egg" problem in reinforcement learning: to safely improve a policy, the value function must be accurate on the state-visitation distribution of the updated policy. That distribution over states is unknown and cannot be sampled for the purposes of training the value function. Conservative updates solve this problem, but at the cost of shrinking the policy update. This paper explores an alternative solution, Approximate Next Policy Sampling (ANPS), which addresses the problem by modifying the training distribution rather than constraining the policy update. ANPS is satisfied if the distribution of the training data approximates that of the next policy. To demonstrate the feasibility and efficacy of ANPS, we introduce Stable Value Approximate Policy Iteration (SV-API). SV-API modifies the standard approximate policy iteration loop to hold the target policy fixed while an iteratively updated behavioral policy gathers relevant experience. It only commits to a new policy once a convergence criterion has been met. If certain stability criteria are met, the update is guaranteed to be safe; otherwise, it remains no less safe than standard approximate policy iteration. Applying SV-API to PPO yields Stable Value PPO (SV-PPO), which matches or improves performance on high-dimensional discrete (Atari) and continuous control benchmarks while executing substantially larger target policy updates. These results demonstrate the viability of ANPS as a new solution to this classic challenge in RL.
中文摘要 我们重新审视强化学习中经典的“先有鸡还是先有蛋”问题：为了安全改进政策，价值函数必须准确反映更新政策的状态访问分布。该状态分布未知，无法用于训练价值函数。保守派的更新解决了这个问题，但代价是政策更新的范围会被缩小。本文探讨了另一种解决方案——近似下一个策略采样（ANPS），该方法通过修改训练分布而非限制策略更新来解决该问题。如果训练数据的分布近似于下一个策略的分布，则ANPS成立。为了展示ANPS的可行性和有效性，我们引入了稳定值近似策略迭代（SV-API）。SV-API 修改了标准的近似策略迭代循环，保持目标策略的固定，同时迭代更新的行为策略收集相关经验。只有在达到趋同标准后，它才承诺新政策。如果满足某些稳定性标准，更新保证安全;否则，它依然不比标准的近似政策迭代更安全。将SV-API应用于PPO可生成稳定价值PPO（SV-PPO），在高维离散（Atari）和连续控制基准测试上性能匹配或提升，同时执行更大范围的目标策略更新。这些结果证明了ANPS作为解决强化学习这一经典挑战的新方案的可行性。

Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning

自适应Q块处理，用于离线到在线强化学习

Authors: Nandiraju Gireesh, Yuanliang Ju, He Wang
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.05544
Pdf link: https://arxiv.org/pdf/2605.05544
Abstract Offline-to-online reinforcement learning with action chunking eliminates multi-step off-policy bias and enables temporally coherent exploration, but all existing methods use a fixed chunk size across every state. This is suboptimal: near contact events the agent needs short chunks for reactive control, while during free-space motion long chunks provide better credit assignment. The natural solution is to train critics for several chunk sizes and select the best one at each state, but naive comparison of learned critic values systematically collapses to the shortest chunk due to discount-scale mismatch, and degrades to noise in low-value states. We propose Adaptive Q-Chunking (AQC), which resolves both failures by comparing the advantage of each chunk size relative to a per-horizon baseline, normalized by the discount factor. This criterion converts biased wrong answers into unbiased near-random choices when no genuine signal exists, and becomes discriminative when a particular scale enables better planning. We prove theoretical bounds on the advantage selector's noise immunity and on the value dominance of adaptive chunking over any fixed chunk size. We demonstrate that AQC achieves state-of-the-art offline and online success rates on OGBench and Robomimic, and can be applied to enhance the performance of large-scale VLA models that predict action sequences, significantly boosting performance on RoboCasa-GR1 tasks.
中文摘要 带有动作分块的离线到在线强化学习消除了多步的非策略偏差，实现时间上的连贯探索，但所有现有方法在每个状态上都使用固定的块大小。这并不理想：近接触事件中，代理需要短块进行反应控制，而在自由空间运动中，长块则能提供更好的信用分配。自然的解决方案是训练批评者针对多个区块大小，并在每个状态中选择最佳，但天真比较学到的批评值会因折价尺度不匹配而系统性地崩溃到最短的区块，并在低值状态下退化为噪声。我们提出了自适应Q块化（AQC），通过比较每个块大小相对于按视界基线（以折现因子归一化）的优势来解决这两种失败。当没有真实信号时，该标准将有偏的错误答案转化为无偏的近随机选择，当某个尺度允许更好规划时，则变得具有歧视性。我们证明了优势选择器的抗噪声能力以及自适应分块在任意固定区块大小上的优势值的理论界限。我们证明AQC在OGBench和Robomimic上实现了最先进的离线和在线成功率，并可用于提升大规模VLA模型的性能，这些模型预测动作序列，显著提升RoboCasa-GR1任务的性能。

SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs

SPARK：知识图谱中非对称奖励的自我游戏

Authors: Hyobin Park, Taeseop Kim, Dong-Geol Choi
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05546
Pdf link: https://arxiv.org/pdf/2605.05546
Abstract Self-play reinforcement learning has shown strong performance in domains with formally verifiable structure, such as mathematics and coding, where both problem generation and reward computation can be grounded in explicit rules. Extending this paradigm to scientific literature is more challenging: the relationships among multi-modal elements within and across documents are rarely made explicit in text, which makes automatic generation of relational reasoning questions difficult and weakens the reliability of reward signals. We propose SPARK (Self-Play with Asymmetric Reward from Knowledge Graphs), a framework that automatically constructs a unified knowledge graph (KG) from multi-document scientific literature and uses it as the structural basis for self-play. KG paths over multimodal nodes serve as a source for generating relational reasoning questions, and structured facts stored in the KG provide a basis for verifiable reward computation. A single small vision-language model (sVLM) alternates between Proposer and Solver roles under information asymmetry against a fixed KG, a design that we believe can be naturally extended toward online adaptation in future work. We evaluate SPARK on public benchmarks and a self-constructed cross-document multi-hop QA dataset. Results show that SPARK consistently outperforms flat-corpus-based self-play baselines, and the performance gap widens as hop count increases, suggesting that KG-structure grounding contributes to relational multi-hop reasoning beyond what unstructured corpus grounding can provide.
中文摘要 自玩强化学习在具有形式可验证结构的领域表现出强劲表现，如数学和编码，这些领域的问题生成和奖励计算都可以基于显式规则。将这一范式推广到科学文献更具挑战性：文档内及跨文献间多模态元素之间的关系很少在文本中明确说明，这使得自动生成关系推理问题变得困难，也削弱了奖励信号的可靠性。我们提出了SPARK（知识图谱中非对称奖励的自我游戏）框架，该框架自动从多文档科学文献构建统一知识图谱（KG），并将其作为自我游戏的结构基础。多模节点上的KG路径作为生成关系推理问题的来源，而存储在KG中的结构化事实为可验证的奖励计算提供了基础。单个小型视觉语言模型（sVLM）在信息不对称性下交替扮演提议者和求解者角色，针对固定的 KG，我们相信这种设计未来工作可以自然地推广到在线适配。我们基于公开基准测试和自建的跨文档多跳质量保证数据集评估SPARK。结果显示，SPARK始终优于基于平坦语料库的自玩基线，且随着跳数增加，性能差距扩大，表明KG结构基础有助于关系多跳推理，超出非结构语料库基础所能提供的。

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

无意义的帮助：空间扰动的提示拓宽了推理探索

Authors: Langlin Huang, Chengsong Huang, Jinyuan Li, Donghong Cai, Yuyi Yang, Jiaxin Huang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.05566
Pdf link: https://arxiv.org/pdf/2605.05566
Abstract Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While simply increasing the sampling budget for these questions is a common remedy, the static sampling policy inherently constrains reasoning exploration, limiting the success rate. In this paper, we propose Lorem Perturbation for Exploration (LoPE), a simple yet effective training framework to break this exploration bottleneck. We posit that task-irrelevant prompt-space perturbations can shift the model's output distribution enough to unlock orthogonal reasoning pathways for hard questions. Specifically, LoPE prepends sequences stochastically assembled from Lorem Ipsum vocabulary (a pseudo-Latin placeholder text) to the prompts before resampling. Experiments across 1.7B, 4B, and 7B models demonstrate that LoPE significantly outperforms resampling with the original prompts. Further analysis reveals that other Latin-based random sequences with low perplexity are also effective perturbations. Our results establish LoPE as a strong baseline for broadening exploration in LLM reinforcement learning.
中文摘要 带有可验证奖励的强化学习，尤其是群相对策略优化（Group Relative Policy Optimization，GRPO），显著提升了大型语言模型（LLMs）的推理能力。然而，在复杂任务中，GRPO常常面临“零优势问题”：当查询的所有采样推断失败时，相对优势会归零。因此，模型失去了这些问题的有效训练信号，浪费了训练数据和计算预算。虽然简单增加这些问题的抽样预算是常见的解决办法，但静态抽样策略本质上限制了推理探索，限制了成功率。本文提出了Lorem Perturbation for Exploration（LoPE），这是一种简单但有效的训练框架，旨在打破这一探索瓶颈。我们假设，任务无关的提示空间扰动可以足以改变模型的输出分布，从而为难题解锁正交推理路径。具体来说，LoPE会在重新采样前，先在提示词前添加由Lorem Ipsum词汇（伪拉丁占位文本）随机组合而成的序列。跨越1.7B、4B和7B模型的实验表明，LoPE在原始提示下显著优于重采样。进一步分析显示，其他基于拉丁语且困惑度低的随机序列也是有效的扰动。我们的结果确立了LoPE作为拓展LLM强化学习探索的强有力基线。

When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models

When2Speak：大型语言模型多方对话中时间参与与轮流的数据集

Authors: Vihaan Nama, Shreya Mendi, Zian Ye, Brinnae Bent
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05626
Pdf link: https://arxiv.org/pdf/2605.05626
Abstract Large Language Models (LLMs) excel at generating contextually appropriate responses but remain poorly calibrated for multi-party conversations, where deciding when to speak is as critical as what to say. In such settings, naively responding at every turn leads to excessive interruptions and degraded conversational coherence. We introduce When2Speak, a grounded synthetic dataset and four-stage generation pipeline for learning intervention timing in group interactions. The dataset comprises over 215,000 examples derived from 16,000 conversations involving 2-6 speakers, spanning diverse conversational styles, tones, and participant dynamics, and explicitly modeling SPEAK vs. SILENT decisions at each turn. Our pipeline combines real-world grounding, structured augmentation, controlled transcript synthesis, and fine-tuning-ready supervision, and is fully open-sourced to support reproducibility and adaptation to domain-specific conversational norms. Across multiple model families, supervised fine-tuning (SFT) on When2Speak significantly outperforms zero-shot baselines (e.g., the average Macro F1 increase across 4B+ parameter models was 60%, with the largest increase being 120%). However, SFT-trained models remain systematically over-conservative, missing nearly half of warranted interventions as seen through the Missed Intervention Rate (MIR), which was on average 0.50 and is noticed even at larger model sizes. To address this limitation, we apply reinforcement learning with asymmetric reward shaping, which reduces MIR to 0.186-0.218 and increases recall from 0.479 to 0.78-0.81. Our findings establish that temporal participation is a distinct and trainable dimension of conversational intelligence, and that grounded synthetic data provides an effective and scalable pathway for enabling LLMs to participate more naturally and appropriately in multi-party interactions.
中文摘要 大型语言模型（LLMs）擅长生成符合上下文的回答，但在多方对话中仍然不够适应，因为在多方对话中，决定何时说话和说什么一样关键。在这种情况下，天真地回应每一步都会导致过度打断和对话连贯性下降。我们介绍了When2Speak，一个基于基础的合成数据集和四阶段生成流程，用于群体互动中的干预学习时机。该数据集包含超过215,000个样本，来自16,000次对话，涉及2-6名说话者，涵盖了多样的对话风格、语调和参与者动态，并在每个环节明确建模了说话与沉默的决策。我们的流程结合了现实世界的基础化、结构化增强、受控转录本综合和可微调的监督，并且完全开源，支持可重复性和适应领域特定的会话规范。在多个模型家族中，When2Speak的监督微调（SFT）显著优于零样本基线（例如，4B+参数模型中宏观F1的平均增长为60%，增幅最大为120%）。然而，SFT训练的模型仍然系统性地过于保守，缺失近一半的必要干预，这一点从漏诊干预率（MIR）可见，平均为0.50，即使在更大模型规模下也能观察到。为解决这一限制，我们应用了非对称奖励塑造的强化学习，将MIR降低至0.186-0.218，回忆率从0.479提升至0.78-0.81。我们的发现确立了时间参与是对话智能的一个独特且可训练的维度，而有根据的合成数据为大型语言模型提供了一条有效且可扩展的路径，使大型语言模型能够更自然、更恰当地参与多方互动。

MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery

MotionGRPO：克服基于GRPO的自我中心运动恢复中群体内低多样性

Authors: Nanjie Yao, Junlong Ren, Wenhao Shen, Hao Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.05680
Pdf link: https://arxiv.org/pdf/2605.05680
Abstract This paper studies full-body 3D human motion recovery from head-mounted device signals. Existing diffusion-based methods often rely on global distribution matching, leading to local joint reconstruction errors. We propose MotionGRPO, a novel framework leveraging reinforcement learning post-training to inject fine-grained guidance into the diffusion process. Technically, we model diffusion sampling as a Markov decision process optimized via Group Relative Policy Optimization (GRPO). To this end, we introduce a hybrid reward mechanism that combines a learned conditioned perceptual model for global visual plausibility and explicit constraints for local joint precision. Our key technical insight is that policy optimization in diffusion-based recovery suffers from vanishing gradients due to limited intra-group sample diversity. To address this, we further introduce a noise-injection strategy that explicitly increases sample variance and stabilizes learning. Extensive experiments demonstrate that MotionGRPO achieves state-of-the-art performance with superior visual fidelity
中文摘要 本文研究了从头戴设备信号中实现全身三维人体运动恢复。现有基于扩散的方法常依赖全局分布匹配，导致局部关节重建误差。我们提出了MotionGRPO，一种利用训练后强化学习为扩散过程注入细粒度指导的新框架。从技术上讲，我们将扩散抽样建模为通过群相对策略优化（GRPO）优化的马尔可夫决策过程。为此，我们引入了一种混合奖励机制，结合了学习到的条件感知模型以实现全局视觉合理性，并对局部关节精度设有明确约束。我们的关键技术见解是，基于扩散的回收策略优化由于组内样本多样性有限，梯度趋零。为此，我们进一步引入了一种噪声注入策略，明确提高样本方差并稳定学习。大量实验表明，MotionGRPO实现了最先进的性能和卓越的视觉保真度

Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling

闭环：通过LLM-RL耦合实现统一的3D场景生成与沉浸式交互

Authors: Anh H. Vo, Sungyo Lee, Phil-Joong Kim, Soo-Mi Choi, Yong-Guk Kim
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2605.05711
Pdf link: https://arxiv.org/pdf/2605.05711
Abstract Recent advances in large language models (LLMs) have significantly improved language-driven 3D content generation, but most existing approaches still treat scene generation and user interaction as separate processes, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language-driven 3D scene generation and immersive user interaction. Given natural language instructions, the system first constructs structured scene representations using LLMs, and then optimizes spatial layouts via reinforcement learning under geometric and semantic constraints. The generated environments are deployed in a virtual reality setting to facilitate HRI-in-the-loop, where user interactions provide continuous feedback to align generated content with human perception and usability. By tightly coupling generation and interaction, the proposed framework enables more responsive, adaptive, and realistic multimedia experiences. Experiments on the ALFRED benchmark demonstrate state-of-the-art performance in task-based scene generation. Furthermore, qualitative results and user studies show consistent improvements in immersion, interaction quality, and task efficiency, highlighting the importance of closed-loop integration of generation and interaction for next-generation multimedia systems. Our project page can be found at this https URL.
中文摘要 大型语言模型（LLMs）的最新进展显著提升了语言驱动的3D内容生成，但大多数现有方法仍将场景生成和用户交互视为独立过程，限制了交互式多媒体系统的适应性和沉浸潜力。本文提出了一个统一框架，将语言驱动的3D场景生成与沉浸式用户交互闭合。在给定自然语言指令时，系统首先利用大型语言模型构建结构化场景表示，然后在几何和语义约束下通过强化学习优化空间布局。生成的环境部署在虚拟现实环境中，以促进HRI环路，用户互动持续反馈，使生成内容与人类感知和可用性保持一致。通过紧密耦合生成与交互，所提框架实现了更具响应性、适应性和真实感的多媒体体验。ALFRED 基准测试的实验展示了基于任务的场景生成技术的尖端性能。此外，定性结果和用户研究显示沉浸感、交互质量和任务效率持续提升，凸显了生成与交互闭环集成对下一代多媒体系统的重要性。我们的项目页面可在此 https URL 找到。

LLM-Enhanced Deep Reinforcement Learning for Task Offloading in Collaborative Edge Computing

协作边缘计算中任务卸载的LLM增强深度强化学习

Authors: Hao Guo, Kaixiang Xv, Ziwu Ge, Lei Yang
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2605.05727
Pdf link: https://arxiv.org/pdf/2605.05727
Abstract Collaborative edge computing uses edge nodes in different locations to execute tasks, necessitating dynamic task offloading decisions to maintain low latency and high reliability, especially under unpredictable node failures. Although deep reinforcement learning (DRL) and large language models (LLMs) have shown promise for task offloading, DRL often suffers from high sample inefficiency and local optima, whereas LLMs struggle with real-time decision-making. To address these limitations, we propose \textbf{LeDRL}, a hybrid decision framework that couples a \emph{lightweight LLM} with self-attention-enhanced DRL for real-time task offloading. LeDRL constructs structured, context-aware prompts capturing node status, task semantics, and link dynamics to derive high-level strategy priors. These are selectively processed by a self-attention-based alignment module for context-aware policy optimization. A reflective evaluator distills semantic feedback from past trajectories to guide future prompts, enabling more informative and temporally generalizable LLM queries. Extensive experiments show that LeDRL outperforms baselines in task success rate, convergence speed, and real-time responsiveness across diverse network scales, achieving over 17\% improvement in success rate. Furthermore, we deploy LeDRL on Jetson-based edge devices using our prototype system \textit{CoEdgeSys}, demonstrating its robustness and feasibility under resource constraints. Our code is available at:this https URL.
中文摘要 协同边缘计算利用位于不同位置的边缘节点执行任务，因此需要动态的任务卸载决策，以保持低延迟和高可靠性，尤其是在节点故障不可预测的情况下。尽管深度强化学习（DRL）和大型语言模型（LLM）在任务卸载方面展现出潜力，但DRL常常存在高样本效率和局部最优问题，而LLM则在实时决策方面存在困难。为解决这些局限性，我们提出了 \textbf{LeDRL}，一种混合决策框架，将 \emph{轻量级 LLM} 与自我注意力增强的 DRL 结合，实现实时任务卸载。LeDRL 构建结构化、上下文感知的提示，捕捉节点状态、任务语义和链接动态，以推导高层策略先验。这些信息由基于自我关注的对齐模块进行选择性处理，以实现上下文感知的策略优化。反思评估器从过去轨迹中提炼语义反馈，指导未来的提示，使大语言模型查询更具信息量且时间上可推广。大量实验表明，LeDRL在任务成功率、收敛速度和实时响应性方面均优于基线，成功率提升超过17%。此外，我们利用原型系统\textit{CoEdgeSys}在基于Jetson的边缘设备上部署LeDRL，展示了其在资源限制下的稳健性和可行性。我们的代码可在：this https URL获取。

Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback

利用LLM评判和闭环强化学习反馈对代理种群预测系统的多维行为评估

Authors: Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computational Finance (q-fin.CP)
Arxiv link: https://arxiv.org/abs/2605.05739
Pdf link: https://arxiv.org/pdf/2605.05739
Abstract Agentic stock prediction systems make sequences of interdependent decisions (regime detection, pathway routing, reinforcement learning control) whose individual quality is hidden by aggregate metrics such as mean absolute percentage error (MAPE) or directional accuracy. We present a behavioral evaluation framework that addresses this gap. Behavioral traces logged at every autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges (GPT 5.4, Claude 4.6 Opus, Gemini 3.1 Pro). Perturbation-based validation on 420 episodes yields targeted score drops of $-1.6$ to $-2.4$ on intended dimensions versus an average of $-0.32$ on the remaining five, with cross-model agreement up to Krippendorff's $\alpha = 0.85$. The composite behavioral score, used here only for cross-episode reporting, correlates at $\rho = 0.72$ with realized 20-day Sharpe ratio from offline backtesting. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty term added to the Soft Actor-Critic (SAC) reward. Three short fine-tuning cycles, all confined to the validation period, produce on the held-out 2017-2025 test period a one-day MAPE reduction from 0.61% to 0.54% (an 11.5% relative reduction; $p<0.001$, Cohen's $d=0.31$), a directional accuracy increase from 71% to 74%, and an 18% Sharpe ratio improvement (95% bootstrap CI [8.2%, 27.4%]), with gains concentrated in high-volatility episodes where the original system was most behaviorally deficient. Results are from offline backtesting and do not address effects specific to live deployment.
中文摘要 代理型库存预测系统会做出一系列相互依赖的决策（如病态检测、通路路由、强化学习控制），其个别质量被平均绝对百分比误差（MAPE）或方向准确性等综合指标所掩盖。我们提出了一个行为评估框架，以弥补这一空白。在每个自主决策点记录的行为痕迹被分组为五天的事件，并由三位大型语言模型（LLM）评委组成的团队（GPT 5.4、Claude 4.6 Opus、Gemini 3.1 Pro）在六个领域特定维度（体制检测、路由、适应、风险校准、策略一致性、错误恢复）进行评分。基于扰动的验证在420集上，目标评分在预期维度下降为$-1.6至$-2.4$，而其余五集平均为$-0.32$，跨模型一致最高可达Krippendorff的$\alpha = 0.85$。此处仅用于跨集报告的综合行为评分与离线回测实现的20天夏普比率相关联，其相关值为$\rho = 0.72美元。闭环后，该框架将不足的每维度得分转换为附加于软性演员-批评者（SAC）奖励的积分罚款。三个短暂的微调周期，均限于验证期内，2017-2025测试期间，MAPE在一天内从0.61%降至0.54%（相对减少11.5%;$p<Cohen's $d=0.31$），方向准确率从71%提升至74%，Sharpe比率提升18%（95%自助置信区间[8.2%，27.4%]），主要在高波动性阶段，原始系统行为缺陷最大。结果来自离线回测，未涉及现场部署的特定影响。

Confidence is the key: how conformal prediction enhances the generative design of permeable peptides

信心是关键：共形预测如何增强渗透肽的生成设计

Authors: Laura van Weesep, Sunay Chankeshwara, Leonardo De Maria, Florian David, Ola Engkvist, Gökçe Geylan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05770
Pdf link: https://arxiv.org/pdf/2605.05770
Abstract Generative models coupled with reinforcement learning (RL), such as REINVENT and PepINVENT, have emerged as a powerful framework for de novo molecular design. During the ideation process these generative frameworks utilize various predictive models as part of the optimization objectives. However, the utility of the predictive models can be limited by their domain of applicability. When RL is used to explore the chemical space with predictive models, it can suggest molecules that lie outside the predictor's domain of applicability. As a result, the predictions may become less reliable, potentially steering designs into high reward but also high uncertainty chemical spaces. This is particularly pronounced for cyclic peptides which show therapeutic promise due to their modifiability and large interaction surfaces but are understudied compared to small molecules. While passive membrane permeation in cyclic peptides has attracted interest, identifying optimal permeable designs remains challenging yet crucial for targeting intracellular sites. We present an RL-guided generative framework that designs permeable cyclic peptides using an uncertainty-aware permeability predictor as the scoring component. To address predictive uncertainty, especially impacted by novel chemistry, we integrate conformal prediction (CP) as our uncertainty quantification method. CP assesses designs based on the calibrated model under a user-defined confidence level. We demonstrate that rewarding generated peptides with CP-informed predictions improves both reliability and efficiency of peptide optimization process. This also discourages exploration outside the predictor's applicability domain. This approach bridges the gap between predictive uncertainty and RL-guided exploration, showing how generative modelling and conformal prediction can be combined for the first time.
中文摘要 生成模型结合强化学习（RL），如REINVENT和PepINVENT，已成为新分子设计的强大框架。在构思过程中，这些生成框架利用各种预测模型作为优化目标的一部分。然而，预测模型的实用性可能受限于其适用范围。当使用强化学习（RL）通过预测模型探索化学空间时，它可以推荐出超出预测变量适用范围的分子。因此，预测可能变得不那么可靠，可能导致设计进入高回报但高不确定性的化学空间。这在环肽中尤为明显，它们因可调节性和大相互作用面而展现出治疗潜力，但与小分子相比研究不足。虽然环状肽中的被动膜渗透引起了关注，但确定最佳渗透设计仍具挑战性，但对于靶向细胞内位点至关重要。我们提出了一个基于强化学习的生成框架，设计具有不确定性感知的通透性预测变量作为评分成分的环状肽。为了解决预测不确定性，尤其是受新化学影响，我们采用了共形预测（CP）作为不确定性量化方法。CP基于校准模型在用户定义置信水平下评估设计。我们证明，用CP为基础的预测奖励生成肽，可以提升肽优化过程的可靠性和效率。这也阻碍了对预测变量适用范围之外的探索。该方法弥合了预测不确定性与强化学习引导探索之间的鸿沟，首次展示了生成建模与共形预测如何结合。

A Measure-Theoretic Finite-Sample Theory for Adaptive-Data Fitted Q-Iteration

自适应数据拟合Q迭代的测度论有限样本理论

Authors: Manuel Haussmann, Mustafa Mert Çelikok, Melih Kandemir
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.05791
Pdf link: https://arxiv.org/pdf/2605.05791
Abstract While reinforcement learning (RL) promises to revolutionize the control of complex nonlinear robotic systems, a profound gap persists between the heuristic success of model-free off-policy deep RL and the underlying theory, which remains largely confined to tabular or linearizable settings. We identify the cause of this gap as an emergent isolation of three traditions: (i) measure-theoretic MDP foundations on general spaces limit their analysis to exact dynamic programming and ignore all error sources of a learning process; (ii) deterministic error propagation analysis addresses the approximation error via concentrability coefficients without a finite-sample analysis of the estimation error; and (iii) PAC generalization bounds characterize the estimation errors of simplified topologies. We bridge these traditions with a unified theoretical framework for fitted Q-iteration (FQI) on general measurable Borel spaces. Our main result provides a finite-sample, adaptive-data performance bound by chaining measure-theoretic probability with Bellman-operator contraction in Banach spaces. We prove that sequential Rademacher complexity controls Bellman-regression generalization under policy-dependent data collection. We further extend this analysis to provide the first cumulative, pathwise online regret guarantee for FQI in continuous spaces. These results lay the necessary foundations for the formal analysis of many modern deep RL algorithms.
中文摘要 虽然强化学习（RL）有望彻底革新复杂非线性机器人系统的控制，但模型无策略深度强化学习的启发式成功与其基础理论之间存在深刻差距，后者主要局限于表格或可线性化的环境。我们将这一差距归因于三种传统的涌现孤立：（i）测度论MDP基于一般空间的基础，将分析限制在精确动态规划，忽视学习过程中的所有误差源;（ii）确定性误差传播分析通过选光系数解决近似误差，而无需对估计误差进行有限样本分析;以及（iii） PAC 推广界限描述简化拓扑的估计误差。我们用一个统一的拟合Q迭代（FQI）理论框架连接这些传统，适用于一般可测的Borel空间。我们的主要结果通过在巴拿赫空间中将测度论概率与贝尔曼算子收缩链对应，提供了有限样本、自适应数据的性能约束。我们证明了顺序Rademacher复杂度控制着策略依赖数据收集下的Bellman回归泛化。我们进一步扩展该分析，提供了连续空间中FQI的首个累积、路径在线遗憾保证。这些结果为许多现代深度强化学习算法的形式化分析奠定了必要基础。

Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs

使用行为树和大型语言模型（LLM）进行组合任务的奖励塑造和动作掩蔽

Authors: Nicholas Potteiger, Ankita Samaddar, Taylor T. Johnson, Xenofon Koutsoukos
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.05795
Pdf link: https://arxiv.org/pdf/2605.05795
Abstract Decomposing complex tasks into a sequence of simpler subtasks can improve learning efficiency for an autonomous agent. Reinforcement learning (RL) can be used to optimize agent policies to complete subtasks, but requires well-defined subtask rewards and benefits from action masking. Recent work uses large language models (LLMs) to automate reward shaping and action masking, however none of them fully address reactivity to subtask failure and modularity to varying objects for compositional tasks. To overcome these challenges, we develop masking reward behavior tree (MRBT), a symbolic structure used as a reactive and modular reward and action mask function. We design an MRBT template and derive logical specifications to construct and verify MRBTs for a sequence of object-interaction subtasks. Further, we develop an automated pipeline that uses an LLM to generate MRBTs robust to varying task objects, an SMT-solver to verify correctness of specifications, and a neurosymbolic RL loop to train agents on compositional tasks. Experiments demonstrate successful generation and refinement of five MRBTs, consistently improving training efficiency and task success rates over baselines and MRBTs without action masking. We further highlight three advantages of MRBTs: transferability, modularity, and verifiability.
中文摘要 将复杂任务分解为一系列更简单的子任务，可以提高自主智能体的学习效率。强化学习（RL）可用于优化代理策略以完成子任务，但需要明确定义的子任务奖励和动作掩蔽带来的收益。近期工作利用大型语言模型（LLMs）自动化奖励塑造和动作掩蔽，但这些模型尚未完全解决对子任务失败的反应性以及组合任务中对不同对象的模块化问题。为克服这些挑战，我们开发了掩蔽奖励行为树（MRBT），这是一种符号结构，作为反应性且模块化的奖励与行动掩蔽功能。我们设计MRBT模板，并推导逻辑规范，用于构造和验证一系列对象交互子任务的MRBTs。此外，我们开发了一套自动化流水线，利用大型语言模型生成对不同任务对象具有鲁棒性的MRBTs，SMT求解器验证规范正确性，以及神经符号强化学习循环用于训练代理完成组合任务。实验证明，5个MRBT的生成和完善持续提升，相较基线和无动作掩蔽的MRBT持续提升训练效率和任务成功率。我们还进一步强调了MRBT的三大优势：可转移性、模块化性和可验证性。

Unified Value Alignment for Generative Recommendation in Industrial Advertising

工业广告生成式推荐的统一价值对齐

Authors: Xinxun Zhang, Yuling Xiong, Jiale Zhou, Zhengkai Guo, Zhennan Pang, Junbang Huo, Jingwen Wang, Xuyang Sun, Enming Zhang, Jiaguang Jin, Changping Wang, Yi Li, Jun Zhang, Xiao Yan, Jiawei Jiang, Jie Jiang
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2605.05803
Pdf link: https://arxiv.org/pdf/2605.05803
Abstract Generative Recommendation (GR) reformulates recommendation as a next-token generation problem and has shown promise in industrial applications. However, extending GR to industrial advertising is non-trivial because the system must optimize not only user interest but also commercial value. Existing GR pipelines remain largely semantics-centric, making it difficult to align value signals across tokenization, decoding, and online serving. To address this issue, we propose UniVA, a Unified Value Alignment framework for advertising recommendation. We first introduce a Commercial SID tokenizer that injects value-related attributes into SID construction, yielding value-discriminative item representations. We then develop a Generation-as-Ranking SID Decoder jointly optimized by supervised learning and eCPM-aware reinforcement learning, which fuses value scores into next-item SID generation to perform generation and ranking in one decoding process. Finally, we design a value-guided personalized beam search that reuses generation-as-ranking logits as online value guidance and applies a personalized trie tree to constrain decoding to request-valid SID paths. Experiments on the Tencent WeChat Channels advertising platform show that UniVA achieves a 37.04\% improvement in offline Hit Rate@100 over the baseline and a 1.5\% GMV lift in online A/B tests.
中文摘要 生成推荐（GR）将推荐重新表述为下一代代币问题，并在工业应用中展现出潜力。然而，将广义相对论扩展到工业广告并非易事，因为系统不仅要优化用户兴趣，还要优化商业价值。现有的广义相对论管道仍主要以语义为中心，这使得价值信号难以在代币化、解码和在线服务之间对齐。为解决这一问题，我们提出了UniVA，一个统一价值对齐广告推荐框架。我们首先引入了商业SID分词器，将价值相关属性注入SID构建，从而实现价值判别性项目表示。随后，我们开发了一种由监督学习和eCPM感知强化学习联合优化的世代即排名SID译码器，将价值评分与下一项目SID生成融合，实现生成和排序的统一解码过程。最后，我们设计了一种价值导向的个性化波束搜索，重用生成即排名的对数作为在线价值指导，并应用个性化的trie树来限制解码至请求有效的SID路径。腾讯微信广告平台的实验显示，UniVA线下Hit Rate@100比基线提升了37.04%，在线A/B测试的GMV提升了1.5%。

Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

长视野Q学习：通过n步不等式实现的准确价值学习

Authors: Armaan A. Abraham, Lucy Xiaoyang Shi, Chelsea Finn
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05812
Pdf link: https://arxiv.org/pdf/2605.05812
Abstract Off-policy, value-based reinforcement learning methods such as Q-learning are appealing because they can learn from arbitrary experience, including data collected by older policies or other agents. In practice, however, bootstrapping makes long-horizon learning brittle: estimation errors at later states propagate backward through temporal-difference (TD) updates and can compound over time. We propose long-horizon Q-learning (LQL), which introduces a principled backstop against compounding error when learning the optimal action-value function. LQL builds on a prior optimality tightening observation: any realized action sequence lower-bounds what the optimal policy can achieve in expectation, so acting optimally earlier should not be worse than following the observed actions for several steps before switching to optimal behavior. Our contribution is to turn this inequality into a practical stabilization mechanism for Q-learning by using a hinge loss to penalize violations of these bounds. Importantly, LQL computes these penalties using network outputs already produced for the TD error, requiring no auxiliary networks and no additional forward passes relative to Q-learning. When combined with multiple state-of-the-art methods on a range of online and offline-to-online benchmarks, LQL consistently outperforms both 1-step TD and n-step TD learning at similar runtime.
中文摘要 非策略、基于价值的强化学习方法如Q学习具有吸引力，因为它们可以从任意经验中学习，包括由旧策略或其他代理收集的数据。然而，在实际操作中，自助法使长视野学习变得脆弱：后期状态的估计误差会通过时间差分（TD）更新向后传播，并可能随时间累积。我们提出了长视野Q学习（LQL），在学习最优动作值函数时引入了针对复利误差的原则性备份。LQL基于先验的最优性紧缩观察：任何实现的动作序列都会限制最优策略在预期中能达到的下界，因此提前采取最优行动不应比在切换到最优行为前遵循观察到的动作多步更差。我们的贡献是通过利用铰链损耗惩罚违反这些界限的行为，将这种不等式转化为Q学习的实用稳定机制。重要的是，LQL利用已生成的TD误差网络输出计算这些惩罚，无需辅助网络，也无需相对于Q学习进行额外的前向传递。结合多种在线和离线到在线基准测试的先进方法，LQL在相似运行时持续优于1步TD和n步TD学习。

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

AGPO：京东的非对称集团策略优化，用于可验证推理和搜索广告相关性

Authors: Yang Xu, Kun Yao, Yiming Deng, Zheng Fang, Kai Ming Ting, Ming Pang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05826
Pdf link: https://arxiv.org/pdf/2605.05826
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards correct paths, they do not elicit fundamentally new reasoning patterns. Instead, the reasoning capability boundary of trained models often narrows compared to their base models, with base models achieving higher coverage at large sample sizes. In this work, we propose Asymmetric Group Policy Optimization (AGPO) to counteract this boundary shrinkage. AGPO adopts a negative-dominant reinforcement strategy to suppress incorrect reasoning paths, maintaining the base model's exploration capacity. For positive reinforcement, AGPO adopts a group advantage mechanism, which scales positive updates based on intra-group variance, allowing the model to focus on rare correct paths while suppressing updates from trivial paths. Our experiments on five mathematical benchmarks demonstrate that AGPO achieves state-of-the-art accuracy while consistently improving pass@$k$ performance at scale. In a large-scale industrial application for search ads relevance optimization, AGPO effectively enhances the quality of the data annotation, leading to substantial performance gains in downstream student models.
中文摘要 带可验证奖励的强化学习（RLVR）在提升大型语言模型（LLMs）推理性能方面取得了显著成功。然而，最新研究显示，虽然现有RLVR方法提高了采样效率，但并未引发根本性的推理模式。相反，训练模型的推理能力边界通常相较于基础模型更窄，基础模型在大样本量下覆盖率更高。在本研究中，我们提出了非对称组策略优化（AGPO）来抵消这种边界缩小。AGPO采用负向强化策略以抑制错误的推理路径，保持基础模型的探索能力。为实现正强化，AGPO采用群体优势机制，基于组内方差扩展正向更新，使模型能够专注于罕见的正确路径，同时抑制来自平凡路径的更新。我们在五个数学基准测试上的实验表明，AGPO在大规模提升pass@$k美元性能的同时，实现了最先进的精度。在大规模工业应用中用于搜索广告相关性优化，AGPO有效提升了数据注释的质量，从而显著提升了下游学生模型的性能。

Measuring Learning Progress via Gradient-Momentum Coupling

通过梯度-动量耦合测量学习进展

Authors: Samuel Blad, Martin Längkvist, Amy Loutfi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.05856
Pdf link: https://arxiv.org/pdf/2605.05856
Abstract Measuring learning progress is essential for curiosity-driven exploration in reinforcement learning, but widely used signals such as prediction error often fail to distinguish meaningful, learnable patterns from random noise. This paper proposes Gradient-Momentum Coupling (GMC), a signal derived from optimization dynamics that quantifies how useful each sample's gradient is for ongoing learning by measuring its per-parameter normalized absolute product with the momentum from previous gradients. By leveraging momentum's natural filtering of noise and oscillations, GMC identifies samples that contribute to ongoing parameter updates. Controlled experiments demonstrate noise robustness and emergent curriculum learning, with the signal prioritizing tasks by learning speed rather than difficulty. Experiments on MiniGrid suggest that replacing prediction error with GMC within existing curiosity-driven architectures can improve robustness to observation noise.
中文摘要 测量学习进展对于强化学习中基于好奇心的探索至关重要，但广泛使用的信号如预测误差往往无法区分有意义且可学习的模式与随机噪声。本文提出了梯度-动量耦合（GMC），这是一种基于优化动力学的信号，通过测量每个样品的标准化绝对积与前一梯度动量，量化每个样本梯度对持续学习的有用性。通过利用动量对噪声和振荡的自然滤波，GMC识别出有助于持续参数更新的样本。受控实验展示了噪声鲁棒性和自发课程学习，信号通过学习速度而非难度优先处理任务。MiniGrid上的实验表明，在现有的好奇心驱动架构中用GMC替代预测误差，可以提高对观测噪声的鲁棒性。

Offline Reinforcement Learning for Rotation Profile Control in Tokamaks

托卡马克旋转配置文件控制的离线强化学习

Authors: Rohit Sonker, Hiro Josep Farre Kaga, Jiayu Chen, Andrew Rothstein, Ian Char, Ricardo Shousha, Egemen Kolemen, Jeff Schneider
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.05857
Pdf link: https://arxiv.org/pdf/2605.05857
Abstract Tokamaks remain leading candidates for achieving practical fusion energy, yet many important control problems inside these devices are still difficult or unsolved. One such challenge is controlling the plasma rotation profile, which strongly influences stability, confinement, and transport. While the average rotation can be controlled, controlling the full profile is challenging due to high dimensionality, response to multiple actuators and dependence on plasma condition. Learning-based control methods, such as reinforcement learning (RL), provide a potential solution to this challenging problem with ability to model complex interactions leading to effective multi-input multi-output control. However, learning such policies is challenging due to the lack of accurate simulators that can model the rotation profile dynamics. In this work, we investigate the use of offline RL and offline model-based RL algorithms for rotation profile control, training them solely on historical data from the DIII-D tokamak. Our final method uses probabilistic models of plasma dynamics to generate rollouts for RL training. We deploy this policy on the DIII-D Tokamak and observe promising real-world results. We conclude by highlighting key challenges and insights from training and deploying an RL policy on a complex physical device while using only limited past data.
中文摘要 托卡马克仍然是实现实用聚变能源的领先候选者，但这些装置内部许多重要的控制问题仍然困难或未解决。其中一个挑战是控制等离子体旋转曲线，这对稳定性、约束和运输有着强烈影响。虽然平均旋转可以控制，但由于高维度、对多个执行器的响应以及对等离子体条件的依赖，控制完整轮廓具有挑战性。基于学习的控制方法，如强化学习（RL），能够模拟复杂交互，从而实现有效的多输入多输出控制，为这一挑战性问题提供了潜在解决方案。然而，由于缺乏能够模拟旋转曲线动态的精确模拟器，学习此类策略具有挑战性。本研究研究利用离线强化学习和基于离线模型的强化学习算法进行旋转曲线控制，仅基于DIII-D托卡马克的历史数据进行训练。我们的最终方法利用等离子体动力学的概率模型生成强化学习训练的展开。我们将该政策应用于DIII-D托卡马克，并观察到有希望的实际效果。最后，我们强调了在复杂物理设备上训练和部署强化学习策略时所面临的关键挑战和见解，同时仅使用有限的过去数据。

SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

SOPE：基于先前数据稳定在线强化学习的非策略评估

Authors: Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, Andrew D. Bagdanov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05863
Pdf link: https://arxiv.org/pdf/2605.05863
Abstract Incorporating prior data into online reinforcement learning accelerates training but typically forces a difficult trade-off between high computational costs and long, multi-stage training pipelines. While fixed-length stabilization phases are significantly more computationally efficient than static update schedules, they require task-dependent manual tuning, risking either the waste of prior knowledge or severe overfitting. To address this, we propose SOPE, an algorithm that uses an actor-aligned Off-Policy Policy Evaluation (OPE) signal as an automated early-stopping mechanism to dynamically control the length of offline training phases. By evaluating the critic on a held-out validation split under the current policy's action distribution, SOPE halts gradient updates exactly when out-of-distribution benefits saturate, eliminating the need for manual schedule tuning. Evaluated on 25 continuous control tasks from the Minari benchmark suite, SOPE improves baseline performance by up to 45.6% while reducing the required TFLOPs by up to 22x, thus balancing the tradeoff between sample and computational efficiency. These findings demonstrate that adaptive, evaluation-driven update schedules are more effective than relying on static, exhaustive update schedules.
中文摘要 将先前数据纳入在线强化学习可以加快培训进程，但通常会在高计算成本和漫长、多阶段的培训流程之间做出艰难权衡。虽然固定长度稳定阶段的计算效率远高于静态更新计划，但需要任务相关的手动调优，存在浪费先前知识或严重过拟合的风险。为此，我们提出了SOPE算法，该算法利用演员对齐的非策略策略评估（OPE）信号作为自动提前停止机制，动态控制离线训练阶段的长度。通过根据当前政策行动分配对批评者进行未完成验证分配的评估，SOPE在分配外福利饱和时立即停止梯度更新，消除了手动调整计划的需求。SOPE基于Minari基准测试套件中的25个连续控制任务进行评估，基线性能提升了高达45.6%，同时将所需的TFLOPs降低了最多22倍，从而在采样效率与计算效率之间取得平衡。这些发现表明，自适应、以评估为驱动的更新计划比依赖静态、穷尽的更新计划更有效。

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

思考，然后评分：视频奖励建模中的解耦推理与评分

Authors: Yuan Wang, Ouxiang Li, Yulong Xu, Borui Liao, Jiajun Liang, Jinghan Li, Meng Wang, Xintao Wang, Pengfei Wang, Kuien Liu, Xiang Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.05922
Pdf link: https://arxiv.org/pdf/2605.05922
Abstract Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textit{Discriminative RMs} regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, \textit{Generative RMs} with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.
中文摘要 生成式视频模型的最新进展越来越多地受到训练后和测试时间缩放的驱动，而这两者都极度依赖于视频奖励模型（RM）的质量。理想的奖励模型应能准确预测符合人类偏好的奖励，适用于多种场景。然而，现有范式面临一个根本困境：\textit{判别有效值}回归直接奖励多模态大型语言模型（MLLMs）提取的特征，且没有显式推理，使其容易走捷径学习，并高度依赖大规模数据尺度来泛化。相比之下，带有思维链（CoT）推理的\textit{生成RM}展现出更优的解释性和泛化潜力，因为它们利用细粒度语义监督内化人类偏好背后的理据。然而，由于推理和评分耦合在单一的自回归推理链中，它们存在固有的优化瓶颈。为了利用CoT推理的泛化优势，同时减轻耦合推理和评分的训练不稳定性，我们引入了DeScore，一种训练高效且可推广的视频奖励模型。DeScore采用解耦的“思考后评分”范式：MLLM先生成显式CoT，随后是一个专用的判别评分模块，该模块由可学习的查询标记和预测最终奖励的回归头组成。DeScore通过两阶段框架进行优化：（1）采用判别式冷启动，采用随机掩罩机制以确保稳健的评分能力;（2）双目标强化学习阶段，独立优化CoT推理质量并校准最终奖励，确保高质量推理直接转化为更优的模型性能。

Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

近政策：通过异步生成和选择性打包加速政策上的提炼

Authors: Miao Rang, Zhenni Bi, Hang Zhou, Kai Han, Xuechun Wang, An Xiao, Xinghao Chen, Yunhe Wang, Hanting Chen
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.05940
Pdf link: https://arxiv.org/pdf/2605.05940
Abstract Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To improve efficiency, we propose Near-Policy Distillation (NPD), an asynchronous approach that decouples student generation from training. This reformulation enables Supervised Fine-Tuning (SFT) with sequence packing. However, asynchronous updates inevitably introduce policy lag and sample noise, which can cause the behavior to drift from near-policy toward off-policy. To counteract this without sacrificing efficiency, NPD integrates sparse student updates and the $\Delta$-IFD filtering mechanism, a heuristic sample selection mechanism that empirically stabilizes the optimization trajectory. By filtering extreme out-of-distribution samples, $\Delta$-IFD prevents noise from dominating the gradients, ensuring updates remain within a safe proximal learning zone. Empirically, the NPD framework achieves a 8.1x speedup over on-policy baselines and outperforms SFT by 8.09%. Crucially, by effectively narrowing the exploration space for subsequent RL, our method enables openPangu-Embedded-1B to reach a state-of-the-art score of 68.73%, outperforming the substantially larger Qwen3-1.7B. Codes will be released soon.
中文摘要 自回归模型的标准知识蒸馏常常存在分布不匹配的问题。虽然政策化方法通过利用学生生成的输出来缓解这一问题，但它们依赖计算量高的强化学习（RL）框架。为提高效率，我们提出了近政策蒸馏（NPD）方法，这是一种异步方法，将学生生成与培训脱钩。这种重新表述使得带序列填充的监督微调（SFT）成为可能。然而，异步更新不可避免地会引入策略延迟和采样噪声，导致行为从近策略向非策略偏移。为了在不牺牲效率的前提下抵消这一问题，NPD集成了稀疏的学生更新和$\Delta$-IFD过滤机制，这是一种启发式样本选择机制，能够经验稳定优化轨迹。通过过滤极端分布外的样本，$\Delta$-IFD防止噪声主导梯度，确保更新保持在安全的近距离学习区内。实证上，NPD框架比政策基线快8.1倍，且比SFT高出8.09%。关键是，通过有效缩小后续强化学习的探索空间，我们的方法使openPangu-Embedded-1B达到68.73%的先进得分，优于远大于Qwen3-1.7B。代码将很快发布。

Foundation Twins: A New Generation of Power Systems Digital Twins using Foundation AI Models

基础双胞胎：新一代基于基础人工智能模型的电力系统数字孪生

Authors: Pedro P. Vergara
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.05952
Pdf link: https://arxiv.org/pdf/2605.05952
Abstract Power systems are inherently multi-timescale systems, with different physical phenomena and decision-making processes spanning multiple timescales, time horizons, and geographic scopes. I envision power systems digital twins (DTs) as powerful modeling and simulation tools that can accelerate and improve decision-making across different time scales and geographic scopes. However, until now, research has not delivered such a vision, and power systems DTs remain a concept distant from implementation. This is not a regular research paper. This is a position paper that outlines my vision for developing a new generation of power systems DTs that leverage recent advances in artificial intelligence (AI) and machine learning (ML). I call these Foundation Twins. Foundation Twins combines the generalization features of foundation models with the decision-making capabilities of reinforcement learning (RL) architectures to deliver the envisioned power systems DTs.
中文摘要 电力系统本质上是多时间尺度的系统，具有跨越多个时间尺度、时间视野和地理范围的不同物理现象和决策过程。我设想电力系统数字孪生（DT）作为强大的建模和仿真工具，能够加速和改善不同时间尺度和地理范围内的决策。然而，直到现在，研究尚未实现这一愿景，电力系统DT仍是一个遥远的概念，难以实现。这不是一篇普通的研究论文。这是一份立场文件，阐述了我开发新一代电力系统DT的愿景，这些系统利用了人工智能（AI）和机器学习（ML）的最新进展。我称这些为基础双胞胎。基础双子结合了基础模型的泛化特性与强化学习（RL）架构的决策能力，实现了设想中的电力系统DT。

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

超越统一学分分配：RLVR的选择性资格追踪

Authors: Chaoli Mou, Zhan Zhuang, Xinning Chen, Yu Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05965
Pdf link: https://arxiv.org/pdf/2605.05965
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become a key approach for improving the reasoning abilities of large language models. However, widely used critic-free algorithms such as Group Relative Policy Optimization (GRPO) necessitate a ``uniform credit assignment'' assumption that indiscriminately broadcast trajectory-level advantages, hindering learning efficiency by failing to distinguish critical reasoning steps. To address this limitation, we propose Selective Eligibility Traces (S-trace). Grounded in the intuition of partial trust region preservation, we initially introduce P-trace as a sample-efficient, critic-free eligibility traces method, upon which we build S-trace, implementing a sparse eligibility traces mechanism to further mitigate variance and achieve fine-grained credit assignment by selectively masking low-entropy tokens. Theoretically, we contextualize the recent Group Sequence Policy Optimization (GSPO) method within the critic-free eligibility traces framework, identifying it as a special instance of the eligibility traces method operating under uniform credit assignment. Experiments demonstrate that S-trace not only outperforms GRPO, showing gains of 0.49\% on Qwen3-1.7B and 3.16\% on Qwen3-4B, and maintaining a robust 2.98\% improvement when scaled further to Qwen3-8B in average pass@16, but notably achieves this with simultaneously higher sample and token efficiency.
中文摘要 带可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理能力的关键方法。然而，广泛使用的无批评算法如群相对策略优化（GRPO）要求“统一学分分配”假设，这会无差别地传播轨迹层级优势，从而因未能区分关键推理步骤而阻碍学习效率。为解决这一限制，我们提出了选择性资格痕迹（S-trace）。基于部分信任区域保持的直觉，我们最初引入P-trace作为一种样本高效、无批评的资格追踪方法，在此基础上构建了S-trace，实施了稀疏资格追踪机制，进一步降低方差并通过选择性掩蔽低熵代币实现细粒度的信用分配。理论上，我们将近期的组序列政策优化（GSPO）方法置于无批评资格追踪框架内，将其视为统一信用分配下运行的资格追踪方法的特殊实例。实验表明，S-trace不仅优于GRPO，Qwen3-1.7B提升0.49%，Qwen3-4B提升3.16%;进一步扩展至Qwen3-8B平均pass@16时仍保持2.98%的强劲提升，且同时在采样和令牌效率上均有显著提升。

BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning

BehaviorGuard：深度强化学习的在线后门防御

Authors: Yinbo Yu, Xueyu Yin, Jiadai Wang, Chunwei Tian, Sai Xu, Qi Zhu, Daoqiang Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.05977
Pdf link: https://arxiv.org/pdf/2605.05977
Abstract Backdoor attacks pose a serious threat to deep reinforcement learning (DRL). Current defenses typically rely on reward anomalies to reverse-engineer triggers and model finetuning to remove backdoors. However, complex trigger patterns undermine their robustness, and fine-tuning entails high costs, limiting practical utility. Therefore, we shift defense concerns to trigger-agnostic backdoor output behaviors and propose BehaviorGuard, an online behavior-based backdoor detection and mitigation framework for DRL. Specifically, we find that regardless of attacks, backdoored policies induce consistent shifts in action distributions to ensure reliable activation, leaving detectable traces in high-quantile regions and distribution tails, even in the absence of triggers. Based on this, we design a novel metric that captures behavioral drift in action distributions to identify and suppress backdoor actions at runtime. To our knowledge, this is the first online backdoor defense that counters attacks both in single- and multi-agent DRL. Evaluated across diverse benchmarks with different backdoor attacks, BehaviorGuard consistently surpasses prior methods in both efficacy and efficiency.
中文摘要 后门攻击对深度强化学习（DRL）构成严重威胁。当前的防御通常依赖奖励异常来逆向工程触发器和模型微调以消除后门。然而，复杂的扳机模式削弱了其坚固性，微调成本高昂，限制了实用性。因此，我们将防御关注转向触发无关的后门输出行为，并提出了BehaviorGuard，这是一个基于在线行为的DRL后门检测和缓解框架。具体来说，我们发现无论如何攻击，后门策略都会持续调整动作分布，以确保激活的可靠，即使在没有触发条件的情况下，也在高分位数区域和分布尾部留下可检测的痕迹。基于此，我们设计了一种新颖指标，捕捉动作分布中的行为漂移，以便在运行时识别并抑制后门动作。据我们所知，这是首个能够反制单代理和多代理日程学习攻击的在线后门防御。通过多种基准测试和不同的后门攻击进行评估，BehaviorGuard在效能和效率上始终超越以往方法。

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

4DThinker：用四维图像思考动态空间理解

Authors: Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xiang An, Bo Li, Xin Xie, ZiDong Wang, Mingze Sun, Shuang Chen, Hongyu Li, Xiaobin Hu, Ruqi Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.05997
Pdf link: https://arxiv.org/pdf/2605.05997
Abstract Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at this https URL.
中文摘要 单眼视频的动态空间推理对于连接视觉智能与物理世界至关重要，但视觉语言模型（VLM）仍面临挑战。以往的方法要么完全以文本形式口述时空推理，这对于复杂动力学来说本质上冗长且不够精确;要么依赖外部几何模块，增加推理复杂度，却不促进内在的模型能力。本文介绍了4DThinker，这是首个使VLM能够通过动态潜在心理意象“以4D思考”的框架，即内部模拟场景在连续隐秘空间中的演变。具体来说，我们首先引入了一种可扩展、无注释的数据生成流程，从原始视频中综合四维推理数据。随后，我们提出了动态影像微调（DIFT），它联合监督文本标记和四维潜能，以将模型扎根于动态视觉语义之中。基于此，4D强化学习（4DRL）通过基于结果的奖励进一步处理复杂的推理任务，将策略梯度限制为文本代币，以确保稳定的优化。多项动态空间推理基准测试的广泛实验表明，4DThinker始终优于强基线，为VLM中的4D推理提供了新的视角。我们的代码可在此 https URL 访问。

Optimal Transport for LLM Reward Modeling from Noisy Preference

基于噪声偏好进行LLM奖励建模的最优传输

Authors: Licheng Pan, Haochen Yang, Haoxuan Li, Yunsheng Lu, Yongqi Tong, Yinuo Wang, Shijian Wang, Zhixuan Chu, Lei Shen, Yuan Lu, Hao Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06036
Pdf link: https://arxiv.org/pdf/2605.06036
Abstract Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing denoising approaches often rely on homogeneous noise assumptions that fail to capture the complexity of linguistic preferences. To handle these challenges, we propose SelectiveRM, a framework grounded in optimal transport. We first devise a Joint Consistency Discrepancy to align the distribution of model predictions with preference data. Furthermore, to address the limitation of strict mass conservation which compels the model to fit outliers, we incorporate a Mass Relaxation mechanism via partial transport. This enables the autonomous exclusion of samples with noisy preference that contradict semantic consistency. Theoretically, we demonstrate that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk. Extensive experiments validate that our approach significantly outperforms state-of-the-art baselines across diverse benchmarks.
中文摘要 奖励模型是人类反馈强化学习（RLHF）的基础，但现实世界的数据不可避免地会被噪声偏好所破坏。传统训练目标往往会对这些误差进行过度拟合，而现有的去噪方法往往依赖于同质噪声假设，无法捕捉语言偏好的复杂性。为应对这些挑战，我们提出了以最优运输为基础的SelectiveRM框架。我们首先设计了联合一致性差异，以使模型预测的分布与偏好数据保持一致。此外，为了解决严格质量守恒的限制，该限制迫使模型拟合离群值，我们采用了通过部分输运实现质量松弛机制。这使得能够自主排除具有噪声偏好且与语义一致性相矛盾的样本。理论上，我们证明了SelectiveRM优化了未观察到的清洁风险的更紧密上限。大量实验验证了我们的方法在多种基准测试中显著优于最先进基线。

Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning

基于新颖性的思维树搜索，用于LLM推理与规划

Authors: Leon Hamm, Zlatan Ajanovic
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.06040
Pdf link: https://arxiv.org/pdf/2605.06040
Abstract Although advances such as chain-of-thought, tree-of-thought or reinforcement learning have improved the performance of LLMs in reasoning and planning tasks, they are still brittle and have not achieved human-level performance in many domains, and often suffer from high time and token costs. Inspired by the success of width-based search in planning, we explore how the concept of novelty can be transferred to language domains and how it can improve tree-of-thought reasoning. A tree of thoughts relies on building possible "paths" of consecutive ideas or thoughts. These are generated by repeatedly prompting an LLM. In our paper, a measurable concept of novelty is proposed that describes the uniqueness of a new node (thought) in comparison to nodes previously seen in the search tree. Novelty is estimated by prompting an LLM and making use of embedded general knowledge from pre-training. This metric can then be used to prune branches and reduce the scope of the search. Although this method introduces more prompts per state, the overall token cost can be reduced by pruning and reducing the overall tree size. This procedure is tested and compared using several benchmarks in language-based planning and general reasoning.
中文摘要 尽管链思考、思维树或强化学习等进步提升了大型语言模型在推理和规划任务中的表现，但它们仍然脆弱，许多领域尚未达到人类水平，且常常面临较高的时间和代币成本。受基于宽度搜索在规划中的成功启发，我们探讨新颖性概念如何转移到语言领域，以及它如何提升思维树推理能力。思维树依赖于构建可能的“路径”，由连续的想法或想法组成。这些是通过反复提示LLM生成的。在我们的论文中，提出了一个可测量的新颖性概念，描述了一个新节点（思维）与搜索树中之前看到的节点的唯一性。新颖性通过提示LLM并利用预训练中嵌入的常识来估算。该指标可用于修剪分支，缩小搜索范围。虽然这种方法每个状态引入更多提示，但通过修剪和缩小整体树大小可以降低整体代币成本。该程序通过多个基于语言的规划和一般推理基准进行测试和比较。

Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

羽必须聚集的请求：批量大小与前缀同质性在LLM推断中

Authors: Saksham Rathi, Preeti, Mythili Vutukuru
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.06046
Pdf link: https://arxiv.org/pdf/2605.06046
Abstract Auto-regressive token generation in large language models is memory-bound because it requires "attending to" key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process by batching multiple requests together, and maximizing batch size subject to GPU memory constraints. The key observation of our work is that with prefix-sharing workloads, smaller, prefix-homogeneous batches -- where all requests share a common prefix -- can achieve higher decode throughput than larger, heterogeneous batches, due to better spatial and temporal locality during KV cache accesses. However, prefix-aware schedulers in state-of-the-art inference engines maximize prefix reuse within a batch only to reduce KV cache memory footprint, but do not stop batch formation at smaller homogeneous batches that could have performed better. Further, we show that shared prefix detection in existing schedulers relies on radix-tree traversals, incurring substantial CPU overhead that is often comparable to GPU execution time. This paper presents Feather, a prefix-aware scheduler that uses reinforcement learning (RL) to learn the optimal tradeoff between batch size and prefix homogeneity. We also introduce Chunked Hash Tree (CHT), a lightweight data structure that enables fast prefix detection and efficient request selection for the RL scheduler, avoiding expensive tree traversals. We integrate Feather into vLLM and SGLang, and our evaluation shows that Feather achieves 2--10$\times$ higher end-to-end throughput as compared to existing schedulers, while doing no worse than the status quo when the workload does not have enough prefix sharing. Feather achieves these gains by reducing the total number of KV cache accesses, surpassing the performance of prefix-aware attention kernels that have the same goal.
中文摘要 大型语言模型中的自回归令牌生成受内存限制，因为它需要“关注”所有先前令牌的键值张量（KV缓存）。先前的工作旨在通过批量处理多个请求，并在GPU内存限制下最大化批处理大小，从而提高该解码过程的效率。我们工作的关键观察是，在前缀共享工作负载中，较小且前缀同质的批次——所有请求共享共同前缀——由于KV缓存访问时空间和时间上的局域性更好，可以比更大、异构批处理获得更高的译码吞吐量。然而，先进推理引擎中的前缀感知调度器仅为了减少KV缓存内存占用，最大化批次重用，并未阻止在较小且同质的批次中形成，这些批次本可表现更好。此外，我们表明现有调度器中的共享前缀检测依赖于基树遍历，导致CPU开销巨大，通常与GPU执行时间相当。本文介绍了Feather，一种前缀感知调度器，利用强化学习（RL）学习批量大小与前缀同质性之间的最佳权衡。我们还介绍了分块哈希树（CHT），这是一种轻量级数据结构，能够快速检测前缀并高效地选择强化学习调度器请求，避免昂贵的树遍历。我们将 Feather 集成到 vLLM 和 SGLang，评估显示 Feather 在端到端吞吐量方面比现有调度器高出 2-10 美元/时间点，且在工作负载前缀共享不足时的表现不比现状差。Feather通过减少KV缓存访问总数实现这些提升，超过了具有相同目标的前缀感知注意力核的性能。

Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark

复杂纸牌游戏的因果强化学习：万智牌的基准测试

Authors: Cristiano da Costa Cunha, Ajmal Mian, Tim French, Wei Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06066
Pdf link: https://arxiv.org/pdf/2605.06066
Abstract Causal reinforcement learning (RL) lacks benchmarks for complex systems that combine sequential decision making, hidden information, large masked action spaces, and explicit causal structure. We introduce MTG-Causal-RL, a Gymnasium benchmark built on Magic: The Gathering with a 3,077-dimensional partial observation, a 478-action masked discrete action space, five competitive Standard archetypes, three reward schemes, and a hand-specified Structural Causal Model (SCM) over strategic variables. Every episode exposes causal variables, SCM-predicted intervention effects, and per-factor credit traces, making causal credit assignment, leave-one-out cross-archetype transfer, and policy auditability first-class metrics. We adapt a panel of reference baselines: random, heuristic, masked PPO, a causal-world-model PPO variant, and an architecture-matched scalar control. We propose Causal Graph-Factored Advantage PPO (CGFA-PPO) as a reference causal agent that uses SCM parents of win probability as factor-aligned critic targets with an intervention-calibration loss. All comparisons use paired seeds, paired-bootstrap confidence intervals, and Holm-Bonferroni correction within pre-registered families. Masked PPO and CGFA-PPO reach competitive in-distribution win rates and exceed the random baseline; per-factor calibration trajectories and leave-one-out transfer gaps expose diagnostic structure that scalar win rate alone cannot. We release the benchmark, reference-baseline results, and full evaluation protocol openly. By coupling a strategically rich, partially observed domain with an explicit causal interface and statistical protocol, MTG-Causal-RL gives causal-RL, world-model, and LLM-agent research a shared testbed for questions current benchmarks cannot pose together: causal credit assignment under masked action spaces, structural transfer across archetypes, and SCM-grounded policy auditability.
中文摘要 因果强化学习（RL）缺乏针对复杂系统的基准，这些系统结合了顺序决策、隐藏信息、大型掩蔽动作空间和显式因果结构。我们介绍MTG-Causal-RL，这是一个基于《万智牌》构建的Gymnasium基准测试，采用3077维部分观察、478次行动掩盖离散行动空间、五种竞争性标准原型、三种奖励方案，以及一个针对战略变量的手工定制结构因果模型（SCM）。每期节目都揭示了因果变量、SCM预测的干预效应和各因素的信用追踪，使因果信用分配、省略一除外的跨原型转移和政策可审计性指标成为一流的指标。我们采用一组参考基线：随机、启发式、掩蔽PPO、因果世界模型PPO变体和架构匹配标量对照。我们提出因果图分解优势PPO（CGFA-PPO）作为参考因果代理，使用胜利概率的SCM父细胞作为因子对齐的批判目标，并存在干预-校准损失。所有比较均使用配对种子、配对自举置信区间和Holm-Bonferroni校正，适用于预注册家族。蒙面PPO和CGFA-PPO的分发中获胜率达到竞争，并超过随机基线;每因子校准轨迹和“留一”转移间隙揭示了单纯标量胜率无法覆盖的诊断结构。我们公开发布基准测试、参考基线结果和完整评估方案。通过将战略丰富且部分观察的领域与明确的因果接口和统计协议相结合，MTG-因果-RL、世界模型和大型语言模型-代理研究提供了一个共享的测试平台，解决当前基准无法共同提出的问题：掩蔽行动空间下的因果归属分配、跨原型的结构转移以及基于SCM的政策审计性。

Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

Arena作为离线奖励：扩散模型的高效细粒度偏好优化

Authors: Zhikai Li, Yue Zhao, Edward Zhongwei Zhang, Xuewen Liu, Jing Zhang, Qingyi Gu, Zhen Dong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.06070
Pdf link: https://arxiv.org/pdf/2605.06070
Abstract Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However, its reliance on binary feedback limits it to coarse-grained modeling on chosen-rejected pairs, resulting in suboptimal optimization. In this paper, we propose ArenaPO, which leverages Arena scores as offline rewards to provide refined feedback, thus achieving efficient and fine-grained optimization without a reward model. This enables ArenaPO to benefit from both the rich rewards of traditional RLHF and the efficiency of DPO. Specifically, we first construct a model Arena in which each model's capability is represented as a Gaussian distribution, and infer these capabilities by traversing the annotated pairwise preferences. Each output image is treated as a sample from the corresponding capability distribution. Then, for a image pair, conditioned on the two capability distributions and the observed pairwise preference, the absolute quality gap is estimated using latent-variable inference based on truncated normal distribution, which serves as fine-grained feedback during training. It does not require a reward model and can be computed offline, thus introducing no additional training overhead. We conduct ArenaPO training on Pick-a-Pic v2 and HPD v3 datasets, showing that ArenaPO consistently outperforms existing baselines.
中文摘要 来自人类反馈的强化学习（RLHF）有效促进文本与图像（T2I）扩散模型的偏好对齐。为了提高计算效率，直接偏好优化（DPO）——避免显式奖励建模——已被广泛研究。然而，其依赖二元反馈限制了对被拒绝对的粗粒度建模，导致优化效果不理想。本文提出了ArenaPO，利用Arena分数作为离线奖励，提供精细反馈，从而实现高效且细粒度的优化，而无需奖励模型。这使得ArenaPO既能享受传统RLHF的丰富回报，也能提升DPO的高效性。具体来说，我们首先构建一个模型 Arena，其中每个模型的能力以高斯分布表示，并通过遍历注释的两对偏好来推断这些能力。每个输出图像都被视为对应能力分布中的样本。然后，对于图像对，基于两种能力分布和观察到的两两偏好，利用基于截断正态分布的潜在变量推断估计绝对质量差距，该正态分布在训练过程中作为细粒度反馈。它不需要奖励模型，且可离线计算，因此不会增加额外的训练开销。我们在Pick-a-Pic v2和HPD v3数据集上进行了ArenaPO培训，显示ArenaPO始终优于现有基线数据。

Milestone-Guided Policy Learning for Long-Horizon Language Agents

面向长期视野语言代理的里程碑引导策略学习

Authors: Zixuan Wang, Yuchen Yan, Hongxing Li, Teng Pan, Dingming Li, Ruiqing Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06078
Pdf link: https://arxiv.org/pdf/2605.06078
Abstract While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near-total loss of learning signal. We introduce a milestone-guided policy learning framework, BEACON, that leverages the compositional structure of long-horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO. Notably, on long-horizon ALFWorld tasks, BEACON achieves 92.9% success rate, nearly doubling GRPO's 53.5%, while improving effective sample utilization from 23.7% to 82.0%. These results establish milestone-anchored credit assignment as an effective paradigm for training long-horizon language agents. Code is available at this https URL.
中文摘要 虽然长期的智能体任务需要语言智能体执行数十个顺序决策，但用强化学习训练此类智能体仍然具有挑战性。我们识别出两个根本原因：信用错误归因，即正确早期操作因终端故障而被惩罚;以及样本低效，即稀少的成功轨迹导致几乎完全丢失学习信号。我们引入了一个里程碑引导的政策学习框架BEACON，利用长期任务的组成结构确保学分分配的精确。BEACON在里程碑边界划分轨迹，在分段内应用时间奖励塑形以认可部分进展，并在双重尺度上估算优势，以防止远方失误影响局部行动的评估。在ALFWorld、WebShop和ScienceWorld上，BEACON持续优于GRPO和GiGPO。值得注意的是，在长期ALFWorld任务中，BEACON实现了92.9%的成功率，几乎是GRPO的53.5%的两倍，同时有效样本利用率从23.7%提升至82.0%。这些结果确立了里程碑锚定学分分配作为训练长视野语言代理的有效范式。代码可在此 https URL 访问。

VISD: Enhancing Video Reasoning via Structured Self-Distillation

VISD：通过结构化自我蒸馏增强视频推理

Authors: Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, Hongbo Jin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06094
Pdf link: https://arxiv.org/pdf/2605.06094
Abstract Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.
中文摘要 由于序列层级奖励稀疏且缺乏细粒度的学分分配，复杂推理的VideoLLMs依然具有挑战性，尤其是在长时间、时间基础的推理轨迹中。虽然带可验证奖励的强化学习（RLVR）提供了可靠的监督，但它未能捕捉代币层级的贡献，导致学习效率低下。相反，现有的自我蒸馏方法提供密集的监督，但缺乏结构和诊断特异性，且常与强化学习相互作用不稳定。在本研究中，我们提出了VISD，一种结构化的自我蒸馏框架，为视频推理引入具有诊断意义的特权信息。VISD采用视频感知的评判模型，将推理质量分解为多个维度，包括答案正确性、逻辑一致性和时空基础，并利用这种结构化反馈指导教师代币级监督政策。为了稳定地将密集监督与强化学习整合，我们引入了方向幅度解耦机制，其中由奖励计算的推广层优势决定更新方向，而结构化特权信号调制代币级更新幅度。该设计实现语义对齐且细粒度的学分分配，提升推理的忠实度和培训效率。此外，VISD还结合了课程排程和基于EMA的教师稳定，支持对长视频序列的稳健优化。在多种基准测试上的实验显示，VISD始终优于强基线，提升了答案的准确性和时空基础质量。值得注意的是，VISD在优化步骤收敛速度上几乎快了2倍，凸显了结构化自我监督在提升视频大型语言模型性能和样本效率方面的有效性。

Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer

超越自回归RTG：决策变换器中通过注入外部顺序建模进行条件化

Authors: Yongyi Wang, Hanyu Liu, Lingfeng Li, Bozhou Chen, Ang Li, Qirui Zheng, Xionghui Yang, Chucai Wang, Wenxin Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06104
Pdf link: https://arxiv.org/pdf/2605.06104
Abstract Decision Transformer (DT) formulates offline reinforcement learning as autoregressive sequence modeling, achieving promising results by predicting actions from a sequence of Return-to-Go (RTG), state, and action tokens. However, RTG is a scalar that summarizes future rewards, containing far less information than typical state or action vectors, yet it consumes the same computational budget per token. Worse, the self-attention cost of Transformers grows quadratically with sequence length, so including RTG as a separate token adds unnecessary overhead. We propose SlimDT, which removes RTG from the autoregressive sequence. Instead, we inject RTG information into the state representations before the sequential modeling step, allowing the Transformer to process only a compact (state, action) sequence. This reduces the sequence length by one-third, directly improving inference efficiency. On the D4RL benchmark, SlimDT surpasses standard DT across various tasks and achieves performance comparable to existing state-of-the-art methods. Decoupling a sparse conditioning signal from an information-rich sequence thus yields both computational gains and higher task performance.
中文摘要 决策变换器（DT）将离线强化学习表述为自回归序列建模，通过预测一系列返回（RTG）、状态和动作令牌的动作，取得了有希望的结果。然而，RTG是一个标量，汇总未来奖励，包含的信息远少于典型的状态或动作向量，但每个代币消耗的计算预算相同。更糟的是，变形金刚的自我关注成本与序列长度成平方增长，因此将RTG作为独立代币增加了不必要的开销。我们提出了SlimDT，它将RTG从自回归序列中移除。相反，我们在顺序建模步骤前向状态表示注入RTG信息，使变换器只能处理紧凑的（状态、动作）序列。这使序列长度缩短了三分之一，直接提升了推理效率。在D4RL基准测试中，SlimDT在多项任务上超越标准DT，并实现与现有最先进方法相当的性能。因此，将稀疏调控信号与信息丰富的序列解耦，既能带来计算收益，也能提升任务性能。

Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs

调度与校准：基于工具的多任务强化学习，适用于代码大型语言模型

Authors: Yujia Chen, Yang Ye, Xiao Chu, Yuchi Ma, Cuiyun Gao
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06111
Pdf link: https://arxiv.org/pdf/2605.06111
Abstract Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified multi-task RL (MTRL) approach. However, existing MTRL methods treat all coding tasks uniformly, relying on fixed data curricula under a shared optimization strategy, ultimately limiting the effectiveness of multi-task training. To address these limitations, we propose ASTOR, a multi-tASk code reinforcement learning framework via uTility-driven coORdination. Centered on task utility, a signal capturing each task learning potential and cross-task synergy, ASTOR comprises two coupled modules: 1) Hierarchical Utility-Routed Data Scheduling module hierarchically allocates training budget and prioritizes informative prompts, steering training toward the most valuable data and 2) Adaptive Utility-Calibrated Policy Optimization module dynamically scales per-task KL regularization, matching update constraints to each tasks current training state. Experiments on two widely-used LLMs across four representative coding tasks demonstrate that ASTOR consistently improves a single model across all tasks, outperforming the best task-specific specialist by 9.0%-9.5% and surpassing the strongest MTRL baseline by 7.5%-12.8%.
中文摘要 带有可验证奖励的强化学习（RL）已被证明在编程后训练LLM中非常有效，但部署专门任务的专家会带来随任务数量增加而增加的成本，这促使统一的多任务RL（MTRL）方法得以实现。然而，现有的MTRL方法对所有编码任务都一视同仁，依赖固定数据课程和共享优化策略，最终限制了多任务训练的效果。为解决这些局限性，我们提出了ASTOR，这是一个通过uTility驱动的coORdination实现的多tASk代码强化学习框架。ASTOR以任务效用为中心，即捕捉每个任务学习潜力和跨任务协同的信号，包含两个耦合模块：1）分层实用路由数据调度模块，分层分配训练预算并优先处理信息提示，引导训练朝向最有价值的数据;2）自适应效用校准策略优化模块，动态扩展每个任务的基层级正则化，将更新约束匹配到当前训练状态。在两种广泛使用的LLM上，涵盖四种代表性编码任务的实验表明，ASTOR在所有任务中持续提升单一模型，表现优于最佳任务专属专家9.0%-9.5%，并以7.5%-12.8%的优势超越最强的MTRL基线。

Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning

策略引导的逐步模型路由以实现成本效益推理

Authors: Wenwen Si, Insup Lee, Osbert Bastani
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06116
Pdf link: https://arxiv.org/pdf/2605.06116
Abstract Inference-time computation has greatly enhanced the performance of large language models (LLMs) on challenging reasoning tasks, but this strategy can incur high inference costs. One solution is to route intermediate chain-of-thought (CoT) states to language models of different sizes; however, existing approaches rely on handcrafted routing strategies that limit performance, or on training large process reward models that may be infeasible in many applications. We formulate stepwise model routing as a constrained decision-making problem, which we solve by training a small control policy using reinforcement learning in conjunction with threshold calibration to tune the performance-efficiency tradeoff. We validate our method on three math benchmarks (GSM8K, MATH500, and OmniMath) on both open and closed models. Our method consistently improves the accuracy-cost tradeoff compared to handcrafted approaches, while achieving a comparable tradeoff to methods that require training large process reward models.
中文摘要 推理时间计算极大提升了大型语言模型（LLMs）在复杂推理任务中的表现，但这种策略可能带来较高的推理成本。一种解决方案是将中间思维链（CoT）状态路由到不同规模的语言模型;然而，现有方法依赖于手工设计的路由策略，限制性能，或训练大型流程奖励模型，而这些模型在许多应用中可能不可行。我们将逐步模型路由构建为一个受限决策问题，通过训练一个小规模控制策略，结合强化学习和阈值校准来调整性能与效率权衡。我们在三个数学基准测试（GSM8K、MATH500和OmniMath）上验证了方法，涵盖开放和封闭模型。我们的方法在准确性与成本的权衡上持续提升，相较于手工制作的方法，同时实现与需要训练大型流程奖励模型的方法相当的权衡。

Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

害虫思维者：通过强化学习学习，学会像昆虫学家一样思考和推理

Authors: Xueheng Li, Yu Wang, Tao Hu, Ji Huang, Ke Cao, Qize Yang, Rui Li, Jie Zhang, Chengjun Xie
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.06121
Pdf link: https://arxiv.org/pdf/2605.06121
Abstract Pest-induced crop losses pose a major threat to global food security and sustainable agricultural development. While recent advances in Multimodal Large Language Models (MLLMs) have shown strong potential for visual understanding and smart agriculture, their direct application to pest recognition remains limited due to the domain's unique challenges such as high inter-species complexity, intra-species variability, and the scarcity of expert-annotated data. In this work, we introduce Pest-Thinker, a knowledge-driven reinforcement learning (RL) framework that enables MLLMs to reason over fine-grained pest morphology. We first construct two high-definition pest benchmarks, QFSD and AgriInsect, comprising diverse species and expert-annotated morphological traits. Leveraging these datasets, we synthesize Chain-of-Thought (CoT) reasoning trajectories to facilitate structured learning of pest-specific visual cues through Supervised Fine-Tuning (SFT). Subsequently, we employ Group Relative Policy Optimization (GRPO) with a novel feature reward that guides the model to focus on observable morphological evidence, assessed by an LLM-as-a-Judge strategy. Extensive experiments demonstrate that Pest-Thinker substantially improves both in-domain and out-of-domain morphological understanding, marking a step toward expert-level visual reasoning for intelligent agricultural pest analysis. The datasets and source code are available upon acceptance.
中文摘要 害虫引起的作物损失对全球粮食安全和可持续农业发展构成重大威胁。尽管多模态大型语言模型（MLLM）的最新进展显示出视觉理解和智能农业的强大潜力，但由于该领域存在高度的物种间复杂性、物种内变异性以及专家注释数据稀缺等独特挑战，其直接应用于害虫识别的应用仍然有限。在本研究中，我们介绍了Pest-Thinker，一种知识驱动强化学习（RL）框架，使MLLM能够在细粒度害虫形态学上进行推理。我们首先构建了两个高清害虫基准，QFSD和AgriInsect，涵盖了多样的物种和专家注释的形态特征。利用这些数据集，我们综合了思维链（CoT）推理轨迹，通过监督微调（SFT）促进对害虫特异视觉线索的结构化学习。随后，我们采用了群相对策略优化（GRPO）和一种新颖的特征奖励，引导模型聚焦于可观察的形态学证据，并通过LLM作为评判者策略进行评估。大量实验表明，Pest-Thinker 显著提升了领域内外的形态学理解，标志着智能农业害虫分析迈向专家级视觉推理的一步。数据集和源代码在接受后可公开。

AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition

AffectGPT-RL：揭示强化学习在开放词汇情感识别中的作用

Authors: Zheng Lian, Fan Zhang, Lan Chen, Yazhou Zhang, Rui Liu, Jinyang Wu, Haoyu Chen, Xiaobai Li, Xiaojiang Peng, Bin He, Jianhua Tao
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2605.06126
Pdf link: https://arxiv.org/pdf/2605.06126
Abstract Open-Vocabulary Multimodal Emotion Recognition (OV-MER) aims to predict emotions without being constrained by predefined label spaces, thereby enabling fine-grained emotion understanding. Unlike traditional discriminative methods, OV-MER leverages generative models to capture the full spectrum of emotions and employs emotion wheels (EWs) for metric calculation. Previous approaches primarily rely on token-level loss during training. However, this objective is misaligned with the metrics used in OV-MER, and these metrics cannot be directly optimized via gradient backpropagation. To address this limitation, we turn our attention to reinforcement learning, as this strategy can optimize non-differentiable objectives. We term this framework AffectGPT-RL. Furthermore, we conduct extensive experiments to elucidate the role of reinforcement learning in this task, revealing the necessity of the reasoning process, the impact of different rewards, and the generalizability to other emotion tasks such as sentiment analysis and basic emotion recognition. Experimental results demonstrate that AffectGPT-RL yields significant performance improvements on OV-MER. Beyond this task, we also achieve remarkable performance gains on basic emotion recognition, attaining state-of-the-art results on MER-UniBench. To the best of our knowledge, this is the pioneering work exploring the role of reinforcement learning in OV-MER, providing valuable guidance for subsequent researchers. Our code is provided in the supplementary material and will be released to facilitate future research.
中文摘要 开放词汇多模态情绪识别（OV-MER）旨在预测情绪，不受预设标签空间的限制，从而实现细致的情感理解。与传统的判别方法不同，OV-MER利用生成模型捕捉情绪的全谱，并采用情绪轮（EW）进行度量计算。以往的方法主要依赖训练中的代币级损失。然而，这一目标与OV-MER中使用的指标不一致，且这些指标无法通过梯度反向传播直接优化。为了解决这一限制，我们将注意力转向强化学习，因为该策略可以优化不可微目标。我们称这个框架为AffectGPT-RL。此外，我们进行了大量实验，阐明强化学习在该任务中的作用，揭示了推理过程的必要性、不同奖励的影响，以及其可推广到情感分析和基础情绪识别等其他情绪任务的应用性。实验结果表明，AffectGPT-RL在OV-MER上显著提升了性能。除了这项任务，我们还在基本情绪识别方面取得了显著的性能提升，在MER-UniBench上取得了最先进的效果。据我们所知，这是探索强化学习在OV-MER中作用的开创性工作，为后续研究者提供了宝贵指导。我们的代码已在补充材料中提供，并将发布以促进未来研究。

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

技能1：通过强化学习实现技能增强智能体的统一进化

Authors: Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu, Qi GU, Xunliang Cai, Xiang Wang, An Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06130
Pdf link: https://arxiv.org/pdf/2605.06130
Abstract A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize these capabilities in isolation or with separate reward sources, resulting in partial and conflicting evolution. We propose Skill1, a framework that trains a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective. The policy generates a query to search the skill library, re-ranks candidates to select one, solves the task conditioned on it, and distills a new skill from the trajectory. All learning derives from a single task-outcome signal. Its low-frequency trend credits selection and its high-frequency variation credits distillation. Experiments on ALFWorld and WebShop show that Skill1 outperforms prior skill-based and reinforcement learning baselines. Training dynamics confirm the co-evolution of the three capabilities, and ablations show that removing any credit signal degrades the evolution.
中文摘要 持久技能库允许语言模型代理在各任务中重复使用成功的策略。维护此类库需要三种耦合功能。代理人选择相关技能，执行时加以利用，并从经验中提炼出新技能。现有方法要么单独优化这些能力，要么用不同的奖励来源，导致部分且冲突的进化。我们提出了Skill1框架，该框架训练单一策略，共同进化技能选择、利用和提炼，朝着共同的任务-结果目标发展。该策略生成查询以搜索技能库，重新排序候选人以选择一个，解决基于该技能的任务，并从轨迹中提取新技能。所有学习都源自单一的任务-结果信号。其低频趋势特征选择和高频变异特征特征提取。在ALFWorld和WebShop上的实验显示，Skill1优于以往基于技能和强化学习的基线。训练动态证实了这三种能力的共同进化，消融显示移除任何信用信号会降低进化。

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

列表策略优化：基于群组的RLVR作为LLM响应单纯形的目标投影

Authors: Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Yingyue Li, Wutong Xu, Lizhou Cai, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06139
Pdf link: https://arxiv.org/pdf/2605.06139
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为大型语言模型（LLMs）训练后激励推理能力的标准方法。在现有配方中，基于群体的策略梯度较为普遍，它采样每个提示的一组响应，并通过群体相对优势信号更新策略。这项工作表明，这些优化策略共享一个共同的几何结构：每个策略都隐式定义响应单纯形上的目标分布，并通过一阶近似向该分布投影。基于这一见解，我们提出了列表策略优化（LPO）来显式进行目标投影，通过将近端强化学习目标限制在响应单纯形中，从而揭开隐含目标的神秘面纱，然后通过精确发散最小化来预测策略。该框架提供了（i）列表目标的单调改进，具有有界、零和自纠正梯度;（ii）通过解耦投影步骤，在散度选择上具有不同结构性质的灵活性。在多样化的推理任务和大型语言模型骨干上，LPO在匹配目标下持续提升典型政策梯度基线的训练表现，同时本质上保持优化稳定性和响应多样性。

Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization

通过控制最大化统一目标条件强化学习与无监督技能学习

Authors: Alireza Modirshanechi, Benjamin Eysenbach, Peter Dayan, Eric Schulz
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.06145
Pdf link: https://arxiv.org/pdf/2605.06145
Abstract Unsupervised pretraining has driven empirical advances in goal-conditioned reinforcement learning (GCRL), but its theoretical foundations remain poorly understood. In particular, an influential class of methods, mutual information skill learning (MISL), discovers behaviorally diverse skills that can later be used for downstream goal-reaching. However, it remains a theoretical mystery why skills learned through MISL should support goal-reaching. A subtle challenge is that both GCRL and MISL are umbrella terms: different GCRL tasks use distinct criteria for measuring goal-reaching performance, while different MISL methods optimize distinct notions of behavioral diversity. We address this challenge and unify GCRL and MISL as instances of control maximization. We identify three canonical GCRL formulations and prove that they are fundamentally inequivalent: they can induce incompatible optimal policies even in the same environment. Nevertheless, they all share a common interpretation: a well-performing goal-conditioned policy is one whose future trajectory is highly sensitive to the commanded goal, with the precise notion of sensitivity determined by the GCRL formulation. Noting that MISL objectives can be understood as measures of skill-sensitivity akin to goal-sensitivity, we show that MISL objectives are bounded by formulation-specific downstream goal-sensitivities. These bounds establish a precise correspondence between MISL methods and downstream GCRL tasks: for every GCRL formulation, there exists a matching MISL objective for which more diverse skills afford greater downstream goal sensitivity. Our results thus lay a theoretical foundation for RL pretraining and have important practical implications, such as suggesting which pretraining objectives to use when a user cares about a specific class of downstream tasks.
中文摘要 无监督预训练推动了目标条件强化学习（GCRL）的实证进展，但其理论基础仍鲜为人知。特别是，一类有影响力的方法——互信息技能学习（MISL），发现了行为上多样化的技能，这些技能后来可用于后续的目标达成。然而，为什么通过MISL学到的技能应该支持实现目标，仍是一个理论上的谜团。一个微妙的挑战是，GCRL和MISL都是总称：不同的GCRL任务使用不同的标准来衡量达成目标，而不同的MISL方法则优化不同的行为多样性概念。我们解决了这一挑战，并将GCRL和MISL统一为控制最大化的实例。我们识别出三种典型的GCRL表述，并证明它们在根本上不等价：即使在相同环境中，它们也可能诱导不兼容的最优策略。尽管如此，它们都有一个共同的解释：一个表现良好的目标条件政策，是指其未来轨迹对所命令目标高度敏感，且具体的敏感性概念由GCRL表述决定。我们指出，MISL目标可以被理解为类似于目标敏感性的技能敏感度度量，我们表明MISL目标受限于特定配方的下游目标敏感性。这些界限确立了MISL方法与下游GCRL任务之间的精确对应关系：对于每一种GCRL表述，都存在匹配的MISL目标，且更多样化的技能能提供更高的下游目标敏感度。因此，我们的结果为强化学习预训练奠定了理论基础，并具有重要的实际意义，例如建议用户关注特定下游任务类别时应采用哪些预训练目标。

AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning

AdaGamma：强化学习中时间适应的状态依赖折扣

Authors: Yaomin Wang, Jianting Pan, Ran Tian, Xiaoyang Li, Yu Zhang, Hengle Qin, Tianshu YU
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06149
Pdf link: https://arxiv.org/pdf/2605.06149
Abstract The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is conceptually appealing, naive deep actor--critic implementations can become unstable and degenerate toward TD-error collapse. We propose AdaGamma, a practical deep actor--critic method for state-dependent discounting that learns a state-dependent discount function together with a return-consistency objective to regularize the induced backup structure. On the theory side, we analyze the Bellman operator induced by state-dependent discounting and establish its basic well-posedness properties under suitable conditions. Empirically, AdaGamma integrates into both SAC and PPO, yielding consistent improvements on continuous-control benchmarks, and achieves statistically significant gains in an online A/B test on the JD Logistics platform. These results suggest that state-dependent discounting can be made effective in deep RL when coupled with a return-consistency objective that prevents degenerate target manipulation.
中文摘要 强化学习中的折扣因子控制有效规划视野和自助法的强度，但大多数深度强化学习方法在所有状态上使用单一固定值。虽然状态依赖性折扣在概念上很有吸引力，但天真的深度行为者-批判者实现可能会变得不稳定，并退化到TD-错误崩溃的方向。我们提出了AdaGamma，一种实用的深度actor-critic状态相关折现方法，学习状态依赖的折现函数，并结合返回一致性目标以正则化诱导备份结构。在理论方面，我们分析了由状态相关贴现诱导的贝尔曼算子，并在适当条件下建立了其基本良定性质。从实证角度看，AdaGamma 集成到 SAC 和 PPO，连续控制基准测试持续提升，并在 JD 物流平台上的在线 A/B 测试中取得了统计学显著的提升。这些结果表明，当状态依赖折现与防止退化目标操作的回报一致性目标结合时，可以在深度强化学习中发挥有效作用。

Entropy-Regularized Adjoint Matching for Offline RL

离线强化学习的熵正则化伴随匹配

Authors: Abdelghani Ghanem, Mounir Ghogho
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06156
Pdf link: https://arxiv.org/pdf/2605.06156
Abstract Integrating expressive generative policies, such as flow-matching models, into offline reinforcement learning (RL) allows agents to capture complex, multi-modal behaviors. While Q-learning with Adjoint Matching (QAM) stabilizes policy optimization via the continuous adjoint method, it remains inherently bound to the fixed behavior distribution. This dependence induces a \textit{popularity bias} that can suppress high-reward actions in low-density regions, and creates a \textit{support binding} that restricts off-manifold exploration. Existing workarounds, such as appending \textit{residual} Gaussian policies, often re-introduce the expressivity bottlenecks associated with unimodal distributions. In this work, we propose \textit{Maximum Entropy Adjoint Matching} (ME-AM), a unified framework that addresses these limitations within the continuous flow formulation. ME-AM incorporates two mechanisms: (1) a Mirror Descent entropy maximization objective that mitigates the popularity bias to facilitate the extraction of optimal policies from offline datasets, and (2) a \textit{Mixture Behavior Prior} that mathematically broadens the geometric support to encompass out-of-distribution high-reward regions. By exploring this extended geometry, ME-AM identifies robust actions while preserving the absolute continuity of the generative vector field. Empirically, ME-AM demonstrates competitive or superior performance compared to prior state-of-the-art (SOTA) methods across a diverse suite of sparse-reward continuous control environments.
中文摘要 将表达式生成策略（如流匹配模型）集成到离线强化学习（RL）中，使智能体能够捕捉复杂的多模态行为。虽然Q学习结合伴随匹配（QAM）通过连续伴随方法稳定策略优化，但它本质上仍受固定行为分布约束。这种依赖性会引发一种\textit{受欢迎偏差}，可以抑制低密度区域的高回报行为，并形成一个\textit{支持绑定}，限制离流形探索。现有的变通方法，如附加 \textit{residual} 高斯策略，常常重新引入与单峰分布相关的表达性瓶颈。在本研究中，我们提出了\textit{最大熵伴随匹配}（ME-AM），这是一个统一框架，解决了连续流表述中的这些限制。ME-AM包含两种机制：（1）镜像下降熵最大化目标，减轻流行度偏差，便于从离线数据集中提取最优策略;（2）\textit{混合行为先验}，数学上扩大几何支持范围，涵盖分布外的高奖励区域。通过探索这一扩展几何，ME-AM识别了稳健作用，同时保持生成向量场的绝对连续性。从实证角度看，ME-AM在多样化的稀疏奖励连续控制环境中，相较于以往最先进（SOTA）方法表现出竞争或更优的性能。

OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

OPSD压缩了RLVR所教的内容：推理模型的后强化学习压缩阶段

Authors: Jaehoon Kim, Dongha Lee
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.06188
Pdf link: https://arxiv.org/pdf/2605.06188
Abstract On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-teacher conditioned on privileged context. However, this promise does not carry over to thinking-enabled mathematical reasoning, where reported accuracy gains shrink and sometimes turn negative. We hypothesize that hindsight supervision can specify better token-level alternatives in short thinking-disabled outputs, but in long thinking-enabled traces it more readily identifies redundancy than supplies better replacements. To test this, we applied OPSD separately to correct and incorrect rollout groups, so that compression and correction can be observed in isolation. Our results show that in thinking-enabled mathematical reasoning, OPSD behaves most reliably as a compression mechanism rather than a correction mechanism: training only on correct rollouts preserves accuracy while substantially shortening responses, whereas training only on incorrect rollouts damages accuracy. In light of these findings, we propose a revised post-training pipeline for thinking-enabled mathematical reasoning: SFT then RLVR then OPSD.
中文摘要 政策自提炼（OPSD）最近作为可验证奖励强化学习（RLVR）的替代方案出现，承诺通过自学者基于特权上下文进行代币级学分分配，带来更高的准确性和更短的响应。然而，这一承诺并未适用于以思考为驱动的数学推理，报告的准确性提升会缩小，有时甚至变成负数。我们假设事后诸葛亮监督可以在短期思维障碍输出中指定更好的代币级替代方案，但在长思考能力追踪中，它更容易识别冗余，而非提供更好的替代品。为测试此点，我们分别应用OPSD来校正和错误的展开组，从而可以单独观察压缩和修正。我们的结果表明，在思维驱动的数学推理中，OPSD最可靠的表现是压缩机制而非修正机制：仅在正确展开上训练能保持准确性，同时大幅缩短响应时间;而仅在错误的部署上训练则损害准确性。基于这些发现，我们提出了一个修订后的训练后流程，用于思维驱动的数学推理：先是SFT，再先RLVR，再到OPSD。

A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

A$^2$TGPO：具备自适应回合层裁剪的代理转组策略优化

Authors: Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.06200
Pdf link: https://arxiv.org/pdf/2605.06200
Abstract Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A$^2$TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn's clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.
中文摘要 针对代理大型语言模型（LLMs）的强化学习通常依赖于稀疏的轨迹级结果奖励，这使得评估单个工具调用在多回合交互中的贡献变得困难。现有的过程积分分配方法要么依赖单独的外部过程奖励模型引入额外消耗，要么依赖基于树状的结构性推广，仅重新分配结果信号，同时限制轨迹多样性。一个有前景的替代方案利用政策每回合预测的真实概率变化，称为信息增益（IG），作为一种无需外部评估者的内在过程信号。然而，先前在强化学习训练循环中利用IG信号的工作面临三个系统性挑战：跨跨异质位置环境的转弯进行归一化可能导致单个转弯的相对位置扭曲，累计可变项数会导致优势幅度随轨迹深度漂移，以及固定削波范围对截波信号差异极大的转弯策略更新具有相同规定。本文提出A$^2$TGPO（带自适应回合层裁剪的代理回合组策略优化），该方案保留IG作为内在信号，但重新设计了IG的归一化、累积和消耗方式：（i）回合组归一化：在每个（提示、回合索引）组内规范IG，使每次回合仅与相同交互深度的对等组比较;（ii）方差重标度贴现累计：将累计归一化IG除以累积项的平方根，以保持各转局面优势幅度可比;以及（iii）自适应转弯级别削波：根据归一化的IG调制每回合的削波范围，扩大信息性转弯的更新区域，非信息化的转弯时收窄更新区域。

Soft Deterministic Policy Gradient with Gaussian Smoothing

软确定性策略梯度与高斯平滑

Authors: Hyunjun Na, Donghwan Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06228
Pdf link: https://arxiv.org/pdf/2605.06228
Abstract Deterministic policy gradient (DPG) is widely utilized for continuous control; however, it inherently relies on the differentiability of the critic with respect to the action during policy updates. This assumption is violated in practical control problems involving sparse or discrete rewards, leading to ill-defined policy gradients and unstable learning. To address these challenges, we propose a principled alternative based on a smoothed Bellman equation formulated via Gaussian smoothing. Specifically, we define a novel action-value function based on a smoothed Bellman equation and derive the soft deterministic policy gradient (Soft-DPG). Our formulation eliminates explicit dependence on critic action-gradients and ensures that the gradient remains well-defined even for non-smooth Q-functions. We instantiate this framework into a deep reinforcement learning algorithm, which we call soft deep deterministic policy gradient (Soft DDPG). Empirical evaluations on standard continuous control benchmarks and their discretized-reward variants show that Soft DDPG remains competitive in dense-reward settings and provides clear gains in most discretized-reward environments, where standard DDPG is more sensitive to irregular critic landscapes.
中文摘要 确定性政策梯度（DPG）被广泛用于持续控制;然而，它本质上依赖于批评者在政策更新期间对行动的可区分性。在涉及稀疏或离散奖励的实际控制问题中，这一假设被打破，导致策略梯度不明确和学习不稳定。为应对这些挑战，我们提出了基于高斯平滑化的平滑贝尔曼方程的原则性备选方案。具体来说，我们基于一个平滑化的贝尔曼方程定义了一个新颖的行动值函数，并推导出软确定性策略梯度（Soft-DPG）。我们的表述消除了对批判作用梯度的显式依赖，确保即使是非光滑的Q函数，梯度也保持良好定义。我们将该框架实例化为深度强化学习算法，称为软深度确定性策略梯度（Soft DDPG）。对标准连续控制基准及其离散化奖励变体的实证评估表明，软DDPG在高密度奖励环境中依然具有竞争力，并在大多数离散化奖励环境中提供明显收益，而标准DDPG对不规则批评环境更为敏感。

Safactory: A Scalable Agent Factory for Trustworthy Autonomous Intelligence

Safactory：一个可扩展的可信自主智能代理工厂

Authors: Xinquan Chen, Zhenyun Yin, Shan He, Bin Huang, Shanzhe Lei, Pengcheng Shi, Kun Cai, Bei Chen, Bangwei Liu, Zeyu Kang, Chao Huang, Yang Zhang, Wenjie Li, Ruijun Ge, Yajie Wang, Tianshun Fang, Tianyang Xu, Yiwen Cong, Meng Jin, Gaolei Li, Xuansheng Wu, Linhan Liu, Zijing He, An Li, Yan Teng, Xin Tan, ChaoChao Lu, Ji He, Jie Li, Chunfeng Song, Jinya Xu, Fan Song, Shujie Wang, Jianmin Qian, Jie Hou, Xuhong Wang, Yingchun Wang, Hui Wang, Xia Hu
Subjects: Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2605.06230
Pdf link: https://arxiv.org/pdf/2605.06230
Abstract As large models evolve from conversational assistants into autonomous agents, challenges increasingly arise from long-horizon decision making, tool use, and real environment interaction. Existing agenticinfrastructure remain fragmented across evaluation, data management, and agent evolution, making it difficult to discover risks systematically and improve models in a continuous closed loop. In this report, we present \textbf{Safactory}, a scalable agent factory for trustworthy autonomous intelligence. Safactory integrates three tightly coupled platforms: a \textbf{Parallel Simulation Platform} for trajectory generation, a \textbf{Trustworthy Data Platform} for trajectory storage and experience extraction, and an \textbf{Autonomous Evolution Platform} for asynchronous reinforcement learning and on-policy distillation. As far as we know, Safactory is the first framework to propose a unified evolutionary pipeline for next-generation trustworthy autonomous intelligence.
中文摘要 随着大型模型从对话式助手向自主智能体发展，长期决策、工具使用和真实环境交互的挑战日益增加。现有的代理基础设施在评估、数据管理和代理演进等方面仍处于碎片化状态，使得在连续闭环中系统性发现风险和改进模型变得困难。在本报告中，我们介绍了\textbf{Safactory}，一个可扩展的智能代理工厂，用于可信赖的自主智能。Safactory 集成了三个紧密耦合的平台：\textbf{并行模拟平台}用于轨迹生成，\textbf{可信数据平台}用于轨迹存储和经验提取，以及\textbf{自主进化平台}用于异步强化学习和策略提取。据我们所知，Safactory是首个提出下一代可信自主智能统一进化流程的框架。

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

重新思考强化学习对大型语言模型推理：它是策略选择稀疏，而非能力学习

Authors: Ömer Faruk Akgül, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.06241
Pdf link: https://arxiv.org/pdf/2605.06241
Abstract Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.
中文摘要 强化学习已成为提升大型语言模型推理能力的标准，但越来越多的证据表明，强化学习并未教授新的策略;它将概率质量重新分配到基模型已有的解中。在本研究中，我们探讨：如果强化学习仅仅引导模型走向其已知路径，那么强化学习优化循环本身是否必要？通过跨多个模型家族的代币级分析和强化学习算法，我们发现强化学习的有益影响是一种稀疏且可预测的修正，集中在模型不确定选择哪个分支的高熵决策点。只有1-3%的代币位置受影响，升标代币总是在基础模型的前五备选中，针对这些位置的针对性修正能因果地恢复RL准确率提升的大部分，而随机修正则失败。基础模型自身的熵在没有任何强化学习训练模型的情况下识别这些位置，整个修正是低维的，仅能在极少数模型参数中表示。这些发现将推理改进重新定义为政策选择稀疏，而非能力获取。我们将这一见解转化为ReasonMaxxer，这是一种极简的无强化学习方法，仅在熵门控决策点应用对比损失，使用数百个基础模型展开，无需在线生成。在三种模型家族、六个量表和六个数学推理基准中，ReasonMaxxer 能够匹敌甚至超过完整的强化学习表现，同时只需数十个问题和单一 GPU 训练分钟，训练成本降低了大约三个数量级。

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

用工具教思维模型推理：工具集成推理的完整流程配方

Authors: Qianjia Cheng, Yuchen Zhang, Zhilin Wang, Yuxin Zuo, Shunkai Zhang, Yuchen Fan, Yu Qiao, Bowen Zhou, Ning Ding, Yu Cheng, Yun Luo, Ganqu Cui
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.06326
Pdf link: https://arxiv.org/pdf/2605.06326
Abstract Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. In this paper, we investigate how to inject natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability, and present a comprehensive TIR recipe. We highlight that (i) the effectiveness of TIR supervised fine-tuning (SFT) hinges on the learnability of teacher trajectories, which should prioritize problems inherently suited for tool-augmented solutions; (ii) controlling the proportion of tool-use trajectories could mitigate the catastrophic forgetting of text-only reasoning capacity; (iii) optimizing for pass@k and response length instead of training loss could maximize TIR SFT gains while preserving headroom for reinforcement learning (RL) exploration; (iv) a stable RL with verifiable rewards (RLVR) stage, built upon suitable SFT initialization and explicit safeguards against mode collapse, provides a simple yet remarkably effective solution. When applied to Qwen3 thinking models at 4B and 30B scales, our recipe yields models that achieve state-of-the-art performance in a wide range of benchmarks among open-source models, such as 96.7% and 99.2% on AIME 2025 for 4B and 30B, respectively.
中文摘要 工具整合推理（TIR）提供了一种直接的方式，将思维模型扩展到仅靠文本推理的限制之外。矛盾的是，我们观察到工具赋能的评估即使强思维模型几乎不实际调用工具，也可能降低推理性能。本文探讨如何在不牺牲无工具推理能力的前提下，将自然工具使用行为注入强思维模型，并提出了全面的TIR方案。我们强调：（i） TIR监督微调（SFT）的有效性取决于教师路径的可学习性，教师应优先考虑本质上适合工具增强解决方案的问题;（ii）控制工具使用轨迹的比例可以减轻仅文本推理能力的灾难性遗忘;（iii）优化pass@k和响应长度而非训练损失，可以在保留强化学习（RL）探索余地的同时最大化TIR SFT的收益;（iv）基于适当的SFT初始化和明确防止模式崩溃的保护措施，建立一个稳定且可验证奖励的RL（RLVR）阶段，提供了一个简单但极为有效的解决方案。当应用于4B和30B尺度的Qwen3思维模型时，我们的配方得出的模型在开源模型中实现了广泛基准测试中的最先进性能，例如4B和30B在AIME 2025中分别达到96.7%和99.2%。

A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

统一的对-GRPO家族：从隐式到显式偏好约束，实现稳定且通用的强化学习对齐

Authors: Hao Yu
Subjects: Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST)
Arxiv link: https://arxiv.org/abs/2605.06375
Pdf link: https://arxiv.org/pdf/2605.06375
Abstract Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference learning paradigms. To systematically address these limitations, we establish a unified theoretical framework for preference-based RL optimization centered on the Pair-GRPO family, comprising two tightly coupled variants: Soft-Pair-GRPO and Hard-Pair-GRPO. Soft-Pair-GRPO is a minimal modification of Group Relative Policy Optimization (GRPO) that replaces group-normalized scalar rewards with binary pairwise preference rewards, retaining GRPO's clipped surrogate and KL-regularized structure. We prove a critical gradient equivalence theorem: under first-order Taylor expansion around the current policy, Soft-Pair-GRPO's gradient is a positive scalar multiple of standard GRPO's gradient, explaining its empirical stability despite discarding continuous reward magnitudes. Building on this foundation, we propose Hard-Pair-GRPO, an advanced variant introducing explicit local probability constraints and constrained KL-fitting optimization to further suppress gradient noise and global policy drift. We provide comprehensive theoretical guarantees for both variants--including monotonic policy improvement, deterministic gradient direction, gradient-variance reduction, and dynamic step-size convergence. Extensive experiments on standard LLM alignment benchmarks (HH-RLHF,UltraFeedback) and the MuJoCo continuous control task HalfCheetah-v4 demonstrate that our Pair-GRPO family consistently outperforms state-of-the-art baselines in alignment quality, human preference win rate, training stability, and generalization to general reinforcement learning. Ablation studies validate the critical contributions of each core component.
中文摘要 通过人类偏好强化学习（RLHF）进行大型语言模型（LLM）对齐，在主流成对偏好学习范式中存在策略更新不稳定、梯度方向模糊、可解释性差以及高梯度方差的问题。为系统性解决这些局限性，我们建立了以Pair-GRPO家族为核心的基于偏好的强化学习优化统一理论框架，包含两个紧密耦合的变体：软对GRPO和硬对GRPO。软对-GRPO是群相对策略优化（GRPO）的最小修改，用二元成对偏好奖励替代组归一化标量奖励，保留GRPO的截断代理和KL正则化结构。我们证明了一个临界梯度等价定理：在当前策略的一阶泰勒展开下，软对GRPO梯度是标准GRPO梯度的正标量倍数，解释了尽管舍弃连续奖励幅度，其经验稳定性依然存在。在此基础上，我们提出了硬对GRPO变体，这是一种先进变体，引入显式局部概率约束和受约束KL拟合优化，以进一步抑制梯度噪声和全局策略漂移。我们为这两种变体提供全面的理论保证——包括单调策略改进、确定性梯度方向、梯度方差缩减和动态步长收敛。对标准LLM比对基准基准（HH-RLHF、UltraFeedback）和MuJoCo连续控制任务HalfCheetah-v4的广泛实验表明，我们的Pair-GRPO家族在比对质量、人类偏好胜率、训练稳定性以及对通用强化学习的推广性方面，始终优于最先进的基线。消融研究验证了每个核心成分的关键贡献。

Independent Learning of Nash Equilibria in Partially Observable Markov Potential Games with Decoupled Dynamics

部分可观测马尔可夫势博弈中纳什均衡的独立学习，具有解耦动力学

Authors: Philip Jordan, Maryam Kamgarpour
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.06377
Pdf link: https://arxiv.org/pdf/2605.06377
Abstract We study Nash equilibrium learning in partially observable Markov games (POMGs), a multi-agent reinforcement learning framework in which agents cannot fully observe the underlying state. Prior work in this setting relies on centralization or information sharing, and suffers from sample and computational complexity that scales exponentially in the number of players. We focus on a subclass of POMGs with independent state transitions, where agents remain coupled through their rewards, and assume that the underlying fully observed Markov game is a Markov potential game. For this class, we present an independent learning algorithm in which players, observing only their own actions and observations and without communication, jointly converge to an approximate Nash equilibrium. Due to partial observability, optimal policies may in general depend on the full action-observation history. Under a filter stability assumption, we show that policies based on finite history windows provide sufficient approximation guarantees. This enables us to approximate the POMG by a surrogate Markov game that is near-potential, leading to quasi-polynomial sample and computational complexity for independent Nash equilibrium learning in the underlying POMG.
中文摘要 我们研究部分可观测马尔可夫博弈（POMGs）中的纳什均衡学习，这是一种多智能体强化学习框架，其中智能体无法完全观察底层状态。此前在该环境下的工作依赖于集中化或信息共享，且样本和计算复杂度呈指数级增长。我们关注一类具有独立状态转移的POMG子类，其中代理通过其奖励保持耦合，并假设完全观测的马尔可夫博弈是一个马尔可夫势博弈。对于本课，我们提出一种独立学习算法，玩家仅观察自己的行为和观察，且不进行交流，共同收敛到近似的纳什均衡。由于部分可观测性，最优策略通常可能依赖于完整的动作-观察历史。在滤波器稳定性假设下，我们证明基于有限历史窗口的策略提供了足够的近似保证。这使得我们能够用一个近势的代理马尔可夫博弈来近似POMG，从而实现了在底层POMG中实现独立纳什均衡学习的准多项式样本和计算复杂度。

Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

非对称策略上提炼：在令牌层面桥接利用与模仿

Authors: Nan Jia, Haojin Yang, Xing Ma, Jiesong Lian, Shuailiang Zhang, Weipeng Zhang, Ke Zeng, Xunliang Cai, Zequn Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06387
Pdf link: https://arxiv.org/pdf/2605.06387
Abstract On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are this http URL therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.
中文摘要 策略内提炼（OPD）通过代币级教师反馈，按照自身轨迹训练学生，且通常优于非策略提炼和标准强化学习。然而，我们发现其标准优势加权策略梯度存在三个结构性弱点，包括高方差更新、零优势区域梯度消失以及纠正信号时的探索瓶颈。因此，建议采用非对称策略蒸馏（AOPD），在非正优势区域用局部发散最小化替代无效的负强化，同时保持正强化学习。数学推理基准测试显示，AOPD持续优于标准OPD，强初始化和弱初始化平均提升分别为4.09 / 8.34。AOPD在培训期间保持更高的政策熵，在顺序工具使用适应中保持能力更佳。

Distributed Online Learning for Time-Critical Communication in 6G Industrial Subnetworks

6G工业子网中时间关键通信的分布式在线学习

Authors: Samira Abdelrahman, Hossam Farag, Gilberto Berardinelli
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.06437
Pdf link: https://arxiv.org/pdf/2605.06437
Abstract 6G industrial in-X subnetworks are expected to support highly time-critical alarm reporting in large-scale environments characterized by mobility, bursty event-driven traffic, and limited radio resources. In such settings, conventional medium access solutions are ill-suited to guarantee reliable delivery of critical traffic, e.g., emergency alarms, within strict deadlines, especially when multiple subnetworks become simultaneously active after a common alarm event, a scenario widely referred as medium access with a shared message. This paper proposes a distributed deep reinforcement learning (DRL)-based medium access control protocol for timely alarm transmission in time-critical industrial subnetworks. The proposed method enables each local access point (LAP) to learn, in an online manner, to infer contention conditions from a broadcast contention-signature signal and to autonomously select a transmission pattern over the available channels using a lightweight deep neural network and an (ephsilon)-greedy policy. Simulation results demonstrate that the proposed approach consistently achieves a higher probability of in-time alarm delivery than benchmark random-access schemes, while exhibiting better scalability with increasing network density. For instance, the proposed method improves probability of in-time alarm delivery by at least 7% with a network size of 40 subnetworks, while the gain increases to 21% when the number of subnetworks increases to 60.
中文摘要 6G工业in-X子网预计将在大规模环境中支持高度时间关键的报警报告，这些环境以移动性、突发事件驱动流量和有限的无线电资源为特征。在这种情况下，传统的介质接入解决方案难以保证关键流量（如紧急报警）在严格期限内的可靠传递，尤其是在多个子网络在共同警报事件后同时激活时，这种情况被广泛称为带有共享消息的媒介接入。本文提出了一种基于分布式深度强化学习（DRL）的介质访问控制协议，用于在时间关键的工业子网络中实现及时报警传输。该方法使每个本地接入点（LAP）能够在线学习，从广播的争用签名信号推断争用条件，并利用轻量级深度神经网络和（ephsilon）贪婪策略自主选择可用信道上的传输模式。模拟结果表明，所提方法在及时报警传递的概率上始终优于基准随机访问方案，同时随着网络密度的提升，其可扩展性也更佳。例如，该方法在网络规模为40个子网时，至少能提高7%的及时报警传递概率，而当子网络数量增加到60个时，增益可提升至21%。

Hitting Time Isomorphism for Multi-Stage Planning with Foundation Policies

多阶段规划与基础政策的时间同构

Authors: Magnus Victor Boock, Abdullah Akgül, Mustafa Mert Çelikok, Melih Kandemir
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.06470
Pdf link: https://arxiv.org/pdf/2605.06470
Abstract We present a new operator-theoretic representation learning framework for offline reinforcement learning that recovers the directed temporal geometry of a controlled Markov process from hitting time observations. While prior art often produces symmetric distances or fails to satisfy the triangle inequality, our framework learns a Hilbert-space displacement geometry where expected hitting times are realized as linear functionals of latent displacements. We prove that this representation exists under latent linear closure and is uniquely identifiable up to a bounded linear isomorphism. For finite-dimensional implementations, we show that global hitting-time error is bounded by one-step transition error amplified by the environment's transient spectral radius. Furthermore, we provide finite-sample guarantees accounting for approximation, statistical complexity, and trajectory-label mismatch. Derived from this theory, we curate Isomorphic Embedding Learning (IEL) as a new goal-agnostic foundation policy learning algorithm that anchors a HILP-style consistency objective with explicit hitting-time regression to ensure that the learned geometry reflects actual decision-time progress. This asymmetric and compositional structure enables robust graph-based multi-stage planning for long-horizon navigation. Our experiments demonstrate that IEL improves the state of the art of learning foundation policy policies from offline maze locomotion data. Our code can be found on this https URL
中文摘要 我们提出了一种新的算子理论表示学习框架，用于离线强化学习，能够从时间观测中恢复受控马尔可夫过程的有向时间几何。虽然现有技术常常产生对称距离或无法满足三角形不等式，但我们的框架学习了希尔伯特空间位移几何，其中预期撞击时间作为潜在位移的线性泛函实现。我们证明该表示在潜在线性闭包下存在，并且在有界线性同构下是唯一可识别的。对于有限维实现，我们证明了全局击中时间误差被一步跃迁误差所限制，并被环境瞬态频谱半径放大。此外，我们提供了有限样本保证，考虑了近似、统计复杂性和轨迹标签不匹配。基于该理论，我们策划了同构嵌入学习（IEL）作为一种新的目标无关基础策略学习算法，它通过显式击中时间回归锚定HILP式一致性目标，确保所学几何反映实际决策时间的进展。这种非对称和组合结构使得基于图的多阶段规划能够稳健地实现长视野导航。我们的实验表明，IEL能够提升基于离线迷宫运动数据的学习基础政策政策的技术水平。我们的代码可以在这个 https URL 中找到

Operator-Guided Invariance Learning for Continuous Reinforcement Learning

连续强化学习中的算符引导不变性学习

Authors: Zuyuan Zhang, Fei Xu Yu, Tian Lan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06500
Pdf link: https://arxiv.org/pdf/2605.06500
Abstract Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve learning. Most existing approaches focus on special cases, such as prescribed symmetries and exact equivariance, without addressing how to discover more general structures that require nonlinear operators to transform and map between continuous state/action systems with isomorphic value functions. We propose \textbf{VPSD-RL} (Value-Preserving Structure Discovery for Reinforcement Learning). It models continuous RL as a controlled diffusion with value-preserving mappings defined through Lie-group actions and associated pullback operators. We show that a value-preserving structure exists exactly when pulling back the value function and pushing forward actions commute with the controlled generator and reward functional. Further, approximate value-preserving structures with rigorous guarantees can be found when the Hamilton--Jacobi--Bellman mismatch is small. This framework discovers exact and approximate value-preserving structures by searching for the associated Lie group operators. VPSD-RL fits differentiable drift, diffusion, and reward models; learns infinitesimal generators via determining-equation residual minimization; exponentiates them with ODE flows to obtain finite transformations; and integrates them into continuous RL through transition augmentation and transformation-consistency regularization. We show that bounded generator/reward mismatch implies quantitative stability of the optimal value function along approximate orbits, with sensitivity governed by the effective horizon, and observe improved data efficiency and robustness on continuous-control benchmarks.
中文摘要 具有连续时间和状态/动作空间的强化学习（RL）通常数据密集且易受干扰变异和变化影响，促使利用价值保持结构的方法来稳定和提升学习。大多数现有方法只关注特例，如规定的对称性和精确等变性，却未探讨如何发现需要非线性算符在连续状态/作用系统之间进行变换和映射的更一般结构，这些系统具有同构的值函数。我们提出了 \textbf{VPSD-RL}（用于强化学习的价值保持结构发现）。它将连续强化学习建模为一种受控扩散，并通过李群作用和相关的拉回算符定义了保持值的映射。我们证明，当拉回价值函数并推动前进的动作与受控生成器和奖励函数交换时，正好存在一个保持价值的结构。此外，当Hamilton-Jacobi-Bellman错配较小时，可以找到具有严格保证的近似保值结构。该框架通过搜索相关的李群算子，发现精确且近似的保值结构。VPSD-RL适用于可微漂移、扩散和奖励模型;通过确定方程的残差最小化学习无穷小生成元;用常微分方程流对它们进行指数化以获得有限变换;并通过过渡增强和变换一致性正则化将它们集成为连续强化学习。我们表明，有界生成器/奖励不匹配意味着最优值函数沿近似轨道的定量稳定性，灵敏度受有效视界控制，并在连续控制基准测试中观察到数据效率和鲁棒性提升。

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

MARBLE：扩散强化的多方面奖励平衡

Authors: Canyu Zhao, Hao Chen, Yunze Tong, Yu Qiao, Jiacheng Li, Chunhua Shen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.06507
Pdf link: https://arxiv.org/pdf/2605.06507
Abstract Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward $R(x)=\sum_k w_k R_k(x)$, or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naive weighted-sum reward aggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving a Quadratic Programming problem. We further propose an amortized formulation that exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward's gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.
中文摘要 强化学习的微调已成为将扩散模型与人类偏好对齐的主流方法。然而，图像评估本质上是一项多维任务，需要同时优化多个评估标准。现有做法通过每个奖励训练一个专业模型、优化加权和奖励$R（x）=\sum_k w_k R_k（x）$，或通过手工调整阶段安排顺序微调来处理多重奖励。这些方法要么无法形成一个可以对所有奖励联合训练的统一模型，要么需要大量手动调优的顺序训练。我们发现失败源于使用朴素加权和奖励聚合法。这种方法存在样本层面的不匹配，因为大多数推广是专业样本，对某些奖励维度信息丰富，但对其他维度则无关紧要;因此，加权求和会削弱他们的监督。为解决这一问题，我们提出了MARBLE（多方面奖励BaLancE），这是一个梯度空间优化框架，该框架为每个奖励维护独立的优势估计量，计算每个奖励的策略梯度，并通过解决二次规划问题，将其协调成单一更新方向，无需手动调整奖励权重。我们还提出了一种摊销式表述，利用DiffusionNFT中仿射损失结构，将每步成本从K+1次后向传递降至接近单次奖励的基线成本，同时对平衡系数进行EMA平滑处理，以稳定更新对瞬态单批次波动的抵抗。在SD3.5中等版中，有五个奖励，MARBLE同时提升了所有五个奖励维度，将最差对齐奖励的梯度余弦值从80%的迷你批次中加权加权下为负值变为持续正值，训练速度是基线训练的0.97倍。

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

关于隐性奖励过拟合与RLVR中的低秩动态

Authors: Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, Tat-Seng Chua
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06523
Pdf link: https://arxiv.org/pdf/2605.06523
Abstract Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-1 Substitution and identified a counterintuitive phenomenon: RLVR may exhibit implicit reward overfitting to the training dataset. Specifically, the model can achieve satisfactory performance on the test set even when its rewards remain relatively low during the training process. Furthermore, we characterize three distinct properties of RL training: (1) The effective rank-1 component in RLVR don't maintain other model knowledge except mathematical reasoning capability. (2) RLVR fundamentally functions by optimizing a specific singular spectrum. The distribution of singular values of almost all linear layers in RLVR-trained model behaves like heavy-tailed distribution. (3) the left singular vectors associated with rank-1 components demonstrate a stronger alignment tendency during training, which echoes the discovery that RLVR is optimizing sampling efficiency in essence. Taken together, our findings and analysis further reveal how RLVR shapes model parameters and offer potential insights for improving existing RL paradigms or other training paradigms to implement continual learning.
中文摘要 近期广泛研究表明，通过可验证奖励强化学习（RLVR）获得的增强推理能力主要集中在第一级组件中。基于这一观察，我们采用了周期性1级替换，发现了一个反直觉现象：RLVR可能对训练数据集表现出隐性奖励过拟合。具体来说，即使模型在训练过程中奖励相对较低，也能在测试集中取得令人满意的性能。此外，我们描述了强化学习训练的三个不同特性：（1）RLVR中的有效秩一成分除数学推理能力外，不保留其他模型知识。（2）RLVR的基本工作原理是优化特定的奇异频谱。RLVR训练模型中几乎所有线性层的奇异值分布表现为重尾分布。（3）与秩1组分相关的左侧奇异矢量在训练过程中表现出更强的比对倾向，这呼应了RLVR本质上优化采样效率的发现。综合来看，我们的发现和分析进一步揭示了RLVR如何塑造模型参数，并为改进现有强化学习范式或其他训练范式以实现持续学习提供了潜在见解。

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

ROSE：通过协作弹性推广Agentic RL的服务GPU推广

Authors: Wei Gao, Yuheng Zhao, Dilxat Muhtar, Dakai An, Xuchun Shang, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Weixun Wang, Ju Huang, Teng Ma, Siran Yang, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2605.06534
Pdf link: https://arxiv.org/pdf/2605.06534
Abstract Agentic reinforcement learning (RL) has emerged as a key driver for improving the multi-step reasoning and tool-use capabilities of LLMs. However, its efficiency is bottlenecked by long-tail rollouts with multi-turn environment interactions, making static GPU provisioning a poor fit: overprovisioning wastes GPUs on stragglers, while underprovisioning increases contention and slows training. We observe that production serving clusters routinely leave substantial GPU compute and memory headroom. Based on this observation, we argue for cooperative elasticity: opportunistically repurposing underutilized serving GPUs to execute rollouts. Realizing cooperative elasticity is non-trivial because it must preserve serving Service Level Objectives (SLOs) under bursty traffic and minimize communication overhead. To address these challenges, we present ROSE, a cooperative, resource-elastic post-training system that safely harvests idle compute and memory on serving GPUs to accelerate agentic RL rollouts. ROSE consists of three components: (1) an SLO-safe co-serving executor that improves rollout throughput while preserving serving SLOs through efficient GPU memory and compute sharing; (2) a cross-cluster weight transfer engine that leverages weight shards and sparsity for fast weight synchronization across clusters; and (3) an elastic rollout scheduler that dynamically provisions cooperative capacity and routes trajectory rollouts across dedicated rollout GPUs and opportunistic serving GPUs. Experiments across multiple model sizes and cluster scales show that ROSE improves average end-to-end throughput by 1.20-3.31 x compared with state-of-the-art resource-fixed and elastic baselines.
中文摘要 智能强化学习（RL）已成为提升LLM多步推理和工具使用能力的关键驱动力。然而，其效率被多回合环境互动的长尾部署所限制，使静态GPU配置不适合：过度配置会浪费GPU给落后GPU使用，而配置不足则增加争用并减慢训练速度。我们观察到，生产服务集群通常会留下大量GPU计算和内存余量。基于这一观察，我们主张合作弹性：机会主义地重新利用未充分利用的服务GPU来执行部署。实现协同弹性并非简单，因为它必须在突发流量下保留服务层目标（SLO）并最小化通信开销。为应对这些挑战，我们推出了ROSE，一个协作式、资源弹性的训练后系统，能够安全地收集服务GPU上的空闲计算和内存，加速agentic强化学习的部署。ROSE 由三个组成部分组成：（1）一个 SLO 安全的共服务执行器，通过高效的 GPU 内存和计算共享，提升部署吞吐量同时保留服务 SLO;（2）跨集群权重转移引擎，利用权重碎片和稀疏性实现簇间快速权重同步;以及（3）一个弹性部署调度器，动态配置协作容量，并将轨迹部署路由到专用的部署GPU和机会性服务GPU之间。跨多个模型规模和集群尺度的实验显示，ROSE相比最先进的固定和弹性基线，平均端到端吞吐量提升了1.20-3.31倍。

Delay-Robust Deep Reinforcement Learning for Ranging-Free Channel Access under Mobility in Underwater Acoustic Networks

在水下声学网络中，延迟强健深度强化学习实现在移动性下实现无测距信道接入

Authors: Huaisheng Ye, Xiaowen Ye, Liqun Fu
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.06536
Pdf link: https://arxiv.org/pdf/2605.06536
Abstract Long propagation delays in underwater acoustic networks (UWANs) cause spatio-temporal uncertainty, constraining channel utilization in medium access control (MAC) protocols. Node mobility within autonomous underwater vehicle scenarios exacerbates these challenges by introducing dynamic propagation delays and varying spatial topologies. We present MobiU-MAC, a deep reinforcement learning (DRL)-based MAC protocol for mobile node access in UWANs that maximizes throughput via autonomous learning. MobiU-MAC incorporates CHILL-STER, a novel DRL algorithm optimized for UWANs that is both ranging-free and delay-robust. CHILL-STER employs a credit horizon-limited $\lambda$-return (CHILL-Return) mechanism to achieve stable learning under asynchronous delayed rewards, while the companion spatio-temporal experience replay (STER) mechanism addresses topological changes arising from node mobility. This work also demonstrates theoretically that DRL attains optimal policy learning equivalent to a standard Markov decision process under long propagation delays without requiring ranging. Performance evaluations indicate that MobiU-MAC outperforms existing DRL-based MAC protocols for UWANs by leveraging the maximum system delay boundary without ranging overhead, supporting the effectiveness of the proposed theory and algorithm in complex underwater dynamic environments.
中文摘要 水下声学网络（UWAN）中的长传播延迟会导致时空不确定性，限制中介接入控制（MAC）协议中的信道利用率。自主水下飞行器场景中的节点机动性通过引入动态传播延迟和多变的空间拓扑，加剧了这些挑战。我们介绍了MobiU-MAC，这是一种基于深度强化学习（DRL）的MAC协议，用于UWAN中的移动节点访问，通过自主学习最大化吞吐量。MobiU-MAC采用了CHILL-STER，这是一种针对UWAN优化的新型延迟学习（DRL）算法，既无测距又具备延迟鲁棒性。CHILL-STER 采用了信用视野限制的 $\lambda$-return（CHILL-Return）机制，在异步延迟奖励下实现稳定学习，而伴随的时空经验重放（STER）机制则处理节点移动性引发的拓扑变化。这项工作还在理论上证明，DRL在长传播延迟下能够实现与标准马尔可夫决策过程等价的最优策略学习，而无需量区间。性能评估表明，MobiU-MAC通过利用最大系统延迟边界且无距离开销，优于现有基于DRL的UWAN协议，支持该理论和算法在复杂水下动态环境中的有效性。

Sequential Design of Genetic Circuits Under Uncertainty With Reinforcement Learning

在不确定性下遗传回路的顺序设计与强化学习

Authors: Michal Kobiela, Diego A. Oyarzún, Michael U. Gutmann
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.06552
Pdf link: https://arxiv.org/pdf/2605.06552
Abstract The design of biological systems is hindered by uncertainty arising from both intrinsic stochasticity of biomolecular reactions and variability across laboratory or experimental conditions. In this work, we present a sequential framework to optimize genetic circuits under both forms of uncertainty. By employing simulator models based on differential equations or Markov jump processes alongside a reinforcement learning (RL) policy-based approach, our method suggests experiments that adapt to unknown laboratory conditions while accounting for inherent stochasticity. While previous Bayesian methods address uncertainty through iterative experiment-inference-optimization cycles, they typically require computationally expensive inference and optimization steps after each experimental round, leading to delays. To overcome this bottleneck, we propose an amortized approach trained up-front across a distribution of possible uncertain parameters. This strategy sidesteps the need for explicit parameter inference during the design cycle, enabling immediate, observation-based adaptation. We demonstrate our framework on models for heterologous gene expression and a repressilator circuit, showing that it efficiently handles both molecular noise and cross-laboratory variability.
中文摘要 生物系统的设计受到生物分子反应内在随机性以及实验室或实验条件下变异性带来的不确定性所阻碍。本研究提出了一个序列框架，以优化遗传回路在这两种不确定性条件下的表现。通过结合基于微分方程或马尔可夫跳法的模拟模型，结合基于强化学习（RL）的策略方法，我们的方法建议实验能够适应未知的实验室条件，同时考虑固有的随机性。以往的贝叶斯方法通过迭代实验-推断-优化循环来解决不确定性，但通常每轮实验后需要计算量较高的推断和优化步骤，导致延迟。为克服这一瓶颈，我们提出一种摊销方法，前期训练于可能不确定参数分布。该策略避免了设计周期中显式参数推断的需求，实现即时的基于观察的适应。我们展示了异源基因表达模型和抑制回路的框架，证明其高效处理分子噪声和跨实验室变异性。

Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning

协调问题：合作多智能体强化学习的评估

Authors: Maria Ana Cardei, Matthew Landers, Afsaneh Doryab
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.06557
Pdf link: https://arxiv.org/pdf/2605.06557
Abstract Cooperative multi-agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, particularly in settings where agents, tasks, and joint assignment choices scale combinatorially. We propose a coordination-aware evaluation perspective that supplements return with process-level diagnostics. We instantiate this perspective using STAT, a controlled commitment-constrained spatial task-allocation testbed that systematically varies agents, tasks, and environment size while holding observation access and task rules fixed. We evaluate six representative value-based MARL methods across varying levels of centralization. Our results show that similar return trends can reflect distinct coordination mechanisms, including differences in redundant assignment, assignment diversity, and task-completion efficiency. We find that in commitment-constrained task allocation, performance under scale is shaped not only by nominal action-space size, but also by assignment pressure, sparse decision opportunities, and redundant choices among interdependent agents. Our findings motivate coordination-aware evaluation as a necessary complement to return-based benchmarking for cooperative MARL.
中文摘要 合作多智能体强化学习（MARL）基准测试通常强调汇总结果，如回报、成功率或完成时间。虽然这些指标至关重要，但往往无法揭示代理之间的协调机制，尤其是在代理、任务和联合分配选择组合性扩展的环境中。我们提出一种协调意识的评估视角，补充流程层级的诊断。我们通过STAT实现这一视角，STAT是一个受控承诺约束的空间任务分配测试平台，系统性地变化代理、任务和环境大小，同时保持观察访问和任务规则固定。我们评估了六种具有代表性的基于价值的MARL方法，涵盖不同集中程度。我们的结果表明，相似的回报趋势可能反映了不同的协调机制，包括冗余分配、任务多样性和任务完成效率的差异。我们发现，在承诺约束的任务分配中，规模内的表现不仅受名义动作空间大小影响，还受分配压力、决策机会稀疏以及相互依赖代理间冗余选择的影响。我们的发现激励协调感知评估作为基于回报基准的必要补充，用于合作MARL。

SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation

SNAPO：通过可微仿真实现最优控制的平滑神经伴随策略优化

Authors: Dmitri Goloubentsev, Natalija Karpichina
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Computational Finance (q-fin.CP); Mathematical Finance (q-fin.MF); Risk Management (q-fin.RM)
Arxiv link: https://arxiv.org/abs/2605.06570
Pdf link: https://arxiv.org/pdf/2605.06570
Abstract Many real-world problems require sequential decisions under uncertainty: when to inject or withdraw gas from storage, how to rebalance a pension portfolio each month, what temperature profile to run through a pharmaceutical reactor chain. Dynamic programming solves small instances exactly but scales exponentially in state dimensions. Black-box reinforcement learning handles high-dimensional states but trains slowly and produces no sensitivities. We introduce SNAPO (Smooth Neural Adjoint Policy Optimization), a framework that embeds a neural policy inside a known, differentiable simulator, replaces hard constraints with smooth approximations, and computes exact gradients of the objective with respect to all policy parameters and all inputs in a single adjoint pass. We demonstrate SNAPO on three domains: natural gas storage (training in under a minute, 365 forward curve sensitivities at no additional cost per sensitivity), pension fund asset-liability management (6.5x-200x sensitivity speedup over bump-and-revalue, scaling with the number of risk factors), and pharmaceutical manufacturing (cross-unit sensitivities through a 4-unit process chain, with 20 ICH Q8 regulatory sensitivities from 5 adjoint passes in 74.5 milliseconds). All sensitivities are produced by the same backward pass that trains the policy, at a cost proportional to one reverse pass regardless of how many sensitivities are computed.
中文摘要 许多现实问题都需要在不确定性下做出顺序决策：何时注入或提取储气，如何每月重新平衡养老金组合，如何通过制药反应堆链进行温度曲线。动态规划精确地解决小实例，但在状态维度上呈指数级扩展。黑箱强化学习处理高维状态，但训练速度缓慢且不产生灵敏度。我们引入了SNAPO（平滑神经伴随策略优化），这是一个将神经策略嵌入已知且可微的模拟器中的框架，用光滑近似替代硬约束，并在单次伴随路径中计算目标针对所有策略参数和所有输入的精确梯度。我们在三个领域演示了SNAPO技术：天然气储存（在不到一分钟内训练，365个前向曲线敏感度，且无额外灵敏度成本）、养老基金资产负债管理（提升灵敏度6.5倍至200倍，提升至200倍，提升至提升风险因素数量）和制药制造（通过4单元工艺链实现跨单元敏感性，5次伴随通过74.5毫秒内完成20个ICH Q8监管敏感性）。所有敏感度均由训练策略的同一个反向传递产生，其成本与一次反向传递成正比，无论计算多少敏感度。

ReActor: Reinforcement Learning for Physics-Aware Motion Retargeting

ReActor：用于物理感知运动重定向的强化学习

Authors: David Müller, Agon Serifi, Sammy Christen, Ruben Grandia, Espen Knoop, Moritz Bächer
Subjects: Subjects: Robotics (cs.RO); Graphics (cs.GR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.06593
Pdf link: https://arxiv.org/pdf/2605.06593
Abstract Retargeting human kinematic reference motion onto a robot's morphology remains a formidable challenge. Existing methods often produce physical inconsistencies, such as foot sliding, self-collisions, or dynamically infeasible motions, which hinder downstream imitation learning. We propose a bilevel optimization framework that jointly adapts reference motions to a robot's morphology while training a tracking policy using reinforcement learning. To make the optimization tractable, we derive an approximate gradient for the upper-level loss. Our framework requires only a sparse set of semantic rigid-body correspondences and eliminates the need for manual tuning by identifying optimal values for a parameterization expressive enough to preserve characteristic motion across different embodiments. Moreover, by integrating retargeting directly with physics simulation, we produce physically plausible motions that facilitate robust imitation learning. We validate our method in simulation and on hardware, demonstrating challenging motions for morphologies that differ significantly from a human, including retargeting onto a quadruped.
中文摘要 将人类运动学参考运动重新定位到机器人的形态上仍然是一项艰巨的挑战。现有方法常常产生物理不一致，如脚滑、自我碰撞或动态不可行的运动，阻碍后续模仿学习。我们提出了一个双层优化框架，结合参考运动适应机器人形态，同时通过强化学习训练跟踪策略。为了使优化可操作，我们推导了上层损耗的近似梯度。我们的框架只需少量语义刚体对应，通过识别参数化的最优值，消除手动调校，使其表达足够丰富，从而保留不同身体体间的特征运动。此外，通过将重定向直接与物理仿真结合，我们产生了物理上合理的运动，促进了稳健的模仿学习。我们在模拟和硬件上验证了我们的方法，展示了与人类显著不同的形态的挑战性运动，包括将重定向到四足动物身上。

Cross-Modal Navigation with Multi-Agent Reinforcement Learning

多智能体强化学习的跨模态导航

Authors: Shuo Liu, Xinzichen Li, Christopher Amato
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.06595
Pdf link: https://arxiv.org/pdf/2605.06595
Abstract Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substantially enlarge the policy space. Cross-modal collaboration among lightweight modality-specialized agents offers a scalable paradigm. It enables flexible deployment and parallel execution, while preserving the strength of each modality. In this paper, we propose \textbf{CRONA}, a Multi-Agent Reinforcement Learning (MARL) framework for \textbf{Cro}ss-Modal \textbf{Na}vigation. CRONA improves collaboration by leveraging control-relevant auxiliary beliefs and a centralized multi-modal critic with global state. Experiments on visual-acoustic navigation tasks show that multi-agent methods significantly improve performance and efficiency over single-agent baselines. We find that homogeneous collaboration with limited modalities is sufficient for short-range navigation under salient cues; heterogeneous collaboration among agents with complementary modalities is generally efficient and effective; and navigation in large, complex environments requires both richer multi-modal perception and increased model capacity.
中文摘要 健全的具身导航依赖于互补的感官线索。然而，高质量且对齐良好的多模态数据在实际操作中往往难以获得。训练单一模型同样具有挑战性，因为丰富的多模态输入会诱导复杂的表征，并大幅扩大政策空间。轻量级模态专用代理之间的跨模态协作提供了可扩展的范式。它支持灵活部署和并行执行，同时保持每种模态的强度。本文提出 \textbf{CRONA}，一个针对 \textbf{Cro}ss-Modal \textbf{Na}vigation 的多智能体强化学习（MARL）框架。CRONA通过利用控制相关的辅助信念和具有全局状态的集中多模态批评者，提升协作。视觉-声学导航任务的实验表明，多智能体方法相比单智能体基线显著提升了性能和效率。我们发现，在显著线索下进行短距离导航的同质协作和有限模态已足够;不同代理之间具有互补模式的异质协作通常高效且有效;在大型复杂环境中导航需要更丰富的多模态感知和更大的模型容量。

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

强化学习能教大型语言模型长视野推理吗？表现力是关键

Authors: Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang, Guanwen Qiu, Abulhair Saparov
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.06638
Pdf link: https://arxiv.org/pdf/2605.06638
Abstract Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^{\gamma}$, $R^{2} > 0.99$), and that the scaling exponent $\gamma$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency.
中文摘要 强化学习（RL）已被应用于提升大型语言模型（LLM）推理能力，但由于缺乏受控、可扩展的环境，系统性研究对训练如何随任务难度扩展而受阻。我们介绍了ScaleLogic，一种综合逻辑推理框架，提供对两个难度轴的独立控制：所需的证明规划深度（即视野）和底层逻辑的表现力。我们提出的框架支持广泛的逻辑：从简单的仅含意逻辑（“如果-那么”）到更具表现力的一阶推理，包括合取（“和”）、析取（“或”）、否定（“非”）和全称量化（“对所有”）。利用该框架，我们证明强化学习训练计算$T$在推理深度$D$（$T \propto D^{\gamma}$， $R^{2} > 0.99$）上遵循幂律，且缩放指数$\gamma$随逻辑表达性单调增长，从$1.04$增至$2.60$。在下游数学和一般推理基准测试中，更具表现力的训练设置不仅带来更大的性能提升（最高可达+10.66美元点数），并且相比表现力较低的设置实现更高的计算效率，表明模型所训练的内容，而不仅仅是训练了多少，还会影响下游的传输。我们还进一步证明幂律关系在多种强化学习方法中成立，基于课程的训练显著提升了扩展效率。

Recursive Agent Optimization

递归代理优化

Authors: Apurva Gandhi, Satyaki Chakraborty, Xiangjun Wang, Aviral Kumar, Graham Neubig
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.06639
Pdf link: https://arxiv.org/pdf/2605.06639
Abstract We introduce Recursive Agent Optimization (RAO), a reinforcement learning approach for training recursive agents: agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively. Recursive agents implement an inference-time scaling algorithm that naturally allows agents to scale to longer contexts and generalize to more difficult problems via divide-and-conquer. RAO provides a method to train models to best take advantage of such recursive inference, teaching agents when and how to delegate and communicate. We find that recursive agents trained in this way enjoy better training efficiency, can scale to tasks that go beyond the model's context window, generalize to tasks much harder than the ones the agent was trained on, and can enjoy reduced wall-clock time compared to single-agent systems.
中文摘要 我们介绍递归代理优化（RAO），这是一种用于训练递归代理的强化学习方法：这些代理能够递归地生成并委派子任务给新的实例。递归代理实现了一种推理时间尺度算法，使智能体能够自然地扩展到更长的上下文，并通过分治法推广到更难的问题。RAO提供了一种训练模型的方法，以最好地利用这种递归推理，教代理何时以及如何委托和通信。我们发现，以这种方式训练的递归智能体训练效率更高，能够扩展到超出模型上下文窗口的任务，能够推广到比智能体训练对象更难的任务，并且相较单智能体系统更短的壁钟时间。

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

StraTA：通过战略轨迹抽象激励代理强化学习

Authors: Xiangyuan Xue, Yifan Zhou, Zidong Wang, Shengji Tang, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.06642
Pdf link: https://arxiv.org/pdf/2605.06642
Abstract Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO-style rollout design, further enhanced by diverse strategy rollout and critical self-judgment. Experiments on ALFWorld, WebShop, and SciWorld show that StraTA consistently improves both sample efficiency and final performance over strong baselines. StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop. On SciWorld, StraTA attains a 63.5% overall score, outperforming frontier closed-source models.
中文摘要 大型语言模型（LLM）越来越多地被用作交互代理，但由于现有方法大多是纯反应式的，在长期决策中优化它们仍然困难，这削弱了探索和信用分配的效果。在本研究中，我们提出了战略轨迹抽象（StraTA），这是一个简单的框架，将显式轨迹级策略引入代理强化学习（RL）中。StraTA从初始任务状态抽样紧凑的战略，基于该策略进行后续行动的条件，并结合分层的GRPO式部署设计训练策略生成和行动执行，进一步增强多样化战略部署和关键自我判断。在ALFWorld、WebShop和SciWorld上的实验显示，StraTA在强基线条件下持续提升样本效率和最终性能。StraTA在ALFWorld的成功率为93.1%，在WebShop为84.2%。在SciWorld上，StraTA获得了63.5%的整体评分，优于前沿的闭源模型。

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

超越负面推广：仅正向的政策优化，隐含负梯度

Authors: Mingwei Xu, Hao Fang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.06650
Pdf link: https://arxiv.org/pdf/2605.06650
Abstract Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy Optimization (GRPO), in which GRPO reduces the complicated advantage estimation with simple estimation over grouped positive and negative rollouts. However, we note that negative rollouts may admit no gradation of failure severity, and the combinatorial vastness makes penalizing a few sampled negatives unlikely to cover a meaningful reward signal under sparse binary rewards. In this work, we propose Positive-Only Policy Optimization (POPO), a novel RLVR framework in which learning can occur exclusively via online positive rollouts. Specifically, POPO utilizes bounded importance sampling over the positive rollout set. Thus, no disjoint negative rollouts are used for the gradient guidance. We show that implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution. Next, POPO stabilizes the policy optimization through two mechanisms. First, it applies a siamese policy network with a momentum-based adaptation law for stabilized policy evolution. Second, we replace the KL-divergence with a bounded similarity penalty term in the siamese representation space. We conduct extensive experiments using publicly available, well-established text-LLM models, e.g., the Qwen family, across all-level mathematical benchmarks. Our experiment demonstrates that POPO achieves performance comparable to, or even superior to GRPO. Notably, we show that POPO can achieve 36.67% in AIME 2025 with Qwen-Math-7B, outperforming GRPO 30.00%. Our ablation and sweep studies further illustrate the necessity and robustness of POPO components.
中文摘要 由于确定性验证，带有可验证奖励的强化学习（RLVR）成为提升大型语言模型（LLMs）推理能力的主导范式。社区见证了从近端策略优化（PPO）向群体相对策略优化（GRPO）的快速转变，后者GRPO通过简单估计对分组的正负推广简化了复杂的优势估计。然而，我们注意到负面推广可能不允许失败严重程度的等级，且组合的庞大性使得惩罚少数采样的负面在稀疏二元奖励下难以覆盖有意义的奖励信号。在本研究中，我们提出了仅积极政策优化（POPO），这是一种全新的RLVR框架，其学习可以完全通过在线积极推广实现。具体来说，POPO利用对正向推广集进行有界重要性抽样。因此，梯度指导不使用不相交的负向滚出。我们证明，通过扩展重分布强化正概率，隐性负梯度可以自然出现。接下来，POPO通过两种机制稳定了策略优化。首先，它采用基于动量的适应律的暹罗政策网络，实现政策的稳定演变。其次，我们将KL散度替换为暹罗表示空间中的有界相似性惩罚项。我们利用公开且成熟的文本大型语言模型（如Qwen家族）在所有层级数学基准中进行了广泛实验。我们的实验表明，POPO的性能与GRPO相当甚至更优。值得注意的是，我们显示POPO在AIME 2025中凭借Qwen-Math-7B实现36.67%，优于GRPO 30.00%。我们的消融和扫除研究进一步展示了POPO组分的必要性和稳健性。

Keyword: diffusion policy

There is no result