Arxiv Papers of Today

生成时间: 2026-02-04 16:48:09 (UTC+8); Arxiv 发布时间: 2026-02-04 20:00 EST (2026-02-05 09:00 UTC+8)

今天共有 77 篇相关文章

Keyword: reinforcement learning

GraphDancer: Training LLMs to Explore and Reason over Graphs via Curriculum Reinforcement Learning

GraphDancer：通过课程强化学习训练大型语言模型在图谱上进行探索和推理

Authors: Yuyang Bai, Zhuofeng Li, Ping Nie, Jianwen Xie, Yu Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.02518
Pdf link: https://arxiv.org/pdf/2602.02518
Abstract Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organized as heterogeneous graphs rather than plain text. Reasoning over such graph-structured knowledge poses two key challenges: (1) navigating structured, schema-defined relations requires precise function calls rather than similarity-based retrieval, and (2) answering complex questions often demands multi-hop evidence aggregation through iterative information seeking. We propose GraphDancer, a reinforcement learning (RL) framework that teaches LLMs to navigate graphs by interleaving reasoning and function execution. To make RL effective for moderate-sized LLMs, we introduce a graph-aware curriculum that schedules training by the structural complexity of information-seeking trajectories using an easy-to-hard biased sampler. We evaluate GraphDancer on a multi-domain benchmark by training on one domain only and testing on unseen domains and out-of-distribution question types. Despite using only a 3B backbone, GraphDancer outperforms baselines equipped with either a 14B backbone or GPT-4o-mini, demonstrating robust cross-domain generalization of graph exploration and reasoning skills. Our code and models can be found at this https URL .
中文摘要 大型语言模型（LLMs）越来越依赖外部知识来提升事实性，但许多现实世界的知识源却以异构图的形式组织，而非纯文本。对此类图结构知识进行推理面临两个关键挑战：（1）导航结构化、模式定义的关系需要精确的函数调用，而非基于相似性的检索;（2）回答复杂问题通常需要通过迭代信息寻求进行多跳证据聚合。我们提出了GraphDancer，这是一个强化学习（RL）框架，通过交错推理和函数执行来教LLM如何导航图。为了使强化学习对中等规模的大型语言模型有效，我们引入了一套基于图的课程，利用易至硬偏差采样器根据信息寻求轨迹的结构复杂度来安排训练。我们通过仅训练一个领域，测试未见域和分布外问题类型，来评估GraphDancer的多域基准测试。尽管仅使用3B骨干，GraphDancer仍优于配备14B骨干或GPT-40-mini的基线，展现了图探索和推理技能的强大跨域推广能力。我们的代码和模型可在此 https URL 找到。

Formulating Reinforcement Learning for Human-Robot Collaboration through Off-Policy Evaluation

通过非策略评估制定人机协作强化学习

Authors: Saurav Singh, Rodney Sanchez, Alexander Ororbia, Jamison Heard
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.02530
Pdf link: https://arxiv.org/pdf/2602.02530
Abstract Reinforcement learning (RL) has the potential to transform real-world decision-making systems by enabling autonomous agents to learn from experience. Deploying RL in real-world settings, especially in the context of human-robot interaction, requires defining state representations and reward functions, which are critical for learning efficiency and policy performance. Traditional RL approaches often rely on domain expertise and trial-and-error, necessitating extensive human involvement as well as direct interaction with the environment, which can be costly and impractical, especially in complex and safety-critical applications. This work proposes a novel RL framework that leverages off-policy evaluation (OPE) for state space and reward function selection, using only logged interaction data. This approach eliminates the need for real-time access to the environment or human-in-the-loop feedback, greatly reducing the dependency on costly real-time interactions. The proposed approach systematically evaluates multiple candidate state representations and reward functions by training offline RL agents and applying OPE to estimate policy performance. The optimal state space and reward function are selected based on their ability to produce high-performing policies under OPE metrics. Our method is validated on two environments: the Lunar Lander environment by OpenAI Gym, which provides a controlled setting for assessing state space and reward function selection, and a NASA-MATB-II human subjects study environment, which evaluates the approach's real-world applicability to human-robot teaming scenarios. This work enhances the feasibility and scalability of offline RL for real-world environments by automating critical RL design decisions through a data-driven OPE-based evaluation, enabling more reliable, effective, and sustainable RL formulation for complex human-robot interaction settings.
中文摘要 强化学习（RL）有潜力通过使自主智能体从经验中学习，改变现实世界的决策系统。在现实环境中部署强化学习，尤其是在人机交互的背景下，需要定义状态表示和奖励函数，这对学习效率和策略执行至关重要。传统的强化学习方法通常依赖领域专业知识和反复试验，需要大量人类参与以及与环境的直接互动，这在复杂且安全关键的应用中可能成本高昂且不切实际。本研究提出了一种新颖的强化学习框架，利用非策略评估（OPE）进行状态空间和奖励函数选择，仅使用记录的交互数据。这种方法消除了对环境的实时访问或人机反馈的需求，大大减少了对昂贵实时交互的依赖。该方法通过训练离线强化学习代理并应用OPE估计策略性能，系统地评估多个候选状态表示和奖励函数。最优状态空间和奖励函数的选择基于其在OPE指标下制定高效政策的能力。我们的方法在两种环境中得到了验证：由OpenAI Gym开发的月球着陆器环境，提供一个受控环境用于评估状态空间和奖励函数选择;以及NASA-MATB-II人类受试者研究环境，评估该方法在人机团队合作场景中的实际适用性。这项工作通过基于数据的OPE评估自动化关键强化学习设计决策，提升了离线强化学习在现实环境中的可行性和可扩展性，从而为复杂的人机交互环境提供更可靠、高效和可持续的强化学习表述。

Hypersonic Flow Control: Generalized Deep Reinforcement Learning for Hypersonic Intake Unstart Control under Uncertainty

高超音速流量控制：在不确定性下高超音速进气停启控制的广义深度强化学习

Authors: Trishit Mondal, Ameya D. Jagtap
Subjects: Subjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
Arxiv link: https://arxiv.org/abs/2602.02531
Pdf link: https://arxiv.org/pdf/2602.02531
Abstract The hypersonic unstart phenomenon poses a major challenge to reliable air-breathing propulsion at Mach 5 and above, where strong shock-boundary-layer interactions and rapid pressure fluctuations can destabilize inlet operation. Here, we demonstrate a deep reinforcement learning (DRL)- based active flow control strategy to control unstart in a canonical two-dimensional hypersonic inlet at Mach 5 and Reynolds number $5\times 10^6$. The in-house CFD solver enables high-fidelity simulations with adaptive mesh refinement, resolving key flow features, including shock motion, boundary-layer dynamics, and flow separation, that are essential for learning physically consistent control policies suitable for real-time deployment. The DRL controller robustly stabilizes the inlet over a wide range of back pressures representative of varying combustion chamber conditions. It further generalizes to previously unseen scenarios, including different back-pressure levels, Reynolds numbers, and sensor configurations, while operating with noisy measurements, thereby demonstrating strong zero-shot generalization. Control remains robust in the presence of noisy sensor measurements, and a minimal, optimally selected sensor set achieves comparable performance, enabling practical implementation. These results establish a data-driven approach for real-time hypersonic flow control under realistic operational uncertainties.
中文摘要 高超音速未启动现象对马赫5及以上的可靠吸气推进构成重大挑战，强烈的冲击边界层相互作用和快速压力波动可能破坏进气口运行。在这里，我们展示了一种基于深度强化学习（DRL）的主动流量控制策略，用于控制标准二维高超音速进气口中以马赫5和雷诺数10^6$的失启动。自制的CFD求解器支持高保真模拟，并实现自适应网格细化，解决了冲击运动、边界层动力学和流分离等关键流动特征，这些对于学习物理一致的控制策略、适合实时部署至关重要。DRL控制器能在不同燃烧室条件下的广泛背压范围内稳健稳定进气口。它进一步推广到此前未见的场景，包括不同的背压水平、雷诺数和传感器配置，同时在噪声测量下运行，从而展示了强的零射击泛化能力。在存在噪声传感器测量的情况下，控制依然稳健，且最小且最优选择的传感器组能够实现相当的性能，从而实现实际应用。这些结果确立了一种基于数据的数据的方法，在现实作不确定性下实现实时高超音速流量控制。

CADENT: Gated Hybrid Distillation for Sample-Efficient Transfer in Reinforcement Learning

CADENT：门控混合蒸馏用于强化学习中的样本高效转移

Authors: Mahyar Alinejad, Yue Wang, George Atia
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.02532
Pdf link: https://arxiv.org/pdf/2602.02532
Abstract Transfer learning promises to reduce the high sample complexity of deep reinforcement learning (RL), yet existing methods struggle with domain shift between source and target environments. Policy distillation provides powerful tactical guidance but fails to transfer long-term strategic knowledge, while automaton-based methods capture task structure but lack fine-grained action guidance. This paper introduces Context-Aware Distillation with Experience-gated Transfer (CADENT), a framework that unifies strategic automaton-based knowledge with tactical policy-level knowledge into a coherent guidance signal. CADENT's key innovation is an experience-gated trust mechanism that dynamically weighs teacher guidance against the student's own experience at the state-action level, enabling graceful adaptation to target domain specifics. Across challenging environments, from sparse-reward grid worlds to continuous control tasks, CADENT achieves 40-60\% better sample efficiency than baselines while maintaining superior asymptotic performance, establishing a robust approach for adaptive knowledge transfer in RL.
中文摘要 迁移学习有望降低深度强化学习（RL）的高样本复杂度，但现有方法在源与目标环境之间的领域转换方面存在困难。策略提炼提供了强大的战术指导，但未能传递长期战略知识;而基于自动机的方法则捕捉任务结构，但缺乏细致的行动指导。本文介绍了带有经验门控转移的上下文感知蒸馏（CADENT），这是一个将基于战略自动机的知识与战术政策级知识统一为连贯指导信号的框架。CADENT的核心创新是一种体验门槛信任机制，能够动态权衡教师指导与学生自身在国家行动层面的体验，从而实现对目标领域具体性的优雅调整。在从稀疏奖励网格世界到连续控制任务等具有挑战性环境的环境中，CADENT在保持优越渐近性能的同时，比基线提升了40-60%的采样效率，建立了强化学习中自适应知识转移的稳健方法。

Beyond Alignment: Expanding Reasoning Capacity via Manifold-Reshaping Policy Optimization

超越对齐：通过流形重塑策略优化扩展推理能力

Authors: Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.02545
Pdf link: https://arxiv.org/pdf/2602.02545
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). However, recent studies question whether RL genuinely expands reasoning capacity or merely aligns existing latent capabilities, arguing that exploration remains confined within the pre-trained model's low-rank bias manifold. In this work, we challenge this accessibility boundary hypothesis by demonstrating that the latent reasoning space can be fundamentally expanded through targeted geometric interventions. We propose Manifold-Reshaping Policy Optimization (MRPO), a geometric framework designed to fundamentally restructure the inference space of LLMs. MRPO operates in two stages: first, we employ Spectral Orthogonal Exploration (SOE) to eject the policy initialization into the null space of the bias manifold; second, we integrate an Effective Rank regularization term into the policy optimization objective. This approach incentivizes the discovery and maintenance of high-dimensional reasoning trajectories against the entropy-reducing tendency of standard RL. Empirically, our 4B-parameter method achieves state-of-the-art performance on mathematical tasks, significantly outperforming larger models (e.g., Qwen3-32B) and expanding the capability boundary beyond standard GRPO. Our code is available at this https URL
中文摘要 带可验证奖励的强化学习（RLVR）在提升大型语言模型（LLMs）推理能力方面取得了显著成功。然而，近期研究质疑强化学习是否真正扩展了推理能力，还是仅仅对应了现有潜在能力，认为探索仍局限于预训练模型的低秩偏置流形中。在本研究中，我们通过证明潜在推理空间可以通过有针对性的几何干预从根本上扩展，挑战这一可及边界假说。我们提出了流形重塑策略优化（MRPO），这是一种几何框架，旨在从根本上重构LLMs的推理空间。MRPO分为两个阶段：首先，我们采用谱正交探索（SOE）将策略初始化弹射到偏置流形的零空间;其次，我们将有效排名正则化项整合到策略优化目标中。这种方法激励人们发现并维持高维推理轨迹，以抵御标准强化学习的熵减少倾向。从经验上看，我们的4B参数方法在数学任务中实现了最先进的性能，显著优于大型模型（如Qwen3-32B），并将能力边界扩展到标准GRPO之外。我们的代码可在此 https 网址获取

BatCoder: Self-Supervised Bidirectional Code-Documentation Learning via Back-Translation

BatCoder：通过反向翻译实现的自我监督双向代码-文档学习

Authors: Jingwen Xu, Yiyang Lu, Zisu Huang, Changze Lv, Xiaohua Wang, Shizheng Li, Zhibo Xu, Zhengkang Guo, Zhengyuan Wang, Muzhao Tian, Xuanjing Huang, Xiaoqing Zheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2602.02554
Pdf link: https://arxiv.org/pdf/2602.02554
Abstract Training LLMs for code-related tasks typically depends on high-quality code-documentation pairs, which are costly to curate and often scarce for niche programming languages. We introduce BatCoder, a self-supervised reinforcement learning framework designed to jointly optimize code generation and documentation production. BatCoder employs a back-translation strategy: a documentation is first generated from code, and then the generated documentation is used to reconstruct the original code. The semantic similarity between the original and reconstructed code serves as an implicit reward, enabling reinforcement learning to improve the model's performance both in generating code from documentation and vice versa. This approach allows models to be trained using only code, substantially increasing the available training examples. Evaluated on HumanEval and MBPP with a 7B model, BatCoder achieved 83.5% and 81.0% pass@1, outperforming strong open-source baselines. Moreover, the framework demonstrates consistent scaling with respect to both training corpus size and model capacity.
中文摘要 用于代码相关任务的LLM训练通常依赖高质量的代码-文档对，而这些对策划成本高昂，且在小众编程语言中往往稀缺。我们介绍BatCoder，一个自监督强化学习框架，旨在联合优化代码生成和文档生成。BatCoder 采用反向翻译策略：先从代码生成文档，然后用生成的文档重建原始代码。原始代码与重建代码之间的语义相似性作为隐性奖励，使强化学习能够提升模型在从文档中生成代码以及反之皆然的性能。这种方法允许仅用代码训练模型，显著增加了可用的训练样本。在HumanEval和MBPP上以7B模型进行评估，BatCoder实现了83.5%和81.0%的pass@1，优于强劲的开源基线。此外，该框架在训练语料库大小和模型容量方面均表现出一致的扩展性。

Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

用参数空间噪声学习探索：深入探讨参数空间噪声用于可验证奖励的强化学习

Authors: Bizhe Bai, Xinyue Wang, Peng Ye, Tao Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.02555
Pdf link: https://arxiv.org/pdf/2602.02555
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consistently expands the effective reasoning capability boundary across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods (e.g., Pass-at-k-style training) while remaining orthogonal and thus composable for additional gains.
中文摘要 带可验证奖励的强化学习（RLVR）提升了LLM推理能力，但越来越多的证据表明探索空间存在：它常常重新加权现有的解迹，而非发现新策略，限制了在大抽样预算下（例如256时通过）下的收益。我们通过PSN-RLVR解决了这一限制，该方法在推出生成前扰动政策参数，以诱导时间一致的轨迹级探索，从而比行动空间噪声更好地保持长视野的思维链连贯性。为减少由此产生的抽样与更新不匹配，我们采用截断重要性抽样（TIS）。为了避免昂贵的基于KL的自适应噪声控制，我们提出了一种计算效率高的实时自适应噪声调度器，由一个轻量级代理驱动，结合语义多样性和归一化自确定性。PSN-GRPO基于广泛使用的RLVR方法GRPO，持续扩展多个数学推理基准和模型族的有效推理能力边界，在大采样预算下实现更高的通过k，优于以往探索导向的RLVR方法（如通过k的训练），同时保持正交且可组合以获得额外收益。

QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals

QuantLRM：通过微调信号对大型推理模型进行量化

Authors: Nan Zhang, Eugene Kwek, Yusen Zhang, Muyu Pan, Suhang Wang, Prasenjit Mitra, Rui Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.02581
Pdf link: https://arxiv.org/pdf/2602.02581
Abstract Weight-only quantization is important for compressing Large Language Models (LLMs). Inspired by the spirit of classical magnitude pruning, we study whether the magnitude of weight updates during reasoning-incentivized fine-tuning can provide valuable signals for quantizing Large Reasoning Models (LRMs). We hypothesize that the smallest and largest weight updates during fine-tuning are more important than those of intermediate magnitude, a phenomenon we term "protecting both ends". Upon hypothesis validation, we introduce QuantLRM, which stands for weight quantization of LRMs via fine-tuning signals. We fit simple restricted quadratic functions on weight updates to protect both ends. By multiplying the average quadratic values with the count of zero weight updates of channels, we compute channel importance that is more effective than using activation or second-order information. We run QuantLRM to quantize various fine-tuned models (including supervised, direct preference optimization, and reinforcement learning fine-tuning) over four reasoning benchmarks (AIME-120, FOLIO, temporal sequences, and GPQA-Diamond) and empirically find that QuantLRM delivers a consistent improvement for LRMs quantization, with an average improvement of 6.55% on a reinforcement learning fine-tuned model. Also supporting non-fine-tuned LRMs, QuantLRM gathers effective signals via pseudo-fine-tuning, which greatly enhances its applicability.
中文摘要 仅权重量化对于压缩大型语言模型（LLM）非常重要。受经典幅度修剪精神的启发，我们研究推理激励微调过程中权重更新的幅度是否能为大型推理模型（LRM）的量子化提供有价值的信号。我们假设微调过程中最小和最大的权重更新比中等大小的更重要，我们称之为“保护两端”。假设验证后，我们引入QuantLRM，即通过微调信号对LRMs进行权重量子化。我们在权重更新上拟合简单的受限二次函数，以保护两端。通过将平均二次值乘以信道零权重更新的计数，我们计算出比激活信息或二阶信息更高效的信道重要性。我们运行QuantLRM对多个微调模型（包括监督式、直接偏好优化和强化学习微调）进行四种推理基准测试（AIME-120、FOLIO、时间序列和GPQA-Diamond）的量子化，实证发现QuantLRM在LRM量化方面持续提升，强化学习微调模型平均提升6.55%。QuantLRM同样支持非微调的LRM，通过伪微调收集有效信号，大大提升了其适用性。

ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization

ContextEvolve：多智能体上下文压缩以实现系统代码优化

Authors: Hongyuan Su, Yu Zheng, Yong Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.02597
Pdf link: https://arxiv.org/pdf/2602.02597
Abstract Large language models are transforming systems research by automating the discovery of performance-critical algorithms for computer systems. Despite plausible codes generated by LLMs, producing solutions that meet the stringent correctness and performance requirements of systems demands iterative optimization. Test-time reinforcement learning offers high search efficiency but requires parameter updates infeasible under API-only access, while existing training-free evolutionary methods suffer from inefficient context utilization and undirected search. We introduce ContextEvolve, a multi-agent framework that achieves RL-level search efficiency under strict parameter-blind constraints by decomposing optimization context into three orthogonal dimensions: a Summarizer Agent condenses semantic state via code-to-language abstraction, a Navigator Agent distills optimization direction from trajectory analysis, and a Sampler Agent curates experience distribution through prioritized exemplar retrieval. This orchestration forms a functional isomorphism with RL-mapping to state representation, policy gradient, and experience replay-enabling principled optimization in a textual latent space. On the ADRS benchmark, ContextEvolve outperforms state-of-the-art baselines by 33.3% while reducing token consumption by 29.0%. Codes for our work are released at this https URL
中文摘要 大型语言模型通过自动化发现计算机系统性能关键算法，正在改变系统研究。尽管大型语言模型生成的代码合理，但能够满足系统严格正确性和性能要求的解决方案仍需要迭代优化。测试时强化学习具有较高的搜索效率，但需要参数更新，这在仅靠API访问下不可行;而现有无训练的进化方法则存在上下文利用效率低下和无向搜索的问题。我们介绍了ContextEvolve，这是一个多智能体框架，通过将优化上下文分解为三个正交维度，在严格参数盲约束下实现强化学习级别的搜索效率：摘要代理通过代码到语言的抽象凝聚语义状态，导航代理通过轨迹分析提炼优化方向，采样代理通过优先级范例检索来策划经验分布。这种编排形成了与强化学习映射到状态表示、策略梯度以及文本潜在空间中实现重放的原则性优化的函数同构。在ADRS基准测试中，ContextEvolve的表现优于最先进基线33.3%，同时减少了29.0%的代币消耗。我们的工作代码会在此 https URL 发布

BinaryPPO: Efficient Policy Optimization for Binary Classification

二元PPO：二元分类的高效策略优化

Authors: Punya Syon Pandey, Zhijing Jin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.02708
Pdf link: https://arxiv.org/pdf/2602.02708
Abstract Supervised fine-tuning (SFT) is the standard approach for binary classification tasks such as toxicity detection, factuality verification, and causal inference. However, SFT often performs poorly in real-world settings with label noise, class imbalance, or sparse supervision. We introduce BinaryPPO, an offline reinforcement learning large language model (LLM) framework that reformulates binary classification as a reward maximization problem. Our method leverages a variant of Proximal Policy Optimization (PPO) with a confidence-weighted reward function that penalizes uncertain or incorrect predictions, enabling the model to learn robust decision policies from static datasets without online interaction. Across eight domain-specific benchmarks and multiple models with differing architectures, BinaryPPO improves accuracy by 40-60 percentage points, reaching up to 99%, substantially outperforming supervised baselines. We provide an in-depth analysis of the role of reward shaping, advantage scaling, and policy stability in enabling this improvement. Overall, we demonstrate that confidence-based reward design provides a robust alternative to SFT for binary classification. Our code is available at this https URL.
中文摘要 监督微调（SFT）是二元分类任务（如毒性检测、事实性验证和因果推断）的标准方法。然而，SFT在实际环境中常常表现不佳，比如标签噪声、类不平衡或监督稀疏。我们介绍了BinaryPPO，一种离线强化学习大型语言模型（LLM）框架，将二元分类重新表述为奖励最大化问题。我们的方法利用了近端策略优化（PPO）的一个变体，采用置信加权奖励函数，惩罚不确定或错误的预测，使模型能够从静态数据集中学习稳健的决策策略，而无需在线交互。在八个领域特定基准和多个不同架构的模型中，BinaryPPO 的准确率提升了 40-60 个百分点，达到高达 99%，远远优于监督基准。我们深入分析了奖励塑造、优势规模化和政策稳定性在实现这一改进中的作用。总体而言，我们证明基于置信度的奖励设计为二元分类提供了SFT的有力替代方案。我们的代码可在此 https URL 访问。

Maximum Likelihood Reinforcement Learning

最大似然强化学习

Authors: Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, Andrea Zanette
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.02710
Pdf link: https://arxiv.org/pdf/2602.02710
Abstract Reinforcement learning is the method of choice to train models in sampling-based setups with binary outcome feedback, such as navigation, code generation, and mathematical problem solving. In such settings, models implicitly induce a likelihood over correct rollouts. However, we observe that reinforcement learning does not maximize this likelihood, and instead optimizes only a lower-order approximation. Inspired by this observation, we introduce Maximum Likelihood Reinforcement Learning (MaxRL), a sampling-based framework to approximate maximum likelihood using reinforcement learning techniques. MaxRL addresses the challenges of non-differentiable sampling by defining a compute-indexed family of sample-based objectives that interpolate between standard reinforcement learning and exact maximum likelihood as additional sampling compute is allocated. The resulting objectives admit a simple, unbiased policy-gradient estimator and converge to maximum likelihood optimization in the infinite-compute limit. Empirically, we show that MaxRL Pareto-dominates existing methods in all models and tasks we tested, achieving up to 20x test-time scaling efficiency gains compared to its GRPO-trained counterpart. We also observe MaxRL to scale better with additional data and compute. Our results suggest MaxRL is a promising framework for scaling RL training in correctness based settings.
中文摘要 强化学习是基于采样的模型训练方法，采用二元结果反馈，如导航、代码生成和数学问题解决。在这种环境下，模型隐含地诱导了对正确推广的可能性。然而，我们观察到强化学习并未最大化这种似然，而是只优化了低阶近似。受这一观察启发，我们引入了最大似然强化学习（MaxRL），这是一个基于抽样的框架，用强化学习技术近似最大似然。MaxRL通过定义一组计算索引的样本目标，在分配额外抽样计算时，在标准强化学习和精确最大似然之间插值，解决了不可微抽样的挑战。所得目标具有简单、无偏的策略梯度估计，并在无限计算极限下收敛至最大似然优化。实证显示，我们证明MaxRL在所有测试的模型和任务中都以帕累托方式为主导，测试时间的扩展效率提升高达20倍，相比GRPO训练的对应方法。我们还观察到MaxRL能更好地扩展，增加数据并进行计算。我们的结果表明，MaxRL是一个有前景的框架，用于在基于正确性的环境中扩展强化学习训练。

Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion

层级实体中心强化学习与分解子目标扩散

Authors: Dan Haramati, Carl Qi, Tal Daniel, Amy Zhang, Aviv Tamar, George Konidaris
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.02722
Pdf link: https://arxiv.org/pdf/2602.02722
Abstract We propose a hierarchical entity-centric framework for offline Goal-Conditioned Reinforcement Learning (GCRL) that combines subgoal decomposition with factored structure to solve long-horizon tasks in domains with multiple entities. Achieving long-horizon goals in complex environments remains a core challenge in Reinforcement Learning (RL). Domains with multiple entities are particularly difficult due to their combinatorial complexity. GCRL facilitates generalization across goals and the use of subgoal structure, but struggles with high-dimensional observations and combinatorial state-spaces, especially under sparse reward. We employ a two-level hierarchy composed of a value-based GCRL agent and a factored subgoal-generating conditional diffusion model. The RL agent and subgoal generator are trained independently and composed post hoc through selective subgoal generation based on the value function, making the approach modular and compatible with existing GCRL algorithms. We introduce new variations to benchmark tasks that highlight the challenges of multi-entity domains, and show that our method consistently boosts performance of the underlying RL agent on image-based long-horizon tasks with sparse rewards, achieving over 150% higher success rates on the hardest task in our suite and generalizing to increasing horizons and numbers of entities. Rollout videos are provided at: this https URL
中文摘要 我们提出了一种层级实体中心框架，用于离线目标条件强化学习（GCRL），结合子目标分解与分解结构，以解决多实体领域中的长期任务。在复杂环境中实现长期目标仍然是强化学习（RL）的核心挑战。具有多个实体的领域因其组合复杂性而特别困难。GCRL促进了跨目标的泛化和子目标结构的使用，但在高维观测和组合状态空间方面存在困难，尤其是在奖励稀疏的情况下。我们采用了两层级结构，由基于价值的GCRL代理和一个分解的子目标生成条件扩散模型组成。强化学习代理和子目标生成器分别独立训练，并通过基于价值函数的选择性子目标生成事后组合，使该方法模块化，并兼容现有的GCRL算法。我们引入了基准测试任务的新变体，凸显了多实体领域的挑战，并证明我们的方法在基于图像的长视野任务中持续提升底层强化学习代理的性能，奖励稀疏，在我们套件中最难的任务中成功率提高超过150%，并推广到增加视野和实体数量。推广视频可在以下链接提供：此 https URL

From Tokens to Numbers: Continuous Number Modeling for SVG Generation

从代币到数字：SVG生成的连续数字建模

Authors: Michael Ogezi, Martin Bell, Freda Shi, Ethan Smith
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.02820
Pdf link: https://arxiv.org/pdf/2602.02820
Abstract For certain image generation tasks, vector graphics such as Scalable Vector Graphics (SVGs) offer clear benefits such as increased flexibility, size efficiency, and editing ease, but remain less explored than raster-based approaches. A core challenge is that the numerical, geometric parameters, which make up a large proportion of SVGs, are inefficiently encoded as long sequences of tokens. This slows training, reduces accuracy, and hurts generalization. To address these problems, we propose Continuous Number Modeling (CNM), an approach that directly models numbers as first-class, continuous values rather than discrete tokens. This formulation restores the mathematical elegance of the representation by aligning the model's inputs with the data's continuous nature, removing discretization artifacts introduced by token-based encoding. We then train a multimodal transformer on 2 million raster-to-SVG samples, followed by fine-tuning via reinforcement learning using perceptual feedback to further improve visual quality. Our approach improves training speed by over 30% while maintaining higher perceptual fidelity compared to alternative approaches. This work establishes CNM as a practical and efficient approach for high-quality vector generation, with potential for broader applications. We make our code available this http URL.
中文摘要 对于某些图像生成任务，矢量图形如可扩展矢量图形（SVG）提供了明显优势，如灵活性提升、尺寸效率和编辑简便性，但仍不及基于光栅的方法被广泛探索。一个核心挑战是，SVG中占很大比例的数值几何参数被低效地编码成长序列的代币。这会减慢训练速度，降低准确率，并损害泛化能力。为解决这些问题，我们提出了连续数建模（CNM）方法，将数字直接建模为一类连续值，而非离散符号。该表述通过将模型输入与数据的连续性对齐，恢复了表示的数学优雅性，消除了基于标记编码引入的离散化伪影。随后，我们用200万个光栅转SVG样本训练多模态变换器，随后通过强化学习利用感知反馈进行微调，进一步提升视觉质量。我们的方法相比其他方法，训练速度提升超过30%，同时保持更高的感知真实度。这项工作确立了CNM作为一种实用且高效的高质量载体生成方法，具有更广泛应用潜力。我们会将代码提供这个 http URL。

Adaptive Linear Path Model-Based Diffusion

基于模型的自适应线性路径扩散

Authors: Yutaka Shimizu, Masayoshi Tomizuka
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.02831
Pdf link: https://arxiv.org/pdf/2602.02831
Abstract The interest in combining model-based control approaches with diffusion models has been growing. Although we have seen many impressive robotic control results in difficult tasks, the performance of diffusion models is highly sensitive to the choice of scheduling parameters, making parameter tuning one of the most critical challenges. We introduce Linear Path Model-Based Diffusion (LP-MBD), which replaces the variance-preserving schedule with a flow-matching-inspired linear probability path. This yields a geometrically interpretable and decoupled parameterization that reduces tuning complexity and provides a stable foundation for adaptation. Building on this, we propose Adaptive LP-MBD (ALP-MBD), which leverages reinforcement learning to adjust diffusion steps and noise levels according to task complexity and environmental conditions. Across numerical studies, Brax benchmarks, and mobile-robot trajectory tracking, LP-MBD simplifies scheduling while maintaining strong performance, and ALP-MBD further improves robustness, adaptability, and real-time efficiency.
中文摘要 将基于模型的控制方法与扩散模型结合的兴趣日益增长。尽管我们在艰难任务中见过许多令人印象深刻的机器人控制成果，扩散模型的性能对调度参数的选择极为敏感，使参数调优成为最关键的挑战之一。我们引入了基于流的线性路径模型扩散（LP-MBD），用流量匹配启发的线性概率路径替代了保持方差的计划。这带来了几何上可解释且解耦的参数化，降低调优复杂度，并为适应提供稳定基础。基于此，我们提出了自适应LP-MBD（ALP-MBD），利用强化学习根据任务复杂性和环境条件调整扩散步骤和噪声水平。在数值研究、Brax基准测试和移动机器人轨迹跟踪方面，LP-MBD简化了调度，同时保持了强劲的性能，ALP-MBD进一步提升了鲁棒性、适应性和实时效率。

Causal Flow Q-Learning for Robust Offline Reinforcement Learning

因果流Q-Learning用于稳健的离线强化学习

Authors: Mingxuan Li, Junzhe Zhang, Elias Bareinboim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.02847
Pdf link: https://arxiv.org/pdf/2602.02847
Abstract Expressive policies based on flow-matching have been successfully applied in reinforcement learning (RL) more recently due to their ability to model complex action distributions from offline data. These algorithms build on standard policy gradients, which assume that there is no unmeasured confounding in the data. However, this condition does not necessarily hold for pixel-based demonstrations when a mismatch exists between the demonstrator's and the learner's sensory capabilities, leading to implicit confounding biases in offline data. We address the challenge by investigating the problem of confounded observations in offline RL from a causal perspective. We develop a novel causal offline RL objective that optimizes policies' worst-case performance that may arise due to confounding biases. Based on this new objective, we introduce a practical implementation that learns expressive flow-matching policies from confounded demonstrations, employing a deep discriminator to assess the discrepancy between the target policy and the nominal behavioral policy. Experiments across 25 pixel-based tasks demonstrate that our proposed confounding-robust augmentation procedure achieves a success rate 120\% that of confounding-unaware, state-of-the-art offline RL methods.
中文摘要 基于流匹配的表达式策略近年来在强化学习（RL）中被成功应用，因为它们能够从离线数据中模拟复杂动作分布。这些算法建立在标准策略梯度之上，假设数据中不存在未测量的混杂因素。然而，当基于像素的演示存在演示者与学习者的感官能力不匹配时，这种条件不一定成立，导致离线数据中隐含的混杂偏差。我们通过从因果角度探讨离线强化学习中混淆观察的问题来应对这一挑战。我们开发了一种新型因果离线强化学习目标，优化了因混杂偏差可能产生的策略最坏情况表现。基于这一新目标，我们引入了一种实用实现，通过混淆演示学习表达式流匹配策略，利用深度判别器评估目标策略与名义行为策略之间的差异。跨越25个基于像素的任务的实验表明，我们提出的混杂稳健增强程序的成功率是无干扰且最先进的离线强化学习方法的120%。

Latent Perspective-Taking via a Schrödinger Bridge in Influence-Augmented Local Models

在影响增强局部模型中通过薛定谔桥实现潜在透视获取

Authors: Kevin Alcedo, Pedro U. Lima, Rachid Alami
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.02857
Pdf link: https://arxiv.org/pdf/2602.02857
Abstract Operating in environments alongside humans requires robots to make decisions under uncertainty. In addition to exogenous dynamics, they must reason over others' hidden mental-models and mental-states. While Interactive POMDPs and Bayesian Theory of Mind formulations are principled, exact nested-belief inference is intractable, and hand-specified models are brittle in open-world settings. We address both by learning structured mental-models and an estimator of others' mental-states. Building on the Influence-Based Abstraction, we instantiate an Influence-Augmented Local Model to decompose socially-aware robot tasks into local dynamics, social influences, and exogenous factors. We propose (a) a neuro-symbolic world model instantiating a factored, discrete Dynamic Bayesian Network, and (b) a perspective-shift operator modeled as an amortized Schrödinger Bridge over the learned local dynamics that transports factored egocentric beliefs into other-centric beliefs. We show that this architecture enables agents to synthesize socially-aware policies in model-based reinforcement learning, via decision-time mental-state planning (a Schrödinger Bridge in belief space), with preliminary results in a MiniGrid social navigation task.
中文摘要 与人类共处的环境中，机器人必须在不确定性中做出决策。除了外生动力学，他们还必须推理他人隐藏的心理模型和心理状态。虽然交互式POMDP和贝叶斯心智理论的表述具有原则性，但精确嵌套信念推断难以处理，手工指定的模型在开放世界中较为脆弱。我们通过学习结构化心理模型和对他人心理状态的估计来应对这两点。基于基于影响力的抽象，我们实现了一个影响增强局部模型，将具有社会意识的机器人任务分解为局部动态、社会影响和外生因素。我们提出：（a）一个神经符号世界模型，实例化一个因式分解的离散动态贝叶斯网络，（b）一个视角转换算符，作为在已学习的局部动力学上的摊销薛定谔桥，将因式分解的自我中心信念转化为他者中心信念。我们展示了该架构使智能体能够通过基于模型的强化学习综合社会意识政策，通过决策时间心理状态规划（信念空间中的薛定谔桥），并在MiniGrid社会导航任务中获得初步结果。

IMAGINE: Intelligent Multi-Agent Godot-based Indoor Networked Exploration

想象一下：基于Godot的智能多智能体室内网络探索

Authors: Tiago Leite, Maria Conceição, António Grilo
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.02858
Pdf link: https://arxiv.org/pdf/2602.02858
Abstract The exploration of unknown, Global Navigation Satellite System (GNSS) denied environments by an autonomous communication-aware and collaborative group of Unmanned Aerial Vehicles (UAVs) presents significant challenges in coordination, perception, and decentralized decision-making. This paper implements Multi-Agent Reinforcement Learning (MARL) to address these challenges in a 2D indoor environment, using high-fidelity game-engine simulations (Godot) and continuous action spaces. Policy training aims to achieve emergent collaborative behaviours and decision-making under uncertainty using Network-Distributed Partially Observable Markov Decision Processes (ND-POMDPs). Each UAV is equipped with a Light Detection and Ranging (LiDAR) sensor and can share data (sensor measurements and a local occupancy map) with neighbouring agents. Inter-agent communication constraints include limited range, bandwidth and latency. Extensive ablation studies evaluated MARL training paradigms, reward function, communication system, neural network (NN) architecture, memory mechanisms, and POMDP formulations. This work jointly addresses several key limitations in prior research, namely reliance on discrete actions, single-agent or centralized formulations, assumptions of a priori knowledge and permanent connectivity, inability to handle dynamic obstacles, short planning horizons and architectural complexity in Recurrent NNs/Transformers. Results show that the scalable training paradigm, combined with a simplified architecture, enables rapid autonomous exploration of an indoor area. The implementation of Curriculum-Learning (five increasingly complex levels) also enabled faster, more robust training. This combination of high-fidelity simulation, MARL formulation, and computational efficiency establishes a strong foundation for deploying learned cooperative strategies in physical robotic systems.
中文摘要 由自主通信感知且协作的无人机（UAV）组探索未知的全球导航卫星系统（GNSS）无法覆盖的环境，在协调、感知和去中心化决策方面带来了重大挑战。本文通过高保真游戏引擎仿真（Godot）和连续动作空间，实现多智能体强化学习（MARL）以解决二维室内环境中的这些挑战。政策培训旨在利用网络分布式部分可观测马尔可夫决策过程（ND-POMDPs）实现在不确定性下涌现的协作行为和决策。每架无人机都配备了光探测与测距（LiDAR）传感器，并可与邻近代理共享数据（传感器测量和本地占用地图）。代理间通信的限制包括范围、带宽和延迟的限制。广泛的消融研究评估了MARL训练范式、奖励函数、通信系统、神经网络（NN）架构、记忆机制及POMDP的表述。本研究共同解决了先前研究中的若干关键局限性，即依赖离散动作、单智能体或集中式表述、先验知识和永久连接的假设、无法处理动态障碍、规划期短以及循环神经网络/变换器中的架构复杂性。结果表明，可扩展的训练范式结合简化架构，使室内区域能够实现快速自主探索。课程学习（五个日益复杂的层级）的实施也使培训更快、更有力。高保真模拟、MARL表述和计算效率的结合，为在物理机器人系统中部署已学习的协作策略奠定了坚实基础。

Manifold-Constrained Energy-Based Transition Models for Offline Reinforcement Learning

离线强化学习的流形约束能量转移模型

Authors: Zeyu Fang, Zuyuan Zhang, Mahdi Imani, Tian Lan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.02900
Pdf link: https://arxiv.org/pdf/2602.02900
Abstract Model-based offline reinforcement learning is brittle under distribution shift: policy improvement drives rollouts into state--action regions weakly supported by the dataset, where compounding model error yields severe value overestimation. We propose Manifold-Constrained Energy-based Transition Models (MC-ETM), which train conditional energy-based transition models using a manifold projection--diffusion negative sampler. MC-ETM learns a latent manifold of next states and generates near-manifold hard negatives by perturbing latent codes and running Langevin dynamics in latent space with the learned conditional energy, sharpening the energy landscape around the dataset support and improving sensitivity to subtle out-of-distribution deviations. For policy optimization, the learned energy provides a single reliability signal: rollouts are truncated when the minimum energy over sampled next states exceeds a threshold, and Bellman backups are stabilized via pessimistic penalties based on Q-value-level dispersion across energy-guided samples. We formalize MC-ETM through a hybrid pessimistic MDP formulation and derive a conservative performance bound separating in-support evaluation error from truncation risk. Empirically, MC-ETM improves multi-step dynamics fidelity and yields higher normalized returns on standard offline control benchmarks, particularly under irregular dynamics and sparse data coverage.
中文摘要 基于模型的离线强化学习在分布转移下较为脆弱：策略改进推动了数据集支持较弱的状态-动作区域，累计模型误差导致严重的价值高估。我们提出了流形受限能量转移模型（MC-ETM），利用流形投影-扩散负采样器训练条件能量转移模型。MC-ETM通过扰动潜在码并在潜空间运行朗之文动力学，生成近流形硬负，利用所学条件能量在潜空间运行，从而增强数据集支持周围的能量景观，提高对细微分布外偏差的敏感度。在策略优化中，学习到的能量提供单一的可靠性信号：当采样的下一态的最低能量超过阈值时，展开被截断，贝尔曼备份则通过基于Q值水平在能量引导样本中的悲观惩罚来稳定。我们通过混合悲观MDP表述形式化MC-ETM，并推导出一个保守的性能界限，将支持内评估误差与截断风险区分开来。从经验角度看，MC-ETM在标准离线控制基准测试上提高了多步动态保真度，并在不规则动力学和稀疏数据覆盖下获得更高的归一化收益。

Spatiotemporal Decision Transformer for Traffic Coordination

时空决策变换器用于交通协调

Authors: Haoran Su, Yandong Sun, Hanxiao Deng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.02903
Pdf link: https://arxiv.org/pdf/2602.02903
Abstract Traffic signal control is a critical challenge in urban transportation, requiring coordination among multiple intersections to optimize network-wide traffic flow. While reinforcement learning has shown promise for adaptive signal control, existing methods struggle with multi-agent coordination and sample efficiency. We introduce MADT (Multi-Agent Decision Transformer), a novel approach that reformulates multi-agent traffic signal control as a sequence modeling problem. MADT extends the Decision Transformer paradigm to multi-agent settings by incorporating: (1) a graph attention mechanism for modeling spatial dependencies between intersections, (2) a|temporal transformer encoder for capturing traffic dynamics, and (3) return-to-go conditioning for target performance specification. Our approach enables offline learning from historical traffic data, with architecture design that facilitates potential online fine-tuning. Experiments on synthetic grid networks and real-world traffic scenarios demonstrate that MADT achieves state-of-the-art performance, reducing average travel time by 5-6% compared to the strongest baseline while exhibiting superior coordination among adjacent intersections.
中文摘要 交通信号控制是城市交通中的关键挑战，需要多个路口协调以优化全网交通流量。虽然强化学习在自适应信号控制方面展现出潜力，但现有方法在多智能体协调和样本效率方面存在困难。我们介绍了MADT（多智能体决策变换器），这是一种新颖的方法，将多智能体交通信号控制重新表述为序列建模问题。MADT 将决策变换器范式扩展到多智能体环境，整合了：（1）用于建模交叉点间空间依赖的图关注机制，（2）用于捕捉交通动态的时序变换器编码器，以及（3）用于目标性能指定的回归条件。我们的方法支持从历史交通数据中离线学习，架构设计便于潜在的在线微调。在合成网格网络和真实交通场景上的实验表明，MADT 实现了最先进的性能，平均行驶时间比最强基线缩短了 5-6%，同时在相邻交叉口之间展现出更优越的协调性。

Notes on the Reward Representation of Posterior Updates

关于后期更新奖励表示的注释

Authors: Pedro A. Ortega
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.02912
Pdf link: https://arxiv.org/pdf/2602.02912
Abstract Many ideas in modern control and reinforcement learning treat decision-making as inference: start from a baseline distribution and update it when a signal arrives. We ask when this can be made literal rather than metaphorical. We study the special case where a KL-regularized soft update is exactly a Bayesian posterior inside a single fixed probabilistic model, so the update variable is a genuine channel through which information is transmitted. In this regime, behavioral change is driven only by evidence carried by that channel: the update must be explainable as an evidence reweighing of the baseline. This yields a sharp identification result: posterior updates determine the relative, context-dependent incentive signal that shifts behavior, but they do not uniquely determine absolute rewards, which remain ambiguous up to context-specific baselines. Requiring one reusable continuation value across different update directions adds a further coherence constraint linking the reward descriptions associated with different conditioning orders.
中文摘要 现代控制与强化学习中的许多想法将决策视为推理：从基线分布出发，信号到达时更新。我们问，什么时候可以把这变成字面意义而非隐喻。我们研究一个特例，即KL正则化软更新恰好是单一固定概率模型中的贝叶斯后验，因此更新变量是信息传输的真实通道。在这种体制下，行为改变仅由该渠道携带的证据驱动：更新必须可解释为对基线的证据重新权衡。这给出了一个清晰的识别结果：后验更新决定了相对的、依赖情境的激励信号，从而改变行为，但它们并不能唯一决定绝对奖励，因为绝对奖励在情境特定基线前仍然模糊不清。要求在不同更新方向之间重复使用一个延续值，增加了连接不同条件顺序奖励描述的相干约束。

How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?

拉格朗日导导如何通过扩散模型实现安全强化学习？

Authors: Xiaoyuan Cheng, Wenxuan Yuan, Boyang Li, Yuanchao Xu, Yiming Yang, Hao Liang, Bei Peng, Robert Loftin, Zhuo Sun, Yukun Hu
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.02924
Pdf link: https://arxiv.org/pdf/2602.02924
Abstract Diffusion policy sampling enables reinforcement learning (RL) to represent multimodal action distributions beyond suboptimal unimodal Gaussian policies. However, existing diffusion-based RL methods primarily focus on offline settings for reward maximization, with limited consideration of safety in online settings. To address this gap, we propose Augmented Lagrangian-Guided Diffusion (ALGD), a novel algorithm for off-policy safe RL. By revisiting optimization theory and energy-based model, we show that the instability of primal-dual methods arises from the non-convex Lagrangian landscape. In diffusion-based safe RL, the Lagrangian can be interpreted as an energy function guiding the denoising dynamics. Counterintuitively, direct usage destabilizes both policy generation and training. ALGD resolves this issue by introducing an augmented Lagrangian that locally convexifies the energy landscape, yielding a stabilized policy generation and training process without altering the distribution of the optimal policy. Theoretical analysis and extensive experiments demonstrate that ALGD is both theoretically grounded and empirically effective, achieving strong and stable performance across diverse environments.
中文摘要 扩散策略采样使强化学习（RL）能够表示超越次优单模高斯策略的多模态动作分布。然而，现有基于扩散的强化学习方法主要侧重于离线环境以最大化奖励，而在线环境中的安全性考虑较少。为弥补这一空白，我们提出了增强拉格朗日引导扩散（ALGD）算法，这是一种用于非策略安全强化学习的新算法。通过回顾优化理论和基于能量的模型，我们表明原始对偶方法的不稳定性源于非凸拉格朗日景观。在基于扩散的安全强化学习中，拉格朗日量可以被解释为引导去噪动力学的能量函数。反直觉的是，直接使用会破坏政策制定和培训的稳定性。ALGD通过引入增强拉格朗日量解决了这一问题，该拉格朗日量局部凸化能量景观，从而实现策略生成和训练过程的稳定，同时不改变最优策略的分布。理论分析和大量实验表明，ALGD既有理论基础，又在实证上有效，能够在多样环境中实现强大且稳定的性能。

Human-Centric Traffic Signal Control for Equity: A Multi-Agent Action Branching Deep Reinforcement Learning Approach

以人为本的交通信号控制实现公平：一种多智能体行动分支深度强化学习方法

Authors: Xiaocai Zhang, Neema Nassir, Lok Sang Chan, Milad Haghani
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.02959
Pdf link: https://arxiv.org/pdf/2602.02959
Abstract Coordinating traffic signals along multimodal corridors is challenging because many multi-agent deep reinforcement learning (DRL) approaches remain vehicle-centric and struggle with high-dimensional discrete action spaces. We propose MA2B-DDQN, a human-centric multi-agent action-branching double Deep Q-Network (DQN) framework that explicitly optimizes traveler-level equity. Our key contribution is an action-branching discrete control formulation that decomposes corridor control into (i) local, per-intersection actions that allocate green time between the next two phases and (ii) a single global action that selects the total duration of those phases. This decomposition enables scalable coordination under discrete control while reducing the effective complexity of joint decision-making. We also design a human-centric reward that penalizes the number of delayed individuals in the corridor, accounting for pedestrians, vehicle occupants, and transit passengers. Extensive evaluations across seven realistic traffic scenarios in Melbourne, Australia, demonstrate that our approach significantly reduces the number of impacted travelers, outperforming existing DRL and baseline methods. Experiments confirm the robustness of our model, showing minimal variance across diverse settings. This framework not only advocates for a fairer traffic signal system but also provides a scalable solution adaptable to varied urban traffic conditions.
中文摘要 在多模态走廊上协调交通信号具有挑战性，因为许多多智能体深度强化学习（DRL）方法仍以车辆为中心，难以适应高维离散动作空间。我们提出了MA2B-DDQN，一种以人为中心的多智能体动作分支双深度Q网络（DQN）框架，明确优化旅行者层面的公平性。我们的关键贡献是一种动作分支离散控制表述，将走廊控制分解为（i）局部、每个交叉点的动作，分配接下来两个阶段之间的绿灯时间，以及（ii）选择这两个阶段总时长的单一全局动作。这种分解使得在离散控制下实现可扩展的协调，同时降低了联合决策的有效复杂性。我们还设计了以人为中心的奖励机制，惩罚走廊内滞留人数，包括行人、车辆乘员和公共交通乘客。在澳大利亚墨尔本，针对七种现实交通情景进行的广泛评估表明，我们的方法显著减少了受影响的旅客数量，优于现有的日程车（DRL）和基线方法。实验验证了模型的稳健性，显示在不同情境下方差极小。该框架不仅倡导更公平的交通信号系统，还提供了可扩展的解决方案，适应多样化的城市交通状况。

Embodiment-Aware Generalist Specialist Distillation for Unified Humanoid Whole-Body Control

具身感知通用专家提炼，实现统一的类人生物全身控制

Authors: Quanquan Peng, Yunfeng Lin, Yufei Xue, Jiangmiao Pang, Weinan Zhang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.02960
Pdf link: https://arxiv.org/pdf/2602.02960
Abstract Humanoid Whole-Body Controllers trained with reinforcement learning (RL) have recently achieved remarkable performance, yet many target a single robot embodiment. Variations in dynamics, degrees of freedom (DoFs), and kinematic topology still hinder a single policy from commanding diverse humanoids. Moreover, obtaining a generalist policy that not only transfers across embodiments but also supports richer behaviors-beyond simple walking to squatting, leaning-remains especially challenging. In this work, we tackle these obstacles by introducing EAGLE, an iterative generalist-specialist distillation framework that produces a single unified policy that controls multiple heterogeneous humanoids without per-robot reward tuning. During each cycle, embodiment-specific specialists are forked from the current generalist, refined on their respective robots, and new skills are distilled back into the generalist by training on the pooled embodiment set. Repeating this loop until performance convergence produces a robust Whole-Body Controller validated on robots such as Unitree H1, G1, and Fourier N1. We conducted experiments on five different robots in simulation and four in real-world settings. Through quantitative evaluations, EAGLE achieves high tracking accuracy and robustness compared to other methods, marking a step toward scalable, fleet-level humanoid control. See more details at this https URL
中文摘要 经过强化学习（RL）训练的人形全身控制器最近取得了显著性能，但许多仅针对单一机器人的体现。动态、自由度（DoF）和运动学拓扑的差异仍然阻碍单一政策指挥多样化的人形生物。此外，要获得一项不仅跨身体体能传递，还能支持更丰富的行为——超越简单走路到深蹲、倾斜——依然极具挑战性。在本研究中，我们通过引入EAGLE来应对这些障碍，EAGLE是一种迭代的通用-专家提炼框架，能够生成一个统一策略，控制多个异构类人生物，而无需每个机器人的奖励调优。在每个周期中，具身专属专家会从当前的通用者中分叉出来，在各自的机器人上进行精炼，并通过在合并的具身集上训练，将新技能提炼回通用专家。重复此循环直到性能趋同，即可生成在Unitree H1、G1和傅里叶N1等机器人上验证的稳健全体控制器。我们在模拟中对五个不同的机器人进行了实验，在现实环境中进行了四个实验。通过定量评估，EAGLE实现了比其他方法更高的跟踪精度和鲁棒性，标志着迈向可扩展的舰队级类人生物控制迈出了一步。更多详情请见此 https 网址

Co2PO: Coordinated Constrained Policy Optimization for Multi-Agent RL

Co2PO：多智能体强化语言的协调受限策略优化

Authors: Shrenik Patel, Christine Truong
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.02970
Pdf link: https://arxiv.org/pdf/2602.02970
Abstract Constrained multi-agent reinforcement learning (MARL) faces a fundamental tension between exploration and safety-constrained optimization. Existing leading approaches, such as Lagrangian methods, typically rely on global penalties or centralized critics that react to violations after they occur, often suppressing exploration and leading to over-conservatism. We propose Co2PO, a novel MARL communication-augmented framework that enables coordination-driven safety through selective, risk-aware communication. Co2PO introduces a shared blackboard architecture for broadcasting positional intent and yield signals, governed by a learned hazard predictor that proactively forecasts potential violations over an extended temporal horizon. By integrating these forecasts into a constrained optimization objective, Co2PO allows agents to anticipate and navigate collective hazards without the performance trade-offs inherent in traditional reactive constraints. We evaluate Co2PO across a suite of complex multi-agent safety benchmarks, where it achieves higher returns compared to leading constrained baselines while converging to cost-compliant policies at deployment. Ablation studies further validate the necessity of risk-triggered communication, adaptive gating, and shared memory components.
中文摘要 受限多智能体强化学习（MARL）面临探索与安全约束优化之间的根本张力。现有的主流方法，如拉格朗日方法，通常依赖全局惩罚或集中批评者，在违规发生后作出反应，常常压制探索，导致过度保守。我们提出了Co2PO，一种新型MARL通信增强框架，通过选择性、风险意识的沟通实现协调驱动的安全。Co2PO引入了共享黑板架构，用于广播位置意图和收益信号，由学习的危害预测器控制，能主动预测在更长的时间视野内潜在违规。通过将这些预测整合进受限优化目标，Co2PO使代理能够预判并应对集体危害，而无需承担传统反应式约束固有的性能权衡。我们通过一系列复杂的多智能体安全基准评估Co2PO，其在部署时达到了比领先受限基线更高的回报，同时趋向成本合规的策略。消融研究进一步验证了风险触发交流、适应性门控和共享记忆成分的必要性。

Learning Fast Monomial Orders for Gröbner Basis Computations

学习格罗布纳基计算中的快速单项式序

Authors: R. Caleb Bunch, Alperen A. Ergür, Melika Golestani, Jessie Tong, Malia Walewski, Yunus E. Zeytuncu
Subjects: Subjects: Symbolic Computation (cs.SC); Machine Learning (cs.LG); Commutative Algebra (math.AC); Algebraic Geometry (math.AG)
Arxiv link: https://arxiv.org/abs/2602.02972
Pdf link: https://arxiv.org/pdf/2602.02972
Abstract The efficiency of Gröbner basis computation, the standard engine for solving systems of polynomial equations, depends on the choice of monomial ordering. Despite a near-continuum of possible monomial orders, most implementations rely on static heuristics such as GrevLex, guided primarily by expert intuition. We address this gap by casting the selection of monomial orderings as a reinforcement learning problem over the space of admissible orderings. Our approach leverages domain-informed reward signals that accurately reflect the computational cost of Gröbner basis computations and admits efficient Monte Carlo estimation. Experiments on benchmark problems from systems biology and computer vision show that the resulting learned policies consistently outperform standard heuristics, yielding substantial reductions in computational cost. Moreover, we find that these policies resist distillation into simple interpretable models, providing empirical evidence that deep reinforcement learning allows the agents to exploit non-linear geometric structure beyond the scope of traditional heuristics.
中文摘要 格罗布纳基计算的效率，作为求解多项式方程组的标准引擎，取决于单项式序的选择。尽管单项式顺序几乎是连续的，大多数实现仍依赖静态启发式方法，如GrevLex，主要依靠专家直觉。我们通过将单项式排序的选择视为可接受排序空间中的强化学习问题来弥补这一空白。我们的方法利用了领域知情的奖励信号，准确反映了格罗布纳基计算的计算成本，并实现了高效的蒙特卡洛估计。系统生物学和计算机视觉的基准问题实验表明，所学策略始终优于标准启发式，显著降低了计算成本。此外，我们发现这些策略难以被简化为简单可解释的模型，提供了实证证据，表明深度强化学习使智能体能够利用超越传统启发式的非线性几何结构。

Structuring Value Representations via Geometric Coherence in Markov Decision Processes

通过几何相干性在马尔可夫决策过程中构建价值表示

Authors: Zuyuan Zhang, Zeyu Fang, Tian Lan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.02978
Pdf link: https://arxiv.org/pdf/2602.02978
Abstract Geometric properties can be leveraged to stabilize and speed reinforcement learning. Existing examples include encoding symmetry structure, geometry-aware data augmentation, and enforcing structural restrictions. In this paper, we take a novel view of RL through the lens of order theory and recast value function estimates into learning a desired poset (partially ordered set). We propose \emph{GCR-RL} (Geometric Coherence Regularized Reinforcement Learning) that computes a sequence of super-poset refinements -- by refining posets in previous steps and learning additional order relationships from temporal difference signals -- thus ensuring geometric coherence across the sequence of posets underpinning the learned value functions. Two novel algorithms by Q-learning and by actor--critic are developed to efficiently realize these super-poset refinements. Their theoretical properties and convergence rates are analyzed. We empirically evaluate GCR-RL in a range of tasks and demonstrate significant improvements in sample efficiency and stable performance over strong baselines.
中文摘要 几何属性可用于稳定和加速强化学习。现有的例子包括对称结构编码、几何感知数据增强以及结构限制的强制执行。本文通过序理论视角对强化学习进行了新颖的视角，并将价值函数估计重新定义为学习所需的偏序集（偏序集）。我们提出了\emph{GCR-RL}（几何相干正则化强化学习），通过在前几步细化偏序集并从时间差分信号中学习额外的序关系，计算一系列超偏序集细化，从而确保支撑所学值函数的偏序集序列中的几何一致性。通过 Q-learning 和 actor-critic 开发了两种新颖算法，以高效实现这些超偏序集的细化。分析了它们的理论性质和收敛率。我们实证评估GCR-RL在多种任务中的表现，并在强基线下显著提升了样本效率和稳定表现。

CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

CPMobius：无数据强化学习的迭代教练-球员推理

Authors: Ran Li, Zeyuan Liu, Yinghao chen, Bingxiang He, Jiarui Yuan, Zixuan Fu, Weize Chen, Jinyi Hu, Zhiyuan Liu, Maosong Sun
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.02979
Pdf link: https://arxiv.org/pdf/2602.02979
Abstract Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPMöbius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and receives rewards based on changes in the Player's performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player's mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy.
中文摘要 大型语言模型（LLMs）在复杂推理方面展现出强大潜力，但其进展仍受制于大量高质量的人为策划任务和标签，无论是通过监督微调（SFT）还是基于推理特定数据的强化学习（RL）。这种依赖使得以监督为主的培训模式变得越来越难以为继，实践中已显示出可扩展性减弱的迹象。为克服这一限制，我们引入了CPMöbius（CPMobius），这是一种教练-球员协作范式，用于无数据的推理模型强化学习。与传统的对抗性自我对弈不同，CPMöbius受现实世界人类体育协作和多代理协作启发，将教练和球员视为独立但合作的角色。教练会根据球员的能力提出指导，并根据球员表现的变化获得奖励，而玩家则因解决教练产生的越来越具启发性的任务而获得奖励。这一合作优化循环旨在直接提升玩家的数学推理能力。令人惊讶的是，CPMöbius在不依赖任何外部训练数据的情况下实现了显著改进，优于现有的无监督方法。例如，在Qwen2.5-Math-7B-Instruct上，我们的方法整体准确率提升+4.9，分布外平均提升+5.4，整体准确率比RENT高+1.5，OOD准确率R-0高+4.2。

Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation

视频-OPD：通过策略上蒸馏实现多模态大型语言模型的高效后期训练，实现时间视频基础

Authors: Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wenhui Tan, Zewen He, Jianzhong Ju, Zhenbo Luo, Jian Luan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.02994
Pdf link: https://arxiv.org/pdf/2602.02994
Abstract Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.
中文摘要 强化学习因其策略优化，已成为时序视频基础（TVG）一种有原则的训练后范式，但现有基于GRPO的方法仍受限于稀疏的奖励信号和巨大的计算开销。我们提出了Video-OPD，这是一个高效的TVG后期培训框架，灵感来自近期政策提炼的进展。视频-OPD优化直接从当前策略中抽样的轨迹，从而保持训练与推理分布之间的对齐，而前沿教师通过反向KL发散目标提供密集的代币级监督。该表述保留了缓解分布偏移的关键策略属性，同时将稀疏的剧集级反馈转化为细粒度的分步学习信号。基于视频-开放性，我们引入了教师验证分歧聚焦（TVDF），这是一套轻量级培训课程，反复优先考虑既对教师可靠又对学生信息量最大的发展轨迹，从而提升培训效率。实证结果表明，视频-OPD持续优于GRPO，同时实现显著更快的收敛速度和更低的计算成本，确立了策略提纯作为TVG传统强化学习的有效替代方案。

CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs

CoBA-RL：面向能力的预算分配用于大型语言模型中的强化学习

Authors: Zhiyuan Yao, Yi-Kai Zhang, Yuxin Chen, Yueqing Sun, Zishan Xu, Yu Yang, Tianhao Hu, Qi Gu, Hui Su, Xunliang Cai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03048
Pdf link: https://arxiv.org/pdf/2602.03048
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key approach for enhancing LLM this http URL, standard frameworks like Group Relative Policy Optimization (GRPO) typically employ a uniform rollout budget, leading to resource inefficiency. Moreover, existing adaptive methods often rely on instance-level metrics, such as task pass rates, failing to capture the model's dynamic learning state. To address these limitations, we propose CoBA-RL, a reinforcement learning algorithm designed to adaptively allocate rollout budgets based on the model's evolving capability. Specifically, CoBA-RL utilizes a Capability-Oriented Value function to map tasks to their potential training gains and employs a heap-based greedy strategy to efficiently self-calibrate the distribution of computational resources to samples with high training value. Extensive experiments demonstrate that our approach effectively orchestrates the trade-off between exploration and exploitation, delivering consistent generalization improvements across multiple challenging benchmarks. These findings underscore that quantifying sample training value and optimizing budget allocation are pivotal for advancing LLM post-training efficiency.
中文摘要 带可验证奖励的强化学习（RLVR）已成为增强LLM的关键方法，而像群相对策略优化（GRPO）这样的标准框架通常采用统一的推广预算，导致资源效率低下。此外，现有的自适应方法常常依赖实例级指标，如任务通过率，无法捕捉模型的动态学习状态。为解决这些限制，我们提出了CoBA-RL，一种强化学习算法，旨在根据模型不断发展的能力自适应分配推广预算。具体来说，CoBA-RL利用能力导向价值函数将任务映射到潜在的训练收益，并采用基于堆的贪婪策略，高效地自我校准计算资源分配到高训练值样本。大量实验表明，我们的方法有效协调了探索与利用之间的权衡，在多个具有挑战性的基准测试中持续带来泛化改进。这些发现强调了量化样本训练价值和优化预算分配对于提升LLM训练后效率至关重要。

TMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT

TMS：轨迹混合监督，针对无奖励、按政策进行的SFT

Authors: Rana Muhammad Shahroz Khan, Zijie Liu, Zhen Tan, Charles Fleming, Tianlong Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03073
Pdf link: https://arxiv.org/pdf/2602.03073
Abstract Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) are the two dominant paradigms for enhancing Large Language Model (LLM) performance on downstream tasks. While RL generally preserves broader model capabilities (retention) better than SFT, it comes with significant costs: complex reward engineering, instability, and expensive on-policy sampling. In contrast, SFT is efficient but brittle, often suffering from catastrophic forgetting due to $\textbf{Supervision Mismatch}$: the divergence between the model's evolving policy and static training labels. We address this trade-off with $\textbf{Trajectory-Mixed Supervision (TMS)}$, a reward-free framework that approximates the on-policy benefits of RL by creating a dynamic curriculum from the model's own historical checkpoints. TMS minimizes $\textit{Policy-Label Divergence (PLD)}$, preventing the mode collapse that drives forgetting in standard SFT. Experiments across reasoning (MATH, GSM8K) and instruction-following benchmarks demonstrate that TMS effectively shifts the accuracy--retention Pareto frontier. While RL remains the gold standard for retention, TMS significantly outperforms standard and iterative SFT, bridging the gap to RL without requiring reward models or verifiers. Mechanistic analysis confirms that PLD drift accurately predicts forgetting and that TMS successfully mitigates this drift.
中文摘要 强化学习（RL）和监督式微调（SFT）是提升大型语言模型（LLM）在下游任务中表现的两种主导范式。虽然强化学习通常比SFT更能保留更广泛的模型能力（保持），但代价也很大：复杂的奖励工程、不稳定性以及昂贵的策略采样。相比之下，SFT高效但脆弱，常因$\textbf{监督不匹配}$（模型策略演变与静态训练标签之间的分歧）而出现灾难性遗忘。我们用$\textbf{轨迹混合监督（TMS）}$来解决这一权衡，这是一个无奖励的框架，通过从模型自身的历史检查点创建动态课程，近似强化学习的政策内效益。TMS最小化$\textit{策略标签散度（PLD）}$，防止标准SFT中导致遗忘的模式崩溃。跨推理（MATH、GSM8K）和指令跟随基准测试的实验表明，TMS有效地改变了准确性——保持帕累托边界。虽然强化学习仍是留存的黄金标准，但TMS远远优于标准和迭代SFT，弥合了与强化学习的差距，无需奖励模型或验证器。机理分析证实PLD漂移准确预测遗忘，TMS成功减轻了这种漂移。

ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

ReMiT：基于强化学习的迭代大型语言模型演化中期训练

Authors: Junjie Huang, Jiarui Qin, Di Yin, Weiwen Liu, Yong Yu, Xing Sun, Weinan Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.03075
Pdf link: https://arxiv.org/pdf/2602.03075
Abstract Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process--where insights from post-training retroactively improve the pre-trained foundation--remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3\% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2\% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.
中文摘要 大型语言模型（LLMs）的标准训练流程通常是单向的，从预训练逐步推进到训练后。然而，双向过程的可能性——即从培训后获得的洞见回溯性地改善预培训基础——尚未被充分探索。我们的目标是建立一个自我强化的飞轮：一个循环，强化学习（RL）调优的模型强化基础模型，进而提升后续训练后的表现，无需专门培训的教师或参考模型。为此，我们分析训练动态，并将训练中段（退火）阶段视为模型能力的关键转折点。这一阶段通常发生在预培训结束时，使用高质量语料库，且学习速度迅速下降。基于这一见解，我们引入了ReMiT（强化学习引导中途训练）。具体来说，ReMiT利用强化学习调优模型的推理先验，在训练中期动态重权，优先处理对推理至关重要的标记。从实证角度看，ReMiT在10个训练前基准测试中平均提升了3%，涵盖数学、代码和一般推理，并且在训练后流程中，这些提升持续超过2\%。这些结果验证了迭代反馈循环，使LLMs能够持续且自我强化地演进。

Neural Predictor-Corrector: Solving Homotopy Problems with Reinforcement Learning

神经预测-校正器：通过强化学习解决同伦问题

Authors: Jiayao Mai, Bangyan Liao, Zhenjun Zhao, Yingping Zeng, Haoang Li, Javier Civera, Tailin Wu, Yi Zhou, Peidong Liu
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.03086
Pdf link: https://arxiv.org/pdf/2602.03086
Abstract The Homotopy paradigm, a general principle for solving challenging problems, appears across diverse domains such as robust optimization, global optimization, polynomial root-finding, and sampling. Practical solvers for these problems typically follow a predictor-corrector (PC) structure, but rely on hand-crafted heuristics for step sizes and iteration termination, which are often suboptimal and task-specific. To address this, we unify these problems under a single framework, which enables the design of a general neural solver. Building on this unified view, we propose Neural Predictor-Corrector (NPC), which replaces hand-crafted heuristics with automatically learned policies. NPC formulates policy selection as a sequential decision-making problem and leverages reinforcement learning to automatically discover efficient strategies. To further enhance generalization, we introduce an amortized training mechanism, enabling one-time offline training for a class of problems and efficient online inference on new instances. Experiments on four representative homotopy problems demonstrate that our method generalizes effectively to unseen instances. It consistently outperforms classical and specialized baselines in efficiency while demonstrating superior stability across tasks, highlighting the value of unifying homotopy methods into a single neural framework.
中文摘要 同伦范式是解决复杂问题的通用原理，广泛应用于鲁棒优化、全局优化、多项式根求和抽样等多个领域。这些问题的实际求解器通常遵循预测-校正器（PC）结构，但依赖手工设计的步长和迭代终止启发式方法，这些方法往往不够优化且针对特定任务。为此，我们将这些问题统一在一个框架下，从而设计出通用的神经求解器。基于这一统一观点，我们提出了神经预测-校正器（NPC），它用自动学习的策略取代了手工设计的启发式方法。NPC将策略选择构建为顺序决策问题，并利用强化学习自动发现高效策略。为进一步提升泛化，我们引入了摊销训练机制，使一类问题实现一次性离线训练，并高效地在线推理新实例。对四个代表性同伦问题的实验表明，我们的方法能够有效地推广到未见的实例。它在效率上持续优于经典和专业基线，同时展现出跨任务的卓越稳定性，凸显了将同伦方法统一到单一神经框架中的价值。

Training and Simulation of Quadrupedal Robot in Adaptive Stair Climbing for Indoor Firefighting: An End-to-End Reinforcement Learning Approach

室内消防自适应楼梯攀爬中的四足机器人训练与模拟：端到端强化学习方法

Authors: Baixiao Huang, Baiyu Huang, Yu Hou
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03087
Pdf link: https://arxiv.org/pdf/2602.03087
Abstract Quadruped robots are used for primary searches during the early stages of indoor fires. A typical primary search involves quickly and thoroughly looking for victims under hazardous conditions and monitoring flammable materials. However, situational awareness in complex indoor environments and rapid stair climbing across different staircases remain the main challenges for robot-assisted primary searches. In this project, we designed a two-stage end-to-end deep reinforcement learning (RL) approach to optimize both navigation and locomotion. In the first stage, the quadrupeds, Unitree Go2, were trained to climb stairs in Isaac Lab's pyramid-stair terrain. In the second stage, the quadrupeds were trained to climb various realistic indoor staircases in the Isaac Lab engine, with the learned policy transferred from the previous stage. These indoor staircases are straight, L-shaped, and spiral, to support climbing tasks in complex environments. This project explores how to balance navigation and locomotion and how end-to-end RL methods can enable quadrupeds to adapt to different stair shapes. Our main contributions are: (1) A two-stage end-to-end RL framework that transfers stair-climbing skills from abstract pyramid terrain to realistic indoor stair topologies. (2) A centerline-based navigation formulation that enables unified learning of navigation and locomotion without hierarchical planning. (3) Demonstration of policy generalization across diverse staircases using only local height-map perception. (4) An empirical analysis of success, efficiency, and failure modes under increasing stair difficulty.
中文摘要 四足机器人用于室内火灾初期的初级搜寻。典型的初级搜查包括在危险条件下迅速且彻底地寻找受害者，并监控易燃物质。然而，复杂室内环境中的态势感知以及跨越不同楼梯的快速爬梯仍然是机器人辅助初级搜索的主要挑战。在本项目中，我们设计了一种两阶段的端到端深度强化学习（RL）方法，以优化导航和移动。第一阶段，四足动物Unitree Go2在Isaac实验室的金字塔楼梯地形中接受了攀爬训练。第二阶段，四足动物接受了艾萨克实验室发动机的攀爬各种真实室内楼梯的训练，这一教学策略从前一阶段迁移过来。这些室内楼梯为直线、L形和螺旋形，以支持复杂环境中的攀爬任务。本项目探讨如何平衡导航与行走，以及端到端强化学习方法如何使四足动物适应不同的楼梯形状。我们的主要贡献包括：（1）一个两阶段的端到端强化学习框架，将爬楼梯技能从抽象的金字塔地形转移到真实的室内楼梯拓扑。（2）基于中心线的导航表述，实现统一学习导航和移动，无需层级规划。（3）仅利用局部高度图感知，演示跨不同楼梯的政策推广。（4）在阶梯难度递增下，成功、效率和失败模式的实证分析。

Test-time Recursive Thinking: Self-Improvement without External Feedback

测试时递归思维：无外部反馈的自我提升

Authors: Yufan Zhuang, Chandan Singh, Liyuan Liu, Yelong Shen, Dinghuai Zhang, Jingbo Shang, Jianfeng Gao, Weizhu Chen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.03094
Pdf link: https://arxiv.org/pdf/2602.03094
Abstract Modern Large Language Models (LLMs) have shown rapid improvements in reasoning capabilities, driven largely by reinforcement learning (RL) with verifiable rewards. Here, we ask whether these LLMs can self-improve without the need for additional training. We identify two core challenges for such systems: (i) efficiently generating diverse, high-quality candidate solutions, and (ii) reliably selecting correct answers in the absence of ground-truth supervision. To address these challenges, we propose Test-time Recursive Thinking (TRT), an iterative self-improvement framework that conditions generation on rollout-specific strategies, accumulated knowledge, and self-generated verification signals. Using TRT, open-source models reach 100% accuracy on AIME-25/24, and on LiveCodeBench's most difficult problems, closed-source models improve by 10.4-14.8 percentage points without external feedback.
中文摘要 现代大型语言模型（LLMs）在推理能力上表现出快速提升，这主要得益于具有可验证奖励的强化学习（RL）。在这里，我们探讨这些大型语言模型是否可以在无需额外培训的情况下自我提升。我们确定了这类系统面临的两个核心挑战：（i）高效生成多样且高质量的候选解，（ii）在缺乏真实监督的情况下可靠地选择正确答案。为应对这些挑战，我们提出了测试时递归思维（TRT），这是一种迭代自我提升框架，其生成基于部署专属策略、积累的知识和自我生成的验证信号。使用TRT，开源模型在AIME-25/24上达到100%准确率，而在LiveCodeBench最难的问题中，闭源模型在无外部反馈的情况下提升10.4-14.8个百分点。

One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence

一个模型，所有角色：多回合、多代理自我游戏强化学习，用于会话社会智能

Authors: Bowen Jiang, Taiwei Shi, Ryo Kamoi, Yuan Yuan, Camillo J. Taylor, Longqi Yang, Pei Zhou, Sihao Chen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.03109
Pdf link: https://arxiv.org/pdf/2602.03109
Abstract This paper introduces OMAR: One Model, All Roles, a reinforcement learning framework that enables AI to develop social intelligence through multi-turn, multi-agent conversational self-play. Unlike traditional paradigms that rely on static, single-turn optimizations, OMAR allows a single model to role-play all participants in a conversation simultaneously, learning to achieve long-term goals and complex social norms directly from dynamic social interaction. To ensure training stability across long dialogues, we implement a hierarchical advantage estimation that calculates turn-level and token-level advantages. Evaluations in the SOTOPIA social environment and Werewolf strategy games show that our trained models develop fine-grained, emergent social intelligence, such as empathy, persuasion, and compromise seeking, demonstrating the effectiveness of learning collaboration even under competitive scenarios. While we identify practical challenges like reward hacking, our results show that rich social intelligence can emerge without human supervision. We hope this work incentivizes further research on AI social intelligence in group conversations.
中文摘要 本文介绍了OMAR：一个模型，所有角色，这是一个强化学习框架，使人工智能能够通过多回合、多智能体的对话自我游戏来发展社会智能。与依赖静态单回合优化的传统范式不同，OMAR 允许单一模型同时扮演对话中的所有参与者，直接通过动态社交互动学习实现长期目标和复杂的社会规范。为了确保长时间对话中的训练稳定性，我们实施了层级优势估计，计算回合级和代币级优势。在SOTOPIA社会环境和狼人策略游戏中的评估显示，我们训练好的模型能够发展出细粒度的、涌现的社会智能，如同理心、说服力和妥协寻求能力，证明了即使在竞争场景下学习协作的有效性。虽然我们识别出了奖励黑客等实际挑战，但我们的结果表明，丰富的社会智能可以在没有人类监督的情况下出现。我们希望这项工作能激励更多关于群体对话中人工智能社会智能的研究。

Quantized Evolution Strategies: High-precision Fine-tuning of Quantized LLMs at Low-precision Cost

量子化进化策略：以低精度成本实现量化大型语言模型的高精度微调

Authors: Yinggan Xu, Risto Miikkulainen, Xin Qiu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03120
Pdf link: https://arxiv.org/pdf/2602.03120
Abstract Post-Training Quantization (PTQ) is essential for deploying Large Language Models (LLMs) on memory-constrained devices, yet it renders models static and difficult to fine-tune. Standard fine-tuning paradigms, including Reinforcement Learning (RL), fundamentally rely on backpropagation and high-precision weights to compute gradients. Thus they cannot be used on quantized models, where the parameter space is discrete and non-differentiable. While Evolution Strategies (ES) offer a backpropagation-free alternative, optimization of the quantized parameters can still fail due to vanishing or inaccurate gradient. This paper introduces Quantized Evolution Strategies (QES), an optimization paradigm that performs full-parameter fine-tuning directly in the quantized space. QES is based on two innovations: (1) it integrates accumulated error feedback to preserve high-precision gradient signals, and (2) it utilizes a stateless seed replay to reduce memory usage to low-precision inference levels. QES significantly outperforms the state-of-the-art zeroth-order fine-tuning method on arithmetic reasoning tasks, making direct fine-tuning for quantized models possible. It therefore opens up the possibility for scaling up LLMs entirely in the quantized space. The source code is available at this https URL .
中文摘要 训练后量化（PTQ）对于在内存受限的设备上部署大型语言模型（LLM）至关重要，但它使模型变得静态且难以微调。标准的微调范式，包括强化学习（RL），基本上依赖反向传播和高精度权重来计算梯度。因此，它们不能用于量子化模型，因为参数空间是离散且不可微的。虽然进化策略（ES）提供了无反向传播的替代方案，但量化参数的优化仍可能因梯度消失或不准确而失败。本文介绍了量子化进化策略（QES），这是一种优化范式，直接在量子化空间中进行全参数微调。QES基于两项创新：（1）整合累积的误差反馈以保持高精度梯度信号，（2）利用无状态种子重放将内存使用降至低精度推理水平。QES在算术推理任务中显著优于最先进的零阶微调方法，使得对量子化模型进行直接微调成为可能。因此，它为在量化空间中完全扩展LLM提供了可能。源代码可在此 https URL 获取。

Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization

短链，深度思考：通过分割合并优化平衡推理效率与段内能力

Authors: Runquan Gui, Jie Wang, Zhihai Wang, Chi Ma, Jianye Hao, Feng Wu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.03141
Pdf link: https://arxiv.org/pdf/2602.03141
Abstract While Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains, this reliance on verbose generation results in significant latency and computational overhead. To address these challenges, we propose \textbf{CoSMo} (\textbf{Co}nsistency-Guided \textbf{S}plit-\textbf{M}erge \textbf{O}ptimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume. Specifically, CoSMo utilizes a split-merge algorithm that dynamically refines reasoning chains by merging redundant segments and splitting logical gaps to ensure coherence. We then employ structure-aligned reinforcement learning with a novel segment-level budget to supervise the model in maintaining efficient reasoning structures throughout training. Extensive experiments across multiple benchmarks and backbones demonstrate that CoSMo achieves superior performance, improving accuracy by \textbf{3.3} points while reducing segment usage by \textbf{28.7\%} on average compared to reasoning efficiency baselines.
中文摘要 虽然大型推理模型（LRM）通过生成长推理链展现了解决复杂任务的惊人能力，但这种对冗长生成的依赖导致了显著的延迟和计算开销。为应对这些挑战，我们提出了 \textbf{CoSMo}（\textbf{Co}nsistency-Guided \textbf{S}plit-\textbf{M}erge \textbf{O}ptimization），旨在消除结构冗余，而非无差别限制代币数量。具体来说，CoSMo采用了一种分裂-合并算法，通过合并冗余片段和分割逻辑空白动态优化推理链，以确保一致性。随后，我们采用结构对齐强化学习，并采用新颖的片段级预算，监督模型在整个训练过程中保持高效的推理结构。跨多个基准和骨干的广泛实验表明，CoSMo实现了更优的性能，平均将准确率提升了_textbf{3.3}点，同时将分段使用率降低了\textbf{28.7\%}，相较于推理效率基线。

Self-Hinting Language Models Enhance Reinforcement Learning

自我提示语言模型增强强化学习

Authors: Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.03143
Pdf link: https://arxiv.org/pdf/2602.03143
Abstract Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$, the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $\tau$ conditioned on $(x,h)$. Crucially, the task reward $R(x,\tau)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set $h=\varnothing$ and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at this https URL.
中文摘要 群体相对策略优化（Group Relative Policy Optimization，GRPO）最近成为一种实用的方案，用于将大型语言模型与可验证的目标对齐。然而，在终端奖励稀疏的情况下，GRPO常常停滞，因为同一群体内的推广奖励常常相同，导致相对优势崩溃，更新消失。我们提出了带有特权监督的自提示对齐GRPO（SAGE），这是一个策略内强化学习框架，在培训过程中注入特权提示，以在同一终端验证者奖励下重塑推广分布。对于每个提示$x$，模型采样一个紧凑提示$h$（例如计划或分解），然后生成以$（x，h）$为条件的解$\tau$。关键是任务奖励$R（x，\tau）$保持不变;在有限抽样下，提示仅能增加群体内的结果多样性，防止GRPO优势在稀疏奖励下崩溃。测试时，我们设置 $h=\varnothing$，并部署无提示策略，且不包含任何特权信息。此外，采样多样化的自我提示作为一种适应性课程，比起初始政策或更强外部模型的固定提示，更有效地追踪学习者的瓶颈。在6个基准测试中，使用3个LLMs的实验显示，SAGE持续优于GRPO，平均在Llama-3.2-3B-Ininstruction上为+2.0，Qwen2.5-7B-Ininstruction为+1.2，Qwen3-4B-Ininstruction为+1.3。代码可在该 https URL 访问。

Intelligent Front-End Personalization: AI-Driven UI Adaptation

智能前端个性化：AI驱动的用户界面适应

Authors: Mona Rajhans
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2602.03154
Pdf link: https://arxiv.org/pdf/2602.03154
Abstract Front-end personalization has traditionally relied on static designs or rule-based adaptations, which fail to fully capture user behavior patterns. This paper presents an AI driven approach for dynamic front-end personalization, where UI layouts, content, and features adapt in real-time based on predicted user behavior. We propose three strategies: dynamic layout adaptation using user path prediction, content prioritization through reinforcement learning, and a comparative analysis of AI-driven vs. rule-based personalization. Technical implementation details, algorithms, system architecture, and evaluation methods are provided to illustrate feasibility and performance gains.
中文摘要 前端个性化传统上依赖静态设计或基于规则的适应，这些无法完全捕捉用户行为模式。本文提出了一种由人工智能驱动的动态前端个性化方法，界面布局、内容和功能根据预测的用户行为实时调整。我们提出了三种策略：利用用户路径预测进行动态布局适应、通过强化学习实现内容优先级排序，以及对AI驱动与基于规则的个性化进行比较分析。提供技术实现细节、算法、系统架构和评估方法，以展示可行性和性能提升。

StepScorer: Accelerating Reinforcement Learning with Step-wise Scoring and Psychological Regret Modeling

StepScorer：通过分步评分和心理后悔建模加速强化学习

Authors: Zhe Xu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03171
Pdf link: https://arxiv.org/pdf/2602.03171
Abstract Reinforcement learning algorithms often suffer from slow convergence due to sparse reward signals, particularly in complex environments where feedback is delayed or infrequent. This paper introduces the Psychological Regret Model (PRM), a novel approach that accelerates learning by incorporating regret-based feedback signals after each decision step. Rather than waiting for terminal rewards, PRM computes a regret signal based on the difference between the expected value of the optimal action and the value of the action taken in each state. This transforms sparse rewards into dense feedback signals through a step-wise scoring framework, enabling faster convergence. We demonstrate that PRM achieves stable performance approximately 36\% faster than traditional Proximal Policy Optimization (PPO) in benchmark environments such as Lunar Lander. Our results indicate that PRM is particularly effective in continuous control tasks and environments with delayed feedback, making it suitable for real-world applications such as robotics, finance, and adaptive education where rapid policy adaptation is critical. The approach formalizes human-inspired counterfactual thinking as a computable regret signal, bridging behavioral economics and reinforcement learning.
中文摘要 强化学习算法常因奖励信号稀疏而收敛缓慢，尤其是在反馈延迟或稀少的复杂环境中。本文介绍了心理后悔模型（PRM），这是一种通过在每个决策步骤后加入基于遗憾的反馈信号来加速学习的新方法。PRM不等待终端奖励，而是根据最优动作的期望值与各状态下所采取行动值的差值计算遗憾信号。这通过分阶段评分框架将稀疏的奖励转化为密集的反馈信号，从而加快收敛速度。我们证明，PRM在如月球着陆器等基准环境中，实现稳定性能的速度约比传统的近端策略优化（PPO）快36/%。我们的结果表明，PRM在持续控制任务和延迟反馈环境中尤为有效，适合机器人、金融和适应性教育等需要快速政策调整的现实应用。该方法将人为启发的反事实思维形式化为可计算的遗憾信号，连接了行为经济学与强化学习。

Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning

提示增强提升GRPO数学推理培训

Authors: Wenquan Lu, Hai Huang, Randall Balestriero
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.03190
Pdf link: https://arxiv.org/pdf/2602.03190
Abstract Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have demonstrated strong potential for improving the mathematical reasoning capabilities of large language models. However, prior work has consistently observed an entropy collapse phenomenon during reinforcement post-training, characterized by a monotonic decrease in policy entropy that ultimately leads to training instability and collapse. As a result, most existing approaches restrict training to short horizons (typically 5-20 epochs), limiting sustained exploration and hindering further policy improvement. In addition, nearly all prior work relies on a single, fixed reasoning prompt or template during training. In this work, we introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats, thereby increasing rollout diversity. We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset and allows the model to tolerate low-entropy regimes without premature collapse. Empirically, a Qwen2.5-Math-1.5B model trained with prompt augmentation on the MATH Level 3-5 dataset achieves state-of-the-art performance, reaching 44.5 per-benchmark accuracy and 51.3 per-question accuracy on standard mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench. The code and model checkpoints are available at this https URL.
中文摘要 强化学习算法如群相对策略优化（GRPO）已被证明在提升大型语言模型的数学推理能力方面具有强大潜力。然而，以往的研究持续观察到训练后强化过程中存在熵坍缩现象，其特征是策略熵单调下降，最终导致训练不稳定和崩溃。因此，大多数现有方法将培训限制在短时间内（通常为5-20个时代），限制了持续探索并阻碍政策的进一步改进。此外，几乎所有之前的工作都依赖于培训期间的单一、固定的推理提示或模板。在本研究中，我们引入了提示增强，这是一种训练策略，指导模型在多样化的模板和格式下生成推理痕迹，从而提升推广多样性。我们表明，在没有KL正则化项的情况下，即时增强可以在固定数据集下实现训练时长的稳定缩放，并使模型能够容忍低熵的环境而不发生过早崩溃。实证上，QWEN2.5-Math-1.5B模型在MATH Level 3-5数据集上进行提示增强训练，在包括AIME24、AMC、MATH500、Minerva和OlympiadBench等标准数学推理基准测试中，达到每基准44.5的准确率和每题51.3的精度。代码和模型检查点可在该 https URL 访问。

Reinforcement Learning with Promising Tokens for Large Language Models

大型语言模型中带有有前景的代币的强化学习

Authors: Jing-Cheng Pang, Liang Lu, Xian Tang, Kun Jiang, Sijie Wu, Kai Zhang, Xubin Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03195
Pdf link: https://arxiv.org/pdf/2602.03195
Abstract Reinforcement learning (RL) has emerged as a key paradigm for aligning and optimizing large language models (LLMs). Standard approaches treat the LLM as the policy and apply RL directly over the full vocabulary space. However, this formulation includes the massive tail of contextually irrelevant tokens in the action space, which could distract the policy from focusing on decision-making among the truly reasonable tokens. In this work, we verify that valid reasoning paths could inherently concentrate within a low-rank subspace. Based on this insight, we introduce Reinforcement Learning with Promising Tokens (RLPT), a framework that mitigates the action space issue by decoupling strategic decision-making from token generation. Specifically, RLPT leverages the semantic priors of the base model to identify a dynamic set of \emph{promising tokens} and constrains policy optimization exclusively to this refined subset via masking. Theoretical analysis and empirical results demonstrate that RLPT effectively reduces gradient variance, stabilizes the training process, and improves sample efficiency. Experiment results on math, coding, and telecom reasoning show that RLPT outperforms standard RL baselines and integrates effectively across various model sizes (4B and 8B) and RL algorithms (GRPO and DAPO).
中文摘要 强化学习（RL）已成为对齐和优化大型语言模型（LLMs）的关键范式。标准方法将LLM视为策略，直接在整个词汇空间应用强化学习。然而，这种表述包含了大量在行动空间中上下文无关的代币尾部，这可能会分散政策对真正合理代币决策的关注。在本研究中，我们验证了有效的推理路径本质上可以集中在低秩子空间中。基于这一见解，我们引入了“有前景代币强化学习”（RLPT），这是一个通过将战略决策与代币生成脱钩来缓解行动空间问题的框架。具体来说，RLPT利用基础模型的语义先验来识别一组动态的\emph{promising tokens}，并通过掩蔽将策略优化仅限制在该精炼子集。理论分析和实证结果表明，RLPT有效降低梯度方差，稳定训练过程，并提高样本效率。数学、编码和电信推理的实验结果表明，RLPT优于标准强化学习基线，并在不同模型尺寸（4B和8B）及强化学习算法（GRPO和DAPO）中有效整合。

From Scalar Rewards to Potential Trends: Shaping Potential Landscapes for Model-Based Reinforcement Learning

从标量奖励到潜在趋势：塑造基于模型的强化学习的潜在景观

Authors: Yao-Hui Li, Zeyu Wang, Xin Li, Wei Pang, Yingfang Yuan, Zhengkun Chen, Boya Zhang, Riashat Islam, Alex Lamb, Yonggang Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03201
Pdf link: https://arxiv.org/pdf/2602.03201
Abstract Model-based reinforcement learning (MBRL) achieves high sample efficiency by simulating future trajectories with learned dynamics and reward models. However, its effectiveness is severely compromised in sparse reward settings. The core limitation lies in the standard paradigm of regressing ground-truth scalar rewards: in sparse environments, this yields a flat, gradient-free landscape that fails to provide directional guidance for planning. To address this challenge, we propose Shaping Landscapes with Optimistic Potential Estimates (SLOPE), a novel framework that shifts reward modeling from predicting scalars to constructing informative potential landscapes. SLOPE employs optimistic distributional regression to estimate high-confidence upper bounds, which amplifies rare success signals and ensures sufficient exploration gradients. Evaluations on 30+ tasks across 5 benchmarks demonstrate that SLOPE consistently outperforms leading baselines in fully sparse, semi-sparse, and dense rewards.
中文摘要 基于模型的强化学习（MBRL）通过模拟学习的动态和奖励模型来实现高采样效率。然而，在奖励稀疏的环境中，其效果会严重受损。核心局限在于标准的回归基层真值标量奖励范式：在稀疏环境中，这会产生一个平坦、无梯度的景观，无法为规划提供方向性指导。为应对这一挑战，我们提出了“利用乐观潜力估计塑造景观”（SLOPE）这一新颖框架，将奖励建模从预测标量转向构建信息性潜力景观。SLOPE采用乐观分布回归来估算高置信度上限，放大罕见成功信号并确保探索梯度充足。对5个基准测试中30+任务的评估显示，SLOPE在完全稀疏、半稀疏和密集奖励中持续优于领先基线。

ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution

ForesightKV：通过学习长期贡献优化推理模型中的KV缓存驱逐

Authors: Zican Dong, Peiyu Liu, Junyi Li, Zhipeng Chen, Han Peng, Shuo Wang, Wayne Xin Zhao
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03203
Pdf link: https://arxiv.org/pdf/2602.03203
Abstract Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory and computation costs. Existing KV cache eviction methods mitigate this issue by discarding less important KV pairs, but often fail to capture complex KV dependencies, resulting in performance degradation. To better balance efficiency and performance, we introduce ForesightKV, a training-based KV cache eviction framework that learns to predict which KV pairs to evict during long-text generations. We first design the Golden Eviction algorithm, which identifies the optimal eviction KV pairs at each step using future attention scores. These traces and the scores at each step are then distilled via supervised training with a Pairwise Ranking Loss. Furthermore, we formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens. Experiments on AIME2024 and AIME2025 benchmarks of three reasoning models demonstrate that ForesightKV consistently outperforms prior methods under only half the cache budget, while benefiting synergistically from both supervised and reinforcement learning approaches.
中文摘要 近年来，大型语言模型（LLMs）通过产生长推理轨迹展现出了显著的推理能力。然而，随着序列长度的增加，键值（KV）缓存会线性扩展，导致大量内存和计算成本增加。现有的KV缓存驱逐方法通过丢弃较不重要的KV对来缓解这一问题，但往往无法捕捉复杂的KV依赖，导致性能下降。为了更好地平衡效率和性能，我们引入了ForesightKV，一个基于训练的KV缓存驱逐框架，能够学习预测在长文本生成过程中哪些KV对需要被淘汰。我们首先设计了黄金驱逐算法，利用未来注意力评分识别每个步骤的最优驱逐KV对。这些痕迹和每步的分数通过监督训练进行两两排名损失进行提炼。此外，我们将缓存驱逐表述为马尔可夫决策过程，并应用GRPO算法以减轻低熵令牌上显著的语言建模丢失增加。对三种推理模型的AIME2024和AIME2025基准测试实验表明，ForesightKV在缓存预算不到一半的情况下，始终优于以往方法，同时同时从监督学习和强化学习方法中获得协同效应。

Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning

手风琴思维：高效且易读的大型语言模型推理的自我调节步骤摘要

Authors: Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Wenlei Shi, Yiwei Wang, Xiaodan Liang, Jing Tang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03249
Pdf link: https://arxiv.org/pdf/2602.03249
Abstract Scaling test-time compute via long Chain-ofThought unlocks remarkable gains in reasoning capabilities, yet it faces practical limits due to the linear growth of KV cache and quadratic attention complexity. In this paper, we introduce Accordion-Thinking, an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization. This mechanism enables a Fold inference mode, where the model periodically summarizes its thought process and discards former thoughts to reduce dependency on historical tokens. We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows and eventually vanishes over the course of training. This phenomenon demonstrates that the model learns to encode essential reasoning information into compact summaries, achieving effective compression of the reasoning context. Our Accordion-Thinker demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead without compromising solution quality, and it achieves a 3x throughput while maintaining accuracy on a 48GB GPU memory configuration, while the structured step summaries provide a human-readable account of the reasoning process.
中文摘要 通过长链思考来扩展测试时间计算，带来了显著的推理能力提升，但由于KV缓存的线性增长和二次注意力复杂度，这也面临实际局限。本文介绍了手风琴思维，这是一种端到端框架，LLMs通过动态总结自我调节推理步骤的粒度。这种机制使得一种折叠推理模式成为可能，模型定期总结其思维过程并丢弃先前的想法，以减少对历史标记的依赖。我们应用强化学习进一步激励这一能力，揭示了一个关键洞见：高效的折叠模式与穷尽展开模式之间的准确性差距会随着训练过程逐渐缩小，最终消失。这一现象表明模型学会将关键推理信息编码为紧凑的摘要，从而实现推理上下文的有效压缩。我们的手风琴思维工具证明，通过学习的自我压缩，LLMs能够以最小的依赖令牌开销处理复杂推理任务，同时不影响解决方案质量，并且在48GB的GPU内存配置下实现3倍吞吐量，同时结构化的步骤摘要提供了推理过程的可读叙述。

Periodic Regularized Q-Learning

周期正则化Q-学习

Authors: Hyukjun Yang, Han-Dong Lim, Donghwan Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03301
Pdf link: https://arxiv.org/pdf/2602.03301
Abstract In reinforcement learning (RL), Q-learning is a fundamental algorithm whose convergence is guaranteed in the tabular setting. However, this convergence guarantee does not hold under linear function approximation. To overcome this limitation, a significant line of research has introduced regularization techniques to ensure stable convergence under function approximation. In this work, we propose a new algorithm, periodic regularized Q-learning (PRQ). We first introduce regularization at the level of the projection operator and explicitly construct a regularized projected value iteration (RP-VI), subsequently extending it to a sample-based RL algorithm. By appropriately regularizing the projection operator, the resulting projected value iteration becomes a contraction. By extending this regularized projection into the stochastic setting, we establish the PRQ algorithm and provide a rigorous theoretical analysis that proves finite-time convergence guarantees for PRQ under linear function approximation.
中文摘要 在强化学习（RL）中，Q-学习是一种基本算法，其收敛性在表格环境中是有保障的。然而，这种收敛保证在线性函数近似下不成立。为克服这一限制，一系列重要研究引入了正则化技术，以确保函数近似下的稳定收敛。在本研究中，我们提出了一种新算法——周期正则化Q-学习（PRQ）。我们首先在投影算子层面引入正则化，并显式构造正则化投影值迭代（RP-VI），随后将其扩展为基于样本的强化学习算法。通过适当正则化投影算子，所得的投影值迭代就变成了收缩。通过将正则化投影扩展到随机环境中，我们建立了PRQ算法，并提供了严谨的理论分析，证明了线性函数近似下PRQ的有限时间收敛性保证。

medR: Reward Engineering for Clinical Offline Reinforcement Learning via Tri-Drive Potential Functions

medR：通过Tri-Drive潜在函数实现临床离线强化学习的奖励工程

Authors: Qianyi Xu, Gousia Habib, Feng Wu, Yanrui Du, Zhihui Chen, Swapnil Mishra, Dilruk Perera, Mengling Feng
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03305
Pdf link: https://arxiv.org/pdf/2602.03305
Abstract Reinforcement Learning (RL) offers a powerful framework for optimizing dynamic treatment regimes (DTRs). However, clinical RL is fundamentally bottlenecked by reward engineering: the challenge of defining signals that safely and effectively guide policy learning in complex, sparse offline environments. Existing approaches often rely on manual heuristics that fail to generalize across diverse pathologies. To address this, we propose an automated pipeline leveraging Large Language Models (LLMs) for offline reward design and verification. We formulate the reward function using potential functions consisted of three core components: survival, confidence, and competence. We further introduce quantitative metrics to rigorously evaluate and select the optimal reward structure prior to deployment. By integrating LLM-driven domain knowledge, our framework automates the design of reward functions for specific diseases while significantly enhancing the performance of the resulting policies.
中文摘要 强化学习（RL）提供了一个强大的框架，用于优化动态治疗方案（DTR）。然而，临床强化学习在奖励工程中根本上被限制：在复杂且稀疏的离线环境中，如何定义安全有效引导政策学习的信号。现有方法常依赖手动启发式方法，无法在不同病理中推广。为此，我们提出了一个利用大型语言模型（LLMs）进行离线奖励设计和验证的自动化流水线。我们利用潜在函数构建奖励函数，该函数由三个核心组成部分组成：生存、信心和能力。我们还引入了定量指标，以在部署前严格评估并选择最佳奖励结构。通过整合由LLM驱动的领域知识，我们的框架自动化了针对特定疾病的奖励函数设计，同时显著提升了最终策略的性能。

Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models

熵门控选择性策略优化：大型语言模型混合训练中的令牌级梯度分配

Authors: Yuelin Hu, Zhengxue Cheng, Wei Liu, Li Song
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03309
Pdf link: https://arxiv.org/pdf/2602.03309
Abstract Hybrid training methods for large language models combine supervised fine tuning (SFT) on expert demonstrations with reinforcement learning (RL) on model rollouts, typically at the sample level. We propose Entropy Gated Selective Policy Optimization (EGSPO), a three stage framework that extends sample level mixing with token level gradient modulation. Stage 1, SFT expert learning, establishes a reliable warm up policy using expert demonstrations with a pure SFT loss. Stage 2, RL rollout generation, samples trajectories from the current policy and computes per token predictive entropy. Stage 3, the EGSPO mechanism, applies entropy gated gradient allocation: a predictive entropy module routes high entropy tokens to full PPO updates to encourage exploration, and low entropy tokens to attenuated PPO updates to reduce variance and preserve knowledge. Critically, both branches incorporate the advantage function A_t, ensuring that incorrect trajectories receive consistent negative learning signals and preventing reinforcement of confident errors. EGSPO achieves consistent improvements on mathematical reasoning benchmarks, with gains of 3.8 percent on AIME and 2.9 percent on MATH over the CHORD phi baseline, while incurring only 3.4 percent additional computational overhead.
中文摘要 大型语言模型的混合训练方法结合了专家演示上的监督微调（SFT）和模型推广时的强化学习（RL），通常在样本层面进行。我们提出了熵门控选择性策略优化（EGSPO），这是一个三阶段框架，将样本级混合与令牌级梯度调制扩展。第一阶段，SFT专家学习，通过专家演示建立可靠的热身策略，且纯SFT损失。第二阶段，强化学习推广生成，从当前政策中抽样轨迹，并计算每个代币的预测熵。第三阶段，即EGSPO机制，应用熵门控梯度分配：预测熵模块将高熵标记路由至完整PPO更新以鼓励探索，低熵标记路由至衰减PPO更新以减少方差并保持知识。关键的是，这两个分支都包含了优势函数A_t，确保错误轨迹持续收到负面学习信号，防止自信错误的强化。EGSPO在数学推理基准上持续提升，AIME比CHORD phi基线提升3.8%，数学提升2.9%，计算开销仅增加3.4%。

MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning

MedSAM-Agent：通过多回合智能体强化学习赋能交互式医学图像分割

Authors: Shengyuan Liu, Liuxin Bao, Qi Yang, Wanting Geng, Boyun Zheng, Chenxin Li, Wenting Chen, Houwen Peng, Yixuan Yuan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03320
Pdf link: https://arxiv.org/pdf/2602.03320
Abstract Medical image segmentation is evolving from task-specific models toward generalizable frameworks. Recent research leverages Multi-modal Large Language Models (MLLMs) as autonomous agents, employing reinforcement learning with verifiable reward (RLVR) to orchestrate specialized tools like the Segment Anything Model (SAM). However, these approaches often rely on single-turn, rigid interaction strategies and lack process-level supervision during training, which hinders their ability to fully exploit the dynamic potential of interactive tools and leads to redundant actions. To bridge this gap, we propose MedSAM-Agent, a framework that reformulates interactive segmentation as a multi-step autonomous decision-making process. First, we introduce a hybrid prompting strategy for expert-curated trajectory generation, enabling the model to internalize human-like decision heuristics and adaptive refinement strategies. Furthermore, we develop a two-stage training pipeline that integrates multi-turn, end-to-end outcome verification with a clinical-fidelity process reward design to promote interaction parsimony and decision efficiency. Extensive experiments across 6 medical modalities and 21 datasets demonstrate that MedSAM-Agent achieves state-of-the-art performance, effectively unifying autonomous medical reasoning with robust, iterative optimization. Code is available \href{this https URL}{here}.
中文摘要 医学图像分割正从任务特定模型向可推广框架发展。最新研究利用多模态大型语言模型（MLLM）作为自主代理，利用可验证奖励强化学习（RLVR）来协调如分段任意模型（SAM）等专业工具。然而，这些方法通常依赖单轮、僵化的交互策略，且在培训过程中缺乏流程层级监督，这阻碍了它们充分发挥交互工具动态潜力的能力，并导致重复作。为弥合这一空白，我们提出了MedSAM-Agent框架，将交互式分割重新表述为多步自主决策过程。首先，我们引入了一种混合提示策略，用于专家策划的轨迹生成，使模型能够内化类人决策启发式和自适应优化策略。此外，我们开发了一条两阶段培训流程，将多回合、端到端的结果验证与临床忠实度流程奖励设计相结合，以促进互动简约和决策效率。跨越6种医疗模式和21个数据集的广泛实验表明，MedSAM-Agent实现了最先进的性能，有效地将自主医学推理与稳健迭代优化统一。代码可用 \href{this https URL}{here}。

MentalSeek-Dx: Towards Progressive Hypothetico-Deductive Reasoning for Real-world Psychiatric Diagnosis

MentalSeek-DX：迈向现实世界精神病诊断的渐进假设演绎推理

Authors: Xiao Sun, Yuming Yang, Junnan Zhu, Jiang Zhong, Xinyu Zhou, Kaiwen Wei
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03340
Pdf link: https://arxiv.org/pdf/2602.03340
Abstract Mental health disorders represent a burgeoning global public health challenge. While Large Language Models (LLMs) have demonstrated potential in psychiatric assessment, their clinical utility is severely constrained by benchmarks that lack ecological validity and fine-grained diagnostic supervision. To bridge this gap, we introduce \textbf{MentalDx Bench}, the first benchmark dedicated to disorder-level psychiatric diagnosis within real-world clinical settings. Comprising 712 de-identified electronic health records annotated by board-certified psychiatrists under ICD-11 guidelines, the benchmark covers 76 disorders across 16 diagnostic categories. Evaluation of 18 LLMs reveals a critical \textit{paradigm misalignment}: strong performance at coarse diagnostic categorization contrasts with systematic failure at disorder-level diagnosis, underscoring a gap between pattern-based modeling and clinical hypothetico-deductive reasoning. In response, we propose \textbf{MentalSeek-Dx}, a medical-specialized LLM trained to internalize this clinical reasoning process through supervised trajectory construction and curriculum-based reinforcement learning. Experiments on MentalDx Bench demonstrate that MentalSeek-Dx achieves state-of-the-art (SOTA) performance with only 14B parameters, establishing a clinically grounded framework for reliable psychiatric diagnosis.
中文摘要 心理健康障碍代表着一个日益增长的全球公共卫生挑战。尽管大型语言模型（LLMs）在精神病学评估方面展现出潜力，但其临床效用受到缺乏生态有效性和细致诊断监督的基准限制。为弥合这一差距，我们推出了 \textbf{MentalDx Bench}，这是首个专注于真实临床环境中疾病级精神病诊断的基准工具。该基准包括712份由ICD-11指南认证精神科医生注释的去标识电子健康记录，涵盖了16个诊断类别的76种疾病。对18个大型语言模型的评估揭示了一个关键的\textit{范式错位}：粗诊断分类表现优异，而在疾病层面诊断中系统性失败，凸显了基于模式建模与临床假设演绎推理之间的差距。对此，我们提出了 \textbf{MentalSeek-Dx}，这是一种医学专业的大型语言模型，通过监督轨迹构建和基于课程的强化学习，内化这一临床推理过程。MentalDx实验台上的实验表明，MentalSeek-Dx仅有14B参数即可实现最先进的（SOTA）性能，建立了基于临床的可靠精神诊断框架。

PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

PEGRL：通过后期编辑引导强化学习提升机器翻译

Authors: Yunzhi Shen, Hao Zhou, Xin Huang, Xue Han, Junlan Feng, Shujian Huang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.03352
Pdf link: https://arxiv.org/pdf/2602.03352
Abstract Reinforcement learning (RL) has shown strong promise for LLM-based machine translation, with recent methods such as GRPO demonstrating notable gains; nevertheless, translation-oriented RL remains challenged by noisy learning signals arising from Monte Carlo return estimation, as well as a large trajectory space that favors global exploration over fine-grained local optimization. We introduce \textbf{PEGRL}, a \textit{two-stage} RL framework that uses post-editing as an auxiliary task to stabilize training and guide overall optimization. At each iteration, translation outputs are sampled to construct post-editing inputs, allowing return estimation in the post-editing stage to benefit from conditioning on the current translation behavior, while jointly supporting both global exploration and fine-grained local optimization. A task-specific weighting scheme further balances the contributions of translation and post-editing objectives, yielding a biased yet more sample-efficient estimator. Experiments on English$\to$Finnish, English$\to$Turkish, and English$\leftrightarrow$Chinese show consistent gains over RL baselines, and for English$\to$Turkish, performance on COMET-KIWI is comparable to advanced LLM-based systems (DeepSeek-V3.2).
中文摘要 强化学习（RL）在基于LLM的机器翻译方面展现出强烈前景，近期如GRPO等方法取得了显著进展;然而，面向翻译的强化学习仍面临蒙特卡洛返回估计产生的噪声学习信号挑战，以及其偏向全局探索而非细粒度局部优化的巨大轨迹空间。我们介绍了 \textbf{PEGRL}，这是一个 \textit{两阶段}强化学习框架，利用后期编辑作为辅助任务，稳定训练并指导整体优化。每次迭代时，翻译输出都会被采样以构建后期编辑输入，使后期编辑阶段的返回估计能够受益于当前翻译行为的条件，同时支持全局探索和细粒度局部优化。针对任务的权重方案进一步平衡翻译目标与后期编辑目标的贡献，从而产生一个有偏但更高效样本的估计量。在英语$to$芬兰语、英语$\to$土耳其语和英语\leftrightarrow$中文的实验中，显示比强化学习基线有持续提升，而英语$至土耳其语的COMET-KIWI表现可与高级大型语言模型系统（DeepSeek-V3.2）相当。

An Approximate Ascent Approach To Prove Convergence of PPO

一种近似上升方法以证明PPO收敛性

Authors: Leif Doering, Daniel Schmidt, Moritz Melcher, Sebastian Kassing, Benedikt Wille, Tilman Aach, Simon Weissmann
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2602.03386
Pdf link: https://arxiv.org/pdf/2602.03386
Abstract Proximal Policy Optimization (PPO) is among the most widely used deep reinforcement learning algorithms, yet its theoretical foundations remain incomplete. Most importantly, convergence and understanding of fundamental PPO advantages remain widely open. Under standard theory assumptions we show how PPO's policy update scheme (performing multiple epochs of minibatch updates on multi-use rollouts with a surrogate gradient) can be interpreted as approximated policy gradient ascent. We show how to control the bias accumulated by the surrogate gradients and use techniques from random reshuffling to prove a convergence theorem for PPO that sheds light on PPO's success. Additionally, we identify a previously overlooked issue in truncated Generalized Advantage Estimation commonly used in PPO. The geometric weighting scheme induces infinite mass collapse onto the longest $k$-step advantage estimator at episode boundaries. Empirical evaluations show that a simple weight correction can yield substantial improvements in environments with strong terminal signal, such as Lunar Lander.
中文摘要 近端策略优化（PPO）是最广泛使用的深度强化学习算法之一，但其理论基础尚未完善。最重要的是，PPO基本优势的趋同和理解依然非常开放。在标准理论假设下，我们展示了PPO的策略更新方案（对多用途推广进行多个周期的迷你批次更新并带有代理梯度）如何被解释为近似的策略梯度上升。我们展示了如何控制代理梯度积累的偏置，并利用随机重洗技术证明了PPO的收敛定理，揭示了PPO的成功。此外，我们还发现了PPO中常用的截断广义优势估计中一个此前被忽视的问题。几何加权方案在事件边界处诱导到最长$k $步长的优势估计量上进行无限质量坍缩。实证评估表明，在末端信号强的环境中，如月球着陆器，简单的权重校正可以带来显著改善。

Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL

长期目标链层级策略用于长期离线目标条件强化学习

Authors: Jinwoo Choi, Sang-Hyun Lee, Seung-Woo Seo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03389
Pdf link: https://arxiv.org/pdf/2602.03389
Abstract Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separate high- and low-level networks and generate only a single intermediate subgoal, making them inadequate for complex tasks that require coordinating multiple intermediate decisions. To address this limitation, we draw inspiration from the chain-of-thought paradigm and propose the Chain-of-Goals Hierarchical Policy (CoGHP), a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling within a unified architecture. Given a state and a final goal, CoGHP autoregressively generates a sequence of latent subgoals followed by the primitive action, where each latent subgoal acts as a reasoning step that conditions subsequent predictions. To implement this efficiently, we pioneer the use of an MLP-Mixer backbone, which supports cross-token communication and captures structural relationships among state, goal, latent subgoals, and action. Across challenging navigation and manipulation benchmarks, CoGHP consistently outperforms strong offline baselines, demonstrating improved performance on long-horizon tasks.
中文摘要 离线目标条件强化学习对于长期任务仍然具有挑战性。虽然层级方法通过分解任务来缓解这一问题，但大多数现有方法依赖于独立的高层和低级网络，只生成单一中间子目标，因此无法应对需要协调多个中间决策的复杂任务。为解决这一局限，我们借鉴了思维链范式，提出了目标链层级策略（CoGHP），这是一个新颖框架，将层级决策重新表述为统一架构中的自回归序列建模。给定一个状态和最终目标，CoGHP自回归生成一系列潜在子目标，随后进行原始动作，每个潜在子目标作为一个推理步骤，条件后续的预测。为了高效实现这一点，我们率先采用了MLP-Mixer骨干网，支持跨标记通信，并捕捉状态、目标、潜在子目标和行动之间的结构关系。在具有挑战性的导航和作基准测试中，CoGHP持续优于强劲的离线基线，展现出在长期任务中的表现。

Enhancing Navigation Efficiency of Quadruped Robots via Leveraging Personal Transportation Platforms

通过利用个人交通平台提升四足机器人的导航效率

Authors: Minsung Yoon, Sung-Eui Yoon
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.03397
Pdf link: https://arxiv.org/pdf/2602.03397
Abstract Quadruped robots face limitations in long-range navigation efficiency due to their reliance on legs. To ameliorate the limitations, we introduce a Reinforcement Learning-based Active Transporter Riding method (\textit{RL-ATR}), inspired by humans' utilization of personal transporters, including Segways. The \textit{RL-ATR} features a transporter riding policy and two state estimators. The policy devises adequate maneuvering strategies according to transporter-specific control dynamics, while the estimators resolve sensor ambiguities in non-inertial frames by inferring unobservable robot and transporter states. Comprehensive evaluations in simulation validate proficient command tracking abilities across various transporter-robot models and reduced energy consumption compared to legged locomotion. Moreover, we conduct ablation studies to quantify individual component contributions within the \textit{RL-ATR}. This riding ability could broaden the locomotion modalities of quadruped robots, potentially expanding the operational range and efficiency.
中文摘要 四足机器人由于依赖腿部，在远程导航效率上存在限制。为改善这些限制，我们引入了基于强化学习的主动运输骑乘方法（\textit{RL-ATR}），灵感来自人类使用个人运输器，包括赛格威。\textit{RL-ATR} 具有运输车乘坐策略和两个状态估计器。该政策根据运输器特有的控制动态制定了适当的机动策略，而估计器通过推断不可观测的机器人和运输器状态，解决非惯性框架中的传感器歧义。综合模拟评估验证了多种运输机器人模型的娴熟指令跟踪能力，且相较于腿部移动更低能耗。此外，我们还进行消融研究，以量化\textit{RL-ATR}中各个组分的贡献。这种骑乘能力可能拓宽四足机器人的运动方式，潜在扩大作范围和效率。

Learning-based Initialization of Trajectory Optimization for Path-following Problems of Redundant Manipulators

基于学习的轨迹优化初始化，针对冗余作手的路径跟随问题

Authors: Minsung Yoon, Mincheul Kang, Daehyung Park, Sung-Eui Yoon
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.03418
Pdf link: https://arxiv.org/pdf/2602.03418
Abstract Trajectory optimization (TO) is an efficient tool to generate a redundant manipulator's joint trajectory following a 6-dimensional Cartesian path. The optimization performance largely depends on the quality of initial trajectories. However, the selection of a high-quality initial trajectory is non-trivial and requires a considerable time budget due to the extremely large space of the solution trajectories and the lack of prior knowledge about task constraints in configuration space. To alleviate the issue, we present a learning-based initial trajectory generation method that generates high-quality initial trajectories in a short time budget by adopting example-guided reinforcement learning. In addition, we suggest a null-space projected imitation reward to consider null-space constraints by efficiently learning kinematically feasible motion captured in expert demonstrations. Our statistical evaluation in simulation shows the improved optimality, efficiency, and applicability of TO when we plug in our method's output, compared with three other baselines. We also show the performance improvement and feasibility via real-world experiments with a seven-degree-of-freedom manipulator.
中文摘要 轨迹优化（TO）是一种高效的工具，用于生成冗余作手的联合轨迹，沿六维笛卡尔路径。优化性能很大程度上取决于初始轨迹的质量。然而，选择高质量初始轨迹并非简单，且由于解轨迹空间极大且缺乏对配置空间任务约束的先验了解，需要相当高的时间预算。为缓解这一问题，我们提出了一种基于学习的初始轨迹生成方法，通过采用示例引导强化学习，在短时间内生成高质量的初始轨迹。此外，我们建议通过高效率学习专家演示中捕获的运动学可行动作，考虑零空间约束，以实现零空间投射模拟奖励。我们在模拟中的统计评估显示，当我们输入方法输出时，托罗的最优性、效率和适用性均优于其他三个基线。我们还通过使用七自由度机械臂的实际实验展示了性能改进和可行性。

CRL-VLA: Continual Vision-Language-Action Learning

CRL-VLA：持续视觉-语言-行动学习

Authors: Qixin Zeng, Shuo Zhang, Hongyin Zhang, Renjie Wang, Han Zhao, Libang Zhao, Runze Li, Donglin Wang, Chao Huang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.03445
Pdf link: https://arxiv.org/pdf/2602.03445
Abstract Lifelong learning is critical for embodied agents in open-world environments, where reinforcement learning fine-tuning has emerged as an important paradigm to enable Vision-Language-Action (VLA) models to master dexterous manipulation through environmental interaction. Thus, Continual Reinforcement Learning (CRL) is a promising pathway for deploying VLA models in lifelong robotic scenarios, yet balancing stability (retaining old skills) and plasticity (learning new ones) remains a formidable challenge for existing methods. We introduce CRL-VLA, a framework for continual post-training of VLA models with rigorous theoretical bounds. We derive a unified performance bound linking the stability-plasticity trade-off to goal-conditioned advantage magnitude, scaled by policy divergence. CRL-VLA resolves this dilemma via asymmetric regulation: constraining advantage magnitudes on prior tasks while enabling controlled growth on new tasks. This is realized through a simple but effective dual-critic architecture with novel Goal-Conditioned Value Formulation (GCVF), where a frozen critic anchors semantic consistency and a trainable estimator drives adaptation. Experiments on the LIBERO benchmark demonstrate that CRL-VLA effectively harmonizes these conflicting objectives, outperforming baselines in both anti-forgetting and forward adaptation.
中文摘要 终身学习对于开放世界环境中具身的智能体至关重要，强化学习的微调已成为使视觉-语言-行动（VLA）模型能够通过环境交互掌握灵巧作的重要范式。因此，持续强化学习（CRL）是将VLA模型应用于终身机器人场景的有前景路径，但平衡稳定性（保留旧技能）和可塑性（学习新技能）仍是现有方法面临的巨大挑战。我们介绍CRL-VLA，这是一个具有严格理论界限的VLA模型持续后训练框架。我们推导出一个统一的性能界限，将稳定性-可塑性权衡与目标条件优势幅度联系起来，并按政策背离度进行标度。CRL-VLA通过不对称调控解决了这一困境：限制先前任务的优势幅度，同时允许新任务的受控增长。这通过一种简单但有效的双批判者架构实现，采用了新颖的目标条件值表述（GCVF），其中一个固定的批判者锚定语义一致性，而可训练的估计量则驱动适应。基于LIBERO基准的实验表明，CRL-VLA有效协调了这些相互冲突的目标，在防遗忘和前瞻适应方面均优于基线。

Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing

超越方差：通过罕见事件放大和双向配对实现提示高效的RLVR

Authors: Xin Sheng, Jiaxin Li, Yujuan Pang, Ran Peng, Yong Ma
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03452
Pdf link: https://arxiv.org/pdf/2602.03452
Abstract Reinforcement learning with verifiable rewards (RLVR) is effective for training large language models on deterministic outcome reasoning tasks. Prior work shows RLVR works with few prompts, but prompt selection is often based only on training-accuracy variance, leading to unstable optimization directions and weaker transfer. We revisit prompt selection from a mechanism-level view and argue that an effective minibatch should provide both (i) a reliable positive anchor and (ii) explicit negative learning signals from rare failures. Based on this principle, we propose \emph{positive--negative pairing}: at each update, we sample a hard-but-solvable $q^{+}$ and an easy-but-brittle prompt $q^{-}$(high success rate but not perfect), characterized by low and high empirical success rates under multiple rollouts. We further introduce Weighted GRPO, which reweights binary outcomes at the pair level and uses group-normalized advantages to amplify rare successes on $q^{+}$ into sharp positive guidance while turning rare failures on $q^{-}$ into strong negative penalties. This bidirectional signal provides informative learning feedback for both successes and failures, improving sample efficiency without suppressing exploration. On Qwen2.5-Math-7B, a single paired minibatch per update consistently outperforms a GRPO baseline that selects two prompts via commonly used variance-based selection heuristics: AIME~2025 Pass@8 improves from 16.8 to 22.2, and AMC23 Pass@64 from 94.0 to 97.0, while remaining competitive with large-scale RLVR trained from a pool of 1209 training prompts. Similar gains are observed on Qwen2.5-Math-7B-Instruct.
中文摘要 带有可验证奖励的强化学习（RLVR）对于训练大型语言模型进行确定性结果推理任务非常有效。以往研究表明，RLVR在提示较少的情况下工作，但提示选择往往仅基于训练精度的差异，导致优化方向不稳定且转移较弱。我们从机制层面重新审视提示选择，并主张有效的迷你批处理应同时提供（i）可靠的正向锚点和（ii）罕见失败的显式负面学习信号。基于这一原则，我们提出了\emph{positive--negative pairing}：每次更新时，我们采样一个硬但可解的$q^{+}$和一个易解但脆弱的提示词$q^{-}$（高成功率但非完美），在多次推广下具有低和高的经验成功率。我们进一步引入加权GRPO，该方法在配对层面重新加权二元结果，利用群体归一化优势将$q^{+}$上的罕见成功放大为锐利的正向指导，同时将$q^{-}$上的罕见失败转化为强的负面惩罚。这种双向信号为成功和失败提供了有益的学习反馈，提升样本效率，同时抑制探索。在Qwen2.5-Math-7B中，每次更新的单对迷你批次持续优于通过常用方差选择启发式选择两个提示的GRPO基线：AIME~2025 Pass@8从16.8提升到22.2，AMC23 Pass@64从94.0提升到97.0，同时仍能与从1209个训练提示池中训练的大规模RLVR竞争。Qwen2.5-Math-7B-Instruct 也观察到类似的提升。

IntentRL: Training Proactive User-intent Agents for Open-ended Deep Research via Reinforcement Learning

IntentRL：通过强化学习培训主动用户意图代理进行开放式深度研究

Authors: Haohao Luo, Zexi Li, Yuexiang Xie, Wenhao Zhang, Yaliang Li, Ying Shen
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03468
Pdf link: https://arxiv.org/pdf/2602.03468
Abstract Deep Research (DR) agents extend Large Language Models (LLMs) beyond parametric knowledge by autonomously retrieving and synthesizing evidence from large web corpora into long-form reports, enabling a long-horizon agentic paradigm. However, unlike real-time conversational assistants, DR is computationally expensive and time-consuming, creating an autonomy-interaction dilemma: high autonomy on ambiguous user queries often leads to prolonged execution with unsatisfactory outcomes. To address this, we propose IntentRL, a framework that trains proactive agents to clarify latent user intents before starting long-horizon research. To overcome the scarcity of open-ended research data, we introduce a scalable pipeline that expands a few seed samples into high-quality dialogue turns via a shallow-to-deep intent refinement graph. We further adopt a two-stage reinforcement learning (RL) strategy: Stage I applies RL on offline dialogues to efficiently learn general user-interaction behavior, while Stage II uses the trained agent and a user simulator for online rollouts to strengthen adaptation to diverse user feedback. Extensive experiments show that IntentRL significantly improves both intent hit rate and downstream task performance, outperforming the built-in clarify modules of closed-source DR agents and proactive LLM baselines.
中文摘要 深度研究（DR）代理通过自主检索和综合大型网络语料库中的证据，将大型语言模型（LLM）扩展到参数化知识之外，形成长距离代理范式。然而，与实时对话助手不同，DR计算成本高且耗时，导致自主性与交互的困境：对模糊用户查询的高自主性往往导致执行时间延长且结果不理想。为此，我们提出了IntentRL，一个旨在培训主动代理在开始长期研究前澄清潜在用户意图的框架。为了弥补开放式研究数据的稀缺，我们引入了一个可扩展的流程，通过浅层到深度意图的细化图，将部分种子样本扩展为高质量的对话回合。我们进一步采用两阶段强化学习（RL）策略：第一阶段在离线对话中应用强化学习，高效学习用户交互行为;第二阶段则使用训练有素的代理和用户模拟器进行在线推广，以增强对多样化用户反馈的适应能力。大量实验表明，IntentRL显著提升了意图命中率和下游任务性能，优于闭源DR代理和主动式LLM基线的内置澄清模块。

Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

骨架与肉体解耦：高效多模态表推理，结合解缠比对和结构感知指导

Authors: Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Xiaoqiang Zhou, Min Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.03491
Pdf link: https://arxiv.org/pdf/2602.03491
Abstract Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures.
中文摘要 由于布局复杂且结构内容信息紧密耦合，大型视觉语言模型（LVLM）在表格图像上的推理仍然具有挑战性。现有解决方案通常依赖昂贵的监督培训、强化学习或外部工具，限制了效率和可扩展性。这项工作探讨了一个关键问题：如何在最少注释且无外部工具的情况下，将LVLM适配到表格推理中？具体来说，我们首先介绍了DiSCo，一种解缠结构-内容对齐框架，在多模态对齐过程中明确将结构抽象与语义基础分离，高效地将LVLM适配到表格结构中。基于DiSCo，我们进一步介绍了Table-GLS，一种全局到局部结构引导的推理框架，通过结构化探索和证据基础推理实现表格推理。跨越多种基准测试的大量实验表明，我们的框架高效提升了 LVLM 的表理解和推理能力，尤其是在未被看见的表结构上。

Reparameterization Flow Policy Optimization

重新参数化流程策略优化

Authors: Hai Zhong, Zhuoran Li, Xun Wang, Longbo Huang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03501
Pdf link: https://arxiv.org/pdf/2602.03501
Abstract Reparameterization Policy Gradient (RPG) has emerged as a powerful paradigm for model-based reinforcement learning, enabling high sample efficiency by backpropagating gradients through differentiable dynamics. However, prior RPG approaches have been predominantly restricted to Gaussian policies, limiting their performance and failing to leverage recent advances in generative models. In this work, we identify that flow policies, which generate actions via differentiable ODE integration, naturally align with the RPG framework, a connection not established in prior work. However, naively exploiting this synergy proves ineffective, often suffering from training instability and a lack of exploration. We propose Reparameterization Flow Policy Optimization (RFO). RFO computes policy gradients by backpropagating jointly through the flow generation process and system dynamics, unlocking high sample efficiency without requiring intractable log-likelihood calculations. RFO includes two tailored regularization terms for stability and exploration. We also propose a variant of RFO with action chunking. Extensive experiments on diverse locomotion and manipulation tasks, involving both rigid and soft bodies with state or visual inputs, demonstrate the effectiveness of RFO. Notably, on a challenging locomotion task controlling a soft-body quadruped, RFO achieves almost $2\times$ the reward of the state-of-the-art baseline.
中文摘要 重参数化策略梯度（RPG）已成为基于模型的强化学习的强大范式，通过可微动力学对梯度进行反向传播，实现了高采样效率。然而，以往的RPG方法主要局限于高斯策略，限制了其性能，未能利用生成模型的最新进展。在本研究中，我们发现通过可微分ODE集成生成动作的流策略自然与RPG框架保持一致，这一联系在以往工作中未被建立。然而，天真地利用这种协同效应往往无效，常常伴随着训练不稳定和缺乏探索。我们提出了重新参数化流程策略优化（RFO）。RFO通过在流生成过程和系统动力学中联合反向传播来计算策略梯度，释放高样本效率，而无需复杂的对数似然计算。RFO包含两个针对稳定性和探索的定制正则化项。我们还提出了一种带有动作分块的RFO变体。在多种运动和作任务中，涉及刚体和软体，并带有状态或视觉输入，广泛实验展示了RFO的有效性。值得注意的是，在一项具有挑战性的移动任务中，控制一只软体四足动物，RFO几乎达到了最先进基线的2美元乘以美元。

Learning to Reason Faithfully through Step-Level Faithfulness Maximization

通过阶级忠实度最大化学会忠实推理

Authors: Runquan Gui, Yafu Li, Xiaoye Qu, Ziyan Liu, Yeqiu Cheng, Yu Cheng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.03507
Pdf link: https://arxiv.org/pdf/2602.03507
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has markedly improved the performance of Large Language Models (LLMs) on tasks requiring multi-step reasoning. However, most RLVR pipelines rely on sparse outcome-based rewards, providing little supervision over intermediate steps and thus encouraging over-confidence and spurious reasoning, which in turn increases hallucinations. To address this, we propose FaithRL, a general reinforcement learning framework that directly optimizes reasoning faithfulness. We formalize a faithfulness-maximization objective and theoretically show that optimizing it mitigates over-confidence. To instantiate this objective, we introduce a geometric reward design and a faithfulness-aware advantage modulation mechanism that assigns step-level credit by penalizing unsupported steps while preserving valid partial derivations. Across diverse backbones and benchmarks, FaithRL consistently reduces hallucination rates while maintaining (and often improving) answer correctness. Further analysis confirms that FaithRL increases step-wise reasoning faithfulness and generalizes robustly. Our code is available at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLMs）在需要多步推理任务中的表现。然而，大多数RLVR流程依赖于基于结果的稀疏奖励，对中间步骤缺乏监督，从而助长过度自信和虚假推理，进而加剧幻觉。为此，我们提出了FaithRL，一种通用强化学习框架，直接优化推理忠实度。我们形式化了一个忠实最大化目标，并理论上证明优化它能减轻过度自信。为实现这一目标，我们引入了几何奖励设计和忠实度感知优势调制机制，通过惩罚无支持步骤来分配阶级信用，同时保持有效的部分推导。在多种基础和基准测试中，FaithRL持续降低幻觉发生率，同时保持（且常常提升）答案正确性。进一步分析证实，FaithRL能够逐步提高推理忠实度，并且具有强有力的推广性。我们的代码可在此 https URL 访问。

CMR: Contractive Mapping Embeddings for Robust Humanoid Locomotion on Unstructured Terrains

CMR：用于无结构地形上强健类人机动的收缩映射嵌入

Authors: Qixin Zeng, Hongyin Zhang, Shangke Lyu, Junxi Jin, Donglin Wang, Chao Huang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03511
Pdf link: https://arxiv.org/pdf/2602.03511
Abstract Robust disturbance rejection remains a longstanding challenge in humanoid locomotion, particularly on unstructured terrains where sensing is unreliable and model mismatch is pronounced. While perception information, such as height map, enhances terrain awareness, sensor noise and sim-to-real gaps can destabilize policies in practice. In this work, we provide theoretical analysis that bounds the return gap under observation noise, when the induced latent dynamics are contractive. Furthermore, we present Contractive Mapping for Robustness (CMR) framework that maps high-dimensional, disturbance-prone observations into a latent space, where local perturbations are attenuated over time. Specifically, this approach couples contrastive representation learning with Lipschitz regularization to preserve task-relevant geometry while explicitly controlling sensitivity. Notably, the formulation can be incorporated into modern deep reinforcement learning pipelines as an auxiliary loss term with minimal additional technical effort required. Further, our extensive humanoid experiments show that CMR potently outperforms other locomotion algorithms under increased noise.
中文摘要 在人形运动中，强有力的干扰拒绝仍是一个长期挑战，尤其是在无结构地形中，感应不可靠且模型不匹配明显。虽然感知信息（如高度图）增强了地形感知，但传感器噪声和模拟与现实的差距在实际作中可能破坏政策稳定性。在本研究中，我们提供了理论分析，界定了观测噪声下回波间隙的界限，当诱导的潜在动力学是收缩的。此外，我们提出了稳健性收缩映射（CMR）框架，将高维、易受扰动的观测映射到一个潜在空间，在那里局部扰动随时间减弱。具体来说，该方法将对比表示学习与利普希茨正则化结合，既保持任务相关几何，又显式控制灵敏度。值得注意的是，该表述可以作为辅助损耗项被纳入现代深度强化学习流水线，几乎无需额外技术投入。此外，我们广泛的类人实验表明，CMR在噪声增加下表现优于其他运动算法。

Not All Negative Samples Are Equal: LLMs Learn Better from Plausible Reasoning

并非所有负面样本都一样：大型语言模型从合理的推理中学习得更好

Authors: Zixiang Di, Jinyi Han, Shuo Zhang, Ying Liao, Zhi Li, Xiaofeng Ji, Yongqi Wang, Zheming Yang, Ming Gao, Bingdong Li, Jie Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03516
Pdf link: https://arxiv.org/pdf/2602.03516
Abstract Learning from negative samples holds great promise for improving Large Language Model (LLM) reasoning capability, yet existing methods treat all incorrect responses as equally informative, overlooking the crucial role of sample quality. To address this, we propose Plausible Negative Samples (PNS), a method that synthesizes high-quality negative samples exhibiting expected format and structural coherence while ultimately yielding incorrect answers. PNS trains a dedicated model via reverse reinforcement learning (RL) guided by a composite reward combining format compliance, accuracy inversion, reward model assessment, and chain-of-thought evaluation, generating responses nearly indistinguishable from correct solutions. We further validate PNS as a plug-and-play data source for preference optimization across three backbone models on seven mathematical reasoning benchmarks. Results demonstrate that PNS consistently outperforms other negative sample synthesis methods, achieving an average improvement of 2.03% over RL-trained models.
中文摘要 从负面样本中学习对提升大型语言模型（LLM）推理能力具有巨大潜力，但现有方法将所有错误回答视为同等信息量，忽视了样本质量的关键作用。为此，我们提出了合理负样本（PNS）方法，该方法合成具有预期格式和结构一致性的高质量负样本，但最终得出错误答案。PNS通过逆向强化学习（RL）训练专用模型，辅导结合格式合规性、准确性反转、奖励模型评估和思维链评估的复合奖励，生成的响应几乎与正确解无异。我们还进一步验证了PNS作为三项骨干模型中即插即用的数据源，适用于七个数学推理基准测试。结果表明，PNS持续优于其他负样本合成方法，平均提升2.03%，优于强化学习训练模型。

AffordanceGrasp-R1:Leveraging Reasoning-Based Affordance Segmentation with Reinforcement Learning for Robotic Grasping

AffordanceGrasp-R1：利用基于推理的可理解性分割与强化学习进行机器人抓取

Authors: Dingyi Zhou, Mu He, Zhuowei Fang, Xiangtong Yao, Yinlong Liu, Alois Knoll, Hu Cao
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.03547
Pdf link: https://arxiv.org/pdf/2602.03547
Abstract We introduce AffordanceGrasp-R1, a reasoning-driven affordance segmentation framework for robotic grasping that combines a chain-of-thought (CoT) cold-start strategy with reinforcement learning to enhance deduction and spatial grounding. In addition, we redesign the grasping pipeline to be more context-aware by generating grasp candidates from the global scene point cloud and subsequently filtering them using instruction-conditioned affordance masks. Extensive experiments demonstrate that AffordanceGrasp-R1 consistently outperforms state-of-the-art (SOTA) methods on benchmark datasets, and real-world robotic grasping evaluations further validate its robustness and generalization under complex language-conditioned manipulation scenarios.
中文摘要 我们介绍AffordanceGrasp-R1，一种基于推理驱动的可理解分割框架，用于机器人抓取，结合了思维链（CoT）冷启动策略与强化学习，以增强推理和空间基础。此外，我们通过从全局场景点云生成抓取候选对象，并随后使用指令条件赋能掩码对其进行过滤，重新设计了抓取流水线，使其更具上下文感知性。大量实验表明，AffordanceGrasp-R1在基准数据集上的表现持续优于最先进（SOTA）方法，而现实世界的机器人抓取评估进一步验证了其在复杂语言条件作场景下的鲁棒性和泛化性。

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

学习DeepResearch报告生成中针对特定查询的评分标准

Authors: Changze Lv, Jie Zhou, Wentao Zhao, Jingwen Xu, Zisu Huang, Muzhao Tian, Shihan Dou, Tao Gui, Le Tian, Xiao Zhou, Xiaoqing Zheng, Xuanjing Huang, Jie Zhou
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.03619
Pdf link: https://arxiv.org/pdf/2602.03619
Abstract Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.
中文摘要 如今，由于缺乏可验证的奖励信号，培训和评估DeepResearch生成的报告依然充满挑战。因此，基于评分标准的评估已成为一种常见做法。然而，现有方法要么依赖粗略的预定义规律，缺乏足够细度，要么依赖手动构建的查询专用规规，这些规规成本高且难以扩展。本文提出一条流程，用于训练针对人类偏好的查询专用评分标准生成器，专为DeepResearch报告生成量身定制。我们首先构建一个带有人类偏好的DeepResearch风格查询数据集，并结合了人类偏好监督和基于LLM的评分标准评估，通过强化学习训练评分规矩生成器。为了更好地处理长视野推理，我们进一步引入了多智能体马尔可夫状态（MaMs）报告生成流程。我们通过实证表明，我们提出的评分标准生成器比现有评分标准设计策略提供了更具辨别性和更优的人类对齐监督。此外，当集成到MaMs训练框架中时，配备我们评分标准生成器的DeepResearch系统在DeepResearch实验台上持续优于所有开源基线，并实现与领先闭源模型相当的性能。

TRE: Encouraging Exploration in the Trust Region

TRE：鼓励在信托区进行勘探

Authors: Chao Huang, Yujing Lu, Quangang Li, Shenghe Wang, Yan Wang, Yueyang Zhang, Long Xia, Jiashu Zhao, Zhiyuan Sun, Daiting Shi, Tingwen Liu
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03635
Pdf link: https://arxiv.org/pdf/2602.03635
Abstract Entropy regularization is a standard technique in reinforcement learning (RL) to enhance exploration, yet it yields negligible effects or even degrades performance in Large Language Models (LLMs). We attribute this failure to the cumulative tail risk inherent to LLMs with massive vocabularies and long generation horizons. In such environments, standard global entropy maximization indiscriminately dilutes probability mass into the vast tail of invalid tokens rather than focusing on plausible candidates, thereby disrupting coherent reasoning. To address this, we propose Trust Region Entropy (TRE), a method that encourages exploration strictly within the model's trust region. Extensive experiments across mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) tasks demonstrate that TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines. Our code is available at this https URL.
中文摘要 熵正则化是强化学习（RL）中一种增强探索的标准技术，但它在大型语言模型（LLMs）中产生的效果微乎其微，甚至降低了性能。我们将这种失败归因于拥有庞大词汇量和长生成期的大型语言模型固有的累积尾部风险。在这种情况下，标准的全局熵最大化无差别地将概率质量稀释到大量无效标记中，而非聚焦于合理的候选标记，从而破坏了连贯的推理。为此，我们提出了信任区域熵（Trust Region Entropy，简称TRE）方法，鼓励严格在模型的信任区域内进行探索。在数学推理（MATH）、组合搜索（Countdown）和偏好比对（HH）任务中的大量实验表明，TRE始终优于普通PPO、标准熵正则化及其他探索基线。我们的代码可在此 https URL 访问。

Reinforcement Fine-Tuning for History-Aware Dense Retriever in RAG

RAG中历史感知密集寻回犬的强化微调

Authors: Yicheng Zhang, Zhen Qin, Zhaomin Wu, Wenqi Zhang, Shuiguang Deng
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03645
Pdf link: https://arxiv.org/pdf/2602.03645
Abstract Retrieval-augmented generation (RAG) enables large language models (LLMs) to produce evidence-based responses, and its performance hinges on the matching between the retriever and LLMs. Retriever optimization has emerged as an efficient alternative to fine-tuning LLMs. However, existing solutions suffer from objective mismatch between retriever optimization and the goal of RAG pipeline. Reinforcement learning (RL) provides a promising solution to address this limitation, yet applying RL to retriever optimization introduces two fundamental challenges: 1) the deterministic retrieval is incompatible with RL formulations, and 2) state aliasing arises from query-only retrieval in multi-hop reasoning. To address these challenges, we replace deterministic retrieval with stochastic sampling and formulate RAG as a Markov decision process, making retriever optimizable by RL. Further, we incorporate retrieval history into the state at each retrieval step to mitigate state aliasing. Extensive experiments across diverse RAG pipelines, datasets, and retriever scales demonstrate consistent improvements of our approach in RAG performance.
中文摘要 检索增强生成（RAG）使大型语言模型（LLM）能够生成基于证据的回答，其性能依赖于检索器与LLM之间的匹配。检索器优化已成为微调LLM的高效替代方案。然而，现有解决方案存在检索器优化与RAG流水线目标之间的客观不匹配问题。强化学习（RL）为解决这一限制提供了有前景的解决方案，但将RL应用于检索器优化带来了两个根本性挑战：1）确定性检索与RL的表述不兼容，2）状态混叠源于多跳推理中的仅查询检索。为应对这些挑战，我们用随机抽样替代确定性反演，并将RAG定为马尔可夫决策过程，使得可通过强化学习优化检索器。此外，我们在每个检索步骤中将检索历史纳入状态，以减少状态混叠。在多种RAG管道、数据集和检索器尺度上的广泛实验显示，我们在RAG性能方面的方法持续提升。

Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration

Search-R2：通过演员与精炼器协作增强搜索集成推理

Authors: Bowei He, Minda Hu, Zenan Xu, Hongru Wang, Licheng Zong, Yankai Chen, Chen Ma, Xue Liu, Pluto Zhou, Irwin King
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.03647
Pdf link: https://arxiv.org/pdf/2602.03647
Abstract Search-integrated reasoning enables language agents to transcend static parametric knowledge by actively querying external sources. However, training these agents via reinforcement learning is hindered by the multi-scale credit assignment problem: existing methods typically rely on sparse, trajectory-level rewards that fail to distinguish between high-quality reasoning and fortuitous guesses, leading to redundant or misleading search behaviors. To address this, we propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention, with both components jointly optimized during training. Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories, and a Meta-Refiner, which selectively diagnoses and repairs flawed steps via a 'cut-and-regenerate' mechanism. To provide fine-grained supervision, we introduce a hybrid reward design that couples outcome correctness with a dense process reward quantifying the information density of retrieved evidence. Theoretically, we formalize the Actor-Refiner interaction as a smoothed mixture policy, proving that selective correction yields strict performance gains over strong baselines. Extensive experiments across various general and multi-hop QA datasets demonstrate that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales, achieving superior reasoning accuracy with minimal overhead.
中文摘要 搜索集成推理使语言代理能够通过主动查询外部来源，超越静态参数知识。然而，通过强化学习训练这些代理受到多尺度信用分配问题的阻碍：现有方法通常依赖稀疏的轨迹级奖励，无法区分高质量推理和偶然猜测，导致重复或误导性的搜索行为。为此，我们提出了Search-R2，一种新型的演员-精炼者协作框架，通过有针对性的干预提升推理能力，并在训练过程中共同优化。我们的方法将生成过程分解为一个演员（Actor），生成初始推理轨迹，以及一个元精炼器（Meta-Refiner），通过“切割与重生”机制有选择性地诊断和修复有缺陷的步骤。为提供细致的监督，我们引入了混合奖励设计，将结果正确性与密集的过程奖励相结合，量化检索证据的信息密度。理论上，我们将演员-精炼器相互作用形式化为平滑混合策略，证明选择性修正在强基线条件下能带来严格的性能提升。在各种通用和多跳质量保证数据集上的大量实验表明，Search-R2在模型尺度上始终优于基于强的RAG和RL基线，以极低的开销实现了更优越的推理准确性。

Rethinking the Reranker: Boundary-Aware Evidence Selection for Robust Retrieval-Augmented Generation

重新思考重新排序器：边界感知证据选择以实现强健检索增强生成

Authors: Jiashuo Sun, Pengcheng Jiang, Saizhuo Wang, Jiajun Fan, Heng Wang, Siru Ouyang, Ming Zhong, Yizhu Jiao, Chengsong Huang, Xueqiang Xu, Pengrui Han, Peiran Li, Jiaxin Huang, Ge Liu, Heng Ji, Jiawei Han
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.03689
Pdf link: https://arxiv.org/pdf/2602.03689
Abstract Retrieval-Augmented Generation (RAG) systems remain brittle under realistic retrieval noise, even when the required evidence appears in the top-K results. A key reason is that retrievers and rerankers optimize solely for relevance, often selecting either trivial, answer-revealing passages or evidence that lacks the critical information required to answer the question, without considering whether the evidence is suitable for the generator. We propose BAR-RAG, which reframes the reranker as a boundary-aware evidence selector that targets the generator's Goldilocks Zone -- evidence that is neither trivially easy nor fundamentally unanswerable for the generator, but is challenging yet sufficient for inference and thus provides the strongest learning signal. BAR-RAG trains the selector with reinforcement learning using generator feedback, and adopts a two-stage pipeline that fine-tunes the generator under the induced evidence distribution to mitigate the distribution mismatch between training and inference. Experiments on knowledge-intensive question answering benchmarks show that BAR-RAG consistently improves end-to-end performance under noisy retrieval, achieving an average gain of 10.3 percent over strong RAG and reranking baselines while substantially improving robustness. Code is publicly avaliable at this https URL.
中文摘要 检索增强生成（RAG）系统在真实检索噪声下依然脆弱，即使所需证据出现在顶K结果中。一个关键原因是检索者和重新排序者仅以相关性为优化，常常选择琐碎且揭示答案的段落或缺乏关键信息的证据，而不考虑这些证据是否适合生成。我们提出了BAR-RAG，它将重新排序器重新定义为一个边界感知的证据选择器，针对生成器的适居区——这种证据对生成器来说既非简单易答，也非根本无法回答，但具有挑战性，同时足以进行推断，因此提供了最强的学习信号。BAR-RAG通过生成器反馈进行强化学习训练选择器，并采用两阶段流水线，在诱导证据分布下微调生成器，以减轻训练与推断之间的分布不匹配。知识密集型问答基准测试的实验表明，BAR-RAG在噪声检索下持续提升端到端性能，平均比强RAG提升10.3%，并重新排序基线，同时显著提升了鲁棒性。代码在此 https URL 公开发布。

Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling

通过对比动态分支采样训练多回合搜索代理

Authors: Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, Chengwei Qin
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.03719
Pdf link: https://arxiv.org/pdf/2602.03719
Abstract Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \href{this https URL}{code}.
中文摘要 代理强化学习使大型语言模型能够执行复杂的多回合规划和工具使用。然而，由于缺乏轨迹级的结果奖励，在长视野环境中学习仍然具有挑战性。虽然以往的基于树的方法试图缓解这一问题，但它们通常存在高方差和计算效率低下的问题。通过对搜索代理的实证分析，我们发现了一个共同模式：性能主要因靠近尾部的决策而出现分歧。基于这一观察，我们提出了分支相对策略优化（BranPO），这是一种无价值的方法，能够在没有密集奖励的情况下实现阶级对比监督。BranPO截断尾部附近的轨迹，并重新采样替代延续，构建共享前缀上的对比后缀，减少长视野推送中的信用歧义。为了进一步提升效率并稳定训练，我们引入了难度感知分支抽样，以适应不同任务的分支频率，并引入冗余步数掩蔽以抑制无信息的动作。对各种问答基准的广泛实验表明，BranPO始终优于强基线，在长期任务中实现显著的准确性提升，且不增加整体训练预算。我们的代码可在 \href{this https URL}{code} 获取。

RegionReasoner: Region-Grounded Multi-Round Visual Reasoning

RegionReasoner：基于区域的多轮视觉推理

Authors: Wenfang Sun, Hao Chen, Yingjun Du, Yefeng Zheng, Cees G. M. Snoek
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.03733
Pdf link: https://arxiv.org/pdf/2602.03733
Abstract Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global-local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps. RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment. Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global-local consistency, establishing a strong baseline for this emerging research direction.
中文摘要 大型视觉语言模型在视觉推理方面取得了显著进展，但大多数现有系统依赖单步或纯文本推理，限制了其在多个视觉情境中迭代细化理解的能力。为解决这一限制，我们引入了新的多轮视觉推理基准测试，训练和测试集涵盖检测任务和分割任务，支持在迭代推理场景下的系统评估。我们还提出了RegionReasoner，一种强化学习框架，通过要求每个推理痕迹显式引用对应的参考边界框来强制推理，同时通过全局-局部一致性奖励保持语义一致性。该奖励从全局场景字幕和区域级字幕中提取关键对象和名词，并将其与推理追踪对齐，以确保推理步骤的一致性。RegionReasoner 采用结构化奖励优化，结合了扎根的忠实度和全局-局部语义对齐。检测和分割任务的实验显示，RegionReasoner-7B 与我们新推出的基准 RegionDial-Bench 一起，显著提升了多轮推理准确性、空间基础精度和全局-局部一致性，为这一新兴研究方向奠定了坚实的基础。

Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL

推理缓存：通过短视野强化学习实现长期改进

Authors: Ian Wu, Yuxiao Qu, Amrith Setlur, Aviral Kumar
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03773
Pdf link: https://arxiv.org/pdf/2602.03773
Abstract Large Language Models (LLMs) that can continually improve beyond their training budgets are able to solve increasingly difficult problems by adapting at test time, a property we refer to as extrapolation. However, standard reinforcement learning (RL) operates over fixed problem distributions and training budgets, which limits extrapolation amidst distribution shift at test time. To address this, we introduce RC, an iterative decoding algorithm that replaces standard autoregressive decoding during both training and inference. RC exploits an asymmetry between the response generation and summarization capabilities of LLMs to construct reasoning chains that consistently improve across iterations. Models trained to use RC can extrapolate and continually improve over reasoning horizons more than an order of magnitude longer than those seen during training. Empirically, training a 4B model with RC using a 16k-token training budget improves performance on HMMT 2025 from 40% to nearly 70% with 0.5m tokens at test time, outperforming both comparably sized models and many larger reasoning LLMs. Finally, we also show that models trained with RC can more effectively leverage existing scaffolds to further scale test-time performance, due to the improved summary-conditioned generation abilities learned through training.
中文摘要 能够不断超越训练预算的大型语言模型（LLM）能够通过测试时的适应来解决越来越难的问题，我们称之为外推。然而，标准强化学习（RL）是在固定问题分布和训练预算下运行的，这限制了测试时分布转移时的外推。为此，我们引入了RC，一种迭代解码算法，在训练和推断过程中取代了标准的自回归解码。RC利用LLM响应生成与总结能力之间的不对称性，构建出在迭代中持续改进的推理链。训练用于RC的模型可以比训练时长一个数量级以上，推理视野内的推算和改进。从经验来看，使用16k令牌训练预算的RC训练4B模型，在HMMT 2025测试时以50万令牌的表现从40%提升至近70%，优于同等规模的模型和许多更大型的推理大型语言模型。最后，我们还展示了通过RC训练的模型，能够更有效地利用现有支架，进一步扩展测试时间性能，这得益于通过训练获得的总结条件生成能力。

Efficient Estimation of Kernel Surrogate Models for Task Attribution

任务归因中核代理模型的高效估计

Authors: Zhenshuo Zhang, Minxuan Duan, Hongyang R. Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.03783
Pdf link: https://arxiv.org/pdf/2602.03783
Abstract Modern AI agents such as large language models are trained on diverse tasks -- translation, code generation, mathematical reasoning, and text prediction -- simultaneously. A key question is to quantify how each individual training task influences performance on a target task, a problem we refer to as task attribution. The direct approach, leave-one-out retraining, measures the effect of removing each task, but is computationally infeasible at scale. An alternative approach that builds surrogate models to predict a target task's performance for any subset of training tasks has emerged in recent literature. Prior work focuses on linear surrogate models, which capture first-order relationships, but miss nonlinear interactions such as synergy, antagonism, or XOR-type effects. In this paper, we first consider a unified task weighting framework for analyzing task attribution methods, and show a new connection between linear surrogate models and influence functions through a second-order analysis. Then, we introduce kernel surrogate models, which more effectively represent second-order task interactions. To efficiently learn the kernel surrogate, we develop a gradient-based estimation procedure that leverages a first-order approximation of pretrained models; empirically, this yields accurate estimates with less than $2\%$ relative error without repeated retraining. Experiments across multiple domains -- including math reasoning in transformers, in-context learning, and multi-objective reinforcement learning -- demonstrate the effectiveness of kernel surrogate models. They achieve a $25\%$ higher correlation with the leave-one-out ground truth than linear surrogates and influence-function baselines. When used for downstream task selection, kernel surrogate models yield a $40\%$ improvement in demonstration selection for in-context learning and multi-objective reinforcement learning benchmarks.
中文摘要 现代人工智能代理，如大型语言模型，同时训练多种任务——翻译、代码生成、数学推理和文本预测。一个关键问题是量化每个训练任务如何影响目标任务的表现，我们称之为任务归因。直接方法——遗漏一重训练，测量去除每个任务的影响，但在大规模计算上不可行。近期文献中出现了一种替代方法，即构建替代模型，预测目标任务在任一训练任务子集的表现。此前的研究主要集中于线性替代模型，这些模型捕捉了一阶关系，但忽略了协同效应、拮抗性或异或类效应等非线互作用。本文首先考虑了一个统一的任务加权框架，用于分析任务归因方法，并通过二阶分析展示了线性替代模型与影响函数之间的新联系。然后，我们引入了核代理模型，更有效地表示二阶任务交互。为了高效学习核代理，我们开发了一种基于梯度的估计过程，利用预训练模型的一阶近似;从经验角度看，这可以获得准确估计，相对误差小于$2\%$而无需反复再训练。跨多个领域的实验——包括变换器中的数学推理、上下文学习和多目标强化学习——展示了核代理模型的有效性。它们与“去一”的基层真值相关性高出25%美元，高于线性替代和影响函数基线。在下游任务选择中使用时，内核代理模型在上下文学习和多目标强化学习基准测试中的演示选择提升了40%%$。

Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

连接线上与线下强化学习：多回合代码生成的上下文盗贼学习

Authors: Ziru Chen, Dongdong Chen, Ruinan Jin, Yingbin Liang, Yujia Xie, Huan Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2602.03806
Pdf link: https://arxiv.org/pdf/2602.03806
Abstract Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi-turn code generation can be formulated as a one-step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single-step code generation. Cobalt outperforms two multi-turn online RL baselines based on GRPO and VeRPO, and substantially improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs' in-context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision-making tasks like multi-turn code generation. Our code and data are available at this https URL.
中文摘要 近年来，在现实任务（如多回合代码生成）上，使用强化学习（RL）训练大型语言模型（LLMs）引起了显著研究兴趣。虽然在线强化学习通常表现优于离线强化学习，但其较高的训练成本和不稳定性阻碍了广泛采用。本文基于多回合代码生成可表述为一步可恢复马尔可夫决策过程的观察，提出了带有离线轨迹的上下文盗贼学习（Cobalt），这是一种结合在线和离线强化学习优势的新方法。Cobalt 首先使用参考大型语言模型收集代码生成轨迹，并将其划分为部分轨迹作为上下文提示。然后，在在线盗贼学习过程中，LLM会通过单步代码生成完成每个部分轨迹提示。钴在基于GRPO和VeRPO的两个多回合在线强化学习基线中表现优于R1-Distill 8B和Qwen3 8B，在LiveCodeBench上显著提升了9.0和6.2的绝对Pass@1分数。此外，我们分析了大型语言模型的上下文奖励黑客行为，并通过扰动轨迹增强Cobalt训练以缓解这一问题。总体而言，我们的结果表明钴是多回合代码生成等迭代决策任务的有前景解决方案。我们的代码和数据可在此 https URL 访问。

SymPlex: A Structure-Aware Transformer for Symbolic PDE Solving

SymPlex：一种用于符号偏微分方程求解的结构感知变换器

Authors: Yesom Park, Annie C. Lu, Shao-Ching Huang, Qiyang Hu, Y. Sungtaek Ju, Stanley Osher
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03816
Pdf link: https://arxiv.org/pdf/2602.03816
Abstract We propose SymPlex, a reinforcement learning framework for discovering analytical symbolic solutions to partial differential equations (PDEs) without access to ground-truth expressions. SymPlex formulates symbolic PDE solving as tree-structured decision-making and optimizes candidate solutions using only the PDE and its boundary conditions. At its core is SymFormer, a structure-aware Transformer that models hierarchical symbolic dependencies via tree-relative self-attention and enforces syntactic validity through grammar-constrained autoregressive decoding, overcoming the limited expressivity of sequence-based generators. Unlike numerical and neural approaches that approximate solutions in discretized or implicit function spaces, SymPlex operates directly in symbolic expression space, enabling interpretable and human-readable solutions that naturally represent non-smooth behavior and explicit parametric dependence. Empirical results demonstrate exact recovery of non-smooth and parametric PDE solutions using deep learning-based symbolic methods.
中文摘要 我们提出了SymPlex，一种强化学习框架，用于在无需接触真实表达式的情况下发现偏微分方程（PDE）的解析符号解。SymPlex将符号偏微分方程的求解形式表述为树状结构的决策，并仅使用偏微分方程及其边界条件来优化候选解。其核心是SymFormer，一种结构感知型变换器，通过树相对自关注建模层级符号依赖关系，并通过语法约束的自回归解码强制句法有效性，克服了基于序列的生成器有限的表达力。与在离散化或隐式函数空间中近似解的数值和神经方法不同，SymPlex 直接在符号表达空间中运行，使得可解释且可人类阅读的解成为可能，这些解自然代表了非光滑行为和显式的参数依赖。实证结果展示了利用基于深度学习的符号方法，能够精确恢复非光滑和参数化偏微分方程解。

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

理解并利用权重更新稀缺性以实现通信高效分布式强化学习

Authors: Erfan Miahi, Eugene Belilovsky
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.03839
Pdf link: https://arxiv.org/pdf/2602.03839
Abstract Reinforcement learning (RL) is a critical component for post-training large language models (LLMs). However, in bandwidth-constrained distributed RL, scalability is often bottlenecked by the synchronization of policy weights from trainers to inference workers, particularly over commodity networks or in decentralized settings. While recent studies suggest that RL updates modify only a small fraction of model parameters, these observations are typically based on coarse checkpoint differences. We present a systematic empirical study of weight-update sparsity at both step-level and multi-step granularities, examining its evolution across training dynamics, off-policy delay, and model scale. We find that update sparsity is consistently high, frequently exceeding 99% across practically relevant settings. Leveraging this structure, we propose PULSE (Patch Updates via Lossless Sparse Encoding), a simple yet highly efficient lossless weight synchronization method that transmits only the indices and values of modified parameters. PULSE is robust to transmission errors and avoids floating-point drift inherent in additive delta schemes. In bandwidth-constrained decentralized environments, our approach achieves over 100x (14 GB to ~108 MB) communication reduction while maintaining bit-identical training dynamics and performance compared to full weight synchronization. By exploiting this structure, PULSE enables decentralized RL training to approach centralized throughput, reducing the bandwidth required for weight synchronization from 20 Gbit/s to 0.2 Gbit/s to maintain high GPU utilization.
中文摘要 强化学习（RL）是大型语言模型（LLM）训练后的关键组成部分。然而，在带宽受限的分布式强化学习中，可扩展性常常被策略权重从训练器同步到推理工作者所限制，尤其是在商品网络或去中心化环境中。虽然最新研究表明强化学习更新只修改了模型参数的一小部分，但这些观察通常基于粗略检查点差异。我们对权重更新稀疏性进行了系统实证研究，涵盖步级和多步粒度，考察其在训练动态、非策略延迟和模型尺度中的演变。我们发现更新稀疏度始终很高，在实际相关的环境中经常超过99%。利用这一结构，我们提出了PULSE（通过无损稀疏编码进行补丁更新），这是一种简单但高效的无损权重同步方法，仅传输修改参数的索引和值。PULSE对传输误差具有鲁棒性，并避免了加法δ方案固有的浮点漂移。在带宽受限的去中心化环境中，我们的方法实现了超过100倍（14 GB到~108 MB）通信减少，同时保持了与全权重同步相同的比特训练动态和性能。通过利用这一结构，PULSE使去中心化强化学习能够接近集中吞吐量，将权重同步所需的带宽从20 Gbit/s降至0.2 Gbit/s，以保持高GPU利用率。

Keyword: diffusion policy

How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?

拉格朗日导导如何通过扩散模型实现安全强化学习？

Authors: Xiaoyuan Cheng, Wenxuan Yuan, Boyang Li, Yuanchao Xu, Yiming Yang, Hao Liang, Bei Peng, Robert Loftin, Zhuo Sun, Yukun Hu
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.02924
Pdf link: https://arxiv.org/pdf/2602.02924
Abstract Diffusion policy sampling enables reinforcement learning (RL) to represent multimodal action distributions beyond suboptimal unimodal Gaussian policies. However, existing diffusion-based RL methods primarily focus on offline settings for reward maximization, with limited consideration of safety in online settings. To address this gap, we propose Augmented Lagrangian-Guided Diffusion (ALGD), a novel algorithm for off-policy safe RL. By revisiting optimization theory and energy-based model, we show that the instability of primal-dual methods arises from the non-convex Lagrangian landscape. In diffusion-based safe RL, the Lagrangian can be interpreted as an energy function guiding the denoising dynamics. Counterintuitively, direct usage destabilizes both policy generation and training. ALGD resolves this issue by introducing an augmented Lagrangian that locally convexifies the energy landscape, yielding a stabilized policy generation and training process without altering the distribution of the optimal policy. Theoretical analysis and extensive experiments demonstrate that ALGD is both theoretically grounded and empirically effective, achieving strong and stable performance across diverse environments.
中文摘要 扩散策略采样使强化学习（RL）能够表示超越次优单模高斯策略的多模态动作分布。然而，现有基于扩散的强化学习方法主要侧重于离线环境以最大化奖励，而在线环境中的安全性考虑较少。为弥补这一空白，我们提出了增强拉格朗日引导扩散（ALGD）算法，这是一种用于非策略安全强化学习的新算法。通过回顾优化理论和基于能量的模型，我们表明原始对偶方法的不稳定性源于非凸拉格朗日景观。在基于扩散的安全强化学习中，拉格朗日量可以被解释为引导去噪动力学的能量函数。反直觉的是，直接使用会破坏政策制定和培训的稳定性。ALGD通过引入增强拉格朗日量解决了这一问题，该拉格朗日量局部凸化能量景观，从而实现策略生成和训练过程的稳定，同时不改变最优策略的分布。理论分析和大量实验表明，ALGD既有理论基础，又在实证上有效，能够在多样环境中实现强大且稳定的性能。