Arxiv Papers of Today

生成时间: 2026-01-29 16:43:27 (UTC+8); Arxiv 发布时间: 2026-01-29 20:00 EST (2026-01-30 09:00 UTC+8)

今天共有 33 篇相关文章

Keyword: reinforcement learning

Towards a Mechanistic Understanding of Large Reasoning Models: A Survey of Training, Inference, and Failures

迈向对大型推理模型的机械理解：训练、推理与失败的综述

Authors: Yi Hu, Jiaqi Gu, Ruxin Wang, Zijun Yao, Hao Peng, Xiaobao Wu, Jianhui Chen, Muhan Zhang, Liangming Pan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.19928
Pdf link: https://arxiv.org/pdf/2601.19928
Abstract Reinforcement learning (RL) has catalyzed the emergence of Large Reasoning Models (LRMs) that have pushed reasoning capabilities to new heights. While their performance has garnered significant excitement, exploring the internal mechanisms driving these behaviors has become an equally critical research frontier. This paper provides a comprehensive survey of the mechanistic understanding of LRMs, organizing recent findings into three core dimensions: 1) training dynamics, 2) reasoning mechanisms, and 3) unintended behaviors. By synthesizing these insights, we aim to bridge the gap between black-box performance and mechanistic transparency. Finally, we discuss under-explored challenges to outline a roadmap for future mechanistic studies, including the need for applied interpretability, improved methodologies, and a unified theoretical framework.
中文摘要 强化学习（RL）催生了大型推理模型（LRM）的出现，推动推理能力达到了新的高度。虽然它们的表现引起了极大关注，但探索驱动这些行为的内部机制也成为同样重要的研究前沿。本文全面综述了LRMS的机制性理解，将最新发现组织为三个核心维度：1）训练动力学，2）推理机制，3）非意图行为。通过综合这些洞见，我们旨在弥合黑箱性能与机制透明之间的差距。最后，我们讨论了尚未充分探索的挑战，为未来机制研究制定路线图，包括应用解释性、改进方法论和统一理论框架的需求。

E2HiL: Entropy-Guided Sample Selection for Efficient Real-World Human-in-the-Loop Reinforcement Learning

E2HiL：熵引导样本选择，实现高效的真实人机循环强化学习

Authors: Haoyuan Deng, Yuanjiang Xue, Haoyang Du, Boyang Zhou, Zhenyu Wu, Ziwei Wang
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.19969
Pdf link: https://arxiv.org/pdf/2601.19969
Abstract Human-in-the-loop guidance has emerged as an effective approach for enabling faster convergence in online reinforcement learning (RL) of complex real-world manipulation tasks. However, existing human-in-the-loop RL (HiL-RL) frameworks often suffer from low sample efficiency, requiring substantial human interventions to achieve convergence and thereby leading to high labor costs. To address this, we propose a sample-efficient real-world human-in-the-loop RL framework named \method, which requires fewer human intervention by actively selecting informative samples. Specifically, stable reduction of policy entropy enables improved trade-off between exploration and exploitation with higher sample efficiency. We first build influence functions of different samples on the policy entropy, which is efficiently estimated by the covariance of action probabilities and soft advantages of policies. Then we select samples with moderate values of influence functions, where shortcut samples that induce sharp entropy drops and noisy samples with negligible effect are pruned. Extensive experiments on four real-world manipulation tasks demonstrate that \method achieves a 42.1\% higher success rate while requiring 10.1\% fewer human interventions compared to the state-of-the-art HiL-RL method, validating its effectiveness. The project page providing code, videos, and mathematical formulations can be found at this https URL.
中文摘要 人机参与引导已成为实现复杂现实作任务在线强化学习（RL）更快趋同的有效方法。然而，现有的人机在环（HiL-RL）框架常常存在样本效率较低的问题，需要大量人工干预才能实现收敛，从而导致劳动力成本高昂。为此，我们提出了一个样本高效、真实世界人类参与的强化学习框架——\方法，通过主动选择有信息的样本，减少了人工干预。具体来说，策略熵的稳定降低使探索与利用之间的权衡得到改善，样本效率更高。我们首先构建不同样本对政策熵的影响函数，通过策略行动概率和软优势的协方差有效估计熵。然后我们选择影响函数中等值的样本，其中对导致熵骤降的捷径样本和效应微小的噪声样本进行修剪。对四项真实作任务的广泛实验表明，该方法的成功率比最先进的HiL-RL方法高出42.1%，且所需的人工干预次数减少10.1%，验证了其有效性。提供代码、视频和数学公式的项目页面可在此 https 网址找到。

Distributional value gradients for stochastic environments

随机环境下的分布值梯度

Authors: Baptiste Debes, Tinne Tuytelaars
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.20071
Pdf link: https://arxiv.org/pdf/2601.20071
Abstract Gradient-regularized value learning methods improve sample efficiency by leveraging learned models of transition dynamics and rewards to estimate return gradients. However, existing approaches, such as MAGE, struggle in stochastic or noisy environments, limiting their applicability. In this work, we address these limitations by extending distributional reinforcement learning on continuous state-action spaces to model not only the distribution over scalar state-action value functions but also over their gradients. We refer to this approach as Distributional Sobolev Training. Inspired by Stochastic Value Gradients (SVG), our method utilizes a one-step world model of reward and transition distributions implemented via a conditional Variational Autoencoder (cVAE). The proposed framework is sample-based and employs Max-sliced Maximum Mean Discrepancy (MSMMD) to instantiate the distributional Bellman operator. We prove that the Sobolev-augmented Bellman operator is a contraction with a unique fixed point, and highlight a fundamental smoothness trade-off underlying contraction in gradient-aware RL. To validate our method, we first showcase its effectiveness on a simple stochastic reinforcement learning toy problem, then benchmark its performance on several MuJoCo environments.
中文摘要 梯度正则化的价值学习方法通过利用学到的转移动力学和奖励模型来估算回报梯度，从而提高样本效率。然而，现有方法如MAGE在随机或噪声环境中表现不佳，限制了其适用性。本研究通过在连续状态-作用空间上扩展分布强化学习，不仅建模标量状态-作用值函数的分布，还模拟其梯度上的分布，解决了这些局限性。我们称这种方法为分布式索博列夫训练。受随机价值梯度（SVG）启发，我们的方法采用了通过条件变分自编码器（cVAE）实现的一步奖励和转移分布世界模型。该框架基于样本，采用最大切片最大均值差（MSMMD）实例化分布贝尔曼算子。我们证明了索博列夫增强的贝尔曼算子是一个具有唯一不动点的收缩，并强调了梯度感知强化学习中基本的平滑性与收缩的权衡。为了验证我们的方法，我们首先展示了其在一个简单的随机强化学习玩具问题上的有效性，然后对其在多个MuJoCo环境中的性能进行了基准测试。

Techno-economic optimization of a heat-pipe microreactor, part II: multi-objective optimization analysis

热管微型反应堆的技术经济优化，第二部分：多目标优化分析

Authors: Paul Seurin, Dean Price
Subjects: Subjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
Arxiv link: https://arxiv.org/abs/2601.20079
Pdf link: https://arxiv.org/pdf/2601.20079
Abstract Heat-pipe microreactors (HPMRs) are compact and transportable nuclear power systems exhibiting inherent safety, well-suited for deployment in remote regions where access is limited and reliance on costly fossil fuels is prevalent. In prior work, we developed a design optimization framework that incorporates techno-economic considerations through surrogate modeling and reinforcement learning (RL)-based optimization, focusing solely on minimizing the levelized cost of electricity (LCOE) by using a bottom-up cost estimation approach. In this study, we extend that framework to a multi-objective optimization that uses the Pareto Envelope Augmented with Reinforcement Learning (PEARL) algorithm. The objectives include minimizing both the rod-integrated peaking factor ($F_{\Delta h}$) and LCOE -- subject to safety and operational constraints. We evaluate three cost scenarios: (1) a high-cost axial and drum reflectors, (2) a low-cost axial reflector, and (3) low-cost axial and drum reflectors. Our findings indicate that reducing the solid moderator radius, pin pitch, and drum coating angle -- all while increasing the fuel height -- effectively lowers $F_{\Delta h}$. Across all three scenarios, four key strategies consistently emerged for optimizing LCOE: (1) minimizing the axial reflector contribution when costly, (2) reducing control drum reliance, (3) substituting expensive tri-structural isotropic (TRISO) fuel with axial reflector material priced at the level of graphite, and (4) maximizing fuel burnup. While PEARL demonstrates promise in navigating trade-offs across diverse design scenarios, discrepancies between surrogate model predictions and full-order simulations remain. Further improvements are anticipated through constraint relaxation and surrogate development, constituting an ongoing area of investigation.
中文摘要 热管微反应堆（HPMRs）是一种紧凑且可运输的核电系统，具有固有安全性，非常适合部署在偏远地区，这些地区访问受限且普遍依赖昂贵的化石燃料。在之前的研究中，我们开发了一个设计优化框架，通过代理建模和基于强化学习（RL）的优化，结合技术经济因素，专注于通过自下而上的成本估算方法最小化电力平准化成本（LCOE）。本研究将该框架扩展为多目标优化，采用帕累托包絡线增强强化学习（PEARL）算法。目标包括在安全和作限制下，最大限度地减少杆积分峰值因子（$F_{\Delta h}$）和LCOE。我们评估了三种成本方案：（1）高成本轴向和鼓式反射器，（2）低成本轴向反射器，（3）低成本轴向和鼓式反射器。我们的研究结果表明，减少固体慢化剂半径、针距和鼓式涂层角度——同时增加燃油高度——实际上降低了燃料的 $F_{\Delta h}$。在这三种情景中，四种关键策略一致出现：（1）在成本高时最小化轴向反射器贡献，（2）减少对控制鼓的依赖，（3）用价格与石墨相当的轴向反射材料替代昂贵的三结构各向同性（TRISO）燃料，以及（4）最大化燃料燃烧。尽管PEARL在多样化设计场景中展现出权衡潜力，但代理模型预测与全序仿真之间仍存在差异。预计还将通过约束放松和替代开发进一步改进，这已成为持续研究的领域。

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

量化感知蒸馏用于NVFP4推断准确性恢复

Authors: Meng Xin, Sweta Priyadarshi, Jingyu Xin, Bilal Kartal, Aditya Vavre, Asma Kuriparambil Thekkumpate, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Ido Shahaf, Akhiad Bercovich, Kinjal Patel, Suguna Varshini Velury, Chenjie Luo, Zhiyu Cheng, Jenny Chen, Chen-Han Yu, Wei Ping, Oleg Rybakov, Nima Tajbakhsh, Oluwatobi Olabiyi, Dusan Stosic, Di Wu, Song Han, Eric Chung, Sharath Turuvekere Sreenivas, Bryan Catanzaro, Yoshi Suhara, Tijmen Blankevoort, Huizi Mao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.20088
Pdf link: https://arxiv.org/pdf/2601.20088
Abstract This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today's LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantization-aware training (QAT) suffers from engineering complexity and training instability; 2. It is robust to data quality and coverage, enabling accuracy recovery without full training data. We evaluate QAD across multiple post-trained models including AceReason Nemotron, Nemotron 3 Nano, Nemotron Nano V2, Nemotron Nano V2 VL (VLM), and Llama Nemotron Super v1, showing consistent recovery to near-BF16 accuracy.
中文摘要 本技术报告介绍了量化感知提纯（QAD）以及我们在恢复NVFP4量化大型语言模型（LLMs）和视觉语言模型（VLMs）准确性的最佳实践。QAD利用KL散度损耗将全精度教师模型提炼为量子化学生模型。虽然将蒸发技术应用于量化模型并非新思想，但我们观察到QAD在当今大型语言模型中的关键优势：1. 通过多阶段训练后流程训练的模型表现出显著的有效性和稳定性，包括监督微调（SFT）、强化学习（RL）和模型合并，而传统量化感知训练（QAT）因工程复杂性和训练不稳定性而受挫;2. 它对数据质量和覆盖范围具有鲁棒性，能够在不包含完整训练数据的情况下恢复准确性。我们评估了多个后训练模型的QAD，包括AceReason Nemotron、Nemotron 3 Nano、Nemotron Nano V2、Nemotron Nano V2 VL（VLM）和Llama Nemotron Super v1，显示其稳定恢复率接近BF16的准确性。

Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis

通过对比分析对代码环境中的奖励黑客检测基准测试

Authors: Darshan Deshpande, Anand Kannappan, Rebecca Qian
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.20103
Pdf link: https://arxiv.org/pdf/2601.20103
Abstract Recent advances in reinforcement learning for code generation have made robust environments essential to prevent reward hacking. As LLMs increasingly serve as evaluators in code-based RL, their ability to detect reward hacking remains understudied. In this paper, we propose a novel taxonomy of reward exploits spanning across 54 categories and introduce TRACE (Testing Reward Anomalies in Code Environments), a synthetically curated and human-verified benchmark containing 517 testing trajectories. Unlike prior work that evaluates reward hack detection in isolated classification scenarios, we contrast these evaluations with a more realistic, contrastive anomaly detection setup on TRACE. Our experiments reveal that models capture reward hacks more effectively in contrastive settings than in isolated classification settings, with GPT-5.2 with highest reasoning mode achieving the best detection rate at 63%, up from 45% in isolated settings on TRACE. Building on this insight, we demonstrate that state-of-the-art models struggle significantly more with semantically contextualized reward hacks compared to syntactically contextualized ones. We further conduct qualitative analyses of model behaviors, as well as ablation studies showing that the ratio of benign to hacked trajectories and analysis cluster sizes substantially impact detection performance. We release the benchmark and evaluation harness to enable the community to expand TRACE and evaluate their models.
中文摘要 近年来，强化学习在代码生成方面的进步使得构建稳健的环境对于防止奖励黑客行为至关重要。随着LLM越来越多地作为基于代码的强化学习的评估工具，它们在奖励黑客检测方面的能力仍然缺乏充分研究。本文提出了涵盖54个类别的奖励利用新分类法，并引入了TRACE（代码环境中奖励异常测试），这是一个由合成策划和人工验证的基准测试，包含517条测试轨迹。与以往在孤立分类场景中评估奖励黑客检测的研究不同，我们将这些评估与TRACE上更真实、对比性强的异常检测设置进行对比。我们的实验显示，模型在对比环境中捕捉奖励黑客行为的效果优于孤立分类环境，GPT-5.2在最高推理模式下达到了63%的最佳检测率，高于TRACE孤立设置中的45%。基于这一见解，我们证明了最先进的模型在语义语境化奖励破解方面明显比语法化模型更难。我们还进一步对模型行为进行了定性分析，并进行了消融研究，显示良性与被黑轨迹的比例以及分析集群规模对检测性能有显著影响。我们发布基准和评估工具，使社区能够扩展TRACE并评估他们的模型。

In-Context Reinforcement Learning From Suboptimal Historical Data

从次优历史数据进行上下文强化学习

Authors: Juncheng Dong, Moyang Guo, Ethan X. Fang, Zhuoran Yang, Vahid Tarokh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.20116
Pdf link: https://arxiv.org/pdf/2601.20116
Abstract Transformer models have achieved remarkable empirical successes, largely due to their in-context learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context reinforcement learning (ICRL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL tasks, and then fix and use this transformer to create an action policy for new RL tasks. Notably, we consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose the Decision Importance Transformer(DIT) framework, which emulates the actor-critic algorithm in an in-context manner. In particular, we first train a transformer-based value function that estimates the advantage functions of the behavior policies that collected the suboptimal trajectories. Then we train a transformer-based policy via a weighted maximum likelihood estimation loss, where the weights are constructed based on the trained value function to steer the suboptimal policies to the optimal ones. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.
中文摘要 Transformer模型取得了显著的实证成功，主要得益于其上下文学习能力。受此启发，我们探索为上下文强化学习（ICRL）训练自回归变换器。在此环境中，我们首先在离线数据集上训练一个变换器，数据集由各种强化学习任务收集的轨迹组成，然后修复并利用该变换器为新的强化学习任务创建动作策略。值得注意的是，我们考虑了离线数据集中从次优行为策略中抽样轨迹的设定。在这种情况下，标准自回归训练对应于模仿学习，导致性能不理想。为此，我们提出了决策重要性变换器（DIT）框架，以上下文中的方式模拟actor-critic算法。特别地，我们首先训练一个基于变换器的价值函数，该函数估计收集了次优轨迹的行为策略的优势函数。然后，我们通过加权最大似然估计损失来训练基于变换器的策略，权重基于训练好的价值函数构建，以引导次优策略走向最优策略。我们进行了大量实验，测试DIT在bandit和马尔可夫决策过程问题上的表现。我们的结果表明，当离线数据集包含次优历史数据时，DIT表现更优。

A Reinforcement Learning Based Universal Sequence Design for Polar Codes

基于强化学习的极性码通用序列设计

Authors: David Kin Wai Ho, Arman Fazeli, Mohamad M. Mansour, Louay M. A. Jalloul
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.20118
Pdf link: https://arxiv.org/pdf/2601.20118
Abstract To advance Polar code design for 6G applications, we develop a reinforcement learning-based universal sequence design framework that is extensible and adaptable to diverse channel conditions and decoding strategies. Crucially, our method scales to code lengths up to $2048$, making it suitable for use in standardization. Across all $(N,K)$ configurations supported in 5G, our approach achieves competitive performance relative to the NR sequence adopted in 5G and yields up to a 0.2 dB gain over the beta-expansion baseline at $N=2048$. We further highlight the key elements that enabled learning at scale: (i) incorporation of physical law constrained learning grounded in the universal partial order property of Polar codes, (ii) exploitation of the weak long term influence of decisions to limit lookahead evaluation, and (iii) joint multi-configuration optimization to increase learning efficiency.
中文摘要 为了推进6G应用中的Polar代码设计，我们开发了一个基于强化学习的通用序列设计框架，该框架可扩展且适应多种信道条件和解码策略。关键是，我们的方法可扩展至最高2048美元的代码长度，适合用于标准化。在所有支持的5G配置中，我们的方法相较于5G采用的NR序列实现了竞争力，并且在$N=2048$时比β扩展基线提升了最多0.2 dB。我们还进一步强调了实现大规模学习的关键要素：（i）基于极化码的普遍偏序性质的物理法则约束学习，（ii）利用决策的弱长期影响限制前瞻评估，（iii）联合多配置优化以提高学习效率。

Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models

奖励智力谦逊：学习在大型语言模型中何时不该回答

Authors: Abha Jha, Akanksha Mahajan, Ashwath Vaithinathan Aravindan, Praveen Saravanan, Sai Sailaja Policharla, Sonal Chaturbhuj Gehlot
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.20126
Pdf link: https://arxiv.org/pdf/2601.20126
Abstract Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention ("I don't know") alongside correctness to promote intellectual humility. We fine-tune and evaluate Granite-3.3-2B-Instruct and Qwen-3-4B-Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure ($-1$, r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine-tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs $\approx -0.25$ to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple-choice tasks, with larger models exhibiting greater robustness to abstention incentives. On open-ended question answering, we observe limitations due to insufficient exploration, which can be partially mitigated through supervised abstention training. Overall, these findings demonstrate the feasibility and flexibility of verifiable reward design as a practical approach for hallucination mitigation in language models. Reproducible code for our abstention training framework is available here this https URL.
中文摘要 大型语言模型（LLMs）常常产生幻觉或无法验证的内容，削弱了它们在事实领域的可靠性。本研究探讨了带可验证奖励的强化学习（RLVR）作为一种明确奖励戒断（“我不知道”）与正确性并存的训练范式，以促进智力谦逊。我们在MedMCQA和Hendrycks数学基准测试中，使用三元奖励结构（$-1$，r_abs，1）在不同戒断奖励结构下对Granite-3.3-2B-Instruct和Qwen-3-4B-Instruct进行微调和评估。我们还进一步研究了将RLVR与监督式微调策略结合起来的影响，后者在强化学习前先教授戒断。我们的结果显示，适度的弃权奖励（r_abs$\约-0.25$到0.3）在选择题任务中持续减少错误回答，且准确率不显著下降，且模型越大，对弃权激励的鲁棒性更强。在开放式问答中，我们观察到由于探索不足而存在的局限性，这可以通过监督下的戒断培训部分缓解。总体而言，这些发现展示了可验证奖励设计作为语言模型中幻觉缓解的实用方法的可行性和灵活性。我们戒断培训框架的可复现代码可在此访问 https 网址。

Meta-Cognitive Reinforcement Learning with Self-Doubt and Recovery

元认知强化学习与自我怀疑与康复

Authors: Zhipeng Zhang, Wenting Ma, Kai Li, Meng Guo, Lei Yang, Wei Yu, Hongji Cui, Yichen Zhang, Mo Zhang, Jinzhe Lin, Zhenjie Yao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.20193
Pdf link: https://arxiv.org/pdf/2601.20193
Abstract Robust reinforcement learning methods typically focus on suppressing unreliable experiences or corrupted rewards, but they lack the ability to reason about the reliability of their own learning process. As a result, such methods often either overreact to noise by becoming overly conservative or fail catastrophically when uncertainty accumulates. In this work, we propose a meta-cognitive reinforcement learning framework that enables an agent to assess, regulate, and recover its learning behavior based on internally estimated reliability signals. The proposed method introduces a meta-trust variable driven by Value Prediction Error Stability (VPES), which modulates learning dynamics via fail-safe regulation and gradual trust recovery. Experiments on continuous-control benchmarks with reward corruption demonstrate that recovery-enabled meta-cognitive control achieves higher average returns and significantly reduces late-stage training failures compared to strong robustness baselines.
中文摘要 稳健的强化学习方法通常侧重于抑制不可靠体验或被破坏的奖励，但它们缺乏对自身学习过程可靠性进行推理的能力。因此，这些方法往往要么对噪声反应过度，变得过于保守，要么在不确定性积累时灾难性地失败。本研究提出一种元认知强化学习框架，使智能体能够基于内部估计的可靠性信号评估、调节并恢复其学习行为。所提方法引入了由价值预测错误稳定性（VPES）驱动的元信任变量，通过故障安全调节和渐进式信任恢复来调节学习动态。连续对照基准测试的奖励腐败实验表明，恢复驱动的元认知控制相比强鲁棒性基线，能实现更高的平均回报，并显著减少后期训练失败。

Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning

Spark：通过动态分支进行战略性政策感知探索，实现长期代理学习

Authors: Jinyang Wu, Shuo Yang, Changpeng Yang, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.20209
Pdf link: https://arxiv.org/pdf/2601.20209
Abstract Reinforcement learning has empowered large language models to act as intelligent agents, yet training them for long-horizon tasks remains challenging due to the scarcity of high-quality trajectories, especially under limited resources. Existing methods typically scale up rollout sizes and indiscriminately allocate computational resources among intermediate steps. Such attempts inherently waste substantial computation budget on trivial steps while failing to guarantee sample quality. To address this, we propose \textbf{Spark} (\textbf{S}trategic \textbf{P}olicy-\textbf{A}ware explo\textbf{R}ation via \textbf{K}ey-state dynamic branching), a novel framework that selectively branches at critical decision states for resource-efficient exploration. Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories, thereby achieving precise resource allocation that prioritizes sampling quality over blind coverage. This design leverages the agent's intrinsic decision-making signals to reduce dependence on human priors, enabling the agent to autonomously expand exploration and achieve stronger generalization. Experiments across diverse tasks (e.g., embodied planning), demonstrate that \textsc{Spark} achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios.
中文摘要 强化学习使大型语言模型能够作为智能代理，但由于高质量轨迹稀缺，尤其是在资源有限的情况下，训练它们执行长期任务仍然具有挑战性。现有方法通常会放大扩展规模，并在中间步骤中不加区分分配计算资源。此类尝试本质上浪费大量计算预算于琐碎步骤，且无法保证样本质量。为此，我们提出了 \textbf{Spark}（\textbf{S}trategic \textbf{P}olicy-\textbf{A}ware explo\textbf{R}ation via \textbf{K}ey-state dynamic branching），这是一个新颖框架，在关键决策状态选择性分支，实现资源高效探索。我们的核心见解是在关键决策点激活自适应分支探索，探测有前景的轨迹，从而实现精准的资源分配，优先考虑采样质量而非盲报覆盖。该设计利用智能体内在的决策信号，减少对人类先验的依赖，使智能体能够自主扩展探索范围，实现更强的泛化。跨不同任务（如具象规划）的实验表明，\textsc{Spark} 在显著减少的训练样本下实现了更优的成功率，即使在未见场景中也展现出强有力的泛化能力。

Scaling Medical Reasoning Verification via Tool-Integrated Reinforcement Learning

通过工具集成强化学习扩展医学推理验证

Authors: Hang Zhang, Ruheng Wang, Yuelyu Ji, Mingu Kwak, Xizhi Wu, Chenyu Li, Li Zhang, Wenqi Shi, Yifan Peng, Yanshan Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.20221
Pdf link: https://arxiv.org/pdf/2601.20221
Abstract Large language models have achieved strong performance on medical reasoning benchmarks, yet their deployment in clinical settings demands rigorous verification to ensure factual accuracy. While reward models offer a scalable approach for reasoning trace verification, existing methods face two limitations: they produce only scalar reward values without explicit justification, and they rely on single-pass retrieval that precludes adaptive knowledge access as verification unfolds. We introduce $\method$, an agentic framework that addresses these limitations by training medical reasoning verifiers to iteratively query external medical corpora during evaluation. Our approach combines tool-augmented verification with an iterative reinforcement learning paradigm that requires only trace-level supervision, alongside an adaptive curriculum mechanism that dynamically adjusts training data distribution. Across four medical reasoning benchmarks, $\method$ achieves substantial gains over existing methods, improving MedQA accuracy by 23.5% and MedXpertQA by 32.0% relative to the base generator in particular. Crucially, $\method$ demonstrates an $\mathbf{8\times}$ reduction in sampling budget requirement compared to prior reward model baselines. These findings establish that grounding verification in dynamically retrieved evidence offers a principled path toward more reliable medical reasoning systems.
中文摘要 大型语言模型在医学推理基准中表现优异，但其在临床环境中的应用仍需严格验证以确保事实准确性。虽然奖励模型提供了可扩展的追踪验证推理方法，但现有方法面临两个局限：它们仅产生标量奖励值且没有明确理由，且依赖单次检索，导致验证过程中无法实现自适应知识访问。我们引入了$\method$，一种代理框架，通过训练医学推理验证者在评估过程中迭代查询外部医学语料库来解决这些局限性。我们的方法结合了工具增强验证与只需微量级监督的迭代强化学习范式，以及动态调整训练数据分布的自适应课程机制。在四个医学推理基准测试中，$\方法相较现有方法取得了显著提升，MedQA准确率提升了23.5%，MedXpertQA相较于基础生成器提升了32.0%。关键是，$\方法$相比之前的奖励模型基线，在抽样预算需求上减少了$\mathbf{8\times}$。这些发现证明，基于动态检索证据的验证为通往更可靠医学推理系统的原则性路径。

Proactive SFC Provisioning with Forecast-Driven DRL in Data Centers

数据中心中采用预测驱动的 DRL 进行主动 SFC 配置

Authors: Parisa Fard Moshiri, Poonam Lohan, Burak Kantarci, Emil Janulewicz
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.20229
Pdf link: https://arxiv.org/pdf/2601.20229
Abstract Service Function Chaining (SFC) requires efficient placement of Virtual Network Functions (VNFs) to satisfy diverse service requirements while maintaining high resource utilization in Data Centers (DCs). Conventional static resource allocation often leads to overprovisioning or underprovisioning due to the dynamic nature of traffic loads and application demands. To address this challenge, we propose a hybrid forecast-driven Deep reinforcement learning (DRL) framework that combines predictive intelligence with SFC provisioning. Specifically, we leverage DRL to generate datasets capturing DC resource utilization and service demands, which are then used to train deep learning forecasting models. Using Optuna-based hyperparameter optimization, the best-performing models, Spatio-Temporal Graph Neural Network, Temporal Graph Neural Network, and Long Short-Term Memory, are combined into an ensemble to enhance stability and accuracy. The ensemble predictions are integrated into the DC selection process, enabling proactive placement decisions that consider both current and future resource availability. Experimental results demonstrate that the proposed method not only sustains high acceptance ratios for resource-intensive services such as Cloud Gaming and VoIP but also significantly improves acceptance ratios for latency-critical categories such as Augmented Reality increases from 30% to 50%, while Industry 4.0 improves from 30% to 45%. Consequently, the prediction-based model achieves significantly lower E2E latencies of 20.5%, 23.8%, and 34.8% reductions for VoIP, Video Streaming, and Cloud Gaming, respectively. This strategy ensures more balanced resource allocation, and reduces contention.
中文摘要 服务功能链（SFC）需要高效部署虚拟网络功能（VNF），以满足多样化的服务需求，同时保持数据中心（DC）中的高资源利用率。由于流量负载和应用需求的动态特性，传统的静态资源分配常常导致过度配置或不足配置。为应对这一挑战，我们提出了一种混合型预测驱动深度强化学习（DRL）框架，结合预测智能与SFC配置。具体来说，我们利用DRL生成捕捉DC资源利用率和服务需求的数据集，随后用于训练深度学习预测模型。利用基于Optuna的超参数优化，表现最佳的模型——时空图神经网络、时序图神经网络和长短期记忆——被组合成一个集合，以提升稳定性和准确性。集合预测被整合进DC选择过程，使得能够主动做出考虑当前和未来资源可用性的布置决策。实验结果表明，所提方法不仅能在云游戏和VoIP等资源密集型服务中保持高接受率，还显著提升了对延迟关键类别（如增强现实）的接受率从30%提升至50%，而工业4.0则从30%提升至45%。因此，基于预测的模型在 VoIP、视频流和云游戏方面分别实现了显著降低的 E2E 延迟，分别为 20.5%、23.8% 和 34.8%。这一策略确保资源分配更加均衡，减少争端。

Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models

内源性再提示：统一多模态模型的自我演化认知对齐

Authors: Zhenchen Tang, Songlin Yang, Zichuan Wang, Bo Peng, Yang Li, Beibei Dong, Jing Dong
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.20305
Pdf link: https://arxiv.org/pdf/2601.20305
Abstract Unified Multimodal Models (UMMs) exhibit strong understanding, yet this capability often fails to effectively guide generation. We identify this as a Cognitive Gap: the model lacks the understanding of how to enhance its own generation process. To bridge this gap, we propose Endogenous Reprompting, a mechanism that transforms the model's understanding from a passive encoding process into an explicit generative reasoning step by generating self-aligned descriptors during generation. To achieve this, we introduce SEER (Self-Evolving Evaluator and Reprompter), a training framework that establishes a two-stage endogenous loop using only 300 samples from a compact proxy task, Visual Instruction Elaboration. First, Reinforcement Learning with Verifiable Rewards (RLVR) activates the model's latent evaluation ability via curriculum learning, producing a high-fidelity endogenous reward signal. Second, Reinforcement Learning with Model-rewarded Thinking (RLMT) leverages this signal to optimize the generative reasoning policy. Experiments show that SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality, without sacrificing general multimodal capabilities.
中文摘要 统一多模模型（UMM）展现出较强的理解力，但这种能力往往无法有效指导生成。我们将此视为认知鸿沟：模型缺乏如何提升自身生成过程的理解。为弥合这一差距，我们提出了内生再提示机制，通过生成过程中生成自对对描述符，将模型的理解从被动编码过程转变为显式生成推理步骤。为此，我们引入了SEER（自我演化评估器和提示器），这是一个训练框架，仅用紧凑代理任务Visual Instruction Elaboration中的300个样本，建立一个两阶段的内生循环。首先，带可验证奖励的强化学习（RLVR）通过课程学习激活模型的潜在评估能力，产生高保真度的内生奖励信号。其次，基于模型奖励思维的强化学习（RLMT）利用该信号优化生成推理策略。实验表明，SEER在评估准确性、重提效率和生成质量方面始终优于最先进的基线，同时不牺牲多模态的通用能力。

CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

CE-RM：通过两阶段推广和统一标准优化的点状生成奖励模型

Authors: Xinyu Hu, Yancheng He, Weixun Wang, Tao Feng, Li Lin, Jiashun Liu, Wenbo Su, Bo Zheng, Xiaojun Wan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.20327
Pdf link: https://arxiv.org/pdf/2601.20327
Abstract Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.
中文摘要 自动评估对于开放式自然语言生成至关重要，但也具有挑战性，尤其是在基于规则的指标不可行的情况下。与传统方法相比，近期的“法官”式LLM范式使评估更为高效灵活，并展现出作为强化学习生成奖励模型的潜力。然而，以往的研究显示，它们看似令人印象深刻的基准表现与在强化学习实践中的实际效果之间存在显著差距。我们将此问题归因于现有研究的一些局限性，包括成对评估占主导地位以及评估标准优化不足。因此，我们提出CE-RM-4B，这是一种基于点的生成奖励模型，采用专门的两阶段推广方法训练，并采用统一的基于查询的标准。仅使用约5700条从开源偏好数据集中精选的高质量数据，我们的CE-RM-4B在多样化的奖励模型基准中表现优异，尤其是在N最佳场景下，并在下游强化学习实践中带来了更为有效的改进。

PsychePass: Calibrating LLM Therapeutic Competence via Trajectory-Anchored Tournaments

PsychePass：通过轨迹锚定锦标赛校准LLM治疗能力

Authors: Zhuang Chen, Dazhen Wan, Zhangkai Zheng, Guanqun Bi, Xiyao Xiao, Binghang Li, Minlie Huang
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.20330
Pdf link: https://arxiv.org/pdf/2601.20330
Abstract While large language models show promise in mental healthcare, evaluating their therapeutic competence remains challenging due to the unstructured and longitudinal nature of counseling. We argue that current evaluation paradigms suffer from an unanchored defect, leading to two forms of instability: process drift, where unsteered client simulation wanders away from specific counseling goals, and standard drift, where static pointwise scoring lacks the stability for reliable judgment. To address this, we introduce Ps, a unified framework that calibrates the therapeutic competence of LLMs via trajectory-anchored tournaments. We first anchor the interaction trajectory in simulation, where clients precisely control the fluid consultation process to probe multifaceted capabilities. We then anchor the battle trajectory in judgments through an efficient Swiss-system tournament, utilizing dynamic pairwise battles to yield robust Elo ratings. Beyond ranking, we demonstrate that tournament trajectories can be transformed into credible reward signals, enabling on-policy reinforcement learning to enhance LLMs' performance. Extensive experiments validate the effectiveness of PsychePass and its strong consistency with human expert judgments.
中文摘要 虽然大型语言模型在心理健康护理中展现出前景，但由于咨询的结构性和纵向性质，评估其治疗能力仍具挑战性。我们认为，当前的评估范式存在一种无锚定缺陷，导致两种不稳定性：过程漂移，即无方向的客户模拟偏离具体的咨询目标;以及标准漂移，静态逐点评分缺乏稳定性，难以实现可靠判断。为此，我们引入了Ps，一个统一框架，通过轨迹锚定的锦标赛校准LLMs的治疗能力。我们首先在仿真中锚定交互轨迹，客户精确控制流体咨询流程，探索多方面能力。随后，我们将比赛轨迹锚定在判决中，通过高效的瑞士系统锦标赛，利用动态的双人对战，获得强劲的Elo评分。除了排名，我们还展示了比赛轨迹可以转化为可信的奖励信号，使策略强化学习能够提升大型语言模型的性能。大量实验验证了PsychePass的有效性及其与人类专家判断的高度一致性。

MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models

MARE：通过视觉语言模型实现可解释的深度伪造检测的多模态对齐与强化

Authors: Wenbo Xu, Wei Lu, Xiangyang Luo, Jiantao Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.20433
Pdf link: https://arxiv.org/pdf/2601.20433
Abstract Deepfake detection is a widely researched topic that is crucial for combating the spread of malicious content, with existing methods mainly modeling the problem as classification or spatial localization. The rapid advancements in generative models impose new demands on Deepfake detection. In this paper, we propose multimodal alignment and reinforcement for explainable Deepfake detection via vision-language models, termed MARE, which aims to enhance the accuracy and reliability of Vision-Language Models (VLMs) in Deepfake detection and reasoning. Specifically, MARE designs comprehensive reward functions, incorporating reinforcement learning from human feedback (RLHF), to incentivize the generation of text-spatially aligned reasoning content that adheres to human preferences. Besides, MARE introduces a forgery disentanglement module to capture intrinsic forgery traces from high-level facial semantics, thereby improving its authenticity detection capability. We conduct thorough evaluations on the reasoning content generated by MARE. Both quantitative and qualitative experimental results demonstrate that MARE achieves state-of-the-art performance in terms of accuracy and reliability.
中文摘要 深度伪造检测是一个广泛研究的话题，对于打击恶意内容的传播至关重要，现有方法主要将问题建模为分类或空间定位。生成模型的快速发展对深度伪造检测提出了新的要求。本文提出通过视觉语言模型实现可解释的深度伪造检测的多模态比对与强化，称为MARE，旨在提升视觉语言模型（VLMs）在深度伪造检测与推理中的准确性和可靠性。具体来说，MARE设计了全面的奖励函数，结合了人类反馈强化学习（RLHF），以激励生成符合人类偏好的文本空间对齐推理内容。此外，MARE引入了伪造解缠模块，用于从高层面部语义中捕捉内在伪造痕迹，从而提升其真实性检测能力。我们对MARE产生的推理内容进行了全面评估。定量和定性实验结果都表明，MARE在准确性和可靠性方面达到了最先进的性能。

PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use

PEARL：多跳工具使用的计划探索与自适应强化学习

Authors: Qihao Wang, Mingzhe Lu, Jiayue Wu, Yue Hu, Yanbing Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.20439
Pdf link: https://arxiv.org/pdf/2601.20439
Abstract Large Language Models show great potential with external tools, but face significant challenges in complex, multi-turn tool invocation. They often exhibit weak planning, tool hallucination, erroneous parameter generation, and struggle with robust interaction. To tackle these issues, we present PEARL, a novel framework to enhance LLM planning and execution for sophisticated tool use. PEARL adopts a two-stage approach: an offline phase where the agent explores tools to learn valid usage patterns and failure conditions, and an online reinforcement learning phase. In the online phase, a dedicated Planner is trained via group Relative Policy Optimization (GRPO) with a carefully designed reward function that provides distinct signals for planning quality. Experiments on the ToolHop and T-Eval benchmarks show PEARL significantly outperforms existing methods, achieving a new state-of-the-art success rate of \textbf{56.5\%} on ToolHop while maintaining a low invocation error rate. Our work marks a key advance in addressing the complex planning challenges of tool use, contributing to the development of more robust and reliable LLM-based agents.
中文摘要 大型语言模型在外部工具下展现出巨大潜力，但在复杂、多回合的工具调用中面临重大挑战。他们常表现出计划薄弱、工具幻觉、参数生成错误，且在强健互动方面遇到困难。为解决这些问题，我们介绍了PEARL，一个新颖框架，旨在增强大型语言模型的规划与执行，支持复杂工具的使用。PEARL采用两阶段方法：离线阶段，代理探索工具以学习有效的使用模式和故障条件;以及在线强化学习阶段。在线阶段，专职规划师通过小组相对政策优化（GRPO）进行培训，并配备精心设计的奖励函数，为规划质量提供独特的信号。ToolHop和T-Eval基准测试的实验显示，PEARL显著优于现有方法，在ToolHop上实现了前所未有的\textbf{56.5\%}成功率，同时保持了低调用错误率。我们的工作标志着解决工具使用复杂规划挑战的重要进展，推动了基于LLM的智能体更稳健可靠。

Fair Recourse for All: Ensuring Individual and Group Fairness in Counterfactual Explanations

公平救济：确保反事实解释中的个人和群体公平

Authors: Fatima Ezzeddine, Obaida Ammar, Silvia Giordano, Omran Ayoub
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.20449
Pdf link: https://arxiv.org/pdf/2601.20449
Abstract Explainable Artificial Intelligence (XAI) is becoming increasingly essential for enhancing the transparency of machine learning (ML) models. Among the various XAI techniques, counterfactual explanations (CFs) hold a pivotal role due to their ability to illustrate how changes in input features can alter an ML model's decision, thereby offering actionable recourse to users. Ensuring that individuals with comparable attributes and those belonging to different protected groups (e.g., demographic) receive similar and actionable recourse options is essential for trustworthy and fair decision-making. In this work, we address this challenge directly by focusing on the generation of fair CFs. Specifically, we start by defining and formulating fairness at: 1) individual fairness, ensuring that similar individuals receive similar CFs, 2) group fairness, ensuring equitable CFs across different protected groups and 3) hybrid fairness, which accounts for both individual and broader group-level fairness. We formulate the problem as an optimization task and propose a novel model-agnostic, reinforcement learning based approach to generate CFs that satisfy fairness constraints at both the individual and group levels, two objectives that are usually treated as orthogonal. As fairness metrics, we extend existing metrics commonly used for auditing ML models, such as equal choice of recourse and equal effectiveness across individuals and groups. We evaluate our approach on three benchmark datasets, showing that it effectively ensures individual and group fairness while preserving the quality of the generated CFs in terms of proximity and plausibility, and quantify the cost of fairness in the different levels separately. Our work opens a broader discussion on hybrid fairness and its role and implications for XAI and beyond CFs.
中文摘要 可解释人工智能（XAI）正变得越来越重要，以提升机器学习（ML）模型的透明度。在各种XAI技术中，反事实解释（CF）发挥着关键作用，因为它们能够说明输入特征的变化如何改变机器学习模型的决策，从而为用户提供可作的资源。确保具有相似属性的个体与属于不同受保护群体（如人口统计）的人获得类似且可行的救济选项，对于可信和公正的决策至关重要。在本研究中，我们直接解决这一挑战，聚焦于公平CF的生成。具体来说，我们首先定义并制定公平，包括：1）个体公平，确保相似个体获得相似CF，2）群体公平，确保不同受保护群体间CF公平;3）混合公平，兼顾个人及更广泛群体层面的公平。我们将问题表述为优化任务，并提出一种新型的模型无关、基于强化学习的方法，以生成满足个体和群体公平约束的自由化（CF），这两个目标通常被视为正交。作为公平度指标，我们扩展了常用来审计机器学习模型的现有指标，如个体和群体间的救济选择平等和效果平等。我们在三个基准数据集上评估了我们的方法，表明它有效确保了个体和群体的公平性，同时保持了生成CF在接近性和合理性的质量，并分别量化了不同层级的公平性成本。我们的工作开启了关于混合公平性及其对XAI及更广泛CFs影响的讨论。

Inequality in Congestion Games with Learning Agents

拥塞博弈中的学习代理不等式

Authors: Dimitris Michailidis, Sennay Ghebreab, Fernando P. Santos
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.20578
Pdf link: https://arxiv.org/pdf/2601.20578
Abstract Who benefits from expanding transport networks? While designed to improve mobility, such interventions can also create inequality. In this paper, we show that disparities arise not only from the structure of the network itself but also from differences in how commuters adapt to it. We model commuters as reinforcement learning agents who adapt their travel choices at different learning rates, reflecting unequal access to resources and information. To capture potential efficiency-fairness tradeoffs, we introduce the Price of Learning (PoL), a measure of inefficiency during learning. We analyze both a stylized network -- inspired in the well-known Braess's paradox, yet with two-source nodes -- and an abstraction of a real-world metro system (Amsterdam). Our simulations show that network expansions can simultaneously increase efficiency and amplify inequality, especially when faster learners disproportionately benefit from new routes before others adapt. These results highlight that transport policies must account not only for equilibrium outcomes but also for the heterogeneous ways commuters adapt, since both shape the balance between efficiency and fairness.
中文摘要 谁从扩展交通网络中受益？虽然这些干预旨在改善流动性，但也可能制造不平等。本文指出，差异不仅源于网络结构本身，还源于通勤者适应网络的方式差异。我们将通勤者建模为强化学习主体，他们会以不同的学习率调整出行选择，反映资源和信息获取的不平等。为了捕捉潜在的效率与公平权衡，我们引入了学习价格（Price of Learning，PoL），这是衡量学习过程中低效率的指标。我们分析了一个风格化的网络——灵感来自著名的布拉斯悖论，但采用了双源节点——以及一个现实世界地铁系统的抽象（阿姆斯特丹）。我们的模拟显示，网络扩展可以同时提高效率，同时加剧不平等，尤其是在学习速度较快的人优先从新路径中获益时，其他人还未适应。这些结果凸显了交通政策不仅要考虑均衡结果，还要考虑通勤者适应的异质性，因为这两者共同决定效率与公平之间的平衡。

Ranking-aware Reinforcement Learning for Ordinal Ranking

序数排序的排名感知强化学习

Authors: Aiming Hao, Chen Zhu, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.20585
Pdf link: https://arxiv.org/pdf/2601.20585
Abstract Ordinal regression and ranking are challenging due to inherent ordinal dependencies that conventional methods struggle to model. We propose Ranking-Aware Reinforcement Learning (RARL), a novel RL framework that explicitly learns these relationships. At its core, RARL features a unified objective that synergistically integrates regression and Learning-to-Rank (L2R), enabling mutual improvement between the two tasks. This is driven by a ranking-aware verifiable reward that jointly assesses regression precision and ranking accuracy, facilitating direct model updates via policy optimization. To further enhance training, we introduce Response Mutation Operations (RMO), which inject controlled noise to improve exploration and prevent stagnation at saddle points. The effectiveness of RARL is validated through extensive experiments on three distinct benchmarks.
中文摘要 序数回归和排序具有挑战性，因为传统方法难以建模的固有序数依赖关系。我们提出了排序感知强化学习（RARL），这是一种新颖的强化学习框架，明确学习这些关系。RARL的核心目标是一个统一目标，协同整合了回归和学习排名（L2R），促进两者之间的相互提升。这由一个具排名感知的可验证奖励共同评估回归精度和排名准确性驱动，通过策略优化实现模型的直接更新。为进一步提升训练效果，我们引入了反应突变作（RMO），通过注入受控噪声以改善探索并防止鞍点停滞。通过对三个不同基准的广泛实验验证了RARL的有效性。

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

《越难越好：通过难度感知的GRPO和多方面题目重构提升数学推理能力》

Authors: Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.20614
Pdf link: https://arxiv.org/pdf/2601.20614
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities. Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. To address these issues, we propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy. Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation, and further prioritizes harder questions by difficulty-aware question-level weighting. Meanwhile, MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data. Extensive experiments show that MathForge significantly outperforms existing methods on various mathematical reasoning tasks. The code and augmented data are all available at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）提供了一种强健的机制，用于增强大型模型中的数学推理能力。然而，我们发现，尽管现有方法对算法和数据视角的更具挑战性问题的系统性缺失，但这些问题对完善尚未充分开发的能力至关重要。从算法角度看，广泛使用的群体相对策略优化（GRPO）存在隐性不平衡，即对于更难的问题，政策更新的幅度较低。从数据角度看，增强方法主要通过重新表述问题以增强多样性，同时不系统性地增加内在难度。为解决这些问题，我们提出了一个双对偶的MathForge框架，通过从两个角度针对更难的问题来提升数学推理能力，该框架包括一个难度感知群策略优化（DGPO）算法和一个多方面问题重组（MQR）策略。具体来说，DGPO首先通过难度平衡的组优势估计来纠正GRPO中的隐性不平衡，并通过难度感知题级权重进一步优先处理难题。与此同时，MQR在多个方面重新设计题目，以提升难度，同时保持原始的金色答案。总体而言，MathForge形成了一个协同循环：MQR扩展了数据前沿，DGPO则有效地从增强数据中学习。大量实验表明，MathForge在各种数学推理任务中显著优于现有方法。代码和增强数据都在这个 https URL 上。

P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering

P2S：概率过程监督用于广域推理问题解答

Authors: Wenlin Zhong, Chengyuan Liu, Yiquan Wu, Bovin Tan, Changlong Sun, Yi Wang, Xiaozhong Liu, Kun Kuang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.20649
Pdf link: https://arxiv.org/pdf/2601.20649
Abstract While reinforcement learning with verifiable rewards (RLVR) has advanced LLM reasoning in structured domains like mathematics and programming, its application to general-domain reasoning tasks remains challenging due to the absence of verifiable reward signals. To this end, methods like Reinforcement Learning with Reference Probability Reward (RLPR) have emerged, leveraging the probability of generating the final answer as a reward signal. However, these outcome-focused approaches neglect crucial step-by-step supervision of the reasoning process itself. To address this gap, we introduce Probabilistic Process Supervision (P2S), a novel self-supervision framework that provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps. During reinforcement learning, P2S synthesizes and filters a high-quality reference reasoning chain (gold-CoT). The core of our method is to calculate a Path Faithfulness Reward (PFR) for each reasoning step, which is derived from the conditional probability of generating the gold-CoT's suffix, given the model's current reasoning prefix. Crucially, this PFR can be flexibly integrated with any outcome-based reward, directly tackling the reward sparsity problem by providing dense guidance. Extensive experiments on reading comprehension and medical Question Answering benchmarks show that P2S significantly outperforms strong baselines.
中文摘要 虽然可验证奖励强化学习（RLVR）在数学和编程等结构化领域推动了LLM推理的发展，但由于缺乏可验证的奖励信号，其在通用领域推理任务中的应用仍然充满挑战。为此，出现了如参照概率奖励强化学习（RLPR）等方法，利用生成最终答案作为奖励信号的概率。然而，这些以结果为导向的方法忽视了对推理过程本身的关键逐步监督。为弥补这一空白，我们引入了概率过程监督（P2S），一种新颖的自我监督框架，无需单独的奖励模型或人工注释推理步骤，即可提供细粒度的过程奖励。在强化学习过程中，P2S合成并过滤高质量的参考推理链（gold-CoT）。我们方法的核心是为每个推理步骤计算路径忠实性奖励（PFR），该奖励基于在模型当前推理前缀下生成金币CoT后缀的条件概率。关键是，该PFR可以灵活整合到任何基于结果的奖励中，通过提供密集的指导直接解决奖励稀疏性问题。大量关于阅读理解和医学问答基准的实验显示，P2S的表现显著优于强基线。

GPO: Growing Policy Optimization for Legged Robot Locomotion and Whole-Body Control

GPO：腿式机器人移动和全身控制政策优化的增长

Authors: Shuhao Liao, Peizhuo Li, Xinrong Yang, Linnan Chang, Zhaoxin Fan, Qing Wang, Lei Shi, Yuhong Cao, Wenjun Wu, Guillaume Sartoretti
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.20668
Pdf link: https://arxiv.org/pdf/2601.20668
Abstract Training reinforcement learning (RL) policies for legged robots remains challenging due to high-dimensional continuous actions, hardware constraints, and limited exploration. Existing methods for locomotion and whole-body control work well for position-based control with environment-specific heuristics (e.g., reward shaping, curriculum design, and manual initialization), but are less effective for torque-based control, where sufficiently exploring the action space and obtaining informative gradient signals for training is significantly more difficult. We introduce Growing Policy Optimization (GPO), a training framework that applies a time-varying action transformation to restrict the effective action space in the early stage, thereby encouraging more effective data collection and policy learning, and then progressively expands it to enhance exploration and achieve higher expected return. We prove that this transformation preserves the PPO update rule and introduces only bounded, vanishing gradient distortion, thereby ensuring stable training. We evaluate GPO on both quadruped and hexapod robots, including zero-shot deployment of simulation-trained policies on hardware. Policies trained with GPO consistently achieve better performance. These results suggest that GPO provides a general, environment-agnostic optimization framework for learning legged locomotion.
中文摘要 由于高维连续动作、硬件限制和有限的探索，训练强化学习（RL）策略仍具挑战性。现有的运动和全身控制方法适合基于位置的控制，配合环境特定的启发式方法（如奖励塑造、课程设计和手动初始化），但对于基于扭矩的控制效果较差，因为充分探索动作空间并获取训练所需的梯度信号显著困难。我们介绍了增长策略优化（GPO），这是一个训练框架，通过时间变化的行动转换限制早期有效行动空间，从而鼓励更有效的数据收集和政策学习，随后逐步扩展以增强探索性并实现更高的预期回报。我们证明该变换保持了PPO更新规则，并仅引入有界且为零的梯度畸变，从而确保训练的稳定。我们评估了四足机器人和六足机器人的GPO，包括在硬件上部署模拟训练策略的零shot。采用GPO训练的策略能够持续实现更好的性能。这些结果表明GPO提供了一个通用的、环境无关的优化框架来学习腿部运动。

Positive-Unlabeled Reinforcement Learning Distillation for On-Premise Small Models

正向-无标签强化学习蒸馏，适用于本地小型模型

Authors: Zhiqiang Kou, Junyang Chen, Xin-Qiang Cai, Xiaobo Xia, Ming-Kun Xie, Dong-Dong Wu, Biao Liu, Yuheng Jia, Xin Geng, Masashi Sugiyama, Tat-Seng Chua
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.20687
Pdf link: https://arxiv.org/pdf/2601.20687
Abstract Due to constraints on privacy, cost, and latency, on-premise deployment of small models is increasingly common. However, most practical pipelines stop at supervised fine-tuning (SFT) and fail to reach the reinforcement learning (RL) alignment stage. The main reason is that RL alignment typically requires either expensive human preference annotation or heavy reliance on high-quality reward models with large-scale API usage and ongoing engineering maintenance, both of which are ill-suited to on-premise settings. To bridge this gap, we propose a positive-unlabeled (PU) RL distillation method for on-premise small-model deployment. Without human-labeled preferences or a reward model, our method distills the teacher's preference-optimization capability from black-box generations into a locally trainable student. For each prompt, we query the teacher once to obtain an anchor response, locally sample multiple student candidates, and perform anchor-conditioned self-ranking to induce pairwise or listwise preferences, enabling a fully local training loop via direct preference optimization or group relative policy optimization. Theoretical analysis justifies that the induced preference signal by our method is order-consistent and concentrates on near-optimal candidates, supporting its stability for preference optimization. Experiments demonstrate that our method achieves consistently strong performance under a low-cost setting.
中文摘要 由于隐私、成本和延迟的限制，小型模型的本地部署越来越普遍。然而，大多数实用的流水线停留在监督微调（SFT）阶段，未能达到强化学习（RL）对齐阶段。主要原因是强化学习通常需要昂贵的人工偏好标注，或高度依赖高质量的奖励模型，这些模型需要大规模使用API和持续的工程维护，而这两者都不适合本地环境。为弥合这一空白，我们提出了一种正无标记（PU）强化学习蒸馏方法，用于本地部署小模型。在没有人工标签偏好或奖励模型的情况下，我们的方法将教师的偏好优化能力从黑箱世代中提炼为可本地训练的学生。对于每个提示，我们只向教师查询一次以获得锚点响应，局部抽样多个学生候选人，并进行锚条件自排序以诱导成对或列表偏好，从而通过直接偏好优化或群体相对策略优化实现完全本地化的训练循环。理论分析证明，我们方法诱导的偏好信号是顺序一致的，并集中于接近最优候选者，支持其偏好优化的稳定性。实验表明，我们的方法在低成本环境下能够持续实现强劲的性能。

One Step Is Enough: Dispersive MeanFlow Policy Optimization

一步就够了：色散平均流策略优化

Authors: Guowei Zou, Haitao Wang, Hejun Wu, Yukun Qian, Yuhang Wang, Weibing Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.20701
Pdf link: https://arxiv.org/pdf/2601.20701
Abstract Real-time robotic control demands fast action generation. However, existing generative policies based on diffusion and flow matching require multi-step sampling, fundamentally limiting deployment in time-critical scenarios. We propose Dispersive MeanFlow Policy Optimization (DMPO), a unified framework that enables true one-step generation through three key components: MeanFlow for mathematically-derived single-step inference without knowledge distillation, dispersive regularization to prevent representation collapse, and reinforcement learning (RL) fine-tuning to surpass expert demonstrations. Experiments across RoboMimic manipulation and OpenAI Gym locomotion benchmarks demonstrate competitive or superior performance compared to multi-step baselines. With our lightweight model architecture and the three key algorithmic components working in synergy, DMPO exceeds real-time control requirements (>120Hz) with 5-20x inference speedup, reaching hundreds of Hertz on high-performance GPUs. Physical deployment on a Franka-Emika-Panda robot validates real-world applicability.
中文摘要 实时机器人控制需要快速的动作生成。然而，基于扩散和流量匹配的现有生成策略需要多步采样，根本限制了在时间关键场景中的部署。我们提出了色散平均流策略优化（DMPO），这是一个统一框架，通过三个关键组成部分实现真正的一步生成：用于数学推导的单步推断而无需知识蒸馏的均流，防止表示崩溃的色散正则化，以及强化学习（RL）微调以超越专家演示。机器人模拟控和OpenAI健身房运动基准测试的实验显示，与多步基线相比，表现具有竞争力或更优。凭借我们轻量化的模型架构和三个关键算法组件协同工作，DMPO超越实时控制要求（>120Hz），推理速度提升5-20倍，高性能GPU上可达数百赫兹。在Franka-Emika-Panda机器人上的实际部署验证了其现实应用性。

Adapting the Behavior of Reinforcement Learning Agents to Changing Action Spaces and Reward Functions

适应强化学习主体行为以适应变化的行动空间和奖励函数

Authors: Raul de la Rosa, Ivana Dusparic, Nicolas Cardozo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.20714
Pdf link: https://arxiv.org/pdf/2601.20714
Abstract Reinforcement Learning (RL) agents often struggle in real-world applications where environmental conditions are non-stationary, particularly when reward functions shift or the available action space expands. This paper introduces MORPHIN, a self-adaptive Q-learning framework that enables on-the-fly adaptation without full retraining. By integrating concept drift detection with dynamic adjustments to learning and exploration hyperparameters, MORPHIN adapts agents to changes in both the reward function and on-the-fly expansions of the agent's action space, while preserving prior policy knowledge to prevent catastrophic forgetting. We validate our approach using a Gridworld benchmark and a traffic signal control simulation. The results demonstrate that MORPHIN achieves superior convergence speed and continuous adaptation compared to a standard Q-learning baseline, improving learning efficiency by up to 1.7x.
中文摘要 强化学习（RL）智能体在环境条件非固定的现实应用中常常遇到困难，尤其是在奖励函数变化或可用动作空间扩大时。本文介绍了MORPHIN，一种自适应的Q学习框架，能够实现无需完全重新训练的即时适应。通过将概念漂移检测与对学习和探索超参数的动态调整相结合，MORPHIN 使智能体适应奖励函数和智能体动作空间的动态扩展，同时保留先前的策略知识以防止灾难性遗忘。我们通过Gridworld基准测试和交通信号控制模拟验证了我们的方法。结果表明，MORPHIN 相比标准 Q 学习基线实现了更优越的收敛速度和持续适应性，学习效率提升了高达 1.7 倍。

GraphAllocBench: A Flexible Benchmark for Preference-Conditioned Multi-Objective Policy Learning

GraphAllocBench：偏好条件多目标政策学习的灵活基准

Authors: Zhiheng Jiang, Yunzhe Wang, Ryan Marr, Ellen Novoseller, Benjamin T. Files, Volkan Ustun
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.20753
Pdf link: https://arxiv.org/pdf/2601.20753
Abstract Preference-Conditioned Policy Learning (PCPL) in Multi-Objective Reinforcement Learning (MORL) aims to approximate diverse Pareto-optimal solutions by conditioning policies on user-specified preferences over objectives. This enables a single model to flexibly adapt to arbitrary trade-offs at run-time by producing a policy on or near the Pareto front. However, existing benchmarks for PCPL are largely restricted to toy tasks and fixed environments, limiting their realism and scalability. To address this gap, we introduce GraphAllocBench, a flexible benchmark built on a novel graph-based resource allocation sandbox environment inspired by city management, which we call CityPlannerEnv. GraphAllocBench provides a rich suite of problems with diverse objective functions, varying preference conditions, and high-dimensional scalability. We also propose two new evaluation metrics -- Proportion of Non-Dominated Solutions (PNDS) and Ordering Score (OS) -- that directly capture preference consistency while complementing the widely used hypervolume metric. Through experiments with Multi-Layer Perceptrons (MLPs) and graph-aware models, we show that GraphAllocBench exposes the limitations of existing MORL approaches and paves the way for using graph-based methods such as Graph Neural Networks in complex, high-dimensional combinatorial allocation tasks. Beyond its predefined problem set, GraphAllocBench enables users to flexibly vary objectives, preferences, and allocation rules, establishing it as a versatile and extensible benchmark for advancing PCPL. Code: this https URL
中文摘要 多目标强化学习（MORL）中的偏好条件政策学习（PCPL）旨在通过对用户指定的偏好对目标进行条件，来近似多样化的帕累托最优解。这使得单一模型能够灵活地适应运行时的任意权衡，通过在帕累托前沿或附近生成策略。然而，现有的PCPL基准大多局限于玩具任务和固定环境，限制了其真实性和可扩展性。为弥补这一空白，我们推出了GraphAllocBench，这是一个基于受城市管理启发的新型基于图的资源分配沙盒环境的灵活基准，我们称之为CityPlannerEnv。GraphAllocBench 提供了丰富的问题集，具有多样化的目标函数、不同的偏好条件和高维可扩展性。我们还提出了两个新的评估指标——非支配解比例（PNDS）和排序评分（OS），它们直接反映偏好一致性，同时补充了广泛使用的超流量指标。通过多层感知器（MLP）和图感知模型的实验，我们展示了GraphAllocBench揭示了现有MORL方法的局限性，并为在复杂高维组合分配任务中使用图神经网络等基于图的方法铺平了道路。除了预设的问题集外，GraphAllocBench 还使用户能够灵活调整目标、偏好和分配规则，使其成为推进 PCPL 的多功能且可扩展的基准。代码：这个 https URL

Less is More: Clustered Cross-Covariance Control for Offline RL

少即是多：离线强化学习的集叉协方差控制

Authors: Nan Qiao, Sheng Yue, Shuning Wang, Yongheng Deng, Ju Ren
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.20765
Pdf link: https://arxiv.org/pdf/2601.20765
Abstract A fundamental challenge in offline reinforcement learning is distributional shift. Scarce data or datasets dominated by out-of-distribution (OOD) areas exacerbate this issue. Our theoretical analysis and experiments show that the standard squared error objective induces a harmful TD cross covariance. This effect amplifies in OOD areas, biasing optimization and degrading policy learning. To counteract this mechanism, we develop two complementary strategies: partitioned buffer sampling that restricts updates to localized replay partitions, attenuates irregular covariance effects, and aligns update directions, yielding a scheme that is easy to integrate with existing implementations, namely Clustered Cross-Covariance Control for TD (C^4). We also introduce an explicit gradient-based corrective penalty that cancels the covariance induced bias within each update. We prove that buffer partitioning preserves the lower bound property of the maximization objective, and that these constraints mitigate excessive conservatism in extreme OOD areas without altering the core behavior of policy constrained offline reinforcement learning. Empirically, our method showcases higher stability and up to 30% improvement in returns over prior methods, especially with small datasets and splits that emphasize OOD areas.
中文摘要 离线强化学习中的一个根本挑战是分布转移。稀缺的数据或以非分销（OOD）区域为主的数据集加剧了这一问题。我们的理论分析和实验表明，标准平方误差目标会诱导有害的TD交叉协方差。这种效应在职场工作领域被放大，导致优化偏向并降低政策学习。为对抗这一机制，我们开发了两种互补策略：分区缓冲区采样，限制更新仅限于局部重放分区，减弱不规则协方差效应，并对齐更新方向，形成一种易于与现有实现集成的方案，即TD的聚类交叉协方差控制（C^4）。我们还引入了显式基于梯度的纠正惩罚，每次更新中抵消协方差引起的偏差。我们证明缓冲区划分保持最大化目标的下界性质，且这些约束在极端户外区域缓解了过度保守，同时不改变策略约束离线强化学习的核心行为。从实证角度看，我们的方法比以往方法展现出更高的稳定性和高达30%的回报提升，尤其是在小数据集和强调户外区域的分割中。

SERA: Soft-Verified Efficient Repository Agents

SERA：软验证高效仓库代理

Authors: Ethan Shen, Danny Tormoen, Saurabh Shah, Ali Farhadi, Tim Dettmers
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2601.20789
Pdf link: https://arxiv.org/pdf/2601.20789
Abstract Open-weight coding agents should hold a fundamental advantage over closed-source systems: they can be specialized to private codebases, encoding repository-specific information directly in their weights. Yet the cost and complexity of training has kept this advantage theoretical. We show it is now practical. We present Soft-Verified Efficient Repository Agents (SERA), an efficient method for training coding agents that enables the rapid and cheap creation of agents specialized to private codebases. Using only supervised finetuning (SFT), SERA achieves state-of-the-art results among fully open-source (open data, method, code) models while matching the performance of frontier open-weight models like Devstral-Small-2. Creating SERA models is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. Our method, Soft Verified Generation (SVG), generates thousands of trajectories from a single code repository. Combined with cost-efficiency, this enables specialization to private codebases. Beyond repository specialization, we apply SVG to a larger corpus of codebases, generating over 200,000 synthetic trajectories. We use this dataset to provide detailed analysis of scaling laws, ablations, and confounding factors for training coding agents. Overall, we believe our work will greatly accelerate research on open coding agents and showcase the advantage of open-source models that can specialize to private codebases. We release SERA as the first model in Ai2's Open Coding Agents series, along with all our code, data, and Claude Code integration to support the research community.
中文摘要 开权编码代理相较于封闭源代码系统应具有根本优势：它们可以专门化到私有代码库，直接在代码库的权重中编码特定信息库。然而，培训的成本和复杂性使这一优势停留在理论上。我们证明它现在已经变得可行。我们介绍软验证高效仓库代理（SERA），这是一种高效的编码训练代理的方法，能够快速且低成本地创建专门针对私有代码库的代理。仅通过监督式微调（SFT），SERA 在完全开源（开放数据、方法、代码）模型中实现了最先进的结果，同时性能可媲美 Devstral-Small-2 等前沿开放权重模型。创建SERA模型比强化学习便宜26倍，比以往合成数据方法便宜57倍，以实现同等性能。我们的方法，软验证生成（SVG），从单一代码库生成数千条轨迹。结合成本效益，这使得私有代码库能够实现专用化。除了仓库专用化外，我们还将SVG应用于更大规模的代码库，生成超过20万条合成轨迹。我们利用该数据集详细分析训练编码代理的缩放律、消融和混杂因素。总体而言，我们相信我们的工作将极大加速对开放编码代理的研究，并展示开源模型专注于私有代码库的优势。我们将SERA作为Ai2开放编码代理系列的第一个模型发布，同时发布所有代码、数据和Claude代码集成，以支持研究社区。

Reinforcement Learning via Self-Distillation

通过自我蒸馏进行强化学习

Authors: Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, Andreas Krause
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.20802
Pdf link: https://arxiv.org/pdf/2601.20802
Abstract Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.
中文摘要 大型语言模型越来越多地通过可验证的领域（如代码和数学）进行强化学习的后训练。然而，当前可验证奖励强化学习（RLVR）方法仅从每次尝试的标量结果奖励中学习，造成严重的学分分配瓶颈。许多可验证的环境实际上提供了丰富的文本反馈，如运行时错误或裁判评估，以解释为何尝试失败。我们将此设定形式化为富含反馈的强化学习，并引入了自我提炼策略优化（SDPO），将代币化反馈转化为高密度的学习信号，无需外部教师或显式奖励模型。SDPO将当前基于反馈的模型视为自学，并将基于反馈的下一代币预测提炼回政策。通过这种方式，SDPO利用了模型在情境中事后识别自身错误的能力。在科学推理、工具使用和LiveCodeBench v6的竞赛编程中，SDPO提升了样本效率和最终准确性，优于强有力的RLVR基线。值得注意的是，SDPO在标准RLVR环境中表现优于仅通过成功推展作为隐式反馈的标量反馈的基线。最后，在测试时对单个问题应用SDPO可以加速在困难的二元奖励任务中的发现，实现与k次取样或多回合对话相同的发现概率，但尝试次数减少了3倍。

Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

通过失败前缀条件训练饱和问题的推理模型

Authors: Minwu Kim, Safal Shrestha, Keith Ross
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.20829
Pdf link: https://arxiv.org/pdf/2601.20829
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning abilities of large language models (LLMs), yet training often stalls as problems become saturated. We identify the core challenge as the poor accessibility of informative failures: learning signals exist but are rarely encountered during standard rollouts. To address this, we propose failure-prefix conditioning, a simple and effective method for learning from saturated problems. Rather than starting from the original question, our approach reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. We observe that failure-prefix conditioning yields performance gains matching those of training on medium-difficulty problems, while preserving token efficiency. Furthermore, we analyze the model's robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results suggest that failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems.
中文摘要 可验证奖励强化学习（RLVR）显著提升了大型语言模型（LLMs）的推理能力，但随着问题趋于饱和，训练常常停滞。我们认为核心挑战是信息性失误的可及性低：学习信号存在，但在标准推广中很少遇到。为此，我们提出了失败前缀条件反射，这是一种简单有效的从饱和问题中学习的方法。我们的方法不是从原始问题出发，而是通过将训练条件条件为由罕见错误推理轨迹得出的前缀来重新分配探索，从而使模型暴露于易失效状态。我们观察到失败前缀条件带来的性能提升，与中等难度问题的训练相当，同时保持代币效率。此外，我们分析了模型的鲁棒性，发现我们的方法在误导性故障前缀下能减少性能下降，尽管在遵循早期正确推理方面存在轻微权衡。最后，我们证明了在培训期间刷新失败前缀的迭代方法，在性能停滞后能解锁更多收益。总体而言，我们的结果表明，失效前缀条件条件为延长对饱和问题的RLVR训练提供了有效途径。

End-to-end example-based sim-to-real RL policy transfer based on neural stylisation with application to robotic cutting

基于神经风格化的端到端模拟到现实的强化学习策略转移，并应用于机器人切割

Authors: Jamie Hathaway, Alireza Rastegarpanah, Rustam Stolkin
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.20846
Pdf link: https://arxiv.org/pdf/2601.20846
Abstract Whereas reinforcement learning has been applied with success to a range of robotic control problems in complex, uncertain environments, reliance on extensive data - typically sourced from simulation environments - limits real-world deployment due to the domain gap between simulated and physical systems, coupled with limited real-world sample availability. We propose a novel method for sim-to-real transfer of reinforcement learning policies, based on a reinterpretation of neural style transfer from image processing to synthesise novel training data from unpaired unlabelled real world datasets. We employ a variational autoencoder to jointly learn self-supervised feature representations for style transfer and generate weakly paired source-target trajectories to improve physical realism of synthesised trajectories. We demonstrate the application of our approach based on the case study of robot cutting of unknown materials. Compared to baseline methods, including our previous work, CycleGAN, and conditional variational autoencoder-based time series translation, our approach achieves improved task completion time and behavioural stability with minimal real-world data. Our framework demonstrates robustness to geometric and material variation, and highlights the feasibility of policy adaptation in challenging contact-rich tasks where real-world reward information is unavailable.
中文摘要 虽然强化学习已成功应用于复杂且不确定环境中的各种机器人控制问题，但依赖大量数据——通常来自仿真环境——由于模拟系统与物理系统之间的领域差距以及现实样本可用性有限，限制了实际部署。我们提出了一种基于从图像处理中神经风格迁移的重新诠释，用于模拟到现实的强化学习策略转移方法，并从未配对的未标记真实世界数据集中合成新颖的训练数据。我们使用变分自编码器共同学习自监督特征表示以实现风格转移，并生成弱配对的源-目标轨迹，以提升合成轨迹的物理真实性。我们基于机器人切割未知材料的案例研究，展示了我们方法的应用。与基线方法（包括我们之前的工作CycleGAN和基于条件变分自编码器的时间序列转换）相比，我们的方法在极少的真实世界数据下实现了任务完成时间和行为稳定性的提升。我们的框架展示了对几何和物质变异的鲁棒性，并强调在缺乏真实奖励信息的复杂接触任务中，策略调整的可行性。

Keyword: diffusion policy

There is no result