Arxiv Papers of Today

生成时间: 2026-01-13 16:37:20 (UTC+8); Arxiv 发布时间: 2026-01-13 20:00 EST (2026-01-14 09:00 UTC+8)

今天共有 74 篇相关文章

Keyword: reinforcement learning

Reinforcement Learning for Chain of Thought Compression with One-Domain-to-All Generalization

基于一域对全域推广的思维链压缩强化学习

Authors: Hanyu Li, Jiangshan Duo, Bofei Gao, Hailin Zhang, Sujian Li, Xiaotie Deng, Liang Zhao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.06052
Pdf link: https://arxiv.org/pdf/2601.06052
Abstract Chain-of-thought reasoning in large language models often creates an "overthinking trap," leading to excessive computational cost and latency for unreliable accuracy gains. Prior work has typically relied on global, static controls that risk penalizing necessary reasoning. We introduce a sample-level, soft reinforcement learning compression method that penalizes inefficiently long rollouts, but only on problems where the model has already mastered and already produced a more concise rollout. Our experiments show that this method reduces average response length by 20-40% with comparable or higher accuracy. Crucially, the compression exhibits strong cross-domain generalization; a model trained on math spontaneously shortens responses on unseen tasks like code, instruction following, and general knowledge QA, with stable or improved accuracy. We demonstrate a stable post-training curriculum (accuracy-compression-accuracy) that can ultimately produce models that are more accurate and reason more concisely, arguing that such compression method should be a standard phase in developing efficient reasoning models.
中文摘要 大型语言模型中的思维链推理常常形成“过度思考陷阱”，导致计算成本和延迟过高，从而获得不可靠的准确性。以往的工作通常依赖于全局的静态控制，这有可能惩罚必要的推理能力。我们引入了一种样本级软强化学习压缩方法，惩罚效率不高的长部署，但仅针对模型已经掌握并产生更简洁展开的问题。我们的实验表明，该方法能以相当甚至更高的准确率将平均反应长度缩短20-40%。关键是，该压缩表现出强烈的跨域泛化;以数学训练为基础的模型会自发缩短对代码、指令遵循和常识质量保证等未见任务的响应，准确性稳定或提升。我们展示了一个稳定的训练后课程（准确性-压缩-准确性），最终能够产生更准确、更简洁推理的模型，主张此类压缩方法应成为开发高效推理模型的标准阶段。

Deep Q-Network Based Resilient Drone Communication:Neutralizing First-Order Markov Jammers

基于深度Q网络的弹性无人机通信：中和一阶马尔可夫干扰器

Authors: Andrii Grekhov, Volodymyr Kharchenko, Vasyl Kondratiuk
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06095
Pdf link: https://arxiv.org/pdf/2601.06095
Abstract Deep Reinforcement Learning based solution for jamming communications using Frequency Hopping Spread Spectrum technology in a 16 channel radio environment is presented. Deep Q Network based transmitter continuously selects the next frequency hopping channel while facing first order reactive jamming, which uses observed transition statistics to predict and interrupt transmissions. Through self training, the proposed agent learns a uniform random frequency hopping policy that effectively neutralizes the predictive advantage of the jamming. In the presence of Rayleigh fading and additive noise, the impact of forward error correction Bose Chaudhuri Hocquenghem type codes is systematically evaluated, demonstrating that even moderate redundancy significantly reduces packet loss. Extensive visualization of the learning dynamics, channel utilization distribution, epsilon greedy decay, cumulative reward, BER and SNR evolution, and detailed packet loss tables confirms convergence to a near optimal jamming strategy. The results provide a practical framework for autonomous resilient communications in modern electronic warfare scenarios.
中文摘要 基于深度强化学习的解决方案，用于在16信道无线电环境中利用跳频扩频技术进行通信干扰。基于深Q网络的发射机在面对一阶反应性干扰时，持续选择下一个跳频信道，该干扰利用观察到的转换统计来预测和中断传输。通过自我训练，拟议的代理学会了统一的随机跳频策略，有效抵消干扰的预测优势。在雷利衰落和加性噪声存在的情况下，系统评估了前向纠错Bose Chaudhuri Hocquenghem类型码的影响，证明即使是中等冗余也能显著减少丢包。对学习动态、信道利用分布、ε贪婪衰减、累计奖励、BER和SNR演变以及详细丢包表的广泛可视化，证实了趋同于近优干扰策略的趋势。结果为现代电子战场景下的自主弹性通信提供了实用框架。

The Impact of Post-training on Data Contamination

后培训对数据污染的影响

Authors: Muhammed Yusuf Kocyigit, Caglar Yildirim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06103
Pdf link: https://arxiv.org/pdf/2601.06103
Abstract We present a controlled study of how dataset contamination interacts with the post-training stages now standard in large language model training pipelines. Starting from clean checkpoints of Qwen2.5 (0.5B/1.5B) and Gemma3 (1B/4B), we inject five copies of GSM8K and MBPP test items into the first 2B tokens of an otherwise 25B token extended pre-training dataset. We then compare the contaminated and clean models both immediately after pre-training and again after two popular post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL) with group relative policy optimization (GRPO). The applied post-training steps do not have any contamination. Across math and coding benchmarks, we find three consistent patterns: (i) Contamination causes performance spikes that are gradually diminished with continued pre-training. After even 25B tokens the apparent performance inflation of contamination can become close to zero. (ii) Both SFT and GRPO resurface the leaked information, but with different external validity: SFT inflates scores only on the contaminated tasks, whereas GRPO also inflates performance on uncontaminated counterparts (GSMPlus, HumanEval). (iii) Model scale amplifies these tendencies, larger Supervised Fine Tuned models memorize more, while larger GRPO models translate leakage into more generalizable capabilities. Our results underscore the need for contamination audits \emph{after} post-training and suggest that RL-based post-training, although not immune, can help alleviate contamination-related over-estimation problems.
中文摘要 我们提出了一项受控研究，探讨数据集污染如何与大型语言模型训练流程中标准的训练后阶段相互作用。从Qwen2.5（0.5B/1.5B）和Gemma3（1B/4B）的干净检查点出发，我们将五份GSM8K和MBPP测试项目注入到原本为25B令牌的扩展预训练数据集的前2B标记中。随后，我们将受污染和干净模型分别在训练前立即及两种流行的训练后方法——监督微调（SFT）和强化学习（RL）与组相对策略优化（GRPO）——进行比较。培训后应用步骤没有污染。在数学和编程基准测试中，我们发现三种一致的模式：（i）污染导致性能峰值，随着持续的预训练逐渐减弱。即使使用了250亿代币，污染的表观性能膨胀也可能接近零。（ii） SFT和GRPO都会重新披露泄露的信息，但外部效度不同：SFT只夸大受污染任务的得分，而GRPO也会夸大未受污染任务的表现（GSMPlus，HumanEval）。（iii）模型尺度放大这些倾向，较大的监督精细调优模型记忆更多，而更大的GRPO模型则将泄漏转化为更通用的能力。我们的结果强调了训练后污染审计的必要性，并表明基于强化学习的培训虽然并非免疫，但有助于缓解与污染相关的高估问题。

From RLHF to Direct Alignment: A Theoretical Unification of Preference Learning for Large Language Models

从RLHF到直接对齐：大型语言模型偏好学习的理论统一

Authors: Tarun Raheja, Nilay Pochhi
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.06108
Pdf link: https://arxiv.org/pdf/2601.06108
Abstract Aligning large language models (LLMs) with human preferences has become essential for safe and beneficial AI deployment. While Reinforcement Learning from Human Feedback (RLHF) established the dominant paradigm, a proliferation of alternatives -- Direct Preference Optimization (DPO), Identity Preference Optimization (IPO), Kahneman-Tversky Optimization (KTO), Simple Preference Optimization (SimPO), and many others -- has left practitioners without clear guidance on method selection. This survey provides a \textit{theoretical unification} of preference learning methods, revealing that the apparent diversity reduces to principled choices along three orthogonal axes: \textbf{(I) Preference Model} (what likelihood model underlies the objective), \textbf{(II) Regularization Mechanism} (how deviation from reference policies is controlled), and \textbf{(III) Data Distribution} (online vs.\ offline learning and coverage requirements). We formalize each axis with precise definitions and theorems, establishing key results including the coverage separation between online and offline methods, scaling laws for reward overoptimization, and conditions under which direct alignment methods fail. Our analysis reveals that failure modes -- length hacking, mode collapse, likelihood displacement -- arise from specific, predictable combinations of design choices. We synthesize empirical findings across 50+ papers and provide a practitioner's decision guide for method selection. The framework transforms preference learning from an empirical art into a theoretically grounded discipline.
中文摘要 将大型语言模型（LLMs）与人类偏好对齐已成为安全且有益的人工智能部署的关键。虽然人类反馈强化学习（RLHF）确立了主导范式，但由于多种替代方案的激增——直接偏好优化（DPO）、身份偏好优化（IPO）、Kahneman-Tversky优化（KTO）、简单偏好优化（SimPO）等——使得从业者在方法选择上缺乏明确的指导。本调查提供了偏好学习方法的理论统一，表明表面多样性实际上是沿三个正交轴的原则性选择：\textbf{（I）偏好模型}（目标基础的似然模型）、\textbf{（II）正则化机制}（如何控制偏离参考策略）和 \textbf{（III）数据分布}（在线与线下学习及覆盖要求）。我们通过精确定义和定理形式化每个轴，建立了包括在线与离线方法覆盖分离、奖励过度优化的尺度定律以及直接比对方法失效的条件等关键结果。我们的分析显示，失效模式——长度砍杀、模态崩溃、似然位移——源于特定且可预测的设计组合。我们综合了50+篇论文的实证发现，并为实践者提供了方法选择的决策指南。该框架将偏好学习从一门实证艺术转变为一门理论基础学科。

COVR:Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

COVR：基于视觉控制的VLM与强化学习代理的协同优化

Authors: Canming Xia, Peixi Peng, Guang Tan, Zhan Su, Haoran Xu, Zhenxian Liu, Luntong Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.06122
Pdf link: https://arxiv.org/pdf/2601.06122
Abstract Visual reinforcement learning (RL) suffers from poor sample efficiency due to high-dimensional observations in complex tasks. While existing works have shown that vision-language models (VLMs) can assist RL, they often focus on knowledge distillation from the VLM to RL, overlooking the potential of RL-generated interaction data to enhance the VLM. To address this, we propose COVR, a collaborative optimization framework that enables the mutual enhancement of the VLM and RL policies. Specifically, COVR fine-tunes the VLM with RL-generated data to enhance the semantic reasoning ability consistent with the target task, and uses the enhanced VLM to further guide policy learning via action priors. To improve fine-tuning efficiency, we introduce two key modules: (1) an Exploration-Driven Dynamic Filter module that preserves valuable exploration samples using adaptive thresholds based on the degree of exploration, and (2) a Return-Aware Adaptive Loss Weight module that improves the stability of training by quantifying the inconsistency of sampling actions via return signals of RL. We further design a progressive fine-tuning strategy to reduce resource consumption. Extensive experiments show that COVR achieves strong performance across various challenging visual control tasks.
中文摘要 视觉强化学习（RL）由于复杂任务中高维观测，样本效率较低。虽然现有研究显示视觉语言模型（VLM）可以辅助强化学习，但它们通常侧重于从VLM向强化学习的知识提炼，忽视了强化学习生成的交互数据在增强VLM中的潜力。为此，我们提出了COVR，一种协作优化框架，能够实现VLM和RL策略的相互增强。具体来说，COVR通过强化学习生成的数据微调VLM，增强与目标任务一致的语义推理能力，并利用增强后的VLM通过动作先验进一步指导策略学习。为提升微调效率，我们引入了两个关键模块：（1）探索驱动动态滤波器模块，利用基于探索程度的自适应阈值保留有价值的探索样本;（2）返回感知自适应损失权重模块，通过通过强化学习返回信号量化采样动作的不一致性，提升训练稳定性。我们还设计了渐进式微调策略以减少资源消耗。大量实验表明，COVR在各种具有挑战性的视觉控制任务中表现出色。

A Review of Online Diffusion Policy RL Algorithms for Scalable Robotic Control

在线扩散策略强化学习算法在可扩展机器人控制中的综述

Authors: Wonhyeok Choi, Minwoo Choi, Jungwan Woo, Kyumin Hwang, Jaeyeul Kim, Sunghoon Im
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.06133
Pdf link: https://arxiv.org/pdf/2601.06133
Abstract Diffusion policies have emerged as a powerful approach for robotic control, demonstrating superior expressiveness in modeling multimodal action distributions compared to conventional policy networks. However, their integration with online reinforcement learning remains challenging due to fundamental incompatibilities between diffusion model training objectives and standard RL policy improvement mechanisms. This paper presents the first comprehensive review and empirical analysis of current Online Diffusion Policy Reinforcement Learning (Online DPRL) algorithms for scalable robotic control systems. We propose a novel taxonomy that categorizes existing approaches into four distinct families -- Action-Gradient, Q-Weighting, Proximity-Based, and Backpropagation Through Time (BPTT) methods -- based on their policy improvement mechanisms. Through extensive experiments on a unified NVIDIA Isaac Lab benchmark encompassing 12 diverse robotic tasks, we systematically evaluate representative algorithms across five critical dimensions: task diversity, parallelization capability, diffusion step scalability, cross-embodiment generalization, and environmental robustness. Our analysis identifies key findings regarding the fundamental trade-offs inherent in each algorithmic family, particularly concerning sample efficiency and scalability. Furthermore, we reveal critical computational and algorithmic bottlenecks that currently limit the practical deployment of online DPRL. Based on these findings, we provide concrete guidelines for algorithm selection tailored to specific operational constraints and outline promising future research directions to advance the field toward more general and scalable robotic learning systems.
中文摘要 扩散策略已成为机器人控制的强大方法，在建模多模态行动分布时展现出优越的表现力，优于传统策略网络。然而，由于扩散模型训练目标与标准强化学习策略改进机制之间的根本不兼容，它们与在线强化学习的整合仍然具有挑战。本文首次全面综述并进行了实证分析，涵盖当前可扩展机器人控制系统的在线扩散策略强化学习（Online DPRL）算法。我们提出了一种新颖的分类法，将现有方法分为四个不同的家族——动作梯度法、Q加权法、基于邻近法和逆向传播（BPTT）方法，基于它们的策略改进机制。通过在涵盖12个不同机器人任务的统一NVIDIA Isaac实验室基准测试上进行广泛实验，我们系统地评估了五个关键维度上的代表性算法：任务多样性、并行化能力、扩散阶级可扩展性、跨身体泛化和环境鲁棒性。我们的分析揭示了每个算法家族内在的基本权衡，特别是在样本效率和可扩展性方面的关键发现。此外，我们还揭示了目前限制在线DPRL实际部署的关键计算和算法瓶颈。基于这些发现，我们为针对特定作限制量身定制的算法选择提供了具体指导方针，并概述了未来有望的研究方向，以推动该领域向更通用且可扩展的机器人学习系统发展。

HiMeS: Hippocampus-inspired Memory System for Personalized AI Assistants

HiMeS：受海马体启发的个性化AI助手记忆系统

Authors: Hailong Li, Feifei Li, Wenhui Que, Xingyu Fan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06152
Pdf link: https://arxiv.org/pdf/2601.06152
Abstract Large language models (LLMs) power many interactive systems such as chatbots, customer-service agents, and personal assistants. In knowledge-intensive scenarios requiring user-specific personalization, conventional retrieval-augmented generation (RAG) pipelines exhibit limited memory capacity and insufficient coordination between retrieval mechanisms and user-specific conversational history, leading to redundant clarification, irrelevant documents, and degraded user experience. Inspired by the hippocampus-neocortex memory mechanism, we propose HiMeS, an AI-assistant architecture that fuses short-term and long-term memory. Our contributions are fourfold: (1) A short-term memory extractor is trained end-to-end with reinforcement learning to compress recent dialogue and proactively pre-retrieve documents from the knowledge base, emulating the cooperative interaction between the hippocampus and prefrontal cortex. (2) A partitioned long-term memory network stores user-specific information and re-ranks retrieved documents, simulating distributed cortical storage and memory reactivation. (3) On a real-world industrial dataset, HiMeS significantly outperforms a cascaded RAG baseline on question-answering quality. (4) Ablation studies confirm the necessity of both memory modules and suggest a practical path toward more reliable, context-aware, user-customized LLM-based assistants.
中文摘要 大型语言模型（LLM）为许多交互系统提供动力，如聊天机器人、客服代理和个人助理。在需要用户个性化的知识密集型场景中，传统的检索增强生成（RAG）流水线内存容量有限，检索机制与用户特定对话历史之间的协调不足，导致澄清重复、文档无关以及用户体验下降。受海马-新皮层记忆机制启发，我们提出了HiMeS，一种融合短期和长期记忆的AI助手架构。我们的贡献有四方面：（1）通过强化学习端到端训练短期记忆提取器，压缩近期对话并主动从知识库中预先检索文档，模拟海马体与前额叶皮层之间的协作互动。（2）分区的长期记忆网络存储用户特定信息并重新排序检索的文档，模拟分布式皮层存储和记忆重激活。（3）在真实工业数据集上，HiMeS在问答质量方面显著优于级联RAG基线。（4）消融研究确认了这两种内存模块的必要性，并提出了实现更可靠、上下文感知、用户定制的基于LLM助手的实用路径。

TIR-Flow: Active Video Search and Reasoning with Frozen VLMs

TIR-Flow：冷冻VLM的主动视频搜索与推理

Authors: Hongbo Jin, Siyi Xie, Jiayu Ding, Kuanwei Lin, Ge Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.06176
Pdf link: https://arxiv.org/pdf/2601.06176
Abstract While Large Video-Language Models (Video-LLMs) have achieved remarkable progress in perception, their reasoning capabilities remain a bottleneck. Existing solutions typically resort to a heavy "data engineering" paradigm-synthesizing large-scale Chain-of-Thought (CoT) datasets followed by Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). This pipeline primarily optimizes probability sampling efficiency and aligns output distributions, but fails to activate the intrinsic intelligence required for dynamic visual exploration. In this work, we propose TIR-Flow, a novel framework that shifts the paradigm from passive processing to active video searching and reasoning without additional data or parameter updating. Concretely, our framework operates through three synergistic modules: HDD decomposes complex queries into a set of verifiable sub-tasks; HAP actively directs visual attention to gather high-resolution evidence for hypothesis validation; EBA maintains a persistent workspace to accumulate and update the discovered clues for logical reasoning. Extensive experiments on seven benchmarks demonstrate that TIR-Flow significantly outperforms recent strong baselines, delivering an average performance boost of 5.9%, with gains reaching 10.5% on Egoschema. Our analysis confirms that empowering frozen VLMs with System-2-like active perception is a scalable path toward solving long-horizon video reasoning.
中文摘要 尽管大型视频语言模型（Video-LLMs）在感知方面取得了显著进步，但其推理能力仍是一个瓶颈。现有解决方案通常依赖于大量“数据工程”范式——综合大规模思维链（Chain-of-Thought，CoT）数据集，随后进行监督微调（SFT）和强化学习（RL）。该流水线主要优化概率抽样效率并对齐输出分布，但未能激活动态视觉探索所需的内在智能。在本研究中，我们提出了TIR-Flow，一种新颖框架，能够将模式从被动处理转变为主动视频搜索和推理，无需额外数据或参数更新。具体来说，我们的框架通过三个协同模块运作：HDD 将复杂查询分解为一组可验证的子任务;HAP积极引导视觉关注以收集高分辨率证据以验证假设;EBA维护一个持久的工作区，用于收集和更新发现的线索以进行逻辑推理。对七个基准测试的广泛实验表明，TIR-Flow显著优于近期强劲基准，平均性能提升5.9%，Egoschema提升达到10.5%。我们的分析证实，赋予冻结VLM类似System-2的主动感知，是解决长视野视频推理的可扩展路径。

TimeGNN-Augmented Hybrid-Action MARL for Fine-Grained Task Partitioning and Energy-Aware Offloading in MEC

TimeGNN增强混合作用MARL用于MEC中的细粒度任务划分和能量感知卸载

Authors: Wei Ai, Yun Peng, Yuntao Shou, Tao Meng, Keqin Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06191
Pdf link: https://arxiv.org/pdf/2601.06191
Abstract With the rapid growth of IoT devices and latency-sensitive applications, the demand for both real-time and energy-efficient computing has surged, placing significant pressure on traditional cloud computing architectures. Mobile edge computing (MEC), an emerging paradigm, effectively alleviates the load on cloud centers and improves service quality by offloading computing tasks to edge servers closer to end users. However, the limited computing resources, non-continuous power provisioning (e.g., battery-powered nodes), and highly dynamic systems of edge servers complicate efficient task scheduling and resource allocation. To address these challenges, this paper proposes a multi-agent deep reinforcement learning algorithm, TG-DCMADDPG, and constructs a collaborative computing framework for multiple edge servers, aiming to achieve joint optimization of fine-grained task partitioning and offloading. This approach incorporates a temporal graph neural network (TimeGNN) to model and predict time series of multi-dimensional server state information, thereby reducing the frequency of online interactions and improving policy predictability. Furthermore, a multi-agent deterministic policy gradient algorithm (DC-MADDPG) in a discrete-continuous hybrid action space is introduced to collaboratively optimize task partitioning ratios, transmission power, and priority scheduling strategies. Extensive simulation experiments confirm that TG-DCMADDPG achieves markedly faster policy convergence, superior energy-latency optimization, and higher task completion rates compared with existing state-of-the-art methods, underscoring its robust scalability and practical effectiveness in dynamic and constrained MEC scenarios.
中文摘要 随着物联网设备和延迟敏感应用的快速发展，对实时和节能计算的需求激增，给传统云计算架构带来了巨大压力。移动边缘计算（MEC）是一种新兴范式，通过将计算任务卸载到更靠近终端用户的边缘服务器，有效减轻云中心的负担并提升服务质量。然而，有限的计算资源、非连续的电力供应（如电池供电的节点）以及边缘服务器高度动态的系统，使得任务调度和资源分配变得复杂。为应对这些挑战，本文提出了多智能体深度强化学习算法TG-DCMADDPG，并构建了多边服务器协作计算框架，旨在实现细粒度任务分区与卸载的联合优化。该方法结合了时序图神经网络（TimeGNN）来建模和预测多维服务器状态信息的时间序列，从而降低在线交互的频率并提升策略的可预测性。此外，引入了多智能体确定性策略梯度算法（DC-MADDPG），用于离散-连续混合动作空间，协同优化任务划分比率、传输功率和优先调度策略。大量模拟实验证实，TG-DCMADDPG相比现有最先进方法实现了显著更快的策略收敛速度、更优越的能量-延迟优化和更高的任务完成率，凸显了其在动态和受限MEC场景下的强大可扩展性和实用性。

Toward Safe and Responsible AI Agents: A Three-Pillar Model for Transparency, Accountability, and Trustworthiness

迈向安全且负责任的人工智能代理：透明、问责与可信的三大支柱模型

Authors: Edward C. Cheng, Jeshua Cheng, Alice Siu
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06223
Pdf link: https://arxiv.org/pdf/2601.06223
Abstract This paper presents a conceptual and operational framework for developing and operating safe and trustworthy AI agents based on a Three-Pillar Model grounded in transparency, accountability, and trustworthiness. Building on prior work in Human-in-the-Loop systems, reinforcement learning, and collaborative AI, the framework defines an evolutionary path toward autonomous agents that balances increasing automation with appropriate human oversight. The paper argues that safe agent autonomy must be achieved through progressive validation, analogous to the staged development of autonomous driving, rather than through immediate full automation. Transparency and accountability are identified as foundational requirements for establishing user trust and for mitigating known risks in generative AI systems, including hallucinations, data bias, and goal misalignment, such as the inversion problem. The paper further describes three ongoing work streams supporting this framework: public deliberation on AI agents conducted by the Stanford Deliberative Democracy Lab, cross-industry collaboration through the Safe AI Agent Consortium, and the development of open tooling for an agent operating environment aligned with the Three-Pillar Model. Together, these contributions provide both conceptual clarity and practical guidance for enabling the responsible evolution of AI agents that operate transparently, remain aligned with human values, and sustain societal trust.
中文摘要 本文提出了基于透明、问责和可信度的三支柱模型，开发和运营安全可信AI代理的概念和作框架。该框架基于此前在人机循环系统、强化学习和协作人工智能方面的研究，定义了一条向自主智能体进化的路径，平衡自动化的提升与适当的人类监督。论文主张，安全代理的自主性必须通过渐进验证实现，类似于自动驾驶的分阶段开发，而非立即实现全面自动化。透明度和问责制被视为建立用户信任和缓解生成式人工智能系统已知风险（包括幻觉、数据偏见和目标错位（如反转问题）的基础要求。论文进一步描述了支持该框架的三个正在进行的工作流：由斯坦福审议民主实验室进行的人工智能代理公开讨论、通过安全人工智能代理联盟进行跨行业合作，以及开发与三支柱模型相符的代理运营环境的开放工具。这些贡献共同提供了概念清晰和实用指导，助力人工智能智能体的负责任进化，这些智能体保持透明运作，符合人类价值观，并维护社会信任。

Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization

扎根你所见：通过字幕反馈、多样性感知抽样和冲突规范化实现抗幻觉MLLMs

Authors: Miao Pan, Wangjie Gan, Jintao Chen, Wenqi Zhang, Bing Sun, Jianwei Yin, Xuhong Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.06224
Pdf link: https://arxiv.org/pdf/2601.06224
Abstract While Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse tasks, their practical deployment is severely hindered by hallucination issues, which become particularly acute during Reinforcement Learning (RL) optimization. This paper systematically analyzes the root causes of hallucinations in MLLMs under RL training, identifying three critical factors: (1) an over-reliance on chained visual reasoning, where inaccurate initial descriptions or redundant information anchor subsequent inferences to incorrect premises; (2) insufficient exploration diversity during policy optimization, leading the model to generate overly confident but erroneous outputs; and (3) destructive conflicts between training samples, where Neural Tangent Kernel (NTK) similarity causes false associations and unstable parameter updates. To address these challenges, we propose a comprehensive framework comprising three core modules. First, we enhance visual localization by introducing dedicated planning and captioning stages before the reasoning phase, employing a quality-based caption reward to ensure accurate initial anchoring. Second, to improve exploration, we categorize samples based on the mean and variance of their reward distributions, prioritizing samples with high variance to focus the model on diverse and informative data. Finally, to mitigate sample interference, we regulate NTK similarity by grouping sample pairs and applying an InfoNCE loss to push overly similar pairs apart and pull dissimilar ones closer, thereby guiding gradient interactions toward a balanced range. Experimental results demonstrate that our proposed method significantly reduces hallucination rates and effectively enhances the inference accuracy of MLLMs.
中文摘要 尽管多模态大型语言模型（MLLM）在多种任务中取得了显著成功，但其实际应用受到幻觉问题的严重阻碍，尤其是在强化学习（RL）优化过程中，幻觉问题尤为突出。本文系统分析了强化学习训练下MLLM幻觉的根本原因，指出三个关键因素：（1）过度依赖链式视觉推理，即不准确的初始描述或冗余信息将后续推断锚定在错误前提上;（2）策略优化过程中探索多样性不足，导致模型产生过于自信但错误的输出;以及（3）训练样本之间的破坏性冲突，其中神经切核（NTK）相似性导致错误关联和参数不稳定更新。为应对这些挑战，我们提出了一个包含三个核心模块的综合框架。首先，我们通过在推理阶段前引入专门的规划和字幕阶段，提升视觉定位，采用基于质量的字幕奖励，确保准确的初始锚定。其次，为了提升探索效果，我们根据样本的奖励分布均值和方差进行分类，优先考虑方差较高的样本，使模型聚焦于多样且富有信息量的数据。最后，为了减少样本干扰，我们通过分组样本对并施加InfoNCE损耗来调节NTK相似度，将过于相似的对推开，拉近不同对，从而引导梯度相互作用趋近平衡范围。实验结果表明，我们提出的方法显著降低了幻觉发生率，并有效提升了MLLM的推断准确性。

Walk the PLANC: Physics-Guided RL for Agile Humanoid Locomotion on Constrained Footholds

走PLANC：物理引导强化学习，在受限足点上实现敏捷人形移动

Authors: Min Dai, William D. Compton, Junheng Li, Lizhi Yang, Aaron D. Ames
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.06286
Pdf link: https://arxiv.org/pdf/2601.06286
Abstract Bipedal humanoid robots must precisely coordinate balance, timing, and contact decisions when locomoting on constrained footholds such as stepping stones, beams, and planks -- even minor errors can lead to catastrophic failure. Classical optimization and control pipelines handle these constraints well but depend on highly accurate mathematical representations of terrain geometry, making them prone to error when perception is noisy or incomplete. Meanwhile, reinforcement learning has shown strong resilience to disturbances and modeling errors, yet end-to-end policies rarely discover the precise foothold placement and step sequencing required for discontinuous terrain. These contrasting limitations motivate approaches that guide learning with physics-based structure rather than relying purely on reward shaping. In this work, we introduce a locomotion framework in which a reduced-order stepping planner supplies dynamically consistent motion targets that steer the RL training process via Control Lyapunov Function (CLF) rewards. This combination of structured footstep planning and data-driven adaptation produces accurate, agile, and hardware-validated stepping-stone locomotion on a humanoid robot, substantially improving reliability compared to conventional model-free reinforcement-learning baselines.
中文摘要 双足人形机器人必须精确协调平衡、时机和接触决策，尤其是在踏脚石、横梁和木板等受限的脚踏点上——即使是小错误也可能导致灾难性故障。经典优化和控制流水线能够很好地处理这些约束，但依赖于高度精确的地形几何数学表示，因此当感知噪声大或不完整时，容易产生误差。与此同时，强化学习显示出对干扰和建模错误的强韧性，但端到端策略很少能准确找到断裂地形所需的精确立足点和阶梯顺序。这些对比的局限促使采用基于物理结构的学习方法，而非单纯依赖奖励塑造。本研究介绍了一种运动框架，其中降阶步进规划器提供动态一致的运动目标，通过控制里雅普诺夫函数（CLF）奖励引导强化学习训练过程。这种结构化的脚步规划与数据驱动适应的结合，能够在人形机器人上实现准确、敏捷且经过硬件验证的踏脚石式移动，与传统无模型强化学习基线相比，显著提升了可靠性。

How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning?

现成的大型语言模型（LLM）利用思维链推理，能多好地从质谱中阐明分子结构？

Authors: Yufeng Wang, Lu Wei, Lin Liu, Hao Xu, Haibin Ling
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.06289
Pdf link: https://arxiv.org/pdf/2601.06289
Abstract Mass spectrometry (MS) is a powerful analytical technique for identifying small molecules, yet determining complete molecular structures directly from tandem mass spectra (MS/MS) remains a long-standing challenge due to complex fragmentation patterns and the vast diversity of chemical space. Recent progress in large language models (LLMs) has shown promise for reasoning-intensive scientific tasks, but their capability for chemical interpretation is still unclear. In this work, we introduce a Chain-of-Thought (CoT) prompting framework and benchmark that evaluate how LLMs reason about mass spectral data to predict molecular structures. We formalize expert chemists' reasoning steps-such as double bond equivalent (DBE) analysis, neutral loss identification, and fragment assembly-into structured prompts and assess multiple state-of-the-art LLMs (Claude-3.5-Sonnet, GPT-4o-mini, and Llama-3 series) in a zero-shot setting using the MassSpecGym dataset. Our evaluation across metrics of SMILES validity, formula consistency, and structural similarity reveals that while LLMs can produce syntactically valid and partially plausible structures, they fail to achieve chemical accuracy or link reasoning to correct molecular predictions. These findings highlight both the interpretive potential and the current limitations of LLM-based reasoning for molecular elucidation, providing a foundation for future work that combines domain knowledge and reinforcement learning to achieve chemically grounded AI reasoning.
中文摘要 质谱（MS）是一种强大的分析技术，用于识别小分子，但由于复杂的碎裂模式和广泛的化学空间多样性，直接通过串联质谱（MS/MS）确定完整的分子结构仍是一个长期挑战。大型语言模型（LLMs）的最新进展显示出在推理密集型科学任务中的前景，但其化学解释能力仍不明确。在本研究中，我们引入了思维链（Chain-of-Thought，简称CoT）提示框架和基准测试，评估大型语言模型（LLM）如何推理质谱数据以预测分子结构。我们将专家化学家的推理步骤——如双键当量（DBE）分析、中性损失识别和片段组装——形式化为结构化提示，并在零样本环境下使用MassSpecGym数据集评估多个最先进的大型语言模型（Claude-3.5-Sonnet、GPT-4o-mini和Llama-3系列）。我们对SMILES效度、公式一致性和结构相似度的评估显示，虽然LLMs能够产生句法有效且部分合理的结构，但它们未能实现化学准确性，也无法将推理与分子预测的正确性联系起来。这些发现凸显了基于LLM的推理在分子阐明中的解释潜力和当前局限性，为未来结合领域知识与强化学习以实现化学基础AI推理的工作奠定了基础。

Future-as-Label: Scalable Supervision from Real-World Outcomes

未来即标签：基于现实世界成果的可扩展监督

Authors: Benjamin Turtel, Paul Wilczewski, Danny Franklin, Kris Skothiem
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06336
Pdf link: https://arxiv.org/pdf/2601.06336
Abstract Many real-world prediction problems lack labels observable at prediction time, creating a temporal gap between prediction and outcome that yields supervision only after events resolve. To address this setting, we extend reinforcement learning with verifiable rewards to temporally resolved real-world prediction, and use it to train language models to make probabilistic forecasts under causally masked information with retrospective evaluation using proper scoring rules. Supervision is derived solely from post-resolution outcomes, preserving delayed-reward semantics. On real-world forecasting benchmarks, Qwen3-32B trained using Foresight Learning improves Brier score by 27% and halves calibration error relative to its pretrained baseline, and outperforms Qwen3-235B on both constructed future-event prediction tasks and the Metaculus benchmark despite a 7x parameter disadvantage.
中文摘要 许多现实世界的预测问题缺乏在预测时可观察的标签，导致预测与结果之间存在时间空白，只有在事件解决后才会有监督。为应对这一情境，我们将带有可验证奖励的强化学习扩展到时间解析的现实世界预测，并用它训练语言模型，在因果掩蔽信息下通过适当的评分规则进行回溯评估，做出概率预测。监督仅基于解决后的结果，保持延迟奖励语义。在现实世界预测基准测试中，使用前瞻性学习训练的Qwen3-32B比预训练基线提升了27%，校准误差减半，尽管Qwen3-235B在参数上处于7倍劣势，但在构建的未来事件预测任务和Metaculus基准测试中均优于Qwen3-235B。

Dynamic Incentivized Cooperation under Changing Rewards

动态激励合作，奖励变化

Authors: Philipp Altmann, Thomy Phan, Maximilian Zorn, Claudia Linnhoff-Popien, Sven Koenig
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2601.06382
Pdf link: https://arxiv.org/pdf/2601.06382
Abstract Peer incentivization (PI) is a popular multi-agent reinforcement learning approach where all agents can reward or penalize each other to achieve cooperation in social dilemmas. Despite their potential for scalable cooperation, current PI methods heavily depend on fixed incentive values that need to be appropriately chosen with respect to the environmental rewards and thus are highly sensitive to their changes. Therefore, they fail to maintain cooperation under changing rewards in the environment, e.g., caused by modified specifications, varying supply and demand, or sensory flaws - even when the conditions for mutual cooperation remain the same. In this paper, we propose Dynamic Reward Incentives for Variable Exchange (DRIVE), an adaptive PI approach to cooperation in social dilemmas with changing rewards. DRIVE agents reciprocally exchange reward differences to incentivize mutual cooperation in a completely decentralized way. We show how DRIVE achieves mutual cooperation in the general Prisoner's Dilemma and empirically evaluate DRIVE in more complex sequential social dilemmas with changing rewards, demonstrating its ability to achieve and maintain cooperation, in contrast to current state-of-the-art PI methods.
中文摘要 同伴激励（PI）是一种流行的多智能体强化学习方法，所有智能体可以相互奖励或惩罚，以实现社会困境中的合作。尽管具有可扩展合作潜力，当前的PI方法高度依赖固定激励值，且需根据环境回报适当选择，因此对其变化高度敏感。因此，即使相互合作的条件保持不变，它们也无法在环境中奖励变化的情况下维持合作，例如因规范修改、供需变化或感官缺陷。本文提出了动态奖励激励用于变量交换（DRIVE），这是一种适应性PI方法，用于在奖励变化的社会困境中合作。DRIVE代理互惠交换奖励差异，以完全去中心化的方式激励相互合作。我们展示了DRIVE如何在一般囚徒困境中实现相互合作，并通过实证方式评估DRIVE在更复杂且连续性社会困境中奖励变化的表现，证明其实现和维持合作的能力，这与当前最先进的PI方法形成对比。

Lightweight Yet Secure: Secure Scripting Language Generation via Lightweight LLMs

轻量级且安全：通过轻量级大型语言模型生成安全脚本语言

Authors: Keyang Zhang, Zeyu Chen, Xuan Feng, Dongliang Fang, Yaowen Zheng, Zhi Li, Limin Sun
Subjects: Subjects: Cryptography and Security (cs.CR); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2601.06419
Pdf link: https://arxiv.org/pdf/2601.06419
Abstract The security of scripting languages such as PowerShell is critical given their powerful automation and administration capabilities, often exercised with elevated privileges. Today, securing these languages still demands substantial human effort to craft and enforce rules, imposing heavy burdens on typical administrators and creating critical production risks (e.g., misoperations that shut down servers).Large language models (LLMs) have demonstrated strong capabilities in code generation, vulnerability detection, and automated repair for languages like Python and JavaScript. However, their ability to assist with generating secure scripting-language code remains largely underexplored. In this paper, we present SecGenEval-PS, a benchmark designed to systematically evaluate LLMs on secure scripting generation, security analysis, and automated repair. Our results show that both proprietary and open-source models fall short in these areas. For instance, over 60% of PowerShell scripts produced by GPT-4o and o3-mini are insecure without structured this http URL bridge this gap, we propose PSSec, a framework that combines data synthesis with fine-tuning to enhance model security capabilities. We develop a self-debugging agent that integrates static analyzers with the reasoning abilities of advanced LLMs to synthesize large-scale structured triplets of insecure scripts, violation analyses, and corresponding repairs. We then fine-tune lightweight LLMs (as small as 1.7B parameters) using supervised fine-tuning (SFT) and reinforcement learning (RL), enabling security-aware reasoning and the generation of secure PowerShell this http URL multiple LLM families, including GPT and Qwen, \textit{PSSec}-trained models match or surpass general-purpose large models on PowerShell security tasks while reducing inference cost by more than an order of magnitude.
中文摘要 鉴于PowerShell等脚本语言强大的自动化和管理能力，通常以更高权限行使，其安全性至关重要。如今，保护这些语言仍需大量人力制定和执行规则，给普通管理员带来沉重负担，并带来关键的生产风险（例如导致服务器关闭的作失误）。大型语言模型（LLM）在代码生成、漏洞检测以及Python和JavaScript等语言的自动修复方面展现出强大的能力。然而，它们协助生成安全脚本语言代码的能力仍然鲜有充分探索。本文介绍了SecGenEval-PS，这是一个旨在系统评估大型语言模型在安全脚本生成、安全分析和自动修复方面的基准测试。我们的结果显示，专有和开源模型在这些方面都存在不足。例如，GPT-4o和o3-mini生成的超过60%的PowerShell脚本如果没有结构化，http URL弥补了这一差距，我们提出了PSSec框架，该框架结合了数据综合与微调，以增强模型安全性。我们开发了一种自调试代理，将静态分析器与高级大型语言模型的推理能力整合，合成大规模结构化的三重组不安全脚本、违规分析及相应修复。随后，我们通过监督微调（SFT）和强化学习（RL）对轻量级LLM（参数最小至17亿亿）进行微调，实现安全意识推理和安全PowerShell生成。多个LLM家族，包括GPT和Qwen，\textit{PSSec}训练模型在PowerShell安全任务中可匹敌甚至超越通用大型模型，同时降低推理成本超过一个数量级。

Deep Reinforcement Learning based Control Design for Aircraft Recovery from Loss-of-Control Scenario

基于深度强化学习的控制设计，用于飞机从失控场景中恢复

Authors: Imran Sayyed, Aayush Konar, Nandan Kumar Sinha
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.06439
Pdf link: https://arxiv.org/pdf/2601.06439
Abstract Loss-of-control (LOC) remains a leading cause of fixed-wing aircraft accidents, especially in post-stall and flat-spin regimes where conventional gain-scheduled or logic-based recovery laws may fail. This study formulates spin-recovery as a continuous-state, continuous-action Markov Decision Process and trains a Proximal Policy Optimization (PPO) agent on a high-fidelity six-degree-of-freedom F-18/HARV model that includes nonlinear aerodynamics, actuator saturation and rate coupling. A two-phase potential-based reward structure first penalizes large angular rates and then enforces trimmed flight. After 6,000 simulated episodes, the policy generalities to unseen upset initializations. Results show that the learned policy successfully arrests the angular rates and stabilizes the angle of attack. The controller performance is observed to be satisfactory for recovery from spin condition which was compared with a state-of-the-art sliding mode controller. The findings demonstrate that deep reinforcement learning can deliver interpretable, dynamically feasible manoeuvres for real-time loss of control mitigation and provide a pathway for flight-critical RL deployment.
中文摘要 失控（LOC）仍是固定翼飞机事故的主要原因，尤其是在失速后和平旋状态下，这些情况下传统的增益计划或逻辑恢复定律可能失效。本研究将自旋恢复表述为连续状态、连续作用的马尔可夫决策过程，并在高保真度六自由度F-18/HARV模型上训练一个近端策略优化（PPO）代理，该模型包含非线性空气动力学、执行器饱和和速率耦合。基于电位的两相奖励结构首先惩罚较大的角速率，然后强制执行修剪飞行。经过6000次模拟事件后，政策通用性对未见的初始化感到不满。结果显示，所学策略成功地阻止了角率并稳定了攻角。该控制器性能在自旋状态恢复方面表现令人满意，并与先进的滑动模式控制器进行了比较。研究结果表明，深度强化学习能够提供可解释、动态可行的机动，实现实时失控缓解，并为飞行关键的强化学习部署提供路径。

Coupling Smoothed Particle Hydrodynamics with Multi-Agent Deep Reinforcement Learning for Cooperative Control of Point Absorbers

结合平滑粒子流体力学与多智能体深度强化学习，实现点吸收器的协作控制

Authors: Yi Zhan, Iván Martínez-Estévez, Min Luo, Alejandro J.C. Crespo, Abbas Khayyer
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.06485
Pdf link: https://arxiv.org/pdf/2601.06485
Abstract Wave Energy Converters, particularly point absorbers, have emerged as one of the most promising technologies for harvesting ocean wave energy. Nevertheless, achieving high conversion efficiency remains challenging due to the inherently complex and nonlinear interactions between incident waves and device motion dynamics. This study develops an optimal adaptive damping control model for the power take-off (PTO) system by coupling Smoothed Particle Hydrodynamics (SPH) with multi-agent deep reinforcement learning. The proposed framework enables real-time communication between high-fidelity SPH simulations and intelligent control agents that learn coordinated policies to maximise energy capture. In each training episode, the SPH-based environment provides instantaneous hydrodynamic states to the agents, which output continuous damping actions and receive rewards reflecting power absorption. The Multi-Agent Soft Actor Critic algorithm is employed within a centralised-training and decentralised-execution scheme to ensure stable learning in continuous, multi-body systems. The entire platform is implemented in a unified GPU-accelerated C++ environment, allowing long-horizon training and large-scale three-dimensional simulations. The approach is validated through a series of two-dimensional and three-dimensional benchmark cases under regular and irregular wave conditions. Compared with constant PTO damping, the learned control policy increases overall energy capture by 23.8% and 21.5%, respectively, demonstrating the strong potential of intelligent control for improving the performance of wave energy converter arrays. The developed three-dimensional GPU-accelerated multi-agent platform in computational hydrodynamics, is extendable to other fluid-structure interaction engineering problem that require real-time, multi-body coordinated control.
中文摘要 波浪能转换器，尤其是点吸收器，已成为收集海洋波浪能量最有前景的技术之一。然而，由于入射波与器件运动动力学之间本质复杂且非线性的相互作用，实现高转换效率仍具挑战性。本研究通过将平滑粒子流体动力学（SPH）与多智能体深度强化学习结合，开发了功率输出（PTO）系统的最佳自适应阻尼控制模型。该框架实现了高精度SPH模拟与智能控制代理之间的实时通信，智能控制代理学习协调策略以最大化能量捕获。在每次训练过程中，基于SPH的环境为代理提供瞬时的流体动力学状态，它们输出连续阻尼作用，并获得反映能量吸收的奖励。多智能体软演员批评算法被采用集中训练和去中心化执行方案，以确保连续多体系统中的稳定学习。整个平台在统一的GPU加速C++环境中实现，支持长视野训练和大规模三维仿真。该方法通过一系列二维和三维基准案例在正则波和不规则波条件下进行验证。与恒定PTO阻尼相比，学习控制策略分别提升了23.8%和21.5%的整体能量捕获，展示了智能控制在提升波能变换阵列性能方面的强大潜力。在计算流体力学领域开发的三维GPU加速多智能体平台，可扩展到其他需要实时、多体协调控制的流体结构交互工程问题。

ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

ArenaRL：通过基于锦标赛的相对排名，为开放式代理人调整强化学习

Authors: Qiang Zhang, Boli Chen, Fanrui Zhang, Ruixue Ding, Shihang Wang, Qiuchen Wang, Yinfeng Huang, Haonan Zhang, Rongxiang Zhu, Pengyong Wang, Ailin Ren, Xin Li, Pengjun Xie, Jiawei Liu, Ning Guo, Jingren Zhou, Zheng-Jun Zha
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06487
Pdf link: https://arxiv.org/pdf/2601.06487
Abstract Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation mechanism, employing multi-level rubrics to assign fine-grained relative scores to trajectories. Additionally, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we build Open-Travel and Open-DeepResearch, two high-quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi-dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real-world tasks.
中文摘要 强化学习显著提升了LLM代理在可验证结果任务中的表现，但在拥有庞大解空间的开放式代理任务（如复杂的旅行规划）上仍显困难。由于缺乏客观的真实性，当前的强化学习算法主要依赖奖励模型，为单个反应分配标量分数。我们认为，这种按点评分存在固有的歧视崩溃：奖励模型难以区分不同轨迹中的微妙优势，导致群体内的分数被压缩到狭窄范围内。因此，有效奖励信号被奖励模型中的噪声所主导，导致优化停滞。为此，我们提出了ArenaRL，一种强化学习范式，从点数评分转向组内相对排名。ArenaRL引入了一种过程感知的两两评估机制，利用多层次评分标准为轨迹分配细粒度的相对分数。此外，我们构建了组内对抗竞技场，并设计了基于比赛的排名方案以获得稳定的优势信号。实证结果证实，构建的种子单消除方案在仅O（N^2）复杂度下，几乎实现与完全两对比较的优势估计准确度相当，且复杂度仅为O（N），在效率与精度之间取得最佳平衡。此外，为了解决开放式智能体缺乏全周期基准的问题，我们构建了Open-Travel和Open-DeepResearch这两个高质量基准，涵盖SFT、RL培训和多维评估的全面流程。大量实验表明，ArenaRL远超标准强化学习基线，使LLM代理能够为复杂的现实任务生成更稳健的解决方案。

Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting via Automated Spectral Inspection

Spec-o3：一种工具增强视觉语言代理，通过自动光谱检测筛选稀有天体候选

Authors: Minghui Jia, Qichao Zhang, Ali Luo, Linjing Li, Shuo Ye, Hailing Lu, Wen Hou, Dongbin Zhao
Subjects: Subjects: Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM)
Arxiv link: https://arxiv.org/abs/2601.06498
Pdf link: https://arxiv.org/pdf/2601.06498
Abstract Due to the limited generalization and interpretability of deep learning classifiers, The final vetting of rare celestial object candidates still relies on expert visual inspection--a manually intensive process. In this process, astronomers leverage specialized tools to analyze spectra and construct reliable catalogs. However, this practice has become the primary bottleneck, as it is fundamentally incapable of scaling with the data deluge from modern spectroscopic surveys. To bridge this gap, we propose Spec-o3, a tool-augmented vision-language agent that performs astronomer-aligned spectral inspection via interleaved multimodal chain-of-thought reasoning. Spec-o3 is trained with a two-stage post-training recipe: cold-start supervised fine-tuning on expert inspection trajectories followed by outcome-based reinforcement learning on rare-type verification tasks. Evaluated on five rare-object identification tasks from LAMOST, Spec-o3 establishes a new State-of-the-Art, boosting the macro-F1 score from 28.3 to 76.5 with a 7B parameter base model and outperforming both proprietary VLMs and specialized deep models. Crucially, the agent demonstrates strong generalization to unseen inspection tasks across survey shifts (from LAMOST to SDSS/DESI). Expert evaluations confirm that its reasoning traces are coherent and physically consistent, supporting transparent and trustworthy decision-making. Code, data, and models are available at \href{this https URL}{Project HomePage}.
中文摘要 由于深度学习分类器的泛化性和可解释性有限，稀有天体候选物的最终筛选仍依赖专家的目视检查——这是一个人工密集的过程。在此过程中，天文学家利用专用工具分析光谱并构建可靠的星表。然而，这种做法已成为主要瓶颈，因为它根本无法适应现代光谱巡天数据洪流的规模化。为弥合这一空白，我们提出了Spec-o3，一种工具增强的视觉语言代理，通过交织多模态思维链推理执行天文学家对齐的光谱检查。Spec-o3采用两阶段培训后方案进行训练：冷启动监督的专家检查轨迹微调，随后是基于结果的强化学习，完成罕见类型验证任务。经过LAMOST五项稀有物体识别任务的评估，Spec-o3确立了新的最先进技术，将宏F1得分从28.3提升至76.5，并以7B参数为基础模型，并超越了专有VLM和专用深度模型。关键是，该代理在调查转变（从LAMOST到SDSS/DESI）中对未见检查任务表现出强烈的泛化能力。专家评估确认其推理路径连贯且物理一致，支持透明且可信的决策。代码、数据和模型可在 \href{this https URL}{Project HomePage} 获取。

Self-Organizing Dual-Buffer Adaptive Clustering Experience Replay (SODASER) for Safe Reinforcement Learning in Optimal Control

自组织双缓冲自适应聚类经验重放（SODASER），用于最优控制下的安全强化学习

Authors: Roya Khalili Amirabadi, Mohsen Jalaeian Farimani, Omid Solaymani Fard
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.06540
Pdf link: https://arxiv.org/pdf/2601.06540
Abstract This paper proposes a novel reinforcement learning framework, named Self-Organizing Dual-buffer Adaptive Clustering Experience Replay (SODACER), designed to achieve safe and scalable optimal control of nonlinear systems. The proposed SODACER mechanism consisting of a Fast-Buffer for rapid adaptation to recent experiences and a Slow-Buffer equipped with a self-organizing adaptive clustering mechanism to maintain diverse and non-redundant historical experiences. The adaptive clustering mechanism dynamically prunes redundant samples, optimizing memory efficiency while retaining critical environmental patterns. The approach integrates SODASER with Control Barrier Functions (CBFs) to guarantee safety by enforcing state and input constraints throughout the learning process. To enhance convergence and stability, the framework is combined with the Sophia optimizer, enabling adaptive second-order gradient updates. The proposed SODACER-Sophia's architecture ensures reliable, effective, and robust learning in dynamic, safety-critical environments, offering a generalizable solution for applications in robotics, healthcare, and large-scale system optimization. The proposed approach is validated on a nonlinear Human Papillomavirus (HPV) transmission model with multiple control inputs and safety constraints. Comparative evaluations against random and clustering-based experience replay methods demonstrate that SODACER achieves faster convergence, improved sample efficiency, and a superior bias-variance trade-off, while maintaining safe system trajectories, validated via the Friedman test.
中文摘要 本文提出了一种新颖的强化学习框架，名为自组织双缓冲自适应聚类体验重放（SODACER），旨在实现非线性系统的安全且可扩展的最优控制。提出的SODACER机制包括一个快速缓冲器，用于快速适应近期经验，另一个是配备自组织自适应聚类机制的慢缓冲器，以维持多样且非冗余的历史经验。自适应聚类机制动态修剪冗余样本，优化记忆效率，同时保留关键环境模式。该方法将SODASER与控制障碍函数（CBF）集成，通过在整个学习过程中强制执行状态和输入约束，确保安全。为增强收敛性和稳定性，该框架与Sophia优化器结合，实现自适应的二阶梯度更新。所提SODACER-Sophia架构确保在动态且安全关键环境中实现可靠、高效且稳健的学习，为机器人、医疗和大规模系统优化等应用提供了通用解决方案。该方法在具有多重控制输入和安全约束的非线性人瘤病毒（HPV）传播模型上得到了验证。与随机和基于聚类的经验回放方法的比较评估表明，SODACER在保持系统轨迹安全的同时实现了更快的收敛速度、提升了样本效率和更优的偏差-方差权衡，这一点通过弗里德曼检验得到了验证。

ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

ArrowGEV：通过学习时间之箭实现视频事件的扎根

Authors: Fangxu Yu, Ziyao Lu, Liqiang Niu, Fandong Meng, Jie Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.06559
Pdf link: https://arxiv.org/pdf/2601.06559
Abstract Grounding events in videos serves as a fundamental capability in video analysis. While Vision-Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left hand). The former denote events whose reversal substantially alters their meaning, while the latter remain semantically unchanged under reversal. For time-sensitive events, ArrowGEV introduces a reward that encourages VLMs to discriminate between forward and backward videos, whereas for time-insensitive events, it enforces consistent grounding across both directions. Extensive experiments demonstrate that ArrowGEV not only improves grounding precision and temporal directionality recognition, but also enhances general video understanding and reasoning ability.
中文摘要 视频中的事件基础化是视频分析中的一项基本能力。虽然视觉语言模型（VLM）越来越多地用于此任务，但现有方法主要训练模型，仅将事件与前向视频中的时间戳关联起来。这种范式阻碍了VLM捕捉事件固有的时间结构和方向性，从而限制了鲁棒性和推广性。为了解决这一限制，受物理学中时间箭头（时间箭头）启发，该箭头描述时间过程的内在方向性，我们提出了ArrowGEV强化学习框架，明确建模事件的时间方向性，以提升VLM中的事件基础和时间方向性的理解。具体来说，我们将事件分为时间敏感型（例如，放下一个袋子）和时间敏感型（例如，左手拿着毛巾）。前者表示那些事件的反转会显著改变其意义，而后者则在反转后语义保持不变。对于时间敏感事件，ArrowGEV引入奖励，鼓励VLM区分前向和后向视频，而对于时间敏感事件，则强制双方保持一致的接地。大量实验表明，ArrowGEV不仅提升了接地精度和时间方向性识别，还增强了视频的整体理解和推理能力。

Object-Centric World Models Meet Monte Carlo Tree Search

以对象为中心的世界模型遇见蒙特卡洛树搜索

Authors: Rodion Vakhitov, Leonid Ugadiarov, Aleksandr Panov
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.06604
Pdf link: https://arxiv.org/pdf/2601.06604
Abstract In this paper, we introduce ObjectZero, a novel reinforcement learning (RL) algorithm that leverages the power of object-level representations to model dynamic environments more effectively. Unlike traditional approaches that process the world as a single undifferentiated input, our method employs Graph Neural Networks (GNNs) to capture intricate interactions among multiple objects. These objects, which can be manipulated and interact with each other, serve as the foundation for our model's understanding of the environment. We trained the algorithm in a complex setting teeming with diverse, interactive objects, demonstrating its ability to effectively learn and predict object dynamics. Our results highlight that a structured world model operating on object-centric representations can be successfully integrated into a model-based RL algorithm utilizing Monte Carlo Tree Search as a planning module.
中文摘要 本文介绍了ObjectZero，一种新型强化学习（RL）算法，利用对象级表示的力量更有效地建模动态环境。与将世界视为单一无差分输入的方式不同，我们的方法采用图神经网络（GNN）捕捉多个对象之间的复杂交互。这些物体可以作并相互交互，是我们模型理解环境的基础。我们在一个充满多样互动对象的复杂环境中训练该算法，展示了其有效学习和预测物体动态的能力。我们的结果表明，基于对象为中心表示的结构化世界模型可以成功集成到基于模型的强化学习算法中，利用蒙特卡洛树搜索作为规划模块。

KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks

KASER：面向开放式编程任务的知识对齐学生错误模拟器

Authors: Zhangqi Duan, Nigel Fernandez, Andrew Lan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2601.06633
Pdf link: https://arxiv.org/pdf/2601.06633
Abstract Open-ended tasks, such as coding problems that are common in computer science education, provide detailed insights into student knowledge. However, training large language models (LLMs) to simulate and predict possible student errors in their responses to these problems can be challenging: they often suffer from mode collapse and fail to fully capture the diversity in syntax, style, and solution approach in student responses. In this work, we present KASER (Knowledge-Aligned Student Error Simulator), a novel approach that aligns errors with student knowledge. We propose a training method based on reinforcement learning using a hybrid reward that reflects three aspects of student code prediction: i) code similarity to the ground-truth, ii) error matching, and iii) code prediction diversity. On two real-world datasets, we perform two levels of evaluation and show that: At the per-student-problem pair level, our method outperforms baselines on code and error prediction; at the per-problem level, our method outperforms baselines on error coverage and simulated code diversity.
中文摘要 开放式任务，如计算机科学教育中常见的编码问题，提供了对学生知识的详细洞察。然而，训练大型语言模型（LLMs）以模拟和预测学生在应对这些问题时可能的错误可能具有挑战性：它们常常存在模式崩溃的问题，无法完全捕捉学生回答中语法、风格和解决方案方法的多样性。在本研究中，我们提出了KASER（知识对齐学生错误模拟器），这是一种将错误与学生知识对齐的新方法。我们提出了一种基于强化学习的混合奖励训练方法，反映了学生代码预测的三个方面：i）代码与真实数据的相似性，ii）错误匹配，iii）代码预测多样性。在两个真实数据集上，我们进行了两个评估层次，结果表明：在每学生问题对层面，我们的方法在代码和错误预测上优于基线;在每个问题层面，我们的方法在错误覆盖率和模拟代码多样性方面优于基线。

Reinforcement Learning-Guided Dynamic Multi-Graph Fusion for Evacuation Traffic Prediction

强化学习引导动态多图融合用于疏散交通预测

Authors: Md Nafees Fuad Rafi, Samiul Hasan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06664
Pdf link: https://arxiv.org/pdf/2601.06664
Abstract Real-time traffic prediction is critical for managing transportation systems during hurricane evacuations. Although data-driven graph-learning models have demonstrated strong capabilities in capturing the complex spatiotemporal dynamics of evacuation traffic at a network level, they mostly consider a single dimension (e.g., travel-time or distance) to construct the underlying graph. Furthermore, these models often lack interpretability, offering little insight into which input variables contribute most to their predictive performance. To overcome these limitations, we develop a novel Reinforcement Learning-guided Dynamic Multi-Graph Fusion (RL-DMF) framework for evacuation traffic prediction. We construct multiple dynamic graphs at each time step to represent heterogeneous spatiotemporal relationships between traffic detectors. A dynamic multi-graph fusion (DMF) module is employed to adaptively learn and combine information from these graphs. To enhance model interpretability, we introduce RL-based intelligent feature selection and ranking (RL-IFSR) method that learns to mask irrelevant features during model training. The model is evaluated using a real-world dataset of 12 hurricanes affecting Florida from 2016 to 2024. For an unseen hurricane (Milton, 2024), the model achieves a 95% accuracy (RMSE = 293.9) for predicting the next 1-hour traffic flow. Moreover, the model can forecast traffic flow for up to next 6 hours with 90% accuracy (RMSE = 426.4). The RL-DMF framework outperforms several state-of-the-art traffic prediction models. Furthermore, ablation experiments confirm the effectiveness of dynamic multi-graph fusion and RL-IFSR approaches for improving model performance. This research provides a generalized and interpretable model for real-time evacuation traffic forecasting, with significant implications for evacuation traffic management.
中文摘要 实时交通预测对于飓风疏散期间的交通系统管理至关重要。尽管数据驱动的图学习模型在捕捉网络层面疏散交通的复杂时空动态方面表现出强大能力，但它们主要考虑单一维度（例如旅行时间或距离）来构建底层图。此外，这些模型往往缺乏可解释性，难以洞察哪些输入变量对其预测表现贡献最大。为克服这些局限，我们开发了一种新型强化学习引导的动态多图融合（RL-DMF）框架，用于疏散交通预测。我们在每个时间步构建多个动态图，以表示交通检测器间的异构时空关系。采用动态多图融合（DMF）模块，自适应地学习和组合这些图中的信息。为提升模型可解释性，我们引入基于强化学习的智能特征选择与排序（RL-IFSR）方法，学习在模型训练过程中掩盖无关特征。该模型基于2016年至2024年影响佛罗里达的12个飓风的真实数据集进行评估。对于未见飓风（Milton， 2024），该模型在预测下一个1小时的交通流量时实现了95%的准确率（RMSE = 293.9）。此外，该模型还能以90%的准确率预测未来6小时的交通流量（RMSE = 426.4）。RL-DMF框架的表现优于多个最先进的交通预测模型。此外，消融实验证实了动态多图融合和RL-IFSR方法提升模型性能的有效性。本研究为实时疏散交通预测提供了一个通用且可解释的模型，对撤离交通管理具有重要意义。

Plasticity vs. Rigidity: The Impact of Low-Rank Adapters on Reasoning on a Micro-Budget

可塑性与刚性：低级适配器对微预算推理的影响

Authors: Zohaib Khan, Omer Tafveez, Zoha Hayat Bhatti
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06677
Pdf link: https://arxiv.org/pdf/2601.06677
Abstract Recent advances in mathematical reasoning typically rely on massive scale, yet the question remains: can strong reasoning capabilities be induced in small language models ($\leq1.5\text{B}$) under extreme constraints? We investigate this by training models on a single A40 GPU (48GB) for under 24 hours using Reinforcement Learning with Verifiable Rewards (RLVR) and Low-Rank Adaptation (LoRA). We find that the success of this ``micro-budget" regime depends critically on the interplay between adapter capacity and model initialization. While low-rank adapters ($r=8$) consistently fail to capture the complex optimization dynamics of reasoning, high-rank adapters ($r=256$) unlock significant plasticity in standard instruction-tuned models. Our best result achieved an impressive 40.0\% Pass@1 on AIME 24 (an 11.1\% absolute improvement over baseline) and pushed Pass@16 to 70.0\%, demonstrating robust exploration capabilities. However, this plasticity is not universal: while instruction-tuned models utilized the budget to elongate their chain-of-thought and maximize reward, heavily math-aligned models suffered performance collapse, suggesting that noisy, low-budget RL updates can act as destructive interference for models already residing near a task-specific optimum.
中文摘要 近年来数学推理的进展通常依赖于大规模计算，但问题依然存在：在极端约束下，能否在小型语言模型（$\leq1.5\text{B}$）中诱导出强大的推理能力？我们通过在单个A40 GPU（48GB）上，使用可验证奖励强化学习（RLVR）和低秩适应（LoRA）训练模型，时间不到24小时进行研究。我们发现，这种“微预算”体系的成功关键在于适配器容量与模型初始化之间的相互作用。低秩适配器（$r=8$）始终无法捕捉推理的复杂优化动态，而高秩适配器（$r=256$）则在标准指令调优模型中释放了显著的可塑性。我们的最佳成绩在AIME 24上取得了令人印象深刻的40.0%Pass@1（比基线绝对提升11.1%），并将Pass@16推至70.0%，展现了强大的探索能力。然而，这种可塑性并非普遍存在：指令调优模型利用预算延长思维链并最大化奖励，而高度数学化的模型则遭遇性能崩溃，表明噪声低预算的强化学习更新可能对已接近任务最优的模型造成破坏性干扰。

Characterising Toxicity in Generative Large Language Models

生成大型语言模型中的毒性特征

Authors: Zhiyao Zhang, Yazan Mash'Al, Yuhan Wu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06700
Pdf link: https://arxiv.org/pdf/2601.06700
Abstract In recent years, the advent of the attention mechanism has significantly advanced the field of natural language processing (NLP), revolutionizing text processing and text generation. This has come about through transformer-based decoder-only architectures, which have become ubiquitous in NLP due to their impressive text processing and generation capabilities. Despite these breakthroughs, language models (LMs) remain susceptible to generating undesired outputs: inappropriate, offensive, or otherwise harmful responses. We will collectively refer to these as ``toxic'' outputs. Although methods like reinforcement learning from human feedback (RLHF) have been developed to align model outputs with human values, these safeguards can often be circumvented through carefully crafted prompts. Therefore, this paper examines the extent to which LLMs generate toxic content when prompted, as well as the linguistic factors -- both lexical and syntactic -- that influence the production of such outputs in generative models.
中文摘要 近年来，注意力机制的出现显著推动了自然语言处理（NLP）领域的发展，彻底革新了文本处理和文本生成。这得益于基于变换器、仅依赖解码器的架构，这些架构因其出色的文本处理和生成能力，在自然语言处理中无处不在。尽管取得了这些突破，语言模型（LM）仍然容易产生不想要的输出：不当、冒犯性或其他有害的反应。我们将这些统称为“有毒”输出。尽管已有如人类反馈强化学习（RLHF）等方法被开发出来，以使模型输出与人类价值观保持一致，但这些防护措施通常可以通过精心设计的提示来规避。因此，本文考察了LLMs在提示时产生的有毒内容的程度，以及影响生成模型中此类输出的语言因素——词汇和句法因素。

On-the-Fly VLA Adaptation via Test-Time Reinforcement Learning

通过测试时间强化学习实现的实时VLA适配

Authors: Changyu Liu, Yiyang Liu, Taowen Wang, Qiao Zhuang, James Chenhao Liang, Wenhao Yang, Renjing Xu, Qifan Wang, Dongfang Liu, Cheng Han
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.06748
Pdf link: https://arxiv.org/pdf/2601.06748
Abstract Vision-Language-Action models have recently emerged as a powerful paradigm for general-purpose robot learning, enabling agents to map visual observations and natural-language instructions into executable robotic actions. Though popular, they are primarily trained via supervised fine-tuning or training-time reinforcement learning, requiring explicit fine-tuning phases, human interventions, or controlled data collection. Consequently, existing methods remain unsuitable for challenging simulated- or physical-world deployments, where robots must respond autonomously and flexibly to evolving environments. To address this limitation, we introduce a Test-Time Reinforcement Learning for VLAs (TT-VLA), a framework that enables on-the-fly policy adaptation during inference. TT-VLA formulates a dense reward mechanism that leverages step-by-step task-progress signals to refine action policies during test time while preserving the SFT/RL-trained priors, making it an effective supplement to current VLA models. Empirical results show that our approach enhances overall adaptability, stability, and task success in dynamic, previously unseen scenarios under simulated and real-world settings. We believe TT-VLA offers a principled step toward self-improving, deployment-ready VLAs.
中文摘要 视觉-语言-行动模型近年来成为通用机器人学习的强大范式，使智能体能够将视觉观察和自然语言指令映射到可执行的机器人动作中。虽然很受欢迎，但它们主要通过监督微调或训练时间强化学习进行训练，需要明确的微调阶段、人工干预或受控数据收集。因此，现有方法仍不适合挑战性的模拟或物理世界部署，机器人必须自主且灵活地响应不断变化的环境。为解决这一限制，我们引入了VLA测试时强化学习（TT-VLA）框架，该框架支持推理过程中的即时策略调整。TT-VLA构建了一个密集的奖励机制，利用逐步任务进展信号在测试期间优化动作策略，同时保留SFT/RL训练的先验，是当前VLA模型的有效补充。实证结果表明，我们的方法在模拟和现实环境中的动态、前所未有的场景中提升了整体适应性、稳定性和任务成功率。我们认为TT-VLA为实现自我改进、部署准备的VLA迈出了有原则的一步。

GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO

GanitLLM：通过课程进行难度感知孟加拉数学推理-GRPO

Authors: Shubhashis Roy Dipta, Khairul Mahbub, Nadia Najjar
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.06767
Pdf link: https://arxiv.org/pdf/2601.06767
Abstract We present a Bengali mathematical reasoning model called GanitLLM (named after the Bangla word for mathematics, "Ganit"), together with a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline. Bengali is one of the world's most widely spoken languages, yet existing LLMs either reason in English and then translate, or simply fail on multi-step Bengali math, in part because reinforcement learning recipes are tuned for high-resource languages and collapse under reward sparsity in low-resource settings. To address this, we construct Ganit, a rigorously filtered and decontaminated Bengali math dataset with automatic difficulty tags derived from the pass@k of a strong evaluator model. Building on this dataset, we propose Curriculum-GRPO, which combines multi-stage training (SFT + GRPO) with difficulty-aware sampling and verifiable rewards for format, numerical correctness, and Bengali reasoning. On Bn-MGSM and Bn-MSVAMP, GanitLLM-4B improves over its Qwen3-4B base by +8 and +7 accuracy points, respectively, while increasing the percentage of Bengali reasoning tokens from 14% to over 88% and reducing average solution length from 943 to 193 words.
中文摘要 我们展示了一个名为 GanitLLM（以孟加拉语数学单词“Ganit”命名）的孟加拉数学推理模型，以及一个新的难度感知孟加拉语数学语料库和基于课程的 GRPO 流水线。孟加拉语是世界上使用最广泛的语言之一，但现有的大型语言模型要么先用英语推理再翻译，要么在多步孟加拉语数学上失败，部分原因是强化学习的配方针对高资源语言，在资源匮乏环境中在奖励稀缺下崩溃。为此，我们构建了Ganit，这是一个经过严格筛选和去污的孟加拉语数学数据集，其难度标签源自强评估器模型的pass@k。基于该数据集，我们提出了Curriculum-GRPO，结合了多阶段训练（SFT + GRPO）与难度感知抽样以及格式、数值正确性和孟加拉语推理的可验证奖励。在 Bn-MGSM 和 Bn-MSVAMP 上，GanitLLM-4B 分别比其 Qwen3-4B 基础提升了 +8 和 +7 点的准确率，同时孟加拉语推理标记的比例从14%提升到超过88%，平均解题长度从943个词减少到193个词。

No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning

不再有陈旧反馈：开放世界代理学习的共同进化批评者

Authors: Zhicong Li, Lingjie Jiang, Yulan Hu, Xingchen Zeng, Yixia Li, Xiangwen Zhang, Guanhua Chen, Zheng Pan, Xin Li, Yong Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06794
Pdf link: https://arxiv.org/pdf/2601.06794
Abstract Critique-guided reinforcement learning (RL) has emerged as a powerful paradigm for training LLM agents by augmenting sparse outcome rewards with natural-language feedback. However, current methods often rely on static or offline critic models, which fail to adapt as the policy evolves. In on-policy RL, the agent's error patterns shift over time, causing stationary critics to become stale and providing feedback of diminishing utility. To address this, we introduce ECHO (Evolving Critic for Hindsight-Guided Optimization)}, a framework that jointly optimizes the policy and critic through a synchronized co-evolutionary loop. ECHO utilizes a cascaded rollout mechanism where the critic generates multiple diagnoses for an initial trajectory, followed by policy refinement to enable group-structured advantage estimation. We address the challenge of learning plateaus via a saturation-aware gain shaping objective, which rewards the critic for inducing incremental improvements in high-performing trajectories. By employing dual-track GRPO updates, ECHO ensures the critic's feedback stays synchronized with the evolving policy. Experimental results show that ECHO yields more stable training and higher long-horizon task success across open-world environments.
中文摘要 批判引导强化学习（RL）已成为训练LLM代理的强大范式，通过用自然语言反馈增强稀疏的结果奖励。然而，现有方法常依赖静态或离线批评模型，这些模型未能随着政策演变而适应。在策略强化学习中，代理的错误模式随时间变化，导致静止批评者变得陈旧，并反馈效用递减。为此，我们引入了ECHO（事后诸葛亮引导优化的进化批评者）框架，通过同步的共进循环共同优化策略和批评者。ECHO采用级联推广机制，批评者为初始轨迹生成多个诊断，随后通过策略优化以实现群体结构化优势估计。我们通过一个具饱和感应的增益塑造目标来解决学习平台期的挑战，该目标奖励批评者在高绩效轨迹中带来渐进式改进。通过采用双轨GRPO更新，ECHO确保评论员的反馈与不断演变的政策保持一致。实验结果表明，ECHO在开放世界环境中能够实现更稳定的训练和更高的长期任务成功率。

GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning

GDEPO：群组双重动态与均衡右优势策略优化，增强训练数据利用以实现样本约束强化学习

Authors: Zhengqing Yan, Xinyang Liu, Yi Zhang, Fan Guo, Yao Liu, Junchen Wan, Kang Song
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06795
Pdf link: https://arxiv.org/pdf/2601.06795
Abstract Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI), requiring the construction of machine-verifiable proofs in formal languages such as Lean to evaluate AI reasoning capabilities. Reinforcement learning (RL), particularly the high-performance Group Relative Policy Optimization (GRPO) algorithm, has emerged as a mainstream approach for this task. However, in ATP scenarios, GRPO faces two critical issues: when composite rewards are used, its relative advantage estimation may conflict with the binary feedback from the formal verifier; meanwhile, its static sampling strategy may discard entire batches of data if no valid proof is found, resulting in zero contribution to model updates and significant data waste. To address these limitations, we propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO), a method incorporating three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function (based on correctness) from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases. Experiments conducted on three datasets of varying difficulty (MinF2F-test, MathOlympiadBench, PutnamBench) confirm the effectiveness of GDEPO, while ablation studies validate the necessity of its synergistic components. The proposed method enhances data utilization and optimization efficiency, offering a novel training paradigm for ATP.
中文摘要 自动定理证明（ATP）是人工智能（AI）中的一个根本性挑战，要求用精益等形式语言构建机器可验证的证明，以评估人工智能推理能力。强化学习（RL），尤其是高性能的群相对策略优化（GRPO）算法，已成为该任务的主流方法。然而，在ATP场景下，GRPO面临两个关键问题：当使用复合奖励时，其相对优势估计可能与形式验证者的二元反馈相冲突;与此同时，如果找不到有效证据，其静态采样策略可能会丢弃整批数据，导致模型更新为零贡献，且数据浪费显著。为解决这些局限性，我们提出了群对偶动态与平等右优势策略优化（GDEPO），该方法包含三种核心机制：1）动态额外采样，即对无效批次进行重采样，直到发现有效证明;2）等右优势，将优势函数的符号（基于正确性）与其大小（由辅助奖励调制）解耦，以确保政策更新的稳定和正确;3）动态的额外迭代，对最初失败但最终成功的样本施加额外的梯度步骤，以加速复杂案例的学习。在三种不同难度的数据集（MinF2F测试、MathOlympiadBench、PutnamBench）上进行的实验证实了GDEPO的有效性，而消融研究则验证了其协同成分的必要性。该方法提升了数据利用率和优化效率，为ATP提供了全新的训练范式。

Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy

与Delta思维：通过差异视觉推理策略激励强化学习

Authors: Shujian Gao, Yuan Wang, Jiangtao Yan, Zuxuan Wu, Yu-Gang Jiang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.06801
Pdf link: https://arxiv.org/pdf/2601.06801
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced reasoning capabilities in Large Language Models. However, adapting RLVR to multimodal domains suffers from a critical \textit{perception-reasoning decoupling}. Existing paradigms, driven by text-centric outcome rewards, reasoning in language medium, inadvertently encourage models to bypass visual perception. We empirically validate this through blind experiments: state-of-the-art policies maintain or surprisingly improve performance even when visual inputs are entirely removed. This reveals that these models degenerate into \textit{blind reasoners}, exploiting linguistic priors to generate plausible answers instead of attending to visual evidence. In response, we propose \textbf{Thinking with Deltas}, a framework driven by a \textbf{Differential Visual Reasoning Policy (DVRP)}. DVRP introduces intrinsic supervision via visual triplets, comprising original, masked, and perturbed inputs. It optimizes the model to maximize reasoning divergence from masked inputs (enforcing \textit{visual sensitivity}) while minimizing divergence from perturbed inputs (ensuring \textit{visual robustness}). By aligning reasoning variations strictly with the \textit{Delta} of visual information, DVRP inherently bolsters visual understanding capabilities and significantly outperforms state-of-the-art methods on both general and medical benchmarks, without requiring external annotations or auxiliary tools.
中文摘要 带可验证奖励的强化学习（RLVR）在大型语言模型的推理能力上有了显著的进步。然而，将RLVR适配到多模域时，存在一个关键的\textit{感知-推理解耦}。现有范式，由以文本为中心的结果奖励驱动，语言媒介推理，无意中鼓励模型绕过视觉感知。我们通过盲测实验验证了这一点：即使完全去除视觉输入，最先进的策略仍能维持或意外地提升性能。这表明这些模型退化成了 \textit{盲目推理者}，利用语言先验来生成合理的答案，而非关注视觉证据。作为回应，我们提出了 \textbf{与三角洲思考}，这是一个由 \textbf{差分视觉推理政策（DVRP）}驱动的框架。DVRP通过视觉三元组引入内在监督，这些三元组包括原始输入、掩蔽输入和扰动输入。它优化模型以最大化与掩蔽输入的推理发散（强制\textit{视觉敏感性}），同时最小化与扰动输入的偏差（确保\textit{视觉鲁棒性}）。通过严格将推理变体与视觉信息的\textit{Delta}对齐，DVRP本质上增强了视觉理解能力，并在通用和医学基准测试上显著优于最先进方法，无需外部注释或辅助工具。

Code Evolution for Control: Synthesizing Policies via LLM-Driven Evolutionary Search

代码演化以实现控制：通过大型语言模型驱动的进化搜索综合策略

Authors: Ping Guo, Chao Li, Yinglan Feng, Chaoning Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06845
Pdf link: https://arxiv.org/pdf/2601.06845
Abstract Designing effective control policies for autonomous systems remains a fundamental challenge, traditionally addressed through reinforcement learning or manual engineering. While reinforcement learning has achieved remarkable success, it often suffers from high sample complexity, reward shaping difficulties, and produces opaque neural network policies that are hard to interpret or verify. Manual design, on the other hand, requires substantial domain expertise and struggles to scale across diverse tasks. In this work, we demonstrate that LLM-driven evolutionary search can effectively synthesize interpretable control policies in the form of executable code. By treating policy synthesis as a code evolution problem, we harness the LLM's prior knowledge of programming patterns and control heuristics while employing evolutionary search to explore the solution space systematically. We implement our approach using EvoToolkit, a framework that seamlessly integrates LLM-driven evolution with customizable fitness evaluation. Our method iteratively evolves populations of candidate policy programs, evaluating them against task-specific objectives and selecting superior individuals for reproduction. This process yields compact, human-readable control policies that can be directly inspected, modified, and formally verified. This work highlights the potential of combining foundation models with evolutionary computation for synthesizing trustworthy control policies in autonomous systems. Code is available at this https URL.
中文摘要 为自主系统设计有效的控制策略仍是一个根本性的挑战，传统上通过强化学习或手工工程来解决。虽然强化学习取得了显著成功，但它常常存在高样本复杂度、奖励塑造困难以及产生难以解读或验证的不透明神经网络策略。而手工设计则需要丰富的领域专业知识，难以跨越多样化任务进行扩展。在本研究中，我们展示了基于LLM驱动的进化搜索能够有效综合可解释的控制策略，以可执行代码的形式呈现。通过将策略综合视为代码演化问题，我们利用LLM对编程模式和控制启发式的既有知识，同时运用进化搜索系统性探索解空间。我们采用EvoToolkit实现这一方法，该框架无缝整合了基于LLM的演进与可定制的适应度评估。我们的方法通过迭代演进候选政策项目的群体，根据任务特定目标进行评估，并挑选更优秀的个体进行繁殖。这一过程产生了简洁、易于人类阅读的控制策略，可以直接检查、修改并正式验证。这项工作强调了将基础模型与进化计算结合，在自主系统中综合可信控制策略的潜力。代码可在此 https URL 访问。

A Brain-like Synergistic Core in LLMs Drives Behaviour and Learning

大型语言模型中类似大脑的协同核心驱动行为和学习

Authors: Pedro Urbina-Rodriguez, Zafeirios Fountas, Fernando E. Rosas, Jun Wang, Andrea I. Luppi, Haitham Bou-Ammar, Murray Shanahan, Pedro A. M. Mediano
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06851
Pdf link: https://arxiv.org/pdf/2601.06851
Abstract The independent evolution of intelligence in biological and artificial systems offers a unique opportunity to identify its fundamental computational principles. Here we show that large language models spontaneously develop synergistic cores -- components where information integration exceeds individual parts -- remarkably similar to those in the human brain. Using principles of information decomposition across multiple LLM model families and architectures, we find that areas in middle layers exhibit synergistic processing while early and late layers rely on redundancy, mirroring the informational organisation in biological brains. This organisation emerges through learning and is absent in randomly initialised networks. Crucially, ablating synergistic components causes disproportionate behavioural changes and performance loss, aligning with theoretical predictions about the fragility of synergy. Moreover, fine-tuning synergistic regions through reinforcement learning yields significantly greater performance gains than training redundant components, yet supervised fine-tuning shows no such advantage. This convergence suggests that synergistic information processing is a fundamental property of intelligence, providing targets for principled model design and testable predictions for biological intelligence.
中文摘要 生物与人工系统中智能的独立进化为识别其基本计算原理提供了独特机会。我们展示了大型语言模型自发发展协同核心——信息整合超过单个部分的组成部分——与人脑中的核心极为相似。利用跨多个大型语言模型家族和架构的信息分解原理，我们发现中间层区域表现出协同处理，而早期和晚期层依赖冗余，这反映了生物大脑中的信息组织。这种组织通过学习产生，在随机初始化的网络中并不存在。关键是，削弱协同成分会导致行为变化和性能损失，这与理论预测中协同脆弱性相符。此外，通过强化学习微调协同区域带来的性能提升显著优于训练冗余组件，但监督微调则没有此优势。这一趋同表明协同信息处理是智能的基本特性，为有原则的模型设计和生物智能的可测试预测提供了目标。

Personality-Aware Reinforcement Learning for Persuasive Dialogue with LLM-Driven Simulation

基于LLM驱动的模拟的说服性对话中的人格感知强化学习

Authors: Donghuo Zeng, Roberto Legaspi, Kazushi Ikeda
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.06877
Pdf link: https://arxiv.org/pdf/2601.06877
Abstract Effective persuasive dialogue agents adapt their strategies to individual users, accounting for the evolution of their psychological states and intentions throughout conversations. We present a personality-aware reinforcement learning approach comprising three main modules: (1) a Strategy-Oriented Interaction Framework, which serves as an agenda-based strategy controller that selects strategy-level actions and generate responses via Maximal Marginal Relevance (MMR) retrieval to ensure contextual relevance, diversity, and scalable data generation; (2) Personality-Aware User Representation Learning, which produces an 81-dimensional mixed-type embedding predicted at each turn from recent exchanges and appended to the reinforcement learning state; and (3) a Dueling Double DQN (D3QN) model and Reward Prediction, in which the policy is conditioned on dialogue history and turn-level personality estimates and trained using a composite reward incorporating agreement intent, donation amount, and changeof-mind penalties. We use an agenda-based LLM simulation pipeline to generate diverse interactions, from which personality estimation is inferred from the generated utterances. Experiments on the PersuasionForGood (P4G) dataset augmented with simulated dialogues reveal three main findings: (i) turn-level personality conditioning improves policy adaptability and cumulative persuasion rewards; (ii) LLM-driven simulation enhances generalization to unseen user behaviors; and (iii) incorporating a change-of-mind penalty reduces post-agreement retractions while slightly improving donation outcomes. These results demonstrate that structured interaction, dynamic personality estimation, and behaviorally informed rewards together yield more effective persuasive policies.
中文摘要 有效的说服性对话代理会根据个体调整策略，考虑他们在对话中心理状态和意图的演变。我们提出了一种人格感知强化学习方法，包含三个主要模块：（1）战略导向互动框架，作为基于议程的策略控制器，通过最大边际相关性（MMR）检索选择策略级行动并生成响应，以确保上下文相关性、多样性和可扩展的数据生成;（2）人格感知用户表征学习，生成一个81维混合类型嵌入，每次预测基于近期交换并附加于强化学习状态;以及（3）双重DQN（D3QN）模型和奖励预测，其中政策基于对话历史和回合级人格估计，并使用包含协议意图、捐赠金额和改变主意惩罚的复合奖励进行训练。我们使用基于议程的大型语言模型模拟流水线生成多样化交互，并从生成的话语中推断出人格估计。在PersuasionForGood（P4G）数据集上进行的实验，结合模拟对话，揭示了三个主要发现：（i）轮流级人格条件反射提升政策适应性和累积说服奖励;（ii）基于LLM的仿真增强了对未被察觉用户行为的泛化;以及（iii）加入改变主意的惩罚，减少协议后撤稿，同时略微改善捐赠结果。这些结果表明，结构化互动、动态人格估计和行为知情奖励共同能产生更有效的说服政策。

Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models

分布清晰度：大型语言模型中强化学习友好性的隐藏驱动因素

Authors: Shaoning Sun, Mingzhu Cai, Huang He, Bingjin Chen, Siqi Bao, Yujiu Yang, Hua Wu, Haifeng Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.06911
Pdf link: https://arxiv.org/pdf/2601.06911
Abstract Language model families exhibit striking disparity in their capacity to benefit from reinforcement learning: under identical training, models like Qwen achieve substantial gains, while others like Llama yield limited improvements. Complementing data-centric approaches, we reveal that this disparity reflects a hidden structural property: \textbf{distributional clarity} in probability space. Through a three-stage analysis-from phenomenon to mechanism to interpretation-we uncover that RL-friendly models exhibit intra-class compactness and inter-class separation in their probability assignments to correct vs. incorrect responses. We quantify this clarity using the \textbf{Silhouette Coefficient} ($S$) and demonstrate that (1) high $S$ correlates strongly with RL performance; (2) low $S$ is associated with severe logic errors and reasoning instability. To confirm this property, we introduce a Silhouette-Aware Reweighting strategy that prioritizes low-$S$ samples during training. Experiments across six mathematical benchmarks show consistent improvements across all model families, with gains up to 5.9 points on AIME24. Our work establishes distributional clarity as a fundamental, trainable property underlying RL-Friendliness.
中文摘要 语言模型家族在受益于强化学习的能力上存在显著差异：在相同训练条件下，Qwen等模型取得了显著进步，而像Llama这样的模型则仅有有限的提升。补充数据中心方法，我们揭示这种差异反映了一种隐藏的结构性质：概率空间中的\textbf{分布清晰度}。通过三阶段分析——从现象到机制再到解释——我们发现，支持强化学习的模型在正确与错误反应的概率分配上表现出类内紧密性和类间分离。我们用 \textbf{Silhouette 系数} （$S$）来量化这种清晰度，并证明（1）高 $S$ 与强化学习表现高度相关;（2）低$S$与严重的逻辑错误和推理不稳定性有关。为了验证这一特性，我们引入了一种轮廓感知重权策略，在训练过程中优先考虑低$S美元样本。六个数学基准测试的实验显示，所有模型家族均有持续提升，AIME24 提升最高达 5.9 分。我们的研究确立了分布清晰性作为强化学习友好性基础上一个基本且可训练的特性。

TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG

TreePS-RAG：基于树的过程监督，用于智能RAG中的强化学习

Authors: Tianhua Zhang, Kun Li, Junan Li, Yunxiang Li, Hongyin Luo, Xixin Wu, James Glass, Helen Meng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.06922
Pdf link: https://arxiv.org/pdf/2601.06922
Abstract Agentic retrieval-augmented generation (RAG) formulates question answering as a multi-step interaction between reasoning and information retrieval, and has recently been advanced by reinforcement learning (RL) with outcome-based supervision. While effective, relying solely on sparse final rewards limits step-wise credit assignment and provides weak guidance for intermediate reasoning and actions. Recent efforts explore process-level supervision, but typically depend on offline constructed training data, which risks distribution shift, or require costly intermediate annotations. We present TreePS-RAG, an online, tree-based RL framework for agentic RAG that enables step-wise credit assignment while retaining standard outcome-only rewards. Our key insight is to model agentic RAG reasoning as a rollout tree, where each reasoning step naturally maps to a node. This tree structure allows step utility to be estimated via Monte Carlo estimation over its descendant outcomes, yielding fine-grained process advantages without requiring intermediate labels. To make this paradigm practical, we introduce an efficient online tree construction strategy that preserves exploration diversity under a constrained computational budget. With a rollout cost comparable to strong baselines like Search-R1, experiments on seven multi-hop and general QA benchmarks across multiple model scales show that TreePS-RAG consistently and significantly outperforms both outcome-supervised and leading process-supervised RL methods.
中文摘要 能动检索增强生成（RAG）将问题回答表述为推理与信息检索之间的多步交互，最近通过基于结果的监督强化学习（RL）得到了发展。虽然有效，但仅依赖稀疏的最终奖励限制了分级的信用分配，并为中级推理和行动提供了薄弱的指导。近期的努力探索了过程级监督，但通常依赖离线构建的训练数据，这可能导致分布偏移，或需要昂贵的中间注释。我们介绍TreePS-RAG，一个在线、基于树的强化学习框架，用于代理RAG，支持分级分配，同时保留标准的仅结果奖励。我们的核心见解是将代理式RAG推理建模为一棵展开树，每个推理步骤自然映射到一个节点。这种树状结构允许通过蒙特卡洛估计其后继结果的阶梯效用进行估计，从而在无需中间标签的情况下获得细粒度的工艺优势。为了使这一范式切实可行，我们引入了一种高效的在线树构建策略，在有限的计算预算下保持探索多样性。其部署成本与Search-R1等强基线相当，七个多跳及通用质量保证基准测试的实验显示，TreePS-RAG在结果监督和流程监督的强化学习方法中，持续且显著地优于结果监督和领先过程监督的强化学习方法。

X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests

X-Coder：通过全合成任务、解决方案和测试推动竞技编程的发展

Authors: Jie Wu, Haoling Li, Xin Zhang, Jiani Guo, Jane Luo, Steven Liu, Yangyu Huang, Ruihang Chu, Scarlett Li, Yujiu Yang
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.06953
Pdf link: https://arxiv.org/pdf/2601.06953
Abstract Competitive programming presents great challenges for Code LLMs due to its intensive reasoning demands and high logical complexity. However, current Code LLMs still rely heavily on real-world data, which limits their scalability. In this paper, we explore a fully synthetic approach: training Code LLMs with entirely generated tasks, solutions, and test cases, to empower code reasoning models without relying on real-world data. To support this, we leverage feature-based synthesis to propose a novel data synthesis pipeline called SynthSmith. SynthSmith shows strong potential in producing diverse and challenging tasks, along with verified solutions and tests, supporting both supervised fine-tuning and reinforcement learning. Based on the proposed synthetic SFT and RL datasets, we introduce the X-Coder model series, which achieves a notable pass rate of 62.9 avg@8 on LiveCodeBench v5 and 55.8 on v6, outperforming DeepCoder-14B-Preview and AReal-boba2-14B despite having only 7B parameters. In-depth analysis reveals that scaling laws hold on our synthetic dataset, and we explore which dimensions are more effective to scale. We further provide insights into code-centric reinforcement learning and highlight the key factors that shape performance through detailed ablations and analysis. Our findings demonstrate that scaling high-quality synthetic data and adopting staged training can greatly advance code reasoning, while mitigating reliance on real-world coding data.
中文摘要 竞争性编程对代码大型语言模型来说是巨大的挑战，因为它需要复杂的推理和高逻辑复杂度。然而，当前的Code LLM仍然高度依赖真实世界的数据，这限制了它们的可扩展性。本文探讨了一种完全合成的方法：通过完全生成的任务、解决方案和测试案例训练代码大型语言模型，从而在不依赖真实世界数据的情况下赋能代码推理模型。为此，我们利用基于特征的综合技术提出了一种名为SynthSmith的新型数据综合流水线。SynthSmith在生成多样化且具有挑战性的任务方面展现出强大潜力，同时提供经过验证的解决方案和测试，支持监督式微调和强化学习。基于所提出的合成SFT和RL数据集，我们介绍了X-Coder模型系列，该系列在LiveCodeBench v5上达到了62.9 avg@8，v6上达到55.8，尽管参数仅有7B，却优于DeepCoder-14B-Preview和AReal-boba2-14B。深入分析显示，缩放定律在我们的合成数据集中成立，我们探讨哪些维度更适合扩展。我们还进一步介绍了以代码为中心的强化学习，并通过详细的消融和分析突出影响性能的关键因素。我们的发现表明，扩展高质量合成数据并采用分阶段训练，可以大幅提升代码推理能力，同时减少对现实编码数据的依赖。

MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning

MEDVISTAGYM：通过工具集成强化学习实现医学图像思维的可扩展培训环境

Authors: Meng Lu, Yuxing Lu, Yuchen Zhuang, Megan Mullins, Yang Xie, Guanghua Xiao, Charles Fleming, Wenqi Shi, Xuan Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.07107
Pdf link: https://arxiv.org/pdf/2601.07107
Abstract Vision language models (VLMs) achieve strong performance on general image understanding but struggle to think with medical images, especially when performing multi-step reasoning through iterative visual interaction. Medical VLMs often rely on static visual embeddings and single-pass inference, preventing models from re-examining, verifying, or refining visual evidence during reasoning. While tool-integrated reasoning offers a promising path forward, open-source VLMs lack the training infrastructure to learn effective tool selection, invocation, and coordination in multi-modal medical reasoning. We introduce MedVistaGym, a scalable and interactive training environment that incentivizes tool-integrated visual reasoning for medical image analysis. MedVistaGym equips VLMs to determine when and which tools to invoke, localize task-relevant image regions, and integrate single or multiple sub-image evidence into interleaved multimodal reasoning within a unified, executable interface for agentic training. Using MedVistaGym, we train MedVistaGym-R1 to interleave tool use with agentic reasoning through trajectory sampling and end-to-end reinforcement learning. Across six medical VQA benchmarks, MedVistaGym-R1-8B exceeds comparably sized tool-augmented baselines by 19.10% to 24.21%, demonstrating that structured agentic training--not tool access alone--unlocks effective tool-integrated reasoning for medical image analysis.
中文摘要 视觉语言模型（VLMs）在一般图像理解方面表现优异，但在通过迭代视觉互动进行多步推理时，在处理医学图像时难以思考。医学VLM通常依赖静态视觉嵌入和单次推断，防止模型在推理过程中重新审视、验证或完善视觉证据。虽然工具集成推理提供了一条有前景的道路，但开源的VLM缺乏学习多模态医学推理中有效工具选择、调用和协调的培训基础设施。我们介绍MedVistaGym，一个可扩展且互动的培训环境，激励工具集成的视觉推理进行医学图像分析。MedVistaGym 为VLM提供判断何时及使用何种工具、定位任务相关图像区域，并将单个或多个子图像证据整合进交织的多模态推理中，实现统一的可执行智能训练接口。利用MedVistaGym，我们训练MedVistaGym-R1通过轨迹采样和端到端强化学习，将工具使用与代理推理相结合。在六项医疗VQA基准中，MedVistaGym-R1-8B的基准数据比同等工具增强基线高出19.10%，提升至24.21%，证明结构化的能动性训练——而非单纯工具访问——能够有效实现医学图像分析的工具整合推理。

Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework

通过强大的大型语言模型赋能多代理强化学习框架提升云网络韧性

Authors: Yixiao Peng, Hao Hu, Feiyang Li, Xinye Cao, Yingchang Jiang, Jipeng Tang, Guoshun Nan, Yuling Liu
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.07122
Pdf link: https://arxiv.org/pdf/2601.07122
Abstract While virtualization and resource pooling empower cloud networks with structural flexibility and elastic scalability, they inevitably expand the attack surface and challenge cyber resilience. Reinforcement Learning (RL)-based defense strategies have been developed to optimize resource deployment and isolation policies under adversarial conditions, aiming to enhance system resilience by maintaining and restoring network availability. However, existing approaches lack robustness as they require retraining to adapt to dynamic changes in network structure, node scale, attack strategies, and attack intensity. Furthermore, the lack of Human-in-the-Loop (HITL) support limits interpretability and flexibility. To address these limitations, we propose CyberOps-Bots, a hierarchical multi-agent reinforcement learning framework empowered by Large Language Models (LLMs). Inspired by MITRE ATT&CK's Tactics-Techniques model, CyberOps-Bots features a two-layer architecture: (1) An upper-level LLM agent with four modules--ReAct planning, IPDRR-based perception, long-short term memory, and action/tool integration--performs global awareness, human intent recognition, and tactical planning; (2) Lower-level RL agents, developed via heterogeneous separated pre-training, execute atomic defense actions within localized network regions. This synergy preserves LLM adaptability and interpretability while ensuring reliable RL execution. Experiments on real cloud datasets show that, compared to state-of-the-art algorithms, CyberOps-Bots maintains network availability 68.5% higher and achieves a 34.7% jumpstart performance gain when shifting the scenarios without retraining. To our knowledge, this is the first study to establish a robust LLM-RL framework with HITL support for cloud defense. We will release our framework to the community, facilitating the advancement of robust and autonomous defense in cloud networks.
中文摘要 虽然虚拟化和资源池赋予云网络结构灵活性和弹性扩展性，但它们不可避免地扩大了攻击面，并挑战了网络韧性。基于强化学习（RL）的防御策略已被开发出来，以优化资源部署和隔离策略，在对抗条件下实现，通过维护和恢复网络可用性来增强系统韧性。然而，现有方法缺乏鲁棒性，需要重新训练以适应网络结构、节点规模、攻击策略和攻击强度的动态变化。此外，缺乏人工在环（HITL）支持限制了解释性和灵活性。为解决这些局限性，我们提出了CyberOps-Bots，一种由大型语言模型（LLMs）赋能的分层多智能体强化学习框架。CyberOps-Bots 受 MITRE ATT&CK 的战术-技术模型启发，采用两层架构：（1）一个高级 LLM 代理，包含四个模块——反应规划、基于 IPDRR 的感知、长期短期记忆和动作/工具集成——执行全局意识、人类意图识别和战术规划;（2）低级别强化学习代理通过异构分离预训练开发，在局部网络区域内执行原子防御行动。这种协同效应既保持了LLM的适应性和可解释性，又确保了强化学习的可靠执行。在真实云数据集上的实验显示，与最先进的算法相比，CyberOps-Bots在不重新训练的情况下切换场景时，网络可用性保持率高出68.5%，启动性能提升34.7%。据我们所知，这是首个建立支持HITL的稳健LLM-RL云防御框架的研究。我们将向社区发布框架，促进云网络中稳健自主防御的发展。

ENTRA: Entropy-Based Redundancy Avoidance in Large Language Model Reasoning

ENTRA：基于熵的冗余避免在大型语言模型推理中

Authors: Ruichu Cai, Haopeng Du, Qingwen Lin, Yutong Chen, Zijian Li, Boyan Xu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.07123
Pdf link: https://arxiv.org/pdf/2601.07123
Abstract Large Reasoning Models (LRMs) often suffer from overthinking, generating unnecessarily long reasoning chains even for simple tasks. This leads to substantial computational overhead with limited performance gain, primarily due to redundant verification and repetitive generation. While prior work typically constrains output length or optimizes correctness, such coarse supervision fails to guide models toward concise yet accurate inference. In this paper, we propose ENTRA, an entropy-based training framework that suppresses redundant reasoning while preserving performance. ENTRA first estimates the token-level importance using a lightweight Bidirectional Importance Estimation (BIE) method, which accounts for both prediction confidence and forward influence. It then computes a redundancy reward based on the entropy of low-importance tokens, normalized by its theoretical upper bound, and optimizes this reward via reinforcement learning. Experiments on mathematical reasoning benchmarks demonstrate that ENTRA reduces output length by 37% to 53% with no loss-and in some cases, gains-in accuracy. Our approach offers a principled and efficient solution to reduce overthinking in LRMs, and provides a generalizable path toward redundancy-aware reasoning optimization.
中文摘要 大型推理模型（LRM）常常存在过度思考的问题，即使是简单任务也会产生不必要的冗长推理链。这导致计算开销巨大，性能提升有限，主要由于冗余验证和重复生成。虽然以往工作通常限制输出长度或优化正确性，但这种粗糙监督无法引导模型实现简洁而准确的推断。本文提出了ENTRA，一种基于熵的训练框架，在保持性能的同时抑制冗余推理。ENTRA首先使用轻量级双向重要性估计（BIE）方法估算代币级重要性，该方法同时考虑预测置信度和前瞻影响。然后，它基于低重要性代币的熵计算冗余奖励，并由理论上界归一化，并通过强化学习优化该奖励。数学推理基准测试的实验表明，ENTRA能将输出长度缩短37%至53%，且准确率无损失，甚至在某些情况下有所提升。我们的方法为减少长老模型中的过度思考提供了原则性且高效的解决方案，并为实现冗余意识推理优化提供了可通用的路径。

ReinPool: Reinforcement Learning Pooling Multi-Vector Embeddings for Retrieval System

ReinPool：强化学习池，用于检索系统的多向量嵌入

Authors: Sungguk Cha, DongWook Kim, Mintae Kim, Youngsub Han, Byoung-Ki Jeon, Sangyeob Lee
Subjects: Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.07125
Pdf link: https://arxiv.org/pdf/2601.07125
Abstract Multi-vector embedding models have emerged as a powerful paradigm for document retrieval, preserving fine-grained visual and textual details through token-level representations. However, this expressiveness comes at a staggering cost: storing embeddings for every token inflates index sizes by over $1000\times$ compared to single-vector approaches, severely limiting scalability. We introduce \textbf{ReinPool}, a reinforcement learning framework that learns to dynamically filter and pool multi-vector embeddings into compact, retrieval-optimized representations. By training with an inverse retrieval objective and NDCG-based rewards, ReinPool identifies and retains only the most discriminative vectors without requiring manual importance annotations. On the Vidore V2 benchmark across three vision-language embedding models, ReinPool compresses multi-vector representations by $746$--$1249\times$ into single vectors while recovering 76--81\% of full multi-vector retrieval performance. Compared to static mean pooling baselines, ReinPool achieves 22--33\% absolute NDCG@3 improvement, demonstrating that learned selection significantly outperforms heuristic aggregation.
中文摘要 多向量嵌入模型已成为文档检索的强大范式，通过令牌级表示保持细粒度的视觉和文本细节。然而，这种表达力代价巨大：为每个代币存储嵌入会使索引规模膨胀超过1000美元，严重限制了可扩展性。我们介绍了 \textbf{ReinPool}，这是一个强化学习框架，能够动态过滤和池化多向量嵌入，形成紧凑且优化检索的表示。通过采用逆检索目标和基于NDCG的奖励训练，ReinPool仅识别并保留最具判别性的向量，无需手动重要性注释。在Vidore V2基准测试中，ReinPool将多向量表示压缩为单向量，成本为746美元至1249美元，同时恢复了76%至81%的全多向量检索性能。与静态平均池基线相比，ReinPool实现了22%-33%的绝对NDCG@3提升，证明学习选择显著优于启发式聚合。

Generating readily synthesizable small molecule fluorophore scaffolds with reinforcement learning

通过强化学习生成易于合成的小分子荧光团支架

Authors: Ruhi Sayana, Kate Callon, Jennifer Xu, Jonathan Deutsch, Steven Chu, James Zou, John Janetzko, Rabindra V. Shivnaraine, Kyle Swanson
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.07145
Pdf link: https://arxiv.org/pdf/2601.07145
Abstract Developing new fluorophores for advanced imaging techniques requires exploring new chemical space. While generative AI approaches have shown promise in designing novel dye scaffolds, prior efforts often produced synthetically intractable candidates due to a lack of reaction constraints. Here, we developed SyntheFluor-RL, a generative AI model that employs known reaction libraries and molecular building blocks to create readily synthesizable fluorescent molecule scaffolds via reinforcement learning. To guide the generation of fluorophores, SyntheFluor-RL employs a scoring function built on multiple graph neural networks (GNNs) that predict key photophysical properties, including photoluminescence quantum yield, absorption, and emission wavelengths. These outputs are dynamically weighted and combined with a computed pi-conjugation score to prioritize candidates with desirable optical characteristics and synthetic feasibility. SyntheFluor-RL generated 11,590 candidate molecules, which were filtered to 19 structures predicted to possess dye-like properties. Of the 19 molecules, 14 were synthesized and 13 were experimentally confirmed. The top three were characterized, with the lead compound featuring a benzothiadiazole chromophore and exhibiting strong fluorescence (PLQY = 0.62), a large Stokes shift (97 nm), and a long excited-state lifetime (11.5 ns). These results demonstrate the effectiveness of SyntheFluor-RL in the identification of synthetically accessible fluorophores for further development.
中文摘要 开发用于先进成像技术的新荧光团需要探索新的化学空间。尽管生成式人工智能方法在设计新型染料支架方面展现出潜力，但以往的尝试往往因缺乏反应约束而产生合成上难以处理的候选方案。在这里，我们开发了SyntheFluor-RL，一种生成式AI模型，利用已知的反应文库和分子构建模块，通过强化学习创建易于合成的荧光分子支架。为指导荧光团的生成，SyntheFluor-RL采用基于多图神经网络（GNN）的评分函数，预测关键的光物理特性，包括光致发光的量子产额、吸收和发射波长。这些输出经过动态加权，并结合计算出的π共轭分数，以优先考虑具有理想光学特性和合成可行性的候选。SyntheFluor-RL生成了11,590个候选分子，经过过滤，筛选出19个预测具有染料样特性的结构。在这19个分子中，有14个已合成，13个经过实验验证。前三者被鉴定，先导化合物含有苯并噻二唑色团，表现出强烈荧光（PLQY = 0.62）、较大的斯托克斯位移（97纳米）以及较长的激发态寿命（11.5纳秒）。这些结果证明了SyntheFluor-RL在识别合成可及荧光团以供进一步开发方面的有效性。

Rewarding Creativity: A Human-Aligned Generative Reward Model for Reinforcement Learning in Storytelling

奖励创造力：一种以人为本的生成奖励模型，用于讲故事中的强化学习

Authors: Zhaoyan Li, Hang Lei, Yujia Wang, Lanbo Liu, Hao Liu, Liang Yu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.07149
Pdf link: https://arxiv.org/pdf/2601.07149
Abstract While Large Language Models (LLMs) can generate fluent text, producing high-quality creative stories remains challenging. Reinforcement Learning (RL) offers a promising solution but faces two critical obstacles: designing reliable reward signals for subjective storytelling quality and mitigating training instability. This paper introduces the Reinforcement Learning for Creative Storytelling (RLCS) framework to systematically address both challenges. First, we develop a Generative Reward Model (GenRM) that provides multi-dimensional analysis and explicit reasoning about story preferences, trained through supervised fine-tuning on demonstrations with reasoning chains distilled from strong teacher models, followed by GRPO-based refinement on expanded preference data. Second, we introduce an entropy-based reward shaping strategy that dynamically prioritizes learning on confident errors and uncertain correct predictions, preventing overfitting on already-mastered patterns. Experiments demonstrate that GenRM achieves 68\% alignment with human creativity judgments, and RLCS significantly outperforms strong baselines including Gemini-2.5-Pro in overall story quality. This work provides a practical pipeline for applying RL to creative domains, effectively navigating the dual challenges of reward modeling and training stability.
中文摘要 虽然大型语言模型（LLMs）能够生成流畅的文本，但制作高质量的创意故事依然充满挑战。强化学习（RL）提供了一个有前景的解决方案，但面临两个关键障碍：设计可靠的主观叙事质量奖励信号，以及减轻训练不稳定性。本文引入了创造性讲故事强化学习（RLCS）框架，以系统地应对这两个挑战。首先，我们开发了一个生成奖励模型（GenRM），通过监督微调演示进行多维分析和显式推理，基于强教师模型提炼的推理链，随后基于GRPO对扩展偏好数据进行细化。其次，我们引入基于熵的奖励塑造策略，动态优先学习于自信误差和不确定正确预测上，防止对已掌握模式的过度拟合。实验显示，GenRM与人类创造力判断的高度契合度达到68%，RLCS在整体故事质量上显著优于包括Gemini-2.5-Pro在内的强基线。这项工作为将强化学习应用于创造性领域提供了实用的流程，有效应对奖励建模和训练稳定性的双重挑战。

Agents of Diffusion: Enhancing Diffusion Language Models with Multi-Agent Reinforcement Learning for Structured Data Generation (Extended Version)

扩散代理：通过多智能体强化学习增强扩散语言模型以实现结构化数据生成（扩展版）

Authors: Aja Khanal, Kaushik T. Ranade, Rishabh Agrawal, Kalyan S. Basu, Apurva Narayan
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2601.07152
Pdf link: https://arxiv.org/pdf/2601.07152
Abstract Generating high-quality structured data such as JSON records, remains a fundamental challenge for large language models (LLMs), particularly when semantic richness must coexist with strict schema adherence. While autoregressive LLMs offer strong structural consistency, they often struggle with semantic variation and output diversity. In contrast, diffusion language models (DLMs) introduce powerful mechanisms for semantic richness and bidirectional decoding, yet lack the inductive biases needed for reliable structure preservation. We present Agents of Diffusion (AoD), a novel framework that unifies the generative flexibility of DLMs with the reasoning capabilities of autoregressive models through language-mediated reinforcement learning. AoD frames structured text generation as a multi-agent alignment process, where a prompt optimization agent collaborates with a judge agent to iteratively guide a DLM using natural language feedback. This approach enables controllable, schema-consistent generation without modifying model parameters or relying on handcrafted constraints. AoD advances the state of controllable generation by demonstrating that diffusion models, when supervised by cooperative agents, can achieve both high semantic novelty and structural fidelity. Across multiple structured data benchmarks, AoD consistently outperforms diffusion and autoregressive baselines, establishing a new path forward for structure-aware, diversity-enhanced text synthesis.
中文摘要 生成高质量结构化数据（如JSON记录）仍是大型语言模型（LLMs）面临的根本挑战，尤其是在语义丰富性必须与严格模式遵循并存时。虽然自回归大型语言模型提供了强大的结构一致性，但它们常常在语义变异和输出多样性方面遇到困难。相比之下，扩散语言模型（DLMs）引入了强大的语义丰富性和双向解码机制，但缺乏可靠结构保存所需的归纳偏置。我们提出了扩散代理（Agents of Diffusion，简称 AoD），这是一个新颖框架，通过语言介导的强化学习，将 DLM 的生成灵活性与自回归模型的推理能力相结合。AoD将结构化文本生成框架为多代理对齐过程，提示优化代理与裁判代理协作，利用自然语言反馈迭代引导DLM。这种方法能够实现可控、模式一致的生成，无需修改模型参数或依赖手工约束。AoD通过证明扩散模型在协作代理监督下，能够实现高度语义新颖性和结构忠实度，从而推动可控生成的进展。在多个结构化数据基准测试中，AoD持续优于扩散和自回归基线，为结构感知和多样性增强的文本综合开辟了新路径。

AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units

AscendKernelGen：基于LLM的神经处理单元内核生成的系统研究

Authors: Xinzi Cao, Jianyang Zhai, Pengfei Li, Zhiheng Hu, Cen Yan, Bingxu Mu, Guanghuan Fang, Bin She, Jiayu Li, Yihan Su, Dongyang Tao, Xiansong Huang, Fan Xu, Feidiao Yang, Yao Lu, Chang-Dong Wang, Yutong Lu, Weicheng Xue, Bin Zhou, Yonghong Tian
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.07160
Pdf link: https://arxiv.org/pdf/2601.07160
Abstract To meet the ever-increasing demand for computational efficiency, Neural Processing Units (NPUs) have become critical in modern AI infrastructure. However, unlocking their full potential requires developing high-performance compute kernels using vendor-specific Domain-Specific Languages (DSLs), a task that demands deep hardware expertise and is labor-intensive. While Large Language Models (LLMs) have shown promise in general code generation, they struggle with the strict constraints and scarcity of training data in the NPU domain. Our preliminary study reveals that state-of-the-art general-purpose LLMs fail to generate functional complex kernels for Ascend NPUs, yielding a near-zero success rate. To address these challenges, we propose AscendKernelGen, a generation-evaluation integrated framework for NPU kernel development. We introduce Ascend-CoT, a high-quality dataset incorporating chain-of-thought reasoning derived from real-world kernel implementations, and KernelGen-LM, a domain-adaptive model trained via supervised fine-tuning and reinforcement learning with execution feedback. Furthermore, we design NPUKernelBench, a comprehensive benchmark for assessing compilation, correctness, and performance across varying complexity levels. Experimental results demonstrate that our approach significantly bridges the gap between general LLMs and hardware-specific coding. Specifically, the compilation success rate on complex Level-2 kernels improves from 0% to 95.5% (Pass@10), while functional correctness achieves 64.3% compared to the baseline's complete failure. These results highlight the critical role of domain-specific reasoning and rigorous evaluation in automating accelerator-aware code generation.
中文摘要 为了满足日益增长的计算效率需求，神经处理单元（NPU）已成为现代人工智能基础设施中的关键。然而，要充分发挥其潜力，需要使用厂商专用的领域特定语言（DSL）开发高性能计算内核，这项任务需要深厚的硬件专业知识且劳动强度高。虽然大型语言模型（LLM）在通用代码生成方面展现出潜力，但它们在NPU领域面临严格的约束和训练数据的稀缺性。我们的初步研究显示，最先进的通用大型语言模型无法为Ascend NPU生成函数式复杂核，成功率几乎为零。为应对这些挑战，我们提出了AscendKernelGen，一个用于NPU内核开发的生成评估集成框架。我们介绍了Ascend-CoT，一个高质量数据集，结合了基于真实内核实现的思维链推理，以及KernelGen-LM，一个通过监督微调和强化学习并结合执行反馈训练的领域自适应模型。此外，我们设计了 NPUKernelBench，这是一个综合基准测试，用于评估不同复杂度水平下的编译、正确性和性能。实验结果表明，我们的方法显著弥合了通用大型语言模型与硬件特定编码之间的鸿沟。具体来说，复杂的二级核的编译成功率从0%提升到95.5%（Pass@10），而功能正确度则达到64.3%，而基线则完全失败。这些结果凸显了领域特定推理和严谨评估在自动化加速器感知代码生成中的关键作用。

Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive Correction of Feature Overgeneralization

基于流程的任务推断和特征过度泛化的自适应纠正的离线元强化学习

Authors: Min Wang, Xin Li, Mingzhong Wang, Hasnaa Bennis
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.07164
Pdf link: https://arxiv.org/pdf/2601.07164
Abstract Offline meta-reinforcement learning (OMRL) combines the strengths of learning from diverse datasets in offline RL with the adaptability to new tasks of meta-RL, promising safe and efficient knowledge acquisition by RL agents. However, OMRL still suffers extrapolation errors due to out-of-distribution (OOD) actions, compromised by broad task distributions and Markov Decision Process (MDP) ambiguity in meta-RL setups. Existing research indicates that the generalization of the $Q$ network affects the extrapolation error in offline RL. This paper investigates this relationship by decomposing the $Q$ value into feature and weight components, observing that while decomposition enhances adaptability and convergence in the case of high-quality data, it often leads to policy degeneration or collapse in complex tasks. We observe that decomposed $Q$ values introduce a large estimation bias when the feature encounters OOD samples, a phenomenon we term ''feature overgeneralization''. To address this issue, we propose FLORA, which identifies OOD samples by modeling feature distributions and estimating their uncertainties. FLORA integrates a return feedback mechanism to adaptively adjust feature components. Furthermore, to learn precise task representations, FLORA explicitly models the complex task distribution using a chain of invertible transformations. We theoretically and empirically demonstrate that FLORA achieves rapid adaptation and meta-policy improvement compared to baselines across various environments.
中文摘要 离线元强化学习（OMRL）结合了离线强化学习中从多样化数据集学习的优势，以及对元强化学习新任务的适应性，承诺强化学习代理能够安全高效地获取知识。然而，OMRL仍因分布外（OOD）动作而存在外推误差，且受任务分布宽泛和元强化学习（meta-RL）中马尔可夫决策过程（MDP）模糊性影响。现有研究表明，$Q$网络的推广会影响离线强化学习中的外推误差。本文通过将$Q$价值分解为特征和权重组成部分，研究了这一关系，观察到虽然分解在高质量数据中增强了适应性和收敛性，但在复杂任务中常常导致策略退化或崩溃。我们观察到，分解后的$Q$值在特征遇到OOD样本时会带来较大的估计偏差，我们称之为“特征过度泛化”。为解决这一问题，我们提出了FLORA，通过建模特征分布并估计其不确定性来识别OOD样本。FLORA集成了返回反馈机制，以自适应地调整特征组件。此外，为了学习精确的任务表示，FLORA通过可逆变换链显式建模复杂任务分布。我们通过理论和实证证明，FLORA在不同环境中相比基线实现了快速适应和元政策改进。

Structured Reasoning for Large Language Models

大型语言模型的结构化推理

Authors: Jinyi Han, Zixiang Di, Zishang Jiang, Ying Liao, Jiaqing Liang, Yongqi Wang, Yanghua Xiao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.07180
Pdf link: https://arxiv.org/pdf/2601.07180
Abstract Large language models (LLMs) achieve strong performance by generating long chains of thought, but longer traces always introduce redundant or ineffective reasoning steps. One typical behavior is that they often perform unnecessary verification and revisions even if they have reached the correct answers. This limitation stems from the unstructured nature of reasoning trajectories and the lack of targeted supervision for critical reasoning abilities. To address this, we propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components. We mainly implement SCR using a Generate-Verify-Revise paradigm. Specifically, we construct structured training data and apply Dynamic Termination Supervision to guide the model in deciding when to terminate reasoning. To avoid interference between learning signals for different reasoning abilities, we adopt a progressive two-stage reinforcement learning strategy: the first stage targets initial generation and self-verification, and the second stage focuses on revision. Extensive experiments on three backbone models show that SCR substantially improves reasoning efficiency and self-verification. Besides, compared with existing reasoning paradigms, it reduces output token length by up to 50%.
中文摘要 大型语言模型（LLM）通过生成长链思考实现了强劲的性能，但较长的路径总会引入冗余或无效的推理步骤。一个典型的行为是，即使他们已经答对了，也常常进行不必要的核对和修改。这一局限源于推理轨迹的无结构性质以及缺乏针对关键推理能力的有针对性监督。为此，我们提出了结构化推理（SCR）框架，将推理轨迹拆分为显式、可评估和可训练的组成部分。我们主要采用生成-验证-修订范式实现SCR。具体来说，我们构建结构化训练数据，并应用动态终止监督来指导模型决定何时终止推理。为避免不同推理能力的学习信号干扰，我们采用渐进式两阶段强化学习策略：第一阶段针对初始生成和自我验证，第二阶段侧重于修订。对三种骨干模型的广泛实验表明，SCR显著提升了推理效率和自我验证能力。此外，与现有推理范式相比，它最多可将输出令牌长度缩短50%。

Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration

整合还是适应？棱镜：通过梯度集中解开SFT和RL数据

Authors: Yang Zhao, Yangou Ouyang, Xiao Ding, Hepeng Wang, Bibo Cai, Kai Xiong, Jinglong Gao, Zhouhao Sun, Li Du, Bing Qin, Ting Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.07224
Pdf link: https://arxiv.org/pdf/2601.07224
Abstract While Hybrid Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become the standard paradigm for training LLM agents, effective mechanisms for data allocation between these stages remain largely underexplored. Current data arbitration strategies often rely on surface-level heuristics that fail to diagnose intrinsic learning needs. Since SFT targets pattern consolidation through imitation while RL drives structural adaptation via exploration, misaligning data with these functional roles causes severe optimization interference. We propose PRISM, a dynamics-aware framework grounded in Schema Theory that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge. By analyzing the spatial geometric structure of gradients, PRISM identifies data triggering high spatial concentration as high-conflict signals that require RL for structural restructuring. In contrast, data yielding diffuse updates is routed to SFT for efficient consolidation. Extensive experiments on WebShop and ALFWorld demonstrate that PRISM achieves a Pareto improvement, outperforming state-of-the-art hybrid methods while reducing computational costs by up to 3.22$\times$. Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.
中文摘要 虽然混合监督微调（SFT）随后是强化学习（RL）已成为训练LLM代理的标准范式，但这些阶段之间有效的数据分配机制仍然大多缺乏探索。当前的数据仲裁策略常常依赖于表面启发式，未能准确诊断内在的学习需求。由于SFT通过模仿实现模式巩固，而强化学习通过探索驱动结构适应，数据与这些功能角色错位会导致严重的优化干扰。我们提出了PRISM，这是一个基于模式理论的动态感知框架，基于数据与模型现有知识的认知冲突程度进行仲裁。通过分析梯度的空间几何结构，PRISM识别出触发高空间集中度的数据为高冲突信号，需要强化学习进行结构重构。相比之下，产生散漫更新的数据则被路由到SFT以实现高效整合。WebShop 和 ALFWorld 上的大量实验表明，PRISM 实现了帕累托改进，超越了最先进的混合方法，同时计算成本降低了高达 3.22 美元\时间美元。我们的发现表明，基于内部优化方案进行数据解码对于可扩展且稳健的代理对齐至关重要。

Group Pattern Selection Optimization: Let LRMs Pick the Right Pattern for Reasoning

群模式选择优化：让LRMs选择正确的模式进行推理

Authors: Hanbin Wang, Jingwei Song, Jinpeng Li, Fei Mi, Lifeng Shang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.07238
Pdf link: https://arxiv.org/pdf/2601.07238
Abstract Large reasoning models (LRMs) exhibit diverse high-level reasoning patterns (e.g., direct solution, reflection-and-verification, and exploring multiple solutions), yet prevailing training recipes implicitly bias models toward a limited set of dominant patterns. Through a systematic analysis, we identify substantial accuracy variance across these patterns on mathematics and science benchmarks, revealing that a model's default reasoning pattern is often sub-optimal for a given problem. To address this, we introduce Group Pattern Selection Optimization (GPSO), a reinforcement learning framework that extends GRPO by incorporating multi-pattern rollouts, verifier-guided optimal pattern selection per problem, and attention masking during optimization to prevent the leakage of explicit pattern suffixes into the learned policy. By exploring a portfolio of diverse reasoning strategies and optimizing the policy on the most effective ones, GPSO enables the model to internalize the mapping from problem characteristics to optimal reasoning patterns. Extensive experiments demonstrate that GPSO delivers consistent and substantial performance gains across various model backbones and benchmarks, effectively mitigating pattern sub-optimality and fostering more robust, adaptable reasoning. All data and codes are available at this https URL.
中文摘要 大型推理模型（LRM）表现出多样的高层次推理模式（如直接解、反思与验证和多解探索），但主流训练方案隐含地偏向有限的主导模式。通过系统分析，我们识别出数学和科学基准测试中这些模式间存在显著的准确性差异，揭示了模型默认推理模式往往对特定问题不尽如人意。为此，我们引入了群模式选择优化（GPSO），这是一个强化学习框架，通过整合多模式展开、验证者引导的最优模式选择每个问题，以及优化过程中的注意力掩蔽，扩展了GRPO，以防止显式模式后缀泄漏到学习策略中。通过探索多样化的推理策略组合并优化最有效的策略，GPSO使模型能够内化从问题特征到最优推理模式的映射。大量实验表明，GPSO在各种模型骨干和基准测试中实现了持续且显著的性能提升，有效缓解模式次优性，促进了更稳健、适应性的推理。所有数据和代码均可在此 https URL 获取。

The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents

置信二分法：工具使用剂中误校的分析与缓解

Authors: Weihao Xuan, Qingcheng Zeng, Heli Qi, Yunze Xiao, Junjue Wang, Naoto Yokoya
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.07264
Pdf link: https://arxiv.org/pdf/2601.07264
Abstract Autonomous agents based on large language models (LLMs) are rapidly evolving to handle multi-turn tasks, but ensuring their trustworthiness remains a critical challenge. A fundamental pillar of this trustworthiness is calibration, which refers to an agent's ability to express confidence that reliably reflects its actual performance. While calibration is well-established for static models, its dynamics in tool-integrated agentic workflows remain underexplored. In this work, we systematically investigate verbalized calibration in tool-use agents, revealing a fundamental confidence dichotomy driven by tool type. Specifically, our pilot study identifies that evidence tools (e.g., web search) systematically induce severe overconfidence due to inherent noise in retrieved information, while verification tools (e.g., code interpreters) can ground reasoning through deterministic feedback and mitigate miscalibration. To robustly improve calibration across tool types, we propose a reinforcement learning (RL) fine-tuning framework that jointly optimizes task accuracy and calibration, supported by a holistic benchmark of reward designs. We demonstrate that our trained agents not only achieve superior calibration but also exhibit robust generalization from local training environments to noisy web settings and to distinct domains such as mathematical reasoning. Our results highlight the necessity of domain-specific calibration strategies for tool-use agents. More broadly, this work establishes a foundation for building self-aware agents that can reliably communicate uncertainty in high-stakes, real-world deployments.
中文摘要 基于大型语言模型（LLM）的自主智能体正在迅速发展以处理多回合任务，但确保其可信度仍是关键挑战。这种可信度的一个基本支柱是校准，指的是代理表达自信并可靠反映其实际表现的能力。虽然静态模型的校准已被充分验证，但其在工具集成的代理工作流中的动态性仍未被充分探索。本研究系统地研究工具使用代理的口头校准，揭示了工具类型驱动的基本置信度二分法。具体来说，我们的试点研究指出，证据工具（如网页搜索）系统性地因检索信息中的固有噪声而导致严重的过度自信，而验证工具（如代码解释器）则可以通过确定性反馈来支撑推理并减少校准错误。为了稳健地提升各类工具的校准，我们提出了一个强化学习（RL）微调框架，结合整体奖励设计基准，共同优化任务准确性和校准。我们证明，训练有素的代理不仅能实现更优异的校准，还能从本地训练环境到噪声网络环境，甚至跨越数学推理等不同领域。我们的结果凸显了工具使用剂领域特定校准策略的必要性。更广泛地说，这项工作为构建能够可靠传达高风险现实部署中不确定性的自我意识代理奠定了基础。

ReasonTabQA: A Comprehensive Benchmark for Table Question Answering from Real World Industrial Scenarios

ReasonTabQA：来自真实工业场景的桌面问答综合基准

Authors: Changzai Pan, Jie Zhang, Kaiwen Wei, Chenshuo Pan, Yu Zhao, Jingwang Huang, Jian Yang, Zhenhe Wu, Haoyang Zeng, Xiaoyan Gu, Weichao Sun, Yanbo Zhai, Yujie Mao, Zhuoru Jiang, Jiang Zhong, Shuangyong Song, Yongxiang Li, Zhongjiang He
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.07280
Pdf link: https://arxiv.org/pdf/2601.07280
Abstract Recent advancements in Large Language Models (LLMs) have significantly catalyzed table-based question answering (TableQA). However, existing TableQA benchmarks often overlook the intricacies of industrial scenarios, which are characterized by multi-table structures, nested headers, and massive scales. These environments demand robust table reasoning through deep structured inference, presenting a significant challenge that remains inadequately addressed by current methodologies. To bridge this gap, we present ReasonTabQA, a large-scale bilingual benchmark encompassing 1,932 tables across 30 industry domains such as energy and automotive. ReasonTabQA provides high-quality annotations for both final answers and explicit reasoning chains, supporting both thinking and no-thinking paradigms. Furthermore, we introduce TabCodeRL, a reinforcement learning method that leverages table-aware verifiable rewards to guide the generation of logical reasoning paths. Extensive experiments on ReasonTabQA and 4 TableQA datasets demonstrate that while TabCodeRL yields substantial performance gains on open-source LLMs, the persistent performance gap on ReasonTabQA underscores the inherent complexity of real-world industrial TableQA.
中文摘要 大型语言模型（LLMs）的最新进展显著推动了基于表格的问题回答（TableQA）。然而，现有的TableQA基准测试常忽视工业场景的复杂性，这些场景的特点是多表结构、嵌套头和大规模规模。这些环境需要通过深度结构推理进行强有力的表格推理，带来了一个重大挑战，目前的方法论尚未充分解决。为弥合这一差距，我们推出了ReasonTabQA，这是一个涵盖能源和汽车等30个行业领域、1932个表格的大型双语基准测试。ReasonTabQA为最终答案和显式推理链提供高质量注释，支持思考和无思考范式。此外，我们介绍了TabCodeRL，一种利用表感知可验证奖励来引导逻辑推理路径生成的强化学习方法。对ReasonTabQA和4个TableQA数据集的广泛实验表明，虽然TabCodeRL在开源大型语言模型上带来了显著的性能提升，但ReasonTabQA持续存在的性能差距凸显了现实工业TableQA的固有复杂性。

LRAS: Advanced Legal Reasoning with Agentic Search

LRAS：高级法律推理与代理检索

Authors: Yujin Zhou, Chuxue Cao, Jinluan Yang, Lijun Wu, Conghui He, Sirui Han, Yike Guo
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.07296
Pdf link: https://arxiv.org/pdf/2601.07296
Abstract While Large Reasoning Models (LRMs) have demonstrated exceptional logical capabilities in mathematical domains, their application to the legal field remains hindered by the strict requirements for procedural rigor and adherence to legal logic. Existing legal LLMs, which rely on "closed-loop reasoning" derived solely from internal parametric knowledge, frequently suffer from lack of self-awareness regarding their knowledge boundaries, leading to confident yet incorrect conclusions. To address this challenge, we present Legal Reasoning with Agentic Search (LRAS), the first framework designed to transition legal LLMs from static and parametric "closed-loop thinking" to dynamic and interactive "Active Inquiry". By integrating Introspective Imitation Learning and Difficulty-aware Reinforcement Learning, LRAS enables LRMs to identify knowledge boundaries and handle legal reasoning complexity. Empirical results demonstrate that LRAS outperforms state-of-the-art baselines by 8.2-32\%, with the most substantial gains observed in tasks requiring deep reasoning with reliable knowledge. We will release our data and models for further exploration soon.
中文摘要 尽管大型推理模型（LRM）在数学领域展现出卓越的逻辑能力，但其在法律领域的应用仍受限于严格的程序严谨性和法律逻辑的严格要求。现有的法律大型语言模型依赖于仅源自内部参数知识的“闭环推理”，常常缺乏对其知识边界的自我意识，导致得出自信但错误的结论。为应对这一挑战，我们介绍了带有代理搜索的法律推理（LRAS），这是首个旨在将法律大型语言模型从静态和参数化的“闭环思维”转变为动态交互“主动探究”的框架。通过整合内省模仿学习和难度感知强化学习，LRAS使LRM能够识别知识边界并处理法律推理的复杂性。实证结果表明，LRAS的表现比最先进的基线高出8.2%至32%，其中在需要深度推理和可靠知识的任务中，提升最为显著。我们将很快发布数据和模型，供进一步探索。

Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding

模仿人类认知，掌握多图像推理：一种提升视觉理解的元行动框架

Authors: Jianghao Yin, Qingbin Li, Kun Sun, Cheng Ding, Jie Wang, Qin Chen, Jie Zhou, Nan Wang, Changqing Li, Pei Wu, Jian Xu, Zheming Yang, Liang He
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.07298
Pdf link: https://arxiv.org/pdf/2601.07298
Abstract While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.
中文摘要 虽然多模态大型语言模型（MLLM）在单图像理解方面表现出色，但在多图像推理场景中表现显著下降。多图像推理面临根本性挑战，包括图像之间的复杂相互关系以及图像集中中分散的关键信息。受人类认知过程启发，我们提出了认知启发元行动框架（CINEMA），这是一种新颖的方法，将多图像推理分解为五个结构化元行动：全局、聚焦、提示、思考和回答，明确建模人类自然采用的连续认知步骤。对于冷启动训练，我们引入了一种基于检索的树采样策略，生成高质量的元动作轨迹，以引导模型并结合推理模式。在强化学习过程中，我们采用两阶段范式：探索阶段采用多样性保持策略以避免熵崩溃，随后是退火利用阶段，利用DAPO逐步强化利用。为了训练模型，我们构建了一个包含57k冷启动和58k强化学习实例的数据集，涵盖多图像、多帧和单图像任务。我们对多图像推理基准、视频理解基准和单图基准进行了广泛评估，在多个关键基准测试中实现了竞争的先进性能。我们的模型在MUIR和MVMath基准测试中超过GPT-4o，并在视频理解基准测试中显著优于专业视频推理模型，展示了我们基于人类认知的推理框架的有效性和可推广性。

Heterogeneous Multi-Expert Reinforcement Learning for Long-Horizon Multi-Goal Tasks in Autonomous Forklifts

自主叉车中远程多目标任务的异构多专家强化学习

Authors: Yun Chen, Bowei Huang, Fan Guo, Kang Song
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.07304
Pdf link: https://arxiv.org/pdf/2601.07304
Abstract Autonomous mobile manipulation in unstructured warehouses requires a balance between efficient large-scale navigation and high-precision object interaction. Traditional end-to-end learning approaches often struggle to handle the conflicting demands of these distinct phases. Navigation relies on robust decision-making over large spaces, while manipulation needs high sensitivity to fine local details. Forcing a single network to learn these different objectives simultaneously often causes optimization interference, where improving one task degrades the other. To address these limitations, we propose a Heterogeneous Multi-Expert Reinforcement Learning (HMER) framework tailored for autonomous forklifts. HMER decomposes long-horizon tasks into specialized sub-policies controlled by a Semantic Task Planner. This structure separates macro-level navigation from micro-level manipulation, allowing each expert to focus on its specific action space without interference. The planner coordinates the sequential execution of these experts, bridging the gap between task planning and continuous control. Furthermore, to solve the problem of sparse exploration, we introduce a Hybrid Imitation-Reinforcement Training Strategy. This method uses expert demonstrations to initialize the policy and Reinforcement Learning for fine-tuning. Experiments in Gazebo simulations show that HMER significantly outperforms sequential and end-to-end baselines. Our method achieves a task success rate of 94.2\% (compared to 62.5\% for baselines), reduces operation time by 21.4\%, and maintains placement error within 1.5 cm, validating its efficacy for precise material handling.
中文摘要 在非结构化仓库中实现自主移动作需要在高效大规模导航与高精度物体交互之间取得平衡。传统的端到端学习方法常常难以应对这些不同阶段的冲突需求。导航依赖于在大空间内进行稳健的决策，而作则需要对局部细节的高度敏感度。强迫单个网络同时学习这些不同目标，常常会导致优化干扰，即改进一项任务会削弱另一项。为解决这些局限性，我们提出了一个专为自动叉车设计的异构多专家强化学习（HMER）框架。HMER 将长期任务分解为由语义任务规划器控制的专用子策略。这种结构将宏观导航与微观作区分开来，使每位专家能够专注于其特定的行动空间，不受干扰。规划师协调这些专家的顺序执行，弥合任务规划与持续控制之间的差距。此外，为解决稀疏探索问题，我们引入了混合模拟-强化训练策略。该方法利用专家演示初始化策略，并进行强化学习进行微调。凉亭模拟实验显示，HMER显著优于顺序和端到端基线。我们的方法实现了94.2%的任务成功率（相比基线的62.5%），作时间减少了21.4%，并将放置误差控制在1.5厘米以内，验证了其在精确物料搬运中的有效性。

Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training

分段优势估计：增强PPO用于长上下文LLM训练

Authors: Xue Gong, Qi Yi, Ziyuan Nan, Guanhua Huang, Kejiao Li, Yuhao Jiang, Ruibin Xiong, Zenan Xu, Jiaming Guo, Shaohui Peng, Bo Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.07320
Pdf link: https://arxiv.org/pdf/2601.07320
Abstract Training Large Language Models (LLMs) for reasoning tasks is increasingly driven by Reinforcement Learning with Verifiable Rewards (RLVR), where Proximal Policy Optimization (PPO) provides a principled framework for stable policy updates. However, the practical application of PPO is hindered by unreliable advantage estimation in the sparse-reward RLVR regime. This issue arises because the sparse rewards in RLVR lead to inaccurate intermediate value predictions, which in turn introduce significant bias when aggregated at every token by Generalized Advantage Estimation (GAE). To address this, we introduce Segmental Advantage Estimation (SAE), which mitigates the bias that GAE can incur in RLVR. Our key insight is that aggregating $n$-step advantages at every token(as in GAE) is unnecessary and often introduces excessive bias, since individual tokens carry minimal information. Instead, SAE first partitions the generated sequence into coherent sub-segments using low-probability tokens as heuristic boundaries. It then selectively computes variance-reduced advantage estimates only from these information-rich segment transitions, effectively filtering out noise from intermediate tokens. Our experiments demonstrate that SAE achieves superior performance, with marked improvements in final scores, training stability, and sample efficiency. These gains are shown to be consistent across multiple model sizes, and a correlation analysis confirms that our proposed advantage estimator achieves a higher correlation with an approximate ground-truth advantage, justifying its superior performance.
中文摘要 用于推理任务的大型语言模型（LLMs）训练越来越多地由可验证奖励强化学习（RLVR）驱动，其中近端策略优化（PPO）为稳定的策略更新提供了原则性框架。然而，PPO的实际应用受限于稀疏奖励RLVR区间的优势估计不可靠。这一问题源于RLVR中奖励稀疏导致中间价值预测不准确，进而在每个代币上通过广义优势估计（GAE）汇总时引入显著偏差。为此，我们引入了分段优势估计（SAE），以减轻GAE在RLVR中可能产生的偏向。我们的关键见解是，在每个代币上汇总$n$步的优势（如GAE）是不必要的，且常常带来过大偏差，因为单个代币携带的信息极少。相反，SAE首先将生成的序列划分为连贯的子段，使用低概率标记作为启发式边界。然后，它仅从这些信息丰富的片段转换中选择性地计算方差约简优势估计，有效过滤掉中间标记中的噪声。我们的实验表明，SAE在最终得分、训练稳定性和样本效率方面均有显著提升，表现更优越。这些增益在多个模型规模中保持一致，相关分析证实我们提出的优势估计器与近似真实优势的相关性更高，证明其优越性能的合理性。

Reward Modeling from Natural Language Human Feedback

自然语言人类反馈中的奖励建模

Authors: Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang, Shaoning Sun, Yujiu Yang, Yongbin Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.07349
Pdf link: https://arxiv.org/pdf/2601.07349
Abstract Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs). Typically in pairwise rewarding tasks, GRMs generate reasoning chains ending with critiques and preference labels, and RLVR then relies on the correctness of the preference labels as the training reward. However, in this paper, we demonstrate that such binary classification tasks make GRMs susceptible to guessing correct outcomes without sound critiques. Consequently, these spurious successes introduce substantial noise into the reward signal, thereby impairing the effectiveness of reinforcement learning. To address this issue, we propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals, thereby mitigating the problem of limited solution space inherent in binary tasks. Specifically, we compute the similarity between GRM-generated and human critiques as the training reward, which provides more accurate reward signals than outcome-only supervision. Additionally, considering that human critiques are difficult to scale up, we introduce Meta Reward Model (MetaRM) which learns to predict process reward from datasets with human critiques and then generalizes to data without human critiques. Experiments on multiple benchmarks demonstrate that our method consistently outperforms state-of-the-art GRMs trained with outcome-only reward, confirming the superiority of integrating natural language over binary human feedback as supervision.
中文摘要 基于偏好数据的可验证奖励强化学习（RLVR）已成为训练生成奖励模型（GRMs）的主流方法。通常在成对奖励任务中，GRMs生成以批判和偏好标签结尾的推理链，RLVR则依赖偏好标签的正确性作为训练奖励。然而，本文证明，这种二元分类任务使GRM容易在没有合理批判的情况下猜测正确结果。因此，这些虚假的成功会在奖励信号中引入大量噪声，从而削弱强化学习的有效性。为解决这一问题，我们提出了自然语言人类反馈的奖励建模（RM-NLHF），利用自然语言反馈获得过程奖励信号，从而缓解二元任务固有的有限解空间问题。具体来说，我们将GRM生成的批评与人类批评的相似性计算为训练奖励，这比仅结果的监督提供了更准确的奖励信号。此外，考虑到人类批评难以大规模化，我们引入了元奖励模型（MetaRM），该模型学习从带有人类批评的数据集中预测过程奖励，并推广到无人类批评的数据。多个基准测试的实验表明，我们的方法持续优于仅以结果为奖励训练的先进GRMs，证实了整合自然语言优于二元人类反馈作为监督的优势。

OpenTinker: Separating Concerns in Agentic Reinforcement Learning

OpenTinker：在智能强化学习中分离关注点

Authors: Siqi Zhu, Jiaxuan You
Subjects: Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2601.07376
Pdf link: https://arxiv.org/pdf/2601.07376
Abstract We introduce OpenTinker, an infrastructure for reinforcement learning (RL) of large language model (LLM) agents built around a separation of concerns across algorithm design, execution, and agent-environment interaction. Rather than relying on monolithic, end-to-end RL pipelines, OpenTinker decomposes agentic learning systems into lightweight, composable components with clearly defined abstraction boundaries. Users specify agents, environments, and interaction protocols, while inference and training are delegated to a managed execution runtime. OpenTinker introduces a centralized scheduler for managing training and inference workloads, including LoRA-based and full-parameter RL, supervised fine-tuning, and inference, over shared resources. We further discuss design principles for extending OpenTinker to multi-agent training. Finally, we present a set of RL use cases that demonstrate the effectiveness of the framework in practical agentic learning scenarios.
中文摘要 我们介绍OpenTinker，这是一个针对大型语言模型（LLM）代理进行强化学习（RL）的基础设施，围绕算法设计、执行和代理-环境交互的关注点分离而构建。OpenTinker 不依赖单一的端到端强化学习流水线，而是将代理学习系统分解为轻量级、可组合的组件，并具有明确的抽象边界。用户指定代理、环境和交互协议，而推理和训练则委托给托管执行时。OpenTinker 引入了集中调度器，用于管理训练和推理工作负载，包括基于 LoRA 和全参数的强化学习、监督式微调以及基于共享资源的推理。我们还进一步讨论了将 OpenTinker 扩展到多智能体训练的设计原则。最后，我们展示了一套强化学习（RL）的用例，展示了该框架在实际代理学习场景中的有效性。

On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

关于训练后监督微调与强化学习的非脱钩

Authors: Xueyan Niu, Bo Bai, Wei Han, Weixi Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2601.07389
Pdf link: https://arxiv.org/pdf/2601.07389
Abstract Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under SFT optimality and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training
中文摘要 大型语言模型的后训练通常会将监督微调（SFT）与强化学习（RL）交错进行。这两种方法的目标不同：SFT最小化模型输出与专家回答之间的交叉熵损失，而强化学习则最大化来自人类偏好或基于规则验证者的奖励信号。现代推理模型广泛采用交替进行SFT和RL训练的做法。然而，目前尚无理论上能解释它们是否可以解耦。我们证明，解耦无论顺序都不可能：（1）SFT-然后RL耦合：RL在SFT最优情况下增加SFT损失;（2）RL-然后SFT耦合：SFT降低RL所获得的奖励。Qwen3-0.6B的实验证实了预测的退化，验证了SFT和RL无法分离而不损失训练后性能

Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

基于结果的优势重塑，用于数学推理中细粒度信用作业

Authors: Ziheng Li, Liu Kang, Feng Xiao, Luxi Xing, Qingyi Si, Zhuoran Li, Weikang Gong, Deqing Yang, Yanghua Xiao, Hongcheng Guo
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.07408
Pdf link: https://arxiv.org/pdf/2601.07408
Abstract Group Relative Policy Optimization (GRPO) has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks. However, standard GRPO employs a coarse-grained credit assignment mechanism that propagates group-level rewards uniformly to to every token in a sequence, neglecting the varying contribution of individual reasoning steps. We address this limitation by introducing Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model's final answer. We instantiate OAR via two complementary strategies: (1) OAR-P, which estimates outcome sensitivity through counterfactual token perturbations, serving as a high-fidelity attribution signal; (2) OAR-G, which uses an input-gradient sensitivity proxy to approximate the influence signal with a single backward pass. These importance signals are integrated with a conservative Bi-Level advantage reshaping scheme that suppresses low-impact tokens and boosts pivotal ones while preserving the overall advantage mass. Empirical results on extensive mathematical reasoning benchmarks demonstrate that while OAR-P sets the performance upper bound, OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline, pushing the boundaries of critic-free LLM reasoning.
中文摘要 群体相对策略优化（GRPO）已成为一种有前景的无批判强化学习范式，用于推理任务。然而，标准GRPO采用粗粒度的信用分配机制，将群体级奖励均匀地传递到序列中的每个代币，忽略了各个推理步骤的不同贡献。我们通过引入基于结果的优势重塑（OAR）来解决这一限制，这是一种细粒度的信用分配机制，根据每个代币对模型最终答案的影响程度重新分配优势。我们通过两种互补策略实现OAR：（1）OAR-P，通过反事实的代币扰动估计结果敏感性，作为高保真度归因信号;（2）OAR-G，使用输入梯度敏感度代理，通过单次回向传递近似影响信号。这些重要性信号与保守的双层优势重塑方案相结合，抑制低影响的代币，提升关键代币，同时保持整体优势质量。大量数学推理基准的实证结果表明，虽然OAR-P设定了性能上限，而OAR-G则实现了相当的提升且计算开销极小，两者都远超强GRPO基准，推动了无批评LLM推理的边界。

Puzzle it Out: Local-to-Global World Model for Offline Multi-Agent Reinforcement Learning

解谜：基于离线多智能体强化学习的本地到全球世界模型

Authors: Sijia li, Xinran Li, Shibo Chen, Jun Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.07463
Pdf link: https://arxiv.org/pdf/2601.07463
Abstract Offline multi-agent reinforcement learning (MARL) aims to solve cooperative decision-making problems in multi-agent systems using pre-collected datasets. Existing offline MARL methods primarily constrain training within the dataset distribution, resulting in overly conservative policies that struggle to generalize beyond the support of the data. While model-based approaches offer a promising solution by expanding the original dataset with synthetic data generated from a learned world model, the high dimensionality, non-stationarity, and complexity of multi-agent systems make it challenging to accurately estimate the transitions and reward functions in offline MARL. Given the difficulty of directly modeling joint dynamics, we propose a local-to-global (LOGO) world model, a novel framework that leverages local predictions-which are easier to estimate-to infer global state dynamics, thus improving prediction accuracy while implicitly capturing agent-wise dependencies. Using the trained world model, we generate synthetic data to augment the original dataset, expanding the effective state-action space. To ensure reliable policy learning, we further introduce an uncertainty-aware sampling mechanism that adaptively weights synthetic data by prediction uncertainty, reducing approximation error propagation to policies. In contrast to conventional ensemble-based methods, our approach requires only an additional encoder for uncertainty estimation, significantly reducing computational overhead while maintaining accuracy. Extensive experiments across 8 scenarios against 8 baselines demonstrate that our method surpasses state-of-the-art baselines on standard offline MARL benchmarks, establishing a new model-based baseline for generalizable offline multi-agent learning.
中文摘要 离线多智能体强化学习（MARL）旨在利用预先收集的数据集解决多智能体系统中的协作决策问题。现有的离线MARL方法主要限制了数据集分布内的训练，导致过于保守的政策，难以超越数据支持的推广。虽然基于模型的方法通过扩展由学习世界模型生成的合成数据扩展原始数据集提供了有前景的解决方案，但多智能体系统的高维度、非平稳性和复杂性使得准确估计离线MARL中的转变和奖励函数变得困难。鉴于直接建模联合动力学的困难，我们提出了一种局部到全球（LOGO）世界模型，这是一种利用更易估计的局部预测来推断全局状态动态的新框架，从而提高预测准确性，同时隐式捕捉代理间的依赖关系。利用训练过的世界模型，我们生成合成数据以补充原始数据集，扩展有效状态-动作空间。为确保策略学习的可靠性，我们进一步引入了一种不确定性感知采样机制，通过预测不确定性自适应地加权合成数据，减少了近似误差传播到策略。与传统的基于集成的方法不同，我们的方法只需额外的编码器来估算不确定性，显著降低计算开销，同时保持准确性。在8个场景和8个基线上的大量实验表明，我们的方法超越了标准离线MARL基准测试的先进基线，建立了基于模型的基础基线，实现可推广的离线多智能体学习。

Graph Inference Towards ICD Coding

图推断对ICD编码的应用

Authors: Xiaoxiao Deng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.07496
Pdf link: https://arxiv.org/pdf/2601.07496
Abstract Automated ICD coding involves assigning standardized diagnostic codes to clinical narratives. The vast label space and extreme class imbalance continue to challenge precise prediction. To address these issues, LabGraph is introduced -- a unified framework that reformulates ICD coding as a graph generation task. By combining adversarial domain adaptation, graph-based reinforcement learning, and perturbation regularization, LabGraph effectively enhances model robustness and generalization. In addition, a label graph discriminator dynamically evaluates each generated code, providing adaptive reward feedback during training. Experiments on benchmark datasets demonstrate that LabGraph consistently outperforms previous approaches on micro-F1, micro-AUC, and P@K.
中文摘要 自动化ICD编码涉及为临床叙述分配标准化诊断代码。庞大的标签空间和极端的阶级不平衡持续挑战着准确的预测。为解决这些问题，引入了LabGraph——一个统一框架，将ICD编码重新表述为图生成任务。通过结合对抗域适应、基于图的强化学习和微扰正则化，LabGraph有效提升了模型的鲁棒性和泛化性。此外，标签图判别器动态评估每个生成的代码，在训练过程中提供自适应奖励反馈。基准数据集上的实验表明，LabGraph在微F1、微AUC和P@K上持续优于以往方法。

Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions

控制具有覆盖增强的潜在动作的多模态会话代理

Authors: Yongqi Li, Hao Lang, Tieyun Qian, Yongbin Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.07516
Pdf link: https://arxiv.org/pdf/2601.07516
Abstract Vision-language models are increasingly employed as multimodal conversational agents (MCAs) for diverse conversational tasks. Recently, reinforcement learning (RL) has been widely explored for adapting MCAs to various human-AI interaction scenarios. Despite showing great enhancement in generalization performance, fine-tuning MCAs via RL still faces challenges in handling the extremely large text token space. To address this, we learn a compact latent action space for RL fine-tuning instead. Specifically, we adopt the learning from observation mechanism to construct the codebook for the latent action space, where future observations are leveraged to estimate current latent actions that could further be used to reconstruct future observations. However, the scarcity of paired image-text data hinders learning a codebook with sufficient coverage. Thus, we leverage both paired image-text data and text-only data to construct the latent action space, using a cross-modal projector for transforming text embeddings into image-text embeddings. We initialize the cross-modal projector on paired image-text data, and further train it on massive text-only data with a novel cycle consistency loss to enhance its robustness. We show that our latent action based method outperforms competitive baselines on two conversation tasks across various RL algorithms.
中文摘要 视觉语言模型越来越多地被用作多模态会话代理（MCA），用于多样化的会话任务。近年来，强化学习（RL）被广泛探索用于将MCA适配到各种人机交互场景。尽管在泛化性能上有显著提升，通过强化学习微调MCA仍面临处理极大文本令牌空间的挑战。为此，我们学习了一个紧凑的潜在作用空间用于强化学习的微调。具体来说，我们采用了“观察学习”机制来构建潜在作用空间的代码手册，利用未来的观测数据来估算当前潜在作用，进一步用于重建未来的观测数据。然而，成对图像-文本数据的稀缺阻碍了学习具有足够覆盖范围的码本。因此，我们利用成对图像-文本数据和纯文本数据构建潜在动作空间，使用跨模态投影仪将文本嵌入转换为图像-文本嵌入。我们将跨模态投影仪初始化为配对图像-文本数据，并在仅有大量文本数据上进行新颖的周期一致性丢失训练，以增强其鲁棒性。我们证明，基于潜伏动作的方法在两种对话任务中，跨越多种强化学习算法，优于竞争基线。

Stagewise Reinforcement Learning and the Geometry of the Regret Landscape

分阶段强化学习与遗憾景观的几何

Authors: Chris Elliott, Einar Urdshals, David Quarel, Matthew Farrugia-Roberts, Daniel Murfet
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.07524
Pdf link: https://arxiv.org/pdf/2601.07524
Abstract Singular learning theory characterizes Bayesian learning as an evolving tradeoff between accuracy and complexity, with transitions between qualitatively different solutions as sample size increases. We extend this theory to deep reinforcement learning, proving that the concentration of the generalized posterior over policies is governed by the local learning coefficient (LLC), an invariant of the geometry of the regret function. This theory predicts that Bayesian phase transitions in reinforcement learning should proceed from simple policies with high regret to complex policies with low regret. We verify this prediction empirically in a gridworld environment exhibiting stagewise policy development: phase transitions over SGD training manifest as "opposing staircases" where regret decreases sharply while the LLC increases. Notably, the LLC detects phase transitions even when estimated on a subset of states where the policies appear identical in terms of regret, suggesting it captures changes in the underlying algorithm rather than just performance.
中文摘要 奇异学习理论将贝叶斯学习描述为准确性与复杂性之间不断演变的权衡，随着样本量增加，质的不同解之间会发生转变。我们将该理论扩展到深度强化学习，证明广义后效策略的集中度受局部学习系数（LLC）支配，这是后悔函数几何的不变量。该理论预测，强化学习中的贝叶斯阶段转变应从高遗憾率的简单策略发展到低遗憾的复杂策略。我们在网格世界环境中实证验证了这一预测，该环境呈现分阶段政策发展：SGD培训的阶段转变表现为“对立阶梯”，即后悔急剧减少而LLC增加。值得注意的是，LLC即使在估计部分策略在遗憾度上看似相同的状态上，也能检测到相变，表明它捕捉的是底层算法的变化，而不仅仅是性能。

GRPO with State Mutations: Improving LLM-Based Hardware Test Plan Generation

带有状态变异的GRPO：改进基于LLM的硬件测试计划生成

Authors: Dimple Vijay Kochar, Nathaniel Pinckney, Guan-Ting Liu, Chia-Tung Ho, Chenhui Deng, Haoxing Ren, Brucek Khailany
Subjects: Subjects: Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.07593
Pdf link: https://arxiv.org/pdf/2601.07593
Abstract RTL design often relies heavily on ad-hoc testbench creation early in the design cycle. While large language models (LLMs) show promise for RTL code generation, their ability to reason about hardware specifications and generate targeted test plans remains largely unexplored. We present the first systematic study of LLM reasoning capabilities for RTL verification stimuli generation, establishing a two-stage framework that decomposes test plan generation from testbench execution. Our benchmark reveals that state-of-the-art models, including DeepSeek-R1 and Claude-4.0-Sonnet, achieve only 15.7-21.7% success rates on generating stimuli that pass golden RTL designs. To improve LLM generated stimuli, we develop a comprehensive training methodology combining supervised fine-tuning with a novel reinforcement learning approach, GRPO with State Mutation (GRPO-SMu), which enhances exploration by varying input mutations. Our approach leverages a tree-based branching mutation strategy to construct training data comprising equivalent and mutated trees, moving beyond linear mutation approaches to provide rich learning signals. Training on this curated dataset, our 7B parameter model achieves a 33.3% golden test pass rate and a 13.9% mutation detection rate, representing a 17.6% absolute improvement over baseline and outperforming much larger general-purpose models. These results demonstrate that specialized training methodologies can significantly enhance LLM reasoning capabilities for hardware verification tasks, establishing a foundation for automated sub-unit testing in semiconductor design workflows.
中文摘要 RTL设计通常在设计周期早期大量依赖临时测试平台的创建。虽然大型语言模型（LLM）在RTL代码生成方面展现出潜力，但它们推理硬件规格和生成有针对性测试计划的能力仍大多未被充分探索。我们首次系统地研究了RTL验证刺激生成的LLM推理能力，建立了一个两阶段框架，将测试计划生成与测试平台执行拆解。我们的基准测试显示，包括DeepSeek-R1和Claude-4.0-Sonnet在内的最先进模型，在生成通过黄金RTL设计的刺激时，成功率仅为15.7%至21.7%。为提升LLM生成刺激，我们开发了一套综合训练方法，结合监督微调与一种新型强化学习方法——带有状态突变的GRPO（GRPO-SMu），通过变换输入突变增强探索能力。我们的方法利用基于树的分支突变策略构建包含等价树和突变树的训练数据，超越线性突变方法，提供丰富的学习信号。在这套精心策划的数据集上训练，我们的7B参数模型实现了33.3%的黄金测试通过率和13.9%的突变检测率，比基线绝对提升了17.6%，并且优于更大规模的通用模型。这些结果表明，专业训练方法能够显著提升硬件验证任务中的LLM推理能力，为半导体设计工作流程中的自动化子单元测试奠定基础。

Clipped Affine Policy: Low-Complexity Near-Optimal Online Power Control for Energy Harvesting Communications over Fading Channels

截剪仿射策略：低复杂度近优在线功率控制，用于衰落信道上的能量收割通信

Authors: Hao Wu, Shengtian Yang, Huiguo Gao, Diao Wang, Jun Chen, Guanding Yu
Subjects: Subjects: Information Theory (cs.IT); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.07622
Pdf link: https://arxiv.org/pdf/2601.07622
Abstract This paper investigates online power control for point-to-point energy harvesting communications over wireless fading channels. A linear-policy-based approximation is derived for the relative-value function in the Bellman equation of the power control problem. This approximation leads to two fundamental power control policies: optimistic and robust clipped affine policies, both taking the form of a clipped affine function of the battery level and the reciprocal of channel signal-to-noise ratio coefficient. They are essentially battery-limited weighted directional waterfilling policies operating between adjacent time slots. By leveraging the relative-value approximation and derived policies, a domain-knowledge-enhanced reinforcement learning (RL) algorithm is proposed for online power control. The proposed approach is further extended to scenarios with energy and/or channel lookahead. Comprehensive simulation results demonstrate that the proposed methods achieve a good balance between computational complexity and optimality. In particular, the robust clipped affine policy (combined with RL, using at most five parameters) outperforms all existing approaches across various scenarios, with less than 2\% performance loss relative to the optimal policy.
中文摘要 本文探讨了无线衰落通道点对点能量采集通信的在线功率控制。在功率控制问题的贝尔曼方程中，对相对价值函数推导出了基于线性策略的近似。该近似导致两种基本功率控制策略：乐观和稳健的截断仿射策略，均表现为电池电平和信道信噪比系数的截波仿射函数。它们本质上是电池限制加权定向灌水政策，在相邻时段之间运行。通过利用相对值近似和派生策略，提出了一种领域知识增强强化学习（RL）算法用于在线功率控制。该方法进一步扩展到具有能源和/或通道前瞻的场景。全面的仿真结果表明，所提方法在计算复杂性和最优性之间取得了良好平衡。特别是，稳健剪切仿射策略（结合强化学习，最多使用五个参数）在多种场景下优于所有现有方法，性能损失低于最优策略的2/%。

Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model

平滑算子：平滑可验证奖励激活视觉-语言模型的空间推理能力

Authors: Siwen Jiao, Tianxiong Lv, Kangan Qian, Chenxu Zhao, Xiuyuan Zhu, Tianlun Li, Xiaolong Cheng, Jinyu Li, Zhihao Liao, Yang Cai
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.07695
Pdf link: https://arxiv.org/pdf/2601.07695
Abstract Vision-Language Models (VLMs) face a critical bottleneck in achieving precise numerical prediction for 3D scene understanding. Traditional reinforcement learning (RL) approaches, primarily based on relative ranking, often suffer from severe reward sparsity and gradient instability, failing to effectively exploit the verifiable signals provided by 3D physical constraints. Notably, in standard GRPO frameworks, relative normalization causes "near-miss" samples (characterized by small but non-zero errors) to suffer from advantage collapse. This leads to a severe data utilization bottleneck where valuable boundary samples are discarded during optimization. To address this, we introduce the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA employs a dynamically parameterized Sigmoid function to transform raw feedback into a dense, continuous reward continuum. Concurrently, AP-GRPO integrates absolute scalar gradients to mitigate the numerical information loss inherent in conventional relative-ranking mechanisms. By leveraging this approach, we constructed Numerical3D-50k, a dataset comprising 50,000 verifiable 3D subtasks. Empirical results indicate that AP-GRPO achieves performance parity with large-scale supervised methods while maintaining higher data efficiency, effectively activating latent 3D reasoning in VLMs without requiring architectural modifications.
中文摘要 视觉语言模型（VLMs）在实现三维场景理解的精确数值预测方面面临关键瓶颈。传统的强化学习（RL）方法主要基于相对排名，常常存在严重的奖励稀疏性和梯度不稳定性，无法有效利用三维物理约束所提供的可验证信号。值得注意的是，在标准GRPO框架中，相对归一化会导致“近距离误差”样本（以小但非零误差为特征）出现优势坍缩。这导致了严重的数据利用瓶颈，优化过程中宝贵的边界样本被丢弃。为此，我们引入了平滑数值奖励激活（SNRA）算子和绝对保持GRPO（AP-GRPO）框架。SNRA采用动态参数化的S形函数，将原始反馈转化为密集、连续的奖励连续谱。同时，AP-GRPO集成了绝对标量梯度，以减轻传统相对排序机制中固有的数值信息损失。通过利用这一方法，我们构建了Numerical3D-50k数据集，包含5万个可验证的3D子任务。实证结果表明，AP-GRPO在保持更高数据效率的同时，实现了与大规模监督方法的性能平等，有效激活VLM中的潜在三维推理，而无需进行架构修改。

Hiking in the Wild: A Scalable Perceptive Parkour Framework for Humanoids

野外徒步：为类人生物打造的可扩展感知跑酷框架

Authors: Shaoting Zhu, Ziwen Zhuang, Mengjie Zhao, Kun-Ying Lee, Hang Zhao
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.07718
Pdf link: https://arxiv.org/pdf/2601.07718
Abstract Achieving robust humanoid hiking in complex, unstructured environments requires transitioning from reactive proprioception to proactive perception. However, integrating exteroception remains a significant challenge: mapping-based methods suffer from state estimation drift; for instance, LiDAR-based methods do not handle torso jitter well. Existing end-to-end approaches often struggle with scalability and training complexity; specifically, some previous works using virtual obstacles are implemented case-by-case. In this work, we present \textit{Hiking in the Wild}, a scalable, end-to-end parkour perceptive framework designed for robust humanoid hiking. To ensure safety and training stability, we introduce two key mechanisms: a foothold safety mechanism combining scalable \textit{Terrain Edge Detection} with \textit{Foot Volume Points} to prevent catastrophic slippage on edges, and a \textit{Flat Patch Sampling} strategy that mitigates reward hacking by generating feasible navigation targets. Our approach utilizes a single-stage reinforcement learning scheme, mapping raw depth inputs and proprioception directly to joint actions, without relying on external state estimation. Extensive field experiments on a full-size humanoid demonstrate that our policy enables robust traversal of complex terrains at speeds up to 2.5 m/s. The training and deployment code is open-sourced to facilitate reproducible research and deployment on real robots with minimal hardware modifications.
中文摘要 在复杂、无结构的环境中实现强健的人形徒步，需要从反应性本体感觉转向主动感知。然而，整合外感知仍是一个重大挑战：基于映射的方法存在状态估计漂移;例如，基于激光雷达的方法无法很好地处理躯干抖动。现有的端到端方法常常在可扩展性和训练复杂性方面遇到困难;具体来说，一些之前使用虚拟障碍的作品是逐个实现的。在本研究中，我们提出了 \textit{Hiking in the Wild}，一个可扩展、端到端的跑酷感知框架，专为强健的人形徒步设计。为确保安全和训练稳定性，我们引入了两个关键机制：结合可扩展的\textit{地形边缘检测}与\textit{脚量点}的足迹安全机制，以防止边缘灾难性滑移;以及一种\textit{平面斑块采样}策略，通过生成可行的导航目标来减轻奖励黑客行为。我们的方法采用单阶段强化学习方案，将原始深度输入和本体感受直接映射到联合动作，无需依赖外部状态估计。对全尺寸人形生物的广泛实地实验表明，我们的政策能够以高达2.5米/秒的速度稳健地穿越复杂地形。培训和部署代码开源，便于在真实机器人上进行可重复的研究和部署，且硬件修改极少。

Video Evidence to Reasoning Efficient Video Understanding via Explicit Evidence Grounding

视频证据到推理，通过显性证据进行高效视频理解，基础

Authors: Yanxiang Huang, Guohua Gao, Zhaoyang Wei, Jianyuan Ni
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.07761
Pdf link: https://arxiv.org/pdf/2601.07761
Abstract Large Vision-Language Models (LVLMs) face a fundamental dilemma in video reasoning: they are caught between the prohibitive computational costs of verbose reasoning and the hallucination risks of efficient, ungrounded approaches. To resolve this, we introduce the Chain of Evidence (CoE), a novel framework that architecturally decouples and co-optimizes perceptual grounding and reasoning efficiency. CoE incorporates two core innovations: (1) A lightweight Evidence Grounding Module (EGM) that acts as a query-guided filter, dynamically identifying and extracting a compact set of high-fidelity visual evidence; and (2) An Evidence-Anchoring Protocol optimized via Reinforcement Learning. Crucially, we design a composite reward mechanism that enforces process alignment, compelling the model to strictly reference identified temporal anchors during deduction, thereby mitigating hallucinations. To enable this, we construct CoE-Instruct, a large-scale dataset (164k samples) featuring a novel dual-annotation schema for separate perception and reasoning supervision. Extensive experiments on five benchmarks, including Video-MME, MVBench, and VSI-Bench, demonstrate that CoE-enhanced models establish a new state-of-the-art. They significantly outperform existing methods in accuracy, proving CoE to be a powerful and practical paradigm for reliable video understanding.
中文摘要 大型视觉语言模型（LVLM）在视频推理中面临一个根本的困境：它们夹在冗长推理的高计算成本和高效且无根基方法带来的幻觉风险之间。为解决这个问题，我们引入了证据链（CoE），这是一个新颖框架，在架构上解耦并共同优化感知基础和推理效率。CoE包含两项核心创新：（1）轻量级证据基础模块（EGM），作为查询引导过滤器，动态识别并提取紧凑的高保真视觉证据集;以及（2）通过强化学习优化的证据锚定协议。关键是，我们设计了一种复合奖励机制，强制过程对齐，迫使模型在推理时严格引用已识别的时间锚点，从而减轻幻觉。为此，我们构建了 CoE-Instruct，这是一个大规模数据集（164k 样本），采用了一种新颖的双注释模式，实现了独立的感知和推理监督。在包括Video-MME、MVBench和VSI-Bench在内的五个基准测试上的广泛实验表明，CoE增强模型确立了新的技术水平。它们在准确性上远超现有方法，证明了CoE是可靠视频理解的强大且实用的范式。

Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning

超越单次获取：通过查询规划实现多步骤工具检索

Authors: Wei Fang, James Glass
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2601.07782
Pdf link: https://arxiv.org/pdf/2601.07782
Abstract LLM agents operating over massive, dynamic tool libraries rely on effective retrieval, yet standard single-shot dense retrievers struggle with complex requests. These failures primarily stem from the disconnect between abstract user goals and technical documentation, and the limited capacity of fixed-size embeddings to model combinatorial tool compositions. To address these challenges, we propose TOOLQP, a lightweight framework that models retrieval as iterative query planning. Instead of single-shot matching, TOOLQP decomposes instructions into sub-tasks and dynamically generates queries to interact with the retriever, effectively bridging the semantic gap by targeting the specific sub-tasks required for composition. We train TOOLQP using synthetic query trajectories followed by optimization via Reinforcement Learning with Verifiable Rewards (RLVR). Experiments demonstrate that TOOLQP achieves state-of-the-art performance, exhibiting superior zero-shot generalization, robustness across diverse retrievers, and significant improvements in downstream agentic execution.
中文摘要 运行在庞大动态工具库中的LLM代理依赖高效的检索，但标准的单帧密集检索器在应对复杂请求时却感到困难。这些失败主要源于抽象用户目标与技术文档之间的脱节，以及固定尺寸嵌入在建模组合工具组合时的能力有限。为应对这些挑战，我们提出了TOOLQP，一个轻量级框架，将检索建模为迭代查询规划。TOOLQP 不是单次匹配，而是将指令分解为子任务，并动态生成查询与检索器交互，通过针对特定子任务来弥合语义差距。我们通过合成查询轨迹训练TOOLQP，随后通过可验证奖励强化学习（RLVR）进行优化。实验表明，TOOLQP实现了最先进的性能，展现出卓越的零射中推广能力、在多种检索者的鲁棒性以及下游代理执行的显著提升。

Data-driven control of hydraulic impact hammers under strict operational and control constraints

在严格作和控制约束下，基于数据驱动的液压冲击锤控制

Authors: Francisco Leiva, Claudio Canales, Michelle Valenzuela, Javier Ruiz-del-Solar
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.07813
Pdf link: https://arxiv.org/pdf/2601.07813
Abstract This paper presents a data-driven methodology for the control of static hydraulic impact hammers, also known as rock breakers, which are commonly used in the mining industry. The task addressed in this work is that of controlling the rock-breaker so its end-effector reaches arbitrary target poses, which is required in normal operation to place the hammer on top of rocks that need to be fractured. The proposed approach considers several constraints, such as unobserved state variables due to limited sensing and the strict requirement of using a discrete control interface at the joint level. First, the proposed methodology addresses the problem of system identification to obtain an approximate dynamic model of the hydraulic arm. This is done via supervised learning, using only teleoperation data. The learned dynamic model is then exploited to obtain a controller capable of reaching target end-effector poses. For policy synthesis, both reinforcement learning (RL) and model predictive control (MPC) algorithms are utilized and contrasted. As a case study, we consider the automation of a Bobcat E10 mini-excavator arm with a hydraulic impact hammer attached as end-effector. Using this machine, both the system identification and policy synthesis stages are studied in simulation and in the real world. The best RL-based policy consistently reaches target end-effector poses with position errors below 12 cm and pitch angle errors below 0.08 rad in the real world. Considering that the impact hammer has a 4 cm diameter chisel, this level of precision is sufficient for breaking rocks. Notably, this is accomplished by relying only on approximately 68 min of teleoperation data to train and 8 min to evaluate the dynamic model, and without performing any adjustments for a successful policy Sim2Real transfer. A demonstration of policy execution in the real world can be found in this https URL.
中文摘要 本文提出了一种基于数据的数据的方法，用于控制静态液压冲击锤（也称为岩石破坏锤），该锤在采矿行业中广泛使用。本研究的任务是控制破石器，使其末端执行器达到任意目标姿势，这在正常作中是将锤子放置在需要破碎岩石顶部的必要条件。该方法考虑了若干约束条件，如由于传感有限导致的未观测状态变量，以及在关节层面严格要求使用离散控制接口。首先，提出的方法论解决系统识别问题，以获得液压臂的近似动力模型。这通过监督学习完成，仅使用远程作数据。随后利用所学的动态模型，获得能够达到目标端执行器姿态的控制器。在策略综合中，同时使用并对比强化学习（RL）和模型预测控制（MPC）算法。作为一个案例研究，我们考虑自动化一台Bobcat E10迷你挖掘机械臂，并配有液压冲击锤作为末端执行器。利用该机器，系统识别和策略综合阶段在模拟和现实世界中都被研究。基于强化学习的最佳策略在现实中，始终能达到目标端执行器姿态，位置误差低于12厘米，俯仰角误差低于0.08弧度。考虑到冲击锤的凿子直径为4厘米，这种精度足以击碎岩石。值得注意的是，这通过仅依赖约68分钟的远程作数据进行训练，8分钟评估动态模型，且不对策略Sim2Real进行任何调整来实现。在现实世界中，策略执行的演示可见于此 https URL。

Failure-Aware RL: Reliable Offline-to-Online Reinforcement Learning with Self-Recovery for Real-World Manipulation

失败感知强化学习：可靠的离线到在线强化学习，具备自我恢复功能，用于现实作

Authors: Huanyu Li, Kun Lei, Sheng Zang, Kaizhe Hu, Yongyuan Liang, Bo An, Xiaoli Li, Huazhe Xu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.07821
Pdf link: https://arxiv.org/pdf/2601.07821
Abstract Post-training algorithms based on deep reinforcement learning can push the limits of robotic models for specific objectives, such as generalizability, accuracy, and robustness. However, Intervention-requiring Failures (IR Failures) (e.g., a robot spilling water or breaking fragile glass) during real-world exploration happen inevitably, hindering the practical deployment of such a paradigm. To tackle this, we introduce Failure-Aware Offline-to-Online Reinforcement Learning (FARL), a new paradigm minimizing failures during real-world reinforcement learning. We create FailureBench, a benchmark that incorporates common failure scenarios requiring human intervention, and propose an algorithm that integrates a world-model-based safety critic and a recovery policy trained offline to prevent failures during online exploration. Extensive simulation and real-world experiments demonstrate the effectiveness of FARL in significantly reducing IR Failures while improving performance and generalization during online reinforcement learning post-training. FARL reduces IR Failures by 73.1% while elevating performance by 11.3% on average during real-world RL post-training. Videos and code are available at this https URL.
中文摘要 基于深度强化学习的后训练算法可以推动机器人模型在特定目标上的极限，如泛化性、准确性和鲁棒性。然而，在现实世界探索中不可避免地会发生需要干预的失败（IR故障）（例如机器人洒水或打破易碎玻璃），这阻碍了此类范式的实际应用。为此，我们引入了“失误感知离线到在线强化学习”（FARL），这是一种减少现实强化学习失败的新范式。我们创建了FailureBench基准测试，整合了需要人工干预的常见故障场景，并提出了一个算法，整合了基于世界模型的安全批评器和线下训练的恢复策略，以防止在线探索过程中发生故障。广泛的模拟和实际实验证明了FARL在显著减少红外失效、提升在线强化学习后性能和泛化能力方面的有效性。FARL在真实的强化学习后训练中，平均可降低73.1%的红外故障率，同时提升性能11.3%。视频和代码可在此 https URL 下载。

Video Generation Models in Robotics - Applications, Research Challenges, Future Directions

机器人中的视频生成模型——应用、研究挑战与未来方向

Authors: Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madison Bland, Lihan Zha, Asher Hancock, Jaime Fernández Fisac, Philip Dames, Anirudha Majumdar
Subjects: Subjects: Systems and Control (eess.SY); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.07823
Pdf link: https://arxiv.org/pdf/2601.07823
Abstract Video generation models have emerged as high-fidelity models of the physical world, capable of synthesizing high-quality videos capturing fine-grained interactions between agents and their environments conditioned on multi-modal user inputs. Their impressive capabilities address many of the long-standing challenges faced by physics-based simulators, driving broad adoption in many problem domains, e.g., robotics. For example, video models enable photorealistic, physically consistent deformable-body simulation without making prohibitive simplifying assumptions, which is a major bottleneck in physics-based simulation. Moreover, video models can serve as foundation world models that capture the dynamics of the world in a fine-grained and expressive way. They thus overcome the limited expressiveness of language-only abstractions in describing intricate physical interactions. In this survey, we provide a review of video models and their applications as embodied world models in robotics, encompassing cost-effective data generation and action prediction in imitation learning, dynamics and rewards modeling in reinforcement learning, visual planning, and policy evaluation. Further, we highlight important challenges hindering the trustworthy integration of video models in robotics, which include poor instruction following, hallucinations such as violations of physics, and unsafe content generation, in addition to fundamental limitations such as significant data curation, training, and inference costs. We present potential future directions to address these open research challenges to motivate research and ultimately facilitate broader applications, especially in safety-critical settings.
中文摘要 视频生成模型作为物理世界的高精度模型出现，能够合成高质量视频，捕捉智能体与其环境之间基于多模态用户输入的细粒度互动。它们令人印象深刻的能力解决了基于物理的模拟器长期面临的挑战，推动了许多问题领域的广泛应用，例如机器人学。例如，视频模型能够实现照片级真实、物理一致性的可变形体仿真，而无需做出过于繁琐的简化假设，而简化是物理模拟中的一大瓶颈。此外，视频模型可以作为基础世界模型，以细腻且富有表现力的方式捕捉世界的动态。因此，它们克服了仅语言抽象在描述复杂物理互动时有限的表现力。本综述回顾了视频模型及其作为机器人具身世界模型的应用，涵盖了模仿学习中的经济性数据生成与动作预测，强化学习中的动态与奖励建模、可视化规划和政策评估。此外，我们还强调了阻碍视频模型在机器人中可靠集成的重要挑战，包括指令跟踪不良、物理规则违规等幻觉现象，以及不安全的内容生成，以及显著的数据管理、训练和推理成本等基本限制。我们提出了未来可能的方向，以应对这些开放的研究挑战，以激励科研，最终促进更广泛的应用，尤其是在安全关键的环境中。

Keyword: diffusion policy

A Review of Online Diffusion Policy RL Algorithms for Scalable Robotic Control

在线扩散策略强化学习算法在可扩展机器人控制中的综述

Authors: Wonhyeok Choi, Minwoo Choi, Jungwan Woo, Kyumin Hwang, Jaeyeul Kim, Sunghoon Im
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.06133
Pdf link: https://arxiv.org/pdf/2601.06133
Abstract Diffusion policies have emerged as a powerful approach for robotic control, demonstrating superior expressiveness in modeling multimodal action distributions compared to conventional policy networks. However, their integration with online reinforcement learning remains challenging due to fundamental incompatibilities between diffusion model training objectives and standard RL policy improvement mechanisms. This paper presents the first comprehensive review and empirical analysis of current Online Diffusion Policy Reinforcement Learning (Online DPRL) algorithms for scalable robotic control systems. We propose a novel taxonomy that categorizes existing approaches into four distinct families -- Action-Gradient, Q-Weighting, Proximity-Based, and Backpropagation Through Time (BPTT) methods -- based on their policy improvement mechanisms. Through extensive experiments on a unified NVIDIA Isaac Lab benchmark encompassing 12 diverse robotic tasks, we systematically evaluate representative algorithms across five critical dimensions: task diversity, parallelization capability, diffusion step scalability, cross-embodiment generalization, and environmental robustness. Our analysis identifies key findings regarding the fundamental trade-offs inherent in each algorithmic family, particularly concerning sample efficiency and scalability. Furthermore, we reveal critical computational and algorithmic bottlenecks that currently limit the practical deployment of online DPRL. Based on these findings, we provide concrete guidelines for algorithm selection tailored to specific operational constraints and outline promising future research directions to advance the field toward more general and scalable robotic learning systems.
中文摘要 扩散策略已成为机器人控制的强大方法，在建模多模态行动分布时展现出优越的表现力，优于传统策略网络。然而，由于扩散模型训练目标与标准强化学习策略改进机制之间的根本不兼容，它们与在线强化学习的整合仍然具有挑战。本文首次全面综述并进行了实证分析，涵盖当前可扩展机器人控制系统的在线扩散策略强化学习（Online DPRL）算法。我们提出了一种新颖的分类法，将现有方法分为四个不同的家族——动作梯度法、Q加权法、基于邻近法和逆向传播（BPTT）方法，基于它们的策略改进机制。通过在涵盖12个不同机器人任务的统一NVIDIA Isaac实验室基准测试上进行广泛实验，我们系统地评估了五个关键维度上的代表性算法：任务多样性、并行化能力、扩散阶级可扩展性、跨身体泛化和环境鲁棒性。我们的分析揭示了每个算法家族内在的基本权衡，特别是在样本效率和可扩展性方面的关键发现。此外，我们还揭示了目前限制在线DPRL实际部署的关键计算和算法瓶颈。基于这些发现，我们为针对特定作限制量身定制的算法选择提供了具体指导方针，并概述了未来有望的研究方向，以推动该领域向更通用且可扩展的机器人学习系统发展。