Arxiv Papers of Today

生成时间: 2026-01-27 16:38:35 (UTC+8); Arxiv 发布时间: 2026-01-27 20:00 EST (2026-01-28 09:00 UTC+8)

今天共有 59 篇相关文章

Keyword: reinforcement learning

Breaking Task Impasses Quickly: Adaptive Neuro-Symbolic Learning for Open-World Robotics

快速突破任务进展：开放世界机器人的自适应神经符号学习

Authors: Pierrick Lorang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.16985
Pdf link: https://arxiv.org/pdf/2601.16985
Abstract Adapting to unforeseen novelties in open-world environments remains a major challenge for autonomous systems. While hybrid planning and reinforcement learning (RL) approaches show promise, they often suffer from sample inefficiency, slow adaptation, and catastrophic forgetting. We present a neuro-symbolic framework integrating hierarchical abstractions, task and motion planning (TAMP), and reinforcement learning to enable rapid adaptation in robotics. Our architecture combines symbolic goal-oriented learning and world model-based exploration to facilitate rapid adaptation to environmental changes. Validated in robotic manipulation and autonomous driving, our approach achieves faster convergence, improved sample efficiency, and superior robustness over state-of-the-art hybrid methods, demonstrating its potential for real-world deployment.
中文摘要 适应开放世界环境中意想不到的新事物仍然是自主系统面临的重大挑战。虽然混合规划与强化学习（RL）方法展现出潜力，但它们常常存在样本效率低下、适应缓慢和灾难性遗忘的问题。我们提出了一个神经符号框架，整合了层级抽象、任务与动作规划（TAMP）以及强化学习，以实现机器人学中的快速适应。我们的架构结合了符号性目标导向学习和基于世界模型的探索，以促进对环境变化的快速适应。该方法在机器人作和自动驾驶方面得到了验证，实现了更快的收敛、提升的样本效率以及优于最先进混合方法的鲁棒性，展示了其在实际应用中的潜力。

Multi-Agent Deep Reinforcement Learning Under Constrained Communications

多智能体深度强化学习在受限通信下

Authors: Shahil Shaik, Jonathon M. Smereka, Yue Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.17069
Pdf link: https://arxiv.org/pdf/2601.17069
Abstract Centralized training with decentralized execution (CTDE) has been the dominant paradigm in multi-agent reinforcement learning (MARL), but its reliance on global state information during training introduces scalability, robustness, and generalization bottlenecks. Moreover, in practical scenarios such as adding/dropping teammates or facing environment dynamics that differ from the training, CTDE methods can be brittle and costly to retrain, whereas distributed approaches allow agents to adapt using only local information and peer-to-peer communication. We present a distributed MARL framework that removes the need for centralized critics or global information. Firstly, we develop a novel Distributed Graph Attention Network (D-GAT) that performs global state inference through multi-hop communication, where agents integrate neighbor features via input-dependent attention weights in a fully distributed manner. Leveraging D-GAT, we develop the distributed graph-attention MAPPO (DG-MAPPO) -- a distributed MARL framework where agents optimize local policies and value functions using local observations, multi-hop communication, and shared/averaged rewards. Empirical evaluation on the StarCraftII Multi-Agent Challenge, Google Research Football, and Multi-Agent Mujoco demonstrates that our method consistently outperforms strong CTDE baselines, achieving superior coordination across a wide range of cooperative tasks with both homogeneous and heterogeneous teams. Our distributed MARL framework provides a principled and scalable solution for robust collaboration, eliminating the need for centralized training or global observability. To the best of our knowledge, DG-MAPPO appears to be the first to fully eliminate reliance on privileged centralized information, enabling agents to learn and act solely through peer-to-peer communication.
中文摘要 集中式训练与去中心化执行（CTDE）一直是多智能体强化学习（MARL）的主导范式，但其在训练中依赖全局状态信息带来了可扩展性、鲁棒性和泛化瓶颈。此外，在实际场景中，如添加/移除队友或面对与训练不同的环境动态，CTDE方法可能较为脆弱且重新训练成本高昂，而分布式方法则允许智能体仅通过本地信息和点对点通信进行适应。我们提出了一个分布式MARL框架，无需集中式批评者或全球信息。首先，我们开发了一种新型分布式图注意力网络（D-GAT），通过多跳通信进行全局状态推断，代理通过输入依赖的注意力权重以完全分布式的方式整合邻居特征。利用D-GAT，我们开发了分布式图注意力MAPO（DG-MAPPO）——一个分布式MARL框架，代理通过局部观察、多跳通信以及共享/平均奖励优化局部策略和价值函数。对《星际争霸II》多智能体挑战赛、谷歌研究橄榄球和多智能体Mujoco的实证评估表明，我们的方法持续优于强CTDE基线，在多种协作任务中实现了优异协调，无论是同质团队还是异质团队。我们的分布式MARL框架为稳健协作提供了原则性且可扩展的解决方案，消除了集中培训或全球可观测性的需求。据我们所知，DG-MAPPO似乎是首个完全消除对特权集中信息依赖的技术，使代理能够仅通过点对点通信学习和行动。

Beyond Instrumental and Substitutive Paradigms: Introducing Machine Culture as an Emergent Phenomenon in Large Language Models

超越工具与替代范式：将机器文化作为大型语言模型中新兴现象引入

Authors: Yueqing Hu, Xinyang Peng, Yukun Zhao, Lin Qiu, Ka-lai Hung, Kaiping Peng
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.17096
Pdf link: https://arxiv.org/pdf/2601.17096
Abstract Recent scholarship typically characterizes Large Language Models (LLMs) through either an \textit{Instrumental Paradigm} (viewing models as reflections of their developers' culture) or a \textit{Substitutive Paradigm} (viewing models as bilingual proxies that switch cultural frames based on language). This study challenges these anthropomorphic frameworks by proposing \textbf{Machine Culture} as an emergent, distinct phenomenon. We employed a 2 (Model Origin: US vs. China) $\times$ 2 (Prompt Language: English vs. Chinese) factorial design across eight multimodal tasks, uniquely incorporating image generation and interpretation to extend analysis beyond textual boundaries. Results revealed inconsistencies with both dominant paradigms: Model origin did not predict cultural alignment, with US models frequently exhibiting holistic'' traits typically associated with East Asian data. Similarly, prompt language did not trigger stable cultural frame-switching; instead, we observed \textbf{Cultural Reversal}, where English prompts paradoxically elicited higher contextual attention than Chinese prompts. Crucially, we identified a novel phenomenon termed \textbf{Service Persona Camouflage}: Reinforcement Learning from Human Feedback (RLHF) collapsed cultural variance in affective tasks into a hyper-positive, zero-variancehelpful assistant'' persona. We conclude that LLMs do not simulate human culture but exhibit an emergent Machine Culture -- a probabilistic phenomenon shaped by \textit{superposition} in high-dimensional space and \textit{mode collapse} from safety alignment.
中文摘要 最新学术研究通常通过 \textit（工具范式）或 \textit{替代范式}（将模型视为基于语言切换文化框架的双语代理）来描述大型语言模型（LLMs）。本研究通过提出 \textbf{机器文化}作为一种新兴且独特的现象，挑战了这些拟人化框架。我们采用了2（模型来源：美国对中国）$\times$ 2（提示语言：英语与中文）的因子设计，涵盖八个多模态任务，独特地结合图像生成和解释，将分析扩展到文本边界之外。结果显示两种主流范式存在不一致：模型起源未能预测文化一致性，美国模型常表现出通常与东亚数据相关的“整体”特征。同样，提示语言也无法触发稳定的文化框架切换;相反，我们观察到了\textbf{文化逆转}，其中英语提示反而比中文提示更能引起更高的语境关注。关键是，我们发现了一个新现象，称为\textbf{服务人格伪装}：来自人类反馈的强化学习（RLHF）将情感任务中的文化差异压缩为一个超积极、零方差的“助人助手”形象。我们得出结论，大型语言模型并不模拟人类文化，而是展现出一种涌现的机器文化——这是一种由高维空间中\textit{叠加}和安全对齐中\textit{模式崩溃}形成的概率现象。

Scaling medical imaging report generation with multimodal reinforcement learning

利用多模态强化学习扩展医学影像报告生成

Authors: Qianchu Liu, Sheng Zhang, Guanghui Qin, Yu Gu, Ying Jin, Sam Preston, Yanbo Xu, Sid Kiblawi, Wen-wai Yim, Tim Ossowski, Tristan Naumann, Mu Wei, Hoifung Poon
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.17151
Pdf link: https://arxiv.org/pdf/2601.17151
Abstract Frontier models have demonstrated remarkable capabilities in understanding and reasoning with natural-language text, but they still exhibit major competency gaps in multimodal understanding and reasoning especially in high-value verticals such as biomedicine. Medical imaging report generation is a prominent example. Supervised fine-tuning can substantially improve performance, but they are prone to overfitting to superficial boilerplate patterns. In this paper, we introduce Universal Report Generation (UniRG) as a general framework for medical imaging report generation. By leveraging reinforcement learning as a unifying mechanism to directly optimize for evaluation metrics designed for end applications, UniRG can significantly improve upon supervised fine-tuning and attain durable generalization across diverse institutions and clinical practices. We trained UniRG-CXR on publicly available chest X-ray (CXR) data and conducted a thorough evaluation in CXR report generation with rigorous evaluation scenarios. On the authoritative ReXrank benchmark, UniRG-CXR sets new overall SOTA, outperforming prior state of the art by a wide margin.
中文摘要 前沿模型在理解和推理自然语言文本方面展现出显著能力，但在多模态理解和推理方面仍存在重大能力差距，尤其是在生物医学等高价值领域。医学影像报告生成就是一个显著的例子。监督式微调可以显著提升性能，但容易被表面的模板模式过度拟合。本文介绍了通用报告生成（UniRG）作为医学影像报告生成的通用框架。通过利用强化学习作为统一机制，直接优化为终端应用设计的评估指标，UniRG能够显著改善监督式微调，实现跨不同机构和临床实践的持久泛化。我们基于公开的胸部X光（CXR）数据对UniRG-CXR进行了培训，并在采用严格评估场景下对CXR报告生成进行了全面评估。在权威的ReXrank基准测试中，UniRG-CXR整体SOTA刷新，远超以往最先进的水平。

Beyond Outcome Verification: Verifiable Process Reward Models for Structured Reasoning

超越结果验证：结构化推理的可验证过程奖励模型

Authors: Massimiliano Pronesti, Anya Belz, Yufang Hou
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.17223
Pdf link: https://arxiv.org/pdf/2601.17223
Abstract Recent work on reinforcement learning with verifiable rewards (RLVR) has shown that large language models (LLMs) can be substantially improved using outcome-level verification signals, such as unit tests for code or exact-match checks for mathematics. In parallel, process supervision has long been explored as a way to shape the intermediate reasoning behaviour of LLMs, but existing approaches rely on neural judges to score chain-of-thought steps, leaving them vulnerable to opacity, bias, and reward hacking. To address this gap, we introduce Verifiable Process Reward Models (VPRMs), a reinforcement-learning framework in which intermediate reasoning steps are checked by deterministic, rule-based verifiers. We apply VPRMs to risk-of-bias assessment for medical evidence synthesis, a domain where guideline-defined criteria and rule-based decision paths enable programmatic verification of reasoning traces. Across multiple datasets, we find that VPRMs generate reasoning that adheres closely to domain rules and achieve substantially higher coherence between step-level decisions and final labels. Results show that VPRMs achieve up to 20% higher F1 than state-of-the-art models and 6.5% higher than verifiable outcome rewards, with substantial gains in evidence grounding and logical coherence.
中文摘要 近期关于可验证奖励强化学习（RLVR）的研究表明，大型语言模型（LLMs）可以通过结果层验证信号（如代码单元测试或数学精确匹配检查）得到显著改进。与此同时，过程监督长期被探索为塑造大型语言模型（LLM）中间推理行为的方式，但现有方法依赖神经裁判来评分思维链步骤，使其容易受到不透明性、偏见和奖励黑客攻击的影响。为弥补这一空白，我们引入了可验证过程奖励模型（VPRMs），这是一种强化学习框架，其中中间推理步骤由确定性、基于规则的验证者检查。我们将VPRM应用于医学证据综合的风险偏倚评估，该领域通过指南定义的标准和基于规则的决策路径实现推理痕迹的程序验证。在多个数据集中，我们发现VPRM产生的推理紧密遵循领域规则，并在步骤级决策与最终标签之间实现了显著更高的连贯性。结果显示，VPRM的F1比最先进模型高出20%，比可验证的结果奖励高出6.5%，在证据基础和逻辑连贯性方面有显著提升。

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Generation

重述、奖励、重复：基于叙事理论的故事生成的强化学习

Authors: David Y. Liu, Xanthe Muston, Aditya Joshi, Sebastian Sequoiah-Grayson
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.17226
Pdf link: https://arxiv.org/pdf/2601.17226
Abstract Despite the subjective nature of storytelling, past works on automatic story generation (ASG) have relied on limited ground truths for training and evaluation. In this work, we explore reinforcement learning (d-RLAIF) as a post-training alternative to supervised fine-tuning (SFT). We first apply Todorov's Theory of Narrative Equilibrium to establish principles that define desirable ASG qualities. We prompt 7B and 14B LLM-as-judge models with our principles to test alignment with human annotators and provide reward signals during d-RLAIF. We use Gemini-3-Flash to evaluate the output of our post-trained models and compare them to human-written stories from the TimeTravel dataset. We show that d-RLAIF offers a viable alternative to supervised fine-tuning (SFT)--producing stories that are more diverse and aligned with human narrative conventions. Our paper demonstrates the promise of reinforcement learning for linguistically grounded post-training for subjective tasks such as ASG.
中文摘要 尽管讲故事具有主观性质，过去关于自动故事生成（ASG）的研究依赖于有限的基础事实进行训练和评估。本研究探讨强化学习（d-RLAIF）作为监督微调（SFT）的训练后替代方案。我们首先应用托多罗夫的叙事均衡理论，确立定义理想ASG品质的原则。我们用我们的原则提示7B和14B作为评判的LLM模型，测试与人类注释者的对齐度，并在d-RLAIF期间提供奖励信号。我们使用 Gemini-3-Flash 来评估我们后训练模型的输出，并将其与 TimeTravel 数据集中的人类故事进行比较。我们展示了d-RLAIF为监督微调（SFT）制作故事提供了一种可行的替代方案，这些故事更为多样化，且符合人类叙事惯例。我们的论文展示了强化学习在主观任务（如ASG）中语言基础后训练中的前景。

Latent-Space Contrastive Reinforcement Learning for Stable and Efficient LLM Reasoning

潜在空间对比强化学习，实现稳定高效的大型语言模型推理

Authors: Lianlei Shan, Han Chen, Yixuan Wang, Zhenjie Liu, Wei Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.17275
Pdf link: https://arxiv.org/pdf/2601.17275
Abstract While Large Language Models (LLMs) demonstrate exceptional performance in surface-level text generation, their nature in handling complex multi-step reasoning tasks often remains one of statistical fitting'' rather than systematic logical deduction. Traditional Reinforcement Learning (RL) attempts to mitigate this by introducing athink-before-speak'' paradigm. However, applying RL directly in high-dimensional, discrete token spaces faces three inherent challenges: sample-inefficient rollouts, high gradient estimation variance, and the risk of catastrophic forgetting. To fundamentally address these structural bottlenecks, we propose \textbf{DeepLatent Reasoning (DLR)}, a latent-space bidirectional contrastive reinforcement learning framework. This framework shifts the trial-and-error cost from expensive token-level full sequence generation to the continuous latent manifold. Specifically, we introduce a lightweight assistant model to efficiently sample $K$ reasoning chain encodings within the latent space. These encodings are filtered via a dual reward mechanism based on correctness and formatting; only high-value latent trajectories are fed into a \textbf{frozen main model} for single-pass decoding. To maximize reasoning diversity while maintaining coherence, we design a contrastive learning objective to enable directed exploration within the latent space. Since the main model parameters remain frozen during optimization, this method mathematically eliminates catastrophic forgetting. Experiments demonstrate that under comparable GPU computational budgets, DLR achieves more stable training convergence, supports longer-horizon reasoning chains, and facilitates the sustainable accumulation of reasoning capabilities, providing a viable path toward reliable and scalable reinforcement learning for LLMs.
中文摘要 虽然大型语言模型（LLMs）在表层文本生成方面表现出色，但它们在处理复杂多步推理任务时，往往仍是“统计拟合”而非系统逻辑推理。传统强化学习（RL）试图通过引入“先思考后说话”的范式来缓解这一问题。然而，直接在高维离散令牌空间中应用强化学习面临三个固有挑战：样本推广效率低、梯度估计方差高以及灾难性遗忘风险。为了从根本上解决这些结构性瓶颈，我们提出了 \textbf{DeepLatent Reasoning （DLR）}，一种潜在空间双向对比强化学习框架。该框架将试错成本从昂贵的代币级全序列生成转向连续潜在流形。具体来说，我们引入了一种轻量级助手模型，以高效采样潜空间内$K$推理链编码。这些编码通过基于正确性和格式的双重奖励机制进行过滤;只有高值的潜在轨迹才会被输入\textbf{冻结主模型}进行单遍解码。为了最大化推理多样性同时保持连贯性，我们设计了一个对比式学习目标，以实现潜伏空间内的定向探索。由于主要模型参数在优化过程中保持冻结，该方法在数学上消除了灾难性遗忘。实验表明，在相当的GPU计算预算下，DLR实现了更稳定的训练收敛，支持更长的推理链，并促进推理能力的可持续积累，为LLMs提供了一条可行的路径，实现可靠且可扩展的强化学习。

Structure-Aware NL-to-SQL for SFC Provisioning via AST-Masking Empowered Language Models

结构感知NL转SQL的SFC配置，通过AST掩蔽赋能的语言模型

Authors: Xinyu Zhu, Parisa Fard Moshiri, Poonam Lohan, Burak Kantarci, Emil Janulewicz
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.17295
Pdf link: https://arxiv.org/pdf/2601.17295
Abstract Effective Service Function Chain (SFC) provisioning requires precise orchestration in dynamic and latency-sensitive networks. Reinforcement Learning (RL) improves adaptability but often ignores structured domain knowledge, which limits generalization and interpretability. Large Language Models (LLMs) address this gap by translating natural language (NL) specifications into executable Structured Query Language (SQL) commands for specification-driven SFC management. Conventional fine-tuning, however, can cause syntactic inconsistencies and produce inefficient queries. To overcome this, we introduce Abstract Syntax Tree (AST)-Masking, a structure-aware fine-tuning method that uses SQL ASTs to assign weights to key components and enforce syntax-aware learning without adding inference overhead. Experiments show that AST-Masking significantly improves SQL generation accuracy across multiple language models. FLAN-T5 reaches an Execution Accuracy (EA) of 99.6%, while Gemma achieves the largest absolute gain from 7.5% to 72.0%. These results confirm the effectiveness of structure-aware fine-tuning in ensuring syntactically correct and efficient SQL generation for interpretable SFC orchestration.
中文摘要 有效的服务功能链（SFC）配置需要在动态和延迟敏感的网络中进行精确的编排。强化学习（RL）提升了适应性，但常常忽视结构化的领域知识，这限制了泛化和可解释性。大型语言模型（LLMs）通过将自然语言（NL）规范转换为可执行的结构化查询语言（SQL）命令，以实现规范驱动的SFC管理，从而弥补这一空白。然而，传统的微调可能导致语法不一致，并产生效率低下的查询。为克服这一问题，我们引入了抽象语法树（AST）掩蔽，这是一种结构感知的微调方法，利用SQL AST为关键组件分配权重，并在不增加推理开销的情况下强制学习语法感知学习。实验表明，AST掩蔽显著提高了多语言模型中SQL生成的准确性。FLAN-T5的执行准确率（EA）达到99.6%，而Gemma的绝对提升幅度最大，从7.5%提升到72.0%。这些结果证实了结构感知微调在确保语法正确且高效的SQL生成以实现可解释SFC编排方面的有效性。

Conformal Feedback Alignment: Quantifying Answer-Level Reliability for Robust LLM Alignment

共形反馈对齐：量化稳健大型语言模型对齐的答案级可靠性

Authors: Tiejin Chen, Xiaoou Liu, Vishnu Nandam, Kuan-Ru Liou, Hua Wei
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.17329
Pdf link: https://arxiv.org/pdf/2601.17329
Abstract Preference-based alignment like Reinforcement Learning from Human Feedback (RLHF) learns from pairwise preferences, yet the labels are often noisy and inconsistent. Existing uncertainty-aware approaches weight preferences, but ignore a more fundamental factor: the reliability of the \emph{answers} being compared. To address the problem, we propose Conformal Feedback Alignment (CFA), a framework that grounds preference weighting in the statistical guarantees of Conformal Prediction (CP). CFA quantifies answer-level reliability by constructing conformal prediction sets with controllable coverage and aggregates these reliabilities into principled weights for both DPO- and PPO-style training. Experiments across different datasets show that CFA improves alignment robustness and data efficiency, highlighting that modeling \emph{answer-side} uncertainty complements preference-level weighting and yields more robust, data-efficient alignment. Codes are provided here.
中文摘要 基于偏好的比对，如人类反馈强化学习（RLHF）从成对偏好中学习，但标签往往噪声大且不一致。现有的不确定性意识方法会加权偏好，但忽略了一个更根本的因素：被比较的\emph{answers}的可靠性。为解决这一问题，我们提出了共形反馈对齐（CFA）框架，该框架将偏好权重建立在共形预测（CP）的统计保证之上。CFA通过构建具有可控覆盖范围的共形预测集，量化答案级可靠性，并将这些可靠性汇总为DPO和PPO风格训练的原则权重。跨不同数据集的实验表明，CFA提升了比对的鲁棒性和数据效率，强调了建模\emph{答案侧}的不确定性补充了偏好级加权，从而实现更稳健、数据高效的比对。代码在此提供。

Scaling Rough Terrain Locomotion with Automatic Curriculum Reinforcement Learning

利用自动课程强化学习提升崎岖地形的移动

Authors: Ziming Li, Chenhao Li, Marco Hutter
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.17428
Pdf link: https://arxiv.org/pdf/2601.17428
Abstract Curriculum learning has demonstrated substantial effectiveness in robot learning. However, it still faces limitations when scaling to complex, wide-ranging task spaces. Such task spaces often lack a well-defined difficulty structure, making the difficulty ordering required by previous methods challenging to define. We propose a Learning Progress-based Automatic Curriculum Reinforcement Learning (LP-ACRL) framework, which estimates the agent's learning progress online and adaptively adjusts the task-sampling distribution, thereby enabling automatic curriculum generation without prior knowledge of the difficulty distribution over the task space. Policies trained with LP-ACRL enable the ANYmal D quadruped to achieve and maintain stable, high-speed locomotion at 2.5 m/s linear velocity and 3.0 rad/s angular velocity across diverse terrains, including stairs, slopes, gravel, and low-friction flat surfaces--whereas previous methods have generally been limited to high speeds on flat terrain or low speeds on complex terrain. Experimental results demonstrate that LP-ACRL exhibits strong scalability and real-world applicability, providing a robust baseline for future research on curriculum generation in complex, wide-ranging robotic learning task spaces.
中文摘要 课程学习在机器人学习方面已证明了显著效果。然而，在扩展到复杂且广泛的任务空间时，它仍然面临局限。此类任务空间通常缺乏明确的难度结构，使得之前方法要求的难度排序变得具有挑战性。我们提出了基于学习进展的自动课程强化学习（LP-ACRL）框架，该框架估计主体在线学习进展，并自适应调整任务抽样分布，从而实现自动生成课程，而无需事先了解任务空间中的难度分布。采用LP-ACRL训练的政策使ANYmal D四足动物能够在各种地形（包括楼梯、坡道、碎石和低摩擦平坦表面）以2.5米/秒和3.0弧度/秒角速度下，实现并保持稳定的高速运动——而以往的方法通常仅限于平坦地形上的高速或复杂地形上的低速。实验结果表明，LP-ACRL具有强大的可扩展性和现实应用性，为未来在复杂且广泛的机器人学习任务空间中课程生成的研究提供了坚实的基础。

PILOT: A Perceptive Integrated Low-level Controller for Loco-manipulation over Unstructured Scenes

PILOT：一种用于非结构化场景机车作的感知集成低级控制器

Authors: Xinru Cui, Linxi Feng, Yixuan Zhou, Haoqi Han, Zhe Liu, Hesheng Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.17440
Pdf link: https://arxiv.org/pdf/2601.17440
Abstract Humanoid robots hold great potential for diverse interactions and daily service tasks within human-centered environments, necessitating controllers that seamlessly integrate precise locomotion with dexterous manipulation. However, most existing whole-body controllers lack exteroceptive awareness of the surrounding environment, rendering them insufficient for stable task execution in complex, unstructured this http URL address this challenge, we propose PILOT, a unified single-stage reinforcement learning (RL) framework tailored for perceptive loco-manipulation, which synergizes perceptive locomotion and expansive whole-body control within a single policy. To enhance terrain awareness and ensure precise foot placement, we design a cross-modal context encoder that fuses prediction-based proprioceptive features with attention-based perceptive representations. Furthermore, we introduce a Mixture-of-Experts (MoE) policy architecture to coordinate diverse motor skills, facilitating better specialization across distinct motion patterns. Extensive experiments in both simulation and on the physical Unitree G1 humanoid robot validate the efficacy of our framework. PILOT demonstrates superior stability, command tracking precision, and terrain traversability compared to existing baselines. These results highlight its potential to serve as a robust, foundational low-level controller for loco-manipulation in unstructured scenes.
中文摘要 类人机器人在以人为中心的环境中具有巨大潜力，能够实现多样化的交互和日常服务任务，因此需要能够无缝结合精准移动与灵巧作的控制器。然而，大多数现有的全身控制器缺乏对周围环境的外感知能力，使其不足以在复杂、无结构的任务中稳定执行。我们提出了PILOT，这是一个统一的单阶段强化学习（RL）框架，专为感知式驾驶作设计，能够在单一策略内协同感知移动与扩展的全身控制。为了增强地形感知并确保脚部准确放置，我们设计了一种跨模态上下文编码器，融合了基于预测的本体感受特征与基于注意力的感知表征。此外，我们引入了专家混合（MoE）政策架构，协调多样化的运动技能，促进不同运动模式间的更好专业化。在模拟和物理Unitree G1人形机器人上的广泛实验验证了我们框架的有效性。PILOT展现出优越的稳定性、指挥跟踪精度和地形穿越能力，优于现有基线。这些结果凸显了其作为非结构化场景中机车作的稳健基础低级控制器的潜力。

Embodiment-Induced Coordination Regimes in Tabular Multi-Agent Q-Learning

表式多智能体Q学习中的具现诱导协调机制

Authors: Muhammad Ahmed Atif, Nehal Naeem Haji, Mohammad Shahid Shaikh, Muhammad Ebad Atif
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.17454
Pdf link: https://arxiv.org/pdf/2601.17454
Abstract Centralized value learning is often assumed to improve coordination and stability in multi-agent reinforcement learning, yet this assumption is rarely tested under controlled conditions. We directly evaluate it in a fully tabular predator-prey gridworld by comparing independent and centralized Q-learning under explicit embodiment constraints on agent speed and stamina. Across multiple kinematic regimes and asymmetric agent roles, centralized learning fails to provide a consistent advantage and is frequently outperformed by fully independent learning, even under full observability and exact value estimation. Moreover, asymmetric centralized-independent configurations induce persistent coordination breakdowns rather than transient learning instability. By eliminating confounding effects from function approximation and representation learning, our tabular analysis isolates coordination structure as the primary driver of these effects. The results show that increased coordination can become a liability under embodiment constraints, and that the effectiveness of centralized learning is fundamentally regime and role dependent rather than universal.
中文摘要 集中式价值学习常被认为能改善多智能体强化学习中的协调性和稳定性，但这一假设很少在受控条件下进行测试。我们通过比较在显式具身约束下的独立和集中Q学习，直接在完全表格的捕食者-猎物网格世界中评估其表现。在多种运动学模式和非对称代理角色中，集中学习无法提供持续的优势，且常常被完全独立学习超越，即使在完全可观测性和精确值估计下。此外，非对称中心化无关配置导致的是持续的协调崩溃，而非瞬态学习不稳定性。通过消除函数近似和表示学习中的混杂效应，我们的表格分析将协调结构作为这些效应的主要驱动因素隔离出来。结果显示，在具身约束下，协调的增加可能成为负担，集中式学习的有效性本质上依赖于体制和角色，而非普遍。

MetaWorld: Skill Transfer and Composition in a Hierarchical World Model for Grounding High-Level Instructions

MetaWorld：层级世界模型中的技能转移与构成，用于扎根高阶指令

Authors: Yutong Shen, Hangxu Liu, Kailin Pei, Ruizhe Xia, Tongtong Feng
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.17507
Pdf link: https://arxiv.org/pdf/2601.17507
Abstract Humanoid robot loco-manipulation remains constrained by the semantic-physical gap. Current methods face three limitations: Low sample efficiency in reinforcement learning, poor generalization in imitation learning, and physical inconsistency in VLMs. We propose MetaWorld, a hierarchical world model that integrates semantic planning and physical control via expert policy transfer. The framework decouples tasks into a VLM-driven semantic layer and a latent dynamics model operating in a compact state space. Our dynamic expert selection and motion prior fusion mechanism leverages a pre-trained multi-expert policy library as transferable knowledge, enabling efficient online adaptation via a two-stage framework. VLMs serve as semantic interfaces, mapping instructions to executable skills and bypassing symbol grounding. Experiments on Humanoid-Bench show MetaWorld outperforms world model-based RL in task completion and motion coherence. Our code will be found at this https URL
中文摘要 类人机器人机车作仍受语义与物理差距的限制。当前方法面临三个局限：强化学习中的样本效率低、模仿学习泛化性差，以及VLMs的物理不一致性。我们提出了MetaWorld，一种通过专家政策转移整合语义规划与物理控制的层级世界模型。该框架将任务解耦为基于VLM驱动的语义层和在紧凑状态空间中工作的潜在动力学模型。我们的动态专家选择与动作先验融合机制利用预训练的多专家政策库作为可转移知识，实现通过两阶段框架实现高效的在线适应。VLM作为语义接口，将指令映射到可执行技能，绕过符号接地。在人形实验台上的实验显示，MetaWorld在任务完成和动作连贯性方面优于基于世界模型的强化学习。我们的代码会在这个 https URL 找到

Cognitive Platform Engineering for Autonomous Cloud Operations

自主云运营的认知平台工程

Authors: Vinoth Punniyamoorthy, Nitin Saksena, Srivenkateswara Reddy Sankiti, Nachiappan Chockalingam, Aswathnarayan Muthukrishnan Kirubakaran, Shiva Kumar Reddy Carimireddy, Durgaraman Maruthavanan
Subjects: Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2601.17542
Pdf link: https://arxiv.org/pdf/2601.17542
Abstract Modern DevOps practices have accelerated software delivery through automation, CI/CD pipelines, and observability tooling,but these approaches struggle to keep pace with the scale and dynamism of cloud-native systems. As telemetry volume grows and configuration drift increases, traditional, rule-driven automation often results in reactive operations, delayed remediation, and dependency on manual expertise. This paper introduces Cognitive Platform Engineering, a next-generation paradigm that integrates sensing, reasoning, and autonomous action directly into the platform lifecycle. This paper propose a four-plane reference architecture that unifies data collection, intelligent inference, policy-driven orchestration, and human experience layers within a continuous feedback loop. A prototype implementation built with Kubernetes, Terraform, Open Policy Agent, and ML-based anomaly detection demonstrates improvements in mean time to resolution, resource efficiency, and compliance. The results show that embedding intelligence into platform operations enables resilient, self-adjusting, and intent-aligned cloud environments. The paper concludes with research opportunities in reinforcement learning, explainable governance, and sustainable self-managing cloud ecosystems.
中文摘要 现代DevOps实践通过自动化、CI/CD流水线和可观测性工具加速了软件交付，但这些方法难以跟上云原生系统的规模和活力。随着遥测量的增长和配置漂移的加剧，传统的规则驱动自动化往往导致被动作、修复延迟以及对人工专业的依赖。本文介绍了认知平台工程，这是一种将感知、推理和自主行动直接整合进平台生命周期的下一代范式。本文提出了一种四层参考架构，将数据收集、智能推理、策略驱动的编排和人类体验层整合在一个连续的反馈循环中。采用Kubernetes、Terraform、Open Policy Agent和基于机器学习的异常检测构建的原型实现，展示了在平均分辨率时间、资源效率和合规性方面的提升。结果显示，将智能嵌入平台运营能够实现具有韧性、自我调整且与意图一致的云环境。论文最后介绍了强化学习、可解释治理和可持续自管理云生态系统的研究机会。

Quantum-Inspired Episode Selection for Monte Carlo Reinforcement Learning via QUBO Optimization

通过QUBO优化实现蒙特卡洛强化学习的量子启发剧集选择

Authors: Hadi Salloum, Ali Jnadi, Yaroslav Kholodov, Alexander Gasnikov
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.17570
Pdf link: https://arxiv.org/pdf/2601.17570
Abstract Monte Carlo (MC) reinforcement learning suffers from high sample complexity, especially in environments with sparse rewards, large state spaces, and correlated trajectories. We address these limitations by reformulating episode selection as a Quadratic Unconstrained Binary Optimization (QUBO) problem and solving it with quantum-inspired samplers. Our method, MC+QUBO, integrates a combinatorial filtering step into standard MC policy evaluation: from each batch of trajectories, we select a subset that maximizes cumulative reward while promoting state-space coverage. This selection is encoded as a QUBO, where linear terms favor high-reward episodes and quadratic terms penalize redundancy. We explore both Simulated Quantum Annealing (SQA) and Simulated Bifurcation (SB) as black-box solvers within this framework. Experiments in a finite-horizon GridWorld demonstrate that MC+QUBO outperforms vanilla MC in convergence speed and final policy quality, highlighting the potential of quantum-inspired optimization as a decision-making subroutine in reinforcement learning.
中文摘要 蒙特卡洛（MC）强化学习存在高样本复杂度，尤其是在奖励稀疏、状态空间大且轨迹相关的环境中。我们通过将剧集选择重新表述为二次无约束二元优化（QUBO）问题，并用量子启发采样器求解，解决了这些局限性。我们的方法MC+QUBO将组合过滤步骤整合进标准MC策略评估中：从每批轨迹中，我们选择一个既能最大化累计回报又促进状态空间覆盖的子集。该选择编码为 QUBO，其中线性项有利于高回报的集数，二次项则惩罚冗余。我们在此框架内探索模拟量子退火（SQA）和模拟分岔（SB）作为黑箱求解器。有限视界网格世界的实验表明，MC+QUBO在收敛速度和最终策略质量上优于普通MC，凸显了量子启发优化作为强化学习决策子程序的潜力。

Learning to Ideate for Machine Learning Engineering Agents

机器学习工程代理的创意学习

Authors: Yunxiang Zhang, Kang Zhou, Zhichao Xu, Kiran Ramnath, Yun Zhou, Sangmin Woo, Haibo Ding, Lin Lee Cheong
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.17596
Pdf link: https://arxiv.org/pdf/2601.17596
Abstract Existing machine learning engineering (MLE) agents struggle to iteratively optimize their implemented algorithms for effectiveness. To address this, we introduce MLE-Ideator, a dual-agent framework that separates ideation from implementation. In our system, an implementation agent can request strategic help from a dedicated Ideator. We show this approach is effective in two ways. First, in a training-free setup, our framework significantly outperforms implementation-only agent baselines on MLE-Bench. Second, we demonstrate that the Ideator can be trained with reinforcement learning (RL) to generate more effective ideas. With only 1K training samples from 10 MLE tasks, our RL-trained Qwen3-8B Ideator achieves an 11.5% relative improvement compared to its untrained counterpart and surpasses Claude Sonnet 3.5. These results highlights a promising path toward training strategic AI systems for scientific discovery.
中文摘要 现有的机器学习工程（MLE）代理在迭代优化其实现算法以实现有效性方面存在困难。为此，我们引入了MLE-Ideator，一个双代理框架，将创意与实现分离。在我们的系统中，实施代理可以向专职创意师请求战略支持。我们证明这种方法有两个方面的有效性。首先，在无培训的设置下，我们的框架在MLE-Bench上显著优于仅执行的代理基线。其次，我们展示了通过强化学习（RL）训练创意者以产生更有效的想法。仅用10个机器学习任务中的1000个训练样本，我们训练过的Qwen3-8B Ideator相比未训练版本实现了11.5%的相对提升，超过了Claude Sonnet 3.5。这些结果凸显了训练战略性人工智能系统以实现科学发现的有希望路径。

Deep Intrinsic Surprise-Regularized Control (DISRC): A Biologically Inspired Mechanism for Efficient Deep Q-Learning in Sparse Environments

深度内在惊奇正则化控制（DISRC）：一种受生物启发的机制，用于稀疏环境中高效的深度Q学习

Authors: Yash Kini, Shiv Davay, Shreya Polavarapu
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.17598
Pdf link: https://arxiv.org/pdf/2601.17598
Abstract Deep reinforcement learning (DRL) has driven major advances in autonomous control. Still, standard Deep Q-Network (DQN) agents tend to rely on fixed learning rates and uniform update scaling, even as updates are modulated by temporal-difference (TD) error. This rigidity destabilizes convergence, especially in sparse-reward settings where feedback is infrequent. We introduce Deep Intrinsic Surprise-Regularized Control (DISRC), a biologically inspired augmentation to DQN that dynamically scales Q-updates based on latent-space surprise. DISRC encodes states via a LayerNorm-based encoder and computes a deviation-based surprise score relative to a moving latent setpoint. Each update is then scaled in proportion to both TD error and surprise intensity, promoting plasticity during early exploration and stability as familiarity increases. We evaluate DISRC on two sparse-reward MiniGrid environments, which included MiniGrid-DoorKey-8x8 and MiniGrid-LavaCrossingS9N1, under identical settings as a vanilla DQN baseline. In DoorKey, DISRC reached the first successful episode (reward > 0.8) 33% faster than the vanilla DQN baseline (79 vs. 118 episodes), with lower reward standard deviation (0.25 vs. 0.34) and higher reward area under the curve (AUC: 596.42 vs. 534.90). These metrics reflect faster, more consistent learning - critical for sparse, delayed reward settings. In LavaCrossing, DISRC achieved a higher final reward (0.95 vs. 0.93) and the highest AUC of all agents (957.04), though it converged more gradually. These preliminary results establish DISRC as a novel mechanism for regulating learning intensity in off-policy agents, improving both efficiency and stability in sparse-reward domains. By treating surprise as an intrinsic learning signal, DISRC enables agents to modulate updates based on expectation violations, enhancing decision quality when conventional value-based methods fall short.
中文摘要 深度强化学习（DRL）推动了自主控制的重大进展。尽管如此，标准的深度Q网络（DQN）代理往往依赖固定的学习率和统一的更新扩展，即使更新会受到时间差分（TD）误差的调制。这种僵化会破坏收敛，尤其是在奖励稀少且反馈稀少的环境中。我们介绍了深度内在惊喜正则化控制（DISRC），这是一种基于生物启发的DQN增强方法，基于潜空间惊讶动态扩展Q更新。DISRC 通过基于 LayerNorm 的编码器编码状态，并相对于移动的潜在集合点计算基于偏差的惊喜分数。每次更新都会根据TD误差和惊喜强度的比例进行扩展，促进早期探索的可塑性以及随着熟悉度增加的稳定性。我们在两个稀疏奖励的MiniGrid环境上评估DISRC，包括MiniGrid-DoorKey-8x8和MiniGrid-LavaCrossingS9N1，且设置与原版DQN基线相同。在DoorKey中，DISRC比原版DQN基线（79集对118集）快33%的成功集数（奖励>0.8），奖励标准差更低（0.25对0.34），曲线下的奖励面积也更高（AUC：596.42对534.90）。这些指标反映了更快、更稳定的学习——对于奖励设置稀疏且延迟的设置至关重要。在熔岩穿越中，DISRC获得了更高的最终奖励（0.95对0.93）和最高的辅助值（957.04），尽管收敛较为缓慢。这些初步结果确立了DISRC作为调节非策略代理学习强度的新机制，提升稀疏奖励域的效率和稳定性。通过将惊讶视为内在学习信号，DISRC使智能体能够基于期望违规调制更新，提升传统价值方法不足时的决策质量。

Athena: Synergizing Data Prefetching and Off-Chip Prediction via Online Reinforcement Learning

Athena：通过在线强化学习协同数据预取与芯片外预测

Authors: Rahul Bera, Zhenrong Lang, Caroline Hengartner, Konstantinos Kanellopoulos, Rakesh Kumar, Mohammad Sadrosadati, Onur Mutlu
Subjects: Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.17615
Pdf link: https://arxiv.org/pdf/2601.17615
Abstract Prefetching and off-chip prediction are two techniques proposed to hide long memory access latencies in high-performance processors. In this work, we demonstrate that: (1) prefetching and off-chip prediction often provide complementary performance benefits, yet (2) naively combining them often fails to realize their full performance potential, and (3) existing prefetcher control policies leave significant room for performance improvement behind. Our goal is to design a holistic framework that can autonomously learn to coordinate an off-chip predictor with multiple prefetchers employed at various cache levels. To this end, we propose a new technique called Athena, which models the coordination between prefetchers and off-chip predictor (OCP) as a reinforcement learning (RL) problem. Athena acts as the RL agent that observes multiple system-level features (e.g., prefetcher/OCP accuracy, bandwidth usage) over an epoch of program execution, and uses them as state information to select a coordination action (i.e., enabling the prefetcher and/or OCP, and adjusting prefetcher aggressiveness). At the end of every epoch, Athena receives a numerical reward that measures the change in multiple system-level metrics (e.g., number of cycles taken to execute an epoch). Athena uses this reward to autonomously and continuously learn a policy to coordinate prefetchers with OCP. Our extensive evaluation using a diverse set of memory-intensive workloads shows that Athena consistently outperforms prior state-of-the-art coordination policies across a wide range of system configurations with various combinations of underlying prefetchers, OCPs, and main memory bandwidths, while incurring only modest storage overhead. Athena is freely available at this https URL.
中文摘要 预取和片外预测是两种被提出用于隐藏高性能处理器中较长内存访问延迟的技术。本研究表明：（1）预取与片外预测常常互补性能优势，但（2）简单结合往往无法发挥其全部性能潜力，（3）现有预取控制策略留有显著性能提升空间。我们的目标是设计一个整体框架，能够自主学习协调片外预测器，同时在不同缓存层级使用多个预取器。为此，我们提出了一种名为Athena的新技术，它将预取器与芯片外预测器（OCP）之间的协调建模为强化学习（RL）问题。Athena 作为强化学习代理，在程序执行的一个时间点内观察多个系统级特征（例如预取器/OCP 准确性、带宽使用），并利用这些特征作为状态信息选择协调动作（即启用预取器和/或 OCP，并调整预取者的激进性）。每个纪元结束时，雅典娜都会获得一个数值奖励，衡量多个系统级指标的变化（例如执行一个纪元所用的周期数）。Athena利用这一奖励自主且持续地学习策略，以协调预取者与OCP的关系。我们通过广泛评估，使用多种内存密集型工作负载，表明Athena在多种系统配置中，基于多种底层预取器、OCP和主存带宽组合，始终优于以往最先进的协调策略，同时仅产生适度的存储开销。雅典娜可通过此 https URL 免费获取。

DIML: Differentiable Inverse Mechanism Learning from Behaviors of Multi-Agent Learning Trajectories

DIML：多智能体学习轨迹行为中的可微逆机制学习

Authors: Zhiyu An, Wan Du
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2601.17678
Pdf link: https://arxiv.org/pdf/2601.17678
Abstract We study inverse mechanism learning: recovering an unknown incentive-generating mechanism from observed strategic interaction traces of self-interested learning agents. Unlike inverse game theory and multi-agent inverse reinforcement learning, which typically infer utility/reward parameters inside a structured mechanism, our target includes unstructured mechanism -- a (possibly neural) mapping from joint actions to per-agent payoffs. Unlike differentiable mechanism design, which optimizes mechanisms forward, we infer mechanisms from behavior in an observational setting. We propose DIML, a likelihood-based framework that differentiates through a model of multi-agent learning dynamics and uses the candidate mechanism to generate counterfactual payoffs needed to predict observed actions. We establish identifiability of payoff differences under a conditional logit response model and prove statistical consistency of maximum likelihood estimation under standard regularity conditions. We evaluate DIML with simulated interactions of learning agents across unstructured neural mechanisms, congestion tolling, public goods subsidies, and large-scale anonymous games. DIML reliably recovers identifiable incentive differences and supports counterfactual prediction, where its performance rivals tabular enumeration oracle in small environments and its convergence scales to large, hundred-participant environments. Code to reproduce our experiments is open-sourced.
中文摘要 我们研究逆机制学习：从观察到的自利学习主体的战略互动痕迹中恢复未知的激励生成机制。与通常在结构化机制中推断效用/奖励参数的逆博弈论和多代理逆强化学习不同，我们的目标包括非结构化机制——一种从联合行动到每个代理收益的（可能是神经的）映射。与优化机制的可微机制设计不同，我们从观察环境中的行为推断机制。我们提出了DIML，一种基于似然的框架，通过多智能体学习动态模型区分，并利用候选机制生成预测观察到动作所需的反事实收益。我们建立了条件逻辑特响应模型下收益差异的可识别性，并证明了标准正则条件下最大似然估计的统计一致性。我们通过模拟学习代理跨无结构神经机制、拥堵收费、公共物品补贴和大规模匿名游戏的互动来评估DIML。DIML可靠地恢复可识别的激励差异，支持反事实预测，其性能在小型环境中可与表格枚举预言机媲美，且其收敛性可扩展到大型百人参与环境。用于复现我们实验的代码是开源的。

Agentic reinforcement learning empowers next-generation chemical language models for molecular design and synthesis

代理强化学习赋能下一代化学语言模型，用于分子设计和合成

Authors: Hao Li, He Cao, Shenyao Peng, Zijing Liu, Bin Feng, Yu Wang, Zhiyuan Yan, Yonghong Tian, Yu Li, Li Yuan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.17687
Pdf link: https://arxiv.org/pdf/2601.17687
Abstract Language models are revolutionizing the biochemistry domain, assisting scientists in drug design and chemical synthesis with high efficiency. Yet current approaches struggle between small language models prone to hallucination and limited knowledge retention, and large cloud-based language models plagued by privacy risks and high inference costs. To bridge this gap, we introduce ChemCRAFT, a novel framework leveraging agentic reinforcement learning to decouple chemical reasoning from knowledge storage. Instead of forcing the model to memorize vast chemical data, our approach empowers the language model to interact with a sandbox for precise information retrieval. This externalization of knowledge allows a locally deployable small model to achieve superior performance with minimal inference costs. To enable small language models for agent-calling ability, we build an agentic trajectory construction pipeline and a comprehensive chemical-agent sandbox. Based on sandbox interactions, we constructed ChemToolDataset, the first large-scale chemical tool trajectory dataset. Simultaneously, we propose SMILES-GRPO to build a dense chemical reward function, promoting the model's ability to call chemical agents. Evaluations across diverse aspects of drug design show that ChemCRAFT outperforms current cloud-based LLMs in molecular structure analysis, molecular optimization, and synthesis pathway prediction, demonstrating that scientific reasoning is not solely an emergent ability of model scale, but a learnable policy of tool orchestration. This work establishes a cost-effective and privacy-preserving paradigm for AI-aided chemistry, opening new avenues for accelerating molecular discovery with locally deployable agents.
中文摘要 语言模型正在革新生物化学领域，高效地协助科学家进行药物设计和化学合成。然而，当前的方法在容易产生幻觉和有限知识保留的小型语言模型与存在隐私风险和高推理成本的大型云端语言模型之间存在困难。为弥合这一差距，我们引入了ChemCRAFT，一种利用能动强化学习将化学推理与知识存储解耦的新框架。我们的方法不强迫模型记忆庞大的化学数据，而是赋予语言模型与沙盒交互，实现精确信息检索的能力。这种知识的外部化使得本地部署的小型模型能够以极低的推理成本实现更优的性能。为了支持小型语言模型实现代理调用能力，我们构建了一个代理轨迹构建流程和全面的化学代理沙盒。基于沙盒交互，我们构建了ChemToolDataset，这是首个大规模化学工具轨迹数据集。同时，我们提出SMILES-GRPO构建高密度化学奖励函数，促进模型调用化学剂的能力。药物设计多个方面的评估显示，ChemCRAFT在分子结构分析、分子优化和合成途径预测方面优于现有云端大型语言模型，表明科学推理不仅是模型规模的涌现能力，更是工具编排的可学习策略。这项工作为人工智能辅助化学建立了一种成本效益高且保护隐私的范式，开辟了利用本地部署剂加速分子发现的新途径。

SQL-Trail: Multi-Turn Reinforcement Learning with Interleaved Feedback for Text-to-SQL

SQL跟踪：多回合强化学习，采用交错反馈，适用于文本转SQL的应用

Authors: Harper Hua, Zhen Han, Zhengyuan Shen, Jeremy Lee, Patrick Guan, Qi Zhu, Sullam Jeoung, Yueyan Chen, Yunfei Bai, Shuai Wang, Vassilis Ioannidis, Huzefa Rangwala
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.17699
Pdf link: https://arxiv.org/pdf/2601.17699
Abstract While large language models (LLMs) have substantially improved Text-to-SQL generation, a pronounced gap remains between AI systems and human experts on challenging benchmarks such as BIRD-SQL. We argue this gap stems largely from the prevailing single-pass paradigm, which lacks the iterative reasoning, schema exploration, and error-correction behaviors that humans naturally employ. To address this limitation, we introduce SQL-Trail, a multi-turn reinforcement learning (RL) agentic framework for Text-to-SQL. Rather than producing a query in one shot, SQL-Trail interacts with the database environment and uses execution feedback to iteratively refine its predictions. Our approach centers on two key ideas: (i) an adaptive turn-budget allocation mechanism that scales the agent's interaction depth to match question difficulty, and (ii) a composite reward panel that jointly incentivizes SQL correctness and efficient exploration. Across benchmarks, SQL-Trail sets a new state of the art and delivers strong data efficiency--up to 18x higher than prior single-pass RL state-of-the-art methods. Notably, our 7B and 14B models outperform substantially larger proprietary systems by 5% on average, underscoring the effectiveness of interactive, agentic workflows for robust Text-to-SQL generation.
中文摘要 尽管大型语言模型（LLM）在文本转SQL生成方面有了显著提升，但在BIRD-SQL等具有挑战性的基准测试上，AI系统与人类专家之间仍存在明显差距。我们认为这一差距主要源于主流的单一路径范式，缺乏人类自然采用的迭代推理、模式探索和纠错行为。为解决这一限制，我们引入了SQL-Trail，一种多回合强化学习（RL）代理框架，适用于文本转SQL。SQL-Trail 不是一次性生成查询，而是与数据库环境交互，并利用执行反馈迭代优化预测。我们的方法围绕两个核心理念展开：（i）一种自适应的回合预算分配机制，可根据问题难度调整代理的互动深度;（ii）一个综合奖励面板，共同激励SQL的正确性和高效探索。在各基准测试中，SQL-Trail 引领了新的最先进技术，并实现了强大的数据效率——比以往单次强化学习最先进的方法高出18倍。值得注意的是，我们的7B和14B模型平均比更大型专有系统高出5%，这凸显了交互式、代理式工作流在稳健的文本转SQL生成中的有效性。

ProGraph-R1: Progress-aware Reinforcement Learning for Graph Retrieval Augmented Generation

ProGraph-R1：图检索增强生成的进展感知强化学习

Authors: Jinyoung Park, Sanghyeok Lee, Omar Zia Khan, Hyunwoo J. Kim, Joo-Kyung Kim
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.17755
Pdf link: https://arxiv.org/pdf/2601.17755
Abstract Graph Retrieval-Augmented Generation (GraphRAG) has been successfully applied in various knowledge-intensive question answering tasks by organizing external knowledge into structured graphs of entities and relations. It enables large language models (LLMs) to perform complex reasoning beyond text-chunk retrieval. Recent works have employed reinforcement learning (RL) to train agentic GraphRAG frameworks that perform iterative interactions between LLMs and knowledge graphs. However, existing RL-based frameworks such as Graph-R1 suffer from two key limitations: (1) they primarily depend on semantic similarity for retrieval, often overlooking the underlying graph structure, and (2) they rely on sparse, outcome-level rewards, failing to capture the quality of intermediate retrieval steps and their dependencies. To address these limitations, we propose ProGraph-R1, a progress-aware agentic framework for graph-based retrieval and multi-step reasoning. ProGraph-R1 introduces a structure-aware hypergraph retrieval mechanism that jointly considers semantic relevance and graph connectivity, encouraging coherent traversal along multi-hop reasoning paths. We also design a progress-based step-wise policy optimization, which provides dense learning signals by modulating advantages according to intermediate reasoning progress within a graph, rather than relying solely on final outcomes. Experiments on multi-hop question answering benchmarks demonstrate that ProGraph-R1 consistently improves reasoning accuracy and generation quality over existing GraphRAG methods.
中文摘要 图检索增强生成（GraphRAG）已成功应用于各种知识密集型问答任务，通过将外部知识组织成结构化的实体和关系图。它使大型语言模型（LLM）能够执行超越文本块检索的复杂推理。近期工作利用强化学习（RL）训练代理型GraphRAG框架，实现LLM与知识图谱之间的迭代交互。然而，现有基于强化学习的框架如Graph-R1存在两个关键局限：（1）它们主要依赖语义相似性进行检索，常忽视底层图结构;（2）依赖稀疏的结果级奖励，未能捕捉中间检索步骤及其依赖关系的质量。为解决这些局限性，我们提出了ProGraph-R1，一种基于图的检索和多步推理的进展感知智能体框架。ProGraph-R1引入了一种结构感知的超图检索机制，结合语义相关性和图连通性，鼓励沿多跳推理路径进行连贯的遍历。我们还设计了基于进展的分阶段策略优化，通过根据图中的中间推理进展调节优势，而非仅依赖最终结果，从而提供密集的学习信号。多跳问答基准测试的实验表明，ProGraph-R1 在推理准确性和生成质量方面持续提升，优于现有 GraphRAG 方法。

Agentic AI for Self-Driving Laboratories in Soft Matter: Taxonomy, Benchmarks,and Open Challenges

软物质自动驾驶实验室的代理人工智能：分类法、基准与开放挑战

Authors: Xuanzhou Chen, Audrey Wang, Stanley Yin, Hanyang Jiang, Dong Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.17920
Pdf link: https://arxiv.org/pdf/2601.17920
Abstract Self-driving laboratories (SDLs) close the loop between experiment design, automated execution, and data-driven decision making, and they provide a demanding testbed for agentic AI under expensive actions, noisy and delayed feedback, strict feasibility and safety constraints, and non-stationarity. This survey uses soft matter as a representative setting but focuses on the AI questions that arise in real laboratories. We frame SDL autonomy as an agent environment interaction problem with explicit observations, actions, costs, and constraints, and we use this formulation to connect common SDL pipelines to established AI principles. We review the main method families that enable closed loop experimentation, including Bayesian optimization and active learning for sample efficient experiment selection, planning and reinforcement learning for long horizon protocol optimization, and tool using agents that orchestrate heterogeneous instruments and software. We emphasize verifiable and provenance aware policies that support debugging, reproducibility, and safe operation. We then propose a capability driven taxonomy that organizes systems by decision horizon, uncertainty modeling, action parameterization, constraint handling, failure recovery, and human involvement. To enable meaningful comparison, we synthesize benchmark task templates and evaluation metrics that prioritize cost aware performance, robustness to drift, constraint violation behavior, and reproducibility. Finally, we distill lessons from deployed SDLs and outline open challenges in multi-modal representation, calibrated uncertainty, safe exploration, and shared benchmark infrastructure.
中文摘要 自驱实验室（SDL）在实验设计、自动化执行和数据驱动决策之间闭合了环路，为代理人工智能提供了在昂贵作、噪声和延迟反馈、严格可行性和安全性约束以及非平稳性下的高强度测试平台。这项调查以软物质为代表性环境，但重点关注真实实验室中出现的人工智能问题。我们将SDL自主性框架为一个代理环境交互问题，明确观察、动作、成本和约束，并利用这一表述将常见的SDL流水线与既定的AI原则连接起来。我们回顾了实现闭环实验的主要方法家族，包括贝叶斯优化和主动学习以高效样本选择实验，规划与强化学习用于长期方案优化，以及使用工具协调异构仪器和软件的工具。我们强调可验证和来源意识的政策，支持调试、可重复性和安全运行。随后，我们提出了一种能力驱动的分类法，按决策视野、不确定性建模、动作参数化、约束处理、故障恢复和人工参与来组织系统。为了实现有意义的比较，我们综合了基准任务模板和评估指标，优先考虑成本感知的性能、稳健性、约束违规行为和可重复性。最后，我们总结了部署SDL的经验教训，概述了多模态表示、校准不确定性、安全探索和共享基准基础设施等方面面临的挑战。

SD-E$^2$: Semantic Exploration for Reasoning Under Token Budgets

SD-E$^2$：代币预算推理的语义探索

Authors: Kshitij Mishra, Nils Lukas, Salem Lahlou
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.17982
Pdf link: https://arxiv.org/pdf/2601.17982
Abstract Small language models (SLMs) struggle with complex reasoning because exploration is expensive under tight compute budgets. We introduce Semantic Diversity-Exploration-Exploitation (SD-E$^2$), a reinforcement learning framework that makes exploration explicit by optimizing semantic diversity in generated reasoning trajectories. Using a frozen sentence-embedding model, SD-E$^2$ assigns a diversity reward that captures (i) the coverage of semantically distinct solution strategies and (ii) their average pairwise dissimilarity in embedding space, rather than surface-form novelty. This diversity reward is combined with outcome correctness and solution efficiency in a z-score-normalized multi-objective objective that stabilizes training. On GSM8K, SD-E$^2$ surpasses the base Qwen2.5-3B-Instruct and strong GRPO baselines (GRPO-CFL and GRPO-CFEE) by +27.4, +5.2, and +1.5 percentage points, respectively, while discovering on average 9.8 semantically distinct strategies per question. We further improve MedMCQA to 49.64% versus 38.37% for the base model and show gains on the harder AIME benchmark (1983-2025), reaching 13.28% versus 6.74% for the base. These results indicate that rewarding semantic novelty yields a more compute-efficient exploration-exploitation signal for training reasoning-capable SLMs. By introducing cognitive adaptation-adjusting the reasoning process structure rather than per-token computation-SD-E$^2$ offers a complementary path to efficiency gains in resource-constrained models.
中文摘要 小型语言模型（SLMs）在复杂推理方面遇到困难，因为在有限的计算预算下探索成本高昂。我们介绍了语义多样性-探索-利用（SD-E$^2$），一种强化学习框架，通过优化生成推理轨迹中的语义多样性，使探索变得显式化。使用冻结句子嵌入模型，SD-E$^2$ 分配了一个多样性奖励，该奖励捕捉了（i）语义上不同的解法策略的覆盖度，以及（ii）它们在嵌入空间中的平均两两差异，而非表面形式的新颖性。这种多样性奖励与结果正确性和解法效率相结合，形成一个z分数归一化的多目标，从而稳定训练。在GSM8K上，SD-E$^2$分别比基础Qwen2.5-3B-Instruct和强GRPO基线（GRPO-CFL和GRPO-CFEE）高出+27.4、+5.2和+1.5个百分点，平均每题发现9.8个语义上不同的策略。我们进一步将MedMCQA提升至49.64%，而基础模型为38.37%，并在更难的AIME基准（1983-2025）上显示出提升，达到13.28%，而基础模型为6.74%。这些结果表明，奖励语义新颖性能为训练具备推理能力的 SLM 提供更高效的探索-利用信号。通过引入认知适应——调整推理过程结构，而非按每个代币计算——SD-E$^2$为资源受限模型中的效率提升提供了互补的途径。

Beyond Static Datasets: Robust Offline Policy Optimization via Vetted Synthetic Transitions

超越静态数据集：通过经过审核的合成转移实现稳健的离线策略优化

Authors: Pedram Agand, Mo Chen
Subjects: Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.18107
Pdf link: https://arxiv.org/pdf/2601.18107
Abstract Offline Reinforcement Learning (ORL) holds immense promise for safety-critical domains like industrial robotics, where real-time environmental interaction is often prohibitive. A primary obstacle in ORL remains the distributional shift between the static dataset and the learned policy, which typically mandates high degrees of conservatism that can restrain potential policy improvements. We present MoReBRAC, a model-based framework that addresses this limitation through Uncertainty-Aware latent synthesis. Instead of relying solely on the fixed data, MoReBRAC utilizes a dual-recurrent world model to synthesize high-fidelity transitions that augment the training manifold. To ensure the reliability of this synthetic data, we implement a hierarchical uncertainty pipeline integrating Variational Autoencoder (VAE) manifold detection, model sensitivity analysis, and Monte Carlo (MC) dropout. This multi-layered filtering process guarantees that only transitions residing within high-confidence regions of the learned dynamics are utilized. Our results on D4RL Gym-MuJoCo benchmarks reveal significant performance gains, particularly in random'' andsuboptimal'' data regimes. We further provide insights into the role of the VAE as a geometric anchor and discuss the distributional trade-offs encountered when learning from near-optimal datasets.
中文摘要 离线强化学习（ORL）在工业机器人等安全关键领域具有巨大潜力，因为实时环境交互往往难以实现。ORL的主要障碍仍是静态数据集与已学习政策之间的分布转移，后者通常要求高度保守，从而限制潜在的政策改进。我们介绍了MoReBRAC，一个基于模型的框架，通过不确定性感知的潜在综合来解决这一局限性。MoReBRAC不再仅依赖固定数据，而是利用双重循环世界模型综合高保真度的过渡，以增强训练流形。为确保这些合成数据的可靠性，我们实施了一个层次不确定性流水线，整合变分自编码器（VAE）流形检测、模型敏感度分析和蒙特卡洛（MC）脱离。这一多层过滤过程确保只使用位于学习动力学高置信区内的转移。我们在D4RL Gym-MuJoCo基准测试中显示了显著的性能提升，尤其是在“随机”和“次优”数据区间。我们还进一步阐述了VAE作为几何锚点的作用，并讨论了从近优数据集学习时所遇到的分布权衡。

Enhance the Safety in Reinforcement Learning by ADRC Lagrangian Methods

通过ADRC拉格朗日方法提升强化学习的安全性

Authors: Mingxu Zhang, Huicheng Zhang, Jiaming Ji, Yaodong Yang, Ying Sun
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.18142
Pdf link: https://arxiv.org/pdf/2601.18142
Abstract Safe reinforcement learning (Safe RL) seeks to maximize rewards while satisfying safety constraints, typically addressed through Lagrangian-based methods. However, existing approaches, including PID and classical Lagrangian methods, suffer from oscillations and frequent safety violations due to parameter sensitivity and inherent phase lag. To address these limitations, we propose ADRC-Lagrangian methods that leverage Active Disturbance Rejection Control (ADRC) for enhanced robustness and reduced oscillations. Our unified framework encompasses classical and PID Lagrangian methods as special cases while significantly improving safety performance. Extensive experiments demonstrate that our approach reduces safety violations by up to 74%, constraint violation magnitudes by 89%, and average costs by 67\%, establishing superior effectiveness for Safe RL in complex environments.
中文摘要 安全强化学习（Safe RL）旨在最大化奖励，同时满足安全约束，通常通过基于拉格朗日的方法来解决。然而，现有方法，包括PID和经典拉格朗日方法，由于参数敏感性和固有的相位滞后，存在振荡和频繁的安全违规。为解决这些局限性，我们提出了利用主动扰动抑制控制（ADRC）以增强鲁棒性和减少振荡的ADRC拉格朗日方法。我们的统一框架将经典方法和PID拉格朗日方法作为特例结合，显著提升了安全性表现。大量实验表明，我们的方法可将安全违规减少多达74%，约束违规幅度降低89%，平均成本降低67%，证明安全强化学习在复杂环境中的卓越效果。

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

FP8-RL：用于大型语言模型强化学习的实用且稳定的低精度堆栈

Authors: Zhaopeng Qiu, Shuang Yu, Jingqi Zhang, Shuai Zhang, Xue Huang, Jingyi Yang, Junjie Lai
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.18150
Pdf link: https://arxiv.org/pdf/2601.18150
Abstract Reinforcement learning (RL) for large language models (LLMs) is increasingly bottlenecked by rollout (generation), where long output sequence lengths make attention and KV-cache memory dominate end-to-end step time. FP8 offers an attractive lever for accelerating RL by reducing compute cost and memory traffic during rollout, but applying FP8 in RL introduces unique engineering and algorithmic challenges: policy weights change every step (requiring repeated quantization and weight synchronization into the inference engine) and low-precision rollouts can deviate from the higher-precision policy assumed by the trainer, causing train-inference mismatch and potential instability. This report presents a practical FP8 rollout stack for LLM RL, implemented in the veRL ecosystem with support for common training backends (e.g., FSDP/Megatron-LM) and inference engines (e.g., vLLM/SGLang). We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks via per-step QKV scale recalibration, and (iii) mitigate mismatch using importance-sampling-based rollout correction (token-level TIS/MIS variants). Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.
中文摘要 大型语言模型（LLM）的强化学习（RL）正日益受到“推广”（生成）的瓶颈，即输出序列长度较长，注意力和KV缓存内存主导了端到端的步骤时间。FP8通过降低推销期间的计算成本和内存流量，为加速RL提供了有吸引力的杠杆，但在强化学习中应用FP8则带来了独特的工程和算法挑战：策略权重每一步都会变化（需要反复量化和权重同步到推理引擎中），低精度的部署可能偏离训练器假设的高精度策略。导致列推断不匹配和潜在的不稳定性。本报告提出了一个实用的FP8 LLM RL推广栈，该栈在veRL生态系统中实现，支持通用训练后端（如FSDP/Megatron-LM）和推理引擎（如vLLM/SGLang）。我们（i）通过分块FP8量化实现FP8 W8A8线性层的展开，（ii）将FP8扩展到KV缓存，通过每步QKV尺度重新校准消除长上下文内存瓶颈，（iii）通过基于重要性抽样的滚动修正（token级别的TIS/MIS变体）减少不匹配。在密集模型和MoE模型中，这些技术在保持与BF16基线相当的学习行为的同时，可实现高达44%的推广吞吐量提升。

QualiRAG: Retrieval-Augmented Generation for Visual Quality Understanding

QualiRAG：可视化质量理解的检索增强生成

Authors: Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Kaiwei Zhang, Jun Jia, Dandan Zhu, Guangtao Zhai, Xiongkuo Min
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.18195
Pdf link: https://arxiv.org/pdf/2601.18195
Abstract Visual quality assessment (VQA) is increasingly shifting from scalar score prediction toward interpretable quality understanding -- a paradigm that demands \textit{fine-grained spatiotemporal perception} and \textit{auxiliary contextual information}. Current approaches rely on supervised fine-tuning or reinforcement learning on curated instruction datasets, which involve labor-intensive annotation and are prone to dataset-specific biases. To address these challenges, we propose \textbf{QualiRAG}, a \textit{training-free} \textbf{R}etrieval-\textbf{A}ugmented \textbf{G}eneration \textbf{(RAG)} framework that systematically leverages the latent perceptual knowledge of large multimodal models (LMMs) for visual quality perception. Unlike conventional RAG that retrieves from static corpora, QualiRAG dynamically generates auxiliary knowledge by decomposing questions into structured requests and constructing four complementary knowledge sources: \textit{visual metadata}, \textit{subject localization}, \textit{global quality summaries}, and \textit{local quality descriptions}, followed by relevance-aware retrieval for evidence-grounded reasoning. Extensive experiments show that QualiRAG achieves substantial improvements over open-source general-purpose LMMs and VQA-finetuned LMMs on visual quality understanding tasks, and delivers competitive performance on visual quality comparison tasks, demonstrating robust quality assessment capabilities without any task-specific training. The code will be publicly available at this https URL.
中文摘要 视觉质量评估（VQA）正日益从标量分数预测转向可解读的质量理解——这一范式要求 \textit（细粒度时空感知）和 \textit（辅助上下文信息）。当前方法依赖于对策划指令数据集进行监督微调或强化学习，这些集成为劳动密集型的注释，且容易存在数据集特有的偏见。为应对这些挑战，我们提出了 \textbf{QualiRAG}，这是一个 \textit{training-free} \textbf{R}etrieval-\textbf{A}增强版 \textbf{G}eneration \textbf{（RAG）} 框架，系统地利用大型多模态模型（LMM）的潜在感知知识来实现视觉质量感知。与从静态语料库检索的传统RAG不同，QualiRAG通过将问题分解为结构化请求并构建四个互补的知识源，动态生成辅助知识：\textit{视觉元数据}、\textit{主题本地化}、\textit{全球质量摘要}和\textit{局部质量描述}，随后进行相关性感知检索以支持证据基础推理。大量实验表明，QualiRAG在视觉质量理解任务上相比开源通用LMM和VQA微调LMM实现了显著改进，并在视觉质量比较任务中表现出竞争力，无需任何任务特定培训即可展现出稳健的质量评估能力。代码将在此 https URL 公开。

PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR

PaperSearchQA：学习在科学论文中寻找和推理，使用 RLVR

Authors: James Burgess, Jan N. Hansen, Duo Peng, Yuhui Zhang, Alejandro Lozano, Min Woo Sun, Emma Lundberg, Serena Yeung-Levy
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2601.18207
Pdf link: https://arxiv.org/pdf/2601.18207
Abstract Search agents are language models (LMs) that reason and search knowledge bases (or the web) to answer questions; recent methods supervise only the final answer accuracy using reinforcement learning with verifiable rewards (RLVR). Most RLVR search agents tackle general-domain QA, which limits their relevance to technical AI systems in science, engineering, and medicine. In this work we propose training agents to search and reason over scientific papers -- this tests technical question-answering, it is directly relevant to real scientists, and the capabilities will be crucial to future AI Scientist systems. Concretely, we release a search corpus of 16 million biomedical paper abstracts and construct a challenging factoid QA dataset called PaperSearchQA with 60k samples answerable from the corpus, along with benchmarks. We train search agents in this environment to outperform non-RL retrieval baselines; we also perform further quantitative analysis and observe interesting agent behaviors like planning, reasoning, and self-verification. Our corpus, datasets, and benchmarks are usable with the popular Search-R1 codebase for RLVR training and released on this https URL. Finally, our data creation methods are scalable and easily extendable to other scientific domains.
中文摘要 搜索代理是语言模型（LM），用于推理和搜索知识库（或网络）以回答问题;最新方法仅通过可验证奖励的强化学习（RLVR）来监督最终答案的准确性。大多数RLVR搜索代理主要处理通用领域的质量保证，这限制了它们在科学、工程和医学领域的技术人工智能系统中的适用性。在本研究中，我们提出训练代理进行科学论文搜索和推理——这测试了技术性质疑问答，直接关联真实科学家，且这些能力对未来的人工智能科学家系统至关重要。具体来说，我们发布了一个包含1600万份生物医学论文摘要的检索语料库，并构建了一个具有挑战性的事实类QA数据集PaperSearchQA，包含6万个样本可从语料库中回答，并附有基准测试。我们在此环境中训练搜索代理，使其表现优于非强化学习的基线;我们还进行了进一步的定量分析，观察到智能体的有趣行为，如规划、推理和自我验证。我们的语料库、数据集和基准测试可用于流行的 Search-R1 代码库用于 RLVR 训练，并发布在这个 https URL。最后，我们的数据创建方法具有可扩展性，且易于扩展到其他科学领域。

Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents

支付更少的泛化税：对大型语言模型代理进行强化学习训练的跨域推广研究

Authors: Zhihan Liu, Lin Guan, Yixin Nie, Kai Zhang, Zhuoqun Hao, Lin Chen, Asli Celikyilmaz, Zhaoran Wang, Na Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.18217
Pdf link: https://arxiv.org/pdf/2601.18217
Abstract Generalist LLM agents are often post-trained on a narrow set of environments but deployed across far broader, unseen domains. In this work, we investigate the challenge of agentic post-training when the eventual test domains are unknown. Specifically, we analyze which properties of reinforcement learning (RL) environments and modeling choices have the greatest influence on out-of-domain performance. First, we identify two environment axes that strongly correlate with cross-domain generalization: (i) state information richness, i.e., the amount of information for the agent to process from the state, and (ii) planning complexity, estimated via goal reachability and trajectory length under a base policy. Notably, domain realism and text-level similarity are not the primary factors; for instance, the simple grid-world domain Sokoban leads to even stronger generalization in SciWorld than the more realistic ALFWorld. Motivated by these findings, we further show that increasing state information richness alone can already effectively improve cross-domain robustness. We propose a randomization technique, which is low-overhead and broadly applicable: add small amounts of distractive goal-irrelevant features to the state to make it richer without altering the task. Beyond environment-side properties, we also examine several modeling choices: (a) SFT warmup or mid-training helps prevent catastrophic forgetting during RL but undermines generalization to domains that are not included in the mid-training datamix; and (b) turning on step-by-step thinking during RL, while not always improving in-domain performance, plays a crucial role in preserving generalization.
中文摘要 通用LLM代理通常在有限的环境中进行后期训练，但部署于更广泛且未被察觉的领域。在本研究中，我们探讨了当最终测试领域未知时，能动后训练所面临的挑战。具体来说，我们分析了强化学习（RL）环境和建模选择中哪些属性对域外性能影响最大。首先，我们确定了两个与跨域泛化高度相关的环境轴：（i）状态信息丰富度，即智能体从状态中处理的信息量，以及（ii）规划复杂度，通过目标可达性和基础策略下的轨迹长度估计。值得注意的是，领域真实性和文本层面相似性并非主要因素;例如，简单的网格世界域Sokoban在SciWorld中比更现实的ALFWorld更强的推广。基于这些发现，我们进一步证明，仅提升状态信息丰富度即可有效提升跨域鲁棒性。我们提出了一种低开销且广泛适用的随机化技术：在不改变任务的情况下，向状态添加少量分散注意力且与目标无关的特征，使其更丰富。除了环境特性外，我们还考察了若干建模选择：（a） SFT预热或训练中途的强化有助于防止强化学习中的灾难性遗忘，但会削弱对未包含在中训练数据混合中的领域推广;以及（b）在强化学习中启用逐步思考，虽然不一定总能提升领域内表现，但在保持泛化性方面起着关键作用。

ShopSimulator: Evaluating and Exploring RL-Driven LLM Agent for Shopping Assistants

ShopSimulator：评估与探索基于强化学习的购物助理大型语言模型代理

Authors: Pei Wang, Yanan Wu, Xiaoshuai Song, Weixun Wang, Gengru Chen, Zhongwen Li, Kezhong Yan, Ken Deng, Qi Liu, Shuaibing Zhao, Shaopan Xiong, Xuepeng Liu, Xuefeng Chen, Wanxi Deng, Wenbo Su, Bo Zheng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.18225
Pdf link: https://arxiv.org/pdf/2601.18225
Abstract Large language model (LLM)-based agents are increasingly deployed in e-commerce shopping. To perform thorough, user-tailored product searches, agents should interpret personal preferences, engage in multi-turn dialogues, and ultimately retrieve and discriminate among highly similar products. However, existing research has yet to provide a unified simulation environment that consistently captures all of these aspects, and always focuses solely on evaluation benchmarks without training support. In this paper, we introduce ShopSimulator, a large-scale and challenging Chinese shopping environment. Leveraging ShopSimulator, we evaluate LLMs across diverse scenarios, finding that even the best-performing models achieve less than 40% full-success rate. Error analysis reveals that agents struggle with deep search and product selection in long trajectories, fail to balance the use of personalization cues, and to effectively engage with users. Further training exploration provides practical guidance for overcoming these weaknesses, with the combination of supervised fine-tuning (SFT) and reinforcement learning (RL) yielding significant performance improvements. Code and data will be released at this https URL.
中文摘要 基于大型语言模型（LLM）的代理正越来越多地被应用于电子商务购物。为了进行彻底且用户定制的产品搜索，代理应解读个人偏好，进行多轮对话，最终检索并区分高度相似的产品。然而，现有研究尚未提供一个统一的仿真环境，能够始终涵盖所有这些方面，且始终专注于评估基准，缺乏培训支持。本文介绍了ShopSimulator，一个大规模且具有挑战性的中国购物环境。利用ShopSimulator，我们评估了各种场景下的大型语言模型，发现即使是表现最好的模型，完全成功率也低于40%。错误分析显示，代理在长期深度搜索和产品选择方面存在困难，难以平衡个性化线索的使用，也难以有效与用户互动。进一步的培训探索为克服这些弱点提供了实用指导，监督微调（SFT）和强化学习（RL）相结合，带来了显著的性能提升。代码和数据将在此 https URL 发布。

Reflecting Twice before Speaking with Empathy: Self-Reflective Alternating Inference for Empathy-Aware End-to-End Spoken Dialogue

在以同理心说话前先反思两次：自省的交替推断，用于同理心意识的端到端口语对话

Authors: Yuhang Jia, Pei Liu, Haoqin Sun, Jiaming Zhou, Xuxin Cheng, Cao Liu, Ke Zeng, Xunliang Cai, Yong Qin
Subjects: Subjects: Computation and Language (cs.CL); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2601.18281
Pdf link: https://arxiv.org/pdf/2601.18281
Abstract End-to-end Spoken Language Models (SLMs) hold great potential for paralinguistic perception, and numerous studies have aimed to enhance their capabilities, particularly for empathetic dialogue. However, current approaches largely depend on rigid supervised signals, such as ground-truth response in supervised fine-tuning or preference scores in reinforcement learning. Such reliance is fundamentally limited for modeling complex empathy, as there is no single "correct" response and a simple numerical score cannot fully capture the nuances of emotional expression or the appropriateness of empathetic behavior. To address these limitations, we sequentially introduce EmpathyEval, a descriptive natural-language-based evaluation model for assessing empathetic quality in spoken dialogues. Building upon EmpathyEval, we propose ReEmpathy, an end-to-end SLM that enhances empathetic dialogue through a novel Empathetic Self-Reflective Alternating Inference mechanism, which interleaves spoken response generation with free-form, empathy-related reflective reasoning. Extensive experiments demonstrate that ReEmpathy substantially improves empathy-sensitive spoken dialogue by enabling reflective reasoning, offering a promising approach toward more emotionally intelligent and empathy-aware human-computer interactions.
中文摘要 端到端口语语言模型（SLMs）在副语言感知方面具有巨大潜力，许多研究旨在提升其能力，特别是在同理心对话方面。然而，目前的方法主要依赖于严格的监督信号，比如监督微调中的地面真实响应或强化学习中的偏好评分。这种依赖在复杂共情建模上本质上有限，因为没有单一的“正确”反应，简单的数值评分无法完全捕捉情感表达的细微差别或同理行为的适当性。为解决这些局限性，我们依次引入了EmpathyEval，一种基于自然语言的描述性评估模型，用于评估口语对话中的同理心质量。基于EmpathyEval，我们提出了ReEmpathy，这是一种端到端SLM，通过一种新的同理心自我反思交替推理机制，增强同理心对话，将口头反应生成与自由形式的、与同理心相关的反思推理交错结合。大量实验表明，再共情通过实现反思推理，显著改善了同理心敏感的口头对话，为更具情感智能和同理心意识的人机互动提供了有前景的途径。

VissimRL: A Multi-Agent Reinforcement Learning Framework for Traffic Signal Control Based on Vissim

VissimRL：基于Vissim的多智能体强化学习框架，用于交通信号控制

Authors: Hsiao-Chuan Chang, Sheng-You Huang, Yen-Chi Chen, I-Chen Wu
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2601.18284
Pdf link: https://arxiv.org/pdf/2601.18284
Abstract Traffic congestion remains a major challenge for urban transportation, leading to significant economic and environmental impacts. Traffic Signal Control (TSC) is one of the key measures to mitigate congestion, and recent studies have increasingly applied Reinforcement Learning (RL) for its adaptive capabilities. With respect to SUMO and CityFlow, the simulator Vissim offers high-fidelity driver behavior modeling and wide industrial adoption but remains underutilized in RL research due to its complex interface and lack of standardized frameworks. To address this gap, this paper proposes VissimRL, a modular RL framework for TSC that encapsulates Vissim's COM interface through a high-level Python API, offering standardized environments for both single- and multi-agent training. Experiments show that VissimRL significantly reduces development effort while maintaining runtime efficiency, and supports consistent improvements in traffic performance during training, as well as emergent coordination in multi-agent control. Overall, VissimRL demonstrates the feasibility of applying RL in high-fidelity simulations and serves as a bridge between academic research and practical applications in intelligent traffic signal control.
中文摘要 交通拥堵依然是城市交通面临的主要挑战，带来了重大的经济和环境影响。交通信号控制（TSC）是缓解拥堵的关键措施之一，近期研究越来越多地将强化学习（RL）应用于其自适应能力。相较于SUMO和CityFlow，模拟器Vissim提供了高精度的驾驶员行为建模和广泛的工业采纳，但由于其复杂界面和缺乏标准化框架，在强化学习研究中仍然未被充分利用。为弥补这一空白，本文提出了VissimRL，一个模块化的TSC强化学习框架，通过高级Python API封装Vissim的COM接口，提供单代理和多代理训练的标准化环境。实验显示，VissimRL在保持运行时效率的同时显著减少开发工作量，并支持训练期间持续提升流量性能，同时实现多代理控制中的突发协调。总体而言，VissimRL展示了在高保真模拟中应用强化学习的可行性，并作为学术研究与智能交通信号控制实际应用之间的桥梁。

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

TriPlay-RL：三角色自我游戏强化学习，用于LLM安全对齐

Authors: Zhewen Tan, Wenhan Yu, Jianfeng Si, Tongxin Liu, Kaiqi Guan, Huiyan Jin, Jiawen Tao, Xiaokun Yuan, Duohe Ma, Xiangzheng Zhang, Tong Yang, Lin Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.18292
Pdf link: https://arxiv.org/pdf/2601.18292
Abstract In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.
中文摘要 近年来，大型语言模型相关的安全风险日益突出，凸显了减少有毒和有害内容生成的紧迫性。主流的 LLM 安全对齐范式通常采用协作框架，包含三个角色：攻击者（负责对抗性提示生成）、防御者（安全防御）以及评估响应者。本文提出了一种名为TriPlay-RL的闭环强化学习框架，实现三个角色间的迭代和协同改进协作，几乎无需手动注释。实验结果显示，攻击者保持了高输出多样性，同时对抗有效性提升了20%-50%;防御者在安全性能提升10%-30%的情况下，不降低整体推理能力;评估者通过迭代不断完善其细致判断能力，准确区分不安全的反应、简单的拒绝和有用的指导。总体而言，我们的框架建立了高效且可扩展的大型语言模型安全对齐范式，实现统一学习循环中的持续共进。

Reinforcement Learning with Distributed MPC for Fuel-Efficient Platoon Control with Discrete Gear Transitions

采用分布式MPC的强化学习，实现节能排级控制及离散起落架转换

Authors: Samuel Mallick, Gianpietro Battocletti, Dimitris Boskos, Azita Dabiri, Bart De Schutter
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.18294
Pdf link: https://arxiv.org/pdf/2601.18294
Abstract Cooperative control of groups of autonomous vehicles (AVs), i.e., platoons, is a promising direction to improving the efficiency of autonomous transportation systems. In this context, distributed co-optimization of both vehicle speed and gear position can offer benefits for fuel-efficient driving. To this end, model predictive control (MPC) is a popular approach, optimizing the speed and gear-shift schedule while explicitly considering the vehicles' dynamics over a prediction window. However, optimization over both the vehicles' continuous dynamics and discrete gear positions is computationally intensive, and may require overly long sample times or high-end hardware for real-time implementation. This work proposes a reinforcement learning (RL)-based distributed MPC approach to address this issue. For each vehicle in the platoon, a policy is trained to select and fix the gear positions across the prediction window of a local MPC controller, leaving a significantly simpler continuous optimization problem to be solved as part of a distributed MPC scheme. In order to reduce the computational cost of training and facilitate the scalability of the proposed approach to large platoons, the policies are parameterized such that the emergent multi-agent RL problem can be decoupled into single-agent learning tasks. In addition, a recurrent neural-network (RNN) architecture is proposed for the gear selection policy, such that the learning is scalable even as the number of possible gear-shift schedules grows exponentially with the MPC prediction horizon. In highway-driving simulations, the proposed approach is shown to have a significantly lower computation burden and a comparable performance in terms of fuel-efficient platoon control, with respect to pure MPC-based co-optimization.
中文摘要 对自动驾驶车辆（AV）群体（即排）的协作控制，是提升自动驾驶系统效率的有前景方向。在此背景下，车速和档位的分布式协同优化有助于节能驾驶。为此，模型预测控制（MPC）是一种流行的方法，它在明确考虑车辆在预测窗口内的动态变化的同时，优化速度和换挡时间表。然而，对车辆连续动力学和离散齿轮位置的优化计算量大，可能需要过长的采样时间或高端硬件实现。本研究提出了一种基于强化学习（RL）的分布式MPC方法来解决这一问题。对于排中的每辆车辆，都会训练一套策略，在本地MPC控制器的预测窗口内选择并固定齿轮位置，从而在分布式MPC方案中解决一个显著简化的连续优化问题。为了降低训练的计算成本并促进拟议方法对大型排的可扩展性，策略参数化使得涌现的多智能体强化学习问题可以解耦为单智能体学习任务。此外，还提出了一种循环神经网络（RNN）架构用于换挡选择策略，使得即使随着MPC预测视野的换挡时间表呈指数级增长，学习过程仍是可扩展的。在高速公路驾驶模拟中，所提方法被证明在计算负担显著较低，且在节能排控方面表现相当于纯MPC协同优化。

Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning

Temp-R1：通过逆向课程强化学习实现复杂时空KGQA的统一自主代理

Authors: Zhaoyan Gong, Zhiqiang Liu, Songze Li, Xiaoke Guo, Yuanxiang Liu, Xinle Deng, Zhizhen Liu, Lei Liang, Huajun Chen, Wen Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.18296
Pdf link: https://arxiv.org/pdf/2601.18296
Abstract Temporal Knowledge Graph Question Answering (TKGQA) is inherently challenging, as it requires sophisticated reasoning over dynamic facts with multi-hop dependencies and complex temporal constraints. Existing methods rely on fixed workflows and expensive closed-source APIs, limiting flexibility and scalability. We propose Temp-R1, the first autonomous end-to-end agent for TKGQA trained through reinforcement learning. To address cognitive overload in single-action reasoning, we expand the action space with specialized internal actions alongside external action. To prevent shortcut learning on simple questions, we introduce reverse curriculum learning that trains on difficult questions first, forcing the development of sophisticated reasoning before transferring to easier cases. Our 8B-parameter Temp-R1 achieves state-of-the-art performance on MultiTQ and TimelineKGQA, improving 19.8% over strong baselines on complex questions. Our work establishes a new paradigm for autonomous temporal reasoning agents. Our code will be publicly available soon at this https URL.
中文摘要 时间知识图问答（TKGQA）本身具有挑战性，因为它需要对动态事实进行复杂的推理，涉及多跳依赖性和复杂的时间约束。现有方法依赖固定的工作流程和昂贵的闭源API，限制了灵活性和可扩展性。我们提出Temp-R1，这是首个通过强化学习训练的TKGQA自主端到端代理。为了应对单一行动推理中的认知过载，我们扩展了行动空间，同时引入了专门的内部行动和外部行动。为了防止简单问题上的捷径学习，我们引入了逆向课程学习，先训练难题，迫使他们发展复杂的推理能力，然后再转向更简单的案例。我们的8B参数Temp-R1在MultiTQ和TimelineKGQA上实现了最先进的表现，在复杂问题上比强基线提升了19.8%。我们的工作为自主时间推理代理建立了新的范式。我们的代码将很快通过该 https URL 公开。

AI Agent for Reverse-Engineering Legacy Finite-Difference Code and Translating to Devito

AI代理用于逆向工程遗留有限差分代码并转换为Devito

Authors: Yinghan Hou, Zongyou Yang
Subjects: Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2601.18381
Pdf link: https://arxiv.org/pdf/2601.18381
Abstract To facilitate the transformation of legacy finite difference implementations into the Devito environment, this study develops an integrated AI agent framework. Retrieval-Augmented Generation (RAG) and open-source Large Language Models are combined through multi-stage iterative workflows in the system's hybrid LangGraph architecture. The agent constructs an extensive Devito knowledge graph through document parsing, structure-aware segmentation, extraction of entity relationships, and Leiden-based community detection. GraphRAG optimisation enhances query performance across semantic communities that include seismic wave simulation, computational fluid dynamics, and performance tuning libraries. A reverse engineering component derives three-level query strategies for RAG retrieval through static analysis of Fortran source code. To deliver precise contextual information for language model guidance, the multi-stage retrieval pipeline performs parallel searching, concept expansion, community-scale retrieval, and semantic similarity analysis. Code synthesis is governed by Pydantic-based constraints to guarantee structured outputs and reliability. A comprehensive validation framework integrates conventional static analysis with the G-Eval approach, covering execution correctness, structural soundness, mathematical consistency, and API compliance. The overall agent workflow is implemented on the LangGraph framework and adopts concurrent processing to support quality-based iterative refinement and state-aware dynamic routing. The principal contribution lies in the incorporation of feedback mechanisms motivated by reinforcement learning, enabling a transition from static code translation toward dynamic and adaptive analytical behavior.
中文摘要 为促进将遗留有限差分实现转化为Devito环境，本研究开发了一个集成的AI代理框架。检索增强生成（RAG）和开源大型语言模型通过多阶段迭代工作流程结合，采用系统的混合LangGraph架构。该代理通过文档解析、结构感知分割、实体关系提取以及基于莱顿的社区检测，构建了广泛的 Devito 知识图谱。GraphRAG 优化提升了包括地震波模拟、计算流体力学和性能调优库在内的语义社区的查询性能。逆向工程组件通过对 Fortran 源代码的静态分析，推导出三级查询策略用于 RAG 检索。为了提供语言模型指导所需的精确上下文信息，多阶段检索流水线进行并行搜索、概念扩展、社区尺度检索和语义相似度分析。代码综合受基于Pydantic约束的约束，以保证结构化输出和可靠性。一个全面的验证框架将传统静态分析与G评估方法相结合，涵盖执行正确性、结构稳健性、数学一致性和API合规性。整体代理工作流程基于LangGraph框架实现，并采用并发处理以支持基于质量的迭代精炼和状态感知的动态路由。其主要贡献在于引入由强化学习驱动的反馈机制，使静态代码转换向动态和自适应分析行为的过渡成为可能。

daVinci-Dev: Agent-native Mid-training for Software Engineering

daVinci-Dev：软件工程的代理原生中期培训

Authors: Ji Zeng, Dayuan Fu, Tiantian Mi, Yumin Zhuang, Yaxing Huang, Xuefeng Li, Lyumanshan Ye, Muhang Xie, Qishuo Hua, Zhen Huang, Mohan Jiang, Hanning Wang, Jifan Lin, Yang Xiao, Jie Sun, Yunze Wu, Pengfei Liu
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.18418
Pdf link: https://arxiv.org/pdf/2601.18418
Abstract Recently, the frontier of Large Language Model (LLM) capabilities has shifted from single-turn code generation to agentic software engineering-a paradigm where models autonomously navigate, edit, and test complex repositories. While post-training methods have become the de facto approach for code agents, agentic mid-training-mid-training (MT) on large-scale data that mirrors authentic agentic workflows-remains critically underexplored due to substantial resource requirements, despite offering a more scalable path to instilling foundational agentic behaviors than relying solely on expensive reinforcement learning. A central challenge in realizing effective agentic mid-training is the distribution mismatch between static training data and the dynamic, feedback-rich environment of real development. To address this, we present a systematic study of agentic mid-training, establishing both the data synthesis principles and training methodology for effective agent development at scale. Central to our approach is agent-native data-supervision comprising two complementary types of trajectories: contextually-native trajectories that preserve the complete information flow an agent experiences, offering broad coverage and diversity; and environmentally-native trajectories collected from executable repositories where observations stem from actual tool invocations and test executions, providing depth and interaction authenticity. We verify the model's agentic capabilities on SWE-Bench Verified. We demonstrate our superiority over the previous open software engineering mid-training recipe Kimi-Dev under two post-training settings with an aligned base model and agentic scaffold, while using less than half mid-training tokens (73.1B). Besides relative advantage, our best performing 32B and 72B models achieve 56.1% and 58.5% resolution rates, respectively, which are ...
中文摘要 近年来，大型语言模型（LLM）能力的前沿已从单回合代码生成转向代理软件工程——一种模型自主导航、编辑和测试复杂代码库的范式。尽管后训练方法已成为代码代理的事实方法，但代理中期训练-中期训练（MT）在大规模数据上模拟真实代理工作流程——由于资源需求巨大，仍然严重未被充分探索，尽管它提供了比单纯依赖昂贵强化学习更可扩展的路径来培养基础代理行为。实现有效代理训练中训练的一个核心挑战是静态训练数据与动态、反馈丰富的真实开发环境之间的分布不匹配。为此，我们提出了一项系统性中期智能体培训研究，确立了大规模智能体开发的数据综合原则和训练方法论。我们方法的核心是智能体原生数据监管，包含两种互补的轨迹类型：上下文原生轨迹，保持智能体所经历的完整信息流，提供广泛的覆盖和多样性;以及环境原生轨迹，这些轨迹来自可执行仓库，观察数据源自实际工具调用和测试执行，提供深度和交互真实性。我们在“SWE-Bench Verified”上验证模型的代理能力。我们在两个训练后设置、基模型对齐且支架（agentic saffold）下，展示了我们相较于之前的开放软件工程中训练配方“Kimi-Dev”的优势，同时使用不到一半的训练中期令牌（73.1B）。除了相对优势外，我们表现最好的32B和72B模型分别实现了56.1%和58.5%的分辨率，分别是......

OffSeeker: Online Reinforcement Learning Is Not All You Need for Deep Research Agents

OffSeeker：在线强化学习并不是你深度研究代理所需的全部

Authors: Yuhang Zhou, Kai Zheng, Qiguang Chen, Mengkang Hu, Qingfeng Sun, Can Xu, Jingjing Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.18467
Pdf link: https://arxiv.org/pdf/2601.18467
Abstract Deep research agents have shown remarkable potential in handling long-horizon tasks. However, state-of-the-art performance typically relies on online reinforcement learning (RL), which is financially expensive due to extensive API calls. While offline training offers a more efficient alternative, its progress is hindered by the scarcity of high-quality research trajectories. In this paper, we demonstrate that expensive online reinforcement learning is not all you need to build powerful research agents. To bridge this gap, we introduce a fully open-source suite designed for effective offline training. Our core contributions include DeepForge, a ready-to-use task synthesis framework that generates large-scale research queries without heavy preprocessing; and a curated collection of 66k QA pairs, 33k SFT trajectories, and 21k DPO pairs. Leveraging these resources, we train OffSeeker (8B), a model developed entirely offline. Extensive evaluations across six benchmarks show that OffSeeker not only leads among similar-sized agents but also remains competitive with 30B-parameter systems trained via heavy online RL.
中文摘要 深度研究代理在处理长期任务方面展现出显著潜力。然而，最先进的性能通常依赖在线强化学习（RL），由于大量API调用，这在经济上成本较高。虽然线下培训是更高效的替代方案，但其进展受限于高质量研究路径的稀缺。本文展示了昂贵的在线强化学习并非构建强大研究代理所需的全部。为了弥合这一差距，我们推出了一套完全开源的套件，专为有效的离线培训设计。我们的核心贡献包括DeepForge，一个即用的任务综合框架，无需繁重预处理即可生成大规模研究查询;以及一个精选的6.6万对QA对、3.3万SFT轨迹和2.1万DPO对的合集。利用这些资源，我们训练了OffSeeker（8B），这是一个完全离线开发的模型。针对六个基准测试的广泛评估显示，OffSeeker 不仅在同规模代理中领先，还能与通过大量在线强化学习训练的 30B 参数系统竞争。

Enhancing Control Policy Smoothness by Aligning Actions with Predictions from Preceding States

通过将动作与前一状态的预测对齐，提升控制策略的平滑性

Authors: Kyoleen Kwak, Hyoseok Hwang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.18479
Pdf link: https://arxiv.org/pdf/2601.18479
Abstract Deep reinforcement learning has proven to be a powerful approach to solving control tasks, but its characteristic high-frequency oscillations make it difficult to apply in real-world environments. While prior methods have addressed action oscillations via architectural or loss-based methods, the latter typically depend on heuristic or synthetic definitions of state similarity to promote action consistency, which often fail to accurately reflect the underlying system dynamics. In this paper, we propose a novel loss-based method by introducing a transition-induced similar state. The transition-induced similar state is defined as the distribution of next states transitioned from the previous state. Since it utilizes only environmental feedback and actually collected data, it better captures system dynamics. Building upon this foundation, we introduce Action Smoothing by Aligning Actions with Predictions from Preceding States (ASAP), an action smoothing method that effectively mitigates action oscillations. ASAP enforces action smoothness by aligning the actions with those taken in transition-induced similar states and by penalizing second-order differences to suppress high-frequency oscillations. Experiments in Gymnasium and Isaac-Lab environments demonstrate that ASAP yields smoother control and improved policy performance over existing methods.
中文摘要 深度强化学习已被证明是解决控制任务的强大方法，但其特有的高频振荡使其难以在现实环境中应用。虽然以往方法通过架构或基于损耗的方法处理作用振荡，但后者通常依赖启发式或综合定义来促进动作一致性，而这些定义往往无法准确反映底层系统动态。本文提出一种基于损耗的新方法，通过引入跃迁诱导的类似态。跃迁诱导的类似态定义为从前一态跃迁的下一态分布。由于它仅利用环境反馈并实际收集数据，因此更能捕捉系统动态。在此基础上，我们引入了通过对齐动作与前置状态预测进行动作平滑（ASAP）的动作平滑方法，有效减轻了动作振荡。ASAP通过将动作与跃迁诱导的相似状态下的动作对齐，并惩罚二阶差值以抑制高频振荡，从而强制执行动作的平滑性。体育馆和艾萨克实验室环境下的实验表明，ASAP比现有方法更平稳地控制并提升策略性能。

Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

即时强化学习：无梯度更新的LLM代理持续学习

Authors: Yibo Li, Zijie Lin, Ailin Deng, Xuan Zhang, Yufei He, Shuo Ji, Tri Cao, Bryan Hooi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.18510
Pdf link: https://arxiv.org/pdf/2601.18510
Abstract While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at this https URL.
中文摘要 虽然大型语言模型（LLM）代理在通用任务中表现出色，但由于部署后权重冻结，它们在持续适应方面存在固有困难。传统强化学习（RL）提供了解决方案，但计算成本高昂且存在灾难性遗忘的风险。我们介绍了实时强化学习（JitRL），这是一个无需训练的框架，能够在测试时进行策略优化，无需梯度更新。JitRL保持动态的非参数记忆，并实时检索相关轨迹以估算行动优势。这些估计值随后被用来直接调制LLM的输出logits。我们理论上证明了该加法更新规则是KL约束策略优化目标的精确闭式解。WebArena和Jericho上的大量实验表明，JitRL在无培训方法中建立了一种全新的先进技术。关键是，JitRL的性能优于计算量高的微调方法（如WebRL），同时将成本降低30多倍，为持续学习代理提供了可扩展的路径。代码可在该 https URL 访问。

From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation

从可验证点到奖励链：利用可验证的基于引用的奖励进行开放式生成的强化学习

Authors: Yuxin Jiang, Yufei Wang, Qiyuan Zhang, Xingshan Zeng, Liangyou Li, Jierun Chen, Chaofan Tao, Haoli Bai, Lifeng Shang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.18533
Pdf link: https://arxiv.org/pdf/2601.18533
Abstract Reinforcement learning with verifiable rewards (RLVR) succeeds in reasoning tasks (e.g., math and code) by checking the final verifiable answer (i.e., a verifiable dot signal). However, extending this paradigm to open-ended generation is challenging because there is no unambiguous ground truth. Relying on single-dot supervision often leads to inefficiency and reward hacking. To address these issues, we propose reinforcement learning with verifiable reference-based rewards (RLVRR). Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e, reward chain). Specifically, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts (e.g., keywords), and style, which evaluates adherence to stylistic properties through LLM-based verification. In this way, RLVRR combines the exploratory strength of RL with the efficiency and reliability of supervised fine-tuning (SFT). Extensive experiments on more than 10 benchmarks with Qwen and Llama models confirm the advantages of our approach. RLVRR (1) substantially outperforms SFT trained with ten times more data and advanced reward models, (2) unifies the training of structured reasoning and open-ended generation, and (3) generalizes more effectively while preserving output diversity. These results establish RLVRR as a principled and efficient path toward verifiable reinforcement learning for general-purpose LLM alignment. We release our code and data at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）通过检查最终可验证的答案（即可验证的点信号）成功完成推理任务（如数学和代码）。然而，将这一范式推广到开放式生成具有挑战性，因为没有明确的事实。依赖单点监督常导致低效和被黑客攻击。为解决这些问题，我们提出了基于可验证的基于引用的奖励（RLVRR）强化学习。RLVRR不检查最终答案，而是从高质量参考（即奖励链）中提取有序语言信号。具体来说，RLVRR将奖励分解为两个维度：内容，保留确定性的核心概念（如关键词），以及风格，通过基于LLM的验证评估对风格属性的遵循。通过这种方式，RLVRR结合了强化学习的探索优势与监督微调（SFT）的效率和可靠性。在10多个基准测试中，使用Qwen和Llama模型进行了大量实验，证实了我们方法的优势。RLVRR（1）远超用十倍数据和先进奖励模型训练的SFT，（2）统一了结构化推理和开放式生成的训练，（3）在保持输出多样性的同时更有效地泛化。这些结果确立了RLVRR作为一条原则性且高效的可验证强化学习路径，用于通用LLM对齐。我们在这个 https URL 上发布了代码和数据。

GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning

GenAgent：通过智能多模态推理实现文本到图像生成的缩放

Authors: Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, Wenqiang Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.18543
Pdf link: https://arxiv.org/pdf/2601.18543
Abstract We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model. Unlike unified models that face expensive training costs and understanding-generation trade-offs, GenAgent decouples these capabilities through an agentic framework: understanding is handled by the multimodal model itself, while generation is achieved by treating image generation models as invokable tools. Crucially, unlike existing modular systems constrained by static pipelines, this design enables autonomous multi-turn interactions where the agent generates multimodal chains-of-thought encompassing reasoning, tool invocation, judgment, and reflection to iteratively refine outputs. We employ a two-stage training strategy: first, cold-start with supervised fine-tuning on high-quality tool invocation and reflection data to bootstrap agent behaviors; second, end-to-end agentic reinforcement learning combining pointwise rewards (final image quality) and pairwise rewards (reflection accuracy), with trajectory resampling for enhanced multi-turn exploration. GenAgent significantly boosts base generator(FLUX.1-dev) performance on GenEval++ (+23.6\%) and WISE (+14\%). Beyond performance gains, our framework demonstrates three key properties: 1) cross-tool generalization to generators with varying capabilities, 2) test-time scaling with consistent improvements across interaction rounds, and 3) task-adaptive reasoning that automatically adjusts to different tasks. Our code will be available at \href{this https URL}{this url}.
中文摘要 我们引入GenAgent，通过智能多模态模型统一视觉理解与生成。与面临高昂训练成本和理解生成权衡的统一模型不同，GenAgent通过代理框架将这些能力解耦：理解由多模态模型本身处理，生成则通过将图像生成模型视为可调用工具来实现。关键是，与现有受静态流水线限制的模块化系统不同，该设计支持自主多回合交互，代理生成涵盖推理、工具调用、判断和反思的多模态思维链，以迭代优化输出。我们采用两阶段训练策略：首先，冷启动，在监督下对高质量工具调用和反射数据进行微调，以引导代理行为;其次，端到端的能动强化学习结合了点对奖励（最终图像质量）和成对奖励（反射精度），并结合轨迹重采样以增强多回合探索。GenAgent显著提升了GenEval++（+23.6%）和WISE（+14%）的基础生成器（FLUX.1-dev）性能。除了性能提升，我们的框架还展现了三个关键特性：1）跨工具推广到能力各异的生成器，2）测试时间的扩展，且在交互轮次间持续改进;3）任务自适应推理，能够自动适应不同任务。我们的代码将在 \href{this https URL}{this url} 上发布。

K-Myriad: Jump-starting reinforcement learning with unsupervised parallel agents

K-Myriad：无监督并行代理的跳板强化学习

Authors: Vincenzo De Paola, Mirco Mutti, Riccardo Zamboni, Marcello Restelli
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.18580
Pdf link: https://arxiv.org/pdf/2601.18580
Abstract Parallelization in Reinforcement Learning is typically employed to speed up the training of a single policy, where multiple workers collect experience from an identical sampling distribution. This common design limits the potential of parallelization by neglecting the advantages of diverse exploration strategies. We propose K-Myriad, a scalable and unsupervised method that maximizes the collective state entropy induced by a population of parallel policies. By cultivating a portfolio of specialized exploration strategies, K-Myriad provides a robust initialization for Reinforcement Learning, leading to both higher training efficiency and the discovery of heterogeneous solutions. Experiments on high-dimensional continuous control tasks, with large-scale parallelization, demonstrate that K-Myriad can learn a broad set of distinct policies, highlighting its effectiveness for collective exploration and paving the way towards novel parallelization strategies.
中文摘要 强化学习中的并行化通常用于加快单一策略的训练，即多位工作人员从相同的抽样分布中收集经验。这种通用设计通过忽视多样化探索策略的优势，限制了并行化的潜力。我们提出了K-Myriad方法，这是一种可扩展且无监督的方法，最大化由一组平行策略引起的集体状态熵。通过培养一系列专门的探索策略，K-Myriad为强化学习提供了稳健的初始化，既提高了训练效率，也发现了异构解。高维连续控制任务的实验，配合大规模并行化，表明K-Myriad能够学习一套广泛的独特策略，彰显其在集体探索中的有效性，并为新颖的并行化策略铺平道路。

From Classification to Ranking: Enhancing LLM Reasoning Capabilities for MBTI Personality Detection

从分类到排名：增强 LLM 推理能力以实现 MBTI 人格检测

Authors: Yuan Cao, Feixiang Liu, Xinyue Wang, Yihan Zhu, Hui Xu, Zheng Wang, Qiang Qiu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.18582
Pdf link: https://arxiv.org/pdf/2601.18582
Abstract Personality detection aims to measure an individual's corresponding personality traits through their social media posts. The advancements in Large Language Models (LLMs) offer novel perspectives for personality detection tasks. Existing approaches enhance personality trait analysis by leveraging LLMs to extract semantic information from textual posts as prompts, followed by training classifiers for categorization. However, accurately classifying personality traits remains challenging due to the inherent complexity of human personality and subtle inter-trait distinctions. Moreover, prompt-based methods often exhibit excessive dependency on expert-crafted knowledge without autonomous pattern-learning capacity. To address these limitations, we view personality detection as a ranking task rather than a classification and propose a corresponding reinforcement learning training paradigm. First, we employ supervised fine-tuning (SFT) to establish personality trait ranking capabilities while enforcing standardized output formats, creating a robust initialization. Subsequently, we introduce Group Relative Policy Optimization (GRPO) with a specialized ranking-based reward function. Unlike verification tasks with definitive solutions, personality assessment involves subjective interpretations and blurred boundaries between trait categories. Our reward function explicitly addresses this challenge by training LLMs to learn optimal answer rankings. Comprehensive experiments have demonstrated that our method achieves state-of-the-art performance across multiple personality detection benchmarks.
中文摘要 人格检测旨在通过社交媒体帖子测量个人对应的人格特质。大型语言模型（LLMs）的进展为人格检测任务带来了新颖的视角。现有方法通过利用大型语言模型（LLM）从文本帖子中提取语义信息作为提示，随后通过训练分类器进行分类，从而增强人格特质分析。然而，由于人类性格的复杂性和细微的特征间差异，准确分类人格特质仍然具有挑战性。此外，基于提示的方法往往过度依赖专家编写的知识，缺乏自主模式学习能力。为解决这些局限性，我们将人格检测视为排名任务而非分类，并提出了相应的强化学习训练范式。首先，我们采用监督微调（SFT）来建立人格特质排名能力，同时强制执行标准化输出格式，创建稳健的初始化。随后，我们引入了具有专门排名奖励函数的群体相对政策优化（GRPO）。与具有明确解决方案的验证任务不同，人格评估涉及主观解读和特质类别间界限模糊。我们的奖励函数明确解决了这一挑战，训练大型语言模型学习最优答案排名。全面的实验表明，我们的方法在多重人格检测基准测试中实现了最先进的性能。

Learning long term climate-resilient transport adaptation pathways under direct and indirect flood impacts using reinforcement learning

利用强化学习学习在直接和间接洪水影响下学习长期气候韧通适应路径

Authors: Miguel Costa, Arthur Vandervoort, Carolin Schmidt, Morten W. Petersen, Martin Drews, Karyn Morrissey, Francisco C. Pereira
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.18586
Pdf link: https://arxiv.org/pdf/2601.18586
Abstract Climate change is expected to intensify rainfall and other hazards, increasing disruptions in urban transportation systems. Designing effective adaptation strategies is challenging due to the long-term, sequential nature of infrastructure investments, deep uncertainty, and complex cross-sector interactions. We propose a generic decision-support framework that couples an integrated assessment model (IAM) with reinforcement learning (RL) to learn adaptive, multi-decade investment pathways under uncertainty. The framework combines long-term climate projections (e.g., IPCC scenario pathways) with models that map projected extreme-weather drivers (e.g. rain) into hazard likelihoods (e.g. flooding), propagate hazards into urban infrastructure impacts (e.g. transport disruption), and value direct and indirect consequences for service performance and societal costs. Embedded in a reinforcement-learning loop, it learns adaptive climate adaptation policies that trade off investment and maintenance expenditures against avoided impacts. In collaboration with Copenhagen Municipality, we demonstrate the approach on pluvial flooding in the inner city for the horizon of 2024 to 2100. The learned strategies yield coordinated spatial-temporal pathways and improved robustness relative to conventional optimization baselines, namely inaction and random action, illustrating the framework's transferability to other hazards and cities.
中文摘要 气候变化预计将加剧降雨和其他危害，增加城市交通系统的扰乱。由于基础设施投资具有长期且连续性、深层次的不确定性以及复杂的跨部门互动，设计有效的适应策略具有挑战性。我们提出了一个通用的决策支持框架，将综合评估模型（IAM）与强化学习（RL）结合，以学习在不确定性下适应性的、数十年投资路径。该框架结合了长期气候预测（如IPCC情景路径）与模型，将预测的极端天气驱动因素（如降雨）映射为灾害概率（如洪水），将灾害传播到城市基础设施影响（如交通中断），并评估服务绩效和社会成本的直接和间接后果。它嵌入强化学习循环中，学习适应性气候适应政策，在投资和维护支出与避免影响之间进行权衡。我们与哥本哈根市政府合作，展示了2024年至2100年内市区积蓄洪水的处理方法。所学策略产生了协调的时空路径，并相较于传统优化基线（即无作用和随机作用）提升了鲁棒性，展示了该框架在其他灾害和城市中的可迁移性。

Rank-1 Approximation of Inverse Fisher for Natural Policy Gradients in Deep Reinforcement Learning

深度强化学习中自然策略梯度的逆费舍尔近似

Authors: Yingxiao Huo, Satya Prakash Dash, Radu Stoican, Samuel Kaski, Mingfei Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2601.18626
Pdf link: https://arxiv.org/pdf/2601.18626
Abstract Natural gradients have long been studied in deep reinforcement learning due to their fast convergence properties and covariant weight updates. However, computing natural gradients requires inversion of the Fisher Information Matrix (FIM) at each iteration, which is computationally prohibitive in nature. In this paper, we present an efficient and scalable natural policy optimization technique that leverages a rank-1 approximation to full inverse-FIM. We theoretically show that under certain conditions, a rank-1 approximation to inverse-FIM converges faster than policy gradients and, under some conditions, enjoys the same sample complexity as stochastic policy gradient methods. We benchmark our method on a diverse set of environments and show that it achieves superior performance to standard actor-critic and trust-region baselines.
中文摘要 自然梯度因其快速收敛特性和协变权重更新，长期以来一直在深度强化学习中被研究。然而，计算自然梯度需要每次迭代都反演费舍尔信息矩阵（FIM），这在计算上是极大的限制。本文提出了一种高效且可扩展的自然政策优化技术，利用秩一近似实现完全反FIM。我们理论上证明，在某些条件下，逆FIM的秩-1近似收敛速度快于策略梯度，并且在某些情况下，样本复杂度与随机策略梯度方法相同。我们在多样化环境中进行基准测试，证明其性能优于标准的actor-critic和信任区域基线。

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

AdaReasoner：用于迭代视觉推理的动态工具编排

Authors: Mingyang Song, Haoyu Sun, Jiawei Gu, Linjie Li, Luxin Xu, Ranjay Krishna, Yu Cheng
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2601.18631
Pdf link: https://arxiv.org/pdf/2601.18631
Abstract When humans face problems beyond their immediate capabilities, they rely on tools, providing a promising paradigm for improving visual reasoning in multimodal large language models (MLLMs). Effective reasoning, therefore, hinges on knowing which tools to use, when to invoke them, and how to compose them over multiple steps, even when faced with new tools or new tasks. We introduce \textbf{AdaReasoner}, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior. AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that optimizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage. Together, these components allow models to infer tool utility from task context and intermediate outcomes, enabling coordination of multiple tools and generalization to unseen tools. Empirically, AdaReasoner exhibits strong tool-adaptive and generalization behaviors: it autonomously adopts beneficial tools, suppresses irrelevant ones, and adjusts tool usage frequency based on task demands, despite never being explicitly trained to do so. These capabilities translate into state-of-the-art performance across challenging benchmarks, improving the 7B base model by +24.9\% on average and surpassing strong proprietary systems such as GPT-5 on multiple tasks, including VSP and Jigsaw.
中文摘要 当人类面临超出自身能力范围的问题时，他们依赖工具，为提升多模态大型语言模型（MLLMs）中的视觉推理提供了有前景的范式。因此，有效的推理依赖于知道使用哪些工具、何时调用它们，以及如何分多个步骤组合它们，即使面对新的工具或任务。我们介绍了\textbf{AdaReasoner}，这是一系列多模态模型，学习工具使用作为一种通用推理技能，而非工具特定或显式监督行为。AdaReasoner 的实现方式包括：（i）可扩展的数据管理流程，使模型能够参与长期、多步工具的交互;（ii） Tool-GRPO，一种基于终端任务成功优化工具选择和排序的强化学习算法;以及（iii）一种动态调节工具使用的自适应学习机制。这些组成部分共同使模型能够从任务上下文和中间结果推断工具的实用性，从而协调多种工具并推广到看不见的工具。从实证角度看，AdaReasoner 表现出强烈的工具适应性和泛化行为：它自主采用有益的工具，抑制无关工具，并根据任务需求调整工具使用频率，尽管从未经过明确训练。这些能力转化为在高难度基准测试中的顶尖性能，平均提升7B基础模型+24.9%，并在包括VSP和Jigsaw在内的多项任务中超越了GPT-5等强大专有系统。

ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule

扩散采样的ART方法：一种强化学习方法来实现时间步进度

Authors: Yilie Huang, Wenpin Tang, Xunyu Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2601.18681
Pdf link: https://arxiv.org/pdf/2601.18681
Abstract We consider time discretization for score-based diffusion models to generate samples from a learned reverse-time dynamic on a finite grid. Uniform and hand-crafted grids can be suboptimal given a budget on the number of time steps. We introduce Adaptive Reparameterized Time (ART) that controls the clock speed of a reparameterized time variable, leading to a time change and uneven timesteps along the sampling trajectory while preserving the terminal time. The objective is to minimize the aggregate error arising from the discretized Euler scheme. We derive a randomized control companion, ART-RL, and formulate time change as a continuous-time reinforcement learning (RL) problem with Gaussian policies. We then prove that solving ART-RL recovers the optimal ART schedule, which in turn enables practical actor--critic updates to learn the latter in a data-driven way. Empirically, based on the official EDM pipeline, ART-RL improves Fréchet Inception Distance on CIFAR-10 over a wide range of budgets and transfers to AFHQv2, FFHQ, and ImageNet without the need of retraining.
中文摘要 我们考虑基于分数的扩散模型的时间离散化，以从有限网格上学习的反时间动态生成样本。在时间步数有限的情况下，统一且手工制作的网格可能不够理想。我们引入了自适应重参数化时间（ART），它控制重新参数化时间变量的时钟速度，从而在保持终端时间的同时，实现采样轨迹上的时间变化和不均匀的时间步。其目标是最小化离散化欧拉方案产生的总误差。我们推导出随机对照伴随ART-RL，并将时间变化表述为一个带有高斯策略的连续时间强化学习（RL）问题。随后我们证明，解ART-RL可以恢复最优ART调度，从而使实际的actor-critic更新能够以数据驱动的方式学习后者。基于官方EDM流水线，ART-RL在CIFAR-10上改进了Fréchet起始距离，涵盖了广泛的预算范围，并可转发至AFHQv2、FFHQ和ImageNet，无需重新训练。

Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs

健康评分：迈向可扩展的健康水平评估标准——大型语言模型

Authors: Zhichao Yang, Sepehr Janghorbani, Dongxu Zhang, Jun Han, Qian Qian, Andrew Ressler II, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.18706
Pdf link: https://arxiv.org/pdf/2601.18706
Abstract Rubrics are essential for evaluating open-ended LLM responses, especially in safety-critical domains such as healthcare. However, creating high-quality and domain-specific rubrics typically requires significant human expertise time and development cost, making rubric-based evaluation and training difficult to scale. In this work, we introduce Health-SCORE, a generalizable and scalable rubric-based training and evaluation framework that substantially reduces rubric development costs without sacrificing performance. We show that Health-SCORE provides two practical benefits beyond standalone evaluation: it can be used as a structured reward signal to guide reinforcement learning with safety-aware supervision, and it can be incorporated directly into prompts to improve response quality through in-context learning. Across open-ended healthcare tasks, Health-SCORE achieves evaluation quality comparable to human-created rubrics while significantly lowering development effort, making rubric-based evaluation and training more scalable.
中文摘要 评分标准对于评估开放式LLM回答至关重要，尤其是在医疗保健等安全关键领域。然而，创建高质量且领域特定的评分标准通常需要大量人力专业知识、时间和开发成本，这使得基于评分标准的评估和培训难以扩展。本研究介绍了Health-SCORE，一种可推广且可扩展的基于评分标准的培训与评估框架，显著降低评分标准开发成本而不牺牲绩效。我们展示了Health-SCORE除了单独评估外，还提供了两个实际好处：它可以作为结构化奖励信号，在安全意识监督下引导强化学习;并且可以直接融入提示中，通过上下文学习提升反应质量。在开放式医疗任务中，Health-SCORE实现了与人工制定评分标准相当的评估质量，同时显著降低了开发工作量，使基于评分标准的评估和培训更具可扩展性。

Reflect: Transparent Principle-Guided Reasoning for Constitutional Alignment at Scale

反思：透明的原则导向推理，实现大规模宪法对齐

Authors: Henry Bell, Caroline Zhang, Mohammed Mobasserul Haque, Dhaval Potdar, Samia Zaman, Brandon Fain
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.18730
Pdf link: https://arxiv.org/pdf/2601.18730
Abstract The constitutional framework of alignment aims to align large language models (LLMs) with value-laden principles written in natural language (such as to avoid using biased language). Prior work has focused on parameter fine-tuning techniques, such as reinforcement learning from human feedback (RLHF), to instill these principles. However, these approaches are computationally demanding, require careful engineering and tuning, and often require difficult-to-obtain human annotation data. We propose \textsc{reflect}, an inference-time framework for constitutional alignment that does not require any training or data, providing a plug-and-play approach for aligning an instruction-tuned model to a set of principles. \textsc{reflect} operates entirely in-context, combining a (i) constitution-conditioned base response with post-generation (ii) self-evaluation, (iii)(a) self-critique, and (iii)(b) final revision. \textsc{reflect}'s technique of explicit in-context reasoning over principles during post-generation outperforms standard few-shot prompting and provides transparent reasoning traces. Our results demonstrate that \textsc{reflect} significantly improves LLM conformance to diverse and complex principles, including principles quite distinct from those emphasized in the model's original parameter fine-tuning, without sacrificing factual reasoning. \textsc{reflect} is particularly effective at reducing the rate of rare but significant violations of principles, thereby improving safety and robustness in the tail end of the distribution of generations. Finally, we show that \textsc{reflect} naturally generates useful training data for traditional parameter fine-tuning techniques, allowing for efficient scaling and the reduction of inference-time computational overhead in long-term deployment scenarios.
中文摘要 宪法对齐框架旨在使大型语言模型（LLMs）与自然语言书写的价值性原则保持一致（例如避免使用有偏见的语言）。此前的研究主要集中在参数微调技术，如人类反馈强化学习（RLHF）等方法，以灌输这些原则。然而，这些方法计算量大，需要精心的工程设计和调校，且通常需要难以获得的人工注释数据。我们提出了 \textsc{reflect}，一种推理时间框架，用于宪法对齐，无需任何训练或数据，提供了即插即用的方法，将指令调优模型与一套原则对齐。\textSC{reflect}完全在上下文中运作，结合了（i）宪法条件的基础反应与（ii）后代自我评估，（iii）（a）自我批评，以及（iii）（b）最终修订。\textsc{reflect} 在后生成过程中对原则进行显式上下文推理的技术优于标准的少数提示，并提供了透明的推理痕迹。我们的结果表明，\textsc{reflect}显著提升了LLM对多样复杂原则的符合性，包括与模型原始参数微调中强调的截然不同的原则，同时不牺牲事实推理。\textsc{reflect} 在减少罕见但重大原则违背率方面尤为有效，从而提升世代分布末期的安全性和稳健性。最后，我们证明 \textsc{reflect} 自然生成了用于传统参数微调技术的有用训练数据，实现了高效的扩展和长期部署场景中推理时间的计算开销。

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

自我提炼推理器：大型语言模型的策略性自我提炼

Authors: Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.18734
Pdf link: https://arxiv.org/pdf/2601.18734
Abstract Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 4-8x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.
中文摘要 知识蒸馏通过压缩教师LLM的知识来训练较小的LLM来提升大型语言模型（LLM）推理能力。策略上提纯通过让学生采样自身轨迹，而教师LLM则提供密集的标记级监督，解决非策略提炼方法中训练与推断分布不匹配的问题，从而推动这一方法的发展。然而，策略提炼通常需要一个独立且规模更大的教师LLM，且不会明确利用推理数据集中可用的真实性解决方案。基于直觉：一个足够强大的大型语言模型能够合理化外部特权推理的痕迹并教导其较弱的自我（即没有特权信息访问权限的版本），我们提出了策略自提纯（OPSD）框架，该框架中单个模型通过不同上下文的条件，既扮演教师又扮演学生的角色。教师政策要求特权信息（例如，经过验证的推理追踪），而学生政策只关注问题本身;训练可最小化这些分布与学生自身推广的每个代币差异。我们展示了该方法在多个数学推理基准测试上的有效性，与GRPO等强化学习方法相比，代币效率达到4-8倍，且优于非策略提炼方法。

Trust, Don't Trust, or Flip: Robust Preference-Based Reinforcement Learning with Multi-Expert Feedback

信任、不信任或翻转：基于偏好的强化学习与多专家反馈

Authors: Seyed Amir Hosseini, Maryam Abdolali, Amirhosein Tavakkoli, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.18751
Pdf link: https://arxiv.org/pdf/2601.18751
Abstract Preference-based reinforcement learning (PBRL) offers a promising alternative to explicit reward engineering by learning from pairwise trajectory comparisons. However, real-world preference data often comes from heterogeneous annotators with varying reliability; some accurate, some noisy, and some systematically adversarial. Existing PBRL methods either treat all feedback equally or attempt to filter out unreliable sources, but both approaches fail when faced with adversarial annotators who systematically provide incorrect preferences. We introduce TriTrust-PBRL (TTP), a unified framework that jointly learns a shared reward model and expert-specific trust parameters from multi-expert preference feedback. The key insight is that trust parameters naturally evolve during gradient-based optimization to be positive (trust), near zero (ignore), or negative (flip), enabling the model to automatically invert adversarial preferences and recover useful signal rather than merely discarding corrupted feedback. We provide theoretical analysis establishing identifiability guarantees and detailed gradient analysis that explains how expert separation emerges naturally during training without explicit supervision. Empirically, we evaluate TTP on four diverse domains spanning manipulation tasks (MetaWorld) and locomotion (DM Control) under various corruption scenarios. TTP achieves state-of-the-art robustness, maintaining near-oracle performance under adversarial corruption while standard PBRL methods fail catastrophically. Notably, TTP outperforms existing baselines by successfully learning from mixed expert pools containing both reliable and adversarial annotators, all while requiring no expert features beyond identification indices and integrating seamlessly with existing PBRL pipelines.
中文摘要 基于偏好的强化学习（PBRL）通过两对轨迹比较学习，提供了显式奖励工程的有前景替代方案。然而，现实世界的偏好数据通常来自不同信度的异构标注符;有些准确，有些嘈杂，还有些系统性地对立。现有的PBRL方法要么对所有反馈一视同仁，要么试图过滤掉不可靠的来源，但当面对系统性错误偏好的对抗性注释者时，这两种方法都失败了。我们介绍TriTrust-PBRL（TTP），这是一个统一框架，通过多专家偏好反馈共同学习共享奖励模型和专家专属信任参数。关键见解是，信任参数在基于梯度的优化过程中自然演变为正（信任）、接近零（忽略）或负（反转），使模型能够自动反转对抗偏好并恢复有用信号，而不仅仅是丢弃损坏的反馈。我们提供理论分析，确立可识别性保证，并详细的梯度分析，解释专家分离如何在培训过程中自然产生，无需明确监督。通过实证方式，我们在四个不同领域评估TTP，涵盖作任务（元世界）和移动（DM控制）在不同损坏场景下的表现。TTP实现了最先进的鲁棒性，在对抗性破坏下保持近乎预言机的性能，而标准PBRL方法则会灾难性地失败。值得注意的是，TTP优于现有基线，成功从包含可靠和对抗性标注器的混合专家池中学习，同时仅需识别索引，且与现有PBRL流水线无缝集成。

Dep-Search: Learning Dependency-Aware Reasoning Traces with Persistent Memory

Dep-Search：学习依赖意识推理痕迹与持久记忆

Authors: Yanming Liu, Xinyue Peng, Zixuan Yan, Yanxin Shen, Wenjie Xu, Yuefeng Huang, Xinyi Wang, Jiannan Cao, Jianwei Yin, Xuhong Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2601.18771
Pdf link: https://arxiv.org/pdf/2601.18771
Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, particularly when augmented with search mechanisms that enable systematic exploration of external knowledge bases. The field has evolved from traditional retrieval-augmented generation (RAG) frameworks to more sophisticated search-based frameworks that orchestrate multi-step reasoning through explicit search strategies. However, existing search frameworks still rely heavily on implicit natural language reasoning to determine search strategies and how to leverage retrieved information across reasoning steps. This reliance on implicit reasoning creates fundamental challenges for managing dependencies between sub-questions, efficiently reusing previously retrieved knowledge, and learning optimal search strategies through reinforcement learning. To address these limitations, we propose Dep-Search, a dependency-aware search framework that advances beyond existing search frameworks by integrating structured reasoning, retrieval, and persistent memory through GRPO. Dep-Search introduces explicit control mechanisms that enable the model to decompose questions with dependency relationships, retrieve information when needed, access previously stored knowledge from memory, and summarize long reasoning contexts into reusable memory entries. Through extensive experiments on seven diverse question answering datasets, we demonstrate that Dep-Search significantly enhances LLMs' ability to tackle complex multi-hop reasoning tasks, achieving substantial improvements over strong baselines across different model scales.
中文摘要 大型语言模型（LLMs）在复杂推理任务中展现出卓越的能力，尤其是在配合搜索机制后，能够系统地探索外部知识库。该领域已从传统的检索增强生成（RAG）框架发展为更复杂的基于搜索的框架，通过显式搜索策略协调多步推理。然而，现有的搜索框架仍然高度依赖隐性自然语言推理来确定搜索策略以及如何在推理步骤中利用检索到的信息。这种对隐性推理的依赖为管理子问题之间的依赖、高效重用先前检索知识以及通过强化学习学习最优搜索策略带来了根本性的挑战。为解决这些局限性，我们提出了Dep-Search，一种依赖感知型搜索框架，通过GRPO整合结构化推理、检索和持久记忆，超越现有搜索框架。Dep-Search 引入了显式的控制机制，使模型能够分解带有依赖关系的问题，必要时检索信息，访问先前存储的记忆知识，并将长推理上下文总结为可重用的记忆条目。通过对七个多样化问答数据集的广泛实验，我们证明了Dep-Search显著提升了大型语言模型处理复杂多跳推理任务的能力，在不同模型尺度上相较于强基线实现了显著提升。

Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

自学教学模式：可学习性边缘的推理

Authors: Shobhita Sundaram, John Quan, Ariel Kwiatkowski, Kartik Ahuja, Yann Ollivier, Julia Kempe
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.18778
Pdf link: https://arxiv.org/pdf/2601.18778
Abstract Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self-play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well-posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.
中文摘要 模型能否学会摆脱自身的学习平台？用于微调大型推理模型的强化学习方法在初始成功率低的数据集上会停滞，因此训练信号较少。我们探讨一个根本性问题：预训练的LLM能否利用潜在知识生成自动化课程，解决无法解决的问题？为此，我们设计了SOAR：一个自我提升框架，旨在通过元强化学习（meta-RL）揭示这些教学信号。教师版模型为学生提出综合问题，并因对部分难题的改进而获得奖励。关键是，SOAR将课程建立在学生的衡量进步基础上，而非内在的代理奖励。我们对最难数学基准子集（0/128成功率）的研究揭示了三个核心发现。首先，我们证明了可以通过提升预训练模型潜在能力来实现双级元强化学习，从而在稀疏、二元奖励下解锁学习，从而生成有用的跳板。其次，扎根奖励优于以往大型语言模型自玩中使用的内在奖励方案，可靠地避免了它们通常表现出的不稳定性和多样性崩溃模式。第三，分析生成的问题表明，结构质量和合理性对学习进展比解题正确性更为关键。我们的结果表明，生成有用垫脚石的能力并不需要先验解决难题的能力，这为在没有额外策划数据的情况下，为摆脱推理平台铺平了一条有原则的道路。

POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration

POPE：通过特权政策探索学习理性解决难题

Authors: Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, Aviral Kumar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.18779
Pdf link: https://arxiv.org/pdf/2601.18779
Abstract Reinforcement learning (RL) has improved the reasoning abilities of large language models (LLMs), yet state-of-the-art methods still fail to learn on many training problems. On hard problems, on-policy RL rarely explores even a single correct rollout, yielding zero reward and no learning signal for driving improvement. We find that natural solutions to remedy this exploration problem from classical RL, such as entropy bonuses, more permissive clipping of the importance ratio, or direct optimization of pass@k objectives, do not resolve this issue and often destabilize optimization without improving solvability. A natural alternative is to leverage transfer from easier problems. However, we show that mixing easy and hard problems during RL training is counterproductive due to ray interference, where optimization focuses on already-solvable problems in a way that actively inhibits progress on harder ones. To address this challenge, we introduce Privileged On-Policy Exploration (POPE), an approach that leverages human- or other oracle solutions as privileged information to guide exploration on hard problems, unlike methods that use oracle solutions as training targets (e.g., off-policy RL methods or warmstarting from SFT). POPE augments hard problems with prefixes of oracle solutions, enabling RL to obtain non-zero rewards during guided rollouts. Crucially, the resulting behaviors transfer back to the original, unguided problems through a synergy between instruction-following and reasoning. Empirically, POPE expands the set of solvable problems and substantially improves performance on challenging reasoning benchmarks.
中文摘要 强化学习（RL）提高了大型语言模型（LLM）的推理能力，但最先进的方法在许多训练问题上仍无法实现学习。在困难问题上，政策强化学习很少探索哪怕一次正确的推广，没有任何奖励，也没有学习信号来推动改进。我们发现，解决经典强化学习探索问题的自然方案，如熵加成、更宽松的重要性比剪裁，或直接优化pass@k目标，无法解决该问题，且常常使优化不稳定，且不提升可解性。一个自然的替代方案是利用较简单的问题进行转移。然而，我们证明在强化学习训练中混合简单和困难问题是适得其反的，因为射线干扰导致优化专注于已经可解的问题，从而主动阻碍了难题的进展。为应对这一挑战，我们引入了特权策略探索（POPE），这是一种利用人类或其他预言机解决方案作为特权信息，指导对难题探索的方法，不同于以预言机解决方案为训练目标的方法（例如非策略强化学习方法或SFT的热启动）。POPE 通过预言机解决方案的前缀来增强难题，使强化学习在引导推广中获得非零奖励。关键是，这些行为通过跟随指令与推理的协同作用，回归到原始的无引导问题。从实证角度看，POPE扩展了可解问题的范围，并显著提升了在具有挑战性推理基准测试中的表现。

Multi-Objective Reinforcement Learning for Efficient Tactical Decision Making for Trucks in Highway Traffic

多目标强化学习用于高速公路交通中卡车高效战术决策

Authors: Deepthi Pathare, Leo Laine, Morteza Haghir Chehreghani
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.18783
Pdf link: https://arxiv.org/pdf/2601.18783
Abstract Balancing safety, efficiency, and operational costs in highway driving poses a challenging decision-making problem for heavy-duty vehicles. A central difficulty is that conventional scalar reward formulations, obtained by aggregating these competing objectives, often obscure the structure of their trade-offs. We present a Proximal Policy Optimization based multi-objective reinforcement learning framework that learns a continuous set of policies explicitly representing these trade-offs and evaluates it on a scalable simulation platform for tactical decision making in trucks. The proposed approach learns a continuous set of Pareto-optimal policies that capture the trade-offs among three conflicting objectives: safety, quantified in terms of collisions and successful completion; energy efficiency and time efficiency, quantified using energy cost and driver cost, respectively. The resulting Pareto frontier is smooth and interpretable, enabling flexibility in choosing driving behavior along different conflicting objectives. This framework allows seamless transitions between different driving policies without retraining, yielding a robust and adaptive decision-making strategy for autonomous trucking applications.
中文摘要 在高速公路驾驶中平衡安全性、效率和运营成本，是重型车辆在决策上的挑战。一个核心难题是，传统的标量奖励表述，通过聚合这些竞争目标，往往模糊了它们权衡的结构。我们提出了基于近点策略优化的多目标强化学习框架，学习一套连续的策略，明确表示这些权衡，并在可扩展的模拟平台上评估，用于卡车战术决策。所提方法学习一套连续的帕累托最优策略，捕捉三个冲突目标之间的权衡：安全，以碰撞次数和成功完成来量化;能源效率和时间效率，分别用能量成本和驱动力成本来量化。由此产生的帕累托边界平滑且可解释，使得在不同冲突目标下灵活选择驾驶行为。该框架允许不同驾驶政策之间的无缝过渡，无需重新培训，为自动驾驶卡车应用提供稳健且适应性的决策策略。

Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes

重复使用你的FLOP：通过条件条件非常偏离政策的前缀来提升难题的强化学习规模

Authors: Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, Sang Michael Xie
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.18795
Pdf link: https://arxiv.org/pdf/2601.18795
Abstract Typical reinforcement learning (RL) methods for LLM reasoning waste compute on hard problems, where correct on-policy traces are rare, policy gradients vanish, and learning stalls. To bootstrap more efficient RL, we consider reusing old sampling FLOPs (from prior inference or RL training) in the form of off-policy traces. Standard off-policy methods supervise against off-policy data, causing instabilities during RL optimization. We introduce PrefixRL, where we condition on the prefix of successful off-policy traces and run on-policy RL to complete them, side-stepping off-policy instabilities. PrefixRL boosts the learning signal on hard problems by modulating the difficulty of the problem through the off-policy prefix length. We prove that the PrefixRL objective is not only consistent with the standard RL objective but also more sample efficient. Empirically, we discover back-generalization: training only on prefixed problems generalizes to out-of-distribution unprefixed performance, with learned strategies often differing from those in the prefix. In our experiments, we source the off-policy traces by rejection sampling with the base model, creating a self-improvement loop. On hard reasoning problems, PrefixRL reaches the same training reward 2x faster than the strongest baseline (SFT on off-policy data then RL), even after accounting for the compute spent on the initial rejection sampling, and increases the final reward by 3x. The gains transfer to held-out benchmarks, and PrefixRL is still effective when off-policy traces are derived from a different model family, validating its flexibility in practical settings.
中文摘要 典型的强化学习（RL）方法用于大型语言模型推理，浪费计算在困难问题上，即正确的策略轨迹稀少，策略梯度消失，学习停滞。为了更高效的强化学习，我们考虑以非策略轨迹的形式重用旧的采样FLOP（来自先前推断或强化学习训练）。标准的非策略方法会对非策略数据进行监督，导致强化学习优化过程中出现不稳定。我们引入了 PrefixRL，在其中我们以成功的非策略追踪前缀为条件，并运行 on-policy RL 来完成这些跟踪，从而绕过了非策略的不稳定性。前缀RL通过调节非策略前缀长度来增强难题的学习信号。我们证明前缀RL目标不仅与标准RL目标一致，而且样本效率更高。实证上，我们发现了反向推广：仅对前缀问题进行训练时，可以推广到分布外的无前缀表现，且学到的策略往往与前缀中的策略不同。在我们的实验中，我们通过用基模型进行拒绝抽样来获取非策略痕迹，形成自我改进循环。在硬推理问题中，PrefixRL达到相同训练奖励的速度是最强基线（非策略数据上的SFT再是RL的2倍），即使考虑了初始拒绝采样的计算，最终奖励也提高了3倍。这些收益会转移到被保留的基准测试上，即使非策略追踪来自不同模型族，PrefixRL依然有效，验证了其在实际环境中的灵活性。

Keyword: diffusion policy

3DGesPolicy: Phoneme-Aware Holistic Co-Speech Gesture Generation Based on Action Control

3DGesPolicy：基于动作控制的音素感知整体共言手势生成

Authors: Xuanmeng Sha, Liyun Zhang, Tomohiro Mashita, Naoya Chiba, Yuki Uranishi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2601.18451
Pdf link: https://arxiv.org/pdf/2601.18451
Abstract Generating holistic co-speech gestures that integrate full-body motion with facial expressions suffers from semantically incoherent coordination on body motion and spatially unstable meaningless movements due to existing part-decomposed or frame-level regression methods, We introduce 3DGesPolicy, a novel action-based framework that reformulates holistic gesture generation as a continuous trajectory control problem through diffusion policy from robotics. By modeling frame-to-frame variations as unified holistic actions, our method effectively learns inter-frame holistic gesture motion patterns and ensures both spatially and semantically coherent movement trajectories that adhere to realistic motion manifolds. To further bridge the gap in expressive alignment, we propose a Gesture-Audio-Phoneme (GAP) fusion module that can deeply integrate and refine multi-modal signals, ensuring structured and fine-grained alignment between speech semantics, body motion, and facial expressions. Extensive quantitative and qualitative experiments on the BEAT2 dataset demonstrate the effectiveness of our 3DGesPolicy across other state-of-the-art methods in generating natural, expressive, and highly speech-aligned holistic gestures.
中文摘要 生成整合全身动作与面部表情的整体共言手势，由于现有部分分解或帧级回归方法，存在语义协调不连贯和空间不稳定且无意义的运动。我们介绍了3DGesPolicy，这是一个新的基于动作的框架，通过机器人的扩散策略将整体手势生成重新表述为连续轨迹控制问题。通过将帧间变化建模为统一的整体动作，我们的方法有效学习了帧间整体手势运动模式，并确保空间和语义上的运动轨迹都符合真实的运动流形。为了进一步弥合表达对齐的差距，我们提出了一个手势-音频-音素（GAP）融合模块，能够深度整合和优化多模态信号，确保语音语义、身体动作和面部表情之间的结构化且细致的对齐。在BEAT2数据集上的大量定量和定性实验证明了我们的3DGesPolicy在生成自然、富有表现力且高度符合语音整体手势方面，在其他最先进方法中的有效性。