生成时间: 2026-01-22 16:36:41 (UTC+8); Arxiv 发布时间: 2026-01-22 20:00 EST (2026-01-23 09:00 UTC+8)
今天共有 28 篇相关文章
Keyword: reinforcement learning
Beyond Affinity: A Benchmark of 1D, 2D, and 3D Methods Reveals Critical Trade-offs in Structure-Based Drug Design
《超越亲和力:一维、二维和三维方法基调》揭示了基于结构的药物设计中的关键权衡
- Authors: Kangyu Zheng, Kai Zhang, Jiale Tan, Xuehan Chen, Yingzhou Lu, Zaixi Zhang, Lichao Sun, Marinka Zitnik, Tianfan Fu, Zhiding Liang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.14283
- Pdf link: https://arxiv.org/pdf/2601.14283
- Abstract
Currently, the field of structure-based drug design is dominated by three main types of algorithms: search-based algorithms, deep generative models, and reinforcement learning. While existing works have typically focused on comparing models within a single algorithmic category, cross-algorithm comparisons remain scarce. In this paper, to fill the gap, we establish a benchmark to evaluate the performance of fifteen models across these different algorithmic foundations by assessing the pharmaceutical properties of the generated molecules and their docking affinities and poses with specified target proteins. We highlight the unique advantages of each algorithmic approach and offer recommendations for the design of future SBDD models. We emphasize that 1D/2D ligand-centric drug design methods can be used in SBDD by treating the docking function as a black-box oracle, which is typically neglected. Our evaluation reveals distinct patterns across model categories. 3D structure-based models excel in binding affinities but show inconsistencies in chemical validity and pose quality. 1D models demonstrate reliable performance in standard molecular metrics but rarely achieve optimal binding affinities. 2D models offer balanced performance, maintaining high chemical validity while achieving moderate binding scores. Through detailed analysis across multiple protein targets, we identify key improvement areas for each model category, providing insights for researchers to combine strengths of different approaches while addressing their limitations. All the code that are used for benchmarking is available in this https URL
- 中文摘要
目前,基于结构的药物设计领域主要由三种类型的算法主导:基于搜索的算法、深度生成模型和强化学习。虽然现有研究通常聚焦于单一算法类别内的模型比较,但跨算法比较仍然稀少。本文为填补这一空白,我们建立了一个基准,通过评估生成分子的药物特性及其与指定目标蛋白的结合亲和力和姿态,评估十五个模型在这些不同算法基础上的性能。我们强调每种算法方法的独特优势,并对未来SBDD模型的设计提出建议。我们强调,结合结合功能(SBDD)的1D/2D配体中心药物设计方法,可以将对接功能视为一个通常被忽视的黑箱预言机来应用。我们的评估揭示了不同模型类别中明显的模式。基于结构的三维模型在结合亲和力方面表现出色,但在化学效度和姿态质量上存在不一致。一维模型在标准分子指标上表现出可靠的性能,但很少达到最佳结合亲和力。二维模型提供平衡的性能,保持高化学效度,同时实现适中的结合评分。通过对多个蛋白质靶点的详细分析,我们识别出每个模型类别的关键改进领域,为研究人员结合不同方法优势并解决其局限性提供了见解。所有用于基准测试的代码都在这个 https URL 中
Large Language Model-Powered Evolutionary Code Optimization on a Phylogenetic Tree
系统发育树上的大型语言模型驱动进化代码优化
- Authors: Leyi Zhao, Weijie Huang, Yitong Guo, Jiang Bian, Chenghong Wang, Xuhong Zhang
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2601.14523
- Pdf link: https://arxiv.org/pdf/2601.14523
- Abstract
Optimizing scientific computing algorithms for modern GPUs is a labor-intensive and iterative process involving repeated code modification, benchmarking, and tuning across complex hardware and software stacks. Recent work has explored large language model (LLM)-assisted evolutionary methods for automated code optimization, but these approaches primarily rely on outcome-based selection and random mutation, underutilizing the rich trajectory information generated during iterative optimization. We propose PhyloEvolve, an LLM-agent system that reframes GPU-oriented algorithm optimization as an In-Context Reinforcement Learning (ICRL) problem. This formulation enables trajectory-conditioned reuse of optimization experience without model retraining. PhyloEvolve integrates Algorithm Distillation and prompt-based Decision Transformers into an iterative workflow, treating sequences of algorithm modifications and performance feedback as first-class learning signals. To organize optimization history, we introduce a phylogenetic tree representation that captures inheritance, divergence, and recombination among algorithm variants, enabling backtracking, cross-lineage transfer, and reproducibility. The system combines elite trajectory pooling, multi-island parallel exploration, and containerized execution to balance exploration and exploitation across heterogeneous hardware. We evaluate PhyloEvolve on scientific computing workloads including PDE solvers, manifold learning, and spectral graph algorithms, demonstrating consistent improvements in runtime, memory efficiency, and correctness over baseline and evolutionary methods. Code is published at: this https URL
- 中文摘要
为现代GPU优化科学计算算法是一个劳动密集且迭代的过程,涉及反复修改代码、基准测试以及跨复杂硬件和软件堆栈的调优。近期研究探讨了大型语言模型(LLM)辅助的进化方法进行自动代码优化,但这些方法主要依赖基于结果的选择和随机突变,未能充分利用迭代优化过程中产生的丰富轨迹信息。我们提出了PhyloEvolve,这是一个LLM代理系统,将GPU导向算法优化重新框架为上下文强化学习(ICRL)问题。该表述使优化经验的轨迹条件重用成为可能,无需模型重新训练。PhyloEvolve 将算法蒸馏和基于提示的决策变换器集成为迭代工作流程,将算法修改序列和性能反馈视为一流的学习信号。为了组织优化历史,我们引入了一种系统发育树表示,能够捕捉算法变体之间的遗传、分歧和重组,从而实现回溯、跨谱系转移和可重复性。该系统结合了精英轨迹池、多岛并行探索和容器化执行,平衡了异构硬件间的探索与利用。我们评估了PhyloEvolve在包括偏微分方程求解器、流形学习和谱图算法在内的科学计算工作负载上的应用,展示了相较于基线和进化方法在运行时间、内存效率和正确性方面的持续提升。代码发布地址为:此 https URL
Towards Execution-Grounded Automated AI Research
迈向基于执行的自动化人工智能研究
- Authors: Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Candès, Diyi Yang, Tatsunori Hashimoto
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2601.14525
- Pdf link: https://arxiv.org/pdf/2601.14525
- Abstract
Automated AI research holds great potential to accelerate scientific discovery. However, current LLMs often generate plausible-looking but ineffective ideas. Execution grounding may help, but it is unclear whether automated execution is feasible and whether LLMs can learn from the execution feedback. To investigate these, we first build an automated executor to implement ideas and launch large-scale parallel GPU experiments to verify their effectiveness. We then convert two realistic research problems - LLM pre-training and post-training - into execution environments and demonstrate that our automated executor can implement a large fraction of the ideas sampled from frontier LLMs. We analyze two methods to learn from the execution feedback: evolutionary search and reinforcement learning. Execution-guided evolutionary search is sample-efficient: it finds a method that significantly outperforms the GRPO baseline (69.4% vs 48.0%) on post-training, and finds a pre-training recipe that outperforms the nanoGPT baseline (19.7 minutes vs 35.9 minutes) on pre-training, all within just ten search epochs. Frontier LLMs often generate meaningful algorithmic ideas during search, but they tend to saturate early and only occasionally exhibit scaling trends. Reinforcement learning from execution reward, on the other hand, suffers from mode collapse. It successfully improves the average reward of the ideator model but not the upper-bound, due to models converging on simple ideas. We thoroughly analyze the executed ideas and training dynamics to facilitate future efforts towards execution-grounded automated AI research.
- 中文摘要
自动化人工智能研究具有加速科学发现的巨大潜力。然而,当前的大型语言模型常常产生看似合理但效果有限的想法。执行接地可能有帮助,但目前尚不清楚自动化执行是否可行,以及大型语言模型是否能从执行反馈中学习。为此,我们首先构建了一个自动执行器来实现想法,并启动大规模并行GPU实验以验证其有效性。随后,我们将两个现实的研究问题——LLM的预训练和后训练——转化为执行环境,并展示了我们的自动执行器能够实现大量从前沿大型语言模型中采样的想法。我们分析了两种从执行反馈中学习的方法:进化搜索和强化学习。执行引导进化搜索具有样本效率:它在仅十个搜索阶段内找到一种方法在训练后显著优于GRPO基线(69.4%对48.0%),并且找到一个预训练配方在预训练中优于nanoGPT基线(19.7分钟对35.9分钟),这一切都发生在短短十个搜索时期内。前沿大型语言模型在搜索过程中常常生成有意义的算法想法,但它们往往过早就过于饱和,且仅偶尔出现扩展趋势。而从执行奖励中学习的强化则存在模式崩溃的问题。它成功提升了创意者模型的平均奖励,但无法提升上界,因为模型会收敛于简单的想法。我们对已执行的想法和训练动态进行了深入分析,以促进未来基于执行的自动化人工智能研究。
Report for NSF Workshop on AI for Electronic Design Automation
美国国家科学基金会电子设计自动化人工智能研讨会报告
- Authors: Deming Chen, Vijay Ganesh, Weikai Li, Yingyan (Celine)Lin, Yong Liu, Subhasish Mitra, David Z. Pan, Ruchir Puri, Jason Cong, Yizhou Sun
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
- Arxiv link: https://arxiv.org/abs/2601.14541
- Pdf link: https://arxiv.org/pdf/2601.14541
- Abstract
This report distills the discussions and recommendations from the NSF Workshop on AI for Electronic Design Automation (EDA), held on December 10, 2024 in Vancouver alongside NeurIPS 2024. Bringing together experts across machine learning and EDA, the workshop examined how AI-spanning large language models (LLMs), graph neural networks (GNNs), reinforcement learning (RL), neurosymbolic methods, etc.-can facilitate EDA and shorten design turnaround. The workshop includes four themes: (1) AI for physical synthesis and design for manufacturing (DFM), discussing challenges in physical manufacturing process and potential AI applications; (2) AI for high-level and logic-level synthesis (HLS/LLS), covering pragma insertion, program transformation, RTL code generation, etc.; (3) AI toolbox for optimization and design, discussing frontier AI developments that could potentially be applied to EDA tasks; and (4) AI for test and verification, including LLM-assisted verification tools, ML-augmented SAT solving, security/reliability challenges, etc. The report recommends NSF to foster AI/EDA collaboration, invest in foundational AI for EDA, develop robust data infrastructures, promote scalable compute infrastructure, and invest in workforce development to democratize hardware design and enable next-generation hardware systems. The workshop information can be found on the website this https URL.
- 中文摘要
本报告总结了2024年12月10日在温哥华与NeurIPS 2024同期举办的NSF电子设计自动化人工智能研讨会的讨论和建议。研讨会汇聚了机器学习和EDA领域的专家,探讨了跨AI的大型语言模型(LLM)、图神经网络(GNN)、强化学习(RL)、神经符号方法等如何促进EDA并缩短设计周期。研讨会包含四个主题:(1)制造业物理合成与设计中的人工智能(DFM),讨论物理制造过程中的挑战及潜在的人工智能应用;(2)用于高级和逻辑级综合(HLS/LLS)的人工智能,涵盖语用插入、程序转换、RTL代码生成等;(3)优化与设计的人工智能工具箱,讨论可能应用于EDA任务的前沿人工智能发展;以及(4)用于测试和验证的人工智能,包括LLM辅助验证工具、机器学习增强SAT求解、安全性/可靠性挑战等。报告建议NSF促进人工智能与EDA的合作,投资EDA的基础人工智能,开发稳健的数据基础设施,推动可扩展的计算基础设施,并投资于劳动力发展,以实现硬件设计的民主化和下一代硬件系统的开发。研讨会信息可在该网站 https 网址上找到。
Rewarding How Models Think Pedagogically: Integrating Pedagogical Reasoning and Thinking Rewards for LLMs in Education
奖励模型的教学思维方式:整合教学推理与思维,为教育中的LLMs提供奖励
- Authors: Unggi Lee, Jiyeong Bae, Jaehyeon Park, Haeun Park, Taejun Park, Younghoon Jeon, Sungmin Cho, Junbo Koh, Yeil Jeong, Gyeonggeon Lee
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2601.14560
- Pdf link: https://arxiv.org/pdf/2601.14560
- Abstract
Large language models (LLMs) are increasingly deployed as intelligent tutoring systems, yet research on optimizing LLMs specifically for educational contexts remains limited. Recent works have proposed reinforcement learning approaches for training LLM tutors, but these methods focus solely on optimizing visible responses while neglecting the model's internal thinking process. We introduce PedagogicalRL-Thinking, a framework that extends pedagogical alignment to reasoning LLMs in education through two novel approaches: (1) Pedagogical Reasoning Prompting, which guides internal reasoning using domain-specific educational theory rather than generic instructions; and (2) Thinking Reward, which explicitly evaluates and reinforces the pedagogical quality of the model's reasoning traces. Our experiments reveal that domain-specific, theory-grounded prompting outperforms generic prompting, and that Thinking Reward is most effective when combined with pedagogical prompting. Furthermore, models trained only on mathematics tutoring dialogues show improved performance on educational benchmarks not seen during training, while preserving the base model's factual knowledge. Our quantitative and qualitative analyses reveal that pedagogical thinking reward produces systematic reasoning trace changes, with increased pedagogical reasoning and more structured instructional decision-making in the tutor's thinking process.
- 中文摘要
大型语言模型(LLMs)越来越多地被用作智能辅导系统,但针对教育环境优化LLM的研究仍然有限。近期研究提出了强化学习方法用于训练LLM导师,但这些方法仅专注于优化可见的反应,而忽视了模型的内在思维过程。我们介绍了PedagogicalRL-Thinking,这一框架通过两种新颖方法将教学对齐扩展到教育中的推理LLMs:(1)教学推理提示,通过领域特定的教育理论而非泛泛的指导引导内在推理;以及(2)思维奖励,明确评估并强化模型推理痕迹的教学质量。我们的实验显示,领域特定、理论基础的提示法优于通用提示,且思考奖励与教学提示结合时效果最佳。此外,仅在数学辅导对话中训练的模型在保持基础模型事实知识的同时,在教育基准上表现有所提升。我们的定量和定性分析显示,教学思维奖励会带来系统性推理的痕迹变化,导师的思维过程中教学推理能力和教学决策更加结构化。
Learning Consistent Taxonomic Classification through Hierarchical Reasoning
通过层级推理学习一致的分类分类
- Authors: Zhenghong Li, Kecheng Zheng, Haibin Ling
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2601.14610
- Pdf link: https://arxiv.org/pdf/2601.14610
- Abstract
While Vision-Language Models (VLMs) excel at visual understanding, they often fail to grasp hierarchical knowledge. This leads to common errors where VLMs misclassify coarser taxonomic levels even when correctly identifying the most specific level (leaf level). Existing approaches largely overlook this issue by failing to model hierarchical reasoning. To address this gap, we propose VL-Taxon, a two-stage, hierarchy-based reasoning framework designed to improve both leaf-level accuracy and hierarchical consistency in taxonomic classification. The first stage employs a top-down process to enhance leaf-level classification accuracy. The second stage then leverages this accurate leaf-level output to ensure consistency throughout the entire taxonomic hierarchy. Each stage is initially trained with supervised fine-tuning to instill taxonomy knowledge, followed by reinforcement learning to refine the model's reasoning and generalization capabilities. Extensive experiments reveal a remarkable result: our VL-Taxon framework, implemented on the Qwen2.5-VL-7B model, outperforms its original 72B counterpart by over 10% in both leaf-level and hierarchical consistency accuracy on average on the iNaturalist-2021 dataset. Notably, this significant gain was achieved by fine-tuning on just a small subset of data, without relying on any examples generated by other VLMs.
- 中文摘要
虽然视觉语言模型(VLM)在视觉理解方面表现出色,但它们常常无法掌握层级知识。这导致常见错误,即VLM在正确识别最具体的层级(叶片层)时,仍错误分类了较粗的分类层级。现有方法大多忽视了这一问题,未能建模层级推理。为弥补这一空白,我们提出了VL-Taxon,一种两阶段、基于层级的推理框架,旨在提高叶片层级的准确性和分类的层级一致性。第一阶段采用自上而下过程,以提升叶片级分类的准确性。第二阶段利用这些准确的叶片级输出,确保整个分类层级的一致性。每个阶段最初都进行监督式微调以灌输分类学知识,随后进行强化学习以完善模型的推理和泛化能力。经过大量实验,我们基于Qwen2.5-VL-7B模型实现的VL-Taxon框架,在iNaturalist-2021数据集上,叶片层面和层级一致性准确率平均比原72B版本高出10%以上。值得注意的是,这一显著增益是通过对极小部分数据进行微调实现的,而未依赖其他VLM生成的示例。
SearchGym: Bootstrapping Real-World Search Agents via Cost-Effective and High-Fidelity Environment Simulation
SearchGym:通过经济高效且高保真度的环境模拟,自力更生地创建真实世界的搜索代理
- Authors: Xichen Zhang, Ziyi He, Yinghao Zhu, Sitong Wu, Shaozuo Yu, Meng Chu, Wenhu Zhang, Haoru Tan, Jiaya Jia
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.14615
- Pdf link: https://arxiv.org/pdf/2601.14615
- Abstract
Search agents have emerged as a pivotal paradigm for solving open-ended, knowledge-intensive reasoning tasks. However, training these agents via Reinforcement Learning (RL) faces a critical dilemma: interacting with live commercial Web APIs is prohibitively expensive, while relying on static data snapshots often introduces noise due to data misalignment. This misalignment generates corrupted reward signals that destabilize training by penalizing correct reasoning or rewarding hallucination. To address this, we propose SearchGym, a simulation environment designed to bootstrap robust search agents. SearchGym employs a rigorous generative pipeline to construct a verifiable knowledge graph and an aligned document corpus, ensuring that every reasoning task is factually grounded and strictly solvable. Building on this controllable environment, we introduce SearchGym-RL, a curriculum learning methodology that progressively optimizes agent policies through purified feedback, evolving from basic interactions to complex, long-horizon planning. Extensive experiments across the Llama and Qwen families demonstrate strong Sim-to-Real generalization. Notably, our Qwen2.5-7B-Base model trained within SearchGym surpasses the web-enhanced ASearcher baseline across nine diverse benchmarks by an average relative margin of 10.6%. Our results validate that high-fidelity simulation serves as a scalable and highly cost-effective methodology for developing capable search agents.
- 中文摘要
搜索代理已成为解决开放式、知识密集型推理任务的关键范式。然而,通过强化学习(RL)训练这些代理面临一个关键难题:与实时商业Web API交互成本高昂,而依赖静态数据快照则常因数据错位而产生噪声。这种错位产生了腐败的奖励信号,通过惩罚正确的推理或奖励幻觉,从而破坏训练的稳定性。为此,我们提出了SearchGym,一个设计用于自举强健搜索代理的模拟环境。SearchGym采用严谨的生成式流程构建可验证的知识图谱和对齐的文档语料库,确保每个推理任务都基于事实且严格可解。基于这一可控环境,我们引入了SearchGym-RL,一种通过纯净反馈逐步优化代理策略的课程学习方法,从基础互动演变为复杂的长远规划。Llama和Qwen家族的广泛实验显示了Sim到Real的强推广性。值得注意的是,我们在SearchGym中训练的Qwen2.5-7B-Base模型,在九个不同基准测试中平均以10.6%的相对优势超越了网页增强ASearcher基线。我们的结果验证了高保真模拟作为一种可扩展且高成本的方法论,用于开发具备能力的搜索代理。
MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks
MAS-Orchestra:通过整体调度和受控基准来理解和提升多智能体推理
- Authors: Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen, Prathyusha Jwalapuram, Semih Yavuz, Caiming Xiong, Shafiq Joty
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2601.14652
- Pdf link: https://arxiv.org/pdf/2601.14652
- Abstract
While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MAS-Orchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented sub-agents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and sub-agents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.
- 中文摘要
虽然多智能体系统(MAS)承诺通过智能体协调提升智能,但当前自动MAS设计方法未能达到预期效果。这些不足源于两个关键因素:(1)方法论复杂性——代理编排采用顺序、代码级执行,限制了全局系统层面的整体推理,且随着代理复杂度的扩展性较差;(2)效能不确定性——MAS部署时不了解与单代理系统(SAS)相比是否有实质性益处。我们提出了MAS-Orchestra,这是一个训练时间框架,将MAS编排表述为一个函数调用强化学习问题,采用整体编排,一次性生成整个MAS。在MAS-Orchestra中,复杂且目标导向的子代理被抽象为可调用函数,实现对系统结构的全局推理,同时隐藏内部执行细节。为了严格研究MAS何时及为何有益,我们引入了MASBENCH,这是一个受控基准,沿五个轴线描述任务:深度、视野、广度、平行和稳健性。我们的分析显示,MAS的提升关键依赖于任务结构、验证协议以及编排者和子代理的能力,而非普遍存在。在这些洞察的指导下,MAS-Orchestra在包括数学推理、多跳质量保证和基于搜索的质量保证在内的公开基准测试上持续取得改进。MAS-Orchestra和MASBENCH共同促进了对MAS的培训和理解,以实现多智能体情报的实现。
FARE: Fast-Slow Agentic Robotic Exploration
FARE:快速-缓慢智能机器人探索
- Authors: Shuhao Liao, Xuxin Lv, Jeric Lew, Shizhe Zhang, Jingsong Liang, Peizhuo Li, Yuhong Cao, Wenjun Wu, Guillaume Sartoretti
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2601.14681
- Pdf link: https://arxiv.org/pdf/2601.14681
- Abstract
This work advances autonomous robot exploration by integrating agent-level semantic reasoning with fast local control. We introduce FARE, a hierarchical autonomous exploration framework that integrates a large language model (LLM) for global reasoning with a reinforcement learning (RL) policy for local decision making. FARE follows a fast-slow thinking paradigm. The slow-thinking LLM module interprets a concise textual description of the unknown environment and synthesizes an agent-level exploration strategy, which is then grounded into a sequence of global waypoints through a topological graph. To further improve reasoning efficiency, this module employs a modularity-based pruning mechanism that reduces redundant graph structures. The fast-thinking RL module executes exploration by reacting to local observations while being guided by the LLM-generated global waypoints. The RL policy is additionally shaped by a reward term that encourages adherence to the global waypoints, enabling coherent and robust closed-loop behavior. This architecture decouples semantic reasoning from geometric decision, allowing each module to operate in its appropriate temporal and spatial scale. In challenging simulated environments, our results show that FARE achieves substantial improvements in exploration efficiency over state-of-the-art baselines. We further deploy FARE on hardware and validate it in complex, large scale $200m\times130m$ building environment.
- 中文摘要
这项工作通过将智能体级语义推理与快速的局部控制相结合,推动了自主机器人探索的发展。我们介绍FARE,一种分层自治探索框架,将用于全局推理的大型语言模型(LLM)与用于局部决策的强化学习(RL)策略相结合。FARE遵循快慢思维范式。慢思考的LLM模块解释了对未知环境的简明文本描述,并综合了代理级的探索策略,随后通过拓扑图将其锚定为一系列全局路径点。为了进一步提高推理效率,该模块采用基于模块性的剪枝机制,减少了冗余的图结构。快速思考的强化学习模块通过对本地观测做出反应来执行探索,同时由大型语言模型生成的全球航点引导。强化学习政策还由一个奖励条款塑造,鼓励遵守全局航点,从而实现连贯且稳健的闭环行为。该架构将语义推理与几何决策分离,使每个模块能够在其适当的时间和空间尺度下工作。在具有挑战性的模拟环境中,我们的结果表明FARE在探索效率上相较于最先进基线实现了显著提升。我们还进一步在硬件上部署FARE,并在复杂、规模为2亿美元至1.3亿美元的建筑环境中进行验证。
Beyond Error-Based Optimization: Experience-Driven Symbolic Regression with Goal-Conditioned Reinforcement Learning
超越基于错误的优化:基于经验的符号回归与目标条件强化学习
- Authors: Jianwen Sun, Xinrui Li, Fuqing Li, Xiaoxuan Shen
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.14693
- Pdf link: https://arxiv.org/pdf/2601.14693
- Abstract
Symbolic Regression aims to automatically identify compact and interpretable mathematical expressions that model the functional relationship between input and output variables. Most existing search-based symbolic regression methods typically rely on the fitting error to inform the search process. However, in the vast expression space, numerous candidate expressions may exhibit similar error values while differing substantially in structure, leading to ambiguous search directions and hindering convergence to the underlying true function. To address this challenge, we propose a novel framework named EGRL-SR (Experience-driven Goal-conditioned Reinforcement Learning for Symbolic Regression). In contrast to traditional error-driven approaches, EGRL-SR introduces a new perspective: leveraging precise historical trajectories and optimizing the action-value network to proactively guide the search process, thereby achieving a more robust expression search. Specifically, we formulate symbolic regression as a goal-conditioned reinforcement learning problem and incorporate hindsight experience replay, allowing the action-value network to generalize common mapping patterns from diverse input-output pairs. Moreover, we design an all-point satisfaction binary reward function that encourages the action-value network to focus on structural patterns rather than low-error expressions, and concurrently propose a structure-guided heuristic exploration strategy to enhance search diversity and space coverage. Experiments on public benchmarks show that EGRL-SR consistently outperforms state-of-the-art methods in recovery rate and robustness, and can recover more complex expressions under the same search budget. Ablation results validate that the action-value network effectively guides the search, with both the reward function and the exploration strategy playing critical roles.
- 中文摘要
符号回归旨在自动识别紧凑且可解释的数学表达式,以模拟输入与输出变量之间的函数关系。大多数现有基于搜索的符号回归方法通常依赖拟合误差来指导搜索过程。然而,在庞大的表达式空间中,许多候选表达式可能表现出相似的误差值,但结构差异显著,导致搜索方向模糊,阻碍与真实函数的收敛。为应对这一挑战,我们提出了一个名为EGRL-SR(符号回归的经验驱动目标条件强化学习)的新框架。与传统的误差驱动方法不同,EGRL-SR引入了新的视角:利用精确的历史轨迹和优化动作-价值网络,主动引导搜索过程,从而实现更稳健的表达式搜索。具体来说,我们将符号回归构建为目标条件强化学习问题,并结合事后经验回放,使行动-价值网络能够推广来自不同输入输出对的常见映射模式。此外,我们设计了一个全点满意度二元奖励函数,鼓励动作价值网络关注结构模式而非低误差表达,同时提出结构引导启发式探索策略以增强搜索多样性和空间覆盖。公开基准测试的实验表明,EGRL-SR在恢复率和鲁棒性方面持续优于最先进方法,并且在相同的搜索预算下能够恢复更复杂的表达式。消融结果验证了行动-价值网络有效引导搜索,奖励函数和探索策略都发挥了关键作用。
CoScale-RL: Efficient Post-Training by Co-Scaling Data and Computation
CoScale-RL:通过数据和计算的共尺度实现高效的后训练
- Authors: Yutong Chen, Jiandong Gao, Ji Wu
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.14695
- Pdf link: https://arxiv.org/pdf/2601.14695
- Abstract
Training Large Reasoning Model (LRM) is usually unstable and unpredictable, especially on hard problems or weak foundation models. We found that the current post-training scaling strategy can still improve on these cases. We propose CoScale-RL, a novel scaling strategy with better data and computational efficiency. We first scale up solutions to make problems solvable. The core idea is to collect multiple solutions for each problem, rather than simply enlarging the dataset. Then, we scale up rollout computation to stabilize Reinforcement Learning. We further leverage a model merge technique called Re-distillation to sustain or even improve computational efficiency when scaling up. Our method significantly improves data and computational efficiency, with an average 3.76$\times$ accuracy improvement on four benchmarks. CoScale-RL is able to improve an LRM's ability boundary without an extensive SFT dataset. Our method provides a new scaling direction to further improve LRM's reasoning ability.
- 中文摘要
训练大型推理模型(LRM)通常不稳定且不可预测,尤其是在困难问题或基础模型薄弱的情况下。我们发现,当前的训练后规模化策略仍可在这些案例上进行改进。我们提出了CoScale-RL,一种具有更好数据和计算效率的新型扩展策略。我们首先放大解决方案,使问题可解决。核心理念是为每个问题收集多个解决方案,而不仅仅是扩大数据集。然后,我们扩大推广计算,以稳定强化学习。我们还进一步利用一种名为“再蒸馏”的模型合并技术,在扩展时维持甚至提升计算效率。我们的方法显著提升了数据和计算效率,四个基准测试的平均准确率提升了3.76美元/时间。CoScale-RL能够在不使用大量SFT数据集的情况下改善LRM的能力边界。我们的方法提供了新的扩展方向,进一步提升LRM的推理能力。
DARL: Encouraging Diverse Answers for General Reasoning without Verifiers
DARL:鼓励多样化的答案以进行一般推理,无需验证者
- Authors: Chongxuan Huang, Lei Lin, Xiaodong Shi, Wenping Hu, Ruiming Tang
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2601.14700
- Pdf link: https://arxiv.org/pdf/2601.14700
- Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated promising gains in enhancing the reasoning capabilities of large language models. However, its dependence on domain-specific verifiers significantly restricts its applicability to open and general domains. Recent efforts such as RLPR have extended RLVR to general domains, enabling training on broader datasets and achieving improvements over RLVR. However, a notable limitation of these methods is their tendency to overfit to reference answers, which constrains the model's ability to generate diverse outputs. This limitation is particularly pronounced in open-ended tasks such as writing, where multiple plausible answers exist. To address this, we propose DARL, a simple yet effective reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference while preserving alignment with it. Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers. Extensive experiments on thirteen benchmarks demonstrate consistent improvements in reasoning performance. Notably, DARL surpasses RLPR, achieving average gains of 1.3 points on six reasoning benchmarks and 9.5 points on seven general benchmarks, highlighting its effectiveness in improving both reasoning accuracy and output diversity.
- 中文摘要
带可验证奖励的强化学习(RLVR)在提升大型语言模型推理能力方面已展现出有希望的进展。然而,其对域特定验证器的依赖大大限制了其对开放域和通用域的适用性。近期如RLPR等努力将RLVR扩展到通用领域,使得对更广泛的数据集进行训练,并取得了相较RLVR的改进。然而,这些方法的一个显著局限是容易对参考答案进行过拟合,这限制了模型生成多样化输出的能力。这种限制在开放式任务中尤为明显,比如写作,因为存在多个合理的答案。为此,我们提出了DARL,一种简单但有效的强化学习框架,鼓励在受控偏差范围内生成多样化答案,同时保持与参考的对齐。我们的框架与现有的通用强化学习方法完全兼容,并且可以在无需额外验证器的情况下无缝集成。对十三个基准测试的广泛实验显示推理表现持续提升。值得注意的是,DARL超过RLPR,在六个推理基准中平均提升1.3分,在七个通用基准中均提升9.5分,凸显其在提升推理准确性和输出多样性方面的有效性。
Proximal Policy Optimization with Evolutionary Mutations
带有进化突变的近端策略优化
- Authors: Casimir Czworkowski, Stephen Hornish, Alhassan S. Yasin
- Subjects: Subjects:
Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2601.14705
- Pdf link: https://arxiv.org/pdf/2601.14705
- Abstract
Proximal Policy Optimization (PPO) is a widely used reinforcement learning algorithm known for its stability and sample efficiency, but it often suffers from premature convergence due to limited exploration. In this paper, we propose POEM (Proximal Policy Optimization with Evolutionary Mutations), a novel modification to PPO that introduces an adaptive exploration mechanism inspired by evolutionary algorithms. POEM enhances policy diversity by monitoring the Kullback-Leibler (KL) divergence between the current policy and a moving average of previous policies. When policy changes become minimal, indicating stagnation, POEM triggers an adaptive mutation of policy parameters to promote exploration. We evaluate POEM on four OpenAI Gym environments: CarRacing, MountainCar, BipedalWalker, and LunarLander. Through extensive fine-tuning using Bayesian optimization techniques and statistical testing using Welch's t-test, we find that POEM significantly outperforms PPO on three of the four tasks (BipedalWalker: t=-2.0642, p=0.0495; CarRacing: t=-6.3987, p=0.0002; MountainCar: t=-6.2431, p<0.0001), while performance on LunarLander is not statistically significant (t=-1.8707, p=0.0778). Our results highlight the potential of integrating evolutionary principles into policy gradient methods to overcome exploration-exploitation tradeoffs.
- 中文摘要
近端策略优化(PPO)是一种广泛使用的强化学习算法,以其稳定性和样本效率著称,但由于探索有限,常常存在过早收敛的问题。本文提出了POEM(近端策略优化与进化突变),这是对PPO的新改良,引入了受进化算法启发的自适应探索机制。POEM通过监测当前政策与以往政策的移动平均之间的库尔巴克-莱布勒(KL)背离,增强了政策多样性。当政策变化变得微小,表明停滞时,POEM会触发政策参数的自适应变异以促进探索。我们在四个OpenAI健身房环境中评估了POEM:CarRacing、MountainCar、BipedalWalker和LunarLander。通过使用贝叶斯优化技术进行大量微调和使用韦尔奇t检验的统计检验,我们发现POEM在四个任务中有三个显著优于PPO(双足行走者:t=-2.0642,p=0.0495;赛车:t=-6.3987,p=0.0002;MountainCar:t=-6.2431,p<0.0001),而LunarLander的性能没有统计学显著性(t=-1.8707,p=0.0778)。我们的结果凸显了将进化原理融入政策梯度方法以克服勘探与开发权衡的潜力。
Case-Guided Sequential Assay Planning in Drug Discovery
药物发现中的病例引导顺序检测计划
- Authors: Tianchi Chen, Jan Bima, Sean L. Wu, Otto Ritter, Bingjia Yang, Xiang Yu
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
- Arxiv link: https://arxiv.org/abs/2601.14710
- Pdf link: https://arxiv.org/pdf/2601.14710
- Abstract
Optimally sequencing experimental assays in drug discovery is a high-stakes planning problem under severe uncertainty and resource constraints. A primary obstacle for standard reinforcement learning (RL) is the absence of an explicit environment simulator or transition data $(s, a, s')$; planning must rely solely on a static database of historical outcomes. We introduce the Implicit Bayesian Markov Decision Process (IBMDP), a model-based RL framework designed for such simulator-free settings. IBMDP constructs a case-guided implicit model of transition dynamics by forming a nonparametric belief distribution using similar historical outcomes. This mechanism enables Bayesian belief updating as evidence accumulates and employs ensemble MCTS planning to generate stable policies that balance information gain toward desired outcomes with resource efficiency. We validate IBMDP through comprehensive experiments. On a real-world central nervous system (CNS) drug discovery task, IBMDP reduced resource consumption by up to 92\% compared to established heuristics while maintaining decision confidence. To rigorously assess decision quality, we also benchmarked IBMDP in a synthetic environment with a computable optimal policy. Our framework achieves significantly higher alignment with this optimal policy than a deterministic value iteration alternative that uses the same similarity-based model, demonstrating the superiority of our ensemble planner. IBMDP offers a practical solution for sequential experimental design in data-rich but simulator-poor domains.
- 中文摘要
在药物发现中对实验测定进行最佳测序是一个在极其不确定性和资源限制下高风险的规划难题。标准强化学习(RL)的主要障碍是缺乏显式环境模拟器或过渡数据 $(s, a, s')$;规划必须完全依赖于静态的历史结果数据库。我们介绍隐式贝叶斯马尔可夫决策过程(IBMDP),这是一个基于模型的强化学习框架,专为此类无模拟器的环境设计。IBMDP通过构建一个非参数信念分布,使用类似的历史结果构建一个案例导向的隐式转变动态模型。该机制使贝叶斯信念随着证据积累而更新,并采用集合MCTS规划,生成稳定策略,平衡信息收益与资源效率。我们通过综合实验验证IBMDP。在一项真实的中枢神经系统(CNS)药物发现任务中,IBMDP在保持决策信心的前提下,将资源消耗比既有启发式减少了高达92%。为了严格评估决策质量,我们还对IBMDP在具有可计算最优策略的合成环境中进行了基准测试。我们的框架与该最优策略的高度高度对齐,优于采用相同基于相似度模型的确定性价值迭代方案,展示了我们集合规划器的优越性。IBMDP为数据丰富但模拟器不足的领域中的顺序实验设计提供了实用解决方案。
DARA: Few-shot Budget Allocation in Online Advertising via In-Context Decision Making with RL-Finetuned LLMs
DARA:通过上下文决策与强化学习精细调优的大型语言模型实现在线广告中的少数样本预算分配
- Authors: Mingxuan Song, Yusen Huo, Bohan Zhou, Shenglin Yin, Zhen Xiao, Jieyi Long, Zhilin Zhang, Chuan Yu
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2601.14711
- Pdf link: https://arxiv.org/pdf/2601.14711
- Abstract
Optimizing the advertiser's cumulative value of winning impressions under budget constraints poses a complex challenge in online advertising, under the paradigm of AI-Generated Bidding (AIGB). Advertisers often have personalized objectives but limited historical interaction data, resulting in few-shot scenarios where traditional reinforcement learning (RL) methods struggle to perform effectively. Large Language Models (LLMs) offer a promising alternative for AIGB by leveraging their in-context learning capabilities to generalize from limited data. However, they lack the numerical precision required for fine-grained optimization. To address this limitation, we introduce GRPO-Adaptive, an efficient LLM post-training strategy that enhances both reasoning and numerical precision by dynamically updating the reference policy during training. Built upon this foundation, we further propose DARA, a novel dual-phase framework that decomposes the decision-making process into two stages: a few-shot reasoner that generates initial plans via in-context prompting, and a fine-grained optimizer that refines these plans using feedback-driven reasoning. This separation allows DARA to combine LLMs' in-context learning strengths with precise adaptability required by AIGB tasks. Extensive experiments on both real-world and synthetic data environments demonstrate that our approach consistently outperforms existing baselines in terms of cumulative advertiser value under budget constraints.
- 中文摘要
在预算限制下优化广告主的累计曝光价值,在人工智能生成竞价(AIGB)范式下,是在线广告中的一个复杂挑战。广告主通常有个性化目标,但历史互动数据有限,导致传统强化学习(RL)方法难以有效发挥少数样本场景。大型语言模型(LLM)利用其上下文学习能力,从有限的数据中进行泛化,为AIGB提供了有前景的替代方案。然而,它们缺乏实现细粒度优化所需的数值精度。为解决这一限制,我们引入了GRPO-Adaptive,这是一种高效的LLM训练后策略,通过在训练过程中动态更新参考策略,提升推理能力和数值精度。基于此基础,我们进一步提出了DARA,一种新型双阶段框架,将决策过程分解为两个阶段:一个通过上下文提示生成初始计划的几帧推理器,以及一个通过反馈驱动推理细化这些计划的细粒度优化器。这种分离使DARA能够将LLM的上下文学习优势与AIGB任务所需的精确适应性结合起来。在真实世界和合成数据环境中的广泛实验表明,在预算限制下,我们的方法在累计广告价值方面始终优于现有基线。
PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning
PCL-Reasoner-V1.5:通过离线强化学习推进数学推理
- Authors: Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen, Bin Zhou, Yonghong Tian
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2601.14716
- Pdf link: https://arxiv.org/pdf/2601.14716
- Abstract
We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs.
- 中文摘要
我们介绍PCL-Reasoner-V1.5,一个拥有320亿参数的大型语言模型(LLM),用于数学推理。该模型基于Qwen2.5-32B,并通过监督微调(SFT)和强化学习(RL)进行细化。我们提出的离线强化学习方法是一个核心创新,它比GRPO等标准在线强化学习方法提供了更优越的训练稳定性和效率。我们的模型在Qwen2.5-32B后训练模型中达到了最先进的性能,AIME 2024的平均准确率为90.9%,AIME 2025为85.6%。我们的研究展示了离线强化学习作为推动大型语言模型推理发展的稳定高效范式。所有实验均在华为Ascend 910C核电源上进行。
CI4A: Semantic Component Interfaces for Agents Empowering Web Automation
CI4A:赋能网络自动化的代理语义组件接口
- Authors: Zhi Qiu, Jiazheng Sun, Chenxiao Xia, Jun Zheng, Xin Peng
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.14790
- Pdf link: https://arxiv.org/pdf/2601.14790
- Abstract
While Large Language Models demonstrate remarkable proficiency in high-level semantic planning, they remain limited in handling fine-grained, low-level web component manipulations. To address this limitation, extensive research has focused on enhancing model grounding capabilities through techniques such as Reinforcement Learning. However, rather than compelling agents to adapt to human-centric interfaces, we propose constructing interaction interfaces specifically optimized for agents. This paper introduces Component Interface for Agent (CI4A), a semantic encapsulation mechanism that abstracts the complex interaction logic of UI components into a set of unified tool primitives accessible to agents. We implemented CI4A within Ant Design, an industrial-grade front-end framework, covering 23 categories of commonly used UI components. Furthermore, we developed a hybrid agent featuring an action space that dynamically updates according to the page state, enabling flexible invocation of available CI4A tools. Leveraging the CI4A-integrated Ant Design, we refactored and upgraded the WebArena benchmark to evaluate existing SoTA methods. Experimental results demonstrate that the CI4A-based agent significantly outperforms existing approaches, achieving a new SoTA task success rate of 86.3%, alongside substantial improvements in execution efficiency.
- 中文摘要
虽然大型语言模型在高级语义规划方面表现出卓越的熟练度,但在处理细粒度、低层次的网页组件作方面仍然有限。为解决这一限制,广泛研究聚焦于通过强化学习等技术提升模型基础化能力。然而,我们提出的不是强迫代理适应以人为中心的界面,而是构建专门为代理优化的交互界面。本文介绍了代理组件接口(CI4A),这是一种语义封装机制,将UI组件复杂的交互逻辑抽象为一套统一的工具原语,供代理访问。我们在Ant Design中实现了CI4A,这是一个工业级前端框架,涵盖了23类常用UI组件。此外,我们开发了一个混合代理,具有根据页面状态动态更新的动作空间,实现了灵活调用可用CI4A工具的能力。利用集成CI4A的Ant Design,我们重构并升级了WebArena基准测试,以评估现有的SoTA方法。实验结果表明,基于CI4A的智能体显著优于现有方法,实现了新的SoTA任务成功率86.3%,同时执行效率也有显著提升。
What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study
是什么让低位量化感知训练对推理型大型语言模型有效?系统性研究
- Authors: Keyu Lv, Manyi Zhang, Xiaobo Xia, Jingchen Ni, Shannan Yan, Xianzhi Yu, Lu Hou, Chun Yuan, Haoli Bai
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2601.14888
- Pdf link: https://arxiv.org/pdf/2601.14888
- Abstract
Reasoning models excel at complex tasks such as coding and mathematics, yet their inference is often slow and token-inefficient. To improve the inference efficiency, post-training quantization (PTQ) usually comes with the cost of large accuracy drops, especially for reasoning tasks under low-bit settings. In this study, we present a systematic empirical study of quantization-aware training (QAT) for reasoning models. Our key findings include: (1) Knowledge distillation is a robust objective for reasoning models trained via either supervised fine-tuning or reinforcement learning; (2) PTQ provides a strong initialization for QAT, improving accuracy while reducing training cost; (3) Reinforcement learning remains feasible for quantized models given a viable cold start and yields additional gains; and (4) Aligning the PTQ calibration domain with the QAT training domain accelerates convergence and often improves the final accuracy. Finally, we consolidate these findings into an optimized workflow (Reasoning-QAT), and show that it consistently outperforms state-of-the-art PTQ methods across multiple LLM backbones and reasoning datasets. For instance, on Qwen3-0.6B, it surpasses GPTQ by 44.53% on MATH-500 and consistently recovers performance in the 2-bit regime.
- 中文摘要
推理模型在编码和数学等复杂任务中表现出色,但其推理往往缓慢且令牌效率低下。为了提高推理效率,训练后量化(PTQ)通常伴随着较大的准确率下降,尤其是在低位设置下的推理任务中。本研究提出了对推理模型量化感知训练(QAT)的系统实证研究。我们的主要发现包括:(1)知识蒸馏是通过监督微调或强化学习训练推理模型的稳健目标;(2) PTQ为QAT提供了强初始化,提高了准确性并降低了训练成本;(3) 在可行的冷启动条件下,强化学习对于量子化模型仍然可行,并带来额外的增益;以及(4)将PTQ校准域与QAT训练域对齐可加速收敛,通常能提高最终精度。最后,我们将这些发现整合进优化的工作流程(Reasoning-QAT),并证明其在多个LLM骨干和推理数据集中持续优于最先进的PTQ方法。例如,在Qwen3-0.6B上,它在MATH-500上比GPTQ高出44.53%,并且在2位区域内持续恢复性能。
Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation
多语言检索增强生成的语言耦合强化学习
- Authors: Rui Qi, Fengran Mo, Yufeng Chen, Xue Zhang, Shuo Wang, Hongliang Li, Jinan Xu, Meng Jiang, Jian-Yun Nie, Kaiyu Huang
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2601.14896
- Pdf link: https://arxiv.org/pdf/2601.14896
- Abstract
Multilingual retrieval-augmented generation (MRAG) requires models to effectively acquire and integrate beneficial external knowledge from multilingual collections. However, most existing studies employ a unitive process where queries of equivalent semantics across different languages are processed through a single-turn retrieval and subsequent optimization. Such a ``one-size-fits-all'' strategy is often suboptimal in multilingual settings, as the models occur to knowledge bias and conflict during the interaction with the search engine. To alleviate the issues, we propose LcRL, a multilingual search-augmented reinforcement learning framework that integrates a language-coupled Group Relative Policy Optimization into the policy and reward models. We adopt the language-coupled group sampling in the rollout module to reduce knowledge bias, and regularize an auxiliary anti-consistency penalty in the reward models to mitigate the knowledge conflict. Experimental results demonstrate that LcRL not only achieves competitive performance but is also appropriate for various practical scenarios such as constrained training data and retrieval over collections encompassing a large number of languages. Our code is available at this https URL.
- 中文摘要
多语言检索增强生成(MRAG)需要模型有效获取和整合多语言集合中的有益外部知识。然而,大多数现有研究采用统一过程,即跨语言对等价语义的查询通过单轮检索和后续优化处理。这种“一刀切”的策略在多语言环境中往往不理想,因为模型在与搜索引擎交互时会出现知识偏见和冲突。为缓解这些问题,我们提出了LcRL,一种多语言搜索增强强化学习框架,将语言耦合的群体相对策略优化整合进策略和奖励模型中。我们在推广模块中采用语言耦合群体抽样以减少知识偏差,并在奖励模型中规范化辅助反一致性惩罚以减轻知识冲突。实验结果表明,LcRL不仅实现了竞争性能,还适用于多种实际场景,如受限训练数据和涵盖大量语言的集合检索。我们的代码可在此 https URL 访问。
Improving Regret Approximation for Unsupervised Dynamic Environment Generation
改进无监督动态环境生成的遗憾近似
- Authors: Harry Mead, Bruno Lacerda, Jakob Foerster, Nick Hawes
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2601.14957
- Pdf link: https://arxiv.org/pdf/2601.14957
- Abstract
Unsupervised Environment Design (UED) seeks to automatically generate training curricula for reinforcement learning (RL) agents, with the goal of improving generalisation and zero-shot performance. However, designing effective curricula remains a difficult problem, particularly in settings where small subsets of environment parameterisations result in significant increases in the complexity of the required policy. Current methods struggle with a difficult credit assignment problem and rely on regret approximations that fail to identify challenging levels, both of which are compounded as the size of the environment grows. We propose Dynamic Environment Generation for UED (DEGen) to enable a denser level generator reward signal, reducing the difficulty of credit assignment and allowing for UED to scale to larger environment sizes. We also introduce a new regret approximation, Maximised Negative Advantage (MNA), as a significantly improved metric to optimise for, that better identifies more challenging levels. We show empirically that MNA outperforms current regret approximations and when combined with DEGen, consistently outperforms existing methods, especially as the size of the environment grows. We have made all our code available here: this https URL.
- 中文摘要
无监督环境设计(UED)旨在自动生成强化学习(RL)代理的训练课程,目标是提升泛化能力和零样本性能。然而,设计有效的课程仍然是个难题,尤其是在环境参数化的少数子集导致所需政策复杂度显著增加的环境中。现有方法难以解决信用分配的难题,依赖遗憾近似,未能识别具有挑战性的水平,随着环境规模的扩大,这两者问题更加严重。我们提出动态环境生成技术(DEGen),以实现更密集的层级生成器奖励信号,降低信用分配的难度,使UED能够扩展到更大的环境规模。我们还引入了一种新的后悔近似——最大化负面优势(Maximised Negative Advantage,MNA),作为一个显著改进的优化指标,能够更好地识别更具挑战性的水平。我们通过实证表明,MNA的表现优于当前的遗憾近似,并且结合DEGen时,尤其随着环境规模的扩大,它持续优于现有方法。我们已将所有代码公开于此:这个 https URL。
Plug-and-Play Benchmarking of Reinforcement Learning Algorithms for Large-Scale Flow Control
大规模流量控制强化学习算法的即插即用基准测试
- Authors: Jannis Becktepe, Aleksandra Franz, Nils Thuerey, Sebastian Peitz
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2601.15015
- Pdf link: https://arxiv.org/pdf/2601.15015
- Abstract
Reinforcement learning (RL) has shown promising results in active flow control (AFC), yet progress in the field remains difficult to assess as existing studies rely on heterogeneous observation and actuation schemes, numerical setups, and evaluation protocols. Current AFC benchmarks attempt to address these issues but heavily rely on external computational fluid dynamics (CFD) solvers, are not fully differentiable, and provide limited 3D and multi-agent support. To overcome these limitations, we introduce FluidGym, the first standalone, fully differentiable benchmark suite for RL in AFC. Built entirely in PyTorch on top of the GPU-accelerated PICT solver, FluidGym runs in a single Python stack, requires no external CFD software, and provides standardized evaluation protocols. We present baseline results with PPO and SAC and release all environments, datasets, and trained models as public resources. FluidGym enables systematic comparison of control methods, establishes a scalable foundation for future research in learning-based flow control, and is available at this https URL.
- 中文摘要
强化学习(RL)在主动流量控制(AFC)方面已显示出有前景的成果,但由于现有研究依赖于异构的观察与驱动方案、数值设置以及评估协议,该领域的进展仍难以评估。当前的AFC基准测试试图解决这些问题,但高度依赖外部计算流体力学(CFD)求解器,且不完全可微分,且支持有限的三维和多智能体支持。为克服这些限制,我们推出了FluidGym,这是首个独立、完全可微分的强化学习基准测试套件。FluidGym完全在PyTorch构建,基于GPU加速的PICT求解器,运行于单一Python协议栈中,无需外部CFD软件,并提供标准化的评估协议。我们以PPO和SAC呈现基线结果,并将所有环境、数据集和训练模型作为公共资源发布。FluidGym 实现了控制方法的系统比较,为未来基于学习的流量控制研究奠定了可扩展的基础,并可在此 https URL 获取。
A Curriculum-Based Deep Reinforcement Learning Framework for the Electric Vehicle Routing Problem
基于课程的电动车路由问题深度强化学习框架
- Authors: Mertcan Daysalilar, Fuat Uyguroglu, Gabriel Nicolosi, Adam Meyers
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.15038
- Pdf link: https://arxiv.org/pdf/2601.15038
- Abstract
The electric vehicle routing problem with time windows (EVRPTW) is a complex optimization problem in sustainable logistics, where routing decisions must minimize total travel distance, fleet size, and battery usage while satisfying strict customer time constraints. Although deep reinforcement learning (DRL) has shown great potential as an alternative to classical heuristics and exact solvers, existing DRL models often struggle to maintain training stability-failing to converge or generalize when constraints are dense. In this study, we propose a curriculum-based deep reinforcement learning (CB-DRL) framework designed to resolve this instability. The framework utilizes a structured three-phase curriculum that gradually increases problem complexity: the agent first learns distance and fleet optimization (Phase A), then battery management (Phase B), and finally the full EVRPTW (Phase C). To ensure stable learning across phases, the framework employs a modified proximal policy optimization algorithm with phase-specific hyperparameters, value and advantage clipping, and adaptive learning-rate scheduling. The policy network is built upon a heterogeneous graph attention encoder enhanced by global-local attention and feature-wise linear modulation. This specialized architecture explicitly captures the distinct properties of depots, customers, and charging stations. Trained exclusively on small instances with N=10 customers, the model demonstrates robust generalization to unseen instances ranging from N=5 to N=100, significantly outperforming standard baselines on medium-scale problems. Experimental results confirm that this curriculum-guided approach achieves high feasibility rates and competitive solution quality on out-of-distribution instances where standard DRL baselines fail, effectively bridging the gap between neural speed and operational reliability.
- 中文摘要
带时间窗口的电动汽车路由问题(EVRPTW)是可持续物流中的一个复杂优化问题,在规划决策中必须在满足严格客户时间限制的同时,最小化总行驶距离、车队规模和电池使用。尽管深度强化学习(DRL)作为经典启发式和精确求解器的替代方案展现出巨大潜力,现有DRL模型常常难以维持训练稳定性——当约束密集时无法收敛或泛化。本研究提出一个基于课程的深度强化学习(CB-DRL)框架,旨在解决这种不稳定性。该框架采用结构化的三阶段课程,逐步增加问题复杂度:代理先学习距离和车队优化(A阶段),然后学习电池管理(B阶段),最后学习完整的 EVRPTW(阶段 C)。为确保跨阶段学习的稳定,框架采用了经过修改的近端策略优化算法,具有阶段特异的超参数、值和优势裁剪以及自适应学习率调度。该策略网络建立在异构图注意力编码器之上,该编码器通过全局-局部注意力和特征层次的线性调制增强。这种专业架构明确捕捉了车库、客户和充电站的独特属性。该模型仅在N=10客户的小实例上训练,能够对N=5到N=100的未见实例进行稳健推广,在中规模问题上显著优于标准基线。实验结果证实,这种课程指导方法在标准DRL基线失效的非分发情况下,能够实现高可行性和竞争性解决方案质量,有效弥合了神经速度与作可靠性之间的差距。
Memory Retention Is Not Enough to Master Memory Tasks in Reinforcement Learning
记忆保持不足以掌握强化学习中的记忆任务
- Authors: Oleg Shchendrigin, Egor Cherepanov, Alexey K. Kovalev, Aleksandr I. Panov
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.15086
- Pdf link: https://arxiv.org/pdf/2601.15086
- Abstract
Effective decision-making in the real world depends on memory that is both stable and adaptive: environments change over time, and agents must retain relevant information over long horizons while also updating or overwriting outdated content when circumstances shift. Existing Reinforcement Learning (RL) benchmarks and memory-augmented agents focus primarily on retention, leaving the equally critical ability of memory rewriting largely unexplored. To address this gap, we introduce a benchmark that explicitly tests continual memory updating under partial observability, i.e. the natural setting where an agent must rely on memory rather than current observations, and use it to compare recurrent, transformer-based, and structured memory architectures. Our experiments reveal that classic recurrent models, despite their simplicity, demonstrate greater flexibility and robustness in memory rewriting tasks than modern structured memories, which succeed only under narrow conditions, and transformer-based agents, which often fail beyond trivial retention cases. These findings expose a fundamental limitation of current approaches and emphasize the necessity of memory mechanisms that balance stable retention with adaptive updating. Our work highlights this overlooked challenge, introduces benchmarks to evaluate it, and offers insights for designing future RL agents with explicit and trainable forgetting mechanisms. Code: this https URL
- 中文摘要
现实世界中的有效决策依赖于既稳定又适应性的记忆:环境随时间变化,代理必须在长期内保留相关信息,同时在环境变化时更新或覆盖过时内容。现有的强化学习(RL)基准测试和记忆增强代理主要关注记忆的保留,而同样关键的记忆重写能力则大多未被充分探索。为弥补这一空白,我们引入了一个基准测试,明确测试在部分可观测性(即代理必须依赖内存而非当前观察的自然环境中)下持续更新内存,并用它来比较循环内存、基于变换器和结构化存储器的存储架构。我们的实验显示,经典的循环模型尽管简单,但在记忆重写任务中展现出比现代结构化记忆(仅在狭窄条件下成功)和基于变换器智能体(后者常常在简单保留情况下失败)更具灵活性和鲁棒性。这些发现揭示了当前方法的根本局限性,强调了在稳定保留与自适应更新之间取得平衡的记忆机制的必要性。我们的工作突出了这一被忽视的挑战,提出了评估基准,并为设计具备显式且可训练遗忘机制的未来强化学习代理提供了见解。代码:这个 https URL
Vehicle Routing with Finite Time Horizon using Deep Reinforcement Learning with Improved Network Embedding
利用深度强化学习和改进网络嵌入实现有限时间视野的车辆路由
- Authors: Ayan Maity, Sudeshna Sarkar
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.15131
- Pdf link: https://arxiv.org/pdf/2601.15131
- Abstract
In this paper, we study the vehicle routing problem with a finite time horizon. In this routing problem, the objective is to maximize the number of customer requests served within a finite time horizon. We present a novel routing network embedding module which creates local node embedding vectors and a context-aware global graph representation. The proposed Markov decision process for the vehicle routing problem incorporates the node features, the network adjacency matrix and the edge features as components of the state space. We incorporate the remaining finite time horizon into the network embedding module to provide a proper routing context to the embedding module. We integrate our embedding module with a policy gradient-based deep Reinforcement Learning framework to solve the vehicle routing problem with finite time horizon. We trained and validated our proposed routing method on real-world routing networks, as well as synthetically generated Euclidean networks. Our experimental results show that our method achieves a higher customer service rate than the existing routing methods. Additionally, the solution time of our method is significantly lower than that of the existing methods.
- 中文摘要
本文研究有限时间视野下的车辆路由问题。在该路由问题中,目标是在有限的时间范围内最大化客户请求的服务数量。我们提出了一种新型路由网络嵌入模块,能够创建本地节点嵌入向量和上下文感知的全局图表示。车辆路由问题的马尔可夫决策过程将节点特征、网络邻接矩阵和边缘特征作为状态空间的组成部分。我们将剩余的有限时间视野纳入网络嵌入模块,以为嵌入模块提供合适的路由上下文。我们将嵌入模块与基于策略梯度的深度强化学习框架整合,解决有限时间视野的车辆路由问题。我们在现实世界的路由网络以及合成生成的欧几里得网络上训练并验证了所提出的路由方法。我们的实验结果表明,我们的方法实现了比现有路由方法更高的客户服务率。此外,我们方法的解算时间明显短于现有方法。
CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning
清洁剂:自我净化轨迹增强能动强化学习
- Authors: Tianshi Xu, Yuteng Chen, Meng Li
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2601.15141
- Pdf link: https://arxiv.org/pdf/2601.15141
- Abstract
Agentic Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to utilize tools like Python interpreters for complex problem-solving. However, for parameter-constrained models (e.g., 4B--7B), the exploration phase is often plagued by frequent execution failures, creating noisy trajectories that hinder policy optimization. Under standard outcome-based reward settings, this noise leads to a critical credit assignment issue, where erroneous actions are inadvertently reinforced alongside successful outcomes. Existing mitigations face a dilemma: dense rewards often trigger reward hacking, while supersampling incurs prohibitive computational costs. To address these challenges, we propose CLEANER. Distinct from external filtering methods, CLEANER exploits the model's intrinsic self-correction capabilities to eliminate error-contaminated context directly during data collection. At its core, the Similarity-Aware Adaptive Rollback (SAAR) mechanism autonomously constructs clean, purified trajectories by retrospectively replacing failures with successful self-corrections. Based on semantic similarity, SAAR adaptively regulates replacement granularity from shallow execution repairs to deep reasoning substitutions. By training on these self-purified paths, the model internalizes correct reasoning patterns rather than error-recovery loops. Empirical results on AIME24/25, GPQA, and LiveCodeBench show average accuracy gains of 6%, 3%, and 5% over baselines. Notably, CLEANER matches state-of-the-art performance using only one-third of the training steps, highlighting trajectory purification as a scalable solution for efficient agentic RL. Our models and code are available at GitHub
- 中文摘要
智能化强化学习(RL)使大型语言模型(LLM)能够利用Python解释器等工具来解决复杂问题。然而,对于参数约束模型(例如4B-7B),探索阶段常常遭遇频繁的执行失败,产生噪声轨迹,阻碍策略优化。在标准的结果导向奖励设置下,这种噪声导致关键的学分分配问题,错误行为在成功结果的同时无意中被强化。现有的缓解措施面临一个难题:高密度奖励常常触发奖励黑客攻击,而超采样则会产生高昂的计算成本。为了应对这些挑战,我们提出了更清洁的方案。与外部过滤方法不同,CLEANER利用模型固有的自我纠正能力,在数据收集过程中直接消除受错误污染的上下文。从本质上讲,相似感知自适应回滚(SAAR)机制通过追溯性地用成功的自我纠正替代失败,自主构建干净、净化的轨迹。基于语义相似性,SAAR自适应地调节从浅层执行修复到深度推理替换的替换粒度。通过在这些自我净化路径上训练,模型内化了正确的推理模式,而非错误恢复循环。在AIME24/25、GPQA和LiveCodeBench上的实证结果显示,平均准确率提升为6%、3%和5%,相较基线。值得注意的是,CLEANER仅用三分之一的训练步骤就达到了最先进的性能,凸显了轨迹净化作为高效能动强化学习的可扩展解决方案。我们的模型和代码可在GitHub上获取
Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data
基于结果的强化学习可以证明地引导变换金器讲理,但前提是数据正确
- Authors: Yuval Ran-Milo, Yotam Alexander, Shahar Mendel, Nadav Cohen
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.15158
- Pdf link: https://arxiv.org/pdf/2601.15158
- Abstract
Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive gradient descent to discover such systematic reasoning remains poorly understood. We address this by analyzing the gradient flow dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought (CoT) but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, gradient flow drives the model to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler instances, the model learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, gradient-based learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.
- 中文摘要
通过强化学习(RL)训练并基于结果的监督的变换器,可以自发发展产生中间推理步骤的能力(思维链)。然而,稀疏奖励如何推动梯度下降以发现这种系统推理的机制仍然不甚明了。我们通过分析单层变换器在合成图遍历任务中的梯度流动动力学来解决这个问题,该任务在没有思维链(CoT)的情况下无法解决,但有简单的迭代解法。我们证明,尽管仅以最终答案正确性进行训练,梯度流仍能驱动模型收敛为一个结构化、可解释的算法,逐顶点迭代遍历图。我们描述了这种出现所需的分布属性,指出“简单例子”的关键作用:即需要较少推理步骤的实例。当训练分布在这些简单实例上赋予足够的质量时,模型学习出一种可推广的遍历策略,并外推到更长的链条;当这个质量消失时,基于梯度的学习变得不可行。我们通过合成数据实验和现实语言模型在数学推理任务中的实验来验证理论结果,验证我们的理论发现能够应用于实际应用。
Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning
知识图谱是隐性奖励模型:路径导出信号使组合推理成为可能
- Authors: Yuval Kansal, Niraj K. Jha
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.15160
- Pdf link: https://arxiv.org/pdf/2601.15160
- Abstract
Large language models have achieved near-expert performance in structured reasoning domains like mathematics and programming, yet their ability to perform compositional multi-hop reasoning in specialized scientific fields remains limited. We propose a bottom-up learning paradigm in which models are grounded in axiomatic domain facts and compose them to solve complex, unseen tasks. To this end, we present a post-training pipeline, based on a combination of supervised fine-tuning and reinforcement learning (RL), in which knowledge graphs act as implicit reward models. By deriving novel reward signals from knowledge graph paths, we provide verifiable, scalable, and grounded supervision that encourages models to compose intermediate axioms rather than optimize only final answers during RL. We validate this approach in the medical domain, training a 14B model on short-hop reasoning paths (1-3 hops) and evaluating its zero-shot generalization to complex multi-hop queries (4-5 hops). Our experiments show that path-derived rewards act as a "compositional bridge", enabling our model to significantly outperform much larger models and frontier systems like GPT-5.2 and Gemini 3 Pro, on the most difficult reasoning tasks. Furthermore, we demonstrate the robustness of our approach to adversarial perturbations against option-shuffling stress tests. This work suggests that grounding the reasoning process in structured knowledge is a scalable and efficient path toward intelligent reasoning.
- 中文摘要
大型语言模型在结构化推理领域如数学和编程方面已达到近乎专家级的表现,但在专业科学领域执行组合多跳推理的能力仍然有限。我们提出一种自下而上的学习范式,模型基于公理化的领域事实,并组合它们以解决复杂且看不见的任务。为此,我们提出了一个基于监督微调和强化学习(RL)结合的训练后流程,其中知识图谱作为隐式奖励模型。通过从知识图谱路径中推导出新的奖励信号,我们提供了可验证、可扩展且扎根的监督,鼓励模型在强化学习中构建中间公理,而非仅仅优化最终答案。我们在医学领域验证了这一方法,训练了一个14B模型,用于短跳推理路径(1-3跳),并评估其零样本推广至复杂多跳查询(4-5跳)的能力。我们的实验表明,路径导出的奖励充当“组合桥梁”,使我们的模型在最复杂的推理任务中显著优于更大型的模型和前沿系统,如GPT-5.2和Gemini 3 Pro。此外,我们证明了我们对对抗性扰动方法对期权洗牌压力测试的稳健性。这项研究表明,将推理过程建立在结构化知识基础上,是通往智能推理的可扩展且高效的路径。
The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models
灵活性陷阱:为何任意顺序限制扩散语言模型中的推理潜力
- Authors: Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2601.15165
- Pdf link: https://arxiv.org/pdf/2601.15165
- Abstract
Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: this https URL
- 中文摘要
扩散大型语言模型(dLLM)打破了传统LLM的严格左右限制,使代币生成能够任意顺序。直观上,这种灵活性意味着一个严格超集固定自回归轨迹的解空间,理论上释放了数学和编码等通用任务的更优越推理潜力。因此,许多研究利用强化学习(RL)来激发dLLM的推理能力。本文揭示了一个反直觉的现实:任意序生成在当前形式下,反而是缩小而非扩展dLLM的推理边界。我们发现,dLLMs往往利用这种顺序灵活性绕过对探索至关重要的高不确定性代币,导致解空间的过早崩溃。这一观察挑战了现有dLLM强化学习方法的前提,因为在这些方法中,处理组合轨迹和难以解决的似然问题往往被投入到保持灵活性的基础上。我们证明,通过有意放弃任意顺序,转而应用标准的群体相对策略优化(GRPO),更有效地引发推理。我们的方法JustGRPO极简但出人意料地高效(例如GSM8K的准确率为89.1%),同时完全保留了dLLM的并行解码能力。项目页面:此 https URL
Keyword: diffusion policy
There is no result