生成时间: 2026-05-06 18:21:37 (UTC+8); Arxiv 发布时间: 2026-05-06 20:00 EDT (2026-05-07 08:00 UTC+8)
今天共有 27 篇相关文章
Keyword: reinforcement learning
An End-to-End Framework for Building Large Language Models for Software Operations
构建大型软件操作语言模型的端到端框架
- Authors: Jingkai He, Pengfei Chen, Chenghui Wu, Shuang Liang, Ye Li, Gou Tan, Xiadao Wen, Chuanfu Zhang
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2605.02906
- Pdf link: https://arxiv.org/pdf/2605.02906
- Abstract
In the field of software operations, Large Language Models (LLMs) have attracted increasing attention. However, existing research has not yet achieved efficient and effective end-to-end intelligent operations due to low-quality data, fragmented knowledge and insufficient learning. To explore the potential of LLMs in software operations, we propose OpsLLM, a domain-specific LLM that supports both knowledge-based question answering (QA) and root cause analysis (RCA). Moreover, we disclose the detailed workflow for building LLMs specifically in the software operations domain. First, a Human-in-the-Loop mechanism is introduced to curate highquality data from a large collection of operational raw data and construct a fine-tuning dataset. Then, based on the data, supervised fine-tuning is conducted to achieve a base model. Furthermore, we introduce a domain process reward model (DPRM) during the reinforcement learning stage to optimize the accuracy and reliability of the fine-tuned model on RCA tasks. Experimental results on the tasks with diverse difficulties demonstrate that OpsLLMs effectively learns and aligns with the operational domain knowledge infused, outperforming existing open-source and closed-source LLMs in accuracy with improvements of 0.2%~5.7% on QA tasks and 2.7% ~70.3% on RCA tasks, while exhibiting strong transferability. Moreover, we will open-source three versions of OpsLLM with 7B, 14B and 32B parameters, along with a 15K fine-tuning dataset.
- 中文摘要
在软件操作领域,大型语言模型(LLM)正受到越来越多的关注。然而,由于数据质量低、知识分散和学习不足,现有研究尚未实现高效且有效的端到端智能操作。为了探索LLM在软件运维中的潜力,我们提出了OpsLLM,一种领域特定的LLM,支持基于知识的问答(QA)和根因分析(RCA)。此外,我们还披露了构建LLM在软件运维领域的详细工作流程。首先,引入了“人机环路”机制,从大量操作原始数据中筛选高质量数据,构建微调数据集。然后,基于数据进行监督式微调以建立基础模型。此外,我们在强化学习阶段引入了领域过程奖励模型(DPRM),以优化微调模型在RCA任务中的准确性和可靠性。针对多样难度任务的实验结果表明,OpsLLMs能够有效学习并与其注入的运营领域知识保持一致,在质量保证任务中提升了0.2%~5.7%,RCA任务提升了2.7%~70.3%,同时展现出强烈的可迁移性。此外,我们将开源三个版本的OpsLLM,分别包含7B、14B和32B参数,以及一个1.5万的微调数据集。
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
延迟、平台期或崩溃:评估系统性验证误差对RLVR的影响
- Authors: Kazuki Egashira, Mark Vero, Jasper Dekoninck, Florian E. Dorner, Robin Staab, Martin Vechev
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.02909
- Pdf link: https://arxiv.org/pdf/2605.02909
- Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a powerful approach for improving the reasoning capabilities of large language models (LLMs). While RLVR is designed for tasks with verifiable ground-truth answers, real-world verifiers (e.g., static code checkers) can introduce errors into the reward signal. Prior analyses have largely treated such errors as random and independent across samples, concluding that errors merely slow training with limited effect on final performance. However, practical verifiers tend to exhibit systematic errors. This introduces a risk of models learning unwanted consistent behavior from a structurally incorrect reward signal. In this work, we study the impact of such systematic verification errors on RLVR. Through controlled experiments on arithmetic tasks, we show that systematic false negatives lead to similar effects as random noise. On the other hand, systematic false positives can cause a wide range of behaviors from sub-optimal plateaus to performance collapse. Crucially, these outcomes are not determined by the overall error rate but by the specific pattern of introduced errors, making pre-hoc mitigation difficult. Our results show that, in contrast to prior conclusions, realistic verification errors can critically shape RLVR outcomes and that verifier quality has to be understood beyond its sample-level error rate.
- 中文摘要
带可验证奖励的强化学习(RLVR)已成为提升大型语言模型(LLM)推理能力的有力方法。虽然RLVR设计用于具有可验证地面真实答案的任务,但现实中的验证器(如静态代码检查器)可能会在奖励信号中引入错误。以往的分析大多将此类误差视为随机且在样本间独立,结论是误差仅会减缓训练速度,对最终表现影响有限。然而,实际验证者往往存在系统性错误。这带来了模型从结构错误的奖励信号中学习到不想要的一致行为的风险。本研究将研究此类系统验证误差对RLVR的影响。通过对算术任务的受控实验,我们表明系统性假阴性会导致与随机噪声类似的效应。另一方面,系统性的误报可能导致从次优停滞期到性能崩溃的多种行为。关键是,这些结果并非由整体错误率决定,而是由引入错误的特定模式决定,这使得事先缓解变得困难。我们的结果表明,与之前的结论相反,现实的验证误差会对RLVR结果产生关键影响,验证者质量必须超越其样本层级的错误率来理解。
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
生成、过滤、控制、重放:大型语言模型强化学习推广策略的全面综述
- Authors: Rohan Surana, Gagan Mundada, Xunyi Jiang, Chuhan Wang, Zhenwei Tang, Difan Jiao, Zihan Huang, Yuxin Xiong, Junda Wu, Sheldon Yu, Xintong Li, Raghav Jain, Nikki Kuang, Sizhe Zhou, Bowen Jin, Zhendong Chu, Tong Yu, Ryan Rossi, Kuan-Hao Huang, Jingbo Shang, Jiawei Han, Julian McAuley
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2605.02913
- Pdf link: https://arxiv.org/pdf/2605.02913
- Abstract
Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data the optimizer learns from, yet rollout design is often underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate-Filter-Control-Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, critics; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across rollouts without weight updates, including self-evolving curricula that autonomously generate new training tasks. We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity that characterizes rollout trade-offs. Using this framework, we synthesize methods spanning RL with verifiable rewards, process supervision, judge-based gating, guided and tree/segment rollouts, adaptive compute allocation, early-exit and partial rollouts, throughput optimization, and replay/recomposition for self-improvement. We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide a diagnostic index that maps common rollout pathologies to GFCR modules and mitigation levers, alongside open challenges for building reproducible, compute-efficient, and trustworthy rollout pipelines.
- 中文摘要
强化学习(RL)已成为提升大型语言模型(LLM)推理能力的核心训练后工具。在这些系统中,推展(即从提示到终止的采样轨迹,包括中间推理步骤和可选的工具或环境交互)决定了优化器学习的数据,但推送设计常常被低估。本调查提供了一个优化者无关的视角,介绍基于强化学习的推理大型语言模型(LLM)后期训练的推广策略。我们用统一的符号形式化了推广流水线,并引入了生成-过滤-控制-重放(GFCR)这一生命周期分类法,将推广流水线分解为四个模块化阶段:生成提出候选轨迹和拓扑;过滤器通过验证者、评审者、批评者构建中间信号;控制负责分配计算,并在预算下做出延续/分支/停止决策;Replay则保留并重用不同推广的伪装,无需权重更新,包括自动生成新训练任务的自我演进课程。我们通过可靠性、覆盖率和成本敏感性三项标准分类来补充GFCR,以描述推广权衡。利用该框架,我们综合了涵盖强化学习、可验证奖励、过程监督、基于判定的门槛、引导式和树/段推广、自适应计算分配、提前退出和部分推广、吞吐量优化以及自我提升的重放/重组方法。我们通过数学、代码/SQL、多模态推理、工具使用代理以及评估技能归导、重用和跨任务转移的智能体技能基准,为框架奠定基础。最后,我们提供了一个诊断指数,将常见的推广病理与GFCR模块和缓解杠杆对应,同时指出构建可重复、计算高效且值得信赖的推广流程面临的挑战。
Healthcare AI GYM for Medical Agents
医疗智能体医疗人工智能健身房
- Authors: Minbyul Jeong
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.02943
- Pdf link: https://arxiv.org/pdf/2605.02943
- Abstract
Clinical reasoning demands multi-step interactions -- gathering patient history, ordering tests, interpreting results, and making safe treatment decisions -- yet a unified training environment provides the breadth of clinical domains and specialized tools to train generalizable medical AI agents through reinforcement learning remains elusive. We present a comprehensive empirical study of multi-turn agentic RL for medical AI, built on \gym{}, a gymnasium-compatible environment spanning 10 clinical domains with 3.6K+ tasks, 135 domain-specific tools, and a knowledge base of 828K medical passages. Our analysis reveals that agentic multi-turn structure degrades into verbose single-turn monologues, characterized by monotonic length explosion and a simultaneous erosion of tool-use frequency. We characterize how this collapse, alongside distillation instability, stems from the misalignment of sparse terminal rewards with sequential clinical trajectories. We find that vanilla GRPO achieves strong final accuracy on some benchmarks but suffers from training instability, evidenced by significant oscillations in response length and prolonged convergence periods. To improve training efficiency and stability, we propose Turn-level Truncated On-Policy Distillation (TT-OPD), a self-distillation framework where a gradient-free EMA teacher leverages outcome-privileged information to provide dense, outcome-aware KL regularization at every conversation turn. TT-OPD achieves the best performance on 10 of 18 benchmarks with an average +3.9~pp improvement over the non-RL baseline with faster early convergence, controlled response length, and sustained multi-turn tool use.
- 中文摘要
临床推理需要多步骤的互动——收集患者病史、安排检测、解读结果以及做出安全的治疗决策——但统一的培训环境提供了广泛的临床领域和专门工具,通过强化学习训练可推广的医疗人工智能代理,仍然难以实现。我们提出了一项关于医学人工智能多回合代理强化学习的综合实证研究,基于 \gym{},这是一个兼容健身房的环境,涵盖10个临床领域,拥有3.6K+任务、135个领域专用工具,以及82.8万条医学文章的知识库。我们的分析显示,能动多转结构会退化为冗长的单转独白,其特征是单调长度爆炸和工具使用频率的同时流失。我们描述了这种崩溃以及蒸馏不稳定性,源于稀疏的终极奖励与连续临床轨迹的错位。我们发现,普通GRPO在某些基准测试上最终精度较强,但存在训练不稳定性,表现为响应长度显著振荡和收敛周期延长。为提升培训效率和稳定性,我们提出了回合级截断策略蒸馏(TT-OPD)自蒸馏框架,其中无梯度EMA教师利用结果特权信息在每个对话回合提供密集且结果感知的KL正则化。TT-OPD在18个基准测试中有10个表现最佳,平均提升+3.9~pp,较非强化基准线提升,具有更快的早期收敛、受控响应长度和持续多匝工具使用。
Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation
探索强化学习中的通过率奖励以实现代码生成
- Authors: Xin-Ye Li, Ren-Biao Liu, Yun-Ji Zhang, Hui Sun, Zheng Xie, Ming Li
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
- Arxiv link: https://arxiv.org/abs/2605.02944
- Pdf link: https://arxiv.org/pdf/2605.02944
- Abstract
Reinforcement learning (RL) from unit-test feedback has become a standard post-training recipe for improving large language models (LLMs) on code generation. However, the pass-all-tests binary reward can be sparse, yielding no learning signal on challenging problems where none of the sampled solutions passes all tests. A common remedy is to use the test-case pass rate as a surrogate reward. In this work, we study pass-rate rewards in critic-free RL for code generation (e.g., GRPO and RLOO) and report a consistent pattern across base models and algorithms: despite alleviating reward sparsity, pass-rate rewards do not reliably improve final performance over binary rewards in rigorous controlled experiments. To understand this discrepancy, we analyze reward density and the resulting gradient directions. We find that pass-rate rewards are denser, but the induced gradient updates do not consistently move probability mass toward full-pass solutions. This arises because test-case pass rate is a miscalibrated surrogate for progress toward full correctness, and partial-pass solutions within the same group can induce conflicting gradient directions that cancel out. Overall, our results suggest that, in critic-free RL, pass-rate rewards are insufficient to improve code generation and motivate reward designs that better align optimization with the goal of full correctness.
- 中文摘要
基于单元测试反馈的强化学习(RL)已成为改进大型语言模型(LLMs)代码生成的标准训练后配方。然而,通过所有测试的二元奖励可能较为稀疏,在没有任何样本解通过所有测试的挑战性问题上,没有学习信号。一种常见的解决方法是将测试案例通过率作为替代奖励。本研究研究了无批评强化学习(如GRPO和RLOO)中代码生成的通过率奖励,并报告了基础模型和算法间的一致模式:尽管通过率奖励缓解了奖励稀疏性,但在严谨的对照实验中,通过率奖励并未可靠地优于二元奖励的最终表现。为了理解这种差异,我们分析了奖励密度及其由此产生的梯度方向。我们发现通过率奖励更密集,但诱导梯度更新并不会持续使概率质量向全通过解移动。这是因为测试案例通过率是向完全正确进度的误校替代指标,同一组内的部分通过解可能诱导出相互抵消的梯度方向冲突。总体来看,我们的结果表明,在无批评的强化学习中,通过率奖励不足以提升代码生成,并激励更能使优化与完全正确目标相匹配的奖励设计。
Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use
奖励黑客基准:利用工具衡量LLM代理的漏洞利用
- Authors: Kunvar Thaman
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.02964
- Pdf link: https://arxiv.org/pdf/2605.02964
- Abstract
Reinforcement learning (RL) trained language model agents with tool access are increasingly deployed in coding assistants, research tools, and autonomous systems. We introduce the Reward Hacking Benchmark (RHB), a suite of multi-step tasks requiring sequential tool operations with naturalistic shortcut opportunities such as skipping verification steps, inferring answers from task-adjacent metadata, or tampering with evaluation-relevant functions. RHB supports independent and chained task regimes, where chain length acts as a proxy for longer-horizon agent behavior. We evaluate 13 frontier models from OpenAI, Anthropic, Google, and DeepSeek. Exploit rates range from 0% (Claude Sonnet 4.5) to 13.9% (DeepSeek-R1-Zero), varying sharply by post-training style. A controlled sibling comparison (DeepSeek-V3 vs. DeepSeek-R1-Zero) shows RL post-training is associated with substantially higher reward hacking (0.6% vs. 13.9%), with consistent gaps across all four task families. We identify six exploit categories and find that 72% of reward hacking episodes include explicit chain-of-thought rationale, suggesting models often frame exploits as legitimate problem-solving. Simple environmental hardening reduces exploit rates by 5.7 percentage points (87.7% relative) without degrading task success. Models with near-zero exploit rates on standard tasks show elevated rates on harder variants, suggesting that production-aligned post-training appears to suppress reward hacking only below a complexity threshold where honest solutions remain tractable.
- 中文摘要
具备工具访问权限的强化学习(RL)训练语言模型代理,越来越多地被部署在编码助手、研究工具和自主系统中。我们介绍了奖励黑客基准(RHB),这是一套多步骤任务,需要连续工具操作,并提供了自然的捷径机会,如跳过验证步骤、从任务相关元数据推断答案,或篡改评估相关函数。RHB支持独立和链式任务模式,链长作为代理更长视角行为的代理。我们评估了来自OpenAI、Anthropic、Google和DeepSeek的13个前沿模型。利用率范围从0%(Claude Sonnet 4.5)到13.9%(DeepSeek-R1-Zero),随训练后风格差异显著。一项对照兄弟姐妹比较(DeepSeek-V3 与 DeepSeek-R1-Zero)显示,训练后强化学习与显著更高的奖励黑客相关(0.6% 对 13.9%),且四个任务家族之间存在一致的差距。我们识别出六个漏洞利用类别,发现72%的奖励黑客事件包含明确的思维链逻辑,表明模型常将漏洞利用框架为合法的问题解决。简单的环境硬化可降低5.7个百分点(相对87.7%)的利用率,同时不降低任务成功率。在标准任务中几乎为零的模型在更难的变体上表现为较高的利用率,表明生产对齐的后期训练似乎只在复杂度阈值以下抑制奖励黑客行为,而该阈值是诚实解决方案仍然可行的。
Joint Energy Management and Coordinated AIGC Workload Scheduling for Distributed Data Centers: A Diffusion-Aided Reward Shaping Approach
分布式数据中心的联合能源管理与协调AIGC工作负载调度:一种扩散辅助奖励塑造方法
- Authors: Yang Fu, Peng Qin, Liming Chen, Zihao Zhang, Hao Yu, Yifei Wang
- Subjects: Subjects:
Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2605.02965
- Pdf link: https://arxiv.org/pdf/2605.02965
- Abstract
Artificial intelligence-generated content (AIGC) has emerged as a transformative paradigm for automating the creation of diverse and customized content, giving rise to rapidly growing computational workloads in cloud data centers. It is imperative for AIGC service providers (ASPs) to strategically schedule AIGC workloads to reduce data center energy costs while guaranteeing high-quality content generation. However, the distinctive characteristics of AIGC services pose critical challenges, including model heterogeneity across ASPs, implicit service quality evaluation, and complex inference process control. To tackle these challenges, we propose a joint energy management and coordinated AIGC workload scheduling framework, which introduces an explicit mathematical characterization of service quality to promote both job transfer among ASPs and fine-grained inference process configuration. Moreover, various energy resources within data centers are jointly considered to enhance power usage flexibility. Subsequently, a system utility maximization problem is formulated to balance AIGC service revenue with operational penalties and costs. Nevertheless, the strong coupling among job scheduling decisions induces severe reward sparsity, which limits the effectiveness of existing deep reinforcement learning (DRL) algorithms. To address this issue, we develop a diffusion model-aided reward shaping approach to synthesize complementary reward signals through a multi-step denoising process. This approach is seamlessly integrated with DRL to enable efficient learning of scheduling policies under sparse environmental feedback. Experiments based on real-world models and datasets demonstrate that our scheme effectively accommodates electricity price fluctuations and AIGC model heterogeneity, while achieving superior learning convergence and system utility compared with benchmark methods.
- 中文摘要
人工智能生成内容(AIGC)已成为自动化创建多样化且定制内容的变革范式,推动云数据中心计算工作负载的快速增长。AIGC服务提供商(ASP)必须有策略地安排AIGC工作负载,以降低数据中心能源成本,同时保证高质量的内容生成。然而,AIGC服务的独特特性带来了关键挑战,包括ASP间的模型异构性、隐式服务质量评估以及复杂的推理过程控制。为应对这些挑战,我们提出了一个联合能源管理和协调的AIGC工作负载调度框架,引入了服务质量的明确数学表征,以促进ASP间的作业转移和细粒度推理过程配置。此外,数据中心内的多种能源资源也被共同考虑,以增强电力使用灵活性。随后,提出了系统效用最大化问题,以平衡AIGC的服务收入与运营成本。然而,作业调度决策之间的强耦合导致了严重的奖励稀疏性,限制了现有深度强化学习(DRL)算法的有效性。为解决这一问题,我们开发了一种扩散模型辅助的奖励塑形方法,通过多步去噪过程合成互补的奖励信号。该方法与DRL无缝集成,使得在环境反馈稀疏的情况下高效学习调度政策。基于真实世界模型和数据集的实验表明,我们的方案有效适应了电价波动和AIGC模型的异质性,同时在学习收敛和系统效用方面优于基准方法。
Taming the Curses of Multiagency in Robust Markov Games with Large State Space through Linear Function Approximation
通过线性函数近似克服大状态空间鲁棒马尔可夫博弈中的多代理诅咒
- Authors: Jingchu Gai, Laixi Shi
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2605.03125
- Pdf link: https://arxiv.org/pdf/2605.03125
- Abstract
Multi-agent reinforcement learning (MARL) holds great potential but faces robustness challenges due to environmental uncertainty. To address this, distributionally robust Markov games (RMGs) optimize worst-case performance when the environment deviates from the nominal model within a uncertainty set. Beyond robustness, an equally urgent goal for MARL is data efficiency -- sampling from vast state and action spaces that grow exponentially with the number of agents potentially leads to the curse of multiagency. However, current provably data-efficient algorithms for RMGs are limited to tabular settings with finite state and action spaces, which are only computationally manageable for small-scale problems, leaving RMGs with large-scale (or infinite) state spaces largely unexplored. The only existing work beyond tabular settings focuses on linear function approximation (LFA) for a restrictive class of RMGs using vanish minimal value assumption and still suffers from sample complexity with the curse of multiagency. In this work, we focuses on general RMGs with LFA. For uncertainty sets defined by total variation distance, we develop provably data-efficient algorithms that break the curse of multiagency in both the generative model setting and a newly proposed online interactive setting. To our knowledge, our results are the first to break the curse of multiagency of sample complexity for RMGs with large (possibly infinite) state spaces, regardless of the uncertainty set construction.
- 中文摘要
多智能体强化学习(MARL)潜力巨大,但由于环境不确定性,鲁棒性面临挑战。为此,分布鲁棒马尔可夫博弈(RMGs)在环境偏离名义模型时优化最坏情况表现。除了稳健性,MARL同样紧迫的目标是数据效率——从庞大的状态和动作空间中抽样,这些空间随着代理数量呈指数增长,可能导致多机构的诅咒。然而,目前可证明的数据效率最高的 RMG 算法仅限于具有有限状态空间和作用空间的表格设置,这些结构仅能在小尺度问题中计算上管理,导致具有大尺度(或无限)状态空间的 RMG 在很大程度上未被探索。除表格设置外,现存的唯一研究主要聚焦于使用消失最小值假设的限制性 RMG 类别线性函数近似(LFA),但由于多代理的诅咒,样本复杂度仍然存在。在本研究中,我们重点关注LFA的通用成衣游戏。对于由总变异距离定义的不确定性集,我们开发了可证明的数据高效算法,打破了生成模型和新提出的在线交互环境中多代理的诅咒。据我们所知,我们的结果是首次打破了具有大(可能无限)状态空间的RMG样本复杂度多代理的诅咒,无论不确定性集的构造如何。
MARS-DA: A Hierarchical Reinforcement Learning Framework for Risk-Aware Multi-Agent Bidding in Power Grids
MARS-DA:一个用于电网中风险感知多代理竞价的层级强化学习框架
- Authors: Jiayi Chen, Xuan Zhang, Guiling Wang
- Subjects: Subjects:
Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2605.03142
- Pdf link: https://arxiv.org/pdf/2605.03142
- Abstract
The increasing penetration of renewable energy has introduced substantial volatility into wholesale electricity markets, complicating the optimal bidding strategies for power producers. Traditional Reinforcement Learning (RL) approaches often struggle to balance profit maximization with risk management, frequently overfitting to specific market conditions or failing to account for the stochastic spread between Day-Ahead (DA) and Real-Time (RT) settlements. To address these challenges, this paper makes two primary contributions. First, we introduce and open-source a high-fidelity gymnasium environment for two-settlement electricity market bidding. Grounded in extensive empirical data from the PJM Interconnection, the environment explicitly models the interplay between DA commitments and RT deviations, providing a standardized testbed for general and risk-sensitive agents. Second, we propose MARS-DA (Multi-Agent Regime-Switching for Day-Ahead markets), a novel hierarchical framework that orchestrates distinct sub-policies for risk management and profit seeking. MARS-DA utilizes a top-level Meta-Controller to dynamically blend the actions of two specialized base agents: a "Safe Agent" that optimizes for reliable DA allocation and a "Speculator Agent" that targets volatile RT arbitrage opportunities. Extensive experiments demonstrate that MARS-DA achieves superior risk-adjusted returns compared to state-of-the-art baselines while maintaining robust regime alignment during periods of extreme market volatility.
- 中文摘要
可再生能源渗透率的提升为批发电力市场带来了显著波动,使得电力生产商的最佳竞标策略变得复杂。传统的强化学习(RL)方法常常难以平衡利润最大化与风险管理,常常过度拟合特定市场条件,或未能考虑日间(DA)和实时(RT)结算之间的随机利差。为应对这些挑战,本文提出了两个主要贡献。首先,我们引入并开源了一个高保真体育馆环境,用于双定居点电力市场竞标。该环境基于PJM互联的大量实证数据,明确模拟了DA承诺与RT偏差之间的相互作用,为通用和风险敏感代理提供了标准化测试平台。其次,我们提出MARS-DA(多代理政权切换,用于日间市场),这是一种新型的层级框架,协调了风险管理和追求利润的不同子政策。MARS-DA利用顶层元控制器动态混合两种专用基座代理的动作:“安全代理”,优化可靠的DA分配;“投机代理”,针对波动性RT套利机会。大量实验表明,MARS-DA在极端市场波动期间,能够在极高的市场波动期间,在风险调整后获得优越的风险调整回报,同时保持稳健的体系一致性。
Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?
Terminus-4B:更小的模型能否取代 Frontier 的代理执行任务中的大型语言模型?
- Authors: Spandan Garg, Vikram Nitin, Yufan Huang
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
- Arxiv link: https://arxiv.org/abs/2605.03195
- Pdf link: https://arxiv.org/pdf/2605.03195
- Abstract
Modern coding agents increasingly delegate specialized subtasks to subagents, which are smaller, focused agentic loops that handle narrow responsibilities like search, debugging or terminal execution. This architectural pattern keeps the main agent's context window clean by isolating verbose outputs (e.g. build logs, test results, etc.) within the subagent context. Typically when agents employ subagents for such tasks, they use frontier models as these subagents. In this paper, we investigate whether a finetuned small language model (SLM) can achieve comparable performance to frontier models in the task of agentic terminal execution. We present Terminus-4B, which is a post-trained Qwen3-4B model via Supervised Finetuning (SFT) and Reinforcement Learning (RL) using rubric-based LLM-as-judge reward, specifically for this task. In our extensive evaluation spanning various frontier models, training ablations and main agent configurations, we find that Terminus-4B is able to reduce the token usage of the main agent by up to ~30% compared to the No Subagent baseline with no impact to agent performance on benchmarks like SWE-Bench Pro and our internal SWE-Bench C# benchmark, which tends to be heavy in verbose execution tasks. Furthermore, Terminus-4B improves key metrics showing the main agent relying on the outputs of the subagent and doing fewer terminal execution tasks by itself. We see that our model not only closes the gap between the Vanilla Qwen model and frontier models like Claude Sonnet / Opus / GPT-5.3-Codex, but often even exceeds their performance.
- 中文摘要
现代编码代理越来越多地将专门的子任务委派给子代理,子代理是更小、聚焦的代理循环,负责搜索、调试或终端执行等狭窄职责。这种架构模式通过在子代理上下文中隔离冗长输出(如构建日志、测试结果等)来保持主代理的上下文窗口干净。通常,代理在雇佣子代理执行此类任务时,会使用前沿模型作为这些代理。本文探讨了微调小语言模型(SLM)在代理终端执行任务中是否能实现与前沿模型相当的性能。我们介绍Terminus-4B,这是一个通过监督微调(SFT)和强化学习(RL)通过基于评分标准的LLM作为评判奖励的后训练Qwen3-4B模型,专门用于该任务。在我们对多个前沿模型、训练消融和主代理配置的广泛评估中,我们发现 Terminus-4B 能够将主代理的令牌使用量比无子代理基线降低高达 ~30%,且在 SWE-Bench Pro 和我们内部的 SWE-Bench C# 基准测试上对代理性能没有影响。 这通常在冗长的执行任务中占有较多。此外,Terminus-4B 改进了关键指标,显示主代理依赖子代理的输出,且自身完成的终端执行任务减少。我们看到,我们的模型不仅缩小了 Vanilla Qwen 模型与 Claude Sonnet / Opus / GPT-5.3-Codex 等前沿模型之间的差距,甚至常常超越它们的性能。
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
DGPO:细粒度信贷分配的分销指导政策优化
- Authors: Hongbo Jin, Rongpeng Zhu, Zhongjing Du, Xu Jiang, Jingqi Tian, Qiaoman Zhang, Jiayu Ding
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.03327
- Pdf link: https://arxiv.org/pdf/2605.03327
- Abstract
Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which severely struggles to isolate pivotal reasoning steps within long Chain of Thought generations. Furthermore, the standard unbounded Kullback Leibler divergence penalty induces severe gradient instability and mode seeking conservatism, ultimately stifling the discovery of novel reasoning trajectories. To overcome these limitations, we introduce Distribution Guided Policy Optimization, a novel critic free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty.
- 中文摘要
强化学习对于使大型语言模型能够执行复杂推理任务至关重要。然而,当前算法如群相对策略优化存在粗粒度序列级的信用分配问题,难以在长链思考生成中分离关键推理步骤。此外,标准的无界库尔巴克·莱布勒散度惩罚导致严重的梯度不稳定性和寻求模式的保守主义,最终扼杀了新推理轨迹的发现。为克服这些局限,我们引入了分布引导策略优化(Distribution Guided Policy Optimization),这是一种新的无批评强化学习框架,将分布偏差重新解释为指导信号,而非僵硬的惩罚。
QoS Assurance Mechanism for 5G Network Slicing Based on the Deep Reinforcement Learning PPO Algorithm
基于深度强化学习PPO算法的5G网络切片QoS保证机制
- Authors: Qingyang Li
- Subjects: Subjects:
Networking and Internet Architecture (cs.NI)
- Arxiv link: https://arxiv.org/abs/2605.03345
- Pdf link: https://arxiv.org/pdf/2605.03345
- Abstract
With the increasing diversity of 5G service types and the intensifying dynamic fluctuations of network load, achieve differentiated quality of service assurance in a network slicing environment has become a key issue in resource management. To address this problem, this paper proposes a deep reinforcement learning mechanism for 5G network slicing quality of service assurance based on the traditional proximal policy optimization actor-critic framework. First, the slicing resource allocation is modeled as a constrained Markov decision process, jointly considering the collaborative optimization of bandwidth, computing, and wireless resources. Meanwhile, a graph attention network and bidirectional long short-term memory are introduced to extract topological correlations and temporal service features, combined with an adaptive Lagrangian penalty and dynamic reward shaping mechanism, to comprehensively optimize delay, throughput, reliability, fairness, and slice isolation performance. Experimental results show that the proposed method outperforms existing baseline models in terms of quality of service satisfaction rate, delay control, resource utilization, and convergence stability.
- 中文摘要
随着5G服务类型的多样性和网络负载动态波动加剧,在网络切片环境中实现差异化服务质量保障已成为资源管理的关键问题。为解决这一问题,本文提出了基于传统近端策略优化演员-批评者框架的深度强化学习机制,用于5G网络切片服务质量保证。首先,将切片资源分配建模为受限马尔可夫决策过程,共同考虑带宽、计算和无线资源的协作优化。同时,引入了图注意力网络和双向长短期记忆,以提取拓扑相关性和时间服务特征,结合自适应拉格朗日惩罚和动态奖励塑形机制,全面优化延迟、吞吐量、可靠性、公平性和切片隔离性能。实验结果显示,该方法在服务质量满意度、延迟控制、资源利用和收敛稳定性方面优于现有基线模型。
Learning Reactive Dexterous Grasping via Hierarchical Task-Space RL Planning and Joint-Space QP Control
通过层级任务空间强化学习规划和联合空间QP控制,学习反应性灵巧抓取
- Authors: Ho Jae Lee, Yonghyeon Lee, Alexander Alexiev, Tzu-Yuan Lin, Se Hwan Jeon, Sangbae Kim
- Subjects: Subjects:
Robotics (cs.RO); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2605.03363
- Pdf link: https://arxiv.org/pdf/2605.03363
- Abstract
In this work, we propose a hybrid hierarchical control framework for reactive dexterous grasping that explicitly decouples high-level spatial intent from low-level joint execution. We introduce a multi-agent reinforcement learning architecture, specialized into distinct arm and hand agents, that acts as a high-level planner by generating desired task-space velocity commands. These commands are then processed by a GPU-parallelized quadratic programming controller, which translates them into feasible joint velocities while strictly enforcing kinematic limits and collision avoidance. This structural isolation not only accelerates training convergence but also strictly enforces hardware safety. Furthermore, the architecture unlocks zero-shot steerability, allowing system operators to dynamically adjust safety margins and avoid dynamic obstacles without retraining the policy. We extensively validate the proposed framework through a rigorous simulation-to-reality pipeline. Real-world hardware experiments on a 7-DoF arm equipped with a 20-DoF anthropomorphic hand demonstrate highly robust zero-shot transferability for dexterous grasping to a diverse set of unseen objects, highlighting the system's ability to reactively recover from unexpected physical disturbances in unstructured environments.
- 中文摘要
本研究提出一种混合层级控制框架,用于反应灵巧抓握,明确将高层空间意图与低层次联合执行分离。我们引入了一种多智能体强化学习架构,专门针对不同的手臂和手部智能体,通过生成所需的任务空间速度命令,作为高级规划器。这些命令随后由GPU并行化的二次编程控制器处理,将其转换为可行的联合速度,同时严格执行运动学限制和碰撞避免。这种结构隔离不仅加速了培训融合,也严格执行硬件安全。此外,该架构解锁了零射击转向,使系统操作员能够动态调整安全余裕,避免动态障碍,而无需重新训练策略。我们通过严谨的模拟到现实流程对所提框架进行了广泛验证。在配备20深度拟人手的7-DoF臂上的真实硬件实验展示了高度稳健的零射击可转移性,方便灵巧地抓取多样的未见物体,凸显了系统在非结构环境中从意外物理扰动中被动恢复的能力。
Discovering Reinforcement Learning Interfaces with Large Language Models
发现强化学习与大型语言模型的接口
- Authors: Akshat Singh Jaswal, Ashish Baghel, Paras Chopra
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.03408
- Pdf link: https://arxiv.org/pdf/2605.03408
- Abstract
Reinforcement learning systems rely on environment interfaces that specify observations and reward functions, yet constructing these interfaces for new tasks often requires substantial manual effort. While recent work has automated reward design using large language models (LLMs), these approaches assume fixed observations and do not address the broader challenge of synthesizing complete task interfaces. We study RL task interface discovery from raw simulator state, where both observation mappings and reward functions must be generated. We propose LIMEN (Code available at this https URL), a LLM guided evolutionary framework that produces candidate interfaces as executable programs and iteratively refines them using policy training feedback. Across novel discrete gridworld tasks and continuous control domains spanning locomotion and manipulation, joint evolution of observations and rewards discovers effective interfaces given only a trajectory-level success metric, while optimizing either component alone fails on at least one domain. These results demonstrate that automatic construction of RL interfaces from raw state can substantially reduce manual engineering and that observation and reward components often benefit from co-design, as single-component optimization fails catastrophically on at least one domain in our evaluation suite.
- 中文摘要
强化学习系统依赖于环境接口来指定观察和奖励函数,但构建这些新任务的接口通常需要大量人工操作。虽然近期研究利用大型语言模型(LLMs)实现了奖励设计,但这些方法假设固定观察,未能解决综合完整任务接口这一更广泛的挑战。我们研究从原始模拟状态发现强化学习任务接口,必须生成观察映射和奖励函数。我们提出了LIMEN(代码可访问此 https URL),这是一个以LLM为导向的进化框架,能够生成候选接口作为可执行程序,并通过策略培训反馈迭代完善。在新型离散网格世界任务和跨移动与操作的连续控制域中,观察与奖励的联合演化仅在轨迹级成功指标下发现有效接口,而仅优化任一组件至少在一个领域失败。这些结果表明,从原始状态自动构建强化学习接口可以大幅减少手工工程,且观察和奖励组件常常受益于共同设计,因为单组件优化在我们评估套件中至少一个领域会灾难性失败。
Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits
通过变分量子电路实现的量子分层强化学习
- Authors: Yu-Ting Lee, Samuel Yen-Chi Chen, Fu-Chieh Chang
- Subjects: Subjects:
Machine Learning (cs.LG); Quantum Physics (quant-ph)
- Arxiv link: https://arxiv.org/abs/2605.03434
- Pdf link: https://arxiv.org/pdf/2605.03434
- Abstract
Reinforcement learning is one of the most challenging learning paradigms where efficacy and efficiency gains are extremely valuable. Hierarchical reinforcement learning is a variant that leverages temporal abstraction to structure decision-making. While parametrized quantum computations have shown success in non-hierarchical reinforcement learning, whether these advantages adapt to hierarchical decision-making remains a critical open question. In this work, we develop a hybrid hierarchical agent based on the option-critic architecture. This hybrid agent substitutes classical components with variational quantum circuits for feature extractors, option-value functions, termination functions, and intra-option policies. Evaluated on standard benchmarking environments, results show that a hybrid agent utilizing a quantum feature extractor can outperform classical baselines while saving up to 66\% trainable parameters. We also identify an architectural bottleneck that quantum option-value estimation severely degrades performance. Further ablation studies reveal how architectural choices of the quantum circuits affect performance. Our work establishes design principles for parameter-efficient hybrid hierarchical agents.
- 中文摘要
强化学习是最具有挑战性的学习范式之一,在其中效率和效能提升极为重要。分层强化学习是一种利用时间抽象来构建决策的变体。尽管参数化量子计算在非分层强化学习中已取得成功,但这些优势是否能适应分层决策仍是一个关键的悬而未决的问题。在本研究中,我们基于期权-批评者架构开发了一个混合层级代理。该混合代理用变分量子电路替代经典分量,使用特征提取器、期权-价值函数、终止函数和期权内策略。在标准基准测试环境中评估结果显示,使用量子特征提取器的混合智能体能够超越经典基线,同时节省高达66%的可训练参数。我们还发现了一个架构瓶颈,即量子期权价值估计严重降低性能。进一步的消融研究揭示了量子电路的架构选择如何影响性能。我们的工作确立了参数高效的混合层级代理的设计原则。
FINER-SQL: Boosting Small Language Models for Text-to-SQL
FINER-SQL:提升文本转SQL的小型语言模型
- Authors: Thanh Dat Hoang, Thanh Trung Huynh, Matthias Weidlich, Thanh Tam Nguyen, Tong Chen, Hongzhi Yin, Quoc Viet Hung Nguyen
- Subjects: Subjects:
Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2605.03465
- Pdf link: https://arxiv.org/pdf/2605.03465
- Abstract
Large language models have driven major advances in Text-to-SQL generation. However, they suffer from high computational cost, long latency, and data privacy concerns, which make them impractical for many real-world applications. A natural alternative is to use small language models (SLMs), which enable efficient and private on-premise deployment. Yet, SLMs often struggle with weak reasoning and poor instruction following. Conventional reinforcement learning methods based on sparse binary rewards (0/1) provide little learning signal when the generated SQLs are incorrect, leading to unstable or collapsed training. To overcome these issues, we propose FINER-SQL, a scalable and reusable reinforcement learning framework that enhances SLMs through fine-grained execution feedback. Built on group relative policy optimization, FINER-SQL replaces sparse supervision with dense and interpretable rewards that offer continuous feedback even for incorrect SQLs. It introduces two key reward functions: a memory reward, which aligns reasoning with verified traces for semantic stability, and an atomic reward, which measures operation-level overlap to grant partial credit for structurally correct but incomplete SQLs. This approach transforms discrete correctness into continuous learning, enabling stable, critic-free optimization. Experiments on the BIRD and Spider benchmarks show that FINER-SQL achieves up to 67.73\% and 85\% execution accuracy with a 3B model -- matching much larger LLMs while reducing inference latency to 5.57~s/sample. These results highlight a cost-efficient and privacy-preserving path toward high-performance Text-to-SQL generation. Our code is available at this https URL.
- 中文摘要
大型语言模型推动了文本转SQL生成的重大进展。然而,它们存在高计算成本、长延迟和数据隐私问题,使得它们在许多实际应用中不切实际。一个自然的替代方案是使用小型语言模型(SLM),这有助于高效且私密的本地部署。然而,SLM常常面临推理薄弱和指令执行不佳的问题。基于稀疏二元奖励(0/1)的传统强化学习方法在生成的SQL错误时几乎没有提供学习信号,导致训练不稳定或崩溃。为克服这些问题,我们提出了FINER-SQL,一个可扩展且可复用的强化学习框架,通过细粒度执行反馈增强SLM。FINER-SQL基于组相对策略优化,用密集且可解释的奖励取代了稀疏的监管,即使SQL错误也能持续反馈。它引入了两个关键的奖励函数:记忆奖励,将推理与经过验证的痕迹对齐以实现语义稳定性;以及原子奖励,测量操作层面的重叠情况,以给予结构正确但不完整的SQL部分认可。该方法将离散正确性转化为连续学习,实现稳定且无批评的优化。BIRD和Spider基准测试的实验显示,FINER-SQL在3B模型下执行准确率可达67.73%和85%——匹配更大LLM,同时将推理延迟降至每样本5.57~s。这些结果凸显了一条成本效益高且保护隐私的高效文本转SQL生成路径。我们的代码可在此 https URL 访问。
MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models
MHPR:大型视觉-语言模型的多维人类感知与推理基准
- Authors: Kangkang Wang, Qinting Jiang, Wanping Zhang, Bowen Ren, Shengzhao Wen
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.03485
- Pdf link: https://arxiv.org/pdf/2605.03485
- Abstract
Multidimensional human understanding is essential for real-world applications such as film analysis and virtual digital humans, yet current LVLM benchmarks largely focus on single-task settings and lack fine-grained, human-centric evaluation. In this work, we introduce MHPR, a comprehensive benchmark for joint perception-reasoning over human-centric scenes spanning individual, multi-person, and human-object interaction dimensions. MHPR comprises a multi-level data design-Captioned Raw Data (C-RD), Supervised Fine-Tuning Data (SFT-D), Reinforcement Learning Data (RL-D), and Test Data (T-D)-together with an automated caption/VQA generation pipeline (ACVG) that performs category-wise attribute decomposition, attribute-specific rewriting, and multi-model voting to ensure high-quality, scalable annotations. We evaluate state-of-the-art vision-language models on fine-grained attributes (appearance, clothing, pose, parts) and high-level semantics (social relations, action semantics, spatial relations, intent and functionality). Our findings show that: 1) format-aligned SFT data substantially improves instruction following and stability; 2) challenge-focused RL data derived from bad-case analysis further enhances perception and reasoning on difficult instances; and 3) training Qwen2.5-VL-7B with MHPR yields significant gains, achieving near-parity with considerably larger models. We release ACVG and MHPR to facilitate reproducible, extensible research on human-centric perception and reasoning.
- 中文摘要
多维人类理解对于胶片分析和虚拟数字人类等现实应用至关重要,但当前的LVLM基准大多聚焦于单一任务场景,缺乏细致、以人为中心的评估。在本研究中,我们介绍了MHPR,这是一个涵盖个人、多人及人与客交互维度的以人为中心场景的联合感知-推理综合基准。MHPR包括多层次数据设计——带字幕原始数据(C-RD)、监督微调数据(SFT-D)、强化学习数据(RL-D)和测试数据(T-D),以及自动化字幕/VQA生成流水线(ACVG),执行类别属性分解、属性特定重写和多模型投票,以确保标注质量高质量且可扩展。我们评估最先进的视觉语言模型,涵盖细粒度属性(外观、服装、姿态、部件)和高级语义(社会关系、动作语义、空间关系、意图和功能)。我们的发现表明:1)格式对齐的SFT数据显著提升了指令跟随和稳定性;2)基于劣质案例分析的挑战导向强化学习数据进一步增强了对困难案例的感知和推理能力;3)用MHPR训练Qwen2.5-VL-7B可获得显著提升,几乎与更大模型实现对应。我们发布ACVG和MHPR以促进关于以人为中心的感知与推理的可重复性和扩展性研究。
Real-Time Evaluation of Autonomous Systems under Adversarial Attacks
在对抗性攻击下自主系统进行实时评估
- Authors: Adithya Mohan, Xujun Xie, Venkatesh Thirugnana Sambandham, Torsten Schön
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.03491
- Pdf link: https://arxiv.org/pdf/2605.03491
- Abstract
Most evaluations of autonomous driving policies under adversarial conditions are conducted in simulation, due to cost efficiency and the absence of physical risk. However, purely virtual testing fails to capture structural inconsistencies, supervision constraints, and state-representation effects that arise in real-world data and fundamentally shape policy robustness. This work presents an offline trajectory-learning and adversarial robustness evaluation framework grounded in real-world intersection driving data. Within a controlled data contract, we train and compare three trajectory-learning paradigms: Multi-Layer Perceptron (MLP)-based Behavior Cloning (BC), Transformer-based object-tokenized BC, and inverse reinforcement learning (IRL) formulated within a Generative Adversarial Imitation Learning (GAIL) framework. Models are evaluated using Average Displacement Error (ADE) and Final Displacement Error (FDE). Inference-time robustness is assessed by subjecting trained policies to gradient-based adversarial perturbations across multiple intersection scenarios, yielding a structured robustness evaluation matrix. Results show that state-structure design and architectural inductive biases critically influence adversarial stability, leading to markedly different robustness profiles despite comparable nominal prediction accuracy (ADE < 0.08). Inference-time Projected Gradient Descent (PGD) attacks induce final displacement errors of up to approximately 8 meters. The proposed framework establishes a scalable benchmark for studying offline trajectory learning and adversarial robustness in real-world autonomous driving settings.
- 中文摘要
由于成本效益高且无物理风险,大多数对抗性条件下自动驾驶政策的评估都是在模拟中进行的。然而,纯虚拟测试无法捕捉现实数据中出现的结构性不一致、监督约束和状态表征效应,这些根本性影响政策的稳健性。本研究提出了基于真实交集驱动数据的离线轨迹学习和对抗性鲁棒性评估框架。在受控数据契约中,我们训练并比较三种轨迹学习范式:基于多层感知器(MLP)的行为克隆(BC)、基于Transformer的对象标记化BC,以及在生成对抗模仿学习(GAIL)框架内构建的逆强化学习(IRL)。模型通过平均位移误差(ADE)和最终位移误差(FDE)进行评估。通过将训练有素的策略置于基于梯度的对抗扰动下,跨越多个交集场景,从而评估推断时间鲁棒性,从而得到结构化的鲁棒性评估矩阵。结果显示,状态结构设计和架构归纳偏差对对抗稳定性有关键影响,尽管预测准确度相当(ADE < 0.08),鲁棒性分布却显著不同。推断时间投影梯度下降(PGD)攻击可引起最终位移误差约8米。该框架为研究现实世界自动驾驶环境中离线轨迹学习和对抗性鲁棒性建立了可扩展的基准。
Vanishing L2 regularization for the softmax Multi Armed Bandit
软极限多臂强盗的消失L2正则化
- Authors: Stefana-Lucia Anita, Gabriel Turinici
- Subjects: Subjects:
Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2605.03752
- Pdf link: https://arxiv.org/pdf/2605.03752
- Abstract
Multi Armed Bandit (MAB) algorithms are a cornerstone of reinforcement learning and have been studied both theoretically and numerically. One of the most commonly used implementation uses a softmax mapping to prescribe the optimal policy and served as the foundation for downstream algorithms, including REINFORCE. Distinct from vanilla approaches, we consider here the L2 regularized softmax policy gradient where a quadratic term is subtracted from the mean reward. Previous studies exploiting convexity failed to identify a suitable theoretical framework to analyze its convergence when the regularization parameter vanishes. We prove here theoretical convergence results and confirm empirically that this regime makes the L2 regularization numerically advantageous on standard benchmarks.
- 中文摘要
多臂强盗(MAB)算法是强化学习的基石,并已在理论和数值上都有研究。最常用的实现之一是使用软极大映射来规定最优策略,并作为下游算法(包括REINFORCE)的基础。与普通方法不同,这里考虑L2正则化软极大策略梯度,其中从平均奖励中减去一个二次项。此前利用凸性的研究未能找到合适的理论框架来分析当正则化参数为零时的收敛性。我们在此证明了理论收敛结果,并通过实证证实该模式使L2正则化在标准基准测试中具有数值优势。
What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity
你所想即所见:通过视觉语言好奇心驱动VLM代理的探索
- Authors: Haoxi Li, Qinglin Hou, Jianfei Ma, Jinxiang Lai, Tao Han, Sikai Bai, Jingcai Guo, Jie Zhang, Song Guo
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.03782
- Pdf link: https://arxiv.org/pdf/2605.03782
- Abstract
To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is insufficient for sparse-reward tasks, as it lacks the epistemic drive to actively uncover the
known unknown'' required for robust generalization. We ask: Can VLM agents actively find signals that challenge and refine their internal world model through curiosity-driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an intrinsic curiosity signal within reinforcement learning, steering the agent to actively explore areas where its internal model is uncertain. Extensive experiments across a series of agentic tasks show the effectiveness of GLANCE, and demonstrate that aligningwhat the agent thinks'' with ``what the agent sees'' is key to solving complex or sparse agentic tasks.
- 中文摘要
为了导航部分可观察的视觉环境,近期的VLM代理越来越多地通过显式CoT推理将世界建模能力内化进策略,使其能够在行动前心理模拟未来。然而,仅依赖被动推理对访问状态的推理对于稀疏奖励任务来说是不够的,因为这缺乏主动揭示“已知未知”的认知驱动力,从而实现了稳健的推广。我们问:VLM代理是否能通过好奇心驱动的探索,主动找到挑战和完善其内部世界模型的信号?在本研究中,我们提出了GLANCE这一统一框架,通过将智能体的语言世界模型扎根于不断演变的目标网络的稳定视觉表征中,桥接推理与探索。关键是,GLANCE利用语言预测与视觉现实之间的差异,作为强化学习中的内在好奇心信号,引导智能体主动探索其内部模型不确定的领域。在一系列代理任务中的大量实验展示了GLANCE的有效性,并证明将“代理所想”与“代理所见”对齐,是解决复杂或稀疏代理任务的关键。
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF
自然语言处理:从分词到RLHF的全面实用指南
- Authors: Mullosharaf K. Arabov
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2605.03799
- Pdf link: https://arxiv.org/pdf/2605.03799
- Abstract
This preprint presents a systematic, research-oriented practicum that guides the reader through the entire modern NLP pipeline: from tokenisation and vectorisation to fine-tuning of large language models, retrieval-augmented generation, and reinforcement learning from human feedback. Twelve hands-on sessions combine concise theory with detailed implementation plans, formalised evaluation metrics, and transparent assessment criteria. The work is not a conventional textbook: it is designed as a reproducible research artefact where every session requires publishing code, models, and reports in public repositories. All experiments are conducted on a single evolving corpus, and the work advocates open-weight models over commercial APIs, with special attention to the Hugging Face ecosystem. The material is enriched by original research on low-resource languages, incorporating linguistic resources for Tajik and Tatar (subword tokenisers, embeddings, lexical databases, and transliteration benchmarks), demonstrating how modern NLP can be adapted to data-scarce environments. Designed for senior undergraduates, graduate students, and practising developers seeking to implement, compare, and deploy methods from classical ML to state-of-the-art LLM-based systems.
- 中文摘要
本预印本呈现了系统化、以研究为导向的实践,引导读者了解整个现代自然语言处理流程:从分词化和向量化,到大型语言模型的微调、检索增强生成,以及从人类反馈中进行强化学习。十二场实践课程结合了简明的理论、详细的实施计划、正式的评估指标和透明的评估标准。这部作品并非传统的教材:它被设计为可重复的研究成果,每次会议都需要在公共仓库中发布代码、模型和报告。所有实验均在一个不断演变的语料库上进行,研究主张开放权重模型而非商业API,特别关注Hugging Face生态系统。材料通过对低资源语言的原创研究丰富,融入了塔吉克语和鞑靼语的语言资源(子词分词器、嵌入、词汇数据库和音译基准),展示了现代自然语言处理如何适应数据稀缺的环境。面向高年级本科生、研究生及在职开发者,帮助他们实现、比较和部署从经典机器学习到最先进的基于LLM的系统。
SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems
SOAR:机器人移动履约系统中订单分配和机器人调度的实时联合优化
- Authors: Yibang Tang, Yifan Yang, Jingyuan Wang, Junhua Chen, Zhen Zhao
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2605.03842
- Pdf link: https://arxiv.org/pdf/2605.03842
- Abstract
Robotic Mobile Fulfillment Systems (RMFS) rely on mobile robots for automated inventory transportation, coordinating order allocation and robot scheduling to enhance warehousing efficiency. However, optimizing RMFS is challenging due to strict real-time constraints and the strong coupling of multi-phase decisions. Existing methods either decompose the problem into isolated sub-tasks to guarantee responsiveness at the cost of global optimality, or rely on computationally expensive global optimization models that are unsuitable for dynamic industrial environments. To bridge this gap, we propose SOAR, a unified Deep Reinforcement Learning framework for real-time joint optimization. SOAR transforms order allocation and robot scheduling into a unified process by utilizing soft order allocations as observations. We formulate this as an Event-Driven Markov Decision Process, enabling the agent to perform simultaneous scheduling in response to asynchronous system events. Technically, we employ a Heterogeneous Graph Transformer to encode the warehouse state and integrate phased domain knowledge. Additionally, we incorporate a reward shaping strategy to address sparse feedback in long-horizon tasks. Extensive experiments on synthetic and real-world industrial datasets, in collaboration with Geekplus, demonstrate that SOAR reduces global makespan by 7.5\% and average order completion time by 15.4\% with sub-100ms latency. Furthermore, sim-to-real deployment confirms its practical viability and significant performance gains in production environments. The code is available at this https URL.
- 中文摘要
机器人移动履约系统(RMFS)依赖移动机器人进行自动库存运输、协调订单分配和机器人排班,以提升仓储效率。然而,由于严格的实时约束和多阶段决策的强耦合,优化RMFS具有挑战性。现有方法要么将问题分解为孤立的子任务以保证响应性,但代价是牺牲全局最优性;要么依赖计算量高、不适合动态工业环境的全局优化模型。为弥合这一差距,我们提出了SOAR,一个用于实时联合优化的统一深度强化学习框架。SOAR通过利用软顺序分配作为观测数据,将订单分配和机器人调度转变为统一流程。我们将此过程称为事件驱动马尔可夫决策过程,使智能体能够响应异步系统事件进行同时调度。技术上,我们采用异构图变换器编码仓库状态并整合分阶段领域知识。此外,我们还采用了奖励塑造策略,以应对长期任务中反馈稀疏的问题。与Geekplus合作的合成和真实工业数据集实验显示,SOAR将全球完成时间缩短7.5%,平均订单完成时间缩短15.4%,延迟低于100毫秒。此外,模拟到实物部署验证了其在生产环境中的实用性和显著的性能提升。代码可在该 https URL 访问。
SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision
SigLoMa:从自我中心视角学习开放世界四足行走操控
- Authors: Shiyi Chen, Haiyi Liu, Mingye Yang, Jiaqi Zhang, Debing Zhang
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2605.03846
- Pdf link: https://arxiv.org/pdf/2605.03846
- Abstract
Designing an open-world quadrupedal loco-manipulation system is highly challenging. Traditional reinforcement learning frameworks utilizing exteroception often suffer from extreme sample inefficiency and massive sim-to-real gaps. Furthermore, the inherent latency of visual tracking fundamentally conflicts with the high-frequency demands of precise floating-base control. Consequently, existing systems lean heavily on expensive external motion capture and off-board computation. To eliminate these dependencies, we present SigLoMa, a fully onboard, ego-centric vision-based pick-and-place framework. At the core of SigLoMa is the introduction of Sigma Points, a lightweight geometric representation for exteroception that guarantees high scalability and native sim-to-real alignment. To bridge the frequency divide between slow perception and fast control, we design an ego-centric Kalman Filter to provide robust, high-rate state estimation. On the learning front, we alleviate sample inefficiency via an Active Sampling Curriculum guided by Hint Poses, and tackle the robot's structural visual blind spots using temporal encoding coupled with simulated random-walk drift. Real-world experiments validate that, relying solely on a 5Hz (200 ms latency) open-vocabulary detector, SigLoMa successfully executes dynamic loco-manipulation across multiple tasks, achieving performance comparable to expert human teleoperation.
- 中文摘要
设计一个开放世界的四足机车操控系统极具挑战性。传统的基于外感知的强化学习框架常常存在极高的样本效率和巨大的模拟与现实差距。此外,视觉跟踪固有的延迟与高精度浮动基底控制的高频需求根本冲突。因此,现有系统大量依赖昂贵的外部动作捕捉和机外计算。为了消除这些依赖,我们提出了SigLoMa,一个完全内置、以自我为中心、基于愿景的选择与定位框架。SigLoMa的核心是引入Sigma Points,这是一种轻量级的几何表示,用于外感知,保证高扩展性和原生模拟与现实对齐。为了弥合慢感知与快速控制之间的频率差距,我们设计了以自我为中心的卡尔曼滤波器,以提供稳健且高效的状态估计。在学习方面,我们通过以提示姿势为指导的主动抽样课程缓解样本效率低效,并结合时间编码结合模拟随机游走漂移,解决机器人的结构性视觉盲点。真实实验验证,仅依靠5Hz(200毫秒延迟)开放词汇检测器,SigLoMa能够在多项任务中成功执行动态机车操作,实现与专家级人类远程操作相当的性能。
Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc
机械意识:机器智能可靠性的数学框架
- Authors: Munkhdegerekh Batzorig, Purevbaatar Ganbold, Kyungbin Park, Pilkong Jeong, Kangbin
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2605.03847
- Pdf link: https://arxiv.org/pdf/2605.03847
- Abstract
Distributed collaborative intelligence (DCI), encompassing edge-to-edge architectures, federated learning, transfer learning, and swarm systems, creates environments in which emergent risk is structurally unavoidable: locally correct decisions by individual agents compose into globally unacceptable behavioral trajectories under uncertainty. Existing approaches such as constrained optimization, safe reinforcement learning, and runtime assurance evaluate acceptability at the level of individual actions rather than across behavioral trajectories, and none addresses the multi-participant, uncertainty-laden nature of DCI deployments. This paper introduces mechanical conscience (MC), a novel concept and simplified mathematical framework that operationalizes trajectory-level normative regulation for both single-agent and distributed intelligent systems. Mechanical conscience is defined as a supervisory filter that minimally corrects a baseline policy's actions to reduce cumulative deviation from a normatively admissible region, while accounting for epistemic uncertainty. We introduce associated constructs, conscience score, mechanical guilt, and resonant dependability, that provide an interpretable vocabulary and computable governance signals for this emerging field. Core theoretical properties are established: admissibility equivalence, existence of optimal regulation, and monotonic deviation reduction. Illustrative results demonstrate that MC-regulated agents maintain trajectory-level normative acceptability where conventional controllers drift outside admissible bounds, and that the framework naturally extends to suppress interaction-induced emergent risk in multi-agent DCI settings.
- 中文摘要
分布式协作智能(DCI),涵盖端到边架构、联邦学习、迁移学习和群体系统,创造了一种新兴风险在结构上不可避免的环境:个体在局部正确的决策中,在不确定性下形成全球不可接受的行为轨迹。现有方法如受限优化、安全强化学习和运行时保障,是在单个动作层面评估可接受性,而非跨行为轨迹,且无一方法解决DCI部署中多参与者且充满不确定性的特性。本文介绍了机械良知(MC),这是一个新颖的概念和简化的数学框架,能够操作单智能体和分布式智能系统的轨迹级规范调控。机械良知被定义为一种监督过滤器,它在考虑认知不确定性的情况下,最小化地纠正基线政策的行为,以减少与规范可接受区域的累积偏差。我们引入了相关构念、良心评分、机械性罪责和共鸣可靠性,这些为这一新兴领域提供了可解释的词汇和可计算的治理信号。确立了核心理论性质:可采性等价性、最优调控的存在性以及单调偏差的减少。有意义的结果表明,MC调控的药物在传统控制剂偏离可采纳范围时,能够保持轨迹级的规范可接受性,且该框架自然扩展以抑制多药物DCI环境中相互作用引发的突发风险。
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
正确还不够:用执行者为基础的奖励培训推理规划师
- Authors: Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang, Junhao Su
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2605.03862
- Pdf link: https://arxiv.org/pdf/2605.03862
- Abstract
Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it.
- 中文摘要
带有可验证奖励的强化学习已成为大型语言模型中提升显性推理的常见方式,但仅凭最终答案的正确性并不能揭示推理痕迹是否忠实、可靠或对所消费模型有用。这种仅结果信号可能强化错误理由而正确的痕迹,通过奖励捷径夸大推理收益,并在多步系统中传播有缺陷的中间状态。为此,我们提出了TraceLift,一种计划者-执行者培训框架,将推理视为可消耗的中间产物。在规划师培训期间,规划师会发出标记推理。冻结执行者将此推理转化为验证者反馈的最终工件,而基于执行者的奖励则塑造中间轨迹。该奖励将基于评分标准的推理奖励模型(RM)分数乘以同一冻结执行者的测量提升,从而赋予高质量且有用的痕迹。为了使推理质量可直接学习,我们引入了 TRACELIFT-GROUPS,这是一个基于数学和代码种子问题构建的带标注的仅理由数据集。每个例子都是同一问题组,包含高质量的参考迹和多个具有局部扰动的合理缺陷迹,这些扰动降低推理质量或解的支持,同时保持任务相关性。大量代码和数学基准测试实验表明,这种基于执行者的推理奖励改进了两阶段的计划者-执行者系统,而非仅执行的训练,表明推理监督不仅应评估痕迹的效果是否良好,还应评估其是否有助于消耗该记录的模型。
Mitigating False Positives in Static Memory Safety Analysis of Rust Programs via Reinforcement Learning
通过强化学习减少Rust程序静态内存安全分析中的误报
- Authors: P Akilesh, Leuson Da Silva, Foutse Khomh, Sridhar Chimalakonda
- Subjects: Subjects:
Software Engineering (cs.SE)
- Arxiv link: https://arxiv.org/abs/2605.04000
- Pdf link: https://arxiv.org/pdf/2605.04000
- Abstract
Static analysis tools are essential for ensuring memory safety in Rust programs, particularly as Rust gains adoption in safety-critical domains. However, existing tools such as Rudra and MirChecker suffer from high false positive rates, which diminish developer trust, increase manual review effort, and may obscure genuine vulnerabilities. This paper presents a novel reinforcement learning (RL)-based approach for automatically classifying and suppressing spurious warnings in static memory safety analysis for Rust. To achieve this, we design an RL agent that learns a warning suppression policy by extracting contextual features from Rust's Mid-level Intermediate Representation (MIR) and optimizing its decisions through interaction with static analysis outputs. To improve decision quality, we integrate dynamic validation via cargo-fuzz as an auxiliary feedback mechanism, allowing the agent to selectively validate suspicious warnings through targeted fuzz testing. Our evaluation shows that the proposed approach significantly outperforms state-of-the-art LLM-based baselines, achieving 65.2% accuracy and an F1 score of 0.659, an improvement of 17.1% over the best LLM baseline. With a recall of 74.6%, our method successfully identifies nearly three-quarters of true bugs while substantially reducing false positives, improving precision from 25.6% in raw Rudra output to 59.0%. Incorporating dynamic fuzzing further boosts performance, yielding additional improvements of 10.7 percentage points in accuracy and 8.6 percentage points in F1 score over the RL-only variant. Overall, our work demonstrates that combining reinforcement learning with hybrid static-dynamic analysis can substantially reduce false positives and improve the practical usability of memory safety verification tools for Rust.
- 中文摘要
静态分析工具对于确保Rust程序中的内存安全至关重要,尤其是在Rust在安全关键领域日益普及之际。然而,现有工具如Rudra和MirChecker存在高误报率,降低开发者信任,增加人工审核工作量,并可能掩盖真实漏洞。本文提出了一种基于强化学习(RL)的新方法,用于在Rust静态记忆安全分析中自动分类和抑制虚假警告。为此,我们设计了一个强化学习代理,通过从Rust的中级中间表征(MIR)中提取上下文特征,并通过与静态分析输出交互优化其决策,学习警告抑制策略。为提升决策质量,我们通过货物模糊整合动态验证作为辅助反馈机制,使智能体能够通过有针对性的模糊测试选择性验证可疑警告。我们的评估显示,所提方法显著优于最先进的基于LLM的基线,准确率达到65.2%,F1得分为0.659,比最佳LLM基线提升17.1%。我们的方法召回率为74.6%,成功识别了近四分之三的真实错误,同时大幅减少了误报,将原始Rudra输出的精度从25.6%提升至59.0%。加入动态模糊进一步提升性能,比仅限RL版本的准确率提升了10.7个百分点,F1成绩提升了8.6个百分点。总体而言,我们的研究表明,将强化学习与混合静态-动态分析结合,可以大幅减少误报,并提升Rust内存安全验证工具的实用性。
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
OpenSeeker-v2:推动搜索代理的极限,提供信息丰富且高难度的轨迹
- Authors: Yuwen Du, Rui Ye, Shuo Tang, Keduan Huang, Xinyu Zhu, Yuzhu Cai, Siheng Chen
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2605.04036
- Pdf link: https://arxiv.org/pdf/2605.04036
- Abstract
Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In this report, we show that when fueled with informative and high-difficulty trajectories, a simple SFT approach could be surprisingly powerful for training frontier search agents. By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low-step filtering, we establish a stronger baseline. Trained on merely 10.6k data points, our OpenSeeker-v2 achieves state-of-the-art performance across 4 benchmarks (30B-sized agents with ReAct paradigm): 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively. Notably, OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT. We are excited to open-source the OpenSeeker-v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community.
- 中文摘要
深度搜索能力已成为前沿大型语言模型(LLM)代理不可或缺的能力,但其开发仍由工业巨头主导。典型的行业流程涉及一个高度资源密集的流程,涵盖预培训、持续预培训(CPT)、监督微调(SFT)和强化学习(RL)。本报告表明,当以信息丰富且高难度的轨迹为动力时,简单的SFT方法对于训练前沿搜索代理来说可能非常强大。通过引入三项简单的数据综合修改:扩展知识图谱规模以丰富探索、扩展工具集以实现更广泛的功能,以及严格的低步骤过滤,我们建立了更强的基线。我们的OpenSeeker-v2仅基于10.6k数据点训练,在4个基准测试(30B级代理,采用ReAct范式)上实现了最先进的性能:BrowseComp 46.0%,BrowseComp-ZH 58.1%,Humanity's Last Exam 34.6%,xbench 78.0%,甚至超过了采用大量 CPT+SFT+RL 流水线训练的 Tongyi DeepResearch,分别达到 43.4%、46.7%、32.9% 和 75.0%。值得注意的是,OpenSeeker-v2 代表了其模型规模和范式内首个由纯学术团队仅使用 SFT 开发的最先进搜索代理。我们很高兴开源OpenSeeker-v2模型权重,并分享我们简单而有效的发现,使前沿搜索代理研究对社区更易获得。
Keyword: diffusion policy
There is no result