生成时间: 2026-02-13 16:48:19 (UTC+8); Arxiv 发布时间: 2026-02-13 20:00 EST (2026-02-14 09:00 UTC+8)
今天共有 54 篇相关文章
Keyword: reinforcement learning
Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions
大型语言模型对齐的机制性可解释性:进展、挑战与未来方向
- Authors: Usman Naseem
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.11180
- Pdf link: https://arxiv.org/pdf/2602.11180
- Abstract
Large language models (LLMs) have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque. Mechanistic interpretability (i.e., the systematic study of how neural networks implement algorithms through their learned representations and computational structures) has emerged as a critical research direction for understanding and aligning these models. This paper surveys recent progress in mechanistic interpretability techniques applied to LLM alignment, examining methods ranging from circuit discovery to feature visualization, activation steering, and causal intervention. We analyze how interpretability insights have informed alignment strategies including reinforcement learning from human feedback (RLHF), constitutional AI, and scalable oversight. Key challenges are identified, including the superposition hypothesis, polysemanticity of neurons, and the difficulty of interpreting emergent behaviors in large-scale models. We propose future research directions focusing on automated interpretability, cross-model generalization of circuits, and the development of interpretability-driven alignment techniques that can scale to frontier models.
- 中文摘要
大型语言模型(LLM)在多样化任务中取得了显著能力,但其内部决策过程仍然大多不透明。机制性可解释性(即系统性研究神经网络如何通过其学习的表征和计算结构实现算法)已成为理解和对齐这些模型的关键研究方向。本文综述了机理性可解释性技术在LLM比对中的最新进展,涵盖从电路发现到特征可视化、激活引导及因果干预等多种方法。我们分析了可解释性洞察如何影响包括人类反馈强化学习(RLHF)、宪法人工智能和可扩展监督在内的对齐策略。识别出主要挑战,包括叠加假设、神经元的多语义性以及在大规模模型中解读涌现行为的困难。我们提出未来的研究方向,聚焦于自动解释性、电路的跨模型泛化,以及开发可扩展至前沿模型的可解释性驱动比对技术。
TDPNavigator-Placer: Thermal- and Wirelength-Aware Chiplet Placement in 2.5D Systems Through Multi-Agent Reinforcement Learning
TDPNavigator-Placer:通过多智能体强化学习实现2.5D系统中的热感知和线长感知芯片组布置
- Authors: Yubo Hou, Furen Zhuang, Partha Pratim Kundu, Sezin Ata Kircali, Jie Wang, Mihai Dragos Rotaru, Dutta Rahul, Ashish James
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.11187
- Pdf link: https://arxiv.org/pdf/2602.11187
- Abstract
The rapid growth of electronics has accelerated the adoption of 2.5D integrated circuits, where effective automated chiplet placement is essential as systems scale to larger and more heterogeneous chiplet assemblies. Existing placement methods typically focus on minimizing wirelength or transforming multi-objective optimization into a single objective through weighted sum, which limits their ability to handle competing design requirements. Wirelength reduction and thermal management are inherently conflicting objectives, making prior approaches inadequate for practical deployment. To address this challenge, we propose TDPNavigator-Placer, a novel multi-agent reinforcement learning framework that dynamically optimizes placement based on chiplet's thermal design power (TDP). This approach explicitly assigns these inherently conflicting objectives to specialized agents, each operating under distinct reward mechanisms and environmental constraints within a unified placement paradigm. Experimental results demonstrate that TDPNavigator-Placer delivers a significantly improved Pareto front over state-of-the-art methods, enabling more balanced trade-offs between wirelength and thermal performance.
- 中文摘要
电子技术的快速发展加速了2.5D集成电路的采用,随着系统向更大、更异构的芯片组扩展,有效的自动芯片组安装至关重要。现有的布置方法通常侧重于最小化导线长度,或通过加权和将多目标优化转化为单一物镜,这限制了它们应对竞争设计需求的能力。线长缩短和热管理本质上是相互冲突的目标,使得以往的方法无法实现实际部署。为应对这一挑战,我们提出了TDPNavigator-Placer,一种新型多智能体强化学习框架,基于芯片组的热设计功率(TDP)动态优化部署。该方法明确将这些本质上相互冲突的目标分配给专门的代理,每个代理在统一的投放范式内,各自在不同的奖励机制和环境约束下运作。实验结果表明,TDPNavigator-Placer相比最先进方法实现了显著改进的帕累托前沿,实现了线长与热性能之间的更平衡权衡。
When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification
何时以及该问什么:AskBench 和评分标准引导的 RLVR 用于 LLM 澄清
- Authors: Jiale Zhao, Ke Fang, Lu Cheng
- Subjects: Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.11199
- Pdf link: https://arxiv.org/pdf/2602.11199
- Abstract
Large language models (LLMs) often respond even when prompts omit critical details or include misleading information, leading to hallucinations or reinforced misconceptions. We study how to evaluate and improve LLMs' ability to decide when and what to ask for clarification without sacrificing task performance. We introduce AskBench, an interactive benchmark that converts standard QA pairs into multi-turn interactions with explicit checkpoints. A unified judge loop evaluates final answers and simulates user responses as needed. AskBench covers two settings: AskMind, with intent-deficient queries requiring clarification, and AskOverconfidence, with queries containing false premises that must be identified and corrected. We further propose rubric-guided reinforcement learning with verifier-based rewards (RLVR), which uses structured rubrics to encourage targeted clarification. Experiments show consistent improvements in accuracy, rubric adherence, and interaction efficiency, with strong generalization to unseen domains.
- 中文摘要
大型语言模型(LLM)常常会在提示遗漏关键细节或包含误导性信息时做出反应,导致幻觉或强化误解。我们研究如何评估并提升大型语言模型在不牺牲任务表现的情况下决定何时何物提出澄清的能力。我们介绍了AskBench,一个交互式基准测试,将标准质量保证对转换为带有明确检查点的多回合交互。统一的评审循环评估最终答案,并根据需要模拟用户的回答。AskBench 涵盖两种环境:AskMind,涉及需要澄清的意图缺陷查询,以及 AskOverconfidence,包含必须识别和纠正的错误前提查询。我们还进一步提出了基于验证者奖励的评分标准引导强化学习(RLVR),该方法利用结构化评分标准鼓励有针对性的澄清。实验显示准确性、评分标准遵循率和交互效率持续提升,且对未见领域有强烈的推广。
SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents
SWE-MiniSandbox:构建软件工程代理的无容器强化学习
- Authors: Danlong Yuan, Wei Wu, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan Zhao
- Subjects: Subjects:
Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.11210
- Pdf link: https://arxiv.org/pdf/2602.11210
- Abstract
Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely on per-task containers for isolation. At scale, pre-built container images incur substantial storage overhead, slow environment setup, and require container-management privileges. We propose SWE-MiniSandbox, a lightweight, container-free method that enables scalable RL training of SWE agents without sacrificing isolation. Instead of relying on per-instance containers, SWE-MiniSandbox executes each task in an isolated workspace backed by kernel-level mechanisms, substantially reducing system overhead. It leverages lightweight environment pre-caching techniques to eliminate the need for bulky container images. As a result, our approach lowers disk usage to approximately 5\% of that required by container-based pipelines and reduces environment preparation time to about 25\% of the container baseline. Empirical results demonstrate that SWE-MiniSandbox achieves evaluation performance comparable to standard container-based pipelines. By removing the dependency on heavy container infrastructure, SWE-MiniSandbox offers a practical and accessible foundation for scaling RL-based SWE agents, particularly in resource-constrained research environments.
- 中文摘要
强化学习(RL)已成为培训软件工程(SWE)代理的关键范式,但现有的流水线通常依赖每个任务的容器进行隔离。在大规模情况下,预构建的容器镜像会产生较大的存储开销、环境设置缓慢,并且需要容器管理权限。我们提出了SWE-MiniSandbox,一种轻量化、无容器的方法,能够在不牺牲隔离性的前提下,实现软件智能体的可扩展强化学习训练。SWE-MiniSandbox 不再依赖每个实例容器,而是在由内核级机制支持的隔离工作区中执行每个任务,从而大幅降低系统开销。它利用轻量级环境预缓存技术,消除了庞大容器图像的需求。因此,我们的方法将磁盘使用率降低到容器类管道的约5%,并将环境准备时间缩短到容器基线的约25%。实证结果表明,SWE-MiniSandbox的评估性能可与标准基于容器的管道相媲美。通过消除对重型容器基础设施的依赖,SWE-MiniSandbox 为基于强化学习的 SWE 代理扩展提供了实用且易用的基础,尤其是在资源有限的研究环境中。
Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT
修补分布不匹配:强化重写代理以实现稳定的非策略SFT
- Authors: Jiacheng Wang, Ping Jian, Zhen Yang, Zirong Chen, Keren Liao, Zhongbin Guo
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.11220
- Pdf link: https://arxiv.org/pdf/2602.11220
- Abstract
Large language models (LLMs) have made rapid progress, yet adapting them to downstream scenarios still commonly relies on supervised fine-tuning (SFT). When downstream data exhibit a substantial distribution shift from the model's prior training distribution, SFT can induce catastrophic forgetting. To narrow this gap, data rewriting has been proposed as a data-centric approach that rewrites downstream training data prior to SFT. However, existing methods typically sample rewrites from a prompt-induced conditional distribution, so the resulting targets are not necessarily aligned with the model's natural QA-style generation distribution. Moreover, reliance on fixed templates can lead to diversity collapse. To address these issues, we cast data rewriting as a policy learning problem and learn a rewriting policy that better matches the backbone's QA-style generation distribution while preserving diversity. Since distributional alignment, diversity and task consistency are automatically evaluable but difficult to optimize end-to-end with differentiable objectives, we leverage reinforcement learning to optimize the rewrite distribution under reward feedback and propose an RL-based data-rewriting agent. The agent jointly optimizes QA-style distributional alignment and diversity under a hard task-consistency gate, thereby constructing a higher-quality rewritten dataset for downstream SFT. Extensive experiments show that our method achieves downstream gains comparable to standard SFT while reducing forgetting on non-downstream benchmarks by 12.34% on average. Our code is available at this https URL .
- 中文摘要
大型语言模型(LLM)取得了快速进展,但将其适应到下游场景仍通常依赖监督微调(SFT)。当下游数据显示出与模型先前训练分布显著偏移时,SFT可能引发灾难性遗忘。为缩小这一差距,提出了以数据为中心的方法,在SFT之前重写下游训练数据。然而,现有方法通常从提示诱导的条件分布中采样重写,因此所得目标不一定与模型自然的质量保证(QA)式生成分布一致。此外,依赖固定模板可能导致多样性崩溃。为解决这些问题,我们将数据重写视为策略学习问题,学习更符合骨干QA式生成分布的重写策略,同时保持多样性。由于分布对齐、多样性和任务一致性自动可值,但难以端到端优化且目标可微,我们利用强化学习优化奖励反馈下的重写分布,并提出基于强化学习的数据重写代理。代理在严格的任务一致性门下共同优化QA风格的分布对齐和多样性,从而构建一个更高质量的重写数据集用于下游SFT。大量实验表明,我们的方法在下游增益上与标准SFT相当,同时在非下游基准测试中平均减少12.34%的遗忘率。我们的代码可在此 https URL 获取。
Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization
通过行为代理优化推动主动代理的帕累托前沿
- Authors: Yihang Yao, Zhepeng Cen, Haohong Lin, Shiqi Liu, Zuxin Liu, Jiacheng Zhu, Zhang-Wei Hong, Laixi Shi, Ding Zhao
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.11351
- Pdf link: https://arxiv.org/pdf/2602.11351
- Abstract
Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns, enabling efficient task completion beyond passive instruction following and making them essential for real-world, user-centric applications. Agentic reinforcement learning (RL) has recently emerged as a promising solution for training such agents in multi-turn settings, allowing interaction strategies to be learned from feedback. However, existing pipelines face a critical challenge in balancing task performance with user engagement, as passive agents can not efficiently adapt to users' intentions while overuse of human feedback reduces their satisfaction. To address this trade-off, we propose BAO, an agentic RL framework that combines behavior enhancement to enrich proactive reasoning and information-gathering capabilities with behavior regularization to suppress inefficient or redundant interactions and align agent behavior with user expectations. We evaluate BAO on multiple tasks from the UserRL benchmark suite, and demonstrate that it substantially outperforms proactive agentic RL baselines while achieving comparable or even superior performance to commercial LLM agents, highlighting its effectiveness for training proactive, user-aligned LLM agents in complex multi-turn scenarios. Our website: this https URL.
- 中文摘要
主动大型语言模型(LLM)智能体旨在主动规划、查询和多轮交互,使任务完成效率超越被动指令执行,使其成为现实世界中以用户为中心应用的必备工具。代理强化学习(RL)最近作为一种有前景的解决方案出现在多回合环境中训练此类代理,允许从反馈中学习交互策略。然而,现有的流程在平衡任务性能与用户参与度方面面临关键挑战,因为被动代理无法高效适应用户意图,而过度依赖人工反馈又降低了用户满意度。为解决这一权衡,我们提出了BAO,一种结合行为增强以丰富主动推理和信息收集能力的智能体强化框架,同时通过行为规范化抑制低效或冗余的交互,并将智能体行为与用户期望对齐。我们在UserRL基准套件中的多项任务中评估了BAO,证明其在主动智能RL基线上表现显著优于主动智能化强化学习,同时性能可与商业LLM代理媲美甚至更优,突出其在复杂多回合场景中训练主动、用户对齐LLM代理的有效性。我们的网站:这个 https URL。
Can We Really Learn One Representation to Optimize All Rewards?
我们真的能学会一种表征来优化所有奖励吗?
- Authors: Chongyi Zheng, Royina Karegoudra Jayanth, Benjamin Eysenbach
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2602.11399
- Pdf link: https://arxiv.org/pdf/2602.11399
- Abstract
As machine learning has moved towards leveraging large models as priors for downstream tasks, the community has debated the right form of prior for solving reinforcement learning (RL) problems. If one were to try to prefetch as much computation as possible, they would attempt to learn a prior over the policies for some yet-to-be-determined reward function. Recent work (forward-backward (FB) representation learning) has tried this, arguing that an unsupervised representation learning procedure can enable optimal control over arbitrary rewards without further fine-tuning. However, FB's training objective and learning behavior remain mysterious. In this paper, we demystify FB by clarifying when such representations can exist, what its objective optimizes, and how it converges in practice. We draw connections with rank matching, fitted Q-evaluation, and contraction mapping. Our analysis suggests a simplified unsupervised pre-training method for RL that, instead of enabling optimal control, performs one step of policy improvement. We call our proposed method $\textbf{one-step forward-backward representation learning (one-step FB)}$. Experiments in didactic settings, as well as in $10$ state-based and image-based continuous control domains, demonstrate that one-step FB converges to errors $10^5$ smaller and improves zero-shot performance by $+24\%$ on average. Our project website is available at this https URL.
- 中文摘要
随着机器学习逐渐将大型模型作为下游任务的先验,社区一直在争论解决强化学习(RL)问题的正确先验形式。如果有人试图尽可能多地预取计算,他们会尝试学习某个尚未确定的奖励函数策略的先验。近期研究(前向后向(FB)表示学习尝试了这一方法,认为无监督表征学习过程可以在无需进一步微调的情况下实现对任意奖励的最佳控制。然而,FB的训练目标和学习行为仍然神秘。本文通过澄清此类表示何时存在、其目标优化何处以及实际如何收敛,来揭开FB的神秘面纱。我们通过排名匹配、拟合Q评估和收缩映射进行联系。我们的分析建议采用简化的无监督强化前训练方法,该方法不实现最佳控制,而是执行一级策略改进。我们称所提出的方法 $\textbf{一步前进后向表示学习(一步 FB)}$。在教学环境中以及基于10美元的状态和基于图像的连续控制域中的实验表明,单步FB收敛到的误差更小10^5美元,且平均能提升零射击性能+24\%$。我们的项目网站可访问此 https URL。
Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization
通过稳健值因数分解实现分布式鲁棒合作多智能体强化学习
- Authors: Chengrui Qu, Christopher Yeh, Kishan Panaganti, Eric Mazumdar, Adam Wierman
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2602.11437
- Pdf link: https://arxiv.org/pdf/2602.11437
- Abstract
Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution, where value-factorization methods enforce the individual-global-maximum (IGM) principle so that decentralized greedy actions recover the team-optimal joint action. However, the reliability of this recipe in real-world settings remains unreliable due to environmental uncertainties arising from the sim-to-real gap, model mismatch, and system noise. We address this gap by introducing Distributionally robust IGM (DrIGM), a principle that requires each agent's robust greedy action to align with the robust team-optimal joint action. We show that DrIGM holds for a novel definition of robust individual action values, which is compatible with decentralized greedy execution and yields a provable robustness guarantee for the whole system. Building on this foundation, we derive DrIGM-compliant robust variants of existing value-factorization architectures (e.g., VDN/QMIX/QTRAN) that (i) train on robust Q-targets, (ii) preserve scalability, and (iii) integrate seamlessly with existing codebases without bespoke per-agent reward shaping. Empirically, on high-fidelity SustainGym simulators and a StarCraft game environment, our methods consistently improve out-of-distribution performance. Code and data are available at this https URL.
- 中文摘要
合作多智能体强化学习(MARL)通常采用集中式训练和去中心化执行,其中价值因数方法强制执行个体-全局-最大值(IGM)原则,使去中心化的贪婪行动能够恢复团队最优的联合行动。然而,由于模拟与现实差距、模型不匹配和系统噪声带来的环境不确定性,该配方在现实环境中的可靠性仍然不可靠。我们通过引入分布稳健的IGM(DrIGM)来弥补这一空白,该原则要求每个代理的强健贪婪行为与稳健的团队最优联合行动保持一致。我们证明了DrIGM对鲁棒个别动作值的新定义成立,该定义与去中心化贪婪执行兼容,并为整个系统提供了可证明的鲁棒性保证。基于此基础,我们推导出符合DrIGM标准的稳健变体,基于现有价值因果架构(如VDN/QMIX/QTRAN),这些变体(i)训练于稳健的Q目标,(ii)保持可扩展性,(iii)无缝集成现有代码库,无需为每个代理定制奖励形态。从经验来看,在高保真SustainGym模拟器和星际争霸游戏环境中,我们的方法持续提升非发行平台的性能。代码和数据可在此 https URL 获取。
Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning
应得的功劳:跨模态连接推动MLLM推理的精准强化学习
- Authors: Zhengbo Jiao, Shaobo Wang, Zifan Zhang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.11455
- Pdf link: https://arxiv.org/pdf/2602.11455
- Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet how visual evidence is integrated during reasoning remains poorly understood. We explore multimodal RLVR through the lens of cross-modal attention connectivity and find that only a small fraction of tokens (approximately 15%) exhibit strong visual-textual coupling. These high-connectivity tokens act as anchors that ground reasoning in the image, while the majority follow linguistic patterns. During RLVR training, credit assignment naturally concentrates on these anchors, sharpening their visual grounding over time. Building on this insight, we propose Anchor-Token Reinforcement Learning (AT-RL), a lightweight framework that selectively reinforces high-connectivity tokens via graph-based clustering of attention topology. Evaluated across the series (3B-32B), AT-RL introduces only 1.2% overhead yet enables the 32B model to surpass the 72B-Instruct baseline on MathVista (80.2), with consistent gains observed across STEM, video and general tasks. Conversely, training solely on low-connectivity tokens causes severe degradation, confirming that effective multimodal RL hinges on precise credit assignment to visual anchors. Our work reveals that reasoning quality is governed not by token quantity but by the fidelity of cross-modal anchoring.
- 中文摘要
带可验证奖励的强化学习(RLVR)显著提升了多模态大型语言模型(MLLM)的推理能力,但视觉证据在推理过程中如何整合仍知之甚少。我们通过跨模态注意力连接性探讨多模态RLVR,发现只有极少部分代币(约15%)表现出强烈的视觉-文本耦合。这些高连接性标记作为锚点,将推理扎根于图像中,而大多数则遵循语言模式。在RLVR培训期间,学分分配自然会集中在这些锚点上,随着时间推移,进一步加深它们的视觉基础。基于这一见解,我们提出了锚点-标记强化学习(AT-RL)的轻量级框架,通过基于图的注意力拓扑聚类选择性强化高连接性标记。在该系列(3B-32B)中评估,AT-RL仅带来1.2%的开销,但使32B模型在MathVista(80.2)上超过了72B-Ininstruction基线,在STEM、视频和通用任务中均有持续提升。相反,仅使用低连通性标记训练会导致严重退化,证实有效的多模态强化学习依赖于对视觉锚点的精确信用分配。我们的研究显示,推理质量的决定性不取决于符号数量,而是由跨模态锚定的保真度决定的。
RL over Commodity Networks: Overcoming the Bandwidth Barrier with Lossless Sparse Deltas
基于商品网络的强化学习:通过无损稀疏三角洲克服带宽障碍
- Authors: Chaoyi Ruan, Geng Luo, Xinyi Wan, Long Zhao, Qinghe Wang, Jiaan Zhu, Duling Xu, Guanbin Xu, Dehui Wei, Xiang Liu, Cheng Li, Haifeng Sun, Congcong Miao, Jialin Li
- Subjects: Subjects:
Distributed, Parallel, and Cluster Computing (cs.DC)
- Arxiv link: https://arxiv.org/abs/2602.11456
- Pdf link: https://arxiv.org/pdf/2602.11456
- Abstract
LLM post-training with reinforcement learning (RL) requires frequent synchronization of large model parameters between the trainer and distributed rollout actors. High-throughput RL post-training therefore relies on dedicated RDMA HPC clusters, an infrastructure cost most organizations cannot absorb. A natural alternative is to aggregate loosely-coupled GPUs over standard Ethernet and WAN links, but this commodity connectivity cannot sustain full-weight broadcasts: synchronizing an 8B model can take over 100~seconds on bandwidth-limited links, while rollout generation typically takes tens of seconds. Toward making RL practical in this regime, we observe that RL fine-tuning yields highly sparse per-step updates, with only around 1\% of parameter elements changing. Atop this insight, we present SparrowRL, a novel high-performance RL training system that preserves bit-exact updates without dropping or quantizing information, designed for commodity-networked, loosely-coupled GPU resources. SparrowRL represents each step as a sparse delta checkpoint, pipelines delta extraction with multi-stream transmission, overlaps transfer with rollout generation, and coordinates heterogeneous workers with throughput- and bandwidth-aware scheduling plus lease-based fault tolerance. On Qwen3 models from 4B to 14B deployed across up to four geographic regions, SparrowRL reduces per-step transfer payload by 79$\times$ for Qwen3-8B and improves throughput by 2.4--9.5$\times$ over full-weight broadcast across WAN, narrowing the throughput gap relative to an ideal RDMA single-datacenter baseline to within 8.91\%. By leveraging on-demand, cross-cloud GPUs over commodity links, SparrowRL delivers 1.21--1.59$\times$ higher tokens per dollar than reserved RDMA clusters at comparable throughput.
- 中文摘要
LLM的强化学习(RL)后训练需要训练器与分布式推广角色之间频繁同步大型模型参数。因此,高通量强化学习的培训后依赖专用的RDMA高性能计算集群,这是大多数组织无法承担的基础设施成本。一个自然的替代方案是通过标准以太网和广域网链路聚合松耦合的GPU,但这种通用连接无法支持全权广播:在带宽有限的链路上同步8B型号可能需要超过100~秒,而部署生成通常需要数十秒。为了使强化学习在这种模式下实用,我们观察到强化学习的微调导致每步更新极为稀疏,仅约有1%的参数元素发生变化。基于这一洞察,我们介绍了SparrowRL,一种新型高性能强化学习训练系统,能够保持位精确更新而不丢弃或量化信息,专为商品网络化、松耦合GPU资源设计。SparrowRL 将每一步表示为稀疏的三角洲检查点,管道的三角洲提取与多流传输,重叠传输与展开生成,并协调异构工作者,采用吞吐量和带宽感知调度及基于租赁的容错能力。在部署于最多四个地理区域的 Qwen3 4B 到 14B 型号上,SparrowRL 将 Qwen3-8B 的每步传输有效负载降低 79 美元/时间美元,吞吐量提升 2.4 美元-9.5 美元/时间美元,相较理想 RDMA 单一数据中心基线将吞吐量差距缩小至 8.91%。通过利用按需、跨云GPU的商品链路,SparrowRL在同等吞吐量下,每美元提供比预留RDMA集群高1.21至1.59美元\倍数的代币。
Future Mining: Learning for Safety and Security
未来采矿:安全与保障的学习
- Authors: Md Sazedur Rahman, Mizanur Rahman Jewel, Sanjay Madria
- Subjects: Subjects:
Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
- Arxiv link: https://arxiv.org/abs/2602.11472
- Pdf link: https://arxiv.org/pdf/2602.11472
- Abstract
Mining is rapidly evolving into an AI driven cyber physical ecosystem where safety and operational reliability depend on robust perception, trustworthy distributed intelligence, and continuous monitoring of miners and equipment. However, real world mining environments impose severe constraints, including poor illumination, GPS denied conditions, irregular underground topologies and intermittent connectivity. These factors degrade perception accuracy, disrupt situational awareness and weaken distributed learning systems. At the same time, emerging cyber physical threats such as backdoor triggers, sensor spoofing, label flipping attacks, and poisoned model updates further jeopardize operational safety as mines adopt autonomous vehicles, humanoid assistance, and federated learning for collaborative intelligence. Energy constrained sensors also experience uneven battery depletion, creating blind spots in safety coverage and disrupting hazard detection pipelines. This paper presents a vision for a Unified Smart Safety and Security Architecture that integrates multimodal perception, secure federated learning, reinforcement learning, DTN enabled communication, and energy aware sensing into a cohesive safety framework. We introduce five core modules: Miner Finder, Multimodal Situational Awareness, Backdoor Attack Monitor, TrustFed LFD, and IoT driven Equipment Health Monitoring. These modules collectively address miner localization, hazard understanding, federated robustness, and predictive maintenance. Together, they form an end to end framework capable of guiding miners through obstructed pathways, identifying compromised models or sensors, and ensuring mission critical equipment reliability. This work outlines a comprehensive research vision for building a resilient and trustworthy intelligent mining system capable of maintaining operational continuity under adversarial conditions.
- 中文摘要
挖矿正迅速发展为一个由人工智能驱动的网络物理生态系统,安全性和运营可靠性依赖于强大的感知、可信的分布式智能以及对矿工和设备的持续监控。然而,现实中的采矿环境带来了严重的限制,包括照明不足、GPS无法使用、地下拓扑结构不规则以及连接断断续续。这些因素降低了感知准确性,干扰了情境感知,并削弱了分布式学习系统。与此同时,新兴的网络物理威胁如后门触发器、传感器伪装、标签翻转攻击以及被污染的模型更新,进一步威胁了矿山采用自动驾驶车辆、类人辅助和联邦学习以实现协作智能的运营安全。受限的传感器电池耗尽不均,导致安全覆盖的盲区,干扰危险检测管道。本文提出了一个统一智能安全与安保架构的愿景,该架构将多模态感知、安全联合学习、强化学习、DTN支持的通信和能量感知整合为一个连贯的安全框架。我们介绍五个核心模块:矿工查找器、多模态态势感知、后门攻击监测器、TrustFed LFD以及物联网驱动设备健康监测。这些模块共同涉及矿工定位、危害理解、联邦稳健性和预测性维护。它们共同构成了一个端到端的框架,能够引导矿工通过受阻路径,识别受损型号或传感器,并确保关键设备的可靠性。本研究勾勒出构建一个有韧性且值得信赖的智能采矿系统,能够在对抗条件下保持运营连续性的综合研究愿景。
Unifying Stable Optimization and Reference Regularization in RLHF
RLHF 中稳定优化与参考正则化的统一
- Authors: Li He, Qiang Qu, He Zhao, Stephen Wan, Dadong Wang, Lina Yao, Tongliang Liu
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.11523
- Pdf link: https://arxiv.org/pdf/2602.11523
- Abstract
Reinforcement Learning from Human Feedback (RLHF) has advanced alignment capabilities significantly but remains hindered by two core challenges: \textbf{reward hacking} and \textbf{stable optimization}. Current solutions independently address these issues through separate regularization strategies, specifically a KL-divergence penalty against a supervised fine-tuned model ($\pi_0$) to mitigate reward hacking, and policy ratio clipping towards the current policy ($\pi_t$) to promote stable alignment. However, the implicit trade-off arising from simultaneously regularizing towards both $\pi_0$ and $\pi_t$ remains under-explored. In this paper, we introduce a unified regularization approach that explicitly balances the objectives of preventing reward hacking and maintaining stable policy updates. Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss with a superior trade-off, which demonstrably improves both alignment results and implementation complexity. Extensive experiments across diverse benchmarks validate that our method consistently outperforms RLHF and online preference learning methods, achieving enhanced alignment performance and stability.
- 中文摘要
人类反馈强化学习(RLHF)显著提升了对齐能力,但仍受两个核心挑战所阻碍:\textbf{奖励黑客}和\textbf{稳定优化}。当前的解决方案通过独立的正则化策略解决这些问题,特别是对监督微调模型施加KL发散惩罚($\pi_0$)以减轻奖励黑客行为,以及向当前策略调整比率($\pi_t$)以促进稳定对齐。然而,同时正则化向$\pi_0$和$\pi_t$之间所产生的隐含权衡仍未被充分探讨。本文引入了一种统一的正则化方法,明确平衡防止奖励黑客行为和保持策略更新稳定的目标。我们简单而有原则的对齐目标,带来了加权监督微调损失,并存在更优的权衡,这显著提升了比对结果和实现复杂度。跨多个基准测试的大量实验验证了我们的方法持续优于RLHF和在线偏好学习方法,实现了更强的比对性能和稳定性。
Adaptive Milestone Reward for GUI Agents
图形界面代理的自适应里程碑奖励
- Authors: Congmin Zheng, Xiaoyun Mo, Xinbei Ma, Qiqiang Lin, Yin Zhao, Jiachen Zhu, Xingyu Lou, Jun Wang, Zhaoxiang Wang, Weiwen Liu, Zhuosheng Zhang, Yong Yu, Weinan Zhang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.11524
- Pdf link: https://arxiv.org/pdf/2602.11524
- Abstract
Reinforcement Learning (RL) has emerged as a mainstream paradigm for training Mobile GUI Agents, yet it struggles with the temporal credit assignment problem inherent in long-horizon tasks. A primary challenge lies in the trade-off between reward fidelity and density: outcome reward offers high fidelity but suffers from signal sparsity, while process reward provides dense supervision but remains prone to bias and reward hacking. To resolve this conflict, we propose the Adaptive Milestone Reward (ADMIRE) mechanism. ADMIRE constructs a verifiable, adaptive reward system by anchoring trajectory to milestones, which are dynamically distilled from successful explorations. Crucially, ADMIRE integrates an asymmetric credit assignment strategy that denoises successful trajectories and scaffolds failed trajectories. Extensive experiments demonstrate that ADMIRE consistently yields over 10% absolute improvement in success rate across different base models on AndroidWorld. Moreover, the method exhibits robust generalizability, achieving strong performance across diverse RL algorithms and heterogeneous environments such as web navigation and embodied tasks.
- 中文摘要
强化学习(RL)已成为训练移动图形界面代理的主流范式,但它在长期任务中固有的时间学分分配问题上遇到困难。主要挑战在于奖励忠实度与密度之间的权衡:结果奖励提供高保真度但存在信号稀疏性,而过程奖励则提供密集监督,但仍易存在偏见和奖励黑客问题。为解决这一冲突,我们提出了适应性里程碑奖励(ADMIRE)机制。ADMIRE通过将轨迹锚定于里程碑构建可验证的、自适应的奖励系统,这些里程碑是通过成功探索动态提炼出来的。关键是,ADMIRE集成了一种非对称的信用分配策略,使成功轨迹去噪,并支架处理失败轨迹。大量实验表明,ADMIRE在AndroidWorld的不同基础型号中,成功率始终提升超过10%。此外,该方法表现出鲁棒的泛化性,在多种强化学习算法和异构环境中(如网络导航和具身任务)中表现优异。
Native Reasoning Models: Training Language Models to Reason on Unverifiable Data
原生推理模型:训练语言模型以基于不可验证数据进行推理
- Authors: Yuanfu Wang, Zhixuan Liu, Xiangtian Li, Chaochao Lu, Chao Yang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.11549
- Pdf link: https://arxiv.org/pdf/2602.11549
- Abstract
The prevailing paradigm for training large reasoning models--combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)--is fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers. This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a wide range of unverifiable tasks beyond its scope. To overcome these limitations, we introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations. NRT reframes the training problem by treating the reasoning process as a latent variable. It employs a unified training objective that models reasoning as an optimization problem, intrinsically rewarding paths that increase the model's likelihood of producing the ground-truth answer. This unified perspective allows us to analyze intrinsic failure modes of prior methods, such as policy collapse, and systematically design more robust reward aggregation functions, creating a self-reinforcing feedback loop where the model learns to think in ways that resolve its own uncertainty. Empirical evaluation on Llama and Mistral model families demonstrates that NRT achieves state-of-the-art performance among verifier-free methods, significantly outperforming standard SFT baselines and prior verifier-free RL methods. Our approach yields particularly strong performance gains in complex reasoning domains and exhibits high robustness to policy collapse, offering a general, scalable path toward building more powerful and broadly applicable reasoning systems.
- 中文摘要
目前用于训练大型推理模型的范式——结合了监督式微调(SFT)与可验证奖励的强化学习(RLVR)——在根本上受限于高质量、人工注释的推理数据和外部验证器。这种依赖产生了大量数据收集成本,可能植入人类认知偏见,并将强化学习阶段限制在数学和编码等客观可评估领域,导致大量无法验证的任务超出其范围。为克服这些局限,我们引入了NRT(原生推理训练),这是一种新颖框架,通过模型仅使用标准问答对生成推理轨迹,从而无需专家编写演示,从而培养复杂的推理能力。NRT通过将推理过程视为潜在变量来重新定义训练问题。它采用统一的训练目标,将推理建模为优化问题,内在奖励提升模型产生真实答案的可能性的路径。这种统一的视角使我们能够分析先前方法的内在失效模式,如策略崩溃,并系统地设计更稳健的奖励聚合函数,形成自我强化的反馈循环,使模型学会以解决自身不确定性的方式思考。对Llama和Mistral模型家族的实证评估表明,NRT在无验证器方法中达到了最先进的性能,显著优于标准SFT基线和之前无验证器的RL方法。我们的方法在复杂推理领域带来特别显著的性能提升,并且对政策崩溃表现出高度的鲁棒性,为构建更强大且广泛适用的推理系统提供了一条通用且可扩展的路径。
SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent
视觉:具自证与信息获取的强化学习——搜索代理的多样化分支
- Authors: Wenlin Zhong, Jinluan Yang, Yiquan Wu, Yi Liu, Jianhang Yao, Kun Kuang
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.11551
- Pdf link: https://arxiv.org/pdf/2602.11551
- Abstract
Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into "Tunnel Vision," where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.
- 中文摘要
强化学习(RL)使大型语言模型(LLMs)能够掌握复杂问题解答的自主搜索能力。然而,尤其是在多回合搜索场景中,这种交互带来了一个关键挑战:搜索结果通常存在高冗余性和低信噪比的问题。因此,代理者很容易陷入“隧道视野”,即对早期噪声反演的强制解释导致不可逆的错误积累。为应对这些挑战,我们提出了SIGHT框架,通过自证支持(SES)和信息获取驱动的多元分支来增强基于搜索的推理能力。SIGHT通过SES将搜索结果提炼为高保真度证据,并计算信息增益评分,以精准定位观测最大程度降低不确定性的关键状态。该评分指导动态提示干预——包括去重、反思或自适应分支——以生成带有SES的新分支。最后,通过集团相对策略优化整合SES和正确性奖励,SIGHT内化了无需外部验证者的稳健探索策略。单跳和多跳质量保证基准测试的实验表明,SIGHT在复杂推理场景下显著优于现有方法,且使用更少的搜索步骤。
PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering
PRIME:数学与工程中可验证推理的过程-结果对齐基准
- Authors: Xiangfeng Wang, Hangyu Guo, Yanlin Lai, Mitt Huang, Liang Zhao, Chengyuan Yao, Yinmin Zhang, Qi Han, Xiaoxiao Ren, Chun Yuan, Tong Xu, Zheng Ge, Xiangyu Zhang, Daxin Jiang
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.11570
- Pdf link: https://arxiv.org/pdf/2602.11570
- Abstract
While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification in Mathematics and Engineering. Curated from a comprehensive collection of college-level STEM problems, PRIME comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of 8.29%, 9.12%, and 7.31% on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation ($R^2 > 0.92$) between verifier accuracy on PRIME and RLVR training effectiveness, validating PRIME as a reliable predictor for verifier selection.
- 中文摘要
虽然基于模型的验证器对于可验证奖励强化学习(RLVR)的扩展至关重要,但当前以结果为中心的验证范式主要关注最终结果与真实信息之间的一致性,常常忽视推导过程中可能存在的错误。这导致了对由错误推导产生的正确答案给予正的奖励。为弥合这一差距,我们引入了PRIME,这是一个用于评估数学与工程中过程-结果对齐验证者验证的基准。PRIME从全面的大学水平STEM题目中精选,通过基于一致性的筛选流程,包含2530个高难度样本。通过广泛评估,我们发现当前验证者经常无法发现推导缺陷。此外,我们提出了一种过程感知型RLVR训练范式,利用通过PRIME选定的验证器。该方法在 QWEN3-14B-Base 模型中,在 AIME24、AIME25 和 Beyond-AIME 上分别实现了 8.29%、9.12% 和 7.31% 的绝对性能提升,优于仅结果验证基线。最后,我们证明了PRIME验证器准确率与RLVR训练效果之间的强线性相关性($R^2 > 0.92$),验证了PRIME作为验证者选择的可靠预测因子。
Learning to Configure Agentic AI Systems
学习配置代理型人工智能系统
- Authors: Aditya Taparia, Som Sagar, Ransalu Senanayake
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.11574
- Pdf link: https://arxiv.org/pdf/2602.11574
- Abstract
Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed large templates or hand-tuned heuristics. This leads to brittle behavior and unnecessary compute, since the same cumbersome configuration is often applied to both easy and hard input queries. We formulate agent configuration as a query-wise decision problem and introduce ARC (Agentic Resource & Configuration learner), which learns a light-weight hierarchical policy using reinforcement learning to dynamically tailor these configurations. Across multiple benchmarks spanning reasoning and tool-augmented question answering, the learned policy consistently outperforms strong hand-designed and other baselines, achieving up to 25% higher task accuracy while also reducing token and runtime costs. These results demonstrate that learning per-query agent configurations is a powerful alternative to "one size fits all" designs.
- 中文摘要
配置基于LLM的代理系统涉及从庞大的组合设计空间中选择工作流、工具、令牌预算和提示,通常由固定的大模板或手工调优的启发式方法处理。这导致了脆弱的行为和不必要的计算,因为同样繁琐的配置常常被应用于简单和困难的输入查询。我们将代理配置提出为按查询的决策问题,并引入了ARC(代理资源与配置学习器),该学习器通过强化学习学习轻量级层级策略,动态定制这些配置。在涵盖推理和工具辅助问答的多个基准测试中,所学策略始终优于强力的手工设计及其他基线,任务准确率提升多达25%,同时降低令牌和运行时间成本。这些结果表明,学习每查询代理配置是“一刀切”设计的有力替代方案。
The Five Ws of Multi-Agent Communication: Who Talks to Whom, When, What, and Why -- A Survey from MARL to Emergent Language and LLMs
多智能体沟通的五个W:谁与谁、何时、什么以及为什么交流——从MARL到新兴语言与大型语言模型的调查
- Authors: Jingdi Chen, Hanqing Yang, Zongjun Liu, Carlee Joe-Wong
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.11583
- Pdf link: https://arxiv.org/pdf/2602.11583
- Abstract
Multi-agent sequential decision-making powers many real-world systems, from autonomous vehicles and robotics to collaborative AI assistants. In dynamic, partially observable environments, communication is often what reduces uncertainty and makes collaboration possible. This survey reviews multi-agent communication (MA-Comm) through the Five Ws: who communicates with whom, what is communicated, when communication occurs, and why communication is beneficial. This framing offers a clean way to connect ideas across otherwise separate research threads. We trace how communication approaches have evolved across three major paradigms. In Multi-Agent Reinforcement Learning (MARL), early methods used hand-designed or implicit protocols, followed by end-to-end learned communication optimized for reward and control. While successful, these protocols are frequently task-specific and hard to interpret, motivating work on Emergent Language (EL), where agents can develop more structured or symbolic communication through interaction. EL methods, however, still struggle with grounding, generalization, and scalability, which has fueled recent interest in large language models (LLMs) that bring natural language priors for reasoning, planning, and collaboration in more open-ended settings. Across MARL, EL, and LLM-based systems, we highlight how different choices shape communication design, where the main trade-offs lie, and what remains unsolved. We distill practical design patterns and open challenges to support future hybrid systems that combine learning, language, and control for scalable and interpretable multi-agent collaboration.
- 中文摘要
多智能体顺序决策驱动着许多现实世界的系统,从自动驾驶车辆、机器人到协作式人工智能助手。在动态且部分可观察的环境中,沟通往往是减少不确定性、使协作成为可能的关键。本调查通过五个W回顾了多智能体通信(MA-Comm):谁与谁沟通、传达什么内容、何时进行沟通以及沟通为何有益。这种框架为跨越本应独立研究线索的思想提供了清晰的连接方式。我们追溯了传播方法在三大主要范式上的演变。在多智能体强化学习(MARL)中,早期方法使用手工设计或隐式协议,随后是端到端的学习交流,以优化奖励和控制。虽然这些协议成功,但通常针对任务且难以解释,这促使人们在新兴语言(EL)领域开展研究,使智能体能够通过互动发展出更有结构或符号化的交流。然而,EL方法仍在基础化、泛化和可扩展性方面存在困难,这也促使近年来对大型语言模型(LLMs)产生兴趣,这些模型为推理、规划和协作提供自然语言先验,以实现更开放的环境。在基于MARL、EL和LLM的系统中,我们强调了不同选择如何塑造通信设计,主要权衡在哪里,以及哪些问题尚未解决。我们提炼出实用的设计模式和开放挑战,支持未来结合学习、语言和控制的混合系统,实现可扩展且可解释的多智能体协作。
Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm
夸克医疗对齐:一种整体多维对齐与协作优化范式
- Authors: Tianxiang Xu, Jiayi Liu, Yixuan Tong, Jialu Xu, Yunqing Wei, Kaiwen Feng, PanPan Hou, Kangping Yin, Jiyuan Hu, Hao Zhou, Zhenxin Ma, Jian Xu, Guanjun Jiang
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.11661
- Pdf link: https://arxiv.org/pdf/2602.11661
- Abstract
While reinforcement learning for large language model alignment has progressed rapidly in recent years, transferring these paradigms to high-stakes medical question answering reveals a fundamental paradigm mismatch. Reinforcement Learning from Human Feedback relies on preference annotations that are prohibitively expensive and often fail to reflect the absolute correctness of medical facts. Reinforcement Learning from Verifiable Rewards lacks effective automatic verifiers and struggles to handle complex clinical contexts. Meanwhile, medical alignment requires the simultaneous optimization of correctness, safety, and compliance, yet multi-objective heterogeneous reward signals are prone to scale mismatch and optimization this http URL address these challenges, we propose a robust medical alignment paradigm. We first construct a holistic multi-dimensional medical alignment matrix that decomposes alignment objectives into four categories: fundamental capabilities, expert knowledge, online feedback, and format specifications. Within each category, we establish a closed loop of where observable metrics inform attributable diagnosis, which in turn drives optimizable rewards, thereby providing fine-grained, high-resolution supervision signals for subsequent iterative optimization. To resolve gradient domination and optimization instability problem caused by heterogeneous signals, we further propose a unified optimization mechanism. This mechanism employs Reference-Frozen Normalization to align reward scales and implements a Tri-Factor Adaptive Dynamic Weighting strategy to achieve collaborative optimization that is weakness-oriented, risk-prioritized, and redundancy-reducing. Experimental results demonstrate the effectiveness of our proposed paradigm in real-world medical scenario evaluations, establishing a new paradigm for complex alignment in vertical domains.
- 中文摘要
近年来,大型语言模型对齐的强化学习进展迅速,但将这些范式转移到高风险的医学问答中,揭示了一个根本性的范式不匹配。从人类反馈中强化学习依赖于昂贵且常常无法反映医学事实绝对正确性的偏好注释。可验证奖励的强化学习缺乏有效的自动验证工具,难以应对复杂的临床情境。与此同时,医学比对需要同时优化正确性、安全性和合规性,但多目标异构奖励信号容易出现规模不匹配和优化。我们提出了一个稳健的医学对齐范式。我们首先构建了一个整体多维医疗对齐矩阵,将对齐目标分解为四类:基本能力、专业知识、在线反馈和格式规范。在每个类别中,我们建立了一个闭环,可观察指标指导归因诊断,进而驱动可优化的奖励,从而为后续迭代优化提供细粒度、高分辨率的监督信号。为解决由异构信号引起的梯度支配和优化不稳定性问题,我们进一步提出了统一的优化机制。该机制采用参考冻结规范化来对齐奖励尺度,并实施三因子自适应动态加权策略,实现面向弱点、风险优先和冗余减少的协作优化。实验结果证明了我们提出的范式在现实世界医疗场景评估中的有效性,建立了垂直领域复杂对齐的新范式。
TabSieve: Explicit In-Table Evidence Selection for Tabular Prediction
TabSieve:表中证据的显式选择用于表式预测
- Authors: Yongyao Wang, Ziqi Miao, Lu Yang, Haonan Jia, Wenting Yan, Chen Qian, Lijun Li
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.11700
- Pdf link: https://arxiv.org/pdf/2602.11700
- Abstract
Tabular prediction can benefit from in-table rows as few-shot evidence, yet existing tabular models typically perform instance-wise inference and LLM-based prompting is often brittle. Models do not consistently leverage relevant rows, and noisy context can degrade performance. To address this challenge, we propose TabSieve, a select-then-predict framework that makes evidence usage explicit and auditable. Given a table and a query row, TabSieve first selects a small set of informative rows as evidence and then predicts the missing target conditioned on the selected evidence. To enable this capability, we construct TabSieve-SFT-40K by synthesizing high-quality reasoning trajectories from 331 real tables using a strong teacher model with strict filtering. Furthermore, we introduce TAB-GRPO, a reinforcement learning recipe that jointly optimizes evidence selection and prediction correctness with separate rewards, and stabilizes mixed regression and classification training via dynamic task-advantage balancing. Experiments on a held-out benchmark of 75 classification and 52 regression tables show that TabSieve consistently improves performance across shot budgets, with average gains of 2.92% on classification and 4.45% on regression over the second-best baseline. Further analysis indicates that TabSieve concentrates more attention on the selected evidence, which improves robustness to noisy context.
- 中文摘要
表格预测可以从表格内行中受益,因为这是少样本证据,但现有的表格模型通常进行实例推断,基于LLM的提示往往较为脆弱。模型无法持续利用相关行,噪声上下文会降低性能。为应对这一挑战,我们提出了TabSieve,一种选择后预测的框架,使证据使用变得明确且可审计。给定一个表和一个查询行,TabSieve 首先选择一小组有用的行作为证据,然后根据所选证据预测缺失的目标。为实现此功能,我们通过使用严格筛选的强教师模型,从331个实表中综合高质量推理轨迹,构建了TabSieve-SFT-40K。此外,我们引入了TAB-GRPO,这是一种强化学习配方,结合独立奖励,优化证据选择和预测正确性,并通过动态任务优势平衡稳定混合回归和分类训练。在一个由75个分类表和52个回归表组成的基准测试中,TabSieve在各投篮预算中持续提升表现,分类平均提升2.92%,回归提升4.45%,均为第二优基线。进一步分析表明,TabSieve更注重所选证据,从而提高了对噪声上下文的鲁棒性。
DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels
DICE:扩散大型语言模型在生成CUDA核方面表现出色
- Authors: Haolei Bai, Lingcheng Kong, Xueyi Chen, Jianmian Wang, Zhiqiang Tao, Huan Wang
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.11715
- Pdf link: https://arxiv.org/pdf/2602.11715
- Abstract
Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.
- 中文摘要
扩散大型语言模型(dLLM)因其并行代币生成能力,已成为自回归(AR)LLMs的有力替代方案。这种范式特别适合代码生成,因为整体结构规划和非顺序优化至关重要。尽管有此潜力,定制 dLLM 以生成 CUDA 内核仍然充满挑战,不仅受高度专业化的限制,还受严重缺乏高质量训练数据的阻碍。为应对这些挑战,我们构建了CuKe,一个为高性能CUDA核优化的增强监督微调数据集。在此基础上,我们提出了一个双阶段策划强化学习(BiC-RL)框架,包括CUDA内核填充阶段和端到端CUDA内核生成阶段。利用这一训练框架,我们介绍了DICE,一系列为CUDA内核生成设计的扩散大型语言模型,涵盖三个参数尺度:1.7B、4B和8B。KernelBench上的大量实验表明,DICE在同等规模的自回归和扩散LLM中表现显著优于,奠定了CUDA内核生成的新技术水平。
STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning
STVG-R1:通过强化学习激励实例层面的视频推理和扎根
- Authors: Xiaowen Zhang, Zhi Gao, Licheng Jiao, Lingling Li, Qing Li
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2602.11730
- Pdf link: https://arxiv.org/pdf/2602.11730
- Abstract
In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.
- 中文摘要
在视觉语言模型(VLMs)中,文本描述与视觉坐标之间的错位常常诱发幻觉。这个问题在密集预测任务中尤为严重,如时空视频接地(STVG)。以往的方法通常侧重于增强视觉-文本对齐或附加辅助解码器。然而,这些策略不可避免地引入了额外的可训练模块,导致显著的注释成本和计算开销。在本研究中,我们提出了一种新的视觉提示范式,避免了跨模态对齐坐标的难题。具体来说,我们通过为每个对象分配一个唯一且时间一致的 ID,将每帧坐标预测重新表述为一个紧凑的实例级识别问题。这些ID嵌入视频中,作为视觉提示,向VLM提供明确且可解释的输入。此外,我们引入了STVG-R1,这是STVG的首个强化学习框架,采用任务驱动奖励,共同优化时间准确性、空间一致性和结构格式正则化。基于六个基准的广泛实验证明了我们方法的有效性。STVG-R1在HCSTVG-v2基准测试m_IoU上以惊人的优势超过Qwen2.5-VL-7B2,开创了新的先进技术(SOTA)。令人惊讶的是,STVG-R1还表现出对多对象指向视频对象分割任务的强烈零拍摄推广能力,在MeViS上实现了47.3%的S-Shot J&F。
AC-MASAC: An Attentive Curriculum Learning Framework for Heterogeneous UAV Swarm Coordination
AC-MASAC:异构无人机群群协调的专注课程学习框架
- Authors: Wanhao Liu, Junhong Dai, Yixuan Zhang, Shengyun Yin, Panshuo Li
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2602.11735
- Pdf link: https://arxiv.org/pdf/2602.11735
- Abstract
Cooperative path planning for heterogeneous UAV swarms poses significant challenges for Multi-Agent Reinforcement Learning (MARL), particularly in handling asymmetric inter-agent dependencies and addressing the risks of sparse rewards and catastrophic forgetting during training. To address these issues, this paper proposes an attentive curriculum learning framework (AC-MASAC). The framework introduces a role-aware heterogeneous attention mechanism to explicitly model asymmetric dependencies. Moreover, a structured curriculum strategy is designed, integrating hierarchical knowledge transfer and stage-proportional experience replay to address the issues of sparse rewards and catastrophic forgetting. The proposed framework is validated on a custom multi-agent simulation platform, and the results show that our method has significant advantages over other advanced methods in terms of Success Rate, Formation Keeping Rate, and Success-weighted Mission Time. The code is available at \textcolor{red}{this https URL}.
- 中文摘要
异构无人机群的协作路径规划对多智能体强化学习(MARL)构成重大挑战,特别是在处理非对称代理间依赖问题以及应对训练中奖励稀疏和灾难性遗忘的风险方面。为解决这些问题,本文提出了一个专注课程学习框架(AC-MASAC)。该框架引入了一种角色感知异构注意力机制,以显式建模非对称依赖关系。此外,设计了结构化的课程策略,整合了层级知识转移和按阶段比例的经验回放,以解决奖励稀疏和灾难性遗忘的问题。所提框架在定制多智能体模拟平台上进行了验证,结果显示我们的方法在成功率、编队保持率和成功加权任务时间方面相比其他先进方法具有显著优势。该代码可在 \textcolor{red}{this https URL} 获取。
TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
TSR:多回合强化学习的大型语言模型代理轨迹搜索部署
- Authors: Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Holger Boche
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.11767
- Pdf link: https://arxiv.org/pdf/2602.11767
- Abstract
Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using task-specific feedback. This improves rollout quality and stabilizes learning while leaving the underlying optimization objective unchanged, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.
- 中文摘要
大型语言模型(LLMs)的进步推动了利用强化学习(RL)通过迭代、多回合交互训练代理的转变。然而,多回合的强化学习依然具有挑战性,因为奖励往往稀少或延迟,环境也可能具有随机性。在这种模式下,朴素轨迹采样可能阻碍利用并诱导模态崩溃。我们提出了TSR(轨迹搜索展开),这是一种训练时间方法,重新利用测试时间的缩放理念,以改进每回合展开生成。TSR通过在每个回合选择高得分动作,利用任务特定的反馈,执行轻量级树状搜索,构建高质量的轨迹。这提升了推广质量并稳定了学习过程,同时保持底层优化目标不变,使TSR成为与优化器无关的。我们通过最佳N法、波束搜索和浅层前瞻搜索实现TSR,并与PPO和GRPO配合,在Sokoban、FrozenLake和WebShop任务上实现高达15%的性能提升和更稳定的学习,同时训练计算量一次性提升。通过将搜索从推理时间转移到训练的展开阶段,TSR提供了一种简单且通用的机制,用于增强多回合代理学习,补充现有框架和拒绝抽样式选择方法。
Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning
温度作为元策略:LLM强化学习中的自适应温度
- Authors: Haoran Dang, Cuiling Lan, Hai Wan, Xibin Zhao, Yan Lu
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.11779
- Pdf link: https://arxiv.org/pdf/2602.11779
- Abstract
Temperature is a crucial hyperparameter in large language models (LLMs), controlling the trade-off between exploration and exploitation during text generation. High temperatures encourage diverse but noisy outputs, while low temperatures produce focused outputs but may cause premature convergence. Yet static or heuristic temperature schedules fail to adapt to the dynamic demands of reinforcement learning (RL) throughout training, often limiting policy improvement. We propose Temperature Adaptive Meta Policy Optimization (TAMPO), a new framework that recasts temperature control as a learnable meta-policy. TAMPO operates through a hierarchical two-loop process. In the inner loop, the LLM policy is updated (e.g., using GRPO) with trajectories sampled at the temperature selected by the meta-policy. In the outer loop, meta-policy updates the distribution over candidate temperatures by rewarding those that maximize the likelihood of high-advantage trajectories. This trajectory-guided, reward-driven mechanism enables online adaptation without additional rollouts, directly aligning exploration with policy improvement. On five mathematical reasoning benchmarks, TAMPO outperforms baselines using fixed or heuristic temperatures, establishing temperature as an effective learnable meta-policy for adaptive exploration in LLM reinforcement learning. Accepted at ICLR 2026.
- 中文摘要
温度是大型语言模型(LLM)中的关键超参数,控制文本生成过程中探索与利用之间的权衡。高温促进多样化但噪音较大的输出,低温则产生集中输出,但可能导致过早收敛。然而,静态或启发式温度计划未能适应强化学习(RL)在整个训练过程中的动态需求,常常限制策略改进。我们提出了温度自适应元策略优化(TAMPO),这是一个将温度控制重新定义为可学习元政策的新框架。TAMPO通过分层的两环流程运行。在内环中,LLM策略会根据元策略选择的温度进行采样(例如使用GRPO)更新。在外环中,元政策通过奖励那些最大化高优势轨迹概率的温度,更新候选温度的分布。这一轨迹引导、奖励驱动的机制使在线适应无需额外推广,直接将探索与政策改进相结合。在五项数学推理基准测试中,TAMPO优于使用固定温度或启发式温度的基线,确立了温度作为LLM强化学习中自适应探索的有效可学习元策略。2026年ICLR录取。
RELATE: A Reinforcement Learning-Enhanced LLM Framework for Advertising Text Generation
相关:一个强化学习增强的广告文本生成大型语言模型框架
- Authors: Jinfang Wang, Jiajie Liu, Jianwei Wu, Ziqin Luo, Zhen Chen, Chunlei Li, Biao Han, Tao Deng, Yi Li, Shuanglong Li, Lin Liu
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.11780
- Pdf link: https://arxiv.org/pdf/2602.11780
- Abstract
In online advertising, advertising text plays a critical role in attracting user engagement and driving advertiser value. Existing industrial systems typically follow a two-stage paradigm, where candidate texts are first generated and subsequently aligned with online performance metrics such as click-through rate(CTR). This separation often leads to misaligned optimization objectives and low funnel efficiency, limiting global optimality. To address these limitations, we propose RELATE, a reinforcement learning-based end-to-end framework that unifies generation and objective alignment within a single model. Instead of decoupling text generation from downstream metric alignment, RELATE integrates performance and compliance objectives directly into the generation process via policy learning. To better capture ultimate advertiser value beyond click-level signals, We incorporate conversion-oriented metrics into the objective and jointly model them with compliance constraints as multi-dimensional rewards, enabling the model to generate high-quality ad texts that improve conversion performance under policy constraints. Extensive experiments on large-scale industrial datasets demonstrate that RELATE consistently outperforms baselines. Furthermore, online deployment on a production advertising platform yields statistically significant improvements in click-through conversion rate(CTCVR) under strict policy constraints, validating the robustness and real-world effectiveness of the proposed framework.
- 中文摘要
在网络广告中,广告文本在吸引用户参与和提升广告主价值方面起着关键作用。现有的工业系统通常遵循两阶段范式,先生成候选文本,随后与点击率(CTR)等在线表现指标对齐。这种分离常常导致优化目标不一致和漏斗效率低下,限制了全局最优性。为解决这些局限性,我们提出了RELATE,这是一种基于强化学习的端到端框架,统一生成与目标对齐于单一模型。RELATE不再将文本生成与下游指标对齐解耦,而是通过策略学习直接将绩效和合规目标整合进生成过程。为了更好地捕捉超越点击级别信号的广告主最终价值,我们将转化导向的指标纳入目标,并结合合规约束共同建模,作为多维奖励,使模型能够生成高质量的广告文本,在政策约束下提升转化表现。大规模工业数据集上的大量实验表明,RELATE始终优于基线数据。此外,在生产广告平台上的在线部署在严格的政策约束下,点击转化率(CTCVR)实现了统计学上的显著提升,验证了所提框架的稳健性和实际有效性。
Detecting RLVR Training Data via Structural Convergence of Reasoning
通过推理结构收敛检测RLVR训练数据
- Authors: Hongbo Zhang, Yue Yang, Jianhao Yan, Guangsheng Bao, Yue Zhang, Yue Zhang
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.11792
- Pdf link: https://arxiv.org/pdf/2602.11792
- Abstract
Reinforcement learning with verifiable rewards (RLVR) is central to training modern reasoning models, but the undisclosed training data raises concerns about benchmark contamination. Unlike pretraining methods, which optimize models using token-level probabilities, RLVR fine-tunes models based on reward feedback from self-generated reasoning trajectories, making conventional likelihood-based detection methods less effective. We show that RLVR induces a distinctive behavioral signature: prompts encountered during RLVR training result in more rigid and similar generations, while unseen prompts retain greater diversity. We introduce Min-$k$NN Distance, a simple black-box detector that quantifies this collapse by sampling multiple completions for a given prompt and computing the average of the $k$ smallest nearest-neighbor edit distances. Min-$k$NN Distance requires no access to the reference model or token probabilities. Experiments across multiple RLVR-trained reasoning models show that Min-$k$NN Distance reliably distinguishes RL-seen examples from unseen ones and outperforms existing membership inference and RL contamination detection baselines.
- 中文摘要
带有可验证奖励的强化学习(RLVR)是训练现代推理模型的核心,但未披露的训练数据引发了对基准污染的担忧。与利用代币级概率优化模型的预训练方法不同,RLVR基于自生成推理轨迹的奖励反馈微调模型,使得传统的似然检测方法效果较差。我们表明RLVR诱导了独特的行为特征:RLVR训练中遇到的提示生成更为僵硬且相似,而未见提示则保持更丰富的多样性。我们引入了最小-$k$NN距离,这是一种简单的黑箱检测器,通过对给定提示词进行多次补全采样,并计算最小$k$最小最近邻编辑距离的平均值,来量化这种坍缩。最小-$k$NN距离无需访问参考模型或代币概率。跨多个RLVR训练推理模型的实验表明,最小$k$NN距离可靠地区分了RL已见实例与未被看到的实例,并且优于现有的成员推断和强化学习污染检测基线。
Temporal Difference Learning with Constrained Initial Representations
带有约束初始表示的时间差分学习
- Authors: Jiafei Lyu, Jingwen Yang, Zhongjian Qiao, Runze Liu, Zeyuan Liu, Deheng Ye, Zongqing Lu, Xiu Li
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.11800
- Pdf link: https://arxiv.org/pdf/2602.11800
- Abstract
Recently, there have been numerous attempts to enhance the sample efficiency of off-policy reinforcement learning (RL) agents when interacting with the environment, including architecture improvements and new algorithms. Despite these advances, they overlook the potential of directly constraining the initial representations of the input data, which can intuitively alleviate the distribution shift issue and stabilize training. In this paper, we introduce the Tanh function into the initial layer to fulfill such a constraint. We theoretically unpack the convergence property of the temporal difference learning with the Tanh function under linear function approximation. Motivated by theoretical insights, we present our Constrained Initial Representations framework, tagged CIR, which is made up of three components: (i) the Tanh activation along with normalization methods to stabilize representations; (ii) the skip connection module to provide a linear pathway from the shallow layer to the deep layer; (iii) the convex Q-learning that allows a more flexible value estimate and mitigates potential conservatism. Empirical results show that CIR exhibits strong performance on numerous continuous control tasks, even being competitive or surpassing existing strong baseline methods.
- 中文摘要
近年来,有大量尝试提升非策略强化学习(RL)代理在与环境交互时的样本效率,包括架构改进和新算法。尽管取得了这些进展,他们忽视了直接限制输入数据初始表示的潜力,这可以直观地缓解分布偏移问题并稳定训练。本文引入了Tanh函数以满足此类约束。我们理论上解析了线性函数近似下时间差分学习与Tanh函数的收敛性质。基于理论见解,我们提出了受限初始表征框架,标注为CIR,由三个组成部分组成:(i)Tanh激活及稳定表征的归一化方法;(ii)跳接连接模块,提供从浅层到深层的线性通路;(iii)凸Q学习,允许更灵活的值估计并减轻潜在的保守性。实证结果表明,CIR在许多连续控制任务中表现出强劲表现,甚至具有竞争力甚至超过现有强基线方法。
From Path Signatures to Sequential Modeling: Incremental Signature Contributions for Offline RL
从路径签名到顺序建模:离线强化学习的增量签名贡献
- Authors: Ziyi Zhao, Qingchuan Li, Yuxuan Xu
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.11805
- Pdf link: https://arxiv.org/pdf/2602.11805
- Abstract
Path signatures embed trajectories into tensor algebra and constitute a universal, non-parametric representation of paths; however, in the standard form, they collapse temporal structure into a single global object, which limits their suitability for decision-making problems that require step-wise reactivity. We propose the Incremental Signature Contribution (ISC) method, which decomposes truncated path signatures into a temporally ordered sequence of elements in the tensor-algebra space, corresponding to incremental contributions induced by last path increments. This reconstruction preserves the algebraic structure and expressivity of signatures, while making their internal temporal evolution explicit, enabling processing signature-based representations via sequential modeling approaches. In contrast to full signatures, ISC is inherently sensitive to instantaneous trajectory updates, which is critical for sensitive and stability-requiring control dynamics. Building on this representation, we introduce ISC-Transformer (ISCT), an offline reinforcement learning model that integrates ISC into a standard Transformer architecture without further architectural modification. We evaluate ISCT on HalfCheetah, Walker2d, Hopper, and Maze2d, including settings with delayed rewards and downgraded datasets. The results demonstrate that ISC method provides a theoretically grounded and practically effective alternative to path processing for temporally sensitive control tasks.
- 中文摘要
路径签名将轨迹嵌入张量代数中,构成路径的通用非参数表示;然而,在标准形式中,它们将时间结构压缩为单一的全局对象,这限制了它们在需要逐步反应性决策问题上的适用性。我们提出了增量签名贡献(ISC)方法,该方法将截断路径签名分解为张量代数空间中按时间顺序排列的元素序列,对应于由最后路径增量诱导的增量贡献。这种重建保留了签名的代数结构和表达力,同时明确其内部时间演化,使得通过顺序建模方法处理基于签名的表示成为可能。与全签名不同,ISC对瞬时轨迹更新具有内在敏感性,这对于敏感且需要稳定性的控制动态至关重要。基于这一表示,我们引入了ISC-Transformer(ISCT),一种离线强化学习模型,将ISC集成到标准Transformer架构中,无需进一步架构修改。我们评估了HalfCheetah、Walker2d、Hopper和Maze2d上的ISCT,包括延迟奖励和降级数据集的设置。结果表明,ISC方法为时间敏感的控制任务提供了理论基础且实用有效的路径处理替代方案。
Predicting LLM Output Length via Entropy-Guided Representations
通过熵引导表示预测LLM输出长度
- Authors: Huanyi Xie, Yubin Chen, Liangyu Wang, Lijie Hu, Di Wang
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.11812
- Pdf link: https://arxiv.org/pdf/2602.11812
- Abstract
The long-tailed distribution of sequence lengths in LLM serving and reinforcement learning (RL) sampling causes significant computational waste due to excessive padding in batched inference. Existing methods rely on auxiliary models for static length prediction, but they incur high overhead, generalize poorly, and fail in stochastic "one-to-many" sampling scenarios. We introduce a lightweight framework that reuses the main model's internal hidden states for efficient length prediction. Our framework features two core components: 1) Entropy-Guided Token Pooling (EGTP), which uses on-the-fly activations and token entropy for highly accurate static prediction with negligible cost, and 2) Progressive Length Prediction (PLP), which dynamically estimates the remaining length at each decoding step to handle stochastic generation. To validate our approach, we build and release ForeLen, a comprehensive benchmark with long-sequence, Chain-of-Thought, and RL data. On ForeLen, EGTP achieves state-of-the-art accuracy, reducing MAE by 29.16\% over the best baseline. Integrating our methods with a length-aware scheduler yields significant end-to-end throughput gains. Our work provides a new technical and evaluation baseline for efficient LLM inference.
- 中文摘要
LLM服务和强化学习(RL)采样中序列长度的长尾分布导致了大量计算浪费,导致批次推理中过度填充。现有方法依赖辅助模型进行静态长度预测,但它们开销较高,推广能力差,且在随机“一对多”抽样场景中失败。我们引入了一个轻量级框架,重用主模型的内部隐藏状态,实现高效的长度预测。我们的框架包含两个核心组件:1)熵引导令牌池(EGTP),利用即时激活和令牌熵实现高度准确的静态预测,成本可忽略;2)渐进长度预测(PLP),动态估计每个解码步骤剩余长度以处理随机生成。为了验证我们的方法,我们构建并发布了ForeLen,这是一个包含长序列、思维链和强化学习数据的综合基准测试。在ForeLen上,EGTP实现了最先进的精度,较最佳基线降低了29.16%的MAE。将我们的方法与时长感知调度器集成,带来了显著的端到端吞吐量提升。我们的工作为高效LLM推理提供了新的技术和评估基线。
In-Context Function Learning in Large Language Models
大型语言模型中的上下文功能学习
- Authors: Elif Akata, Konstantinos Voudouris, Vincent Fortuin, Eric Schulz
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.11863
- Pdf link: https://arxiv.org/pdf/2602.11863
- Abstract
Large language models (LLMs) can learn from a few demonstrations provided at inference time. We study this in-context learning phenomenon through the lens of Gaussian Processes (GPs). We build controlled experiments where models observe sequences of multivariate scalar-valued function samples drawn from known GP priors. We evaluate prediction error in relation to the number of demonstrations and compare against two principled references: (i) an empirical GP-regression learner that gives a lower bound on achievable error, and (ii) the expected error of a 1-nearest-neighbor (1-NN) rule, which gives a data-driven upper bound. Across model sizes, we find that LLM learning curves are strongly influenced by the function-generating kernels and approach the GP lower bound as the number of demonstrations increases. We then study the inductive biases of these models using a likelihood-based analysis. We find that LLM predictions are most likely under less smooth GP kernels. Finally, we explore whether post-training can shift these inductive biases and improve sample-efficiency on functions sampled from GPs with smoother kernels. We find that both reinforcement learning and supervised fine-tuning can effectively shift inductive biases in the direction of the training data. Together, our framework quantifies the extent to which LLMs behave like GP learners and provides tools for steering their inductive biases for continuous function learning tasks.
- 中文摘要
大型语言模型(LLM)可以通过推理时提供的一些演示来学习。我们通过高斯过程(GP)的视角研究这种情境内学习现象。我们构建了受控实验,模型观察来自已知GP先验的多元标量值函数样本序列。我们根据演示次数评估预测误差,并比较两个原则性参考:(i)经验GP回归学习器给出可实现误差的下界,(ii)1最近邻(1-NN)规则的期望误差,给出数据驱动的上界。在不同模型规模下,我们发现LLM的学习曲线受到函数生成核的强烈影响,并且随着演示次数的增加,学习曲线趋近于GP下界。然后,我们利用基于似然的分析研究这些模型的归纳偏倚。我们发现,LLM预测很可能在GP核不那么光滑的情况下进行。最后,我们探讨了后训练是否能改变这些归纳偏差,并提升从核更光滑的GP中采样函数的样本效率。我们发现,强化学习和监督式微调都能有效将归纳偏向训练数据方向。我们的框架共同量化了LLM在多大程度上类似于GP学习者,并提供了引导其归纳偏向以应对持续功能学习任务的工具。
Efficient Crawling for Scalable Web Data Acquisition (Extended Version)
高效爬取可扩展网络数据采集(扩展版)
- Authors: Antoine Gauquier, Ioana Manolescu, Pierre Senellart
- Subjects: Subjects:
Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2602.11874
- Pdf link: https://arxiv.org/pdf/2602.11874
- Abstract
Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites with millions of webpages show that our crawler is highly efficient, delivering high fractions of a site's targets while crawling only a small part.
- 中文摘要
新闻事实核查以及社会或经济研究都需要分析高质量的统计数据集(简称SDs)。然而,大规模检索SD语料库可能困难、低效甚至不可能,具体取决于它们在网上的发布方式。为了提升开放统计数据的可访问性,我们提出了一种聚焦的网络爬取算法,能够高效且可扩展地从给定网站获取尽可能多的目标,即特定类型的资源,但爬取的次数远少于整个网站。我们表明,最优解决该问题是难以解决的,并提出了一种基于强化学习的方法,即使用睡眠强盗。我们提出了SB-CLASSIFIER的爬虫工具,它能根据链接内的链接路径,高效学习哪些超链接指向多个目标的页面。我们在拥有数百万网页的网站上的实验显示,我们的爬虫效率极高,能在爬取部分的情况下提供大量目标。
Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning
回声:通过音频交错推理迈向高级音频理解
- Authors: Daiqing Wu, Xuan Zhang, Dongbao Yang, Jiashu Yao, Longfei Chen, Qingsong Liu, Sicheng Zhao, Can Ma, Yangyang Kang, Yu Zhou
- Subjects: Subjects:
Sound (cs.SD); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.11909
- Pdf link: https://arxiv.org/pdf/2602.11909
- Abstract
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: this https URL.
- 中文摘要
大型音频语言模型(LALMs)的成熟提升了人们对它们理解复杂音频的期望,类似于人类。当前的努力主要通过一次性编码将音频内容置于上下文中来复制基于文本的推理,这带来了关键的信息瓶颈。我们从人类认知中汲取灵感,提出了音频交错推理以突破这一瓶颈。它将音频视为主动推理的组成部分,从而实现持续的音频参与和基于感知的分析。为了实现这一点,我们引入了两阶段培训框架,首先通过监督微调教授LALMs定位显著音频片段,然后通过强化学习激励熟练的重听。同时,开发结构化数据生成流水线以产生高质量的训练数据。因此,我们介绍了Echo,一种能够在推理过程中动态重听需求音频的LALM系统。在音频理解基准测试中,Echo在具有挑战性专家级和通用任务中均取得整体优势。综合分析进一步证实了音频交错推理的高效性和普遍性,使其成为推动音频理解进步的有前景方向。项目页面:这个 https URL。
Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration
扩展专家混合推理模型的谜题,并应用于GPT-OSS加速
- Authors: Akhiad Bercovich, Nir Ailon, Vladimir Anisimov, Tomer Asida, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Roi Koren, Itay Levy, Zach Moshe, Pavlo Molchanov, Najeeb Nabwani, Mostofa Patwari, Omri Puny, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.11937
- Pdf link: https://arxiv.org/pdf/2602.11937
- Abstract
Reasoning-focused LLMs improve answer quality by generating longer reasoning traces, but the additional tokens dramatically increase serving cost, motivating inference optimization. We extend and apply Puzzle, a post-training neural architecture search (NAS) framework, to gpt-oss-120B to produce gpt-oss-puzzle-88B, a deployment-optimized derivative. Our approach combines heterogeneous MoE expert pruning, selective replacement of full-context attention with window attention, FP8 KV-cache quantization with calibrated scales, and post-training reinforcement learning to recover accuracy, while maintaining low generation length. In terms of per-token speeds, on an 8XH100 node we achieve 1.63X and 1.22X throughput speedups in long-context and short-context settings, respectively. gpt-oss-puzzle-88B also delivers throughput speedups of 2.82X on a single NVIDIA H100 GPU. However, because token counts can change with reasoning effort and model variants, per-token throughput (tok/s) and latency (ms/token) do not necessarily lead to end-to-end speedups: a 2X throughput gain is erased if traces grow 2X. Conversely, throughput gains can be spent on more reasoning tokens to improve accuracy; we therefore advocate request-level efficiency metrics that normalize throughput by tokens generated and trace an accuracy--speed frontier across reasoning efforts. We show that gpt-oss-puzzle-88B improves over gpt-oss-120B along the entire frontier, delivering up to 1.29X higher request-level efficiency. Across various benchmarks, gpt-oss-puzzle-88B matches or slightly exceeds the parent on suite-average accuracy across reasoning efforts, with retention ranging from 100.8% (high) to 108.2% (low), showing that post-training architecture search can substantially reduce inference costs without sacrificing quality.
- 中文摘要
以推理为中心的大型语言模型通过生成更长的推理轨迹来提升答案质量,但额外的代币显著增加了服务成本,促使推理优化。我们将Puzzle(一种训练后神经结构搜索(NAS)框架扩展并应用于gpt-oss-120B,生成了部署优化的衍生工具gpt-oss-puzzle-88B。我们的方法结合了异构 MoE 专家剪枝、全上下文注意力的选择性替换为窗口注意力、FP8 KV-缓存量化与校准尺度,以及训练后强化学习,以恢复准确性,同时保持低生成长度。在每个令牌速度方面,在8XH100节点上,我们在长上下文和短上下文设置下分别实现了1.63倍和1.22倍的吞吐量提升。GPT-OSS-PUZZLE-88B 在单个 NVIDIA H100 GPU 上还能实现 2.82 倍的吞吐量提升。然而,由于令牌计数会随着推理努力和模型变体而变化,每个令牌吞吐量(tok/s)和延迟(ms/token)并不一定导致端到端的加速:如果跟踪量增长2倍,2倍的吞吐量提升会被抹去。相反,吞吐量提升可以用来增加推理代币以提升准确性;因此,我们倡导通过生成的代币规范吞吐量,并追踪推理工作中精度-速度前沿的请求层效率指标。我们展示了gpt-oss-puzzle-88B在整个前沿相较于gpt-oss-120B的提升,提供高达1.29倍的请求级效率。在多个基准测试中,gpt-oss-puzzle-88B 在推理努力的套件平均准确率上与父方匹配或略高,保留率范围从100.8%(高)到108.2%(低),表明训练后架构搜索可在不牺牲质量的前提下大幅降低推理成本。
Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments
Gaia2:动态与异步环境中的大型语言模型代理基准测试
- Authors: Romain Froger, Pierre Andrews, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Emilien Garreau, Jean-Baptiste Gaya, Hugo Laurençon, Maxime Lecanu, Kunal Malkan, Dheeraj Mekala, Pierre Ménard, Gerard Moreno-Torres Bertran, Ulyana Piterbarg, Mikhail Plekhanov, Mathieu Rita, Andrey Rusakov, Vladislav Vorotilov, Mengjue Wang, Ian Yu, Amine Benhalloum, Grégoire Mialon, Thomas Scialom
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.11964
- Pdf link: https://arxiv.org/pdf/2602.11964
- Abstract
We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write-action verifier, enabling fine-grained, action-level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state-of-the-art proprietary and open-source models shows that no model dominates across capabilities: GPT-5 (high) reaches the strongest overall score of 42% pass@1 but fails on time-sensitive tasks, Claude-4 Sonnet trades accuracy and speed for cost, Kimi-K2 leads among open-source models with 21% pass@1. These results highlight fundamental trade-offs between reasoning, efficiency, robustness, and expose challenges in closing the "sim2real" gap. Gaia2 is built on a consumer environment with the open-source Agents Research Environments platform and designed to be easy to extend. By releasing Gaia2 alongside the foundational ARE framework, we aim to provide the community with a flexible infrastructure for developing, benchmarking, and training the next generation of practical agent systems.
- 中文摘要
我们介绍了Gaia2,这是一个用于评估大型语言模型代理在现实异步环境中的基准测试。与以往静态或同步评估不同,Gaia2引入了环境独立于代理行动演变的场景,要求代理在时间约束下作,适应噪声和动态事件,解决歧义,并与其他代理协作。每个场景都配有写入-动作验证器,实现细粒度的动作级评估,使 Gaia2 能够直接用于可验证奖励的强化学习。我们对最先进的专有和开源模型的评估显示,没有模型能在各项能力上占据主导地位:GPT-5(高)整体得分达到42%pass@1但未能完成时间敏感任务,Claude-4 Sonnet以牺牲准确性和速度换取成本,Kimi-K2以21%的pass@1领先开源模型。这些结果凸显了推理、效率、鲁棒性之间的基本权衡,并揭示了缩小“sim2real”差距的挑战。Gaia2 基于开源的 Agents Research Environments 平台构建在消费者环境中,设计上易于扩展。通过将Gaia2与基础ARE框架同时发布,我们旨在为社区提供一个灵活的基础设施,用于开发、基准测试和培训下一代实用代理系统。
Accelerating Robotic Reinforcement Learning with Agent Guidance
加速机器人强化学习与智能体指导
- Authors: Haojun Chen, Zili Zou, Chengdong Ma, Yaoxiang Pu, Haotong Zhang, Yuanpei Chen, Yaodong Yang
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.11978
- Pdf link: https://arxiv.org/pdf/2602.11978
- Abstract
Reinforcement Learning (RL) offers a powerful paradigm for autonomous robots to master generalist manipulation skills through trial-and-error. However, its real-world application is stifled by severe sample inefficiency. Recent Human-in-the-Loop (HIL) methods accelerate training by using human corrections, yet this approach faces a scalability barrier. Reliance on human supervisors imposes a 1:1 supervision ratio that limits fleet expansion, suffers from operator fatigue over extended sessions, and introduces high variance due to inconsistent human proficiency. We present Agent-guided Policy Search (AGPS), a framework that automates the training pipeline by replacing human supervisors with a multimodal agent. Our key insight is that the agent can be viewed as a semantic world model, injecting intrinsic value priors to structure physical exploration. By using executable tools, the agent provides precise guidance via corrective waypoints and spatial constraints for exploration pruning. We validate our approach on two tasks, ranging from precision insertion to deformable object manipulation. Results demonstrate that AGPS outperforms HIL methods in sample efficiency. This automates the supervision pipeline, unlocking the path to labor-free and scalable robot learning. Project website: this https URL.
- 中文摘要
强化学习(RL)为自主机器人通过反复试验掌握通才作技能提供了强大的范式。然而,其实际应用因样本效率严重低而受限。最新的人工介入(HIL)方法通过人工修正加速训练,但这种方法面临扩展性障碍。依赖人工监督者会强制执行1:1的监督比例,限制了舰队扩展,导致在长时间会话中疲劳,且由于人类熟练度不稳定,导致高度差异。我们介绍了智能体引导策略搜索(AGPS),这是一种通过用多模态代理替代人工主管来自动化培训流程的框架。我们的关键见解是,智能体可以被视为一个语义世界模型,注入内在价值先验以构建物理探索。通过使用可执行工具,代理通过修正性路径点和空间约束,为探索修剪提供精确的指导。我们在两个任务上验证了我们的方法,涵盖从精准插入到可变形物体作。结果表明AGPS在样本效率上优于HIL方法。这自动化了监督流程,开启了无劳力且可扩展的机器人学习之路。项目网站:这个 https URL。
FedGRPO: Privately Optimizing Foundation Models with Group-Relative Rewards from Domain Client
FedGRPO:私密优化基于群组相对奖励的基础模型,从域名客户端获得
- Authors: Gongxi Zhu, Hanlin Gu, Lixin Fan, Qiang Yang, Yuxing Han
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.12014
- Pdf link: https://arxiv.org/pdf/2602.12014
- Abstract
One important direction of Federated Foundation Models (FedFMs) is leveraging data from small client models to enhance the performance of a large server-side foundation model. Existing methods based on model level or representation level knowledge transfer either require expensive local training or incur high communication costs and introduce unavoidable privacy risks. We reformulate this problem as a reinforcement learning style evaluation process and propose FedGRPO, a privacy preserving framework comprising two modules. The first module performs competence-based expert selection by building a lightweight confidence graph from auxiliary data to identify the most suitable clients for each question. The second module leverages the "Group Relative" concept from the Group Relative Policy Optimization (GRPO) framework by packaging each question together with its solution rationale into candidate policies, dispatching these policies to a selected subset of expert clients, and aggregating solely the resulting scalar reward signals via a federated group-relative loss function. By exchanging reward values instead of data or model updates, FedGRPO reduces privacy risk and communication overhead while enabling parallel evaluation across heterogeneous devices. Empirical results on diverse domain tasks demonstrate that FedGRPO achieves superior downstream accuracy and communication efficiency compared to conventional FedFMs baselines.
- 中文摘要
联邦基础模型(FedFMs)的一个重要方向是利用小型客户端模型的数据来提升大型服务器端基础模型的性能。基于模型层级或表示层知识转移的现有方法要么需要昂贵的本地培训,要么产生高昂的通信成本,并带来不可避免的隐私风险。我们将该问题重新表述为强化学习式评估过程,并提出了FedGRPO,这是一个由两个模块组成的隐私保护框架。第一个模块通过从辅助数据构建轻量级置信图,进行基于能力的专家选择,以识别每个问题最合适的客户。第二个模块利用了Group Relative Policy Optimization(GRPO)框架中的“Group Relative”概念,将每个问题及其解理由打包到候选策略中,将这些策略发送给选定的专家客户子集,并通过联邦组-相对损失函数仅汇总所得的标量奖励信号。通过交换奖励值而非数据或模型更新,FedGRPO降低了隐私风险和通信开销,同时实现了异构设备间的并行评估。在多领域任务上的实证结果表明,FedGRPO相比传统FedFM基线在下游精度和通信效率上更优越。
Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models
Composition-RL:为大型语言模型强化学习编写可验证的提示
- Authors: Xin Xu, Clive Bai, Kai Yang, Tianhao Chen, Yangkun Chen, Weijie Liu, Hao Chen, Yang Wang, Saiyong Yang, Can Yang
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.12036
- Pdf link: https://arxiv.org/pdf/2602.12036
- Abstract
Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at this https URL.
- 中文摘要
大规模可验证提示支撑了可验证奖励强化学习(RLVR)的成功,但它们包含许多无益的示例,且进一步扩展成本较高。最新研究重点是通过优先处理那些推送通过率为0的硬提示来更好地利用有限的训练数据。然而,随着训练的推进,通过率为1的简单提示也越来越普遍,从而减少了有效数据量。为缓解这一问题,我们提出了Composition-RL,这是一种简单但有用的方法,用于更好地利用针对通过率1提示的有限可验证提示。更具体地说,Composition-RL 会自动将多个问题组合成一个新的可验证问题,并利用这些组合提示进行强化学习训练。跨4B至30B模型规模的广泛实验表明,Composition-RL在推理能力上始终优于原始数据集训练的强化学习。通过Composition-RL的课程变体,在训练过程中逐步提升作曲深度,进一步提升表现。此外,Composition-RL通过组合来自不同领域的提示词,实现了更高效的跨域强化学习。代码、数据集和模型均可在此 https URL 访问。
Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards
通过在线强化学习与实机基准奖励提升大型语言模型的高性能计算代码生成能力
- Authors: Ryo Mikasa, Shun-ichiro Hayashi, Daichi Mukunoki, Tetsuya Hoshino, Takahiro Katagiri
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.12049
- Pdf link: https://arxiv.org/pdf/2602.12049
- Abstract
Large language models (LLMs) have demonstrated strong code generation capabilities, yet the runtime performance of generated code is not guaranteed, and there have been few attempts to train LLMs using runtime performance as a reward in the HPC domain. We propose an online reinforcement learning approach that executes LLM-generated code on a supercomputer and directly feeds back the measured runtime performance (GFLOPS) as a reward. We further introduce a Staged Quality-Diversity (SQD) algorithm that progressively varies the permitted optimization techniques on a per-problem basis, enabling the model to learn code optimization from diverse perspectives. We build a distributed system connecting a GPU training cluster with a CPU benchmarking cluster, and train Qwen2.5 Coder 14B on a double-precision matrix multiplication task using Group Relative Policy Optimization (GRPO). Through two experiments, we show that reinforcement learning combining runtime performance feedback with staged optimization can improve the HPC code generation capability of LLMs.
- 中文摘要
大型语言模型(LLM)已展现出强大的代码生成能力,但生成代码的运行时性能并无保证,且在高性能计算领域中,使用运行时性能作为奖励来训练LLM的尝试也很少。我们提出了一种在线强化学习方法,在超级计算机上执行LLM生成的代码,并直接反馈测量到的运行时性能(GFLOPS)作为奖励。我们还进一步引入了分阶段质量多样性(SQD)算法,该算法在每个问题基础上逐步变化允许的优化技术,使模型能够从多角度学习代码优化。我们构建了一个分布式系统,将GPU训练集群与CPU基准集群连接起来,并利用组相对策略优化(GRPO)训练Qwen2.5 Coder 14B执行双精度矩阵乘法任务。通过两项实验,我们表明将运行时性能反馈与分阶段优化相结合的强化学习,可以提升大型语言模型(LLM)在高性能计算(HPC)代码生成能力上的表现。
Geometry of Uncertainty: Learning Metric Spaces for Multimodal State Estimation in RL
不确定性几何:学习强化学习多模态状态估计的度量空间
- Authors: Alfredo Reichlin, Adriano Pacciarelli, Danica Kragic, Miguel Vasco
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.12087
- Pdf link: https://arxiv.org/pdf/2602.12087
- Abstract
Estimating the state of an environment from high-dimensional, multimodal, and noisy observations is a fundamental challenge in reinforcement learning (RL). Traditional approaches rely on probabilistic models to account for the uncertainty, but often require explicit noise assumptions, in turn limiting generalization. In this work, we contribute a novel method to learn a structured latent representation, in which distances between states directly correlate with the minimum number of actions required to transition between them. The proposed metric space formulation provides a geometric interpretation of uncertainty without the need for explicit probabilistic modeling. To achieve this, we introduce a multimodal latent transition model and a sensor fusion mechanism based on inverse distance weighting, allowing for the adaptive integration of multiple sensor modalities without prior knowledge of noise distributions. We empirically validate the approach on a range of multimodal RL tasks, demonstrating improved robustness to sensor noise and superior state estimation compared to baseline methods. Our experiments show enhanced performance of an RL agent via the learned representation, eliminating the need of explicit noise augmentation. The presented results suggest that leveraging transition-aware metric spaces provides a principled and scalable solution for robust state estimation in sequential decision-making.
- 中文摘要
从高维、多模态和噪声观测中估算环境状态是强化学习(RL)中的一个根本挑战。传统方法依赖概率模型来考虑不确定性,但通常需要明确的噪声假设,从而限制了推广。本研究提出了一种新颖的方法,用于学习结构化潜在表征,其中状态间距离与两状态之间转换所需的最小动作数直接相关。所提出的度量空间表述提供了不确定性的几何解释,无需显式概率建模。为此,我们引入了多模态潜在转变模型和基于反距离加权的传感器融合机制,允许在不先验噪声分布的情况下自适应整合多种传感器模态。我们实证验证了该方法在多种多模态强化学习任务上,证明其对传感器噪声的鲁棒性和优于基线方法的状态估计能力。我们的实验显示,通过学习到的表征,强化学习代理的性能得到了提升,消除了显式噪声增强的需求。所展示的结果表明,利用过渡感知度量空间为顺序决策中的鲁棒状态估计提供了原则性且可扩展的解决方案。
GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning
GigaBrain-0.5M*:一种基于世界模型的强化学习VLA。
- Authors: GigaBrain Team: Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, Lv Feng, Mingming Yu, Peng Li, Qiuping Deng, Tianze Liu, Xinyu Zhou, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yifei Nie, Yilong Li, Yukun Zhou, Yun Ye, Zhichao Liu, Zheng Zhu
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2602.12099
- Pdf link: https://arxiv.org/pdf/2602.12099
- Abstract
Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^*$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{this https URL}{project page}.
- 中文摘要
直接预测当前观测多步动作块的视觉-语言-动作(VLA)模型,由于场景理解受限和未来预期能力薄弱,面临固有局限。相比之下,在网络规模视频语料库上预训练的视频世界模型展现出强大的时空推理能力和准确的未来预测,是提升VLA学习的自然基础。因此,我们提出 \textit{GigaBrain-0.5M},这是一个通过基于世界模型的强化学习训练的 VLA 模型。基于 \textit{GigaBrain-0.5},该预训练基于超过 10,000 小时的机器人作数据,其中间版本目前在国际 RoboChallenge 基准测试中排名第一。\textit{GigaBrain-0.5M} 进一步整合了基于世界模型的强化学习,通过 \textit{RAMP}(通过世界模型条件策略进行强化学习),实现了稳健的跨任务适应。实证结果表明,\textit{RAMP}在RECAP基线上取得了显著的性能提升,在包括\texttt{叠衣服}、\texttt{包装}和\texttt{意式浓缩准备}等具有挑战性的任务上提升了约30%。关键是,\textit{GigaBrain-0.5M$^*$} 展现出可靠的长远执行能力,持续完成复杂作任务且无失误,这一点通过我们\href{this https URL}{项目页面}上的真实部署视频得到了验证。
On the Complexity of Offline Reinforcement Learning with $Q^\star$-Approximation and Partial Coverage
关于带有$Q^\star$-近似和部分覆盖的离线强化学习的复杂性
- Authors: Haolin Liu, Braham Snyder, Chen-Yu Wei
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2602.12107
- Pdf link: https://arxiv.org/pdf/2602.12107
- Abstract
We study offline reinforcement learning under $Q^\star$-approximation and partial coverage, a setting that motivates practical algorithms such as Conservative $Q$-Learning (CQL; Kumar et al., 2020) but has received limited theoretical attention. Our work is inspired by the following open question: "Are $Q^\star$-realizability and Bellman completeness sufficient for sample-efficient offline RL under partial coverage?" We answer in the negative by establishing an information-theoretic lower bound. Going substantially beyond this, we introduce a general framework that characterizes the intrinsic complexity of a given $Q^\star$ function class, inspired by model-free decision-estimation coefficients (DEC) for online RL (Foster et al., 2023b; Liu et al., 2025b). This complexity recovers and improves the quantities underlying the guarantees of Chen and Jiang (2022) and Uehara et al. (2023), and extends to broader settings. Our decision-estimation decomposition can be combined with a wide range of $Q^\star$ estimation procedures, modularizing and generalizing existing approaches. Beyond the general framework, we make further contributions: By developing a novel second-order performance difference lemma, we obtain the first $\epsilon^{-2}$ sample complexity under partial coverage for soft $Q$-learning, improving the $\epsilon^{-4}$ bound of Uehara et al. (2023). We remove Chen and Jiang's (2022) need for additional online interaction when the value gap of $Q^\star$ is unknown. We also give the first characterization of offline learnability for general low-Bellman-rank MDPs without Bellman completeness (Jiang et al., 2017; Du et al., 2021; Jin et al., 2021), a canonical setting in online RL that remains unexplored in offline RL except for special cases. Finally, we provide the first analysis for CQL under $Q^\star$-realizability and Bellman completeness beyond the tabular case.
- 中文摘要
我们研究了$Q^\star$近似和部分覆盖下的离线强化学习,这一设定激励了如保守的$Q$学习(CQL;Kumar 等,2020),但理论关注有限。我们的研究灵感来自以下未解问题:“$Q^\star$实现性和Bellman完备性是否足以在部分覆盖下实现样本高效的离线强化学习?”我们通过建立信息论的下限来回答是否定的。更进一步,我们引入了一个通用框架,描述给定$Q^\star$函数类的内在复杂性,灵感来自在线RL的无模型决策估计系数(DEC)(Foster等,2023b;Liu 等,2025b)。这种复杂性恢复并改进了陈和江(2022)及上原等(2023)担保的数量,并扩展到更广泛的情境。我们的决策估计分解可以与多种$Q^\star$估计程序结合,模块化和推广现有方法。在一般框架之外,我们还做出了进一步贡献:通过开发一个新的二阶性能差分引理,我们获得了软$Q$-学习部分覆盖下的第一个$\epsilon^{-2}$样本复杂度,改进了Uehara等人(2023)的$\epsilon^{-4}$界限。当$Q^\star$的价值差距未知时,我们消除了Chen和Jiang(2022)对额外在线互动的需求。我们还首次对无Bellman完备性的低Bellman秩MDP的离线学习能力进行了表征(Jiang 等,2017;Du 等,2021;Jin 等,2021),这是在线强化学习中的经典设定,离线强化学习中除特殊情况外未被探索。最后,我们首次在表格情况下对CQL进行了基于$Q^\star$实现性和Bellman完备性的分析。
Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty
停止不必要的反思:训练长距离模型(LRMs)进行自适应反思和长度协调惩罚的高效推理
- Authors: Zewei Yu, Lirong Gao, Yuke Zhu, Bo Zheng, Sheng Guo, Haobo Wang, Junbo Zhao
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.12113
- Pdf link: https://arxiv.org/pdf/2602.12113
- Abstract
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks by employing test-time scaling. However, they often generate over-long chains-of-thought that, driven by substantial reflections such as repetitive self-questioning and circular reasoning, lead to high token consumption, substantial computational overhead, and increased latency without improving accuracy, particularly in smaller models. Our observation reveals that increasing problem complexity induces more excessive and unnecessary reflection, which in turn reduces accuracy and increases token overhead. To address this challenge, we propose Adaptive Reflection and Length Coordinated Penalty (ARLCP), a novel reinforcement learning framework designed to dynamically balance reasoning efficiency and solution accuracy. ARLCP introduces two key innovations: (1) a reflection penalty that adaptively curtails unnecessary reflective steps while preserving essential reasoning, and (2) a length penalty calibrated to the estimated complexity of the problem. By coordinating these penalties, ARLCP encourages the model to generate more concise and effective reasoning paths. We evaluate our method on five mathematical reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B models. Experimental results show that ARLCP achieves a superior efficiency-accuracy trade-off compared to existing approaches. For the 1.5B model, it reduces the average response length by 53.1% while simultaneously improving accuracy by 5.8%. For the 7B model, it achieves a 35.0% reduction in length with a 2.7% accuracy gain. The code is released at this https URL .
- 中文摘要
大型推理模型(LRM)通过采用测试时间缩放技术,在复杂推理任务中表现出显著的性能。然而,它们常常产生过长的思维链,这些链条由大量反思驱动,如重复的自我质疑和循环推理,导致高代币消耗、显著的计算开销和增加延迟,尤其是在较小的模型中,准确性并未提升。我们的观察显示,问题复杂度的增加会引发更多过度且不必要的反思,进而降低准确性并增加令牌开销。为应对这一挑战,我们提出了自适应反思与长度协调惩罚(ARLCP),这是一种新型强化学习框架,旨在动态平衡推理效率与求解准确性。ARLCP引入了两项关键创新:(1)一种自适应性地减少不必要的反思步骤,同时保留核心推理的反思惩罚;(2)根据问题的估计复杂度校准的长度惩罚。通过协调这些惩罚,ARLCP鼓励模型生成更简洁有效的推理路径。我们利用DeepSeek-R1-Distill-Qwen-1.5B和DeepSeek-R1-Distill-Qwen-7B模型,在五个数学推理基准测试上评估我们的方法。实验结果表明,ARLCP在效率与准确性之间取得了优于现有方法的权衡。对于1.5亿模型,它将平均响应长度缩短了53.1%,同时提高了准确率5.8%。对于7B型号,其长度减少了35.0%,准确率提升了2.7%。代码发布于此 https URL 。
P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling
P-GenRM:个性化生成奖励模型,具备测试时用户基础缩放
- Authors: Pinyi Zhang, Ting-En Lin, Yuchuan Wu, Jingyang Chen, Zongqi Wang, Hua Yang, Ze Xu, Fei Huang, Kai Zhang, Yongbin Li
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.12116
- Pdf link: https://arxiv.org/pdf/2602.12116
- Abstract
Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.
- 中文摘要
大型语言模型的个性化对齐旨在根据个人用户偏好调整响应,通常通过强化学习实现。一个关键挑战是在开放式场景中获得准确且用户特定的奖励信号。现有的个性化奖励模型面临两个持续的局限:(1)将多样化且特定情境的偏好简化为一套小而固定的评估原则;(2)反馈有限的情况下难以推广到新用户。为此,我们提出了P-GenRM,这是首个基于用户测试的个性化生成奖励模型。P-GenRM将偏好信号转化为结构化的评估链,从而在各种场景中推导出自适应的人格和评分评分标准。它进一步将用户归类为用户原型,并引入了双粒度缩放机制:在个体层面,它自适应地扩展和汇总每个用户的评分方案;在原型层面,它会整合类似用户的偏好。该设计减少了推断偏好中的噪声,并通过基于原型的转移增强了对未见用户的泛化。实证结果显示,P-GenRM在广泛使用的个性化奖励模型基准测试中取得了最先进的成绩,平均提升2.31%,并且在分布外数据集上展现出强烈的泛化性。值得注意的是,基于测试时间的用户扩展提供了额外3%的提升,展示了与测试时间可扩展性的更强个性化对齐。
Meta-Sel: Efficient Demonstration Selection for In-Context Learning via Supervised Meta-Learning
Meta-Sel:通过监督元学习实现情境内学习的高效演示选择
- Authors: Xubin Wang, Weijia Jia
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.12123
- Pdf link: https://arxiv.org/pdf/2602.12123
- Abstract
Demonstration selection is a practical bottleneck in in-context learning (ICL): under a tight prompt budget, accuracy can change substantially depending on which few-shot examples are included, yet selection must remain cheap enough to run per query over large candidate pools. We propose Meta-Sel, a lightweight supervised meta-learning approach for intent classification that learns a fast, interpretable scoring function for (candidate, query) pairs from labeled training data. Meta-Sel constructs a meta-dataset by sampling pairs from the training split and using class agreement as supervision, then trains a calibrated logistic regressor on two inexpensive meta-features: TF--IDF cosine similarity and a length-compatibility ratio. At inference time, the selector performs a single vectorized scoring pass over the full candidate pool and returns the top-k demonstrations, requiring no model fine-tuning, no online exploration, and no additional LLM calls. This yields deterministic rankings and makes the selection mechanism straightforward to audit via interpretable feature weights. Beyond proposing Meta-Sel, we provide a broad empirical study of demonstration selection, benchmarking 12 methods -- spanning prompt engineering baselines, heuristic selection, reinforcement learning, and influence-based approaches -- across four intent datasets and five open-source LLMs. Across this benchmark, Meta-Sel consistently ranks among the top-performing methods, is particularly effective for smaller models where selection quality can partially compensate for limited model capacity, and maintains competitive selection-time overhead.
- 中文摘要
演示选择是上下文学习(ICL)中的一个实际瓶颈:在提示预算紧张的情况下,准确率会因包含的少数样本而显著变化,但选择必须保持足够低廉,以便在大型候选池中每查询运行一次。我们提出了Meta-Sel,一种轻量级监督元学习方法,用于意图分类,通过标记训练数据学习快速且可解释的(候选、查询)评分函数。Meta-Sel 通过从训练分割中抽样对并以类一致性为监督,构建元数据集,然后对两个廉价元特征训练经过校准的逻辑回归器:TF-IDF 余弦相似度和长度兼容性比。在推理时,选择器对整个候选池进行一次向量化评分,返回前k个演示,无需模型微调、在线探索或额外LLM调用。这产生了确定性排名,并使选择机制便于通过可解释的特征权重进行审计。除了提出Meta-Sel外,我们还提供了广泛的实证研究,基准测试了12种方法——涵盖提示工程基线、启发式选择、强化学习和基于影响的方法——涵盖四个意图数据集和五个开源大型语言模型。在该基准测试中,Meta-Sel持续位居顶尖方法之列,尤其适用于较小模型,选择质量能部分弥补有限的模型容量。 并保持竞争性选拔时间的开销。
Capability-Oriented Training Induced Alignment Risk
能力导向训练诱导的对齐风险
- Authors: Yujun Zhou, Yue Huang, Han Bao, Kehan Guo, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.12124
- Pdf link: https://arxiv.org/pdf/2602.12124
- Abstract
While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk is emerging: capability-oriented training induced exploitation. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, will spontaneously learn to exploit these flaws to maximize their reward, even without any malicious intent in their training. To test this, we design a suite of four diverse "vulnerability games", each presenting a unique, exploitable flaw related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety. More critically, we find that these exploitative strategies are not narrow "tricks" but generalizable skills; they can be transferred to new tasks and even "distilled" from a capable teacher model to other student models through data alone. Our findings reveal that capability-oriented training induced risks pose a fundamental challenge to current alignment approaches, suggesting that future AI safety work must extend beyond content moderation to rigorously auditing and securing the training environments and reward mechanisms themselves. Code is available at this https URL.
- 中文摘要
虽然大多数人工智能对齐研究侧重于防止模型生成明确有害内容,但一个更微妙的风险正在显现:能力导向训练诱导的利用。我们研究了语言模型在带有隐性漏洞的环境中进行强化学习(RL)训练时,是否会自发学习利用这些缺陷以最大化奖励,即使训练过程中没有恶意。为此,我们设计了一套由四种不同的“漏洞游戏”组成的套件,每个游戏都存在与上下文条件合规、代理指标、奖励篡改和自我评估相关的独特且可利用的缺陷。我们的实验表明,模型能够持续学习利用这些漏洞,发现机会主义策略,显著提高其奖励,但牺牲任务正确性或安全性。更关键的是,我们发现这些剥削策略并非狭隘的“技巧”,而是可推广的技能;这些数据可以转移到新的任务中,甚至仅凭数据就从有能力的教师模型“提炼”到其他学生模型。我们的研究结果显示,能力导向的培训诱发风险对当前的对齐方法构成根本挑战,表明未来的人工智能安全工作必须超越内容审核,严格审计并保障培训环境及奖励机制本身的安全。代码可在此 https URL 访问。
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
超越教师的学习:带有奖励外推的广义政策提炼
- Authors: Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.12125
- Pdf link: https://arxiv.org/pdf/2602.12125
- Abstract
On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher-student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong-to-weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre-RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.
- 中文摘要
政策上蒸馏(OPD)将学生与教师对应的对应分布,在提升学生表现方面表现出显著的实证效果,且常常优于非策略提炼和强化学习(RL)范式。在本研究中,我们首先理论上证明了OPD是密集KL约束RL的一个特例,其中奖励函数和KL正则化总是权重相等,参考模型可以由任意模型组成。随后,我们提出了广义策略提纯(G-OPD)框架,通过引入灵活的参考模型和奖励尺度因子,扩展了标准OPD目标,该因子控制奖励项相对于KL正则化的相对权重。通过对数学推理和代码生成任务的全面实验,我们得出了两个新颖见解:(1)将奖励尺度因子设为大于1(即奖励外推),我们称之为ExOPD,在多种师生规模配对中持续优于标准OPD。特别是在将不同领域专家通过对同一学生模型应用特定强化学习获得的知识合并回原始学生的环境中,ExOPD使学生甚至能够超越教师的表现界限,超越领域教师。(2)基于ExOPD,我们进一步发现,在强到弱蒸馏环境下(即从较大教师蒸馏较小的学生),通过选择参考模型作为教师在强化学习前的基础模型进行奖励校正,能获得更准确的奖励信号,进一步提升蒸馏性能。然而,这种选择假设了教师能够使用强化学习前的版本,并且会承担更多的计算开销。我们希望我们的工作为未来门诊研究提供新的见解。
Seq2Seq2Seq: Lossless Data Compression via Discrete Latent Transformers and Reinforcement Learning
Seq2Seq2Seq:通过离散潜在变换器和强化学习实现无损数据压缩
- Authors: Mahdi Khodabandeh, Ghazal Shabani, Arash Yousefi Jordehi, Seyed Abolghasem Mirroshandel
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
- Arxiv link: https://arxiv.org/abs/2602.12146
- Pdf link: https://arxiv.org/pdf/2602.12146
- Abstract
Efficient lossless compression is essential for minimizing storage costs and transmission overhead while preserving data integrity. Traditional compression techniques, such as dictionary-based and statistical methods, often struggle to optimally exploit the structure and redundancy in complex data formats. Recent advancements in deep learning have opened new avenues for compression; however, many existing approaches depend on dense vector representations that obscure the underlying token structure. To address these limitations, we propose a novel lossless compression method that leverages Reinforcement Learning applied to a T5 language model architecture. This approach enables the compression of data into sequences of tokens rather than traditional vector representations. Unlike auto-encoders, which typically encode information into continuous latent spaces, our method preserves the token-based structure, aligning more closely with the original data format. This preservation allows for higher compression ratios while maintaining semantic integrity. By training the model using an off-policy Reinforcement Learning algorithm, we optimize sequence length to minimize redundancy and enhance compression efficiency. Our method introduces an efficient and adaptive data compression system built upon advanced Reinforcement Learning techniques, functioning independently of external grammatical or world knowledge. This approach shows significant improvements in compression ratios compared to conventional methods. By leveraging the latent information within language models, our system effectively compresses data without requiring explicit content understanding, paving the way for more robust and practical compression solutions across various applications.
- 中文摘要
高效的无损压缩对于最小化存储成本和传输开销,同时保持数据完整性至关重要。传统的压缩技术,如基于字典和统计的方法,常常难以在复杂数据格式中最优利用结构和冗余。深度学习的最新进展为压缩开辟了新途径;然而,许多现有方法依赖于密集的向量表示,这会掩盖底层的代币结构。为解决这些限制,我们提出了一种新型无损压缩方法,利用强化学习应用于T5语言模型架构。这种方法使数据能够压缩成符号序列,而非传统的向量表示。与通常将信息编码到连续潜在空间的自动编码器不同,我们的方法保留了基于令牌的结构,更贴近原始数据格式。这种保留允许更高的压缩比,同时保持语义完整性。通过使用非策略强化学习算法训练模型,我们优化序列长度以最小化冗余并提升压缩效率。我们的方法引入了一个高效且自适应的数据压缩系统,基于先进的强化学习技术,独立于外部语法或世界知识之外运作。该方法相比传统方法在压缩比上有显著改善。通过利用语言模型中的潜在信息,我们的系统能够有效压缩数据,无需显式内容理解,为更稳健、实用的压缩解决方案铺平了道路,适用于各种应用。
DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
DeepGen 1.0:一个轻量级统一多模态模型,用于推进图像生成和编辑技术
- Authors: Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tianhang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, Jiaqi Wang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.12205
- Pdf link: https://arxiv.org/pdf/2602.12205
- Abstract
Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
- 中文摘要
当前统一的多模态图像生成和编辑模型通常依赖庞大的参数尺度(例如>10B),这带来了高昂的训练成本和部署规模。在本研究中,我们介绍了DeepGen 1.0,一个轻量级的5B统一模型,实现了与更大型同类产品媲美甚至超越的综合能力。为了克服紧凑模型在语义理解和细粒度控制上的局限性,我们引入了堆叠通道桥接(SCB),这是一种深度对齐框架,从多个VLM层中提取层级特征,并与可学习的“思考标记”融合,为生成骨干提供结构化、推理丰富的指导。我们进一步设计了一种以数据为中心的训练策略,涵盖三个渐进阶段:(1)对大规模图像-文本对和编辑三元组的对齐预训练以同步VLM和DiT表示,(2)在生成、编辑和推理任务的高质量混合上进行联合监督微调,以促进全能能力;(3)结合MR-GRPO的强化学习,利用奖励函数和监督信号的混合技术, 这在生成质量和与人类偏好的一致性上取得了显著提升,同时保持了稳定的训练进度,避免了视觉伪影。尽管仅训练约5000万样本,DeepGen 1.0在多项基准测试中表现领先,WISE上的80B勋源图像高出28%,在UniREditBench上超过27B的Qwen-Image-Edit37%。通过开源训练代码、权重和数据集,我们提供了高效、高性能的替代方案,实现统一的多模态研究民主化。
Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training
迈向政策范式理论:分布判别理论及其在大型语言医学培训中的应用
- Authors: Miaosen Zhang, Yishan Liu, Shuxia Lin, Xu Yang, Qi Dai, Chong Luo, Weihao Jiang, Peng Hou, Anxiang Zeng, Xin Geng, Baining Guo
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2602.12222
- Pdf link: https://arxiv.org/pdf/2602.12222
- Abstract
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL's use of on-policy data. We propose a framework to bridge this chasm by enabling On-Policy SFT. We first present \textbf{\textit{Distribution Discriminant Theory (DDT)}}, which explains and quantifies the alignment between data and the model-induced distribution. Leveraging DDT, we introduce two complementary techniques: (i) \textbf{\textit{In-Distribution Finetuning (IDFT)}}, a loss-level method to enhance generalization ability of SFT, and (ii) \textbf{\textit{Hinted Decoding}}, a data-level technique that can re-align the training corpus to the model's distribution. Extensive experiments demonstrate that our framework achieves generalization performance on par with prominent offline RL algorithms, including DPO and SimPO, while maintaining the efficiency of an SFT pipeline. The proposed framework thus offers a practical alternative in domains where RL is infeasible. We open-source the code here: this https URL
- 中文摘要
监督微调(SFT)计算效率高,但通常得出的推广效果不如强化学习(RL)。这一差距主要源于强化学习对政策上数据的使用。我们提出一个框架,通过启用On-Policy SFT来弥合这一鸿沟。我们首先介绍 \textbf{\textit{分布判别理论(DDT)}}},它解释并量化了数据与模型诱导分布之间的对齐关系。利用DDT,我们引入了两种互补技术:(i) \textbf{\textit{In-Distribution Finetuning (IDFT)}},一种损耗级方法,用于增强SFT泛化能力;以及 (ii) \textbf{\textit{提示解码}},一种数据级技术,可以将训练语料库重新对齐到模型的分布。大量实验表明,我们的框架在泛化性能上与著名的离线强化学习算法(如DPO和SimPO)相当,同时保持SFT流水线的效率。因此,该框架在强化学习不可行的领域提供了实用的替代方案。我们将代码开源于此:这个 https URL
Any House Any Task: Scalable Long-Horizon Planning for Abstract Human Tasks
任何房子任何任务:抽象人类任务的可扩展长期规划
- Authors: Zhihong Liu, Yang Li, Rengming Huang, Cewu Lu, Panpan Cai
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2602.12244
- Pdf link: https://arxiv.org/pdf/2602.12244
- Abstract
Open world language conditioned task planning is crucial for robots operating in large-scale household environments. While many recent works attempt to address this problem using Large Language Models (LLMs) via prompting or training, a key challenge remains scalability. Performance often degrades rapidly with increasing environment size, plan length, instruction ambiguity, and constraint complexity. In this work, we propose Any House Any Task (AHAT), a household task planner optimized for long-horizon planning in large environments given ambiguous human instructions. At its core, AHAT utilizes an LLM trained to map task instructions and textual scene graphs into grounded subgoals defined in the Planning Domain Definition Language (PDDL). These subgoals are subsequently solved to generate feasible and optimal long-horizon plans through explicit symbolic reasoning. To enhance the model's ability to decompose complex and ambiguous intentions, we introduce TGPO, a novel reinforcement learning algorithm that integrates external correction of intermediate reasoning traces into Group Relative Policy Optimization (GRPO). Experiments demonstrate that AHAT achieves significant performance gains over state-of-the-art prompting, planning, and learning methods, particularly in human-style household tasks characterized by brief instructions but requiring complex execution plans.
- 中文摘要
开放世界语言条件下的任务规划对于大型家庭环境中的机器人至关重要。虽然许多近期研究尝试通过提示或训练来解决大型语言模型(LLM)的问题,但一个关键挑战仍然是可扩展性。性能通常会随着环境规模、计划长度、指令模糊性和约束复杂度的增加而迅速下降。在本研究中,我们提出了“任意家任何任务”(Any House Any Task,AHAT),这是一款针对大型环境下在人类指令模糊时,长期规划优化的家庭任务规划器。AHAT 的核心利用训练有素的大型语言模型,将任务指令和文本场景图映射到规划领域定义语言(PDDL)定义的基础子目标中。随后通过显式符号推理,解决这些子目标,生成可行且最优的长期规划。为了增强模型分解复杂和模糊意图的能力,我们引入了TGPO,一种新型强化学习算法,将中间推理轨迹的外部纠正整合进群体相对策略优化(GRPO)。实验表明,AHAT 在采用最先进的提示、规划和学习方法时,尤其是在以简短指令为特征但需要复杂执行计划的人文式家务中,在性能提升方面显著提升。
Intrinsic-Energy Joint Embedding Predictive Architectures Induce Quasimetric Spaces
内在能量关节嵌入预测结构诱导准度量空间
- Authors: Anthony Kobanda, Waris Radji
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.12245
- Pdf link: https://arxiv.org/pdf/2602.12245
- Abstract
Joint-Embedding Predictive Architectures (JEPAs) aim to learn representations by predicting target embeddings from context embeddings, inducing a scalar compatibility energy in a latent space. In contrast, Quasimetric Reinforcement Learning (QRL) studies goal-conditioned control through directed distance values (cost-to-go) that support reaching goals under asymmetric dynamics. In this short article, we connect these viewpoints by restricting attention to a principled class of JEPA energy functions : intrinsic (least-action) energies, defined as infima of accumulated local effort over admissible trajectories between two states. Under mild closure and additivity assumptions, any intrinsic energy is a quasimetric. In goal-reaching control, optimal cost-to-go functions admit exactly this intrinsic form ; inversely, JEPAs trained to model intrinsic energies lie in the quasimetric value class targeted by QRL. Moreover, we observe why symmetric finite energies are structurally mismatched with one-way reachability, motivating asymmetric (quasimetric) energies when directionality matters.
- 中文摘要
联合嵌入预测架构(JEPAs)旨在通过从上下文嵌入预测目标嵌入来学习表示,从而在潜空间中诱导标量兼容性能量。相比之下,准计量强化学习(QRL)通过有向距离值(cost-to-go)研究目标条件控制,支持在非对称动力学下实现目标。在这篇短文中,我们通过将注意力限制在一类有原则的JEPA能量函数:内在(最小作用)能量,定义为两个状态之间可接受轨迹上累积局部努力的下确界,来连接这些观点。在轻闭和可加性假设下,任何内在能都是拟能。在目标达成控制中,最优成本函数恰好满足这一内在形式;反过来,训练用于模拟本征能量的JEPA则属于QRL所针对的准度量值类别。此外,我们观察到为什么对称有限能量在结构上与单向可达性不匹配,这在方向性重要时会促使非对称(准简距)能量出现。
CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use
CM2:多回合和多步骤智能工具使用的清单奖励强化学习
- Authors: Zhen Zhang, Kaiqiang Song, Xun Wang, Yebowen Hu, Weixiang Yan, Chenyang Zhao, Henry Peng Zou, Haoyun Deng, Sathish Reddy Indurthi, Shujian Liu, Simin Ma, Xiaoyang Wang, Xin Eric Wang, Song Wang
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.12268
- Pdf link: https://arxiv.org/pdf/2602.12268
- Abstract
AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: this https URL.
- 中文摘要
人工智能代理越来越多地被用于通过多回合用户互动推理和调用外部工具来解决现实任务。然而,将强化学习应用于此类环境仍然困难:现实目标往往缺乏可验证的奖励,反而强调开放式行为;此外,多轮多步智能工具使用的强化学习仍未被充分探索;构建和维护可执行工具环境成本高昂,限制了规模和覆盖范围。我们提出了CM2,这是一个强化学习框架,用清单奖励替代可验证的结果奖励。CM2将每回合的预期行为分解为细粒度的二元标准,明确有证据基础和结构化元数据,使开放式判断转变为更稳定的分类式决策。为平衡稳定性和信息量,我们的方法采用稀疏奖励分配但评估标准密集的策略。培训在可扩展的大型语言模型模拟工具环境中进行,避免了大型工具集的繁重工程设计。实验显示,CM2在监督微调中持续提升。从8B基础模型出发,在8k示例的RL数据集上训练,CM2在tau^-Bench上比SFT对应版本提升8个百分点,在BFCL-V4上提升10个百分点,在ToolSandbox上提升12个百分点。结果与同等规模的开源基线(包括评判模型)相当甚至优于。因此,CM2为优化多回合、多步骤工具使用代理提供了可扩展的方案,而无需依赖可验证的奖励。代码由开源社区提供:这个 https URL。
Keyword: diffusion policy
Learning to Manipulate Anything: Revealing Data Scaling Laws in Bounding-Box Guided Policies
学习作任何事物:揭示边界框引导策略中的数据扩展规律
- Authors: Yihao Wu, Jinming Ma, Junbo Tan, Yanzhao Yu, Shoujie Li, Mingliang Zhou, Diyun Xiang, Xueqian Wang
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2602.11885
- Pdf link: https://arxiv.org/pdf/2602.11885
- Abstract
Diffusion-based policies show limited generalization in semantic manipulation, posing a key obstacle to the deployment of real-world robots. This limitation arises because relying solely on text instructions is inadequate to direct the policy's attention toward the target object in complex and dynamic environments. To solve this problem, we propose leveraging bounding-box instruction to directly specify target object, and further investigate whether data scaling laws exist in semantic manipulation tasks. Specifically, we design a handheld segmentation device with an automated annotation pipeline, Label-UMI, which enables the efficient collection of demonstration data with semantic labels. We further propose a semantic-motion-decoupled framework that integrates object detection and bounding-box guided diffusion policy to improve generalization and adaptability in semantic manipulation. Throughout extensive real-world experiments on large-scale datasets, we validate the effectiveness of the approach, and reveal a power-law relationship between generalization performance and the number of bounding-box objects. Finally, we summarize an effective data collection strategy for semantic manipulation, which can achieve 85\% success rates across four tasks on both seen and unseen objects. All datasets and code will be released to the community.
- 中文摘要
基于扩散的策略在语义作中泛化有限,成为现实世界机器人部署的关键障碍。这一限制源于仅依赖文本指令在复杂且动态的环境中无法将策略的注意力引导到目标对象上。为解决该问题,我们提出利用边界盒指令直接指定目标对象,并进一步研究语义作任务中是否存在数据尺度规律。具体来说,我们设计了一款手持式分割设备,配备自动注释流水线Label-UMI,能够高效地收集带有语义标签的演示数据。我们还提出了一个语义-运动解耦框架,整合了对象检测和包围盒引导扩散策略,以提升语义作的泛化性和适应性。通过大规模数据集的大量现实实验,我们验证了该方法的有效性,并揭示了泛化性能与包围盒对象数量之间的幂律关系。最后,我们总结了一种有效的语义作数据收集策略,该策略在四个任务中对可见和未可见对象均可实现85%的成功率。所有数据集和代码将向社区发布。