Arxiv Papers of Today

生成时间: 2026-02-27 16:44:28 (UTC+8); Arxiv 发布时间: 2026-02-27 20:00 EST (2026-02-28 09:00 UTC+8)

今天共有 45 篇相关文章

Keyword: reinforcement learning

Graph Your Way to Inspiration: Integrating Co-Author Graphs with Retrieval-Augmented Generation for Large Language Model Based Scientific Idea Generation

你的图带来灵感：将合著者图与检索增强生成整合，用于基于大型语言模型的科学思想生成

Authors: Pengzhen Xie, Huizhi Liang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2602.22215
Pdf link: https://arxiv.org/pdf/2602.22215
Abstract Large Language Models (LLMs) demonstrate potential in the field of scientific idea generation. However, the generated results often lack controllable academic context and traceable inspiration pathways. To bridge this gap, this paper proposes a scientific idea generation system called GYWI, which combines author knowledge graphs with retrieval-augmented generation (RAG) to form an external knowledge base to provide controllable context and trace of inspiration path for LLMs to generate new scientific ideas. We first propose an author-centered knowledge graph construction method and inspiration source sampling algorithms to construct external knowledge base. Then, we propose a hybrid retrieval mechanism that is composed of both RAG and GraphRAG to retrieve content with both depth and breadth knowledge. It forms a hybrid context. Thirdly, we propose a Prompt optimization strategy incorporating reinforcement learning principles to automatically guide LLMs optimizing the results based on the hybrid context. To evaluate the proposed approaches, we constructed an evaluation dataset based on arXiv (2018-2023). This paper also develops a comprehensive evaluation method including empirical automatic assessment in multiple-choice question task, LLM-based scoring, human evaluation, and semantic space visualization analysis. The generated ideas are evaluated from the following five dimensions: novelty, feasibility, clarity, relevance, and significance. We conducted experiments on different LLMs including GPT-4o, DeepSeek-V3, Qwen3-8B, and Gemini 2.5. Experimental results show that GYWI significantly outperforms mainstream LLMs in multiple metrics such as novelty, reliability, and relevance.
中文摘要 大型语言模型（LLMs）展示了科学思想生成领域的潜力。然而，生成的结果往往缺乏可控的学术背景和可追溯的灵感路径。为弥合这一差距，本文提出了一种名为GYWI的科学思想生成系统，结合作者知识图谱与检索增强生成（RAG），形成外部知识库，为大型语言模型提供可控的上下文和灵感路径，以产生新的科学思想。我们首先提出了以作者为中心的知识图谱构建方法和灵感源抽样算法，用于构建外部知识库。随后，我们提出了一种混合检索机制，结合RAG和GraphRAG，以检索具有深度和广度知识的内容。它构成了一个混合的语境。第三，我们提出了一种结合强化学习原则的提示优化策略，自动引导大型语言模型基于混合上下文优化结果。为评估所提出的方法，我们基于arXiv（2018-2023）构建了一个评估数据集。本文还开发了一种全面的评估方法，包括多项选择题任务中的实证自动评估、基于LLM的评分、人类评估以及语义空间可视化分析。生成的想法从以下五个维度评估：新颖性、可行性、清晰度、相关性和重要性。我们对包括GPT-4o、DeepSeek-V3、Qwen3-8B和Gemini 2.5在内的多种大型语言模型进行了实验。实验结果显示，GYWI在新颖性、可靠性和相关性等多个指标上显著优于主流LLM。

SmartChunk Retrieval: Query-Aware Chunk Compression with Planning for Efficient Document RAG

SmartChunk 检索：带高效文档 RAG 规划的查询感知区块压缩

Authors: Xuechen Zhang, Koustava Goswami, Samet Oymak, Jiasi Chen, Nedim Lipka
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.22225
Pdf link: https://arxiv.org/pdf/2602.22225
Abstract Retrieval-augmented generation (RAG) has strong potential for producing accurate and factual outputs by combining language models (LMs) with evidence retrieved from large text corpora. However, current pipelines are limited by static chunking and flat retrieval: documents are split into short, predetermined, fixed-size chunks, embeddings are retrieved uniformly, and generation relies on whatever chunks are returned. This design brings challenges, as retrieval quality is highly sensitive to chunk size, often introduces noise from irrelevant or misleading chunks, and scales poorly to large corpora. We present SmartChunk retrieval, a query-adaptive framework for efficient and robust long-document question answering (QA). SmartChunk uses (i) a planner that predicts the optimal chunk abstraction level for each query, and (ii) a lightweight compression module that produces high-level chunk embeddings without repeated summarization. By adapting retrieval granularity on the fly, SmartChunk balances accuracy with efficiency and avoids the drawbacks of fixed strategies. Notably, our planner can reason about chunk abstractions through a novel reinforcement learning scheme, STITCH, which boosts accuracy and generalization. To reflect real-world applications, where users face diverse document types and query styles, we evaluate SmartChunk on five QA benchmarks plus one out-of-domain dataset. Across these evaluations, SmartChunk outperforms state-of-the-art RAG baselines, while reducing cost. Further analysis demonstrates strong scalability with larger corpora and consistent gains on out-of-domain datasets, highlighting its effectiveness as a general framework for adaptive retrieval.
中文摘要 检索增强生成（RAG）通过结合语言模型（LM）与从大型语料库检索的证据，具有强大潜力产出准确且事实准确的输出。然而，当前的流水线受限于静态分块和平坦检索：文档被拆分为短的、预定的固定大小块，嵌入统一检索，生成依赖于返回的块。这种设计带来了挑战，因为检索质量对区块大小高度敏感，常常引入无关或误导性区块的噪声，且对大型语料库的扩展性较差。我们介绍了SmartChunk检索，这是一个用于高效且稳健的长文档问答（QA）的查询自适应框架。SmartChunk 使用（i）预测每个查询最优区块抽象级别的规划器，以及（ii）一个轻量级压缩模块，无需重复摘要即可生成高层块嵌入。通过实时调整检索粒度，SmartChunk 在准确性与效率之间取得平衡，避免了固定策略的弊端。值得注意的是，我们的规划者可以通过一种新颖的强化学习方案STITCH推理块抽象，提升了准确性和泛化能力。为了反映用户面对多样文档类型和查询风格的现实应用，我们基于五个QA基准测试及一个域外数据集评估SmartChunk。在这些评估中，SmartChunk 优于最先进的 RAG 基线，同时降低成本。进一步分析显示，更大语料库具有强的可扩展性，且域外数据集持续获得收益，凸显其作为自适应检索通用框架的有效性。

UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

UpSkill：面向大型语言模型结构化反应多样性的互信息技能学习

Authors: Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.22296
Pdf link: https://arxiv.org/pdf/2602.22296
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.
中文摘要 带可验证奖励的强化学习（RLVR）提升了大型语言模型（LLMs）在数学和编程任务中的推理能力，但优化单次尝试准确性的标准方法可能无意中抑制重复尝试间的响应多样性，缩小探索范围并忽视代表性不足的策略。我们介绍了UpSkill，一种训练时间方法，将互信息技能学习（MISL）适配到LLMs上，以优化pass@k正确性。我们提出了一种新颖的奖励，在群体相对策略优化（GRPO）中实现：一种代币级互信息（MI）奖励，鼓励轨迹特异性到z。在GSM8K上，使用三个开放权重模型——Llama 3.1-8B、Qwen 2.5-7B和R1-Distilled-Qwen2.5-Math-1.5B进行实验，显示UpSkill在更强基础模型上提升了多次尝试的指标，Qwen和Llama的平均在pass@k上均提升了~3%，同时不影响pass@1。此外，我们发现pass@k的改进与互信息目标密切相关，是实证和理论证据。

Learning Rewards, Not Labels: Adversarial Inverse Reinforcement Learning for Machinery Fault Detection

学习奖励，而非标签：对抗性反强化学习用于机械故障检测

Authors: Dhiraj Neupane, Richard Dazeley, Mohamed Reda Bouadjenek, Sunil Aryal
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.22297
Pdf link: https://arxiv.org/pdf/2602.22297
Abstract Reinforcement learning (RL) offers significant promise for machinery fault detection (MFD). However, most existing RL-based MFD approaches do not fully exploit RL's sequential decision-making strengths, often treating MFD as a simple guessing game (Contextual Bandits). To bridge this gap, we formulate MFD as an offline inverse reinforcement learning problem, where the agent learns the reward dynamics directly from healthy operational sequences, thereby bypassing the need for manual reward engineering and fault labels. Our framework employs Adversarial Inverse Reinforcement Learning to train a discriminator that distinguishes between normal (expert) and policy-generated transitions. The discriminator's learned reward serves as an anomaly score, indicating deviations from normal operating behaviour. When evaluated on three run-to-failure benchmark datasets (HUMS2023, IMS, and XJTU-SY), the model consistently assigns low anomaly scores to normal samples and high scores to faulty ones, enabling early and robust fault detection. By aligning RL's sequential reasoning with MFD's temporal structure, this work opens a path toward RL-based diagnostics in data-driven industrial settings.
中文摘要 强化学习（RL）在机械故障检测（MFD）方面具有巨大潜力。然而，大多数现有基于强化学习的MFD方法并未充分利用强化学习的顺序决策优势，常将MFD视为简单的猜测游戏（情境强盗）。为弥合这一差距，我们将MFD设计为离线逆强化学习问题，智能体直接从健康的作序列中学习奖励动态，从而绕过了手动奖励工程和错误标签的需求。我们的框架采用对抗性反强化学习，训练一个区分器，区分正常（专家）和策略生成的转换。鉴别者的学习奖励作为异常评分，表示偏离正常作行为。在三个运行至失效基准数据集（HUMS2023、IMS和XJTU-SY）上评估时，模型始终为正常样本分配低异常分数，对有缺陷样本给予高分，从而实现早期且稳健的故障检测。通过将强化学习的顺序推理与MFD的时间结构对齐，这项工作为基于强化学习的诊断在数据驱动的工业环境中开辟了一条道路。

Reinforcement-aware Knowledge Distillation for LLM Reasoning

强化感知知识蒸馏用于大型语言模型推理

Authors: Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.22495
Pdf link: https://arxiv.org/pdf/2602.22495
Abstract Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.
中文摘要 强化学习（RL）的后训练近年来推动了大型语言模型（LLMs）长链推理的显著进展，但此类模型的高推理成本促使学生被提炼成更小的学生。大多数现有的知识蒸馏（KD）方法设计用于监督微调（SFT），依赖固定教师踪迹或基于发散的师生库尔巴克-莱布勒（KL）正则化。当与强化学习结合时，这些方法常常存在分布不匹配和客观干扰的问题：教师的监督可能与学生不断演变的推广分布不一致，且KL规范器可能与奖励最大化竞争，需要谨慎的损失平衡。为解决这些问题，我们提出了强化学习感知蒸馏（RLAD），在强化学习期间进行选择性模仿——只有当学生改进当前政策更新时，才引导学生走向教师。我们的核心组件——信任区域比率蒸馏（TRRD），用类似PPO/GRPO的似然比目标取代师生的基层正则化器，基于教师与旧政策的混合，从而在学生推广时实现具优势意识、信任区域界限的提炼，并自然平衡探索、利用和模仿。在多样的逻辑推理和数学基准中，RLAD始终优于离线提炼、标准GRPO以及基于基层的师生知识提炼。

Space Syntax-guided Post-training for Residential Floor Plan Generation

空间语法引导住宅平面图生成后培训

Authors: Zhuoyang Jiang, Dongqing Zhang
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.22507
Pdf link: https://arxiv.org/pdf/2602.22507
Abstract Pre-trained generative models for residential floor plans are typically optimized to fit large-scale data distributions, which can under-emphasize critical architectural priors such as the configurational dominance and connectivity of domestic public spaces (e.g., living rooms and foyers). This paper proposes Space Syntax-guided Post-training (SSPT), a post-training paradigm that explicitly injects space syntax knowledge into floor plan generation via a non-differentiable oracle. The oracle converts RPLAN-style layouts into rectangle-space graphs through greedy maximal-rectangle decomposition and door-mediated adjacency construction, and then computes integration-based measurements to quantify public space dominance and functional hierarchy. To enable consistent evaluation and diagnosis, we further introduce SSPT-Bench (Eval-8), an out-of-distribution benchmark that post-trains models using conditions capped at $\leq 7$ rooms while evaluating on 8-room programs, together with a unified metric suite for dominance, stability, and profile alignment. SSPT is instantiated with two strategies: (i) iterative retraining via space-syntax filtering and diffusion fine-tuning, and (ii) reinforcement learning via PPO with space-syntax rewards. Experiments show that both strategies improve public-space dominance and restore clearer functional hierarchy compared to distribution-fitted baselines, while PPO achieves stronger gains with substantially higher compute efficiency and reduced variance. SSPT provides a scalable pathway for integrating architectural theory into data-driven plan generation and is compatible with other generative backbones given a post-hoc evaluation oracle.
中文摘要 预训练的住宅平面图生成模型通常优化以适应大规模数据分布，这可能低估了诸如家庭公共空间（如客厅和门厅）配置主导性和连通性等关键建筑先验。本文提出了空间语法引导后训练（SSPT），这是一种通过不可微分预言机明确将空间语法知识注入户型生成的后训练范式。该oracle通过贪婪的极大矩形分解和门介导的邻接构造，将RPLAN风格的布局转换为矩形空间图，然后计算基于积分的测量以量化公共空间的主导地位和功能层级。为实现一致的评估和诊断，我们进一步引入了SSPT-Bench（Eval-8），这是一个分布外基准测试，在8个房间项目上评估时，使用限制在$\leq房间上限的条件下的模型进行后期训练，并配备了统一的优导性、稳定性和轮廓对齐度量套件。SSPT通过两种策略实现：（i）通过空间语法过滤和扩散微调进行迭代再训练，以及（ii）通过PPO进行带有空间语法奖励的强化学习。实验表明，这两种策略都提升了公共空间主导地位，并恢复了比分布拟合基线更清晰的功能层次结构，而PPO则以显著更高的计算效率和更低的方差实现了更强的收益。SSPT为将架构理论整合到数据驱动的规划生成提供了可扩展的路径，并且在事后评估oracle下与其他生成骨干网兼容。

A Mathematical Theory of Agency and Intelligence

能动性与智能的数学理论

Authors: Wael Hafez, Chenan Wei, Rodrigo Felipe, Amir Nazeri, Cameron Reid
Subjects: Subjects: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2602.22519
Pdf link: https://arxiv.org/pdf/2602.22519
Abstract To operate reliably under changing conditions, complex systems require feedback on how effectively they use resources, not just whether objectives are met. Current AI systems process vast information to produce sophisticated predictions, yet predictions can appear successful while the underlying interaction with the environment degrades. What is missing is a principled measure of how much of the total information a system deploys is actually shared between its observations, actions, and outcomes. We prove this shared fraction, which we term bipredictability, P, is intrinsic to any interaction, derivable from first principles, and strictly bounded: P can reach unity in quantum systems, P equal to, or smaller than 0.5 in classical systems, and lower once agency (action selection) is introduced. We confirm these bounds in a physical system (double pendulum), reinforcement learning agents, and multi turn LLM conversations. These results distinguish agency from intelligence: agency is the capacity to act on predictions, whereas intelligence additionally requires learning from interaction, self-monitoring of its learning effectiveness, and adapting the scope of observations, actions, and outcomes to restore effective learning. By this definition, current AI systems achieve agency but not intelligence. Inspired by thalamocortical regulation in biological systems, we demonstrate a feedback architecture that monitors P in real time, establishing a prerequisite for adaptive, resilient AI.
中文摘要 为了在不断变化的条件下可靠运行，复杂系统需要反馈其资源使用效率，而不仅仅是目标是否达成。当前的人工智能系统处理大量信息以产生复杂的预测，但预测看似成功，而与环境的底层互动却在减弱。缺少的是衡量系统部署总信息中实际共享多少，这些信息被其观察、行动和结果共享。我们证明了这种共享分数，我们称之为双预测性P，是任何相互作用的内在特征，可由第一原理推导，且严格有界：P在量子系统中可达一，P在经典系统中可达0.5或小于0.5，且引入代理性（作用选择）后更低。我们在物理系统（双摆）、强化学习代理和多回合大型语言模型对话中确认了这些界限。这些结果区分了能动性与智能：能动性是基于预测采取行动的能力，而智能还需要通过互动学习、自我监测其学习效果，并调整观察、行动和结果的范围以恢复有效学习。按照这一定义，当前的人工智能系统实现了自主性，但尚未实现智能。受生物系统丘脑皮层调控启发，我们展示了一种实时监测P的反馈架构，确立了适应性和韧性的人工智能的前提条件。

Agentic AI for Intent-driven Optimization in Cell-free O-RAN

无单元O-RAN中意图驱动优化的代理人工智能

Authors: Mohammad Hossein Shokouhi, Vincent W.S. Wong
Subjects: Subjects: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2602.22539
Pdf link: https://arxiv.org/pdf/2602.22539
Abstract Agentic artificial intelligence (AI) is emerging as a key enabler for autonomous radio access networks (RANs), where multiple large language model (LLM)-based agents reason and collaborate to achieve operator-defined intents. The open RAN (O-RAN) architecture enables the deployment and coordination of such agents. However, most existing works consider simple intents handled by independent agents, while complex intents that require coordination among agents remain unexplored. In this paper, we propose an agentic AI framework for intent translation and optimization in cell-free O-RAN. A supervisor agent translates the operator intents into an optimization objective and minimum rate requirements. Based on this information, a user weighting agent retrieves relevant prior experience from a memory module to determine the user priority weights for precoding. If the intent includes an energy-saving objective, then an open radio unit (O-RU) management agent will also be activated to determine the set of active O-RUs by using a deep reinforcement learning (DRL) algorithm. A monitoring agent measures and monitors the user data rates and coordinates with other agents to guarantee the minimum rate requirements are satisfied. To enhance scalability, we adopt a parameter-efficient fine-tuning (PEFT) method that enables the same underlying LLM to be used for different agents. Simulation results show that the proposed agentic AI framework reduces the number of active O-RUs by 41.93% when compared with three baseline schemes in energy-saving mode. Using the PEFT method, the proposed framework reduces the memory usage by 92% when compared with deploying separate LLM agents.
中文摘要 代理人工智能（AI）正作为自主无线接入网（RAN）的关键推动力崛起，在这些网络中，多个基于大型语言模型（LLM）的智能体推理并协作，以实现定义的意图。开放的RAN（O-RAN）架构支持此类代理的部署和协调。然而，大多数现有研究认为简单的意图由独立代理处理，而需要代理间协调的复杂意图仍未被探索。本文提出了一种针对无细胞O-RAN意图翻译和优化的智能人工智能框架。监督代理将的意图转化为优化目标和最低费率要求。基于这些信息，用户加权代理从内存模块中获取相关经验，以确定预编码的用户优先权重。如果意图包含节能目标，则还会激活开放无线单元（O-RU）管理代理，通过深度强化学习（DRL）算法确定活动中的O-RU集合。监控代理测量并监控用户数据速率，并与其他代理协调，确保满足最低速率要求。为提升可扩展性，我们采用了参数高效微调（PEFT）方法，使同一底层LLM能够用于不同的智能体。模拟结果显示，所提出的代理型AI框架相比三种节能模式的基线方案，将活动中的O-RU数量减少了41.93%。使用PEFT方法，所提框架相比部署独立LLM代理，内存使用减少了92%。

Multilingual Safety Alignment Via Sparse Weight Editing

通过稀疏权重编辑实现多语言安全对齐

Authors: Jiaming Liang, Zhaoxin Wang, Handing Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.22554
Pdf link: https://arxiv.org/pdf/2602.22554
Abstract Large Language Models (LLMs) exhibit significant safety disparities across languages, with low-resource languages (LRLs) often bypassing safety guardrails established for high-resource languages (HRLs) like English. Existing solutions, such as multilingual supervised fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), are computationally expensive and dependent on scarce multilingual safety data. In this work, we propose a novel, training-free alignment framework based on Sparse Weight Editing. Identifying that safety capabilities are localized within a sparse set of safety neurons, we formulate the cross-lingual alignment problem as a constrained linear transformation. We derive a closed-form solution to optimally map the harmful representations of LRLs to the robust safety subspaces of HRLs, while preserving general utility via a null-space projection constraint. Extensive experiments across 8 languages and multiple model families (Llama-3, Qwen-2.5) demonstrate that our method substantially reduces Attack Success Rate (ASR) in LRLs with negligible impact on general reasoning capabilities, all achieved with a single, data-efficient calculation.
中文摘要 大型语言模型（LLMs）在不同语言间存在显著的安全差异，低资源语言（LRL）常常绕过了为高资源语言（HRL）如英语所建立的安全防护。现有解决方案，如多语言监督微调（SFT）或人类反馈强化学习（RLHF），计算量大且依赖稀缺的多语言安全数据。本研究提出基于稀疏权重编辑的新颖无训练比对框架。我们确定安全能力局限于一组稀疏的安全神经元中，因此我们将跨语言比对问题表述为受限线性变换。我们推导出一个闭式解，以最优地将LRL的有害表示映射到HRL的稳健安全子空间，同时通过零空间投影约束保持一般效用。跨越8种语言和多个模型家族（Llama-3、Qwen-2.5）的广泛实验表明，我们的方法通过一次高效的计算实现了LRL中对一般推理能力几乎无影响的攻击成功率（ASR）。

Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

通过优势塑造和长度感知梯度调控实现稳定适应性思维

Authors: Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian, Lijun Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.22556
Pdf link: https://arxiv.org/pdf/2602.22556
Abstract Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses across varying problem difficulties and out-of-distribution tasks confirm the robustness and generalization of our approach.
中文摘要 大型推理模型（LRM）通过扩展推理轨迹实现了良好的性能，但在低复杂度查询中常表现出过度思考的行为。现有的缓解这一问题的努力在不稳定的准确性与效率权衡和对异构推理行为的鲁棒性方面受到根本限制。为应对这些挑战，我们提出了一个两阶段框架，用于LRMs中稳定的适应性思维。该框架首先应用混合微调，使模型同时接触思考和无思考行为，建立良好条件初始化。随后，它通过自适应强化学习，配合正确性保留优势塑形（CPAS），以避免抑制正确的长链推理，并采用长度感知梯度调控（LAGR）以稳定在严重推理长度异质性下的优化。在Qwen2.5-1.5B和7B上的大量实验显示，在强基线基础上持续提升，准确率最高可达+3.7/+3.6，同时生成令牌减少了40.6%/43.9%。针对不同问题难度和分布外任务的进一步分析证实了我们方法的鲁棒性和普遍性。

Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA

迈向忠实的工业RAG：广告质量保证的强化共适应框架

Authors: Wenwei Li, Ming Xu, Tianle Xia, Lingxiang Hu, Yiding Sun, Linfang Shang, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.22584
Pdf link: https://arxiv.org/pdf/2602.22584
Abstract Industrial advertising question answering (QA) is a high-stakes task in which hallucinated content, particularly fabricated URLs, can lead to financial loss, compliance violations, and legal risk. Although Retrieval-Augmented Generation (RAG) is widely adopted, deploying it in production remains challenging because industrial knowledge is inherently relational, frequently updated, and insufficiently aligned with generation objectives. We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for multi-hop, domain-specific evidence selection; and (2) evidence-constrained reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional rewards covering faithfulness, style compliance, safety, and URL validity. Experiments on an internal advertising QA dataset show consistent gains across expert-judged dimensions including accuracy, completeness, and safety, while reducing the hallucination rate by 72\%. A two-week online A/B test demonstrates a 28.6\% increase in like rate, a 46.2\% decrease in dislike rate, and a 92.7\% reduction in URL hallucination. The system has been running in production for over half a year and has served millions of QA interactions.
中文摘要 工业广告问答（QA）是一项高风险任务，其中虚构内容，尤其是伪造的URL，可能导致经济损失、合规违规和法律风险。尽管检索增强发电（RAG）已被广泛采用，但其在生产中部署仍然具有挑战性，因为工业知识本质上是关系性的，经常更新，且与发电目标不够契合。我们提出了一个强化共适应框架，通过两个组成部分共同优化检索和生成：（1）图感知检索（GraphRAG），该框架在高引用知识子图上建模实体关系结构，用于多跳、领域特定证据的选择;以及（2）通过群体相对策略优化（GRPO）进行证据约束的强化学习，提供涵盖忠实度、样式合规性、安全性和URL有效性的多维奖励。在内部广告质量保证数据集上的实验显示，在专家判断的维度上，包括准确性、完整性和安全性，均持续提升，同时幻觉率降低了72%。为期两周的在线A/B测试显示，点赞率增加了28.6%，点赞率下降了46.2%，URL幻觉减少了92.7%。该系统已运行生产超过半年，服务于数百万次质量保证交互。

Where Relevance Emerges: A Layer-Wise Study of Internal Attention for Zero-Shot Re-Ranking

EvolveGen: Algorithmic Level Hardware Model Checking Benchmark Generation through Reinforcement Learning

EvolveGen：通过强化学习生成基准的算法级硬件模型检查

Authors: Guangyu Hu, Xiaofeng Zhou, Wei Zhang, Hongce Zhang
Subjects: Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.22609
Pdf link: https://arxiv.org/pdf/2602.22609
Abstract Progress in hardware model checking depends critically on high-quality benchmarks. However, the community faces a significant benchmark gap: existing suites are limited in number, often distributed only in representations such as BTOR2 without access to the originating register-transfer-level (RTL) designs, and biased toward extreme difficulty where instances are either trivial or intractable. These limitations hinder rigorous evaluation of new verification techniques and encourage overfitting of solver heuristics to a narrow set of problems. To address this, we introduce EvolveGen, a framework for generating hardware model checking benchmarks by combining reinforcement learning (RL) with high-level synthesis (HLS). Our approach operates at an algorithmic level of abstraction in which an RL agent learns to construct computation graphs. By compiling these graphs under different synthesis directives, we produce pairs of functionally equivalent but structurally distinct hardware designs, inducing challenging model checking instances. Solver runtime is used as the reward signal, enabling the agent to autonomously discover and generate small-but-hard instances that expose solver-specific weaknesses. Experiments show that EvolveGen efficiently creates a diverse benchmark set in standard formats (e.g., AIGER and BTOR2) and effectively reveals performance bottlenecks in state-of-the-art model checkers.
中文摘要 硬件模型检测的进展关键依赖于高质量的基准测试。然而，社区面临一个显著的基准差距：现有套件数量有限，通常仅以BTOR2等表示形式分布，无法获得原始寄存器传输层（RTL）设计，且偏向极端难度，导致实例简单或难以解决。这些局限阻碍了对新验证技术的严谨评估，并鼓励求解器启发式方法对狭窄问题进行过度拟合。为此，我们引入了EvolveGen，一个结合强化学习（RL）与高级综合（HLS）生成硬件模型检查基准的框架。我们的方法在算法抽象层面运行，强化学习代理学习构建计算图。通过在不同的综合指令下编译这些图，我们生成了功能等价但结构上不同的硬件设计成对，从而引入了具有挑战性的模型检查实例。求解器运行时间作为奖励信号，使智能体能够自主发现并生成暴露求解器特定弱点的小型但困难实例。实验表明，EvolveGen 能够高效地在标准格式（如 AIGER 和 BTOR2）中创建多样化的基准测试集，并有效揭示最先进的模型检查器中的性能瓶颈。

Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning

压缩简单，探索困难：难度感知熵正则化以实现高效LLM推理

Authors: Qin-Wen Luo, Sheng Ren, Xiang Chen, Rui Liu, Jun Fang, Naiqiang Tan, Sheng-Jun Huang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.22642
Pdf link: https://arxiv.org/pdf/2602.22642
Abstract Chain-of-Thought (CoT) has substantially empowered Large Language Models (LLMs) to tackle complex reasoning tasks, yet the verbose nature of explicit reasoning steps incurs prohibitive inference latency and computational costs, limiting real-world deployment. While existing compression methods - ranging from self-training to Reinforcement Learning (RL) with length constraints - attempt to mitigate this, they often sacrifice reasoning capability for brevity. We identify a critical failure mode in these approaches: explicitly optimizing for shorter trajectories triggers rapid entropy collapse, which prematurely shrinks the exploration space and stifles the discovery of valid reasoning paths, particularly for challenging questions requiring extensive deduction. To address this issue, we propose Compress responses for Easy questions and Explore Hard ones (CEEH), a difficulty-aware approach to RL-based efficient reasoning. CEEH dynamically assesses instance difficulty to apply selective entropy regularization: it preserves a diverse search space for currently hard questions to ensure robustness, while permitting aggressive compression on easier instances where the reasoning path is well-established. In addition, we introduce a dynamic optimal-length penalty anchored to the historically shortest correct response, which effectively counteracts entropy-induced length inflation and stabilizes the reward signal. Across six reasoning benchmarks, CEEH consistently reduces response length while maintaining accuracy comparable to the base model, and improves Pass@k relative to length-only optimization.
中文摘要 Chain-of-Thought（CoT）极大地增强了大型语言模型（LLMs）处理复杂推理任务的能力，但显式推理步骤冗长的特性导致推理延迟和计算成本过高，限制了实际应用。现有的压缩方法——从自学到带有长度约束的强化学习（RL）——试图缓解这一问题，但它们往往为了简洁而牺牲推理能力。我们识别出这些方法中的一个关键失败模式：明确优化更短轨迹会触发快速熵坍缩，过早缩小探索空间，抑制有效推理路径的发现，尤其是对于需要广泛推理的复杂问题。为解决这一问题，我们提出了“简单问题的压缩回答”和“探索困难题”（CEEH），这是一种基于强化学习的高效推理的难度感知方法。CEEH动态评估实例难度以应用选择性熵正则化：它保留了当前困难问题的多样化搜索空间以确保鲁棒性，同时允许在推理路径已稳固的较易实例上进行激进压缩。此外，我们还引入了基于历史上最短正确响应的动态最优长度惩罚，有效抵消了熵引起的长度膨胀并稳定了奖励信号。在六个推理基准测试中，CEEH持续缩短响应长度，同时保持与基础模型相当的准确性，并且相较于仅用长度优化提升了Pass@k。

AHBid: An Adaptable Hierarchical Bidding Framework for Cross-Channel Advertising

AHBid：跨渠道广告的可适应层级竞价框架

Authors: Xinxin Yang, Yangyang Tang, Yikun Zhou, Yaolei Liu, Yun Li, Bo Yang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.22650
Pdf link: https://arxiv.org/pdf/2602.22650
Abstract In online advertising, the inherent complexity and dynamic nature of advertising environments necessitate the use of auto-bidding services to assist advertisers in bid optimization. This complexity is further compounded in multi-channel scenarios, where effective allocation of budgets and constraints across channels with distinct behavioral patterns becomes critical for optimizing return on investment. Current approaches predominantly rely on either optimization-based strategies or reinforcement learning techniques. However, optimization-based methods lack flexibility in adapting to dynamic market conditions, while reinforcement learning approaches often struggle to capture essential historical dependencies and observational patterns within the constraints of Markov Decision Process frameworks. To address these limitations, we propose AHBid, an Adaptable Hierarchical Bidding framework that integrates generative planning with real-time control. The framework employs a high-level generative planner based on diffusion models to dynamically allocate budgets and constraints by effectively capturing historical context and temporal patterns. We introduce a constraint enforcement mechanism to ensure compliance with specified constraints, along with a trajectory refinement mechanism that enhances adaptability to environmental changes through the utilization of historical data. The system further incorporates a control-based bidding algorithm that synergistically combines historical knowledge with real-time information, significantly improving both adaptability and operational efficacy. Extensive experiments conducted on large-scale offline datasets and through online A/B tests demonstrate the effectiveness of AHBid, yielding a 13.57% increase in overall return compared to existing baselines.
中文摘要 在网络广告中，广告环境的复杂性和动态性迫使广告主使用自动竞价服务来帮助广告主优化竞价。在多渠道场景中，这种复杂性进一步加剧，在具有不同行为模式的渠道间有效分配预算和约束对于优化投资回报至关重要。当前的方法主要依赖基于优化的策略或强化学习技术。然而，基于优化的方法缺乏适应动态市场环境的灵活性，而强化学习方法常常难以在马尔可夫决策过程框架的约束下捕捉关键的历史依赖关系和观察模式。为解决这些局限性，我们提出了AHBid，一种可适应的分层竞价框架，将生成式规划与实时控制相结合。该框架采用基于扩散模型的高级生成规划器，通过有效捕捉历史背景和时间模式，动态分配预算和约束。我们引入了约束执行机制以确保符合指定约束，并引入轨迹优化机制，通过利用历史数据增强对环境变化的适应能力。系统进一步集成了基于控制的竞价算法，协同结合历史知识与实时信息，显著提升了适应性和运营效率。大量针对大规模离线数据集和在线A/B测试的实验证明了AHBid的有效性，整体回报比现有基线提高了13.57%。

Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

多搜索，少思考：重新思考长远代理搜索以提升效率与概括性

Authors: Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu, Shu Xu, Jiaqi Wu, Jiayu Zhang, Xinpeng Liu, Xin Gui, Jingyi Cao, Piaohong Wang, Dingfeng Shi, He Zhu, Tiannan Wang, Yuqing Wang, Maojia Song, Tianyu Zheng, Ge Zhang, Jian Yang, Jiaheng Liu, Minghao Liu, Yuchen Eleanor Jiang, Wangchunshu Zhou
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.22675
Pdf link: https://arxiv.org/pdf/2602.22675
Abstract Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios. Moreover, generalization across heterogeneous research settings remains challenging. In this work, we propose \emph{Search More, Think Less} (SMTL), a framework for long-horizon agentic search that targets both efficiency and generalization. SMTL replaces sequential reasoning with parallel evidence acquisition, enabling efficient context management under constrained context budgets. To support generalization across task types, we further introduce a unified data synthesis pipeline that constructs search tasks spanning both deterministic question answering and open-ended research scenarios with task appropriate evaluation metrics. We train an end-to-end agent using supervised fine-tuning and reinforcement learning, achieving strong and often state of the art performance across benchmarks including BrowseComp (48.6\%), GAIA (75.7\%), Xbench (82.0\%), and DeepResearch Bench (45.9\%). Compared to Mirothinker-v1.0, SMTL with maximum 100 interaction steps reduces the average number of reasoning steps on BrowseComp by 70.7\%, while improving accuracy.
中文摘要 最近的深度研究代理主要通过扩展推理深度来提升性能，但在搜索密集型场景中这会导致推理成本和延迟较高。此外，跨异质研究环境的泛化仍然具有挑战性。在本研究中，我们提出了\emph{Search More， Think Less}（SMTL），一种面向高效和泛化的长期代理搜索框架。SMTL用并行证据采集取代顺序推理，使得在受限的上下文预算下实现高效的上下文管理。为支持跨任务类型的泛化，我们进一步引入了统一的数据综合流程，构建涵盖确定性问答和开放式研究场景的搜索任务，并配备适合任务的评估指标。我们通过监督微调和强化学习训练端到端代理，在包括BrowseComp（48.6%）、GAIA（75.7%）、Xbench（82.0%）和DeepResearch Bench（45.9%）等基准测试中实现了强劲且常常领先的性能。与Mirothinker-v1.0相比，SMTL最多100交互步骤，使BrowseComp的平均推理步骤减少了70.7%，同时提高了准确性。

Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue

强化现实世界服务代理：任务导向对话中的效用与成本平衡

Authors: Ning Gao, Wei Zhang, Yuqin Dai, Ling Shi, Ziyin Wang, Yujie Wang, Wei He, Jinpeng Wang, Chaozheng Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.22697
Pdf link: https://arxiv.org/pdf/2602.22697
Abstract The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized real business scenarios demonstrate that InteractCS-RL significantly outperform other baselines across three evaluation dimensions. Further evaluation on tool-agent-user interaction benchmarks verify InteractCS-RL robustness across diverse domains.
中文摘要 大型语言模型（LLM）的快速发展加速了从对话式聊天机器人向通用代理的转变。然而，有效平衡同理心沟通与预算意识决策仍是一个开放挑战。鉴于现有方法未能捕捉这些复杂的战略权衡，我们提出了InteractCS-RL，这一框架将任务导向对话重新框架为多粒度强化学习过程。具体来说，我们首先建立了以用户为中心的互动框架，提供高保真度训练健身房，使客服能够与以人格为驱动的用户动态探索多样化策略。随后，我们引入了成本感知型多回合策略优化（CMPO），采用混合优势估计策略。通过整合生成过程积分并采用PID-拉格朗日成本控制器，CMPO有效引导政策探索用户奖励与全局成本约束之间的帕累托边界。在定制化的真实商业场景中进行的大量实验表明，InteractCS-RL在三个评估维度上显著优于其他基线。对工具-代理-用户交互基准的进一步评估验证了InteractCS-RL在不同领域的鲁棒性。

Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

通过翻译器引导强化学习提升VLM中的几何感知

Authors: Hao Yu, Shuning Jia, Guanghao Li, Wenhao Jiang, Chun Yuan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.22703
Pdf link: https://arxiv.org/pdf/2602.22703
Abstract Vision-language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language (DSL) representations, along with an efficient automatic data generation pipeline. This design enables the isolated evaluation of geometric perception independently from reasoning. To exploit the data provided by GeoPerceive for enhancing the geometric perception capabilities of VLMs, we propose GeoDPO, a translator-guided reinforcement learning (RL) framework. GeoDPO employs an NL-to-DSL translator, which is trained on synthetic pairs generated by the data engine of GeoPerceive, to bridge natural language and DSL. This translator facilitates the computation of fine-grained, DSL-level scores, which serve as reward signals in reinforcement learning. We assess GeoDPO on both in-domain and out-of-domain datasets, spanning tasks in geometric perception as well as downstream reasoning. Experimental results demonstrate that, while supervised fine-tuning (SFT) offers only marginal improvements and may even impair performance in out-of-domain scenarios, GeoDPO achieves substantial gains: $+26.5\%$ on in-domain data, $+8.0\%$ on out-of-domain data, and $+39.0\%$ on downstream reasoning tasks. These findings underscore the superior performance and generalization ability of GeoDPO over SFT. All codes are released at this https URL to ensure reproducibility.
中文摘要 视觉语言模型（VLM）由于对基本图示元素的感知有限，常常在几何推理上遇到困难。为应对这一挑战，我们推出了GeoPerceive，这是一个基准测试，结合了图示实例与领域特定语言（DSL）表示，以及高效的自动数据生成流水线。这种设计使得几何感知能够独立于推理之外进行孤立的评估。为了利用GeoPerceive提供的数据提升VLM的几何感知能力，我们提出了GeoDPO，一个由翻译器引导的强化学习（RL）框架。GeoDPO采用NL转DSL的翻译器，该翻译器基于GeoPerceive数据引擎生成的合成对训练，桥接自然语言与DSL。该翻译器有助于计算细粒度的DSL级得分，这些分数作为强化学习中的奖励信号。我们对域内和域外数据集进行评估，涵盖几何感知任务以及下游推理。实验结果表明，监督微调（SFT）仅带来边际提升，甚至可能影响域外性能，GeoDPO却实现了显著提升：域内数据提升+26.5美元，域外数据+8.0%$，下游推理任务+39.0%%。这些发现凸显了GeoDPO相较SFT的优越性能和泛化能力。所有代码均在此 https URL 发布，以确保可重复性。

Same Words, Different Judgments: Modality Effects on Preference Alignment

同一句话，不同的判断：模态对偏好对立的影响

Authors: Aaron Broukhim, Nadir Weibel, Eshin Jolly
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2602.22710
Pdf link: https://arxiv.org/pdf/2602.22710
Abstract Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences, but its application to speech remains underexplored. We present a controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. Audio preferences prove as reliable as text, with inter-rater agreement reaching good levels (ICC(2,k) $\approx$ .80) at $\sim$9 raters -- the first ICC-based reliability characterization in the preference annotation literature for either modality. However, modality reshapes how people judge: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. Synthetic ratings further align with human judgments and predict inter-rater agreement, supporting their use both for triaging ambiguous pairs and as full replacements for human annotations.
中文摘要 基于偏好的强化学习（PbRL）是将人工智能系统与人类偏好对齐的主要框架，但其在语音中的应用尚未被充分探索。我们进行了一项对人类与合成偏好注释的受控跨模态研究，比较了100个提示中相同语义内容的文本与音频评估。音频偏好的可靠性与文本一样可靠，评级者间的一致性达到了良好的水平（ICC（2，k） $\$ 0.80），而评分者达到$\sim$9——这是偏好注释文献中首个基于ICC的可靠性刻画，适用于这两种模态。然而，模态改变了人们的判断方式：音频评定者表现出更窄的决策门槛、更少的长度偏差和更用户导向的评估标准，且几乎实现了跨模态的近乎概率一致。合成评分进一步符合人类判断，预测评级者间的一致性，支持其用于歧义对的分类以及作为人类注释的完整替代。

RLHFless: Serverless Computing for Efficient RLHF

RLHFless：高效RLHF的无服务器计算

Authors: Rui Wei, Hanfei Yu, Shubham Jain, Yogarajan Sivakumar, Devesh Tiwari, Jian Li, Seung-Jong Park, Hao Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2602.22718
Pdf link: https://arxiv.org/pdf/2602.22718
Abstract Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training to align model outputs with human preferences. Recent models, such as DeepSeek-R1, have also shown RLHF's potential to improve LLM reasoning on complex tasks. In RL, inference and training co-exist, creating dynamic resource demands throughout the workflow. Compared to traditional RL, RLHF further challenges training efficiency due to expanding model sizes and resource consumption. Several RLHF frameworks aim to balance flexible abstraction and efficient execution. However, they rely on serverful infrastructures, which struggle with fine-grained resource variability. As a result, during synchronous RLHF training, idle time between or within RL components often causes overhead and resource wastage. To address these issues, we present RLHFless, the first scalable training framework for synchronous RLHF, built on serverless computing environments. RLHFless adapts to dynamic resource demands throughout the RLHF pipeline, pre-computes shared prefixes to avoid repeated computation, and uses a cost-aware actor scaling strategy that accounts for response length variation to find sweet spots with lower cost and higher speed. In addition, RLHFless assigns workloads efficiently to reduce intra-function imbalance and idle time. Experiments on both physical testbeds and a large-scale simulated cluster show that RLHFless achieves up to 1.35x speedup and 44.8% cost reduction compared to the state-of-the-art baseline.
中文摘要 来自人类反馈的强化学习（RLHF）已被广泛应用于大型语言模型（LLM）的后训练，以使模型输出与人类偏好保持一致。近期模型，如DeepSeek-R1，也展示了RLHF在提升复杂任务中LLM推理能力的潜力。在强化学习中，推理与培训并存，在整个工作流程中产生动态的资源需求。与传统强化学习相比，RLHF因模型规模扩大和资源消耗而进一步挑战训练效率。多个RLHF框架旨在平衡灵活抽象和高效执行。然而，它们依赖于服务器型基础设施，而这些基础设施在细粒度资源变异方面存在困难。因此，在同步RLHF训练过程中，RL组件之间或内部的空闲时间常导致开销和资源浪费。为解决这些问题，我们介绍了RLHFless，这是首个基于无服务器计算环境的可扩展同步RLHF训练框架。RLHFless 适应整个 RLHF 流水线的动态资源需求，预先计算共享前缀以避免重复计算，并采用一种成本感知型演员缩放策略，考虑响应长度的变化，以找到成本更低、速度更快的最佳点。此外，RLHFless 高效分配工作负载，以减少职能内部不平衡和空闲时间。在物理测试平台和大规模模拟集群上的实验显示，RLHFless 相比最先进的基线实现了高达 1.35 倍的速度和 44.8% 的成本降低。

Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA

用单步LLM流水线生成替代多步组装的数据准备流程，用于表质量保证

Authors: Fengyu Li, Junhao Zhu, Kaishi Song, Lu Chen, Zhongming Yao, Tianyi Li, Christian S. Jensen
Subjects: Subjects: Databases (cs.DB); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.22721
Pdf link: https://arxiv.org/pdf/2602.22721
Abstract Table Question Answering (TQA) aims to answer natural language questions over structured tables. Large Language Models (LLMs) enable promising solutions to this problem, with operator-centric solutions that generate table manipulation pipelines in a multi-step manner offering state-of-the-art performance. However, these solutions rely on multiple LLM calls, resulting in prohibitive latencies and computational costs. We propose Operation-R1, the first framework that trains lightweight LLMs (e.g., Qwen-4B/1.7B) via a novel variant of reinforcement learning with verifiable rewards to produce high-quality data-preparation pipelines for TQA in a single inference step. To train such an LLM, we first introduce a self-supervised rewarding mechanism to automatically obtain fine-grained pipeline-wise supervision signals for LLM training. We also propose variance-aware group resampling to mitigate training instability. To further enhance robustness of pipeline generation, we develop two complementary mechanisms: operation merge, which filters spurious operations through multi-candidate consensus, and adaptive rollback, which offers runtime protection against information loss in data transformation. Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79\% table compression and a 2.2$\times$ reduction in monetary cost.
中文摘要 表格问答（TQA）旨在通过结构化表格回答自然语言问题。大型语言模型（LLMs）为这一问题提供了有前景的解决方案，以为中心的解决方案能够以多步方式生成表格作流程，提供最先进的性能。然而，这些解决方案依赖于多次LLM调用，导致延迟过长且计算成本高昂。我们提出了Operation-R1，这是首个通过一种新型强化学习变体训练轻量级大型语言模型（如Qwen-4B/1.7B）的框架，并提供可验证的奖励，从而在单一推理步骤内生成高质量的TQA数据准备流水线。为了训练此类LLM，我们首先引入了一种自监督奖励机制，自动获得细粒度的流水线监督信号，用于LLM训练。我们还提出了方差感知群重抽样以缓解训练不稳定性。为进一步增强流水线生成的鲁棒性，我们开发了两种互补机制：作合并，通过多候选共识过滤虚假作;自适应回滚，提供数据转换时信息丢失的运行时保护。在两个基准数据集上的实验显示，使用相同的大型语言模型骨干，Operation-R1 在多步准备基线上平均绝对准确率提升了 9.55 个百分点和 6.08 个百分点，表压缩率提升了 79%，成本降低了 2.2 美元 \ 倍值。

Generative Recommendation for Large-Scale Advertising

大规模广告的生成式推荐

Authors: Ben Xue, Dan Liu, Lixiang Wang, Mingjie Sun, Peng Wang, Pengfei Zhang, Shaoyun Shi, Tianyu Xu, Yunhao Sha, Zhiqiang Liu, Bo Kong, Bo Wang, Hang Yang, Jieting Xue, Junhao Wang, Shengyu Wang, Shuping Hui, Wencai Ye, Xiao Lin, Yongzhi Li, Yuhang Chen, Zhihui Yin, Quan Chen, Shiyang Wen, Wenjin Wu, Han Li, Guorui Zhou, Changcheng Li, Peng Jiang
Subjects: Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.22732
Pdf link: https://arxiv.org/pdf/2602.22732
Abstract Generative recommendation has recently attracted widespread attention in industry due to its potential for scaling and stronger model capacity. However, deploying real-time generative recommendation in large-scale advertising requires designs beyond large-language-model (LLM)-style training and serving recipes. We present a production-oriented generative recommender co-designed across architecture, learning, and serving, named GR4AD (Generative Recommendation for ADdvertising). As for tokenization, GR4AD proposes UA-SID (Unified Advertisement Semantic ID) to capture complicated business information. Furthermore, GR4AD introduces LazyAR, a lazy autoregressive decoder that relaxes layer-wise dependencies for short, multi-candidate generation, preserving effectiveness while reducing inference cost, which facilitates scaling under fixed serving budgets. To align optimization with business value, GR4AD employs VSL (Value-Aware Supervised Learning) and proposes RSPO (Ranking-Guided Softmax Preference Optimization), a ranking-aware, list-wise reinforcement learning algorithm that optimizes value-based rewards under list-level metrics for continual online updates. For online inference, we further propose dynamic beam serving, which adapts beam width across generation levels and online load to control compute. Large-scale online A/B tests show up to 4.2% ad revenue improvement over an existing DLRM-based stack, with consistent gains from both model scaling and inference-time scaling. GR4AD has been fully deployed in Kuaishou advertising system with over 400 million users and achieves high-throughput real-time serving.
中文摘要 生成式推荐因其扩展潜力和更强的模型容量，近年来在业界引起了广泛关注。然而，在大规模广告中部署实时生成式推荐需要超越大型语言模型（LLM）式训练和提供配方的设计。我们推出了一款面向生产环境的生成式推荐工具，由架构、学习和服务共同设计，名为GR4AD（生成推荐广告推荐）。至于令牌化，GR4AD提出了UA-SID（统一广告语义ID）以捕捉复杂的业务信息。此外，GR4AD 引入了 LazyAR，一种懒惰自回归解码器，能够放宽层层依赖，实现短时间多候选生成，既保持有效性又降低推理成本，便于在固定服务预算下的扩展。为了使优化与业务价值保持一致，GR4AD 采用了价值感知监督学习（VSL），并提出了 RSPO（排名引导软最大偏好优化），这是一种排名感知、按列表方向的强化学习算法，能够在列表级指标下优化基于价值的奖励，实现持续的在线更新。对于在线推断，我们进一步提出了动态波束服务，即在不同世代层级和在线负载间调整波束宽度以控制计算。大规模在线A/B测试显示，广告收入相比现有基于DLRM的堆栈提升了多达4.2%，并且在模型扩展和推理时间扩展中均有持续提升。GR4AD已全面部署在快手广告系统，拥有超过4亿用户，实现了高通量实时服务。

Pixel2Catch: Multi-Agent Sim-to-Real Transfer for Agile Manipulation with a Single RGB Camera

Pixel2Catch：多智能体模拟到现实传输，用于单一RGB摄像头的敏捷作

Authors: Seongyong Kim, Junhyeon Cho, Kang-Won Lee, Soo-Chul Lim
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.22733
Pdf link: https://arxiv.org/pdf/2602.22733
Abstract To catch a thrown object, a robot must be able to perceive the object's motion and generate control actions in a timely manner. Rather than explicitly estimating the object's 3D position, this work focuses on a novel approach that recognizes object motion using pixel-level visual information extracted from a single RGB image. Such visual cues capture changes in the object's position and scale, allowing the policy to reason about the object's motion. Furthermore, to achieve stable learning in a high-DoF system composed of a robot arm equipped with a multi-fingered hand, we design a heterogeneous multi-agent reinforcement learning framework that defines the arm and hand as independent agents with distinct roles. Each agent is trained cooperatively using role-specific observations and rewards, and the learned policies are successfully transferred from simulation to the real world.
中文摘要 要接住被抛出的物体，机器人必须能够感知物体的运动并及时生成控制动作。本研究不直接估计物体的三维位置，而是采用一种新颖的方法，利用从单幅RGB图像提取的像素级视觉信息来识别物体运动。这些视觉线索捕捉物体位置和比例的变化，使策略能够推理物体的运动。此外，为了在由配备多指手的机器人手臂组成的高深度系统中实现稳定学习，我们设计了一个异构多智能体强化学习框架，将手臂和手定义为具有不同角色的独立智能体。每个代理通过角色特定的观察和奖励进行协作训练，所学策略成功地从模拟转移到现实世界。

Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning

你知道什么：元认知熵校准用于可验证的强化学习推理

Authors: Qiannian Zhao, Chen Yang, Jinhao Jing, Yunke Zhang, Xuhui Ren, Lu Yu, Shijie Zhang, Hongzhi Yin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.22751
Pdf link: https://arxiv.org/pdf/2602.22751
Abstract Large reasoning models (LRMs) have emerged as a powerful paradigm for solving complex real-world tasks. In practice, these models are predominantly trained via Reinforcement Learning with Verifiable Rewards (RLVR), yet most existing outcome-only RLVR pipelines rely almost exclusively on a binary correctness signal and largely ignore the model's intrinsic uncertainty. We term this discrepancy the uncertainty-reward mismatch, under which high- and low-uncertainty solutions are treated equivalently, preventing the policy from "Know What You Know" and impeding the shift from optimizing for correct answers to optimizing effective reasoning paths. This limitation is especially critical in reasoning-centric tasks such as mathematics and question answering, where performance hinges on the quality of the model's internal reasoning process rather than mere memorization of final answers. To address this, we propose EGPO, a metacognitive entropy calibration framework that explicitly integrates intrinsic uncertainty into RLVR for enhancing LRMs. EGPO estimates per-sample uncertainty using a zero-overhead entropy proxy derived from token-level likelihoods and aligns it with extrinsic correctness through an asymmetric calibration mechanism that preserves correct reasoning while selectively regulating overconfident failures, thereby enabling stable and uncertainty-aware policy optimization. Moreover, EGPO recovers informative learning signals from otherwise degenerate group-based rollouts without modifying the verifier or reward definition. Extensive experiments across multiple benchmarks demonstrate that the proposed EGPO leads to substantial and consistent improvements in reasoning performance, establishing a principled path for advancing LRMs through metacognitive entropy calibration.
中文摘要 大型推理模型（LRM）已成为解决复杂现实任务的强大范式。实际上，这些模型主要通过可验证奖励的强化学习（RLVR）训练，但大多数现有的仅结果RLVR流程几乎完全依赖二元正确性信号，并大多忽视模型本身的不确定性。我们将这种差异称为不确定性-奖励不匹配，即高不确定性和低不确定性解决方案被同等对待，阻止了“你知道的”政策，阻碍了从优化正确答案向优化有效推理路径的转变。这一限制在以推理为中心的任务中尤为关键，如数学和问答，因为表现依赖于模型内部推理过程的质量，而非单纯的最终答案记忆。为此，我们提出了EGPO，一种元认知熵校准框架，明确将内在不确定性整合进RLVR以增强LRM。EGPO利用零开销熵代理（源自代币级似然）估算每样本不确定性，并通过非对称校准机制与外在正确性对齐，保持推理正确，同时选择性调控过度自信的失效，从而实现稳定且具不确定性的策略优化。此外，EGPO能够从本来退化的基于群体的推广中恢复信息学习信号，而无需修改验证者或奖励定义。跨多个基准测试的广泛实验表明，提出的EGPO能够显著且一致地提升推理性能，为通过元认知熵校准推动LRMS的发展奠定了有原则的路径。

Towards Better RL Training Data Utilization via Second-Order Rollout

通过二阶推广实现更好的强化学习训练数据利用

Authors: Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.22765
Pdf link: https://arxiv.org/pdf/2602.22765
Abstract Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training
中文摘要 强化学习（RL）赋予大型语言模型（LLMs）强大的推理能力，但普通强化学习主要侧重于通过仅用一阶展开（对问题生成多个回答）来生成能力提升，我们认为这种方法未能充分发挥训练数据的潜力，因为忽视了批判能力训练。为解决这一问题，我们进一步引入了二阶推广（为一个回答生成多个批评），并提出了一个统一的框架，用于联合训练生成和批评能力。在各种模型和数据集上的大量实验表明，我们的方法比普通强化学习更有效地利用训练数据，并且在同一训练数据下能实现更好的性能。此外，我们还发现了关于二阶推广和批评训练的若干重要发现，如标签平衡在批评训练中的重要性以及基于结果的奖励噪声问题，这些问题可以通过抽样技术得到缓解。我们的工作对强化学习中的动态数据增强和联合生成批评培训进行了初步探索，为强化学习的进一步发展提供了有意义的启发

Transformer Actor-Critic for Efficient Freshness-Aware Resource Allocation

Transformer Actor-Critic 以实现高效的新度感知资源分配

Authors: Maryam Ansarifard, Mohit K. Sharma, Kishor C. Joshi, George Exarchakos
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.22774
Pdf link: https://arxiv.org/pdf/2602.22774
Abstract Emerging applications such as autonomous driving and industrial automation demand ultra-reliable and low-latency communication (URLLC), where maintaining fresh and timely information is critical. A key performance metric in such systems is the age of information (AoI). This paper addresses AoI minimization in a multi-user uplink wireless network using non-orthogonal multiple access (NOMA), where users offload tasks to a base station. The system must handle user heterogeneity in task sizes, AoI thresholds, and penalty sensitivities, while adhering to NOMA constraints on user scheduling. We propose a deep reinforcement learning (DRL) framework based on proximal policy optimization (PPO), enhanced with a Transformer encoder. The attention mechanism allows the agent to focus on critical user states and capture inter-user dependencies, improving policy performance and scalability. Extensive simulations show that our method reduces average AoI compared to baselines. We also analyze the evolution of attention weights during training and observe that the model progressively learns to prioritize high-importance users. Attention maps reveal meaningful structure: early-stage policies exhibit uniform attention, while later stages show focused patterns aligned with user priority and NOMA constraints. These results highlight the promise of attention-driven DRL for intelligent, priority-aware resource allocation in next-generation wireless systems.
中文摘要 新兴应用如自动驾驶和工业自动化要求超可靠且低延迟的通信（URLLC），保持信息的新鲜及时至关重要。这类系统中一个关键的性能指标是信息时代（AoI）。本文讨论了使用非正交多重接入（NOMA）技术，即用户将任务卸载到基站的多用户上行无线网络中的AoI最小化问题。系统必须处理任务大小、责任范围阈值和惩罚敏感性等用户异质性，同时遵守NOMA对用户调度的约束。我们提出了基于近端策略优化（PPO）并辅以Transformer编码器的深度强化学习（DRL）框架。注意机制使代理能够专注于关键用户状态并捕捉用户间依赖，从而提升策略性能和可扩展性。大量模拟表明，我们的方法相较基线降低了平均AoI。我们还分析了训练过程中注意力权重的演变，观察到模型逐渐学会优先考虑高重要性用户。注意力地图揭示了有意义的结构：早期策略表现出均匀的注意力，而后期阶段则显示与用户优先级和NOMA约束一致的聚焦模式。这些结果凸显了以注意力驱动的日程学习（DRL）在下一代无线系统中实现智能、优先级感知资源分配的潜力。

QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

QSIM：通过动作相似度加权Q-学习缓解多智能体强化学习中的高估

Authors: Yuanjun Li, Bin Zhang, Hao Chen, Zhouyang Jiang, Dapeng Li, Zhiwei Xu
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.22786
Pdf link: https://arxiv.org/pdf/2602.22786
Abstract Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at this https URL.
中文摘要 价值分解（VD）方法在合作多智能体强化学习（MARL）中取得了显著成功。然而，他们依赖最大算符来计算时间差值（TD）目标，导致系统性地高估Q值。这一问题在MARL中尤为严重，因为联合作用空间的组合爆炸性爆发，常导致学习不稳定和策略不优。为解决这个问题，我们提出了QSIM，一种相似度加权的Q学习框架，利用动作相似度重建TD目标。QSIM不直接使用贪婪联合行动，而是在结构化近贪婪联合行动空间上形成相似度加权期望。这种表述允许目标整合来自多样但行为相关行为的Q值，同时将更大影响力分配给与贪婪选择更相似的行为。通过用结构相关的替代方案平滑目标，QSIM有效减少了高估，提升了学习稳定性。大量实验表明，QSIM可以无缝集成多种VD方法，持续优于原始算法的性能和稳定性。此外，实证分析证实QSIM显著减轻了MARL系统性价值高估。代码可在此 https URL 访问。

Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

释放扩散模型在端到端自动驾驶中的潜力

Authors: Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang, Jianlin Zhang, Jianwei Cui, Guang Chen, Kun Ma, Hangjun Ye, Long Chen, Ya-Qin Zhang, Xianyuan Zhan, Jingjing Liu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.22801
Pdf link: https://arxiv.org/pdf/2602.22801
Abstract Diffusion models have become a popular choice for decision-making tasks in robotics, and more recently, are also being considered for solving autonomous driving tasks. However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings. The full strength of diffusion models for large-scale, complex real-world settings, such as End-to-End Autonomous Driving (E2E AD), remains underexplored. In this study, we conducted a systematic and large-scale investigation to unleash the potential of the diffusion models as planners for E2E AD, based on a tremendous amount of real-vehicle data and road testing. Through comprehensive and carefully controlled studies, we identify key insights into the diffusion loss space, trajectory representation, and data scaling that significantly impact E2E planning performance. Moreover, we also provide an effective reinforcement learning post-training strategy to further enhance the safety of the learned planner. The resulting diffusion-based learning framework, Hyper Diffusion Planner} (HDP), is deployed on a real-vehicle platform and evaluated across 6 urban driving scenarios and 200 km of real-world testing, achieving a notable 10x performance improvement over the base model. Our work demonstrates that diffusion models, when properly designed and trained, can serve as effective and scalable E2E AD planners for complex, real-world autonomous driving tasks.
中文摘要 扩散模型已成为机器人决策任务的热门选择，近年来也被考虑用于解决自动驾驶任务。然而，它们在自动驾驶中的应用和评估仍限于基于模拟或实验室的环境。扩散模型在大规模复杂现实环境中的全部实力，如端到端自动驾驶（E2E AD）仍未被充分探索。在这项研究中，我们进行了系统且大规模的调查，基于大量真实车辆数据和路试，释放扩散模型作为端对端广告规划者的潜力。通过全面且严格控制的研究，我们识别出对扩散损耗空间、轨迹表示和数据尺度的关键见解，这些都显著影响端对端对方规划的表现。此外，我们还提供有效的培训后强化学习策略，进一步提升学习规划者的安全性。由此产生的基于扩散的学习框架Hyper Diffusion Planner（HDP）部署在真实车辆平台上，并在6个城市驾驶场景和200公里的真实世界测试中进行了评估，性能比基础模型显著提升了10倍。我们的研究表明，经过合理设计和训练的扩散模型，可以作为复杂、真实的自动驾驶任务中高效且可扩展的端对端AD规划工具。

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

长期代理任务的组层策略优化

Authors: Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, Bo An
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.22817
Pdf link: https://arxiv.org/pdf/2602.22817
Abstract Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks. To enable more fine-grained policy updates, recent research has increasingly shifted toward stepwise group-based policy optimization, which treats each step in a rollout trajectory independently while using a memory module to retain historical context. However, we find a key issue in estimating stepwise relative advantages, namely context inconsistency, where steps within the same group may differ in their historical contexts. Empirically, we reveal that this issue can lead to severely biased advantage estimation, thereby degrading policy optimization significantly. To address the issue, in this paper, we propose Hierarchy-of-Groups Policy Optimization (HGPO) for long-horizon agentic tasks. Specifically, within a group of rollout trajectories, HGPO assigns each step to multiple hierarchical groups according to the consistency of historical contexts. Then, for each step, HGPO computes distinct advantages within each group and aggregates them with an adaptive weighting scheme. In this way, HGPO can achieve a favorable bias-variance trade-off in stepwise advantage estimation, without extra models or rollouts. Evaluations on two challenging agentic tasks, ALFWorld and WebShop with Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct, show that HGPO significantly outperforms existing agentic RL methods under the same computational constraints. Code is available at this https URL.
中文摘要 基于群体的强化学习（RL），如GRPO，推动了大型语言模型在长期代理任务中的表现。为了实现更细致的策略更新，近期研究越来越倾向于逐步基于组的策略优化，该方法独立处理部署轨迹中的每一步，同时使用内存模块保留历史上下文。然而，我们发现估计逐步相对优势时存在关键问题，即上下文不一致，即同一群体内的步骤在历史背景下可能存在差异。通过实证，我们发现这一问题可能导致严重偏倚的优势估计，从而显著降低政策优化。为解决这一问题，本文提出了针对长期代理任务的组层策略优化（Hierarchy-of-Group Policy Optimization，HGPO）。具体来说，在一组推广轨迹中，HGPO根据历史上下文的一致性将每个步骤分配给多个层级组。然后，对于每个步骤，HGPO计算每个群体内独特的优势，并用自适应权重方案进行汇总。通过这种方式，HGPO可以在逐步优势估计中实现有利的偏倚-方差权衡，而无需额外的模型或推广。对两个具有挑战性的代理任务ALFWorld和WebShop（Qwen2.5-1.5B-Instruct及Qwen2.5-7B-Ininstruction）的评估显示，HGPO在相同的计算约束下显著优于现有代理强化学习方法。代码可在此 https URL 访问。

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

从盲点到收益：大型多模态模型的诊断驱动迭代训练

Authors: Hongrui Jia, Chaoya Jiang, Shikun Zhang, Wei Ye
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.22859
Pdf link: https://arxiv.org/pdf/2602.22859
Abstract As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at this https URL.
中文摘要 随着大型多模态模型（LMM）的扩展和强化学习（RL）方法的成熟，LMM在复杂推理和决策方面取得了显著进展。然而，训练仍然依赖静态数据和固定配方，这使得诊断能力盲点或提供动态、有针对性的强化变得困难。基于测试驱动的错误暴露和基于反馈的纠正优于重复练习的发现，我们提出了诊断驱动的渐进进化（DPE），这是一个螺旋循环，诊断引导数据生成和强化，每次迭代重新诊断更新后的模型，以推动下一轮有针对性的改进。DPE有两个关键组成部分。首先，多个代理对大量无标签多模态数据进行注释和质量控制，利用网页搜索和图片编辑等工具，生成多样且逼真的样本。其次，DPE将故障归因于特定弱点，动态调整数据组合，并引导代理生成针对弱点的数据以进行有针对性强化。Qwen3-VL-8B-Instruct和Qwen2.5-VL-7B-Instruct的实验在11个基准测试中显示稳定且持续的进步，表明DPE作为开放任务分布下持续LMM训练的可扩展范式。我们的代码、模型和数据在此 https URL 公开。

MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

MSJoE：联合发展MLLM与采样器，以实现高效的长视频理解

Authors: Wenhui Tan, Xiaoyi Yu, Jiaze Li, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Ruihua Song, Jian Luan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.22932
Pdf link: https://arxiv.org/pdf/2602.22932
Abstract Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0\% accuracy gain upon the base MLLM, and 1.1\% higher accuracy than strongest baseline method.
中文摘要 高效理解长视频仍然是多模态大型语言模型（MLLM）面临的根本挑战。本文介绍了MLLM采样器联合进化（MSJoE），这是一个新颖框架，联合演化了MLLM和一个轻量级关键帧采样器，实现了高效的长视频理解。MSJoE 基于一个关键假设：只有极少数关键帧对回答视频问题有真正有用的信息量。具体来说，MSJoE 首先提出了几个问题，这些问题描述了与问题相关的多样视觉视角。然后，这些查询与冻结的CLIP模型交互，生成查询帧相似矩阵。最后，轻量级采样器从该矩阵预测关键帧采样权重，选择一组紧凑的信息帧，然后输入MLLM以生成答案。MLLM 和采样器通过强化学习共同优化，实现查询推理、帧采样和关键帧理解的协同适应。收集了一个新的长视频质量保证数据集，包含2800个视频，7000个问答对，以支持训练过程。在VideoMME、LongVideoBench、LVBench和MLVU上的大量实验表明，MSJoE在基础MLLM上实现了8.0%的准确率提升，比最强基线方法高出1.1%的准确率。

FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning

FactGuard：通过强化学习进行代理视频错误信息检测

Authors: Zehao Li, Hongwei Yu, Hao Jiang, Qiang Sheng, Yilong Xu, Baolong Bi, Yang Li, Zhenlong Yuan, Yujun Cai, Zhaoqi Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.22963
Pdf link: https://arxiv.org/pdf/2602.22963
Abstract Multimodal large language models (MLLMs) have substantially advanced video misinformation detection through unified multimodal reasoning, but they often rely on fixed-depth inference and place excessive trust in internally generated assumptions, particularly in scenarios where critical evidence is sparse, fragmented, or requires external verification. To address these limitations, we propose FactGuard, an agentic framework for video misinformation detection that formulates verification as an iterative reasoning process built upon MLLMs. FactGuard explicitly assesses task ambiguity and selectively invokes external tools to acquire critical evidence, enabling progressive refinement of reasoning trajectories. To further strengthen this capability, we introduce a two-stage training strategy that combines domain-specific agentic supervised fine-tuning with decision-aware reinforcement learning to optimize tool usage and calibrate risk-sensitive decision making. Extensive experiments on FakeSV, FakeTT, and FakeVV demonstrate FactGuard's state-of-the-art performance and validate its excellent robustness and generalization capacity.
中文摘要 多模大型语言模型（MLLM）通过统一的多模态推理在视频错误信息检测方面有了显著进步，但它们通常依赖固定深度推断，并过度信任内部生成的假设，尤其是在关键证据稀缺、零散或需要外部验证的情况下。为解决这些局限性，我们提出了FactGuard，这是一个针对视频虚假信息检测的代理框架，将验证构建为基于多层次多层次学习模型（MLLM）的迭代推理过程。FactGuard明确评估任务模糊性，并有选择地调用外部工具获取关键证据，从而实现推理轨迹的逐步优化。为进一步强化这一能力，我们引入了一种两阶段训练策略，结合了领域特定的代理监督微调与决策感知强化学习，以优化工具使用并校准风险敏感决策。在FakeSV、FakeTT和FakeVV上的广泛实验展示了FactGuard的先进性能，并验证了其卓越的鲁棒性和泛化能力。

A Perspective on Open Challenges in Deformable Object Manipulation

关于可变形物体作中开放挑战的视角

Authors: Ryan Paul McKennaa, John Oyekan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.22998
Pdf link: https://arxiv.org/pdf/2602.22998
Abstract Deformable object manipulation (DOM) represents a critical challenge in robotics, with applications spanning healthcare, manufacturing, food processing, and beyond. Unlike rigid objects, deformable objects exhibit infinite dimensionality, dynamic shape changes, and complex interactions with their environment, posing significant hurdles for perception, modeling, and control. This paper reviews the state of the art in DOM, focusing on key challenges such as occlusion handling, task generalization, and scalable, real-time solutions. It highlights advancements in multimodal perception systems, including the integration of multi-camera setups, active vision, and tactile sensing, which collectively address occlusion and improve adaptability in unstructured environments. Cutting-edge developments in physically informed reinforcement learning (RL) and differentiable simulations are explored, showcasing their impact on efficiency, precision, and scalability. The review also emphasizes the potential of simulated expert demonstrations and generative neural networks to standardize task specifications and bridge the simulation-to-reality gap. Finally, future directions are proposed, including the adoption of graph neural networks for high-level decision-making and the creation of comprehensive datasets to enhance DOM's real-world applicability. By addressing these challenges, DOM research can pave the way for versatile robotic systems capable of handling diverse and dynamic tasks with deformable objects.
中文摘要 可变形物体作（DOM）是机器人技术中的一个关键挑战，其应用涵盖医疗、制造、食品加工等多个领域。与刚性物体不同，可变形物体展现出无限维度、动态形状变化以及与环境的复杂交互，这对感知、建模和控制构成了重大障碍。本文回顾了DOM的最新技术，重点关注闭塞处理、任务泛化以及可扩展的实时解决方案等关键挑战。它强调了多模态感知系统的进展，包括多摄像头设备、主动视觉和触觉感知的集成，这些技术共同解决了遮挡问题，提升了在非结构化环境中的适应性。本文探讨了物理知情强化学习（RL）和可微化模拟的前沿发展，展示了它们对效率、精度和可扩展性的影响。综述还强调了模拟专家演示和生成神经网络在标准化任务规格和弥合模拟与现实差距方面的潜力。最后，提出了未来方向，包括采用图神经网络进行高层决策，以及创建全面数据集以增强DOM的现实应用性。通过解决这些挑战，DOM研究有望为多功能机器人系统铺平道路，能够处理多样且动态的可变形物体任务。

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

探索性内存增强LLM代理，通过混合开关策略优化

Authors: Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, Yuqing Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.23008
Pdf link: https://arxiv.org/pdf/2602.23008
Abstract Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO$^2$), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO$^2$ achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO$^2$ demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO$^2$ as a promising framework for building more exploratory and generalizable LLM-based agents.
中文摘要 探索仍然是大型语言模型智能体接受强化学习训练的关键瓶颈。虽然以往方法利用了预训练知识，但在需要发现新态的环境中却失败了。我们提出了探索性内存增强开关策略优化（Exploratory Memory-Augmented On-Off Policy，简称EMPO$^2$），这是一种混合强化学习框架，利用内存进行探索，结合开启和关闭策略更新，使大型语言模型在内存下表现良好，同时确保无内存时的鲁棒性。在ScienceWorld和WebShop上，EMPO$^2$分别比GRPO提升了128.6%和11.3%。此外，在非分布测试中，EMPO$^2$ 展现出对新任务更优越的适应性，只需少量内存试验且无需参数更新。这些结果凸显了EMPO$^2$作为构建更具探索性和可推广性的基于LLM代理的有前景框架。

Learning-based Multi-agent Race Strategies in Formula 1

基于学习的多智能体竞赛策略在一级方程式中

Authors: Giona Fieni, Joschua Wüthrich, Marc-Philippe Neumann, Christopher H. Onder
Subjects: Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.23056
Pdf link: https://arxiv.org/pdf/2602.23056
Abstract In Formula 1, race strategies are adapted according to evolving race conditions and competitors' actions. This paper proposes a reinforcement learning approach for multi-agent race strategy optimization. Agents learn to balance energy management, tire degradation, aerodynamic interaction, and pit-stop decisions. Building on a pre-trained single-agent policy, we introduce an interaction module that accounts for the behavior of competitors. The combination of the interaction module and a self-play training scheme generates competitive policies, and agents are ranked based on their relative performance. Results show that the agents adapt pit timing, tire selection, and energy allocation in response to opponents, achieving robust and consistent race performance. Because the framework relies only on information available during real races, it can support race strategists' decisions before and during races.
中文摘要 在一级方程式中，比赛策略会根据不断变化的比赛条件和选手的动作进行调整。本文提出了一种用于多智能体种族策略优化的强化学习方法。代理学习平衡能量管理、轮胎退化、空气动力学相互作用和进站决策。基于预训练的单代理策略，我们引入了一个交互模块，考虑竞争对手的行为。交互模块与自玩训练方案的结合生成了竞争性政策，代理则根据其相对表现进行排名。结果显示，这些检测剂会根据对手调整进站时机、轮胎选择和能量分配，从而实现稳健且一致的比赛表现。由于该框架仅依赖于真实比赛期间可用的信息，它可以支持比赛策略师在比赛前和比赛中做出决策。

GeoWorld: Geometric World Models

GeoWorld：几何世界模型

Authors: Zeyu Zhang, Danning Li, Ian Reid, Richard Hartley
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.23058
Pdf link: https://arxiv.org/pdf/2602.23058
Abstract Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: this https URL.
中文摘要 基于能量的预测世界模型通过推理潜在能量景观而非生成像素，提供了强有力的多步骤视觉规划方法。然而，现有方法面临两个主要挑战：（i）其潜在表示通常在欧几里得空间中学习，忽视了状态间的几何和层级结构;（ii）它们在长视野预测方面存在困难，导致在长时间展开中迅速退化。为应对这些挑战，我们引入了GeoWorld，这是一种几何世界模型，通过双曲JEPA保持几何结构和层级关系，将欧几里得空间的潜在表示映射到双曲流形上。我们进一步介绍了基于能量的几何强化学习，实现双曲潜空间中的稳定多步规划。CrossTask和反叛乱的广泛实验显示，与最先进的V-JEPA 2相比，三步规划中SR提升约3%，四步规划SR提升约2%。项目网站：这个 https URL。

Towards Intelligible Human-Robot Interaction: An Active Inference Approach to Occluded Pedestrian Scenarios

迈向可理解的人机交互：一种针对遮蔽行人场景的主动推理方法

Authors: Kai Chen, Yuyao Huang, Guang Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.23109
Pdf link: https://arxiv.org/pdf/2602.23109
Abstract The sudden appearance of occluded pedestrians presents a critical safety challenge in autonomous driving. Conventional rule-based or purely data-driven approaches struggle with the inherent high uncertainty of these long-tail scenarios. To tackle this challenge, we propose a novel framework grounded in Active Inference, which endows the agent with a human-like, belief-driven mechanism. Our framework leverages a Rao-Blackwellized Particle Filter (RBPF) to efficiently estimate the pedestrian's hybrid state. To emulate human-like cognitive processes under uncertainty, we introduce a Conditional Belief Reset mechanism and a Hypothesis Injection technique to explicitly model beliefs about the pedestrian's multiple latent intentions. Planning is achieved via a Cross-Entropy Method (CEM) enhanced Model Predictive Path Integral (MPPI) controller, which synergizes the efficient, iterative search of CEM with the inherent robustness of MPPI. Simulation experiments demonstrate that our approach significantly reduces the collision rate compared to reactive, rule-based, and reinforcement learning (RL) baselines, while also exhibiting explainable and human-like driving behavior that reflects the agent's internal belief state.
中文摘要 突然出现的遮挡行人是自动驾驶中一个关键的安全挑战。传统的基于规则或纯数据驱动的方法难以应对这些长尾场景固有的高不确定性。为应对这一挑战，我们提出了一个基于主动推理的新框架，赋予代理一种类人、基于信念驱动的机制。我们的框架利用Rao-Blackwellized粒子滤波器（RBPF）高效估算行人的混合状态。为了模拟在不确定性下的类人认知过程，我们引入了条件信念重置机制和假设注入技术，以明确建模行人多重潜在意图的信念。规划通过交叉熵方法（CEM）增强型模型预测路径积分（MPPI）控制器实现，该控制器将高效迭代的CEM搜索与MPPI固有的鲁棒性结合起来。模拟实验表明，我们的方法相比反应式、基于规则和强化学习（RL）的基线，显著降低了碰撞率，同时还表现出可解释且类人化的驾驶行为，反映了智能体的内在信念状态。

Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive

能动性与架构限制：为何基于优化的系统无法实现规范响应

Authors: Radha Sarma
Subjects: Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2602.23239
Pdf link: https://arxiv.org/pdf/2602.23239
Abstract AI systems are increasingly deployed in high-stakes contexts -- medical diagnosis, legal research, financial analysis -- under the assumption they can be governed by norms. This paper demonstrates that assumption is formally invalid for optimization-based systems, specifically Large Language Models trained via Reinforcement Learning from Human Feedback (RLHF). We establish that genuine agency requires two necessary and jointly sufficient architectural conditions: the capacity to maintain certain boundaries as non-negotiable constraints rather than tradeable weights (Incommensurability), and a non-inferential mechanism capable of suspending processing when those boundaries are threatened (Apophatic Responsiveness). These conditions apply across all normative domains. RLHF-based systems are constitutively incompatible with both conditions. The operations that make optimization powerful -- unifying all values on a scalar metric and always selecting the highest-scoring output -- are precisely the operations that preclude normative governance. This incompatibility is not a correctable training bug awaiting a technical fix; it is a formal constraint inherent to what optimization is. Consequently, documented failure modes - sycophancy, hallucination, and unfaithful reasoning - are not accidents but structural manifestations. Misaligned deployment triggers a second-order risk we term the Convergence Crisis: when humans are forced to verify AI outputs under metric pressure, they degrade from genuine agents into criteria-checking optimizers, eliminating the only component in the system capable of normative accountability. Beyond the incompatibility proof, the paper's primary positive contribution is a substrate-neutral architectural specification defining what any system -- biological, artificial, or institutional -- must satisfy to qualify as an agent rather than a sophisticated instrument.
中文摘要 人工智能系统越来越多地被应用于高风险领域——医疗诊断、法律研究、金融分析——前提是它们可以被规范支配。本文证明，假设在基于优化的系统中形式上是无效的，特别是通过人类反馈强化学习（RLHF）训练的大型语言模型。我们确立了真正的能动性需要两个必要且共同充分的架构条件：能够将某些边界作为不可协商的约束而非可交易的权重（不可通约性）维持;以及一个能够在边界受到威胁时暂停处理的非推理机制（否定响应性）。这些条件适用于所有规范领域。基于RLHF的系统本质上不兼容这两种条件。使优化强大的作——统一所有标量度量值并始终选择得分最高的输出——正是排除规范治理的作。这种不兼容并不是等待技术修复的可修复训练错误;它是优化本质中固有的形式约束。因此，有文献记载的失败模式——諂媚、幻觉和不忠推理——不是偶然，而是结构性的表现。部署不一致引发我们称之为“融合危机”的二级风险：当人类被迫在度量压力下验证人工智能输出时，他们从真正的智能体退化为标准检查优化器，消除了系统中唯一具备规范问责能力的组成部分。除了不兼容的证明外，论文的主要积极贡献是提出了一个基底中立的架构规范，定义了任何系统——无论是生物、人工还是制度——必须满足什么条件，才能被视为代理而非复杂工具。

A Model-Free Universal AI

一个无模型的通用人工智能

Authors: Yegon Kim, Juho Lee
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.23242
Pdf link: https://arxiv.org/pdf/2602.23242
Abstract In general reinforcement learning, all established optimal agents, including AIXI, are model-based, explicitly maintaining and using environment models. This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically $\varepsilon$-optimal in general RL. AIQI performs universal induction over distributional action-value functions, instead of policies or environments like previous works. Under a grain of truth condition, we prove that AIQI is strong asymptotically $\varepsilon$-optimal and asymptotically $\varepsilon$-Bayes-optimal. Our results significantly expand the diversity of known universal agents.
中文摘要 在一般强化学习中，所有已建立的最优代理，包括AIXI，都是基于模型的，明确维护和使用环境模型。本文介绍了带Q-归纳的通用人工智能（AIQI），这是首个在强化学习中被证明具有渐近$\varepsilon$最优的无模型智能体。AIQI对分布式动作-价值函数进行通用归纳，而非像以往工作那样对策略或环境进行分析。在一个真理粒条件下，我们证明AIQI在渐近情况下强于$\varepsilon$-最优，在渐近条件下强于$\varepsilon$-贝叶斯最优。我们的结果显著扩展了已知通用药物的多样性。

SPARR: Simulation-based Policies with Asymmetric Real-world Residuals for Assembly

SPARR：基于仿真的策略，具有非对称的现实世界残差用于组装

Authors: Yijie Guo, Iretiayo Akinola, Lars Johannsmeier, Hugo Hadfield, Abhishek Gupta, Yashraj Narang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.23253
Pdf link: https://arxiv.org/pdf/2602.23253
Abstract Robotic assembly presents a long-standing challenge due to its requirement for precise, contact-rich manipulation. While simulation-based learning has enabled the development of robust assembly policies, their performance often degrades when deployed in real-world settings due to the sim-to-real gap. Conversely, real-world reinforcement learning (RL) methods avoid the sim-to-real gap, but rely heavily on human supervision and lack generalization ability to environmental changes. In this work, we propose a hybrid approach that combines a simulation-trained base policy with a real-world residual policy to efficiently adapt to real-world variations. The base policy, trained in simulation using low-level state observations and dense rewards, provides strong priors for initial behavior. The residual policy, learned in the real world using visual observations and sparse rewards, compensates for discrepancies in dynamics and sensor noise. Extensive real-world experiments demonstrate that our method, SPARR, achieves near-perfect success rates across diverse two-part assembly tasks. Compared to the state-of-the-art zero-shot sim-to-real methods, SPARR improves success rates by 38.4% while reducing cycle time by 29.7%. Moreover, SPARR requires no human expertise, in contrast to the state-of-the-art real-world RL approaches that depend heavily on human supervision.
中文摘要 机器人组装长期面临挑战，因为它需要精准且接触丰富的作。虽然基于仿真的学习使得构建了稳健的汇编策略成为可能，但由于模拟与现实之间的差距，这些策略在实际环境中部署时常常会下降。相反，现实世界强化学习（RL）方法避免了模拟与现实之间的差距，但高度依赖人类监督，缺乏对环境变化的泛化能力。在本研究中，我们提出了一种混合方法，将模拟训练的基础策略与现实世界的残余策略结合起来，以高效适应现实世界的变体。基础策略通过低级别状态观察和密集奖励在模拟中训练，为初始行为提供了强先验。残差策略通过视觉观察和稀疏奖励在现实世界中学习，能够补偿动态和传感器噪声的差异。大量真实实验表明，我们的方法SPARR在多种两部件组装任务中几乎完美成功。与最先进的零点模拟到实物方法相比，SPARR的成功率提高了38.4%，同时周期时间减少了29.7%。此外，SPARR不需要人类专业知识，这与高度依赖人类监督的最先进现实现实学习方法形成对比。

Physics Informed Viscous Value Representations

物理知情的粘性值表示

Authors: Hrishikesh Viswanath, Juanwu Lu, S. Talha Bukhari, Damon Conover, Ziran Wang, Aniket Bera
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.23280
Pdf link: https://arxiv.org/pdf/2602.23280
Abstract Offline goal-conditioned reinforcement learning (GCRL) learns goal-conditioned policies from static pre-collected datasets. However, accurate value estimation remains a challenge due to the limited coverage of the state-action space. Recent physics-informed approaches have sought to address this by imposing physical and geometric constraints on the value function through regularization defined over first-order partial differential equations (PDEs), such as the Eikonal equation. However, these formulations can often be ill-posed in complex, high-dimensional environments. In this work, we propose a physics-informed regularization derived from the viscosity solution of the Hamilton-Jacobi-Bellman (HJB) equation. By providing a physics-based inductive bias, our approach grounds the learning process in optimal control theory, explicitly regularizing and bounding updates during value iterations. Furthermore, we leverage the Feynman-Kac theorem to recast the PDE solution as an expectation, enabling a tractable Monte Carlo estimation of the objective that avoids numerical instability in higher-order gradients. Experiments demonstrate that our method improves geometric consistency, making it broadly applicable to navigation and high-dimensional, complex manipulation tasks. Open-source codes are available at this https URL.
中文摘要 离线目标条件强化学习（GCRL）从静态预收集的数据集中学习目标条件策略。然而，由于状态动作空间覆盖有限，准确估算价值仍是个挑战。近年来，基于物理学的方法试图通过对价值函数施加物理和几何约束，通过定义在一阶偏微分方程（PDE）上的正则化，如Eikonal方程，来解决这个问题。然而，这些形式在复杂的高维环境中常常存在错态。在本研究中，我们提出了一种基于物理学原理的正则化，源自汉密尔顿-雅各比-贝尔曼方程（HJB）的粘度解。通过基于物理的归纳偏置，我们的方法将学习过程建立在最优控制理论之上，在值迭代过程中显式地对更新进行正则化和界限。此外，我们利用费曼-卡克定理将偏微分方程解重新表述为期望值，从而实现可作的目标函数值估计，避免高阶梯度中的数值不稳定性。实验表明，我们的方法提升了几何一致性，使其广泛适用于导航和高维复杂作任务。开源代码可在此 https URL 获取。

Simple Models, Real Swimming: Digital Twins for Tendon-Driven Underwater Robots

简单模型，真实游泳：肌腱驱动水下机器人的数字孪生

Authors: Mike Y. Michelis, Nana Obayashi, Josie Hughes, Robert K. Katzschmann
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.23283
Pdf link: https://arxiv.org/pdf/2602.23283
Abstract Mimicking the graceful motion of swimming animals remains a core challenge in soft robotics due to the complexity of fluid-structure interaction and the difficulty of controlling soft, biomimetic bodies. Existing modeling approaches are often computationally expensive and impractical for complex control or reinforcement learning needed for realistic motions to emerge in robotic systems. In this work, we present a tendon-driven fish robot modeled in an efficient underwater swimmer environment using a simplified, stateless hydrodynamics formulation implemented in the widespread robotics framework MuJoCo. With just two real-world swimming trajectories, we identify five fluid parameters that allow a matching to experimental behavior and generalize across a range of actuation frequencies. We show that this stateless fluid model can generalize to unseen actuation and outperform classical analytical models such as the elongated body theory. This simulation environment runs faster than real-time and can easily enable downstream learning algorithms such as reinforcement learning for target tracking, reaching a 93% success rate. Due to the simplicity and ease of use of the model and our open-source simulation environment, our results show that even simple, stateless models -- when carefully matched to physical data -- can serve as effective digital twins for soft underwater robots, opening up new directions for scalable learning and control in aquatic environments.
中文摘要 由于流体结构相互作用复杂且难以控制软体仿生体，模仿游泳动物的优雅动作仍是软机器人的核心挑战。现有建模方法通常计算成本高且不切实际，用于实现机器人系统中真实运动所需的复杂控制或强化学习。本研究提出了一个基于肌腱驱动的鱼类机器人，采用了在广泛使用的机器人框架MuJoCo中实现的简化无状态流体力学公式，在高效的水下游泳环境中建模。仅凭两条真实世界的游泳轨迹，我们识别出五个流体参数，使得与实验行为匹配，并推广到一系列驱动频率。我们证明了该无态流体模型可以推广到看不见的驱动，并且优于传统解析模型如拉长体理论。该模拟环境运行速度超过实时，并能轻松支持下游学习算法，如强化学习用于目标跟踪，成功率达93%。由于模型的简洁易用性以及我们开源的仿真环境，我们的结果表明，即使是简单的无状态模型——只要与物理数据精确匹配——也能作为软水下机器人的有效数字孪生，为水下环境中可扩展的学习和控制开辟新方向。

MediX-R1: Open Ended Medical Reinforcement Learning

MediX-R1：开放式医疗强化学习

Authors: Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Omair Mohamed, Mohamed Zidan, Fahad Khan, Salman Khan, Rao Anwer, Hisham Cholakkal
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.23363
Pdf link: https://arxiv.org/pdf/2602.23363
Abstract We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at this https URL
中文摘要 我们介绍了MediX-R1，一种面向医学多模态大型语言模型（MLLM）的开放式强化学习（RL）框架，能够提供超越选择题格式的临床基础、自由形式的答案。MediX-R1通过基于群体的强化学习和针对医学推理量身定制的综合奖励，微调了基线视觉语言骨干：基于LLM的准确性奖励通过严格的是/否判定来判断语义正确性;基于医学嵌入的语义奖励用于捕捉释义和术语变体;以及强化可解释推理和模态识别的轻量级格式和模态奖励。这种多信号设计为开放式输出提供了稳定且信息丰富的反馈，而传统可验证或仅限选择题的奖励则难以实现。为了衡量进展，我们提出了一个统一的评估框架，涵盖纯文本和图像+文本任务，使用基于引用的LLM作为评判者替代脆弱字符串重叠指标，捕捉语义正确性、推理能力和上下文对齐。尽管仅使用$\sim51$K的指令示例，MediX-R1在标准医学LLM（仅文本）和VLM（图片+文本）基准测试中取得了优异成绩，优于强劲的开源基线，并在开放式临床任务中取得了显著提升。我们的结果表明，开放式强化学习结合综合奖励信号和基于LLM的评估，是多模态模型中实现可靠医学推理的切实途径。我们训练好的模型、精心策划的数据集和源代码可在此 https URL 获取

Keyword: diffusion policy

When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

何时行动、询问或学习：不确定性意识政策引导

Authors: Jessie Yuan, Yilin Wu, Andrea Bajcsy
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.22474
Pdf link: https://arxiv.org/pdf/2602.22474
Abstract Policy steering is an emerging way to adapt robot behaviors at deployment-time: a learned verifier analyzes low-level action samples proposed by a pre-trained policy (e.g., diffusion policy) and selects only those aligned with the task. While Vision-Language Models (VLMs) are promising general-purpose verifiers due to their reasoning capabilities, existing frameworks often assume these models are well-calibrated. In practice, the overconfident judgment from VLM can degrade the steering performance under both high-level semantic uncertainty in task specifications and low-level action uncertainty or incapability of the pre-trained policy. We propose uncertainty-aware policy steering (UPS), a framework that jointly reasons about semantic task uncertainty and low-level action feasibility, and selects an uncertainty resolution strategy: execute a high-confidence action, clarify task ambiguity via natural language queries, or ask for action interventions to correct the low-level policy when it is deemed incapable at the task. We leverage conformal prediction to calibrate the composition of the VLM and the pre-trained base policy, providing statistical assurances that the verifier selects the correct strategy. After collecting interventions during deployment, we employ residual learning to improve the capability of the pre-trained policy, enabling the system to learn continually but with minimal expensive human feedback. We demonstrate our framework through experiments in simulation and on hardware, showing that UPS can disentangle confident, ambiguous, and incapable scenarios and minimizes expensive user interventions compared to uncalibrated baselines and prior human- or robot-gated continual learning approaches. Videos can be found at this https URL
中文摘要 策略引导是一种在部署时调整机器人行为的新兴方式：学习验证器分析由预训练策略（如扩散策略）提出的低层动作样本，并仅选择与任务对应的样本。虽然视觉语言模型（VLM）因其推理能力而成为有前景的通用验证器，但现有框架通常假设这些模型已经过良好校准。实际上，VLM的过度自信判断会在任务规范的高层级语义不确定性和预训练策略的低层动作不确定性或无能性下，降低引导性能。我们提出了不确定性感知策略引导（UPS），这是一种结合语义任务不确定性和低层次行动可行性推理的框架，并选择不确定性解决策略：执行高置信度动作，通过自然语言查询澄清任务歧义，或请求行动干预以纠正被认为无法胜任的低层次策略。我们利用共形预测来校准VLM和预训练基础策略的组成，提供统计保证验证者选择正确策略。在部署期间收集干预后，我们利用残余学习提升预训练策略的能力，使系统能够持续学习，同时减少昂贵的人工反馈。我们通过仿真和硬件实验展示了我们的框架，表明UPS能够解开自信、模糊和无能的场景，并相比未校准基线和之前的人机门控持续学习方法，最大限度地减少昂贵的用户干预。视频链接可在此 https 网址上观看

GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion

抓住LDP：通过潜在扩散实现可推广的抓取政策

Authors: Enda Xiang, Haoxiang Ma, Xinzhu Ma, Zicheng Liu, Di Huang
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.22862
Pdf link: https://arxiv.org/pdf/2602.22862
Abstract This paper focuses on enhancing the grasping precision and generalization of manipulation policies learned via imitation learning. Diffusion-based policy learning methods have recently become the mainstream approach for robotic manipulation tasks. As grasping is a critical subtask in manipulation, the ability of imitation-learned policies to execute precise and generalizable grasps merits particular attention. Existing imitation learning techniques for grasping often suffer from imprecise grasp executions, limited spatial generalization, and poor object generalization. To address these challenges, we incorporate grasp prior knowledge into the diffusion policy framework. In particular, we employ a latent diffusion policy to guide action chunk decoding with grasp pose prior, ensuring that generated motion trajectories adhere closely to feasible grasp configurations. Furthermore, we introduce a self-supervised reconstruction objective during diffusion to embed the graspness prior: at each reverse diffusion step, we reconstruct wrist-camera images back-projected the graspness from the intermediate representations. Both simulation and real robot experiments demonstrate that our approach significantly outperforms baseline methods and exhibits strong dynamic grasping capabilities.
中文摘要 本文重点提升通过模仿学习所学到的作策略的抓取精度和泛化性。基于扩散的策略学习方法近年来已成为机器人作任务的主流方法。由于抓取是作中的关键子任务，模仿学习策略执行精确且可推广抓握的能力值得特别关注。现有的模仿学习抓取技术常常存在抓取执行不精确、空间泛化有限以及对象泛化不佳的问题。为应对这些挑战，我们将先验知识纳入传播政策框架。特别是，我们采用潜在扩散策略来引导动作块解码，先验抓取姿态，确保生成的运动轨迹紧密遵循可行抓取配置。此外，我们在扩散过程中引入了自我监督重建目标，以嵌入先验抓握：在每个反向扩散步骤，我们重建腕部相机图像，从中间表征中背投抓握。无论是模拟还是实机实验，都表明我们的方法远超基础方法，并展现出强大的动态抓取能力。

Keyword: reinforcement learning

Graph Your Way to Inspiration: Integrating Co-Author Graphs with Retrieval-Augmented Generation for Large Language Model Based Scientific Idea Generation

你的图带来灵感：将合著者图与检索增强生成整合，用于基于大型语言模型的科学思想生成

SmartChunk Retrieval: Query-Aware Chunk Compression with Planning for Efficient Document RAG

SmartChunk 检索：带高效文档 RAG 规划的查询感知区块压缩

UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

UpSkill：面向大型语言模型结构化反应多样性的互信息技能学习

Learning Rewards, Not Labels: Adversarial Inverse Reinforcement Learning for Machinery Fault Detection

学习奖励，而非标签：对抗性反强化学习用于机械故障检测

Reinforcement-aware Knowledge Distillation for LLM Reasoning

强化感知知识蒸馏用于大型语言模型推理

Space Syntax-guided Post-training for Residential Floor Plan Generation

空间语法引导住宅平面图生成后培训

A Mathematical Theory of Agency and Intelligence

能动性与智能的数学理论

Agentic AI for Intent-driven Optimization in Cell-free O-RAN

无单元O-RAN中意图驱动优化的代理人工智能

Multilingual Safety Alignment Via Sparse Weight Editing

通过稀疏权重编辑实现多语言安全对齐

Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

通过优势塑造和长度感知梯度调控实现稳定适应性思维

Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA

迈向忠实的工业RAG：广告质量保证的强化共适应框架

Where Relevance Emerges: A Layer-Wise Study of Internal Attention for Zero-Shot Re-Ranking

相关性出现之处：零射重排序内部注意力的层级研究

EvolveGen: Algorithmic Level Hardware Model Checking Benchmark Generation through Reinforcement Learning

EvolveGen：通过强化学习生成基准的算法级硬件模型检查

Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning

压缩简单，探索困难：难度感知熵正则化以实现高效LLM推理

AHBid: An Adaptable Hierarchical Bidding Framework for Cross-Channel Advertising

AHBid：跨渠道广告的可适应层级竞价框架

Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

多搜索，少思考：重新思考长远代理搜索以提升效率与概括性

Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue

强化现实世界服务代理：任务导向对话中的效用与成本平衡

Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

通过翻译器引导强化学习提升VLM中的几何感知

Same Words, Different Judgments: Modality Effects on Preference Alignment

同一句话，不同的判断：模态对偏好对立的影响

RLHFless: Serverless Computing for Efficient RLHF

RLHFless：高效RLHF的无服务器计算

Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA

用单步LLM流水线生成替代多步组装的数据准备流程，用于表质量保证

Generative Recommendation for Large-Scale Advertising

大规模广告的生成式推荐

Pixel2Catch: Multi-Agent Sim-to-Real Transfer for Agile Manipulation with a Single RGB Camera

Pixel2Catch：多智能体模拟到现实传输，用于单一RGB摄像头的敏捷作

Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning

你知道什么：元认知熵校准用于可验证的强化学习推理

Towards Better RL Training Data Utilization via Second-Order Rollout

通过二阶推广实现更好的强化学习训练数据利用

Transformer Actor-Critic for Efficient Freshness-Aware Resource Allocation

Transformer Actor-Critic 以实现高效的新度感知资源分配

QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

QSIM：通过动作相似度加权Q-学习缓解多智能体强化学习中的高估

Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

释放扩散模型在端到端自动驾驶中的潜力

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

长期代理任务的组层策略优化

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

从盲点到收益：大型多模态模型的诊断驱动迭代训练

MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

MSJoE：联合发展MLLM与采样器，以实现高效的长视频理解

FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning

FactGuard：通过强化学习进行代理视频错误信息检测

A Perspective on Open Challenges in Deformable Object Manipulation

关于可变形物体作中开放挑战的视角

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

探索性内存增强LLM代理，通过混合开关策略优化

Learning-based Multi-agent Race Strategies in Formula 1

基于学习的多智能体竞赛策略在一级方程式中

GeoWorld: Geometric World Models

GeoWorld：几何世界模型

Towards Intelligible Human-Robot Interaction: An Active Inference Approach to Occluded Pedestrian Scenarios

迈向可理解的人机交互：一种针对遮蔽行人场景的主动推理方法

Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive

能动性与架构限制：为何基于优化的系统无法实现规范响应

A Model-Free Universal AI

一个无模型的通用人工智能

SPARR: Simulation-based Policies with Asymmetric Real-world Residuals for Assembly