Arxiv Papers of Today

生成时间: 2026-02-26 16:52:05 (UTC+8); Arxiv 发布时间: 2026-02-26 20:00 EST (2026-02-27 09:00 UTC+8)

今天共有 36 篇相关文章

Keyword: reinforcement learning

ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following

ImpRIF：更强的隐性推理带来更好的复杂指令跟随

Authors: Yuancheng Yang, Lin Yang, Xu Wang, Chao Tong, Haihua Yang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.21228
Pdf link: https://arxiv.org/pdf/2602.21228
Abstract As applications of large language models (LLMs) become increasingly complex, the demand for robust complex instruction following capabilities is growing accordingly. We argue that a thorough understanding of the instruction itself, especially the latent reasoning structure embedded between the lines, is crucial for improving instruction following. Therefore we target complex instructions that involve implicit reasoning, intricate logical relations, and multi-constraint dependencies. We propose ImpRIF, a method to enhance LLMs' understanding of implicit reasoning instructions, thereby improving its ability to follow complex instructions. We formalize such instructions as verifiable reasoning graphs, enabling programmatic verification and graph-driven chain-of-thought reasoning. Based on this formulation, we synthesize large-scale single- and multi-turn data, propose fine-tuning with graph reasoning, and apply reinforcement learning to explicitly train models to reason along the graph. On five complex instruction following benchmarks, our models substantially outperform their base models. These results demonstrate that enhancing implicit reasoning capabilities can significantly improve complex instruction following. This project will be open-sourced in the near future.
中文摘要 随着大型语言模型（LLM）应用日益复杂，对稳健复杂指令跟随能力的需求也相应增长。我们认为，深入理解指令本身，尤其是嵌入行间的潜在推理结构，对于提升指令遵循至关重要。因此，我们针对涉及隐式推理、复杂逻辑关系和多约束依赖的复杂指令。我们提出了ImpRIF，这是一种增强大型语言模型对隐性推理指令理解的方法，从而提升其执行复杂指令的能力。我们将此类指令形式化为可验证推理图，实现程序验证和图驱动的思维链推理。基于这一表述，我们综合了大规模的单回合和多回合数据，提出通过图推理进行微调，并应用强化学习以显式训练模型沿图进行推理。在五个复杂指令跟随基准测试中，我们的模型明显优于基础模型。这些结果表明，增强隐性推理能力可以显著提升复杂指令的跟随率。该项目将在不久的将来开源。

Cross domain Persistent Monitoring for Hybrid Aerial Underwater Vehicles

混合空中水下载具的跨域持续监测

Authors: Ricardo B. Grando, Victor A. Kich, Alisson H. Kolling, Junior C. D. Jesus, Rodrigo S. Guerra, Paulo L. J. Drews-Jr
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.21259
Pdf link: https://arxiv.org/pdf/2602.21259
Abstract Hybrid Unmanned Aerial Underwater Vehicles (HUAUVs) have emerged as platforms capable of operating in both aerial and underwater environments, enabling applications such as inspection, mapping, search, and rescue in challenging scenarios. However, the development of novel methodologies poses significant challenges due to the distinct dynamics and constraints of the air and water domains. In this work, we present persistent monitoring tasks for HUAUVs by combining Deep Reinforcement Learning (DRL) and Transfer Learning to enable cross-domain adaptability. Our approach employs a shared DRL architecture trained on Lidar sensor data (on air) and Sonar data (underwater), demonstrating the feasibility of a unified policy for both environments. We further show that the methodology presents promising results, taking into account the uncertainty of the environment and the dynamics of multiple mobile targets. The proposed framework lays the groundwork for scalable autonomous persistent monitoring solutions based on DRL for hybrid aerial-underwater vehicles.
中文摘要 混合无人机水下飞行器（HUAUV）已成为能够在空中和水下环境中运行的平台，使得在复杂的场景中实现检查、测绘、搜救等应用。然而，由于空气和水域的动态和限制，新方法的发展面临重大挑战。本研究通过结合深度强化学习（DRL）和迁移学习，提出了HUAUVs的持续监测任务，以实现跨域适应性。我们的方法采用共享的DRL架构，基于激光雷达传感器数据（空中）和声纳数据（水下）训练，展示了两种环境统一政策的可行性。我们还进一步证明，该方法在考虑环境不确定性和多移动目标动态后，呈现出有希望的结果。该框架为基于DRL的可扩展自主持续监测解决方案奠定了基础，适用于混合空中-水下飞行器。

Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data

Tool-R0：自演进的LLM代理，用于从零数据中学习工具

Authors: Emre Can Acikgoz, Cheng Qian, Jonas Hübotter, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.21320
Pdf link: https://arxiv.org/pdf/2602.21320
Abstract Large language models (LLMs) are becoming the foundation for autonomous agents that can use tools to solve complex tasks. Reinforcement learning (RL) has emerged as a common approach for injecting such agentic capabilities, but typically under tightly controlled training setups. It often depends on carefully constructed task-solution pairs and substantial human supervision, which creates a fundamental obstacle to open-ended self-evolution toward superintelligent systems. In this paper, we propose Tool-R0 framework for training general purpose tool-calling agents from scratch with self-play RL, under a zero-data assumption. Initialized from the same base LLM, Tool-R0 co-evolves a Generator and a Solver with complementary rewards: one proposes targeted challenging tasks at the other's competence frontier and the other learns to solve them with real-world tool calls. This creates a self-evolving cycle that requires no pre-existing tasks or datasets. Evaluation on different tool-use benchmarks show that Tool-R0 yields 92.5 relative improvement over the base model and surpasses fully supervised tool-calling baselines under the same setting. Our work further provides empirical insights into self-play LLM agents by analyzing co-evolution, curriculum dynamics, and scaling behavior.
中文摘要 大型语言模型（LLMs）正成为自主智能体的基础，这些智能体能够利用工具解决复杂任务。强化学习（RL）已成为注入此类代理能力的常用方法，但通常在严格控制的训练条件下进行。它通常依赖精心构建的任务-解决方案对和大量人类监督，这为向超级智能系统开放式自我进化构成了根本障碍。本文提出了Tool-R0框架，用于在零数据假设下，利用自玩RL从零开始训练通用工具调用代理。Tool-R0 基于同一基础大型语言模型初始化，协同进化了生成器和求解器，并互补奖励：一方提出对方能力前沿的有针对性挑战任务，另一方通过现实工具调用学习解决。这形成了一个自我演变的循环，无需任何预设任务或数据集。对不同工具使用基准的评估显示，Tool-R0相较基础模型提升了92.5倍，并且在相同设置下优于完全监督的工具调用基线。我们的工作还通过分析共演化、课程动态和行为扩展，提供了对自我游戏LLM代理的实证见解。

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

对准加权DPO：一种有原则的推理方法以改善安全对齐

Authors: Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li, Alfy Samuel, Daben Liu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.21346
Pdf link: https://arxiv.org/pdf/2602.21346
Abstract Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.
中文摘要 近年来，监督微调（SFT）、人类反馈强化学习（RLHF）和直接偏好优化（DPO）等比对技术的最新进展，提高了大型语言模型（LLMs）的安全性。然而，这些大型语言模型仍易受到越狱攻击的威胁，这些攻击通过间接或欺骗性的措辞掩盖了有害意图。通过因果干预，我们实证证明这种脆弱性源于缺乏深度推理的浅层对齐机制，常常在未真正理解其有害原因的情况下拒绝有害提示。为减轻这一脆弱性，我们建议通过推理意识的培训后提升对齐能力。我们构建并发布了一个新的思维链（Chain-of-Thought，简称CoT）微调数据集，包含实用性和安全关键提示，并附有逐步的理由。对该数据集的微调鼓励模型产生基于推理的原则性拒绝，表现优于标准SFT基线。此外，受CoT微调中的失败模式启发，我们引入了对齐加权DPO，通过为推理和最终答案部分分配不同的偏好权重，针对输出中最有问题的部分。这比原版DPO产生更细粒度、更有针对性的更新，并提升了对多样化越狱策略的鲁棒性。在多个安全和效用基准测试中进行的大量实验表明，我们的方法在保持整体模型效用的同时，持续提升比对的鲁棒性。

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

过度自信错误需要更强的纠正：强化学习中的非对称信心惩罚

Authors: Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.21420
Pdf link: https://arxiv.org/pdf/2602.21420
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while they improve Pass@1 accuracy through sharpened sampling, they simultaneously narrow the model's reasoning boundary and reduce generation diversity. We identify a root cause that existing methods overlook: the uniform penalization of errors. Current approaches -- whether data-filtering methods that select prompts by difficulty, or advantage normalization schemes -- treat all incorrect rollouts within a group identically. We show that this uniformity allows overconfident errors (incorrect reasoning paths that the RL process has spuriously reinforced) to persist and monopolize probability mass, ultimately suppressing valid exploratory trajectories. To address this, we propose the Asymmetric Confidence-aware Error Penalty (ACE). ACE introduces a per-rollout confidence shift metric, c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)), to dynamically modulate negative advantages. Theoretically, we demonstrate that ACE's gradient can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength. We conduct extensive experiments fine-tuning Qwen2.5-Math-7B, Qwen3-8B-Base, and Llama-3.1-8B-Instruct on the DAPO-Math-17K dataset using GRPO and DAPO within the VERL framework. Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.
中文摘要 带可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLMs）推理能力的领先范式。然而，标准RLVR算法存在一个广为人知的病态问题：虽然通过锐利采样提高了Pass@1准确性，但同时缩小了模型的推理边界并降低了生成多样性。我们发现了一个现有方法忽视的根本原因：错误的统一惩罚。当前的方法——无论是按难度选择提示的数据过滤方法，还是优势规范化方案——都将组内所有错误的推广视为一模一样。我们证明，这种一致性允许过度自信的错误（即强化学习过程虚假强化的错误推理路径）得以持续存在，并垄断概率质量，最终抑制有效的探索轨迹。为此，我们提出了非对称置信感知错误惩罚（ACE）。ACE引入了每次推导的置信度转移指标，c_i = log（pi_theta（y_i|x） / pi_ref（y_i|x）），以动态调制负面优势。理论上，我们证明ACE的梯度可以分解为限制在过度自信误差的选择性正则化子梯度，加上一个特征良好的残差，该残差部分调节正则化子的强度。我们在DAPO-Math-17K数据集上，利用GRPO和DAPO在VERL框架内，对Qwen2.5-Math-7B、Qwen3-8B-Base和Llama-3.1-8B-Instruct进行了大量微调实验。ACE在MATH-500和AIME 2025评选中脱颖而出，能够无缝衔接现有方法，并持续提升三大模型家族和基准的全Pass@k光谱。

On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation

论政策转型下认识行为的结构性不保留

Authors: Alexander Galozy
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.21424
Pdf link: https://arxiv.org/pdf/2602.21424
Abstract Reinforcement learning (RL) agents under partial observability often condition actions on internally accumulated information such as memory or inferred latent context. We formalise such information-conditioned interaction patterns as behavioural dependency: variation in action selection with respect to internal information under fixed observations. This induces a probe-relative notion of $\epsilon$-behavioural equivalence and a within-policy behavioural distance that quantifies probe sensitivity. We establish three structural results. First, the set of policies exhibiting non-trivial behavioural dependency is not closed under convex aggregation. Second, behavioural distance contracts under convex combination. Third, we prove a sufficient local condition under which gradient ascent on a skewed mixture objective decreases behavioural distance when a dominant-mode gradient aligns with the direction of steepest contraction. Minimal bandit and partially observable gridworld experiments provide controlled witnesses of these mechanisms. In the examined settings, behavioural distance decreases under convex aggregation and under continued optimisation with skewed latent priors, and in these experiments it precedes degradation under latent prior shift. These results identify structural conditions under which probe-conditioned behavioural separation is not preserved under common policy transformations.
中文摘要 强化学习（RL）代理在部分可观察性下，通常会以内部积累的信息（如记忆或推断出的潜在上下文）为行动条件。我们将这种信息条件交互模式形式化为行为依赖：在固定观察下，行动选择在内部信息上的变异。这诱导出了探针相对的 $\ε⁻ⁿ-行为等价概念，以及一个量化探针敏感性的策略内行为距离。我们建立了三个结构性结果。首先，表现出非平凡行为依赖性的策略集合在凸聚合下不封闭。第二，行为距离在凸组合下的收缩。第三，我们证明了一个足够的局部条件，使得在混合气目标上进行梯度上升，当主导模式梯度与最陡收缩方向对齐时，行为距离会减少。极少的盗贼和部分可观测的网格世界实验为这些机制提供了受控的见证。在所研究的环境中，行为距离在凸聚合和持续优化且潜先验偏斜的情况下会减少，而在这些实验中，行为距离先于潜在先验移位的退化。这些结果识别了在探针条件下，探针条件行为分离在共同策略转换下无法保持的结构性条件。

Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking

将理解与生成与交错分析-起草思维协同

Authors: Shengqiong Wu, Bobo Li, Xinkai Wang, Xiangtai Li, Lei Cui, Furu Wei, Shuicheng Yan, Hao Fei, Tat-seng Chua
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.21435
Pdf link: https://arxiv.org/pdf/2602.21435
Abstract Unified Vision-Language Models (UVLMs) aim to advance multimodal learning by supporting both understanding and generation within a single framework. However, existing approaches largely focus on architectural unification while overlooking the need for explicit interaction between the two capabilities during task solving. As a result, current models treat understanding and generation as parallel skills rather than synergistic processes. To achieve real synergy, we introduce the interleaved Analyzing-Drafting problem-solving loop (AD-Loop), a new think paradigm that dynamically alternates between analytic and drafting operations. By interleaving textual thoughts with visual thoughts, AD-Loop enables models to iteratively refine both comprehension and outputs, fostering genuine synergy. To train this mechanism, we design a two-stage strategy: supervised learning on interleaved thought data to initialize alternation, followed by reinforcement learning to promote adaptive and autonomous control. Extensive experiments demonstrate that AD-Loop consistently improves performance across standard benchmarks for both understanding and generation, with strong transferability to various UVLMs architectures. Visual analyses further validate the effectiveness of implicit visual thoughts. These results highlight AD-Loop as a principled and broadly applicable strategy for synergizing comprehension and creation. The project page is at this https URL.
中文摘要 统一视觉语言模型（UVLMs）旨在通过支持在单一框架内的理解和生成，推动多模态学习的发展。然而，现有方法大多侧重于架构统一，忽视了任务解决过程中两者能力之间显式交互的必要性。因此，当前模型将理解和生成视为平行技能，而非协同过程。为了实现真正的协同效应，我们引入了交错式分析-制图问题解决循环（AD-Loop），这是一种动态交替在分析作和绘图作之间切换的新思维范式。通过将文本思维与视觉思维交错，AD-Loop使模型能够迭代完善理解和输出，促进真正的协同效应。为训练该机制，我们设计了两阶段策略：对交错思维数据进行监督学习以初始化交替，随后进行强化学习以促进自适应和自主控制。大量实验表明，AD-Loop在理解和生成的标准基准测试中持续提升性能，并且具有强烈的可迁移性，适用于各种UVLM架构。视觉分析进一步验证了隐性视觉思维的有效性。这些结果凸显了AD-Loop作为一种原则性且广泛适用的策略，用于协同理解与创造。项目页面位于这个 https URL。

GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning

GradAlign：用于大型语言模型强化学习的梯度对齐数据选择

Authors: Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.21492
Pdf link: https://arxiv.org/pdf/2602.21492
Abstract Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non-stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at this https URL
中文摘要 强化学习（RL）已成为大型语言模型（LLM）训练后的核心范式，但其性能对训练问题的质量高度敏感。这种敏感性源于强化学习的非平稳性：推广由不断演变的策略生成，学习由探索和奖励反馈塑造，这与固定轨迹的监督微调（SFT）不同。因此，先前的工作往往依赖于人工管理或简单的启发式过滤器（例如准确性），这些过滤器可能存在错误或低效用的问题。我们提出了GradAlign，这是一种梯度对齐的数据选择方法，用于LLM强化学习，利用一个小型可信验证集优先排序策略梯度与验证梯度一致的训练问题，从而形成自适应的课程。我们评估了GradAlign在三种具有挑战性的数据范例：不可靠的奖励信号、分布不平衡和低效用训练语料库，表明GradAlign始终优于现有基线，强调了方向梯度信号在非平稳策略优化中的重要性，并带来更稳定的训练和提升最终性能。我们以这个 https URL 发布我们的实现

See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

看见它，说出来，分类：一个无训练迭代的多模态推理框架，用于LVLMs中以视觉为基础的多模态推理

Authors: Yongchang Zhang, Xianzheng Ma, Tianyi Liu, Guangquan Zhou, Yang Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.21497
Pdf link: https://arxiv.org/pdf/2602.21497
Abstract Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to "think with images" via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model's reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.
中文摘要 近年来，大型视觉语言模型（LVLM）通过生成长链思考（CoT）反应展现了令人印象深刻的推理能力。然而，在多模态情境下的CoT推理极易受到视觉幻觉传播的影响：一旦中间推理步骤与视觉证据不一致，后续步骤即使逻辑有效，仍可能导致错误的最终答案。现有解决方案试图通过强化学习（RL）训练模型“用图像思考”来缓解这一问题。虽然这些方法有效，但成本高昂、模型特定且难以跨架构推广。不同地，我们提出了一种轻量级的方法，绕过强化学习训练，提供了一个迭代、无需训练、即插即用的视觉基础多模态推理框架。我们的核心理念是在测试时用视觉证据监督每一个推理步骤，确保每个解码的代币都由相应的视觉线索证明。具体来说，我们构建了一个文本视觉证据库，指导模型的推理生成。当现有证据不足时，视觉决策模块会根据当前的推理情境动态从图像中提取更多相关证据，扩展证据池，直到模型获得足够的视觉确定性以终止推理并得出最终答案。在多个LVLM骨干和基准测试上的大量实验证明了我们方法的有效性。我们的方法在TreeBench上实现了16.5%-29.5%的提升，RH-Bench上RH-AUC提升13.7%，显著降低了幻觉率，同时提升推理准确性，无需额外训练。

Training Generalizable Collaborative Agents via Strategic Risk Aversion

通过战略风险规避培训可通用协作代理

Authors: Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.21515
Pdf link: https://arxiv.org/pdf/2602.21515
Abstract Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals. Unfortunately, existing approaches to learning policies for such collaborative problems produce brittle solutions that fail when paired with new partners. We attribute these failures to a combination of free-riding during training and a lack of strategic robustness. To address these problems, we study the concept of strategic risk aversion and interpret it as a principled inductive bias for generalizable cooperation with unseen partners. While strategically risk-averse players are robust to deviations in their partner's behavior by design, we show that, in collaborative games, they also (1) can have better equilibrium outcomes than those at classical game-theoretic concepts like Nash, and (2) exhibit less or no free-riding. Inspired by these insights, we develop a multi-agent reinforcement learning (MARL) algorithm that integrates strategic risk aversion into standard policy optimization methods. Our empirical results across collaborative benchmarks (including an LLM collaboration task) validate our theory and demonstrate that our approach consistently achieves reliable collaboration with heterogeneous and previously unseen partners across collaborative tasks.
中文摘要 许多新兴的代理范式要求代理者之间（或人）协作以实现共同目标。遗憾的是，现有针对此类协作问题的学习策略方法往往产生脆弱的解决方案，当与新伙伴结合时便会失败。我们将这些失败归因于训练期间的自由骑行和缺乏战略稳健性。为解决这些问题，我们研究战略风险规避的概念，并将其解释为一种原则性归纳偏见，适用于与看不见的合作伙伴进行普遍合作。虽然战略上风险厌恶的玩家设计上对搭档行为的偏差具有韧性，但我们表明，在协作博弈中，他们（1）比经典博弈论概念如纳什的玩家拥有更好的均衡结果，且（2）表现出较少或没有搭便车现象。受这些洞见启发，我们开发了一种多智能体强化学习（MARL）算法，将战略风险规避整合进标准策略优化方法。我们在协作基准测试（包括LLM协作任务）中的实证结果验证了我们的理论，并证明我们的方法能够在协作任务中持续与异质且未见过的合作伙伴实现可靠的协作。

Which Tool Response Should I Trust? Tool-Expertise-Aware Chest X-ray Agent with Multimodal Agentic Learning

我应该相信哪个工具的回应？具备工具专业知识感知的胸部X光特工，具备多模态智能学习

Authors: Zheang Huai, Honglong Yang, Xiaomeng Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.21517
Pdf link: https://arxiv.org/pdf/2602.21517
Abstract AI agents with tool-use capabilities show promise for integrating the domain expertise of various tools. In the medical field, however, tools are usually AI models that are inherently error-prone and can produce contradictory responses. Existing research on medical agents lacks sufficient understanding of the tools' realistic reliability and thus cannot effectively resolve tool conflicts. To address this gap, this paper introduces a framework that enables an agent to interact with tools and empirically learn their practical trustworthiness across different types of multimodal queries via agentic learning. As a concrete instantiation, we focus on chest X-ray analysis and present a tool-expertise-aware chest X-ray agent (TEA-CXA). When tool outputs disagree, the agent experimentally accepts or rejects multimodal tool results, receives rewards, and learns which tool to trust for each query type. Importantly, TEA-CXA extends existing codebases for reinforcement learning with multi-turn tool-calling that focus on textual inputs, to support multimodal contexts effectively. In addition, we enhance the codebase for medical use scenarios by supporting multiple tool calls in one turn, parallel tool inference, and multi-image accommodation within a single user query. Our code framework is applicable to general medical research on multi-turn tool-calling reinforcement learning in multimodal settings. Experiments show that TEA-CXA outperforms the state-of-the-art methods and a comprehensive set of baselines. Code will be released.
中文摘要 具备工具使用能力的AI代理在整合多种工具领域专业知识方面展现出潜力。然而，在医疗领域，工具通常是本质上容易出错且可能产生矛盾反应的人工智能模型。现有关于医疗代理的研究缺乏对这些工具现实可靠性的充分理解，因此无法有效解决工具间的冲突。为弥补这一空白，本文引入了一个框架，使智能体能够通过代理学习与工具交互，并通过代理学习实证了解其在不同类型多模态查询中的实际可信度。作为具体实例，我们专注于胸部X光分析，并展示了一种工具感知型胸部X光探剂（TEA-CXA）。当工具输出不一致时，代理会实验性地接受或拒绝多模态工具结果，获得奖励，并学习每种查询类型应信任哪种工具。重要的是，TEA-CXA通过多回合工具调用扩展了现有强化学习代码库，重点关注文本输入，以有效支持多模态语境。此外，我们还通过支持一次作多工具调用、并行工具推断以及单一用户查询中的多图像调整，增强了医疗应用场景的代码库。我们的代码框架适用于多模态环境中多回合工具调用强化学习的一般医学研究。实验显示，TEA-CXA优于最先进的方法和全面的基线数据。代码将会发布。

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

ARLArena：稳定智能体强化学习的统一框架

Authors: Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li, Kaiqiao Han, Chenyi Tong, Haoran Deng, Renliang Sun, Alexander Taylor, Yanqiao Zhu, Jason Cong, Yizhou Sun, Wei Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.21534
Pdf link: https://arxiv.org/pdf/2602.21534
Abstract Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.
中文摘要 智能体强化学习（ARL）作为训练智能体解决复杂多步骤交互任务的有前景范式，迅速受到关注。尽管早期取得令人鼓舞的成绩，ARL仍然高度不稳定，常导致培训崩溃。这种不稳定性限制了扩展性，只能在更大环境和更长的交互时间内实现，并限制了对算法设计选择的系统性探索。本文首先提出了ARLArena，一种稳定的训练配方和系统分析框架，旨在研究在受控且可重复环境中的训练稳定性。ARLArena首先构建了一个干净且标准化的测试平台。然后，我们将政策梯度分解为四个核心设计维度，并评估每个维度的绩效和稳定性。通过这种细致分析，我们总结出对ARL的统一视角，并提出了SAMPO，一种稳定的代理策略优化方法，旨在缓解ARL中不稳定的主要来源。从实证角度看，SAMPO在多样化的代理任务中实现了持续稳定的训练和强劲的表现。总体而言，本研究为ARL提供了一个统一的政策梯度视角，并为构建稳定且可重复的基于LLM的代理培训流程提供了实用指导。

Learning Agile and Robust Omnidirectional Aerial Motion on Overactuated Tiltable-Quadrotors

学习在过致动的可倾向四旋翼天上实现敏捷且稳健的全向空中运动

Authors: Wentao Zhang, Zhaoqi Ma, Jinjie Li, Huayi Wang, Haokun Liu, Junichiro Sugihara, Chen Chen, Yicheng Chen, Moju Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.21583
Pdf link: https://arxiv.org/pdf/2602.21583
Abstract Tilt-rotor aerial robots enable omnidirectional maneuvering through thrust vectoring, but introduce significant control challenges due to the strong coupling between joint and rotor dynamics. While model-based controllers can achieve high motion accuracy under nominal conditions, their robustness and responsiveness often degrade in the presence of disturbances and modeling uncertainties. This work investigates reinforcement learning for omnidirectional aerial motion control on over-actuated tiltable quadrotors that prioritizes robustness and agility. We present a learning-based control framework that enables efficient acquisition of coordinated rotor-joint behaviors for reaching target poses in the $SE(3)$ space. To achieve reliable sim-to-real transfer while preserving motion accuracy, we integrate system identification with minimal and physically consistent domain randomization. Compared with a state-of-the-art NMPC controller, the proposed method achieves comparable six-degree-of-freedom pose tracking accuracy, while demonstrating superior robustness and generalization across diverse tasks, enabling zero-shot deployment on real hardware.
中文摘要 倾转旋翼空中机器人通过推力矢量实现全向机动，但由于关节与旋翼动力学的强烈耦合，带来了显著的控制挑战。虽然基于模型的控制器在正常条件下可以实现高运动精度，但其鲁棒性和响应性在干扰和建模不确定性存在时常常会下降。本研究研究了在过度驱动的可倾斜四旋翼机上实现全向空中运动控制的强化学习，强调了稳健性和灵活性。我们提出了一个基于学习的控制框架，能够高效获得协调的转子-关节行为，以达到$SE（3）$空间中的目标姿态。为了实现可靠的模拟到实物传输同时保持运动精度，我们将系统识别与最小且物理一致的域随机化相结合。与最先进的NMPC控制器相比，该方法实现了相当的六自由度姿态跟踪精度，同时在多种任务中展现出卓越的鲁棒性和泛化性，实现了在真实硬件上的零点部署。

Tacmap: Bridging the Tactile Sim-to-Real Gap via Geometry-Consistent Penetration Depth Map

战术地图：通过几何一致的穿透深度图桥接触觉模拟与现实之间的差距

Authors: Lei Su, Zhijie Peng, Renyuan Ren, Shengping Mao, Juan Du, Kaifeng Zhang, Xuezhou Zhu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.21625
Pdf link: https://arxiv.org/pdf/2602.21625
Abstract Vision-Based Tactile Sensors (VBTS) are essential for achieving dexterous robotic manipulation, yet the tactile sim-to-real gap remains a fundamental bottleneck. Current tactile simulations suffer from a persistent dilemma: simplified geometric projections lack physical authenticity, while high-fidelity Finite Element Methods (FEM) are too computationally prohibitive for large-scale reinforcement learning. In this work, we present Tacmap, a high-fidelity, computationally efficient tactile simulation framework anchored in volumetric penetration depth. Our key insight is to bridge the tactile sim-to-real gap by unifying both domains through a shared deform map representation. Specifically, we compute 3D intersection volumes as depth maps in simulation, while in the real world, we employ an automated data-collection rig to learn a robust mapping from raw tactile images to ground-truth depth maps. By aligning simulation and real-world in this unified geometric space, Tacmap minimizes domain shift while maintaining physical consistency. Quantitative evaluations across diverse contact scenarios demonstrate that Tacmap's deform maps closely mirror real-world measurements. Moreover, we validate the utility of Tacmap through an in-hand rotation task, where a policy trained exclusively in simulation achieves zero-shot transfer to a physical robot.
中文摘要 基于视觉的触觉传感器（VBTS）对于实现灵活的机器人作至关重要，但触觉模拟与现实之间的差距仍是一个根本的瓶颈。当前的触觉模拟存在一个持续的困境：简化几何投影缺乏物理真实性，而高精度有限元方法（FEM）在计算量上过于庞大，无法实现大规模强化学习。在本研究中，我们介绍了Tacmap，一种高保真度、计算效率高的触觉仿真框架，基于体积穿透深度。我们的关键见解是通过共享变形地图表示来统一两个领域，弥合触觉模拟与现实之间的差距。具体来说，我们在仿真中计算三维交集体作为深度图，而在现实世界中，我们使用自动数据收集设备，学习从原始触觉图像到真实深度图的稳健映射。通过将模拟与现实世界对齐于统一的几何空间，Tacmap 最大限度地减少了域移，同时保持物理一致性。在多种接触场景下的定量评估表明，Tacmap的变形图与实际测量高度相似。此外，我们通过手中旋转任务验证了Tacmap的实用性，该任务中仅在模拟中训练的策略实现了零发射器传输到物理机器人。

RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

RuCL：基于评分标准的多模态大型语言模型推理课程学习

Authors: Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li, Zhen Qin, Hengyu Chang, Ancheng Xu, Zhihao Yang, Hamid Alinejad-Rokny, Qiang Qu, Bo Zheng, Min Yang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.21628
Pdf link: https://arxiv.org/pdf/2602.21628
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a prevailing paradigm for enhancing reasoning in Multimodal Large Language Models (MLLMs). However, relying solely on outcome supervision risks reward hacking, where models learn spurious reasoning patterns to satisfy final answer checks. While recent rubric-based approaches offer fine-grained supervision signals, they suffer from high computational costs of instance-level generation and inefficient training dynamics caused by treating all rubrics as equally learnable. In this paper, we propose Stratified Rubric-based Curriculum Learning (RuCL), a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design. RuCL generates generalized rubrics for broad applicability and stratifies them based on the model's competence. By dynamically adjusting rubric weights during training, RuCL guides the model from mastering foundational perception to tackling advanced logical reasoning. Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.
中文摘要 带可验证奖励的强化学习（RLVR）已成为增强多模态大型语言模型（MLLM）推理的主流范式。然而，仅依赖结果监督的风险会带来“黑客”行为，即模型学习虚假推理模式以满足最终答案检验。虽然最新的基于评分标准的方法提供了细粒度的监督信号，但它们存在实例级生成的高计算成本以及将所有评分标准视为同等可学习性导致训练动力学效率低下的问题。本文提出了基于分层评分标准的课程学习（RuCL），这是一种新颖框架，通过将重点从数据选择转向奖励设计，重新定义了课程学习。RuCL生成通用的评分标准以实现广泛的适用性，并根据模型的能力进行分层。通过在训练过程中动态调整评分标准权重，RuCL引导模型从掌握基础感知到攻克高级逻辑推理。对各种视觉推理基准测试的广泛实验表明，RuCL相比Qwen2.5-VL-7B模型平均提升了显著的+7.83%，达到了60.06%的先进准确率。

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

自我纠正VLA：通过稀疏世界想象实现在线动作精炼

Authors: Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang, Heng Tao Shen
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.21633
Pdf link: https://arxiv.org/pdf/2602.21633
Abstract Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context modeling, lacking explicit mechanisms for self-improvement. To solve these problems, we propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination. We first design sparse world imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution. Then we introduce the online action refinement module to reshape progress-dependent dense rewards, adjusting trajectory orientation based on the predicted sparse future states. Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines, alongside a 14% gain in real-world experiments. Code is available at this https URL.
中文摘要 标准的视觉-语言-行动（VLA）模型依赖拟合统计数据先验，限制了其对潜在物理动力学的稳健理解。强化学习通过探索增强物理基础，但通常依赖于与代理内部状态隔离的外部奖励信号。世界行动模型已成为一种有前景的范式，融合了想象力与控制，实现了预测性规划。然而，它们依赖隐性情境建模，缺乏明确的自我提升机制。为解决这些问题，我们提出了自我纠正VLA（SC-VLA），通过稀疏想象力内在引导动作精炼实现自我提升。我们首先通过整合辅助预测首脑来预测当前任务进展和未来轨迹趋势，从而设计稀疏的世界想象，从而限制政策编码短期物理演变。然后我们引入在线动作细化模块，重塑与进度相关的密集奖励，并根据预测的稀疏未来状态调整轨迹方向。基于模拟基准和现实环境对具有挑战性的机器人作任务的评估表明，SC-VLA实现了最先进的性能，实现了最高的任务吞吐量，步数减少16%，成功率比最佳基线高出9%，同时在实际实验中提升了14%。代码可在此 https URL 访问。

CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

CCCaption：双重奖励强化学习，实现完整且正确的图片字幕

Authors: Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou, Anxiang Zeng, Jianqiang Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.21655
Pdf link: https://arxiv.org/pdf/2602.21655
Abstract Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency. For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition. Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.
中文摘要 图像字幕仍然是视觉语言理解的基本任务，但真实监控仍主要依赖人工注释的参考。由于人工注释反映了主观偏好和专业知识，真实的说明往往不完整甚至错误，这反过来限制了说明模型。我们认为，说明文字质量应以两个客观方面来评估：完整性（说明是否涵盖所有显著的视觉事实？）和正确性（描述是否相对于图像真实？）。为此，我们引入了CCCaption：一个双重奖励强化学习框架，配备专门的微调语料库，明确优化这些属性以生成\textbf{C}omplete和\textbf{C}orrect \textbf{Captions}。为了完整起见，我们使用多样化的LVLM将图像拆解成一组可视化查询，并奖励回答更多这些查询的说明文字，同时采用动态查询抽样策略以提高训练效率。为了正确性，我们通过验证字幕查询的真实性来惩罚包含幻觉的字幕，这些字幕查询是从字幕分解中得出的。我们的对称双重奖励优化共同最大化完整性和正确性，引导模型向更符合这些客观标准的说明文字方向发展。在标准字幕基准测试中的大量实验显示出持续的改进，提供了一条超越人工标注模仿的原则性训练说明模型路径。

Hierarchical Lead Critic based Multi-Agent Reinforcement Learning

基于层级主审判的多智能体强化学习

Authors: David Eckel, Henri Meeß
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.21680
Pdf link: https://arxiv.org/pdf/2602.21680
Abstract Cooperative Multi-Agent Reinforcement Learning (MARL) solves complex tasks that require coordination from multiple agents, but is often limited to either local (independent learning) or global (centralized learning) perspectives. In this paper, we introduce a novel sequential training scheme and MARL architecture, which learns from multiple perspectives on different hierarchy levels. We propose the Hierarchical Lead Critic (HLC) - inspired by natural emerging distributions in team structures, where following high-level objectives combines with low-level execution. HLC demonstrates that introducing multiple hierarchies, leveraging local and global perspectives, can lead to improved performance with high sample efficiency and robust policies. Experimental results conducted on cooperative, non-communicative, and partially observable MARL benchmarks demonstrate that HLC outperforms single hierarchy baselines and scales robustly with increasing amounts of agents and difficulty.
中文摘要 合作多智能体强化学习（MARL）解决需要多个智能体协调的复杂任务，但通常限于局部（独立学习）或全局（集中学习）视角。本文介绍了一种新的顺序训练方案和MARL架构，该架构从不同层级的多角度学习。我们提出了层级首席批评者（HLC）——灵感来自团队结构中自然出现的分布，在这种结构中，遵循高层目标与低层执行相结合。HLC证明，引入多层级结构，利用本地和全球视角，能够以高样本效率和稳健策略提升性能。在合作式、非通信性和部分可观测的MARL基准测试上进行的实验结果表明，HLC优于单一层级基线，并且随着药物数量和难度的增加，其规模可稳健。

Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration: A Hybrid Knowledge-Data-Driven Approach

通过LLM-RL协作实现两级有源配电网络电压控制：一种知识-数据驱动混合方法

Authors: Xu Yang, Chenhui Lin, Xiang Ma, Dong Liu, Ran Zheng, Haotian Liu, Wenchuan Wu
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.21715
Pdf link: https://arxiv.org/pdf/2602.21715
Abstract The growing integration of distributed photovoltaics (PVs) into active distribution networks (ADNs) has exacerbated operational challenges, making it imperative to coordinate diverse equipment to mitigate voltage violations and enhance power quality. Although existing data-driven approaches have demonstrated effectiveness in the voltage control problem, they often require extensive trial-and-error exploration and struggle to incorporate heterogeneous information, such as day-ahead forecasts and semantic-based grid codes. Considering the operational scenarios and requirements in real-world ADNs, in this paper, we propose a hybrid knowledge-data-driven approach that leverages dynamic collaboration between a large language model (LLM) agent and a reinforcement learning (RL) agent to achieve two-stage voltage control. In the day-ahead stage, the LLM agent receives coarse region-level forecasts and generates scheduling strategies for on-load tap changer (OLTC) and shunt capacitors (SCs) to regulate the overall voltage profile. Then in the intra-day stage, based on accurate node-level measurements, the RL agent refines terminal voltages by deriving reactive power generation strategies for PV inverters. On top of the LLM-RL collaboration framework, we further propose a self-evolution mechanism for the LLM agent and a pretrain-finetune pipeline for the RL agent, effectively enhancing and coordinating the policies for both agents. The proposed approach not only aligns more closely with practical operational characteristics but also effectively utilizes the inherent knowledge and reasoning capabilities of the LLM agent, significantly improving training efficiency and voltage control performance. Comprehensive comparisons and ablation studies demonstrate the effectiveness of the proposed method.
中文摘要 分布式光伏（PV）日益整合进有源配电网络（ADN）加剧了运营挑战，因此协调多样化设备以减轻电压违规并提升电力质量变得尤为必要。尽管现有的数据驱动方法在电压控制问题中已证明有效，但它们通常需要大量试错探索和整合异构信息，如预日预报和基于语义的网格代码。考虑到现实世界ADN的作场景和需求，本文提出了一种混合知识-数据驱动方法，利用大型语言模型（LLM）代理与强化学习（RL）代理之间的动态协作实现两级电压控制。在当天阶段，LLM代理接收粗略的区域级预报，并为负载分接开关（OLTC）和并联电容（SC）生成调度策略，以调节整体电压曲线。然后在日内阶段，基于精确的节点级测量，RL代理通过推导光伏逆变器的无功发电策略来优化终端电压。在LLM-RL协作框架的基础上，我们还提出了LLM代理的自我演化机制和RL代理的预训练微调流水线，有效增强和协调两代理的策略。该方法不仅更贴近实际作特性，还有效利用LLM代理的固有知识和推理能力，显著提升训练效率和电压控制性能。综合比较和消融研究证明了该方法的有效性。

Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning

利用强化学习评估递归数字系统中规律性与可学习性之间的关系

Authors: Andrea Silvi, Ponrawee Prasertsom, Jennifer Culbertson, Devdatt Dubhashi, Moa Johansson, Kenny Smith
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.21720
Pdf link: https://arxiv.org/pdf/2602.21720
Abstract Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular. Following prior work that relates cross-linguistic tendencies to biases in learning, we ask whether regular systems are common because regularity facilitates learning. Adopting methods from the Reinforcement Learning literature, we confirm that highly regular human(-like) systems are easier to learn than unattested but possible irregular systems. This asymmetry emerges under the natural assumption that recursive numeral systems are designed for generalisation from limited data to represent all integers exactly. We also find that the influence of regularity on learnability is absent for unnatural, highly irregular systems, whose learnability is influenced instead by signal length, suggesting that different pressures may influence learnability differently in different parts of the space of possible numeral systems. Our results contribute to the body of work linking learnability to cross-linguistic prevalence.
中文摘要 人类递归数字系统（即如英语十进制数字等计数系统）和许多其他语法系统一样，具有高度规则性。继先前将跨语言倾向与学习偏见联系起来的研究，我们探讨正规系统是否普遍存在，因为规律性促进了学习。我们采用强化学习文献的方法，确认高度规律的人类（类人类）系统比未被证实但可能不规则的系统更容易学习。这种不对称性源于自然假设：递归数值系统设计时是从有限数据推广到精确表示所有整数。我们还发现，对于非自然、高度不规则的系统，规律性对可学习性的影响不存在，这些系统可学习性受信号长度影响，这表明不同压力在可能的数字系统空间中可能对可学习性产生不同影响。我们的研究成果有助于将可学习性与跨语言流行率联系起来。

LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations

LessMimic：长视野类人生物与统一距离场表示的交互

Authors: Yutang Lin, Jieming Cui, Yixuan Li, Baoxiong Jia, Yixin Zhu, Siyuan Huang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.21723
Pdf link: https://arxiv.org/pdf/2602.21723
Abstract Humanoid robots that autonomously interact with physical environments over extended horizons represent a central goal of embodied intelligence. Existing approaches rely on reference motions or task-specific rewards, tightly coupling policies to particular object geometries and precluding multi-skill generalization within a single framework. A unified interaction representation enabling reference-free inference, geometric generalization, and long-horizon skill composition within one policy remains an open challenge. Here we show that Distance Field (DF) provides such a representation: LessMimic conditions a single whole-body policy on DF-derived geometric cues--surface distances, gradients, and velocity decompositions--removing the need for motion references, with interaction latents encoded via a Variational Auto-Encoder (VAE) and post-trained using Adversarial Interaction Priors (AIP) under Reinforcement Learning (RL). Through DAgger-style distillation that aligns DF latents with egocentric depth features, LessMimic further transfers seamlessly to vision-only deployment without motion capture (MoCap) infrastructure. A single LessMimic policy achieves 80--100% success across object scales from 0.4x to 1.6x on PickUp and SitStand where baselines degrade sharply, attains 62.1% success on 5 task instances trajectories, and remains viable up to 40 sequentially composed tasks. By grounding interaction in local geometry rather than demonstrations, LessMimic offers a scalable path toward humanoid robots that generalize, compose skills, and recover from failures in unstructured environments.
中文摘要 能够自主与物理环境交互的人形机器人，跨越遥远的视野，代表了具身智能的核心目标。现有方法依赖参考动作或任务特定奖励，紧密将策略与特定物体几何形态耦合，避免在单一框架内进行多技能泛化。实现统一交互表示，实现无引用推断、几何推广和长期技能组合，仍是一个开放的挑战。这里我们展示了距离场（DF）提供了这样的表示方式：LessMimic对DF衍生的几何线索——表面距离、梯度和速度分解——施加单一整体策略——消除运动参考的需求，交互潜点通过变分自动编码器（VAE）编码，并在强化学习（RL）下使用对抗交互先验（AIP）进行后期训练。通过DAgger式的提炼，将DF潜在元素与自我中心的深度特征对齐，LessMimic进一步无缝地实现了仅视觉部署，无需动作捕捉（MoCap）基础设施。单一LessMimic策略在PickUp和SitStand（基线急剧下降）中，在0.4倍到1.6倍的对象尺度上实现80%至100%的成功率，在5个任务实例轨迹中达到62.1%，并且在最多40个连续组合任务中仍可行。通过将交互基于局部几何而非演示，LessMimic 提供了一条可扩展的路径，实现类人机器人的推广、技能组合并在非结构化环境中从故障中恢复。

Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling

图谱探索：通过路径精炼奖励建模激励知识图谱上自主探索大型语言模型

Authors: Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen, Tianyi Zhang, Shijie Zhang, Wei Qiang Zhang, Yongfeng Huang, Haixin Duan, Yunqi Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.21728
Pdf link: https://arxiv.org/pdf/2602.21728
Abstract The reasoning process of Large Language Models (LLMs) is often plagued by hallucinations and missing facts in question-answering tasks. A promising solution is to ground LLMs' answers in verifiable knowledge sources, such as Knowledge Graphs (KGs). Prevailing KG-enhanced methods typically constrained LLM reasoning either by enforcing rules during generation or by imitating paths from a fixed set of demonstrations. However, they naturally confined the reasoning patterns of LLMs within the scope of prior experience or fine-tuning data, limiting their generalizability to out-of-distribution graph reasoning problems. To tackle this problem, in this paper, we propose Explore-on-Graph (EoG), a novel framework that encourages LLMs to autonomously explore a more diverse reasoning space on KGs. To incentivize exploration and discovery of novel reasoning paths, we propose to introduce reinforcement learning during training, whose reward is the correctness of the reasoning paths' final answers. To enhance the efficiency and meaningfulness of the exploration, we propose to incorporate path information as additional reward signals to refine the exploration process and reduce futile efforts. Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
中文摘要 大型语言模型（LLMs）的推理过程常常受到幻觉和问答任务中缺失事实的困扰。一个有前景的解决方案是将LLM的答案建立在可验证的知识源中，如知识图谱（KGs）。主流的KG增强方法通常通过在生成过程中强制执行规则或模仿固定演示的路径来约束LLM推理。然而，它们自然将LLM的推理模式限制在先前经验或微调数据的范围内，限制了其推广性仅限于分布外图推理问题。为解决这一问题，本文提出了Explore-on-Graph（EoG）新颖框架，鼓励大型语言模型自主探索KGs上更多样化的推理空间。为了激励探索和发现新颖推理路径，我们提议在培训过程中引入强化学习，其奖励是推理路径最终答案的正确性。为了提升探索的效率和意义，我们建议将路径信息作为额外的奖励信号，以优化探索过程，减少徒劳的努力。对五个KGQA基准数据集的广泛实验表明，据我们所知，我们的方法实现了最先进的性能，不仅优于开源，甚至超越了闭源LLMs。

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

通过难度感知群归一化增强多模态大型语言模型的推理能力

Authors: Jinghan Li, Junfeng Fang, Jinda Lu, Yuan Wang, Xiaoyan Guo, Tianyu Zhang, Xiang Wang, Xiangnan He
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.21743
Pdf link: https://arxiv.org/pdf/2602.21743
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) have significantly advanced the reasoning capabilities of large language models. Extending these methods to multimodal settings, however, faces a critical challenge: the instability of std-based normalization, which is easily distorted by extreme samples with nearly positive or negative rewards. Unlike pure-text LLMs, multimodal models are particularly sensitive to such distortions, as both perceptual and reasoning errors influence their responses. To address this, we characterize each sample by its difficulty, defined through perceptual complexity (measured via visual entropy) and reasoning uncertainty (captured by model confidence). Building on this characterization, we propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group. Our approach preserves GRPO's intra-group distinctions while eliminating sensitivity to extreme cases, yielding significant performance gains across multiple multimodal reasoning benchmarks.
中文摘要 带可验证奖励的强化学习（RLVR）和群体相对策略优化（GRPO）显著提升了大型语言模型的推理能力。然而，将这些方法推广到多模态环境时，面临一个关键挑战：基于标准化的归一化不稳定性，容易被极端样本且几乎正负奖励所扭曲。与纯文本LLM不同，多模态模型对这种扭曲特别敏感，因为感知和推理错误都会影响其反应。为此，我们根据每个样本的难度来描述，难度通过感知复杂度（通过视觉熵测量）和推理不确定性（以模型置信度衡量）来定义。基于这一描述，我们提出了难度感知组归一化（Durian），即按难度等级重新分组样本，并在每个组内共享标准值。我们的方法在消除对极端情况的敏感性的同时，保留了GRPO的组内差异，从而在多个多模态推理基准中实现了显著的性能提升。

Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

RLHF在奖励移位和截断KL正则化下的推广

Authors: Kenton Tang, Yuzhu Chen, Fengxiang He
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.21765
Pdf link: https://arxiv.org/pdf/2602.21765
Abstract Alignment and adaptation in large language models heavily rely on reinforcement learning from human feedback (RLHF); yet, theoretical understanding of its generalisability remains premature, especially when the learned reward could shift, and the KL control is estimated and clipped. To address this issue, we develop generalisation theory for RLHF that explicitly accounts for (1) \emph{reward shift}: reward models are trained on preference data from earlier or mixed behaviour policies while RLHF optimises the current policy on its own rollouts; and (2) \emph{clipped KL regularisation}: the KL regulariser is estimated from sampled log-probability ratios and then clipped for stabilisation, resulting in an error to RLHF. We present generalisation bounds for RLHF, suggesting that the generalisation error stems from a sampling error from prompts and rollouts, a reward shift error, and a KL clipping error. We also discuss special cases of (1) initialising RLHF parameters with a uniform prior over a finite space, and (2) training RLHF by stochastic gradient descent, as an Ornstein-Uhlenbeck process. The theory yields practical implications in (1) optimal KL clipping threshold, and (2) budget allocation in prompts, rollouts, and preference data.
中文摘要 大型语言模型中的对齐与适应高度依赖于来自人类反馈的强化学习（RLHF）;然而，理论上对其推广性理解尚为时过早，尤其是在学习的奖励可能发生变化，且KL控制被估计并截断时。为解决这个问题，我们开发了RLHF的推广理论，明确考虑了（1）\emph{奖励转移}：奖励模型基于早期或混合行为政策的偏好数据训练，而RLHF则在自身推广时优化当前政策;以及（2）\emph{截断的KL正则化}：通过采样的对数概率比估计KL正则化子，然后为稳定而截断，导致RLHF误差。我们给出了RLHF的推广界限，认为推广误差源自提示和展开的采样误差、奖励移位误差以及基伦伦裁断误差。我们还讨论了（1）在有限空间内用均匀先验初始化RLHF参数，以及（2）通过随机梯度下降训练RLHF作为奥恩斯坦-乌伦贝克过程的特例。该理论在（1）最佳的 KL 裁剪阈值和（2）提示词、推广和偏好数据中的预算分配中提供了实际意义。

DexRepNet++: Learning Dexterous Robotic Manipulation with Geometric and Spatial Hand-Object Representations

DexRepNet++：学习灵活的机器人作，结合几何和空间手部物体表示

Authors: Qingtao Liu, Zhengnan Sun, Yu Cui, Haoming Li, Gaofeng Li, Lin Shao, Jiming Chen, Qi Ye
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.21811
Pdf link: https://arxiv.org/pdf/2602.21811
Abstract Robotic dexterous manipulation is a challenging problem due to high degrees of freedom (DoFs) and complex contacts of multi-fingered robotic hands. Many existing deep reinforcement learning (DRL) based methods aim at improving sample efficiency in high-dimensional output action spaces. However, existing works often overlook the role of representations in achieving generalization of a manipulation policy in the complex input space during the hand-object interaction. In this paper, we propose DexRep, a novel hand-object interaction representation to capture object surface features and spatial relations between hands and objects for dexterous manipulation skill learning. Based on DexRep, policies are learned for three dexterous manipulation tasks, i.e. grasping, in-hand reorientation, bimanual handover, and extensive experiments are conducted to verify the effectiveness. In simulation, for grasping, the policy learned with 40 objects achieves a success rate of 87.9% on more than 5000 unseen objects of diverse categories, significantly surpassing existing work trained with thousands of objects; for the in-hand reorientation and handover tasks, the policies also boost the success rates and other metrics of existing hand-object representations by 20% to 40%. The grasp policies with DexRep are deployed to the real world under multi-camera and single-camera setups and demonstrate a small sim-to-real gap.
中文摘要 由于高自由度（DoFs）和多指机械手的复杂接触，机器人灵巧作是一个具有挑战性的问题。许多现有基于深度强化学习（DRL）的方法旨在提高高维输出动作空间中的样本效率。然而，现有研究常忽视表示在手-对象交互复杂输入空间中实现作策略推广中的作用。本文提出了DexRep，一种新颖的手-物交互表示法，用于捕捉物体表面特征及手与物体间的空间关系，用于灵巧作技能学习。基于DexRep，学习了三种灵巧作任务的策略，即抓取、手中重新定向、双手交接，并进行了大量实验验证其有效性。在模拟中，抓取策略中，40个对象学习的策略在5000多个不同类别的未见物体上实现了87.9%的成功率，远超了用数千个物体训练的现有工作;对于手中的重新定位和交接任务，策略还将现有手对象表示的成功率和其他指标提升了20%至40%。DexRep的抓取策略在多机位和单机位设置下部署到现实世界，展示了模拟与现实之间的小差距。

Self-Curriculum Model-based Reinforcement Learning for Shape Control of Deformable Linear Objects

基于模型的自学强化学习用于可变形线性物体的形状控制

Authors: Zhaowei Liang, Song Wang, Zhao Jin, Shirui Wu, Dan Wu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.21816
Pdf link: https://arxiv.org/pdf/2602.21816
Abstract Precise shape control of Deformable Linear Objects (DLOs) is crucial in robotic applications such as industrial and medical fields. However, existing methods face challenges in handling complex large deformation tasks, especially those involving opposite curvatures, and lack efficiency and precision. To address this, we propose a two-stage framework combining Reinforcement Learning (RL) and online visual servoing. In the large-deformation stage, a model-based reinforcement learning approach using an ensemble of dynamics models is introduced to significantly improve sample efficiency. Additionally, we design a self-curriculum goal generation mechanism that dynamically selects intermediate-difficulty goals with high diversity through imagined evaluations, thereby optimizing the policy learning process. In the small-deformation stage, a Jacobian-based visual servo controller is deployed to ensure high-precision convergence. Simulation results show that the proposed method enables efficient policy learning and significantly outperforms mainstream baselines in shape control success rate and precision. Furthermore, the framework effectively transfers the policy trained in simulation to real-world tasks with zero-shot adaptation. It successfully completes all 30 cases with diverse initial and target shapes across DLOs of different sizes and materials. The project website is available at: this https URL
中文摘要 可变形线性物体（DLO）的精确形状控制在工业和医疗等机器人应用中至关重要。然而，现有方法在处理复杂的大变形任务时面临挑战，尤其是涉及反向曲率的任务，且效率和精度不足。为此，我们提出了一个结合强化学习（RL）和在线视觉伺服的两阶段框架。在大变形阶段，引入了基于模型的强化学习方法，利用一组动力学模型，显著提高样品效率。此外，我们设计了一种自学目标生成机制，通过想象评估动态选择具有高度多样性的中等难度目标，从而优化政策学习过程。在小变形阶段，部署基于雅可比矩阵的视觉伺服控制器以确保高精度收敛。模拟结果表明，该方法能够实现高效的策略学习，并在形状控制成功率和精度上显著优于主流基线。此外，该框架有效将模拟中训练的策略转化为零样本适应的现实任务。它成功完成了30个初始和目标形状多样的病例，涵盖不同尺寸和材料的DLO。项目网站可访问：此 https URL

LightSim: A Lightweight Cell Transmission Model Simulator for Traffic Signal Control Research

LightSim：一款用于交通信号控制研究的轻量级小区传输模型模拟器

Authors: Haoran Su, Hanxiao Deng
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.21852
Pdf link: https://arxiv.org/pdf/2602.21852
Abstract Reinforcement learning for traffic signal control is bottlenecked by simulators: training in SUMO takes hours, reproducing results often requires days of platform-specific setup, and the slow iteration cycle discourages the multi-seed experiments that rigorous evaluation demands. Much of this cost is unnecessary, since for signal timing optimization the relevant dynamics are queue formation and discharge, which the Cell Transmission Model (CTM) captures as a macroscopic flow model. We introduce LightSim, a pure Python, pip-installable traffic simulator with Gymnasium and PettingZoo interfaces that runs over 20000 steps per second on a single CPU. Across cross-simulator experiments spanning single intersections, grid networks, arterial corridors, and six real-world city networks, LightSim preserves controller rankings from SUMO for both classical and reinforcement learning strategies while training 3 to 7 times faster. LightSim is released as an open-source benchmark with nineteen built-in scenarios, seven controllers, and full reinforcement learning pipelines, lowering the barrier to signal control research from days to minutes.
中文摘要 用于交通信号控制的强化学习被模拟器所限制：SUMO训练需要数小时，结果的复现通常需要数天针对特定平台的搭建，且缓慢的迭代周期抑制了严格评估所需的多种子实验。大部分成本是不必要的，因为信号时序优化所需的动态是队列形成和排放，而小胞传输模型（CTM）将其捕捉为宏观流动模型。我们介绍LightSim，一款纯Python、可pip安装的交通模拟器，具备Gymnasium和PettingZoo接口，单一CPU可运行超过20000步每秒。在跨交叉模拟器实验中，涵盖单一交叉路口、网格网络、主干道走廊和六个真实城市网络，LightSim保留了SUMO中经典和强化学习策略的控制者排名，同时训练速度提升3到7倍。LightSim作为一个开源基准发布，内置19个场景、7个控制器和完整的强化学习流程，将信号控制研究的门槛从数天降至几分钟。

Distill and Align Decomposition for Enhanced Claim Verification

提取和对齐分解以增强主张验证

Authors: Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero, Arturo Oncevay, Charese H. Smiley, Xiaomo Liu, Manuela Veloso
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.21857
Pdf link: https://arxiv.org/pdf/2602.21857
Abstract Complex claim verification requires decomposing sentences into verifiable subclaims, yet existing methods struggle to align decomposition quality with verification performance. We propose a reinforcement learning (RL) approach that jointly optimizes decomposition quality and verifier alignment using Group Relative Policy Optimization (GRPO). Our method integrates: (i) structured sequential reasoning; (ii) supervised finetuning on teacher-distilled exemplars; and (iii) a multi-objective reward balancing format compliance, verifier alignment, and decomposition quality. Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)). Human evaluation confirms the high quality of the generated subclaims. Our framework enables smaller language models to achieve state-of-the-art claim verification by jointly optimising for verification accuracy and decomposition quality.
中文摘要 复杂的主张验证需要将句子分解为可验证的子主张，但现有方法难以将分解质量与验证性能对齐。我们提出了一种强化学习（RL）方法，结合群相对策略优化（Group Relative Policy Optimization，GRPO）来优化分解质量和验证器对齐。我们的方法整合了：（i）结构化顺序推理;（ii）对教师提炼样本进行监督的微调;以及（iii）多目标奖励平衡格式合规性、验证者对齐和分解质量。在六个评估环境中，我们训练有素的8B分解器将下游验证性能提升至（71.75%）的宏F1，优于基于提示的方法（+1.99）、（+6.24））和现有强化学习方法（+5.84））。人工评估确认生成的子索赔质量高。我们的框架使较小的语言模型能够通过联合优化验证准确性和分解质量，实现最先进的主张验证。

ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

ExpLang：通过策略思维语言选择改进的LLM推理探索与利用

Authors: Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li, Shujian Huang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.21887
Pdf link: https://arxiv.org/pdf/2602.21887
Abstract Current large reasoning models (LRMs) have shown strong ability on challenging tasks after reinforcement learning (RL) based post-training. However, previous work mainly focuses on English reasoning in expectation of the strongest performance, despite the demonstrated potential advantage of multilingual thinking, as well as the requirement for native thinking traces by global users. In this paper, we propose ExpLang, a novel LLM post-training pipeline that enables on-policy thinking language selection to improve exploration and exploitation during RL with the use of multiple languages. The results show that our method steadily outperforms English-only training with the same training budget, while showing high thinking language compliance for both seen and unseen languages. Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged non-English advantage. The method is orthogonal to most RL algorithms and opens up a new perspective on using multilinguality to improve LRMs.
中文摘要 当前的大推理模型（LRM）在基于强化学习（RL）的后训练后，展现出在具有挑战性任务中的强劲能力。然而，尽管多语言思维的潜在优势已被证明，且全球用户需要母语思维痕迹，但以往的研究主要聚焦于英语推理，期望表现最强。本文提出了ExpLang，一种新型LLM后培训流程，能够通过多语言进行策略思维语言选择，提升强化学习中的探索和利用。结果显示，我们的方法在相同训练预算下稳步优于纯英语训练，同时在可见和未见语言中都表现出高度的思维语言顺应性。分析显示，通过在强化学习中启用政策思考语言选择作为动作，ExpLang有效地扩展了强化学习的探索空间，并利用非英语优势提升了强化学习的利用效果。该方法与大多数强化学习算法正交，并为利用多语言能力改进长语言模型开辟了新视角。

RADAR: Reasoning as Discrimination with Aligned Representations for LLM-based Knowledge Graph Reasoning

RADAR：基于LLM的知识图谱推理中，推理作为区分与对齐表征

Authors: Bo Xue, Yuan Jin, Luoyi Fu, Jiaxin Ding, Xinbing Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.21951
Pdf link: https://arxiv.org/pdf/2602.21951
Abstract Knowledge graph reasoning (KGR) infers missing facts, with recent advances increasingly harnessing the semantic priors and reasoning abilities of Large Language Models (LLMs). However, prevailing generative paradigms are prone to memorizing surface-level co-occurrences rather than learning genuine relational semantics, limiting out-of-distribution generalization. To address this, we propose RADAR, which reformulates KGR from generative pattern matching to discriminative relational reasoning. We recast KGR as discriminative entity selection, where reinforcement learning enforces relative entity separability beyond token-likelihood imitation. Leveraging this separability, inference operates directly in representation space, ensuring consistency with the discriminative optimization and bypassing generation-induced hallucinations. Across four benchmarks, RADAR achieves 5-6% relative gains on link prediction and triple classification over strong LLM baselines, while increasing task-relevant mutual information in intermediate representations by 62.9%, indicating more robust and transferable relational reasoning.
中文摘要 知识图谱推理（KGR）推断缺失事实，近年来越来越多的进展利用大型语言模型（LLM）的语义先验和推理能力。然而，主流生成范式倾向于记忆表层共现，而非学习真正的关系语义，限制了分布外的泛化。为此，我们提出了RADAR，将KGR从生成模式匹配重新表述为判别关系推理。我们将 KGR 重新定义为判别实体选择，强化学习强化相对实体可分离性，超越了代币似然的模仿。利用这种可分离性，推理直接在表示空间中工作，确保与判别优化的一致性，并绕过生成引起的幻觉。在四个基准测试中，RADAR在链路预测和三重分类方面相较于强有力的大型语言模型基线实现了5-6%的相对提升，同时在中间表示中任务相关互信息方面提高了62.9%，表明关系推理更稳健且可迁移。

PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning

PanoEnv：利用强化学习探索全景环境中的三维空间智能

Authors: Zekai Lin, Xu Zheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.21992
Pdf link: https://arxiv.org/pdf/2602.21992
Abstract 360 panoramic images are increasingly used in virtual reality, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations including depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall accuracy and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward that incorporates five geometry-aware strategies such as distance tolerance and spatial consistency. A two-stage curriculum further mitigates catastrophic forgetting: Stage 1 trains on structured tasks (true/false and multiple choice), and Stage 2 fine-tunes on mixed open-ended data to improve generalization. Our 7B model achieves new state-of-the-art performance, improving overall accuracy to 52.93% (+3.59%) and open-ended accuracy to 14.83% while maintaining structured-task performance. It also achieves top semantic evaluation scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.
中文摘要 360度全景图像越来越多地被用于虚拟现实、自动驾驶和机器人技术，以实现整体场景理解。然而，当前的视觉语言模型（VLM）在等矩形投影（ERP）图像上的三维空间推理方面存在困难，原因是几何畸变和三维监督有限。我们引入PanoEnv，这是一个基于合成三维环境构建的大型VQA基准测试，包含14.8K题，涵盖五个类别（如相对位置、体积比较），并基于准确的三维注释，包括深度、分割和边界框。对14个最先进的VLM进行基准测试显示，三维理解有限，整体准确率仅为49.34%，开放式题目准确率为8.36%。为增强三维推理能力，我们提出了基于群体相对策略优化（GRPO）的强化学习后培训框架，采用基于真实感的奖励，包含五种几何感知策略，如距离容忍度和空间一致性。两阶段课程进一步减少灾难性遗忘：第一阶段训练结构化任务（真/假和多项选择），第二阶段则对混合开放式数据进行微调，以提升泛化能力。我们的7B模型实现了全新的最先进性能，整体准确率提升至52.93%（+3.59%），开放式准确率提升至14.83%，同时保持结构化任务表现。它还取得了顶尖的语义评估得分（Q-Score 6.24，P -Score 5.95），超过 32B 模型。这些结果表明，PanoEnv-QA及其基于课程的强化学习框架有效为VLM注入了三维空间智能，实现了全方位感知。

System Design of the Ultra Mobility Vehicle: A Driving, Balancing, and Jumping Bicycle Robot

超能移动车辆系统设计：一款驾驶、平衡与跳跃自行车机器人

Authors: Benjamin Bokser, Daniel Gonzalez, Surya Singh, Aaron Preston, Alex Bahner, Annika Wollschläger, Arianna Ilvonen, Asa Eckert-Erdheim, Ashwin Khadke, Bilal Hammoud, Dean Molinaro, Fabian Jenelten, Henry Mayne, Howie Choset, Igor Bogoslavskyi, Itic Tinman, James Tigue, Jan Preisig, Kaiyu Zheng, Kenny Sharma, Kim Ang, Laura Lee, Liana Margolese, Nicole Lin, Oscar Frias, Paul Drews, Ravi Boggavarapu, Rick Burnham, Samuel Zapolsky, Sangbae Kim, Scott Biddlestone, Sean Mayorga, Shamel Fahmi, Tyler McCollum, Velin Dimitrov, William Moyne, Yu-Ming Chen, Farbod Farshidian, Marco Hutter, David Perry, Al Rizzi, Gabe Nelson
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.22118
Pdf link: https://arxiv.org/pdf/2602.22118
Abstract Trials cyclists and mountain bike riders can hop, jump, balance, and drive on one or both wheels. This versatility allows them to achieve speed and energy-efficiency on smooth terrain and agility over rough terrain. Inspired by these athletes, we present the design and control of a robotic platform, Ultra Mobility Vehicle (UMV), which combines a bicycle and a reaction mass to move dynamically with minimal actuated degrees of freedom. We employ a simulation-driven design optimization process to synthesize a spatial linkage topology with a focus on vertical jump height and momentum-based balancing on a single wheel contact. Using a constrained Reinforcement Learning (RL) framework, we demonstrate zero-shot transfer of diverse athletic behaviors, including track-stands, jumps, wheelies, rear wheel hopping, and front flips. This 23.5 kg robot is capable of high speeds (8 m/s) and jumping on and over large obstacles (1 m tall, or 130% of the robot's nominal height).
中文摘要 越野赛自行车手和山地车骑手可以在一轮或双轮上跳跃、跳跃、保持平衡和驾驶。这种多功能性使它们能够在平滑地形上实现速度和节能，在崎岖地形上展现灵活性。受这些运动员的启发，我们展示了一款机器人平台——超能移动车辆（UMV）的设计与控制，该平台结合了自行车和反作用质量，实现了在最小驱动自由度下的动态移动。我们采用仿真驱动的设计优化流程，综合空间连锁拓扑结构，重点关注垂直跳跃高度和基于动量的单轮接触平衡。我们采用受限强化学习（RL）框架，展示了多种运动行为的零投递传递，包括站立、跳跃、翘头、后轮跳跃和前空翻。这款重23.5公斤的机器人能够实现高速（8米/秒），并能跳跃和跨越大型障碍物（高达1米，约为机器人标称高度的130%）。

SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

软件工程师门徒：学会选择性地与专家协作，解锁了作为软件工程代理的小型语言模型

Authors: Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt, Zijian Wang, John Yang, Samuel Thompson
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.22124
Pdf link: https://arxiv.org/pdf/2602.22124
Abstract Small language models (SLMs) offer compelling advantages in cost, latency, and adaptability, but have so far lagged behind larger models on long-horizon software engineering tasks such as SWE-bench, where they suffer from pervasive action looping and low resolution rates. We introduce SWE-Protégé, a post-training framework that reframes software repair as an expert-protégé collaboration problem. In SWE-Protégé, an SLM remains the sole decision-maker while learning to selectively seek guidance from a strong expert model, recognize stalled states, and follow through on expert feedback. Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration. We lightly post-train Qwen2.5-Coder-7B-Instruct to achieve 42.4% Pass@1 on SWE-bench Verified, a +25.4% improvement over the prior SLM state of the art, while using expert assistance sparsely (~4 calls per task and 11% of total tokens).
中文摘要 小型语言模型（SLMs）在成本、延迟和适应性方面具有显著优势，但在长期软件工程任务（如SWE-bench）上，由于普遍存在动作循环和低分辨率率，至今仍落后于大型模型。我们介绍了SWE-Protégé，这是一个培训后框架，将软件修复重新定义为专家与门徒协作问题。在软件软件学徒中，SLM始终是唯一的决策者，同时学习有选择地寻求强有力专家模型的指导，识别停滞状态，并跟进专家反馈。我们的方法结合了专家增强轨迹的监督微调与能动强化学习，明确阻止退化循环和无效的专家协作。我们对Qwen2.5-Coder-7B-Ininstruction进行了轻度后培训，在SWE-bench Verified测试中实现了42.4%的 Pass@1，比之前的 SLM 先进技术提升了+25.4%，同时专家协助的使用非常稀少（每个任务约4次呼叫，占总代币的11%）。

Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

通过乐观原始对偶实现多目标安全LLM对齐的可证明最后迭代收敛

Authors: Yining Li, Peizhong Ju, Ness Shroff
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.22146
Pdf link: https://arxiv.org/pdf/2602.22146
Abstract Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point problem is in convex-concave form. Moreover, standard primal-dual methods may exhibit instability or divergence in the last iterate under policy parameterization in practical applications. In this work, we propose a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods. Building on this framework, we introduce an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics. We establish last-iterate convergence guarantees for the proposed method, covering both exact policy optimization in the distributional space and convergence to a neighborhood of the optimal solution whose gap is related to approximation error and bias under parameterized policies. Our analysis reveals that optimism plays a crucial role in mitigating oscillations inherent to constrained alignment objectives, thereby closing a key theoretical gap between constrained RL and practical RLHF.
中文摘要 来自人类反馈的强化学习（RLHF）在使大型语言模型（LLMs）与人类偏好对齐方面起着重要作用。虽然带有期望奖励约束的RLHF可以表述为原始-对偶优化问题，但标准的原始-对偶方法仅保证与分布策略收敛，前提是鞍点问题呈凸凹形式。此外，标准的原始对偶方法在实际应用中，在策略参数化的最后一次迭代中可能表现出不稳定性或发散。本研究提出一个通用的原始对偶安全对抗框架，统一了包括安全RLHF、单次和多次的比对算法在内的广泛现有比对算法。基于该框架，我们引入了一种乐观的原始对偶（OPD）算法，结合了原始变量和对偶变量的预测更新，以稳定鞍点动态。我们为所提方法建立了最后一次迭代收敛的保证，涵盖分布空间中的精确策略优化以及与参数化策略下近似误差和偏置相关的最优解邻域的收敛。我们的分析显示，乐观主义在缓解受限对齐目标固有的振荡中起着关键作用，从而弥合受限强化学习与实际实务RLHF之间的关键理论差距。

Improving Parametric Knowledge Access in Reasoning Language Models

改进推理语言模型中的参数化知识获取

Authors: Melody Ma, John Hewitt
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.22193
Pdf link: https://arxiv.org/pdf/2602.22193
Abstract We study reasoning for accessing world knowledge stored in a language model's parameters. For example, recalling that Canberra is Australia's capital may benefit from thinking through major cities and the concept of purpose-built capitals. While reasoning language models are trained via reinforcement learning to produce reasoning traces on tasks such as mathematics, they may not reason well for accessing their own world knowledge. We first find that models do not generate their best world knowledge reasoning by default: adding a simple "think step-by-step" cue demonstrates statistically significant improvement in knowledge recall but not math. Motivated by this, we propose training models to reason over their parametric knowledge using world-knowledge question answering as a verifiable reward. After reinforcement learning on TriviaQA (+9.9%), performance also improves on Natural Questions, HotpotQA, SimpleQA, and StrategyQA by 4.2%, 2.1%, 0.6%, and 3.0%, respectively. Reasoning models are under-optimized for parametric knowledge access, but can be easily trained to reason better.
中文摘要 我们研究如何推理访问存储在语言模型参数中的世界知识。例如，回顾堪培拉是澳大利亚的首都，或许可以通过思考主要城市和专门建造的首都概念来受益。虽然推理语言模型通过强化学习训练，在数学等任务中产生推理痕迹，但它们可能无法很好地推理以访问自身世界知识。我们首先发现，模型并非默认产生最佳世界知识推理：添加一个简单的“逐步思考”提示显示知识回忆有统计学显著提升，但数学回忆没有。基于此，我们提出训练模型，利用世界知识问答作为可验证的奖励，对其参数化知识进行推理。在TriviaQA强化学习（+9.9%）后，自然问题、HotpotQA、SimpleQA和策略QA的表现分别提升了4.2%、2.1%、0.6%和3.0%。推理模型在参数化知识访问方面优化不足，但可以很容易地训练以更好地推理。

Keyword: diffusion policy

ADM-DP: Adaptive Dynamic Modality Diffusion Policy through Vision-Tactile-Graph Fusion for Multi-Agent Manipulation

ADM-DP：通过视觉-触觉-图融合实现自适应动态模态扩散策略，用于多智能体作

Authors: Enyi Wang, Wen Fan, Dandan Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.21622
Pdf link: https://arxiv.org/pdf/2602.21622
Abstract Multi-agent robotic manipulation remains challenging due to the combined demands of coordination, grasp stability, and collision avoidance in shared workspaces. To address these challenges, we propose the Adaptive Dynamic Modality Diffusion Policy (ADM-DP), a framework that integrates vision, tactile, and graph-based (multi-agent pose) modalities for coordinated control. ADM-DP introduces four key innovations. First, an enhanced visual encoder merges RGB and point-cloud features via Feature-wise Linear Modulation (FiLM) modulation to enrich perception. Second, a tactile-guided grasping strategy uses Force-Sensitive Resistor (FSR) feedback to detect insufficient contact and trigger corrective grasp refinement, improving grasp stability. Third, a graph-based collision encoder leverages shared tool center point (TCP) positions of multiple agents as structured kinematic context to maintain spatial awareness and reduce inter-agent interference. Fourth, an Adaptive Modality Attention Mechanism (AMAM) dynamically re-weights modalities according to task context, enabling flexible fusion. For scalability and modularity, a decoupled training paradigm is employed in which agents learn independent policies while sharing spatial information. This maintains low interdependence between agents while retaining collective awareness. Across seven multi-agent tasks, ADM-DP achieves 12-25% performance gains over state-of-the-art baselines. Ablation studies show the greatest improvements in tasks requiring multiple sensory modalities, validating our adaptive fusion strategy and demonstrating its robustness for diverse manipulation scenarios.
中文摘要 由于共享工作空间中协调、抓握稳定性和碰撞避免等综合需求，多智能体机器人作依然充满挑战。为应对这些挑战，我们提出了自适应动态模态扩散策略（ADM-DP），该框架整合了视觉、触觉和基于图（多代理姿态）的模态以实现协调控制。ADM-DP引入了四项关键创新。首先，增强型视觉编码器通过特征层线性调制（FiLM）融合RGB和点云特征，以丰富感知。其次，触觉引导抓取策略利用力敏电阻（FSR）反馈检测接触不足并触发纠正抓握细化，提升抓握稳定性。第三，基于图的碰撞编码器利用多个智能体共享工具中心点（TCP）位置作为结构化运动学上下文，以保持空间感知并减少智能代理间干扰。第四，自适应模态注意力机制（AMAM）根据任务上下文动态重新权重模态，实现灵活融合。为了可扩展性和模块化，采用了解耦训练范式，代理在共享空间信息的同时学习独立策略。这样在保持智能体间低依赖性的同时，保持了集体意识。在七个多智能体任务中，ADM-DP相比最先进的基线实现了12%-25%的性能提升。消融研究显示，在需要多种感官模式的任务中取得了最大改善，验证了我们的自适应融合策略，并证明其在多样化作场景下的稳健性。