Arxiv Papers of Today

生成时间: 2026-03-03 16:47:43 (UTC+8); Arxiv 发布时间: 2026-03-03 20:00 EST (2026-03-04 09:00 UTC+8)

今天共有 95 篇相关文章

Keyword: reinforcement learning

Reinforcement Learning for Control with Probabilistic Stability Guarantee: A Finite-Sample Approach

基于概率稳定性保证的控制强化学习：有限样本方法

Authors: Minghao Han, Lixian Zhang, Chenliang Liu, Zhipeng Zhou, Jun Wang, Wei Pan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.00043
Pdf link: https://arxiv.org/pdf/2603.00043
Abstract This paper presents a novel approach to reinforcement learning (RL) for control systems that provides probabilistic stability guarantees using finite data. Leveraging Lyapunov's method, we propose a probabilistic stability theorem that ensures mean square stability using only a finite number of sampled trajectories. The probability of stability increases with the number and length of trajectories, converging to certainty as data size grows. Additionally, we derive a policy gradient theorem for stabilizing policy learning and develop an RL algorithm, L-REINFORCE, that extends the classical REINFORCE algorithm to stabilization problems. The effectiveness of L-REINFORCE is demonstrated through simulations on a Cartpole task, where it outperforms the baseline in ensuring stability. This work bridges a critical gap between RL and control theory, enabling stability analysis and controller design in a model-free framework with finite data.
中文摘要 本文提出了一种针对控制系统的强化学习（RL）新方法，利用有限数据提供概率稳定性保证。利用李雅普诺夫的方法，我们提出了一个概率稳定性定理，确保仅使用有限个采样轨迹即可实现均方稳定性。稳定性的概率随着轨迹的数量和长度的增加而增加，随着数据量的增加，稳定性的概率趋于确定性。此外，我们推导了稳定策略学习的策略梯度定理，并开发了强化学习算法L-REINFORCE，将经典REINFORCE算法扩展到稳定问题。L-REINFORCE的有效性通过Cartpole任务的模拟得到了验证，在确保稳定性方面优于基线。这项工作弥合了强化学习与控制理论之间的关键鸿沟，使得在一个无模型框架、有限数据的框架下实现稳定性分析和控制器设计。

Breaking the Factorization Barrier in Diffusion Language Models

打破扩散语言模型中的分解障碍

Authors: Ian Li, Zilei Shao, Benjie Wang, Rose Yu, Guy Van den Broeck, Anji Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.00045
Pdf link: https://arxiv.org/pdf/2603.00045
Abstract Diffusion language models theoretically allow for efficient parallel generation but are practically hindered by the "factorization barrier": the assumption that simultaneously predicted tokens are independent. This limitation forces a trade-off: models must either sacrifice speed by resolving dependencies sequentially or suffer from incoherence due to factorization. We argue that this barrier arises not from limited backbone expressivity, but from a structural misspecification: models are restricted to fully factorized outputs because explicitly parameterizing a joint distribution would require the Transformer to output a prohibitively large number of parameters. We propose Coupled Discrete Diffusion (CoDD), a hybrid framework that breaks this barrier by replacing the fully-factorized output distribution with a lightweight, tractable probabilistic inference layer. This formulation yields a distribution family that is significantly more expressive than standard factorized priors, enabling the modeling of complex joint dependencies, yet remains compact enough to avoid the prohibitive parameter explosion associated with full joint modeling. Empirically, CoDD seamlessly enhances diverse diffusion language model architectures with negligible overhead, matching the reasoning performance of computationally intensive Reinforcement Learning baselines at a fraction of the training cost. Furthermore, it prevents performance collapse in few-step generation, enabling high-quality outputs at significantly reduced latencies. Code available at: this https URL
中文摘要 扩散语言模型理论上允许高效的并行生成，但实际上受限于“分解障碍”：即假设同时预测的令牌是独立的。这一限制迫使模型做出权衡：模型要么通过顺序解决依赖关系来牺牲速度，要么因因式分解而出现不相干性。我们认为，这一障碍并非源于骨干表达力的限制，而是结构性描述错误：模型仅限于全分解输出，因为显式参数化联合分布需要Transformer输出过多的参数。我们提出了耦合离散扩散（CoDD），这是一种混合框架，通过用轻量化、可处理的概率推断层取代全因数分解的输出分布，打破了这一障碍。该表述产生了一个比标准分解先验显著更具表现力的分布族，使复杂关节依赖关系能够建模，同时又足够紧凑，避免了全关节建模中出现的巨大参数爆炸。从经验角度看，CoDD 无缝增强了多样化的扩散语言模型架构，开销极低，推理性能可媲美计算密集型强化学习基线的推理性能，且训练成本极低。此外，它防止了在少步生成中的性能崩溃，使得高质量输出且延迟显著降低。代码可访问：此 https URL

Safe Multi-Agent Deep Reinforcement Learning for Privacy-Aware Edge-Device Collaborative DNN Inference

安全的多智能体深度强化学习，用于隐私意识的边缘设备协作DNN推理

Authors: Hong Wang, Xuwei Fan, Zhipeng Cheng, Yachao Yuan, Minghui Min, Minghui Liwang, Xiaoyu Xia
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.00129
Pdf link: https://arxiv.org/pdf/2603.00129
Abstract As Deep Neural Network (DNN) inference becomes increasingly prevalent on edge and mobile platforms, critical challenges emerge in privacy protection, resource constraints, and dynamic model deployment. This paper proposes a privacy-aware collaborative inference framework, in which adaptive model partitioning is performed across edge devices and servers. To jointly optimize inference delay, energy consumption, and privacy cost under dynamic service demands and resource constraints, we formulate the joint problem as a Constrained Markov Decision Process (CMDP) that integrates model deployment, user-server association, model partitioning, and resource allocation. We propose a Hierarchical Constrained Multi-Agent Proximal Policy Optimization with Lagrangian relaxation (HC-MAPPO-L) algorithm, a safe reinforcement learning-based framework that enhances Multi-Agent Proximal Policy Optimization (MAPPO) with adaptive Lagrangian dual updates to enforce long-term delay constraints. To ensure tractability while maintaining coordination, we decompose the CMDP into three hierarchically structured policy layers: an auto-regressive based model deployment policy, a Lagrangian-enhanced user association and model partitioning policy, and an attention-based resource allocation policy. Extensive experimental results demonstrate that HC-MAPPO-L consistently satisfies stringent delay constraints while achieving a superior balance among energy consumption and privacy cost, outperforming representative baseline algorithms across varying problem scales and resource configurations.
中文摘要 随着深度神经网络（DNN）推理在边缘和移动平台上日益普及，隐私保护、资源限制和动态模型部署等关键挑战浮现。本文提出了一种隐私意识的协作推理框架，在该框架中，跨边缘设备和服务器实现自适应模型分区。为了在动态服务需求和资源约束下共同优化推断延迟、能耗和隐私成本，我们将联合问题表述为一个受限马尔可夫决策过程（CMDP），集成了模型部署、用户-服务器关联、模型分区和资源分配。我们提出了一种层级约束多代理近端策略优化（HC-MAPPO-L）算法，这是一种基于安全强化学习的框架，通过自适应拉格朗日对偶更新增强多代理近端策略优化（MAPPO），以强制执行长期延迟约束。为了确保可处理性同时保持协调性，我们将CMDP拆分为三个层级结构的策略层：基于自回归的模型部署策略、拉格朗日增强的用户关联与模型分区策略，以及基于注意力的资源分配策略。大量实验结果表明，HC-MAPPO-L始终满足严格的延迟约束，同时在能耗和隐私成本之间取得优越平衡，在不同问题尺度和资源配置下优于代表性基线算法。

Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion

避免记忆：文本到图像扩散的可达性约束强化学习

Authors: Sathwik Karnik, Juyeop Kim, Sanmi Koyejo, Jong-Seok Lee, Somil Bansal
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.00140
Pdf link: https://arxiv.org/pdf/2603.00140
Abstract Text-to-image diffusion models often memorize training data, revealing a fundamental failure to generalize beyond the training set. Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization. To address this, we propose Reachability-Aware Diffusion Steering (RADS), an inference-time framework that prevents memorization while preserving generation fidelity. RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the "backward reachable tube"--the set of intermediate states that inevitably evolve into memorized samples. We then formulate mitigation as a constrained reinforcement learning (RL) problem, where a policy learns to steer the trajectory away from memorization via minimal perturbations in the caption embedding space. Empirical evaluations show that RADS achieves a superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state-of-the-art baselines. Crucially, RADS provides robust mitigation without modifying the diffusion backbone, offering a plug-and-play solution for safe generation. Our website is available at: this https URL.
中文摘要 文本到图像扩散模型常常记忆训练数据，暴露出无法超越训练集进行推广的根本性失败。当前的缓解策略通常牺牲图像质量或提示对齐以减少记忆。为此，我们提出了可达性感知扩散引导（RADS）这一推理时间框架，既防止记忆，又保持生成忠实度。RADS将扩散去噪过程建模为动力学系统，并应用可达性分析的概念来近似“向后可达管”——即必然演变为记忆采样的中间态集合。随后我们将缓解表述为受限强化学习（RL）问题，策略通过最小扰动引导路径避免记忆。实证评估表明，RADS在生成多样性（SSCD）、质量（FID）和比对性（CLIP）之间实现了更优的帕累托边界，相较于最先进的基线。关键是，RADS无需修改扩散主干网，提供了即插即用的安全生成解决方案。我们的网站访问地址为：此 https URL。

FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation

FlowPortrait：音频驱动的人像视频生成强化学习

Authors: Weiting Tan, Andy T. Liu, Ming Tu, Xinghua Qu, Philipp Koehn, Lu Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2603.00159
Pdf link: https://arxiv.org/pdf/2603.00159
Abstract Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.
中文摘要 由于口型同步不完美、动作不自然以及与人类感知关联不佳的评估指标等持续存在的问题，制作逼真的对话头视频依然充满挑战。我们提出了FlowPortrait，这是一个基于多模态骨干的音频驱动人像动画强化学习框架，实现自回归音频转视频生成。FlowPortrait 引入了基于多模态大型语言模型（MLLM）的人类对齐评估系统，用于评估口型同步的准确性、表现力和动作质量。这些信号与感知和时间一致性正则化器结合，形成稳定的复合奖励，用于通过组相对策略优化（Group Relative Policy Optimization，GRPO）对生成器进行后期训练。包括自动评估和人类偏好研究在内的大量实验表明，FlowPortrait 持续产生更高质量的对话头视频，凸显了强化学习在人像动画中的有效性。

Bridging Policy and Real-World Dynamics: LLM-Augmented Rebalancing for Shared Micromobility Systems

桥接政策与现实世界动态：共享微出行系统的LLM增强再平衡

Authors: Heng Tan, Hua Yan, Yu Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.00176
Pdf link: https://arxiv.org/pdf/2603.00176
Abstract Shared micromobility services such as e-scooters and bikes have become an integral part of urban transportation, yet their efficiency critically depends on effective vehicle rebalancing. Existing methods either optimize for average demand patterns or employ robust optimization and reinforcement learning to handle predefined uncertainties. However, these approaches overlook emergent events (e.g., demand surges, vehicle outages, regulatory interventions) or sacrifice performance in normal conditions. We introduce AMPLIFY, an LLM-augmented policy adaptation framework for shared micromobility rebalancing. The framework combines a baseline rebalancing module with an LLM-based adaptation module that adjusts strategies in real time under emergent scenarios. The adaptation module ingests system context, demand predictions, and baseline strategies, and refines adjustments through self-reflection. Evaluations on real-world e-scooter data from Chicago show that our approach improves demand satisfaction and system revenue compared to baseline policies, highlighting the potential of LLM-driven adaptation as a flexible solution for managing uncertainty in micromobility systems.
中文摘要 电动滑板车和自行车等共享微出行服务已成为城市交通的重要组成部分，但其效率关键在于有效的车辆平衡。现有方法要么针对平均需求模式进行优化，要么采用稳健的优化和强化学习来处理预定义的不确定性。然而，这些方法忽视了突发事件（如需求激增、车辆停运、监管干预）或牺牲正常条件下的性能。我们介绍AMPLIFY，这是一个基于LLM增强的共享微出行再平衡政策适应框架。该框架结合了基线再平衡模块和基于LLM的适应模块，后者在紧急情景下实时调整策略。适应模块吸收系统上下文、需求预测和基线策略，并通过自我反思细化调整。对芝加哥真实电动滑板车数据的评估显示，我们的方法相较于基线政策提升了需求满足和系统收入，凸显了基于LLM的适应作为灵活解决方案在微出行系统中管理不确定性的潜力。

RLShield: Practical Multi-Agent RL for Financial Cyber Defense with Attack-Surface MDPs and Real-Time Response Orchestration

RLShield：实用多代理强化学习，支持金融网络防御，配备攻击面MDP和实时响应编排

Authors: Srikumar Nayak
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.00186
Pdf link: https://arxiv.org/pdf/2603.00186
Abstract Financial systems run nonstop and must stay reliable even during cyber incidents. Modern attacks move across many services (apps, APIs, identity, payment rails), so defenders must make a sequence of actions under time pressure. Most security tools still use fixed rules or static playbooks, which can be slow to adapt when the attacker changes behavior. Reinforcement learning (RL) is a good fit for sequential decisions, but much of the RL-in-finance literature targets trading and does not model real cyber response limits such as action cost, service disruption, and defender coordination across many assets. This paper proposes RLShield, a practical multi-agent RL pipeline for financial cyber defense. We model the enterprise attack surface as a Markov decision process (MDP) where states summarize alerts, asset exposure, and service health, and actions represent real response steps (e.g., isolate a host, rotate credentials, ratelimit an API, block an account, or trigger recovery). RLShield learns coordinated policies across multiple agents (assets or service groups) and optimizes a risk-sensitive objective that balances containment speed, business disruption, and response cost. We also include a game-aware evaluation that tests policies against adaptive attackers and reports operational outcomes, not only reward. Experiments show that RLShield reduces time-to-containment and residual exposure while keeping disruption within a fixed response budget, outperforming static rule baselines and single-agent RL under the same constraints. These results suggest that multi-agent, cost-aware RL can provide a deployable layer for automated response in financial security operations.
中文摘要 金融系统不停运转，即使在网络事件发生时也必须保持可靠。现代攻击跨越多个服务（应用、API、身份、支付轨道），因此防御者必须在时间压力下采取一系列行动。大多数安全工具仍然使用固定规则或静态作手册，当攻击者改变行为时，这些作可能较慢。强化学习（RL）非常适合顺序决策，但金融领域的RL文献大多针对交易，未建模真实的网络响应极限，如行动成本、服务中断和防御者协调等跨多资产。本文提出了RLShield，一种实用的多智能体强化学习流水线，用于金融网络防御。我们将企业攻击面建模为马尔可夫决策过程（MDP），州级总结警报、资产暴露和服务健康状况，动作代表真实响应步骤（例如，隔离主机、轮换凭证、限制API、封锁账户或触发恢复）。RLShield 学习跨多个代理（资产或服务组）的协调策略，并优化一个风险敏感目标，平衡遏制速度、业务中断和响应成本。我们还包括一项游戏感知评估，用于针对自适应攻击者的策略测试，并报告运营结果，而不仅仅是奖励。实验表明，RLShield在限制中断的同时，缩短了控制时间和残余暴露，在相同约束下优于静态规则基线和单代理强化学习。这些结果表明，多代理、成本感知型强化学习可以为金融安全运营中的自动化响应提供可部署层。

VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models

VisRef：思考时视觉重新聚焦改善多模态大推理模型中的测试时间尺度

Authors: Soumya Suvra Ghosal, Youngeun Kim, Zhuowei Li, Ritwick Chaudhry, Linghan Xu, Hongjing Zhang, Jakub Zablocki, Yifan Xing, Qin Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.00207
Pdf link: https://arxiv.org/pdf/2603.00207
Abstract Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended reasoning. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference time can degrade performance as models progressively lose attention to visual tokens and increasingly rely on textual priors alone. To address this, prior works use reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process by re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context while remaining diverse and globally representative of the image, enabling more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that, under fixed test-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.
中文摘要 大型推理模型的进展通过扩展推理扩展测试时间计算，在复杂推理任务中表现出强劲的性能。然而，最新研究观察到，在视觉依赖任务中，推理时间的长时间文本推理可能会降低性能，因为模型逐渐失去对视觉符号的关注，越来越依赖文本先验。为此，以往的研究采用基于强化学习（RL）的微调来路由视觉符号，或在推理过程中采用重新聚焦机制。虽然这些方法有效，但计算量大，需要大规模数据生成和策略优化。为了利用测试时计算的优势，而无需额外强化学习微调，我们提出了VisRef，一个基于视觉基础的测试时间缩放框架。我们的核心理念是通过重新注入一套与推理上下文语义相关的视觉符号核心集，同时保持多样且具全球代表性的图像，从而实现更扎实的多模态推理。在三个视觉推理基准测试中，采用最先进的多模态大型推理模型，实验表明，在固定测试时间计算预算下，VisRef 始终比现有测试时间扩展方法高出多达 6.4%。

Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning

长度高效思维链推理的逐步惩罚

Authors: Xintong Li, Sha Li, Rongmei Lin, Hongye Jin, Linwei Li, Hejie Cui, Sarah Zhang, Chia-Yuan Chang, Kewei Cheng, Besnik Fetahu, Priyanka Nigam, Jingbo Shang, Bing Yin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.00296
Pdf link: https://arxiv.org/pdf/2603.00296
Abstract Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model's on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.
中文摘要 大型推理模型通过更多的测试时间计算而提升，但往往过度思考，产生不必要的冗长思维链，提高成本却不提升准确性。先验强化学习方法通常依赖单一结果奖励，且带有轨迹级长度惩罚，无法区分关键步骤与冗余推理步骤，因此产生钝化压缩。尽管近期工作包含了步级信号，如离线剪枝、监督数据构建或基于验证者的中间奖励，但推理长度很少被视为强化学习中明确的步级优化目标。我们提出了逐步自适应惩罚（SWAP）框架，这是一种基于内在贡献分配各步长度的细粒度框架。我们从模型的策略对数概率改进中估算步长的重要性，然后将多余长度视为重新分配的惩罚质量，以更重惩罚低重要步，同时保持高重要推理。我们在群体相对策略优化中实现统一的结果-过程优势进行优化。大量实验表明，SWAP平均能将推理长度缩短64.3%，同时相较基础模型提升准确率5.7%。

Conservative Equilibrium Discovery in Offline Game-Theoretic Multiagent Reinforcement Learning

离线博弈论多智能体强化学习中的保守均衡发现

Authors: Austin A. Nguyen, Michael P. Wellman
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.00374
Pdf link: https://arxiv.org/pdf/2603.00374
Abstract Offline learning of strategies takes data efficiency to its extreme by restricting algorithms to a fixed dataset of state-action trajectories. We consider the problem in a mixed-motive multiagent setting, where the goal is to solve a game under the offline learning constraint. We first frame this problem in terms of selecting among candidate equilibria. Since datasets may inform only a small fraction of game dynamics, it is generally infeasible in offline game-solving to even verify a proposed solution is a true equilibrium. Therefore, we consider the relative probability of low regret (i.e., closeness to equilibrium) across candidates based on the information available. Specifically, we extend Policy Space Response Oracles (PSRO), an online game-solving approach, by quantifying game dynamics uncertainty and modifying the RL objective to skew towards solutions more likely to have low regret in the true game. We further propose a novel meta-strategy solver, tailored for the offline setting, to guide strategy exploration in PSRO. Our incorporation of Conservatism principles from Offline reinforcement learning approaches for strategy Exploration gives our approach its name: COffeE-PSRO. Experiments demonstrate COffeE-PSRO's ability to extract lower-regret solutions than state-of-the-art offline approaches and reveal relationships between algorithmic components empirical game fidelity, and overall performance.
中文摘要 离线学习策略通过限制算法在固定的状态-动作轨迹数据集中，将数据效率推向极致。我们在混合动机多智能体环境中考虑该问题，目标是在离线学习约束下解决一个博弈。我们首先将这个问题框架为在候选均衡中进行选择。由于数据集可能只提供游戏动态的一小部分信息，在离线游戏中，通常无法验证一个提出的解是否为真正的均衡。因此，我们考虑了基于现有信息，候选人间低悔率（即接近平衡的概率）。具体来说，我们扩展了策略空间响应预言机（PSRO），一种在线游戏解决方法，通过量化游戏动态的不确定性，并调整强化学习目标，使其倾向于在真实游戏中更可能后悔率较低的解决方案。我们还提出了一款专为离线环境设计的新型元战略求解器，用于指导PSRO中的战略探索。我们将离线强化学习方法中的保守主义原则融入战略探索，因此我们的方法被称为COffeE-PSRO。实验证明了COffeE-PSRO提取低遗憾解的能力，超过了最先进的离线方法，并揭示了算法组件、经验游戏忠实度与整体性能之间的关系。

Hereditary Geometric Meta-RL: Nonlocal Generalization via Task Symmetries

遗传几何元强化学习：通过任务对称性实现非局域推广

Authors: Paul Nitschke, Shahriar Talebi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2603.00396
Pdf link: https://arxiv.org/pdf/2603.00396
Abstract Meta-Reinforcement Learning (Meta-RL) commonly generalizes via smoothness in the task encoding. While this enables local generalization around each training task, it requires dense coverage of the task space and leaves richer task space structure untapped. In response, we develop a geometric perspective that endows the task space with a "hereditary geometry" induced by the inherent symmetries of the underlying system. Concretely, the agent reuses a policy learned at the train time by transforming states and actions through actions of a Lie group. This converts Meta-RL into symmetry discovery rather than smooth extrapolation, enabling the agent to generalize to wider regions of the task space. We show that when the task space is inherited from the symmetries of the underlying system, the task space embeds into a subgroup of those symmetries whose actions are linearizable, connected, and compact--properties that enable efficient learning and inference at the test time. To learn these structures, we develop a differential symmetry discovery method. This collapses functional invariance constraints and thereby improves numerical stability and sample efficiency over functional approaches. Empirically, on a two-dimensional navigation task, our method efficiently recovers the ground-truth symmetry and generalizes across the entire task space, while a common baseline generalizes only near training tasks.
中文摘要 元强化学习（Meta-RL）通常通过任务编码的平滑性进行推广。虽然这使得对每个训练任务进行局部推广，但需要对任务空间的密集覆盖，且更丰富的任务空间结构未被充分利用。为此，我们发展出一种几何视角，赋予任务空间一种由底层系统固有对称性诱导的“遗传几何”。具体来说，代理通过李群的动作将状态和动作变换到列车时间学到的策略重用。这使Meta-RL转变为对称性发现，而非平滑外推，使智能体能够推广到任务空间的更广泛区域。我们证明，当任务空间继承自底层系统的对称性时，任务空间嵌入其中一个作用可线性、连通且紧致的对称性子群中——这些特性使得测试时能够高效学习和推断。为了了解这些结构，我们开发了一种微分对称性发现方法。这打破了函数不变性约束，从而提升了数值稳定性和样本效率，相较于泛函方法。在实证上，在二维导航任务中，我们的方法高效恢复了地面真实对称性，并推广到整个任务空间，而共同基线仅推广近训练任务。

HydroShear: Hydroelastic Shear Simulation for Tactile Sim-to-Real Reinforcement Learning

HydroShear：用于触觉模拟到真实强化学习的水弹性剪切模拟

Authors: An Dang, Jayjun Lee, Mustafa Mukadam, X. Alice Wu, Bernadette Bucher, Manikantan Nambi, Nima Fazeli
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.00446
Pdf link: https://arxiv.org/pdf/2603.00446
Abstract In this paper, we address the problem of tactile sim-to-real policy transfer for contact-rich tasks. Existing methods primarily focus on vision-based sensors and emphasize image rendering quality while providing overly simplistic models of force and shear. Consequently, these models exhibit a large sim-to-real gap for many dexterous tasks. Here, we present HydroShear, a non-holonomic hydroelastic tactile simulator that advances the state-of-the-art by modeling: a) stick-slip transitions, b) path-dependent force and shear build up, and c) full SE(3) object-sensor interactions. HydroShear extends hydroelastic contact models using Signed Distance Functions (SDFs) to track the displacements of the on-surface points of an indenter during physical interaction with the sensor membrane. Our approach generates physics-based, computationally efficient force fields from arbitrary watertight geometries while remaining agnostic to the underlying physics engine. In experiments with GelSight Minis, HydroShear more faithfully reproduces real tactile shear compared to existing methods. This fidelity enables zero-shot sim-to-real transfer of reinforcement learning policies across four tasks: peg insertion, bin packing, book shelving for insertion, and drawer pulling for fine gripper control under slip. Our method achieves a 93% average success rate, outperforming policies trained on tactile images (34%) and alternative shear simulation methods (58%-61%).
中文摘要 本文探讨了接触丰富任务中触摸模拟到真实策略转移的问题。现有方法主要聚焦于基于视觉的传感器，强调图像渲染质量，同时提供过于简单的力和剪切模型。因此，这些模型在许多灵活任务中存在较大的模拟与现实差距。本文介绍HydroShear，一款非全体体水弹性触觉模拟器，通过建模：a）棍滑跃转变，b）路径依赖力和剪切积累，c）完整的SE（3）物体-传感器相互作用，推动了技术进步。HydroShear 利用带符号距离函数（SDF）扩展水弹性接触模型，追踪凹肌在与传感器膜物理相互作用时表面点的位移。我们的方法从任意的水密几何形状生成基于物理、计算高效的力场，同时保持对底层物理引擎的中立性。在GelSight Minis的实验中，HydroShear比现有方法更忠实地还原了真实的触觉剪切。这种保真度使强化学习策略能够在四项任务中实现零拍摄模拟到现实的传输：插销插入、箱子装箱、书架插入以及拉动抽屉以实现滑动下精细握把控制。我们的方法平均成功率为93%，优于基于触觉图像训练的策略（34%）和替代剪切模拟方法（58%-61%）。

Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems

Cloud-OpsBench：云系统中代理性根因分析的可重现基准

Authors: Yilun Wang, Guangba Yu, Haiyu Huang, Zirui Wang, Yujie Huang, Pengfei Chen, Michael R. Lyu
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2603.00468
Pdf link: https://arxiv.org/pdf/2603.00468
Abstract The transition to agentic Root Cause Analysis (RCA) necessitates benchmarks that evaluate active reasoning rather than passive classification. However, current frameworks fail to reconcile ecological validity with reproducibility. We introduce Cloud-OpsBench, a large-scale benchmark that employs a State Snapshot Paradigm to construct a deterministic digital twin of the cloud, featuring 452 distinct fault cases across 40 root cause types spanning the full Kubernetes stack. Crucially, Cloud-OpsBench serves as an enabling infrastructure for next-generation SRE research: (1) As a Data Engine, it harvests high-quality reasoning trajectories to bootstrap Supervised Fine-Tuning (SFT) for Small Language Models; (2) As an Reinforcement Learning (RL) environment, it transforms high-risk operations into a safe low-latency sandbox for training policy optimization agents; and (3) As a Diagnostic Standard, its process-centric protocol uncovers architectural bottlenecks guiding the design of robust specialized multi-agent system for RCA.
中文摘要 向能动根因分析（RCA）的转变需要评估主动推理而非被动分类的基准。然而，现有框架未能将生态有效性与可重复性调和。我们介绍了Cloud-OpsBench，这是一个大规模基准测试，采用状态快照范式构建云的确定性数字孪生，涵盖452个不同故障案例，涵盖40种根因类型，覆盖整个Kubernetes栈。关键是，Cloud-OpsBench 作为下一代 SRE 研究的基础建设：（1）作为数据引擎，它收集高质量的推理轨迹，为小型语言模型启动监督微调（SFT）;（2）作为强化学习（RL）环境，它将高风险作转变为安全、低延迟的沙盒，用于训练策略优化代理;以及（3）作为诊断标准，其以流程为中心的协议揭示了指导RCA稳健专用多代理系统设计的架构瓶颈。

Optimal-Horizon Social Robot Navigation in Heterogeneous Crowds

异质人群中的最优视野社交机器人导航

Authors: Jiamin Shi, Haolin Zhang, Yuchen Yan, Shitao Chen, Jingmin Xin, Nanning Zheng
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.00507
Pdf link: https://arxiv.org/pdf/2603.00507
Abstract Navigating social robots in dense, dynamic crowds is challenging due to environmental uncertainty and complex human-robot interactions. While Model Predictive Control (MPC) offers strong real-time performance, its reliance on a fixed prediction horizon limits adaptability to changing environments and social dynamics. Furthermore, most MPC approaches treat pedestrians as homogeneous obstacles, ignoring social heterogeneity and cooperative or adversarial interactions, which often causes the Frozen Robot Problem in partially observable real-world environments. In this paper, we identify the planning horizon as a socially conditioned decision variable rather than a fixed design choice. Building on this insight, we propose an optimal-horizon social navigation framework that optimizes MPC foresight online according to inferred social context. A spatio-temporal Transformer infers pedestrian cooperation attributes from local trajectory observations, which serve as social priors for a reinforcement learning policy that optimally selects the prediction horizon under a task-driven objective. The resulting horizon-aware MPC incorporates socially conditioned safety constraints to balance navigation efficiency and interaction safety. Extensive simulations and real-world robot experiments demonstrate that optimal foresight selection is critical for robust social navigation in partially observable crowds. Compared to state-of-the-art baselines, the proposed approach achieves a 6.8\% improvement in success rate, reduces collisions by 50\%, and shortens navigation time by 19\%, with a low timeout rate of 0.8\%, validating the necessity of socially optimal planning horizons for efficient and safe robot navigation in crowded environments. Code and videos are available at Under Review.
中文摘要 由于环境不确定性和复杂的人机互动，在密集且动态的人群中驾驭社交机器人具有挑战性。虽然模型预测控制（MPC）提供了强大的实时性能，但其对固定预测视野的依赖限制了对不断变化环境和社会动态的适应能力。此外，大多数MPC方法将行人视为同质障碍，忽视社会异质性以及合作或对抗性互动，这常常导致部分可观察的现实环境中出现冻结机器人问题。本文将规划视野视为一个社会条件化的决策变量，而非固定的设计选择。基于这一见解，我们提出了一种最优视野社会导航框架，根据推断的社会背景优化MPC在线前瞻。时空变换器通过局部轨迹观测推断行人合作属性，这些属性作为强化学习策略的社会先验，该策略在任务驱动目标下最优选择预测视界。由此产生的地平感知MPC融入了社会条件化的安全约束，以平衡导航效率与交互安全。大量模拟和现实机器人实验表明，在部分可观察的人群中，最佳的前瞻性选择对于稳健的社会导航至关重要。与最先进的基线相比，该方法成功率提升了6.8%，碰撞减少了50%，导航时间缩短了19%，超时率仅为0.8%，验证了在拥挤环境中实现高效安全机器人导航所需的社会最优规划视野的必要性。代码和视频可在“审核中”获取。

Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation

Mesh-Pro：异步优势引导的排名偏好优化，用于艺术家式四边形网格生成

Authors: Zhen Zhou, Jian Liu, Biwen Lei, Jing Xu, Haohan Weng, Yiling Zhu, Zhuo Chen, Junfeng Fan, Yunkai Ma, Dazhao Du, Song Guo, Fengshui Jing, Chunchao Guo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.00526
Pdf link: https://arxiv.org/pdf/2603.00526
Abstract Reinforcement learning (RL) has demonstrated remarkable success in text and image generation, yet its potential in 3D generation remains largely unexplored. Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization. In this work, we aim to enhance both the training efficiency and generation quality of RL in 3D mesh generation. Specifically, (1) we design the first asynchronous online RL framework tailored for 3D mesh generation post-training efficiency improvement, which is 3.75$\times$ faster than synchronous RL. (2) We propose Advantage-guided Ranking Preference Optimization (ARPO), a novel RL algorithm that achieves a better trade-off between training efficiency and generalization than current RL algorithms designed for 3D mesh generation, such as DPO and group relative policy optimization (GRPO). (3) Based on asynchronous ARPO, we propose Mesh-Pro, which additionally introduces a novel diagonal-aware mixed triangular-quadrilateral tokenization for mesh representation and a ray-based reward for geometric integrity. Mesh-Pro achieves state-of-the-art performance on artistic and dense meshes.
中文摘要 强化学习（RL）在文本和图像生成方面取得了显著成功，但其在三维生成中的潜力尚未被充分开发。现有尝试通常依赖离线直接偏好优化（DPO）方法，但该方法训练效率低且泛化有限。在这项工作中，我们旨在提升3D网格生成中强化学习的训练效率和生成质量。具体来说，（1）我们设计了首个专为3D网格生成训练后效率提升而设计的异步在线强化学习框架，比同步强化学习快3.75美元/时间。（2）我们提出了优势引导排序偏好优化（ARPO），这是一种新颖的强化学习算法，在训练效率和泛化性之间实现了比当前为三维网格生成设计的强化学习算法（如DPO和群相对策略优化（GRPO）更好的权衡。（3）基于异步ARPO，我们提出了Mesh-Pro，该方案还引入了一种新的对角线感知混合三角形-四边形标记化技术用于网格表示，以及基于射线的几何完整性奖励。Mesh-Pro 在艺术性和密集网格上实现了最先进的性能。

LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks

LOGIGEN：逻辑驱动的可验证代理任务生成

Authors: Yucheng Zeng, Weipeng Lu, Linyun Liu, Shupeng Li, Zitian Qu, Chenghao Zhu, Shaofei Li, Zhengdong Tan, Mengyue Liu, Haotian Zhao, Zhe Zhou, Jianmin Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.00540
Pdf link: https://arxiv.org/pdf/2603.00540
Abstract The evolution of Large Language Models (LLMs) from static instruction-followers to autonomous agents necessitates operating within complex, stateful environments to achieve precise state-transition objectives. However, this paradigm is bottlenecked by data scarcity, as existing tool-centric reverse-synthesis pipelines fail to capture the rigorous logic of real-world applications. We introduce \textbf{LOGIGEN}, a logic-driven framework that synthesizes verifiable training data based on three core pillars: \textbf{Hard-Compiled Policy Grounding}, \textbf{Logic-Driven Forward Synthesis}, and \textbf{Deterministic State Verification}. Specifically, a Triple-Agent Orchestration is employed: the \textbf{Architect} compiles natural-language policy into database constraints to enforce hard rules; the \textbf{Set Designer} initializes boundary-adjacent states to trigger critical policy conflicts; and the \textbf{Explorer} searches this environment to discover causal solution paths. This framework yields a dataset of 20,000 complex tasks across 8 domains, where validity is strictly guaranteed by checking exact state equivalence. Furthermore, we propose a verification-based training protocol where Supervised Fine-Tuning (SFT) on verifiable trajectories establishes compliance with hard-compiled policy, while Reinforcement Learning (RL) guided by dense state-rewards refines long-horizon goal achievement. On $\tau^2$-Bench, LOGIGEN-32B(RL) achieves a \textbf{79.5\% success rate}, substantially outperforming the base model (40.7\%). These results demonstrate that logic-driven synthesis combined with verification-based training effectively constructs the causally valid trajectories needed for next-generation agents.
中文摘要 大型语言模型（LLMs）从静态指令跟随器向自主智能体的演变，要求在复杂且有状态的环境中运行，以实现精确的状态转移目标。然而，这一范式因数据稀缺而成为瓶颈，现有的工具中心逆向综合流水线未能捕捉现实应用的严谨逻辑。我们介绍 \textbf{LOGIGEN}，这是一个基于三大核心支柱的逻辑驱动框架，综合可验证的训练数据：\textbf{硬编译策略基础}、\textbf{逻辑驱动前向综合}和\textbf{确定性状态验证}。具体来说，采用了三代理编排：\textbf{Architect}将自然语言策略编译为数据库约束以强制执行硬规则;\textbf{Set Designer} 初始化边界邻近状态以触发关键策略冲突;而 \textbf{Explorer} 则在该环境中搜索因果解路径。该框架生成了涵盖8个领域、2万个复杂任务的数据集，通过检查精确状态等价性严格保证有效性。此外，我们提出了一种基于验证的训练协议，其中对可验证轨迹进行监督微调（SFT）以确定对硬编译策略的遵守，而由密集状态奖励引导的强化学习（RL）则优化长期目标的实现。在$\tau^2$-Bench上，LOGIGEN-32B（RL）实现了\textbf{79.5\%}的成功率，远超基础型号（40.7%）。这些结果表明，逻辑驱动综合与基于验证的训练相结合，能够有效构建下一代智能体所需的因果有效轨迹。

Learning to Attack: A Bandit Approach to Adversarial Context Poisoning

学习攻击：一种强盗式的对抗性情境中毒方法

Authors: Ray Telikani, Amir H. Gandomi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.00567
Pdf link: https://arxiv.org/pdf/2603.00567
Abstract Neural contextual bandits are vulnerable to adversarial attacks, where subtle perturbations to rewards, actions, or contexts induce suboptimal decisions. We introduce AdvBandit, a black-box adaptive attack that formulates context poisoning as a continuous-armed bandit problem, enabling the attacker to jointly learn and exploit the victim's evolving policy. The attacker requires no access to the victim's internal parameters, reward function, or gradient information; instead, it constructs a surrogate model using a maximum-entropy inverse reinforcement learning module from observed context-action pairs and optimizes perturbations against this surrogate using projected gradient descent. An upper confidence bound-aware Gaussian process guides arm selection. An attack-budget control mechanism is also introduced to limit detection risk and overhead. We provide theoretical guarantees, including sublinear attacker regret and lower bounds on victim regret linear in the number of attacks. Experiments on three real-world datasets (Yelp, MovieLens, and Disin) against various victim contextual bandits demonstrate that our attack model achieves higher cumulative victim regret than state-of-the-art baselines.
中文摘要 神经情境盗贼容易受到对抗性攻击，即对奖励、行为或情境的细微扰动会引发次优决策。我们介绍AdvBandit，一种黑箱自适应攻击，将上下文中毒表述为持续武装的盗贼问题，使攻击者能够共同学习并利用受害者不断演变的策略。攻击者无需访问受害者的内部参数、奖励函数或梯度信息;相反，它利用最大熵逆强化学习模块，从观察到的上下文-动作对构建一个替代模型，并通过投影梯度下降优化针对该替代的扰动。一个上置信度界限感知的高斯过程指导臂的选择。还引入了攻击预算控制机制，以减少检测风险和开销。我们提供理论保证，包括攻击者的亚线性后悔以及受害者后悔的下限，攻击次数线性。在三种真实世界数据集（Yelp、MovieLens和Disin）上针对各种受害者情境盗贼的实验表明，我们的攻击模型比最先进的基线数据更能实现累计受害者遗憾率。

Learning to Explore: Policy-Guided Outlier Synthesis for Graph Out-of-Distribution Detection

学习探索：图分布外检测的策略引导离群值综合

Authors: Li Sun, Lanxu Yang, Jiayu Tian, Bowen Fang, Xiaoyan Yu, Junda Ye, Peng Tang, Hao Peng, Philip S. Yu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.00602
Pdf link: https://arxiv.org/pdf/2603.00602
Abstract Detecting out-of-distribution (OOD) graphs is crucial for ensuring the safety and reliability of Graph Neural Networks. In unsupervised graph-level OOD detection, models are typically trained using only in-distribution (ID) data, resulting in incomplete feature space characterization and weak decision boundaries. Although synthesizing outliers offers a promising solution, existing approaches rely on fixed, non-adaptive sampling heuristics (e.g., distance- or density-based), limiting their ability to explore informative OOD regions. We propose a Policy-Guided Outlier Synthesis (PGOS) framework that replaces static heuristics with a learned exploration strategy. Specifically, PGOS trains a reinforcement learning agent to navigate low-density regions in a structured latent space and sample representations that most effectively refine the OOD decision boundary. These representations are then decoded into high-quality pseudo-OOD graphs to improve detector robustness. Extensive experiments demonstrate that PGOS achieves state-of-the-art performance on multiple graph OOD and anomaly detection benchmarks.
中文摘要 检测分布外（OOD）图对于确保图神经网络的安全性和可靠性至关重要。在无监督图级OOD检测中，模型通常仅使用分布内（ID）数据进行训练，导致特征空间特征描述不完整且决策边界较弱。尽管合成离群值提供了有前景的解决方案，但现有方法依赖固定的、非自适应的采样启发式（如基于距离或密度的），限制了它们探索信息量范围区域的能力。我们提出了一个策略引导离群综合（PGOS）框架，用学习的探索策略取代静态启发式。具体来说，PGOS训练强化学习代理在结构化潜空间中的低密度区域导航，并通过最有效细化OOD决策边界的样本表示。这些表示随后被解码成高质量的伪OOD图，以提高探测器的鲁棒性。大量实验表明，PGOS在多个图、外向和异常检测基准测试上达到了最先进的性能。

From Simulation to Reality: Practical Deep Reinforcement Learning-based Link Adaptation for Cellular Networks

从仿真到现实：基于细胞网络的基于深度强化学习的链路适配实用

Authors: Lizhao You, Nanqing Zhou, Guanglong Pang, Jiajie Huang, Yulin Shao, Liqun Fu
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.00689
Pdf link: https://arxiv.org/pdf/2603.00689
Abstract Link Adaptation (LA) that dynamically adjusts the Modulation and Coding Schemes (MCS) to accommodate time-varying channels is crucial and challenging in cellular networks. Deep reinforcement learning (DRL)-based LA that learns to make decision through the interaction with the environment is a promising approach to improve throughput. However, existing DRL-based LA algorithms are typically evaluated in simplified simulation environments, neglecting practical issues such as ACK/NACK feedback delay, retransmission and parallel hybrid automatic repeat request (HARQ). Moreover, these algorithms overlook the impact of DRL execution latency, which can significantly degrade system performance. To address these challenges, we propose Decoupling-DQN (DC-DQN), a new DRL framework that separates traditional DRL's coupled training and inference processes into two modules based on Deep Q Networks (DQN): a real-time inference module and an out-of-decision-loop training module. Based on this framework, we introduce a novel DRL-based LA algorithm, DC-DQN-LA. The algorithm incorporates practical considerations by designing state, action, and reward functions that account for feedback delays, parallel HARQ, and retransmissions. We implemented a prototype using USRP software-defined radios and srsRAN software. Experimental results demonstrate that DC-DQN-LA improves throughput by 40\% to 70\% in mobile scenario compared with baseline LA algorithms, while maintaining comparable block error rates, and can quickly adapt to environment changes in mobile-to-static scenario. These results highlight the efficiency and practicality of the proposed DRL-based LA algorithm.
中文摘要 链路适配（LA）动态调整调制与编码方案（MCS）以适应时变信道，在蜂窝网络中至关重要且具有挑战性。基于深度强化学习（DRL）的语言分析通过与环境互动来学习决策，是一种有前景的提升吞吐量的方法。然而，现有基于DRL的LA算法通常在简化的仿真环境中进行评估，忽略了诸如ACK/NACK反馈延迟、重传和并行混合自动重复请求（HARQ）等实际问题。此外，这些算法忽视了DRL执行延迟的影响，而延迟会显著降低系统性能。为应对这些挑战，我们提出了Decoupling-DQN（DC-DQN），这是一个新的DRL框架，将传统DRL的耦合训练和推理过程分为基于深度Q网络（DQN）的两个模块：一个是实时推理模块，另一个是决策循环外训练模块。基于该框架，我们引入了一种基于DRL的新颖LA算法DC-DQN-LA。该算法通过设计状态、动作和奖励函数，考虑反馈延迟、并行HARQ和重传，从而纳入实际考虑。我们使用USRP软件定义无线电和SRRSRAN软件实现了一个原型。实验结果表明，与基线LA算法相比，DC-DQN-LA在移动场景下吞吐量提升了40%至70%，同时保持了可比的块错误率，并且能够快速适应移动到静态场景下的环境变化。这些结果凸显了基于DRL的LA算法的高效性和实用性。

Frozen Policy Iteration: Computationally Efficient RL under Linear $Q^π$ Realizability for Deterministic Dynamics

冻结策略迭代：线性$Q^π$确定性动力学实现下的计算高效强化学习

Authors: Yijing Ke, Zihan Zhang, Ruosong Wang
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.00716
Pdf link: https://arxiv.org/pdf/2603.00716
Abstract We study computationally and statistically efficient reinforcement learning under the linear $Q^{\pi}$ realizability assumption, where any policy's $Q$-function is linear in a given state-action feature representation. Prior methods in this setting are either computationally intractable, or require (local) access to a simulator. In this paper, we propose a computationally efficient online RL algorithm, named Frozen Policy Iteration, under the linear $Q^{\pi}$ realizability setting that works for Markov Decision Processes (MDPs) with stochastic initial states, stochastic rewards and deterministic transitions. Our algorithm achieves a regret bound of $\widetilde{O}(\sqrt{d^2H^6T})$, where $d$ is the dimensionality of the feature space, $H$ is the horizon length, and $T$ is the total number of episodes. Our regret bound is optimal for linear (contextual) bandits which is a special case of our setting with $H = 1$. Existing policy iteration algorithms under the same setting heavily rely on repeatedly sampling the same state by access to the simulator, which is not implementable in the online setting with stochastic initial states studied in this paper. In contrast, our new algorithm circumvents this limitation by strategically using only high-confidence part of the trajectory data and freezing the policy for well-explored states, which ensures that all data used by our algorithm remains effectively on-policy during the whole course of learning. We further demonstrate the versatility of our approach by extending it to the Uniform-PAC setting and to function classes with bounded eluder dimension.
中文摘要 我们研究在线性$Q^{\pi}$实现性假设下，计算和统计效率高的强化学习，其中任意策略的$Q$函数在给定状态-动作特征表示下都是线性的。该背景下以往的方法要么计算上难以处理，要么需要（本地）访问模拟器。本文提出一种计算高效的在线强化学习算法，名为冻结策略迭代，采用线性$Q^{\pi}$实现性设定，适用于具有随机初始状态、随机奖励和确定性转移的马尔可夫决策过程（MDP）。我们的算法实现了 $\widetilde{O}（\sqrt{d^2H^6T}））$ 的遗憾界限，其中 $d$ 是特征空间的维度，$H$ 是视野长度，$T$ 是总集数。我们的遗憾界限对于线性（上下文）强盗最优，这是我们设定 $H = 1$ 的一个特例。现有策略迭代算法在相同环境下高度依赖通过模拟器反复采样同一状态，而这在在线环境中无法实现，前提是本文研究的随机初始状态。相比之下，我们的新算法通过战略性地只使用轨迹数据中高置信部分，并冻结策略，确保算法使用的所有数据在整个学习过程中都有效地保持策略状态。我们进一步展示了方法的多样性，将其扩展到Uniform-PAC设置以及具有有界远离维数的函数类。

Keyframe-Guided Structured Rewards for Reinforcement Learning in Long-Horizon Laboratory Robotics

长视野实验室机器人强化学习的关键帧引导结构化奖励

Authors: Yibo Qiu, Shu'ang Sun, Haoliang Ye, Ronald X Xu, Mingzhai Sun
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.00719
Pdf link: https://arxiv.org/pdf/2603.00719
Abstract Long-horizon precision manipulation in laboratory automation, such as pipette tip attachment and liquid transfer, requires policies that respect strict procedural logic while operating in continuous, high-dimensional state spaces. However, existing approaches struggle with reward sparsity, multi-stage structural constraints, and noisy or imperfect demonstrations, leading to inefficient exploration and unstable convergence. We propose a Keyframe-Guided Reward Generation Framework that automatically extracts kinematics-aware keyframes from demonstrations, generates stage-wise targets via a diffusion-based predictor in latent space, and constructs a geometric progress-based reward to guide online reinforcement learning. The framework integrates multi-view visual encoding, latent similarity-based progress tracking, and human-in-the-loop reinforcement fine-tuning on a Vision-Language-Action backbone to align policy optimization with the intrinsic stepwise logic of biological protocols. Across four real-world laboratory tasks, including high-precision pipette attachment and dynamic liquid transfer, our method achieves an average success rate of 82% after 40--60 minutes of online fine-tuning. Compared with HG-DAgger (42%) and Hil-ConRFT (47%), our approach demonstrates the effectiveness of structured keyframe-guided rewards in overcoming exploration bottlenecks and providing a scalable solution for high-precision, long-horizon robotic laboratory automation.
中文摘要 实验室自动化中的长视距精密作，如移液器尖端连接和液体转移，需要在连续高维状态空间中严格遵守程序逻辑的政策。然而，现有方法面临奖励稀疏性、多阶段结构约束以及噪声或不完美的演示，导致探索效率低下和收敛不稳定。我们提出了一个关键帧引导奖励生成框架，能够自动从演示中提取运动学感知的关键帧，通过基于扩散的预测器在潜空间生成阶段目标，并构建基于几何进度的奖励以指导在线强化学习。该框架集成了多视角可视化编码、基于潜在相似性的进展跟踪以及在视觉-语言-行动骨干上的人工强化微调，使策略优化与生物协议的内在分步逻辑保持一致。在四项真实实验室任务中，包括高精度移液器连接和动态液体转移，我们的方法在在线微调40-60分钟后平均成功率为82%。与HG-DAgger（42%）和Hil-ConRFT（47%）相比，我们的方法证明了结构化关键帧引导奖励在克服探索瓶颈和提供高精度、长视野机器人实验室自动化可扩展解决方案方面的有效性。

RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models

RLAR：大型语言模型上多任务强化学习的智能奖励系统

Authors: Andrew Zhuoer Feng, Cunxiang Wang, Bosi Wen, Yidong Wang, Yu Luo, Hongning Wang, Minlie Huang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.00724
Pdf link: https://arxiv.org/pdf/2603.00724
Abstract Large language model alignment via reinforcement learning depends critically on reward function quality. However, static, domain-specific reward models are often costly to train and exhibit poor generalization in out-of-distribution scenarios encountered during RL iterations. We present RLAR (Reinforcement Learning from Agent Rewards), an agent-driven framework that dynamically assigns tailored reward functions to individual queries. Specifically, RLAR transforms reward acquisition into a dynamic tool synthesis and invocation task. It leverages LLM agents to autonomously retrieve optimal reward models from the Internet and synthesize programmatic verifiers through code generation. This allows the reward system to self-evolve with the shifting data distributions during training. Experimental results demonstrate that RLAR yields consistent performance gains ranging from 10 to 60 across mathematics, coding, translation, and dialogue tasks. On RewardBench-V2, RLAR significantly outperforms static baselines and approaches the performance upper bound, demonstrating superior generalization through dynamic reward orchestration. The data and code are available on this link: this https URL.
中文摘要 通过强化学习实现大型语言模型对齐，关键在于奖励函数的质量。然而，静态的领域特定奖励模型通常训练成本高昂，且在强化学习迭代中遇到的非分布场景中泛化能力较差。我们介绍RLAR（来自代理奖励的强化学习），这是一个由智能体驱动的框架，能够动态为单个查询分配定制的奖励函数。具体来说，RLAR将奖励获取转化为动态的工具综合和调用任务。它利用大型语言模型代理自主从互联网检索最优奖励模型，并通过代码生成合成程序验证器。这使得奖励系统能够随着训练期间数据分布的变化而自我演化。实验结果表明，RLAR在数学、编码、翻译和对话任务中，能够持续提升10到60的性能。在RewardBench-V2上，RLAR显著优于静态基线，接近性能上限，通过动态奖励编排展现了更优越的泛化能力。数据和代码可在此链接获取：https URL。

Qwen3-Coder-Next Technical Report

Qwen3-Coder-Next 技术报告

Authors: Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, Zeyao Ma, Kashun Shum, Xuwu Wang, Jinxi Wei, Jiaxi Yang, Jiajun Zhang, Lei Zhang, Zongmeng Zhang, Wenting Zhao, Fan Zhou
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.00729
Pdf link: https://arxiv.org/pdf/2603.00729
Abstract We present Qwen3-Coder-Next, an open-weight language model specialized for coding agents. Qwen3-Coder-Next is an 80-billion-parameter model that activates only 3 billion parameters during inference, enabling strong coding capability with efficient inference. In this work, we explore how far strong training recipes can push the capability limits of models with small parameter footprints. To achieve this, we perform agentic training through large-scale synthesis of verifiable coding tasks paired with executable environments, allowing learning directly from environment feedback via mid-training and reinforcement learning. Across agent-centric benchmarks including SWE-Bench and Terminal-Bench, Qwen3-Coder-Next achieves competitive performance relative to its active parameter count. We release both base and instruction-tuned open-weight versions to support research and real-world coding agent development.
中文摘要 我们介绍Qwen3-Coder-Next，一种专门用于编码代理的开权重语言模型。Qwen3-Coder-Next 是一个拥有800亿参数的模型，在推理过程中仅激活30亿参数，实现了强大的编码能力和高效的推理能力。本研究探讨强训练配方能在多大程度上推动参数占用较小模型的能力极限。为此，我们通过大规模综合可验证的编码任务与可执行环境进行代理训练，实现通过中训练和强化学习直接从环境反馈中学习。在包括SWE-Bench和Terminal-Bench在内的以代理为中心的基准测试中，Qwen3-Coder-Next相对于其主动参数数量实现了竞争性能。我们发布基础版和指令调优的开放权重版本，以支持研究和实际编码代理的开发。

MO-MIX: Multi-Objective Multi-Agent Cooperative Decision-Making With Deep Reinforcement Learning

MO-MIX：多目标多代理协作决策，采用深度强化学习

Authors: Tianmeng Hu, Biao Luo, Chunhua Yang, Tingwen Huang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.00730
Pdf link: https://arxiv.org/pdf/2603.00730
Abstract Deep reinforcement learning (RL) has been applied extensively to solve complex decision-making problems. In many real-world scenarios, tasks often have several conflicting objectives and may require multiple agents to cooperate, which are the multi-objective multi-agent decision-making problems. However, only few works have been conducted on this intersection. Existing approaches are limited to separate fields and can only handle multi-agent decision-making with a single objective, or multi-objective decision-making with a single agent. In this paper, we propose MO-MIX to solve the multi-objective multi-agent reinforcement learning (MOMARL) problem. Our approach is based on the centralized training with decentralized execution (CTDE) framework. A weight vector representing preference over the objectives is fed into the decentralized agent network as a condition for local action-value function estimation, while a mixing network with parallel architecture is used to estimate the joint action-value function. In addition, an exploration guide approach is applied to improve the uniformity of the final non-dominated solutions. Experiments demonstrate that the proposed method can effectively solve the multi-objective multi-agent cooperative decision-making problem and generate an approximation of the Pareto set. Our approach not only significantly outperforms the baseline method in all four kinds of evaluation metrics, but also requires less computational cost.
中文摘要 深度强化学习（RL）已被广泛应用于解决复杂的决策问题。在许多现实场景中，任务通常存在多个冲突目标，可能需要多个代理协作，这就是多目标多代理决策问题。然而，该交叉口的施工工作很少。现有方法仅限于独立领域，只能处理多代理单一目标决策，或多目标决策。本文提出MO-MIX用于解决多目标多智能体强化学习（MOMARL）问题。我们的方法基于集中式训练与去中心化执行（CTDE）框架。代表对目标偏好的权重向量被输入去中心化代理网络，作为局部动作价值函数估计的条件，同时使用并行架构的混合网络估计联合动作价值函数。此外，还采用探索导向方法以提升最终非显性解的均匀性。实验表明，所提出的方法能够有效解决多目标多智能体协作决策问题，并生成帕累托集的近似。我们的方法不仅在四种评估指标上显著优于基线方法，而且计算成本更低。

CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

CHIMERA：用于通用大型语言模型推理的紧凑合成数据

Authors: Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li, Olli Saarikivi, Yun Zhu, Yu Meng
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.00889
Pdf link: https://arxiv.org/pdf/2603.00889
Abstract Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.
中文摘要 大型语言模型（LLMs）近年来展现出了卓越的推理能力，这在很大程度上得益于基于监督微调（SFT）和强化学习（RL）的后期训练，基于高质量推理数据。然而，在开放且可扩展的环境中重现和扩展这些能力受到三个基本数据中心挑战的阻碍：（1）冷启动问题，源于缺乏具有详细且长思考链（CoT）轨迹的种子数据集来初始化推理策略;（2）领域覆盖有限，因为大多数现有的开源推理数据集集中在数学领域，对更广泛的科学学科覆盖有限;以及（3）注释瓶颈，即前沿层推理任务的难度使得可靠的人工注释成本高昂或不可行。为应对这些挑战，我们引入了CHIMERA，这是一个包含9K样本的紧凑合成推理数据集，用于可推广的跨域推理。CHIMERA具有三个关键特性：（1）它提供了丰富且长的CoT推理轨迹，这些轨迹由最先进的推理模型综合;（2）其覆盖范围广泛且结构化，涵盖8个主要科学学科，并通过模型生成的层级分类法组织超过1000个细致主题;（3）采用全自动化、可扩展的评估流程，利用强推理模型交叉验证问题的有效性和答案正确性。我们用 CHIMERA 来对 4B Qwen3 模型进行后期训练。尽管数据集规模适中，所得模型在包括GPQA-Diamond、AIME 24/25/26、HMMT 25和Humanity's Last Exam等一系列具有挑战性的推理基准测试中表现出色，推理性能接近或匹敌了DeepSeek-R1和Qwen3-235B等更大规模模型。

Principled Fast and Meta Knowledge Learners for Continual Reinforcement Learning

原则性快速和元知识学习者，用于持续强化学习

Authors: Ke Sun, Hongming Zhang, Jun Jin, Chao Gao, Xi Chen, Wulong Liu, Linglong Kong
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.00903
Pdf link: https://arxiv.org/pdf/2603.00903
Abstract Inspired by the human learning and memory system, particularly the interplay between the hippocampus and cerebral cortex, this study proposes a dual-learner framework comprising a fast learner and a meta learner to address continual Reinforcement Learning~(RL) problems. These two learners are coupled to perform distinct yet complementary roles: the fast learner focuses on knowledge transfer, while the meta learner ensures knowledge integration. In contrast to traditional multi-task RL approaches that share knowledge through average return maximization, our meta learner incrementally integrates new experiences by explicitly minimizing catastrophic forgetting, thereby supporting efficient cumulative knowledge transfer for the fast learner. To facilitate rapid adaptation in new environments, we introduce an adaptive meta warm-up mechanism that selectively harnesses past knowledge. We conduct experiments in various pixel-based and continuous control benchmarks, revealing the superior performance of continual learning for our proposed dual-learner approach relative to baseline methods. The code is released in this https URL.
中文摘要 本研究受人类学习与记忆系统，特别是海马体与大脑皮层相互作用的启发，提出了一个由快速学习者和元学习者组成的双学习者框架，以解决持续强化学习~（RL）问题。这两种学习者结合，承担着不同但互补的角色：快速学习者专注于知识转移，而元学习者则确保知识整合。与传统的多任务强化学习方法通过平均回报最大化共享知识不同，我们的元学习者通过明确减少灾难性遗忘，逐步整合新体验，从而支持快速学习者的高效累积知识转移。为了促进新环境中的快速适应，我们引入了一种自适应元预热机制，选择性地利用过去的知识。我们在各种基于像素和连续对照的基准测试中进行了实验，揭示了我们提出的双学习者方法相较于基线方法，持续学习的表现更优。代码发布在这个 https 网址中。

Minimalist Compliance Control

极简合规控制

Authors: Haochen Shi, Songbo Hu, Yifan Hou, Weizhuo Wang, Karen Liu, Shuran Song
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.00913
Pdf link: https://arxiv.org/pdf/2603.00913
Abstract Compliance control is essential for safe physical interaction, yet its adoption is limited by hardware requirements such as force torque sensors. While recent reinforcement learning approaches aim to bypass these constraints, they often suffer from sim-to-real gaps, lack safety guarantees, and add system complexity. We propose Minimalist Compliance Control, which enables compliant behavior using only motor current or voltage signals readily available in modern servos and quasi-direct-drive motors, without force sensors, current control, or learning. External wrenches are estimated from actuator signals and Jacobians and incorporated into a task-space admittance controller, preserving sufficient force measurement accuracy for stable and responsive compliance control. Our method is embodiment-agnostic and plug-and-play with diverse high-level planners. We validate our approach on a robot arm, a dexterous hand, and two humanoid robots across multiple contact-rich tasks, using vision-language models, imitation learning, and model-based planning. The results demonstrate robust, safe, and compliant interaction across embodiments and planning paradigms.
中文摘要 合规控制对于安全的物理交互至关重要，但其应用受限于诸如力矩传感器等硬件需求。虽然近期强化学习方法旨在绕过这些限制，但它们常常存在模拟与现实之间的差距，缺乏安全保障，并增加了系统复杂性。我们提出了极简顺应控制，仅使用现代伺服机和准直接驱动电机中易得的电机电流或电压信号，无需力传感器、电流控制或学习，从而实现合规行为。外部扳手通过执行器信号和雅可比矩形测量进行估算，并集成到任务空间的进度控制器中，保持足够的力测量精度，实现稳定且响应灵敏的顺应控制。我们的方法不依赖具体体现，且可即插即用，支持多元化的高级规划师。我们在机器人手臂、灵巧的手和两个类人机器人上验证了我们的方法，涵盖多个接触丰富的任务，采用视觉语言模型、模仿学习和基于模型的规划。结果展示了跨实践和规划范式的稳健、安全且合规的交互。

HierKick: Hierarchical Reinforcement Learning for Vision-Guided Soccer Robot Control

HierKick：视觉引导足球机器人控制的层级强化学习

Authors: Yizhi Chen, Zheng Zhang, Zhanxiang Cao, Yihe Chen, Shengcheng Fu, Liyun Yan, Yang Zhang, Jiali Liu, Haoyang Li, Yue Gao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.00948
Pdf link: https://arxiv.org/pdf/2603.00948
Abstract Controlling soccer robots involves multi-time-scale decision-making, which requires balancing long-term tactical planning and short-term motion execution. Traditional end-to-end reinforcement learning (RL) methods face challenges in complex dynamic environments. This paper proposes HierKick, a vision-guided soccer robot control framework based on dual-frequency hierarchical RL. The framework adopts a hierarchical control architecture featuring a 5 Hz high-level policy that integrates YOLOv8 for real-time detection and selects tasks via a coach model, and a pre-trained 50 Hz low-level controller for precise joint control. Through this architecture, the framework achieves the four steps of approaching, aligning, dribbling, and kicking. Experimental results show that the success rates of this framework are 95.2\% in IsaacGym, 89.8\% in Mujoco, and 80\% in the real world. HierKick provides an effective hierarchical paradigm for robot control in complex environments, extendable to multi-time-scale tasks, with its modular design and skill reuse offering a new path for intelligent robot control.
中文摘要 控制足球机器人涉及多时间尺度的决策，需要在长期战术规划与短期动作执行之间取得平衡。传统的端到端强化学习（RL）方法在复杂的动态环境中面临挑战。本文提出了HierKick，这是一种基于双频分层强化学习的视觉引导足球机器人控制框架。该框架采用分层控制架构，包含5 Hz高级别策略，集成YOLOv8实现实时检测并通过教练模型选择任务，以及预训练的50 Hz低级控制器，实现精确联合控制。通过该架构，框架实现了接近、对齐、盘带和踢球的四个步骤。实验结果显示，该框架在IsaacGym的成功率为95.2%，Mujoco为89.8%，在现实世界中为80%。HierKick为复杂环境中的机器人控制提供了有效的层级范式，可扩展至多时间尺度任务，其模块化设计和技能再利用为智能机器人控制开辟了新路径。

Stabilizing Policy Optimization via Logits Convexity

通过Logits 凸性稳定策略优化

Authors: Hongzhan Chen, Tao Yang, Yuhua Zhu, Shiping Gao, Xiaojun Quan, Ting Yao
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.00963
Pdf link: https://arxiv.org/pdf/2603.00963
Abstract While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.
中文摘要 虽然强化学习（RL）是大型语言模型（LLM）近年来成功的核心，但强化学习优化以不稳定著称，尤其是与监督微调（SFT）相比。本研究从梯度视角探讨SFT与RL之间的稳定性差距，并证明SFT损失相对于模型对数的凸性在实现稳定训练中起着关键作用。我们的理论分析表明，这一特性在优化过程中会诱导有利的梯度方向性。相比之下，广泛采用的策略梯度算法近端策略优化（PPO）采用截断替代目标，但缺乏这种稳定特性。基于这一观察，我们提出了Logit凸优化（LCO），这是一种简单但有效的策略优化框架，将所学策略与原始强化学习目标的最优目标对齐，从而模拟logits层次凸性的稳定效果。跨多个模型家族的广泛实验表明，我们的LCO框架持续提升训练稳定性，并在广泛的基准测试中优于传统强化学习方法。

Intent-Context Synergy Reinforcement Learning for Autonomous UAV Decision-Making in Air Combat

意图-情境协同强化学习用于空战中自主无人机决策

Authors: Jiahao Fu, Feng Yang
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.00974
Pdf link: https://arxiv.org/pdf/2603.00974
Abstract Autonomous UAV infiltration in dynamic contested environments remains a significant challenge due to the partially observable nature of threats and the conflicting objectives of mission efficiency versus survivability. Traditional Reinforcement Learning (RL) approaches often suffer from myopic decision-making and struggle to balance these trade-offs in real-time. To address these limitations, this paper proposes an Intent-Context Synergy Reinforcement Learning (ICS-RL) framework. The framework introduces two core innovations: (1) An LSTM-based Intent Prediction Module that forecasts the future trajectories of hostile units, transforming the decision paradigm from reactive avoidance to proactive planning via state augmentation; (2) A Context-Analysis Synergy Mechanism that decomposes the mission into hierarchical sub-tasks (safe cruise, stealth planning, and hostile breakthrough). We design a heterogeneous ensemble of Dueling DQN agents, each specialized in a specific tactical context. A dynamic switching controller based on Max-Advantage values seamlessly integrates these agents, allowing the UAV to adaptively select the optimal policy without hard-coded rules. Extensive simulations demonstrate that ICS-RL significantly outperforms baselines (Standard DDQN) and traditional methods (PSO, Game Theory). The proposed method achieves a mission success rate of 88\% and reduces the average exposure frequency to 0.24 per episode, validating its superiority in ensuring robust and stealthy penetration in high-dynamic scenarios.
中文摘要 由于威胁部分可被察觉以及任务效率与生存能力的冲突目标，自主无人机在动态争夺环境中渗透依然是重大挑战。传统的强化学习（RL）方法常常存在短视决策，难以在实时中平衡这些权衡。为解决这些局限性，本文提出了意图-情境协同强化学习（ICS-RL）框架。该框架引入了两项核心创新：（1）基于LSTM的意图预测模块，预测敌对单位的未来轨迹，将决策范式从被动规避转变为通过状态增强实现的主动规划;（2）情境分析协同机制，将任务分解为分层子任务（安全巡航、隐身计划和敌方突破）。我们设计了一个异构的DQN特工组合，每个特工专注于特定的战术背景。基于Max-Advantage值的动态切换控制器无缝整合这些代理，使无人机能够在不依赖硬编码规则的情况下自适应地选择最优策略。大量模拟表明，ICS-RL显著优于基线（标准DDQN）和传统方法（PSO、博弈论）。该方法实现了88%的任务成功率，并将平均暴露频率降至每发0.24次，验证了其在高动态场景下确保稳健隐蔽渗透的优越性。

HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents

HiMAC：面向长期视野LLM代理的层级宏观微观学习

Authors: Hongbo Jin, Rongpeng Zhu, Jiayu Ding, Wenhao Zhang, Ge Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.00977
Pdf link: https://arxiv.org/pdf/2603.00977
Abstract Large language model (LLM) agents have recently demonstrated strong capabilities in interactive decision-making, yet they remain fundamentally limited in long-horizon tasks that require structured planning and reliable execution. Existing approaches predominantly rely on flat autoregressive policies, where high-level reasoning and low-level actions are generated within a single token sequence, leading to inefficient exploration and severe error propagation over extended trajectories. In this work, we propose HiMAC, a hierarchical agentic RL framework that explicitly decomposes long-horizon decision-making into macro-level planning and micro-level execution. HiMAC models reasoning as a structured blueprint generation process followed by goal-conditioned action execution, enabling robust long-horizon planning within LLM-based agents. To train this hierarchy efficiently, we introduce a critic-free hierarchical policy optimization paradigm that extends group-based reinforcement learning to bi-level structures through hierarchical relative advantage estimation. Furthermore, we propose an iterative co-evolution training strategy that alternates between planner exploration and executor adaptation, mitigating the non-stationarity inherent in hierarchical learning. Extensive experiments on ALFWorld, WebShop, and Sokoban demonstrate that HiMAC consistently outperforms strong prompting and reinforcement learning baselines, achieving state-of-the-art performance and substantially improved sample efficiency across both text-based and visually grounded environments. Our results show that introducing structured hierarchy, rather than increasing model scale alone, is a key factor for enabling robust long-horizon agentic intelligence.
中文摘要 大型语言模型（LLM）代理最近展现出强大的交互式决策能力，但在需要结构化规划和可靠执行的长期任务中仍然存在根本限制。现有方法主要依赖于平面自回归策略，即在单一令牌序列中生成高层推理和低层动作，导致探索效率低下且错误在较长轨迹中严重传播。在本研究中，我们提出了HiMAC，一种层级的代理强化学习框架，明确将长期决策分解为宏观规划和微观执行。HiMAC将推理建模为结构化蓝图生成过程，随后执行目标条件动作，从而实现基于LLM的代理内稳健的长期规划。为了高效训练这一层级结构，我们引入了一种无批评的层级策略优化范式，通过层级相对优势估计将基于群体的强化学习扩展到双层结构。此外，我们提出了一种迭代共进化训练策略，在规划者探索和执行者适应之间交替进行，减轻了层级学习固有的非平稳性。在ALFWorld、WebShop和Sokoban上的大量实验表明，HiMAC在基于文本和视觉化的环境中，始终优于强提示和强化学习基线，实现了最先进的性能和显著提升的样本效率。我们的结果表明，引入结构化层级，而不仅仅是增加模型规模，是实现稳健长期智能的关键因素。

MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

MM-DeepResearch：一个简单有效的多模态代理搜索基线

Authors: Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, Jiaxing Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01050
Pdf link: https://arxiv.org/pdf/2603.01050
Abstract We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks. Code is available at this https URL
中文摘要 我们旨在开发一款多模态研究代理，具备显式推理与规划、多工具调用及跨模态信息综合，使其能够执行深入的研究任务。然而，我们观察到开发此类代理面临三大挑战：（1）搜索密集型多模态质量保证数据稀缺，（2）缺乏有效的搜索轨迹，（3）使用在线搜索API进行训练的成本高昂。为此，我们首先提出了Hyper-Search，这是一种基于超图的质量保证生成方法，能够建模并连接视觉与文本节点，跨模态之间，从而生成需要调用各种搜索工具解决的搜索密集型多模态质量保证对。其次，我们引入了DR-TTS，它首先根据搜索工具类型将涉及搜索的任务分解为多个类别，并分别优化每个工具的专业搜索工具专家。然后它重新组合工具专家，共同通过树搜索探索搜索轨迹，生成能够成功解决复杂任务的轨迹，利用各种搜索工具实现。第三，我们构建了一个支持多种搜索工具的离线搜索引擎，实现智能强化学习，而无需使用昂贵的在线搜索 API。通过这三种设计，我们开发了MM-DeepResearch，一款强大的多模态深度研究代理，广泛结果显示其在基准测试中具有优势。代码可在此 https URL 获取

Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures

通过显式学习从故障中释放自动驾驶中的VLA潜力

Authors: Yuechen Luo, Qimao Chen, Fang Li, Shaoqing Xu, Jaxin Liu, Ziying Song, Zhi-xin Yang, Fuxi Wen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.01063
Pdf link: https://arxiv.org/pdf/2603.01063
Abstract Vision-Language-Action (VLA) models for autonomous driving often hit a performance plateau during Reinforcement Learning (RL) optimization. This stagnation arises from exploration capabilities constrained by previous Supervised Fine-Tuning (SFT), leading to persistent failures in long-tail scenarios. In these critical situations, all explored actions yield a zero-value driving score. This information-sparse reward signals a failure, yet fails to identify its root cause -- whether it is due to incorrect planning, flawed reasoning, or poor trajectory execution. To address this limitation, we propose VLA with Explicit Learning from Failures (ELF-VLA), a framework that augments RL with structured diagnostic feedback. Instead of relying on a vague scalar reward, our method produces detailed, interpretable reports that identify the specific failure mode. The VLA policy then leverages this explicit feedback to generate a Feedback-Guided Refinement. By injecting these corrected, high-reward samples back into the RL training batch, our approach provides a targeted gradient, which enables the policy to solve critical scenarios that unguided exploration cannot. Extensive experiments demonstrate that our method unlocks the latent capabilities of VLA models, achieving state-of-the-art (SOTA) performance on the public NAVSIM benchmark for overall PDMS, EPDMS score and high-level planning accuracy.
中文摘要 自动驾驶的视觉-语言-行动（VLA）模型在强化学习（RL）优化过程中常常遇到性能瓶颈。这种停滞源于先前监督式微调（SFT）限制了探索能力，导致长尾场景中持续失败。在这些关键情况下，所有探索的动作都得出零值驱动得分。这种信息稀少的奖励提示失败，却无法找出根本原因——无论是计划不当、推理有缺陷，还是轨迹执行不当。为解决这一限制，我们提出了带有显式失败学习（ELF-VLA）的框架，该框架通过结构化诊断反馈增强强化学习。我们的方法不再依赖模糊的标量奖励，而是生成详细且可解释的报告，识别具体的失效模式。VLA策略随后利用这些显式反馈生成反馈引导细化。通过将这些修正后的高回报样本注入强化学习训练批次，我们的方法提供了有针对性的梯度，使策略能够解决无引导探索无法解决的关键场景。大量实验表明，我们的方法释放了VLA模型的潜在能力，在公开的NAVSIM基准测试中实现了整体PDMS、EPDMS得分和高级规划准确性的顶尖（SOTA）性能。

How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning

强化学习如何解锁几何交错推理中的顿悟时刻

Authors: Xiangxiang Zhang, Caijun Jia, Siyuan Li, Dingyu He, Xiya Xiong, Zheng Sun, Honghao He, Yuchen Wu, Bihui Yu, Linzhuang Sun, Cheng Tan, Jingxuan Wei
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.01070
Pdf link: https://arxiv.org/pdf/2603.01070
Abstract Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in visual generation and plotting, we identify a counter-intuitive and underexplored phenomenon. Naively applying Supervised Fine-Tuning (SFT) on interleaved plot-solution data leads to a substantial degradation in reasoning performance compared to text-only baselines. We argue that this failure stems from a fundamental limitation of SFT, which primarily induces distributional alignment: the model learns to reproduce the surface format of interleaved plotting but fails to internalize the causal dependency between the generated plot and reasoning steps. To overcome this limitation, we propose Faire (Functional alignment for interleaved reasoning), a reinforcement learning framework that enforces three casual constraints to move beyond superficial imitation toward functional alignment. Extensive experiments show that Faire induces a qualitative shift in model behavior in which the plotting is effectively internalized, yielding competitive performance on challenging geometric reasoning benchmarks.
中文摘要 解决复杂的几何问题本质上需要交错推理：在构图与进行逻辑推理之间紧密交替进行。尽管近期的多模态大型语言模型（MLLM）在视觉生成和绘图方面展现出强大能力，但我们发现了一个反直觉且未被充分探索的现象。对交错图解数据应用监督微调（SFT）会导致推理性能相比纯文本基线显著下降。我们认为，这一失败源于SFT的一个根本局限性，主要导致分布对齐：模型学会了重现交错绘图的表面格式，但未能内化生成图与推理步骤之间的因果依赖关系。为克服这一局限，我们提出了Faire（交错推理的功能对齐），这是一种强化学习框架，通过三个随意约束，从表面模仿转向功能对齐。大量实验表明，费尔算法会诱导模型行为的质的转变，使绘图有效地内化，从而在具有挑战性的几何推理基准测试中获得竞争性能。

DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

DIVA-GRPO：通过难度自适应变异优势提升多模态推理

Authors: Haowen Gao, Zhenyu Zhang, Liang Pang, Fangda Guo, Hongjian Dou, Guannan Lv, Shaoguo Liu, Tingting Gao, Huawei Shen, Xueqi Cheng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01106
Pdf link: https://arxiv.org/pdf/2603.01106
Abstract Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: this https URL
中文摘要 强化学习（RL）结合群相对策略优化（GRPO）已成为广泛采用的增强多模态大型语言模型（MLLM）推理能力的方法。虽然GRPO支持无需批评者的长链推理，但其在困难问题上奖励稀少，且当群体层面奖励过于稳定，导致对过于简单或困难的问题优势消失。现有方案（样本扩展、选择性利用和间接奖励设计）常常无法保持组内奖励分布的足够方差，从而产生明确的优化信号。为此，我们提出了DIVA-GRPO，一种难度适应变异优势方法，从全局视角调整变异难度分布。DIVA-GRPO动态评估问题难度，采样适当难度等级的变体，并利用难度加权和归一化尺度计算局部和全局群体间的优势。这缓解了奖励稀疏和优势消失，同时提升了训练稳定性。在六个推理基准测试上的广泛实验表明，DIVA-GRPO在训练效率和推理表现方面优于现有方法。代码：这个 https URL

DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

DeepResearch-9K：深度研究代理的挑战性基准数据集

Authors: Tongzhou Wu, Yuhao Wang, Xinyu Ma, Xiuqiang He, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01152
Pdf link: https://arxiv.org/pdf/2603.01152
Abstract Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch-9K, a large-scale challenging dataset specifically designed for deep-research scenarios built from open-source multi-hop question-answering (QA) datasets via a low-cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, a state-of-the-art deep-research agent, and (3) verifiable answers. Furthermore, we develop an open-source training framework DeepResearch-R1 that supports (1) multi-turn web interactions, (2) different reinforcement learning (RL) approaches, and (3) different reward models such as rule-based outcome reward and LLM-as-judge feedback. Finally, empirical results demonstrate that agents trained on DeepResearch-9K under our DeepResearch-R1 achieve state-of-the-art results on challenging deep-research benchmarks. We release the DeepResearch-9K dataset on this https URL and the code of DeepResearch-R1 on this https URL.
中文摘要 深度研究代理能够执行多步网页探索、定向检索和复杂的问答。尽管能力强大，深度研究代理仍面临两个关键瓶颈：（1）缺乏具有现实世界难度的大规模且具有挑战性的数据集，（2）缺乏可访问的开源数据综合和代理培训框架。为弥合这些空白，我们首先构建了DeepResearch-9K，这是一个大规模且具有挑战性的数据集，专为深度研究场景设计，基于开源多跳问答（QA）数据集，通过低成本自治流水线构建。值得注意的是，它包含（1）个9000个问题，涵盖从L1到L3的三个难度等级;（2）高质量的搜索轨迹，采用了Tongyi-DeepResearch-30B-A3B的推理链，这是一款先进的深度研究代理;以及（3）可验证的答案。此外，我们开发了一个开源训练框架DeepResearch-R1，支持（1）多回合网络交互，（2）不同的强化学习（RL）方法，以及（3）不同的奖励模型，如基于规则的结果奖励和LLM作为评判的反馈。最后，实证结果表明，在DeepResearch-9K下训练的代理在挑战性的深度研究基准上取得了最先进的成绩。我们将DeepResearch-9K数据集发布到该https URL，并将DeepResearch-R1的代码发布到该HTTPS网址。

BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling

BeautyGRPO：通过动态路径指导和细粒度偏好建模实现面部修饰的美学对齐

Authors: Jiachen Yang, Xianhui Lin, Yi Dong, Zebiao Zheng, Xing Liu, Hong Gu, Yanmei Fang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.01163
Pdf link: https://arxiv.org/pdf/2603.01163
Abstract Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences.
中文摘要 面部修饰需要去除细微瑕疵，同时保留独特的面部特征，以提升整体美感。然而，现有方法存在一个根本性的权衡。对标记数据的监督学习仅限于像素级标签模仿，无法捕捉复杂的主观人类审美偏好。相反，在线强化学习（RL）在偏好对齐方面表现出色，但其随机探索范式与面部修饰的高保真需求相冲突，且常因积累的随机漂移而引入明显的噪声伪影。为解决这些局限性，我们提出了BeautyGRPO，一种强化学习框架，将面部修饰与人类审美偏好相结合。我们构建了FRPref-10K，这是一个涵盖五个关键修饰维度的细粒度偏好数据集，并训练了一个能够评估细微感知差异的专业奖励模型。为了平衡探索与保真度，我们引入了动态路径引导（DPG）。DPG通过动态计算基于锚点的常微分方程路径，并在每个采样时间步重新规划引导轨迹，稳定随机采样轨迹，有效纠正随机漂移，同时保持受控探索。大量实验表明，BeautyGRPO的表现优于专业面部修饰方法和通用图像编辑模型，实现了更优越的纹理质量、更精准的瑕疵去除，以及更符合人类审美偏好的整体效果。

PARWiS: Winner determination under shoestring budgets using active pairwise comparisons

PARWiS：在极限预算下利用主动两两比较进行赢家判定

Authors: Shailendra Bhandari
Subjects: Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2603.01171
Pdf link: https://arxiv.org/pdf/2603.01171
Abstract Determining a winner among a set of items using active pairwise comparisons under a limited budget is a challenging problem in preference-based learning. The goal of this study is to implement and evaluate the PARWiS algorithm, which shows spectral ranking and disruptive pair selection to identify the best item under shoestring budgets. This work have extended the PARWiS with a contextual variant (Contextual PARWiS) and a reinforcement learning-based variant (RL PARWiS), comparing them against baselines, including Double Thompson Sampling and a random selection strategy. This evaluation spans synthetic and real-world datasets (Jester and MovieLens), using budgets of 40, 60, and 80 comparisons for 20 items. The performance is measured through recovery fraction, true rank of reported winner, reported rank of true winner, and cumulative regret, alongside the separation metric (\Delta_{1,2}). Results show that PARWiS and RL PARWiS outperform baselines across all datasets, particularly in the Jester dataset with a higher (\Delta_{1,2}), while performance gaps narrow in the more challenging MovieLens dataset with a smaller (\Delta_{1,2}). Contextual PARWiS shows comparable performance to PARWiS, indicating that contextual features may require further tuning to provide significant benefits.
中文摘要 在有限预算下，利用主动的两两比较确定一组题目中的赢家，是基于偏好学习的一个具有挑战性的问题。本研究的目标是实现和评估PARWiS算法，该算法通过谱排名和破坏性配对选择，在有限预算下识别最佳项目。本研究扩展了PARWiS，新增了情境变体（情境PARWiS）和基于强化学习的变体（RL PARWiS），并将其与基线进行比较，包括双重汤普森抽样和随机选择策略。该评估涵盖合成数据集和现实数据集（Jester和MovieLens），使用40、60和80个预算的比较，涵盖20个项目。绩效通过恢复分数、报告获胜者的真实排名、真实获胜者的报告排名和累计遗憾来衡量，同时结合分离指标 \（\Delta_{1,2}\）。结果显示，PARWiS 和 RL PARWiS 在所有数据集中都优于基线，尤其是在 Jester 数据集中 \（\Delta_{1,2}\），而在更具挑战性的 MovieLens 数据集中，性能差距缩小，且 \（\Delta_{1,2}\）。上下文PARWiS的性能与PARWiS相当，表明上下文特征可能需要进一步调优以获得显著优势。

Reasoning Boosts Opinion Alignment in LLMs

推理提升了大型语言模型中的观点一致性

Authors: Frédéric Berdoz, Yann Billeter, Yann Vonlanthen, Roger Wattenhofer
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.01214
Pdf link: https://arxiv.org/pdf/2603.01214
Abstract Opinion modeling aims to capture individual or group political preferences, enabling applications such as digital democracies, where models could help shape fairer and more popular policies. Given their versatility, strong generalization capabilities, and demonstrated success across diverse text-to-text applications, large language models (LLMs) are natural candidates for this task. However, due to their statistical nature and limited causal understanding, they tend to produce biased opinions when prompted naively. In this work, we study whether reasoning can improve opinion alignment. Motivated by the recent advancement in mathematical reasoning enabled by reinforcement learning (RL), we train models to produce profile-consistent answers through structured reasoning. We evaluate our approach on three datasets covering U.S., European, and Swiss politics. Results indicate that reasoning enhances opinion modeling and is competitive with strong baselines, but does not fully remove bias, highlighting the need for additional mechanisms to build faithful political digital twins using LLMs. By releasing both our method and datasets, we establish a solid baseline to support future research on LLM opinion alignment.
中文摘要 意见建模旨在捕捉个人或群体的政治偏好，从而使数字民主等应用成为可能，模型有助于塑造更公平、更受欢迎的政策。鉴于其多功能性、强大的泛化能力以及在多种文本对文本应用中已取得的成功，大型语言模型（LLMs）是这项任务的自然候选。然而，由于其统计性质和有限的因果理解，当被天真地引导时，往往会产生带有偏见的观点。本研究中，我们研究推理是否能改善意见一致性。受强化学习（RL）推动的数学推理最新进展激励，我们训练模型通过结构化推理产生符合谱面的答案。我们基于三个涵盖美国、欧洲和瑞士政治的数据集来评估我们的方法。结果显示，推理能增强意见建模，并且与强有力的基线竞争，但并未完全消除偏见，这凸显了利用大型语言模型构建忠实政治数字孪生所需的额外机制。通过发布我们的方法和数据集，我们为未来关于LLM意见一致性的研究奠定了坚实基线。

Epistemic Gain, Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning

认识收益，偶然成本：数学推理中多智能体辩论中的不确定性分解

Authors: Dan Qiao, Binbin Chen, Fengyu Cai, Jianlong Chen, Wenhao Li, Fuxin Jiang, Zuzhi Chen, Hongyuan Zha, Tieying Zhang, Baoxiang Wang
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.01221
Pdf link: https://arxiv.org/pdf/2603.01221
Abstract Multi-Agent Debate (MAD) has shown promise in leveraging collective intelligence to improve reasoning and reduce hallucinations, yet it remains unclear how information exchange shapes the underlying ability. Empirically, MAD exhibits paradoxical phenomena, such as accuracy improvement accompanied by substantial increase in token entropy, and remarkable divergence between homogeneous and heterogeneous model combinations. In this paper, we propose a Bayesian uncertainty analysis framework for MAD, which decomposes total predictive uncertainty into epistemic uncertainty reducible by debate context and aleatoric uncertainty induced by internal model noise. Across multiple model configurations, we find that effective debate hinges on achieving high epistemic gain under controlled aleatoric cost. Building on this insight, we design an uncertainty-guided multi-agent reinforcement learning (MARL) algorithm that explicitly optimizes aleatoric noise reduction and epistemic information utilization. Experiments show that our training significantly improves post-debate accuracy and stability, and enhances individual reasoning beyond single-agent RL, providing a unified Bayesian uncertainty perspective for understanding and improving MAD.
中文摘要 多智能体辩论（MAD）在利用集体智能提升推理能力和减少幻觉方面展现出潜力，但信息交换如何塑造潜在能力仍不明确。从经验角度看，MAD表现出悖论现象，如准确率提升伴随符号熵显著增加，以及同质与异质模型组合间显著的差异。本文提出了MAD的贝叶斯不确定性分析框架，将总预测不确定性分解为可通过辩论语境约简的认识论不确定性和由内部模型噪声诱发的偶然不确定性。在多种模型配置中，我们发现有效辩论的关键在于在受控的偶然成本下实现高的认知收益。基于这一见解，我们设计了一种不确定性引导的多智能体强化学习（MARL）算法，明确优化偶然噪声减少和认知信息利用。实验显示，我们的训练显著提升了辩论后的准确性和稳定性，并增强了超越单一主体强化学习的个体推理能力，为理解和改善MAD提供了统一的贝叶斯不确定性视角。

Learn Hard Problems During RL with Reference Guided Fine-tuning

在强化学习中学习难题，参考引导的微调

Authors: Yangzhen Wu, Shanda Li, Zixin Wen, Xin Zhou, Ameet Talwalkar, Yiming Yang, Wenhao Huang, Tianle Cai
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.01223
Pdf link: https://arxiv.org/pdf/2603.01223
Abstract Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefit because models often cannot imitate human proofs that lie outside their own reasoning distribution. We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL. For each problem, we provide the model with a partial reference solution and let it generate its own reasoning trace, ensuring the resulting trajectories remain in the model's reasoning space while still benefiting from reference guidance. Fine-tuning on these reference-guided trajectories increases the number of solvable problems and produces a checkpoint that receives more positive rewards during RL. Across three benchmarks (AIME24, AIME25, BeyondAIME), ReGFT consistently improves supervised accuracy, accelerates DAPO training, and raises the final performance plateau of RL. Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.
中文摘要 用于数学推理的强化学习（RL）可能存在奖励稀疏性：对于具有挑战性的问题，LLM未能采样任何正确的轨迹，阻碍了强化学习获得有意义的正反馈。同时，问题中常常存在人类编写的参考解（例如AoPS中的问题），但直接对这些解进行微调无益，因为模型往往无法模仿超出自身推理分布的人类证明。我们介绍了参考引导微调（ReGFT），这是一种简单有效的方法，利用人类编写的参考解来综合难题的正轨迹，并在强化学习前进行训练。对于每个问题，我们为模型提供部分参考解，并让其生成自己的推理轨迹，确保最终轨迹保持在模型推理空间内，同时仍能受益于参考指导。对这些参考引导轨迹进行微调，可以增加可解问题的数量，并产生一个在强化学习中获得更多积极回报的检查点。在三个基准测试（AIME24、AIME25、BeyondAIME）中，ReGFT持续提升监督准确率，加快DAPO培训进程，并提升了强化学习的最终性能平台。我们的结果表明，ReGFT有效克服了奖励稀疏性，并激活了更强的基于强化学习的数学推理能力。

Can Thinking Models Think to Detect Hateful Memes?

思考型模型能识别仇恨表情包吗？

Authors: Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.01225
Pdf link: https://arxiv.org/pdf/2603.01225
Abstract Hateful memes often require compositional multimodal reasoning: the image and text may appear benign in isolation, yet their interaction conveys harmful intent. Although thinking-based multimodal large language models (MLLMs) have recently advanced vision-language understanding, their capabilities remain underexplored for hateful meme analysis. We propose a reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective. Specifically, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful meme understanding, (ii) extend an existing hateful meme dataset by generating weakly or pseudo-supervised chain-of-thought rationales via distillation, and (iii) introduce a GRPO-based objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning. Experiments on the Hateful Memes benchmark show that our approach achieves state-of-the-art performance, improving accuracy and F1 by approximately 1 percent and explanation quality by approximately 3 percent. We will publicly release our code, dataset extensions, and evaluation resources to support reproducibility.
中文摘要 仇恨表情包通常需要构图多模态推理：图像和文字单独看似无害，但它们的相互作用却传递出有害意图。尽管基于思维的多模态大型语言模型（MLLM）近年来推动了视觉语言理解，但其能力在仇恨表情包分析方面仍未被充分探索。我们提出了一种基于强化学习的后培训框架，通过任务特定奖励和新的群体相对策略优化（GRPO）目标，提升基于思维的MLLMs的推理能力。具体来说，我们（i）对现成的MLLM进行仇恨模因理解的系统实证研究，（ii）通过提炼生成弱或伪监督的思维链逻辑来扩展现有仇恨模因数据集，（iii）引入基于GRPO的目标，联合优化模因分类和解释质量，以鼓励细粒度、逐步推理。在Hateful Memes基准测试上的实验显示，我们的方法实现了最先进的性能，准确率和F1提升约1%，解释质量提升约3%。我们将公开代码、数据集扩展和评估资源，以支持可重复性。

Towards Policy-Adaptive Image Guardrail: Benchmark and Method

迈向政策适应性形象护栏：基准与方法

Authors: Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, Shuigeng Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.01228
Pdf link: https://arxiv.org/pdf/2603.01228
Abstract Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time. However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the cross-policy generalization performance of existing VLMs with SafeEditBench, a new evaluation suite. SafeEditBench leverages image-editing models to convert unsafe images into safe counterparts, producing policy-aligned datasets where each safe-unsafe image pair remains visually similar except for localized regions violating specific safety rules. Human annotators then provide accurate safe/unsafe labels under five distinct policies, enabling fine-grained assessment of policy-aware generalization. Second, we introduce SafeGuard-VL, a reinforcement learning-based method with verifiable rewards (RLVR) for robust unsafe-image guardrails. Instead of relying solely on supervised fine-tuning (SFT) under fixed policies, SafeGuard-VL explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies. Extensive experiments verify the effectiveness of our method for unsafe image guardrails across various policies.
中文摘要 准确拒绝敏感或有害的视觉内容，即有害的图像护栏，在许多应用场景中至关重要。这项任务必须不断适应不同领域和时间内不断演变的安全政策和内容。然而，传统分类器仅限于固定类别，在引入新政策时需要频繁的再培训。视觉语言模型（VLMs）为动态安全护栏提供了更具适应性和通用性的基础。尽管有此潜力，现有基于VLM的保护方法通常仅在固定安全政策下进行培训和评估。我们发现这些模型对可见策略过于过拟合，无法推广到看不见策略，甚至丧失了基本的指令跟随能力和一般知识。为解决这个问题，本文提出了两个关键贡献。首先，我们用SafeEditBench（一个新的评估套件）对现有VLM的跨策略泛化性能进行了基准测试。SafeEditBench利用图像编辑模型将不安全的图像转换为安全的对应物，生成符合策略的数据集，使每个安全与不安全的图像对在视觉上保持相似，唯独局部区域违反了特定安全规则。人工标注者随后根据五个不同策略提供准确的安全/不安全标签，从而实现对策略感知概括的细致评估。其次，我们介绍了SafeGuard-VL，这是一种基于强化学习的方法，带有可验证奖励（RLVR），用于强健的不安全图像防护栏。SafeGuard-VL不再仅依赖固定策略下的监督微调（SFT），而是明确优化模型，提供基于策略的奖励，促进跨策略的可验证适应性。经过大量实验，验证了我们方法在不同政策中对不安全图像护栏的有效性。

MOSAIC: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers

MOSAIC：一个统一平台，用于跨范式比较与评估同质与异构多智能体强化学习、大型语言模型、VLM及人类决策者

Authors: Abdulhamid M. Mousa, Yu Fu, Rakhmonberdi Khajiev, Jalaledin M. Azzabi, Abdulkarim M. Mousa, Peng Yang, Yunusa Haruna, Ming Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01260
Pdf link: https://arxiv.org/pdf/2603.01260
Abstract Reinforcement learning (RL), large language models (LLMs), and vision-language models (VLMs) have been widely studied in isolation. However, existing infrastructure lacks the ability to deploy agents from different decision-making paradigms within the same environment, making it difficult to study them in hybrid multi-agent settings or to compare their behaviour fairly under identical conditions. We present MOSAIC, an open-source platform that bridges this gap by incorporating a diverse set of existing reinforcement learning environments and enabling heterogeneous agents (RL policies, LLMs, VLMs, and human players) to operate within them in ad-hoc team settings with reproducible results. MOSAIC introduces three contributions. (i) An IPC-based worker protocol that wraps both native and third-party frameworks as isolated subprocess workers, each executing its native training and inference logic unmodified, communicating through a versioned inter-process protocol. (ii) An operator abstraction that forms an agent-level interface by mapping workers to agents: each operator, regardless of whether it is backed by an RL policy, an LLM, or a human, conforms to a minimal unified interface. (iii) A deterministic cross-paradigm evaluation framework offering two complementary modes: a manual mode that advances up to N concurrent operators in lock-step under shared seeds for fine-grained visual inspection of behavioural differences, and a script mode that drives automated, long-running evaluation through declarative Python scripts, for reproducible experiments. We release MOSAIC as an open, visual-first platform to facilitate reproducible cross-paradigm research across the RL, LLM, and human-in-the-loop communities.
中文摘要 强化学习（RL）、大型语言模型（LLM）和视觉语言模型（VLMs）已被广泛单独研究。然而，现有基础设施无法在同一环境中部署来自不同决策范式的代理，这使得在混合多代理环境中研究它们或在相同条件下公平比较行为变得困难。我们介绍MOSAIC这一开源平台，通过整合多样化的现有强化学习环境，使异构代理（如强化学习策略、大型语言模型、VLM和人类玩家）能够在其中临时团队环境中运行，并实现可重复的结果。MOSAIC 引入了三个贡献。（i）基于IPC的工作协议，将本地和第三方框架包裹为独立的子进程工作者，每个框架都执行其原生训练和推理逻辑，未作修改，通过版本化进程间协议进行通信。（ii）通过将工人映射到代理，形成代理级接口的抽象：每个，无论其是由强化学习策略、大型语言模型（LLM）还是人工支持，都遵循最小统一的接口。（iii）一种确定性跨范式评估框架，提供两种互补模式：手动模式，在共享种子下同步推进最多N个并发作符，用于细粒度的行为差异可视化检查;以及脚本模式，通过声明式Python脚本驱动自动化、长时间的评估，用于可重复的实验。我们发布MOSAIC作为一个开放、视觉优先的平台，促进强化学习、大型语言模型（LLM）和人类参与（human in-the loop）社区的可重复跨范式研究。

Beyond Reward: A Bounded Measure of Agent Environment Coupling

超越奖励：代理与环境耦合的有界衡量

Authors: Wael Hafez, Cameron Reid, Amit Nazeri
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.01283
Pdf link: https://arxiv.org/pdf/2603.01283
Abstract Real-world reinforcement learning (RL) agents operate in closed-loop systems where actions shape future observations, making reliable deployment under distribution shifts a persistent challenge. Existing monitoring relies on reward or task metrics, capturing outcomes but missing early coupling failures. We introduce bipredictability (P) as the ratio of shared information in the observation, action, outcome loop to the total available information, a principled, real time measure of interaction effectiveness with provable bounds, comparable across tasks. An auxiliary monitor, the Information Digital Twin (IDT), computes P and its diagnostic components from the interaction stream. We evaluate SAC and PPO agents on MuJoCo HalfCheetah under eight agent, and environment-side perturbations across 168 trials. Under nominal operation, agents exhibit P = 0.33 plus minus 0.02, below the classical bound of 0.5, revealing an informational cost of action selection. The IDT detects 89.3% of perturbations versus 44.0% for reward based monitoring, with 4.4x lower median latency. Bipredictability enables early detection of interaction degradation before performance drops and provides a prerequisite signal for closed loop self regulation in deployed RL systems.
中文摘要 现实世界强化学习（RL）代理运行于闭环系统中，动作塑造未来的观察，使得在分布转移下可靠部署成为持续的挑战。现有的监测依赖奖励或任务指标，捕捉结果但遗漏早期耦合失败。我们引入双可预测性（P），即观察、行动、结果循环中共享信息与总可用信息的比例，这是一种原则性、实时的交互有效度量，具有可证明的界限，且跨任务可比较。辅助监视器——信息数字孪生（IDT）从交互流中计算P及其诊断成分。我们在168项试验中评估了MuJoCo HalfCheetah的SAC和PPO试剂，涵盖8种试剂及环境侧扰动。在名义作下，代理人表现为P = 0.33加负0.02，低于经典界限0.5，显示出信息行为成本选择。IDT检测到89.3%的扰动，而基于奖励的监测仅为44.0%，中位数延迟降低了4.4倍。双可预测性能够在性能下降前及早检测交互退化，并为部署的强化学习系统实现闭环自调节提供前提信号。

Integrating LTL Constraints into PPO for Safe Reinforcement Learning

将LTL约束整合进PPO以实现安全强化学习

Authors: Maifang Zhang, Hang Yu, Qian Zuo, Cheng Wang, Vaishak Belle, Fengxiang He
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.01292
Pdf link: https://arxiv.org/pdf/2603.01292
Abstract This paper proposes Proximal Policy Optimization with Linear Temporal Logic Constraints (PPO-LTL), a framework that integrates safety constraints written in LTL into PPO for safe reinforcement learning. LTL constraints offer rigorous representations of complex safety requirements, such as regulations that broadly exist in robotics, enabling systematic monitoring of safety requirements. Violations against LTL constraints are monitored by limit-deterministic Büchi automata, and then translated by a logic-to-cost mechanism into penalty signals. The signals are further employed for guiding the policy optimization via the Lagrangian scheme. Extensive experiments on the Zones and CARLA environments show that our PPO-LTL can consistently reduce safety violations, while maintaining competitive performance, against the state-of-the-art methods. The code is at this https URL.
中文摘要 本文提出了带有线性时间逻辑约束的近端策略优化（PPO-LTL），该框架将LTL中写入的安全约束集成到PPO中，实现安全强化学习。LTL约束提供了对复杂安全要求的严谨表征，例如机器人领域广泛存在的法规，从而实现对安全要求的系统监控。对LTL约束的违规由极限确定性Büchi自动机监控，然后通过逻辑成本机制转换为惩罚信号。这些信号还被用于通过拉格朗日方案指导策略优化。在区域和CARLA环境中的广泛实验表明，我们的PPO-LTL能够持续减少安全违规，同时保持竞争力，以应对最先进的方法。代码就在这个 https 网址。

Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models

关于训练前后推理模型中数据质量与协同效应的理论视角

Authors: Adel Javanmard, Baharan Mirzasoleiman, Vahab Mirrokni
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.01293
Pdf link: https://arxiv.org/pdf/2603.01293
Abstract Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL). Best practices emphasize large, diverse pretraining data, whereas post-training operates differently: SFT relies on smaller, high-quality datasets, while RL benefits more from scale, with larger amounts of feedback often outweighing label quality. Yet it remains unclear why pretraining and RL require large datasets, why SFT excels on smaller ones, and what defines high-quality SFT data. In this work, we theoretically analyze transformers trained on an in-context weight prediction task for linear regression. Our analysis reveals several key findings: $(i)$ balanced pretraining data can induce latent capabilities later activated during post-training, and $(ii)$ SFT learns best from a small set of examples challenging for the pretrained model, while excessively large SFT datasets may dilute informative pretraining signals. In contrast, RL is most effective on large-scale data that is not overly difficult for the pretrained model. We validate these theoretical insights with experiments on large nonlinear transformer architectures.
中文摘要 大型语言模型（LLMs）在海量数据集上进行预训练，随后通过监督微调（SFT）或强化学习（RL）进行指令调优。最佳实践强调大量、多样化的预训练数据，而后训练则不同：SFT依赖较小且高质量的数据集，而强化学习则更依赖规模化，反馈量往往超过标签质量。然而，为什么预训练和强化学习需要大型数据集，为什么SFT在较小数据集上表现出色，以及什么定义了高质量SFT数据，仍然不清楚。在本研究中，我们理论上分析了在上下文权重预测任务中训练的线性回归变换器。我们的分析揭示了几个关键发现：$（i）$ 平衡的预训练数据可以诱导潜在能力在训练后激活，且 $（ii）$ SFT 从少数对预训练模型具有挑战的样本中学习效果最佳，而过大的 SFT 数据集可能稀释信息预训练信号。相比之下，强化学习在对预训练模型来说不算过于困难的大尺度数据上最为有效。我们通过大型非线性变压器架构的实验验证了这些理论见解。

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

强化学习什么时候能帮助医疗VLM？理清视觉、SFT和RL的收益

Authors: Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh, Natasha Sharan, Abhishek Moturu, Elham Dolatabadi, Babak Taati
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.01301
Pdf link: https://arxiv.org/pdf/2603.01301
Abstract Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.
中文摘要 强化学习（RL）越来越多地被用于医学视觉语言模型（VLM）的后期训练，但目前尚不清楚强化学习是提升了医学视觉推理，还是主要提升了已被监督微调（SFT）引发的行为。我们提出了一项对照研究，将这些效应沿着视觉、SFT和强化学习三个方面进行区分。以MedMNIST为多模态测试平台，我们通过对VLM视觉塔与仅视觉基线进行基准对比，量化推理支持和采样效率（Accuracy@1与Pass@K），并评估强化学习何时弥合支持差距及增益如何跨模态转移。我们发现，当模型已有非平凡支持（高Pass@K）时，RL最有效：它主要提升输出分布，改善Acc@1和采样效率，而SFT则扩展支持，使RL更有效。基于这些发现，我们提出了一种边界感知配方，并通过强化学习在PMC多项选择VQA的一小部分平衡子集上后训练OctoMed初始化模型实现，在六项医疗VQA基准中取得了强劲的平均表现。

Hybrid TD3: Overestimation Bias Analysis and Stable Policy Optimization for Hybrid Action Space

混合TD3：混合行动空间的过度估计偏差分析与稳定策略优化

Authors: Thanh-Tuan Tran, Thanh Nguyen Canh, Nak Young Chong, Xiem HoangVan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.01302
Pdf link: https://arxiv.org/pdf/2603.01302
Abstract Reinforcement learning in discrete-continuous hybrid action spaces presents fundamental challenges for robotic manipulation, where high-level task decisions and low-level joint-space execution must be jointly optimized. Existing approaches either discretize continuous components or relax discrete choices into continuous approximations, which suffer from scalability limitations and training instability in high-dimensional action spaces and under domain randomization. In this paper, we propose Hybrid TD3, an extension of Twin Delayed Deep Deterministic Policy Gradient (TD3) that natively handles parameterized hybrid action spaces in a principled manner. We conduct a rigorous theoretical analysis of overestimation bias in hybrid action settings, deriving formal bounds under twin-critic architectures and establishing a complete bias ordering across five algorithmic variants. Building on this analysis, we introduce a weighted clipped Q-learning target that marginalizes over the discrete action distribution, achieving equivalent bias reduction to standard clipped minimization while improving policy smoothness. Experimental results demonstrate that Hybrid TD3 achieves superior training stability and competitive performance against state-of-the-art hybrid action baselines
中文摘要 离散-连续混合动作空间中的强化学习为机器人作带来了根本性挑战，因为高层次任务决策和低层次联合空间执行必须共同优化。现有方法要么离散连续分量，要么将离散选择放宽为连续近似，这在高维动作空间和域下随机化中存在可扩展性限制和训练不稳定性。本文提出了混合TD3，它是双延迟深度确定性策略梯度（TD3）的扩展，能够以原则性方式原生处理参数化混合动作空间。我们对混合动作场景中的高估偏差进行了严谨的理论分析，推导出双重批判者架构下的形式界限，并建立了五种算法变体中的完整偏置排序。基于该分析，我们引入了一个加权截剪Q学习目标，该目标在离散行动分布上边际化，实现了与标准截剪最小化等效的偏差减少，同时提升策略平滑性。实验结果表明，混合TD3在与最先进的混合动作基线相比，在训练稳定性和竞争性能上表现更优

Energy Efficient Traffic Scheduling For Optical LEO Satellite Downlinks

光学LEO卫星下行链路的节能流量调度

Authors: Ethan Fettes, Pablo G. Madoery, Halim Yanikomeroglu, Gunes Karabulut Kurt, Abhishek Naik, Stéphane Martel
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.01334
Pdf link: https://arxiv.org/pdf/2603.01334
Abstract In recent years, the number of satellites in orbit has increased rapidly, with megaconstellations like Starlink providing near-global, delay-sensitive communication services. However, not all satellite communication use cases have stringent delay requirements; services such as Earth observation (EO) and remote Internet of Things (IoT) fall into this category. These relaxed delay quality of service (QoS) objectives allow services to be delivered using sparse constellations, enabled by delay-tolerant networking protocols. In the context of rapidly growing data volumes that must be delivered through satellite networks, a key challenge is having sufficient \pgm{space-to-ground link capacity}. This has led to proposals for using free-space optical (FSO) communications, which offer high data rates. However, FSO communications are highly vulnerable to weather-related disruptions. This results in certain communication opportunities being energy inefficient. Given the energy-constrained nature of satellites, developing schemes to improve energy efficiency is highly desirable. In this work, both static and adaptive schemes were developed to balance maintaining the delivery ratio and maximizing energy efficiency. The proposed schemes fall into the following categories: threshold schemes, heuristic sorting algorithms, and reinforcement learning-based schemes. The schemes were evaluated under a variety of different data volumes and cloud cover distribution configurations \pgm{as well as a case study using historical weather data}. It was found that static schemes suffered from low delivery ratio performance under dynamic conditions when compared to adaptive techniques. However, this performance improvement came at the cost of increased complexity and onboard computations.
中文摘要 近年来，轨道卫星数量迅速增加，像Starlink这样的超级星座提供近乎全球、延迟敏感的通信服务。然而，并非所有卫星通信用例都有严格的延迟要求;诸如地球观测（EO）和远程物联网（IoT）等服务就属于这一类别。这些放松的延迟质量（QoS）目标允许通过稀疏星座进行服务交付，并由容忍延迟的网络协议实现。在必须通过卫星网络传输的快速增长数据量背景下，一个关键挑战是拥有足够的\pgm（空间到地面链路容量）。这促使人们提出了使用自由空间光通信（FSO）的提案，这种通信能提供高数据速率。然而，FSO的通信极易受到天气干扰的影响。这导致某些通信机会能源效率低下。鉴于卫星的能源受限特性，开发提升能源效率的方案非常理想。在这项工作中，开发了静态和自适应方案，以平衡维持输送比和最大化能源效率。所提出的方案分为以下几类：阈值方案、启发式排序算法和基于强化学习的方案。这些方案在多种不同数据量和云量分布配置下进行了评估，并结合了历史气象数据的案例研究。研究发现，静态方案在动态条件下的传递比性能较自适应技术较低。然而，这种性能提升是以复杂性和机载计算增加为代价的。

SubstratumGraphEnv: Reinforcement Learning Environment (RLE) for Modeling System Attack Paths

SubstratumGraphEnv：用于建模系统攻击路径的强化学习环境（RLE）

Authors: Bahirah Adewunmi, Edward Raff, Sanjay Purushotham
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.01340
Pdf link: https://arxiv.org/pdf/2603.01340
Abstract Automating network security analysis, particularly the identification of potential attack paths, presents significant challenges. Due in part to the sequential, interconnected, and evolutionary nature of system events which most artificial intelligence (AI) techniques struggle to model effectively. This paper proposes a Reinforcement Learning (RL) environment generation framework that simulates the sequence of processes executed on a Windows operating system, enabling dynamic modeling of malicious processes on a system. This methodology models operating system state and transitions using a graph representation. This graph is derived from open-source System Monitor (Sysmon) logs. To address the variety in system event types, fields, and log formats, a mechanism was developed to capture and model parent-child processes from Sysmon logs. A Gymnasium environment (SubstratumGraphEnv) was constructed to establish the perceptible basis for an RL environment, and a customized PyTorch interface was also built (SubstratumBridge) to translate Gymnasium graphs into Deep Reinforcement Learning (DRL) observations and discrete actions. Graph Convolutional Networks (GCNs) concretize the graph's local and global state, which feed the distinct policy and critic heads of an Advantage Actor-Critic (A2C) model. This work's central contribution lies in the design of a novel deep graphical RL environment that automates translation of sequential user and system events, furnishing crucial context for cybersecurity analysis. This work provides a foundation for future research into shaping training parameters and advanced reward shaping, while also offering insight into which system events attributes are critical to training autonomous RL agents.
中文摘要 自动化网络安全分析，尤其是潜在攻击路径的识别，面临重大挑战。部分原因是系统事件具有顺序性、相互关联性和进化性质，大多数人工智能（AI）技术难以有效建模。本文提出了一个强化学习（RL）环境生成框架，模拟Windows作系统上执行进程的顺序，从而实现系统上恶意进程的动态建模。该方法论通过图表示来建模作系统状态和转换。该图源自开源的系统监控（Sysmon）日志。为了解决系统事件类型、字段和日志格式的多样性，开发了一种机制，用以从Sysmon日志中捕获和建模父子进程。构建了一个Gymnasium环境（SubstratumGraphEnv），以建立强化学习环境的可感知基础，同时还构建了定制的PyTorch接口（SubstratumBridge），将Gymnasium图转换为深度强化学习（DRL）观察和离散动作。图卷积网络（GCN）具体化图的局部和全局状态，为优势行为者-批判者（A2C）模型中的不同策略和批判者提供信息。这项工作的核心贡献在于设计了一个新颖的深度图形化强化学习环境，该环境能够自动翻译顺序的用户和系统事件，为网络安全分析提供了关键的背景。这项工作为未来训练参数塑造和高级奖励塑造的研究奠定了基础，同时也揭示了哪些系统事件属性对训练自主强化学习代理至关重要。

MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning

MIST-RL：基于突变的增量套件测试，通过强化学习实现

Authors: Sicheng Zhu, Jiajun Wang, Jiawei Ai, Xin Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2603.01409
Pdf link: https://arxiv.org/pdf/2603.01409
Abstract Large Language Models (LLMs) often fail to generate correct code on the first attempt, which requires using generated unit tests as verifiers to validate the solutions. Despite the success of recent verification methods, they remain constrained by a "scaling-by-quantity" paradigm. This brute-force approach suffers from a critical limitation: it yields diminishing returns in fault detection while causing severe test redundancy. To address this, we propose MIST-RL (Mutation-based Incremental Suite Testing via Reinforcement Learning), a framework that shifts the focus to "scaling-by-utility". We formulate test generation as a sequential decision process optimized via Group Relative Policy Optimization (GRPO). Specifically, we introduce a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions. Experiments on HumanEval+ and MBPP+ demonstrate that MIST-RL outperforms state-of-the-art baselines. It achieves a +28.5% higher mutation score while reducing the number of test cases by 19.3%. Furthermore, we show that these compact, high-utility tests serve as superior verifiers, which improves downstream code reranking accuracy on HumanEval+ by 3.05% over the SOTA baseline with 10 candidate samples. The source code and data are provided in the supplementary material.
中文摘要 大型语言模型（LLM）常常在第一次尝试时无法生成正确的代码，这需要使用生成的单元测试作为验证器来验证解决方案。尽管近期验证方法取得了成功，但它们仍受制于“按数量计量”的范式。这种暴力破解方法存在一个关键局限：它在故障检测中收益递减，同时导致严重的测试冗余。为此，我们提出了MIST-RL（基于突变的强化学习增量套件测试）框架，将重点转向“效用扩展”。我们将测试生成表述为通过群体相对策略优化（GRPO）优化的顺序决策过程。具体来说，我们引入了一种新颖的增量突变奖励，结合动态惩罚，激励模型发现新的缺陷，同时抑制功能等价的断言。HumanEval+和MBPP+的实验表明，MIST-RL的表现优于最先进的基线。它实现了+28.5%的突变得分提升，同时测试病例数量减少了19.3%。此外，我们证明这些紧凑且高效用的测试作为更优的验证工具，使HumanEval+下游代码重新排序的准确性比SOTA基线（10个候选样本）提高了3.05%。源代码和数据包含在补充材料中。

Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents

确保底线与提升天花板：基于合并的多模态搜索代理范式

Authors: Zhixiang Wang, Jingxuan Xu, Dajun Chen, Yunfang Wu, Wei Jiang, Yong Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01416
Pdf link: https://arxiv.org/pdf/2603.01416
Abstract Recent advances in Vision-Language Models (VLMs) have motivated the development of multi-modal search agents that can actively invoke external search tools and integrate retrieved evidence through multi-step reasoning. While promising, existing approaches typically rely on large-scale supervised trajectories or expensive reinforcement learning (RL), leading to high training cost, instability, and a severe cold-start problem for standard VLMs. We propose a training-free paradigm to empower VLMs with autonomous search capabilities via cross-modal model merging. By fusing a text-based search agent with a base VLM, we show that multi-modal search capabilities can be effectively composed without any additional multi-modal training data. To mitigate parameter interference during cross-modal integration, we introduce Optimal Brain Merging (OBM), a saliency-aware merging algorithm that identifies task-critical parameters based on their impact on model loss using only a small set of calibration samples. Extensive experiments on search-intensive benchmarks (e.g., InfoSeek, MMSearch) reveal that: (1) Model merging secures a reasonable performance floor as a zero-shot agent, with OBM achieving superior search rates; (2) OBM significantly raises the performance ceiling as a warm-start strategy, achieving faster convergence and higher peak accuracy than standard VLM initialization.
中文摘要 视觉语言模型（VLMs）的最新进展推动了多模态搜索代理的发展，这些代理能够主动调用外部搜索工具，并通过多步推理整合检索到的证据。虽然前景看好，但现有方法通常依赖大规模监督轨迹或昂贵的强化学习（RL），导致训练成本高昂、不稳定性以及标准VLM严重的冷启动问题。我们提出了一种无训练范式，通过跨模态模型合并赋能VLM自主搜索能力。通过将基于文本的搜索代理与基础VLM融合，我们证明了多模态搜索能力可以在无需额外多模态训练数据的情况下有效组合。为减少跨模态集成中的参数干扰，我们引入了最优脑合并（OBM），这是一种显著性感知合并算法，仅使用少量校准样本，根据关键参数对模型丢失的影响识别任务关键参数。对搜索密集基准测试（如InfoSeek、MMSearch）的广泛实验表明：（1）模型合并作为零样本代理能保证合理性能底线，OBM实现更优的搜索率;（2）OBM作为热启动策略显著提高了性能上限，实现了比标准VLM初始化更快的收敛和更高的峰值精度。

Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning

扩展任务，而非样本：通过多任务模型强化学习掌握类人生物控制

Authors: Shaohuai Liu, Weirui Ye, Yilun Du, Le Xie
Subjects: Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.01452
Pdf link: https://arxiv.org/pdf/2603.01452
Abstract Developing generalist robots capable of mastering diverse skills remains a central challenge in embodied AI. While recent progress emphasizes scaling model parameters and offline datasets, such approaches are limited in robotics, where learning requires active interaction. We argue that effective online learning should scale the \emph{number of tasks}, rather than the number of samples per task. This regime reveals a structural advantage of model-based reinforcement learning (MBRL). Because physical dynamics are invariant across tasks, a shared world model can aggregate multi-task experience to learn robust, task-agnostic representations. In contrast, model-free methods suffer from gradient interference when tasks demand conflicting actions in similar states. Task diversity therefore acts as a regularizer for MBRL, improving dynamics learning and sample efficiency. We instantiate this idea with \textbf{EfficientZero-Multitask (EZ-M)}, a sample-efficient multi-task MBRL algorithm for online learning. Evaluated on \textbf{HumanoidBench}, a challenging whole-body control benchmark, EZ-M achieves state-of-the-art performance with significantly higher sample efficiency than strong baselines, without extreme parameter scaling. These results establish task scaling as a critical axis for scalable robotic learning. The project website is available \href{this https URL}{here}.
中文摘要 开发能够掌握多样技能的通用机器人仍然是具身人工智能的核心挑战。虽然近期进展强调模型参数缩放和离线数据集，但在机器人领域此类方法有限，因为学习需要主动互动。我们认为，有效的在线学习应以任务数量为基础，而不是每个任务的样本数量。这种机制揭示了基于模型的强化学习（MBRL）的结构优势。由于物理动力学在任务间不变，共享世界模型可以聚合多任务经验，学习稳健且任务无关的表示。相比之下，无模型方法在任务要求在相似状态下采取冲突动作时，会受到梯度干扰的影响。因此，任务多样性作为MBRL的正则化器，提升了动态学习和样本效率。我们用 \textbf{EfficientZero-Multitask （EZ-M）} 实现了这一理念，这是一种用于在线学习的样本高效多任务 MBRL 算法。在具有挑战性全体控制基准的\textbf{HumanoidBench}上评估，EZ-M实现了最先进的性能，样本效率显著优于强基线，且无极端参数尺度。这些结果确立了任务规模化作为可扩展机器人学习的关键轴。项目网站可访问 \href{this https URL}{here}。

ProtRLSearch: A Multi-Round Multimodal Protein Search Agent with Large Language Models Trained via Reinforcement Learning

ProtRLSearch：一种多轮多模态蛋白质搜索代理，采用强化学习训练的大型语言模型

Authors: Congying Liu, Taihao Li, Ming Huang, Xingyuan Wei, Peipei Liu, Yiqing Shen, Yanxu Mao, Tiehan Cui
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.01464
Pdf link: https://arxiv.org/pdf/2603.01464
Abstract Protein analysis tasks arising in healthcare settings often require accurate reasoning under protein sequence constraints, involving tasks such as functional interpretation of disease-related variants, protein-level analysis for clinical research, and similar scenarios. To address such tasks, search agents are introduced to search protein-related information, providing support for disease-related variant analysis and protein function reasoning in protein-centric inference. However, such search agents are mostly limited to single-round, text-only modality search, which prevents the protein sequence modality from being incorporated as a multimodal input into the search decision-making process. Meanwhile, their reliance on reinforcement learning (RL) supervision that focuses solely on the final answer results in a lack of search process constraints, making deviations in keyword selection and reasoning directions difficult to identify and correct in a timely manner. To address these limitations, we propose ProtRLSearch, a multi-round protein search agent trained with multi-dimensional reward based RL, which jointly leverages protein sequence and text as multimodal inputs during real-time search to produce high quality reports. To evaluate the ability of models to integrate protein sequence information and text-based multimodal inputs in realistic protein query settings, we construct ProtMCQs, a benchmark of 3,000 multiple choice questions (MCQs) organized into three difficulty levels. The benchmark evaluates protein query tasks that range from sequence constrained reasoning about protein function and phenotype changes to comprehensive protein reasoning that integrates multi-dimensional sequence features with signal pathways and regulatory networks.
中文摘要 医疗环境中出现的蛋白质分析任务通常需要在蛋白质序列约束下进行准确推理，包括疾病相关变异的功能解释、临床研究中的蛋白质水平分析等任务。为应对此类任务，引入搜索代理用于搜索蛋白质相关信息，支持疾病相关变异分析和蛋白质功能推理，进行蛋白质中心推断。然而，这类搜索代理大多限于单轮、仅文本的模态搜索，这阻止了蛋白质序列模态作为多模态输入被纳入搜索决策过程。与此同时，他们依赖专注于最终答案的强化学习（RL）监督，导致缺乏搜索过程的约束，使得关键词选择和推理方向的偏差难以及时发现和纠正。为解决这些局限性，我们提出了ProtRLSearch，一款多轮蛋白质搜索代理，采用基于多维奖励的强化学习训练，结合蛋白质序列和文本作为多模态输入，实时搜索生成高质量报告。为了评估模型在现实蛋白质查询环境中整合蛋白质序列信息和基于文本的多模态输入的能力，我们构建了ProtMCQs，这是一个包含3000道选择题（MCQ）的基准测试，这些题目分为三个难度等级。基准测试评估蛋白质查询任务，涵盖从关于蛋白质功能和表型变化的序列约束推理，到整合多维序列特征与信号通路和调控网络的综合蛋白质推理。

Towards Robot Skill Learning and Adaptation with Gaussian Processes

迈向机器人技能学习与适应，采用高斯过程

Authors: A K M Nadimul Haque, Fouad Sukkar, Sheila Sujipto, Cedric Le Gentil, Marc G. Carmichael, Teresa Vidal-Calleja
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.01480
Pdf link: https://arxiv.org/pdf/2603.01480
Abstract General robot skill adaptation requires expressive representations robust to varying task configurations. While recent learning-based skill adaptation methods refined via Reinforcement Learning (RL), have shown success, existing skill models often lack sufficient representational capacity for anything beyond minor environmental changes. In contrast, Gaussian Process (GP)-based skill modelling provides an expressive representation with useful analytical properties; however, adaptation of GP-based skills remains underexplored. This paper proposes a novel, robust skill adaptation framework that utilises GPs with sparse via-points for compact and expressive modelling. The model considers the trajectory's poses and leverages its first and second analytical derivatives to preserve the skill's kinematic profile. We present three adaptation methods to cater for the variability between initial and observed configurations. Firstly, an optimisation agent that adjusts the path's via-points while preserving the demonstration velocity. Second, a behaviour cloning agent trained to replicate output trajectories from the optimisation agent. Lastly, an RL agent that has learnt to modify via-points whilst maintaining the kinematic profile and enabling online capabilities. Evaluated across three tasks (drawer opening, cube-pushing and bar manipulation) in both simulation and hardware, our proposed methods outperform every benchmark in success rates. Furthermore, the results demonstrate that the GP-based representation enables all three methods to attain high cosine similarity and low velocity magnitude errors, indicating strong preservation of the kinematic profile. Overall, our formulation provides a compact representation capable of adapting to large deviations from a single demonstrated skill.
中文摘要 一般机器人技能适应需要对不同任务配置具备鲁棒性的表达性表现。尽管近期通过强化学习（RL）改进的基于学习的技能适应方法已取得成效，但现有技能模型往往缺乏足够的表征能力，除了轻微的环境变化外，其他内容都难以实现。相比之下，基于高斯过程（GP）的技能建模提供了具有实用分析特性的表达性表征;然而，基于全科医生技能的适应性仍然缺乏充分探索。本文提出了一种新颖且稳健的技能适应框架，利用具有稀疏通点的GP进行紧凑且富有表现力的建模。该模型考虑轨迹的姿态，并利用其一级和二级解析导数来保持技能的运动学特征。我们提出了三种适应方法，以应对初始配置与观测配置之间的变异性。首先，一种优化代理，在保持示范速度的同时调整路径的通路点。其次，是一种经过训练以复制优化代理输出轨迹的行为克隆代理。最后，是一个能够在保持运动学特性并启用在线功能的同时修改通点的强化学习代理。在模拟和硬件中，我们提出的方法在三个任务（开抽屉、推立方体和作条）中进行了评估，成功率优于所有基准测试。此外，结果表明基于GP的表示使三种方法都能实现高余弦相似性和低速度幅度误差，表明运动学轮廓得以强力保持。总体而言，我们的表述提供了一个紧凑的表示，能够适应单一已展示技能的较大偏差。

Harmonizing Dense and Sparse Signals in Multi-turn RL: Dual-Horizon Credit Assignment for Industrial Sales Agents

多轮强化学习中密集与稀疏信号的协调：工业销售代理的双视角信用分配

Authors: Haojin Yang, Ai Jian, Xinyue Huang, Yiwei Wang, Weipeng Zhang, Ke Zeng, Xunliang Cai, Jingqing Ruan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01481
Pdf link: https://arxiv.org/pdf/2603.01481
Abstract Optimizing large language models for industrial sales requires balancing long-term commercial objectives (e.g., conversion rate) with immediate linguistic constraints such as fluency and compliance. Conventional reinforcement learning often merges these heterogeneous goals into a single reward, causing high-magnitude session-level rewards to overwhelm subtler turn-level signals, which leads to unstable training or reward hacking. To address this issue, we propose Dual-Horizon Credit Assignment (DuCA), a framework that disentangles optimization across time scales. Its core, Horizon-Independent Advantage Normalization (HIAN), separately normalizes advantages from turn-level and session-level rewards before fusion, ensuring balanced gradient contributions from both immediate and long-term objectives to the policy update. Extensive experiments with a high-fidelity user simulator show DuCA outperforms the state-of-the-art GRPO baseline, achieving a 6.82% relative improvement in conversion rate, reducing inter-sentence repetition by 82.28%, and lowering identity detection rate by 27.35%, indicating a substantial improvement for an industrial sales scenario that effectively balances the dual demands of strategic performance and naturalistic language generation.
中文摘要 优化大型语言模型以实现工业销售需要在长期商业目标（如转化率）与即时语言约束（如流利度和合规性）之间取得平衡。传统的强化学习常常将这些异质目标合并为单一奖励，导致高强度的会话级奖励压倒更微妙的回合级信号，导致训练不稳定或奖励黑客行为。为解决这一问题，我们提出了双视野学分分配（DuCA）框架，该框架能够在时间尺度上理清优化。其核心——地平线无关优势规范化（HIAN）在融合前分别规范回合级和会话级奖励的优势，确保即时和长期目标对政策更新的平衡梯度贡献。高保真用户模拟器的大量实验显示，DuCA的转化率相对提升了6.82%，句间重复减少了82.28%，身份检测率降低了27.35%，显示出在工业销售场景中显著提升了战略性能与自然语言生成的双重需求。

LLM-assisted Semantic Option Discovery for Facilitating Adaptive Deep Reinforcement Learning

促进自适应深度强化学习的语义选项发现

Authors: Chang Yao, Jinghui Qin, Kebing Jin, Hankz Hankui Zhuo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01488
Pdf link: https://arxiv.org/pdf/2603.01488
Abstract Despite achieving remarkable success in complex tasks, Deep Reinforcement Learning (DRL) is still suffering from critical issues in practical applications, such as low data efficiency, lack of interpretability, and limited cross-environment transferability. However, the learned policy generating actions based on states are sensitive to the environmental changes, struggling to guarantee behavioral safety and compliance. Recent research shows that integrating Large Language Models (LLMs) with symbolic planning is promising in addressing these challenges. Inspired by this, we introduce a novel LLM-driven closed-loop framework, which enables semantic-driven skill reuse and real-time constraint monitoring by mapping natural language instructions into executable rules and semantically annotating automatically created options. The proposed approach utilizes the general knowledge of LLMs to facilitate exploration efficiency and adapt to transferable options for similar environments, and provides inherent interpretability through semantic annotations. To validate the effectiveness of this framework, we conduct experiments on two domains, Office World and Montezuma's Revenge, respectively. The results demonstrate superior performance in data efficiency, constraint compliance, and cross-task transferability.
中文摘要 尽管在复杂任务中取得了显著成功，深度强化学习（DRL）在实际应用中仍面临关键问题，如数据效率低、解释性不足以及跨环境迁移性有限。然而，基于州的学习政策制定行动对环境变化敏感，难以保证行为安全和合规性。最新研究表明，将大型语言模型（LLMs）与符号规划结合，在应对这些挑战方面具有前景。受此启发，我们引入了一个新的大型语言模型驱动闭环框架，通过将自然语言指令映射到可执行规则并语义注释自动生成的选项，实现语义驱动的技能重用和实时约束监控。该方法利用大型语言模型的通用知识，促进探索效率并适应类似环境中的可迁移选项，并通过语义注释提供固有的可解释性。为了验证该框架的有效性，我们在两个领域分别进行了实验，分别是《办公室世界》和《蒙特祖玛的复仇》。结果显示了数据效率、约束合规性和跨任务可转移性方面的卓越表现。

GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control

GAC：通过梯度对齐控制稳定LLM的异步强化学习训练

Authors: Haofeng Xu, Junwei Su, Yukun Tian, Lansong Diao, Zhengping Qian, Chuan Wu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01501
Pdf link: https://arxiv.org/pdf/2603.01501
Abstract Asynchronous execution is essential for scaling reinforcement learning (RL) to modern large model workloads, including large language models and AI agents, but it can fundamentally alter RL optimization behavior. While prior work on asynchronous RL focuses on training throughput and distributional correction, we show that naively applying asynchrony to policy-gradient updates can induce qualitatively different training dynamics and lead to severe training instability. Through systematic empirical and theoretical analysis, we identify a key signature of this instability: asynchronous training exhibits persistently high cosine similarity between consecutive policy gradients, in contrast to the near-orthogonal updates observed under synchronized training. This stale-aligned gradient effect amplifies correlated updates and increases the risk of overshooting and divergence. Motivated by this observation, we propose GRADIENT ALIGNMENT CONTROL(GAC), a simple dynamics-aware stabilization method that regulates asynchronous RL progress along stale-aligned directions via gradient projection. We establish convergence guarantees under bounded staleness and demonstrate empirically that GAC recovers stable, on-policy training dynamics and matches synchronized baselines even at high staleness.
中文摘要 异步执行对于将强化学习（RL）扩展到现代大型模型工作负载（包括大型语言模型和人工智能代理）至关重要，但它也可能从根本上改变强化学习的优化行为。虽然此前异步强化学习的工作主要关注训练吞吐量和分布校正，但我们表明，若天真地将异步应用于策略梯度更新，可能会导致训练动态的质的不同，并导致严重的训练不稳定性。通过系统的实证和理论分析，我们识别出这种不稳定性的一个关键特征：异步训练在连续策略梯度之间表现出持续较高的余弦相似性，而同步训练下观察到的近正交更新则不同。这种与陈旧对齐的梯度效应放大了相关更新，增加了超溢和背离的风险。基于这一观察，我们提出了梯度对齐控制（GRADIENT alignment Control，简称GAC），这是一种简单的动态感知稳定方法，通过梯度投影调节沿陈旧对齐方向的异步强化学习进展。我们建立了有界陈旧下的收敛保证，并通过实证证明GAC即使在高停滞状态下也能恢复稳定、符合策略的训练动态，并匹配同步基线。

State-Action Inpainting Diffuser for Continuous Control with Delay

状态作用修复扩散器，用于带延迟的连续控制

Authors: Dongqi Han, Wei Wang, Enze Zhang, Dongsheng Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.01553
Pdf link: https://arxiv.org/pdf/2603.01553
Abstract Signal delay poses a fundamental challenge in continuous control and reinforcement learning (RL) by introducing a temporal gap between interaction and perception. Current solutions have largely evolved along two distinct paradigms: model-free approaches which utilize state augmentation to preserve Markovian properties, and model-based methods which focus on inferring latent beliefs via dynamics modeling. In this paper, we bridge these perspectives by introducing State-Action Inpainting Diffuser (SAID), a framework that integrates the inductive bias of dynamics learning with the direct decision-making capability of policy optimization. By formulating the problem as a joint sequence inpainting task, SAID implicitly captures environmental dynamics while directly generating consistent plans, effectively operating at the intersection of model-based and model-free paradigms. Crucially, this generative formulation allows SAID to be seamlessly applied to both online and offline RL. Extensive experiments on delayed continuous control benchmarks demonstrate that SAID achieves state-of-the-art and robust performance. Our study suggests a new methodology to advance the field of RL with delay.
中文摘要 信号延迟在持续控制与强化学习（RL）中带来了根本性的挑战，因为它在交互与感知之间引入了时间上的鸿沟。当前的解决方案主要沿两种截然不同的范式发展而来：无模型方法利用状态增强保持马尔可夫性质;以及基于模型的方法，侧重于通过动力学建模推断潜在信念。本文通过引入状态作用修复扩散器（SAID）来弥合这些观点，该框架将动态学习的归纳偏见与策略优化的直接决策能力相结合。通过将问题表述为联合序列修复任务，SAID隐含地捕捉环境动态，同时直接生成一致的计划，有效地在基于模型与无模型范式的交汇处运行。关键是，这种生成式表述使SAID能够无缝应用于线上和线下强化学习。大量延迟连续对照基准实验表明，SAID实现了最先进且稳健的性能。我们的研究提出了一种新的方法论，以延缓推进强化学习领域。

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

LFPO：掩盖扩散模型的无似然策略优化

Authors: Chenxing Wei, Jiazhen Kang, Hong Wang, Jianqing Zhang, Hao Jiang, Xiaolong Xu, Ningyuan Sun, Ying He, F. Richard Yu, Yao Shu, Bo Jiang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01563
Pdf link: https://arxiv.org/pdf/2603.01563
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.
中文摘要 带可验证奖励的强化学习（RLVR）在改进自回归模型方面取得了显著成功，尤其是在数学推理和代码生成等需要正确性的领域。然而，直接将此类范式应用于扩散大型语言模型（dLLMs）根本受限于精确似然计算的难解性，迫使现有方法依赖高方差近似。为弥合这一差距，我们提出了无似然策略优化（LFPO）的原生框架，将向量场流匹配的概念映射到离散令牌空间。具体来说，LFPO将对齐表述为几何速度整流，通过对比更新直接优化去噪对数。该设计有效绕过了似然近似中固有的误差，从而获得精确的梯度估计。此外，LFPO通过预测中间步骤的最终解来强制一致性，有效地使概率流更直，从而实现高质量生成且显著减少迭代次数。大量实验表明，LFPO不仅在代码和推理基准测试上优于最先进的基线，还通过减少扩散步数将推理速度提升约20%。

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

超越长度缩放：促进生成奖励模型的广度与深度

Authors: Qiyuan Zhang, Yufei Wang, Tianhe Wu, Can Xu, Qingfeng Sun, Kai Zheng, Xue Liu, Chen Ma
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01571
Pdf link: https://arxiv.org/pdf/2603.01571
Abstract Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at \href{this https URL}{Hugging Face}, and the code is released at \href{this https URL}{Github}.
中文摘要 生成奖励模型（GRMs）的最新进展表明，扩展思维链（CoT）推理的长度可显著提升评估的可靠性。然而，当前的研究主要依赖于非结构长度尺度，忽视了不同推理机制的不同效力：广度CoT（B-CoT，即多维原则覆盖）和深度CoT（D-CoT，即实质判断的合理性）。为此，我们引入了Mix-GRM，这是一种通过模块化综合流水线将原始理据重构为结构化的B-CoT和D-CoT的框架，随后采用监督微调（SFT）和可验证奖励强化学习（RLVR）来内化和优化这些机制。综合实验表明，Mix-GRM在五个基准测试中建立了新的最先进技术，平均领先开源RM8.2%。我们的结果显示推理上存在明显分歧：B-CoT在主观偏好任务中受益，而D-CoT在客观正确性任务中表现优异。因此，推理机制与任务不匹配会直接降低表现。此外，我们证明RLVR充当开关放大器，诱导一种涌现极化，模型自发地分配其推理风格以匹配任务需求。综合数据和模型发布于 \href{此 https URL}{Hugging Face}，代码发布于 \href{此 https URL}{Github}。

CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

护理：迈向多模态医学推理中的临床问责，并以循证为基础的代理框架

Authors: Yuexi Du, Jinglu Wang, Shujie Liu, Nicha C. Dvornek, Yan Lu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.01607
Pdf link: https://arxiv.org/pdf/2603.01607
Abstract Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians' evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence-answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our CARE-Flow (coordinator-free) improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA). With dynamic planning and answer review, our CARE-Coord yields a further gain, outperforming the heavily pre-trained SOTA by 5.2%. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.
中文摘要 大型视觉语言模型（VLMs）展现出强大的多模态医学推理能力，但大多数采用端到端黑箱，偏离临床医生基于证据的分阶段工作流程，阻碍临床责任。此外，专业的视觉基础模型能够准确定位感兴趣区域（ROI），提供明确可靠的证据，提升推理准确性和信任度。本文介绍了CARE，推动多模态医学推理中的临床责任，并采用循证能动框架。与将基础与推理结合在单一通用模型中的方法不同，CARE将任务分解为协调的子模块，以减少捷径学习和幻觉：紧凑的VLM提出了相关的医疗实体;专家实体指向的分割模型产生像素级的投资回报率证据;以及基于投资回报线索的全面图像的扎实VLM推理。VLMs通过强化学习优化，并提供可验证的奖励，使答案与支持证据保持一致。此外，VLM协调员还负责规划工具调用并审查证据与答案的一致性，提供代理控制和最终验证。基于标准医疗VQA基准评估，我们的CARE-Flow（无协调器）相比同规模（10B）最先进（SOTA）平均准确率提升了10.9%。通过动态规划和答案审查，我们的CARE-Coord进一步提升，比高度预先训练的SOTA高出5.2%。我们的实验表明，模拟临床工作流程、融合解耦专业模型和显式证据的代理框架，能够带来更准确、更负责任的医疗人工智能。

ToolRLA: Fine-Grained Reward Decomposition for Tool-Integrated Reinforcement Learning Alignment in Domain-Specific Agents

ToolRLA：领域特定代理中工具集成强化学习对齐的细粒度奖励分解

Authors: Pengbo Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01620
Pdf link: https://arxiv.org/pdf/2603.01620
Abstract Tool-integrated reasoning agents interleaving natural language deliberation with external API calls show promise for complex multi-step tasks. However, aligning such agents for high-stakes domain-specific deployment is challenging, as existing reinforcement learning uses coarse binary rewards (success/failure) that insufficiently guide nuanced tool invocation in production. We present ToolRLA, a three-stage post-training pipeline (Supervised Fine-Tuning, Group Relative Policy Optimization, Direct Preference Optimization) for domain-specific tool-integrated agents. Its core is a fine-grained reward function with multiplicative correctness decomposition, evaluating tool invocation across four dimensions: format validity, tool selection correctness, invocation efficiency, and domain constraint compliance. Multiplicative composition prioritizes correct tool selection (a prerequisite for meaningful parameter evaluation), while a large negative compliance penalty ({\lambda}=10) ensures regulatory adherence. Deployed on a real-world financial advisory copilot (80+ advisors, 1,200+ daily queries, 15+ heterogeneous APIs), ToolRLA achieves 47% higher end-to-end task completion (62% to 91%), 63% lower tool invocation error (38% to 14%), 93% lower regulatory violation (12% to 0.8%), and sub-2-second latency after three months. Ablation studies confirm fine-grained reward decomposition contributes 7 percentage points over coarse additive rewards; generalizability is validated on ToolBench and API-Bank.
中文摘要 工具集成推理代理将自然语言思考与外部API调用交错使用，显示出复杂多步任务的潜力。然而，将此类代理对齐以适应高风险的领域特定部署具有挑战性，因为现有强化学习采用粗略的二元奖励（成功/失败），无法充分指导生产环境中细致的工具调用。我们介绍ToolRLA，一个三阶段训练后流程（监督式微调、群体相对策略优化、直接偏好优化），面向特定领域的工具集成代理。其核心是一个细粒度的奖励函数，具有乘法正确性分解，评估工具调用的四个维度：格式有效性、工具选择正确性、调用效率和域约束合规性。乘法复合优先考虑正确的工具选择（这是有意义参数评估的前提），而较大的负合规惩罚（{\lambda}=10）则确保法规遵循。部署在现实世界的金融咨询副驾驶（80+ 顾问，1,200+ 每日查询，15+ 异构 API）上，ToolRLA 实现端到端任务完成率提升 47%（62% 至 91%），工具调用错误降低 63%（38% 至 14%），法规违规率降低 93%（12% 至 0.8%），三个月后延迟降至 2 秒以下。消融研究证实，细粒度奖励分解比粗加性奖励贡献7个百分点;通用性可在 ToolBench 和 API-Bank 上验证。

Learning Thermal-Aware Locomotion Policies for an Electrically-Actuated Quadruped Robot

学习电动驱动四足机器人的热感知运动政策

Authors: Letian Qian, Yuhang Wan, Shuhan Wang, Xin Luo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.01631
Pdf link: https://arxiv.org/pdf/2603.01631
Abstract Electrically-actuated quadrupedal robots possess high mobility on complex terrains, but their motors tend to accumulate heat under high-torque cyclic loads, potentially triggering overheat protection and limiting long-duration tasks. This work proposes a thermal-aware control method that incorporates motor temperatures into reinforcement learning locomotion policies and introduces thermal-constraint rewards to prevent temperature exceedance. Real-world experiments on the Unitree A1 demonstrate that, under a fixed 3 kg payload, the baseline policy triggers overheat protection and stops within approximately 7 minutes, whereas the proposed method can operate continuously for over 27 minutes without thermal interruptions while maintaining comparable command-tracking performance, thereby enhancing sustainable operational capability.
中文摘要 电动驱动的四足机器人在复杂地形上具有高度机动性，但其电机在高扭矩周期负载下容易积聚热量，可能触发过热保护，限制长时间任务。本研究提出了一种热感知控制方法，将运动温度纳入强化学习的运动策略中，并引入热约束奖励以防止温度超标。Unitree A1的实际实验表明，在固定3公斤有效载荷下，基线策略会触发过热保护，并在约7分钟内停止，而拟议方法则可在无热中断的情况下连续运行超过27分钟，同时保持相当的指令跟踪性能，从而提升可持续的作战能力。

Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

学习起草：带有强化学习的自适应推测性解码

Authors: Jiebin Zhang, Zhenghan Yu, Liang Wang, Nan Yang, Eugene J. Yu, Zheng Li, Yifan Song, Dawei Zhu, Xingxing Zhang, Furu Wei, Sujian Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.01639
Pdf link: https://arxiv.org/pdf/2603.01639
Abstract Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency. We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.
中文摘要 推测解码通过使用小型草稿模型生成候选标记，加速大型语言模型（LLM）的推断，供更大目标模型验证。这种方法的有效性取决于在选秀候选人和核实他们之间所花费的时间权衡。然而，当前最先进的方法依赖静态时间分配，而近期的动态方法则优化代理指标如验收长度，常常忽视真实时间成本，将起草和验证阶段孤立处理。为解决这些局限性，我们引入了学习起草（LTD）新方法，直接优化每个起草与验证周期的吞吐量。我们将问题设计为强化学习环境，并训练两种协同适应策略，动态协调草稿和验证阶段。这鼓励策略相互适应，明确最大化解码效率。我们对五个不同的大型语言模型和四个不同的任务进行了广泛评估。我们的结果显示，LTD实现的加速比从2.24倍到4.32倍，超过了最先进的Eagle3方法，达到36.4%。

Chain-of-Context Learning: Dynamic Constraint Understanding for Multi-Task VRPs

上下文链学习：多任务VRP中的动态约束理解

Authors: Shuangchun Gui, Suyu Liu, Xuehe Wang, Zhiguang Cao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01667
Pdf link: https://arxiv.org/pdf/2603.01667
Abstract Multi-task Vehicle Routing Problems (VRPs) aim to minimize routing costs while satisfying diverse constraints. Existing solvers typically adopt a unified reinforcement learning (RL) framework to learn generalizable patterns across tasks. However, they often overlook the constraint and node dynamics during the decision process, making the model fail to accurately react to the current context. To address this limitation, we propose Chain-of-Context Learning (CCL), a novel framework that progressively captures the evolving context to guide fine-grained node adaptation. Specifically, CCL constructs step-wise contextual information via a Relevance-Guided Context Reformulation (RGCR) module, which adaptively prioritizes salient constraints. This context then guides node updates through a Trajectory-Shared Node Re-embedding (TSNR) module, which aggregates shared node features from all trajectories' contexts and uses them to update inputs for the next step. By modeling evolving preferences of the RL agent, CCL captures step-by-step dependencies in sequential decision-making. We evaluate CCL on 48 diverse VRP variants, including 16 in-distribution and 32 out-of-distribution (with unseen constraints) tasks. Experimental results show that CCL performs favorably against the state-of-the-art baselines, achieving the best performance on all in-distribution tasks and the majority of out-of-distribution tasks.
中文摘要 多任务车辆路由问题（VRPs）旨在在满足多样约束的同时，最小化布线成本。现有求解器通常采用统一强化学习（RL）框架，以学习跨任务的可推广模式。然而，它们在决策过程中常常忽视约束和节点动态，导致模型无法准确响应当前上下文。为解决这一局限，我们提出了上下文链学习（Chain-of-Context-Learning，CCL）这一新框架，逐步捕捉不断演变的上下文，以指导细粒度节点的适应。具体来说，CCL通过相关性引导上下文重述（RGCR）模块逐步构建上下文信息，该模块自适应地优先级处理显著约束。该上下文随后通过轨迹共享节点重嵌入（TSNR）模块引导节点更新，该模块汇总所有轨迹上下文中的共享节点特征，并用以更新下一步的输入。通过建模强化学习代理不断变化的偏好，CCL逐步捕捉了顺序决策中的依赖关系。我们在48个不同VRP变体上评估CCL，其中包括16个分布内任务和32个分布外任务（含未见约束）。实验结果显示，CCL在最先进的基线条件下表现良好，在所有分布内任务和大多数分布外任务中表现最佳。

MVR: Multi-view Video Reward Shaping for Reinforcement Learning

MVR：多视角视频奖励塑造用于强化学习

Authors: Lirui Luo, Guoxi Zhang, Hongming Xu, Yaodong Yang, Cong Fang, Qing Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.01694
Pdf link: https://arxiv.org/pdf/2603.01694
Abstract Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent's behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.
中文摘要 奖励设计对于通过强化学习解决复杂任务至关重要。近期研究探讨利用视觉语言模型（VLM）产生的图像-文本相似性，通过视觉反馈增强任务的奖励。一种常见做法是线性地将VLM分数添加到任务或成功奖励上，而无需明确塑造，这可能会改变最优策略。此外，这些方法通常依赖单张静态图像，难以处理涉及复杂动态运动、跨越多种视觉状态的任务。此外，单一视角可能遮蔽代理行为的关键方面。为解决这些问题，本文提出了多视角视频奖励塑造（MVR）框架，该框架通过多视角拍摄的视频模拟目标任务状态的相关性。MVR利用冻结的预训练VLM中的视频-文本相似性，学习状态相关函数，从而减轻基于图像的方法中固有的对特定静态姿态的偏向。此外，我们引入了状态依赖的奖励塑造表述，整合了任务特定奖励和基于VLM的引导，一旦实现预期运动模式，自动降低VLM引导的影响。我们通过对HumanoidBench中具有挑战性的类人移动任务和MetaWorld中控任务的广泛实验，验证了所提框架的有效性，并通过消融研究验证了设计选择。

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

跨模态身份映射：通过强化学习最小化模态转换中的信息丢失

Authors: Haonan Jia, Shichao Dong, Xin Dong, Zenghui Sun, Jin Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01696
Pdf link: https://arxiv.org/pdf/2603.01696
Abstract Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on this http URL code will be released when the paper is accepted.
中文摘要 大型视觉语言模型（LVLM）常常在生成的图片说明中省略或歪曲关键的视觉内容。最小化信息丢失将迫使LVLM专注于图像细节以生成精确描述。然而，由于视觉内容与文本输出之间的模态差距，测量模态转换过程中的信息丢失本身具有挑战性。本文论证，图片说明的质量与使用该说明文字搜索检索图像之间的相似性呈正相关。基于这一见解，我们进一步提出了跨模态身份映射（CIM）——一种强化学习框架，能够增强图片标题，而无需额外注释。具体来说，该方法从两个角度定量评估信息丢失：图库表示一致性和查询图库图像相关性。在这些指标的监督下，LVLM 最大限度地减少信息丢失，旨在实现从图片到说明文字的身份映射。实验结果表明，即使与监督式微调相比，我们的方法在图像标题制作上的表现更优。特别是在COCO-LN500基准测试中，CIM在该http URL代码的关系推理提升了20%，论文被接受后将发布。

TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training

TopoCurate：用于工具使用代理训练的交互拓扑建模

Authors: Jinluan Yang, Yuxin Liu, Zhengyu Chen, Chengcheng Han, Yueqing Sun, Qi Gu, Hui Su, Xunliang Cai, Fei Wu, Kun Kuang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.01714
Pdf link: https://arxiv.org/pdf/2603.01714
Abstract Training tool-use agents typically relies on outcome-based filtering: Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks. However, this paradigm ignores interaction dynamics: successful trajectories may lack error recovery or exhibit redundancy, while pass rates fail to distinguish structurally informative tasks from trivial ones. We propose \textbf{TopoCurate}, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology. By merging equivalent action-observation states, this projection transforms scattered linear trajectories into a structured manifold that explicitly captures how tool invocations and environmental responses drive the divergence between effective strategies and failure modes. Leveraging this representation, we introduce a dual-selection mechanism: for SFT, we prioritize trajectories demonstrating reflective recovery, semantic efficiency, and strategic diversity to mitigate covariate shift and mode collapse; for RL, we select tasks with high error branch ratios and strategic heterogeneity, maximizing gradient Signal-to-Noise Ratio to address vanishing signals in sparse-reward settings. Evaluations on BFCLv3 and Tau2 Bench show that TopoCurate achieves consistent gains of 4.2\% (SFT) and 6.9\% (RL) over state-of-the-art baselines. We will release the code and data soon for further investigations.
中文摘要 工具使用代理的训练通常依赖基于结果的过滤：成功轨迹的监督微调（SFT）和通过率选择任务的强化学习（RL）。然而，该范式忽视了交互动态：成功的轨迹可能缺乏错误恢复或存在冗余，而通过率则无法区分结构性信息性任务与琐碎任务。我们提出了 \textbf{TopoCurate}，这是一个交互感知框架，将同一任务的多重试验展开投影为统一的语义商拓扑。通过合并等效的动作-观察状态，该投影将分散的线性轨迹转化为结构化流形，明确捕捉工具调用和环境响应如何驱动有效策略与失败模式之间的分歧。利用这一表示方式，我们引入了双重选择机制：对于SFT，我们优先考虑展现反思恢复、语义效率和战略多样性的轨迹，以减轻协变量偏移和模式崩溃;对于强化学习，我们选择高误差分支比和战略异质性的任务，最大化梯度信噪比，以应对稀疏-奖励环境中的消失信号。BFCLv3和Tau2 Bench的评估显示，TopoCurate相较于最先进基线稳定提升了4.2%（SFT）和6.9%（RL）。我们将很快发布代码和数据，供进一步调查。

Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

重新思考大规模强化学习中整体策略梯度中的政策多样性

Authors: Naoki Shitanda, Motoki Omura, Tatsuya Harada, Takayuki Osa
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.01741
Pdf link: https://arxiv.org/pdf/2603.01741
Abstract Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple tasks, including challenging dexterous manipulation, in terms of both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods. Project page at this https URL .
中文摘要 将强化学习扩展到数万个并行环境，需要克服单一策略有限的探索能力。最近提出了基于集合的策略梯度方法，利用多策略收集多样化样本，以促进勘探。然而，仅仅拓宽探索空间并不总能提升学习能力，因为过度探索会降低探索质量或影响训练稳定性。本研究理论上分析政策间多样性对政策集合学习效率的影响，并提出耦合策略优化，通过政策间的KL约束来调节多样性。该方法能够有效探索，并在包括高难度作在内的多项任务中，在样品效率和最终性能方面优于SAPG、PBT和PPO等强基线。此外，对培训期间政策多样性和有效样本量的分析显示，跟随者政策自然围绕领导者分布，展示了结构化且高效的探索行为的出现。我们的结果表明，在适当监管下进行多样化探索，是实现集成策略梯度方法中稳定且样本高效学习的关键。项目页面，https URL 。

Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport

带条件拉格朗日最优运输的超参数轨迹推断

Authors: Harry Amad, Mihaela van der Schaar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01771
Pdf link: https://arxiv.org/pdf/2603.01771
Abstract Neural networks (NNs) often have critical behavioural trade-offs that are set at design time with hyperparameters-such as reward weights in reinforcement learning or quantile targets in regression. Post-deployment, however, user preferences can evolve, making initial settings undesirable, necessitating potentially expensive retraining. To circumvent this, we introduce the task of Hyperparameter Trajectory Inference (HTI): to learn, from observed data, how a NN's conditional output distribution changes with its hyperparameters, and construct a surrogate model that approximates the NN at unobserved hyperparameter settings. HTI requires extending existing trajectory inference approaches to incorporate conditions, exacerbating the challenge of ensuring inferred paths are feasible. We propose an approach based on conditional Lagrangian optimal transport, jointly learning the Lagrangian function governing hyperparameter-induced dynamics along with the associated optimal transport maps and geodesics between observed marginals, which form the surrogate model. We incorporate inductive biases based on the manifold hypothesis and least-action principles into the learned Lagrangian, improving surrogate model feasibility. We empirically demonstrate that our approach reconstructs NN outputs across various hyperparameter spectra better than other alternatives.
中文摘要 神经网络（NNs）通常在设计时通过超参数设定关键行为权衡，如强化学习中的奖励权重或回归中的分位数目标。然而，部署后用户偏好可能发生变化，使初始设置变得不理想，可能需要昂贵的重新培训。为避免此问题，我们引入了超参数轨迹推断（HTI）任务：从观测数据中了解NN的条件输出分布如何随超参数变化，并构建一个替代模型，近似NN在未观测超参数设置下。HTI需要扩展现有的轨迹推断方法以纳入条件，这加剧了确保推断路径可行性的挑战。我们提出一种基于条件拉格朗日最优传输的方法，结合学习控制超参数诱导动力学的拉格朗日函数，以及相关的最优传输图和观测边际间的测地线，这些构成了替代模型。我们将基于流形假设和最小作用原则的归纳偏差纳入已学的拉格朗日量，从而提升替代模型的可行性。我们通过实证证明，我们的方法比其他替代方案更能重建不同超参数谱的神经网络输出。

FireRed-OCR Technical Report

FireRed-OCR技术报告

Authors: Hao Wu, Haoran Lou, Xinyue Li, Zuodong Zhong, Zhaojun Sun, Phellon Chen, Xuanhe Zhou, Kai Zuo, Yibo Chen, Xu Tang, Yao Hu, Boxiang Zhou, Jian Wu, Yongji Wu, Wenxin Yu, Yingmiao Liu, Yuhao Huang, Manjie Xu, Gang Liu, Yidong Ma, Zhichao Sun, Changhao Qiao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2603.01840
Pdf link: https://arxiv.org/pdf/2603.01840
Abstract We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct aGeometry + Semantics'' Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model's understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94\%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert'' paradigm.
中文摘要 我们介绍了FireRed-OCR，这是一个系统化框架，用于将通用VLM专门化为高性能OCR模型。大型视觉语言模型（VLM）展现出令人印象深刻的通用能力，但在处理复杂文档时常常出现“结构性幻觉”，限制了其在工业OCR应用中的应用价值。本文介绍了FireRed-OCR，这是一个新颖框架，旨在将基于Qwen3-VL的通用VLM转化为像素精确的结构文档解析专家。为了解决高质量结构化数据的稀缺，我们构建了一个“几何+语义”数据工厂。与传统随机抽样不同，我们的流程利用几何特征聚类和多维标签，综合并策划高度平衡的数据集，有效处理长尾布局和稀有文档类型。此外，我们提出了一种三阶段渐进训练策略，引导模型从像素级感知到逻辑结构生成。该课程包括：（1）多任务预对齐，以巩固模型对文档结构的理解;（2）用于标准化全图像Markdown输出的专用SFT;以及（3）格式约束群相对策略优化（GRPO），利用强化学习来强制严格的语法有效性和结构完整性（例如，表格闭合、公式语法）。对 OmniDocBench v1.5 的广泛评估显示，FireRed-OCR 实现了最先进的性能，整体得分为 92.94%，在文本、公式、表格和阅读顺序指标上显著优于 DeepSeek-OCR 2 和 OCRVerse 等强基线。我们将代码和模型权重开源，以促进“通用VLM到专业结构专家”的范式。

SEAR: Sample Efficient Action Chunking Reinforcement Learning

SEAR：示例高效动作分块强化学习

Authors: C. F. Maximilian Nagy, Onur Celik, Emiliyan Gospodinov, Florian Seligmann, Weiran Liao, Aryan Kaushik, Gerhard Neumann
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.01891
Pdf link: https://arxiv.org/pdf/2603.01891
Abstract Action chunking can improve exploration and value estimation in long horizon reinforcement learning, but makes learning substantially harder since the critic must evaluate action sequences rather than single actions, greatly increasing approximation and data efficiency challenges. As a result, existing action chunking methods, primarily designed for the offline and offline-to-online settings, have not achieved strong performance in purely online reinforcement learning. We introduce SEAR, an off policy online reinforcement learning algorithm for action chunking. It exploits the temporal structure of action chunks and operates with a receding horizon, effectively combining the benefits of small and large chunk sizes. SEAR outperforms state of the art online reinforcement learning methods on Metaworld, training with chunk sizes up to 20.
中文摘要 动作分块可以提升长视距强化学习中的探索和价值估计，但由于批评者必须评估动作序列而非单一动作，这大大增加了近似和数据效率的挑战。因此，现有的动作分块方法，主要为离线和离线到在线环境设计，在纯在线强化学习中表现不佳。我们介绍SEAR，一种基于行动分块的非策略在线强化学习算法。它利用动作块的时间结构，并以逐渐远大的视野运作，有效地结合了大块大小的优势。SEAR 在 Metaworld 上表现优于最先进的在线强化学习方法，训练块大小可达 20。

Generative Visual Chain-of-Thought for Image Editing

生成式视觉思维链用于图像编辑

Authors: Zijin Yin, Tiankai Hang, Yiji Cheng, Shiyi Zhang, Runze He, Yu Xu, Chunyu Wang, Bing Li, Zheng Chang, Kongming Liang, Qinglin Lu, Zhanyu Ma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.01893
Pdf link: https://arxiv.org/pdf/2603.01893
Abstract Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks. We adopt a progressive training strategy: supervised fine-tuning to build foundational localization ability in reasoning trace before final editing, followed by reinforcement learning to further improve reasoning and editing quality. Finally, we introduce SREdit-Bench, a new benchmark designed to comprehensively stress-test models under sophisticated scenes and fine-grained referring expressions. Experiments demonstrate that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit. We hope our GVCoT will inspire future research toward interpretable and precise image editing.
中文摘要 现有的图像编辑方法难以判断该从哪里编辑，尤其是在复杂场景和细腻的空间指令下。为解决这一问题，我们提出了生成视觉思维链（GVCoT）这一统一框架，通过先生成空间线索定位目标区域，然后执行编辑，进行原生视觉推理。与以往仅文本或依赖工具的视觉CoT范式不同，GVCoT以端到端的方式共同优化推理和编辑阶段生成的视觉标记。这种方式促进了天生的空间推理能力的出现，并使视觉领域线索的更高效利用成为可能。训练GCVoT的主要挑战在于缺乏带有精确编辑区域注释的大规模编辑数据;为此，我们构建了GVCoT-Edit-Instruct数据集，该数据集包含180万高质量样本，涵盖19个任务。我们采用渐进式培训策略：在最终编辑前，监督微调以建立推理追踪的基础本地化能力，随后进行强化学习以进一步提升推理和编辑质量。最后，我们介绍SREdit-Bench，这是一个旨在全面压力测试模型、复杂场景和细粒度指称表达式的新基准测试。实验表明，GVCoT在SREdit-Bench和ImgEdit上始终优于最先进的模型。我们希望我们的GVCoT能激励未来研究，推动图像编辑的可解读性和精确性。

Visual Bias in Simulated Users: The Impact of Luminance and Contrast on Reinforcement Learning-based Interaction

模拟用户中的视觉偏见：亮度和对比度对基于强化学习交互的影响

Authors: Hannah Selder, Charlotte Beylier, Nico Scherf, Arthur Fleig
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2603.01901
Pdf link: https://arxiv.org/pdf/2603.01901
Abstract Reinforcement learning (RL) enables simulations of HCI tasks, yet their validity is questionable when performance is driven by visual rendering artifacts distinct from interaction design. We provide the first systematic analysis of how luminance and contrast affect behavior by training 247 \RV{simulated users using RL} on pointing and tracking tasks. We vary the luminance of task-relevant objects, distractors, and background under no distractor, static distractor, and moving distractor conditions, and evaluate task performance and robustness to unseen luminances. Results show luminance becomes critical with static distractors, substantially degrading performance and robustness, whereas motion cues mitigate this issue. Furthermore, robustness depends on preserving relational ordering between luminances rather than matching absolute values. Extreme luminances, especially black, often yield high performance but poor robustness. Overall, seemingly minor luminance changes can strongly shape learned behavior, revealing critical insights into what RL-driven simulated users actually learn.
中文摘要 强化学习（RL）使HCI任务能够模拟，但当性能由视觉渲染伪影驱动，而非交互设计时，其有效性值得怀疑。我们首次系统分析亮度和对比度如何影响行为，通过训练247名RV{使用强化学习的模拟用户}进行指向和跟踪任务。我们在无干扰、静态干扰和移动干扰条件下，调整任务相关物体、干扰物和背景的亮度，并评估任务表现和对未见亮度的鲁棒性。结果显示，亮度在静态干扰器下变得关键，显著降低性能和稳健性，而运动线索则缓解了这一问题。此外，鲁棒性依赖于保持亮度之间的关系排序，而非匹配绝对值。极端亮度，尤其是黑色，通常性能较高但稳健性较差。总体而言，看似微小的亮度变化能强烈影响学习行为，揭示强化学习驱动的模拟用户实际学习内容的关键见解。

Efficient RLVR Training via Weighted Mutual Information Data Selection

通过加权互信息数据选择实现高效的RLVR训练

Authors: Xinyu Zhou, Boyu Zhu, Haotian Zhang, Huiming Wang, Zhijiang Guo
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.01907
Pdf link: https://arxiv.org/pdf/2603.01907
Abstract Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models, yet its efficiency critically depends on how training data are selected. Existing online selection strategies predominantly rely on difficulty-based heuristics, favouring datapoints with intermediate success rates, implicitly equating difficulty with informativeness and neglecting epistemic uncertainty arising from limited evidence. We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective. By modeling data outcomes with Bayesian latent success rates, we show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection. Leveraging this observation, InSight constructs a stable acquisition score based on the mean belief of datapoints' success rather than noisy sampled outcomes, and naturally extends to multi-rollout settings common in reinforcement learning with verifiable rewards (RLVR). Extensive experiments demonstrate that InSight consistently achieves state-of-the-art performance and improves training efficiency, including a +1.41 average gain on Planning & Mathmatics benchmarks, +1.01 improvement on general reasoning, and up to ~2.2x acceleration, with negligible additional computational overhead.
中文摘要 强化学习（RL）在提升大型语言模型的推理和对齐方面起着核心作用，但其效率关键在于如何选择训练数据。现有的在线选择策略主要依赖基于难度的启发式方法，偏好成功率中等的数据点，隐含地将难度等同于信息量，忽视了源自有限证据的认知不确定性。我们介绍InSight，这是一个基于加权互信息目标的INformation引导数据SamplInG，用于强化学习培训。通过用贝叶斯潜在成功率建模数据结果，我们表明预期的不确定性降低分解为互补的难度和证据依赖成分，揭示了仅难度选择的根本局限性。基于这一观察，InSight基于数据点成功的平均信念构建稳定的获取得分，而非噪声的抽样结果，并自然扩展到带有可验证奖励的强化学习（RLVR）中常见的多重推广场景。大量实验表明，InSight始终保持最先进的性能并提升训练效率，包括规划与数学基准平均提升+1.41，一般推理提升+1.01，加速可达~2.2倍，计算开销极小。

LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

LaST-VLA：在自动驾驶中视觉-语言-行动的潜在时空空间思考

Authors: Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, Fuxi Wen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.01928
Pdf link: https://arxiv.org/pdf/2603.01928
Abstract While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.
中文摘要 虽然视觉-语言-行动（VLA）模型通过统一感知与规划彻底革新了自动驾驶，但它们对显式文本思维链（CoT）的依赖导致语义-感知脱钩以及感知-符号冲突。近年来，潜能推理的转变试图通过在连续的隐秘空间中思考来绕过这些瓶颈。然而，在没有显式中间约束的情况下，标准潜在CoT通常作为一种物理无关的表示来运作。为此，我们提出了潜在时空VLA（LaST-VLA）框架，将推理范式从离散符号处理转变为物理基础的潜在时空CoT。通过实现双特征对齐机制，我们将三维基础模型中的几何约束和世界模型中的动态前瞻直接提炼到潜在空间中。配合渐进式SFT训练策略，从特征对齐过渡到轨迹生成，并通过强化学习与组相对策略优化（GRPO）进行优化，以确保安全性和规则合规性。\method~在NAVSIM v1（91.3 PDMS）和NAVSIM v2（87.1 EPDMS）上创下新纪录，同时在SURDS和NuDynamics基准测试中时空推理表现出色。

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

CoVe：通过约束引导验证训练交互式工具使用代理

Authors: Jinpeng Chen, Cheng Gong, Hanbo Li, Ziru Liu, Zichen Tian, Xinyu Fu, Shi Wu, Chenyang Zhang, Wu Zhang, Suiyun Zhang, Dandan Tu, Rui Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01940
Pdf link: https://arxiv.org/pdf/2603.01940
Abstract Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce \textbf{CoVe} (\textbf{Co}nstraint-\textbf{Ve}rification), a post-training data synthesis framework designed for training interactive tool-use agents while ensuring both data complexity and correctness. CoVe begins by defining explicit task constraints, which serve a dual role: they guide the generation of complex trajectories and act as deterministic verifiers for assessing trajectory quality. This enables the creation of high-quality training trajectories for supervised fine-tuning (SFT) and the derivation of accurate reward signals for reinforcement learning (RL). Our evaluation on the challenging $\tau^2$-bench benchmark demonstrates the effectiveness of the framework. Notably, our compact \textbf{CoVe-4B} model achieves success rates of 43.0\% and 59.4\% in the Airline and Retail domains, respectively; its overall performance significantly outperforms strong baselines of similar scale and remains competitive with models up to $17\times$ its size. These results indicate that CoVe provides an effective and efficient pathway for synthesizing training data for state-of-the-art interactive tool-use agents. To support future research, we open-source our code, trained model, and the full set of 12K high-quality trajectories used for training.
中文摘要 开发多回合交互式工具使用代理具有挑战性，因为现实世界的用户需求往往复杂且模糊，但代理必须执行确定性动作以满足这些需求。为弥补这一空白，我们引入了 \textbf{CoVe}（\textbf{Co}nstraint-\textbf{Ve}rification），这是一个训练后数据综合框架，旨在训练交互式工具使用代理，同时确保数据复杂性和准确性。CoVe首先定义了明确的任务约束，这些约束具有双重作用：指导复杂轨迹的生成，并作为评估轨迹质量的确定性验证器。这使得创建高质量的监督微调（SFT）训练轨迹和精准的奖励信号以实现强化学习（RL）成为可能。我们对具有挑战性的$\tau^2$-bench基准的评估展示了该框架的有效性。值得注意的是，我们的紧凑型\textbf{CoVe-4B}模型在航空和零售领域分别实现了43.0%和59.4%的成功率;其整体性能远超同尺寸的强基线，且与同尺寸最高17美元/乘倍美元以下的车型保持竞争力。这些结果表明，CoVe为综合最先进交互式工具使用代理的训练数据提供了有效且高效的路径。为了支持未来的研究，我们将代码、训练模型以及用于训练的1.2K高质量轨迹全套开源。

CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production

CharacterFlywheel：生产中涉及且可引导的大型语言模型的规模化迭代改进

Authors: Yixin Nie, Lin Guan, Zhongyao Ma, Anchit Gupta, Yipin Zhou, Xiao Li, Zhengping Zhou, Raymond Zeng, Gelin Zhou, Shigan Chu, Ajay Thampi, Wancen Mu, Nathan Shuster, Ketong Wang, Lin Chen, Jason Brewer, Derek Hao Hu, Alexander McCauley, Jason Weston, Sem Park, Na Zhang, Kevin Tang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2603.01973
Pdf link: https://arxiv.org/pdf/2603.01973
Abstract This report presents CharacterFlywheel, an iterative flywheel process for improving large language models (LLMs) in production social chat applications across Instagram, WhatsApp, and Messenger. Starting from LLaMA 3.1, we refined models across 15 generations using data from both internal and external real-user traffic. Through continuous deployments from July 2024 to April 2025, we conducted controlled 7-day A/B tests showing consistent engagement improvements: 7 of 8 newly deployed models demonstrated positive lift over the baseline, with the strongest performers achieving up to 8.8% improvement in engagement breadth and 19.4% in engagement depth. We also observed substantial gains in steerability, with instruction following increasing from 59.2% to 84.8% and instruction violations decreasing from 26.6% to 5.8%. We detail the CharacterFlywheel process which integrates data curation, reward modeling to estimate and interpolate the landscape of engagement metrics, supervised fine-tuning (SFT), reinforcement learning (RL), and both offline and online evaluation to ensure reliable progress at each optimization step. We also discuss our methods for overfitting prevention and navigating production dynamics at scale. These contributions advance the scientific rigor and understanding of LLMs in social applications serving millions of users.
中文摘要 本报告介绍了CharacterFlywheel，一种用于改进Instagram、WhatsApp和Messenger生产性社交聊天应用中大型语言模型（LLM）的迭代飞轮过程。从LLaMA 3.1开始，我们利用内部和外部真实用户流量的数据，在15代中对模型进行了细化。通过2024年7月至2025年4月的持续部署，我们进行了为期7天的受控A/B测试，显示参与度持续提升：8个新部署模型中有7个显示出较基线的积极提升，表现最强的模型在参与广度提升最高达8.8%，参与深度提升19.4%。我们还观察到可控性显著提升，指令跟进率从59.2%增加到84.8%，指令违规率从26.6%降至5.8%。我们详细介绍了CharacterFlywheel流程，该过程整合了数据整理、奖励建模以估计和插值参与度指标的全景、监督式微调（SFT）、强化学习（RL）以及线下和在线评估，以确保每个优化步骤的可靠进展。我们还讨论了我们对过拟合预防的方法以及如何应对大规模生产动态。这些贡献推动了面向数百万用户的社会应用中LLM的科学严谨性和理解。

Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

过程胜于结果：培养法医推理以实现可推广多模态作检测

Authors: Yuchen Zhang, Yaxiong Wang, Kecheng Han, Yujiao Wu, Lianwei Wu, Li Zhu, Zhedong Zheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.01993
Pdf link: https://arxiv.org/pdf/2603.01993
Abstract Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.
中文摘要 生成式人工智能的最新进展显著提升了多模态媒体作的真实性，从而对控检测带来了重大挑战。现有的作检测和基础方法主要侧重于在结果导向监督下的作类型分类，这不仅缺乏可解释性，还倾向于过度拟合表面伪影。本文论证，可泛化检测需要包含显式的法医推理，而不仅仅是对有限的作类型进行分类，后者无法推广到看不见的作模式。为此，我们提出了REFORM，一个以推理为驱动的框架，将学习从结果拟合转向过程建模。改革采用三阶段课程，首先引导法医推理，然后将推理与最终判断对齐，最后通过强化学习完善逻辑一致性。为了支持这一范式，我们引入了ROM，一个具有丰富推理注释的大规模数据集。大量实验表明，REFORM 在 ROM 上实现了新的先进性能和更优的泛化能力，在 ROM 上实现了 81.52% 的 ACC，在 DGM4 上达到 76.65% 的 ACC，在 MMFakeBench 上实现了 74.9% 的 F1。

Temporal Representations for Exploration: Learning Complex Exploratory Behavior without Extrinsic Rewards

探索的时间表征：学习无外在奖励的复杂探索行为

Authors: Faisal Mohamed, Catherine Ji, Benjamin Eysenbach, Glen Berseth
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.02008
Pdf link: https://arxiv.org/pdf/2603.02008
Abstract Effective exploration in reinforcement learning requires not only tracking where an agent has been, but also understanding how the agent perceives and represents the world. To learn powerful representations, an agent should actively explore states that contribute to its knowledge of the environment. Temporal representations can capture the information necessary to solve a wide range of potential tasks while avoiding the computational cost associated with full state reconstruction. In this paper, we propose an exploration method that leverages temporal contrastive representations to guide exploration, prioritizing states with unpredictable future outcomes. We demonstrate that such representations can enable the learning of complex exploratory x in locomotion, manipulation, and embodied-AI tasks, revealing capabilities and behaviors that traditionally require extrinsic rewards. Unlike approaches that rely on explicit distance learning or episodic memory mechanisms (e.g., quasimetric-based methods), our method builds directly on temporal similarities, yielding a simpler yet effective strategy for exploration.
中文摘要 在强化学习中进行有效的探索不仅需要追踪智能体曾经去过的地方，还要理解智能体如何感知和表示世界。为了学习强有力的表征，智能体应主动探索那些有助于其环境知识的状态。时间表示可以捕捉解决各种潜在任务所需的信息，同时避免了全状态重建相关的计算成本。本文提出了一种利用时间对比表征来指导探索的方法，优先考虑未来结果不可预测的状态。我们证明，这种表征能够在运动、作和具身人工智能任务中学习复杂的探索性x，揭示传统上需要外在奖励的能力和行为。与依赖显性远程学习或情景记忆机制（如基于拟距离的方法）不同，我们的方法直接建立在时间相似性之上，提供了一种更简单但有效的探索策略。

Expanding LLM Agent Boundaries with Strategy-Guided Exploration

通过策略引导探索拓展LLM代理边界

Authors: Andrew Szot, Michael Kirchhof, Omar Attia, Alexander Toshev
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.02045
Pdf link: https://arxiv.org/pdf/2603.02045
Abstract Reinforcement learning (RL) has demonstrated notable success in post-training large language models (LLMs) as agents for tasks such as computer use, tool calling, and coding. However, exploration remains a central challenge in RL for LLM agents, especially as they operate in language-action spaces with complex observations and sparse outcome rewards. In this work, we address exploration for LLM agents by leveraging the ability of LLMs to plan and reason in language about the environment to shift exploration from low-level actions to higher-level language strategies. We thus propose Strategy-Guided Exploration (SGE), which first generates a concise natural-language strategy that describes what to do to make progress toward the goal, and then generates environment actions conditioned on that strategy. By exploring in the space of strategies rather than the space of actions, SGE induces structured and diverse exploration that targets different environment outcomes. To increase strategy diversity during RL, SGE introduces mixed-temperature sampling, which explores diverse strategies in parallel, along with a strategy reflection process that grounds strategy generation on the outcomes of previous strategies in the environment. Across UI interaction, tool-calling, coding, and embodied agent environments, SGE consistently outperforms exploration-focused RL baselines, improving both learning efficiency and final performance. We show that SGE enables the agent to learn to solve tasks too difficult for the base model.
中文摘要 强化学习（RL）在大型语言模型（LLMs）作为计算机使用、工具调用和编码等任务的后期训练后取得了显著成功。然而，探索仍然是强化学习中LLM代理的核心挑战，尤其是在它们运行于语言动作空间中，观察复杂且结果奖励稀疏时。本研究通过利用LLM在语言中规划和推理环境的能力，将探索从低层次行动转向更高层次的语言策略，来探讨LLM代理的探索。因此，我们提出了策略引导探索（SGE），它首先生成一个简明的自然语言策略，描述如何实现目标，然后基于该策略生成环境行动。通过在策略空间而非行动空间进行探索，SGE促进了针对不同环境结果的结构化且多样化的探索。为了在强化学习期间增加策略多样性，SGE引入了混合温度抽样，并并行探索多样策略，同时采用策略反思过程，将策略生成建立在以往策略在环境中的结果之上。在用户界面交互、工具调用、编码和具身代理环境中，SGE持续优于以探索为核心的强化学习基线，提升学习效率和最终性能。我们证明了SGE使智能体能够学习解决基础模型难以完成的任务。

Accelerating PDE Surrogates via RL-Guided Mesh Optimization

通过强化学习引导网格优化加速偏微分方程替代

Authors: Yang Meng, Ruoxi Jiang, Zhuokai Zhao, Chong Liu, Rebecca Willett, Yuxin Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.02066
Pdf link: https://arxiv.org/pdf/2603.02066
Abstract Deep surrogate models for parametric partial differential equations (PDEs) can deliver high-fidelity approximations but remain prohibitively data-hungry: training often requires thousands of fine-grid simulations, each incurring substantial computational cost. To address this challenge, we introduce RLMesh, an end-to-end framework for efficient surrogate training under limited simulation budget. The key idea is to use reinforcement learning (RL) to adaptively allocate mesh grid points non-uniformly within each simulation domain, focusing numerical resolution in regions most critical for accurate PDE solutions. A lightweight proxy model further accelerates RL training by providing efficient reward estimates without full surrogate retraining. Experiments on PDE benchmarks demonstrate that RLMesh achieves competitive accuracy to baselines but with substantially fewer simulation queries. These results show that solver-level spatial adaptivity can dramatically improve the efficiency of surrogate training pipelines, enabling practical deployment of learning-based PDE surrogates across a wide range of problems.
中文摘要 参数偏微分方程（PDE）的深度替代模型可以提供高精度近似，但数据需求极大：训练通常需要数千次精细网格仿真，每次都会产生大量计算成本。为应对这一挑战，我们引入了RLMesh，一个端到端的高效替代者培训框架，且模拟预算有限。核心思想是利用强化学习（RL）在每个仿真域内自适应地非均匀分配网格点，聚焦于对精确偏微分方程解最关键的区域进行数值分辨率。轻量级代理模型通过提供高效的奖励估计，无需完全替代再训练，进一步加速了强化学习训练。偏微分方程基准测试的实验表明，RLMesh 在模拟查询次数大幅减少的情况下，在基线上实现了竞争性准确性。这些结果表明，求解器级空间适应性能够显著提升替代训练流水线的效率，使基于学习的偏微分方程代理能够在各种问题中实际部署。

$π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

$π$-StepNFT：更广阔的空间需要更细致的在线强化学习以支持基于流量的VLA

Authors: Siting Wang, Xiaofeng Wang, Zheng Zhu, Minnan Pei, Xinyu Cui, Cheng Deng, Jian Zhao, Guan Huang, Haifeng Zhang, Jun Wang
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.02083
Pdf link: https://arxiv.org/pdf/2603.02083
Abstract Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textbf{\textit{$\boldsymbol{\pi}$-StepNFT}} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, $\pi$-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.
中文摘要 基于流动的视觉-语言-行动（VLA）模型在具象控制方面表现出色，但在多步抽样过程中存在难以解决的可能性，阻碍了在线强化学习。我们提出了 \textbf{\textit{$\boldsymbol{\pi}$-StepNFT}}（分步负向微调），这是一个无批判和似然的框架，每优化步骤只需一次前向传递，并消除辅助值网络。我们认识到，更广泛的探索空间需要更细致、分阶段的对齐指导。从经验来看，$\pi$-StepNFT 以竞争性的少数回合稳健性释放了 LIBERO 潜在潜力。此外，它在 ManiSkill 上实现了更优的泛化，在 OOD 场景中通过防止多模态特征的过拟合，表现优于基于价值的基线。这一特性为复杂的现实应用提供了可扩展的解决方案。

Reinforcement Learning-Based Filters for Convection-Dominated Flows: Reference-Free and Reference-Guided Training

基于强化学习的对流主导流滤波器：无参考与参考引导训练

Authors: Anna Ivagnes, Maria Strazzullo, Gianluigi Rozza
Subjects: Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2603.02086
Pdf link: https://arxiv.org/pdf/2603.02086
Abstract We propose a reinforcement learning (RL) framework for the dynamic selection of the filter parameter in Evolve-Filter (EF) regularization strategies for incompressible turbulent flows. Instead of prescribing the filter radius heuristically, the RL agent learns to adaptively control the filtering intensity in time, balancing numerical stability and physical accuracy. The methodology is assessed on two benchmark problems with fundamentally different dynamics: flow past a cylinder and decaying homogeneous turbulence. Both reference-guided and reference-free reward formulations are investigated. In the reference-guided setting, the agent is trained using direct numerical simulation (DNS) data over a limited time window and then evaluated in extrapolation. In the reference-free setting, the reward relies exclusively on physics-based quantities, without access to reference solutions, i.e., completely eliminating the computational costs related to DNS simulations. The results show that the proposed RL-EF strategies prevent numerical blow-up while avoiding the excessive dissipation typical of standard EF approaches based on a fixed Kolmogorov length scale. The learned policies accurately reproduce the relevant flow dynamics across scales, preserving the correct balance between large-scale structures and small-scale dissipation. Notably, the reference-free reward achieves performance comparable to the reference-guided approach, demonstrating that stable and spectrally consistent filtering strategies can be learned even without DNS data, drastically reducing the computational costs of the training phase. The proposed framework provides a robust and flexible alternative to manually tuned regularization parameters, enabling adaptive, physically consistent control of filtering in turbulent flow simulations.
中文摘要 我们提出了一种强化学习（RL）框架，用于演化滤波器（EF）正则化策略中不可压缩湍流滤波器参数的动态选择。强化学习者不再以启发式方式规定滤波半径，而是学会自适应地随时间控制滤波强度，平衡数值稳定性和物理精度。该方法论基于两个基准问题进行评估，这些问题具有根本不同的动力学：通过圆柱体的流动和衰减均匀湍流。研究了参考引导和无参考奖励的表述。在参考引导的环境中，代理在有限时间窗口内通过直接数值模拟（DNS）数据进行训练，然后通过外推进行评估。在无引用的环境中，奖励完全依赖于基于物理的量，无法访问参考解，即完全消除了与DNS模拟相关的计算成本。结果表明，提出的RL-EF策略在避免基于固定Kolmogorov长度尺度的标准EF方法典型的过度耗散的同时，防止了数值爆破。所学策略准确地重现了跨尺度的相关流动动态，保持了大尺度结构与小尺度耗散之间的正确平衡。值得注意的是，无引用奖励的性能可与引用引导方法相当，表明即使没有DNS数据也能学习稳定且频谱一致的过滤策略，大幅降低训练阶段的计算成本。该框架为手动调优正则化参数提供了一种稳健且灵活的替代方案，使湍流仿真中能够实现自适应且物理一致的过滤控制。

Learning from Synthetic Data Improves Multi-hop Reasoning

从合成数据中学习提升多跳推理能力

Authors: Anmol Kabra, Yilun Yin, Albert Gong, Kamilė Stankevičiūtė, Dongyoung Go, Johann Lee, Katie Z. Luo, Carla P. Gomes, Kilian Q. Weinberger
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.02091
Pdf link: https://arxiv.org/pdf/2603.02091
Abstract Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge -- a fundamental and generalizable reasoning skill. Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.
中文摘要 强化学习（RL）已被证明能显著提升大型语言模型（LLMs）在数学、编程和多跳推理任务中的推理能力。然而，强化学习的微调需要大量高质量且可验证的数据，通常来源于人工注释、前沿大型语言模型生成，或由基于大型语言模型的验证器评分。这三者都有较大的局限性：人工注释数据集规模小且管理成本高昂，LLM生成的数据容易产生幻觉且成本高昂，基于LLM的验证器则不准确且速度缓慢。本研究探讨了一种更经济的替代方案：对规则生成的合成数据进行强化学习微调，用于多跳推理任务。我们发现，尽管合成数据仅包含虚构知识，基于合成数据微调的大型语言模型在流行的现实世界问答基准测试中表现显著更好。通过按题目难度分层表现，我们发现合成数据教会大型语言模型构建知识——这是一种基础且可推广的推理技能。我们的工作强调规则生成合成推理数据作为免费且可扩展的资源，以提升LLM推理能力。

ACDC: Adaptive Curriculum Planning with Dynamic Contrastive Control for Goal-Conditioned Reinforcement Learning in Robotic Manipulation

ACDC：基于动态对比控制的自适应课程规划，用于机器人作中的目标条件强化学习

Authors: Xuerui Wang, Guangyu Ren, Tianhong Dai, Bintao Hu, Shuangyao Huang, Wenzhang Zhang, Hengyan Liu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.02104
Pdf link: https://arxiv.org/pdf/2603.02104
Abstract Goal-conditioned reinforcement learning has shown considerable potential in robotic manipulation; however, existing approaches remain limited by their reliance on prioritizing collected experience, resulting in suboptimal performance across diverse tasks. Inspired by human learning behaviors, we propose a more comprehensive learning paradigm, ACDC, which integrates multidimensional Adaptive Curriculum (AC) Planning with Dynamic Contrastive (DC) Control to guide the agent along a well-designed learning trajectory. More specifically, at the planning level, the AC component schedules the learning curriculum by dynamically balancing diversity-driven exploration and quality-driven exploitation based on the agent's success rate and training progress. At the control level, the DC component implements the curriculum plan through norm-constrained contrastive learning, enabling magnitude-guided experience selection aligned with the current curriculum focus. Extensive experiments on challenging robotic manipulation tasks demonstrate that ACDC consistently outperforms the state-of-the-art baselines in both sample efficiency and final task success rate.
中文摘要 目标条件强化学习在机器人作中展现出相当大的潜力;然而，现有方法仍受限于优先考虑收集经验，导致在多样化任务中表现不佳。受人类学习行为启发，我们提出了一种更全面的学习范式——ACDC，它将多维自适应课程（AC）规划与动态对比（DC）控制相结合，引导智能体沿着设计良好的学习轨迹前进。更具体地说，在规划层面，AC组件通过动态平衡多元化驱动的探索与基于质量驱动的利用，基于代理的成功率和培训进展来安排学习课程。在对照层面，DC组件通过规范约束的对比学习实现课程规划，实现与当前课程重点相符的量级指导体验选择。对具有挑战性的机器人作任务的广泛实验表明，ACDC在样本效率和最终任务成功率上始终优于最先进的基线。

Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

铅笔拼图工作台：多步可验证推理的基准

Authors: Justin Waugh
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.02119
Pdf link: https://arxiv.org/pdf/2603.02119
Abstract We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification. From a database of 62,231 puzzles across 94 varieties with verified unique solutions, we select a benchmark of 300 puzzles spanning 20 varieties and evaluate 51 models from 11 providers in two modes: direct ask (single-shot) and agentic (multi-turn with iterative verification). A key differentiator of our benchmark is that every intermediate board state can be checked against variety-specific constraints, localizing errors to the exact rule violated, providing the infrastructure for dense, per-move reward signals for process supervision and reinforcement learning. Our evaluation reveals two distinct axes of capability: (1) reasoning effort scaling, where GPT-5.2 improves 81x from no reasoning to maximum effort; and (2) agentic iteration, where Claude Opus 4.6 rises from 0.3% to 30.0% through iterative checking, while GPT-5.2@xhigh improves from 20.2% to 56.0%. Agentic attempts span a median of 29 turns over 17 minutes, with the longest exceeding 1,221 turns and 14.3 hours - a demanding test of long-context utilization, not just reasoning.
中文摘要 我们介绍了铅笔谜题工作台，这是一个通过铅笔谜题评估大型语言模型推理的框架，这是一系列与NP完全问题密切相关的约束满足问题，具有确定性、步骤级验证。从涵盖94种类型、经过验证且唯一解法的62,231个谜题数据库中，我们选出涵盖20个类型、300个谜题的基准，并以两种模式评估来自11家提供者的51个模型：直接问（单次）和代理（多回合且带有迭代验证）。我们基准的一个关键区别在于，每个中间板状态都可以根据品种特有的约束进行检验，将错误定位到被违反的精确规则，为过程监督和强化学习提供密集的每次行动奖励信号基础设施。我们的评估揭示了两个不同的能力轴：（1）推理努力扩展，GPT-5.2从无推理提升到最大努力提升81倍;以及（2）代理迭代，其中Claude Opus 4.6通过迭代检查从0.3%提升到30.0%，而GPT-5.2@xhigh从20.2%提升到56.0%。行动尝试的中位数为29回合，时长17分钟，最长超过1221回合，耗时14.3小时——这是对长上下文利用的严苛考验，而不仅仅是推理。

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

LongRLVR：长上下文强化学习需要可验证的上下文奖励

Authors: Guanzheng Chen, Michael Qizhe Shieh, Lidong Bing
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.02146
Pdf link: https://arxiv.org/pdf/2603.02146
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）通过优化大型语言模型（LLMs）针对事实结果，显著提升了其推理能力。然而，这种范式在长上下文场景中存在不足，因为它依赖内部参数知识，不适合需要上下文基础的任务——即能够在外部信息上寻找和推理的能力。我们确定了失败的关键原因：仅基于最终答案的奖励过于稀疏，无法有效指导模型识别相关证据。我们正式证明，仅结果奖励会导致上下文基础过程出现显著的零梯度，使学习变得难以解决。为克服这一瓶颈，我们引入了LongRLVR，用密集且可验证的上下文奖励来补充稀疏的答案奖励。该辅助信号直接激励模型选择正确的基准信息，提供稳健的学习梯度，解决潜在的优化难题。我们利用Qwen和LLaMA模型验证了该方法在挑战长上下文基准测试中。LongRLVR在所有模型和基准测试中持续且显著地优于标准RLVR，例如，14B型号在RULER-QA上的得分从73.17提升至88.90，LongBench v2的分数从39.8提升至46.5。我们的研究表明，明确奖励基础化过程是释放大型语言模型在长上下文应用中全部推理潜力的关键且有效策略。我们的代码可在此 https URL 访问。

Near-Optimal Regret for KL-Regularized Multi-Armed Bandits

吉隆坡正则化多臂强盗的近优遗憾

Authors: Kaixuan Ji, Qingyue Zhao, Heyang Zhao, Qiwei Di, Quanquan Gu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.02155
Pdf link: https://arxiv.org/pdf/2603.02155
Abstract Recent studies have shown that reinforcement learning with KL-regularized objectives can enjoy faster rates of convergence or logarithmic regret, in contrast to the classical $\sqrt{T}$-type regret in the unregularized setting. However, the statistical efficiency of online learning with respect to KL-regularized objectives remains far from completely characterized, even when specialized to multi-armed bandits (MABs). We address this problem for MABs via a sharp analysis of KL-UCB using a novel peeling argument, which yields a $\tilde{O}(\eta K\log^2T)$ upper bound: the first high-probability regret bound with linear dependence on $K$. Here, $T$ is the time horizon, $K$ is the number of arms, $\eta^{-1}$ is the regularization intensity, and $\tilde{O}$ hides all logarithmic factors except those involving $\log T$. The near-tightness of our analysis is certified by the first non-constant lower bound $\Omega(\eta K \log T)$, which follows from subtle hard-instance constructions and a tailored decomposition of the Bayes prior. Moreover, in the low-regularization regime (i.e., large $\eta$), we show that the KL-regularized regret for MABs is $\eta$-independent and scales as $\tilde{\Theta}(\sqrt{KT})$. Overall, our results provide a thorough understanding of KL-regularized MABs across all regimes of $\eta$ and yield nearly optimal bounds in terms of $K$, $\eta$, and $T$.
中文摘要 最新研究表明，采用KL正则化目标的强化学习可以比在非正则化环境中的经典$\sqrt{T}$型后悔更快收敛率或实现对数后悔。然而，在线学习在多臂强盗（MABs）方面，针对KL正则化目标的统计效率仍远未完全表征。我们通过对KL-UCB进行对KL-UCB的犀利分析，利用一种新的剥离论证解决了MAB的问题，该论证得出$\tilde{O}（\eta K\log^2T）$的上界：第一个线性依赖于$K$的高概率后悔界限。这里，$T$ 是时间视界，$K$ 是臂数，$\eta^{-1}$ 是正则化强度，$\tilde{O}$ 隐藏了除涉及 $\log T$ 外的所有对数因子。我们分析的近乎紧密性由第一个非恒定下界$\Omega（\eta K \log T）$证明，该下界通过微妙的硬实例构造和贝叶斯先验的定制分解得出。此外，在低正则化区间（即大 $\eta$）中，我们证明 MAB 的 KL 正则化遗憾无关于 $\eta$，且可随 $\tilde{\Theta}（\sqrt{KT}）$ 扩展。总体而言，我们的结果为所有$\eta$区间的KL正则化MAB提供了深入理解，并给出了$K$、$\eta$和$T$的近乎最优的界限。

Tool Verification for Test-Time Reinforcement Learning

测试时强化学习工具验证

Authors: Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh, Volker Tresp, Serena Yeung-Levy
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.02203
Pdf link: https://arxiv.org/pdf/2603.02203
Abstract Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.
中文摘要 测试时间强化学习（TTRL）已成为自我演化大型推理模型（LRM）的有前景范式，通过多数投票实现对未标记测试输入的在线适应，实现自我诱导奖励。然而，一个虚假且高频且未经验证的共识可能变成有偏且强化的奖励信号，导致错误的模式崩溃。我们用T^3RL（测试时间强化学习工具验证）来解决这一失败模式，该方法将测试时间工具验证引入奖励估计中。具体来说，验证者使用外部工具作为证据（例如代码执行中的证据），在验证意识投票中对已验证的推广进行权重提升，从而产生更可靠的伪标签用于训练。在各种数学难度（MATH-500、AMC和AIME 2024）和多样的骨干类型中，T^3RL相较TTRL显著提升，难度较高的问题获得更大进步。更广泛地说，T^3RL可视为经过验证的在线数据综合，强调测试时工具验证是稳定自我演化的关键机制。

Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training

Reasoning Core：一套可扩展的程序式数据生成套件，用于符号性预训练和后训练

Authors: Valentin Lacombe, Valentin Quesnel, Damien Sileo
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.02208
Pdf link: https://arxiv.org/pdf/2603.02208
Abstract Training on verifiable symbolic data is a promising way to expand the reasoning frontier of language models beyond what standard pre-training corpora provide. Yet existing procedural generators often rely on fixed puzzles or templates and do not deliver the distributional breadth needed at scale. We introduce Reasoning Core, a scalable suite that procedurally generates verifiable symbolic reasoning data across core formal domains: PDDL planning over randomized domains, first-order logic with equality, context-free grammar parsing and generation, causal reasoning over random Bayesian networks, and systems of equations. Each task is paired with an external solver for rigorous verification and admits continuous difficulty control for curriculum design. Examples can optionally include solver-derived reasoning traces, enabling supervised training from the earliest pre-training stages, and the same interface provides verifiable reward functions for reinforcement learning. Our experiments show that mixing Reasoning Core data into pre-training improves downstream reasoning while preserving, or slightly improving, language modeling quality. Zero-shot evaluations confirm these tasks challenge frontier models such as GPT-5. The code and data are publicly available under the MIT license.
中文摘要 在可验证的符号数据上训练，是拓展语言模型推理前沿、超越标准预训练语料库的有前景方式。然而，现有的程序生成器往往依赖固定的谜题或模板，无法在大规模下提供所需的分布式广度。我们介绍了Reasoning Core，这是一套可扩展的套件，能够在核心形式领域程序生成可验证的符号推理数据：基于随机域的PDDL规划、带等号的一阶逻辑、上下文无关语法解析与生成、随机贝叶斯网络上的因果推理以及方程组。每个任务都配有外部求解器进行严格验证，并支持持续难度控制以实现课程设计。例如，还可以选择性地包括求解器衍生的推理痕迹，从最早的预训练阶段就实现监督训练，同一接口还为强化学习提供可验证的奖励函数。我们的实验表明，将推理核心数据混合到预训练中，可以提升下游推理能力，同时保持或略微提升语言建模质量。零样本评估证实这些任务挑战了如GPT-5等前沿模型。代码和数据均在MIT许可协议下公开。

Keyword: diffusion policy

Mean-Flow based One-Step Vision-Language-Action

基于平均流程的一步愿景-语言-行动

Authors: Yang Chen, Xiaoguang Ma, Bin Zhao
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.01469
Pdf link: https://arxiv.org/pdf/2603.01469
Abstract Recent advances in FlowMatching-based Vision-Language-Action (VLA) frameworks have demonstrated remarkable advantages in generating high-frequency action chunks, particularly for highly dexterous robotic manipulation tasks. Despite these notable achievements, their practical applications are constrained by prolonged generation latency, which stems from inherent iterative sampling requirements and architectural limitations. To address this critical bottleneck, we propose a Mean-Flow based One-Step VLA approach. Specifically, we resolve the noise-induced issues in the action generation process, thereby eliminating the consistency constraints inherent to conventional Flow-Matching methods. This significantly enhances generation efficiency and enables one-step action generation. Real-world robotic experiments show that the generation speed of the proposed Mean-Flow based One-Step VLA is 8.7 times and 83.9 times faster than that of SmolVLA and Diffusion Policy, respectively. These results elucidate its great potential as a high-efficiency backbone for VLA-based robotic manipulation.
中文摘要 基于流匹配的视觉-语言-动作（VLA）框架的最新进展，在生成高频动作块方面展现了显著优势，尤其适用于高度灵活的机器人作任务。尽管取得了这些显著成就，但其实际应用受限于长期的世代延迟，这源于固有的迭代采样需求和架构限制。为解决这一关键瓶颈，我们提出了基于平均流的一步 VLA 方法。具体来说，我们解决了动作生成过程中噪声引起的问题，从而消除了传统流匹配方法固有的一致性约束。这大大提升了生成效率，并实现了一步作生成。真实世界的机器人实验表明，基于平均流量的单步VLA生成速度分别是SmolVLA和扩散政策的8.7倍和83.9倍。这些结果阐明了其作为基于VLA的机器人作高效骨干的巨大潜力。

Closed-Loop Action Chunks with Dynamic Corrections for Training-Free Diffusion Policy

带动态修正的闭环动作块以实现无训练扩散策略

Authors: Pengyuan Wu, Pingrui Zhang, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.01953
Pdf link: https://arxiv.org/pdf/2603.01953
Abstract Diffusion-based policies have achieved remarkable results in robotic manipulation but often struggle to adapt rapidly in dynamic scenarios, leading to delayed responses or task failures. We present DCDP, a Dynamic Closed-Loop Diffusion Policy framework that integrates chunk-based action generation with real-time correction. DCDP integrates a self-supervised dynamic feature encoder, cross-attention fusion, and an asymmetric action encoder-decoder to inject environmental dynamics before action execution, achieving real-time closed-loop action correction and enhancing the system's adaptability in dynamic scenarios. In dynamic PushT simulations, DCDP improves adaptability by 19\% without retraining while requiring only 5\% additional computation. Its modular design enables plug-and-play integration, achieving both temporal coherence and real-time responsiveness in dynamic robotic scenarios, including real-world manipulation tasks. The project page is at: this https URL
中文摘要 基于扩散的策略在机器人作方面取得了显著成效，但在动态场景下常常难以快速适应，导致响应延迟或任务失败。我们介绍DCDP，一个动态闭环扩散策略框架，集成了基于块的动作生成与实时修正。DCDP集成了自监督动态特征编码器、交叉注意融合和非对称动作编码-解码器，在动作执行前注入环境动力学，实现实时闭环动作修正，增强系统在动态场景下的适应性。在动态PushT模拟中，DCDP在无需重新训练的情况下提升了19%的适应性，且只需额外5/%的计算量。其模块化设计实现即插即用集成，在动态机器人场景（包括真实作任务）中实现时间一致性和实时响应性。项目页面是：这个 https URL