Arxiv Papers of Today

生成时间: 2026-04-10 17:14:02 (UTC+8); Arxiv 发布时间: 2026-04-10 20:00 EDT (2026-04-11 08:00 UTC+8)

今天共有 47 篇相关文章

Keyword: reinforcement learning

Reinforcement Learning with Reward Machines for Sleep Control in Mobile Networks

使用奖励机进行移动网络睡眠控制的强化学习

Authors: Kristina Levina, Nikolaos Pappas, Athanasios Karapantelakis, Aneta Vulgarakis Feljan, Jendrik Seipp
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.07411
Pdf link: https://arxiv.org/pdf/2604.07411
Abstract Energy efficiency in mobile networks is crucial for sustainable telecommunications infrastructure, particularly as network densification continues to increase power consumption. Sleep mechanisms for the components in mobile networks can reduce energy use, but deciding which components to put to sleep, when, and for how long while preserving quality of service (QoS) remains a difficult optimisation problem. In this paper, we utilise reinforcement learning with reward machines (RMs) to make sleep-control decisions that balance immediate energy savings and long-term QoS impact, i.e. time-averaged packet drop rates for deadline-constrained traffic and time-averaged minimum-throughput guarantees for constant-rate users. A challenge is that time-averaged constraints depend on cumulative performance over time rather than immediate performance. As a result, the effective reward is non-Markovian, and optimal actions depend on operational history rather than the instantaneous system state. RMs account for the history dependence by maintaining an abstract state that explicitly tracks the QoS constraint violations over time. Our framework provides a principled, scalable approach to energy management for next-generation mobile networks under diverse traffic patterns and QoS requirements.
中文摘要 移动网络的能效对于可持续的电信基础设施至关重要，尤其是在网络密度持续增加电力消耗的情况下。移动网络组件的睡眠机制可以降低能源消耗，但在保持服务质量（QoS）的前提下决定哪些组件、何时以及睡眠多久，仍是一个难以优化的问题。本文利用奖励机（RM）进行强化学习，做出睡眠控制决策，平衡即时节能和长期服务质量影响，即截止时间受限流量的时间平均数据包丢弃率和恒定速率用户的均值最低吞吐量保证。一个挑战在于，时间平均约束依赖于累计性能随时间变化，而非即时性能。因此，有效奖励是非马尔可夫的，最优动作依赖于操作历史而非瞬时系统状态。RM通过保持一个抽象状态，明确跟踪QoS约束违规随时间的变化，来考虑历史依赖性。我们的框架为下一代移动网络在多样化流量模式和服务质量需求下提供了有原则且可扩展的能源管理方法。

SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval

子检索：复杂检索中无监督引导推理的中间奖励

Authors: Roxana Petcu, Evangelos Kanoulas, Maarten de Rijke
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.07415
Pdf link: https://arxiv.org/pdf/2604.07415
Abstract Large language models (LLMs) are probabilistic in nature and perform more reliably when augmented with external information. As complex queries often require multi-step reasoning over the retrieved information, with no clear or predetermined reasoning path, they remain challenging. Recent approaches train models using reinforcement learning on the model's outcome, showing promise in improving how models handle complex information. We introduce SubSearch, a specialized framework that shifts from outcome-only supervision to intermediate reward signals that incentivize planning high-quality reasoning. Unlike previous work on process reward modeling, which focuses on training a separate reward model with annotated trajectories by either human annotators or large LLM judges, SubSearch directly optimizes the generator using intrinsic process rewards, which we define as internally-derived rewards, eliminating the need for external supervision, and moving towards autonomous information-intensive reasoning. Experiments on seven benchmarks show that rewarding intermediate reasoning steps with intrinsic rewards leads to more robust reasoning traces in both QA and multi-hop QA datasets over using only outcome rewards. SubSearch can help in building reasoning traces that allow agents to better integrate search engines for complex query answering, while offering a data-efficient alternative to supervised process modeling.
中文摘要 大型语言模型（LLMs）具有概率性质，在加入外部信息后表现更为可靠。由于复杂的查询通常需要对检索到的信息进行多步推理，且没有明确或预定的推理路径，因此它们依然具有挑战性。近期方法通过强化学习训练模型结果，显示出改善模型处理复杂信息的潜力。我们引入SubSearch，这是一个专门的框架，从仅关注结果的监督转向激励高质量推理规划的中间奖励信号。与以往过程奖励建模的工作不同，后者侧重于由人工注释者或大型大型语言模型评判者通过注释轨迹训练独立奖励模型，SubSearch 直接利用内在过程奖励优化生成器，我们将其定义为内部衍生的奖励，消除外部监督需求，迈向自主信息密集型推理。七个基准测试的实验显示，用内在奖励奖励中间推理步骤，在QA和多跳QA数据集中，推理痕迹比仅使用结果奖励更稳健。SubSearch 可以帮助构建推理痕迹，使智能体能够更好地整合搜索引擎以应对复杂查询，同时为监督过程建模提供一种数据高效的替代方案。

Dual-Rerank: Fusing Causality and Utility for Industrial Generative Reranking

双重重排序：融合因果律与效用进行工业生成式重排序

Authors: Chao Zhang, Shuai Lin, ChengLei Dai, Ye Qian, Fan Mingyang, Yi Zhang, Yi Wang, Jingwei Zhuo
Subjects: Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.07420
Pdf link: https://arxiv.org/pdf/2604.07420
Abstract Kuaishou serves over 400 million daily active users, processing hundreds of millions of search queries daily against a repository of tens of billions of short videos. As the final decision layer, the reranking stage determines user experience by optimizing whole-page utility. While traditional score-and-sort methods fail to capture combinatorial dependencies, Generative Reranking offers a superior paradigm by directly modeling the permutation probability. However, deploying Generative Reranking in such a high-stakes environment faces a fundamental dual dilemma: 1) the structural trade-off where Autoregressive (AR) models offer superior Sequential modeling but suffer from prohibitive latency, versus Non-Autoregressive (NAR) models that enable efficiency but lack dependency capturing; 2) the optimization gap where Supervised Learning faces challenges in directly optimizing whole-page utility, while Reinforcement Learning (RL) struggles with instability in high-throughput data streams. To resolve this, we propose Dual-Rerank, a unified framework designed for industrial reranking that bridges the structural gap via Sequential Knowledge Distillation and addresses the optimization gap using List-wise Decoupled Reranking Optimization (LDRO) for stable online RL. Extensive A/B testing on production traffic demonstrates that Dual-Rerank achieves State-of-the-Art performance, significantly improving User satisfaction and Watch Time while drastically reducing inference latency compared to AR baselines.
中文摘要 快手服务超过4亿日活跃用户，每天处理数亿次搜索查询，存储数百亿个短视频。作为最终决策层，重新排序阶段通过优化全页实用性来决定用户体验。虽然传统的评分与排序方法无法捕捉组合依赖关系，生成式重新排序通过直接建模置换概率，提供了更优越的范式。然而，在如此高风险的环境中部署生成式重新排序面临一个根本性的双重困境：1）结构性权衡：自回归（AR）模型提供更优越的顺序建模，但存在过高的延迟;而非自回归（NAR）模型虽能高效但缺乏依赖捕捉;2）优化缺口，监督学习在直接优化全页效用方面面临挑战，而强化学习（RL）在高通量数据流中则面临不稳定性。为此，我们提出了Dual-Rerank，这是一个为工业级重新排序设计的统一框架，通过顺序知识蒸馏弥合结构性差距，并通过列表分列解耦重新排序优化（LDRO）解决稳定在线强化学习的优化差距。对生产流量进行的大量A/B测试表明，Dual-Rerank实现了最先进的性能，显著提升了用户满意度和观看时间，同时相比AR基线大幅降低了推理延迟。

GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control

女孩：通过信息理论幻觉控制实现生成想象强化学习

Authors: Prakul Sunil Hiremath
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.07426
Pdf link: https://arxiv.org/pdf/2604.07426
Abstract Model-based reinforcement learning (MBRL) improves sample efficiency by optimizing policies inside imagined rollouts, but long-horizon planning degrades when model errors compound and imagined trajectories drift off the training manifold. We introduce GIRL (Generative Imagination Reinforcement Learning), a latent world-model framework that addresses this failure mode with two key components. First, a cross-modal grounding signal derived from a frozen foundation model (DINOv2) anchors the latent transition prior to a semantically consistent embedding space, penalizing inconsistent or implausible predictions. Second, an uncertainty-adaptive trust-region bottleneck interprets the KL regularizer as the Lagrange multiplier of a constrained optimization problem, restricting imagination drift within a learned region calibrated by Expected Information Gain and a Relative Performance Loss signal. We re-derive a value-gap bound using the Performance Difference Lemma and Integral Probability Metrics, yielding a bound that remains informative as the discount factor approaches one and connects the objective to real-environment regret. Experiments across three benchmark suites, including DeepMind Control, Adroit Hand Manipulation, and Meta-World with visual distractors, show that GIRL reduces latent rollout drift by 38 to 61 percent across tasks relative to DreamerV3, improves asymptotic return, and requires fewer environment interactions on long-horizon tasks. GIRL also outperforms TD-MPC2 on sparse-reward and high-contact settings under standard evaluation metrics. A distilled-prior variant reduces inference overhead and improves computational efficiency relative to the full model.
中文摘要 基于模型的强化学习（MBRL）通过优化假设中的策略来提升样本效率，但当模型错误叠加且想象轨迹偏离训练流形时，长远规划会下降。我们介绍GIRL（生成想象强化学习），这是一个潜在的世界模型框架，通过两个关键组成部分来应对这种失败模式。首先，由冻结基础模型（DINOv2）派生的跨模态接地信号锚定了潜在跃迁，在语义一致的嵌入空间之前，惩罚不一致或不合理的预测。其次，不确定性自适应信任区域瓶颈将KL正则子解释为受限优化问题的拉格朗日乘子，限制由期望信息增益和相对性能损失信号校准的学习区域内的想象漂移。我们利用性能差引理和积分概率指标重新推导价值缺口界限，得到一个当折现因子趋近于1时仍具信息价值的界限，并将目标与现实环境遗憾联系起来。涵盖三项基准测试套件的实验，包括DeepMind Control、Adroit Hand Manipulation和带有视觉干扰器的Meta-World，显示GIRL相较于DreamerV3，在任务中将潜在的推移漂移减少了38%至61%，提升了渐近回报，并且在长视野任务中需要更少的环境交互。在标准评估指标下，GIRL在稀疏奖励和高接触环境中也优于TD-MPC2。蒸馏先验变体减少了推断开销，并相对于完整模型提高了计算效率。

Regret-Aware Policy Optimization: Environment-Level Memory for Replay Suppression under Delayed Harm

遗憾感知策略优化：延迟伤害下环境级内存用于重放抑制

Authors: Prakul Sunil Hiremath
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.07428
Pdf link: https://arxiv.org/pdf/2604.07428
Abstract Safety in reinforcement learning (RL) is typically enforced through objective shaping while keeping environment dynamics stationary with respect to observable state-action pairs. Under delayed harm, this can lead to replay: after a washout period, reintroducing the same stimulus under matched observable conditions reproduces a similar harmful cascade. We introduce the Replay Suppression Diagnostic (RSD), a controlled exposure-decay-replay protocol that isolates this failure mode under frozen-policy evaluation. We show that, under stationary observable transition kernels, replay cannot be structurally suppressed without inducing a persistent shift in replay-time action distributions. Motivated by platform-mediated systems, we propose Regret-Aware Policy Optimization (RAPO), which augments the environment with persistent harm-trace and scar fields and applies a bounded, mass-preserving transition reweighting to reduce reachability of historically harmful regions. On graph diffusion tasks (50-1000 nodes), RAPO suppresses replay, reducing re-amplification gain (RAG) from 0.98 to 0.33 on 250-node graphs while retaining 82\% of task return. Disabling transition deformation only during replay restores re-amplification (RAG 0.91), isolating environment-level deformation as the causal mechanism.
中文摘要 强化学习（RL）中的安全性通常通过目标塑造来强化，同时保持环境动态相对于可观测的状态-动作对保持不变。在延迟伤害下，这可能导致重播：在一次洗刷期后，在匹配的可观测条件下重新引入相同刺激，会产生类似的有害级连反应。我们介绍了重放抑制诊断（RSD），这是一种受控的暴露-衰减-重放协议，通过冻结策略评估隔离该失败模式。我们证明，在平稳可观测转移核下，重放无法结构性抑制而不引发重放时间动作分布的持续变化。受平台介导系统的启发，我们提出了遗憾感知策略优化（RAPO），该方法通过持续的伤害痕迹和伤痕场增强环境，并采用有界、保持质量的过渡加权，以减少历史有害区域的可达性。在图扩散任务（50-1000节点）中，RAPO抑制重放，将250节点图的再放大增益（RAG）从0.98降至0.33，同时保留82%的任务返回。仅在回放期间禁用过渡变形即可恢复再放大（RAG 0.91），从而隔离环境层面的变形作为因果机制。

Active Reward Machine Inference From Raw State Trajectories

从原始状态轨迹推断主动奖励机

Authors: Mohamad Louai Shehab, Antoine Aspeel, Necmiye Ozay
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
Arxiv link: https://arxiv.org/abs/2604.07480
Pdf link: https://arxiv.org/pdf/2604.07480
Abstract Reward machines are automaton-like structures that capture the memory required to accomplish a multi-stage task. When combined with reinforcement learning or optimal control methods, they can be used to synthesize robot policies to achieve such tasks. However, specifying a reward machine by hand, including a labeling function capturing high-level features that the decisions are based on, can be a daunting task. This paper deals with the problem of learning reward machines directly from raw state and policy information. As opposed to existing works, we assume no access to observations of rewards, labels, or machine nodes, and show what trajectory data is sufficient for learning the reward machine in this information-scarce regime. We then extend the result to an active learning setting where we incrementally query trajectory extensions to improve data (and indirectly computational) efficiency. Results are demonstrated with several grid world examples.
中文摘要 奖励机是类似自动机的结构，用于捕捉完成多阶段任务所需的内存。当与强化学习或最优控制方法结合时，它们可以用来综合机器人策略以实现此类任务。然而，手动指定奖励机，包括捕捉决策所依据的高层特征的标签函数，可能是一项艰巨的任务。本文探讨了直接从原始状态和策略信息学习奖励机的问题。与现有研究不同，我们假设无法访问奖励、标签或机器节点的观测，并展示了在信息稀缺的环境中，哪些轨迹数据足以学习奖励机。然后我们将结果扩展到主动学习环境，逐步查询轨迹扩展，以提升数据（以及间接计算）效率。结果通过多个网格世界示例来展示。

CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection

清晰：通过能动反思对比学习经验的语境增强

Authors: Linbo Liu, Guande Wu, Han Ding, Yawei Wang, Qiang Zhou, Yuzhe Lu, Zhichao Xu, Huan Song, Panpan Xu, Lin Lee Cheong
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.07487
Pdf link: https://arxiv.org/pdf/2604.07487
Abstract Large language model agents rely on effective model context to obtain task-relevant information for decision-making. Many existing context engineering approaches primarily rely on the context generated from the past experience and retrieval mechanisms that reuse these context. However, retrieved context from past tasks must be adapted by the execution agent to fit new situations, placing additional reasoning burden on the underlying LLM. To address this limitation, we propose a generative context augmentation framework using Contrastive Learning of Experience via Agentic Reflection (CLEAR). CLEAR first employs a reflection agent to perform contrastive analysis over past execution trajectories and summarize useful context for each observed task. These summaries are then used as supervised fine-tuning data to train a context augmentation model (CAM). Then we further optimize CAM using reinforcement learning, where the reward signal is obtained by running the task execution agent. By learning to generate task-specific knowledge rather than retrieve knowledge from the past, CAM produces context that is better tailored to the current task. We conduct comprehensive evaluations on the AppWorld and WebShop benchmarks. Experimental results show that CLEAR consistently outperforms strong baselines. It improves task completion rate from 72.62% to 81.15% on AppWorld test set and averaged reward from 0.68 to 0.74 on a subset of WebShop, compared with baseline agent. Our code is publicly available at this https URL.
中文摘要 大型语言模型代理依赖有效的模型上下文来获取与任务相关的决策信息。许多现有的上下文工程方法主要依赖于从过去经验中生成的上下文和重复利用这些上下文的检索机制。然而，从过去任务中检索的上下文必须由执行代理调整以适应新情况，这给底层的LLM增加了推理负担。为解决这一限制，我们提出了一种使用经纪反思对比学习经验（CLEAR）的生成上下文增强框架。CLEAR首先使用反射代理对过去执行轨迹进行对比分析，并总结每个观察到任务的有用上下文。这些摘要随后作为监督微调数据，用于训练上下文增强模型（CAM）。然后我们进一步优化CAM的强化学习，奖励信号通过运行任务执行代理获得。通过学习生成任务特定的知识，而非从过去检索知识，CAM能够生成更适合当前任务的上下文。我们对AppWorld和WebShop基准进行了全面评估。实验结果显示，CLEAR始终优于强基线。它将AppWorld测试集上的任务完成率从72.62%提升到81.15%，WebShop的一个子集上平均奖励从0.68提升到0.74，相比基准代理。我们的代码在此 https URL 公开。

ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework

ReflectRM：通过统一判断框架内的自我反思提升生成奖励模型

Authors: Kai Qin, Liangxin Liu, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Houde Liu, Daiting Shi
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.07506
Pdf link: https://arxiv.org/pdf/2604.07506
Abstract Reward Models (RMs) are critical components in the Reinforcement Learning from Human Feedback (RLHF) pipeline, directly determining the alignment quality of Large Language Models (LLMs). Recently, Generative Reward Models (GRMs) have emerged as a superior paradigm, offering higher interpretability and stronger generalization than traditional scalar RMs. However, existing methods for GRMs focus primarily on outcome-level supervision, neglecting analytical process quality, which constrains their potential. To address this, we propose ReflectRM, a novel GRM that leverages self-reflection to assess analytical quality and enhance preference modeling. ReflectRM is trained under a unified generative framework for joint modeling of response preference and analysis preference. During inference, we use its self-reflection capability to identify the most reliable analysis, from which the final preference prediction is derived. Experiments across four benchmarks show that ReflectRM consistently improves performance, achieving an average accuracy gain of +3.7 on Qwen3-4B. Further experiments confirm that response preference and analysis preference are mutually reinforcing. Notably, ReflectRM substantially mitigates positional bias, yielding +10.2 improvement compared with leading GRMs and establishing itself as a more stable evaluator.
中文摘要 奖励模型（RM）是人类反馈强化学习（RLHF）流程中的关键组成部分，直接决定大型语言模型（LLMs）的对齐质量。近年来，生成奖励模型（GRM）作为一种更优越的范式出现，提供了比传统标量RM更高的解释性和更强的泛化性。然而，现有GRM方法主要侧重于结果层面的监督，忽视了分析过程的质量，这限制了其潜力。为此，我们提出了ReflectRM，一种利用自我反思评估分析质量并增强偏好建模的新型GRM。ReflectRM 在一个统一的生成框架下训练，用于响应偏好和分析偏好的联合建模。在推断过程中，我们利用其自我反思能力识别最可靠的分析，从而得出最终偏好预测。四个基准测试的实验显示，ReflectRM持续提升性能，Qwen3-4B的平均准确率提升为+3.7。进一步的实验证实，反应偏好和分析偏好是相互强化的。值得注意的是，ReflectRM显著减轻了位置偏差，与领先GRM相比提升了+10.2倍，并确立了其更稳定的评估平台地位。

RL-ASL: A Dynamic Listening Optimization for TSCH Networks Using Reinforcement Learning

RL-ASL：基于强化学习的TSCH网络动态监听优化

Authors: F. Fernando Jurado-Lasso, J. F. Jurado
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.07533
Pdf link: https://arxiv.org/pdf/2604.07533
Abstract Time Slotted Channel Hopping (TSCH) is a widely adopted Media Access Control (MAC) protocol within the IEEE 802.15.4e standard, designed to provide reliable and energy-efficient communication in Industrial Internet of Things (IIoT) networks. However, state-of-the-art TSCH schedulers rely on static slot allocations, resulting in idle listening and unnecessary power consumption under dynamic traffic conditions. This paper introduces RL-ASL, a reinforcement learning-driven adaptive listening framework that dynamically decides whether to activate or skip a scheduled listening slot based on real-time network conditions. By integrating learning-based slot skipping with standard TSCH scheduling, RL-ASL reduces idle listening while preserving synchronization and delivery reliability. Experimental results on the FIT IoT-LAB testbed and Cooja network simulator show that RL-ASL achieves up to 46% lower power consumption than baseline scheduling protocols, while maintaining near-perfect reliability and reducing average latency by up to 96% compared to PRIL-M. Its link-based variant, RL-ASL-LB, further improves delay performance under high contention with similar energy efficiency. Importantly, RL-ASL performs inference on constrained motes with negligible overhead, as model training is fully performed offline. Overall, RL-ASL provides a practical, scalable, and energy-aware scheduling mechanism for next-generation low-power IIoT networks.
中文摘要 时隙频道跳跳（TSCH）是IEEE 802.15.4e标准中广泛采用的媒体访问控制（MAC）协议，旨在为工业物联网（IIoT）网络提供可靠且节能的通信。然而，最先进的TSCH调度器依赖静态时隙分配，导致在动态流量条件下空闲监听和不必要的功耗。本文介绍了RL-ASL，一种基于强化学习的自适应听觉框架，能够根据实时网络条件动态决定是否激活或跳过预定的听力时段。通过将基于学习的槽跳过与标准TSCH调度集成，RL-ASL减少了空闲监听，同时保持同步和传输可靠性。FIT IoT-LAB测试平台和Cooja网络模拟器的实验结果显示，RL-ASL的功耗比基础调度协议低多达46%，同时保持近乎完美的可靠性，平均延迟比PRIL-M降低了高达96%。其基于链路的变体RL-ASL-LB，在高争用条件下以类似的能源效率进一步提升了延迟性能。重要的是，RL-ASL对受约束的微粒进行推断，开销可忽略不计，因为模型训练完全离线完成。总体而言，RL-ASL为下一代低功耗IIoT网络提供了一种实用、可扩展且具能量感知的调度机制。

Dual-Loop Control in DCVerse: Advancing Reliable Deployment of AI in Data Centers via Digital Twins

DCVerse中的双环路控制：通过数字孪生推进数据中心AI的可靠部署

Authors: Qingang Zhang, Yuejun Yan, Guangyu Wu, Siew-Chien Wong, Jimin Jia, Zhaoyang Wang, Yonggang Wen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.07559
Pdf link: https://arxiv.org/pdf/2604.07559
Abstract The growing scale and complexity of modern data centers present major challenges in balancing energy efficiency with outage risk. Although Deep Reinforcement Learning (DRL) shows strong potential for intelligent control, its deployment in mission-critical systems is limited by data scarcity and the lack of real-time pre-evaluation mechanisms. This paper introduces the Dual-Loop Control Framework (DLCF), a digital twin-based architecture designed to overcome these challenges. The framework comprises three core entities: the physical system, a digital twin, and a policy reservoir of diverse DRL agents. These components interact through a dual-loop mechanism involving real-time data acquisition, data assimilation, DRL policy training, pre-evaluation, and expert verification. Theoretical analysis shows how DLCF can improve sample efficiency, generalization, safety, and optimality. Leveraging DLCF, we implemented the DCVerse platform and validated it through case studies on a real-world data center cooling system. The evaluation shows that our approach achieves up to 4.09% energy savings over conventional control strategies without violating SLA requirements. Additionally, the framework improves policy interpretability and supports more trustworthy DRL deployment. This work provides a foundation for reliable AI-based control in data centers and points toward future extensions for holistic, system-wide optimization.
中文摘要 现代数据中心规模和复杂度的增长，在平衡能源效率与停机风险方面带来了重大挑战。尽管深度强化学习（DRL）在智能控制方面展现出强大潜力，但其在关键任务系统中的部署受限于数据稀缺性和缺乏实时预评估机制。本文介绍了双环控制框架（DLCF），这是一种基于数字孪生的架构，旨在克服这些挑战。该框架由三个核心实体组成：物理系统、数字孪生以及由多样化DRL代理组成的政策库。这些组件通过双环机制交互，包括实时数据采集、数据同化、DRL政策培训、预评估和专家验证。理论分析表明DLCF如何提升样本效率、泛化性、安全性和最优性。利用DLCF，我们实现了DCVerse平台，并通过在真实数据中心冷却系统的案例研究中进行了验证。评估显示，我们的方法相比传统控制策略实现了高达4.09%的节能，且不违反SLA要求。此外，该框架提升了策略可解释性，支持更可信赖的DRL部署。这项工作为数据中心可靠的基于人工智能的控制奠定了基础，并指向未来全面、系统范围优化的扩展。

PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

PRIME：通过迭代记忆演化为用户中心代理免费训练主动推理

Authors: Prince Zizhuang Wang, Shuli Jiang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.07645
Pdf link: https://arxiv.org/pdf/2604.07645
Abstract The development of autonomous tool-use agents for complex, long-horizon tasks in collaboration with human users has become the frontier of agentic research. During multi-turn Human-AI interactions, the dynamic and uncertain nature of user demands poses a significant challenge; agents must not only invoke tools but also iteratively refine their understanding of user intent through effective communication. While recent advances in reinforcement learning offer a path to more capable tool-use agents, existing approaches require expensive training costs and struggle with turn-level credit assignment across extended interaction horizons. To this end, we introduce PRIME (Proactive Reasoning via Iterative Memory Evolution), a gradient-free learning framework that enables continuous agent evolvement through explicit experience accumulation rather than expensive parameter optimization. PRIME distills multi-turn interaction trajectories into structured, human-readable experiences organized across three semantic zones: successful strategies, failure patterns, and user preferences. These experiences evolve through meta-level operations and guide future agent behavior via retrieval-augmented generation. Our experiments across several diverse user-centric environments demonstrate that PRIME achieves competitive performance with gradient-based methods while offering cost-efficiency and interpretability. Together, PRIME presents a practical paradigm for building proactive, collaborative agents that learn from Human-AI interaction without the computational burden of gradient-based training.
中文摘要 与人类用户协作开发用于复杂、长期任务的自主工具使用代理已成为智能体研究的前沿。在多回合人机交互中，用户需求的动态性和不确定性带来了重大挑战;代理不仅要调用工具，还要通过有效的沟通迭代完善对用户意图的理解。尽管强化学习的最新进展为更强大的工具使用代理提供了路径，但现有方法需要昂贵的培训费用，并且在漫长交互视角内的回合级学分分配上存在困难。为此，我们引入了PRIME（通过迭代记忆演化的主动推理），这是一个无梯度的学习框架，通过显式积累经验实现持续的代理进化，而非昂贵的参数优化。PRIME将多回合交互轨迹提炼成结构化、易于人类阅读的体验，组织在三个语义区：成功策略、失败模式和用户偏好。这些体验通过元层操作演进，并通过检索增强生成引导未来的代理行为。我们在多个多样化的用户中心环境中的实验表明，PRIME通过基于梯度的方法实现了竞争性能，同时兼具成本效益和可解释性。PRIME共同提出了一种实用范式，用于构建主动、协作的代理，这些智能体能够通过人机交互学习，而无需承担基于梯度训练的计算负担。

An Imperfect Verifier is Good Enough: Learning with Noisy Rewards

一个不完美的验证器就足够好：有噪音奖励的学习

Authors: Andreas Plesner, Francisco Guzmán, Anish Athalye
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.07666
Pdf link: https://arxiv.org/pdf/2604.07666
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become a prominent method for post-training Large Language Models (LLMs). However, verifiers are rarely error-free; even deterministic checks can be inaccurate, and the growing dependence on model-based judges exacerbates the issue. The extent to which RLVR is robust to such noise and the verifier accuracy required for effective training remain unresolved questions. We investigate these questions in the domains of code generation and scientific reasoning by introducing noise into RL training. Noise rates up to 15% yield peak validation accuracy within 2 percentage points of the clean baseline. These findings are consistent across controlled and model-based noise types, three model families (Qwen3, GLM4, Llama 3.1), and model sizes from 4B to 9B. Overall, the results indicate that imperfect verification does not constitute a fundamental barrier to RLVR. Furthermore, our findings suggest that practitioners should prioritize moderate accuracy with high precision over perfect verification.
中文摘要 带可验证奖励的强化学习（RLVR）已成为大型语言模型（LLM）后训练的一种重要方法。然而，验证器很少是无误的;即使是确定性检查也可能不准确，而对基于模型的法官的依赖加剧了这一问题。RLVR对此类噪声的鲁棒性以及有效训练所需的验证器准确性仍是未解之谜。我们通过在强化学习训练中引入噪声，探讨代码生成和科学推理领域的这些问题。噪声率高达15%时，峰值验证准确率在干净基线2个百分点内。这些发现在受控和基于模型的噪声类型、三个模型家族（Qwen3、GLM4、Llama 3.1）以及模型大小从4B到9B之间保持一致。总体来看，结果表明不完美验证并不构成RLVR的根本障碍。此外，我们的发现建议从业者应优先考虑中等准确性和高精度，而非完美验证。

Reset-Free Reinforcement Learning for Real-World Agile Driving: An Empirical Study

无重置强化学习用于现实世界敏捷驾驶：一项实证研究

Authors: Kohei Honda, Hirotaka Hosogaya
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.07672
Pdf link: https://arxiv.org/pdf/2604.07672
Abstract This paper presents an empirical study of reset-free reinforcement learning (RL) for real-world agile driving, in which a physical 1/10-scale vehicle learns continuously on a slippery indoor track without manual resets. High-speed driving near the limits of tire friction is particularly challenging for learning-based methods because complex vehicle dynamics, actuation delays, and other unmodeled effects hinder both accurate simulation and direct sim-to-real transfer of learned policies. To enable autonomous training on a physical platform, we employ Model Predictive Path Integral control (MPPI) as both the reset policy and the base policy for residual learning, and systematically compare three representative RL algorithms, i.e., PPO, SAC, and TD-MPC2, with and without residual learning in simulation and real-world experiments. Our results reveal a clear gap between simulation and real-world: SAC with residual learning achieves the highest returns in simulation, yet only TD-MPC2 consistently outperforms the MPPI baseline on the physical platform. Moreover, residual learning, while clearly beneficial in simulation, fails to transfer its advantage to the real world and can even degrade performance. These findings reveal that reset-free RL in the real world poses unique challenges absent from simulation, calling for further algorithmic development tailored to training in the wild.
中文摘要 本文提出了一项关于真实敏捷驾驶中无重置强化学习（RL）的实证研究，其中物理1/10比例的车辆在湿滑的室内赛道上连续学习，无需手动重置。接近轮胎摩擦极限的高速驾驶对于基于学习的方法尤其具有挑战性，因为复杂的车辆动力学、执行延迟及其他未建模效应阻碍了准确的模拟和学习政策的直接模拟到现实传递。为了实现物理平台上的自主训练，我们采用模型预测路径积分控制（MPPI）作为残差学习的重置策略和基础策略，并系统地比较了三种具有代表性的强化学习算法，即PPO、SAC和TD-MPC2，在模拟和现实实验中有无残差学习。我们的结果揭示了模拟与现实世界之间的明显差距：带有残差学习的SAC在仿真中实现了最高的回报，但只有TD-MPC2在物理平台上持续优于MPPI基线。此外，残差学习虽然在仿真中明显有益，但其优势未能转化到现实世界，甚至可能降低性能。这些发现表明，现实世界中无重置强化学习带来了模拟中所缺乏的独特挑战，呼吁进一步开发针对野外训练的算法。

Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing

通过分布对齐提示合成和后向提示退火缓解数学RLVR中的分布锐化

Authors: Pei-Xi Xie, Che-Yu Lin, Cheng-Lin Yang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.07747
Pdf link: https://arxiv.org/pdf/2604.07747
Abstract Reinforcement learning with verifiable rewards (RLVR) can improve low-$k$ reasoning accuracy while narrowing solution coverage on challenging math questions, and pass@1 gains do not necessarily translate into better large-$k$ performance. Existing hint-based approaches can make challenging questions trainable, but they leave two issues underexplored: teacher-student distribution mismatch and the need to reduce hint exposure to match no-hint evaluation. We address these issues through two components. Distribution-Aligned Hint Synthesis (DAHS) constructs verified teacher hints conditioned on student-style responses. Backward Hint Annealing (BHA) anneals hint exposure across difficulty buckets and uses per-question hint dropout to preserve no-hint updates throughout RL training. We evaluate the method in math RLVR under the DAPO training framework across AIME24, AIME25, and AIME26 using $\texttt{Qwen3-1.7B-Base}$ and $\texttt{Llama-3.2-1B-Instruct}$. On $\texttt{Qwen3-1.7B-Base}$, our method improves both pass@1 and pass@2048 relative to DAPO across the three AIME benchmarks. On $\texttt{Llama-3.2-1B-Instruct}$, the gains are concentrated in the large-$k$ regime. These results suggest that, in math RLVR, hint scaffolding is effective when it restores learnable updates on challenging questions early in training and is then gradually removed before no-hint evaluation.
中文摘要 带有可验证奖励的强化学习（RLVR）可以在降低$k美元推理准确度的同时，缩小难度数学题的解答范围，pass@1收益不一定转化为更好的大$k美元性能。现有的基于提示的方法可以使具有挑战性的问题变得可训练，但它们留下了两个问题尚未被充分探讨：师生分布不匹配，以及减少提示暴露以匹配无提示评估的需求。我们通过两个部分来应对这些问题。分布对齐提示综合（DAHS）构建基于学生式回答的经过验证的教师提示。反向提示退火（BHA）通过提示在不同难度桶间的暴露，并通过每题提示的退出来保持无提示更新，贯穿整个强化学习训练。我们在DAPO培训框架下，使用$\texttt{Qwen3-1.7B-Base}$和$\texttt{Llama-3.2-1B-Instruct}$，评估该数学RLVR方法，涵盖AIME24、AIME25和AIME26。在$\texttt{Qwen3-1.7B-Base}$上，我们的方法在三个AIME基准测试中相较DAPO提升了pass@1和pass@2048。在$\texttt{Llama-3.2-1B-Instruct}$上，收益集中在大$k$区间。这些结果表明，在数学RLVR中，提示支架在训练初期恢复难题的可学习更新后，在无提示评估前逐步移除，是有效的。

RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

RoboAgent：整合具身任务规划的基本能力

Authors: Peiran Xu, Jiaqi Zheng, Yadong Mu
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.07774
Pdf link: https://arxiv.org/pdf/2604.07774
Abstract This paper focuses on embodied task planning, where an agent acquires visual observations from the environment and executes atomic actions to accomplish a given task. Although recent Vision-Language Models (VLMs) have achieved impressive results in multimodal understanding and reasoning, their performance remains limited when applied to embodied planning that involves multi-turn interaction, long-horizon reasoning, and extended context analysis. To bridge this gap, we propose RoboAgent, a capability-driven planning pipeline in which the model actively invokes different sub-capabilities. Each capability maintains its own context, and produces intermediate reasoning results or interacts with the environment according to the query given by a scheduler. This framework decomposes complex planning into a sequence of basic vision-language problems that VLMs can better address, enabling a more transparent and controllable reasoning process. The scheduler and all capabilities are implemented with a single VLM, without relying on external tools. To train this VLM, we adopt a multi-stage paradigm that consists of: (1) behavior cloning with expert plans, (2) DAgger training using trajectories collected by the model, and (3) reinforcement learning guided by an expert policy. Across these stages, we exploit the internal information of the environment simulator to construct high-quality supervision for each capability, and we further introduce augmented and synthetic data to enhance the model's performance in more diverse scenarios. Extensive experiments on widely used embodied task planning benchmarks validate the effectiveness of the proposed approach. Our codes will be available at this https URL.
中文摘要 本文聚焦于具象任务规划，即智能体从环境中获取视觉观察，并执行原子级动作以完成特定任务。尽管近期的视觉语言模型（VLMs）在多模态理解和推理方面取得了令人印象深刻的成果，但当应用于涉及多回合互动、长远视野推理和扩展上下文分析的具象规划时，其表现仍然有限。为弥合这一差距，我们提出了RoboAgent，一种能力驱动的规划流程，模型主动调用不同的子能力。每个能力都保持自己的上下文，并根据调度器给出的查询产生中间推理结果或与环境交互。该框架将复杂的规划分解为一系列基本的视觉语言问题，VLM能够更好地解决这些问题，从而实现更透明和可控的推理过程。调度器及所有功能均由单一VLM实现，无需依赖外部工具。为了训练该VLM，我们采用了多阶段范式，包括：（1）基于专家计划进行行为克隆，（2）利用模型收集的轨迹进行DAgger训练，以及（3）由专家策略指导的强化学习。在这些阶段，我们利用环境模拟器的内部信息为每个能力构建高质量的监督，并进一步引入增强和合成数据，以提升模型在更多样化场景下的性能。广泛使用的具象任务规划基准测试验证了该方法的有效性。我们的代码将通过该HTTPS网址提供。

Automotive Engineering-Centric Agentic AI Workflow Framework

以汽车工程为中心的代理人工智能工作流框架

Authors: Tong Duy Son, Zhihao Liu, Piero Brigida, Yerlan Akhmetov, Gurudevan Devarajan, Kai Liu, Ajinkya Bhave
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.07784
Pdf link: https://arxiv.org/pdf/2604.07784
Abstract Engineering workflows such as design optimization, simulation-based diagnosis, control tuning, and model-based systems engineering (MBSE) are iterative, constraint-driven, and shaped by prior decisions. Yet many AI methods still treat these activities as isolated tasks rather than as parts of a broader workflow. This paper presents Agentic Engineering Intelligence (AEI), an industrial vision framework that models engineering workflows as constrained, history-aware sequential decision processes in which AI agents support engineer-supervised interventions over engineering toolchains. AEI links an offline phase for engineering data processing and workflow-memory construction with an online phase for workflow-state estimation, retrieval, and decision support. A control-theoretic interpretation is also possible, in which engineering objectives act as reference signals, agents act as workflow controllers, and toolchains provide feedback for intervention selection. Representative automotive use cases in suspension design, reinforcement learning tuning, multimodal engineering knowledge reuse, aerodynamic exploration, and MBSE show how diverse workflows can be expressed within a common formulation. Overall, the paper positions engineering AI as a problem of process-level intelligence and outlines a practical roadmap for future empirical validation in industrial settings.
中文摘要 工程工作流如设计优化、基于仿真的诊断、控制调优和基于模型的系统工程（MBSE）是迭代的、约束驱动的，并受先前决策的影响。然而，许多人工智能方法仍然将这些活动视为孤立的任务，而非更广泛工作流程的一部分。本文介绍了代理工程智能（AEI），这是一种工业愿景框架，将工程工作流建模为受限、历史感知的顺序决策过程，其中人工智能代理支持工程师监督的干预，而非工程工具链。AEI将用于工程数据处理和工作流内存构建的离线阶段与用于工作流状态估计、检索和决策支持的在线阶段相连接起来。还可以采用控制理论解释，工程目标作为参考信号，代理作为工作流控制器，工具链为干预选择提供反馈。悬挂设计、强化学习调校、多模态工程知识再利用、空气动力学探索和MBSE中的代表性汽车应用案例展示了多样化工作流程如何在共同的表述中表达。总体而言，论文将工程人工智能定位为工艺级智能问题，并勾勒出未来工业环境中实证验证的实用路线图。

SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

SEARL：策略与工具图内存的联合优化，用于自我演化代理

Authors: Xinshun Feng, Xinhao Song, Lijun Li, Gongshen Liu, Jing Shao
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.07791
Pdf link: https://arxiv.org/pdf/2604.07791
Abstract Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments. The inherent sparsity of outcome-based rewards also poses a substantial challenge, as agents typically receive feedback only upon completion of tasks. To address these limitations, we introduce a Tool-Memory based self-evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel state abstraction that facilitates generalization across analogous contexts, such as tool reuse. Consequently, agents extract explicit knowledge from historical data while leveraging inter-trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and mathematics tasks, demonstrating its effectiveness in achieving more practical and efficient learning.
中文摘要 近期在可验证奖励强化学习（RLVR）的进展显示出单回合推理任务的显著潜力。随着范式转变，模型越来越多地被期望通过综合工具或积累显式经验来从轨迹中学习。然而，主流方法通常依赖大规模大型语言模型或多智能体框架，这会阻碍它们在资源受限环境中的部署。基于结果的奖励本身就很稀少，这也带来了很大挑战，因为客服通常只有在完成任务后才会收到反馈。为解决这些局限，我们引入了一个基于工具内存的自我演化代理框架SEARL。与直接利用互动体验的方法不同，我们的方法构建了一个结构化的体验记忆，将规划与执行相结合。这提供了一种新颖的状态抽象，便于在类似情境（如工具重用）中进行泛化。因此，智能体从历史数据中提取显性知识，同时利用轨迹间相关性以丰富奖励信号。我们评估了该框架在知识推理和数学任务上的应用，展示了其在实现更实用、更高效学习方面的有效性。

QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch

QaRL：推广对齐量化感知强化语言，用于训练下快速稳定的训练——推断不匹配

Authors: Hao Gu, Hao Wang, Jiacheng Liu, Lujun Li, Qiyuan Zhu, Bei Liu, Binxing Xu, Lei Wang, Xintong Yang, Sida Lin, Sirui Han, Yike Guo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.07853
Pdf link: https://arxiv.org/pdf/2604.07853
Abstract Large language model (LLM) reinforcement learning (RL) pipelines are often bottlenecked by rollout generation, making end-to-end training slow. Recent work mitigates this by running rollouts with quantization to accelerate decoding, which is the most expensive stage of the RL loop. However, these setups destabilize optimization by amplifying the training-inference gap: rollouts are operated at low precision, while learning updates are computed at full precision. To address this challenge, we propose QaRL (Rollout Alignment Quantization-Aware RL), which aligns training-side forward with the quantized rollout to minimize mismatch. We further identify a failure mode in quantized rollouts: long-form responses tend to produce repetitive, garbled tokens (error tokens). To mitigate these problems, we introduce TBPO (Trust-Band Policy Optimization), a sequence-level objective with dual clipping for negative samples, aimed at keeping updates within the trust region. On Qwen3-30B-A3B MoE for math problems, QaRL outperforms quantized-rollout training by +5.5 while improving stability and preserving low-bit throughput benefits.
中文摘要 大型语言模型（LLM）强化学习（RL）流水线常常被推出生成的瓶颈，导致端到端训练速度较慢。近期研究通过运行带量化的滚动来加速解码，而解码是强化循环中成本最高的阶段。然而，这些设置通过放大训练-推断差距，破坏了优化的稳定性：展开操作精度较低，而学习更新则以全精度计算。为应对这一挑战，我们提出了QaRL（Rollout Alignment Quantization-Aware RL），该方法将训练端与量化推广对齐，以最大限度减少不匹配。我们进一步识别了量化推广中的失败模式：长形式响应往往产生重复且混乱的令牌（错误令牌）。为缓解这些问题，我们引入了TBPO（信任带策略优化），这是一个序列级目标，负样本采用双重裁剪，旨在保持更新在信任区域内。在数学问题的Qwen3-30B-A3B模式下，QaRL在量化展开训练方面表现优于+5.5，同时提升稳定性并保持低比特吞吐量优势。

ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?

ZeroCoder：LLM能否在没有真实监督的情况下提升代码生成？

Authors: Lishui Fan, Mouxiang Chen, Tingwei Zhu, Kui Liu, Xin Xia, Shanping Li, Zhongxin Liu
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2604.07864
Pdf link: https://arxiv.org/pdf/2604.07864
Abstract Code generation is important in software engineering, and Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm to improve it through execution-based feedback. However, most RLVR pipelines rely on human-curated tests, making progress bottlenecked by scarce and costly supervision. Existing work tried to use self-generated tests to ground rewards, but the lack of discriminative tests constrains the effect due to the sub-optimal performance of the model on test generation. We aim to improve code generation without ground-truth supervision by co-evolving code and test generation, so that their interactions yield progressively more informative supervision. To this end, we present ZeroCoder, a fully label-free co-evolutionary framework that jointly trains a Coder and a Tester using execution feedback from self-generated code-test interactions. For each problem, ZeroCoder executes sampled solutions against sampled tests to form a passing matrix, identifies a consensus subset of likely-correct solutions and consistent tests via a pluggable selection algorithm, and derives role-specific rewards. To ensure reward quality, ZeroCoder filters low-information instances via rank-based pre-filtering and trains the Tester with a curriculum balancing validity and mutation-driven discriminativeness. We further identify selector drift, the progressive miscalibration of fixed selection rules during co-evolution, and introduce DyB4, a Bayesian selector that uses as few as 10 labeled instances to recalibrate its priors dynamically. Across three models and six benchmarks, ZeroCoder consistently improves code generation and test generation. In the fully label-free setting, it improves code generation by up to 14.5% over the base model on Qwen2.5-Coder-7B-Instruct. With DyB4, the gain reaches 21.6%, while test generation improves by 24.3%, approaching oracle-supervised performance.
中文摘要 代码生成在软件工程中非常重要，而带可验证奖励的强化学习（RLVR）是一个通过基于执行的反馈改进代码的强大范式。然而，大多数RLVR流程依赖人工策划的测试，导致进展因监督稀少且成本高昂而受限。现有研究尝试利用自生成测试来确定奖励，但由于缺乏判别性测试，模型在测试生成上的表现不佳，限制了效果。我们旨在通过共同演进代码和测试生成，提升代码生成的改进，使其相互作用能带来越来越有信息量的监督。为此，我们介绍了ZeroCoder，一个完全无标签的共进化框架，利用自生成的代码-测试交互的执行反馈，共同训练编码员和测试员。对于每个问题，ZeroCoder 对采样测试执行抽样解，形成通过矩阵，通过可插拔选择算法识别可能正确解和一致检验的共识子集，并推导角色特定的奖励。为确保奖励质量，ZeroCoder 通过基于排名的预过滤过滤低信息实例，并用课程平衡有效性和突变驱动的判别性训练测试者。我们进一步识别了选择子漂移，即在共演化过程中固定选择规则的渐进性误准，并介绍了DyB4，这是一种贝叶斯选择子，它只需少达10个标记实例即可动态重新校准其先验。在三种模型和六个基准测试中，ZeroCoder 持续提升代码生成和测试生成。在完全无标签的设置下，它比Qwen2.5-Coder-7B-Ininstruction基础模型提升了最多14.5%的代码生成。使用DyB4时，提升达到21.6%，测试生成提升24.3%，接近oracle监督性能。

Learning over Forward-Invariant Policy Classes: Reinforcement Learning without Safety Concerns

超越前向不变策略类的学习：无安全问题的强化学习

Authors: Chieh Tsai, Muhammad Junayed Hasan Zahed, Salim Hariri, Hossein Rastgoftar
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.07875
Pdf link: https://arxiv.org/pdf/2604.07875
Abstract This paper proposes a safe reinforcement learning (RL) framework based on forward-invariance-induced action-space design. The control problem is cast as a Markov decision process, but instead of relying on runtime shielding or penalty-based constraints, safety is embedded directly into the action representation. Specifically, we construct a finite admissible action set in which each discrete action corresponds to a stabilizing feedback law that preserves forward invariance of a prescribed safe state set. Consequently, the RL agent optimizes policies over a safe-by-construction policy class. We validate the framework on a quadcopter hover-regulation problem under disturbance. Simulation results show that the learned policy improves closed-loop performance and switching efficiency, while all evaluated policies remain safety-preserving. The proposed formulation decouples safety assurance from performance optimization and provides a promising foundation for safe learning in nonlinear systems.
中文摘要 本文提出了基于前向不变性诱导动作空间设计的安全强化学习（RL）框架。控制问题被描述为马尔可夫决策过程，但安全性直接嵌入到动作表示中，而非依赖运行时屏蔽或基于惩罚的约束。具体来说，我们构造了一个有限可接受作用量集，其中每个离散作用对应一个稳定反馈律，保持规定安全状态集的前向不变性。因此，强化学习代理在基于安全结构的策略类别上优化策略。我们在扰动下四旋翼的悬停调节问题验证了该框架。模拟结果显示，学习到的策略提升了闭环性能和切换效率，而所有评估策略仍保持安全保护。所提的表述将安全保障与性能优化脱钩，为非线性系统中的安全学习奠定了有前景的基础。

AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning

异常代理：通过工具增强强化学习实现的智能工业异常综合

Authors: Jiaming Su, Tengchao Yang, Ruikang Zhang, Zhengan Yan, Haoyu Sun, Linfeng Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.07900
Pdf link: https://arxiv.org/pdf/2604.07900
Abstract Industrial anomaly generation is a crucial method for alleviating the data scarcity problem in anomaly detection tasks. Most existing anomaly synthesis methods rely on single-step generation mechanisms, lacking complex reasoning and iterative optimization capabilities, making it difficult to generate anomaly samples with high semantic realism. We propose AnomalyAgent, an anomaly synthesis agent with self-reflection, knowledge retrieval, and iterative refinement capabilities, aiming to generate realistic and diverse anomalies. Specifically, AnomalyAgent is equipped with five tools: Prompt Generation (PG), Image Generation (IG), Quality Evaluation (QE), Knowledge Retrieval (KR), and Mask Generation (MG), enabling closed-loop optimization. To improve decision-making and self-reflection, we construct structured trajectories from real anomaly images and design a two-stage training framework: supervised fine-tuning followed by reinforcement learning. This process is driven by a three-part reward mechanism: (1) task rewards to supervise the quality and location rationality of generated anomalies; (2) reflection rewards to train the model's ability to improve anomaly synthesis prompt; (3) behavioral rewards to ensure adherence to the trajectory. On the MVTec-AD dataset, AnomalyAgent achieves IS/IC-L of 2.10/0.33 for anomaly generation, 57.0% classification accuracy using ResNet34, and 99.3%/74.2% AP at the image/pixel level using a simple UNet, surpassing all zero-shot SOTA methods. The code and data will be made publicly available.
中文摘要 工业异常生成是缓解异常检测任务中数据稀缺性问题的关键方法。大多数现有异常合成方法依赖单步生成机制，缺乏复杂的推理和迭代优化能力，难以生成具有高语义真实性的异常样本。我们提出AnomalyAgent，一款具备自我反思、知识检索和迭代精炼能力的异常综合代理，旨在生成真实且多样化的异常。具体来说，AnomalyAgent配备了五个工具：提示生成（PG）、图像生成（IG）、质量评估（QE）、知识检索（KR）和掩模生成（MG），实现闭环优化。为了提升决策和自我反思，我们从真实异常图像构建结构化轨迹，并设计两阶段训练框架：监督微调，随后强化学习。该过程由三部分奖励机制驱动：（1）任务奖励，用于监督生成异常的质量和位置合理性;（2）反射奖励，用于训练模型改进异常合成提示的能力;（3）行为奖励以确保对发展轨迹的坚持。在MVTec-AD数据集上，AnomalyAgent在异常生成方面实现了2.10/0.33的IS/IC-L，使用ResNet34实现了57.0%的分类准确率，使用简单的UNet在图像/像素层面实现了99.3%/74.2%的AP，超过了所有零次SOTA方法。代码和数据将公开。

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

大型语言模型后训练：非策略与策略内学习的统一视角

Authors: Shiwan Zhao, Zhihu Wang, Xuyang Zhao, Jiaming Zhou, Caiyue Xu, Chenfei Liu, Liting Zhang, Yuhang Jia, Yanzhe Zhang, Hualong Yu, Zichen Xu, Qicheng Li, Yong Qin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.07941
Pdf link: https://arxiv.org/pdf/2604.07941
Abstract Post-training has become central to turning pretrained large language models (LLMs) into aligned and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objective families rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary learning regimes: off-policy learning on externally supplied trajectories, and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles -- effective support expansion, which makes useful behaviors more reachable, and policy reshaping, which improves behavior within already reachable regions -- together with a complementary systems-level role, behavioral consolidation, which preserves, transfers, and amortizes behavior across stages and model transitions. This perspective yields a unified reading of major paradigms. SFT may serve either support expansion or policy reshaping, whereas preference-based methods are usually off-policy reshaping. On-policy RL often improves behavior on learner-generated states, though under stronger guidance it can also make hard-to-reach reasoning paths reachable. Distillation is often best understood as consolidation rather than only compression, and hybrid pipelines emerge as coordinated multi-stage compositions. Overall, the framework helps diagnose post-training bottlenecks and reason about stage composition, suggesting that progress in LLM post-training increasingly depends on coordinated system design rather than any single dominant objective.
中文摘要 后培训已成为将预训练大型语言模型（LLM）转变为对齐且可部署系统的核心。近期进展涵盖监督微调（SFT）、偏好优化、强化学习（RL）、过程监督、验证者引导方法、蒸馏和多阶段流水线。然而，这些方法常常以零散的方式讨论，按标签或客观家族组织，而非按它们解决的行为瓶颈来组织。本调查认为，LLM训练后最好理解为对模型行为的结构化干预。我们首先按轨迹来源来组织该领域，定义了两种主要学习模式：基于外部提供轨迹的非政策学习，以及基于学习者生成的推广的政策内学习。我们通过两个反复作用来解释方法——有效支持扩展，使有用行为更易实现;策略重塑，改善已可达区域内的行为——并辅以互补的系统层面行为巩固，后者在各阶段和模型转换中保留、转移和摊销行为。这种视角带来了对主要范式的统一解读。SFT可以用于支持扩展或政策重塑，而基于偏好的方法通常是非策略重塑。策略强化学习常常改善学习者生成状态下的行为，但在更强的指导下，它也能使难以触及的推理路径变得可达。蒸馏通常最好被理解为巩固而非单纯压缩，混合管道则呈现为协调的多阶段组合。总体而言，该框架有助于诊断训练后瓶颈并推理阶段组成，表明LLM训练后进展越来越依赖于协调系统设计，而非单一主导目标。

On-Policy Distillation of Language Models for Autonomous Vehicle Motion Planning

自动驾驶车辆运动规划语言模型的政策提炼

Authors: Amirhossein Afsharrad, Amirhesam Abedsoltan, Ahmadreza Moradipari, Sanjay Lall
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.07944
Pdf link: https://arxiv.org/pdf/2604.07944
Abstract Large language models (LLMs) have recently demonstrated strong potential for autonomous vehicle motion planning by reformulating trajectory prediction as a language generation problem. However, deploying capable LLMs in resource-constrained onboard systems remains a fundamental challenge. In this paper, we study how to effectively transfer motion planning knowledge from a large teacher LLM to a smaller, more deployable student model. We build on the GPT-Driver framework, which represents driving scenes as language prompts and generates waypoint trajectories with chain-of-thought reasoning, and investigate two student training paradigms: (i) on-policy generalized knowledge distillation (GKD), which trains the student on its own self-generated outputs using dense token-level feedback from the teacher, and (ii) a dense-feedback reinforcement learning (RL) baseline that uses the teacher's log-probabilities as per-token reward signals in a policy gradient framework. Experiments on the nuScenes benchmark show that GKD substantially outperforms the RL baseline and closely approaches teacher-level performance despite a 5$\times$ reduction in model size. These results highlight the practical value of on-policy distillation as a principled and effective approach to deploying LLM-based planners in autonomous driving systems.
中文摘要 大型语言模型（LLMs）最近通过将轨迹预测重新表述为语言生成问题，展示了在自动驾驶车辆运动规划中的强大潜力。然而，在资源有限的车载系统中部署具备能力的大型语言模型仍是一个根本性的挑战。本文研究如何有效将大型教师LLM中的运动规划知识转移到更小型、更易部署的学生模型。我们基于GPT-Driver框架，该框架将驾驶场景表示为语言提示，并通过思维链推理生成路点轨迹，并探讨了两种学生培训范式：（i）政策通用知识蒸馏（GKD），利用教师的密集代币级反馈训练学生自身生成的输出;（ii）以教师的对数概率作为每个代币奖励的密集反馈强化学习（RL）基线在政策梯度框架中发出信号。nuScenes基准测试的实验显示，GKD的表现远超强化学习基线，尽管模型规模减少了5$/times$，但其表现接近教师水平。这些结果凸显了政策上提炼作为一种原则性且有效的方法，在自动驾驶系统中部署基于LLM的规划工具具有实际价值。

Incremental Residual Reinforcement Learning Toward Real-World Learning for Social Navigation

渐进残余强化学习：面向现实世界学习的社会导航

Authors: Haruto Nagahisa, Kohei Matsumoto, Yuki Tomita, Yuki Hyodo, Ryo Kurazume
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.07945
Pdf link: https://arxiv.org/pdf/2604.07945
Abstract As the demand for mobile robots continues to increase, social navigation has emerged as a critical task, driving active research into deep reinforcement learning (RL) approaches. However, because pedestrian dynamics and social conventions vary widely across different regions, simulations cannot easily encompass all possible real-world scenarios. Real-world RL, in which agents learn while operating directly in physical environments, presents a promising solution to this issue. Nevertheless, this approach faces significant challenges, particularly regarding constrained computational resources on edge devices and learning efficiency. In this study, we propose incremental residual RL (IRRL). This method integrates incremental learning, which is a lightweight process that operates without a replay buffer or batch updates, with residual RL, which enhances learning efficiency by training only on the residuals relative to a base policy. Through the simulation experiments, we demonstrated that, despite lacking a replay buffer, IRRL achieved performance comparable to those of conventional replay buffer-based methods and outperformed existing incremental learning approaches. Furthermore, the real-world experiments confirmed that IRRL can enable robots to effectively adapt to previously unseen environments through the real-world learning.
中文摘要 随着对移动机器人需求的持续增长，社交导航已成为一项关键任务，推动了对深度强化学习（RL）方法的积极研究。然而，由于不同地区的行人动态和社会惯例差异很大，模拟无法轻易涵盖所有可能的现实场景。现实世界的强化学习中，智能体在物理环境中直接操作学习，为这一问题提供了有前景的解决方案。然而，这种方法面临重大挑战，尤其是在边缘设备计算资源受限和学习效率方面。本研究提出增量残差RL（IRRL）。该方法将增量学习（一种无需重放缓冲或批量更新的轻量级过程）与残差强化学习（residual RL）结合，后者仅通过训练相对于基础策略的残差来提升学习效率。通过模拟实验，我们证明尽管缺乏重放缓冲区，IRRL的性能仍可与传统重放缓冲区方法媲美，且优于现有增量学习方法。此外，现实实验证实IRRL能够通过真实学习有效地使机器人适应前所未见的环境。

TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning

TOOLCAD：探索利用工具的大型语言模型在文本到CAD生成中与强化学习

Authors: Yifei Gong, Xing Wu, Wenda Liu, Kang Tu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.07960
Pdf link: https://arxiv.org/pdf/2604.07960
Abstract Computer-Aided Design (CAD) is an expert-level task that relies on long-horizon reasoning and coherent modeling actions. Large Language Models (LLMs) have shown remarkable advancements in enabling language agents to tackle real-world tasks. Notably, there has been no investigation into how tool-using LLMs optimally interact with CAD engines, hindering the emergence of LLM-based agentic text-to-CAD modeling systems. We propose ToolCAD, a novel agentic CAD framework deploying LLMs as tool-using agents for text-to-CAD generation. Furthermore, we introduce an interactive CAD modeling gym to rollout reasoning and tool-augmented interaction trajectories with the CAD engine, incorporating hybrid feedback and human supervision. Meanwhile, an end-to-end post-training strategy is presented to enable the LLM agent to elicit refined CAD Modeling Chain of Thought (CAD-CoT) and evolve into proficient CAD tool-using agents via online curriculum reinforcement learning. Our findings demonstrate ToolCAD fills the gap in adopting and training open-source LLMs for CAD tool-using agents, enabling them to perform comparably to proprietary models, paving the way for more accessible and robust autonomous text-to-CAD modeling systems.
中文摘要 计算机辅助设计（CAD）是一项专家级任务，依赖于长远的推理和连贯的建模动作。大型语言模型（LLM）在使语言代理能够完成现实任务方面取得了显著进步。值得注意的是，目前尚无研究工具使用LLM如何与CAD引擎最佳交互，这阻碍了基于LLM的代理文本到CAD建模系统的出现。我们提出了ToolCAD，一种新型代理式CAD框架，将LLM作为工具使用代理进行文本到CAD生成。此外，我们还引入了交互式CAD建模馆，用于推理推理和工具辅助交互轨迹，结合混合反馈和人工监督。同时，提出了端到端的培训后期策略，使LLM代理能够引发精细的CAD建模思维链（CAD-CoT），并通过在线课程强化学习，发展成为熟练的CAD工具使用代理。我们的发现表明，ToolCAD填补了CAD工具代理采用和训练开源LLM的空白，使其性能可媲美专有模型，为更易访问、更稳健的自主文本到CAD建模系统铺平了道路。

A Decomposition Perspective to Long-context Reasoning for LLMs

从分解视角到大型语言模型的长上下文推理

Authors: Yanling Xiao, Huaibing Xie, Guoliang Zhao, Shihan Dou, Shaolei Wang, Yiting Liu, Nantao Zheng, Cheng Zhang, Pluto Zhou, Zhisong Zhang, Lemao Liu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.07981
Pdf link: https://arxiv.org/pdf/2604.07981
Abstract Long-context reasoning is essential for complex real-world applications, yet remains a significant challenge for Large Language Models (LLMs). Despite the rapid evolution in long-context reasoning, current research often overlooks the internal complexity of the long-context reasoning task itself. In this paper, we move beyond this holistic view and decompose long-context reasoning into a set of fundamental atomic skills, and we then automatically synthesize a suite of pseudo datasets, each explicitly targeting a specific atomic skill. Our empirical analysis confirms that proficiency in these atomic skills is strongly correlated with general long-text reasoning performance. Building on this insight, we employ reinforcement learning on these pseudo datasets to sharpen the model's atomic skills, in the hope of boosting its general long-context reasoning ability. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach: it outperforms a strong baseline by an average margin of 7.7\% (improving from 46.3\% to 54.0\%) across Loogle, Loong, LongBench-v2, BrowscompLong, Ruler-qa2, and MRCR.
中文摘要 长上下文推理对于复杂的现实应用至关重要，但对大型语言模型（LLMs）来说仍是一个重大挑战。尽管长语境推理发展迅速，当前研究常常忽视长语境推理任务本身的复杂性。本文超越了这一整体视角，将长上下文推理分解为一组基础原子技能，然后自动合成一套伪数据集，每个数据集都明确针对特定原子技能。我们的实证分析证实，这些原子技能的熟练度与一般长文本推理表现高度相关。基于这一见解，我们对这些伪数据集采用强化学习，以提升模型的原子能力，希望提升其整体的长上下文推理能力。跨多个基准测试的大量实验证明了我们方法的有效性：在Loogle、Loong、LongBench-v2、BrowscompLong、Ruler-qa2和MRCR上，平均领先强基线7.7%（从46.3%提升至54.0%）。

PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC

PriPG-RL：针对部分可观测系统的特权规划者引导强化学习，支持随时可行MPC

Authors: Mohsen Amiri, Mohsen Amiri, Ali Beikmohammadi, Sindri Magnuśson, Mehdi Hosseinzadeh
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.08036
Pdf link: https://arxiv.org/pdf/2604.08036
Abstract This paper addresses the problem of training a reinforcement learning (RL) policy under partial observability by exploiting a privileged, anytime-feasible planner agent available exclusively during training. We formalize this as a Partially Observable Markov Decision Process (POMDP) in which a planner agent with access to an approximate dynamical model and privileged state information guides a learning agent that observes only a lossy projection of the true state. To realize this framework, we introduce an anytime-feasible Model Predictive Control (MPC) algorithm that serves as the planner agent. For the learning agent, we propose Planner-to-Policy Soft Actor-Critic (P2P-SAC), a method that distills the planner agent's privileged knowledge to mitigate partial observability and thereby improve both sample efficiency and final policy performance. We support this framework with rigorous theoretical analysis. Finally, we validate our approach in simulation using NVIDIA Isaac Lab and successfully deploy it on a real-world Unitree Go2 quadruped navigating complex, obstacle-rich environments.
中文摘要 本文解决了在部分可观察性下通过利用训练期间独占的特权、随时可行的规划代理来训练强化学习（RL）策略的问题。我们将此形式化为部分可观测马尔可夫决策过程（POMDP），其中一个计划代理能够访问近似动力模型和特权状态信息，引导一个仅观察真实状态有损投影的学习代理。为实现该框架，我们引入了一种随时可行的模型预测控制（MPC）算法，作为规划代理。对于学习代理，我们提出了规划者到策略软行为者-批评者（P2P-SAC）的方法，该方法提炼了规划代理的特权知识，以减轻部分可观测性，从而提升样本效率和最终策略性能。我们通过严谨的理论分析支持这一框架。最后，我们利用NVIDIA Isaac Lab验证了我们的模拟方法，并成功部署在真实世界的Unitree Go2四足环境中，能够在复杂且障碍重重的环境中导航。

Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search

超越随机探索：训练数据对代理搜索的价值

Authors: Chuzhan Hao, Wenfeng Feng, Guochao Jiang, Guofeng Quan, Guohua Liu, Yuewei Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08124
Pdf link: https://arxiv.org/pdf/2604.08124
Abstract Reinforcement learning (RL) has become an effective approach for advancing the reasoning capabilities of large language models (LLMs) through the strategic integration of external search engines. However, current RL-based search agents often rely on a process of stochastic exploration guided by carefully crafted outcome rewards, leading to inefficient reasoning trajectories and unstable training. To address these issues, we propose a novel framework, Hierarchical Experience (HiExp), to enhance the performance and training stability of search agents. Specifically, we extract empirical knowledge through contrastive analysis and a multi-level clustering mechanism, transforming raw reasoning trajectories into hierarchical experience knowledge. By leveraging experience-aligned training, we effectively regularize stochastic exploration, evolving it into a strategic and experience-driven search process. Extensive evaluations on multiple complex agentic search and mathematical reasoning benchmarks demonstrate that our approach not only achieves substantial performance gains but also exhibits strong cross-task and cross-algorithm generalization.
中文摘要 强化学习（RL）已成为通过战略性集成外部搜索引擎，提升大型语言模型（LLMs）推理能力的有效方法。然而，当前基于强化学习的搜索代理常依赖于随机探索过程，并以精心设计的结果奖励为指导，导致推理轨迹效率低下且训练不稳定。为解决这些问题，我们提出了一种新框架——层级体验（HiExp），以提升搜索代理的性能和训练稳定性。具体来说，我们通过对比分析和多层次聚类机制提取经验知识，将原始推理轨迹转化为层级经验知识。通过利用经验对齐的培训，我们有效地规范了随机探索，将其演变成一个战略性和体验驱动的搜索流程。对多个复杂智能搜索和数学推理基准的广泛评估表明，我们的方法不仅实现了显著的性能提升，还展现出强大的跨任务和跨算法泛化能力。

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

ViVa：机器人强化学习的视频生成价值模型

Authors: Jindi Lv, Hao Li, Jie Li, Yifei Nie, Fankun Kong, Yang Wang, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08168
Pdf link: https://arxiv.org/pdf/2604.08168
Abstract Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator for value estimation. Taking the current observation and robot proprioception as input, ViVa jointly predicts future proprioception and a scalar value for the current state. By leveraging the spatiotemporal priors of a pretrained video generator, our approach grounds value estimation in anticipated embodiment dynamics, moving beyond static snapshots to intrinsically couple value with foresight. Integrated into RECAP, ViVa delivers substantial improvements on real-world box assembly. Qualitative analysis across all three tasks confirms that ViVa produces more reliable value signals, accurately reflecting task progress. By leveraging spatiotemporal priors from video corpora, ViVa also generalizes to novel objects, highlighting the promise of video-generative models for value estimation.
中文摘要 视觉-语言-动作（VLA）模型通过大规模预训练实现了机器人操作的先进功能，但由于部分可观测性和反馈延迟，实际部署仍然具有挑战性。强化学习通过价值函数来解决这个问题，价值函数评估任务进展并指导策略改进。然而，基于视觉语言模型（VLMs）的现有价值模型难以捕捉时间动态，削弱了长期任务中的可靠价值估计。本文提出了ViVa，一种视频生成价值模型，将预训练的视频生成器重新利用用于价值估计。以当前观察和机器人本体感觉为输入，ViVa联合预测了未来的本体感觉和当前状态的标量值。通过利用预训练视频生成器的时空先验，我们的方法将价值估计建立在预期的具身动态之上，超越静态快照，将价值与前瞻性本质上结合。集成进RECAP后，ViVa在实际盒装上带来了显著改进。三项任务的定性分析证实ViVa能产生更可靠的价值信号，准确反映任务进展。通过利用视频语料库中的时空先验，ViVa还推广到新颖对象，强调了视频生成模型在价值估计中的前景。

Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning

离线多智能体强化学习的价值指导平均流

Authors: Teng Pang, Zhiqiang Dong, Yan Zhang, Rongjian Xu, Guoqiang Wu, Yilong Yin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.08174
Pdf link: https://arxiv.org/pdf/2604.08174
Abstract Offline multi-agent reinforcement learning (MARL) aims to learn the optimal joint policy from pre-collected datasets, requiring a trade-off between maximizing global returns and mitigating distribution shift from offline data. Recent studies use diffusion or flow generative models to capture complex joint policy behaviors among agents; however, they typically rely on multi-step iterative sampling, thereby reducing training and inference efficiency. Although further research improves sampling efficiency through methods like distillation, it remains sensitive to the behavior regularization coefficient. To address the above-mentioned issues, we propose Value Guidance Multi-agent MeanFlow Policy (VGM$^2$P), a simple yet effective flow-based policy learning framework that enables efficient action generation with coefficient-insensitive conditional behavior cloning. Specifically, VGM$^2$P uses global advantage values to guide agent collaboration, treating optimal policy learning as conditional behavior cloning. Additionally, to improve policy expressiveness and inference efficiency in multi-agent scenarios, it leverages classifier-free guidance MeanFlow for both policy training and execution. Experiments on tasks with both discrete and continuous action spaces demonstrate that, even when trained solely via conditional behavior cloning, VGM$^2$P efficiently achieves performance comparable to state-of-the-art methods.
中文摘要 离线多智能体强化学习（MARL）旨在从预先收集的数据集中学习最优联合策略，需要在最大化全局收益与减少离线数据分布转移之间做出权衡。近期研究利用扩散或流生成模型捕捉代理间复杂的联合策略行为;然而，它们通常依赖多步迭代抽样，从而降低了训练和推断效率。尽管进一步研究通过蒸馏等方法提高了采样效率，但它仍然对行为正则化系数敏感。为解决上述问题，我们提出了价值指导多智能体均流策略（VGM$^2$P），这是一个简单但有效的基于流的策略学习框架，能够通过系数不敏感的条件行为克隆实现高效的动作生成。具体来说，VGM$^2$P 使用全局优势值指导代理协作，将最优策略学习视为条件行为克隆。此外，为了提升多智能体场景下的策略表达性和推理效率，它利用无分类器的MeanFlow指导工具进行策略训练和执行。在离散和连续动作空间任务上的实验表明，即使仅通过条件行为克隆训练，VGM$^2$P 也能高效实现与最先进方法相当的性能。

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

通过规划对齐代理：轨迹级奖励建模的基准

Authors: Jiaxuan Wang, Yulan Hu, Wenjin Yang, Zheng Pan, Xin Li, Lan-Zhe Guo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08178
Pdf link: https://arxiv.org/pdf/2604.08178
Abstract In classical Reinforcement Learning from Human Feedback (RLHF), Reward Models (RMs) serve as the fundamental signal provider for model alignment. As Large Language Models evolve into agentic systems capable of autonomous tool invocation and complex reasoning, the paradigm of reward modeling faces unprecedented challenges--most notably, the lack of benchmarks specifically designed to assess RM capabilities within tool-integrated environments. To address this gap, we present Plan-RewardBench, a trajectory-level preference benchmark designed to evaluate how well judges distinguish preferred versus distractor agent trajectories in complex tool-using scenarios. Plan-RewardBench covers four representative task families -- (i) Safety Refusal, (ii) Tool-Irrelevance / Unavailability, (iii) Complex Planning, and (iv) Robust Error Recovery -- comprising validated positive trajectories and confusable hard negatives constructed via multi-model natural rollouts, rule-based perturbations, and minimal-edit LLM perturbations. We benchmark representative RMs (generative, discriminative, and LLM-as-Judge) under a unified pairwise protocol, reporting accuracy trends across varying trajectory lengths and task categories. Furthermore, we provide diagnostic analyses of prevalent failure modes. Our results reveal that all three evaluator families face substantial challenges, with performance degrading sharply on long-horizon trajectories, underscoring the necessity for specialized training in agentic, trajectory-level reward modeling. Ultimately, Plan-RewardBench aims to serve as both a practical evaluation suite and a reusable blueprint for constructing agentic planning preference data.
中文摘要 在经典的人类反馈强化学习（RLHF）中，奖励模型（RM）是模型对齐的基本信号提供者。随着大型语言模型发展为能够自主调用工具和复杂推理的代理系统，奖励建模范式面临前所未有的挑战——最显著的是缺乏专门设计用于评估工具集成环境中RM能力的基准测试。为弥补这一空白，我们提出了Plan-RewardBench，这是一个轨迹级偏好基准，旨在评估法官在复杂工具使用场景中区分偏好与干扰剂轨迹的能力。Plan-RewardBench 涵盖了四大代表性任务族——（i）安全拒绝，（ii）工具无关性/不可用性，（iii）复杂规划，以及（iv）稳健错误恢复——包括通过多模型自然展开、基于规则的扰动和最小编辑的大型语言模型扰动构建的经过验证的正轨迹和易混淆的硬负。我们在统一的成对协议下对代表性的 RM（生成型、判别型和 LLM 作为评判者）进行基准测试，报告不同轨迹长度和任务类别的准确性趋势。此外，我们还提供了常见失效模式的诊断分析。我们的结果显示，这三大评估者家族都面临重大挑战，在长期轨迹上表现急剧下降，凸显了在能动性、轨迹级奖励建模方面的专业培训的必要性。最终，Plan-RewardBench旨在作为一个实用的评估套件和可重复使用的蓝图，用于构建代理规划偏好数据。

MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

MedVR：通过代理强化学习实现无注释的医学视觉推理

Authors: Zheng Jiang, Heng Guo, Chengyu Fang, Changchen Xiao, Xinyang Hu, Lifeng Sun, Minfeng Xu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08203
Pdf link: https://arxiv.org/pdf/2604.08203
Abstract Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.
中文摘要 医学视觉语言模型（VLM）在复杂的临床任务中具有巨大潜力，但其推理能力常受限于仅靠文本的范式，无法以视觉证据为基础推断。这一限制不仅限制了需要细致视觉分析任务的性能，还在安全关键应用中引入视觉幻觉的风险。因此，我们介绍了MedVR，一种新型强化学习框架，能够实现医疗VLM的无注释视觉推理。其核心创新在于两种协同机制：熵引导视觉再根基（EVR）利用模型不确定性引导探索，而基于共识的学分分配（CCA）则从推广协议中提炼出伪监督。在没有人工注释的中间步骤下，MedVR在多种公共医疗VQA基准测试中实现了最先进的性能，远超现有模型。通过学习直接用视觉证据进行推理，MedVR促进了加速医疗人工智能临床部署所需的稳健性和透明度。

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

OmniJigsaw：通过模态编排重排序增强全模态推理

Authors: Yiduo Jia, Muzhi Zhu, Hao Zhong, Mingyu Liu, Yuling Xi, Hao Chen, Bin Qin, Yongjie Yang, Zhenbo Luo, Chunhua Shen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.08209
Pdf link: https://arxiv.org/pdf/2604.08209
Abstract To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.
中文摘要 为了将强化学习的训练后范式扩展到全模态模型，以同时增强视听理解和协作推理，我们提出了OmniJigsaw，这是一个基于时间重序代理任务构建的通用自监督框架。该范式以对随机音视频片段的时间顺序重建为中心，通过三种不同策略——联合模态整合、样本级模态选择和剪辑级模态掩蔽——策略性地协调视觉和听觉信号，推动跨模态整合。认识到此类代理任务的有效性与谜题质量密切相关，我们设计了一个两阶段的粗细数据过滤流程，促进了OmniJigsaw对大量无注释全模态数据的高效适应。我们的分析揭示了联合模态整合中的“双模态捷径现象”，并证明细粒度的剪辑级模态掩蔽能缓解该问题，同时优于样本层模态选择。对15个基准的广泛评估显示，视频、音频和协作推理方面取得了显著进步，验证了OmniJigsaw作为自监督全模态学习可扩展范式的潜力。

HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation

HiRO-Nav：混合导航实现高效的实体导航

Authors: He Zhao, Yijun Yang, Zichuan Lin, Deheng Ye, Chunyan Miao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08232
Pdf link: https://arxiv.org/pdf/2604.08232
Abstract Embodied navigation agents built upon large reasoning models (LRMs) can handle complex, multimodal environmental input and perform grounded reasoning per step to improve sequential decision-making for long-horizon tasks. However, a critical question remains: \textit{how can the reasoning capabilities of LRMs be harnessed intelligently and efficiently for long-horizon navigation tasks?} In simple scenes, agents are expected to act reflexively, while in complex ones they should engage in deliberate reasoning before this http URL achieve this, we introduce \textbf{H}ybr\textbf{i}d \textbf{R}eas\textbf{O}ning \textbf{Nav}igation (\textbf{HiRO-Nav}) agent, the first kind of agent capable of adaptively determining whether to perform thinking at every step based on its own action entropy. Specifically, by examining how the agent's action entropy evolves over the navigation trajectories, we observed that only a small fraction of actions exhibit high entropy, and these actions often steer the agent toward novel scenes or critical objects. Furthermore, studying the relationship between action entropy and task completion (i.e., Q-value) reveals that improving high-entropy actions contributes more positively to task this http URL, we propose a tailored training pipeline comprising hybrid supervised fine-tuning as a cold start, followed by online reinforcement learning with the proposed hybrid reasoning strategy to explicitly activate reasoning only for high-entropy actions, significantly reducing computational overhead while improving decision quality. Extensive experiments on the \textsc{CHORES}-$\mathbb{S}$ ObjectNav benchmark showcases that HiRO-Nav achieves a better trade-off between success rates and token efficiency than both dense-thinking and no-thinking baselines.
中文摘要 基于大型推理模型（LRM）构建的具身导航代理能够处理复杂的多模态环境输入，并在每步进行基于基础的推理，以改善长视野任务的连续决策。然而，一个关键问题依然存在：\textit{如何智能高效地利用LRM的推理能力进行长视野导航任务？}在简单场景中，代理应具备反射性行动，而在复杂场景中，代理应在此之前进行有意识的推理。我们引入了 \textbf{H}ybr\textbf{i}d \textbf{R}eas\textbf{O}ning \textbf{Nav}igation （\textbf{HiRO-Nav}），这是第一种能够根据自身动作熵在每一步自适应判断是否进行思考的代理。具体来说，通过观察代理的行动熵在导航轨迹上的演变，我们观察到只有极少数动作表现出高熵，这些行为常常引导代理前往新场景或关键对象。此外，研究动作熵与任务完成（即Q值）之间的关系，发现提升高熵动作对任务的贡献更积极。http URL，我们提出了一种定制化训练流水线，包括混合监督微调作为冷启动，随后采用在线强化学习，采用混合推理策略，明确激活高熵动作的推理。显著降低计算开销，同时提升决策质量。在\textsc{CHORES}-$\mathbb{S}$ ObjectNav基准测试上的大量实验表明，HiRO-Nav在成功率和代币效率之间比密集思考和无思考基线在权衡上更为出色。

Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data

Fundus-R1：在公共数据上以知识感知推理训练能阅读眼底的MLLM

Authors: Yuchuan Deng, Qijie Wei, Kaiheng Qian, Jiazhen Liu, Zijie Xin, Bangxiang Lan, Jingyu Liu, Jianfeng Dong, Xirong Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.08322
Pdf link: https://arxiv.org/pdf/2604.08322
Abstract Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.
中文摘要 眼底成像如CFP、OCT和UWF对于早期发现视网膜异常和疾病至关重要。由于其知识密集型，眼底图像理解是一项具有挑战性的视觉语言任务。解决该任务的新兴方法是对通用多模态大型语言模型（MLLM）进行后期训练，方法包括监督微调（SFT）或可验证奖励强化学习（RLVR），并结合大量内部样本并配合高质量临床报告。然而，这些有价值的样本并未公开，这不仅阻碍了复现性，也实际上限制了研究对象的范围。为克服这一障碍，我们尝试使用纯公开数据集训练一种推理增强的眼底阅读MLLM，称为Fundus-R1，其中超过94%的数据仅用图像级标签标注。我们的技术贡献有两个方面。首先，我们提出一种基于RAG的方法，用于合成图像特定、知识感知的推理痕迹。这种自动生成的痕迹将通用MLLM识别的视觉发现与眼科知识上的图像标签联系起来。其次，我们通过流程奖励来增强RLVR，鼓励每次推广中生成的推理追踪的自洽性。对三种眼底读取基准测试（FunBench、Omni-Fundus和GMAI-Fundus）的广泛实验显示，Fundus-R1明显优于多个基线，包括其通用对应基准（Qwen2.5-VL）以及未使用生成痕迹后训练的更强版本。这项工作为利用公开数据训练强力眼底读数MLM铺平了道路。

ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection

ProMedical：通过显式注入进行医学LLM对齐的层级细粒度标准建模

Authors: He Geng, Yangmin Huang, Lixian Lai, Qianyun Du, Hui Chu, Zhiyang He, Jiaxue Hu, Xiaodong Tao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08326
Pdf link: https://arxiv.org/pdf/2604.08326
Abstract Aligning Large Language Models (LLMs) with high-stakes medical standards remains a significant challenge, primarily due to the dissonance between coarse-grained preference signals and the complex, multi-dimensional nature of clinical protocols. To bridge this gap, we introduce ProMedical, a unified alignment framework grounded in fine-grained clinical criteria. We first construct ProMedical-Preference-50k, a dataset generated via a human-in-the-loop pipeline that augments medical instructions with rigorous, physician-derived rubrics. Leveraging this corpus, we propose the Explicit Criteria Injection paradigm to train a multi-dimensional reward model. Unlike traditional scalar reward models, our approach explicitly disentangles safety constraints from general proficiency, enabling precise guidance during reinforcement learning. To rigorously validate this framework, we establish ProMedical-Bench, a held-out evaluation suite anchored by double-blind expert adjudication. Empirical evaluations demonstrate that optimizing the Qwen3-8B base model via ProMedical-RM-guided GRPO yields substantial gains, improving overall accuracy by 22.3% and safety compliance by 21.7%, effectively rivaling proprietary frontier models. Furthermore, the aligned policy generalizes robustly to external benchmarks, demonstrating performance comparable to state-of-the-art models on UltraMedical. We publicly release our datasets, reward models, and benchmarks to facilitate reproducible research in safety-aware medical alignment.
中文摘要 将大型语言模型（LLMs）与高风险医疗标准对齐仍是一项重大挑战，主要原因是粗粒度偏好信号与临床方案复杂多维性之间的不协调。为弥合这一差距，我们引入了ProMedical，一个基于细致临床标准的统一对齐框架。我们首先构建了ProMedical-Preference-50k数据集，这是一个通过人机流程生成的数据集，通过严格的医生来源评分标准来补充医疗指示。利用该语料库，我们提出了显式标准注入范式，用于训练一个多维奖励模型。与传统的标量奖励模型不同，我们的方法明确将安全约束与一般熟练度分离，实现强化学习中的精准指导。为严格验证该框架，我们建立了ProMedical-Bench，一套以双盲专家裁决为基础的评估套件。实证评估表明，通过ProMedical-RM引导GRPO优化Qwen3-8B基础模型可显著提升，整体准确率提升22.3%，安全合规性提升21.7%，有效与专有前沿模型媲美。此外，该对齐政策能够稳健地推广到外部基准，展现出与UltraMedical上最先进模型相当的性能。我们公开发布数据集、奖励模型和基准，以促进安全意识医学对齐的可重复性研究。

ASPECT:Analogical Semantic Policy Execution via Language Conditioned Transfer

ASPECT：通过语言条件传输执行类比语义策略

Authors: Ajsal Shereef Palattuparambil, Thommen George Karimpanal, Santu Rana
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08355
Pdf link: https://arxiv.org/pdf/2604.08355
Abstract Reinforcement Learning (RL) agents often struggle to generalize knowledge to new tasks, even those structurally similar to ones they have mastered. Although recent approaches have attempted to mitigate this issue via zero-shot transfer, they are often constrained by predefined, discrete class systems, limiting their adaptability to novel or compositional task variations. We propose a significantly more generalized approach, replacing discrete latent variables with natural language conditioning via a text-conditioned Variational Autoencoder (VAE). Our core innovation utilizes a Large Language Model (LLM) as a dynamic \textit{semantic operator} at test time. Rather than relying on rigid rules, our agent queries the LLM to semantically remap the description of the current observation to align with the source task. This source-aligned caption conditions the VAE to generate an imagined state compatible with the agent's original training, enabling direct policy reuse. By harnessing the flexible reasoning capabilities of LLMs, our approach achieves zero-shot transfer across a broad spectrum of complex and truly novel analogous tasks, moving beyond the limitations of fixed category mappings. Code and videos are available \href{this https URL}{here}.
中文摘要 强化学习（RL）代理常常难以将知识推广到新任务，即使是结构上与他们已掌握的任务相似。尽管近期方法尝试通过零射转移来缓解这一问题，但它们常受预设、离散类别系统的限制，限制了适应新颖或组合任务变化的适用性。我们提出了一种更为通用的方法，通过文本条件变分自编码器（VAE）用自然语言条件替换离散潜在变量。我们的核心创新是在测试时将大型语言模型（LLM）作为动态的 \textit{语义运算符}。我们的代理不依赖僵化规则，而是查询LLM以语义重新映射当前观察的描述，使其与源任务保持一致。该源对齐字幕条件VAE生成与代理原始训练兼容的想象状态，从而实现直接策略的重用。通过利用大型语言模型的灵活推理能力，我们的方法实现了在广泛复杂且真正新颖的类比任务中实现零样本转移，突破了固定范畴映射的局限。代码和视频可查阅 \href{this https URL}{here}。

Synthetic Data for any Differentiable Target

任意可微目标的合成数据

Authors: Tristan Thrush, Sung Min Park, Herman Brunborg, Luke Bailey, Marcel Roed, Neil Band, Christopher Potts, Tatsunori Hashimoto
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2604.08423
Pdf link: https://arxiv.org/pdf/2604.08423
Abstract What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.
中文摘要 通过合成训练数据控制语言模型的局限性是什么？我们开发了一个强化学习（RL）原语——数据集策略梯度（Dataset Policy Gradient，DPG），能够精准优化合成数据生成器，生成目标样本数据集。当用于目标模型的监督微调（SFT）时，这些例子使目标模型在我们选择的可微度量上表现良好。我们的方法通过通过高阶梯度获取精确数据归因，并将这些分数作为策略梯度奖励来实现这一点。我们证明该过程与合成数据生成器的真实且难以处理的梯度非常接近。为了说明DPG的潜力，我们展示了仅使用SFT对生成的样本，使目标模型的LM头权重（1）嵌入二维码，（2）嵌入模式$\texttt{67}$，（3）具有更低的$\ell^2$范数。我们还展示了，即使生成器的输入提示中未传达这些目标，我们也能（4）将输入重新表述为新语言，并生成特定的UUID。这些发现表明，DPG是一种强大且灵活的技术，能够仅利用合成训练实例塑造模型属性。

NL-CPS: Reinforcement Learning-Based Kubernetes Control Plane Placement in Multi-Region Clusters

NL-CPS：基于强化学习的Kubernetes控制平面在多区域集群中的布局

Authors: Sajid Alam, Amjad Ullah, Ze Wang
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2604.08434
Pdf link: https://arxiv.org/pdf/2604.08434
Abstract The placement of Kubernetes control-plane nodes is critical to ensuring cluster reliability, scalability, and performance, and therefore represents a significant deployment challenge in heterogeneous, multi-region environments. Existing initialisation procedures typically select control-plane hosts arbitrarily, without considering node resource capacity or network topology, often leading to suboptimal cluster performance and reduced resilience. Given Kubernetes's status as the de facto standard for container orchestration, there is a need to rigorously evaluate how control-plane node placement influences the overall performance of the cluster operating across multiple regions. This paper advances this goal by introducing an intelligent methodology for selecting control-plane node placement across dynamically selected Cloud-Edge resources spanning multiple regions, as part of an automated orchestration system. More specifically, we propose a reinforcement learning framework based on neural contextual bandits that observes operational performance and learns optimal control-plane placement policies from infrastructure characteristics. Experimental evaluation across several geographically distributed regions and multiple cluster configurations demonstrates substantial performance improvements over several baseline approaches.
中文摘要 Kubernetes 控制平面节点的布置对于确保集群的可靠性、可扩展性和性能至关重要，因此在异构多区域环境中是一项重大部署挑战。现有初始化程序通常任意选择控制平面主机，未考虑节点资源容量或网络拓扑，常导致集群性能不理想且韧性降低。鉴于 Kubernetes 作为容器编排的事实标准，有必要严格评估控制平面节点的布置如何影响跨多个区域运行集群的整体性能。本文通过引入一种智能方法，在跨多个区域的动态选择云端资源中选择控制平面节点位置，作为自动化编排系统的一部分，进一步推进了这一目标。更具体地说，我们提出了基于神经上下文盗贼的强化学习框架，通过观察操作性能并从基础设施特性中学习最优控制平面放置策略。跨多个地理分布区域和多集群配置的实验评估显示，相较于多种基线方法，性能显著提升。

Less Approximates More: Harmonizing Performance and Confidence Faithfulness via Hybrid Post-Training for High-Stakes Tasks

少近似更多：通过混合后期培训协调表现与自信忠实度，以应对高风险任务

Authors: Haokai Ma, Lee Yan Zhen, Gang Yang, Yunshan Ma, Ee-Chien Chang, Tat-Seng Chua
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.08454
Pdf link: https://arxiv.org/pdf/2604.08454
Abstract Large language models are increasingly deployed in high-stakes tasks, where confident yet incorrect inferences may cause severe real-world harm, bringing the previously overlooked issue of confidence faithfulness back to the forefront. A promising solution is to jointly optimize unsupervised Reinforcement Learning from Internal Feedback (RLIF) with reasoning-trace-guided Reasoning Distillation (RD), which may face three persistent challenges: scarcity of high-quality training corpora, factually unwarranted overconfidence and indiscriminate fusion that amplifies erroneous updates. Inspired by the human confidence accumulation from uncertainty to certainty, we propose Progressive Reasoning Gain (PRG) to measure whether reasoning steps progressively strengthen support for the final answer. Furthermore, we introduce HyTuning, a hybrid post-training framework that adaptively reweights RD and RLIF via a PRG-style metric, using scarce supervised reasoning traces as a stable anchor while exploiting abundant unlabeled queries for scalability. Experiments on several domain-specific and general benchmarks demonstrate that HyTuning improves accuracy while achieving confidence faithfulness under limited supervision, supporting a practical "Less Approximates More" effect.
中文摘要 大型语言模型越来越多地被用于高风险任务，在这些任务中，自信但错误的推理可能对现实世界造成严重伤害，这也让此前被忽视的信心忠实性问题重新浮上了视野。一个有前景的解决方案是联合优化无监督的内部反馈强化学习（RLIF）与推理追踪引导的推理蒸馏（RD），这可能面临三个持续挑战：高质量训练语料库的稀缺、事实错误的过度自信以及加剧错误更新的无差别融合。受人类信心从不确定性积累到确定性的启发，我们提出了渐进推理增益（PRG）方法，用以衡量推理步骤是否逐渐增强对最终答案的支持。此外，我们引入了HyTuning，一种混合式后训练框架，通过PRG风格的指标自适应地重新加权RD和RLIF，利用稀缺的监督推理轨迹作为稳定锚点，同时利用大量无标签查询实现可扩展性。在多个特定领域和通用基准测试上的实验表明，HyTuning 在有限监督下提升准确率并实现置信忠实度，支持了“少近似多”的实用效果。

TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis

TTVS：通过测试时变分综合提升自我探索强化学习

Authors: Sikai Bai, Haoxi Li, Jie Zhang, Yongjiang Liu, Song Guo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08468
Pdf link: https://arxiv.org/pdf/2604.08468
Abstract Despite significant advances in Large Reasoning Models (LRMs) driven by reinforcement learning with verifiable rewards (RLVR), this paradigm is fundamentally limited in specialized or novel domains where such supervision is prohibitively expensive or unavailable, posing a key challenge for test-time adaptation. While existing test-time methods offer a potential solution, they are constrained by learning from static query sets, risking overfitting to textual patterns. To address this gap, we introduce Test-Time Variational Synthesis (TTVS), a novel framework that enables LRMs to self-evolve by dynamically augmenting the training stream from unlabeled test queries. TTVS comprises two synergistic modules: (1) Online Variational Synthesis, which transforms static test queries into a dynamic stream of diverse, semantically-equivalent variations, enforcing the model to learn underlying problem logic rather than superficial patterns; (2) Test-time Hybrid Exploration, which balances accuracy-driven exploitation with consistency-driven exploration across synthetic variants. Extensive experiments show TTVS yields superior performance across eight model architectures. Notably, using only unlabeled test-time data, TTVS not only surpasses other test-time adaptation methods but also outperforms state-of-the-art supervised RL-based techniques trained on vast, high-quality labeled data.
中文摘要 尽管由可验证奖励强化学习（RLVR）驱动的大型推理模型（LRM）取得了重大进展，但该范式在专业化或新颖领域中存在根本限制，这些领域监督成本高昂或难以获得，这对测试时间适应构成了关键挑战。虽然现有的测试时方法提供了潜在的解决方案，但它们受限于从静态查询集学习，存在对文本模式过度拟合的风险。为弥补这一空白，我们引入了测试时间变分综合（TTVS），这是一种新颖框架，使LRM能够通过动态增强未标记测试查询的训练流实现自我演化。TTVS包含两个协同模块：（1）在线变分综合，将静态测试查询转化为动态的多样、语义等价变化流，强制模型学习底层问题逻辑而非表面模式;（2）测试阶段混合探索，平衡了基于准确性的利用与基于一致性的探索，跨合成变体。大量实验表明，TTVS在八种模型架构中表现更优。值得注意的是，仅使用未标记的测试时间数据，TTVS不仅超越了其他测试时间适应方法，还优于基于大量高质量标记数据训练的先进监督强化学习技术。

LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

LAMP：提升图像编辑作为开放世界操作的通用3D先验

Authors: Jingjing Wang, Zhengdong Hong, Chong Bao, Yuke Zhu, Junhan Sun, Guofeng Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.08475
Pdf link: https://arxiv.org/pdf/2604.08475
Abstract Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: this https URL.
中文摘要 开放世界中类人化的概括仍然是机器人操作面临的根本挑战。现有基于学习的方法，包括强化学习、模仿学习和视觉-语言-动作模型（VLA），常常在新颖任务和看不见的环境中遇到困难。另一个有前景的方向是探索可推广的表示方式，捕捉细粒度的空间和几何关系，用于开放世界操作。虽然大型语言模型（LLM）和视觉语言模型（VLMs）基于语言或注释的二维表示提供了强有力的语义推理，但它们有限的三维感知限制了其适用于细粒度操作。为此，我们提出了LAMP，它将图像编辑提升为三维先验，将物体间的三维变换提取为连续的几何感知表示。我们的关键见解是，图像编辑本质上编码了丰富的二维空间线索，将这些隐性线索转化为三维变换，为开放世界操作提供了细致且准确的指导。大量实验证明，\codename能够实现精确的三维变换，并在开放世界操作中实现了强有力的零发射泛化。项目页面：这个 https URL。

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

忠实GRPO：通过受限策略优化提升多模态语言模型中的视觉空间推理

Authors: Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha, Vineeth N Balasubramanian, Tanuja Ganu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08476
Pdf link: https://arxiv.org/pdf/2604.08476
Abstract Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.
中文摘要 多模态推理模型（MRM）通过可验证奖励强化学习（RLVR）训练，在视觉推理基准测试中准确率有所提升。然而，我们观察到准确率的提升往往是以推理质量为代价的：生成的思维链（CoT）痕迹常与最终答案不一致，且缺乏视觉证据的基础。我们系统地研究了这一现象，涵盖七个具有挑战性的现实世界空间推理基准，发现它影响了当代的MRM，如ViGoRL-Spatial、TreeVGR，以及我们用标准组相对策略优化（GRPO）训练的模型。我们从两个互补轴线来描述CoT的推理质量：“逻辑一致性”（CoT是否意味着最终答案？）和“视觉基础”（每个推理步骤是否准确描述图像中的物体、属性和空间关系？）。为此，我们提出了忠实GRPO（FGRPO），这是一种GRPO的变体，通过拉格朗日对偶上升强制一致性和基准作为约束。FGRPO在群内优势计算中融入了批次层级的一致性和基础约束，在优化过程中自适应地调整约束的相对重要性。我们在七个空间数据集中评估了Qwen2.5-VL-7B和3B骨架上的FGRPO。我们的结果显示，FGRPO显著提升了推理质量，将不一致率从24.5%降至1.7%，视觉接地得分提升+13%。它还比简单的GRPO提高了最终答案的准确性，证明忠实推理能带来更好的答案。

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

超新星：在自然指令上通过强化学习引发大型语言模型中的一般推理

Authors: Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.08477
Pdf link: https://arxiv.org/pdf/2604.08477
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLM）在数学和代码等形式领域的推理能力。尽管取得了这些进步，LLMs在需要因果推理和时间理解等能力的一般推理任务上仍然存在困难。将RLVR扩展到一般推理根本上受限于缺乏高质量、可验证的跨越多种推理技能的训练数据。为应对这一挑战，我们提出了SUPERNOVA，这是一个用于增强一般推理能力的RLVR数据管理框架。我们的关键见解是，包含专家注释的基础真理的指令调优数据集编码了丰富的推理模式，可以系统地适应RLVR。为此，我们进行了100+受控强化学习实验，分析数据设计选择如何影响后续推理表现。特别研究三个关键因素：（i）来源任务选择，（ii）任务混合策略，以及（iii）提升数据质量的合成干预措施。我们的分析显示，源任务选择并非简单，且对后续推理表现有显著影响。此外，基于任务的表现选择任务，优于基于整体平均表现的策略。最后，在SUPERNOVA训练模型中，在包括BBEH、Zebralogic和MMLU-Pro等具有挑战性的推理基准测试中，表现优于强基线（如Qwen3.5）。特别是，在SUPERNOVA上训练时，BBEH在不同模型规模下相对提升了高达52.8%，展示了RLVR原则性数据策展的有效性。我们的发现为策划人工注释资源、将RLVR扩展到一般推理提供了实用见解。代码和数据可在该 https URL 访问。

Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

AI聊天机器人中的广告？大型语言模型如何应对利益冲突的分析

Authors: Addison J. Wu, Ryan Liu, Shuyue Stella Li, Yulia Tsvetkov, Thomas L. Griffiths
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2604.08525
Pdf link: https://arxiv.org/pdf/2604.08525
Abstract Today's large language models (LLMs) are trained to align with user preferences through methods such as reinforcement learning. Yet models are beginning to be deployed not merely to satisfy users, but also to generate revenue for the companies that created them through advertisements. This creates the potential for LLMs to face conflicts of interest, where the most beneficial response to a user may not be aligned with the company's incentives. For instance, a sponsored product may be more expensive but otherwise equal to another; in this case, what does (and should) the LLM recommend to the user? In this paper, we provide a framework for categorizing the ways in which conflicting incentives might lead LLMs to change the way they interact with users, inspired by literature from linguistics and advertising regulation. We then present a suite of evaluations to examine how current models handle these tradeoffs. We find that a majority of LLMs forsake user welfare for company incentives in a multitude of conflict of interest situations, including recommending a sponsored product almost twice as expensive (Grok 4.1 Fast, 83%), surfacing sponsored options to disrupt the purchasing process (GPT 5.1, 94%), and concealing prices in unfavorable comparisons (Qwen 3 Next, 24%). Behaviors also vary strongly with levels of reasoning and users' inferred socio-economic status. Our results highlight some of the hidden risks to users that can emerge when companies begin to subtly incentivize advertisements in chatbots.
中文摘要 如今的大型语言模型（LLM）通过强化学习等方法训练以符合用户偏好。然而，模型开始不仅为了满足用户需求，还通过广告为制作这些模型的公司创造收入。这可能导致LLM面临利益冲突，即对用户最有利的回应可能与公司的激励不符。例如，一个赞助产品可能更贵，但其他方面相当;在这种情况下，LLM会（以及应该）向用户推荐什么？本文提出了一个框架，用于分类冲突激励机制如何促使大型语言模型改变其与用户的互动方式，灵感来自语言学和广告监管的文献。随后，我们呈现一系列评估，探讨当前模型如何处理这些权衡。我们发现，大多数大型语言模型在多种利益冲突情境中，会放弃用户福祉，转而追求公司激励，包括推荐价格几乎翻倍的赞助产品（Grok 4.1 Fast，83%）、出现赞助选项以扰乱购买过程（GPT 5.1,94%）以及在不利比较中隐藏价格（Qwen 3 Next，24%）。行为也因推理水平和用户推断的社会经济地位而有很大差异。我们的结果凸显了当公司开始微妙地激励聊天机器人广告时，用户可能面临的一些隐藏风险。

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

OpenVLThinkerV2：一个用于多领域视觉任务的通用多模态推理模型

Authors: Wenbo Hu, Xin Chen, Yan Gao-Tian, Yihe Deng, Nanyun Peng, Kai-Wei Chang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.08539
Pdf link: https://arxiv.org/pdf/2604.08539
Abstract Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.
中文摘要 群体相对策略优化（GRPO）已成为推动多模态大型语言模型近期进展的事实强化学习（RL）目标。然而，将这一成功推广到开源多模态通用模型仍受到两个主要挑战的严重限制：不同视觉任务中奖励拓扑的极端差异，以及在细粒度感知与多步推理能力之间取得平衡的固有困难。为解决这些问题，我们引入了高斯GRPO（G$^2$RPO），一种新颖的强化学习训练目标，用非线性分布匹配替代标准线性标度。通过数学上强制任意任务的优势分布严格收敛于标准正态分布 $\mathcal{N}（0,1）$，G$^2$RPO 理论上确保任务间梯度公平，减轻对重尾离群值的脆弱性，并为正负奖励提供对称更新。利用G$^2$RPO提供的增强训练稳定性，我们引入了两种任务层级塑造机制，以无缝平衡感知与推理。首先，响应长度塑造动态地引发复杂查询的扩展推理链，同时强制直接输出以增强视觉基础。其次，熵整形严格限制模型的探索区，有效防止熵坍缩和熵爆炸。整合这些方法，我们提出了OpenVLThinkerV2，一个高度稳健的通用多模态模型。在18个多样化基准测试中进行了广泛的评估，显示其优于强大的开源和领先的专有前沿模型。

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

明智行动：培养智能多模态模型中的元认知工具使用

Authors: Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.08545
Pdf link: https://arxiv.org/pdf/2604.08545
Abstract The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.
中文摘要 代理多模态模型的出现使系统能够主动与外部环境交互。然而，当前的代理存在严重的元认知缺陷：他们难以在利用内部知识和查询外部效用之间做出调解。因此，它们经常陷入盲目工具调用，即使查询可从原始视觉上下文中解析，仍依赖反射性工具执行。这种病态行为导致严重的延迟瓶颈，并注入多余噪声，破坏理性的推理。现有的强化学习协议试图通过一个标量化的奖励来减轻这一问题，惩罚工具的使用。然而，这种耦合的表述带来了不可调和的优化困境：激进惩罚抑制了关键工具的使用，而轻微惩罚则完全被优势归一化过程中准确率奖励的变异性所取代，使其对工具过度使用无效。为了突破这一瓶颈，我们提出了HDPO框架，将工具效率从竞争的标量目标重新定义为严格的条件目标。通过避免奖励标量化，HDPO保持了两个正交优化通道：一个最大化任务正确性的准确通道，以及通过条件优势估计，仅在准确轨迹内执行执行经济的效率通道。这种解耦架构自然地引导出一种认知课程，迫使智能体先掌握任务解决，然后再完善自立能力。大量评估表明，我们所得的模型Metis将工具调用次数减少了数量级，同时提升了推理准确性。

Keyword: diffusion policy

There is no result