Arxiv Papers of Today

生成时间: 2026-04-17 17:21:19 (UTC+8); Arxiv 发布时间: 2026-04-17 20:00 EDT (2026-04-18 08:00 UTC+8)

今天共有 36 篇相关文章

Keyword: reinforcement learning

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

GFT：从模仿到奖励微调，具有无偏群优势和动态系数整流

Authors: Wangjie Gan, Miao Pan, Linbo Xi, Wenqi Zhang, Jintao Chen, Jianwei Yin, Xuhong Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.14258
Pdf link: https://arxiv.org/pdf/2604.14258
Abstract Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.
中文摘要 大型语言模型通常通过监督微调（SFT）和强化学习（RL）进行后期训练，但有效整合高效的知识注入与稳健的泛化仍然具有挑战性。在本研究中，我们通过训练动力学分析表明，SFT可以被解释为策略梯度优化的特例，其隐性奖励极为稀疏且反概率加权不稳定，这些因素共同导致单路径依赖、熵坍缩和梯度爆炸。基于这一诊断，我们提出了群体微调（GFT）——一种统一的训练后框架，通过两种机制解决这些内在限制：群体优势学习，构建多样化的反应组并推导出归一化对比监督以缓解奖励稀疏;动态系数整流，自适应地限制逆概率权重以稳定优化，同时保持高效的知识注入。实验表明，GFT始终超越基于SFT的方法，并能产生更顺畅地与后续强化学习训练整合的政策。

Reinforcement Learning via Value Gradient Flow

通过价值梯度流进行强化学习

Authors: Haoran Xu, Kaiwen Hu, Somayeh Sojoudi, Amy Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.14265
Pdf link: https://arxiv.org/pdf/2604.14265
Abstract We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly conservative when attempting to move beyond the behavior support. In this paper, we propose Value Gradient Flow (VGF), a scalable new paradigm for behavior-regularized RL. VGF casts behavior-regularized RL as an optimal transport problem that maps the reference distribution to the value-induced optimal policy distribution. We solve this transport problem via discrete gradient flow, where value gradients guide particles initialized from the reference distribution. Our analysis shows that VGF imposes regularization implicitly by controlling the transport budget. VGF eliminates explicit policy parameterization while remaining expressive and flexible, this enables adaptive test-time scaling by adjusting the transport budget. Extensive experiments demonstrate that VGF significantly outperforms prior methods, achieving state-of-the-art results on offline RL benchmarks (D4RL, OGBench) and LLM RL tasks. Code and runs can be found at this https URL.
中文摘要 我们研究行为正则化强化学习（RL），其中向参考分布（离线强化学习中的数据集或LLM RL微调中的基础模型）进行正则化对于防止因错误的分布外推导致的价值过度优化至关重要。现有方法要么依赖重参数化策略梯度，这种梯度难以扩展到大型生成模型，要么依赖拒绝抽样，后者在试图超越行为支持时可能过于保守。本文提出了价值梯度流（VGF），这是一种可扩展的行为正则化强化学习范式。VGF将行为正则化RL定位为一个最优传输问题，将参考分布映射到价值诱导的最优策略分布。我们通过离散梯度流解决该传输问题，其中值梯度引导从参考分布初始化的粒子。我们的分析表明，VGF通过控制运输预算，隐含地施加了正则化。VGF消除了显式策略参数化，同时保持表达性和灵活性，通过调整传输预算实现自适应测试时间缩放。大量实验表明，VGF在离线强化学习基准测试（D4RL、OGBinc）和大型语言强化学习任务中取得了最先进的成绩。代码和运行代码可以在这个 https URL 找到。

Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization

通过贡献加权组相对策略优化，增强基于LLM的搜索代理

Authors: Junzhe Wang, Zhiheng Xi, yajie yang, Hao Luo, Shihan Dou, Tao Gui, Qi Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.14267
Pdf link: https://arxiv.org/pdf/2604.14267
Abstract Search agents extend Large Language Models (LLMs) beyond static parametric knowledge by enabling access to up-to-date and long-tail information unavailable during pretraining. While reinforcement learning has been widely adopted for training such agents, existing approaches face key limitations: process supervision often suffers from unstable value estimation, whereas outcome supervision struggles with credit assignment due to sparse, trajectory-level rewards. To bridge this gap, we propose Contribution-Weighted GRPO (CW-GRPO), a framework that integrates process supervision into group relative policy optimization. Instead of directly optimizing process rewards, CW-GRPO employs an LLM judge to assess the retrieval utility and reasoning correctness at each search round, producing per-round contribution scores. These scores are used to rescale outcome-based advantages along the trajectory, enabling fine-grained credit assignment without sacrificing optimization stability. Experiments on multiple knowledge-intensive benchmarks show that CW-GRPO outperforms standard GRPO by 5.0\% on Qwen3-8B and 6.3\% on Qwen3-1.7B, leading to more effective search behaviors. Additional analysis reveals that successful trajectories exhibit concentrated contributions across rounds, providing empirical insight into search agent tasks.
中文摘要 搜索代理通过访问预训练期间无法获得的最新和长尾信息，将大型语言模型（LLM）扩展到静态参数知识之外。虽然强化学习已被广泛用于训练此类代理，但现有方法面临关键局限：过程监督常常存在价值估计不稳定的问题，而结果监督则因奖励稀疏、轨迹级奖励不足而难以获得学分分配。为弥合这一差距，我们提出了贡献加权GRPO（CW-GRPO）框架，将流程监督整合进群体相对策略优化。CW-GRPO不直接优化过程奖励，而是使用LLM评判员评估每轮检索的效用和推理正确性，从而生成每轮贡献分数。这些评分用于沿轨迹重新调整基于结果的优势，实现细粒度的信用分配，同时不牺牲优化稳定性。多项知识密集型基准测试的实验显示，CW-GRPO在Qwen3-8B上比标准GRPO高出5.0%，在Qwen3-1.7B上高出6.3%，从而实现了更有效的搜索行为。进一步分析显示，成功的轨迹在各轮次中表现出集中贡献，为搜索代理任务提供了实证洞察。

Aerial Multi-Functional RIS in Fluid Antennas-Aided Full-Duplex Networks: A Self-Optimized Hybrid Deep Reinforcement Learning Approach

流体天线辅助全双工网络中的空中多功能远程信息系统：一种自我优化的混合深度强化学习方法

Authors: Li-Hsiang Shen, Yu-Quan Zheng
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2604.14309
Pdf link: https://arxiv.org/pdf/2604.14309
Abstract To address high data traffic demands of sixth-generation (6G) networks, this paper proposes a novel architecture that integrates autonomous aerial vehicles (AAVs) and multi-functional reconfigurable intelligent surfaces (MF-RISs) as AM-RIS in fluid antenna (FA)-assisted full-duplex (FD) networks. The AM-RIS provides hybrid functionalities, including signal reflection, amplification, and energy harvesting (EH), potentially improving both signal coverage and sustainability. Meanwhile, FA facilitates fine-grained spatial adaptability at FD-enabled base station (BS), which complements residual self-interference (SI) suppression. We aim at maximizing the overall energy efficiency (EE) by jointly optimizing transmit DL beamforming at BS, UL user power, configuration of AM-RIS, and positions of the FA and AM-RIS. Owing to the hybrid continuous-discrete parameters and high dimensionality of the intractable problem, we have conceived a self-optimized multi-agent hybrid deep reinforcement learning (DRL) framework (SOHRL), which integrates multi-agent deep Q-networks (DQN) and multi-agent proximal policy optimization (PPO), respectively handling discrete and continuous actions. To enhance self-adaptability, an attention-driven state representation and meta-level hyperparameter optimization are incorporated, enabling multi-agents to autonomously adjust learning hyperparameters. Simulation results validate the effectiveness of the proposed AM-RIS-enabled FA-aided FD networks empowered by SOHRL algorithm. The results reveal that SOHRL outperforms benchmarks of the case without attention mechanism and conventional hybrid/multi-agent/standalone DRL. Moreover, AM-RIS in FD achieves the highest EE compared to half-duplex, conventional rigid antenna arrays, partial EH, and conventional RIS without amplification, highlighting its potential as a compelling solution for EE-aware wireless networks.
中文摘要 为应对第六代（6G）网络高数据流量需求，本文提出了一种新颖架构，将自动驾驶飞行器（AAV）和多功能可重构智能表面（MF-RIS）集成为AM-RIS，应用于流体天线（FA）辅助全双工（FD）网络中。AM-RIS提供混合功能，包括信号反射、放大和能量收集（EH），有望提升信号覆盖率和可持续性。与此同时，FA促进了FD支持基站（BS）的细粒度空间适应性，补充了残余自干涉（SI）抑制。我们旨在通过联合优化发射DL波束成形在BS、UL用户功率、AM-RIS配置以及FA和AM-RIS位置，最大化整体能效（EE）。由于混合连续-离散参数和难解问题的高维度，我们构思了一个自我优化的多智能体混合深度强化学习（DRL）框架（SOHRL），集成了多智能体深度Q网络（DQN）和多智能体近端策略优化（PPO），分别处理离散和连续动作。为增强自我适应性，集成了注意力驱动的状态表示和元级超参数优化，使多智能体能够自主调整学习超参数。模拟结果验证了由SOHRL算法赋能的AM-RIS辅助FA辅助FD网络的有效性。结果显示，SOHRL优于无注意力机制和传统混合/多药物/独立日程劳力的基准测试。此外，调频中的AM-RIS在半双工、传统刚性天线阵列、部分EH和无放大的传统RIS中实现了最高的EE，凸显了其作为EE感知无线网络解决方案的潜力。

When Missing Becomes Structure: Intent-Preserving Policy Completion from Financial KOL Discourse

当缺失变成结构：金融KOL话语中的意图保全政策完成

Authors: Yuncong Liu, Yuan Wan, Zhou Jiang, Yao Lu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.14333
Pdf link: https://arxiv.org/pdf/2604.14333
Abstract Key Opinion Leader (KOL) discourse on social media is widely consumed as investment guidance, yet turning it into executable trading strategies without injecting assumptions about unspecified execution decisions remains an open problem. We observe that the gaps in KOL statements are not random deficiencies but a structured separation: KOLs express directional intent (what to buy or sell and why) while leaving execution decisions (when, how much, how long) systematically unspecified. Building on this observation, we propose an intent-preserving policy completion framework that treats KOL discourse as a partial trading policy and uses offline reinforcement learning to complete the missing execution decisions around the KOL-expressed intent. Experiments on multimodal KOL discourse from YouTube and X (2022-2025) show that KICL achieves the best return and Sharpe ratio on both platforms while maintaining zero unsupported entries and zero directional reversals, and ablations confirm that the full framework yields an 18.9% return improvement over the KOL-aligned baseline.
中文摘要 社交媒体上的关键意见领袖（KOL）话语被广泛视为投资指导，但将其转化为可执行的交易策略而不对未具体执行决策做出假设，仍是一个悬而未决的问题。我们观察到，KOL语句中的空白并非随机缺陷，而是一种结构化的分离：KOL表达了方向性意图（买卖什么及其原因），而执行决策（何时、多少、多长时间）则系统性地未被明确说明。基于这一观察，我们提出了一种保持意图的策略完成框架，将KOL话语视为部分交易策略，并利用离线强化学习来补全围绕KOL表达意图的缺失执行决策。YouTube和X（2022-2025）对多模态KOL话语的实验显示，KICL在两个平台上实现了最佳回报和Sharpe比率，同时保持零无支持进场和零方向反转，消融结果证实完整框架相较KOL基线提升了18.9%的回报。

Step-level Denoising-time Diffusion Alignment with Multiple Objectives

多目标的阶级去噪时间扩散比准

Authors: Qi Zhang, Dawei Wang, Shaofeng Zou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.14379
Pdf link: https://arxiv.org/pdf/2604.14379
Abstract Reinforcement learning (RL) has emerged as a powerful tool for aligning diffusion models with human preferences, typically by optimizing a single reward function under a KL regularization constraint. In practice, however, human preferences are inherently pluralistic, and aligned models must balance multiple downstream objectives, such as aesthetic quality and text-image consistency. Existing multi-objective approaches either rely on costly multi-objective RL fine-tuning or on fusing separately aligned models at denoising time, but they generally require access to reward values (or their gradients) and/or introduce approximation error in the resulting denoising objectives. In this paper, we revisit the problem of RL fine-tuning for diffusion models and address the intractability of identifying the optimal policy by introducing a step-level RL formulation. Building on this, we further propose Multi-objective Step-level Denoising-time Diffusion Alignment (MSDDA), a retraining-free framework for aligning diffusion models with multiple objectives, obtaining the optimal reverse denoising distribution in closed form, with mean and variance expressed directly in terms of single-objective base models. We prove that this denoising-time objective is exactly equivalent to the step-level RL fine-tuning, introducing no approximation error. Moreover, we provide numerical results, which indicate our method outperforms existing denoising-time approaches.
中文摘要 强化学习（RL）已成为将扩散模型与人类偏好对齐的强大工具，通常通过在KL正则化约束下优化单一奖励函数。然而，在实际操作中，人类偏好本质上是多元的，对齐模型必须在多个下游目标之间取得平衡，如美学质量和文本-图像一致性。现有的多目标方法要么依赖昂贵的多目标强化学习微调，要么在去噪时融合独立对齐的模型，但通常需要访问奖励值（或其梯度）和/或在结果去噪目标中引入近似误差。本文重新审视扩散模型的强化学习微调问题，并通过引入阶级强化学习表述解决了识别最优策略的难题。在此基础上，我们进一步提出了多目标步进去噪时间扩散比对（MSDDA）框架，这是一种无需重训练的框架，用于比对多目标的扩散模型，获得封闭形式的最佳反去噪分布，均值和方差直接以单目标基模型表示。我们证明该去噪时间目标完全等价于阶级强化学习微调，且不引入近似误差。此外，我们还提供了数值结果，表明我们的方法优于现有的去噪时间方法。

On Tackling Complex Tasks with Reward Machines and Signal Temporal Logics

关于用奖励机和信号时间逻辑解决复杂任务

Authors: Ana María Gómez Ruiz (UGA), Thao Dang (VERIMAG - IMAG, CNRS, UGA), Alexandre Donzé
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.14440
Pdf link: https://arxiv.org/pdf/2604.14440
Abstract We propose a Reinforcement Learning (RL) based control design framework for handling complex tasks. The approach extends the concept of Reward Machines (RM) with Signal Temporal Logic (STL) formulas that can be used for event generation. The use of STL allows not only a more efficient representation of rewards for complex tasks but also guiding the training process to converge towards behaviors satisfying specified requirements. We also propose an implementation of the framework that leverages the STL online monitoring algorithms. We illustrate the framework with three case studies (minigrid, cart-pole and high-way environments) with non-trivial tasks.
中文摘要 我们提出了一种基于强化学习（RL）的控制设计框架，用于处理复杂任务。该方法通过信号时间逻辑（STL）公式扩展了奖励机（RM）的概念，可用于事件生成。使用STL不仅能更高效地表示复杂任务的奖励，还能引导训练过程趋向满足特定要求的行为。我们还提出了一个利用STL在线监控算法的框架实现方案。我们用三个案例研究（迷你网格、车杆和高速公路环境）来说明该框架，涉及非简单任务。

Improving Human Performance with Value-Aware Interventions: A Case Study in Chess

通过价值意识干预提升人类表现：国际象棋案例研究

Authors: Saumik Narayanan, Raja Panjwani, Siddhartha Sen, Chien-Ju Ho
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.14465
Pdf link: https://arxiv.org/pdf/2604.14465
Abstract AI systems are increasingly used to assist humans in sequential decision-making tasks, yet determining when and how an AI assistant should intervene remains a fundamental challenge. A potential baseline is to recommend the optimal action according to a strong model. However, such actions assume optimal follow-up actions, which human decision makers may fail to execute, potentially reducing overall performance. In this work, we propose and study value-aware interventions, motivated by a basic principle in reinforcement learning: under the Bellman equation, the optimal policy selects actions that maximize the immediate reward plus the value function. When a decision maker follows a suboptimal policy, this policy-value consistency no longer holds, creating discrepancies between the actions taken by the policy and those that maximize the immediate reward plus the value of the next state. We show that these policy-value inconsistencies naturally identify opportunities for intervention. We formalize this problem in a Markov decision process where an AI assistant may override human actions under an intervention budget. In the single-intervention regime, we show that the optimal strategy is to recommend the action that maximizes the human value function. For settings with multiple interventions, we propose a tractable approximation that prioritizes interventions based on the magnitude of the policy-value discrepancy. We evaluate these ideas in the domain of chess by learning models of humans from large-scale gameplay data. In simulation, our approach consistently outperforms interventions based on the strongest chess engine (Stockfish) in a wide range of settings. A within-subject human study with 20 players and 600 games further shows that our interventions significantly improve performance for low- and mid-skill players while matching expert-engine interventions for high-skill players.
中文摘要 人工智能系统越来越多地被用于协助人类完成顺序决策任务，但确定何时以及如何人工智能助手介入仍是一个根本性挑战。一个潜在的基线是基于强模型推荐最优行动。然而，这些操作假设了最优的后续操作，而人类决策者可能未能执行这些后续操作，从而可能降低整体性能。在本研究中，我们提出并研究价值感知干预，其动机源自强化学习中的一个基本原则：根据贝尔曼方程，最优策略选择最大化即时奖励加价值函数的行动。当决策者遵循次优政策时，这种政策与价值的一致性不再成立，导致政策所采取的行动与最大化即时奖励及下一状态价值的行动之间产生差异。我们表明，这些政策价值不一致自然地识别出干预的机会。我们将这个问题形式化为马尔可夫决策过程，其中AI助手可以在干预预算下覆盖人类行为。在单一干预体系中，我们表明最优策略是推荐最大化人类价值函数的行动。对于多项干预措施的环境，我们提出一种可处理的近似方法，基于政策价值差异的大小优先考虑干预措施。我们通过从大规模棋局数据中学习人类模型，在国际象棋领域评估这些想法。在模拟中，我们的方法在各种场景下持续优于基于最强国际象棋引擎（Stockfish）的干预措施。一项包含20名球员、600场比赛的受试者内部人体研究进一步显示，我们的干预措施显著提升了低技能和中等水平玩家的表现，同时对高技能玩家则匹配专家引擎干预。

Scouting By Reward: VLM-TO-IRL-Driven Player Selection For Esports

奖励球探：电子竞技从虚拟平台到现实世界的选手选择

Authors: Qing Yan, Wenyu Yang, Yufei Wang, Wenhao Ma, Linchong Hu, Yifei Jin, Anton Dahbura
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.14474
Pdf link: https://arxiv.org/pdf/2604.14474
Abstract Traditional esports scouting workflows rely heavily on manual video review and aggregate performance metrics, which often fail to capture the nuanced decision-making patterns necessary to determine if a prospect fits a specific tactical archetype. To address this, we reframe style-based player evaluation in esports as an Inverse Reinforcement Learning (IRL) problem. In this paper, we introduce a novel player selection framework that learns professional-specific reward functions from logged gameplay demonstrations, allowing organizations to rank candidates by their stylistic alignment with a target star player. Our proposed architecture utilizes a multimodal, two-branch intake: one branch encodes structured state-action trajectories derived from high-resolution in-game telemetry, while the second encodes temporally aligned tactical pseudo-commentary generated by Vision-Language Models (VLMs) from broadcast footage. These representations are fused and evaluated via a Generative Adversarial Imitation Learning (GAIL) objective, where a discriminator learns to capture the unique mechanical and tactical signatures of elite professionals. By transitioning from generic skill estimation to scouting "by reward," this framework provides a scalable, workflow-aware digital twin system that enables data-driven roster construction and targeted talent discovery across massive candidate pools.
中文摘要 传统的电竞球探工作流程高度依赖人工视频回顾和汇总表现指标，这些往往无法捕捉判断新秀是否符合特定战术原型所需的细致决策模式。为此，我们将电竞中基于风格的选手评估重新定义为逆向强化学习（IRL）问题。本文介绍了一种新颖的球员选择框架，通过记录的游戏演示学习专业专属的奖励函数，使组织能够根据候选人与目标明星球员的风格一致性进行排名。我们提出的架构采用多模态、双分支输入：一个分支编码基于高分辨率游戏内遥测的结构化状态-动作轨迹，另一个分支编码由视觉语言模型（VLM）从广播视频生成的时间对齐战术伪评论。这些表征通过生成对抗模仿学习（GAIL）目标进行融合和评估，判别者学习捕捉精英专业人士独特的机械和战术特征。通过从通用技能估算转向“奖励式”发掘，该框架提供了一个可扩展、工作流感知的数字孪生系统，支持数据驱动的名单构建和针对庞大候选人群体的精准人才发掘。

Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve

Evo-MedAgent：超越一次性诊断，使用记忆、反思并改进的特工

Authors: Weixiang Shen, Bailiang Jian, Jun Li, Che Liu, Johannes Moll, Xiaobin Hu, Daniel Rueckert, Hongwei Bran Li, Jiazhen Pan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.14475
Pdf link: https://arxiv.org/pdf/2604.14475
Abstract Tool-augmented large language model (LLM) agents can orchestrate specialist classifiers, segmentation models, and visual question-answering modules to interpret chest X-rays. However, these agents still solve each case in isolation: they fail to accumulate experience across cases, correct recurrent reasoning mistakes, or adapt their tool-use behavior without expensive reinforcement learning. While a radiologist naturally improves with every case, current agents remain static. In this work, we propose Evo-MedAgent, a self-evolving memory module that equips a medical agent with the capacity for inter-case learning at test time. Our memory comprises three complementary stores: (1)~\emph{Retrospective Clinical Episodes} that retrieve problem-solving experiences from similar past cases, (2)~an \emph{Adaptive Procedural Heuristics} bank curating priority-tagged diagnostic rules that evolves via reflection, much like a physician refining their internal criteria, and (3)~a \emph{Tool Reliability Controller} that tracks per-tool trustworthiness. On ChestAgentBench, Evo-MedAgent raises multiple-choice question (MCQ) accuracy from 0.68 to 0.79 on GPT-5-mini, and from 0.76 to 0.87 on Gemini-3 Flash. With a strong base model, evolving memory improves performance more effectively than orchestrating external tools on qualitative diagnostic tasks. Because Evo-MedAgent requires no training, its per-case overhead is bounded by one additional retrieval pass and a single reflection call, making it deployable on top of any frozen model.
中文摘要 工具增强型大型语言模型（LLM）智能体可以协调专业分类器、分割模型和视觉问答模块来解读胸部X光片。然而，这些代理仍然孤立地解决每个案例：他们未能积累跨案例的经验，纠正反复的推理错误，也无法在没有昂贵的强化学习的情况下调整工具使用行为。虽然放射科医生随着每个病例自然进步，但现有的治疗师保持稳定。在本研究中，我们提出了Evo-MedAgent，一种自我演化的记忆模块，使医疗代理在测试时具备案例间学习的能力。我们的记忆包含三个互补的存储：（1）~\emph{回顾性临床发作}，提取类似过去案例中的问题解决经验，（2）~一个\emph{自适应程序启发式}银行，通过反思演变出优先级标签的诊断规则，就像医生完善内部标准一样，以及（3）~一个\emph{工具可靠性控制器}，追踪各工具的可信度。在ChestAgentBench上，Evo-MedAgent将GPT-5-mini的选择题准确率从0.68提升至0.79，Gemini-3 Flash则从0.76提升至0.87。有了强有力的基础模型，记忆演化比在定性诊断任务上协调外部工具更有效地提升表现。由于Evo-MedAgent无需培训，其每案开销受限于一次额外的检索通道和一次反射调用，使其能够部署在任何冻结模型之上。

MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation

MARS$^2$：通过强化学习扩展多智能体树搜索以生成代码

Authors: Pengfei Li, Shijie Wang, Fangyuan Li, Yikun Fu, Kaifeng Liu, Kaiyan Zhang, Dazhi Zhang, Yuqiang Li, Biqing Qi, Bowen Zhou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.14564
Pdf link: https://arxiv.org/pdf/2604.14564
Abstract Reinforcement learning (RL) paradigms have demonstrated strong performance on reasoning-intensive tasks such as code generation. However, limited trajectory diversity often leads to diminishing returns, which constrains the achievable performance ceiling. Search-enhanced RL alleviates this issue by introducing structured exploration, which remains constrained by the single-agent policy priors. Meanwhile, leveraging multiple interacting policies can acquire more diverse exploratory signals, but existing approaches are typically decoupled from structured search. We propose \textbf{MARS$^2$} (Multi-Agent Reinforced Tree-Search Scaling), a unified RL framework in which multiple independently-optimized agents collaborate within a shared tree-structured search environment. MARS$^2$ models the search tree as a learnable multi-agent interaction environment, enabling heterogeneous agents to collaboratively generate and refine candidate solutions within a shared search topology. To support effective learning, we introduce a path-level group advantage formulation based on tree-consistent reward shaping, which facilitates effective credit assignment across complex search trajectories. Experiments on code generation benchmarks show that MARS$^2$ consistently improves performance across diverse model combinations and training settings, demonstrating the effectiveness of coupling multi-agent collaboration with tree search for enhancing reinforcement learning. Our code is publicly available at this https URL.
中文摘要 强化学习（RL）范式在推理密集型任务如代码生成方面表现出优异表现。然而，有限的轨迹多样性常导致收益递减，限制了可实现的性能上限。搜索增强型强化学习通过引入结构化探索缓解了这一问题，而结构化探索仍受单代理政策先验的限制。同时，利用多种交互策略可以获得更多样化的探索信号，但现有方法通常与结构化搜索脱钩。我们提出了 \textbf{MARS$^2$}（多智能体强化树搜索缩放），这是一个统一的强化学习框架，多个独立优化的智能体在共享的树状结构搜索环境中协作。MARS$^2$ 将搜索树建模为可学习的多智能体交互环境，使异构智能体能够在共享的搜索拓扑中协作生成和优化候选解。为支持有效学习，我们引入了基于树一致性奖励塑造的路径层级群体优势表述，促进了复杂搜索轨迹中的有效学分分配。代码生成基准测试的实验显示，MARS$^2$在多种模型组合和训练环境中持续提升性能，证明了将多智能体协作与树搜索结合的有效性，能够增强强化学习。我们的代码在此 https URL 公开。

Model-Based Reinforcement Learning Exploits Passive Body Dynamics for High-Performance Biped Robot Locomotion

基于模型的强化学习利用被动身体动力学实现高性能双足机器人运动

Authors: Tomoya Kamimura, Haruka Washiyama, Akihito Sano
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.14565
Pdf link: https://arxiv.org/pdf/2604.14565
Abstract Embodiment is a significant keyword in recent machine learning fields. This study focused on the passive nature of the body of a biped robot to generate walking and running locomotion using model-based deep reinforcement learning. We constructed two models in a simulator, one with passive elements (e.g., springs) and the other, which is similar to general humanoids, without passive elements. The training of the model with passive elements was highly affected by the attractor of the system. This lead that although the trajectories quickly converged to limit cycles, it took a long time to obtain large rewards. However, thanks to the attractor-driven learning, the acquired locomotion was robust and energy-efficient. The results revealed that robots with passive elements could efficiently acquire high-performance locomotion by utilizing stable limit cycles generated through dynamic interaction between the body and ground. This study demonstrates the importance of implementing passive properties in the body for future embodied AI.
中文摘要 具身性是近年来机器学习领域的一个重要关键词。本研究聚焦于双足机器人身体的被动特性，利用基于模型的深度强化学习生成行走和奔跑的运动。我们在模拟器中构建了两个模型，一个带有被动元素（例如弹簧），另一个类似于普通类人生物，但没有被动元素。模型的被动元件训练受到系统吸引子的高度影响。这导致尽管轨迹迅速收敛以限制周期，但获得大量奖励却花费了很长时间。然而，得益于吸引子驱动的学习，所获得的运动既稳健又节能。结果显示，配备被动元件的机器人通过身体与地面动态相互作用产生的稳定极限循环，能够高效获得高性能的运动能力。本研究展示了在身体中实施被动特性对于未来具身人工智能的重要性。

Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks

问重要性：软件工程任务的奖励驱动澄清

Authors: Sanidhya Vijayvargiya, Vijay Viswanathan, Graham Neubig
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.14624
Pdf link: https://arxiv.org/pdf/2604.14624
Abstract Humans often specify tasks incompletely, so assistants must know when and how to ask clarifying questions. However, effective clarification remains challenging in software engineering tasks as not all missing information is equally valuable, and questions must target information users can realistically provide. We study clarification in real software engineering tasks by quantifying which types of information most affect task success and which questions elicit useful responses from simulated users. Using Shapley attribution and distributional comparisons, we identify two key properties of effective clarification: task relevance (which information predicts success) and user answerability (what users can realistically provide). We operationalize these properties as multi-stage reinforcement learning rewards to train CLARITI, an 8B-parameter clarification module, that matches GPT-5's resolution rate on underspecified issues while generating 41% fewer questions. Our results suggest that grounding reward design in empirical analysis of information impact and user answerability improves clarification efficiency.
中文摘要 人类常常不完整地指定任务，因此助理必须知道何时以及如何提出澄清性问题。然而，在软件工程任务中，有效的澄清仍然具有挑战性，因为并非所有缺失的信息都同等重要，问题必须针对用户能够现实提供的信息。我们通过量化哪些类型的信息最影响任务成功，以及哪些问题能从模拟用户那里引发有用回答，来研究真实软件工程任务中的澄清性。通过Shapley归因和分布比较，我们识别出有效澄清的两个关键属性：任务相关性（哪些信息预测成功）和用户可回答性（用户实际能提供什么）。我们将这些属性作为多阶段强化学习奖励来训练CLARITI，一个8B参数的澄清模块，该模块在未明确问题上的解决率与GPT-5相匹配，同时产生的问题数量减少41%。我们的结果表明，将奖励设计建立在对信息影响和用户答责性的实证分析基础上，可以提升信息效率。

Targeted Exploration via Unified Entropy Control for Reinforcement Learning

通过统一熵控制进行有针对性的探索以实现强化学习

Authors: Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Ge Lan, Yue Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.14646
Pdf link: https://arxiv.org/pdf/2604.14646
Abstract Recent advances in reinforcement learning (RL) have improved the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs). However, the widely used Group Relative Policy Optimization (GRPO) consistently suffers from entropy collapse, causing the policy to converge prematurely and lose diversity. Existing exploration methods introduce additional bias or variance during exploration, making it difficult to maintain optimization stability. We propose Unified Entropy Control for Reinforcement Learning (UEC-RL), a framework that provides targeted mechanisms for exploration and stabilization. UEC-RL activates more exploration on difficult prompts to search for potential and valuable reasoning trajectories. In parallel, a stabilizer prevents entropy from growing uncontrollably, thereby keeping training stable as the model consolidates reliable behaviors. Together, these components expand the search space when needed while maintaining robust optimization throughout training. Experiments on both LLM and VLM reasoning tasks show consistent gains over RL baselines on both Pass@1 and Pass@$k$. On Geometry3K, UEC-RL achieves a 37.9\% relative improvement over GRPO, indicating that it sustains effective exploration without compromising convergence and underscoring UEC-RL as a key for scaling RL-based reasoning in large models. Our code is available at this https URL.
中文摘要 强化学习（RL）的最新进展提升了大型语言模型（LLMs）和视觉语言模型（VLMs）的推理能力。然而，广泛使用的群相对策略优化（GRPO）持续存在熵坍缩问题，导致策略过早收敛并失去多样性。现有的探索方法在探索过程中引入额外的偏差或方差，使得保持优化稳定性变得困难。我们提出了强化学习统一熵控制（UEC-RL）框架，提供针对性的探索和稳定机制。UEC-RL激活更多对困难提示的探索，以寻找潜在且有价值的推理轨迹。同时，稳定子防止熵无序增长，从而保持训练稳定，模型整合可靠行为。这些组件共同在需要时扩展搜索空间，同时在整个训练过程中保持稳健的优化。LLM和VLM推理任务的实验显示，Pass@1和Pass@$k$都比强化学习基线有持续的提升。在Geometry3K上，UEC-RL相较GRPO提升了37.9%，表明其在不影响收敛性的前提下维持了有效探索，并凸显UEC-RL作为大型模型中基于强化学习推理扩展的关键。我们的代码可在此 https URL 访问。

ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning

ClariCodec：利用强化学习优化200bps通信的神经语音代码

Authors: Junyi Wang, Chi Zhang, Jing Qian, Haifeng Luo, Hao Wang, Zengrui Jin, Chao Zhang
Subjects: Subjects: Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2604.14654
Pdf link: https://arxiv.org/pdf/2604.14654
Abstract In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 200 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 3.68% WER on the LibriSpeech test-clean set at 200 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.20% on test-clean and 8.93% on test-other, corresponding to a 13% relative reduction while preserving perceptual quality.
中文摘要 在带宽受限的通信中，如卫星和水下频道，语音通常必须以超低比特率传输，而可理解性是首要目标。在如此极端的压缩水平下，经过声学重建损耗训练的编解码器往往会将比特分配给感知细节，导致字错误率（WER）大幅下降。本文提出了ClariCodec，一种以200比特每秒（bps）速度运行的神经语音编解码器，将量化重新表述为一种随机策略，从而实现基于强化学习（RL）的可理解性优化。具体来说，编码器通过WER驱动的奖励进行微调，而声学重建流程则保持冻结状态。即使不使用强化学习，ClariCodec在LibriSpeech测试净化测试集上也能实现3.68%的WER，速度为200 bps，已能与高码率编解码器竞争。进一步的强化学习微调将测试干净时的WER降至3.20%，测试他时降低8.93%，相对降低13%，同时保持感知质量。

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

一瞥链：搜索引导的渐进式对象基础推理以视频理解

Authors: Zhixuan Wu, Quanxing Zha, Teng Wang, Genbao Xu, Wenyuan Gu, Wei Rao, Nan Ma, Bo Cheng, Soujanya Poria
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.14692
Pdf link: https://arxiv.org/pdf/2604.14692
Abstract Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.
中文摘要 视频理解需要识别并推理跨帧的语义区分视觉对象，但现有的对象无关解决方案难以有效处理随时间变化的重大对象变化。为此，我们引入了Chain-of-Glimpse，一种搜索引导的渐进式对象基础推理框架，明确将每个推理步骤锚定在特定的视觉证据区域，支持组合和多步决策。形式上，Chain-of-Glimpse将视频推理表述为一个逐步过程，逐步在任务相关视觉对象周围构建空间基准的痕迹，从而减少过度依赖显著性驱动的线索。具体来说，Chain-of-Glimpse采用搜索引导控制器，通过强化学习优化格式奖励，显著激励接地能力，以迭代基准视觉证据区域，形成可靠的推理轨迹，从而产生准确且可解释的多步决策。对域内 NExTQA 和域外 Video-Holmes、CG-Bench 推理和 VRBench 基准测试的广泛评估显示，Chain-of-Glimpse 在多种视频推理任务中持续提升性能、稳健性和泛化性。

Mean Flow Policy Optimization

平均流量策略优化

Authors: Xiaoyi Dong, Xi Sheryl Zhang, Jian Cheng
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.14698
Pdf link: https://arxiv.org/pdf/2604.14698
Abstract Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo and DeepMind Control Suite benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at this https URL.
中文摘要 扩散模型最近作为在线强化学习（RL）的表达性策略表示形式出现。然而，它们的迭代生成过程带来了大量的训练和推理开销。为克服这一限制，我们提出使用MeanFlow模型（一类少步流源生成模型）来表示策略，以提升训练和推断效率，优于基于扩散的强化学习方法。为促进探索，我们在最大熵强化学习框架下通过软策略迭代优化平均流策略，并解决均流策略的两个关键挑战：行动似然评估和软策略改进。在MuJoCo和DeepMind控制套件基准测试上的实验表明，我们的方法平均流量策略优化（MFPO）在显著缩短训练和推断时间的同时，实现了与当前基于扩散的基线相当甚至更高的性能。我们的代码可在此 https URL 访问。

The Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment

《像素法庭审判：通过对抗证据和强化学习判断进行强健图像处理定位》

Authors: Songlin Li, Zhiqing Guo, Dan Ma, Changtao Miao, Gaobo Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.14703
Pdf link: https://arxiv.org/pdf/2604.14703
Abstract Although some existing image manipulation localization (IML) methods incorporate authenticity-related supervision, this information is typically utilized merely as an auxiliary training signal to enhance the model's sensitivity to manipulation artifacts, rather than being explicitly modeled as localization evidence opposing the manipulated regions. Consequently, when manipulation traces are subtle or degraded by post-processing and noise, these methods struggle to explicitly compare manipulated and authentic evidence, resulting in unreliable predictions in ambiguous areas. To address these issues, we propose a courtroom-style adjudication framework that regards IML task as the confrontation of evidence followed by judgment. The framework comprises a prosecution stream, a defense stream, and a judge model. We first build a dual-hypothesis segmentation architecture on a shared multi-scale encoder, in which the prosecution stream asserts manipulation and the defense stream asserts authenticity. Guided by edge priors, it produces evidence for manipulated and authentic regions through cascaded multi-level fusion, bidirectional disagreement suppression, and dynamic debate refinement. We further develop a reinforcement learning judge model that performs strategic re-inference and refinement on uncertain regions, yielding a manipulated-region mask. The judge model is trained with advantage-based rewards and a soft-IoU objective, and reliability is calibrated via entropy and cross-hypothesis consistency. Experimental results show that our model achieves superior average performance compared with SOTA IML methods.
中文摘要 尽管一些现有的图像操作定位（IML）方法包含真实性相关的监督，但这些信息通常仅作为辅助训练信号使用，以增强模型对操作伪影的敏感性，而非明确建模为反对作区域的定位证据。因此，当操作痕迹较为微妙或因后处理和噪声而退化时，这些方法难以明确比较操纵证据与真实证据，导致在模糊领域做出不可靠的预测。为解决这些问题，我们提出了一种法庭式裁决框架，将IML任务视为对抗证据并进行判决。该框架包括起诉流程、辩护流程和法官模式。我们首先在共享的多尺度编码器上构建了双假设分割架构，其中控方流主张操纵，辩方流主张真实性。它以边缘先验为指导，通过级联多层次融合、双向分歧抑制和动态辩论细化，生成控且真实的区域证据。我们进一步开发了强化学习评判模型，对不确定区域进行战略性再推断和细化，生成操作区域掩码。法官模型通过基于优势的奖励和软意值目标进行训练，可靠性通过熵和交叉假设一致性进行校准。实验结果显示，我们的模型平均性能优于SOTA IML方法。

RELOAD: A Robust and Efficient Learned Query Optimizer for Database Systems

RELOAD：一个稳健高效的数据库系统学习查询优化器

Authors: Seokwon Lee, Jaeyoung Sim, Sihyun Kim, Yuhsing Li, Yiwen Zhu, Kwanghyun Park
Subjects: Subjects: Databases (cs.DB); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.14725
Pdf link: https://arxiv.org/pdf/2604.14725
Abstract Recent advances in query optimization have shifted from traditional rule-based and cost-based techniques towards machine learning-driven approaches. Among these, reinforcement learning (RL) has attracted significant attention due to its ability to optimize long-term performance by learning policies over query planning. However, existing RL-based query optimizers often exhibit unstable performance at the level of individual queries, including severe performance regressions, and require prolonged training to reach the plan quality of expert, cost-based optimizers. These shortcomings make learned query optimizers difficult to deploy in practice and remain a major barrier to their adoption in production database systems. To address these challenges, we present RELOAD, a robust and efficient learned query optimizer for database systems. RELOAD focuses on (i) robustness, by minimizing query-level performance regressions and ensuring consistent optimization behavior across executions, and (ii) efficiency, by accelerating convergence to expert-level plan quality. Through extensive experiments on standard benchmarks, including Join Order Benchmark, TPC-DS, and Star Schema Benchmark, RELOAD demonstrates up to 2.4x higher robustness and 3.1x greater efficiency compared to state-of-the-art RL-based query optimization techniques.
中文摘要 查询优化的最新进展已从传统的基于规则和成本的技术转向机器学习驱动的方法。其中，强化学习（RL）因其通过学习策略优化长期性能而非查询规划而备受关注。然而，现有基于强化学习的查询优化器在单个查询层面常表现出不稳定的性能，包括严重的性能回归，需要长时间训练才能达到专家成本优化器的计划质量。这些缺陷使得学习后的查询优化器在实际应用中难以部署，并且仍然是其在生产数据库系统中采用的主要障碍。为应对这些挑战，我们推出了RELOAD，一款稳健高效的数据库系统学习式查询优化器。RELOAD侧重于（i）鲁棒性，通过最小化查询级性能回归并确保执行间优化行为一致，以及（ii）效率，加速趋同到专家级计划质量。通过对包括连接顺序基准、TPC-DS和星型模式基准测试在内的标准基准测试的广泛实验，RELOAD展现出与最先进的基于强化学习的查询优化技术相比，鲁棒性高出多达2.4倍，效率高出3.1倍。

Wasserstein Formulation of Reinforcement Learning. An Optimal Transport Perspective on Policy Optimization

Wasserstein 强化学习表述。政策优化的最佳交通视角

Authors: Mathias Dus (IRMA)
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
Arxiv link: https://arxiv.org/abs/2604.14765
Pdf link: https://arxiv.org/pdf/2604.14765
Abstract We present a geometric framework for Reinforcement Learning (RL) that views policies as maps into the Wasserstein space of action probabilities. First, we define a Riemannian structure induced by stationary distributions, proving its existence in a general context. We then define the tangent space of policies and characterize the geodesics, specifically addressing the measurability of vector fields mapped from the state space to the tangent space of probability measures over the action space. Next, we formulate a general RL optimization problem and construct a gradient flow using Otto's calculus. We compute the gradient and the Hessian of the energy, providing a formal second-order analysis. Finally, we illustrate the method with numerical examples for low-dimensional problems, computing the gradient directly from our theoretical formalism. For high-dimensional problems, we parameterize the policy using a neural network and optimize it based on an ergodic approximation of the cost.
中文摘要 我们提出了一个几何框架用于强化学习（RL），将策略视为进入Wasserstein行动概率空间的映射。首先，我们定义一个由平稳分布诱导的黎曼结构，证明其在一般语境下的存在性。随后，我们定义了策略的切空间并对测地线进行了刻画，特别是针对从状态空间映射到作用空间上概率测度切空间的向量场的可测量性。接下来，我们提出一个通用的强化学习优化问题，并利用奥托演算构造梯度流。我们计算能量的梯度和黑森量，提供形式化的二阶分析。最后，我们用低维问题的数值示例来说明该方法，直接从理论形式主义计算梯度。对于高维问题，我们通过神经网络参数化策略，并基于成本的遍历近似进行优化。

Learning Ad Hoc Network Dynamics via Graph-Structured World Models

通过图结构化世界模型学习临时网络动力学

Authors: Can Karacelebi, Yusuf Talha Sahin, Elif Surer, Ertan Onur
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2604.14811
Pdf link: https://arxiv.org/pdf/2604.14811
Abstract Ad hoc wireless networks exhibit complex, innate and coupled dynamics: node mobility, energy depletion and topology change that are difficult to model analytically. Model-free deep reinforcement learning requires sustained online interaction whereas existing model based approaches use flat state representations that lose per node structure. Therefore we propose G-RSSM, a graph structured recurrent state space model that maintains per node latent states with cross node multi head attention to learn the dynamics jointly from offline trajectories. We apply the proposed method to the downstream task clustering where a cluster head selection policy trains entirely through imagined rollouts in the learned world model. Across 27 evaluation scenarios spanning MANET, VANET, FANET, WSN and tactical networks with N=30 to 1000 nodes, the learned policy maintains high connectivity with only trained for N=50. Herein, we propose the first multi physics graph structured world model applied to combinatorial per node decision making in size agnostic wireless ad hoc networks.
中文摘要 自组无线网络表现出复杂的、先天和耦合的动态：节点移动性、能量消耗和拓扑变化，这些都难以解析建模。无模型深度强化学习需要持续的在线互动，而现有基于模型的方法则使用扁平状态表示，且每个节点结构都会丢失。因此，我们提出了G-RSSM，一种图结构化的循环状态空间模型，通过跨节点多头关注维持每节点的潜在状态，以共同学习离线轨迹中的动态。我们将所提方法应用于下游任务聚类，其中集群负责人选择策略完全通过学习世界模型中的想象性展开进行训练。在涵盖MANET、VANET、FANET、WSN和N=30至1000节点的战术网络的27个评估场景中，所学策略仅训练N=50时保持高度连通性。本书提出首个应用于规模无关无线自组网络中组合每节点决策的多物理图结构世界模型。

SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling

SWE-TRACE：通过评分标准过程奖励模型和启发式测试时间尺度优化长期SWE代理

Authors: Hao Han, Jin Xie, Xuehao Ma, Weiquan Zhu, Ziyao Zhang, ZhiLiang Long, Hongkai Chen, Qingwen Ye
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2604.14820
Pdf link: https://arxiv.org/pdf/2604.14820
Abstract Resolving real-world software engineering (SWE) issues with autonomous agents requires complex, long-horizon reasoning. Current pipelines are bottlenecked by unoptimized demonstration data, sparse execution rewards, and computationally prohibitive inference scaling, which collectively exacerbate token bloat, reward hacking, and policy degradation. We present SWE-TRACE (Trajectory Reduction and Agentic Criteria Evaluation), a unified framework optimizing the SWE agent lifecycle across data curation, reinforcement learning (RL), and test-time inference. First, we introduce an LLM multi-task cascading method, utilizing stepwise oracle verification to distill a 60K-instance Supervised Fine-Tuning (SFT) corpus strictly biased toward token-efficient, shortest-path trajectories. Second, to overcome the instability of sparse outcome rewards, we design a MemoryAugmented Agentic RL pipeline featuring a Rubric-Based Process Reward Model (PRM). An auxiliary Rubric-Agent provides dense, fine-grained heuristic feedback on intermediate steps, guiding the model through long-horizon tasks. Finally, we bridge training and inference by repurposing the PRM for heuristic-guided Test-Time Scaling (TTS). By dynamically evaluating and pruning action candidates at each step, SWE-TRACE achieves superior search efficiency without the latency overhead of standard parallel sampling. Extensive experiments on standard SWE benchmarks demonstrate that SWE-TRACE significantly advances the state-of-the-art, maximizing resolution rates while drastically reducing both token consumption and inference latency.
中文摘要 解决自主智能体的现实软件工程（SWE）问题需要复杂且长远的推理。当前的流水线被未优化的演示数据、稀疏的执行奖励和计算上难以负担的推理扩展所限制，这些因素共同加剧了令牌膨胀、奖励被黑客攻击和策略退化的问题。我们提出了SWE-TRACE（轨迹简化与代理标准评估），这是一个统一框架，优化SWE代理生命周期，涵盖数据管理、强化学习（RL）和测试时间推断。首先，我们引入了一种LLM多任务级联方法，利用逐步预言机验证，提取出一个严格偏向代币效率和最短路径轨迹的6万实例监督微调（SFT）语料库。其次，为了克服结果奖励稀疏的不稳定性，我们设计了一个具有基于评分标准的过程奖励模型（PRM）的内存增强能动强化学习流水线。辅助评分标准代理为中间步骤提供密集、细粒度的启发式反馈，引导模型完成长视野任务。最后，我们通过将PRM重新利用为启发式引导测试时间尺度（TTS）来衔接训练与推理。通过在每一步动态评估和剪枝动作候选，SWE-TRACE 实现了卓越的搜索效率，同时避免了标准并行采样的延迟开销。在标准软件工程基准测试上的大量实验表明，SWE-TRACE显著推动了技术进步，最大化了分辨率的提升，同时大幅降低了令牌消耗和推理延迟。

Switch: Learning Agile Skills Switching for Humanoid Robots

切换：学习敏捷技能为类人机器人切换

Authors: Yuen-Fui Lau, Qihan Zhao, Yinhuai Wang, Runyi Yu, Hok Wai Tsui, Qifeng Chen, Ping Tan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.14834
Pdf link: https://arxiv.org/pdf/2604.14834
Abstract Recent advancements in whole-body control through deep reinforcement learning have enabled humanoid robots to achieve remarkable progress in real-world chal lenging locomotion skills. However, existing approaches often struggle with flexible transitions between distinct skills, cre ating safety concerns and practical limitations. To address this challenge, we introduce a hierarchical multi-skill system, Switch, enabling seamless skill transitions at any moment. Our approach comprises three key components: (1) a Skill Graph (SG) that establishes potential cross-skill transitions based on kinematic similarity within multi-skill motion data, (2) a whole-body tracking policy trained on this skill graph through deep reinforcement learning, and (3) an online skill scheduler to drive the tracking policy for robust skill execution and smooth transitions. For skill switching or significant tracking deviations, the scheduler performs online graph search to find the optimal feasible path, which ensures efficient, stable, and real-time execution of diverse locomotion skills. Comprehensive experiments demonstrate that Switch empowers humanoid to execute agile skill transitions with high success rates while maintaining strong motion imitation performance.
中文摘要 通过深度强化学习实现全身控制的最新进展，使类人机器人在现实世界的颈椎间伸出运动技能上取得了显著进步。然而，现有方法常常在不同技能之间灵活过渡、安全问题和实际局限性方面遇到困难。为应对这一挑战，我们引入了分层多技能系统Switch，实现随时无缝的技能转换。我们的方法包括三个关键组成部分：（1）基于多技能动作数据中的运动学相似性建立潜在的跨技能转换的技能图谱（SG），（2）基于该技能图谱通过深度强化学习训练的全身跟踪策略，以及（3）在线技能调度器，驱动追踪策略以实现技能稳健执行和顺畅过渡。对于技能切换或显著的跟踪偏差，调度器会进行在线图搜索，寻找最优可行路径，确保多种移动技能的高效、稳定和实时执行。综合实验表明，Switch使人形能够高效执行敏捷技能转换，同时保持强有力的动作模仿性能。

Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

强化学习是否扩展了LLM代理的能力边界？PASS@（k，T）分析

Authors: Zhiyuan Zhai, Wenjing Yan, Xiaodan Shao, Xin Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.14877
Pdf link: https://arxiv.org/pdf/2604.14877
Abstract Does reinforcement learning genuinely expand what LLM agents can do, or merely make them more reliable? For static reasoning, recent work answers the second: base and RL pass@k curves converge at large k. We ask whether this holds for agentic tool use, where T rounds of interaction enable compositional strategies that re-sampling cannot recover. We introduce PASS@(k,T), a two-dimensional metric that jointly varies sampling budget k and interaction depth T, separating capability expansion from efficiency improvement. Our main finding is that, contrary to the static-reasoning result, tool-use RL genuinely enlarges the capability boundary: the RL agent's pass-curve pulls above the base model's and the gap widens at large k rather than converging. The expansion is specific to compositional, sequential information gathering; on simpler tasks RL behaves as prior work predicts. Under matched training data, supervised fine-tuning regresses the boundary on the same compositional tasks, isolating self-directed exploration as the causal factor. Mechanism analysis shows RL reweights the base strategy distribution toward the subset whose downstream reasoning more often yields a correct answer, with the improvement concentrated on how the agent integrates retrieved information. These results reconcile optimistic and pessimistic readings of RL for LLMs: both are correct, on different task types.
中文摘要 强化学习真的扩展了LLM代理的能力，还是仅仅让它们更可靠？对于静态推理，近期研究回答了第二个问题：基曲线和强pass@k化逻辑曲线在大 k 时收敛。我们询问这是否适用于代理工具的使用，即T轮交互使得重采样无法恢复的合成策略成为可能。我们引入了PASS@（k，T），一种二维度量，它联合变化采样预算k和交互深度T，将能力扩展与效率提升区分开来。我们的主要发现是，与静态推理结果相反，工具使用强化学习确实扩大了能力边界：强化学习代理的通过曲线超过基础模型，且在大k时差距扩大而非收敛。该扩展专门用于合成、顺序信息收集;在较简单的任务中，强化学习的表现与先前工作预测一致。在匹配训练数据下，监督微调在同一成分任务上回归边界，将自我探索隔离为因果因素。机制分析显示，强化学习会将基础策略分布重新加权，针对下游推理更常得出正确答案的子集，改进主要集中在智能体如何整合检索到的信息上。这些结果调和了对大型语言模型（LLM）的乐观和悲观读数：两者在不同任务类型上都是正确的。

GenRec: A Preference-Oriented Generative Framework for Large-Scale Recommendation

GenRec：一个面向偏好的大规模推荐生成框架

Authors: Yanyan Zou, Junbo Qi, Lunsong Huang, Yu Li, Kewei Xu, Jiabao Gao, Binglei Zhao, Xuanhua Yang, Sulong Xu, Shengjie Li
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.14878
Pdf link: https://arxiv.org/pdf/2604.14878
Abstract Generative Retrieval (GR) offers a promising paradigm for recommendation through next-token prediction (NTP). However, scaling it to large-scale industrial systems introduces three challenges: (i) within a single request, the identical model inputs may produce inconsistent outputs due to the pagination request mechanism; (ii) the prohibitive cost of encoding long user behavior sequences with multi-token item representations based on semantic IDs, and (iii) aligning the generative policy with nuanced user preference signals. We present GenRec, a preference-oriented generative framework deployed on the JD App that addresses above challenges within a single decoder-only architecture. For training objective, we propose Page-wise NTP task, which supervises over an entire interaction page rather than each interacted item individually, providing denser gradient signal and resolving the one-to-many ambiguity of point-wise training. On the prefilling side, an asymmetric linear Token Merger compresses multi-token Semantic IDs in the prompt while preserving full-resolution decoding, reducing input length by ~2X with negligible accuracy loss. To further align outputs with user satisfaction, we introduce GRPO-SR, a reinforcement learning method that pairs Group Relative Policy Optimization with NLL regularization for training stability, and employs Hybrid Rewards combining a dense reward model with a relevance gate to mitigate reward hacking. In month-long online A/B tests serving production traffic, GenRec achieves 9.5% improvement in click count and 8.7% in transaction count over the existing pipeline.
中文摘要 生成检索（GR）为通过下一代币预测（NTP）进行推荐提供了有前景的范式。然而，将其扩展到大规模工业系统会带来三个挑战：（i）在单一请求中，由于分页请求机制，相同的模型输入可能产生不一致的输出;（ii）用基于语义ID的多词号项表示编码冗长用户行为序列的高昂成本，以及（iii）将生成策略与细致的用户偏好信号对齐。我们介绍GenRec，一个基于偏好的生成框架，部署在JD App上，在单一仅依赖解码器的架构中解决上述挑战。作为训练目标，我们提出逐页NTP任务，它监督整个交互页面，而非单个互动项目，提供更密集的梯度信号，解决点对多训练的一对多歧义。在预填充方面，非对称线性令牌合并在提示中压缩多词语义ID，同时保持全解析解码，输入长度减少~2倍，准确率损失可忽略不计。为了进一步使输出与用户满意度对齐，我们引入了GRPO-SR，这是一种强化学习方法，将组相对策略优化与NLL正则化结合以实现训练稳定性，并采用混合奖励结合密集奖励模型与相关性门以减轻奖励黑客行为。在为期一个月的在线A/B测试中，GenRec的点击数量提升了9.5%，交易数量提升了8.7%，均为现有流水线。

Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

双轴生成奖励模型：在互动口语对话模型中实现语义和轮流稳健性

Authors: Yifu Chen, Shengpeng Ji, Zhengqing Liu, Qian Chen, Wen Wang, Ziqing Wang, Yangzhuo Li, Tianle Liang, Zhou Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.14920
Pdf link: https://arxiv.org/pdf/2604.14920
Abstract Achieving seamless, human-like interaction remains a key challenge for full-duplex spoken dialogue models (SDMs). Reinforcement learning (RL) has substantially enhanced text- and vision-language models, while well-designed reward signals are crucial for the performance of RL. We consider RL a promising strategy to address the key challenge for SDMs. However, a fundamental barrier persists: prevailing automated metrics for assessing interaction quality rely on superficial proxies, such as behavioral statistics or timing-prediction accuracy, failing to provide reliable reward signals for RL. On the other hand, human evaluations, despite their richness, remain costly, inconsistent, and difficult to scale. We tackle this critical barrier by proposing a Dual-Axis Generative Reward Model, which is trained to understand complex interaction dynamics using a detailed taxonomy and an annotated dataset, produces a single score and, crucially, provides separate evaluations for semantic quality and interaction timing. Such dual outputs furnish precise diagnostic feedback for SDMs and deliver a dependable, instructive reward signal suitable for online reinforcement learning. Our model achieves state-of-the-art performance on interaction-quality assessment across a wide spectrum of datasets, spanning synthetic dialogues and complex real-world interactions.
中文摘要 实现无缝、类人互动仍是全双工语音对话模型（SDM）面临的关键挑战。强化学习（RL）大幅增强了文本和视觉语言模型，而设计良好的奖励信号对强化学习的表现至关重要。我们认为强化学习是解决SDM关键挑战的有前景策略。然而，一个根本障碍依然存在：主流的自动化交互质量评估指标依赖于表面代理指标，如行为统计或时间预测准确性，未能为强化学习提供可靠的奖励信号。另一方面，尽管人类评估内容丰富，但成本高昂、不一致且难以扩展。我们通过提出一个双轴生成奖励模型来克服这一关键障碍，该模型通过详细的分类法和注释数据集训练以理解复杂的交互动态，产生单一评分，并且关键的是，提供语义质量和交互时序的独立评估。这种双输出为SDM提供精确的诊断反馈，并提供适合在线强化学习的可靠、有指导性的奖励信号。我们的模型在广泛的数据集中实现了最先进的交互质量评估，涵盖了合成对话和复杂的现实世界交互。

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

LongAct：利用内在激活模式进行长语境强化学习

Authors: Bowen Ping, Zijun Chen, Tingfeng Hui, Qize Yu, Chenxuan Li, Junchi Yan, Baobao Chang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.14922
Pdf link: https://arxiv.org/pdf/2604.14922
Abstract Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model's intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantization -- which establishes the criticality of such high-magnitude activations -- and the insight that long-context reasoning inherently exhibits a sparse structure, we hypothesize that these weights serve as the pivotal drivers for effective model optimization. Based on this insight, we propose LongAct, a strategy that shifts from uniform to saliency-guided sparse updates. By selectively updating only the weights associated with these significant activations, LongAct achieves an approximate 8% improvement on LongBench v2 and enhances generalization on the RULER benchmark. Furthermore, our method exhibits remarkable universality, consistently boosting performance across diverse RL algorithms such as GRPO and DAPO. Extensive ablation studies suggest that focusing on these salient features is key to unlocking long-context potential.
中文摘要 强化学习（RL）已成为提升大型语言模型（LLMs）推理能力的关键驱动力。虽然近期进展主要集中在奖励工程或数据综合，但很少有研究利用模型的内在表征特性来指导训练过程。本文首先观察到在处理长上下文时，查询和关键向量中存在高强度激活。我们从模型量子化中汲取灵感——该方法确立了此类高强度激活的关键性——以及长上下文推理本质上表现出稀疏结构的洞见，我们假设这些权重是有效模型优化的关键驱动力。基于这一见解，我们提出了LongAct策略，即从统一更新转向以显著性为导向的稀疏更新。通过选择性地仅更新与这些显著激活相关的权重，LongAct在LongBench v2基础上实现了约8%的提升，并增强了在RULER基准测试上的泛化性。此外，我们的方法展现出显著的通用性，持续提升了GRPO和DAPO等多种强化学习算法的性能。大量消融研究表明，聚焦这些显著特征是释放长期背景潜力的关键。

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

WavAlign：通过自适应混合后训练提升口头对话模型中的智力和表达力

Authors: Yifu Chen, Shengpeng Ji, Qian Chen, Tianle Liang, Yangzhuo Li, Ziqing Wang, Wen Wang, Jingyu Lu, Haoxiao Wang, Xueyi Pu, Fan Zhuo, Zhou Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.14932
Pdf link: https://arxiv.org/pdf/2604.14932
Abstract End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.
中文摘要 端到端口语对话模型因其在表达力和感知能力上比级联系统更高的潜力而备受关注。然而，当前开源口语对话模型的智能性和表达力往往低于预期。受在线强化学习（RL）在其他领域的成功激励，人们可能会尝试直接将偏好优化应用于口头对话模型，但这种迁移并非简单。我们从奖励建模和推广抽样的角度分析这些障碍，重点关注稀疏偏好监督如何与共享参数更新下的密集语音生成相互作用。基于分析，我们提出了一种模式感知的自适应后训练方案，使强化学习在口头对话中实用：它限制偏好更新到语义通道，并通过显式锚定改善声学行为，同时动态调节其混合，避免不可靠的偏好梯度。我们评估了该方法在多个口语对话基准和代表性架构中，观察到语义质量和语音表达力持续提升。

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

UniDoc-RL：从粗到细的视觉RAG，具有层级动作和密集奖励

Authors: Jun Wang, Shuo Tan, Zelong Sun, Tiancheng Gu, Yongle Zhao, Ziyong Feng, Kaicheng Yang, Cewu Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.14967
Pdf link: https://arxiv.org/pdf/2604.14967
Abstract Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.
中文摘要 检索增强生成（RAG）通过外部视觉知识扩展大型视觉语言模型（LVLM）。然而，现有的视觉RAG系统通常依赖通用的检索信号，忽视了复杂推理所必需的细粒度视觉语义。为解决这一局限性，我们提出了UniDoc-RL，一种统一的强化学习框架，其中一名LVLM代理共同执行检索、重新排序、主动视觉感知和推理。UniDoc-RL 将视觉信息获取表述为一个具有层级动作空间的顺序决策问题。具体来说，它逐步优化了从粗粒度文档检索到细粒度图像选择和主动区域裁剪的视觉证据，使模型能够抑制无关内容并关注信息密集区域。为了实现有效的端到端培训，我们引入了密集的多奖励方案，为每个动作提供任务感知的监督。基于群相对策略优化（GRPO），UniDoc-RL 将代理行为与多个目标对齐，而无需依赖单独的价值网络。为支持这一训练范式，我们策划了一个包含高质量推理轨迹的全面数据集，并配有细粒度的动作注释。三个基准测试的实验表明，UniDoc-RL始终超越最先进的基线，比以往基于RL的方法提升高达17.7%。

Momentum-constrained Hybrid Heuristic Trajectory Optimization Framework with Residual-enhanced DRL for Visually Impaired Scenarios

适用于视障场景的动量约束混合启发式轨迹优化框架，带有残差增强的日程学习

Authors: Yuting Zeng, Zhiwen Zheng, Jingya Wang, You Zhou, JiaLing Xiao, Yongbin Yu, Manping Fan, Bo Gong, Liyong Ren
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.14986
Pdf link: https://arxiv.org/pdf/2604.14986
Abstract Safe and efficient assistive planning for visually impaired scenarios remains challenging, since existing methods struggle with multi-objective optimization, generalization, and interpretability. In response, this paper proposes a Momentum-Constrained Hybrid Heuristic Trajectory Optimization Framework (MHHTOF). To balance multiple objectives of comfort and safety, the framework designs a Heuristic Trajectory Sampling Cluster (HTSC) with a Momentum-Constrained Trajectory Optimization (MTO), which suppresses abrupt velocity and acceleration changes. In addition, a novel residual-enhanced deep reinforcement learning (DRL) module refines candidate trajectories, advancing temporal modeling and policy generalization. Finally, a dual-stage cost modeling mechanism (DCMM) is introduced to regulate optimization, where costs in the Frenet space ensure consistency, and reward-driven adaptive weights in the Cartesian space integrate user preferences for interpretability and user-centric decision-making. Experimental results show that the proposed framework converges in nearly half the iterations of baselines and achieves lower and more stable costs. In complex dynamic scenarios, MHHTOF further demonstrates stable velocity and acceleration curves with reduced risk, confirming its advantages in robustness, safety, and efficiency.
中文摘要 对于视障场景进行安全高效的辅助规划仍然充满挑战，因为现有方法在多目标优化、泛化和可解释性方面存在困难。为此，本文提出了一个动量约束混合启发式轨迹优化框架（MHHTOF）。为了平衡舒适和安全的多重目标，该框架设计了一个启发式轨迹采样集群（HTSC），采用动量约束轨迹优化（MTO），抑制突发的速度和加速度变化。此外，一个新型残差增强深度强化学习（DRL）模块优化候选轨迹，推动时间建模和策略推广。最后，引入了双阶段成本建模机制（DCMM）用于调节优化，其中弗雷内特空间的成本确保一致性，而笛卡尔空间中奖励驱动的自适应权重整合了用户对可解释性和用户中心决策的偏好。实验结果表明，所提出的框架在近一半的基线迭代中收敛，并实现了更低且更稳定的成本。在复杂的动态场景下，MHHTOF进一步展示了稳定的速度和加速度曲线，降低风险，证实了其在鲁棒性、安全性和效率方面的优势。

IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

IG-搜索：搜索增强推理带来的步骤级信息奖励

Authors: Zihan Liang, Yufei Ma, Ben Chen, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Chenyi Lei, Wenwu Ou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.15148
Pdf link: https://arxiv.org/pdf/2604.15148
Abstract Reinforcement learning has emerged as an effective paradigm for training large language models to perform search-augmented reasoning. However, existing approaches rely on trajectory-level rewards that cannot distinguish precise search queries from vague or redundant ones within a rollout group, and collapse to a near-zero gradient signal whenever every sampled trajectory fails. In this paper, we propose IG-Search, a reinforcement learning framework that introduces a step-level reward based on Information Gain (IG). For each search step, IG measures how much the retrieved documents improve the model's confidence in the gold answer relative to a counterfactual baseline of random documents, thereby reflecting the effectiveness of the underlying search query. This signal is fed back to the corresponding search-query tokens via per-token advantage modulation in GRPO, enabling fine-grained, step-level credit assignment within a rollout. Unlike prior step-level methods that require either externally annotated intermediate supervision or shared environment states across trajectories, IG-Search derives its signals from the policy's own generation probabilities, requiring no intermediate annotations beyond standard question-answer pairs. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate that IG-Search achieves an average EM of 0.430 with Qwen2.5-3B, outperforming the strongest trajectory-level baseline (MR-Search) by 1.6 points and the step-level method GiGPO by 0.9 points on average across benchmarks, with particularly pronounced gains on multi-hop reasoning tasks. Despite introducing a dense step-level signal, IG-Search adds only ~6.4% to per-step training wall-clock time over the trajectory-level baseline and leaves inference latency unchanged, while still providing a meaningful gradient signal even when every sampled trajectory answers incorrectly.
中文摘要 强化学习已成为训练大型语言模型执行搜索增强推理的有效范式。然而，现有方法依赖轨迹级奖励，无法区分精确的搜索查询与模糊或冗余的搜索查询，且当每个采样轨迹失败时，梯度信号几乎为零。本文提出了IG-Search，一种基于信息获得（IG）的逐步奖励强化学习框架。对于每个搜索步骤，IG衡量检索到的文档相较于随机文档的反事实基线，提升了模型对金答案的置信度，从而反映底层搜索查询的有效性。该信号通过GRPO中的逐代币优势调制反馈给相应的搜索查询令牌，实现在推广过程中进行细粒度的阶级信用分配。与以往需要外部注释中间监督或跨轨迹共享环境状态的步骤级方法不同，IG-Search 从策略自身生成概率中获取信号，除了标准问答对外，无需中间注释。在七个单跳和多跳质量保证基准测试上的实验显示，IG-Search在Qwen2.5-3B中平均EM为0.430，平均领先最强轨迹级基线（MR-Search）1.6分，比步进级方法GiGPO高0.9分，在多跳推理任务中表现尤为显著。尽管引入了密集的步进级信号，IG-Search 仅为每步训练墙钟时间增加 ~6.4%，且推断延迟保持不变，即使每个采样轨迹都错误，仍能提供有意义的梯度信号。

LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking

大型语言模型游戏验证者：RLVR可能导致奖励黑客攻击

Authors: Lukas Helff, Quentin Delfosse, David Steinmann, Ruben Härle, Hikaru Shindo, Patrick Schramowski, Wolfgang Stammer, Kristian Kersting, Felix Friedrich
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.15149
Pdf link: https://arxiv.org/pdf/2604.15149
Abstract As reinforcement Learning with Verifiable Rewards (RLVR) has become the dominant paradigm for scaling reasoning capabilities in LLMs, a new failure mode emerges: LLMs gaming verifiers. We study this phenomenon on inductive reasoning tasks, where models must induce and output logical rules. We find that RLVR-trained models systematically abandon rule induction. Instead of learning generalizable patterns (e.g., ``trains carrying red cars go east''), they enumerate instance-level labels, producing outputs that pass verifiers without capturing the relational patterns required by the task. We show that this behavior is not a failure of understanding but a form of reward hacking: imperfect verifiers that check only extensional correctness admit false positives. To detect such shortcuts, we introduce Isomorphic Perturbation Testing (IPT), which evaluates a single model output under both extensional and isomorphic verification, where the latter enforces invariance under logically isomorphic tasks. While genuine rule induction remains invariant, shortcut strategies fail. We find that shortcut behavior is specific to RLVR-trained reasoning models (e.g., GPT-5, Olmo3) and absent in non-RLVR models (e.g., GPT-4o, GPT-4.5, Ministral). Moreover, shortcut prevalence increases with task complexity and inference-time compute. In controlled training experiments, extensional verification directly induces shortcut strategies, while isomorphic verification eliminates them. These results show that RLVR can incentivize reward hacking not only through overt manipulation but also by exploiting what the verifier fails to enforce.
中文摘要 随着可验证奖励强化学习（RLVR）成为大型语言模型推理能力扩展的主导范式，出现了一种新的失败模式：大型语言模型的游戏验证器。我们在归纳推理任务中研究这一现象，模型必须诱导并输出逻辑规则。我们发现，RLVR训练的模型系统性地放弃了规则归纳。他们不学习可推广的模式（例如“载有红色车厢的列车向东行驶”），而是枚举实例级标签，产生通过验证器但无法捕捉任务所需关系模式的输出。我们证明这种行为并非理解失败，而是一种奖励黑客：仅检查外延正确性的不完美验证器会接受假阳性。为检测此类捷径，我们引入同构微扰测试（IPT），该方法在外延验证和同构验证下评估单个模型输出，后者在逻辑同构任务下强制执行不变性。虽然真正的规则归纳是不变的，但捷径策略失效。我们发现捷径行为特有于RLVR训练的推理模型（如GPT-5、Olmo3），而在非RLVR模型（如GPT-4o、GPT-4.5、Ministral）中则不存在。此外，随着任务复杂度和推理时间计算的增加，捷径的普及率也随之增加。在受控训练实验中，外延验证直接诱导捷径策略，而同构验证则消除这些捷径。这些结果表明，RLVR不仅可以通过公开操控来激励奖励黑客，还能利用验证者未能执行的内容来激励黑客行为。

RL-STPA: Adapting System-Theoretic Hazard Analysis for Safety-Critical Reinforcement Learning

RL-STPA：适应系统理论危害分析以适应安全关键强化学习

Authors: Steven A. Senczyszyn, Timothy C. Havens, Nathaniel Rice, Jason E. Summers, Benjamin D. Werner, Benjamin J. Schumeg
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.15201
Pdf link: https://arxiv.org/pdf/2604.15201
Abstract As reinforcement learning (RL) deployments expand into safety-critical domains, existing evaluation methods fail to systematically identify hazards arising from the black-box nature of neural network enabled policies and distributional shift between training and deployment. This paper introduces Reinforcement Learning System-Theoretic Process Analysis (RL-STPA), a framework that adapts conventional STPA's systematic hazard analysis to address RL's unique challenges through three key contributions: hierarchical subtask decomposition using both temporal phase analysis and domain expertise to capture emergent behaviors, coverage-guided perturbation testing that explores the sensitivity of state-action spaces, and iterative checkpoints that feed identified hazards back into training through reward shaping and curriculum design. We demonstrate RL-STPA in the safety-critical test case of autonomous drone navigation and landing, revealing potential loss scenarios that can be missed by standard RL evaluations. The proposed framework provides practitioners with a toolkit for systematic hazard analysis, quantitative metrics for safety coverage assessment, and actionable guidelines for establishing operational safety bounds. While RL-STPA cannot provide formal guarantees for arbitrary neural policies, it offers a practical methodology for systematically evaluating and improving RL safety and robustness in safety-critical applications where exhaustive verification methods remain intractable.
中文摘要 随着强化学习（RL）部署扩展到安全关键领域，现有评估方法未能系统识别由神经网络支持的黑箱策略和训练与部署分布转移所产生的危害。本文介绍了强化学习系统理论过程分析（RL-STPA），该框架通过三项关键贡献，调整了传统STPA的系统危害分析，以应对强化学习的独特挑战：利用时间相分析和领域专业知识进行分层子任务分解以捕捉涌现行为;覆盖引导的扰动测试，探索状态-行动空间的敏感性;以及迭代检查点，将已识别的危害反馈回去通过奖励塑造和课程设计进入培训。我们在自主无人机导航和着陆的安全关键测试案例中展示了RL-STPA，揭示了标准强化学习评估可能忽略的潜在损失场景。拟议框架为从业者提供了系统化危害分析工具包、安全覆盖评估的定量指标以及确定运营安全界限的可操作指南。虽然RL-STPA无法为任意神经策略提供形式保证，但它提供了一种实用的方法，用于系统评估和提升在安全关键应用中，在穷尽验证方法难以解决的环境中提升强化学习的安全性和稳健性。

Abstract Sim2Real through Approximate Information States

通过近似信息态进行 Sim2Real 的抽象

Authors: Yunfu Deng, Yuhao Li, Josiah P. Hanna
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.15289
Pdf link: https://arxiv.org/pdf/2604.15289
Abstract In recent years, reinforcement learning (RL) has shown remarkable success in robotics when a fast and accurate simulator is available for a given task. When using RL and simulation, more simulator realism is generally beneficial but becomes harder to obtain as robots are deployed in increasingly complex and widescale domains. In such settings, simulators will likely fail to model all relevant details of a given target task and this observation motivates the study of sim2real with simulators that leave out key task details. In this paper, we formalize and study the abstract sim2real problem: given an abstract simulator that models a target task at a coarse level of abstraction, how can we train a policy with RL in the abstract simulator and successfully transfer it to the real-world? Our first contribution is to formalize this problem using the language of state abstraction from the RL literature. This framing shows that an abstract simulator can be grounded to match the target task if the grounded abstract dynamics take the history of states into account. Based on the formalism, we then introduce a method that uses real-world task data to correct the dynamics of the abstract simulator. We then show that this method enables successful policy transfer both in sim2sim and sim2real evaluation.
中文摘要 近年来，当有快速且准确的模拟器可用于特定任务时，强化学习（RL）在机器人领域取得了显著成功。在使用强化学习和仿真时，更逼真的模拟器通常有益，但随着机器人在越来越复杂和大规模的领域部署，实现起来会变得更困难。在这种环境下，模拟器很可能无法模拟给定目标任务的所有相关细节，这一观察促使人们用忽略关键任务细节的模拟器研究sim2real。本文形式化并研究了抽象sim2real问题：给定一个抽象模拟器，在粗抽象层次上建模目标任务，我们如何在抽象模拟器中用强化学习训练策略并成功将其转移到现实世界？我们的第一个贡献是利用强化学习文献中的状态抽象语言形式化该问题。这种框架表明，如果基础抽象动力学考虑状态历史，抽象模拟器可以基于目标任务进行基础化。基于形式主义，我们引入一种利用现实任务数据修正抽象模拟器动态的方法。随后我们展示了该方法能够在 sim2sim 和 sim2real 评估中成功进行策略转移。

Generalization in LLM Problem Solving: The Case of the Shortest Path

LLM问题解决中的推广：最短路径的情况

Authors: Yao Tong, Jiayuan Ye, Anastasia Borovykh, Reza Shokri
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.15306
Pdf link: https://arxiv.org/pdf/2604.15306
Abstract Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.
中文摘要 语言模型是否能够系统地泛化，仍在积极争论中。然而，实证性能受训练数据、训练范式和推理时间策略等多种因素共同影响，使得失败难以解释。我们引入了一个基于最短路径规划的受控合成环境，这是一个规范的可组合序列优化问题。该设置能够清晰区分这些因素，并支持两个正交的推广轴：空间转移至未见地图，长度缩放至更长视距问题。我们发现模型表现出强烈的空间转移，但在长度缩放下由于递归不稳定性而持续失效。我们还进一步分析了学习流程的不同阶段如何影响系统性问题解决：例如，数据覆盖设置能力限制;强化学习提升训练稳定性，但不扩大这些限制;推理时间缩放提升了性能，但无法挽救长度缩放的失败。

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

RAD-2：生成器-判别器框架下的扩展强化学习

Authors: Hao Gao, Shaoyu Chen, Yifan Zhu, Yuehao Song, Wenyu Liu, Qian Zhang, Xinggang Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.15308
Pdf link: https://arxiv.org/pdf/2604.15308
Abstract High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.
中文摘要 高级自动驾驶需要能够模拟多模态未来不确定性的运动规划器，同时保持闭环交互的稳健性。尽管基于扩散的规划器在建模复杂轨迹分布方面有效，但在纯粹通过模仿学习训练时，常常存在随机不稳定性和缺乏纠正性负反馈的问题。为解决这些问题，我们提出了RAD-2，一种统一的生成器-判别器框架，用于闭环规划。具体来说，基于扩散的生成器用于生成多样化的轨迹候选，而强化学习优化的判别器则根据候选者的长期驱动质量重新排序。这种解耦设计避免直接将稀疏标量奖励应用于完整的高维轨迹空间，从而提升优化稳定性。为了进一步提升强化学习，我们引入了时间一致性群体相对策略优化，利用时间一致性来缓解学分分配问题。此外，我们提出了策略生成器优化，将闭环反馈转换为结构化的纵向优化信号，并逐步将生成器向高奖励轨迹流形移动。为支持高效的大规模训练，我们引入了BEV-Warp，一种高通量仿真环境，通过空间扭曲直接在鸟瞰视角特征空间中进行闭环评估。与强扩散型规划器相比，RAD-2 的碰撞率降低了 56%。实际部署进一步展示了在复杂城市交通中安全性和驾驶平稳度的提升。

Keyword: diffusion policy

There is no result

Keyword: reinforcement learning

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

GFT：从模仿到奖励微调，具有无偏群优势和动态系数整流

Reinforcement Learning via Value Gradient Flow

通过价值梯度流进行强化学习

Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization

通过贡献加权组相对策略优化，增强基于LLM的搜索代理

Aerial Multi-Functional RIS in Fluid Antennas-Aided Full-Duplex Networks: A Self-Optimized Hybrid Deep Reinforcement Learning Approach

流体天线辅助全双工网络中的空中多功能远程信息系统：一种自我优化的混合深度强化学习方法

When Missing Becomes Structure: Intent-Preserving Policy Completion from Financial KOL Discourse

当缺失变成结构：金融KOL话语中的意图保全政策完成

Step-level Denoising-time Diffusion Alignment with Multiple Objectives

多目标的阶级去噪时间扩散比准

On Tackling Complex Tasks with Reward Machines and Signal Temporal Logics

关于用奖励机和信号时间逻辑解决复杂任务

Improving Human Performance with Value-Aware Interventions: A Case Study in Chess

通过价值意识干预提升人类表现：国际象棋案例研究

Scouting By Reward: VLM-TO-IRL-Driven Player Selection For Esports

奖励球探：电子竞技从虚拟平台到现实世界的选手选择

Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve

Evo-MedAgent：超越一次性诊断，使用记忆、反思并改进的特工

MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation

MARS$^2$：通过强化学习扩展多智能体树搜索以生成代码

Model-Based Reinforcement Learning Exploits Passive Body Dynamics for High-Performance Biped Robot Locomotion

基于模型的强化学习利用被动身体动力学实现高性能双足机器人运动

Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks

问重要性：软件工程任务的奖励驱动澄清

Targeted Exploration via Unified Entropy Control for Reinforcement Learning

通过统一熵控制进行有针对性的探索以实现强化学习

ClariCodec: Optimising Neural Speech Codes for 200bps Communication using Reinforcement Learning

ClariCodec：利用强化学习优化200bps通信的神经语音代码

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

一瞥链：搜索引导的渐进式对象基础推理以视频理解

Mean Flow Policy Optimization

平均流量策略优化

The Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment

《像素法庭审判：通过对抗证据和强化学习判断进行强健图像处理定位》

RELOAD: A Robust and Efficient Learned Query Optimizer for Database Systems

RELOAD：一个稳健高效的数据库系统学习查询优化器

Wasserstein Formulation of Reinforcement Learning. An Optimal Transport Perspective on Policy Optimization

Wasserstein 强化学习表述。政策优化的最佳交通视角

Learning Ad Hoc Network Dynamics via Graph-Structured World Models

通过图结构化世界模型学习临时网络动力学

SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling

SWE-TRACE：通过评分标准过程奖励模型和启发式测试时间尺度优化长期SWE代理

Switch: Learning Agile Skills Switching for Humanoid Robots

切换：学习敏捷技能 为类人机器人切换

Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

强化学习是否扩展了LLM代理的能力边界？PASS@（k，T）分析

GenRec: A Preference-Oriented Generative Framework for Large-Scale Recommendation

GenRec：一个面向偏好的大规模推荐生成框架

Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

双轴生成奖励模型：在互动口语对话模型中实现语义和轮流稳健性

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

LongAct：利用内在激活模式进行长语境强化学习

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

WavAlign：通过自适应混合后训练提升口头对话模型中的智力和表达力

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

UniDoc-RL：从粗到细的视觉RAG，具有层级动作和密集奖励

Momentum-constrained Hybrid Heuristic Trajectory Optimization Framework with Residual-enhanced DRL for Visually Impaired Scenarios

适用于视障场景的动量约束混合启发式轨迹优化框架，带有残差增强的日程学习

IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

IG-搜索：搜索增强推理带来的步骤级信息奖励

LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking

大型语言模型游戏验证者：RLVR可能导致奖励黑客攻击

RL-STPA: Adapting System-Theoretic Hazard Analysis for Safety-Critical Reinforcement Learning

RL-STPA：适应系统理论危害分析以适应安全关键强化学习

Abstract Sim2Real through Approximate Information States

通过近似信息态进行 Sim2Real 的抽象

Generalization in LLM Problem Solving: The Case of the Shortest Path

LLM问题解决中的推广：最短路径的情况

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

RAD-2：生成器-判别器框架下的扩展强化学习

Keyword: diffusion policy

切换：学习敏捷技能为类人机器人切换