Arxiv Papers of Today

生成时间: 2026-05-27 19:41:31 (UTC+8); Arxiv 发布时间: 2026-05-27 20:00 EDT (2026-05-28 08:00 UTC+8)

今天共有 59 篇相关文章

Keyword: reinforcement learning

ATOM: Instantiating Budget-Controllable Multi-Agent Collaboration via Nucleus-Electron Hierarchy

ATOM：通过核电子层级实现预算可控的多智能体协作

Authors: Xinkui Zhao, Sai Liu, Yifan Zhang, Qingyu Ma, Zewen Lin, Naibo Wang, Guanjie Cheng, Chang Liu, Yueshen Xu
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.26178
Pdf link: https://arxiv.org/pdf/2605.26178
Abstract Large Language Model (LLM)-based multi-agent systems rely on optimized collaboration topologies to balance performance and communication costs. However, current methods struggle with the inherent stability-extensibility trade-off and often misalign computational budgets with query difficulty. We propose \textsc{ATOM}, an adaptive framework that generates budget-controllable collaboration graphs via a novel task-driven reinforcement learning paradigm. Inspired by atomic structures, \textsc{ATOM} employs a nucleus-electron hierarchy: it maintains a stable, offline-learned collaboration backbone (the nucleus) while dynamically activating query-conditioned agents (electrons) during inference. Crucially, a complexity-aware budgeting strategy aligns resource consumption with task demands by estimating query difficulty to strictly regulate electron instantiation. Extensive experiments across six diverse benchmarks demonstrate that \textsc{ATOM} achieves state-of-the-art performance while improving token efficiency by up to $30\%$ compared to strong baselines.
中文摘要 基于大型语言模型（LLM）的多智能体系统依赖优化的协作拓扑结构来平衡性能和通信成本。然而，现有方法在稳定性与可扩展性的权衡上存在困难，且常常使计算预算与查询难度不匹配。我们提出了 \textsc{ATOM}，这是一种自适应框架，通过一种新颖的任务驱动强化学习范式生成预算可控的协作图。受原子结构启发，\textsc{ATOM} 采用了核-电子层级结构：它保持稳定的离线学习协作骨干（原子核），同时在推理过程中动态激活查询条件代理（电子）。关键是，理解复杂度的预算策略通过估算查询难度，严格调节电子实例化，使资源消耗与任务需求保持一致。跨越六个不同基准测试的广泛实验表明，\textsc{ATOM} 实现了最先进的性能，同时相比强基准提升了高达 30 美元。

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

GAC：噪声感知自适应混合混合SFT-RL后训练

Authors: Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26184
Pdf link: https://arxiv.org/pdf/2605.26184
Abstract Hybrid post-training usually combines supervised fine-tuning and reinforcement learning, but fixed mixing schedules cannot adapt when the relative noise of the two signals changes over time. We propose GAC, a noise-aware controller that derives an adaptive mixing weight from online estimates of gradient variance and disagreement between the two training signals. The method adds smoothing, prior guidance, and bounded updates while reusing existing training tensors. Experiments on math, code, science, and logic benchmarks show that GAC consistently improves hybrid post-training over strong fixed and rule-based baselines, with larger gains at larger model scales and less than 1% training overhead.
中文摘要 混合后训练通常结合监督微调和强化学习，但固定混频时间表无法适应两个信号的相对噪声随时间变化。我们提出了GAC，一种噪声感知控制器，通过在线估计梯度方差和两种训练信号之间的不一致来推导出自适应混合权重。该方法在重复使用现有训练张量的同时，增加了平滑处理、先验引导和有界更新。数学、代码、科学和逻辑基准测试的实验表明，GAC在混合后训练方面相较于强的固定和基于规则的基线持续提升，在更大模型尺度上取得更大提升，且训练开销低于1%。

Unified Neural Scaling Laws

统一神经尺度定律

Authors: Ethan Caballero, Priyank Jaini, David Krueger, Irina Rish
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2605.26248
Pdf link: https://arxiv.org/pdf/2605.26248
Abstract We present a functional form (that we refer to as a Unified Neural Scaling Law (UNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks as multiple dimensions all vary simultaneously (i.e. how the evaluation metric of interest varies as one simultaneously varies the number of model parameters, training dataset size, number of training steps, number of inference steps, amount of compute, and various hyperparameters) for various architectures and for each of various tasks within a varied set of upstream and downstream tasks. This set includes large-scale vision, language, math, and reinforcement learning. When compared to other functional forms for neural scaling, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set.
中文摘要 我们提出了一种功能形式（我们称之为统一神经尺度律（UNSL）），它准确建模并推断深度神经网络的尺度行为，因为多维度同时变化（即评估指标的变化，模型参数数量、训练数据集大小、训练步骤数等变化），针对不同架构以及不同任务的推理步骤数、计算量和各种超参数）。该套装包括大规模视觉、语言、数学和强化学习。与其他神经尺度函数形式相比，该函数形式对尺度行为的外推在该集合上更为准确。

Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

通过扩散策略优化扩展世界模型强化学习

Authors: Xiaoyuan Cheng, Wenxuan Yuan, Zhancun Mu, Yuanzhao Zhang, Yiming Yang, Hai Wang, Zhuo Sun, Che Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.26282
Pdf link: https://arxiv.org/pdf/2605.26282
Abstract Model-based reinforcement learning (RL) can be effectively supported at scale through the use of world models. However, in practice, scaling such approaches remains fundamentally limited. A commonly recognized challenge is model bias and error compounding, which degrade long-horizon predictions. Beyond these issues, we identify a more critical yet underexplored bottleneck: a structural misalignment between search and value learning in existing world model approaches. In particular, policy improvement often relies on value functions induced by a separate, non-search policy, resulting in training inconsistency and ultimately suboptimal learning. To address this limitation, we propose Model-Based Diffusion Policy Optimization (MBDPO) in world models, a framework that unifies search and policy optimization through diffusion policy representations, thereby unlocking the potential of world models for scalable policy learning. Instead of constructing an explicit planner over a learned world model, we reformulate policy optimization as a diffusion process over searched trajectories in latent world models. In this view, we extract an implicit energy function from the collected dataset that anchors the policy, enabling MBDPO to refine the score field for policy optimization while mitigating misalignment. We evaluate MBDPO across a wide range of settings, including multi-task offline pretraining, online learning, and offline-to-online fine-tuning. In the offline regime, we further investigate its scaling behavior by pretraining on large-scale datasets, observing consistent and monotonic performance gains with increasing model capacity.
中文摘要 基于模型的强化学习（RL）可以通过使用世界模型在大规模上有效支持。然而，在实际操作中，这种方法的规模化仍然存在根本上的限制。一个常见的挑战是模型偏差和误差叠加，这会降低长期预测。除了这些问题，我们还发现了一个更关键但尚未被充分探讨的瓶颈：现有世界模型方法中搜索与价值学习之间的结构性错位。特别是，策略改进常依赖于由独立非搜索策略诱导的价值函数，导致训练不一致，最终学习效果不优。为解决这一局限，我们提出了基于模型的扩散政策优化（MBDPO）框架，通过扩散政策表示统一搜索和策略优化，从而释放世界模型在可扩展政策学习中的潜力。我们不再在学习的世界模型上构建显式规划器，而是将政策优化重新表述为潜世界模型中搜索轨迹的扩散过程。在此视图中，我们从收集的数据集中提取一个隐式能量函数，锚定策略，使MBDPO能够优化评分字段，同时减少错位。我们在多种环境中评估MBDPO，包括多任务离线预训练、在线学习以及离线到在线的微调。在离线模式下，我们通过对大规模数据集进行预训练进一步研究其缩放行为，观察到随着模型容量增加，性能持续且单调地提升。

Decoupled Delay Compensation: Enhancing Pre-trained MARL Policies via Learned Dynamics Filtering

解耦延迟补偿：通过学习动力学过滤增强预训练的MARL策略

Authors: Maxim Mednikov, Oren Gal
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.26286
Pdf link: https://arxiv.org/pdf/2605.26286
Abstract Real-world multi-agent reinforcement learning (MARL) systems must often operate under stale observations, stochastic communication delays, and intermittent packet loss. Policies trained under idealized synchronous conditions frequently exhibit significant performance degradation in these regimes because they act on outdated feedback. We propose a modular execution-stage state-estimation layer that replaces delayed communicated observations with current belief-state estimates. The framework integrates a learned Gated transition model with a recursive Kalman filtering layer to estimate instantaneous states from asynchronous measurements. A primary advantage of this approach is its modularity, The estimator serves as a plug-in for pre-trained policies, requiring no modifications to the original MARL training algorithm, architecture, or reward structure. Evaluation across diverse multi-agent and continuous-control benchmarks demonstrates that the proposed layer consistently enhances robustness to communication latency and message loss. The most significant performance gains are observed in coordination-intensive and dynamically unstable tasks where temporal consistency is critical for control.
中文摘要 现实世界的多智能体强化学习（MARL）系统常常必须在陈旧观测、随机通信延迟和间歇性丢包的情况下运行。在理想化同步条件下训练的政策，由于基于过时的反馈，常常在这些条件下表现出显著的性能下降。我们提出了一个模块化的执行阶段状态估计层，用当前的信念状态估计替代延迟的通信观察。该框架将已学习的门控转移模型与递归卡尔曼滤波层相结合，用于从异步测量中估算瞬时状态。该方法的主要优势在于其模块化特性，估计器作为预训练策略的插件，无需对原始MARL训练算法、架构或奖励结构进行修改。对多种多智能体和连续控制基准的评估表明，所提层持续增强了对通信延迟和消息丢失的鲁棒性。在协调密集且动态不稳定的任务中，性能提升最显著，时间一致性对控制至关重要。

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

MechRL：强化学习代理执行电路发现以实现机制解释

Authors: Barsat Khadka
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.26343
Pdf link: https://arxiv.org/pdf/2605.26343
Abstract Mechanistic interpretability has identified small sets of attention heads that implement specific behaviours in transformer language models, but recovering these circuits typically requires a bespoke analytical pipeline for each new task. We recast circuit discovery as a reinforcement-learning problem. An agent operates over the 144 attention heads of GPT-2 small as a discrete action space; each action triggers a zero-ablation and a contrastive reward that subtracts the ablation's damage to general next-token prediction from its damage to the target task. A single PPO policy, trained on two tasks (induction and IOI) in a vectorised multi-task environment, attains the per-episode oracle on both training tasks and on a held-out third task (docstring completion). Its preferred heads coincide with the canonical heads of established literature on precisely the axes those papers identify as causally non-redundant under single-head ablation; the categories they identify as redundant are correctly de-prioritised by the agent. On the held-out task, best-of-five planning recovers 96\% of the oracle ceiling with no task signal supplied at evaluation. These results indicate that reinforcement learning over causal interventions is a viable, transferable substrate for identifying the single-head bottlenecks of mechanistic circuits, complementary to existing path-patching approaches.
中文摘要 机械可解释性已识别出在变换器语言模型中实现特定行为的小型注意力头，但恢复这些电路通常需要为每个新任务定制的分析流水线。我们将电路发现重新定义为强化学习问题。智能体在GPT-2小的144个注意力头上作为离散动作空间操作;每个动作都会触发零消融和对比性奖励，将消融对一般下一标记预测的损害从对目标任务的损害中扣除。一个针对两个任务（归纳和IOI）在矢量化多任务环境中训练的单一PPO策略，在两个训练任务和第三个任务（文档字符串完成）上都获得每集预言机。其首选标题与既有文献中关于那些论文在单面消融下因果上无冗余轴的规范标题一致;他们识别为冗余的类别会被代理人正确地去优先级化。在被保留的任务中，五局三胜规划可回收96%的预言机上限，且评估时未提供任务信号。这些结果表明，强化学习在因果干预之上是识别机制回路单头瓶颈的可行且可转移的基础，补充现有路径补丁方法。

Balancing Plasticity and Stability with Fast and Slow Successor Features

平衡可塑性和稳定性与快速与慢速继任特性

Authors: Raymond Chua, Doina Precup, Blake Richards
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.26357
Pdf link: https://arxiv.org/pdf/2605.26357
Abstract A hallmark of intelligence is the ability to adapt in non-stationary environments, yet deep Reinforcement Learning (RL) agents often struggle in such settings. Prior studies introduce non-stationarity through abrupt shifts in features or dynamics, whereas real-world environments often evolve gradually through continual drift. This distinction has important implications for the "stability-plasticity dilemma" in RL, as abrupt task changes may demand more plasticity than naturalistic settings. To address this, we modify existing 3D Miniworld and MuJoCo environments to incorporate naturalistic, continual non-stationarity, and use them to examine how stability and adaptation affect performance under continuous environmental change. We find that methods favoring stability, such as synaptic consolidation, outperform approaches focused on plasticity, such as parameters resetting. Motivated by this result, and prior evidence that Successor Features (SFs) reduce interference, we investigate whether SFs are better consolidation targets than Q-values. Across both environments, applying neuro-inspired synaptic consolidation to SFs yields superior performance on continually changing settings. Moreover, consolidation is most effective when SFs are stabilized across multiple timescales, which capture complementary aspects of gradual environmental change. Together, these results suggest that stability is more critical in continual learning when changes are gradual, and that multi-timescale consolidation of predictive representations is an effective approach.
中文摘要 智能的一个标志是能够适应非静止环境，但深度强化学习（RL）智能体在此类环境中常常会遇到困难。以往的研究通过特征或动力学的突然变化引入了非平稳性，而现实环境通常通过持续漂移逐步演变。这一区分对强化学习中的“稳定性-可塑性困境”具有重要意义，因为突变任务可能比自然环境要求更多的可塑性。为此，我们修改现有的3D Miniworld和MuJoCo环境，融入自然主义、持续的非静态性，并用它们研究稳定性和适应性如何影响持续环境变化下的性能。我们发现，有利于稳定性的方法，如突触巩固，优于注重可塑性的方法，如参数重置。基于这一结果，以及先前证据表明继任特征（SFs）能减少干扰，我们研究SF是否比Q值更适合合并。在这两种环境下，将神经启发的突触巩固应用于SFs，在不断变化的环境下表现更优。此外，整合在多个时间尺度稳定的SFs中效果最佳，这些时间表捕捉了渐进环境变化的互补方面。综合来看，这些结果表明，当变化渐进时，稳定性在持续学习中更为关键，而多时间尺度的预测表征巩固是一种有效的方法。

Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL

利用局部动力学正则性实现离线层级强化学习中的可复用技能

Authors: Sarthak Dayal, Abhinav Peri, Carl Qi, Claas Voelcker, Alexander Levine, Caleb Chuck, Amy Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26371
Pdf link: https://arxiv.org/pdf/2605.26371
Abstract Hierarchical Reinforcement Learning (HRL) promises to solve long-horizon Reinforcement Learning (RL) tasks more efficiently than non-hierarchical counterparts by discovering and reusing temporally-extended skills. However, obtaining skills that are actually reusable remains an open challenge. Towards this end, we focus on abstractions that exploit the intuition of local dynamics: local transitions in different global contexts require similar kinds of action sequences. By aligning these contexts with the action sequences they require, we are able to learn which skills to reuse and where to reuse them. In principle, this information should benefit many HRL algorithms, where high-level policies have to reason about the low-level skills they use. The resulting algorithm CARL (Contrastive Action-based Representations for Reusable Local Control) shows both qualitative clustering of meaningful skills in complex humanoid environments and improved downstream performance on the OGBench benchmark when integrated with HIQL.
中文摘要 分层强化学习（HRL）承诺通过发现和重复使用时间扩展技能，比非分层式任务更高效地解决长期强化学习（RL）任务。然而，获得真正可重复使用的技能仍是一个开放的挑战。为此，我们重点关注利用局部动力学直觉的抽象：不同全局背景下的局部转移需要类似类型的动作序列。通过将这些情境与它们所需的动作序列对齐，我们能够学习哪些技能该重复使用，以及在哪些地方重复使用。原则上，这些信息应当惠及许多HRL算法，因为高级政策需要对其使用的低层技能进行推理。最终算法CARL（基于动作的对比可重用局部控制表示）在复杂类人环境中展示了有意义技能的定性聚类，并在与HIQL集成后，在OGBench基准测试中表现有所提升。

Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

两阶段排名中早期检索的信用分配策略梯度

Authors: Haruka Kiyohara, Mihaela Curmei, Ariel Evnine, Shankar Kalyanaraman, Israel Nir, Ana-Roxana Pop, Nitzan Razin, Sarah Dean, Thorsten Joachims, Udi Weinsberg
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.26385
Pdf link: https://arxiv.org/pdf/2605.26385
Abstract Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-stage ranker (ESR) generates a candidate set, which is subsequently re-ranked by a late-stage ranker (LSR). While there are many reinforcement learning (RL) methods for training the LSR, end-to-end training of the ESR has proven challenging. In particular, naive application of "vanilla" policy gradient (V-PG) is not scalable for candidate-set sizes relevant for practical use due to exploding variance. This issue arises because V-PG propagates the gradient to the joint probability of the candidate sets, ignoring the contribution of each specific item in the candidate set to the reward. To mitigate this issue, we propose a novel "credit-assigned" policy gradient (CA-PG), which computes gradients with respect to the probability that the target item is chosen in any candidate set, i.e. marginalizing over all candidate sets that contain it. Our theoretical analysis reveals that CA-PG significantly reduces the variance of V-PG by marginalizing over the specific composition of the candidate set, while preserving the ability to learn the correct ranking of items under a reasonably aligned LSR policy. Experiments on both synthetic and real-world data demonstrate that CA-PG improves the convergence speed and training stability for ESRs utilizing the canonical Plackett-Luce model, especially when the candidate-set size is large.
中文摘要 大规模搜索、推荐和检索增强生成（RAG）系统通常采用两阶段架构：早期排序器（ESR）生成候选集，随后由后期排序器（LSR）重新排序。虽然有许多强化学习（RL）方法用于训练LSR，但ESR的端到端训练一直具有挑战性。特别是，由于方差爆炸性，“普通”策略梯度（V-PG）的朴素应用无法扩展到实际应用相关的候选集规模。这个问题的出现是因为V-PG将梯度传播到候选集的联合概率，忽略了候选集中每个具体项目对奖励的贡献。为缓解这一问题，我们提出了一种新的“信用分配”策略梯度（CA-PG），该梯度根据目标项目在任何候选集合中被选中的概率计算梯度，即对包含该目标的选项集进行边缘化。我们的理论分析显示，CA-PG通过对候选集的具体组成进行边缘化，显著降低了V-PG的方差，同时保留了在合理对齐的LSR策略下学习正确排序的能力。合成和现实世界数据的实验表明，CA-PG在采用典型Plackett-Luce模型时，尤其在候选集规模较大时，能提升收敛速度和训练稳定性。

When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

深度强化学习何时能击败校准基线？一项关于适应性资源控制的基准研究

Authors: Guilin Zhang, Chuanyi Sun, Kai Zhao, Shahryar Sarkani, John Fossaceca
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2605.26418
Pdf link: https://arxiv.org/pdf/2605.26418
Abstract A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reproducible benchmark and evaluation protocol for DRL on adaptive resource control, where an agent allocates compute to a dynamic workload under cost and service-level constraints. We evaluate PPO, DQN, A2C, SAC, TD3, and DDPG under matched architectures, training budgets, and reward functions against a calibrated rule-based baseline across six workload patterns and five seeds (240 runs), instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling, and probe distribution-shift generalization. Three findings challenge common assumptions: (i) the calibrated controller achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; (ii) discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and (iii) no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck in RL-based resource control is not algorithm selection but baseline calibration, reward engineering, and realistic evaluation protocols.
中文摘要 一个经过适当校准的基于规则的自动扩展器，在我们测试的每个工作负载中都能以成本击败六种主流深度强化学习（DRL）算法——那么，DRL究竟什么时候，或者说，到底有没有真正起作用呢？我们在RLScale-Bench中研究了这一点，该方案是一个可重复的自适应资源控制（DRL）基准和评估协议，代理在成本和服务级约束下将计算分配到动态工作负载。我们在匹配架构、训练预算和奖励函数下，基于校准的基于规则的基线，在六个工作负载模式和五个种子（240次运行）中评估PPO、DQN、A2C、SAC、TD3和DDPG，并对分布移移泛化进行分析。有三个发现挑战了常见假设：（i）校准控制器在六种工作负载中实现最低成本，尽管在突发和闪存流量上落后于最佳强化学习代理;（ii）离散动作算法在因动作空间不匹配导致的约束违规方面比连续动作一多出一个数量级;以及（iii）没有单一算法在工作负载中占据主导地位，排名最多可移动四个名次。基于强化学习的资源控制瓶颈不在于算法选择，而在于基线校准、奖励工程和现实的评估方案。

Design First, Code Later: Aesthetically Pleasing Template-Free Slides Generation

先设计，后写代码：美观无模板幻灯片生成

Authors: Zhiyao Cui, Chenxu Wang, Shuyue Hu, Yiqun Zhang, Wenqi Shao, Qiaosheng Zhang, Zhen Wang
Subjects: Subjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.26451
Pdf link: https://arxiv.org/pdf/2605.26451
Abstract Producing presentation slides automatically entails coordinating narrative structure with page-level graphic design under strict spatial constraints. For such structured multimodal tasks, a well-organized design process is essential to ensure the final quality of slides. Existing approaches rely on fixed templates or directly emit executable code, thereby both limiting the creative layout-design capabilities of LLMs and bypassing the essential slide-page design step. To address these limitations, this paper (1) proposes a hierarchical slides generation workflow, DeepSlides, that systematically organizes slide design tasks without any predefined template or style, decoupling slide-page design from implementation; (2) introduces SlideDesign, a dataset tailored specifically for slides generation tasks; and (3) presents a multi-agent reinforcement learning training paradigm and trains a couple of models, SlideQwens, for slide design and implementation. Experimental results demonstrate that our proposed framework outperforms baseline methods on evaluated metrics and achieves superior performance in human preference evaluations. The dataset and code are available at this https URL.
中文摘要 制作演示幻灯片自动意味着在严格的空间限制下协调叙事结构与页面级平面设计。对于这种结构化的多模态任务，良好的设计流程对于确保幻灯片的最终质量至关重要。现有方法依赖固定模板或直接输出可执行代码，这既限制了大型语言模型的创意布局设计能力，也绕过了关键的幻灯片页设计步骤。为解决这些限制，本文（1）提出了一种分层幻灯片生成流程DeepSlides，系统地组织幻灯片设计任务，无需预定义模板或样式，将幻灯片页设计与实现解耦;（2）引入了SlideDesign，一个专门为幻灯片生成任务量身定制的数据集;（3）呈现多智能体强化学习训练范式，并训练几个模型 SlideQwens，用于幻灯片设计和实现。实验结果表明，我们提出的框架在评估指标上优于基线方法，并在人类偏好评估中表现更优。数据集和代码可在该 https URL 访问。

Robust Koopman Control Barrier Filters for Safe Actor-Critic Reinforcement Learning

稳健的Koopman控制屏障过滤器，实现安全的演员-批评者强化学习

Authors: Dhruv S. Kushwaha, Zoleikha A. Biron
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.26452
Pdf link: https://arxiv.org/pdf/2605.26452
Abstract Safe reinforcement learning (RL) for robotic systems requires policies that improve task performance while satisfying state and input constraints during both training and deployment. Control barrier functions (CBFs) provide a principled mechanism for enforcing forward invariance through minimally invasive safety filters, but their use in model-free RL is limited by the need for accurate dynamics and hand-designed barrier certificates. We propose Robust Koopman-CBF SAC, a safety-filtered actor--critic framework that learns a finite-dimensional Koopman predictor from data, constructs affine CBF constraints in the lifted space, and enforces them through a quadratic-program safety layer. To account for finite-dimensional Koopman approximation error, the CBF condition is tightened using a projected residual margin estimated from held-out rollout data. The critic is trained on the executed safe action, while the actor is regularized toward the Koopman-CBF feasible set, reducing dependence on the filter over training. Across safe-control benchmarks, the method achieves zero constraint violations on CartPole stabilization and tracking while matching or exceeding unconstrained SAC returns. On high-dimensional Safety Gymnasium locomotion tasks, the method reduces violations in some settings but also exposes important limitations of first-order velocity barriers and linear EDMD models, motivating high-order and multi-step Koopman-CBF extensions. These results suggest that robust Koopman-CBF filters are a promising bridge between model-free RL and certifiable safety, while clarifying the structural conditions under which such filters remain effective. All code is available at \href{this https URL}{Github Repository}.
中文摘要 机器人系统的安全强化学习（RL）需要在训练和部署过程中满足状态和输入约束的同时，提高任务性能的策略。控制障碍函数（CBF）通过微创安全滤波器为前向不变性提供了原则性机制，但在无模型强化学习中的应用受限于对精确动力学和手工设计障碍证书的需求。我们提出了稳健的Koopman-CBF SAC，这是一种安全过滤的演员-批判者框架，通过数据学习有限维Koopman预测器，在提升空间中构造仿射CBF约束，并通过二次程序安全层强制执行。为考虑有限维库普曼近似误差，CBF条件通过从保留的展开数据估算的预测剩余裕度来收紧。批评者训练于执行的安全动作，而行为者则正则化至Koopman-CBF可行集，减少对滤波器的过度依赖。在安全控制基准测试中，该方法在CartPole稳定和跟踪方面实现零约束违规，同时匹配或超过无约束SAC回波。在高维安全体育馆的运动任务中，该方法在某些条件下减少了违规，但也暴露了一阶速度障碍和线性EDMD模型的重要局限，从而推动高阶和多步的Koopman-CBF扩展。这些结果表明，稳健的Koopman-CBF滤波器是无模型强化学习与可认证安全性之间的有希望桥梁，同时澄清了此类滤波器在结构条件下保持有效的作用。所有代码均可在 \href{this https URL}{Github Repository} 获取。

Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules

三元动力学感知扩散后采样反问题：优化引导与随机性时刻表

Authors: Junseo Bang, Dong Ju Mun, Hoigi Seo, Seongmin Hong, Se Young Chun
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.26470
Pdf link: https://arxiv.org/pdf/2605.26470
Abstract Generative posterior sampling using diffusion models has emerged as a dominant paradigm for solving inverse problems in imaging, which usually consists of three main components: data consistency (DC) guidance, classifier-free guidance (CFG) and stochasticity. While prior arts have focused on how to develop each or all components, less attention has given to how to schedule them, leading to heuristically fixed or partially adjusted suboptimal schedules. In this work, we argue that the interactions among all three components in terms of scheduling are crucial for significantly improved performance in solving inverse problems in imaging. Our analysis shows that aggressive CFG early in sampling conflict with DC guidance, while stochasticity brings the trajectory back to higher-probability regions. Based on these findings, we propose Triadic Dynamics Aware Posterior Sampling (TriPS), which reformulates posterior sampling as a time-varying control problem and optimizes schedules following a triadic trend of decreasing DC and stochasticity scales alongside increasing CFG scale. TriPS achieves this through two strategies: template-based search over functional priors for reliable baseline schedules, and Group Relative Policy Optimization (GRPO)-based reinforcement learning for more flexible temporal curves. Experiments demonstrate TriPS outperforms state-of-the-art baselines in data fidelity and perceptual realism.
中文摘要 利用扩散模型进行生成后验抽样已成为解决影像中逆问题的主流范式，反问题通常由三个主要组成部分组成：数据一致性（DC）指导、无分类器引导（CFG）和随机性。虽然现有技术关注如何开发每个或所有组件，但对如何调度它们的关注较少，导致启发式固定或部分调整的次优计划。在本研究中，我们认为三个组件在调度上的相互作用对于显著提升图像中逆问题的解决性能至关重要。我们的分析显示，早期采样时的激进CFG与DC指导相冲突，而随机性则将趋势带回更高概率区域。基于这些发现，我们提出了三元动力学感知后验采样（TriPS），该方法将后验采样重新表述为一个时间变化的控制问题，并沿三元趋势优化排程，顺应DC和随机尺度递减，CFG尺度递增。TriPS通过两种策略实现这一点：基于模板的函数先验搜索以实现可靠的基线调度，以及基于组相对策略优化（GRPO）的强化学习，以实现更灵活的时间曲线。实验表明TriPS在数据忠实度和感知真实性方面优于最先进的基线。

Heterogeneous AAV Logistics Task Allocation: A Reinforcement Learning Enhanced Overlapping Coalition Formation Game Approach

异构AAV后勤任务分配：强化学习增强型重叠联盟编组博弈方法

Authors: Yuze Zhou, Jingliang Sun, Junzhi Li, Jianxin Zhong, Zihan Wang, Teng Long
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.26471
Pdf link: https://arxiv.org/pdf/2605.26471
Abstract In dynamic urban logistics, the stochastic emergence of time-sensitive tasks poses a significant optimality challenge for heterogeneous AAVs logistics task allocation. To address this problem, a reinforcement learning enhanced overlapping coalition formation game approach is proposed. A dynamic task allocation model is established, where global optimality is mathematically quantified by a generalized logistics cost coupling service quality and resource consumption. To deal with the time-varying task sets induced by stochastic order arrivals, a transformer-based soft actor-critic network is designed. By leveraging multi-head self-attention to encode variable-length logistics states and capture task-wise spatiotemporal dependencies, the learned policy adaptively guides coalition updates, replacing heuristic rules in the overlapping coalition formation game. On this basis, heterogeneous AAVs can form more efficient overlapping coalitions for dynamic logistics tasks. The resulting coalition formation process is proven to constitute an exact potential game, which guarantees convergence to a Nash-stable equilibrium within a finite number of iterations. Numerical simulations demonstrate that the proposed algorithm effectively improves the optimality of task allocation under the generalized logistics cost criterion. In a scenario with 32 AAVs and 80 tasks, our algorithm achieves a 39.76% cost reduction compared with the heuristic OCF baseline. Indoor flight experiments further validate its practicality.
中文摘要 在动态城市物流中，时间敏感任务的随机出现对异构AAVs物流任务分配构成重大最优挑战。为解决这一问题，提出了一种强化学习增强型重叠联盟形成博弈方法。建立了一个动态任务分配模型，其全局最优性通过广义物流成本耦合服务质量和资源消耗来数学上量化。为应对随机序到达引起的时间变化任务集，设计了一个基于变换器的软演员-批判者网络。通过利用多头自注编码可变长度的物流状态并捕捉任务上的时空依赖关系，所学政策自适应地引导联盟更新，取代了重叠联盟组建游戏中的启发式规则。基于此，异构AAV可以更高效地形成重叠的动态物流联盟。由此形成的联盟过程被证明构成一个精确的势博弈，保证在有限次迭代内收敛到纳什稳定均衡。数值模拟表明，所提算法在广义物流成本准则下有效提升了任务分配的最优性。在拥有32个AAV和80个任务的场景中，我们的算法相比启发式OCF基线实现了39.76%的成本降低。室内飞行实验进一步验证了其实用性。

Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

通过随机解耦策略梯度实现高效的策略内可视化-强化学习

Authors: Haoxiang You, Yilang Liu, Davis Zong, Qian Wang, Teeratham Vitchutripop, Qi Wang, Daniel Rakita, Ian Abraham
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.26478
Pdf link: https://arxiv.org/pdf/2605.26478
Abstract We present the stochastic decoupled policy gradient (SDPG), a lightweight visual reinforcement learning (RL) method that trains diverse visuomotor control policies end-to-end within a few hours on a single NVIDIA RTX 4080 GPU. SDPG estimates policy gradients via random perturbations of trajectory rollouts, requiring orders of magnitude fewer batch-rendered environments and substantially reducing compute and memory overhead. On visual MuJoCo benchmarks, SDPG consistently outperforms baseline methods in training time, memory usage, and rewards. Finally, to support future research, we introduce a suite of realistic visual robotics benchmarks spanning dexterous manipulation, challenging locomotion, and demonstrate effective sim-to-real transfer on physical hardware.
中文摘要 我们提出了随机解耦策略梯度（SDPG），这是一种轻量级视觉强化学习（RL）方法，可在单个NVIDIA RTX 4080 GPU上数小时内端到端训练多种视觉运动控制策略。SDPG通过轨迹展开的随机扰动来估算策略梯度，从而大幅减少批量渲染环境数量，并大幅降低计算和内存开销。在视觉MuJoCo基准测试中，SDPG在训练时间、内存使用和奖励方面始终优于基线方法。最后，为了支持未来研究，我们引入了一套涵盖灵巧操作、挑战性移动以及在物理硬件上有效的模拟到现实传输的真实视觉机器人基准测试。

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

超越成对偏好：扩散模型的列表奖励感知比对

Authors: Austin Wang, Jiaqi Han, Stefano Ermon, Yisong Yue
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.26491
Pdf link: https://arxiv.org/pdf/2605.26491
Abstract Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary pairwise comparisons. This pairwise reduction is limiting when training data naturally contains multiple candidate images for the same prompt, and when continuous reward scores can provide richer information than a single winner-loser label. To address these limitations, we propose Diffusion LAIR, a reward-aware listwise preference optimization method for diffusion models. For each prompt, LAIR converts reward scores across a group of candidate images into centered advantage weights, then optimizes an advantage-weighted regression objective on the implicit reward, defined as the denoising-loss improvement of the current model over a fixed reference model, with a quadratic penalty that regularizes the magnitude of the implicit reward. The resulting objective uses all candidates simultaneously rather than selecting pairs, and remains conservative by explicitly controlling the magnitude of the implicit reward. The LAIR objective admits a bounded closed-form optimum in implicit-reward space, clarifying how the regularization strength controls the magnitude of the preference update. Experiments show that Diffusion LAIR outperforms strong preference optimization baselines on SD1.5 and SDXL across text-to-image generation, compositional generation, and image editing benchmarks.
中文摘要 偏好优化已成为在线人工反馈强化学习（RLHF）的高效替代方案，用于对齐文本到图像扩散模型。然而，现有方法大多将监督简化为二元的两两比较。当训练数据自然包含同一提示的多个候选图像，且连续奖励分数能提供比单一胜负标签更丰富的信息时，这种成对减少是受限的。为解决这些局限性，我们提出了扩散LAIR，一种针对扩散模型的奖励意识列表偏好优化方法。对于每个提示，LAIR将一组候选图像的奖励分数转换为中心优势权重，然后对隐式奖励优化优势加权回归目标，该目标定义为当前模型相较固定参考模型的去噪损益改进，并以二次惩罚正则化隐式奖励的大小。所得目标是同时使用所有候选人，而非选择配对，并通过明确控制隐性奖励的大小保持保守。LAIR目标在隐式奖励空间中存在有界闭形式最优，阐明了正则化强度如何控制偏好更新的幅度。实验显示，Diffusion LAIR 在文本生成、合成生成和图像编辑基准测试中，在 SD1.5 和 SDXL 上优于强偏好优化基线。

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

InterSketch：一种交错推理模型，具有自我纠正的视觉素描和逐步奖励

Authors: Zhiwei Ning, Wenwen Tong, Xiangli Kong, Shengnan Ma, Ziyi Shang, Jingcheng Ni, Tao Hu, Yong Xien Chng, Jixuan Ying, Zehuan Wu, Hanming Deng, Jie Yang, Yuanjie Zheng, Wei Liu, Lewei Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26520
Pdf link: https://arxiv.org/pdf/2605.26520
Abstract While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.
中文摘要 虽然视觉语言模型（VLMs）展现出多回合视觉推理能力，但其推理路径相对浅显，且以文本为中心的范式主导，限制了其适用于复杂视觉挑战的范围。相比之下，类人思维通常涉及长视野推理和交织的视觉-文本思维链（VT-CoT）。为弥合这一差距，我们引入了InterSketch，一种交错推理模型，通过自我纠正和逐步奖励机制增强VT-CoT的能力。InterSketch 利用外部工具动态生成中间视觉草图，并将其与文本推理交织，使得在长期视觉理解任务中实现有效的感知和逻辑推理。具体来说，在第一冷启动阶段，我们提出了一个合成的高质量交错VT-CoT数据集，并包含一个反射机制，以使模型能够实现多回合交错推理和自我纠正的能力。在后续强化学习（RL）阶段，我们设计了一种逐步奖励机制，以减轻仅端监督对长视野推理中奖励信号的稀疏性。大量视觉推理基准测试验证了InterSketch的有效性，甚至超过了诸如Gemini-3-Pro等专有模型。

StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting

StreamSplit：通过不确定性引导自适应分流实现的连续音频表示学习

Authors: Minh K. Quan, Pubudu N. Pathirana
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.26523
Pdf link: https://arxiv.org/pdf/2605.26523
Abstract Large-batch Contrastive Learning (CL), the foundation of modern representation learning, is fundamentally incompatible with the volatile resource constraints of edge devices. This conflict creates a dilemma: small on-device batches degrade model fidelity, while offloading to the cloud incurs unacceptable latency and bandwidth costs. Existing solutions often resort to static model compression, which fails to adapt to the runtime volatility of edge environments. To bridge this gap, we present StreamSplit, a novel framework that makes streaming CL practical across heterogeneous ARM client platforms. StreamSplit resolves the conflict between the continuous nature of ambient audio and the discrete batch requirements of models like CLAP and COLA. We introduce: (1) A distribution-based streaming framework that decouples representation quality from local batch size, using a tractable Hybrid Loss to maintain fidelity despite sparse updates; and (2) An Uncertainty-Guided Adaptive Splitter that uses a lightweight Reinforcement Learning (RL) policy to dynamically partition computation. Uniquely, this policy integrates real-time resource monitoring with embedding ambiguity to optimize the accuracy-latency trade-off on the fly. We evaluate StreamSplit on diverse hardware, from the resource-constrained Raspberry Pi 4 to the high-performance Apple M2. Results demonstrate that StreamSplit reduces per-sample latency by up to 4.7x and cuts bandwidth by 77.1% and energy by 52.3% compared to server-centric baselines. Crucially, it maintains accuracy within 2.2% of server-centric models, proving that adaptive, distributed learning is a viable path for the modern edge ecosystem.
中文摘要 大批量对比学习（CL）作为现代表示学习的基础，根本上与边缘设备的易变性资源约束不兼容。这种冲突带来了两难：小批量的设备内批次会降低模型的真实度，而将数据移到云端则会产生不可接受的延迟和带宽成本。现有解决方案常常依赖静态模型压缩，无法适应边缘环境的运行时波动性。为弥合这一差距，我们提出了StreamSplit，这是一个新颖框架，使流式客户订单在异构ARM客户端平台上变得实用。StreamSplit解决了环境音频连续性与CLAP和COLA等离散批处理需求的矛盾。我们引入：（1）基于分布的流式框架，将表示质量与本地批次大小解耦，利用可处理的混合损耗法在更新稀疏的情况下保持保真度;以及（2）一种不确定性引导自适应分流器，采用轻量级强化学习（RL）策略动态划分计算。该策略独特地将实时资源监控与嵌入模糊性相结合，实时优化准确性与延迟的权衡。我们在多种硬件上评估了StreamSplit，从资源有限的树莓派4到高性能的苹果M2。结果显示，StreamSplit 将每个采样延迟降低了高达 4.7 倍，带宽减少了 77.1%，功耗降低了 52.3%。关键是，它在服务器中心模型中准确率保持在2.2%以内，证明自适应分布式学习是现代边缘生态系统的可行路径。

Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards

焦点奖励：基于评分标准的奖励下的平衡强化学习

Authors: Yu Huang, Zihua Zhao, Zhaoxin Huan, Wanli Gu, Feng Hong, Xinmu Ge, Lin Yuan, Weichang Wu, Qiang Hu, Xiaolu Zhang, Jun Zhou, Jiangchao Yao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.26579
Pdf link: https://arxiv.org/pdf/2605.26579
Abstract The open-ended generation in LLMs usually requires multi-dimensional rubrics to adequately assess quality and guide the improvement of reinforcement learning. However, a critical dilemma inherent in this training paradigm is the imbalanced reward polarization along different rubric dimensions. Under this bottleneck, even if LLMs achieve relatively high rewards after training, they may still exhibit severe deficiencies in certain dimensions, leading to a direct deterioration in user experience. To address this problem, we propose Focal Reward, a novel objective to automatically balance the training of reinforcement learning under rubric-based rewards. Specifically, we first leverage an inverse reward projection mechanism to estimate the saturation degree of each criterion in the rubric, which forms the basis to calibrate the reward direction. Then, the final objective is designed with an automatically reweighting coefficient for each criterion to achieve the fine-grained balancing. Extensive experiments across three model scales and six benchmarks demonstrate that our Focal Reward method outperforms the strongest static aggregation baseline in all 18 model-benchmark comparisons. Rollout, mechanism, and ablation analyses further show that these gains arise from online, saturation-aware reallocation toward rubrics that still have room for improvement.
中文摘要 LLM中的开放式生成通常需要多维评分标准，以充分评估质量并指导强化学习的改进。然而，这种训练范式中固有的一个关键难题是不同评分标准维度上的奖励极化不平衡。在这一瓶颈下，即使大型语言模型在训练后获得相对较高的奖励，它们在某些维度上仍可能存在严重缺陷，导致用户体验直接下降。为解决这一问题，我们提出了焦点奖励（Focal Reward），这是一个新颖的目标，旨在自动平衡基于评分标准的奖励下强化学习的训练。具体来说，我们首先利用反奖励投射机制来估算评分标准中每个标准的饱和度，这为校准奖励方向奠定了基础。然后，最终目标设计为每个标准自动加权系数，以实现细粒度的平衡。在三个模型尺度和六个基准测试中进行了大量实验，表明我们的焦点奖励方法在所有18个模型-基准比较中都优于最强的静态聚合基线。推广、机制和消融分析进一步表明，这些收益来自于在线、感知饱和度的重新分配，转向仍有改进空间的评分标准。

Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

在关键时刻使用你的推广时间：为基于小组的强化学习培训后分配部署

Authors: Woojeong Kim, Ziyi Yang, Jing Nathan Yan, Jialu Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26606
Pdf link: https://arxiv.org/pdf/2605.26606
Abstract Reinforcement learning (RL) is the dominant paradigm for post-training large language models. However, in the online, on-policy setting, rollout generation dominates the computational cost of training. Group-based policy optimization methods compute advantages from multiple rollouts per prompt, yet they indiscriminately allocate budget to prompts with collapsed reward distributions, wasting expensive rollouts on negligible learning signals. We demonstrate that group-based updates are most effective in regimes of high reward variance. Since the policy evolves throughout training, prompt informativeness must be estimated online rather than precomputed, but exhaustively evaluating every prompt is computationally prohibitive. We introduce Pilot-Commit, a budget-aware rollout allocation framework for group-based RL post-training. Pilot-Commit decouples prompt evaluation from exploitation: a pilot stage estimates per-prompt informativeness using a fraction of the budget, and the remaining rollouts are allocated to high-leverage prompts while low-signal prompts are skipped. Across multiple math reasoning benchmarks and model scales from 1.5B to 14B parameters, Pilot-Commit matches baseline accuracy with significantly lower sampling costs, reaching target accuracy up to $1.9\times$ faster than GRPO and $4.0\times$ faster than DAPO in cumulative rollouts.
中文摘要 强化学习（RL）是训练后大型语言模型的主导范式。然而，在在线、政策内的环境中，推广生成主导了培训的计算成本。基于群体的策略优化方法通过每个提示计算多次推出的优势，但却不加区分地将预算分配给奖励分布崩溃的提示，浪费了昂贵的推广时间在微不足道的学习信号上。我们证明，基于群体的更新在高奖励方差的环境中最为有效。由于该政策在培训过程中不断演进，提示信息量必须在线估算，而非预先计算，但对每个提示进行全面评估计算上是难以实现的。我们引入了Pilot-Commit，这是一个预算意识的部署分配框架，用于基于组的强化学习培训后。试点-提交将提示评估与利用分离：试点阶段用预算的一小部分估算每个提示的信息量，剩余的推广分配给高杠杆提示，低信号提示则被跳过。在多个数学推理基准和模型尺度（从15亿到14亿参数）中，Pilot-Commit以显著更低的采样成本匹配基线准确性，累计部署中目标精度比GRPO快1.9美元，比DAPO快4.0倍。

MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation

MedVol-R1：以奖励为驱动的证据基础，支持体积推理分割

Authors: Zichun Wang, Hairong Shi, Bingzheng Wei, Yan Xu, Zihua Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26621
Pdf link: https://arxiv.org/pdf/2605.26621
Abstract Volumetric Reasoning Segmentation (VRS) aims to segment a target region in a 3D medical scan from a free-form clinical query, where the referent is often implicit and requires both medical knowledge and volume-grounded reasoning. Existing methods typically rely on specialized segmentation tokens to connect language with mask decoding, but this coupling collapses the decision process into opaque latent representations, limiting interpretability and generalization to diverse narrative expressions. In this paper, we present MedVol-R1, a reinforcement learning-based framework for VRS that explicitly decouples evidence grounding from volumetric delineation: the LVLM grounds clinical reasoning to a verifiable 2D evidence anchor (key axial slice and 2D bounding boxes), which is then propagated into a coherent 3D mask by a frozen MedSAM2 module. We train MedVol-R1 with cold-start supervised fine-tuning followed by GRPO, guided by a multi-component reward that encourages informative evidence selection, accurate 2D spatial grounding, and cross-slice volumetric coherence, without requiring costly chain-of-thought annotations. Experiments on CT-ORG, AbdomenCT-1K, and KiTS23 from the M3D-Seg benchmark demonstrate that MedVol-R1 consistently outperforms strong baselines and achieves state-of-the-art performance, with reinforcement learning providing clear gains over pure supervised fine-tuning.
中文摘要 体积推理分割（VRS）旨在将三维医学扫描中的目标区域从自由形式的临床查询中分割开来，其中指称通常是隐含的，需要医学知识和体积基础推理。现有方法通常依赖专用的分割令牌将语言与掩码解码连接起来，但这种耦合使决策过程被压缩为不透明的潜在表征，限制了对多样叙述表达的解释性和推广。本文介绍了MedVol-R1，一种基于强化学习的VRS框架，明确将证据基础与体积描述分离：LVLM将临床推理建立在可验证的二维证据锚点（关键轴切片和二维边界框），然后通过冻结的MedSAM2模块传播到连贯的三维掩码中。我们用冷启动监督微调训练MedVol-R1，随后进行GRPO，并由多元奖励指导，鼓励信息性证据选择、准确的二维空间基础和跨切片体积一致性，无需昂贵的思维链注释。基于M3D-Seg基准的CT-ORG、AbdomenCT-1K和KiTS23的实验表明，MedVol-R1始终优于强基线，实现最先进的性能，强化学习明显优于纯监督微调。

Breaking the Epistemic Trap: Active Perception Under Compound Uncertainty

打破认识论陷阱：复合不确定性下的主动感知

Authors: Chayan Banerjee, Ethan Goan
Subjects: Subjects: Systems and Control (eess.SY); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.26627
Pdf link: https://arxiv.org/pdf/2605.26627
Abstract Deploying reinforcement learning in safety critical domains, from autonomous vehicles to medical decision support, is constrained by failures arising when systems encounter unfamiliar conditions. We argue that the fundamental bottleneck is not individual challenges like changing dynamics or incomplete observations, but their synergistic interaction, which we term the Epistemic Trap: agents cannot estimate their state without knowing system dynamics, nor learn dynamics without accurate state information. Proof-of-concept experiments in simulated locomotion reveal that combining these uncertainties causes failures far worse than either challenge alone, a 77% performance degradation against the 46% by adding the individual effects, demonstrating compounding failure modes that conventional methods overlook. Such approaches adopt a passive epistemic stance that cannot resolve this coupled uncertainty. We propose reframing safety as an information problem, introducing an Adaptive Safety Architecture built around three contributions: the Compound Uncertainty Coefficient ($\kappa$), a mutual information based metric that quantifies state dynamics coupling and is computable online without full joint belief inference; information seeking policies governed by a MaxInfoRL objective that actively probe system dynamics; and regime-adaptive safety constraints that tighten as epistemic coupling rises. This paradigm shift, from passive robustness to active perception, offers a principled path toward decision making systems that operate under uncertainty, recognize their own ignorance, and act strategically to resolve it.
中文摘要 在安全关键领域部署强化学习，从自动驾驶车辆到医疗决策支持，受到系统遇到陌生环境时可能出现的故障限制。我们认为，根本瓶颈不在于动态变化或不完整的观察等个体挑战，而是它们之间的协同相互作用，我们称之为“认识陷阱”：智能体无法在不了解系统动力学的情况下估计自己的状态，也无法在没有准确状态信息的情况下学习动力学。模拟运动中的概念验证实验显示，将这些不确定性结合起来会导致的故障远比单独的挑战更严重，性能下降77%，而单独叠加后46%，展示了传统方法忽视的复合失效模式。此类方法采取被动的认识论立场，无法解决这种耦合的不确定性。我们提出将安全性重新框架为信息问题，引入围绕三项贡献构建的自适应安全架构：复合不确定性系数（$\kappa$），这是一种基于互信息的度量，量化状态动力学耦合，且可在线计算而无需完全联合信念推断;由 MaxInfoRL 目标管理的信息寻求策略，主动探测系统动态;以及随着认识耦合增强而收紧的体制适应性安全约束。这种从被动稳健到主动感知的范式转变，为决策系统提供了一条有原则的路径，这些系统在不确定性下运作，认识到自身的无知，并采取战略行动以解决这一问题。

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

UnityMAS-O：基于LLM的多智能体系统的通用强化学习优化框架

Authors: Yiqun Chen, Wei Yang, Erhan Zhang, Shijie Wang, Qi Liu, Zechun Niu, Bin Zhang, Haitao Li, Rui Li, Lingyong Yan, Jinyuan Feng, Biqing Qi, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.26646
Pdf link: https://arxiv.org/pdf/2605.26646
Abstract LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.
中文摘要 基于LLM的多智能体系统将复杂任务分解为交互角色，但大多数仍由提示、工具和控制规则手动编排，而智能体很少通过统一的强化学习接口进行优化。现有的强化学习后培训框架主要针对单一策略优化，缺乏用户定义的多智能体工作流、结构化交互、角色特定学分分配和可配置参数共享的抽象。我们介绍UnityMAS-O，一种面向基于LLM的多智能体系统的通用强化学习优化框架。UnityMAS-O 将整个工作流程视为优化单元，而非单一响应或策略轨迹。它通过四类对象表示工作流程：逻辑代理角色、图轨迹、用户自定义奖励和代理模型映射。这将逻辑代理与物理模型参数解耦，支持完全共享、完全分离和部分共享，奖励分别在角色、转向和轨迹层级分配。UnityMAS-O 通过基于光线的星型拓扑运行时扩展了 verl。中央控制器执行工作流程、调用工具、记录结构化轨迹并组装奖励;模型本地工作组负责部署、缓冲、优势计算以及分布式PPO风格的更新。用户可以定义代理、工作流、建模映射和奖励，而无需重写优化基础设施。我们在检索增强质量保证、迭代代理搜索和反射代码生成上实现了UnityMAS-O。在自然问题、热点质量保证（HotpotQA）和等待代码任务中，多智能体强化学习在优化后改进了手动指定的工作流程，尤其在较小的模型和严格的代码全传递指标上效果显著。这些结果表明，UnityMAS-O可以作为可重用的基底，将多样化的基于LLM的多智能体工作流转换为可训练的多智能体强化学习系统。

Bilevel Optimization over Saddle Points of Zero-Sum Markov Games

零和马可夫博弈鞍点上的双层优化

Authors: Zihao Zheng, Irwin King, Songtao Lu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.26654
Pdf link: https://arxiv.org/pdf/2605.26654
Abstract Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-level (LL) decision-making process responds, naturally leading to a bilevel optimization problem. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and therefore fail to capture competitive structures arising in applications such as incentive design, where multiple policies interact. We study bilevel optimization problems in which the LL problem is a regularized min-max zero-sum Markov game and the UL objective is optimized through the saddle-point equilibrium induced by the LL game. In this work, we propose penalty-augmented Nikaido-Isoda descent-ascent (PANDA), a penalty-based first-order policy-gradient method based on the Nikaido-Isoda function. By exploiting the min-max game structure, PANDA avoids computing UL hypergradients and does not require second-order information. We prove that PANDA converges to stationary points without convexity assumptions on either the UL or LL objectives. Moreover, PANDA reaches an $\epsilon$-stationary point in $\tilde{\mathcal{O}}(\epsilon^{-1})$ iterations with sample complexity $\tilde{\mathcal{O}}(\epsilon^{-3})$, matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments demonstrate the superior performance of PANDA over closely related baselines.
中文摘要 强化学习（RL）通常具有层级结构，其中上层（UL）学习者选择模型参数，下层（LL）决策过程响应，自然导致双层优化问题。大多数现有的双层强化学习方法假设单一策略LL马尔可夫决策过程（MDP），因此未能捕捉到激励设计等多策略相互作用应用中产生的竞争结构。我们研究双层优化问题，其中LL问题是一个正则化的最小极大零和马尔可夫博弈，UL目标通过LL博弈诱导的鞍点均衡进行优化。本研究提出惩罚增强的二阶-磯田下降-上升（PANDA），这是一种基于惩罚的一阶策略梯度方法，基于二阶-磯田函数。通过利用最小极大博弈结构，PANDA避免计算UL超梯度，也不需要二阶信息。我们证明PANDA在UL或LL目标上收敛到驻点且无凸假设。此外，PANDA在样本复杂度为$\tilde{\mathcal{O}}（\epsilon^{-1}）$次迭代中达到$\epsilon$-平稳点，匹配单策略LL MDP中双水平强化学习的最佳速率。实验显示PANDA优于密切相关的基线表现。

WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization

WINDQuant：基于权重的神经决策，用于全球混合精度大型语言模型量化

Authors: Phong Nam Huu Nguyen, Khoi M. Le, Cong-Duy T Nguyen, Anh Tuan Luu, Thong Thanh Nguyen, Tho Quan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.26660
Pdf link: https://arxiv.org/pdf/2605.26660
Abstract Quantization is an effective approach to reduce the memory footprint and inference cost of large language models (LLMs), yet maintaining performance in the ultra-low-bit regime remains challenging. Existing post-training methods often suffer from severe accuracy degradation, while quantization-aware training requires costly retraining and additional resources. Moreover, most mixed-precision strategies rely on coarse-grained or heuristic sensitivity analysis that overlooks fine-grained variations within weight matrices. We propose WINDQuant, a reinforcement-learning-based allocation controller for ultra-low-bit LLM quantization. Rather than introducing another low-level quantization operator, WINDQuant learns how to assign bit-widths and quantization treatments to fine-grained column chunks under a global storage budget. By operating at the column-chunk level, WINDQuant enables flexible and fine-grained precision assignment within layers under a global target bit-width. The implementation combines PPO with activation-aware calibration, lightweight per-unit quantizer fitting, and explicit effective-bit accounting of the learned mixed-precision plan. Experiments on LLaMA models demonstrate that WINDQuant achieves competitive performance in ultra-low-bit settings while reducing optimization overhead relative to retraining-based approaches, highlighting reinforcement learning as a practical controller for adaptive mixed-precision quantization.
中文摘要 量化是减少大型语言模型（LLM）内存占用和推理成本的有效方法，但在超低位环境中保持性能仍具挑战性。现有的后训练方法常常严重降低准确性，而量化感知训练则需要昂贵的再训练和额外资源。此外，大多数混合精度策略依赖粗粒度或启发式敏感性分析，忽略权重矩阵中的细粒度变异。我们提出WINDQuant，一种基于强化学习的超低位LLM量化分配控制器。WINDQuant没有引入另一个低级别量化算子，而是学习如何在全局存储预算下为细粒度列块分配位宽和量化处理。通过在列-块层面运行，WINDQuant 实现了在全局目标位宽下层内灵活且细粒度的精确分配。实现结合了PPO与激活感知校准、轻量化的单位量化器拟合以及对所学混合精度计划的显式有效比特计量。LLaMA模型上的实验表明，WINDQuant在超低位环境中实现了竞争性能，同时相较于基于重训练的方法降低了优化开销，凸显了强化学习作为自适应混合精度量化的实用控制器。

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

超越轨迹级归因：基于图的学分赋值用于能动强化学习

Authors: Xin Cheng, Shuo He, Lang Feng, HaiYang Xu, Ming Yan, Lei Feng, Bo An
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26684
Pdf link: https://arxiv.org/pdf/2605.26684
Abstract Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLMs) and have been rapidly extended to agentic tasks. However, their credit assignment relies heavily on coarse-grained trajectory-level attribution according to final outcomes, making it difficult to capture the contribution of individual steps, such as valuable steps obscured within failed trajectories. To uncover latent information and enable more faithful step-level credit assignment, we propose Graph-based Group Policy Optimization (GraphGPO), which first aggregates all rollout trajectories into a unified state-transition graph and then estimates the distance from each state to the task goal using the global information encoded in the graph. Finally, GraphGPO assigns credit to each edge by estimating a graph-based advantage, based on how much the transition reduces the distance to the task goal. In this way, GraphGPO significantly improves training efficiency and achieves state-of-the-art performance across a range of challenging benchmarks.
中文摘要 基于群体的强化学习（RL）方法在提升大型语言模型（LLM）性能方面取得了显著成功，并迅速扩展到代理任务。然而，它们的信用分配高度依赖于基于最终结果的粗粒度轨迹层级归因，这使得捕捉单个步骤的贡献变得困难，比如在失败轨迹中被掩盖的宝贵步骤。为了揭示潜在信息并实现更忠实的步骤级功分分配，我们提出了基于图的组策略优化（GraphGPO），该方法首先将所有推广轨迹汇总为统一的状态-转换图，然后利用图中编码的全局信息估算每个状态到任务目标的距离。最后，GraphGPO通过基于图估计优势，根据过渡缩短任务目标距离的程度，为每条边赋予功劳。通过这种方式，GraphGPO显著提升了训练效率，并在一系列具有挑战性的基准测试中实现了最先进的性能。

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

在Oracle预算下，利用生物引导搜索进行蛋白质设计的自我提升模仿

Authors: Ashima Khanna, Dominik Grimm
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2605.26690
Pdf link: https://arxiv.org/pdf/2605.26690
Abstract Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often degrade under surrogate noise, and position-agnostic mutation proposals risk disrupting functionally critical residues. We introduce SILO, a trajectory-level self-improvement imitation framework for oracle-budgeted protein design. SILO uses a hierarchical edit policy that decomposes each mutation into a position choice followed by a residue choice. In each active-learning round, the policy samples candidate trajectories via incremental stochastic beam search without replacement (SBS), and a UCB-based proxy ensemble, combined with an alanine-scan fitness score (AFS), selects candidates with functionally relevant edits for in silico oracle evaluation. The policy is then updated by next-action cross-entropy imitation on the round's best oracle-labeled trajectories, avoiding value-function estimation. Across eight reproduced protein fitness landscapes and five strong baselines from prior work, SILO achieves the highest maximum and top-100 mean fitness on 8 of 8 landscapes within our evaluations, often exhibiting faster early-stage improvement. In low-data and noisy-proxy stress tests on two landscapes per setting, SILO remains competitive or best when several baselines degrade. Ablations show that SBS with AFS account for much of the gains, with iterative imitation providing additional improvement. Code is available at: this https URL
中文摘要 在严格的预言机预算下进行蛋白质序列优化，需要探索庞大的组合空间，同时使每次评估都具有信息量。现有的强化学习和非策略生成方法常常在替代噪声下退化，位置无关的突变提案有破坏功能关键残基的风险。我们介绍了SILO，一种用于oracle预算蛋白设计的轨迹级自我改进仿真框架。SILO采用分层编辑策略，将每个突变分解为一个位置选择，再分解为残留选择。在每一轮主动学习中，策略通过增量随机光束搜索无替换（SBS）采样候选轨迹，基于UCB的代理集合结合丙氨酸扫描适应度评分（AFS）选出功能相关编辑的候选者，用于计算机预言机评估。随后通过对本轮最佳预言机标记轨迹进行下一步动作交叉熵模拟，更新策略，避免了价值函数估计。在八个复现的蛋白质适应度图谱和五条先前工作的强基线中，SILO在我们评估的8个图谱中达到了最高的最大和前100的平均适应度，且通常表现出更快的早期阶段改善。在每个场景对两个景观进行低数据和噪声代理应力测试中，当多个基线退化时，SILO仍保持竞争力或最佳。消融显示，伴随AFS的SBS带来了大部分收益，迭代仿制则提供了额外的改善。代码可在以下 https URL 获取

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

注意工具故障：为医疗代理人实现协同工具收益

Authors: Yunhui Gan, Tan Pan, Kaiyu Guo, Limei Han, Weimiao Yu, Guangnan Ye, Chen Jiang, Yuan Cheng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26691
Pdf link: https://arxiv.org/pdf/2605.26691
Abstract Medical AI agents increasingly use external tools for diagnosis, treatment recommendation, and evidence retrieval, yet most existing approaches assume that task-appropriate tools are reliable within their intended scope. This assumption is fragile in real clinical settings, where even relevant tools may fail on challenging instances and lead to unsafe downstream decisions. To address this issue, we study medical tool use under imperfect-tool settings to correct failure instances missed by individual tools. Instance-dependent failure patterns create a gap between the best fixed single tool and an ideal instance-wise selector, which we refer to as the Single-Oracle risk gap. The core challenge is that conventional task-level tool selection cannot realize this gap, as it is inherently bounded by the performance of the best single tool. Motivated by this observation, we therefore account for instance-level heterogeneity and formulate tool use as an instance-level selection problem. Particularly, we propose a GRPO-based reinforcement learning framework with rewards for probabilistic risk minimization and disagreement-aware synergy learning, which promotes instance-level correction of erroneous tool consensus. Furthermore, an entropy-guided sampling strategy is adopted to upweight high-disagreement instances, which provide stronger signals for learning instance-specific tool synergy. These two components complement each other in mitigating instance-level heterogeneity and improving tool synergy. Experiments on two tasks and seven medical benchmarks show that our method consistently achieves robust and stable improvements over a broad range of baselines, highlighting the importance of synergy-aware tool use for reliable medical agentic systems.
中文摘要 医疗人工智能代理越来越多地使用外部工具进行诊断、治疗建议和证据检索，但大多数现有方法都假设适合任务的工具在其预期范围内是可靠的。这一假设在真实临床环境中是脆弱的，即使是相关工具也可能在挑战性情境中失效，导致不安全的后续决策。为解决这一问题，我们研究了在不完美工具设置下使用医疗工具，以纠正单个工具遗漏的失效实例。实例相关的失败模式在最佳固定单一工具与理想实例选择器之间形成了差距，我们称之为单一-预言语风险差距。核心挑战在于，传统的任务级工具选择无法实现这一差距，因为它本质上受限于最佳单一工具的性能。基于这一观察，我们考虑实例级异质性，并将工具的使用定义为实例级选择问题。具体地，我们提出了基于GRPO的强化学习框架，对概率风险最小化和分歧感知协同学习给予奖励，促进对错误工具共识的实例级纠正。此外，采用熵引导抽样策略来提升高分歧实例的权重，这为学习实例特定工具协同提供了更强的信号。这两个组成部分在减少实例级异质性和提升工具协同方面相辅相成。对两项任务和七项医学基准的实验表明，我们的方法在广泛的基线范围内持续实现稳健且稳定的改进，凸显了协同感知工具使用对可靠医疗代理系统的重要性。

KARMA: Karma-Aligned Reward Model Adaptation

业力：业力对齐奖励模型适应

Authors: Jared Scott, Jesse Roberts
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.26738
Pdf link: https://arxiv.org/pdf/2605.26738
Abstract Human communication depends on implicit social signals where effectiveness is shaped by tone, context, and conversational norms rather than semantic content alone. We introduce KARMA (Karma-Aligned Reward Model Adaptation), a framework for LLM learning of context-sensitive conversational behavior from large-scale social interaction data. KARMA trains a reward model on Reddit conversations to predict response valuation conditioned on context, and uses this signal to fine-tune language models via reinforcement learning to improve performance on pragmatics-mediated tasks. Critically, we find that the highest performing reward model does not lead to better downstream model alignment: a reward model relying exclusively on conversational context was a worse predictor of Reddit karma but yielded substantially better downstream performance. We evaluate the effects of KARMA applied to a downstream model with and without direct exposure to the social media data. The resulting models show improved pragmatics-mediated behaviors with largely mitigated undesirable side effects. Factuality is consistently diminished by KARMA across all conditions, including when the downstream model has no direct exposure to Reddit data, suggesting that this tension is embedded in the reward signal itself rather than introduced by noisy training data.
中文摘要 人类交流依赖于隐含的社会信号，而其效果受语气、语境和会话规范影响，而非仅仅是语义内容。我们介绍了KARMA（业力对齐奖励模型适应），这是一个用于从大规模社交交互数据中学习情境敏感对话行为的LLM框架。KARMA在Reddit对话中训练奖励模型，预测基于上下文的反应价值，并利用该信号通过强化学习微调语言模型，以提升语用介导任务的表现。关键是，我们发现表现最好的奖励模型并不会带来更好的下游模型对齐：完全依赖对话上下文的奖励模型预测Reddit业力较差，但下游表现显著提升。我们评估了KARMA应用于下游模型时与否直接接触社交媒体数据的影响。所得模型显示出语用学介导的行为有所改善，且大部分减轻了不良副作用。KARMA在所有条件下都持续降低事实性，包括下游模型未直接接触Reddit数据时，表明这种张力是嵌入在奖励信号本身，而非噪声训练数据引入的。

Adversarial Training for Robust Coverage Network under Worst-case Facility Losses

在最坏情况下设施损失下，强健覆盖网络的对抗性培训

Authors: Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26763
Pdf link: https://arxiv.org/pdf/2605.26763
Abstract The Maximal Covering Location-Interdiction Problem (MCLIP) is a classic bi-level optimization problem, which is fundamental to resilient infrastructure planning yet remains computationally intractable. Specifically, the upper level determines facility locations to maximize coverage, while the lower level executes worst-case interdiction to minimize the coverage. The strong coupling between the upper and lower levels, combined with their respective high combinatorial complexity, renders traditional methods ineffective. To bridge this gap, we propose a Dual-Agent Deep Reinforcement Learning (DADRL) framework based on adversarial learning, comprising a location agent corresponding to the upper level and an interdiction agent corresponding to the lower level. Our contributions are threefold: (1) The location agent is trained simultaneously against an evolving interdiction agent, making it effectively capture the dynamic competitive interplay between the upper and lower levels; (2) To fully exploit the learned capabilities of the interdiction agent, we propose a Surrogate-based Ensemble Inference Strategy that utilizes the trained interdiction agent as a high-fidelity surrogate to guide the decisions of location agent; (3) Extensive experiments on synthetic and real-world datasets demonstrate that our approach achieves superior computational efficiency while maintaining highly competitive solution quality compared to other baselines. Furthermore, our DADRL framework is model-agnostic to network structures, while its underlying adversarial learning paradigm demonstrates strong potential for solving other bi-level optimization problems.
中文摘要 最大覆盖位置拦截问题（MCLIP）是一个经典的双层优化问题，它是韧性基础设施规划的基础，但在计算上仍然难以解决。具体来说，上层确定设施位置以最大化覆盖范围，而下层执行最坏情况下拦截以最小化覆盖。上层和下层之间的强耦合，加上各自高组合复杂性，使传统方法无效。为弥合这一差距，我们提出了基于对抗学习的双智能体深度强化学习（DADRL）框架，由对应上层的位置代理和对应下层的拦截代理组成。我们的贡献有三方面：（1）定位代理同时针对不断演变的拦截代理进行训练，使其有效捕捉上下层级之间的动态竞争互动;（2）为充分利用截断代理的学习能力，我们提出一种基于代理的集合推断策略，利用受训拦截代理作为高保真代理，指导定位代理的决策;（3）在合成和现实世界数据集上的大量实验表明，我们的方法在保持高效计算效率的同时，也保持了与其他基线线高度竞争的解决方案质量。此外，我们的DADRL框架对网络结构具有模型无关性，其底层的对抗学习范式展现出解决其他双层优化问题的强大潜力。

Towards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts

面向泛化的车辆路径问题模型，专家混合研究

Authors: Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26776
Pdf link: https://arxiv.org/pdf/2605.26776
Abstract In recent years, Deep Reinforcement Learning (DRL) has achieved substantial progress on Vehicle Routing Problems (VRPs). However, existing DRL-based methods are typically trained on instances generated from a uniform distribution, which limits their performance under real-world distribution shifts. In this paper, we aim to develop a generalization-oriented model that partitions the policy network into multiple modules and adaptively recombines modules to form specific policies during inference. Specifically, we propose Residual Refined Experts with Instance-level Gating (R2E-IG) to improve cross-distribution generalization. Our contributions are threefold: (1) We introduce a Residual Refined Expert (R2E) architecture that enhance expert expressiveness via residual refinement; (2) We design an instance-level gating mechanism that learns distribution-aware instance representations and routes inputs to suitable modules; (3) We propose a mixed-distribution training mechanism equipped with Dynamic Weight Adaption (DWA), which dynamically reweights training data from different distributions to emphasize more informative ones. Extensive experiments show that R2E-IG achieves competitive performance against state-of-the-art baselines on both in-distribution and out-of-distribution instances across synthetic and benchmark datasets. Moreover, R2E-IG is generic and can be easily integrated into existing DRL-based methods to further improve performance.
中文摘要 近年来，深度强化学习（DRL）在车辆导航问题（VRPs）方面取得了显著进展。然而，现有基于DRL的方法通常训练于由均匀分布生成的实例上，这限制了它们在现实分布变化下的表现。本文旨在开发一种面向泛化的模型，将策略网络划分为多个模块，并在推理过程中自适应地重新组合模块形成特定的策略。具体来说，我们提出了带有实例级门控（R2E-IG）的剩余精炼专家技术，以提升跨分布泛化能力。我们的贡献有三方面：（1）引入残余精炼专家（Residualed Expert，R2E）架构，通过残差精炼提升专家的表达力;（2）我们设计了实例级门控机制，学习分布感知实例表示并引导输入到合适的模块;（3）我们提出一种混合分布训练机制，配备动态权重适应（DWA），该机制动态重权重不同分布的训练数据，以强调更具信息量的分布。大量实验表明，R2E-IG在合成和基准数据集中，无论是分布内还是非分布实例，都能在最先进的基线中实现竞争性能。此外，R2E-IG是通用的，可以轻松集成到现有基于日程学习（DRL）的方法中，进一步提升性能。

Ratio-Variance Regularized Policy Optimization

比率-方差正则化策略优化

Authors: Yu Luo, Shuo Han, Yihan Hu, Lei Lv, Huaping Liu, Fuchun Sun, Jianye Hao, Dong Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26784
Pdf link: https://arxiv.org/pdf/2605.26784
Abstract Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost by indiscriminately truncating high-return yet high-divergence updates. We demonstrate that explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints, eliminating the need for binary hard clipping. By acting as a distributional ``soft brake'', this approach preserves critical gradient signals from novel discoveries while naturally down-weighting and enabling the reuse of stale, off-policy data. We introduce ${\bf R}^2{\bf VPO}$ (Ratio-Variance Regularized Policy Optimization), which implements this constraint via a primal-dual optimization framework. Extensive evaluations across $7$ LLM scales, spanning both fast and slow reasoning paradigms, and $10$ robotic control tasks demonstrate the generality of the proposed approach. R$^2$VPO achieves substantial performance gains on mathematical reasoning benchmarks, with particularly pronounced improvements on smaller models, while significantly improving sample efficiency. Furthermore, it consistently outperforms PPO baselines in continuous control domains, particularly in sparse-reward and dynamic environments. Together, these findings establish ratio-variance regularization as a principled foundation for stable and data-efficient policy optimization.
中文摘要 标准的策略强化学习依赖启发式剪裁来强制信任区域，但该机制通过无差别地截断高回报但高发度的更新，带来了严重的成本。我们证明，明确限制策略比率方差为信任区域约束提供了原则性的局部近似，从而消除了对二元硬剪裁的需求。通过作为分布式的“软制动”，这种方法保留了新发现的关键梯度信号，同时自然地降低权重，使陈旧、非政策数据得以再利用。我们引入了${\bf R}^2{\bf VPO}$（比率-方差正则化策略优化），它通过原始对偶优化框架实现了该约束。在价值7美元的大型语言模型尺度上，涵盖快速和慢速推理范式，以及价值10美元的机器人控制任务，充分展示了该方法的通用性。R$^2$VPO 在数学推理基准测试中实现了显著的性能提升，尤其在较小模型上有显著提升，同时显著提升了样本效率。此外，它在连续控制领域，尤其是在稀疏奖励和动态环境中，持续优于PPO基线。这些发现共同确立了比率方差正则化作为稳定且数据高效策略优化的原则基础。

SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability

SeDT：多回合对话可靠性的句子变换器决策-变换器条件

Authors: Ramakrishna Vamsi Setti, Jagadeesh Rachapudi, Sachin Chaudhary, Praful Hambarde, Amit Shukla
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26788
Pdf link: https://arxiv.org/pdf/2605.26788
Abstract Large language models (LLMs) achieve impressive performance when a task is fully specified in a single turn, yet the same models lose up to 39% of that performance when the identical task is revealed incrementally across multiple turns, a phenomenon documented at scale as Lost in Conversation. Crucially, this collapse is almost entirely a reliability failure; the best case, the aptitude only falls 16%, while the unreliability more than doubles (+112%). We argue that the root cause is structural, a flat conversation history assigns equal implicit weight to every prior turn, giving the model no signal to distinguish a critical constraint from incidental dialog. We present SeDT Sentence-transformer Decision-Transformer, a training-free inference-time method that resolves this by importing return-to-go conditioning from offline reinforcement learning. SeDT annotates each conversation shard with a cumulative relevance score derived from three complementary semantic, lexical, and positional signals and presents the full annotated history to the model at the final turn, without weight changes, without training data, and without discarding context. Evaluated on the Lost-in-Conversation benchmark in three LLMs and three generation tasks, SeDT outperforms the sharded baseline in all nine model-task combinations, with gains up to +37.7% in mean performance P and simultaneous reductions in unreliability in seven of the nine combinations. In short, telling the model which past turns matter is sufficient to substantially recover the performance lost in conversation.
中文摘要 大型语言模型（LLMs）在任务在一次回合内完全指定时表现令人印象深刻，但当同一任务在多次回合中逐步揭示时，相同模型会损失高达39%的性能，这一现象在大规模上被称为“对话中迷失”（Lost in Conversation）。关键是，这次崩溃几乎完全是可靠性故障;最理想情况下，能力率仅下降16%，而不可靠性则翻倍多（+112%）。我们认为根本原因是结构性的，扁平的对话历史赋予了之前每回合相同的隐含权重，模型没有信号区分关键约束和偶发对话。我们介绍了SeDT句子变换器决策变换器，这是一种无训练的推理时间方法，通过导入离线强化学习中的返回到目标条件来解决这一问题。SeDT为每个对话碎片添加由三个互补语义、词汇和位置信号得出的累计相关性评分，并在最后一回合向模型呈现完整的注释历史，不改变权重，不切换训练数据，也不丢弃上下文。在三台大型语言模型和三种生成任务中，SeDT基于“会话中丢失”基准评估，在所有九种模型-任务组合中均优于分片基线，平均性能P提升高达+37.7%，且在九种组合中有七种同时降低了不可靠性。简而言之，告诉模型过去的转折点重要，足以大幅挽回对话中失去的性能。

Optimising Factual Consistency in Summarisation via Preference Learning from Multiple Imperfect Metrics

通过从多个不完美指标中学习，优化摘要中的事实一致性

Authors: Yuxuan Ye, Raul Santos-Rodriguez, Edwin Simpson
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.26840
Pdf link: https://arxiv.org/pdf/2605.26840
Abstract Reinforcement learning with evaluation metrics as rewards is widely used to enhance specific capabilities of language models. However, for tasks such as factually consistent summarisation, existing metrics remain underdeveloped, limiting their effectiveness as signals for shaping model this http URL individual factuality metrics are unreliable, their combination can more effectively capture diverse factual errors. We leverage this insight to introduce an automated training pipeline that improves factual consistency in summaries by aggregating scores from different weak metrics. Our approach avoids the need for complex reward shaping by mapping scores to preferences and filtering out cases with high disagreement between metrics. For each source document, we generate lexically similar summary pairs by varying decoding strategies, enabling the model to learn from factual differences caused by subtle lexical differences. This approach constructs a high-quality preference dataset using only source this http URL demonstrate consistent factuality gains across models, ranging from early encoder-decoder architectures to modern large language models, with smaller models reaching comparable factuality to larger ones.
中文摘要 以评估指标为奖励的强化学习被广泛用于增强语言模型的特定能力。然而，对于事实一致性总结等任务，现有指标尚未充分发展，其效果有限，因为形成模型的信号 http URL 个别事实性指标不可靠，它们的组合能更有效地捕捉多样化的事实错误。我们利用这一洞察引入自动化培训流程，通过汇总不同弱指标的分数，提升摘要的事实一致性。我们的方法避免了复杂的奖励塑造，通过将分数映射到偏好，并过滤掉指标间高度不一致的情况。对于每个源文档，我们通过不同的译码策略生成词汇相似的摘要对，使模型能够从由细微词汇差异引起的事实差异中学习。该方法仅使用源 http URL 构建高质量偏好数据集，展示从早期编码-解码器架构到现代大型语言模型的模型，从早期编码-解码器架构到现代大型语言模型，较小模型的事实性可达与大型模型相当。

REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization

反向：强化证据验证与代理图像地理定位搜索

Authors: Yong Li, Furong Jia, Dacheng Yin, Kang Rong, Fengyun Rao, Jing Lyu, Fan Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.26861
Pdf link: https://arxiv.org/pdf/2605.26861
Abstract Image geo-localization aims to determine where a photograph was taken, a task that often requires more than recognizing visible landmarks. Human experts typically solve it through an iterative workflow: they inspect informative regions, form location hypotheses, seek external evidence, and revise their judgments as new clues appear. Existing methods only partially capture this process: direct prediction methods bypass evidence acquisition altogether, while retrieval-augmented methods introduce external evidence but usually provide limited supervision on the intermediate decisions of where to search, how to query, and how to filter noisy results. We present REVERSE, a framework that reinforces the interplay between evidence search and verification to enable multi-turn agentic reasoning. REVERSE teaches three intermediate decisions: where to look, what to query, and what evidence to trust. To support this, we construct tool-grounded trajectories with annotated region selections, search observations, and geo-informative evidence labels, and introduce process rewards for visual grounding, query utility, and evidence discrimination. An offline search cache makes retrieval observations stable and reusable during reinforcement learning, enabling dense supervision over noisy search results. With a 4B model, REVERSE outperforms strong retrieval-augmented baselines and rivals substantially larger models on Im2GPS3k and YFCC4k. Code is available at this https URL.
中文摘要 图像地理定位旨在确定照片拍摄地点，这通常不仅仅是识别可见地标。人类专家通常通过迭代流程来解决：他们检查信息丰富的区域，形成位置假设，寻求外部证据，并在出现新线索时修正判断。现有方法仅部分捕捉了这一过程：直接预测方法完全绕过证据获取，而检索增强方法引入外部证据，但通常对搜索地点、查询方式及噪声结果过滤等中间决策提供有限监督。我们提出了REVERSE框架，它强化了证据搜索与验证之间的相互作用，从而实现多回合的智能体推理。反转教导三个中间决策：去哪里看、查询什么、信任哪些证据。为此，我们构建了基于工具的轨迹，包括注释区域选择、搜索观察和地理信息证据标签，并引入了视觉基础、查询实用性和证据辨别的过程奖励。离线搜索缓存使检索观察在强化学习过程中稳定且可重复使用，从而实现对噪声搜索结果的密集监督。采用4B模型时，REVERSE 优于强检索增强基线，并在 Im2GPS3k 和 YFCC4k 上与更大模型竞争。代码可在此 https URL 访问。

GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought

地信：忠实思维链的时空双重视角

Authors: Weijiang Lv, Wentong Zhao, Jiayu Wang, Yuhao Wu, Jiaheng Wei, Xiaobo Xia
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26893
Pdf link: https://arxiv.org/pdf/2605.26893
Abstract Chain-of-Thought (CoT) reasoning has advanced large language models (LLMs), but outcome-based supervision leads to pervasive post-hoc rationalization, producing plausible yet unfaithful reasoning chains. Most prior faithfulness assessment methods are either unscalable, expensive, or unreliable. We propose GeoFaith, a spatio-temporal framework that leverages latent geometric structure and entropy dynamics to diagnose and enforce faithful reasoning. We develop a scalable bootstrapping pipeline expanding step-level annotations from 1k to 20k samples across four domains, train an 8B faithfulness detector outperforming GPT-5 on standard benchmarks, and design a faithfulness-aware reinforcement learning framework jointly optimizing outcome correctness, process faithfulness, and trajectory consistency. Experiments show the proposed method achieves superior performance on both faithfulness detection and downstream reasoning, producing shorter, more interpretable chains without sacrificing accuracy. Our code will be made available publicly.
中文摘要 思维链（CoT）推理拥有先进的大型语言模型（LLM），但基于结果的监督导致普遍的事后合理化，产生合理但不忠实的推理链。大多数先前的忠实度评估方法要么不可扩展，要么成本高昂，要么不可靠。我们提出了地信（GeoFaith），这是一种时空框架，利用潜在的几何结构和熵动态来诊断和强制忠实推理。我们开发了可扩展的自助流程，将步骤级注释从1k扩展到2万个样本，跨越四个领域，训练一个8B忠实度检测器在标准基准测试上优于GPT-5，并设计了一个忠实感知强化学习框架，共同优化结果正确性、过程忠实性和轨迹一致性。实验显示，该方法在忠实度检测和后续推理方面均表现优异，能产生更短且更易解释的链条，同时不牺牲准确性。我们的代码将公开。

Learning to Adapt SFT Data for Better Reasoning Generalization

学习如何调整SFT数据以实现更好的推理泛化

Authors: Lisong Sun, Li Wang, Chen Zhang, Jinyang Wu, Kui Zhang, Tianhao Peng, Wenjun Wu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.26924
Pdf link: https://arxiv.org/pdf/2605.26924
Abstract Large language models (LLMs) have achieved remarkable progress, with post-training playing a crucial role in enhancing their reasoning capabilities. Among post-training paradigms, supervised fine-tuning (SFT) is widely used: it leverages external data to provide dense supervision and enables efficient training. However, directly fine-tuning on expert data can hurt generalization when the data distribution is mismatched with the target model's own distribution. In this work, we propose Data Adaptation for Reasoning Tuning (DART), which formulates the use of a fixed, potentially distributionally misaligned SFT dataset as an optimization problem over demonstration transformations. DART trains a mapper model with reinforcement learning to convert original SFT data into model-adapted supervision that better matches the target model's distribution and learning preferences. The transformed data are then used for SFT, allowing the target model to better exploit external supervision. Experiments across multiple models and datasets show that DART improves generalization, achieves higher training efficiency than direct RL, and helps models surpass standard SFT. Our code is available at this https URL.
中文摘要 大型语言模型（LLM）取得了显著进步，后期训练在提升其推理能力方面起着关键作用。在训练后范式中，监督式微调（SFT）被广泛使用：它利用外部数据提供密集的监督，并实现高效的训练。然而，当数据分布与目标模型自身分布不匹配时，直接对专家数据进行微调可能会损害泛化。在本研究中，我们提出了推理调整数据适应（DART），该方法将固定且可能分布错位的SFT数据集作为优化问题，取代演示变换。DART通过强化学习训练映射模型，将原始SFT数据转化为更符合目标模型分布和学习偏好的模型适配监督。转换后的数据用于SFT，使目标模型能够更好地利用外部监督。跨多个模型和数据集的实验表明，DART提升了泛化性，比直接强化学习实现了更高的训练效率，并帮助模型超越标准SFT。我们的代码可在此 https URL 访问。

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

推理深度与环境复杂性：RLVR数据在逻辑推理任务间分配的受控研究

Authors: Yihua Zhu, Qianying Liu, Fei Cheng, Jiaxin Wang, Akiko Aizawa, Sadao Kurohashi, Hidetoshi Shimodaira
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26934
Pdf link: https://arxiv.org/pdf/2605.26934
Abstract Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existing studies is their narrow view of the reasoning space: difficulty is treated as reasoning depth alone, and reward is concentrated on forward deductive state tracking. We instead characterize the reasoning space along two dimensions. Difficulty. Beyond reasoning depth, we study environment complexity, where models must identify the correct path amid distractors and interacting structures. Rewarded reasoning form. We consider four abilities core to real-world reasoning: deductive state tracking, abductive recovery of hidden events or facts, inductive rule induction, and analogical transfer. To disentangle these factors, we construct a synthetic knowledge-graph environment with controlled pre- and post-training distributions, where each instance varies along depth, complexity, and task family. Three findings emerge: joint depth-complexity coverage outperforms single-axis recipes; reasoning families respond non-uniformly, with abductive reasoning degrading outside the RL-covered region and task correlations clustering into deductive-abductive and inductive-analogy pairs; and uniform mixing outperforms staged curricula under a fixed budget. We also find that recent off-the-shelf models exhibit the same deductive-over-abductive asymmetry, suggesting that this gap is not merely an artifact of our controlled setup.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为训练后推理模型的核心，但现有研究的一个关键局限是其对推理空间的狭隘理解：难度仅被视为推理深度，奖励则集中于前向演绎状态追踪。我们改为刻画二维推理空间。难度。超越推理深度，我们还研究环境复杂性，模型必须在干扰因素和相互作用结构中识别正确的路径。奖励推理表。我们考虑了现实推理中核心的四项能力：演绎状态追踪、隐藏事件或事实的溯因恢复、归纳规则归纳和类比转移。为了理清这些因素，我们构建了一个合成知识图环境，具有受控的训练前后分布，每个实例在深度、复杂度和任务族中变化。得出三个发现：关节深度-复杂度覆盖优于单轴配方;推理家族的反应不均匀，溯因推理在强化学习覆盖区域外退化，任务相关性聚集成演绎-溯因和归纳类比对;统一混音在固定预算下优于分阶段课程。我们还发现，近期的现成模型也表现出同样的演绎-过于溯因的不对称性，表明这种差距不仅仅是我们受控设置的假象。

Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement

高效的代理强化学习与策略上内在知识边界增强

Authors: Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.26952
Pdf link: https://arxiv.org/pdf/2605.26952
Abstract Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model's intrinsic knowledge boundary, where the model fails to distinguish when tools are needed versus when parametric knowledge suffices. Existing solutions based on reward shaping create coarse-grained optimization targets that tend to incentivize indiscriminate tool-call suppression, leading to reward hacking. In this paper, we propose AKBE (Agentic Knowledge Boundary Enhancement), an on-policy method that dynamically probes the model's intrinsic knowledge boundary through dual-path (with-tool and no-tool) rollouts during training. We define the knowledge boundary as the per-instance determination of whether tools are required and the minimum tool calls necessary. By comparing correctness across paths, AKBE categorizes trajectories and constructs targeted supervisory signals that guide efficient tool-use patterns for each question. These signals are integrated seamlessly into the agentic RL training loop. Experiments on seven QA benchmarks demonstrate that AKBE improves task accuracy by +1.85 on average and reduces tool calls by 18% over standard agentic RL, yielding 25% higher tool productivity without any accuracy-efficiency trade-off. Further analysis suggests its plug-and-play compatibility across different RL algorithms and the mechanism of each signal category. Our code is available at this https URL.
中文摘要 智能体强化学习（RL）已被证明在训练具备外部工具使用能力的基于LLM的智能体方面非常有效。然而，我们发现代理式强化学习会导致工具调用增加，并模糊模型的内在知识边界，模型无法区分何时需要工具，何时参数化知识足够。基于奖励塑形的现有解决方案创造了粗粒度优化目标，往往激励了无差别的工具调用抑制，导致奖励被黑客攻击。本文提出了AKBE（代理知识边界增强），这是一种策略内方法，通过训练期间的双路径（带工具和无工具）展开动态探测模型的内在知识边界。我们将知识边界定义为每次实例确定是否需要工具以及所需最小工具调用量。通过比较路径的正确性，AKBE 对轨迹进行分类，并构建针对性的监督信号，指导每个问题的高效工具使用模式。这些信号无缝集成进能动强化学习训练循环。七个质量保证基准测试的实验表明，AKBE平均提升任务准确率+1.85，工具调用比标准代理强化学习减少18%，工具生产率提升25%，且无准确率与效率权衡。进一步分析表明，其在不同强化学习算法之间的即插即用兼容性以及每个信号类别的机制。我们的代码可在此 https URL 访问。

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

Tournament-GRPO：开放式长格式生成中强化学习的分组锦标赛奖励

Authors: Zixuan Yang, Yiqun Chen, Wei Yang, Erhan Zhang, Zihan Shen, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.26958
Pdf link: https://arxiv.org/pdf/2605.26958
Abstract Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scoring, but absolute scores are difficult to calibrate across complex responses, may provide weak discrimination among same-query rollouts, and can become saturated during optimization. We propose Tournament-GRPO, a group-wise reward framework that converts rubric-guided LLM judgments into relative rewards through repeated multi-round tournaments among same-query rollouts. Tournament-GRPO compares candidates within groups, accumulates tournament outcomes, and normalizes them into group-wise rewards for GRPO training. Experiments on Deep Research Bench show that Tournament-GRPO consistently outperforms existing reward-design baselines, achieving a 4.52-point overall-score improvement over the strongest baseline. Further analyses show that tournament rewards provide a favorable effectiveness--efficiency trade-off and that tournament design affects training dynamics. These results suggest that rubric-guided tournament comparison provides an effective reward signal for reinforcement learning in open-ended long-form generation.
中文摘要 开放式长格式生成中的强化学习具有挑战性，因为可靠的参考答案和自动指标往往无法获得。现有基于评分标准的方法通常依赖点数的LLM作为评判评分，但绝对分数难以在复杂回答间校准，可能在同查询推广中区分较弱，且在优化过程中容易过于饱和。我们提出了Tournament-GRPO，一种按组划分的奖励框架，通过重复多轮锦标赛在同查询推广中进行，将基于评分标准的LLM判断转化为相对奖励。Tournament-GRPO比较各组内的候选人，累计比赛结果，并将其归一化为GRPO训练的各组奖励。深度研究台的实验显示，Tournament-GRPO持续优于现有的奖励设计基线，整体得分比最强基线提升了4.52分。进一步分析显示，锦标赛奖励提供了有利的有效性——效率权衡，锦标赛设计影响训练动态。这些结果表明，评分标准引导的锦标比较为开放式长格式生成中的强化学习提供了有效的奖励信号。

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

RLVR数据集及其定位：追踪数据谱系以获取更好的训练数据

Authors: Hsiu-Yuan Huang, Weijie Liu, Chenming Tang, Sanwoo Lee, Kai Yang, Yangkun Chen, Saiyong Yang, Yunfang Wu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.26971
Pdf link: https://arxiv.org/pdf/2605.26971
Abstract The proliferation of Reinforcement Learning from Verifiable Rewards (RLVR) datasets has exacerbated provenance collapse due to unclear lineage among existing datasets. To bridge this fragmented RLVR data landscape, we propose Atomic-source Tracing via Lineage-Aware Search (ATLAS), a systematic framework for tracing RLVR datasets back to their atomic sources, attributing over 99.7% of 1.45M instances to 20 atomic sources. Our analysis reveals that most RLVR datasets are variants of a small set of shared upstream sources, with few introducing genuinely new data, and many facing data contamination risks. These findings naturally motivate us to curate a new RLVR dataset, DAPO++, and to benchmark existing datasets from a lineage-aware perspective. To this end, we propose Source-level Counterfactual Attribution (SCA) as a guiding principle to curate a decontaminated training dataset with concentrated learning signals. Essentially, SCA measures a sample's marginal utility by comparing per-atomic-source RL checkpoints against a shared base model. Building upon these attribution signals, we further design a composite dataset quality score Q that strongly correlates with downstream RLVR performance. Experiments on Qwen3 series models verify that DAPO++ consistently improves performance on held-out benchmarks, while Q reliably predicts downstream RLVR training effectiveness. Our code and data is available at this https URL.
中文摘要 可验证奖励（RLVR）数据集中的强化学习的激增加剧了来源性崩溃，因为现有数据集之间的血缘关系不明确。为了弥合这一支离破碎的RLVR数据景观，我们提出了通过谱系感知搜索（ATLAS）进行原子源追踪的系统框架，用于追溯RLVR数据集至其原子源，145万个实例中有99.7%归属于20个原子源。我们的分析显示，大多数RLVR数据集是少数共享上游源的变体，真正引入新数据的较少，且许多数据集面临数据污染风险。这些发现自然促使我们策划新的RLVR数据集DAPO++，并从谱系感知的角度对现有数据集进行基准测试。为此，我们提出以源级反事实归因（SCA）为指导原则，策划一个具有集中学习信号的去污染训练数据集。本质上，SCA通过比较每个原子源的强化学习检查点与共享基模型来衡量样本的边际效用。基于这些归因信号，我们进一步设计了一个与下游RLVR表现高度相关的复合数据集质量评分Q。Qwen3系列模型的实验验证了DAPO++在未完成基准测试中持续提升性能，而Q则可靠地预测下游RLVR训练效果。我们的代码和数据可在该 https URL 访问。

Trust, Geometry, and Rules: A Credibility-Aware Reinforcement Learning Framework for Safe USV Navigation under Uncertainty

信任、几何与规则：一个可信度感知的强化学习框架，用于在不确定性下安全无人驾驶导航

Authors: Yuhang Zhang, Shuqi Chai, Yukang Zhang, Liusha Yang, Mingchuan Zhang, Wei Wang, Qingjiang Shi, Quanbo Ge
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.26974
Pdf link: https://arxiv.org/pdf/2605.26974
Abstract Autonomous navigation of Unmanned Surface Vehicles (USVs) that is safe and compliant with the International Regulations for Preventing Collisions at Sea (COLREGs) remains a formidable challenge in dynamic maritime environments, particularly when perception systems exhibit miscalibrated uncertainty. Existing Reinforcement Learning (RL)-based methods often falter because state-estimation errors induce unreliable belief states that mislead the value function, while discrete traffic rules introduce discontinuity in the learning objective. To address these challenges, we propose a framework integrating credibility-aware learning, geometric safety shielding, and continuous rule-aware embedding. First, Credibility-Weighted Value Learning (CW-VL) introduces a dynamic trust factor derived from the discrepancy between filter-estimated covariance and empirical error statistics to modulate the critic's heteroscedastic loss, preventing policy overfitting to noisy samples. Second, the Covariance-Inflated Velocity Obstacle (CI-VO) maps position-estimation uncertainty into set-wise angular margins, forming a conservative geometric shield that overrides hazardous exploratory actions. Third, Risk-Aware COLREGs Duty Embedding relaxes binary encounter duties into continuous rule-aware signals, providing smooth sector-transition information and suppressing oscillation from sparse rule rewards. Simulated encounter studies demonstrate improved training robustness against perceptual inconsistency and superior collision avoidance and COLREGs compliance over baselines.
中文摘要 无人水面飞行器（USV）在安全且符合《防止海上碰撞国际条例》（COLREGs）的前提下，依然是一大挑战，尤其是在感知系统表现出校准不确定性时。现有基于强化学习（RL）的方法常常失效，因为状态估计错误会导致不可靠的信念状态误导价值函数，而离散交通规则则会在学习目标中引入不连续性。为应对这些挑战，我们提出了一个整合可信度感知学习、几何安全屏蔽和持续规则感知嵌入的框架。首先，可信度加权价值学习（CW-VL）引入了基于滤波器估计协方差与实证误差统计差异的动态信任因子，以调节批评者的异频差损失，防止策略对噪声样本的过度拟合。其次，协方差膨胀速度障碍（CI-VO）将位置估计不确定性映射到集合各角度边界，形成一个保守的几何屏蔽，覆盖危险的探索动作。第三，风险感知COLREG的任务嵌入将二进制遭遇任务放宽为连续的规则感知信号，提供平滑的扇区转换信息，抑制因规则奖励稀疏引起的振荡。模拟遭遇研究显示，训练在对感知不一致性的鲁棒性提升，以及COLREG的碰撞规避和合规性优于基线。

Probabilistic Recurrent Intention Switching Model

概率循环意图切换模型

Authors: Wenyuan Sheng, Hao Zhu, Joschka Boedecker
Subjects: Subjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2605.26998
Pdf link: https://arxiv.org/pdf/2605.26998
Abstract Inverse reinforcement learning (IRL) recovers reward functions from observed behavior, yet traditional methods assume a single stationary reward that cannot capture goal switching within an episode. Recent multi-intention IRL methods address this by segmenting trajectories, but model intention transitions as either a memoryless Markov chain or via manual state augmentation with a fixed history window. We propose the Probabilistic Recurrent Intention Switching Model (PRISM), which replaces both mechanisms with a lightweight recurrent network that maps observation history to a per-step intention distribution. We prove that the resulting EM objective decomposes exactly into independent per-intention reward subproblems, each solvable in closed form, yielding an $\mathcal{O}(nK)$ E-step with no variational approximation. We evaluate PRISM on a non-Markovian gridworld, a mouse labyrinth, and BridgeData~V2 robotic manipulation, the first large-scale robotic application of multi-intention IRL. Across all settings PRISM achieves the highest held-out log-likelihood while recovering nameable, temporally coherent intentions from unlabeled demonstrations, suggesting that discrete goal switching is present in both biological and artificial agents.
中文摘要 逆强化学习（IRL）从观察到的行为中恢复奖励函数，而传统方法假设只有一个固定奖励，无法捕捉一次事件中的目标切换。近期多意向IRL方法通过对轨迹进行分段来解决这个问题，但意图转换要么作为无记忆的马尔可夫链，要么通过带有固定历史窗口的手动状态增强来建模。我们提出了概率性循环意图切换模型（PRISM），该模型用一个轻量级循环网络替代了这两个机制，该网络将观察历史映射到每步意图分布。我们证明所得的EM目标精确分解为独立的每意图奖励子问题，每个子问题以封闭形式求解，得到一个无变分近似的$\mathcal{O}（nK）$ E步。我们在非马可夫网格世界、鼠标迷宫中评估PRISM以及BridgeData~V2机器人操作，这是首次大规模多意图IRL的机器人应用。在所有情境下，PRISM在从未标记的演示中恢复可命名、时间上连贯的意图时，都实现了最高的对数似然，表明离散目标切换存在于生物和人工代理中。

SQARL: A Size-Agnostic Reinforcement Learning approach for Circuit Allocation in Distributed Quantum Architectures

SQARL：分布式量子架构中电路分配的尺寸无关强化学习方法

Authors: Víctor Carballo, Júlia López-Closa, Mario Martin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.27027
Pdf link: https://arxiv.org/pdf/2605.27027
Abstract The scaling of quantum processors is currently limited by technical challenges such as decoherence and cross-talk. As the number of qubits grows, interference increases the computational noise. Distributed quantum computing addresses these limitations by interconnecting smaller, easier-to-handle quantum processors (cores), but it introduces the challenge of minimizing slow, error-prone inter-core communication. The task of distributing quantum circuits across cores while minimizing communication costs is known as the Qubit Allocation problem. This work focuses on developing a deep learning approach to this problem, emphasizing flexibility to quantum hardware topology and improving state-of-the-art performance. Heuristic and non-learning algorithms, such as the Hungarian Qubit Allocation (HQA), currently represent the state of the art. Reinforcement Learning (RL) approaches leverage learned allocation policies but often lack flexibility, requiring retraining when hardware configurations change, and they fall short of the solution quality achieved by non-learning methods. However, learning mechanisms could outperform human-crafted heuristics. To overcome these limitations, this work proposes a flexible, transformer-based architecture that can handle arbitrary numbers of qubits and cores without retraining. Results show that the trained policy consistently outperforms the previous RL state of the art and narrows the gap between RL and HQA for the most common circuits. It achieves a 33% reduction in allocation cost relative to the HQA for the Cuccaro Adder and 25% on average for random circuits. These findings show that learning-based approaches can effectively match the performance of hand-crafted heuristics, a crucial step towards their application in real-world scenarios.
中文摘要 目前，量子处理器的扩展受限于退相干和串扰等技术挑战。随着量子比特数量的增加，干扰会增加计算噪声。分布式量子计算通过互联更小、更易处理的量子处理器（核心）来解决这些限制，但同时也带来了减少缓慢、易出错的核心间通信的挑战。在最大化通信成本的同时将量子电路分布于核心的任务被称为量子比特分配问题。本研究重点是开发深度学习方法，强调量子硬件拓扑的灵活性，并提升最先进的性能。启发式和非学习型算法，如匈牙利量子比特分配（HQA），目前代表了最先进的技术。强化学习（RL）方法利用已学习的分配策略，但通常缺乏灵活性，当硬件配置变化时需要重新训练，且其解决方案质量远低于非学习方法。然而，学习机制可能优于人类设计的启发式方法。为克服这些限制，本研究提出了一种灵活的基于变换器的架构，能够处理任意数量的量子比特和核心，无需重新训练。结果显示，训练有素的策略始终优于之前的强化学习最先进技术，缩小了最常用电路中强化学习与总部的差距。相比总部，Cuccaro Adder的分配成本降低了33%，随机电路平均降低了25%。这些发现表明，基于学习的方法能够有效匹配手工启发式方法的性能，这是其在现实应用中的关键一步。

Learning to Balance Motor Thermal Safety and Quadrupedal Locomotion Performance with Residual Policy

学习平衡发动机热安全与四足行走性能与残留政策

Authors: Yuhang Wan, Weixian Lin, Letian Qian, Yiqi Zou, Weiwei Wu, Shengwei Wu, Chuanlin Zhao, Xin Luo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.27046
Pdf link: https://arxiv.org/pdf/2605.27046
Abstract Motor thermal management is often overlooked in the context of electrically-actuated robots, particularly legged robots, but motor overheating is a key factor that limits long-duration locomotion especially under payload conditions. This paper integrates a whole-body thermal model of a quadruped robot into the reinforcement learning pipeline to update motor temperatures, and proposes a two-stage training framework for motor thermal management. In this framework, a nominal policy is first pre-trained as a locomotion baseline capable of traversing diverse terrains. A residual policy is then trained on top of the nominal policy to provide corrective actions based on the robot's thermal state, ensuring high performance under low-temperature conditions and preventing motor overheating under high-temperature conditions. Simulation results demonstrate that the proposed policy achieves an effective balance between motor thermal safety and locomotion performance. Real-world experiments on a Unitree A1 quadruped robot further validate the approach: under a 3 kg payload, the robot achieves stable locomotion across multiple terrains for over 13 minutes, while the nominal policy alone leads to motor overheating in about 5 minutes.
中文摘要 在电动驱动机器人，尤其是腿部机器人中，电机热管理常被忽视，但电机过热是限制长时间移动的关键因素，尤其是在有效载荷条件下。本文将四足机器人的全身热模型整合进强化学习流程以更新运动温度，并提出了一个两阶段的运动热管理训练框架。在此框架下，名义政策首先被预先训练为能够穿越不同地形的行车基线。然后在标称策略基础上训练残差策略，基于机器人热状态提供纠正措施，确保低温条件下高性能，防止高温下电机过热。模拟结果表明，拟议政策在发动机热安全与行车性能之间实现了有效平衡。在Unitree A1四足机器人上的实际实验进一步验证了这一方法：在3公斤有效载荷下，该机器人能在多地形上实现稳定移动超过13分钟，而仅此标准策略就会导致发动机约5分钟过热。

Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search

工业搜索中的大型语言模型驱动的查询驱动事件时间线摘要

Authors: Mingyue Wang, Xingyu Xie, Hang Yang, Li Gao, Lixin Su, Ge Chen, Dawei Yin, Daiting Shi
Subjects: Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2605.27066
Pdf link: https://arxiv.org/pdf/2605.27066
Abstract Understanding how events evolve over time is essential for search engines handling queries about trending news. We present QDET (Query-Driven Event Timeline Summarization), a production system deployed on Baidu Search that constructs focused event timelines to explain specific query events. Unlike traditional topic-centric approaches that aim for comprehensive coverage, QDET identifies and organizes sub-events closely relevant to the query from noisy candidate sets formed by millions of documents retrieved daily. QDET incorporates two key innovations: (1) multi-task supervised fine-tuning with three auxiliary tasks-temporal ordering, causal judgment, and timeline completion-that enable compact models to match the performance of much larger general-purpose models in specialized domains; (2) reinforcement learning-based event concise summarization that enforces strict length constraints while maintaining semantic quality, achieving 88.2% length compliance and outperforming 671B-scale models by 7.7 points in constraint satisfaction. Our fine-tuned 7B parameter model achieves 76.2% F1 score on timeline summarization, slightly surpassing the zero-shot performance of DeepSeek-R1-671B (76.1% F1) while using only 1% of its parameters-demonstrating that domain-specific optimization enables production-ready models with comparable quality at drastically reduced computational costs. Online A/B tests on Baidu Search validate real-world effectiveness, showing 5.5% CTR improvement, 4.6% longer dwell time, and 4.4% deeper exploration compared to single-task baselines. We further demonstrate that timeline understanding transfers to heat prediction, confirming effective knowledge transfer to downstream tasks.
中文摘要 理解事件如何随时间演变，对于处理关于热门新闻的查询搜索引擎至关重要。我们介绍QDET（查询驱动事件时间线摘要），这是一套部署在百度搜索上的生产系统，构建聚焦事件时间线以解释特定查询事件。与传统的以主题为中心、追求全面覆盖的方法不同，QDET从每日检索的数百万文档组成的噪声候选集中，识别并组织与查询密切相关的子事件。QDET包含两项关键创新：（1）多任务监督微调，辅以三项辅助任务——时间排序、因果判断和时间线完成——使紧凑模型能够在专业领域中匹配更大型通用模型的性能;（2）基于强化学习的事件简明总结，在保持语义质量的同时执行严格的长度约束，实现88.2%的长度合规性，并在约束满足度上比671B尺度模型高出7.7个百分点。我们经过精细调校的7B参数模型在时间线总结中获得了76.2%的F1评分，略低于DeepSeek-R1-671B的零样本性能（76.1% F1），且仅使用其1%的参数——证明领域特定优化使得质量相当且计算成本大幅降低的生产模型成为可能。百度搜索的在线A/B测试验证了实际效果，显示CTR提升5.5%，停留时间延长4.6%，深度探索提升4.4%，相比单任务基线。我们还进一步证明了时间线理解可以转化为热量预测，证实了知识有效转移至下游任务。

Trust Region Q Adjoint Matching

信任区域Q伴随匹配

Authors: Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim, Jinwoo Shin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.27079
Pdf link: https://arxiv.org/pdf/2605.27079
Abstract Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter $\lambda$ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of $\lambda$. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.
中文摘要 由于多步采样过程带来的优化不稳定性，非策略强化学习仍具挑战性。最近，Q-学习与伴随匹配（QAM）通过将问题重新表述为一个无记忆随机最优控制（SOC）问题，并以学习批评者的形式解决了这一问题。然而，QAM继承了批评者引导改进的一个根本脆弱性：当批评者条件不佳时，小的批评错误会被放大，常常导致模型崩溃。本文介绍了信任区域Q伴随匹配（TRQAM），这是一种稳定的非策略微调算法，通过预训练的流策略自适应控制路径空间KL，实现预测的对偶下降。具体来说，我们在SOC动力学中优化了信任区域参数$\lambda$，并理论上证明路径空间KL可以用$\lambda$的闭形式函数表示。因此，我们的方法能够精确控制与预训练流策略的精确偏差，实现稳定的非策略强化学习。通过对50项OGBench任务的实验，TRQAM在离线强化学习和离线到在线强化学习中均持续优于现有技术。特别是，TRQAM在离线强化学习中整体成功率为68%，显著提升了46%的强基准。

MuChator: Enabling Active Music Discovery via Conversational Music LLMs in Douyin Music

MuChator：通过抖音音乐中的对话式音乐大型语言模型实现主动音乐发现

Authors: Jiahao Liang, Linzhi Huang, Xuannan Liu, Xukai Wang, Xuanpu Luo, Yongchun Zhu, Jingwu Chen, Feng Zhang, Xiao Yang
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2605.27103
Pdf link: https://arxiv.org/pdf/2605.27103
Abstract Douyin Music, a large-scale platform with millions of daily users, adopts an immersive, feed-based discovery paradigm, where users passively explore music through continuous recommendations. While effective for passive music discovery, this paradigm restricts users to recommendation results and provides limited support for explicitly specifying listening intents. Unlike conventional search, where users express well-defined intents through explicit queries such as specific songs or artists, real-world active music discovery is often situational and colloquial, involving vague or underspecified requests. While LLMs enable natural language interaction, their direct use in music discovery remains limited by insufficient music-domain knowledge, lack of music-query collaborative reasoning, and shallow understanding of personalized preferences. To address these challenges, we introduce MuChator, an interactive MusicLLM-based framework that enables users to actively express situational music intents in natural language. MuChator incorporates three key components: (1) Music Knowledge Pre-training, a three-stage scheme that incrementally injects objective music knowledge, subjective music knowledge, and personalized music preferences into LLMs; (2) Context-aware Instruction Tuning, which constructs high-quality user-query-music triplets through an automated synthesis pipeline to align LLMs with active and situational user intents; and (3) Preference Alignment with Hybrid RM, which jointly models intent relevance, personalized preferences, and basic constraints, and is optimized using GRPO-based reinforcement learning. Extensive evaluations on industrial music recommendation datasets demonstrate that MuChator outperforms leading proprietary models, such as Gemini-3-Pro. The model has been deployed on Douyin Music App within ByteDance, with 46.49\% improvement of user active days in online A/B test.
中文摘要 抖音音乐是一个拥有数百万日用用户的大型平台，采用沉浸式、基于订阅源的发现模式，用户通过持续推荐被动探索音乐。虽然这种范式对被动音乐发现有效，但限制用户只能获得推荐结果，并且对明确指定听觉意图的支持有限。与传统搜索不同，传统搜索用户通过具体的查询（如特定歌曲或艺术家）表达明确意图，现实世界的主动音乐发现往往是情境性的、口语化的，涉及模糊或未明确的请求。虽然LLM支持自然语言交互，但其在音乐发现中的直接应用仍受限于音乐领域知识不足、缺乏音乐查询协作推理以及对个性化偏好的浅薄理解。为应对这些挑战，我们引入了MuChator，一个基于MusicLLM的交互式框架，使用户能够用自然语言主动表达情境音乐意图。MuChator包含三个关键组成部分：（1）音乐知识预训练，这是一个三阶段方案，逐步将客观音乐知识、主观音乐知识和个性化音乐偏好注入LLM中;（2）上下文感知指令调优，通过自动化合成流水线构建高质量的用户查询音乐三连音，使大型语言模型与主动和情境用户意图对齐;以及（3）与混合RM的偏好对齐，该方法结合了意图相关性、个性化偏好和基本约束，并通过基于GRPO的强化学习进行优化。对工业音乐推荐数据集的广泛评估表明，MuChator的表现优于领先的专有模型，如Gemini-3-Pro。该模型已在字节跳动中的抖音音乐应用上部署，在线A/B测试用户活跃天数提升了46.49%。

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

移动图形界面导航视觉语言代理的缩放、基准测试与推理

Authors: Heng Qu, Yike Liu, Renren Jin, Wenzong Zhang, Pengzhi Gao, Wei Liu, Jian Luan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.27134
Pdf link: https://arxiv.org/pdf/2605.27134
Abstract Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.
中文摘要 视觉语言模型（VLMs）在移动图形界面导航方面取得了快速进展。本文系统地研究了基于VLM的代理在该领域的数据尺度、基准测试和推理。为促进严谨评估，我们引入了HyperTrack，这是一个涵盖650多个中国移动应用、涵盖16000多个真实任务的大规模数据集，以及GUIEvalKit，一个用于VLM离线GUI导航任务统一基准测试的开源工具包。利用HyperTrack，我们分析了训练数据规模对监督式和基于强化的微调的影响。我们的结果表明，基于强化的微调在域外环境中持续优于监督式微调，凸显了数据扩展与强化学习之间的协同效应。利用GUIEvalKit，我们进一步对最先进的（SOTA）VLM进行基准测试，并分析交互历史和推理能力如何影响任务完成。HyperTrack 和 GUIEvalKit 共同提供了一个综合平台，用于开发和评估移动 GUI 导航任务中的 VLM 代理。

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

StepOPSD：用于代理强化学习的步进感知在线偏好提炼

Authors: Yanfei Zhang, Xu Lin, Chenglin Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.27140
Pdf link: https://arxiv.org/pdf/2605.27140
Abstract Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically treats heterogeneous agent trajectories as monolithic strings rather than causal interaction units. We present StepOPSD, a post-rollout preference self-distillation framework that takes the agent step as the unit of credit redistribution. StepOPSD decomposes trajectories into action-centered step segments, rescoring them under hindsight-enriched teacher contexts and converting token-level log-probability gaps into sign-preserving advantage shaping with a normalized per-step credit budget before the GRPO update. Across ALFWorld and Search-QA with Qwen3-1.7B and Qwen2.5-3B-Instruct, StepOPSD attains best or second-best results on subsets most sensitive to local causal errors, including first-place performance on ALFWorld Heat (79.1%), PickTwo (95.0%), Search-QA TriviaQA (61.6%), and tied-best performance on HotpotQA (40.4%). The results further reveal a consistent two-knob law: smaller {\alpha}_clip acts as a broadly stabilizing local trust region, whereas the optimal global mixing strength {\lambda}_mix remains task-dependent. These findings suggest that step-aware distillation is most useful when trajectory-level rewards are weakly aligned with the local action that determines downstream success.
中文摘要 多回合特工的强化学习存在信用与赋值不匹配的问题：奖励稀少且发展轨迹级，而成功往往取决于几个局部决策。现有的在线策略提炼（OPD）提供了更密集的令牌级监督，但通常将异构代理轨迹视为单一字符串，而非因果交互单元。我们介绍StepOPSD，一种推广后偏好自提纯框架，以代理步骤作为信用再分配的单位。StepOPSD将轨迹分解为以动作为中心的步骤段，在事后丰富的教师语境下重新评分，并将代币级的对数概率差距转换为符号保持优势塑形，并结合每步学分预算归一化，在GRPO更新前进行。在ALFWorld和Search-QA中，Qwen3-1.7B和Qwen2.5-3B-Instruct，StepOPSD在对局部因果错误最敏感的子集上取得最佳或次佳成绩，包括ALFWorld Heat（79.1%）、PickTwo（95.0%）、Search-QA TriviaQA（61.6%）以及HotpotQA（40.4%）并列最佳表现。结果进一步揭示了一个一致的两旋钮定律：较小的{\alpha}_clip作为一个广泛稳定的局部信任区域，而最优的全局混合强度{\lambda}_mix则依赖任务。这些发现表明，当轨迹级奖励与决定下游成功的局部行动对齐较弱时，阶梯感知蒸馏最为有用。

Container Unloading via Reinforcement Learning: Picking Order, Deadlock Avoidance, and Proof-of-Concept Simulation

通过强化学习卸载集装箱：拣货顺序、避免死锁与概念验证仿真

Authors: Jan Rüdiger, Max Schenke, Daniel Weber
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.27143
Pdf link: https://arxiv.org/pdf/2605.27143
Abstract Unloading containers in the courier, express and parcel industry is a physically demanding and labor-intensive work. Automatizing this process is an important step towards increasing the efficiency of parcel-handling systems. This work investigates the potential of reinforcement learning to learn a policy for item selection in container unloading scenarios. For that, a simulation environment is created and a masked deep Q-learning with a specially designed neural network architecture is implemented. The results indicate that the agent can learn to select items with an average success rate of 60 %, which is significantly better than a random policy at a random chance of 20 %. The findings suggest that RL could be a promising approach for automatizing item unloading tasks in the future.
中文摘要 在快递、快递和包裹行业中卸载集装箱是一项体力消耗高且劳动密集的工作。自动化这一流程是提高包裹处理系统效率的重要一步。本研究探讨强化学习在集装箱卸载场景中学习物品选择策略的潜力。为此，会创建一个仿真环境，并实现一个带有专门设计的神经网络架构的掩蔽深度Q-learning。结果显示，代理人能够以平均成功率60%的水平学习选择项目，这远优于随机20%的随机策略。研究结果表明，强化学习未来可能成为自动化卸载任务的有前景方法。

Touch-R1: Reinforcing Touch Reasoning in MLLMs

Touch-R1：强化多层次导向学习中的触觉推理

Authors: Yingxin Lai, Yafei Zhou, Fucai Zhu, Siyu Zhu, Weihao Yuan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.27154
Pdf link: https://arxiv.org/pdf/2605.27154
Abstract While rule-based reinforcement learning has recently catalyzed explicit reasoning in multimodal models, tactile reasoning remains largely underexplored. Existing tactile-language models primarily rely on supervised or contrastive objectives, which limits their capacity to ground predictions in physical evidence or rectify misleading visual priors. Tactile reasoning introduces two modality-specific challenges: the ordinal nature of physical attributes (e.g., hardness, roughness) and the cross-sensor distribution shifts inherent in optical tactile hardware. In this work, we introduce TouchReason-1M, a large-scale multimodal dataset comprising over 1M synchronized tactile pairs across four distinct sensors, and TouchReason-Bench, a rigorous framework for evaluating tactile perception and visual-tactile conflict resolution. Building upon these, we propose Touch-R1, a tactile reasoning MLLM based on Qwen2.5-VL-7B. Touch-R1 is trained via a tactile-grounded GRPO objective that combines ordinal-aware accuracy, cross-sensor physical consistency, structured-format control, and an input-side tactile grounding objective. Specifically, the tactile-use reward assigns credit only when authentic tactile inputs yield superior correctness relative to counterfactual controls where the tactile stream is removed, shuffled, or noise-masked. On TouchReason-Bench, Touch-R1-7B outperforms Octopi-13B by 18.4\% and GPT-4o by 24.7\% on average. Its structured reasoning traces reveal emergent behaviors of probing, comparison, and revision, demonstrating that R1-style reasoning can be effectively grounded in physical contact.
中文摘要 尽管基于规则的强化学习近年来催化了多模态模型中的显式推理，但触觉推理仍然较少被充分探索。现有的触觉语言模型主要依赖监督式或对比性目标，这限制了其以物理证据为基础预测或纠正误导性视觉先验的能力。触觉推理带来了两个特定于模态的挑战：物理属性（如硬度、粗糙度）的序数性质，以及光学触觉硬件固有的跨传感器分布变化。本研究介绍了TouchReason-1M，一个包含超过100万同步触觉对、跨四个不同传感器的大规模多模态数据集，以及TouchReason-Bench，一个用于评估触觉感知和视觉-触觉冲突解决的严谨框架。基于此，我们提出了Touch-R1，一种基于Qwen2.5-VL-7B的触觉推理MLLM。Touch-R1 通过触觉接地的 GRPO 物标训练，该物物标结合了序数感知精度、跨传感器物理一致性、结构化格式控制以及输入端触觉接地目标。具体来说，触觉使用奖励仅在真实触觉输入相较于反事实控制（触觉流被移除、洗牌或遮蔽噪声）时才给予认可。在TouchReason-Bench上，Touch-R1-7B平均优于Octopi-13B的18.4%和GPT-4o的24.7%。其结构化推理痕迹揭示了探究、比较和修正等涌现行为，表明R1式推理可以有效地建立在身体接触之上。

FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation

FoundObj：自监督基础模型作为无标签3D对象分割的奖励

Authors: Zihui Zhang, Zhixuan Sun, Yafei Yang, Jinxi Li, Jiahao Chen, Bo Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.27178
Pdf link: https://arxiv.org/pdf/2605.27178
Abstract We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint-based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self-supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi-class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero-shot and long-tail scenarios, underscoring its potential for scalable, label-free 3D object segmentation.
中文摘要 我们在训练过程中无需依赖场景层级的人工注释，解决复杂场景点云中3D物体分割这一具有挑战性的任务。现有方法通常仅限于识别简单对象，主要原因是学习过程中对象先验不足。本文介绍了FoundObj，一种新颖框架，采用基于超点的对象发现代理，通过我们创新的语义和几何奖励模块，逐步合并合适的邻近上点。这些模块协同利用自监督的二维/三维基础模型中的语义和几何先验，为对象发现代理提供互补反馈，并通过强化学习实现多类对象的稳健识别。在多种基准测试上的大量实验表明，我们的方法始终优于现有基线。值得注意的是，我们的方法在零样本和长尾场景中展现出强烈的泛化性，凸显了其在可扩展、无标签三维物体分割中的潜力。

It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

这并不总是谄媚：衡量LLM符合度是基于认识不确定性

Authors: Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.27288
Pdf link: https://arxiv.org/pdf/2605.27288
Abstract Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesize that conformity is also driven by a model's epistemic uncertainty at inference time. In this paper, we introduce MUSE, a two-stage evaluation framework to disentangle the mechanisms driving LLM conformity. Specifically, MUSE maps a model's epistemic uncertainty in responding to a query against its likelihood to yield to user pushback in a subsequent turn. We demonstrate that the mechanisms driving conformity extend beyond sycophancy alone. Specifically, we characterize two distinct factors that jointly drive conformity: sycophantic conformity, where a model aligns with user pushback even with absolute certainty in its initial response, and uncertainty-driven conformity, where a model's likelihood for conformity increases alongside its uncertainty. Furthermore, we conduct ablation studies to demonstrate that both sycophantic conformity and uncertainty-driven conformity grow with 1) the LLM's perceived expertise of the user and 2) the plausibility of the user's suggestions. More broadly, MUSE informs more targeted intervention strategies by distinguishing alignment-induced sycophancy and training-corpora-driven uncertainty.
中文摘要 大型语言模型（LLM）已知会放弃最初的立场，以适应用户的反弹。虽然以往研究主要将这种行为归因于强化学习中从人类反馈中学到的谄媚，但我们假设从众性也由模型在推理时间的认识不确定性驱动。本文介绍了MUSE，一种两阶段评估框架，旨在理清驱动LLM一致性的机制。具体来说，MUSE将模型在响应查询时的认知不确定性与其在后续回合中因用户反对而屈服的可能性进行映射。我们证明，驱动从众的机制不仅仅体现在谄媚之上。具体来说，我们区分了两个共同驱动一致性的因素：谄媚顺从，即模型即使对用户的反弹反应有绝对确定性，也会保持一致;以及不确定性驱动的从众，即模型符合的可能性随着不确定性增加而增加。此外，我们开展消融研究，证明谄媚顺从和不确定性驱动的从众都会随着1）LLM对用户专业能力的感知以及2）用户建议的合理性而增长。更广泛地说，MUSE通过区分对齐引发的谄媚和训练语料库驱动的不确定性，为更有针对性的干预策略提供指导。

BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

基础：基于单次推出信息共享的批量优势估计，用于大型语言模型推理

Authors: Shijin Gong, Erhan Xu, Kai Ye, Francesco Quinzan, Giulia Livieri, Chengchun Shi
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.27293
Pdf link: https://arxiv.org/pdf/2605.27293
Abstract Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning. We introduce BASIS, a critic-free post-training algorithm designed to address this tradeoff. At each online training step, BASIS samples only one rollout per prompt, but leverages rich information across prompts in the entire batch to improve value function estimation. Our experiments demonstrate that BASIS reduces MSE in value function estimation by 69% compared to REINFORCE++, a representative single-rollout baseline, and achieves lower MSE with one rollout than group mean estimators with 8 rollouts. This improvement in value estimation translates to better policy optimization: using substantially less training time, BASIS achieves performance close to multi-rollout GRPO-type baselines and often outperforms single-rollout REINFORCE-type baselines.
中文摘要 带有可验证奖励的强化学习已成为提升大型语言模型推理能力的标准方法。现有算法在价值估计和策略学习中面临计算效率与样本效率之间的权衡。我们介绍BASIS，一种无批评的后训练算法，旨在解决这一权衡。在每个在线训练步骤，BASIS 每个提示只抽取一个推广，但利用整个批次中多个提示词的丰富信息来提升价值函数估计。我们的实验表明，BASIS在价值函数估计中比REINFORCE++（代表性单次推广基线）降低了69%，且一次推广的MSE低于组均估计器8次推广。这种价值估计的改进带来了更好的策略优化：BASIS以显著更少的训练时间，能够实现接近多重推展GRPO类基线的性能，且常常优于单次推展的REINFORCE类基线。

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

用稀疏自编码器模型内部结构指导LLM后训练数据工程

Authors: Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao, Lei Hou, Juanzi Li, Xiaozhi Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.27354
Pdf link: https://arxiv.org/pdf/2605.27354
Abstract Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.
中文摘要 模型内部编码了关于大型语言模型（LLM）如何处理其训练数据的丰富信息;然而，训练后数据工程主要依赖外部信号，忽视模型内部丰富的内在信号。我们提出了SAERL，一种用于大型语言模型强化学习（RL）的数据工程框架。它通过使用稀疏自动编码器（SAE）提取的模型内部结构，模拟了三种内在数据属性：多样性、难度和质量，SAE是一种先进的机械解释工具。每个属性都为具体的数据工程操作提供基础：SAE空间聚类配合适度批处理混合以控制批处理多样性，难度代理用于从简单到困难的课程排序，以及质量探针用于数据过滤。SAERL比原版GRPO提升平均准确率3.00%，并在Qwen2.5-Math-1.5B上以减少20%的训练步数达到目标精度，在模型尺度和强化学习算法中实现一致提升。实验表明，SAE能够有效跨模型家族和尺度转移，作为一种轻量级且可重复使用的数据工程工具。这些结果表明，模型内部数据是训练后数据工程中强大且实用的信号来源。

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

对齐篡改：如何利用人类反馈的强化学习来优化偏差偏差

Authors: Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.27355
Pdf link: https://arxiv.org/pdf/2605.27355
Abstract Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: this https URL
中文摘要 来自人类反馈的强化学习（RLHF）是将大型语言模型（LLMs）与人类偏好对齐的标准方法。在本研究中，我们引入了比对篡改，这是一种潜在漏洞，即被比对的LLM影响偏好数据集，导致RLHF放大不良行为。这源于RLHF的核心局限性：（1）偏好数据集由LLM自身输出构建，使其能够影响输出;（2）成对比较仅指示哪种响应更好，而非原因。这些限制可以被利用来导致对齐被篡改。例如，如果一个大型语言模型生成了质量更高的偏向响应，标注者会基于质量偏好这些响应。然而，偏好标签并未区分质量与偏见，奖励模型继承了这一局限性。通过强化学习或最佳N样本优化此类奖励，可以放大偏差偏差。我们的实验展示了在多种偏见中的放大：从关键词偏见到宣传（如性别歧视）、品牌推广以及工具性目标追求。缓解工作依然具有挑战性，因为现有的稳健RLHF技术无法在不牺牲响应质量的情况下完全解决比对干扰。这些发现揭示了当前RLHF的结构性脆弱性，并强调了预防这一脆弱性的必要性。项目页面：此 https URL

Keyword: diffusion policy

Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

通过扩散策略优化扩展世界模型强化学习

Authors: Xiaoyuan Cheng, Wenxuan Yuan, Zhancun Mu, Yuanzhao Zhang, Yiming Yang, Hai Wang, Zhuo Sun, Che Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.26282
Pdf link: https://arxiv.org/pdf/2605.26282
Abstract Model-based reinforcement learning (RL) can be effectively supported at scale through the use of world models. However, in practice, scaling such approaches remains fundamentally limited. A commonly recognized challenge is model bias and error compounding, which degrade long-horizon predictions. Beyond these issues, we identify a more critical yet underexplored bottleneck: a structural misalignment between search and value learning in existing world model approaches. In particular, policy improvement often relies on value functions induced by a separate, non-search policy, resulting in training inconsistency and ultimately suboptimal learning. To address this limitation, we propose Model-Based Diffusion Policy Optimization (MBDPO) in world models, a framework that unifies search and policy optimization through diffusion policy representations, thereby unlocking the potential of world models for scalable policy learning. Instead of constructing an explicit planner over a learned world model, we reformulate policy optimization as a diffusion process over searched trajectories in latent world models. In this view, we extract an implicit energy function from the collected dataset that anchors the policy, enabling MBDPO to refine the score field for policy optimization while mitigating misalignment. We evaluate MBDPO across a wide range of settings, including multi-task offline pretraining, online learning, and offline-to-online fine-tuning. In the offline regime, we further investigate its scaling behavior by pretraining on large-scale datasets, observing consistent and monotonic performance gains with increasing model capacity.
中文摘要 基于模型的强化学习（RL）可以通过使用世界模型在大规模上有效支持。然而，在实际操作中，这种方法的规模化仍然存在根本上的限制。一个常见的挑战是模型偏差和误差叠加，这会降低长期预测。除了这些问题，我们还发现了一个更关键但尚未被充分探讨的瓶颈：现有世界模型方法中搜索与价值学习之间的结构性错位。特别是，策略改进常依赖于由独立非搜索策略诱导的价值函数，导致训练不一致，最终学习效果不优。为解决这一局限，我们提出了基于模型的扩散政策优化（MBDPO）框架，通过扩散政策表示统一搜索和策略优化，从而释放世界模型在可扩展政策学习中的潜力。我们不再在学习的世界模型上构建显式规划器，而是将政策优化重新表述为潜世界模型中搜索轨迹的扩散过程。在此视图中，我们从收集的数据集中提取一个隐式能量函数，锚定策略，使MBDPO能够优化评分字段，同时减少错位。我们在多种环境中评估MBDPO，包括多任务离线预训练、在线学习以及离线到在线的微调。在离线模式下，我们通过对大规模数据集进行预训练进一步研究其缩放行为，观察到随着模型容量增加，性能持续且单调地提升。

Riding the Shifting Potential: When Reactive Control Suffices for Multi-Goal Behavior

乘势而上：当反应式控制足以实现多目标行为时

Authors: Vito Mengers, Oliver Brock
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.27314
Pdf link: https://arxiv.org/pdf/2605.27314
Abstract Reactive control is often considered insufficient for multi-objective tasks because conflicting objectives give rise to local minima. We argue this limitation is not inherent but arises from static encodings that fail to reflect how objectives currently interact. We exploit the interaction structure encoded in a graph-based world model by extending it with nullspace projections: conflicts are resolved where they arise by projecting lower-priority gradients into the nullspace of higher-priority ones, with priorities determined continuously from the current state. We demonstrate this in two domains where conflicts between objectives are central: navigation around non-convex obstacles, where static potential fields fundamentally fail, and planar pushing of non-convex objects, where our method achieves $100\%$ success across one-hundred configurations versus $0\%$ for the steepest-descent baseline and ${\sim}55\%$ for diffusion policy, without demonstrations or retraining. The same formulation transfers directly to a real robot with additional perceptual and kinematic constraints, accommodating them through the same mechanism.
中文摘要 反应式控制常被认为不足以应对多目标任务，因为冲突的目标会产生局部最小值。我们认为这种限制并非固有，而是由于静态编码未能反映目标当前的相互作用方式而产生的。我们利用基于图的世界模型中编码的交互结构，通过零空间投影进行扩展：冲突通过将低优先级梯度投射到高优先级梯度的虚无空间中，优先级从当前状态连续确定来解决冲突。我们在两个目标冲突核心的领域展示了这一点：绕过非凸障碍的导航，静态势场根本失效;以及平面推入非凸物体，我们的方法在一百种配置中成功率为$100\%$，而最陡下降基线为$0，扩散策略为${\sim}55\%$，无需演示或重新训练。同样的表述也直接转移到带有额外感知和运动学约束的真实机器人上，通过相同的机制进行适应。