Arxiv Papers of Today

生成时间: 2026-05-04 18:15:45 (UTC+8); Arxiv 发布时间: 2026-05-04 20:00 EDT (2026-05-05 08:00 UTC+8)

今天共有 30 篇相关文章

Keyword: reinforcement learning

Exploring LLM biases to manipulate AI search overview

探索大语言模型偏见以操控AI搜索概览

Authors: Roman Smirnov
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.00012
Pdf link: https://arxiv.org/pdf/2605.00012
Abstract Modern large language models (LLMs) are used in many business applications in general, and specifically in web search systems and applications that generate overviews of search results - LLM Overview systems. Such systems are using an LLM to select most relevant sources from search results and generate an answer to the user's query. It is known from many studies that LLMs have different biases, in LLM Overview application both the source selection and answer generation stages may be affected by the biases of LLMs (here we are focusing mainly on the selection stage). This research is focused on investigating the presence of the biases in LLM Overview systems and on biases exploitation to manipulate LLM Overview results. Here we train a small language model using reinforcement learning to rewrite search snippets to increase their likelihood of being preferred by an LLM Overview. Our experimental setup intentionally restricts the policy to operate only on snippets and limits reward-hacking strategies, reflecting realistic constraints of web search environments. The results prove that LLM Overview systems have biases and that reinforcement learning in most of the cases can optimize snippet's content to manipulate LLM Overview results. We also prove that LLM Overview selections are driven by comparative rather than absolute advantages among candidate sources. In addition, we examine safety aspects of LLM Overview manipulation possibilities and show that context poisoning attacks can lead to inaccurate or harmful results.
中文摘要 现代大型语言模型（LLM）被广泛应用于许多商业应用中，特别是在网页搜索系统和生成搜索结果概览的应用程序中——LLM概览系统。这些系统使用LLM从搜索结果中选择最相关的来源，并生成用户查询的答案。许多研究表明，LLM存在不同的偏差，在LLM概览应用中，源选择阶段和答案生成阶段都可能受到LLM的偏差影响（这里我们主要关注选择阶段）。本研究重点是调查LLM概览系统中偏见的存在，以及偏见被利用来操控LLM总览结果。这里我们通过强化学习训练一个小型语言模型，重写搜索片段，以提高它们被大型语言模型概述优先考虑的可能性。我们的实验设置有意限制策略仅对片段运行，并限制奖励黑客策略，反映了网络搜索环境的现实限制。结果证明LLM概览系统存在偏见，且在大多数情况下，强化学习可以优化摘要内容以操控LLM概览结果。我们还证明了LLM概览的选择是由候选来源之间的比较优势而非绝对优势驱动的。此外，我们考察了LLM概览操控的安全性，并表明上下文中毒攻击可能导致不准确或有害的结果。

Dynamic-TD3: A Novel Algorithm for UAV Path Planning with Dynamic Obstacle Trajectory Prediction

Dynamic-TD3：一种带动态障碍物轨迹预测的无人机路径规划新算法

Authors: Wentao Chen, Jingtang Chen, Mingjian Fu, Tiantian Li, Youfeng Su, Wenxi Liu, Yuanlong Yu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.00059
Pdf link: https://arxiv.org/pdf/2605.00059
Abstract Deep reinforcement learning (DRL) finds extensive application in autonomous drone navigation within complex, high-risk environments. However, its practical deployment faces a safety-exploration dilemma: soft penalty mechanisms encourage risky trial-and-error, while most constraint-based methods suffer degraded performance under sensor noise and intent uncertainty. We propose Dynamic-TD3, a physically enhanced framework that enforces strict safety constraints while maintaining maneuverability by modeling navigation as a Constrained Markov Decision Process (CMDP). This framework integrates an Adaptive Trajectory Relational Evolution Mechanism (ATREM) to capture long-range intentions and employs a Physically Aware Gated Kalman Filter (PAG-KF) to mitigate non-stationary observation noise. The resulting state representation drives a dual-criterion policy that balances mission efficiency against hard safety constraints via Lagrangian relaxation. In experiments with aggressive dynamic threats, this approach demonstrates superior collision avoidance performance, reduced energy consumption, and smoother flight trajectories.
中文摘要 深度强化学习（DRL）在复杂且高风险的环境中广泛应用于自主无人机导航。然而，其实际部署面临安全探索的难题：软惩罚机制鼓励冒险的试错，而大多数基于约束的方法在传感器噪声和意图不确定性下性能下降。我们提出了Dynamic-TD3，这是一种物理增强的框架，通过将导航建模为受限马尔可夫决策过程（CMDP），在保持机动性的同时执行严格的安全约束。该框架集成了自适应轨迹关系演化机制（ATREM）以捕捉远距离意图，并采用物理感知门控卡尔曼滤波器（PAG-KF）来减轻非平稳观测噪声。由此产生的状态表示推动了一项双重标准政策，通过拉格朗日放宽在任务效率与硬性安全约束之间取得平衡。在面对激进动态威胁的实验中，这种方法展现出更优的碰撞规避性能、更低的能量消耗以及更平稳的飞行轨迹。

XekRung Technical Report

XekRung技术报告

Authors: Jiutian Zeng, Junjie Li, Chengwei Dai, Jie Liang, Zhaoyu Hu, Yiliang Zhang, Ziang Weng, Longtao Huang, Dongjie Zhang, Libin Dong, Yang Ge, Yuanda Wang, Kaiwen Lv Kacuila, Bingyu Zhu, Jing Wang, Jin Xu
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.00072
Pdf link: https://arxiv.org/pdf/2605.00072
Abstract We present XekRung, a frontier large language model for cybersecurity, designed to provide comprehensive security capabilities. To achieve this, we develop diverse data synthesis pipelines tailored to the cybersecurity domain, enabling the scalable construction of high-quality training data and providing a strong foundation for cybersecurity knowledge and understanding. Building on this foundation, we establish a complete training pipeline spanning continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL) to further extend the model's capabilities. We further introduce a multi-dimensional evaluation system to guide the iterative improvement of both domain-specific and general-purpose abilities. Extensive experiments demonstrate that XekRung achieves state-of-the-art performance on cybersecurity-specific benchmarks among models of the same scale, while maintaining strong performance on general benchmarks.
中文摘要 我们介绍XekRung，一款面向网络安全的前沿大型语言模型，旨在提供全面的安全能力。为此，我们开发了针对网络安全领域量身定制的多样化数据综合流程，支持高质量训练数据的可扩展构建，并为网络安全知识和理解奠定坚实基础。在此基础上，我们建立了涵盖持续预培训（CPT）、监督微调（SFT）和强化学习（RL）的完整培训流程，进一步扩展模型的能力。我们还引入了多维评估系统，指导领域专属能力和通用能力的迭代改进。大量实验表明，XekRung在同规模模型中，在网络安全特定基准测试中达到了最先进的性能，同时在通用基准测试中保持强劲表现。

World Model for Robot Learning: A Comprehensive Survey

机器人学习世界模型：一项综合调查

Authors: Bohan Hou, Gen Li, Jindou Jia, Tuo An, Xinying Guo, Sicong Leng, Haoran Geng, Yanjie Ze, Tatsuya Harada, Philip Torr, Oier Mees, Marc Pollefeys, Zhuang Liu, Jiajun Wu, Pieter Abbeel, Jitendra Malik, Yilun Du, Jianfei Yang
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.00080
Pdf link: https://arxiv.org/pdf/2605.00080
Abstract World models, which are predictive representations of how environments evolve under actions, have become a central component of robot learning. They support policy learning, planning, simulation, evaluation, data generation, and have advanced rapidly with the rise of foundation models and large-scale video generation. However, the literature remains fragmented across architectures, functional roles, and embodied application domains. To address this gap, we present a comprehensive review of world models from a robot-learning perspective. We examine how world models are coupled with robot policies, how they serve as learned simulators for reinforcement learning and evaluation, and how robotic video world models have progressed from imagination-based generation to controllable, structured, and foundation-scale formulations. We further connect these ideas to navigation and autonomous driving, and summarize representative datasets, benchmarks, and evaluation protocols. Overall, this survey systematically reviews the rapidly growing literature on world models for robot learning, clarifies key paradigms and applications, and highlights major challenges and future directions for predictive modeling in embodied agents. To facilitate continued access to newly emerging works, benchmarks, and resources, we will maintain and regularly update the accompanying GitHub repository alongside this survey.
中文摘要 世界模型是环境在行动下如何演变的预测性表征，已成为机器人学习的核心组成部分。它们支持政策学习、规划、仿真、评估和数据生成，并且随着基础模型和大规模视频生成的兴起而迅速发展。然而，相关文献在不同架构、功能角色和具体应用领域之间仍然分散。为弥补这一空白，我们从机器人学习视角全面回顾了世界模型。我们考察了世界模型如何与机器人策略结合，它们如何作为强化学习和评估的学习模拟器，以及机器人视频世界模型如何从基于想象的生成发展到可控、结构化和基础规模的表述。我们进一步将这些理念与导航和自动驾驶联系起来，并总结了具有代表性的数据集、基准测试和评估方案。总体而言，本综述系统地回顾了快速增长的机器人学习世界模型文献，澄清了关键范式和应用，并突出了具身智能体预测建模的主要挑战和未来方向。为了促进对新兴作品、基准和资源的持续访问，我们将维护并定期更新随附的GitHub仓库。

Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

Wasserstein 分布稳健遗憾优化用于基于人类反馈的强化学习

Authors: Yikai Wang, Shang Liu, Jose Blanchet
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.00155
Pdf link: https://arxiv.org/pdf/2605.00155
Abstract Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret relative to the best policy under the same plausible reward perturbation. We study the promptwise problem through a simplex allocation model and show that, under an $\ell_1$ ambiguity set, the inner worst-case regret admits an exact solution and the optimal policy has a water-filling structure. These results lead to a practical policy-gradient algorithm with a simple sampled-bonus interpretation and only minor changes to PPO/GRPO-style RLHF training. The framework also clarifies theoretically why DRRO is less pessimistic than DRO, and our experiments show that DRRO mitigates over-optimization more effectively than existing baselines while standard DRO is systematically over-pessimistic.
中文摘要 来自人类反馈的强化学习（RLHF）已成为对齐大型语言模型的核心训练后步骤，但RLHF中使用的奖励信号只是真正人类效用的学习代理。从运筹学的角度来看，这在客观错误定义下产生了决策问题：策略是基于估计的奖励进行优化，而部署性能则由未观察到的目标决定。由此产生的差距导致了奖励过度优化，即Goodharting，即即使真正的质量下降，代理奖励仍持续改善。现有的缓解措施通过不确定性惩罚、悲观奖励或保守约束来解决这个问题，但它们可能计算负担过重且过于悲观。我们提出了Wasserstein分布稳健后悔优化（DRRO）用于RLHF。与标准DRO中对最坏情况价值进行悲观化不同，DRRO对最坏情况的遗憾相对于最佳政策进行同一合理奖励扰动的悲剧化。我们通过单纯形分配模型研究提示性问题，并证明在$\ell_1$歧义集下，内在的最坏情况遗憾能得到一个精确解，且最优策略具有水填充结构。这些结果促成了一种实用的策略梯度算法，采用简单的采样奖励解释，且对PPO/GRPO风格的RLHF训练仅有细微调整。该框架还理论上阐明了为什么DRRO比DRO更不悲观，我们的实验表明DRRO比现有基线更有效地缓解过度优化，而标准DRO则系统性地过度悲观。

E$^2$DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation

E$^2$DT：具有经验感知采样的高效决策变换器，用于机器人操作

Authors: Kaiyan Zhao, Borong Zhang, Yiming Wang, Xingyu Liu, Xuetao Li, Yuyang Chen, Xiaoguang Niu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.00159
Pdf link: https://arxiv.org/pdf/2605.00159
Abstract In reinforcement learning (RL) for robotic manipulation, the Decision Transformer (DT) has emerged as an effective framework for addressing long-horizon tasks. However, DT's performance depends heavily on the coverage of collected experiences. Without an active exploration mechanism, standard DT relies on uniform replay, which leads to poor sample efficiency, limited exploration, and reduced overall effectiveness. At the same time, while excessive exploration can help avoid local optima, it often delays policy convergence and leads to degraded efficiency. To address these limitations, we propose E$^2$DT, a DT-guided k-Determinantal Point Process sampling framework that enables the model to actively shape its own experience selection. Our framework is experience-aware, allowing E$^2$DT to be both efficient, by prioritizing sampling quality, such as high-return, high-uncertainty, and underrepresented trajectories, and effective, by ensuring diversity across trajectory windows to preserve policy optimality. Specifically, DT's internal latent embeddings measure diversity across trajectory windows, while quality is quantified through a composite metric that integrates return-to-go (RTG) quantiles, predictive uncertainty, and stage coverage based on inverse frequency. These two dimensions are integrated into a novel quality-diversity joint kernel that prioritizes the most informative experiences, thereby enabling learning that is both efficient and effective. We evaluate E$^2$DT on challenging robotic manipulation benchmarks in both simulation and real-robot settings. Results show that it consistently outperforms prior methods. These findings demonstrate that coupling policy learning with experience-aware sampling provides a principled path toward robust long-horizon robotic learning.
中文摘要 在机器人操作的强化学习（RL）中，决策转换器（DT）已成为解决长期任务的有效框架。然而，DT的表现很大程度上依赖于对收集体验的覆盖。没有主动探索机制，标准DT依赖于均匀回放，导致样本效率低、探索有限和整体效果下降。同时，虽然过度探索有助于避免局部最优，但往往会延迟策略收敛，导致效率下降。为解决这些限制，我们提出了E$^2$DT，这是一个以DT为导向的k-确定性点过程抽样框架，使模型能够主动塑造自身的体验选择。我们的框架具备经验感知，使E$^2$DT既能高效地通过优先考虑抽样质量（如高回报、高不确定性和代表性不足的轨迹）来实现效率，又能有效地确保轨迹窗口的多样性，以保持政策的最优性。具体来说，DT的内部潜在嵌入测量了轨迹窗口的多样性，而质量则通过综合指标量化，该指标整合了回归（RTG）分位数、预测不确定性和基于反频率的阶段覆盖。这两个维度被整合进一个新的质量-多样性联合内核，优先考虑最具信息量的体验，从而实现高效且有效的学习。我们在模拟和真实机器人环境中，通过具有挑战性的机器人操作基准来评估E$^2$DT。结果显示，它持续优于以往的方法。这些发现表明，将政策学习与经验感知抽样结合，为实现稳健的长视野机器人学习提供了一条有原则的路径。

TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

TUR-DPO：拓扑与不确定性感知的直接偏好优化

Authors: Abdulhady Abas Abdullah, Fatemeh Daneshfar, Seyedali Mirjalili, Mourad Oussalah
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.00224
Pdf link: https://arxiv.org/pdf/2605.00224
Abstract Aligning large language models (LLMs) with human preferences is commonly done via reinforcement learning from human feedback (RLHF) with Proximal Policy Optimization (PPO) or, more simply, via Direct Preference Optimization (DPO). While DPO is stable and RL-free, it treats preferences as flat winner vs. loser signals and is sensitive to noisy or brittle preferences arising from fragile chains of thought. We propose TUR-DPO, a topology- and uncertainty-aware variant of DPO that rewards how answers are derived, not only what they say, by eliciting lightweight reasoning topologies and combining semantic faithfulness, utility, and topology quality into a calibrated uncertainty signal. A small learnable reward is factorized over these signals and incorporated into an uncertainty-weighted DPO objective that remains RL-free and relies only on a fixed or moving reference policy. Empirically, across open 7-8B models and benchmarks spanning mathematical reasoning, factual question answering, summarization, and helpful/harmless dialogue, TUR-DPO improves judge win-rates, faithfulness, and calibration relative to DPO while preserving training simplicity and avoiding online rollouts. We further observe consistent gains in multimodal and long-context settings, and show that TUR-DPO matches or exceeds PPO on reasoning-centric tasks while maintaining operational simplicity.
中文摘要 将大型语言模型（LLMs）与人类偏好对齐，通常通过人类反馈强化学习（RLHF）与近端策略优化（PPO）或更简单的直接偏好优化（DPO）实现。虽然DPO是稳定且无强化逻辑的，但它将偏好视为平坦的赢家与输家信号，并对由脆弱思维链引发的噪声或脆弱偏好敏感。我们提出了TUR-DPO，这是一种拓扑和不确定性意识的DPO变体，通过引发轻量级推理拓扑，将语义忠实性、效用和拓扑质量结合为校准的不确定性信号，奖励答案的推导方式，而不仅仅是其说法。在这些信号上分解一个小可学习奖励，并纳入一个不确定性加权的DPO目标中，该目标保持无强化逻辑，仅依赖固定或移动的参考策略。从实证来看，在开放的7-8B模型和涵盖数学推理、事实性问答、总结和有益/无害对话的基准中，TUR-DPO在保持训练简洁性并避免在线推广的同时，提高了裁判的胜率、准确性和校准度。我们还观察到在多模态和长上下文环境中持续的提升，并证明TUR-DPO在以推理为中心的任务中与PPO相当甚至超过，同时保持操作简洁性。

Bayesian Optimization in Linear Time

线性时间中的贝叶斯优化

Authors: Jesse Schneider, William J. Welch
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.00237
Pdf link: https://arxiv.org/pdf/2605.00237
Abstract Bayesian optimization is a sequential method for minimizing objective functions that are expensive to evaluate and about which few assumptions can be made. By using all gathered data to train a Gaussian process model for the function and adaptively employing a mixture of global exploration and local exploitation, this method has been used for optimization in many fields including machine learning, automotive engineering and reinforcement learning. However, the standard method suffers from two problems: 1) with cubic computational complexity in the training-set size it eventually becomes computationally infeasible to train the model, and 2) globally modeling the objective function is not necessarily optimal given the local nature of minimization. Using flexible and recursive binary partitioning of the search space, we adapt both the modeling and acquisitive aspects of standard Bayesian optimization to work harmoniously with the partitioning scheme, thereby ameliorating both standard shortcomings. We compare our method against a commonly used Bayesian optimization library on seven challenging test functions, ranging in dimensionality from $6$ to $124$, and show that our method achieves superior optimization performance in all tests. In addition our method has linear computational complexity.
中文摘要 贝叶斯优化是一种序列方法，用于最小化评估成本高且几乎不能做出假设的目标函数。通过利用所有收集的数据训练高斯过程模型，并结合全局探索和局部利用，该方法已被应用于机器学习、汽车工程和强化学习等多个领域的优化。然而，标准方法存在两个问题：1）训练集大小的三次计算复杂度使得训练模型在计算上变得不可行;2）鉴于最小化的局部性质，整体建模目标函数未必是最优。利用灵活且递归的二元划分，我们将标准贝叶斯优化的建模和获取方面都调整为与划分方案的和谐工作，从而改善了这两个标准缺陷。我们将方法与常用的贝叶斯优化库对比七个具有挑战性的测试函数，维度范围从6美元到124美元不等，结果显示我们的方法在所有测试中都实现了更优越的优化性能。此外，我们的方法具有线性计算复杂度。

Pessimism-Free Offline Learning in General-Sum Games via KL Regularization

通过KL正则化实现的无悲观离线学习一般和博弈

Authors: Claire Chen, Yuheng Zhang
Subjects: Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2605.00264
Pdf link: https://arxiv.org/pdf/2605.00264
Abstract Offline multi-agent reinforcement learning in general-sum settings is challenged by the distribution shift between logged datasets and target equilibrium policies. While standard methods rely on manual pessimistic penalties, we demonstrate that KL regularization suffices to stabilize learning and achieve equilibrium recovery. We propose General-sum Anchored Nash Equilibrium (GANE), which recovers regularized Nash equilibria at an accelerated statistical rate of $\widetilde{O}(1/n)$. For computational tractability, we develop General-sum Anchored Mirror Descent (GAMD), an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of $\widetilde{O}(1/\sqrt{n}+1/T)$. These results establish KL regularization as a standalone mechanism for pessimism-free offline learning that achieves equivalent or accelerated rates in multi-player general-sum games.
中文摘要 在一般求和设置下，离线多智能体强化学习受到记录数据集与目标均衡策略分布转移的挑战。虽然标准方法依赖人工悲观惩罚，但我们证明了 KL 正则化足以稳定学习并实现平衡恢复。我们提出一般和锚定纳什均衡（GANE），以加速的统计速率 $\widetilde{O}（1/n）$ 恢复正则化纳什均衡。为了计算可解性，我们开发了一般和锚定镜像下降法（GAMD），这是一种以标准速率$\widetilde{O}（1/\sqrt{n}+1/T）$收敛到粗相关均衡的迭代算法。这些结果确立了KL正则化作为一种独立的无悲观离线学习机制，能够在多人通用和博弈中实现等效或加速的学习速率。

Data Deletion Can Help in Adaptive RL

数据删除可以帮助自适应强化学习

Authors: Param Budhraja, Aditya Gangrade, Alex Olshevsky, Venkatesh Saligrama
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2605.00298
Pdf link: https://arxiv.org/pdf/2605.00298
Abstract Deploying reinforcement learning policies in the real world requires adapting to time-varying environments. We study this problem in the contextual Markov Decision Process (cMDP) framework, where a family of environments is indexed by a low-dimensional context unknown at test time. The standard approach decomposes the problem: train a so-called "universal policy" which assumes knowledge of the true context, then pair it with a context estimator which approximates context using the observed trajectory. We identify a simple, counterintuitive trick that substantially improves the estimator: randomly delete a fraction of the training buffer after each round. This works because data is collected across multiple rounds using progressively better policies, and older trajectories come from a different distribution than what the estimator will face at deployment time; random deletion creates an implicit exponential decay on older data while preserving diversity without requiring any explicit identification of which samples are stale. This reduces robustness gap by 30% for MLPs and by 6% on average for recurrent networks. Strikingly, it allows a narrow MLP with 5x fewer parameters to outperform a wide MLP trained without deletion. To understand when and why deletion helps, we analyze regularized empirical risk minimization with a mismatch between the train distribution and the distribution at deployment; in this idealized setting, we prove that removing a single uniformly random training point decreases expected test loss in expectation under mild conditions. For ridge regression we make this quantitative: deletion helps when the regularization coefficient is moderate and the signal-to-noise ratio (SNR) is sufficiently low, and, crucially, this SNR threshold gives a direct measure of how large the distribution mismatch between training and deployment must be for deletion to be beneficial.
中文摘要 在现实世界中部署强化学习策略需要适应时间变化的环境。我们在上下文马尔可夫决策过程（cMDP）框架中研究该问题，其中一组环境被测试时未知的低维上下文所索引。标准方法分解问题：训练所谓的“通用策略”，假设其了解真实上下文，然后与利用观察轨迹近似上下文的上下文估计器配对。我们发现了一个简单但反直觉的技巧，显著提升了估计量：每轮后随机删除部分训练缓冲区。这之所以有效，是因为数据通过逐步改进的策略在多轮次中收集，且较早的轨迹来自与估计器部署时面临的分布不同;随机删除在旧数据上产生隐式指数衰减，同时保持多样性，无需明确识别哪些样本是陈旧的。这使得MLP的鲁棒性差距减少了30%，循环网络平均减少了6%。值得注意的是，它允许参数少5倍的窄MLP优于训练且未删除的宽MLP。为了理解删除何时及为何有效，我们分析了列车分布与部署分布不匹配的正则化经验风险最小化;在这种理想化环境下，我们证明去除单个均匀随机训练点能降低在温和条件下预期的测试损失。对于脊回归，我们采用定量方法：当正则化系数适中且信噪比（SNR）足够低时，缺失有效;关键是，该SNR阈值直接衡量训练与部署之间分布不匹配的程度，以保证缺失有益。

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

奥德修斯：通过强化学习将VLM规模缩放到100+回合决策

Authors: Chengshuai Shi, Wenzhe Li, Xinran Liang, Yizhou Lu, Wenjia Yang, Ruirong Feng, Seth Karten, Ziran Yang, Zihan Ding, Gabriel Sarch, Danqi Chen, Karthik Narasimhan, Chi Jin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.00347
Pdf link: https://arxiv.org/pdf/2605.00347
Abstract Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We further show that pretrained VLMs provide strong action priors, significantly improving sample efficiency during RL training and reducing the need for manual design choices such as action engineering, compared to classical deep RL trained from scratch. Building on these insights, we introduce Odysseus, an open training framework for VLM agents, achieving substantial gains across multiple levels of the game and at least 3 times average game progresses than frontier models. Moreover, the trained models exhibit consistent improvements under both in-game and cross-game generalization settings, while maintaining general-domain capabilities. Overall, our results identify key ingredients for making RL stable and effective in long-horizon, multi-modal settings, and provide practical guidance for developing VLMs as embodied agents.
中文摘要 鉴于视觉语言模型（VLM）能力的快速增长，将其扩展到视频游戏等交互式决策任务已成为一个有前景的前沿。然而，现有方法要么依赖大规模监督微调（SFT）对人体轨迹，要么仅在相对较短的视野条件下应用强化学习（RL）。本研究中，我们研究基于强化学习的VLM训练，用于超级马里奥乐园中的长期决策，这是一个视觉接地环境中，需要100+回合的协调感知、推理和行动交互。我们从系统性地研究关键算法组件开始，提出了一个带有轻量级回合级批评者的改良版PPO，这在训练稳定性和样本效率上比GRPO和Reinforce++等无批判方法显著提升。我们还进一步证明，预训练VLM提供了强有力的动作先验，显著提升了强化学习训练中的采样效率，并减少了如动作工程等手动设计选择的需求，相比传统从零开始训练的深度强化学习。基于这些见解，我们介绍了Odysseus，一个面向VLM代理的开放训练框架，在游戏多个层面上取得了显著进步，平均游戏进度至少是前沿模型的三倍。此外，训练后的模型在游戏内和跨游戏泛化设置下都表现出持续的改进，同时保持了广域能力。总体而言，我们的结果确定了使强化学习在长视野、多模态环境中稳定有效的关键要素，并为开发VLM作为具象代理提供了实用指导。

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

统一正确策略优化：打破RLVR对多样性的冷漠

Authors: Anamika Lochab, Bolian Li, Ruqi Zhang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.00365
Pdf link: https://arxiv.org/pdf/2605.00365
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy-regularized optimality, which identify the Uniform-Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty on the policy's distribution over correct solutions. The penalty redistributes gradient signal toward underrepresented correct responses, encouraging uniform allocation of probability mass within the correct set. Across three models (1.5B-7B parameters) and five mathematical reasoning benchmarks, UCPO improves Pass@K and diversity while maintaining competitive Pass@1, achieving up to +10\% absolute improvement on AIME24 at Pass@64 and up to 45\% higher equation-level diversity within the correct set. The code is available at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）在推理任务中单次尝试准确率（Pass@1）取得了显著提升，但通常存在多样本覆盖率下降（Pass@K），表明多样性崩溃。我们识别出这种降解的结构性原因：常见的RLVR目标，如GRPO，对概率质量在正确解之间的分布无关紧要。结合随机训练动力学，这种无所谓导致自我强化的崩溃，即概率质量集中在一小部分正确输出上，而有效替代解被抑制。我们将该坍缩机制形式化，并在两个互补标准——鲁棒性和熵正则化的最优性——进一步表征最优政策结构，这两者将统一正确策略视为唯一最优。基于这一分析，我们提出了统一正确策略优化（UCPO），这是对GRPO的一种修改，对策略在正确解的分布上增加了条件一致性惩罚。惩罚将梯度信号重新分配到代表性不足的正确响应上，鼓励在正确集合内均匀分配概率质量。在三个模型（1.5B-7B参数）和五个数学推理基准中，UCPO在保持竞争力Pass@1的同时提升Pass@K和多样性，在Pass@64时实现AIME24最高+10%的绝对提升，在正确的组内方程层面多样性提升高达45%。代码可在该 https URL 访问。

AlphaInventory: Evolving White-Box Inventory Policies via Large Language Models with Deployment Guarantees

AlphaInventory：通过大型语言模型演进白盒库存策略并提供部署保证

Authors: Chenyu Huang, Jianghao Lin, Zhengyang Tang, Bo Jiang, Ruoqing Jiang, Benyou Wang, Lai Wei
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.00369
Pdf link: https://arxiv.org/pdf/2605.00369
Abstract We study how large language models can be used to evolve inventory policies in online, non-stationary environments. Our work is motivated by recent advances in LLM-based evolutionary search, such as AlphaEvolve, which demonstrates strong performance for static and highly structured problems such as mathematical discovery, but is not directly suited to online dynamic inventory settings. To this end, we propose AlphaInventory, an end-to-end inventory-policy evolution and inference framework grounded in confidence-interval-based certification. The framework trains a large language model using reinforcement learning, incorporates demand data as well as numerical and textual features beyond demand, and generates white-box inventory policy with statistical safety guarantees for deployment in future periods. We further introduce a unified theoretical interface that connects training, inference, and deployment. This allows us to characterize the probability that the AlphaInventory evolves a statistically safe and improved policy, and to quantify the deployment gap relative to the oracle-safe benchmark. Tested on both synthetic data and real-world retail data, AlphaInventory outperforms classical inventory policies and deep learning based methods. In canonical inventory settings, it evolves new policies that improve upon existing benchmarks.
中文摘要 我们研究大型语言模型如何用于在线、非固定环境中的库存政策进化。我们的研究灵感来自基于大型语言模型的进化搜索的最新进展，如AlphaEvolve，该方法在静态且高度结构化的问题（如数学发现）上表现出色，但并不直接适用于在线动态库存设置。为此，我们提出了AlphaInventory，一个基于置信区间认证的端到端清单政策演进与推断框架。该框架通过强化学习训练大型语言模型，整合需求数据以及超出需求的数值和文本特征，并生成带有统计安全保障的白盒库存政策，以保障未来时期的部署。我们还进一步引入了连接培训、推理和部署的统一理论接口。这使我们能够描述AlphaInventory演变出统计上安全且改进策略的概率，并量化相对于oracle安全基准的部署差距。通过合成数据和真实零售数据的测试，AlphaInventory 优于传统的库存政策和基于深度学习的方法。在规范的库存环境中，它会发展出改进现有基准的新政策。

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

GaMMA：迈向大型多模态模型中的全球-时空音乐联合理解

Authors: Zuyao You, Zhesong Yu, Mingyu Liu, Bilei Zhu, Yuan Wan, Zuxuan Wu
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.00371
Pdf link: https://arxiv.org/pdf/2605.00371
Abstract In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.
中文摘要 本文提出了GaMMA，一种最先进的（SoTA）大型多模态模型（LMM），旨在实现全面的音乐内容理解。GaMMA继承了LLaVA简化的编码-解码设计，实现了音乐与语言之间的有效跨模态学习。通过以专家混合方式整合音频编码器，GaMMA有效地将时间序列和非时间序列音乐理解任务统一在一组参数内。我们的方法将精心策划的数据集大规模结合，采用渐进式训练流程，通过预训练、监督式微调（SFT）和强化学习（RL）有效推动音乐理解的边界。为了全面评估音乐LMMs的时序和非时间能力，我们推出了MusicBench，这是最大的音乐导向基准，包含3,739道由人力策划的多项选择题，涵盖音乐理解的各个方面。大量实验表明，GaMMA在音乐领域建立了新的SoTA，在MuchoMusic上实现了79.1%的准确率，在MusicBench-Temporal上达到了79.3%，在MusicBench-Global上达到了81.3%，持续优于以往的方法。

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

ResRL：通过负样本投射残留强化学习提升LLM推理能力

Authors: Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Li Wang, Xiaodong Lu, Wei Lin, Ran He, Guojun Yin
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.00380
Pdf link: https://arxiv.org/pdf/2605.00380
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at this https URL.
中文摘要 带有可验证奖励的强化学习（RLVR）提升了大型语言模型（LLM）的推理能力，但由于正向奖励的过度激励，通常表现出有限的代际多样性。尽管像负样本强化（NSR）这样的方法通过将惩罚权重提升来缓解这一问题，但它们可能会抑制正负反应之间共享的语义分布。为了提升推理能力而不损失多样性，本文提出了负样本投射残余强化学习（ResRL），将正负反应之间的相似语义分布解耦。我们理论上将懒似然位移（LLD）与负正头梯度干涉联系起来，并推导出一个单前向代理，该代理对上界表示对齐，以指导保守优势重加权。ResRL随后将负令牌隐藏表示投影到基于SVD的低秩正子空间，并利用投影残差调制负梯度，提升推理能力，保持多样性，并在涵盖数学、代码、代理任务和函数调用的十二个基准测试中平均优于强基线。值得注意的是，ResRL在Avg@16中数学推理方面比NSR高出9.4%，Pass@128中高出7.0%。代码可在此 https URL 访问。

PrefMoE: Robust Preference Modeling with Mixture-of-Experts Reward Learning

PrefMoE：专家混合奖励学习的稳健偏好建模

Authors: Ziqin Yuan, Ruiqi Wang, Dezhong Zhao, Baijian Yang, Byung-Cheol Min
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.00384
Pdf link: https://arxiv.org/pdf/2605.00384
Abstract Preference-based reinforcement learning offers a scalable alternative to manual reward engineering by learning reward structures from comparative feedback. However, large-scale preference datasets, whether collected from crowdsourced annotators or generated by synthetic teachers, often contain heterogeneous and partially conflicting supervision, including disagreement across annotators and inconsistency within annotators. Existing reward learning methods typically fit a single reward model to such data, forcing it to average incompatible signals and thereby limiting robustness. To solve this, we propose PrefMoE, a mixture-of-experts reward learning framework for robust preference modeling. PrefMoE learns multiple specialized reward experts and uses trajectory-level soft routing to combine them adaptively, enabling the model to capture diverse latent preference patterns under noisy and heterogeneous preference supervision. A load-balancing regularizer further stabilizes training by preventing expert collapse. Across locomotion benchmarks from D4RL and manipulation tasks from MetaWorld, PrefMoE improves preference prediction robustness and leads to more reliable downstream policy learning than strong single-model baselines.
中文摘要 基于偏好的强化学习通过比较反馈学习奖励结构，提供了一种可扩展的人工奖励工程替代方案。然而，无论是从众包注释者收集还是由合成教师生成的大规模偏好数据集，通常存在异质且部分冲突的监督，包括注释者间的分歧和注释者内部的不一致。现有的奖励学习方法通常将单一奖励模型拟合到此类数据，迫使其对不兼容信号进行平均，从而限制了鲁棒性。为解决这个问题，我们提出了PrefMoE，一个专家混合的奖励学习框架，用于稳健的偏好建模。PrefMoE学习多个专业奖励专家，并利用轨迹级软路由自适应地组合它们，使模型能够在噪声和异质偏好监督下捕捉多样的潜在偏好模式。负载均衡调节器进一步稳定训练，防止专家崩溃。在D4RL的移动基准和MetaWorld的操作任务中，PrefMoE提升了偏好预测的鲁棒性，并使下游策略学习比强单模型基线更为可靠。

Model-Based Reinforcement Learning with Double Oracle Efficiency in Policy Optimization and Offline Estimation

基于模型的强化学习，策略优化和离线估计中具有双倍神谕效率

Authors: Haichen Hu, Jian Qian, David Simchi-Levi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.00393
Pdf link: https://arxiv.org/pdf/2605.00393
Abstract Reinforcement learning (RL) in large environments often suffers from severe computational bottlenecks, as conventional regret minimization algorithms require repeated, costly calls to planning and statistical estimation oracles. While recent advances have explored offline oracle-efficient algorithms, their computational complexity typically scales with the cardinality of the state and action spaces, rendering them intractable for large-scale or continuous environments. In this paper, we address this fundamental limitation by studying offline oracle-efficient episodic RL through the lens of log-barrier and log-determinant regularization. Specifically, for tabular Markov Decision Processes (MDPs), we propose a novel algorithm that achieves the optimal $\tilde{O}(\sqrt{T})$ regret bound while requiring only $O(H\log\log T)$ calls to both the offline statistical estimation and planning oracles when $T$ is known and $O(H\log T)$ calls when $T$ is unknown. Crucially, this oracle complexity is entirely independent of the size of the state and action spaces. This strict independence drastically reduces the planning oracle complexity, representing a substantial improvement over existing offline oracle-efficient algorithms (Qian et al., 2024). Furthermore, we demonstrate the versatility of our framework by generalizing the algorithm to linear MDPs featuring infinite state spaces and arbitrary action spaces. We prove that this generalized approach successfully attains meaningful sub-linear regret. Consequently, our work yields the first doubly oracle-efficient (i.e., efficient with respect to both statistical estimation and policy optimization) regret minimization algorithm capable of solving MDPs with infinite state and action spaces, significantly expanding the boundaries of computationally tractable RL.
中文摘要 大型环境中的强化学习（RL）常常面临严重的计算瓶颈，因为传统的遗憾最小化算法需要反复且昂贵地调用规划和统计估计预言机。虽然近期进展探索了离线oracle高效的算法，但其计算复杂度通常随状态空间和动作空间的基数而扩展，使其在大规模或连续环境中难以处理。本文通过对数障碍和对数行列式正则化的视角研究离线高效的情节强化学习，解决了这一根本局限。具体来说，对于表格马尔可夫决策过程（MDP），我们提出了一种新颖算法，在已知$T$时只需调用$O（H\log\log T）$，同时实现最优的$\tilde{O}（\sqrt{T}）$后悔界限，同时只需调用（H\log\log T）$，当未知$T$时调用$O（H\log T）$。关键是，这种预言机复杂度完全独立于状态空间和动作空间的大小。这种严格的独立性极大地降低了规划预言机的复杂性，相比现有的离线高效算法实现了显著改进（Qian 等，2024）。此外，我们通过将算法推广到具有无限状态空间和任意作用空间的线性MDP，展示了该框架的多功能性。我们证明了这种广义方法能够成功实现有意义的亚线性后悔。因此，我们的工作产生了首个双倍oracle效率（即在统计估计和策略优化方面都高效）的遗憾最小化算法，能够解决具有无限状态空间和作用空间的MDP，显著扩展了计算可解的强化学习边界。

Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

超越启发式：3D高斯喷溅的可学习密度控制

Authors: Zhenhua Ning, Xin Li, Jun Yu, Guangming Lu, Yaowei Wang, Wenjie Pei
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.00408
Pdf link: https://arxiv.org/pdf/2605.00408
Abstract While 3D Gaussian Splatting (3DGS) has demonstrated impressive real-time rendering performance, its efficacy remains constrained by a reliance on heuristic density control. Despite numerous refinements to these handcrafted rules, such methods inherently lack the flexibility to adapt to diverse scenes with complex geometries. In this paper, we propose a paradigm shift for density control from rigid heuristics to fully learnable policies. Specifically, we introduce \textbf{LeGS}, a framework that reformulates density control as a parameterized policy network optimized via Reinforcement Learning (RL). Central to our approach is the tailored effective reward function grounded in sensitivity analysis, which precisely quantifies the marginal contribution of individual Gaussians to reconstruction quality. To maintain computational tractability, we derive a closed-form solution that reduces the complexity of reward calculation from $O(N^2)$ to $O(N)$. Extensive experiments on the Mip-NeRF 360, Tanks \& Temples, and Deep Blending datasets demonstrate that \textbf{LeGS} significantly outperforms state-of-the-art methods, striking a superior balance between reconstruction quality and efficiency. The code will be released at this https URL
中文摘要 虽然3D高斯喷溅（3DGS）展现了令人印象深刻的实时渲染性能，但其效能仍受限于启发式密度控制。尽管这些手工规则经过多次改进，但这些方法本质上缺乏适应复杂几何场景的灵活性。本文提出密度控制从僵硬启发式向完全可学习策略的范式转变。具体来说，我们介绍了 \textbf{LeGS}，这是一个通过强化学习（RL）优化的参数化策略网络框架。我们方法的核心是基于敏感性分析的定制有效奖励函数，精准量化单个高斯分布量对重建质量的边际贡献。为了保持计算可解性，我们推导出一个闭式解，将奖励计算的复杂度从 $O（N^2）$ 降低到 $O（N）$。在Mip-NeRF 360、Tanks \ & Temples和Deep Blending数据集上的大量实验表明，\textbf{LeGS}显著优于最先进方法，在重建质量与效率之间取得了优越平衡。代码将以该 https URL 发布

Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling

物理原生世界模型：生成世界建模的哈密顿视角

Authors: Sen Cui, Jingheng Ma
Subjects: Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.00412
Pdf link: https://arxiv.org/pdf/2605.00412
Abstract World models have recently re-emerged as a central paradigm for embodied intelligence, robotics, autonomous driving, and model-based reinforcement learning. However, current world model research is often dominated by three partially separated routes: 2D video-generative models that emphasize visual future synthesis, 3D scene-centric models that emphasize spatial reconstruction, and JEPA-like latent models that emphasize abstract predictive representations. While each route has made important progress, they still struggle to provide physically reliable, action-controllable, and long-horizon stable predictions for embodied decision making. In this paper, we argue that the bottleneck of world models is no longer only whether they can generate realistic futures, but whether those futures are physically meaningful and useful for action. We propose \emph{Hamiltonian World Models} as a physically grounded perspective on world modeling. The key idea is to encode observations into a structured latent phase space, evolve the latent state through Hamiltonian-inspired dynamics with control, dissipation, and residual terms, decode the predicted trajectory into future observations, and use the resulting rollouts for planning. We discuss how Hamiltonian structure may improve interpretability, data efficiency, and long-horizon stability, while also noting practical challenges in real-world robotic scenes involving friction, contact, non-conservative forces, and deformable objects.
中文摘要 世界模型最近重新成为具身智能、机器人学、自动驾驶和基于模型的强化学习的核心范式。然而，当前的世界模型研究常被三种部分分离的路径所主导：强调视觉未来综合的二维视频生成模型、强调空间重建的三维场景中心模型，以及强调抽象预测表征的类似JEPA的潜在模型。尽管每条路线都取得了重要进展，但它们在为具体决策提供物理上可靠、可操作控制且视野较长且稳定的预测方面仍面临困难。本文认为，世界模型的瓶颈不再仅在于它们是否能生成现实的未来，而是这些未来是否具有物理意义并对行动有用。我们提出 \emph{Hamiltonian World Models} 作为一种物理基础的世界建模视角。核心思想是将观测数据编码到结构化的潜相空间中，通过带有控制、耗散和残差项的哈密顿动力学对潜态进行演化，将预测轨迹解码为未来观测，并利用由此产生的展开进行规划。我们讨论了哈密顿结构如何提升可解释性、数据效率和长视野稳定性，同时指出现实机器人场景中摩擦、接触、非保守力和可变形物体等实际挑战。

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

部署中的学习：通用机器人政策的舰队规模强化学习

Authors: Yi Wang, Xinchen Li, Pengwei Xie, Pu Yang, Buqing Nie, Yunuo Cai, Qinglin Zhang, Chendi Qu, Jeffrey Wu, Jianheng Song, Xinlin Ren, Jingshun Huang, Mingjie Pan, Siyuan Feng, Zhi Chen, Jianlan Luo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.00416
Pdf link: https://arxiv.org/pdf/2605.00416
Abstract Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.
中文摘要 通用机器人政策越来越多地受益于大规模预训练，但仅靠离线数据不足以实现稳健的实际部署。部署中的机器人会遇到分布转移、长尾失效、任务变化以及人工纠正机会，这些都是固定演示数据集无法完全捕捉的。我们介绍了部署过程中学习（LWD），这是一个舰队规模的离线到在线强化学习框架，用于持续培训通用的愿景-语言-行动（VLA）政策。LWD从预训练的VLA策略出发，通过自主部署和机器人车队中收集的人工干预，闭合部署、共享物理体验、策略改进和再部署之间的环路。为了稳定从异构、稀疏奖励车队数据中的学习，LWD结合了分布隐含价值学习（DIVL）进行稳健价值估计，并结合Q学习（QAM）用于基于流的VLA动作生成器中的策略提取。我们在16台双臂机器人车队中验证了LWD，涵盖八个真实操作任务，包括语义补货和3至5分钟的长视野任务。单一通才政策随着舰队经验积累而改善，平均成功率达到95%，其中在长期任务中获得最大收益。

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

AEM：多回合能动强化学习中的自适应熵调制

Authors: Haotian Zhao, Yuxin Zhang, Songlin Zhou, Stephen S.-T. Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, Daxiang Dong, Jianmin Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.00425
Pdf link: https://arxiv.org/pdf/2605.00425
Abstract Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only rewards make it difficult to assign credit to individual steps in an agent's action trajectory. A common remedy is to introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, but this increases supervision and tuning complexity and often generalizes poorly across tasks and domains. This paper presents AEM, a supervision-free credit assignment method that adaptively modulates entropy dynamics during RL training to achieve a more effective exploration-exploitation trade-off. Theoretically, we elevate entropy analysis from the token level to the response level to reduce token sampling variance and show that entropy drift under natural gradients is intrinsically governed by the product of the advantage and the relative response surprisal. Specifically, we derive a practical proxy to reshape training dynamics, enabling a natural transition from exploration to exploitation. Extensive experiments across various benchmarks and models ranging from 1.5B to 32B parameters demonstrate the effectiveness of AEM, including a notable 1.4 percent gain when integrated into a state-of-the-art baseline on the highly challenging SWE-bench-Verified benchmark.
中文摘要 强化学习（RL）显著提升了大型语言模型（LLM）代理与环境交互和解决多回合任务的能力。然而，有效的培训依然具有挑战性，因为稀疏且仅以结果为主的奖励，使得很难将功劳归功于代理行动轨迹中的各个步骤。一种常见的解决方法是引入密集的中间监督，如过程奖励模型或辅助自监督信号，但这增加了监督和调优的复杂性，且往往难以跨任务和领域推广。本文介绍了AEM，一种无需监督的学分分配方法，能够在强化学习训练期间自适应调节熵动态，实现更有效的探索-利用权衡。理论上，我们将熵分析从代币层面提升到响应层面，以减少代币抽样方差，并证明自然梯度下的熵漂移本质上由优势与相对响应惊喜的乘积所支配。具体来说，我们推导出一个实用的代理指标，以重塑训练动态，实现从探索到开发的自然过渡。在1.5B到32B参数的多个基准和模型中进行了大量实验，展示了AEM的有效性，其中在高度具有挑战性的SWE-bench-Verified基准上，将先进基线整合进最先进的基准时，显著提升了1.4%。

Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning

通过需求感知课程强化学习提升LLM代码生成

Authors: Shouyu Yin, Zhao Tian, Junjie Chen, Shikai Guo
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.00433
Pdf link: https://arxiv.org/pdf/2605.00433
Abstract Code generation, which aims to automatically generate source code from given programming requirements, has the potential to substantially improve software development efficiency. With the rapid advancement of large language models (LLMs), LLM-based code generation has attracted widespread attention from both academia and industry. However, as programming requirements become increasingly complex, existing LLMs still exhibit notable performance limitations. To address this challenge, recent studies have proposed training-based curriculum reinforcement learning (CRL) strategies to improve LLM code generation performance. Despite their effectiveness, existing CRL approaches suffer from several limitations, including misaligned requirement difficulty perception, the absence of requirement difficulty optimization, and suboptimal curriculum sampling strategies. In CRL-based code generation, programming requirements serve as the sole input to the model, making their quality and difficulty critical to training effectiveness. Motivated by insights from software requirements engineering, we propose RECRL, a novel requirement-aware curriculum reinforcement learning framework for enhancing LLM-based code generation. RECRL automatically perceives model-specific requirement difficulty, optimizes challenging requirements to improve training data utilization, and employs an adaptive curriculum sampling strategy to construct training batches with smoothly varying difficulty. Extensive experiments on five state-of-the-art LLMs across five widely-used code generation benchmarks by comparing with five state-of-the-art baselines, demonstrate the significant effectiveness of RECRL. For example, RECRL achieves an average Pass@1 improvement of 1.23%-5.62% over all state-of-the-art baselines.
中文摘要 代码生成旨在自动生成源代码，能够显著提升软件开发效率。随着大型语言模型（LLM）的快速发展，基于LLM的代码生成引起了学术界和工业界的广泛关注。然而，随着编程需求的日益复杂，现有的大型语言模型仍表现出显著的性能限制。为应对这一挑战，近期研究提出了基于培训的课程强化学习（CRL）策略，以提升LLM代码生成性能。尽管有效，现有的CRL方法仍存在若干局限性，包括需求难度感知不匹配、缺乏需求难度优化以及课程抽样策略不理想。在基于CRL的代码生成中，编程需求是模型的唯一输入，使其质量和难度对训练效果至关重要。受软件需求工程的启发，我们提出了RECRL，一种新型的需求感知课程强化学习框架，用于增强基于LLM的代码生成。RECRL自动感知模型特定的需求难度，优化具有挑战性的要求以提升训练数据利用率，并采用自适应课程抽样策略，构建难度平滑变化的培训批次。通过对五个广泛使用的代码生成基准测试，五个最先进LLM与五个最先进基线进行的广泛实验，展示了RECRL的显著有效性。例如，RECRL在所有最先进基线上平均实现了1.23%-5.62%的Pass@1提升。

A Policy-Driven DRL Framework for System-Level Tradeoff Control in NR-U/Wi-Fi Coexistence

一个基于策略的DRL框架，用于NR-U/Wi-Fi共存中的系统级权衡控制

Authors: Po-Heng Chou, Yi-Fang Yu, Shou-Yu Chen, Chiapin Wang
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.00457
Pdf link: https://arxiv.org/pdf/2605.00457
Abstract The coexistence of NR-U and Wi-Fi in unlicensed spectrum introduces a system-level resource coordination problem, where heterogeneous channel access mechanisms lead to a significant imbalance in spectrum utilization and degraded Wi-Fi performance. To address this challenge, we propose a policy-driven deep reinforcement learning (DRL) framework for adaptive TXOP control, in which the coexistence process is formulated as a Markov decision process (MDP) and a deep Q-network (DQN) learns control policies through online interaction. A key contribution is the introduction of a policy layer via reward design, enabling explicit control of system-level tradeoffs among fairness, throughput, and quality of service (QoS). Three policies, namely absolute fairness, moderate fairness, and utility-based fairness, are developed to achieve different operating points. Simulation results show that the proposed framework achieves a Jain fairness index above 0.9 under strict fairness control. Compared to absolute fairness, moderate fairness improves aggregate throughput by 68.22%, while the utility-based policy further enhances utility by 177.6%. These results demonstrate that policy-driven control provides a flexible and effective solution for managing tradeoffs in heterogeneous coexistence networks.
中文摘要 NR-U和Wi-Fi在无许可频谱中的共存引入了系统级资源协调问题，异构信道接入机制导致频谱利用率显著失衡，Wi-Fi性能下降。为应对这一挑战，我们提出了一种策略驱动的深度强化学习（DRL）框架用于自适应TXOP控制，其中共存过程被表述为马尔可夫决策过程（MDP），深度Q网络（DQN）通过在线交互学习控制策略。一个关键贡献是通过奖励设计引入策略层，实现对系统层公平性、吞吐量和服务质量（QoS）权衡的明确控制。制定了三种政策，即绝对公平、适度公平和效用基础公平，以实现不同的运营点。模拟结果显示，在严格公平控制下，所提框架可实现耆那教公平指数高于0.9。与绝对公平相比，适度公平能提升总吞吐量68.22%，而基于效用的政策则进一步提升效用177.6%。这些结果表明，策略驱动控制为管理异构共存网络中的权衡提供了灵活且有效的解决方案。

Recovering Hidden Reward in Diffusion-Based Policies

基于扩散的政策中回收隐藏奖励

Authors: Yanbiao Ji, Qiuchang Li, Yuting Hu, Shaokai Wu, Wenyuan Xie, Guodong Zhang, Qicheng He, Deyi Ji, Yue Ding, Hongtao Lu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.00623
Pdf link: https://arxiv.org/pdf/2605.00623
Abstract This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximum-entropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds. We further characterize the identifiability of recovered rewards and bound how score estimation errors propagate to action preferences. Empirically, EnergyFlow achieves state-of-the-art imitation performance on various manipulation tasks while providing an effective reward signal for downstream reinforcement learning that outperforms both adversarial IRL methods and likelihood-based alternatives. These results show that the structural constraints required for valid reward extraction simultaneously serve as beneficial inductive biases for policy generalization. The code is available at this https URL.
中文摘要 本文介绍了EnergyFlow，这是一种通过参数化标量能量函数（其梯度为去噪场）来统一生成动作建模与逆强化学习的框架。我们确定在最大熵最优情况下，通过去噪分数匹配获得的得分函数能够恢复专家软Q函数的梯度，从而实现在无对抗训练的情况下提取奖励。形式上，我们证明了将学习的场限制为保守场可以降低假设复杂度并收紧非分布推广界限。我们还进一步表征了回收奖励的可识别性，并限制了得分估计误差如何传递给动作偏好。从经验角度看，EnergyFlow在各种操作任务中实现了最先进的模仿性能，同时为下游强化学习提供了有效的奖励信号，优于对抗的真实学习方法和基于似然的替代方案。这些结果表明，有效奖励提取所需的结构约束同时作为有利于政策泛化的归纳偏见。代码可在该 https URL 访问。

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

学习如何从自己点击：基于GUI的政策自提炼

Authors: Yan Zhang, Daiqing Wu, Huawen Shen, Yu Zhou, Can Ma
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.00642
Pdf link: https://arxiv.org/pdf/2605.00642
Abstract Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO) have achieved strong performance, but they rely on expensive multiple rollouts and suffer from sparse signals on hard samples. These limitations make on-policy self-distillation (OPSD), which provides dense token-level supervision from a single rollout, a promising alternative. However, its applicability to GUI grounding remains unexplored. In this paper, we present GUI-SD, the first OPSD framework tailored for GUI grounding. First, it constructs a visually enriched privileged context for the teacher using a target bounding box and a Gaussian soft mask, providing informative guidance without leaking exact coordinates. Second, it employs entropy-guided distillation, which adaptively weights tokens based on digit significance and teacher confidence, concentrating optimization on the most impactful and reliable positions. Extensive experiments on six representative GUI grounding benchmarks show that GUI-SD consistently outperforms GRPO-based methods and naive OPSD in both accuracy and training efficiency. Code and training data are available at this https URL.
中文摘要 图形用户界面（GUI）接地将自然语言指令映射到目标元素的视觉坐标，并作为自主GUI代理的核心功能。近期的强化学习方法（如GRPO）取得了强劲的性能，但它们依赖昂贵的多重展开，并且在硬样本中存在信号稀疏的问题。这些限制使得政策自提纯（OPSD）——即从单次推出提供密集代币级监督——成为一个有前景的替代方案。然而，其在图形界面基础上的适用性尚未被充分探讨。本文介绍了GUI-SD，这是首个专为GUI基础设计的OPSD框架。首先，它利用目标边界框和高斯软掩膜为教师构建了一个视觉丰富的特权上下文，提供信息指导且不泄露精确坐标。其次，采用熵引导提纯技术，根据数字重要性和教师信心自适应权重，优化最有影响力和最可靠的位置。对六个代表性的GUI基础基准测试的广泛实验表明，GUI-SD在准确性和训练效率上始终优于基于GRPO的方法和朴实OPSD。代码和训练数据可在此 https URL 访问。

Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

增强拉格朗日乘数网络，用于强化学习中的状态安全

Authors: Jiaming Zhang, Yujie Yang, Yao Lyu, Shengbo Eben Li, Liping Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.00667
Pdf link: https://arxiv.org/pdf/2605.00667
Abstract Safety is a primary challenge in real-world reinforcement learning (RL). Formulating safety requirements as state-wise constraints has become a prominent paradigm. Handling state-wise constraints with the Lagrangian method requires a distinct multiplier for every state, necessitating neural networks to approximate them as a multiplier network. However, applying standard dual gradient ascent to multiplier networks induces severe training oscillations. This is because the inherent instability of dual ascent is exacerbated by network generalization -- local overshoots and delayed updates propagate to adjacent states, further amplifying policy fluctuations. Existing stabilization techniques are designed for scalar multipliers, which are inadequate for state-dependent multiplier networks. To address this challenge, we propose an augmented Lagrangian multiplier network (ALaM) framework for stable learning of state-wise multipliers. ALaM consists of two key components. First, a quadratic penalty is introduced into the augmented Lagrangian to compensate for delayed multiplier updates and establish the local convexity near the optimum, thereby mitigating policy oscillations. Second, the multiplier network is trained via supervised regression toward a dual target, which stabilizes training and promotes convergence. Theoretically, we show that ALaM guarantees multiplier convergence and thus recovers the optimal policy of the constrained problem. Building on this framework, we integrate soft actor-critic (SAC) with ALaM to develop the SAC-ALaM algorithm. Experiments demonstrate that SAC-ALaM outperforms state-of-the-art safe RL baselines in both safety and return, while also stabilizing training dynamics and learning well-calibrated multipliers for risk identification.
中文摘要 安全是现实世界强化学习（RL）中的主要挑战。将安全要求制定为州级约束已成为一种显著范式。用拉格朗日方法处理状态约束需要为每个状态配备不同的乘法器，因此神经网络将其近似为乘数网络。然而，对乘法网络应用标准的双梯度上升会引发严重的训练振荡。这是因为双重上升的固有不稳定性被网络泛化加剧——局部超跃和延迟更新传播到相邻状态，进一步放大政策波动。现有的稳定技术主要针对标量乘法，但对于状态依赖的乘数网络来说，标量乘数并不足够。为应对这一挑战，我们提出了一种增强拉格朗日乘数网络（ALaM）框架，用于稳定学习状态乘数。ALaM由两个关键组成部分组成。首先，在增强拉格朗日量中引入二次惩罚，以补偿乘数更新延迟，并将局部凸性建立在最优附近，从而减轻策略振荡。其次，乘数网络通过监督回归训练至双目标，稳定训练并促进收敛。理论上，我们证明了ALaM保证乘数收敛，从而恢复了受限问题的最优策略。基于该框架，我们将软演员-批判者（SAC）与ALaM整合，开发SAC-ALaM算法。实验表明，SAC-ALaM在安全性和回波方面均优于最先进的安全强化学习基线，同时稳定训练动态并学习良好校准的乘数以识别风险。

STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

凝视：多模态毒性攻击的分阶段时间对齐与红队引擎

Authors: Xutao Mao, Liangjie Zhao, Tao Liu, Xiang Zheng, Hongying Zan, Cong Wang
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2605.00699
Pdf link: https://arxiv.org/pdf/2605.00699
Abstract Red-teaming Vision-Language Models is essential for identifying vulnerabilities where adversarial image-text inputs trigger toxic outputs. Existing approaches treat image generation as a black box, returning only terminal toxicity scores and leaving open the question of when and how toxic semantics emerge during multi-step synthesis. We introduce STARE, a hierarchical reinforcement learning framework that treats the denoising trajectory itself as the attack surface, under a direct white-box T2I and query-only black-box VLM setting. By coupling a high-level prompt editor with low-level T2I fine-tuning via Group Relative Policy Optimization (GRPO), STARE attains a 68\% improvement in Attack Success Rate over state-of-the-art black-box and white-box baselines. More importantly, this trajectory-level view surfaces the Optimization-Induced Phase Alignment phenomenon: vanilla models exhibit diffuse toxicity, whereas adversarial optimization concentrates conceptual harms into early semantic phases and detail-oriented harms into late refinement. Targeted perturbations of either window selectively suppress different toxicity categories, indicating that this temporal structure is a genuine causal handle rather than a side effect of the hierarchical design. The phenomenon turns toxicity formation from a chaotic process into a small set of predictable vulnerability windows, providing both a potent attack engine and a basis for phase-aware safety mechanisms. Content warning: This paper contains examples of toxic content that may be offensive or disturbing.
中文摘要 红队视觉语言模型对于识别对抗性图像文本输入触发有害输出的漏洞至关重要。现有方法将图像生成视为黑箱，仅返回终端毒性评分，并未确定多步合成过程中何时以及如何出现有毒语义的问题。我们引入了STARE，这是一种分层强化学习框架，将去噪轨迹本身视为攻击面，采用直接白盒T2I和仅查询的黑箱VLM设置。通过将高级提示编辑器与低层次T2I微调（通过Group Relative Policy Optimization，GRPO）相结合，STARE的攻击成功率比最先进的黑盒和白盒基线提升了68%。更重要的是，这一轨迹层面的视角揭示了优化诱导的相位对齐现象：普通模型表现出弥漫毒性，而对抗性优化则将概念伤害集中在早期语义阶段，细节导向的伤害集中在后期细化阶段。对任一窗口的定向扰动选择性抑制不同毒性类别，表明该时间结构是真实的因果控制，而非层级设计的副作用。该现象将毒性形成从混乱过程转变为一小部分可预测的脆弱性窗口，既提供了强大的攻击引擎，也为相位感知安全机制奠定了基础。内容警告：本文包含可能令人反感或令人不安的有毒内容示例。

Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

学习如何以及记忆什么：基于认知的两阶段优化以促进记忆进化

Authors: Derong Xu, Shuochen Liu, Pengfei Luo, Pengyue Jia, Yingyi Zhang, Yi Wen, Yimin Deng, Wenlin Zhang, Enhong Chen, Xiangyu Zhao, Tong Xu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.00702
Pdf link: https://arxiv.org/pdf/2605.00702
Abstract Large language model (LLM) agents require long-term user memory for consistent personalization, but limited context windows hinder tracking evolving preferences over long interactions. Existing memory systems mainly rely on static, hand-crafted update rules; although reinforcement learning (RL)-based agents learn memory updates, sparse outcome rewards provide weak supervision, resulting in unstable long-horizon optimization. Drawing on memory schema theory and the functional division between prefrontal regions and hippocampus regions, we introduce MemCoE, a cognition-inspired two-stage optimization framework that learns how memory should be organized and what information to update. In the first stage, we propose Memory Guideline Induction to optimize a global guideline via contrastive feedback interpreted as textual gradients; in the second stage, Guideline-Aligned Memory Policy Optimization uses the induced guideline to define structured process rewards and performs multi-turn RL to learn a guideline-following memory evolution policy. We evaluate on three personalization memory benchmarks, covering explicit/implicit preference and different sizes and noise, and observe consistent improvements over strong baselines with favorable robustness, transferability, and efficiency.
中文摘要 大型语言模型（LLM）代理需要长期的用户记忆以实现一致的个性化，但有限的上下文窗口限制了在长时间互动中跟踪偏好演变。现有的内存系统主要依赖静态的手工更新规则;尽管基于强化学习（RL）的智能体学习记忆更新，但结果奖励稀疏导致监督薄弱，导致长期视野优化不稳定。我们基于记忆模式理论以及前额叶区与海马体区的功能划分，介绍了MemCoE，这是一个基于认知的两阶段优化框架，学习记忆应如何组织以及如何更新哪些信息。在第一阶段，我们提出记忆指导归纳法，通过对比反馈（将其解释为文本渐变）来优化全局指导;第二阶段，指南对齐记忆策略优化利用诱导指南定义结构化过程奖励，并进行多回合强化学习，学习遵循指南的记忆演化策略。我们基于三个个性化记忆基准进行评估，涵盖显性/隐性偏好以及不同的规模和噪声，观察到相较于强基线的持续改善，具有良好的稳健性、可转移性和效率。

SAVGO: Learning State-Action Value Geometry with Cosine Similarity for Continuous Control

SAVGO：连续控制中带余弦相似性的状态-作用值几何学习

Authors: Stavros Orfanoudakis, Pedro P. Vergara
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.00787
Pdf link: https://arxiv.org/pdf/2605.00787
Abstract While representation and similarity learning have improved the sample efficiency of Reinforcement Learning (RL), they are rarely used to shape policy updates directly in the action space. To bridge this gap, a geometry-aware RL algorithm that explicitly incorporates value-based similarity into the policy update, State-Action Value Geometry Optimization (SAVGO), is proposed. In detail, SAVGO learns a joint state-action embedding space in which pairs with similar action-value estimates exhibit high cosine similarity, while dissimilar pairs are mapped to distinct directions. This learned geometry enables the generation of a similarity kernel over candidate actions sampled at each update, allowing policy improvement to be guided directly toward higher-value regions beyond local gradient-based updates. As a result, representation learning, value estimation, and policy optimization are unified within a single geometry-consistent objective, while preserving the scalability of off-policy actor-critic training. The proposed method is evaluated on standard MuJoCo continuous-control benchmarks, demonstrating improvements over strong baselines on challenging high-dimensional tasks. Ablation studies are done to analyze the contributions of value-geometry learning and similarity-based policy updates.
中文摘要 虽然表征和相似性学习提高了强化学习（RL）的样本效率，但它们很少被用于直接在行动空间中塑造策略更新。为弥合这一差距，提出了一种几何感知强化学习算法，明确将基于值的相似性纳入策略更新，即状态-动作值几何优化（SAVGO）。具体来说，SAVGO学习了一个联合状态-作用嵌入空间，其中具有相似作用值估计的对表现出高余弦相似度，而异对则映射到不同方向。这种学习到的几何结构使得对每次更新采样的候选动作生成相似性内核，使策略改进能够直接引导到超越局部梯度更新的高价值区域。因此，表示学习、价值估计和策略优化统一在一个几何一致的目标中，同时保持了非策略行为者-批评者训练的可扩展性。该方法在标准MuJoCo连续控制基准测试上进行了评估，在具有挑战性高维任务中相较于强基线技术有所提升。消融研究旨在分析价值几何学习和基于相似性的策略更新的贡献。

Keyword: diffusion policy

MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine Manipulation

MSACT：多级空间对准，实现稳定低延迟的精细操作

Authors: Xianbo Cai, Hideyuki Ichiwara, Masaki Yoshikawa, Tetsuya Ogata
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.00475
Pdf link: https://arxiv.org/pdf/2605.00475
Abstract Real-world fine manipulation, particularly in bimanual manipulation, typically requires low-latency control and stable visual localization, while collecting large-scale data is costly and limited demonstrations may lead to localization drift. Existing approaches make different trade-offs: action-chunking policies such as ACT enable low-latency execution and data efficiency but rely on dense visual features without explicit spatial consistency, generative methods such as Diffusion Policy improve expressiveness but can incur iterative sampling latency, vision-language-action and voxel-based methods enhance generalization and geometric grounding but require higher computational cost and system complexity. We introduce a multistage spatial attention module that extracts stable 2D attention points and jointly predicts future attention sequences with a temporal alignment loss. Built upon ACT with a pretrained ResNet visual prior, a multistage attention module extracts task-relevant 2D attention points as a local spatial modality for action prediction. To maintain consistent object tracking, we introduce a self-supervised objective that aligns predicted attention sequences with visual features from future frames, suppressing drift without keypoint annotations and improving stability of the vision-to-action mapping under limited data. Experiments on simulated and real-world fine manipulation tasks, conducted on the ALOHA bimanual platform, evaluate task success, attention drift, inference latency, and robustness to visual disturbances. Results indicate improvements in localization stability and task performance while maintaining low-latency inference under the tested conditions.
中文摘要 现实世界的精细操作，尤其是双手操作，通常需要低延迟控制和稳定的视觉定位，而收集大规模数据成本高昂，有限的演示可能导致定位漂移。现有方法在权衡方面有不同取舍：动作分块策略如ACT实现低延迟执行和数据效率，但依赖密集的视觉特征而无显式空间一致性;生成方法如扩散策略提升表现力但可能产生迭代采样延迟;视觉-语言-动作和体素方法增强了泛化性和几何基础，但计算成本和系统复杂度更高。我们引入了一个多阶段空间注意力模块，能够提取稳定的二维注意力点，并共同预测具有时间对齐损失的未来注意力序列。基于ACT和预训练的ResNet视觉先验，多阶段注意力模块提取与任务相关的二维注意力点，作为动作预测的局部空间模态。为保持物体跟踪的一致性，我们引入了自监督目标，将预测注意力序列与未来帧的视觉特征对齐，抑制无关键点注释的漂移，并在有限数据下提升视觉到动作映射的稳定性。在ALOHA双手平台上进行的模拟和现实精细操作任务实验，评估任务成功率、注意力漂移、推理延迟及视觉干扰的鲁棒性。结果显示，在测试条件下，定位稳定性和任务性能均有所提升，同时保持低延迟推断。