Arxiv Papers of Today

生成时间: 2026-02-19 16:49:34 (UTC+8); Arxiv 发布时间: 2026-02-19 20:00 EST (2026-02-20 09:00 UTC+8)

今天共有 22 篇相关文章

Keyword: reinforcement learning

Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

通过目标导向偏好优化，在任务导向对话中解耦策略与执行

Authors: Jingyi Xu, Xingyu Ren, Zhiqiang You, Yumeng Zhang, Zhoupeng Shou
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.15854
Pdf link: https://arxiv.org/pdf/2602.15854
Abstract Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent's critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.
中文摘要 大型语言模型在任务导向对话系统中展现潜力，但现有训练方法往往依赖于代币级似然或偏好优化，这与长期任务成功率不匹配。为此，我们提出了目标导向偏好优化（GOPO），这是一种分层强化学习框架，通过专家代理和客户服务代理将战略规划与响应生成解耦。专家代理在对话轨迹层面优化多回合目标偏好，而客户服务代理则生成严格符合所选策略的回复。我们基于公开基准和电子商务客户服务数据集评估GOPO，并引入了基于真实电子商务互动数据的序列级指标——任务聚焦顺序参与度（TSE）。在Mgshop数据集上，GOPO比PPO和Memento分别提升了7.7%和10.3%，序列级奖励和生成质量持续提升。此外，使用GOPO训练的14B模型的TSE分别比Qwen-235B和GPT-5.2高出2.7%和1.5%。消融研究证实了专家代理在长期优化中的关键作用。GOPO在其他数据集中也持续展现了持续的改进。这项工作为商业场景中的任务导向对话系统建立了新的范式，代码和数据集将公开。

MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

MARVL：通过视觉语言模型实现机器人作的多阶段指导

Authors: Xunlan Zhou, Xuanlin Chen, Shaowei Zhang, Xiangkun Li, ShengHua Wan, Xiaohai Hu, Yuan Lei, Le Gan, De-chuan Zhan
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.15872
Pdf link: https://arxiv.org/pdf/2602.15872
Abstract Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.
中文摘要 设计高密度奖励函数对于高效的机器人强化学习（RL）至关重要。然而，大多数密集奖励依赖于人工工程，这从根本上限制了强化学习的可扩展性和自动化。虽然视觉语言模型（VLMs）为奖励设计提供了一条有前景的路径，但天真的VLM奖励常常与任务进展不匹配，难以适应空间定位，并且对任务语义理解有限。为解决这些问题，我们提出了MARVL-多阶段指导，用于通过视觉-语言模型进行机器人作。MARVL对VLM进行空间和语义一致性的微调，并将任务分解为多阶段子任务，并通过任务方向投影以提升轨迹敏感性。在经验上，MARVL在Meta-World基准测试中显著优于现有VLM-奖励方法，在稀疏奖励作任务中展现出更优的样本效率和鲁棒性。

Learning to Drive in New Cities Without Human Demonstrations

在没有人工示范的新城市学习驾驶

Authors: Zilin Wang, Saeed Rahmani, Daphne Cornelisse, Bidipta Sarkar, Alexander David Goldie, Jakob Nicolaus Foerster, Shimon Whiteson
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.15891
Pdf link: https://arxiv.org/pdf/2602.15891
Abstract While autonomous vehicles have achieved reliable performance within specific operating regions, their deployment to new cities remains costly and slow. A key bottleneck is the need to collect many human demonstration trajectories when adapting driving policies to new cities that differ from those seen in training in terms of road geometry, traffic rules, and interaction patterns. In this paper, we show that self-play multi-agent reinforcement learning can adapt a driving policy to a substantially different target city using only the map and meta-information, without requiring any human demonstrations from that city. We introduce NO data Map-based self-play for Autonomous Driving (NOMAD), which enables policy adaptation in a simulator constructed based on the target-city map. Using a simple reward function, NOMAD substantially improves both task success rate and trajectory realism in target cities, demonstrating an effective and scalable alternative to data-intensive city-transfer methods. Project Page: this https URL
中文摘要 尽管自动驾驶车辆在特定运营区域内取得了稳定的性能，但其在新城市的部署仍然成本高昂且速度缓慢。一个关键瓶颈是，在适应新城市的驾驶政策时，需要收集大量人类示范轨迹，这些政策在道路几何形状、交通规则和交互模式方面与培训中所见不同。本文展示了自玩多智能体强化学习可以仅凭地图和元信息，将驱动政策调整到一个截然不同的目标城市，无需该城市进行任何人工演示。我们引入了无数据地图的自动驾驶自玩（NOMAD），使基于目标城市地图构建的模拟器中能够进行政策调整。利用简单的奖励函数，NOMAD显著提升目标城市中的任务成功率和轨迹真实性，展示了一种有效且可扩展的替代数据密集型城市转移方法。项目页面：此 https URL

Harnessing Implicit Cooperation: A Multi-Agent Reinforcement Learning Approach Towards Decentralized Local Energy Markets

利用隐性合作：一种多智能体强化学习方法推动去中心化的本地能源市场

Authors: Nelson Salazar-Pena, Alejandra Tabares, Andres Gonzalez-Mancera
Subjects: Subjects: Systems and Control (eess.SY); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2602.16062
Pdf link: https://arxiv.org/pdf/2602.16062
Abstract This paper proposes implicit cooperation, a framework enabling decentralized agents to approximate optimal coordination in local energy markets without explicit peer-to-peer communication. We formulate the problem as a decentralized partially observable Markov decision problem that is solved through a multi-agent reinforcement learning task in which agents use stigmergic signals (key performance indicators at the system level) to infer and react to global states. Through a 3x3 factorial design on an IEEE 34-node topology, we evaluated three training paradigms (CTCE, CTDE, DTDE) and three algorithms (PPO, APPO, SAC). Results identify APPO-DTDE as the optimal configuration, achieving a coordination score of 91.7% relative to the theoretical centralized benchmark (CTCE). However, a critical trade-off emerges between efficiency and stability: while the centralized benchmark maximizes allocative efficiency with a peer-to-peer trade ratio of 0.6, the fully decentralized approach (DTDE) demonstrates superior physical stability. Specifically, DTDE reduces the variance of grid balance by 31% compared to hybrid architectures, establishing a highly predictable, import-biased load profile that simplifies grid regulation. Furthermore, topological analysis reveals emergent spatial clustering, where decentralized agents self-organize into stable trading communities to minimize congestion penalties. While SAC excelled in hybrid settings, it failed in decentralized environments due to entropy-driven instability. This research proves that stigmergic signaling provides sufficient context for complex grid coordination, offering a robust, privacy-preserving alternative to expensive centralized communication infrastructure.
中文摘要 本文提出了隐性合作，这一框架使去中心化的代理能够在本地能源市场中近似实现最佳协调，而无需显式的点对点通信。我们将该问题表述为一个去中心化的部分可观测马尔可夫决策问题，通过多智能体强化学习任务解决，智能体使用烙印信号（系统级的关键绩效指标）来推断并响应全局状态。通过在IEEE 34节点拓扑上的3x3乘乘设计，我们评估了三种训练范式（CTCE、CTDE、DTDE）和三种算法（PPO、APPO、SAC）。结果确定APPO-DTDE为最优配置，相较理论集中基准（CTCE）达到91.7%的协调得分。然而，效率与稳定性之间存在关键权衡：中心化基准以0.6的点对点交易比率最大化配置效率，而全去中心化方法（DTDE）则展现出更优越的物理稳定性。具体来说，DTDE相比混合架构将电网平衡的方差减少了31%，建立了高度可预测、进口偏向的负载分布，简化了电网调控。此外，拓扑分析揭示了涌现的空间聚类现象，即去中心化的代理自我组织成稳定的交易社区，以最小化拥塞惩罚。虽然SAC在混合环境中表现出色，但在去中心化环境中因熵驱动的不稳定性而失败。本研究证明，标噪信号为复杂的网格协调提供了足够的背景，提供了一种稳健且保护隐私的替代方案，替代昂贵的集中式通信基础设施。

Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution

通过多听众软执行在推理中平衡忠实性与表演

Authors: Nithin Sivakumaran, Shoubin Yu, Hyunji Lee, Yue Zhang, Ali Payani, Mohit Bansal, Elias Stengel-Eskin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.16154
Pdf link: https://arxiv.org/pdf/2602.16154
Abstract Chain-of-thought (CoT) reasoning sometimes fails to faithfully reflect the true computation of a large language model (LLM), hampering its utility in explaining how LLMs arrive at their answers. Moreover, optimizing for faithfulness and interpretability in reasoning often degrades task performance. To address this tradeoff and improve CoT faithfulness, we propose Reasoning Execution by Multiple Listeners (REMUL), a multi-party reinforcement learning approach. REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful. A speaker model generates a reasoning trace, which is truncated and passed to a pool of listener models who "execute" the trace, continuing the trace to an answer. Speakers are rewarded for producing reasoning that is clear to listeners, with additional correctness regularization via masked supervised finetuning to counter the tradeoff between faithfulness and performance. On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness -- hint attribution, early answering area over the curve (AOC), and mistake injection AOC -- while also improving accuracy. Our analysis finds that these gains are robust across training domains, translate to legibility gains, and are associated with shorter and more direct CoTs.
中文摘要 思维链（CoT）推理有时无法忠实反映大型语言模型（LLM）的真实计算，这限制了其解释LLM如何得出答案的实用性。此外，优化推理的忠实性和可解释性往往会降低任务表现。为解决这一权衡并提升CoT的忠实度，我们提出了多方强化学习方法——多听者推理执行（REMUL）。REMUL基于这样一个假设：其他方可以遵循的推理痕迹会更为忠实。说话模型生成一个推理轨迹，该轨迹被截断后传递给一组听者模型，听者“执行”该轨迹，继续追踪至答案。说话者因提出对听众清晰的推理而获得奖励，并通过掩蔽监督微调进行额外的正确性规范化，以平衡忠实度与表演之间的权衡。在多个推理基准测试（BIG-Bench Extra Hard、MuSR、ZebraLogicBench 和 FOLIO）上，REMUL 持续且显著地提升了三项忠实度指标——提示归因、早期解答曲线面积（AOC）和错误注入 AOC——同时提升了准确性。我们的分析发现，这些增益在训练领域中表现稳健，转化为可读性提升，并与更短且更直接的CoT相关。

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

HiPER：带有显式学分赋值的大型语言模型代理的层级强化学习

Authors: Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, Mingyi Hong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.16165
Pdf link: https://arxiv.org/pdf/2602.16165
Abstract Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at a single time scale, selecting one action at each turn. In sparse-reward settings, such flat policies must propagate credit across the entire trajectory without explicit temporal abstraction, which often leads to unstable optimization and inefficient credit assignment. We propose HiPER, a novel Hierarchical Plan-Execute RL framework that explicitly separates high-level planning from low-level execution. HiPER factorizes the policy into a high-level planner that proposes subgoals and a low-level executor that carries them out over multiple action steps. To align optimization with this structure, we introduce a key technique called hierarchical advantage estimation (HAE), which carefully assigns credit at both the planning and execution levels. By aggregating returns over the execution of each subgoal and coordinating updates across the two levels, HAE provides an unbiased gradient estimator and provably reduces variance compared to flat generalized advantage estimation. Empirically, HiPER achieves state-of-the-art performance on challenging interactive benchmarks, reaching 97.4\% success on ALFWorld and 83.3\% on WebShop with Qwen2.5-7B-Instruct (+6.6\% and +8.3\% over the best prior method), with especially large gains on long-horizon tasks requiring multiple dependent subtasks. These results highlight the importance of explicit hierarchical decomposition for scalable RL training of multi-turn LLM agents.
中文摘要 将LLM训练为多回合决策的交互代理仍然具有挑战性，尤其是在长期任务中奖励稀疏且延迟，代理必须执行长时间的作序列才能获得有意义的反馈。大多数现有的强化学习（RL）方法将模型LLM代理视为单一时间尺度的扁平策略，每个回合选择一个动作。在稀疏奖励环境下，这种扁平策略必须在整个轨迹中传播信用，且不显式时间抽象，这常导致优化不稳定和信用分配效率低下。我们提出了HiPER，一种新颖的分层计划-执行强化学习框架，明确区分了高层规划和低层执行。HiPER 将策略分解为一个高级规划器，提出子目标，以及一个低级别执行者，在多个行动步骤中执行这些目标。为了使优化与该结构保持一致，我们引入了一种关键技术——层级优势估计（HAE），它在规划和执行两个层面都精心分配功劳。通过汇总每个子目标执行的收益并协调两层级的更新，HAE提供了无偏梯度估计器，并可证明地降低了与扁平广义优势估计相比的方差。从实证角度看，HiPER 在具有挑战性的交互基准测试中取得了最先进的性能，在 ALFWorld 上达到 97.4% 的成功率，在 WebShop 上使用 Qwen2.5-7B-Instruct 实现了 83.3% 的成功率（比最佳先验方法高出 +6.6% 和 +8.3%），尤其是在需要多个依赖子任务的长期任务上取得显著提升。这些结果凸显了显式层级分解对于多回合LLM代理可扩展强化学习训练的重要性。

Edge Learning via Federated Split Decision Transformers for Metaverse Resource Allocation

通过联邦分裂决策变换器进行元宇宙资源分配的边缘学习

Authors: Fatih Temiz, Shavbo Salehi, Melike Erol-Kantarci
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2602.16174
Pdf link: https://arxiv.org/pdf/2602.16174
Abstract Mobile edge computing (MEC) based wireless metaverse services offer an untethered, immersive experience to users, where the superior quality of experience (QoE) needs to be achieved under stringent latency constraints and visual quality demands. To achieve this, MEC-based intelligent resource allocation for virtual reality users needs to be supported by coordination across MEC servers to harness distributed data. Federated learning (FL) is a promising solution, and can be combined with reinforcement learning (RL) to develop generalized policies across MEC-servers. However, conventional FL incurs transmitting the full model parameters across the MEC-servers and the cloud, and suffer performance degradation due to naive global aggregation, especially in heterogeneous multi-radio access technology environments. To address these challenges, this paper proposes Federated Split Decision Transformer (FSDT), an offline RL framework where the transformer model is partitioned between MEC servers and the cloud. Agent-specific components (e.g., MEC-based embedding and prediction layers) enable local adaptability, while shared global layers in the cloud facilitate cooperative training across MEC servers. Experimental results demonstrate that FSDT enhances QoE for up to 10% in heterogeneous environments compared to baselines, while offloadingnearly 98% of the transformer model parameters to the cloud, thereby reducing the computational burden on MEC servers.
中文摘要 基于移动边缘计算（MEC）的无线元宇宙服务为用户提供无连接、沉浸式体验，在严格的延迟限制和视觉质量要求下，必须实现卓越的体验质量（QoE）。为此，基于MEC的虚拟现实用户智能资源分配需要通过跨MEC服务器协调支持，以利用分布式数据。联邦学习（FL）是一个有前景的解决方案，可以与强化学习（RL）结合，在MEC服务器上制定通用策略。然而，传统的 FL 需要在 MEC 服务器和云端传输完整的模型参数，并且由于朴素的全局聚合，尤其是在异构多无线接入技术环境中，性能会下降。为应对这些挑战，本文提出了联邦分裂决策变换器（FSDT），这是一种离线强化学习框架，其中变换器模型被分区在MEC服务器和云端之间。代理专用组件（例如基于MEC的嵌入层和预测层）支持本地适应性，而云中的共享全局层则促进了跨MEC服务器的协作训练。实验结果表明，FSDT 在异构环境中相比基线提升了高达 10% 的 QoE，同时将近 98% 的变压器模型参数卸载到云端，从而减轻了 MEC 服务器的计算负担。

EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

EnterpriseGym Corecraft：在高保真强化环境中训练通用代理

Authors: Sushant Mehta, Logan Ritchie, Suhaas Garre, Nick Heiner, Edwin Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.16179
Pdf link: https://arxiv.org/pdf/2602.16179
Abstract We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce \corecraft{}, the first environment in \textsc{EnterpriseGym}, Surge AI's suite of agentic RL environments. \corecraft{} is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30\% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM~4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37\% to 36.76\% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5\% on BFCL Parallel, +7.4\% on $\tau^2$-Bench Retail, and +6.8\% on Toolathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.
中文摘要 我们证明，在高保真强化学习环境中训练AI代理，能够产生超越训练分布的泛化能力。我们介绍 \corecraft{}，这是 \textsc{EnterpriseGym} 中的第一个环境，这是 Surge AI 的代理式强化环境套件。\corecraft{} 是一个完全运营的企业级客户支持组织模拟，包含2500多个实体，涵盖14种实体类型，拥有23种独特工具，旨在衡量人工智能代理是否能够执行真实工作所需的多步骤、领域特定工作。像GPT-5.2和Claude Opus 4.6这样的前沿模型，在满足所有专家作者评分标准时，解决的任务不到30%。利用该环境，我们用组相对策略优化（GRPO）和自适应剪裁训练GLM~4.6。经过单一训练阶段后，模型在未完成评估任务的任务通过率从25.37%提升到36.76%。更重要的是，这些涨幅会转移到非发行基准指标：BFCL Parallel的+4.5%，$\tau^2$-Bench零售的+7.4%，以及Toolathlon（Pass@1）的+6.8%。我们认为三种环境特性与观察到的转移一致：以任务为中心的世界构建，优化多样化且具有挑战性的任务;专家编写的评分标准，支持可靠的奖励计算;以及反映现实职业模式的企业工作流程。我们的结果表明，环境质量、多样性和真实性是促成代理可推广能力的关键因素。

Graphon Mean-Field Subsampling for Cooperative Heterogeneous Multi-Agent Reinforcement Learning

合作异构多智能体强化学习中的图书图平均场子采样

Authors: Emile Anand, Richard Hoffmann, Sarah Liaw, Adam Wierman
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.16196
Pdf link: https://arxiv.org/pdf/2602.16196
Abstract Coordinating large populations of interacting agents is a central challenge in multi-agent reinforcement learning (MARL), where the size of the joint state-action space scales exponentially with the number of agents. Mean-field methods alleviate this burden by aggregating agent interactions, but these approaches assume homogeneous interactions. Recent graphon-based frameworks capture heterogeneity, but are computationally expensive as the number of agents grows. Therefore, we introduce $\texttt{GMFS}$, a $\textbf{G}$raphon $\textbf{M}$ean-$\textbf{F}$ield $\textbf{S}$ubsampling framework for scalable cooperative MARL with heterogeneous agent interactions. By subsampling $\kappa$ agents according to interaction strength, we approximate the graphon-weighted mean-field and learn a policy with sample complexity $\mathrm{poly}(\kappa)$ and optimality gap $O(1/\sqrt{\kappa})$. We verify our theory with numerical simulations in robotic coordination, showing that $\texttt{GMFS}$ achieves near-optimal performance.
中文摘要 协调大量交互代理是多智能体强化学习（MARL）中的核心挑战，该领域联合状态-行动空间的规模会随着代理数量呈指数增长。平均场方法通过聚合代理间相互作用来减轻这一负担，但这些方法假设相互作用是均匀的。最新的基于graphon的框架捕捉了异质性，但随着代理数量的增加，计算成本较高。因此，我们引入了$\texttt{GMFS}$，一个$\textbf{G}$raphon $\textbf{M}$ean-$\textbf{F}$ield $\textbf{S}$ubsampling用于具有异构代理交互的可扩展合作MARL框架。通过根据交互强度对$\kappa$的代理进行子抽样，我们近似图子加权均值场，并学习样本复杂度为$\mathrm{poly}（\kappa）$且最优性差距$O（1/\sqrt{\kappa}）$的策略。我们通过机器人协调中的数值模拟验证了我们的理论，证明 $\texttt{GMFS}$ 实现了近乎最优的性能。

Multi-agent cooperation through in-context co-player inference

通过上下文中的协作推理实现多智能体合作

Authors: Marissa A. Weis, Maciej Wołczyk, Rajai Nasser, Rif A. Saurous, Blaise Agüera y Arcas, João Sacramento, Alexander Meulemans
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.16301
Pdf link: https://arxiv.org/pdf/2602.16301
Abstract Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between "learning-aware" agents that account for and shape the learning dynamics of their co-players. However, existing approaches typically rely on hardcoded, often inconsistent, assumptions about co-player learning rules or enforce a strict separation between "naive learners" updating on fast timescales and "meta-learners" observing these updates. Here, we demonstrate that the in-context learning capabilities of sequence models allow for co-player learning awareness without requiring hardcoded assumptions or explicit timescale separation. We show that training sequence model agents against a diverse distribution of co-players naturally induces in-context best-response strategies, effectively functioning as learning algorithms on the fast intra-episode timescale. We find that the cooperative mechanism identified in prior work-where vulnerability to extortion drives mutual shaping-emerges naturally in this setting: in-context adaptation renders agents vulnerable to extortion, and the resulting mutual pressure to shape the opponent's in-context learning dynamics resolves into the learning of cooperative behavior. Our results suggest that standard decentralized reinforcement learning on sequence models combined with co-player diversity provides a scalable path to learning cooperative behaviors.
中文摘要 实现自利代理之间的合作仍然是多代理强化学习中的根本挑战。最新研究表明，“学习意识”代理之间可以诱导相互合作，这些代理能够考虑并塑造其共同参与者的学习动态。然而，现有方法通常依赖硬编码且常常不一致的共同学习规则假设，或严格区分“天真学习者”在快速时间尺度上更新，“元学习者”观察这些更新。在这里，我们展示了序列模型的上下文学习能力，能够实现协作者学习的意识，而无需硬编码假设或显式的时间尺度分离。我们表明，针对多样化共参与者分布的序列模型训练代理自然会诱导上下文中的最佳响应策略，有效地作为快速的剧集内时间尺度上的学习算法。我们发现，先前研究中识别的合作机制——即勒索易感驱动相互塑造——在此环境中自然出现：情境适应使代理易受勒索影响，而由此产生的相互压力，塑造对手情境内学习动态的相互压力转化为合作行为的学习。我们的结果表明，基于序列模型的标准分散强化学习结合协作者多样性，提供了可扩展的合作行为学习路径。

Dual-Quadruped Collaborative Transportation in Narrow Environments via Safe Reinforcement Learning

通过安全强化学习实现狭窄环境中的双四足协同运输

Authors: Zhezhi Lei, Zhihai Bi, Wenxin Wang, Jun Ma
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.16353
Pdf link: https://arxiv.org/pdf/2602.16353
Abstract Collaborative transportation, where multiple robots collaboratively transport a payload, has garnered significant attention in recent years. While ensuring safe and high-performance inter-robot collaboration is critical for effective task execution, it is difficult to pursue in narrow environments where the feasible region is extremely limited. To address this challenge, we propose a novel approach for dual-quadruped collaborative transportation via safe reinforcement learning (RL). Specifically, we model the task as a fully cooperative constrained Markov game, where collision avoidance is formulated as constraints. We introduce a cost-advantage decomposition method that enforces the sum of team constraints to remain below an upper bound, thereby guaranteeing task safety within an RL framework. Furthermore, we propose a constraint allocation method that assigns shared constraints to individual robots to maximize the overall task reward, encouraging autonomous task-assignment among robots, thereby improving collaborative task performance. Simulation and real-time experimental results demonstrate that the proposed approach achieves superior performance and a higher success rate in dual-quadruped collaborative transportation compared to existing methods.
中文摘要 协作运输，即多台机器人协同运输有效载荷，近年来备受关注。虽然确保安全且高性能的机器人间协作对于有效任务执行至关重要，但在可行区域极其有限的狭窄环境中实现这一目标仍较为困难。为应对这一挑战，我们提出了一种通过安全强化学习（RL）实现双四足协同运输的新方法。具体来说，我们将该任务建模为一个完全合作的约束马尔可夫博弈，其中碰撞避免被表述为约束。我们引入了一种成本效益分解方法，强制团队约束的总和保持在上界以下，从而在强化学习框架内保证任务安全。此外，我们提出了一种约束分配方法，将共享约束分配给单个机器人，以最大化整体任务奖励，鼓励机器人间自主分配任务，从而提升协作任务的表现。模拟和实时实验结果表明，与现有方法相比，所提方法在双四足协同运输中实现了更优的性能和更高的成功率。

Causally-Guided Automated Feature Engineering with Multi-Agent Reinforcement Learning

因果引导自动特征工程与多智能体强化学习

Authors: Arun Vignesh Malarkkan, Wangyang Ying, Yanjie Fu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.16435
Pdf link: https://arxiv.org/pdf/2602.16435
Abstract Automated feature engineering (AFE) enables AI systems to autonomously construct high-utility representations from raw tabular data. However, existing AFE methods rely on statistical heuristics, yielding brittle features that fail under distribution shift. We introduce CAFE, a framework that reformulates AFE as a causally-guided sequential decision process, bridging causal discovery with reinforcement learning-driven feature construction. Phase I learns a sparse directed acyclic graph over features and the target to obtain soft causal priors, grouping features as direct, indirect, or other based on their causal influence with respect to the target. Phase II uses a cascading multi-agent deep Q-learning architecture to select causal groups and transformation operators, with hierarchical reward shaping and causal group-level exploration strategies that favor causally plausible transformations while controlling feature complexity. Across 15 public benchmarks (classification with macro-F1; regression with inverse relative absolute error), CAFE achieves up to 7% improvement over strong AFE baselines, reduces episodes-to-convergence, and delivers competitive time-to-target. Under controlled covariate shifts, CAFE reduces performance drop by ~4x relative to a non-causal multi-agent baseline, and produces more compact feature sets with more stable post-hoc attributions. These findings underscore that causal structure, used as a soft inductive prior rather than a rigid constraint, can substantially improve the robustness and efficiency of automated feature engineering.
中文摘要 自动化特征工程（AFE）使人工智能系统能够自主地从原始表格数据构建高效用表示。然而，现有的AFE方法依赖统计启发式，导致在分布偏移下失效的脆弱性特征。我们介绍了CAFE框架，将AFE重新表述为因果引导的顺序决策过程，桥接因果发现与强化学习驱动的特征构建。第一阶段学习一个针对特征和目标的稀疏有向无环图，以获得软因果先验，根据特征对目标的因果影响将特征分为直接、间接或其他。第二阶段采用级联多智能体深度Q学习架构来选择因果群和转换算子，采用层级奖励塑造和因果群体级探索策略，这些策略有利于因果合理转换，同时控制特征复杂度。在15项公开基准测试（宏观F1分类;反相对绝对误差回归）中，CAFE相较强AFE基线提升高达7%，减少发作到汇聚时间，并提供具有竞争力的目标时间。在受控协变量转移下，CAFE相较于非因果多代理基线，性能下降减少约4倍，并产生更紧凑的特征集和更稳定的事后归因。这些发现强调了因果结构作为软归纳先验而非僵硬约束，可以显著提升自动化特征工程的鲁棒性和效率。

Certifying Hamilton-Jacobi Reachability Learned via Reinforcement Learning

通过强化学习学习的Hamilton-Jacobi可达性认证

Authors: Prashant Solanki, Isabelle El-Hajj, Jasper J. van Beers, Erik-Jan van Kampen, Coen C. de Visser
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.16475
Pdf link: https://arxiv.org/pdf/2602.16475
Abstract We present a framework to \emph{certify} Hamilton--Jacobi (HJ) reachability learned by reinforcement learning (RL). Building on a discounted initial time \emph{travel-cost} formulation that makes small-step RL value iteration provably equivalent to a forward Hamilton--Jacobi (HJ) equation with damping, we convert certified learning errors into calibrated inner/outer enclosures of strict backward reachable tube. The core device is an additive-offset identity: if $W_\lambda$ solves the discounted travel-cost Hamilton--Jacobi--Bellman (HJB) equation, then $W_\varepsilon:=W_\lambda + \varepsilon$ solves the same PDE with a constant offset $\lambda\varepsilon$. This means that a uniform value error is \emph{exactly} equal to a constant HJB offset. We establish this uniform value error via two routes: (A) a Bellman operator-residual bound, and (B) a HJB PDE-slack bound. Our framework preserves HJ-level safety semantics and is compatible with deep RL. We demonstrate the approach on a double-integrator system by formally certifying, via satisfiability modulo theories (SMT), a value function learned through reinforcement learning to induce provably correct inner and outer backward-reachable set enclosures over a compact region of interest.
中文摘要 我们提出了一个框架，用以\emph{certify}通过强化学习（RL）学习汉密尔顿-雅各比（HJ）可达性。基于一种折现的初始时间\emph{旅行成本}公式，使小步强化学习值迭代可证明等价于带阻尼的正向Hamilton-Jacobi（HJ）方程，我们将经过认证的学习误差转换为严格向后可达管的校准内外包体。核心装置是一个加法偏移恒等式：如果$W_\lambda$解出折现的旅行成本汉密尔顿--雅各比-贝尔曼（HJB）方程，则$W_\varepsilon：=W_\lambda + \varepsilon$ 以常数偏移求解相同的偏微分方程 $\lambda\varepsilon$。这意味着均匀值误差等于 \emph{exactly} 等于 HJB 的常数偏移量。我们通过两条路径建立该均匀值误差：（A）Bellman算子-残差界限，（B）HJB偏微分方程松弛界限。我们的框架保持了HJ级别的安全语义，并兼容深度强化学习。我们通过通过模理论（SMT）形式化地证明通过强化学习学习的价值函数，演示了该方法，从而在紧致兴趣区域内诱导出可证明正确的内外向可向后到达集合包。

VIGOR: Visual Goal-In-Context Inference for Unified Humanoid Fall Safety

VIGOR：视觉目标上下文推断，实现统一类人生物坠落安全

Authors: Osher Azulay, Zhengjie Xu, Andrew Scheffer, Stella X. Yu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.16511
Pdf link: https://arxiv.org/pdf/2602.16511
Abstract Reliable fall recovery is critical for humanoids operating in cluttered environments. Unlike quadrupeds or wheeled robots, humanoids experience high-energy impacts, complex whole-body contact, and large viewpoint changes during a fall, making recovery essential for continued operation. Existing methods fragment fall safety into separate problems such as fall avoidance, impact mitigation, and stand-up recovery, or rely on end-to-end policies trained without vision through reinforcement learning or imitation learning, often on flat terrain. At a deeper level, fall safety is treated as monolithic data complexity, coupling pose, dynamics, and terrain and requiring exhaustive coverage, limiting scalability and generalization. We present a unified fall safety approach that spans all phases of fall recovery. It builds on two insights: 1) Natural human fall and recovery poses are highly constrained and transferable from flat to complex terrain through alignment, and 2) Fast whole-body reactions require integrated perceptual-motor representations. We train a privileged teacher using sparse human demonstrations on flat terrain and simulated complex terrains, and distill it into a deployable student that relies only on egocentric depth and proprioception. The student learns how to react by matching the teacher's goal-in-context latent representation, which combines the next target pose with the local terrain, rather than separately encoding what it must perceive and how it must act. Results in simulation and on a real Unitree G1 humanoid demonstrate robust, zero-shot fall safety across diverse non-flat environments without real-world fine-tuning. The project page is available at this https URL
中文摘要 可靠的坠落恢复对于在杂乱环境中作的人形机器人至关重要。与四足机器人或轮式机器人不同，类人机器人在坠落时会经历高能量冲击、复杂的全身接触和视角大幅变化，因此恢复对持续作至关重要。现有方法将安全分解为独立问题，如防摔、减撞和站立恢复，或依赖端到端策略，这些策略通过强化学习或模仿学习训练，通常在平坦地形上进行。更深层次上，坠落安全被视为庞大的数据复杂性、耦合姿态、动力学和地形，需要详尽覆盖，限制了扩展性和泛化性。我们提出了涵盖跌倒恢复所有阶段的统一安全方案。它基于两个洞见：1）自然的人体跌倒和恢复姿势高度受限，且可通过对齐从平坦地形转移到复杂地形;2）快速的全身反应需要整合的感知-运动表征。我们通过在平坦地形和模拟复杂地形上进行稀疏的人体演示来培训一位特权教师，并将其提炼成一个仅依靠自我中心深度和本体感知的可部署学生。学生通过匹配教师的目标在上下文中的潜在表征来学习如何反应，将下一个目标姿势与当地地形结合起来，而不是单独编码它必须感知的事物和必须如何行动。在模拟和真实的Unitree G1人形模型上，结果展示了在多种非平坦环境中，无需真实微调即可实现零射击的坠落安全。项目页面可在此 https URL 访问。

Reinforcement Learning for Parameterized Quantum State Preparation: A Comparative Study

参数化量子态准备的强化学习：一项比较研究

Authors: Gerhard Stenzel, Isabella Debelic, Michael Kölle, Tobias Rohe, Leo Sünkel, Julian Hager, Claudia Linnhoff-Popien
Subjects: Subjects: Machine Learning (cs.LG); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2602.16523
Pdf link: https://arxiv.org/pdf/2602.16523
Abstract We extend directed quantum circuit synthesis (DQCS) with reinforcement learning from purely discrete gate selection to parameterized quantum state preparation with continuous single-qubit rotations (R_x), (R_y), and (R_z). We compare two training regimes: a one-stage agent that jointly selects the gate type, the affected qubit(s), and the rotation angle; and a two-stage variant that first proposes a discrete circuit and subsequently optimizes the rotation angles with Adam using parameter-shift gradients. Using Gymnasium and PennyLane, we evaluate Proximal Policy Optimization (PPO) and Advantage Actor--Critic (A2C) on systems comprising two to ten qubits and on targets of increasing complexity with (\lambda) ranging from one to five. Whereas A2C does not learn effective policies in this setting, PPO succeeds under stable hyperparameters (one-stage: learning rate approximately (5\times10^{-4}) with a self-fidelity-error threshold of 0.01; two-stage: learning rate approximately (10^{-4})). Both approaches reliably reconstruct computational basis states (between 83\% and 99\% success) and Bell states (between 61\% and 77\% success). However, scalability saturates for (\lambda) of approximately three to four and does not extend to ten-qubit targets even at (\lambda=2). The two-stage method offers only marginal accuracy gains while requiring around three times the runtime. For practicality under a fixed compute budget, we therefore recommend the one-stage PPO policy, provide explicit synthesized circuits, and contrast with a classical variational baseline to outline avenues for improved scalability.
中文摘要 我们将基于纯离散门选择的强化学习扩展到具有连续单量子比特旋转 \（R_x\）、\（R_y\）和 \（R_z\）的有向量子电路合成（DQCS）到参数化量子态准备。我们比较了两种训练模式：一阶段智能体联合选择门类型、受影响量子比特和旋转角度;以及一种两级变体，先提出离散电路，随后利用参数移位梯度优化Adam的旋转角度。利用Gymnasium和PennyLane，我们评估了在包含2到10个量子比特的系统以及复杂度逐渐增加的目标（\（\lambda\）范围为1到5的系统上，评估了近端策略优化（PPO）和优势演员-批判（A2C）。而A2C在此环境中不会学习有效策略，而PPO在稳定超参数下成功（一阶段：学习率约为\（5\times10^{-4}），自保真误差阈值为0.01;两阶段：学习率约为\（10^{-4}\））。这两种方法都能可靠地重建计算基态（成功率介于83%至99%）和贝尔态（成功率介于61%至77%之间）。然而，扩展性在 \（\lambda\）约为三到四的范围内饱和，即使在 \（\lambda=2\）处也无法扩展到十量子比特目标。两阶段方法仅带来边际精度提升，且运行时间约为其三倍。因此，为了在有限计算预算下的实用性，我们推荐采用单阶段PPO策略，提供显式合成电路，并与经典变分基线形成对比，以规划提升可扩展性的路径。

Capacity-constrained demand response in smart grids using deep reinforcement learning

智能电网中利用深度强化学习实现容量受限需求响应

Authors: Shafagh Abband Pashaki, Sepehr Maleki, Amir Badiee
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.16525
Pdf link: https://arxiv.org/pdf/2602.16525
Abstract This paper presents a capacity-constrained incentive-based demand response approach for residential smart grids. It aims to maintain electricity grid capacity limits and prevent congestion by financially incentivising end users to reduce or shift their energy consumption. The proposed framework adopts a hierarchical architecture in which a service provider adjusts hourly incentive rates based on wholesale electricity prices and aggregated residential load. The financial interests of both the service provider and end users are explicitly considered. A deep reinforcement learning approach is employed to learn optimal real-time incentive rates under explicit capacity constraints. Heterogeneous user preferences are modelled through appliance-level home energy management systems and dissatisfaction costs. Using real-world residential electricity consumption and price data from three households, simulation results show that the proposed approach effectively reduces peak demand and smooths the aggregated load profile. This leads to an approximately 22.82% reduction in the peak-to-average ratio compared to the no-demand-response case.
中文摘要 本文提出了一种容量受限的激励驱动需求响应方法，应用于住宅智能电网。其目标是通过通过经济激励终端用户减少或调整能源消耗，来维持电网容量限制并防止拥堵。该框架采用分层架构，服务提供商根据批发电价和总住宅负荷调整小时激励费率。服务提供者和终端用户的财务利益都被明确考虑。采用深度强化学习方法，在显式容量约束下学习最佳实时激励率。异构用户偏好通过家用电器级能源管理系统和不满意成本进行建模。利用三户家庭的真实住宅用电量和价格数据，模拟结果表明，所提方法有效降低了峰值用电量，平滑了总负载曲线。这导致峰值与平均比比相比无需求响应情况约减少22.82%。

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

通过逆受限强化学习对安全强化学习的脆弱性分析

Authors: Jialiang Fan, Shixiong Jiang, Mengyu Liu, Fanxin Kong
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.16543
Pdf link: https://arxiv.org/pdf/2602.16543
Abstract Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.
中文摘要 安全强化学习（Safe RL）旨在确保策略性能同时满足安全约束。然而，大多数现有的安全强化学习方法假设环境无害，因此容易受到现实环境中常见的对抗性扰动影响。此外，现有基于梯度的对抗攻击通常需要访问策略中的梯度信息，但在现实场景中这往往不切实际。为应对这些挑战，我们提出了一个对抗性攻击框架，以揭示安全强化学习策略的漏洞。通过专家演示和黑箱环境交互，我们的框架学习约束模型和代理（学习者）策略，实现基于梯度的攻击优化，而无需受害者策略的内部梯度或真实的安全约束。我们还提供了理论分析，确立可行性并推导微扰界限。在多个安全强化学习基准测试上的实验证明了我们方法在有限特权访问下的有效性。

RIDER: 3D RNA Inverse Design with Reinforcement Learning-Guided Diffusion

RIDER：带有强化学习引导扩散的3D RNA逆向设计

Authors: Tianmeng Hu, Yongzheng Cui, Biao Luo, Ke Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.16548
Pdf link: https://arxiv.org/pdf/2602.16548
Abstract The inverse design of RNA three-dimensional (3D) structures is crucial for engineering functional RNAs in synthetic biology and therapeutics. While recent deep learning approaches have advanced this field, they are typically optimized and evaluated using native sequence recovery, which is a limited surrogate for structural fidelity, since different sequences can fold into similar 3D structures and high recovery does not necessarily indicate correct folding. To address this limitation, we propose RIDER, an RNA Inverse DEsign framework with Reinforcement learning that directly optimizes for 3D structural similarity. First, we develop and pre-train a GNN-based generative diffusion model conditioned on the target 3D structure, achieving a 9% improvement in native sequence recovery over state-of-the-art methods. Then, we fine-tune the model with an improved policy gradient algorithm using four task-specific reward functions based on 3D self-consistency metrics. Experimental results show that RIDER improves structural similarity by over 100% across all metrics and discovers designs that are distinct from native sequences.
中文摘要 RNA三维（3D）结构的逆向设计对于合成生物学和治疗中功能性RNA的工程至关重要。虽然近期深度学习方法推动了该领域的发展，但通常采用原生序列恢复来优化和评估，这是结构保真度的有限替代，因为不同序列可以折叠成相似的三维结构，高恢复率并不一定意味着折叠正确。为解决这一限制，我们提出了RIDER框架，这是一种带有强化学习的RNA逆DEsign框架，直接优化三维结构相似性。首先，我们开发并预训练基于目标三维结构的GNN生成扩散模型，使原生序列恢复率比先进方法提升了9%。随后，我们用基于三维自一致性指标的四个任务特定奖励函数，对模型进行改进的策略梯度算法进行微调。实验结果显示，RIDER在所有指标上结构相似度提升了100%以上，并发现了与原生序列不同的设计。

A Scalable Approach to Solving Simulation-Based Network Security Games

一种可扩展的方法用于解决基于仿真的网络安全游戏

Authors: Michael Lanier, Yevgeniy Vorobeychik
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2602.16564
Pdf link: https://arxiv.org/pdf/2602.16564
Abstract We introduce MetaDOAR, a lightweight meta-controller that augments the Double Oracle / PSRO paradigm with a learned, partition-aware filtering layer and Q-value caching to enable scalable multi-agent reinforcement learning on very large cyber-network environments. MetaDOAR learns a compact state projection from per node structural embeddings to rapidly score and select a small subset of devices (a top-k partition) on which a conventional low-level actor performs focused beam search utilizing a critic agent. Selected candidate actions are evaluated with batched critic forwards and stored in an LRU cache keyed by a quantized state projection and local action identifiers, dramatically reducing redundant critic computation while preserving decision quality via conservative k-hop cache invalidation. Empirically, MetaDOAR attains higher player payoffs than SOTA baselines on large network topologies, without significant scaling issues in terms of memory usage or training time. This contribution provide a practical, theoretically motivated path to efficient hierarchical policy learning for large-scale networked decision problems.
中文摘要 我们介绍MetaDOAR，一款轻量级元控制器，通过学习的分区感知过滤层和Q值缓存，增强了Double Oracle / PSRO范式，使在超大型网络环境中实现可扩展的多代理强化学习。MetaDOAR通过每个节点的结构嵌入学习紧凑的状态投影，快速评分并选择一小部分设备（top-k划分），其中传统低级演员利用批判代理进行聚焦束搜索。选定候选动作通过批量批评转发评估，并存储在由量化状态投影和局部动作标识符键控的LRU缓存中，显著减少冗余的批评计算，同时通过保守的k-hop缓存失效保持决策质量。从经验上看，MetaDOAR在大型网络拓扑上比SOTA基线获得更高的玩家收益，且在内存使用或训练时间方面没有显著的扩展性问题。这一贡献为大规模网络决策问题提供了一条实用且理论驱动的高效层级政策学习路径。

Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes

平均奖励马尔可夫决策过程中差分时间差分学习几乎确定收敛

Authors: Ethan Blaser, Jiuqi Wang, Shangtong Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.16629
Pdf link: https://arxiv.org/pdf/2602.16629
Abstract The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy $n$-step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.
中文摘要 平均奖励是强化学习（RL）中一个基本的性能指标，关注智能体的长期表现。差分时间差（TD）学习算法是平均奖励强化学习的重大进展，因为它们提供了一种高效的在线方法，学习与政策内和非策略中平均奖励相关的价值函数。然而，现有的趋同保证要求本地学习时钟与州访问计数挂钩，但从业者不使用该时钟，且仅限于表格格式。我们通过证明任意$n的政策上 $n 步差差 TD 几乎必然收敛，方法是在没有本地时钟的情况下解决这一限制。我们推导出三个充分条件，使得非策略$n步差TD在没有局部时钟的情况下也能收敛。这些结果加强了微分TD的理论基础，使其收敛分析更接近实际实现。

Learning to unfold cloth: Scaling up world models to deformable object manipulation

学习展开布料：将世界模型放大到可变形物体控

Authors: Jack Rome, Stephen James, Subramanian Ramamoorthy
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.16675
Pdf link: https://arxiv.org/pdf/2602.16675
Abstract Learning to manipulate cloth is both a paradigmatic problem for robotic research and a problem of immediate relevance to a variety of applications ranging from assistive care to the service industry. The complex physics of the deformable object makes this problem of cloth manipulation nontrivial. In order to create a general manipulation strategy that addresses a variety of shapes, sizes, fold and wrinkle patterns, in addition to the usual problems of appearance variations, it becomes important to carefully consider model structure and their implications for generalisation performance. In this paper, we present an approach to in-air cloth manipulation that uses a variation of a recently proposed reinforcement learning architecture, DreamerV2. Our implementation modifies this architecture to utilise surface normals input, in addition to modiying the replay buffer and data augmentation procedures. Taken together these modifications represent an enhancement to the world model used by the robot, addressing the physical complexity of the object being manipulated by the robot. We present evaluations both in simulation and in a zero-shot deployment of the trained policies in a physical robot setup, performing in-air unfolding of a variety of different cloth types, demonstrating the generalisation benefits of our proposed architecture.
中文摘要 学习作布料既是机器人研究的典范问题，也是从辅助护理到服务行业等多种应用中即时相关的问题。可变形物体的复杂物理使得布料作问题变得非凡。为了制定一套通用的作策略，能够应对各种形状、尺寸、褶皱和皱褶图案，以及外观变化等常见问题，必须仔细考虑模型结构及其对泛化性能的影响。本文提出了一种利用最近提出的强化学习架构DreamerV2变体的空中布料控方法。我们的实现修改了该架构，利用表面法线输入，同时修改了重放缓冲区和数据增强过程。综合来看，这些修改是对机器人所用世界模型的增强，解决了机器人作对象的物理复杂性。我们通过模拟和零样本部署在物理机器人系统中进行训练策略的评估，进行多种不同布料类型的空中展开，展示了我们提出架构的泛化优势。

Reinforced Fast Weights with Next-Sequence Prediction

强化快速权重与下一序列预测

Authors: Hee Seung Hwang, Xindi Wu, Sanghyuk Chun, Olga Russakovsky
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.16704
Pdf link: https://arxiv.org/pdf/2602.16704
Abstract Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.
中文摘要 快速权重架构为基于注意力的变换器提供了一种有前景的替代方案，可以实现长上下文建模，无论上下文长度如何都能保持恒定的内存开销。然而，它们的潜力受限于下一个令牌预测（NTP）训练范式。NTP优化单词符预测，忽略前缀后多个词的语义一致性。因此，快速权重模型动态更新参数以存储上下文信息，学习的表现表征往往不够优，无法捕捉长距离依赖关系。我们介绍REFINE（强化快速数据与下一个方程预测），这是一个强化学习框架，在下一序列预测（NSP）目标下训练快速权重模型。REFINE根据预测熵选择有信息的代币位置，生成多代币推广，分配自我监督的序列级奖励，并通过组相对策略优化（GRPO）优化模型。REFINE适用于预训练语言模型的整个训练生命周期：训练中、训练后和测试时训练。我们在LaCT-760M和DeltaNet-1.3B上的实验表明，REFINE在大海捞针检索、长上下文问答以及LongBench中多样化任务中，始终优于NTP的监督微调。REFINE提供了一个有效且多功能的框架，用于提升快速权重架构中的长上下文建模。

Keyword: diffusion policy

There is no result