生成时间: 2026-06-15 21:57:54 (UTC+8); Arxiv 发布时间: 2026-06-15 20:00 EDT (2026-06-16 08:00 UTC+8)
今天共有 30 篇相关文章
Keyword: reinforcement learning
UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems
UP-NRPA:基于用户肖像的嵌套推广政策适应,用于目标导向对话系统中大型语言模型规划
- Authors: Hui Wang, Fafa Zhang, Meng Liu, Xiangyu Chen, Chaoxu Mu
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.13683
- Pdf link: https://arxiv.org/pdf/2606.13683
- Abstract
To address the challenge that current dialogue policy planning methods struggle to dynamically adapt to diverse user characteristics, this paper proposes a User Portrait based Nested Rollout Policy Adaptation (UP-NRPA) online framework with Large Language Models. In contrast to conventional approaches dependent on model training and require offline reinforcement learning policy models for user groups, UP-NRPA enables dynamic customization of dialogue strategies through an adaptive mechanism. This is achieved by leveraging real-time user feedback alongside personality, preferences, and objectives mapped from the current user portrait, thereby adapting to user characteristics without offline reinforcement learning. In collaborative and non-collaborative dialogue benchmarks, UP-NRPA demonstrated considerable benefits, achieving an impressive 100% success rate in multiple dialogue tasks. Particularly in negotiation tasks, the sale-to-list ratio (SL) increased by 56.41%. This demonstrates that UP-NRPA can adapt to diverse user needs without requiring a training mechanism, enabling the dialogue system to adapt to user characteristics.
- 中文摘要
为解决当前对话政策规划方法难以动态适应多样化用户特征的挑战,本文提出了基于用户肖像的嵌套政策调整(UP-NRPA)在线框架,采用大型语言模型。与依赖模型训练、需要离线强化学习策略模型的传统方法不同,UP-NRPA通过自适应机制实现了对话策略的动态定制。这通过利用实时用户反馈,结合当前用户画像映射的个性、偏好和目标,从而适应用户特性而无需线下强化学习。在协作和非协作对话基准中,UP-NRPA展现了显著成效,在多项对话任务中取得了令人印象深刻的100%成功率。尤其是在谈判任务中,销售与名单比(SL)增长了56.41%。这表明UP-NRPA能够适应多样化的用户需求,无需培训机制,使对话系统能够适应用户特性。
Orchestra-o1: Omnimodal Agent Orchestration
Orchestra-o1:全模态代理编排
- Authors: Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang, Pheng-Ann Heng
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2606.13707
- Pdf link: https://arxiv.org/pdf/2606.13707
- Abstract
The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.
- 中文摘要
代理群体的近期成功将基于大型语言模型(LLM)的代理范式从单代理工作流转变为多代理系统,凸显了代理编排在任务分解和协作中的重要性。然而,现有的编排框架仅限于有限的模态,难以推广到异质模态共存和相互作用的更复杂情境。这种局限在全模态场景中尤为明显,因为任务需要对文本、图像、音频和视频等多种输入进行统一理解和协调。本研究提出Orchestra-o1,一种全模态代理编排框架,旨在支持跨多种模式的高效代理协作。Orchestra-o1引入了统一的编排机制,支持模态感知的任务分解、在线子代理专用化以及并行子任务执行。这种可扩展设计使智能体系统能够有效应对涉及异构信息源的复杂现实任务,在OmniGAIA基准测试中,准确率高出10.3%的次优方法。此外,我们引入了决策对齐群相对策略优化(DA-GRPO),这是一种高效的智能体强化学习方法,用于训练Orchestra-o1-8B,同时在所有现有开源全模态智能体面前实现了最先进的性能。
Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher
混合开放式三进化造就更深入的研究者
- Authors: Hongming Piao, Chi Liu, Mengzhuo Chen, Yan Shu, Derek Li, Ying Wei, Bryan Dai
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.13710
- Pdf link: https://arxiv.org/pdf/2606.13710
- Abstract
Deep research and agent evolution serve as de-facto tasks for AI agents in real-world applications toward artificial general intelligence. The former enables autonomous retrieval and integration of information in open-ended environments to tackle open-ended research tasks, yet it is constrained by the static parametric deep research capabilities of agent systems. The latter allows agents to autonomously interact with the environment to gain experiences that evolve model capabilities. However, its effectiveness has been widely validated only on verifiable tasks with standard answers, leaving a gap with open-ended research tasks. To bridge these two critical tasks, we propose the Hybrid Open-Ended Tri-Evolution (HOTE) framework, which leverages hybrid-mode reinforcement learning to facilitate the collaborative evolution of a proposer, solver and judge based on web-scale knowledge, moving toward autonomous evolving agents in open-ended tasks and environments. Extensive experiments on three long-form deep research benchmarks demonstrate that the 8B model trained via HOTE surpasses the strongest static open 8-32B models as well as those trained by state-of-the-art deep research training methods with less time overhead, and further verify that the evolution of all three modules in HOTE is indispensable.
- 中文摘要
深入的研究和智能体进化成为人工智能代理在现实应用中不可或缺的任务。前者使得在开放式环境中自主检索和整合信息,以应对开放式研究任务,但受限于智能体系统的静态参数深度研究能力。后者允许代理自主与环境交互,获得进化模型能力的经验。然而,其有效性仅在可验证且有标准答案的任务中得到广泛验证,而开放式研究任务则存在空白。为弥合这两项关键任务,我们提出了混合开放式三进化(HOTE)框架,利用混合模式强化学习,促进基于网络规模知识的提议者、求解者和评审者的协作进化,向开放式任务和环境中自主进化的智能体迈进。对三个长形式深度研究基准的广泛实验表明,通过HOTE训练的8B模型优于最强的静态开放8-32B模型,以及采用最先进的深度研究训练方法训练且时间开销更少的模型,进一步验证了HOTE中三模块的演进是不可或缺的。
Safety-Contract Graph Multi-Agent Reinforcement Learning for Autonomous Network Security Response
安全契约图:多智能体强化学习,用于自主网络安全响应
- Authors: Jose Luis Lima de Jesus Silva
- Subjects: Subjects:
Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.13832
- Pdf link: https://arxiv.org/pdf/2606.13832
- Abstract
Autonomous network-security response systems promise to reduce Security Operations Centre (SOC) reaction latency, but reward-only multi-agent reinforcement learning (MARL) can improve security reward while remaining non-deployable. We present a safety-contract graph MARL framework and instantiate it as ACD$^3$-GAT (Adaptive Constrained Counterfactual Decisioning with a Graph Attention Network encoder), an architecture that separates simulator observations from reusable operational budgets, constrained optimization, graph state encoding, and counterfactual action screening. We evaluate the method in CAGE Challenge 4, where agents operate under budgets for Mean Time to Recover (MTTR), false-positive response, and firewall change-management disruption. Across the benchmark, every unconstrained method violates the SOC downtime budget in 100% of evaluated episodes, with mean downtime proxy costs of 311-430 against a budget of 50. This complements prior CAGE Challenge 4 findings by showing that reward-only learning lacks operational discipline. Constrained MAPPO-GAT (C-MAPPO-GAT) isolates Lagrangian operational-cost control and budget-aware screening, while ACD$^3$-GAT adds budget context, CVaR tail-risk estimation, opponent-belief state, and Graph Counterfactual Risk Propagation (G-CRP). The replicated comparison includes three 200-episode seeds for IPPO, MAPPO-GAT, C-MAPPO-GAT, and ACD$^3$-GAT. C-MAPPO-GAT reduces downtime violation from 100% to 0.3% and mean downtime cost from 355.4 to 15.5 relative to MAPPO-GAT. ACD$^3$-GAT reduces mean downtime cost to 48.2 with a 13.8% violation rate, placing it on the safety-contract frontier rather than at the most conservative compliance point. Topology-seed and coupled adaptive Red-process stress tests preserve this contrast and show lower worst adaptive degradation for safety-constrained policies than reward-only MAPPO-GAT.
- 中文摘要
自主网络安全响应系统有望降低安全运营中心(SOC)的反应延迟,但仅奖励的多智能体强化学习(MARL)可以在不部署的情况下提升安全奖励。我们提出了一个安全契约图MARL框架,并将其实例化为ACD$^3$-GAT(带图关注网络编码器的自适应约束反事实决策),该架构将模拟器观测数据与可重用的运营预算分离,采用约束优化、图状态编码和反事实动作筛查。我们在CAGE挑战4中评估了该方法,该方法在平均恢复时间(MTTR)、误报响应和防火墙变更管理中断的预算内运作。在基准测试中,所有无约束方法在评估的每集中均违反SOC停机预算,平均停机代理成本为311-430,预算为50。这补充了之前CAGE挑战4的发现,表明仅奖励学习缺乏操作纪律。受限的MAPPO-GAT(C-MAPPO-GAT)分离了拉格朗日操作成本控制和预算感知筛选,而ACD$^3$-GAT则增加了预算上下文、CVaR尾风险估计、反信念状态和图反事实风险传播(G-CRP)。重复的比较包括三个200集的种子,分别是IPPO-GAT、C-MAPPO-GAT和ACD$^3$-GAT。C-MAPPO-GAT 将停机违规率从 100% 降至 0.3%,平均停机时间成本从 355.4% 降至 15.5。ACD$^3$-GAT将平均停机成本降至48.2,违规率降至13.8%,使其处于安全合同的边界,而非最保守的合规点。拓扑种子和耦合自适应Red过程应力测试保持了这种对比,并在安全性约束策略下显示出比仅奖励的MAPPO-GAT更低的最坏自适应退化。
Temporally Consistent Graph Q-Networks for Intelligent Network Control
智能网络控制的时序一致图Q-网络
- Authors: Zacharias Veiksaar, Maxime Bouton
- Subjects: Subjects:
Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.13848
- Pdf link: https://arxiv.org/pdf/2606.13848
- Abstract
Mobile networks continue to grow in complexity and next generation networks are expected to support both increasing traffic loads and more diverse services. As network complexity rises, optimizing antenna parameters under dynamic or changing objectives becomes increasingly challenging. We propose a novel multi-agent reinforcement learning (MARL) algorithm for high-level control and orchestration of mobile networks. The Temporally Consistent Graph Q-Network (TC-GQN) algorithm learns a self-predicting representation of the whole network that is task-independent and aggregates information from all base-stations. A graph neural network is trained using a global reward function to assign coordinated local actions based on the learned encoding of the global network state. We evaluate the algorithm in a simulated environment to orchestrate an energy-saving feature across multiple sectors and multiple carriers under different quality of service (QoS) constraints. The proposed algorithm outperforms state-of-the-art graph-based baselines and a competitive rule-based controller by improving hardware sleep time while maintaining QoS. Moreover, the learned representation enables rapid adaptation to changing intents.
- 中文摘要
移动网络的复杂性持续增长,下一代网络预计将支持日益增长的流量负载和更多样化的服务。随着网络复杂度的提升,在动态或变化目标下优化天线参数变得越来越具有挑战性。我们提出了一种新型多智能体强化学习(MARL)算法,用于高层次控制和编排移动网络。时间一致图Q-Network(TC-GQN)算法学习一个任务无关的自预测网络表示,并汇总来自所有基站的信息。图神经网络通过全局奖励函数训练,根据学习到的全局网络状态编码来分配协调的局部动作。我们在模拟环境中评估该算法,以便在不同服务质量(QoS)约束下,在多个扇区和多运营商之间协调节能功能。该算法通过提升硬件睡眠时间同时保持服务质量,表现优于最先进的基于图形的基线和竞争性的基于规则的控制器。此外,习得的表征使得对意图变化的快速适应成为可能。
TetraRL: A Self-Adaptive Runtime for On-Device Deep Reinforcement Learning Systems
TetraRL:用于设备内深度强化学习系统的自适应运行时
- Authors: Zexin Li, Soheil Shirvani, Cong Liu
- Subjects: Subjects:
Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2606.13891
- Pdf link: https://arxiv.org/pdf/2606.13891
- Abstract
Autonomous robotic systems, including autonomous vehicles, drones, and mobile robots, increasingly rely on on-device Deep Reinforcement Learning (DRL) to adapt to dynamic environments. Unlike cloud-based solutions, embedded DRL must perform training and inference directly on resource-constrained hardware while maintaining timely decision-making. This creates a fundamental challenge: balancing four tightly coupled objectives, real-time performance, task reward, memory utilization, and energy consumption. Optimizing these objectives independently often leads to suboptimal behavior, while conventional multi-objective methods may violate resource constraints and compromise reliability. This paper presents TetraRL, a self-adaptive runtime framework for tetra-objective on-device DRL. TetraRL formulates embedded DRL as a unified optimization problem over real-time, reward, RAM, and reserve (energy) objectives, and employs a preference-conditioned reinforcement learning controller to dynamically navigate the resulting trade-off space. The framework integrates a unified resource-management abstraction, hardware-aware DVFS control, and a runtime Override Layer for robust constraint enforcement. We implement TetraRL on NVIDIA Jetson AGX Orin and Orin Nano platforms and evaluate it across diverse DRL environments. Results show that TetraRL effectively balances all four objectives, achieves competitive trade-offs under varying runtime preferences, and incurs negligible overhead. Moreover, a single trained policy can support runtime-switchable optimization goals, providing a practical foundation for resource-aware and self-adaptive on-device DRL.
- 中文摘要
自主机器人系统,包括自动驾驶车辆、无人机和移动机器人,越来越依赖设备内深度强化学习(DRL)来适应动态环境。与云端解决方案不同,嵌入式DRL必须直接在资源受限的硬件上进行训练和推断,同时保持决策的及时性。这带来了一个根本挑战:平衡四个紧密耦合的目标:实时表现、任务奖励、记忆利用率和能耗。独立优化这些目标常导致行为次优,而传统的多目标方法可能违反资源约束并影响可靠性。本文介绍了TetraRL,一种用于四目标设备日程学习的自适应运行时框架。TetraRL将嵌入式日程学习(DRL)构建为一个统一的优化问题,涵盖实时、奖励、内存和储备(能量)目标,并采用偏好条件强化学习控制器动态导航权衡空间。该框架集成了统一的资源管理抽象、硬件感知的 DVFS 控制以及运行时覆盖层,以实现强健的约束强制。我们在NVIDIA Jetson AGX Orin和Orin Nano平台上实现TetraRL,并在多样化的DRL环境中进行评估。结果显示,TetraRL有效地平衡了四个目标,在不同的运行时间偏好下实现了竞争权衡,并且产生的开销极低。此外,单一训练策略可以支持运行时可切换的优化目标,为资源感知和自适应的设备上DRL提供实用基础。
Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding
培训后能让LLM成为优秀的医疗编码员吗?生成式ICD编码的实证研究
- Authors: Ziqing Wang, Weihao Li, Shijie Chen, Yuan Luo, Kaize Ding
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.13940
- Pdf link: https://arxiv.org/pdf/2606.13940
- Abstract
Automated International Classification of Diseases (ICD) coding is a core medical-coding task for billing, epidemiology, and clinical decision support. Generative large language models (LLMs) are often reported as weak medical coders, but this finding mainly comes from inference-time settings such as prompting, retrieval, reranking, or tool use, leaving the role of task-specific post-training underexplored. We present a controlled empirical study of post-training for generative ICD coding, comparing discriminative baselines with LLM coders across prompting, supervised fine-tuning, and reinforcement learning under a common protocol and metric set. To our knowledge, this is the first study to evaluate RL-based post-training for generative LLM coders in ICD coding. We further introduce PHI, a diagnostic curriculum that extends GRPO to refine missed-code cases. Our results show that prompting-only evaluation substantially underestimates the potential of LLMs for ICD coding. SFT provides the main capability jump, GRPO further improves code-set prediction beyond SFT, and PHI provides targeted gains on macro-level performance. These findings suggest that the main bottleneck is not the generative formulation alone, but how the model is adapted and optimized for full-taxonomy recall. We release our code, data splits, and checkpoints at this https URL.
- 中文摘要
自动化国际疾病分类(ICD)编码是医疗编码中用于账单、流行病学和临床决策支持的核心任务。生成式大型语言模型(LLMs)常被报道为较弱的医疗编码器,但这一发现主要来自推理时间的设置,如提示、检索、重新排序或工具使用,导致任务特定后训练的角色尚未被充分探讨。我们提出了一项针对生成式ICD编码训练后期的受控实证研究,比较了在提示、监督微调和强化学习中,在统一协议和指标集下,区分性基线与LLM编码者。据我们所知,这是首个评估基于强化学习的生成式LLM编码者ICD编码后训练的研究。我们还进一步引入了PHI,这是一种诊断课程,扩展了GRPO以优化漏码案例。我们的结果表明,仅用提示评估大大低估了LLMs在ICD编码中的潜力。SFT提供了主要的能力飞跃,GRPO进一步提升了超出SFT的代码集预测,PHI则在宏观层面的性能上实现了有针对性的提升。这些发现表明,主要瓶颈不仅在于生成式表述本身,还在于如何调整和优化模型以实现完整的分类法回忆。我们在这个 https URL 上发布了代码、数据拆分和检查点。
Explainable and Trustworthy Speech Emotion Recognition Using Confidence Score and Reinforcement Learning Rectified Speech Emotion Descriptors
可解释且可信的语音情感识别 使用信心评分和强化学习 纠正语音情绪描述符
- Authors: Youjun Chen, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Shujie Hu, Huimeng Wang, Haoning Xu, Chengxi Deng, Bowen Zhang, Xunying Liu
- Subjects: Subjects:
Sound (cs.SD)
- Arxiv link: https://arxiv.org/abs/2606.14086
- Pdf link: https://arxiv.org/pdf/2606.14086
- Abstract
Explainable and trustworthy speech emotion recognition (SER) remains a challenging task to date, largely due to the scarcity of SER data with reliable speech emotion descriptor (SED) labels, such as prosodic features and speaker traits. This paper presents a confidence score and reinforcement learning (RL) based on-the-fly SED rectification approach for post-training SER systems on automatically annotated SED labels. Experiments on IEMOCAP and MELD suggest that explainable SER systems incorporating the proposed confidence score and RL-based SED rectification approach consistently outperform baselines without data selection or SED rectification. The best performing system, which integrates both components, surpasses the baseline without data selection and SED rectification, achieving SER gains of 2.9% and 3.3% absolute (3.7% and 5.4% relative) on IEMOCAP and MELD benchmarks, respectively.
- 中文摘要
迄今为止,可解释且可信的语音情感识别(SER)仍是一项具有挑战性的任务,主要原因是带有可靠语音情感描述符(SED)标签的SER数据(如韵律特征和说话者特征)极为稀缺。本文提出了一种基于置信评分和强化学习(RL)的即时SED纠正方法,适用于自动注释的SED标签上训练后的SER系统。IEMOCAP和MELD的实验表明,结合拟议置信度评分和基于强化学习的SED整流方法的可解释SER系统,在无数据选择或SED整流的情况下,始终优于基线表现。整合这两个组件的最佳系统在未进行数据选择和SED整流的情况下,在IEMOCAP和MELD基准测试中分别实现了2.9%和3.3%的绝对SER提升(相对3.7%和5.4%)。
Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning
基于契约的组合屏蔽,用于安全多智能体强化学习
- Authors: Omar Adalat, Edwin Hamel-De le Court, Francesco Belardinelli
- Subjects: Subjects:
Machine Learning (cs.LG); Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2606.14130
- Pdf link: https://arxiv.org/pdf/2606.14130
- Abstract
Safe coordination problems surface in multi-agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent's action may depend on the dynamics of other agents. Decentralised shields can enforce safety at runtime, but purely factorised permissions often exclude optimal team behaviour that is safe only through coordination. We study deterministic safety guarantees for agents trained and deployed under decentralised execution, recovering team-optimal safe behaviour without centralised runtime control. Agents have a shared global specification $\phi$ in the safety fragment of Linear Temporal Logic ($\mathsf{LTL}{\mathsf{safe}}$ ), and select among tuples of local $\mathsf{LTL}{\mathsf{safe}}$ obligations whose conjunction implies the global specification $\phi$. Each agent may rely on the other agents' local obligations as assumptions because the whole contract tuple is certified simultaneously and allows projection into local action masks. At learning time, a non-stationary multi-armed bandit chooses among a library of local $\mathsf{LTL}_{\mathsf{safe}}$ obligations to select the tuple that optimises team reward, all without forgoing end-to-end safety. We evaluate the approach across 6 environments and 15 algorithmic variants.
- 中文摘要
当全局安全无法被任何智能体单方面强制执行时,安全协调问题出现在多智能体强化学习中:一个智能体行为的可接受性可能取决于其他智能体的动态。去中心化的防护可以在运行时强制执行安全,但纯粹的因数分解权限往往排除了只有通过协调才能安全的最佳团队行为。我们研究在去中心化执行下训练和部署的代理的确定性安全保障,在无需集中运行时控制的情况下恢复团队最优安全行为。代理在线性时间逻辑的安全片段中拥有共享的全局规范 $\phi$ ($\mathsf{LTL}{\mathsf{safe}}} ),并从本地 $\mathsf{LTL}{\mathsf{safe}}} 义务的元组中选择,其合取蕴含全局规范 $\phi$。每个代理可以依赖其他代理的局部义务作为假设,因为整个合同元组同时认证,允许投射到局部动作掩码中。在学习时,一个非固定的多臂强盗从本地的$\mathsf{LTL}_{\mathsf{safe}}}的库中选择一个优化团队奖励的元组,同时不放弃端到端安全。我们评估了该方法在6个环境和15个算法变体中。
Aidos: A Hybrid Optimization Algorithm for Beam Hopping Scheduling in NGSO Mega-Constellations
Aidos:NGSO巨型星座中束流跳跳调度的混合优化算法
- Authors: Lingkai Zhao, Zhe Chen, Kun Qiu, Yue Gao
- Subjects: Subjects:
Networking and Internet Architecture (cs.NI)
- Arxiv link: https://arxiv.org/abs/2606.14151
- Pdf link: https://arxiv.org/pdf/2606.14151
- Abstract
With the rapid proliferation of non-geostationary orbit (NGSO) mega-constellations, beam hopping (BH) has become indispensable for resource scheduling in multi-satellite, multi-coverage scenarios. By dynamically adjusting spot beam power and pointing within each time slot, BH enables highly efficient spectrum utilization. A principal engineering challenge is the real-time generation of beam hopping time plans (BHTP). Traditional algorithms, such as the round-robin strategy, distribute beams evenly across all service cells in a round-robin fashion. However, real traffic follows a long-tail distribution; the most active 10% of hotspot cells generate more than 50% of the aggregate demand, making uniform allocation inadequate. To address this issue, existing frameworks adopt a genetic algorithm (GA), whose throughput is approximately 80.7% higher than the traditional baseline. Operational satellite footprints encompass more than 1,000 service cells. The GA requires 67.8 s to generate a BHTP for 1,127 cells. With a 550 km LEO satellite providing only a 300 s visibility window, multiple online recomputations are impractical. State-of-the-art algorithms, such as multi-agent deep reinforcement learning (MADRL), fail to converge once the cell count exceeds 200. To overcome these challenges, we propose a novel BH scheduling algorithm Aidos. The algorithm integrates traffic-aware random-key encoding into a multi-objective metaheuristic search, and then applies a sliding-window Beta resampling strategy during adaptive distribution evolution, to improve both the search efficiency and the solution quality of the BHTP. Experiments demonstrate that Aidos improves throughput by 79.2% and reduces latency by 99.45%. Its average computation time is 9.3 s, enabling online replanning within a 300 s satellite overpass window.
- 中文摘要
随着非地球静止轨道(NGSO)超级星座的快速扩散,波束跳跃(BH)已成为多卫星、多覆盖场景中资源调度不可或缺的技术。通过动态调整每个时隙内的定点束功率和指向,BH实现了高效的频谱利用。一个主要的工程挑战是实时生成束流跳跃时间计划(BHTP)。传统算法,如轮询策略,以轮转方式均匀分布所有服务单元的束流。然而,实际流量遵循长尾分布;最活跃的10%热点小区产生了超过50%的总需求,导致均匀分配不够。为解决这一问题,现有框架采用遗传算法(GA),其吞吐量约比传统基线高出80.7%。运营卫星覆盖超过1000个服务单元。GA需要67.8秒才能生成1,127个单元的BHTP。由于一颗550公里长的近地轨道卫星仅提供300秒的能见度窗口,多次在线重新计算并不切实际。最先进的算法,如多智能体深度强化学习(MADRL),一旦细胞数超过200,便无法收敛。为克服这些挑战,我们提出了一种新的BH调度算法Aidos。该算法将流量感知的随机密钥编码集成到多目标元启发式搜索中,并在自适应分布演化过程中应用滑动窗口Beta重采样策略,以提升BHTP的搜索效率和解的质量。实验表明,Aidos可提升吞吐量79.2%,延迟降低99.45%。其平均计算时间为9.3秒,使得在300秒的卫星飞越窗口内实现在线重新规划。
CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward
CacheRL:通过缓存部署和混合奖励实现多回合工具调用代理
- Authors: Md Amirul Islam, Sumiran Thakur, Huancheng Chen, Su Min Park, Jiayun Wang, Gyuhak Kim
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.14179
- Pdf link: https://arxiv.org/pdf/2606.14179
- Abstract
We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5's 94 percent while requiring 100 times less compute. Our approach addresses three challenges in practical agent training: transferring tool-calling knowledge from large models at scale, enabling reinforcement learning without costly live tool execution, and learning robustly from noisy cached environments. CacheRL introduces three key innovations. First, a hybrid thinking trajectory pipeline augments agent trajectories with LLM-generated reasoning traces, producing training examples that teach models not only what tools to call but also why. Second, the CacheAgentLoop eliminates live execution costs through a three-tier fuzzy cache while preserving trajectory fidelity using token-level masking. Third, a cache-tier-aware reward dynamically adjusts answer-quality weights to avoid penalizing models for cache-induced limitations. Through iterative supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), CacheRL improves Qwen3-4B-Thinking's validation reward from 0.43 to 0.78. On public agentic tool-calling benchmarks, our model achieves competitive performance against frontier models such as GPT-5. Ablation studies show that removing knowledge transfer reduces performance by 41 percent, while cache-aware rewards contribute a 17 percent improvement. Interestingly, reinforcement learning improves training stability but yields limited gains beyond strong supervised fine-tuning, suggesting that data quality and reward design play a more important role than complex optimization methods in building practical small agent models.
- 中文摘要
我们介绍CacheRL,一种用于训练小型代理基础模型的系统,在多步工具调用任务中实现92%的过程准确率,接近GPT-5的94%,同时所需计算量减少100倍。我们的方法解决了实际代理训练中的三大挑战:大规模转移大型模型的工具调用知识、实现在不依赖高成本实时工具执行的情况下进行强化学习,以及从噪声缓存环境中稳健地学习。CacheRL 引入了三项关键创新。首先,混合思维轨迹流水线通过LLM生成的推理轨迹来增强代理轨迹,生成训练示例,不仅教会模型调用哪些工具,还教导其使用原因。其次,CacheAgentLoop通过三层模糊缓存消除实时执行成本,同时通过令牌级掩蔽保持轨迹真实度。第三,缓存层感知奖励动态调整答案质量权重,以避免因缓存导致的限制而惩罚模型。通过迭代监督微调(SFT)和群相对策略优化(GRPO),CacheRL 将 Qwen3-4B-Thinking 的验证奖励从 0.43 提升到 0.78。在公共代理工具调用基准测试中,我们的模型在与GPT-5等前沿模型竞争中表现出色。消融研究显示,移除知识转移会降低性能41%,而缓存感知奖励则贡献了17%的提升。有趣的是,强化学习提升了训练稳定性,但除了强监督微调外,收益有限,表明数据质量和奖励设计在构建实用小智能体模型中比复杂优化方法更为重要。
DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation
驱动力:分发式与检索增强竞价,带价值评估
- Authors: Miduo Cui, Haochen Wang, Shangqin Mao, Xun Yang, Qianlong Xie, Xingxing Wang, Xuri Ge, Ying Zhou, Zhiwei Xu
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.14192
- Pdf link: https://arxiv.org/pdf/2606.14192
- Abstract
Auto-bidding is a core component of real-time advertising systems, where decisions must optimize long-term performance under budget and cost constraints, while online exploration is prohibitively risky. Offline reinforcement learning and, more recently, Transformer-based sequence modeling have shown promise for learning bidding policies from logged data, but their unimodal and purely parametric formulations often collapse multiple effective bidding strategies into suboptimal averaged actions and perform unreliably under sparse or long-tail traffic. To mitigate these limitations, we propose DRIVE (Distributional and Retrieval-Augmented Bidding with Value Evaluation), a unified Transformer-based framework that decouples candidate action generation from decision making for offline auto-bidding. DRIVE combines distributional action modeling, retrieval-augmented candidate generation from high-quality historical decisions, and value-based evaluation to select the most promising bid at inference time. Extensive experiments on AuctionNet and additional offline reinforcement learning benchmarks demonstrate that DRIVE consistently improves bidding performance and generalizes well across multiple Transformer-based methods.
- 中文摘要
自动竞价是实时广告系统的核心组成部分,决策必须在预算和成本限制下优化长期表现,而在线搜索则风险极高。离线强化学习以及最近基于Transformer的序列建模显示出从记录数据学习竞价策略的潜力,但它们的单模态和纯参数化表述常常将多个有效的竞价策略合并为次优的平均动作,且在稀疏或长尾流量下表现不可靠。为缓解这些限制,我们提出了DRIVE(分布式与检索增强竞价与价值评估)框架,这是一个基于Transformer的统一框架,将候选动作生成与离线自动竞价的决策解耦。DRIVE结合了分布动作建模、从高质量历史决策中生成的检索增强候选方案,以及基于价值的评估,在推理时选择最有前景的投标。在AuctionNet上的大量实验以及额外的离线强化学习基准测试表明,DRIVE能够持续提升竞价表现,并且在多种基于Transformer的方法中具有良好的推广性。
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry
HarnessX:一个可组合、自适应且可进化的智能体束束铸造厂
- Authors: Tingyang Chen, Shuo Lu, Kang Zhao, Weicheng Meng, Hanlin Teng, Tianhao Li, Chao Li, Xule Liu, Jian Liang, Zhizhong Zhang, Yuan Xie, Heng Qu, Kun Shao, Jian Luan
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.14249
- Pdf link: https://arxiv.org/pdf/2606.14249
- Abstract
AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet today's harnesses remain largely hand-crafted and static: each new model or task still demands bespoke scaffolding, and the rich traces produced during execution are rarely distilled back into systematic improvement. We introduce HarnessX, a foundry for composable, adaptive, and evolvable agent harnesses. HarnessX assembles typed harness primitives via a substitution algebra, adapts them through AEGIS, a trace-driven multi-agent evolution engine grounded in an operational mirror between symbolic adaptation and reinforcement learning, and closes the harness-model loop by turning trajectories into both harness updates and model training signal. Across five benchmarks (ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified), HarnessX yields an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest. These results suggest that agent progress need not come from model scaling alone: composing and evolving runtime interfaces from execution feedback is an actionable and complementary lever. The complete codebase will be open-sourced in a future release.
- 中文摘要
AI代理的性能关键依赖于运行时机束,包括提示、工具、内存和控制流,这些都调节模型如何观察、推理和行动。然而,如今的背带大多仍是手工制作且静态的:每一个新模型或任务仍需定制脚手架,执行过程中产生的丰富痕迹很少被提炼成系统性的改进。我们介绍 HarnessX,一个可组合、自适应和可进化代理工具的代工厂。HarnessX 通过替换代数组装类型化的束带原语,通过 AEGIS 进行适配——AEGIS 是一个基于符号适应与强化学习之间操作镜像的跟踪驱动多代理演化引擎,并通过将轨迹转化为束带更新和模型训练信号来闭合束束-模型循环。在五个基准测试(ALFWorld、GAIA、WebShop、tau^3-bench 和 SWE-bench Verified)中,HarnessX 平均涨幅为 +14.5%(最高可达 +44.0%),涨幅在基线最低处最大。这些结果表明,代理的进展不必仅依赖模型扩展:基于执行反馈组合和演化运行时接口是一种可操作且互补的杠杆。完整的代码库将在未来版本中开源。
Robust Fall Recovery for Armless Bipedal-Wheeled Robots Via Force-Guided Learning
无臂双足轮机器人通过力引导学习实现稳健的坠落恢复
- Authors: Haidong Hou, Zhangguo Yu, Tao Han, Hengbo Qi, Khaleel Ghazal, Yu Zhang, Yidong Du, Xuechao Chen, Fei Meng
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.14270
- Pdf link: https://arxiv.org/pdf/2606.14270
- Abstract
Fall recovery is critical for autonomous legged locomotion. Existing methods have demonstrated that some legged robots, such as humanoids and quadrupeds, are capable of fall recovery from diverse postures by utilizing arms or coordinating multi-legs to generate support forces. Without arms or other legs to provide supportive assistance, a bipedal-wheeled robot must rely solely on the actuation of its legs, making recovery particularly difficult. To address this, we introduce FTSR (Force-guided Teacher-student framework with Stage-wise Rewards). The force-guided method constructs an external auxiliary force during simulation training that correlates directly with the robot's real-time height, explicitly formulating this force as an optimizable constraint. Through constrained reinforcement learning, the policy is guided toward reducing force dependency gradually and increasing the body height, developing internal recovery strategies despite having no arms for support. Height-progressive stage-Wise rewards progressively structure posture stabilization during recovery and transition to sustained locomotion, integrated with teacher-student architecture distilling privileged knowledge of force effects and recovery dynamics. After simulation training, the policy is deployed on a physical armless bipedal-wheeled robot and extensively evaluated. Experiments confirm robust and reliable fall recovery under diverse challenging conditions, demonstrating strong environmental adaptability and motion robustness, while maintaining full post-recovery motion capability. The framework also generalizes effectively to a high-DOF humanoid, confirming its practical generalizability. The project page is available at this https URL
- 中文摘要
跌倒恢复对于自主腿部行走至关重要。现有方法已证明,一些有腿的机器人,如类人生物和四足机器人,能够通过使用手臂或协调多足来产生支撑力量,从不同姿势中恢复坠落。没有手臂或其他腿部提供支撑,双足轮机器人只能依靠双腿的动作,这使得恢复异常困难。为此,我们引入了FTSR(带阶段奖励的原力引导师生框架)。力导方法在模拟训练中构建一个外部辅助力,直接对应机器人的实时高度,明确将该力表述为可优化的约束。通过受限强化学习,政策逐步减少对力的依赖,增加身高,尽管没有手臂支撑,仍能发展内在恢复策略。高度-渐进阶段逐步奖励结构性姿势稳定,在恢复和过渡到持续运动期间,结合师生结构,提炼出对力效应和恢复动力学的宝贵知识。经过模拟训练后,该政策会部署在无臂的双足轮式机器人上,并进行广泛评估。实验确认在多种复杂条件下实现了稳健可靠的坠落恢复,展现了强大的环境适应性和运动韧性,同时保持了完整的恢复后运动能力。该框架还有效推广到高景深的人形,确认了其实用通用性。项目页面可在此 https URL 访问。
Retrospective Progress-Aware Self-Refinement for LLM Agent Training
回顾性进展感知自我精炼用于LLM代理培训
- Authors: Xinbei Ma, Congmin Zheng, Jiyang Qiu, Jiale Hong, Yao Yao, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang, Weiwen Liu, Weinan Zhang, Zhuosheng Zhang, Hai Zhao
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.14302
- Pdf link: https://arxiv.org/pdf/2606.14302
- Abstract
LLM-based agents trained with reinforcement learning optimize step-wise action prediction but lack metacognitive awareness of task progress, inducing a gap that hinders long-horizon scaling. A pilot study reveals that online progress prompting hurts performance while retrospective demonstrations help, yet this capability cannot emerge from outcome-reward training alone. We present RePro, Retrospective Progress-Aware Training, a framework that trains agents to self-generate progress signals via a forward-then-reflect rollout paradigm: the agent executes actions online, then retrospectively reassesses its step-wise progress given the completed trajectory and known outcome. RePro initializes with a Retrospection Warmup that teaches reflection format from minimal external demonstrations, then further trains through RePro-PO with a composite reward that produces self-generated signals without continuous external supervision. Experiments on WebShop, ALFWorld, and Sokoban show that RePro enhances the Qwen family's performance, with up to $12\%$ absolute success rate gains.
- 中文摘要
基于LLM的智能体通过强化学习训练,能够优化逐步动作预测,但缺乏对任务进展的元认知意识,导致存在阻碍长期扩展的鸿沟。一项试点研究显示,在线进展提示会损害表现,而回顾性演示则有帮助,但这种能力无法仅靠结果-奖励培训实现。我们介绍RePro,回顾性进展感知培训,这是一种通过前向后反思的推广范式训练代理自我生成进展信号的框架:代理在线执行动作,然后根据已完成的轨迹和已知结果回顾性地重新评估其逐步进展。RePro通过回顾热身初始化,从最小的外部演示中教授反射格式,随后通过RePro-PO进行复合奖励训练,产生自生成信号且无需持续外部监督。在WebShop、ALFWorld和Sokoban上的实验显示,RePro能提升Qwen系列的性能,绝对成功率提升高达12美元。
ForceForget: Reinforcement Concept Removal for Enhancing Safety in Text-to-Image Models
ForceForget:消除强化概念以提升文本转图像模型的安全性
- Authors: Dong Han, Yong Li
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2606.14351
- Pdf link: https://arxiv.org/pdf/2606.14351
- Abstract
With the advance of generative AI, the text-to-image (T2I) model has the ability to generate various contents. However, T2I models still can generate unsafe contents. To alleviate this issue, various concept erasing methods are proposed. However, existing methods tend to excessively erase unsafe concepts and suppress benign concepts contained in harmful prompts, which can negatively affect model utility. In this paper, we focus on eliminating unsafe content while maintaining model capability in safe semantic meaning interpretation by optimizing the concept erasing reward (CER) with reinforcement learning. To avoid overly content erasure, we introduce the Safe Adapter to project partial text embedding for efficient concept regulation in cross-attention layers. Extensive experiments conducted on different datasets demonstrate the effectiveness of the proposed method in alleviating unsafe content generation while preserving the high fidelity of benign images compared with existing state-of-the-art (SOTA) concept erasing methods. In terms of robustness, our method outperforms counterparts against red-teaming tools. Moreover, we showcase the proposed approach is more effective in emerging image-to-image (I2I) scenarios compared with others. Lastly, we extend our method to erase general concepts, such as artistic styles and objects. Disclaimer: This paper includes discussions of sexually explicit content that may be offensive to certain readers. All images used in this work are synthesized or from public datasets.
- 中文摘要
随着生成式人工智能的发展,文本到图像(T2I)模型具备生成各种内容的能力。然而,T2I模型仍然可能生成不安全的内容。为缓解这一问题,提出了各种概念抹除方法。然而,现有方法往往过度抹除不安全的概念,抑制有害提示中包含的良性概念,这可能对模型效用产生负面影响。本文重点通过强化学习优化概念消除奖励(CER)来消除不安全内容,同时保持模型在安全语义意义解释中的能力。为避免内容过度删除,我们引入了安全适配器,用于部分文本嵌入项目,以高效调节交叉注意力层的概念。在不同数据集上进行的大量实验证明,该方法在减少不安全内容生成的同时,保持了良性图像的高保真度,相较于现有最先进的(SOTA)概念抹除方法。在鲁棒性方面,我们的方法在对抗红队工具时表现优于其他方法。此外,我们展示了该方法在新兴图像对图像(I2I)场景中比其他方法更有效。最后,我们扩展方法,抹去一般概念,如艺术风格和物品。免责声明:本文包含可能冒犯某些读者的性露骨内容讨论。本研究中使用的所有图像均为综合或来自公开数据集。
Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models
弹性查询强化学习:VLA模型的自觉策略执行
- Authors: Ge Wang, Xinyu Tan, Xiang Li, Man Luo, Chengsi Yao, Shenhao Yan, Jiahao Yang, Fan Feng, Honghao Cai, Xiangyuan Wang, Zhixin Mai, Yiming Zhao, Yatong Han, Zhen Li
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.14375
- Pdf link: https://arxiv.org/pdf/2606.14375
- Abstract
Vision-language-action (VLA) models are powerful action generators for robot manipulation, but they are typically executed with fixed inference and replanning schedules. This rigidity ignores the uneven difficulty of robot control: contact-rich or uncertain states may need more computation and fresher feedback, while easier states can often be handled with fewer inference steps and longer open-loop execution. We propose Elastic Queries Reinforcement Learning (EQRL), a framework that makes each VLA policy query elastic. A lightweight latent-schedule adaptor jointly selects the latent input, denoising budget, and action chunk length, without fine-tuning the underlying VLA model. To make scheduling difficulty-aware, EQRL trains a critic over the joint latent-schedule action and derives a state difficulty signal from critic ensemble disagreement. This signal guides compute toward difficult states, while a learned residual allows task-driven correction. We formulate variable chunk execution as query-level macro-action RL with chunk-dependent discounting and an amortized number-of-function-evaluations (NFE) budget. Across simulation and real-robot manipulation, EQRL reduces amortized inference cost while preserving or improving task success.
- 中文摘要
视觉-语言-动作(VLA)模型是机器人操作的强大动作生成器,但通常采用固定推理和重新规划计划来执行。这种严格性忽略了机器人控制难度的不均:接触丰富或不确定状态可能需要更多计算和更新鲜的反馈,而较简单的状态通常可以通过更少的推理步骤和更长的开环执行来处理。我们提出了弹性查询强化学习(EQRL)框架,使每个VLA策略查询具有弹性。一个轻量级潜调度适配器可以联合选择潜在输入、去噪预算和动作块长度,而无需微调底层的VLA模型。为了使调度难度感知,EQRL对联合潜在调度动作进行批评训练,并从批评者集合分歧中推导出状态难度信号。该信号引导计算趋向困难状态,而学习残差则允许任务驱动的纠正。我们将变量块执行表述为查询级宏操作强化学习,采用块依赖折扣和摊销函数评估次数(NFE)预算。在仿真和真实机器人操作中,EQRL降低摊销推断成本,同时保持或提升任务成功率。
CSPO: Constraint-Sensitive Policy Optimization for Safe Reinforcement Learning
CSPO:约束敏感策略优化以实现安全强化学习
- Authors: Ayoub Belouadah, Sylvain Kubler, Yves Le Traon
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.14415
- Pdf link: https://arxiv.org/pdf/2606.14415
- Abstract
Safe reinforcement learning (Safe RL) aims to maximize expected return while satisfying safety constraints, typically modeled as Constrained Markov Decision Processes (CMDPs). While primal-dual methods scale well to deep RL, they often suffer from delayed constraint correction, leading to oscillatory behavior and prolonged safety violations. In this paper, we propose Constraint-Sensitive Policy Optimization (CSPO), a first-order primal-dual method that incorporates local constraint sensitivity into policy updates. CSPO augments the primal objective with a constraint-sensitive correction derived from the shortest signed distance to the safety boundary, enabling smarter recovery steps back to safety, compensating for delayed Lagrange multiplier updates, reducing oscillations near the boundary, and preserving the KKT solutions of the original constrained problem. Experiments on navigation and locomotion benchmarks demonstrate that CSPO achieves faster safety recovery and high reward preservation, resulting in higher constrained returns compared to state-of-the-art primal-dual and penalty-based methods
- 中文摘要
安全强化学习(Safe RL)旨在最大化期望回报,同时满足安全约束,通常以受限马尔可夫决策过程(CMDPs)建模。虽然原始对偶方法在深度强化学习中扩展性良好,但通常存在约束修正延迟,导致振荡行为和长期安全违规。本文提出了约束敏感策略优化(CSPO),这是一种一阶原始对偶方法,将局部约束敏感性纳入策略更新中。CSPO通过从最短符号距离到安全边界导出的约束敏感修正来补充原始目标,实现更智能的恢复步回安全,补偿延迟的拉格朗日乘数更新,减少边界附近的振荡,并保持原始受限问题的KKT解。导航和移动基准测试的实验表明,CSPO实现了更快的安全恢复和高奖励保值,从而使得与最先进的原始-对偶和基于罚则的方法相比,获得更高的约束收益
Causal Object-Centric Models for Planning with Monte Carlo Tree Search
基于因果对象的规划模型,采用蒙特卡洛树搜索
- Authors: Rodion Vakhitov, Leonid Ugadiarov, Alexey Skrynnik, Aleksandr Panov
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.14418
- Pdf link: https://arxiv.org/pdf/2606.14418
- Abstract
We introduce COMET (Causal Object-centric Model for Efficient Tree search), a model-based reinforcement learning algorithm that performs Monte Carlo Tree Search in a slot-structured latent space. COMET pairs a frozen unsupervised object-centric encoder with a transformer-based world model, in which actions are bound to objects through a novel action-slot fusion mechanism that is used in slot transition prediction. Policy and value heads use object-causal attention, modulating token interactions by learned per-slot relevance scores so that decision-making concentrates on task-relevant entities. COMET adds an explicit object-level inductive bias to MuZero-style latent planning. Across eight visually and dynamically diverse tasks from the Object-Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom, COMET achieves a higher mean normalized score during the early stages of training compared to object-centric and monolithic baselines.
- 中文摘要
我们介绍了COMET(因果对象中心高效树搜索模型),这是一种基于模型的强化学习算法,在槽结构潜空间中执行蒙特卡洛树搜索。COMET将一个冻结的无监督对象中心编码器与基于变换器的世界模型配对,该模型通过一种新颖的动作-槽融合机制绑定动作与对象,该机制用于槽转换预测。政策和价值负责人使用对象因果关注,通过学习的每个槽位相关性评分调节代币交互,使决策集中在任务相关实体上。COMET在MuZero风格的潜在规划中增加了显式的对象级归纳偏置。在从面向对象视觉强化基准、ManiSkill、Robosuite和VizDoom等八个视觉和动态多样化任务中,COMET在训练早期阶段的平均归一化得分高于以对象为中心和单一基线。
Kine2Go: Kinematic dataset for the Unitree Go2 robot with diverse gaits and motions
Kine2Go:适用于具有多样步态和动作的Unitree Go2机器人的运动学数据集
- Authors: Władysław Pałucki, Paweł Siwak, Krzysztof Ciebiera, Marek Cygan
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.14433
- Pdf link: https://arxiv.org/pdf/2606.14433
- Abstract
The recent popularity of robotics, combined with the steadily decreasing cost of robotic hardware, has lowered the entry barrier to robotics research and enabled rapid advancements in the field. One of the primary examples is the Unitree Go2 quadruped robot, which is often used by researchers in the areas of locomotion, navigation, control, and others. Many researchers use the Go2 robot in combination with techniques like imitation learning, reinforcement learning, and behavioral cloning to allow machine learning systems to take full control of the robot. At the same time, many of those techniques require demonstration data consisting of the robot's kinematics information and actions applied to the motors. Obtaining such data is difficult, requires building complex pipelines, and can take significant time. To aid in those kinds of efforts, we present Kine2Go - a dataset with 800 diverse gait kinematics trajectory motion data for the Unitree Go2 robot, derived from 40 distinct policies. Our pipeline accepts data from various quadruped morphologies and translates them to a Go2-compatible format. Then we use Reinforcement Learning to train policies following a given motion, and finally we gather data from those policies, which grants robust, perturbed kinematic data with corresponding motor-level actions.
- 中文摘要
机器人技术的近期流行,加上机器人硬件成本的稳步下降,降低了机器人研究的门槛,推动了该领域的快速进展。其中一个主要例子是Unitree Go2四足机器人,该机器人常被运动、导航、控制等领域的研究人员使用。许多研究人员将Go2机器人与模仿学习、强化学习和行为克隆等技术结合使用,使机器学习系统能够完全控制机器人。同时,许多技术需要包括机器人运动学信息和应用于电机的动作在内的演示数据。获取此类数据困难,需要构建复杂的管道,且可能耗费大量时间。为支持此类工作,我们展示了Kine2Go——一个包含800条不同步态运动学数据的Unitree Go2机器人数据集,源自40个不同策略。我们的流程接受各种四足动物形态的数据,并将其转换为兼容Go2的格式。然后我们利用强化学习训练策略,遵循给定动作,最后从这些策略中收集数据,从而获得稳健且受扰动的运动学数据,并伴随相应的运动层面动作。
From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI
从聊天机器人到数字同事:向持久自主人工智能的范式转变
- Authors: Yongheng Zhang, Ziang Liu, Jiaxuan Zhu, Shuai Wang, Xiangqi Chen, Haojing Huang, Jiayi Kuang, Siyu Chen, Ao Shen, Hao Wu, Qiufeng Wang, Qian-Wen Zhang, Junnan Dong, Wenhao Jiang, Ying Shen, Hai-Tao Zheng, Yinghui Li, Di Yin, Xing Sun, Philip S. Yu
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.14502
- Pdf link: https://arxiv.org/pdf/2606.14502
- Abstract
Large Language Models (LLMs) are undergoing a fundamental transformation from conversational generators into integrated AI systems capable of reasoning, action, memory, and self-improvement. We conceptualize this transition as a shift from Chatbot to Digital Colleague: from conversational answers to persistent work. We organize this transition along two tightly coupled dimensions. First, at the cognitive core level, LLMs are advancing from Chatbot-era "fast thinking" systems driven by next-token prediction toward Thinking LLMs that leverage inference-time computation, Chain-of-Thought reasoning, reflection, process supervision, and reinforcement learning to support more deliberate and reliable cognition. Second, at the tool-augmented task execution level, LLMs are progressing from tool-calling Agents that invoke external resources in an ad hoc manner toward OpenClaw-style workstation systems (OpenClaw) equipped with persistent Workspaces, skills, verification loops, and governance. The "Workspace + Skill" paradigm makes episodic tool use colleague-like via state persistence, reusable procedures, task closure, and experience reuse. We examine data construction shifts from instruction-response pairs to State-Action-Observation trajectories and evaluation from static benchmarks to sandboxed, auditable, self-evolving AI ecosystems.
- 中文摘要
大型语言模型(LLMs)正经历从对话生成器向具备推理、行动、记忆和自我提升能力的集成人工智能系统的根本转变。我们将这次转变概念化为从聊天机器人向数字同事的转变:从对话式回答转向持续工作。我们将这一转变组织在两个紧密耦合的维度上。首先,在认知核心层面,LLM正从聊天机器人时代的“快速思考”系统(由下一代币预测驱动)向利用推理时间计算、思维链推理、反思、过程监督和强化学习支持更有意识和可靠性的思考型LLM发展。其次,在工具增强的任务执行层面,LLM正从调用外部资源的工具调用代理,逐步发展为配备持久工作区、技能、验证循环和治理的OpenClaw式工作站系统(OpenClaw)。“Workspace + Skill”范式使得情节式工具通过状态持久化、可重用过程、任务关闭和体验重用等方式使用同事。我们考察了从指令-响应对到状态-动作-观察轨迹的数据构建转变,以及从静态基准到沙盒化、可审计、自我演进的人工智能生态系统的评估。
Provably Safe, Yet Scalable Reinforcement Learning
可验证安全且可扩展的强化学习
- Authors: Kai S. Yun, Zeyang Li, Navid Azizan
- Subjects: Subjects:
Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2606.14536
- Pdf link: https://arxiv.org/pdf/2606.14536
- Abstract
Safe reinforcement learning (RL) aims to learn policies that optimize rewards while satisfying constraints. Predominant approaches rely on soft-constrained policy optimization, which has achieved empirical success but does not provide formal safety guarantees for the learned policy. In contrast, methods with strict guarantees typically rely on explicit certificate functions, whose construction requires the direct synthesis and verification of control-invariant sets, a process that scales poorly with state dimension and often yields overly conservative behavior. In this paper, we present the Provably Safe, yet Scalable RL (PS2-RL) framework, a novel two-phase architecture for learning provably safe policies in a scalable manner, designed to overcome the key bottlenecks of prior methods. Rather than explicitly computing invariant sets, PS2-RL leverages a learned backup policy to forward-integrate the system dynamics, generating an implicit control-invariant set online. In the first phase, the backup policy is trained with our proposed safe-arrival value function, which characterizes the optimal backup policy for invariant-set construction. In the second phase, an RL policy is trained end-to-end through a differentiable projection layer that strictly enforces the safety guarantees induced by the learned backup policy. By maximizing the volume of the implicit control-invariant set in the first phase, the resulting PS2 policy from the second phase is performant and scalable, while maintaining provable safety. Crucially, PS2-RL imposes no restrictions on the underlying RL algorithm and can be plugged into any existing training pipeline. We establish theoretical guarantees for the proposed framework and evaluate it on robotic control tasks with state dimensions up to 10, a regime in which prior provably safe RL methods struggle or become impractical.
- 中文摘要
安全强化学习(RL)旨在学习在满足约束条件的同时优化奖励的策略。主流方法依赖软约束策略优化,虽然取得了实证成功,但并未为所学策略提供正式的安全保障。相比之下,严格保证的方法通常依赖显式证书函数,其构建需要直接合成和验证控制不变集合,这一过程在状态维度上扩展性差,且常常表现过于保守。本文介绍了可证明安全但可扩展的强化学习(PS2-RL)框架,这是一种新型的两阶段架构,旨在以可扩展的方式学习可证明安全的策略,旨在克服以往方法的关键瓶颈。PS2-RL不直接计算不变集合,而是利用学习到的备份策略前向集成系统动力学,生成隐式控制不变集合。第一阶段,备份策略用我们提出的安全到达值函数训练,该函数描述了不变集构造的最优备份策略。第二阶段,通过可微化的投影层对强化学习策略进行端到端训练,严格执行由学习后的备份策略所带来的安全保障。通过最大化第一阶段隐含控制不变集的体积,第二阶段产生的PS2策略既高效且可扩展,同时保持可证明的安全性。关键是,PS2-RL对底层强化学习算法没有任何限制,并且可以插入任何现有的训练流程中。我们为所提框架建立了理论保证,并在状态维度高达10的机器人控制任务中进行评估,在这一阶段,以往可证明安全的强化学习方法难以实现或变得不切实际。
VISTA: View-Consistent Self-Verified Training for GUI Grounding
VISTA:视图一致的自我验证GUI基础培训
- Authors: Xinyu Qiu, Yunzhu Zhang, Heng Jia, Shuheng Shen, Changhua Meng, Linchao Zhu
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.14579
- Pdf link: https://arxiv.org/pdf/2606.14579
- Abstract
When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no useful relative advantage. We propose VISTA (View-Consistent Self-Verified Training), a GRPO-based training framework that constructs each comparison group from multiple target-preserving views of the same GUI this http URL view is generated by a crop that keeps the target element visible and remaps its box exactly, so model rollouts are compared across semantically equivalent but geometrically different inputs. To stabilize short coordinate generation without turning reinforcement learning into unconditional imitation, VISTA further adds a self-verified cross-view anchor: an oracle answer optimized with an advantage-weighted loss, excluded from the group baseline and activated only when the model has produced a maximum-reward rollout. Across five GUI-grounding benchmarks and multiple Qwen backbones, VISTA consistently improves grounding this http URL ScreenSpot-Pro, it raises Qwen3-VL 4B/8B/30B-A3B from 55.5/52.7/53.7 to 63.4/65.8/67.0. Robustness analyses further show higher worst-view accuracy and lower prediction flip rates.
- 中文摘要
在应用组相对策略优化(GRPO)进行图形界面接地时,部署数据从单一截图视图中抽样;群体往往在困难情况下成为全失败,在简单情况下全成功,没有实际的相对优势。我们提出了VISTA(视图一致性自我验证训练),这是一个基于GRPO的训练框架,它从同一图形界面的多个保持目标的视图构建每个对照组。该http URL视图由裁剪生成,裁剪保持目标元素可见并精确映射其框,因此模型的展开在语义上等效但几何上不同输入间进行比较。为了稳定短坐标生成,同时不使强化学习变成无条件模仿,VISTA进一步增加了自我验证的交叉视角锚点:一个以优势加权损失优化的预言机答案,排除在组基线之外,仅在模型实现最大奖励推广时激活。通过五个图形界面接地基准测试和多个Qwen骨干网,VISTA持续改进该http URL ScreenSpot-Pro的接地,将Qwen3-VL 4B/8B/30B-A3B从55.5/52.7/53.7提升至63.4/65.8/67.0。鲁棒性分析进一步显示,最差视野准确率更高,预测翻转率更低。
A Statistical and Machine Learning Framework for Operational Threshold Detection and Deployable Dispatch Controller Development in Hydrogen Multi-Energy Systems
氢多能系统中运行阈值检测和可部署调度控制器开发的统计与机器学习框架
- Authors: Shadi Heenatigala, Hasanika Samarasinghe
- Subjects: Subjects:
Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Computation (stat.CO)
- Arxiv link: https://arxiv.org/abs/2606.14601
- Pdf link: https://arxiv.org/pdf/2606.14601
- Abstract
This study presents a statistical and machine learning framework for characterizing a hydrogen-based multi-energy system (H-MES) using one year of high-resolution operational data. Statistical analysis revealed a binary operation driven by renewable surplus, with solar irradiance explaining 45.7% of rank-based variance in hydrogen production, a large effect by conventional standards. Only high-irradiance periods triggered meaningful electrolyzer engagement, while electricity demand exerted a weaker inverse suppression effect ($\epsilon^2 = 0.126$). Multiple regression confirmed electrolyzer power as the dominant linear predictor, with a synergistic solar-wind interaction. Notably, Random Forest analysis ranked wind output first in predictive importance despite its weak bivariate correlation (r = 0.167), revealing non-linear dynamics invisible to parametric methods. A sequence model exploited strong 24-hour autocorrelation (r = 0.845) for operational forecasting, while a reinforcement learning agent optimized hydrogen revenue dispatch. The core contribution is demonstrating that statistical and machine learning approaches are complementary for H-MES modeling and control.
- 中文摘要
本研究提出了一个统计和机器学习框架,用于利用一年的高分辨率运行数据来表征基于氢的多能系统(H-MES)。统计分析显示,这一过程由可再生能源盈余驱动,太阳辐照度解释了氢气产量等级变异的45.7%,按传统标准影响较大。只有高辐照周期才会触发有意义的电解槽接合,而电力需求则产生较弱的反向抑制效应($\epsilon^2 = 0.126$)。多重回归证实电解器功率为主导线性预测因子,太阳与风的协同作用。值得注意的是,随机森林分析将风力输出在预测重要性上排名第一,尽管其二元相关性较弱(r = 0.167),揭示了参数化方法无法察觉的非线性动力学。序列模型利用强24小时自相关性(r = 0.845)进行操作预测,而强化学习代理则优化了氢气收入分配。其核心贡献是证明统计和机器学习方法在H-MES建模与控制方面是互补的。
Safe Reinforcement Learning of Autonomous Highway Driving: A Unified Framework for Safety and Efficiency
自动驾驶安全强化学习:安全与效率统一框架
- Authors: Chufei Yan, Zhihao Cui, Yiyan Lv, Taojie Chen, Ning Bian, Yulei Wang
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.14609
- Pdf link: https://arxiv.org/pdf/2606.14609
- Abstract
Deep reinforcement learning (DRL) offers a compelling route to decision-making for advanced autonomous vehicles (AVs), yet its trial-and-error nature makes it difficult to guarantee safety during training and to achieve both safety and efficiency at deployment. We propose a unified safe reinforcement learning (SRL) framework that integrates safe distance (SD), reward machines (RM), and mixture-of-experts (MoE), termed MoE-RM-SRL. For deployment, SD and RM jointly shape a rule-aware reward that encodes highway traffic regulations and stage-wise objectives, enabling safe and reliable behavior without sacrificing efficiency. For training, we introduce a sparsely gated MoE layer comprising up to 11 deep Q-networks (DQNs); an SD-based gating rule activates a minimal set of experts for lane-keeping and lane-changing, mitigating the instability, discontinuities, and impulsive transients commonly induced by switching between heterogeneous controllers (e.g., MPC/rule-based modules and learned policies). We implement the proposed architecture in CARLA and integrate it with a 6-DoF driver-in-the-loop virtual-reality (DiL-VR) platform. Experiments in stochastic two-lane traffic show that MoE-RM-SRL substantially improves safety and efficiency over state-of-the-art baselines, and the framework naturally extends to multi-lane driving as well as on-ramp merging and exiting scenarios.
- 中文摘要
深度强化学习(DRL)为先进自动驾驶车辆(AV)提供了一条引人注目的决策路径,但其反复试验的特性使得在训练期间保障安全以及在部署时实现安全与效率变得困难。我们提出了一个统一的安全强化学习(SRL)框架,整合了安全距离(SD)、奖励机(RM)和专家混合(MoE),称为MoE-RM-SRL。在部署方面,SD和RM共同制定规则感知奖励,编码高速公路交通法规和分阶段目标,确保安全可靠的行为,同时不牺牲效率。训练时,我们引入了由多达11个深度Q网络(DQNs)组成的稀疏门控MoE层;基于SD的门控规则激活了极少数专家进行车道保持和变道,减轻了在异构控制器之间切换(如MPC/规则模块和学习策略)时常见的不稳定性、不连续性和冲动瞬态。我们将拟议架构在CARLA中实现,并将其与6景深驱动环路虚拟现实(DiL-VR)平台集成。随机双车道交通实验表明,MoE-RM-SRL相比最先进的基线大幅提升了安全性和效率,该框架自然适用于多车道驾驶以及匝道合流和退出场景。
HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities
HPSv3++:涵盖扩散模型全谱的奖励模型尺度
- Authors: Yijun Liu, Jie Huang, Zeyue Xue, Yuming Li, Ruizhe He, Haoran Li, Shijia Ge, Siming Fu
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2606.14657
- Pdf link: https://arxiv.org/pdf/2606.14657
- Abstract
Reward models guide text-to-image (T2I) systems toward outputs aligned with human preferences. However, typical reward models such as HPSv3 are trained on pre-annotated data from earlier T2I models, without accounting for quality discriminative shifts arising from evolving model capabilities and reinforcement learning (RL) iterations, limiting their broader applicability. In this work, we propose HPSv3++, a reward model framework that elevates the HPSv3 model for varying T2I model capabilities and their RL iteration changes across the full capability-iteration spectrum. Specifically, we first introduce HPDv3++, a 212K dual-dimension preference dataset annotated for text fidelity and aesthetic quality using a recent high-capability (Qwen-Image) model with human supervision. We then propose a two-stage training framework. Stage 1 employs data-aware orthogonal gradient projection to incorporate diverse aesthetic perception from HPDv3++ while preserving the original effective human preference knowledge in HPSv3. Stage 2 further leverages unlabeled data from T2I models spanning different capability levels and RL iterations, and introduces a joint capability-iterations conditioned signal for the reward model together with a standard deviation-driven unsupervised guidance mechanism, strengthening reward model across the capability-iteration spectrum. HPSv3++ achieves state-of-the-art preference prediction, outperforming HPSv3 9.8% on HPDv3, 5.5% on GenAI-Bench, while achieving 79.1%/88.1% on our proposed HPDv3++. When used for T2I RL training, it consistently improves GenEval scores across diverse T2I models, demonstrating its wide-range capabilities. The code is available at this https URL.
- 中文摘要
奖励模型引导文本转图像(T2I)系统朝着符合人类偏好的输出方向发展。然而,典型的奖励模型如HPSv3是基于早期T2I模型的预注释数据训练的,未考虑模型能力演进和强化学习(RL)迭代带来的质量判别性变化,限制了其更广泛的适用性。在本研究中,我们提出了HPSv3++,这是一个奖励模型框架,提升了HPSv3模型在不同T2I能力及其在整个能力-迭代频谱中强化学习迭代变化的高度。具体来说,我们首先介绍了HPDv3++,这是一个212K的双维偏好数据集,采用了新的高能力(Qwen-Image)模型,并由人工监督,进行了文本忠实度和美观质量的注释。随后,我们提出了一个两阶段的培训框架。第一阶段采用数据感知的正交梯度投影,融合了HPDv3++中多样的美学感知,同时保留了HPSv3中原始有效的人类偏好知识。第二阶段进一步利用跨越不同能力水平和强化学习迭代的T2I模型中的无标记数据,并为奖励模型引入了联合能力-迭代条件信号,并结合了标准差驱动的无监督引导机制, 在能力-迭代范围内加强奖励模型。HPSv3++实现了最先进的偏好预测,在HPDv3上表现优于HPSv3的9.8%,在GenAI-Bench上超过5.5%,而在我们提出的HPDv3++中则达到79.1%/88.1%。用于T2I强化学习训练时,它持续提升多种T2I模型的GenEval分数,展示了其广泛的能力。代码可在该 https URL 访问。
CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment
CORA:通过一致性导向推理对齐分析并弥合多模态RLVR中的思维与答案差距
- Authors: Jiayue Cao, Zhicong Lu, Xuehan Sun, Wei Jia, Hongling Zheng, Changyuan Tian, Zichuan Lin, Wenqian Lv, Nayu Liu
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.14691
- Pdf link: https://arxiv.org/pdf/2606.14691
- Abstract
Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving the visual coverage of reasoning traces and mitigating visual hallucinations, but underestimate the semantic inconsistency between the reasoning process and the final answer. In this paper, we delve into thinking-answer inconsistency in RLVR for large vision-language models (LVLMs), showing thorough analyses of rollouts collected throughout Group Relative Policy Optimization (GRPO) training process and post-RLVR evaluation outputs that this issue persists during training and remains present during inference. Motivated by the analysis, we propose Consistency-Oriented Reasoning Alignment (CORA), which introduces thinking-answer semantic consistency into RLVR through a lightweight plug-and-play consistency reward model, and further incorporates Hybrid Reward Advantage Splitting (HRAS) to stably coordinate task and consistency optimization. Extensive experiments across representative multimodal reasoning benchmarks and mainstream LVLMs show that CORA improves task performance while effectively mitigating thinking-answer inconsistency, leading to more faithful reasoning traces.
- 中文摘要
带可验证奖励的强化学习(RLVR)成功激发了大型语言模型的推理能力,推动其扩展到多模态场景。现有方法主要关注改善推理痕迹的视觉覆盖和减轻视觉幻觉,但低估了推理过程与最终答案之间的语义不一致。本文深入探讨了大型视觉语言模型(LVLM)中RLVR的思维-答案不一致,展示了在群体相对政策优化(GRPO)训练过程中收集的推广数据及RLVR评估后输出的详细分析,发现该问题在训练期间持续存在,并在推断过程中依然存在。基于该分析,我们提出了一致性导向推理对齐(CORA),通过轻量化的即插即用一致性奖励模型引入思维答案语义一致性,并进一步整合混合奖励优势分割(HRAS),以稳定协调任务与一致性优化。在代表性的多模态推理基准和主流LVLMs中的大量实验表明,CORA在有效减少思考与答案不一致的同时,提升了任务表现,从而实现更忠实的推理痕迹。
Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning
多目标多代理强化学习的协调偏好
- Authors: Pengxin Wang, Lihao Guo, Yi Xie, Bo Liu, Siyang Cao, Jingdi Chen
- Subjects: Subjects:
Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.14693
- Pdf link: https://arxiv.org/pdf/2606.14693
- Abstract
Cooperative multi-objective multi-agent reinforcement learning (MOMARL) models team decision making under multiple, potentially conflicting objectives. In this setting, conflicts arise not only across objectives but also across agents with different observations, roles, and contributions. We propose Preference Coordinated Multi-agent Policy Optimization (PCMA), which learns coordinated agent-specific preferences to enable complementary trade-offs among agents. Theoretically, we formulate cooperative MOMARL as a team-optimal game and show that, under suitable conditions, preference diversity can induce team improvement through a first-order improvement decomposition. Experiments on multiple cooperative MOMA environments and a practical traffic-control scenario show that PCMA improves both performance and trade-off coordination.
- 中文摘要
合作多目标多智能体强化学习(MOMARL)模拟了在多个可能相互冲突目标下的团队决策。在此环境中,冲突不仅发生在目标之间,也发生在具有不同观察、角色和贡献的主体之间。我们提出了偏好协调多代理策略优化(PCMA),它通过学习协调的代理特定偏好,从而实现代理之间的互补权衡。理论上,我们将合作MOMARL表述为团队最优博弈,并证明在适当条件下,偏好多样性可以通过一阶改进分解诱导团队改进。在多个协作MOMA环境和实际交通控制场景上的实验表明,PCMA不仅提升了性能,还提升了权衡协调。
Keyword: diffusion policy
Diffusion Policy Optimization without Drifting Apart
扩散政策优化而不分离
- Authors: Haozhe Jiang, Haiwen Feng, Pieter Abbeel, Jiantao Jiao, Angjoo Kanazawa, Nika Haghtalab
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.13795
- Pdf link: https://arxiv.org/pdf/2606.13795
- Abstract
RL post-training has become increasingly pivotal for improving diffusion policies, but existing diffusion policy-gradient methods are often unstable and cannot achieve reliable policy improvement. We identify the cause as the double-drift phenomenon: optimizing a variational surrogate can let the ELBO separate from the true log-likelihood, which then makes the resulting proxy policy gradient misaligned with the true policy gradient of expected return. We propose \textbf{DiPOD}, a diffusion policy optimization framework that maintains tight-bound behavior throughout training by interleaving self-distillation with policy-improving gradient updates. This leads to a simple and practical algorithm: augmenting each diffusion policy-gradient update with an on-policy ELBO regularizer. Across diffusion language model post-training and continuous-control diffusion policies, DiPOD substantially stabilizes training and reaches higher rewards than previous methods.
- 中文摘要
强化学习后培训对改进扩散策略变得越来越关键,但现有的扩散策略梯度方法往往不稳定,无法实现可靠的策略改进。我们将原因归结为双重漂移现象:优化变分代理变量可以使ELBO与真实对数似然分离,从而使得的代理策略梯度与预期收益的真实策略梯度不匹配。我们提出了 \textbf{DiPOD},这是一种扩散策略优化框架,通过将自我蒸馏与策略改进梯度更新交错,在整个训练过程中保持紧密约束的行为。这导致了一个简单实用的算法:在每次扩散策略梯度更新中加入一个策略上的ELBO正则化器。在扩散语言模型的训练后和连续控制扩散策略中,DiPOD显著稳定了训练,并达到了比以往方法更高的奖励。
Spatially Conditioned Diffusion Policy: Learning Precise and Robust Manipulation with a Single RGB Camera
空间条件扩散策略:学习单一RGB相机的精确稳健操作
- Authors: Seoyoon Kim, Kanghyun Kim, Dongwoo Ko, Yeong Jin Heo, Min Jun Kim
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.14535
- Pdf link: https://arxiv.org/pdf/2606.14535
- Abstract
Recent visual imitation learning systems have widely adopted multi-camera setups with wrist-mounted cameras as the de facto standard. However, manipulation from a single global view remains challenging, as the policy should capture fine-grained interaction details and identify task-relevant regions without local wrist views. To address this challenge, we present Spatially Conditioned Diffusion Policy (SCDP), a diffusion-based visuomotor policy that achieves precise and robust manipulation in a single-camera setting. Our key idea is that end-effector trajectories can serve as visual attention anchors that reflect task-relevant regions. Building on this idea, SCDP consists of two key components: (i) a visual encoder that produces multi-scale feature maps to capture both broader context and fine-grained visual features, and (ii) a spatial conditioning module that samples point-wise features along intermediate end-effector trajectories in the diffusion loop. Extensive simulation experiments show that SCDP consistently outperforms strong single-view baselines and achieves performance comparable to multi-camera baselines. Real-world experiments further demonstrate precise manipulation and robustness to visual distractors, highlighting the potential of single-camera imitation learning.
- 中文摘要
近年来的视觉模仿学习系统广泛采用多摄像机配置,腕部摄像头已成为事实上的标准。然而,从单一全局视图进行操作仍然具有挑战性,因为该策略应捕捉细粒度交互细节,并识别任务相关区域,而非本地手腕视图。为应对这一挑战,我们提出了空间条件扩散政策(SCDP),这是一种基于扩散的视觉运动策略,能够在单摄像头环境中实现精确且稳健的操作。我们的核心思想是,末端效应器轨迹可以作为视觉注意力锚点,反映与任务相关的区域。基于这一理念,SCDP由两个关键组成部分:(i)一个视觉编码器,用于生成多尺度特征图,以捕捉更广泛的上下文和细粒度的视觉特征;(ii)一个空间条件模块,沿扩散环中中间端执行器轨迹点点采样特征。大量仿真实验表明,SCDP始终优于强单视角基线,并实现与多机位基线相当的性能。真实实验进一步展示了对视觉干扰的精确操控和鲁棒性,凸显了单机模拟学习的潜力。