Arxiv Papers of Today

生成时间: 2026-05-15 18:27:49 (UTC+8); Arxiv 发布时间: 2026-05-15 20:00 EDT (2026-05-16 08:00 UTC+8)

今天共有 58 篇相关文章

Keyword: reinforcement learning

CA2: Code-Aware Agent for Automated Game Testing

CA2：用于自动游戏测试的代码感知代理

Authors: Valliappan Chidambaram Adaikkappan, Vincent Martineau, Joshua Romoff, David Meger
Subjects: Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.13918
Pdf link: https://arxiv.org/pdf/2605.13918
Abstract Automated game testing is important for verifying game functionality, but it remains a costly and time-consuming process. Manual testing often misses edge cases, and current automated methods struggle to provide full code coverage. Prior work has explored reinforcement learning (RL) for game testing, but without leveraging internal code signals such as the call stack. We present Code Aware Agent (CA2), which uses call stack information to learn effective testing strategies. The agent receives the current function call trace along with the game state and learns to reach specific target functions. We instrument two types of environments, 1) State-based and 2) Image-based, with support for efficient call stack extraction. Through experimental evaluation, we find that CA2 achieves consistent improvement over the non-code aware baselines, which does not leverage call stack information. Our results show that incorporating code signals like the call stack enables more effective and targeted game testing.
中文摘要 自动化游戏测试对于验证游戏功能非常重要，但这仍然是一个昂贵且耗时的过程。手动测试常常漏掉边缘案例，现有的自动化方法难以提供完整的代码覆盖。此前的研究曾探索过强化学习（RL）用于游戏测试，但未利用调用栈等内部代码信号。我们介绍代码感知代理（CA2），利用调用栈信息学习有效的测试策略。代理会接收当前函数调用跟踪和游戏状态，并学习如何访问特定的目标函数。我们支持两种环境类型，1）基于状态的，2）基于映像的，支持高效的调用栈提取。通过实验评估，我们发现CA2相较于不依赖调用栈信息的非代码感知基线实现了持续的改进。我们的结果表明，将调用堆栈等代码信号纳入，可以实现更有效且有针对性的游戏测试。

Rethinking Molecular OOD Generalization via Target-Aware Source Selection

重新思考通过目标感知源选择进行分子OOD泛化

Authors: Zhuohao Lin, Kun Li, Jiameng Chen, Jiajun Yu, Duanhua Cao, Yizhen Zheng, Wenbin Hu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.13932
Pdf link: https://arxiv.org/pdf/2605.13932
Abstract Robust prediction of molecular properties under extreme out-of-distribution (OOD) scenarios is a pivotal bottleneck in AI-driven drug discovery. Current scaffold-splitting protocols fail to obstruct microscopic semantic overlap, predisposing models to shortcut learning and overestimating their true extrapolation capability; meanwhile, conventional domain adaptation paradigms suffer under extreme structural shifts, as blindly aligning heterogeneous source libraries injects topological noise and triggers negative transfer. To address these two challenges, scaffold-cluster out-of-distribution performance evaluation benchmark (SCOPE-BENCH), a benchmark built on cluster-level partitioning in an explicit physicochemical descriptor space, is proposed alongside policy optimization for multi-source adaptation (POMA), a framework that formulates knowledge transfer as a retrieve-compose-adapt pipeline: labeled source scaffolds structurally close to the unlabeled target are first identified as proxy targets; a reinforcement learning policy then adaptively selects the optimal source subset from an exponentially large candidate pool; and dual-scale domain adaptation is finally performed at macroscopic topological and microscopic pharmacophore scales. Evaluations show that prediction errors of state-of-the-art 3D molecular models surge by up to 8.0x on SCOPE-BENCH with a mean of 5.9x, while POMA achieves up to an 11.2% reduction in mean absolute error with an average relative improvement of 6.2% across diverse backbone architectures. Code is available at this https URL.
中文摘要 在极端脱离分销（OOD）场景下对分子性质进行稳健预测，是人工智能驱动药物发现中的关键瓶颈。当前的支架分裂协议未能阻止微观语义重叠，使模型容易简化学习并高估其真实外推能力;与此同时，传统的域适应范式在极端结构转变下受到影响，因为盲目对齐异构源库注入拓扑噪声并触发负转移。为应对这两个挑战，提出了基于显式物理化学描述符空间中集群级划分的基准基准——支架-集群外分布性能评估基准测试，同时提出了多源适应策略优化（POMA）框架，该框架将知识转移构建为检索-组合-适应流水线：结构上靠近未标记目标的标记源支架首先被识别为代理目标;强化学习策略随后从指数级庞大的候选池中自适应地选择最优源子集;双尺度域适应最终在宏观拓扑和微观药效团尺度上实现。评估显示，最先进的三维分子模型在SCOPE-BENCH上预测误差可提升多达8.0倍，平均为5.9倍，而POMA在不同骨干架构中平均绝对误差降低了11.2%，平均相对提升为6.2%。代码可在此 https URL 访问。

WarmPrior: Straightening Flow-Matching Policies with Temporal Priors

WarmPrior：用时间先验调整流量匹配策略

Authors: Sinjae Kang, Chanyoung Kim, Kaixin Wang, Li Zhao, Kimin Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.13959
Pdf link: https://arxiv.org/pdf/2605.13959
Abstract Generative policies based on diffusion and flow matching have become a dominant paradigm for visuomotor robotic control. We show that replacing the standard Gaussian source distribution with WarmPrior, a simple temporally grounded prior constructed from readily available recent action history, consistently improves success rates on robotic manipulation tasks. We trace this gain to markedly straighter probability paths, echoing the effect of optimal-transport couplings in Rectified Flow. Beyond standard behavior cloning, WarmPrior also reshapes the exploration distribution in prior-space reinforcement learning, improving both sample efficiency and final performance. Collectively, these results identify the source distribution as an important and underexplored design axis in generative robot control.
中文摘要 基于扩散和流动匹配的生成策略已成为视觉运动机器人控制的主导范式。我们证明，用WarmPrior替代标准高斯源分布——一种由近期动作历史构建的简单时间基础先验——能持续提高机器人操作任务的成功率。我们将这种增益追溯到明显更直的概率路径，呼应了《Rectified Flow》中最优运输耦合的效果。除了标准行为克隆，WarmPrior还重塑了先验空间强化学习中的探索分布，提高了样本效率和最终性能。综合来看，这些结果表明源分布是生成式机器人控制中一个重要且尚未被充分探索的设计轴。

R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning

R2R2：通过减少冗余实现强化经验再利用的强硬表征，实现自我预测学习

Authors: Sanghyeob Song, Donghyeok Lee, Jinsik Kim, Sungroh Yoon
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14026
Pdf link: https://arxiv.org/pdf/2605.14026
Abstract For reinforcement learning in data-scarce domains like real-world robotics, intensive data reuse enhances efficiency but induces overfitting. While prior works focus on critic bias, representation-level instability in Self-Predictive Learning (SPL) under high Update-to-Data (UTD) regimes remains underexplored. To bridge this gap, we propose Robust Representation via Redundancy Reduction (R2R2), a regularization method within SPL. We theoretically identify that standard zero-centering conflicts with SPL's spectral properties and design a non-centered objective accordingly. We verify R2R2 on SPL-native algorithms like TD7. Furthermore, to demonstrate its orthogonality to prior advancements, we extend the state-of-the-art SimbaV2, which originally lacks SPL, by integrating a tailored SPL module, termed SimbaV2-SPL. Experiments across 11 continuous control tasks confirm that R2R2 effectively mitigates overfitting; specifically, at a UTD ratio of 20, it improves TD7 by ~22% and provides additional gains on top of SimbaV2-SPL, which itself establishes a new state-of-the-art. The code can be found at: this https URL
中文摘要 对于数据稀缺领域如现实机器人的强化学习，密集数据重用提高了效率，但会诱发过拟合。虽然以往研究聚焦于批评偏见，但在高数据更新（UTD）体系下，自我预测学习（SPL）的代表性不稳定性仍未被充分探讨。为弥合这一空白，我们提出了通过冗余约简的稳健表示（R2R2），这是一种在SPL中的正则化方法。理论上我们识别标准零中心与SPL的谱性质冲突，并据此设计非中心目标。我们在像TD7这样的SPL原生算法上验证了R2R2。此外，为了证明其与先前进展的正交性，我们通过集成定制的 SPL 模块（称为 SimbaV2-SPL）扩展了最初缺乏 SPL 的先进 SimbaV2。11个连续控制任务的实验证实R2R2有效减少过拟合;具体来说，在UTD比20下，它提升了TD7约22%，并在SimbaV2-SPL基础上提供了额外增益，而SimbaV2-SPL本身也建立了全新的技术水平。代码可在以下链接找到：https URL

Optimal design of solar-battery hybrid resources considering multi-market participation under weather and price uncertainty

考虑多市场参与、天气和价格不确定性下的太阳能-电池混合资源的最优设计

Authors: Hikaru Hoshino, Taiyo Mantani, Eiko Furutani
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.14043
Pdf link: https://arxiv.org/pdf/2605.14043
Abstract The rapid growth of variable renewable energy has increased the need for flexible and efficiently coordinated energy resources. In this context, hybrid resources that combine renewable generation and battery storage within a single market-participating entity have attracted growing attention. Such hybrid resources can have multiple revenue streams, while allocating limited power and energy capacity across multiple electricity markets including energy and ancillary services. This multi-market coordination increases operational complexity and complicates profitability assessment, making optimal system sizing a challenging design problem. In addition, uncertainty in renewable generation and market prices makes it difficult for conventional optimization approaches to determine system designs that remain effective under stochastic operating conditions. To address these challenges, this paper proposes a deep reinforcement learning-based co-optimization framework for hybrid solar-battery resources. The framework embeds system design variables directly into the policy learning process, enabling joint optimization of hybrid system sizing and coordinated multi-market bidding strategies within a unified stochastic formulation. Case studies using historical renewable generation and market data demonstrate the effectiveness of the proposed framework in identifying economically rational hybrid system design considering multi-market operation.
中文摘要 可变可再生能源的快速增长增加了对灵活且高效协调能源资源的需求。在此背景下，将可再生发电和电池储能结合在单一市场参与实体内的混合资源吸引了越来越多的关注。此类混合资源可以拥有多重收入来源，同时在包括能源及辅助服务在内的多个电力市场中分配有限的电力和能源容量。这种多市场协调增加了运营复杂度，并使盈利能力评估更加复杂，使得系统最佳规模的制定成为一个具有挑战性的设计难题。此外，可再生能源发电和市场价格的不确定性使得传统优化方法难以确定在随机运行条件下仍有效的系统设计。为应对这些挑战，本文提出了一个基于深度强化学习的混合太阳能-电池资源协同优化框架。该框架将系统设计变量直接嵌入政策学习过程，实现混合系统规模和多市场协调竞标策略的联合优化，实现统一的随机表述。利用历史可再生能源发电和市场数据的案例研究展示了该框架在识别考虑多市场运营的经济合理混合系统设计方面的有效性。

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

是视力差还是思维不当？视觉-语言推理的奖励感知

Authors: Haozhe Wang, Qixin Xu, Changpeng Wang, Taofeng Xue, Chong Peng, Wenhu Chen, Fangzhen Lin
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.14054
Pdf link: https://arxiv.org/pdf/2605.14054
Abstract Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.
中文摘要 实现强健的感知-推理协同是高级视觉语言模型（VLM）的核心目标。近年来的进展通过架构设计或代理式工作流实现了这一目标。然而，这些方法常常受限于静态文本推理，或因外部代理复杂性带来的重大计算和工程负担而复杂化。更糟的是，这种重度投资并未带来成比例的收益，常常出现对认知和推理的“摇摆效应”。这促使人们对真正的瓶颈进行了根本性的重新思考。本文论证，这一权衡的根本原因是模态学分分配的歧义：当VLM失败时，是由于感知错误（“错误的视力”）还是逻辑缺陷（“错误的思维”）？为解决这个问题，我们引入了一个强化学习框架，通过可靠地奖励感知忠实度，提升感知与推理的协同效应。我们明确将生成过程分解为交错的感知和推理步骤。这种解耦使得对感知进行有针对性的监督成为可能。关键是，我们引入了感知验证（PV），利用“蒙眼推理”代理，独立于推理结果奖励感知准确性。此外，为了在自由形式VL任务中进行扩展训练，我们提出了结构化口头验证，用结构化算法执行取代了高方差LLM的判断。这些技术集成到模态感知信用分配（MoCA）机制中，该机制将奖励导向特定错误源——无论是视力差还是思维不良——使单个VLM能够在广泛的任务范围内同时实现性能提升。

Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

法律探询型对话代理的双层级对话政策学习

Authors: Xubo Lin, Zezhii Deng, Shihao Wang, Grace Hui Yang, Yang Deng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.14057
Pdf link: https://arxiv.org/pdf/2605.14057
Abstract Most existing dialogue systems are user-driven, primarily designed to fulfill user requests. However, in many critical real-world scenarios, a conversational agent must proactively extract information to achieve its own objectives rather than merely respond. To address this gap, we introduce \emph{Inquisitive Conversational Agents (ICAs)} and develop an ICA specifically tailored to U.S. Supreme Court oral arguments. We propose a Dual Hierarchical Reinforcement Learning framework featuring two cooperating RL agents, each with its own policy, to coordinate strategic dialogue management and fine-grained utterance generation. By learning when and how to ask probing questions, the agent emulates judicial questioning patterns and systematically uncovers crucial information to fulfill its legal objectives. Evaluations on a U.S. Supreme Court dataset show that our method outperforms various baselines across multiple metrics. It represents an important first step toward broader high-stakes, domain-specific applications.
中文摘要 大多数现有对话系统由用户驱动，主要设计用于满足用户请求。然而，在许多关键的现实场景中，对话代理必须主动提取信息以实现自身目标，而不仅仅是回应。为弥补这一空白，我们引入了\emph{Inquisititive Conversational Agents（ICAs）}，并专门开发了针对美国最高法院口头辩论量身定制的ICA。我们提出了一个双层级强化学习框架，由两个协作的强化学习代理组成，每个代理都有自己的策略，以协调战略对话管理和细粒度的话语生成。通过学习何时以及如何提出探询性问题，代理人能够模拟司法提问模式，系统地揭示实现法律目标的关键信息。美国最高法院数据集的评估显示，我们的方法在多个指标上优于多个基线。它代表了迈向更广泛高风险、领域特定应用的重要第一步。

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

通过强化学习与可验证奖励的生成式平面设计，利用LLM进行

Authors: Luis Lara, Aristides Milios, Zhi Hao Luo, Aditya Sharma, Ge Ya Luo, Christopher Beckham, Florian Golemo, Christopher Pal
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14117
Pdf link: https://arxiv.org/pdf/2605.14117
Abstract An AI system for professional floor plan design must precisely control room dimensions and areas while respecting the desired connectivity between rooms and maintaining functional and aesthetic quality. Existing generative approaches focus primarily on respecting the requested connectivity between rooms, but do not support generating floor plans that respect numerical constraints. We introduce a text-based floor plan generation approach that fine-tunes a large language model (LLM) on real plans and then applies reinforcement learning with verifiable rewards (RLVR) to improve adherence to topological and numerical constraints while discouraging invalid or overlapping outputs. Furthermore, we design a set of constraint adherence metrics to systematically measure how generated floor plans align with user-defined constraints. Our model generates floor plans that satisfy user-defined connectivity and numerical constraints and outperforms existing methods on Realism, Compatibility, and Diversity metrics. Across all tasks, our approach achieves at least a 94% relative reduction in Compatibility compared with existing methods. Our results demonstrate that LLMs can effectively handle constraints in this setting, suggesting broader applications for text-based generative modeling.
中文摘要 专业平面设计的人工智能系统必须精确控制房间尺寸和面积，同时尊重房间间所需的连通性，并保持功能性和美观品质。现有的生成方法主要关注房间间请求的连接性，但不支持生成符合数值约束的平面图。我们引入了基于文本的平面图生成方法，在真实图纸上微调大型语言模型（LLM），然后应用可验证奖励强化学习（RLVR），以提升拓扑和数值约束的遵守率，同时防止无效或重叠输出。此外，我们设计了一套约束依从度量度指标，系统地衡量生成的楼层平面图与用户定义约束的对应程度。我们的模型生成满足用户定义连接性和数值约束的平面图，并在现实性、兼容性和多样性指标上优于现有方法。在所有任务中，我们的方法相较于现有方法，兼容性至少降低了94%。我们的结果表明，LLM能够有效处理该情境下的约束，这表明其在基于文本的生成建模中具有更广泛的应用潜力。

Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

快速医疗互操作资源（FHIR）中工具调用代理的强化学习

Authors: Marius S. Knorr, Robert Müller, Jan P. Bremer, Nils Schweingruber
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14126
Pdf link: https://arxiv.org/pdf/2605.14126
Abstract Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multi-step reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool-augmented LLM agents (retrieval, code execution, multi-turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR-AgentBench, a benchmark for realistic question answering over real-world hospital data, and frame reasoning on FHIR as a sequential decision-making problem over a queryable structured graph. We implement a multi-turn CodeAct agent and post-train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution-grounded rewards. Compared to prompt-based, closed-model baselines, RL post-training improves performance while enforcing data-integrity constraints. Empirically, our approach improves answer correctness from 50% (o4-mini) to 77% on FHIR-AgentBench using a smaller and cheaper Qwen3-8B model. We present an end-to-end post-training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi-turn reasoning over structured clinical graphs.
中文摘要 快速医疗互操作资源（FHIR）是医疗数据互操作交换的主流标准。在FHIR中，电子健康记录形成一个有向资源图。通过FHIR回答临床有意义的问题需要代理在多种资源类型中进行多步骤推理、筛选和聚合。先前研究表明，即使是工具增强的LLM代理（检索、代码执行、多回合规划）也常常选择错误的资源或违反遍历约束。我们在FHIR-AgentBench的背景下研究该问题，FHIR是现实医院数据下现实问题回答的基准，并将FHIR框架推理为可查询结构化图上的顺序决策问题。我们实现了一个多回合CodeAct代理，并通过定制的束带和工具进行强化学习的后期训练。LLM评判员提供基于执行的奖励。与基于提示的闭合模型基线相比，强化学习后训练在执行数据完整性约束的同时提升了性能。通过实证，我们的方法在FHIR-AgentBench上使用更小且更便宜的Qwen3-8B模型，将答案正确率从50%（o4-mini）提升至77%。我们提出了端到端的培训后流程（环境构建、线束构建、模型训练和定制评估），可靠地提升了多回合推理能力，优于结构化临床图表。

Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation

安全约束强化学习与培训后可达性验证的机器人导航

Authors: Qisong He, Xinmiao Huang, Jinwei Hu, Zhuoyun Li, Yi Dong, Changshun Wu, Xiaowei Huang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.14174
Pdf link: https://arxiv.org/pdf/2605.14174
Abstract Safe navigation for mobile robots demands policies that remain reliable under the high-consequence perception uncertainty of cluttered environments. Yet most existing safe reinforcement learning (RL) methods assess safety through average cumulative cost. Such metrics can mask dangerous tail-risk behaviors. To address this, we propose a framework that trains risk-sensitive policies through Conditional Value-at-Risk (CVaR) constrained optimization on an off-policy TD3 backbone and evaluates their safety margins post-training through neural network reachability verification. During training, the policy is optimized under CVaR constraints on cumulative costs, promoting sensitivity to high-cost tail outcomes rather than average behavior alone. After training, we compute action reachable sets under bounded observation uncertainty using Taylor Model analysis, yielding a safety rate metric that quantifies the proportion of evaluated states at which the policy's reachable action set remains within prescribed safety margins. A key finding is that policies trained with CVaR constraints maintain larger safety margins from obstacles across evaluated states. This makes them significantly more amenable to formal reachability verification. Experiments across ten navigation scenarios and six baselines show that our method achieves a 98.3\% success rate, the highest safety verification rate among all compared methods, while revealing that average cost rankings and reachability-based safety rankings can diverge. This indicates that reachability verification captures risks which are missed by empirical cost metrics alone. We further validate our approach on a physical Clearpath Jackal robot, demonstrating successful sim-to-real transfer.
中文摘要 移动机器人的安全导航需要在高后果感知不确定性、环境杂乱的情况下保持可靠的政策。然而，大多数现有的安全强化学习（RL）方法通过平均累计成本来评估安全性。这些指标可能掩盖危险的尾部风险行为。为此，我们提出了一个框架，通过条件风险价值（CVaR）约束优化，在非策略TD3骨干上训练风险敏感策略，并通过神经网络可达性验证评估训练后的安全边际。在训练过程中，该策略在CVaR对累计成本的约束下进行优化，促进对高成本尾部结果的敏感度，而非仅仅针对平均行为。训练后，我们利用泰勒模型分析计算在有界观测不确定性下的可达行动集，得到一个安全率指标，量化评估状态中政策可达行动集保持在规定安全边际内的比例。一个关键发现是，采用CVaR约束训练的政策在评估状态下能保持更大的障碍安全裕度。这使得它们更易于接受正式的可达性验证。在十种导航场景和六个基线的实验显示，我们的方法成功率为98.3%，是所有比较方法中最高的，同时也显示平均成本排名和基于可达性的安全排名可能存在差异。这表明可达性验证能够捕捉仅靠经验成本指标所忽略的风险。我们还进一步验证了在物理Clearpath Jackal机器人上的方法，展示了成功的模拟到现实传输。

MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

MAPLE：端到端自动驾驶的潜在多智能体游戏

Authors: Rajeev Yasarla, Deepti Hegde, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Meysam Sadeghigooghari, Hanno Ackermann, Litian Liu, Pranav Desai, Fatih Porikli, Mohammad Ghavamzadeh, Hong Cai
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.14201
Pdf link: https://arxiv.org/pdf/2605.14201
Abstract Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.
中文摘要 视觉-语言-动作（VLA）模型作为端到端运动规划器非常有效，但在闭环环境中评估时，由于是在传统模仿学习框架下训练的，因此可能较为脆弱。现有的闭环监管方法缺乏可扩展性，无法完全模拟被动环境。我们提出了MAPLE，这是一种用于在VLA模型潜在空间内动态驱动场景的反应性多代理推广的新框架。自我车辆和附近的交通代理在多步视野上独立控制，同时对场景中的其他代理保持反应，从而实现闭环训练。MAPLE包含两个培训阶段：（1）基于地面真实轨迹对潜在部署进行监督微调，随后是（2）通过全局和代理特定奖励进行强化学习，鼓励安全性、进步和互动真实性。我们还提出了多样性奖励，鼓励模型生成可能不存在于日志驾驶数据中的规划行为。值得注意的是，我们的闭环训练框架具有可扩展性，无需外部模拟器，而外部模拟器运行起来计算成本高且视觉与现实世界的准确性有限。MAPLE在Bench2Drive上实现了最先进的驾驶性能，并展示了可扩展的闭环多智能体游戏，支持强大的端对端自动驾驶系统。

MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

MetaAgent-X：通过端到端强化学习突破自动多智能体系统的天花板

Authors: Yaolun Zhang, Yujie Zhao, Nan Wang, Yiran Wu, Jiayu Chang, Yizhao Chen, Qingyun Wu, Jishen Zhao, Huazheng Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14212
Pdf link: https://arxiv.org/pdf/2605.14212
Abstract Automatic multi-agent systems aim to instantiate agent workflows without relying on manually designed or fixed orchestration. However, existing automatic MAS approaches remain only partially adaptive: they either perform training-free test-time search or optimize the meta-level designer while keeping downstream execution agents frozen, which creating a frozen-executor ceiling and leaving the end-to-end training of self-designing and self-executing agentic models unexplored. To address this, we introduce MetaAgent-X, an end-to-end reinforcement learning framework that jointly optimizes automatic MAS design and execution. MetaAgent-X enables script-based MAS generation, execution rollout collection, and credit assignment for both designer and executor trajectories. To support stable and scalable optimization, we propose Executor Designer Hierarchical Rollout and Stagewise Co-evolution to improve training stability and expose the dynamics of designer-executor co-evolution. MetaAgent-X consistently outperforms existing automatic MAS baselines, achieving up to 21.7% gains. Comprehensive ablations show that both designer and executor improve throughout training, and that effective automatic MAS learning follows a stagewise co-evolution process. These results establish end-to-end trainable automatic MAS as a practical paradigm for building self-designing and self-executing agentic models.
中文摘要 自动多代理系统旨在实现代理工作流程的实例化，而不依赖手动设计或固定的编排。然而，现有的自动MAS方法仍仅部分适应性：它们要么进行无训练测试时搜索，要么优化元级设计器，同时冻结下游执行代理，这导致执行者冻结，导致自设计和自执行代理模型的端到端训练未被充分探索。为此，我们引入了MetaAgent-X，一个端到端强化学习框架，共同优化自动MAS设计与执行。MetaAgent-X 支持基于脚本的 MAS 生成、执行部署收集和设计者和执行者轨迹的信用分配。为支持稳定且可扩展的优化，我们提出了执行者-设计者层级展开和分阶段共进，以提升训练稳定性并揭示设计者-执行者共进化的动态。MetaAgent-X持续优于现有自动MAS基线，提升率高达21.7%。全面的消融显示，设计者和执行者在培训过程中都会有所进步，且有效的自动MAS学习遵循分阶段的共进化过程。这些结果确立了端到端可训练自动MAS作为构建自设计和自执行代理模型的实用范式。

GenCircuit-RL: Reinforcement Learning from Hierarchical Verification for Genetic Circuit Design

GenCircuit-RL：遗传回路设计中的分层验证强化学习

Authors: Noah Flynn
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2605.14215
Pdf link: https://arxiv.org/pdf/2605.14215
Abstract Genetic circuit design remains a laborious, expert-driven process despite decades of progress in synthetic biology. We study this problem through code generation: models produce Python code in pysbol3 to construct genetic circuits in the Synthetic Biology Open Language (SBOL), a formal representation that supports automated verification. We introduce GenCircuit-RL, a reinforcement learning framework built around hierarchical verification rewards that decompose correctness into five levels, from code execution to task-specific topological checks, and a four-stage curriculum that shifts optimization pressure from code generation to functional reasoning. We also introduce SynBio-Reason, a benchmark of 4,753 circuits spanning six canonical circuit types and nine tasks from code repair to de novo design, with held-out biological parts for out-of-distribution evaluation. Hierarchical verification improves task success on functional reasoning tasks by 14 to 16 percentage points over binary rewards, and curriculum learning is required for strong design performance. The resulting models generate topologically correct circuits, generalize to novel biological parts, and rediscover canonical designs from the synthetic biology literature.
中文摘要 尽管合成生物学已有数十年进展，遗传回路设计仍是一个艰巨且由专家驱动的过程。我们通过代码生成来研究这个问题：模型在 pysbol3 中生成 Python 代码，用合成生物学开放语言（SBOL）构建遗传回路，这是一种支持自动验证的形式表示。我们介绍GenCircuit-RL，这是一个基于层级验证奖励构建的强化学习框架，将正确性分解为五个层级，从代码执行到任务特定拓扑检查，以及一个四阶段课程，将优化压力从代码生成转向函数推理。我们还推出了SynBio-Reason，这是一个包含4,753条电路的基准测试，涵盖六种典型电路类型和从代码修复到全新设计的九个任务，并保留了用于外部评估的生物部件。层级验证在功能推理任务上的任务成功率比二元奖励高出14到16个百分点，且课程学习是强有力设计表现的必要条件。由此产生的模型生成拓扑正确的电路，推广到新的生物部分，并重新发现合成生物学文献中的规范设计。

PreFT: Prefill-only finetuning for efficient inference

PreFT：仅预填充微调以实现高效推断

Authors: Andrew Lanpouthakoun, Aryaman Arora, Zhengxuan Wu, Dhruv Pai, Ben Keigwin, Dan Jurafsky, Christopher Potts
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.14217
Pdf link: https://arxiv.org/pdf/2605.14217
Abstract Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.
中文摘要 大型语言模型现在可以通过参数高效的微调方法（PEFT）在大规模上高效个性化，但即使采用专门的内核和内存管理技术，也为用户定制PEFTs提供服务会损害吞吐量。这是因为从理论和经验角度看，预填充（一次处理大量令牌）和解码（自回归生成单个令牌）之间存在不匹配：后者在服务多个适配器时吞吐量远低于译码。为了高效的多适配器服务，我们不应针对参数数量优化性能，而是应针对服务吞吐量进行优化。因此，我们提出了预填充（Pre-fill only Finetuning）方法，即只将适配器应用于预填充令牌，之后将其丢弃。预傅时光显著提升吞吐量，且对性能影响极小。我们开发并发布了两个仅预填充的PEFTs——LoRA和ReFT，在vLLM推理引擎上高效实现。我们首先证明，为多用户提供预FT比传统PEFT更高效（在Llama 3.1 70B上，服务512美元适配器时吞吐量约为1.9美元+乘以）。随后，我们比较了仅预填充适配器与全令牌适配器在不同尺度的LMs监督微调和强化学习任务中的表现。在SFT中，我们观察到预傅里叶的评估损耗高于PEFT，但可以通过提升秩来补偿，且几乎不减少吞吐量。在强化学习中，我们一致发现预FTs接近标准PEFT的水平。本研究共同验证了仅预填充的LLM适配，作为比现有PEFT更有利的准确率-吞吐量权衡，用于个性化服务。

Quantum Advantage in Multi Agent Reinforcement Learning

多智能体强化学习中的量子优势

Authors: Simranjeet Singh Dahia, Claudia Szabo
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2605.14235
Pdf link: https://arxiv.org/pdf/2605.14235
Abstract We present an empirical evaluation of quantum entanglement in agent coordination within quantum multi agent reinforcement learning (QMARL). While QMARL has attracted growing interest recently, most prior work evaluates quantum policies without provable baselines, making it impossible to rigorously distinguish quantum advantage from algorithmic coincidence. We address this directly by evaluating a decentralized QMARL framework with variational quantum circuit (VQC) actors with shared entangled states. In the CHSH game, which has a mathematically proven classical performance ceiling of 0.75 win rate, we show that entangled QMARL agents approach the Tsirelson limit of 0.854, providing clear evidence of their quantum advantage. We show that unentangled quantum circuits match the classical baseline, confirming that entanglement and not the quantum circuit itself is the active coordination mechanism. We also explore the effect of specific entanglement structures, as some Bell states enable coordination gains while others actively harm performance. On cooperative navigation (CoopNav), QMARL without entanglement achieves $\sim2\times$ improvement in success rate over classical MAA2C ($\sim$0.85 versus $\sim$0.40), with the hybrid configuration, quantum actor paired with a classical centralised critic, outperforming both fully classical and fully quantum solutions. We present our experimental analysis and discuss future work.
中文摘要 我们对量子多智能体强化学习（QMARL）中代理协调中的量子纠缠进行了实证评估。尽管QMARL近年来引起了越来越多的关注，但大多数先前工作在没有可证明基线的情况下评估量子政策，这使得无法严格区分量子优势与算法巧合。我们直接通过评估一个去中心化的QMARL框架，利用具有共享纠缠态的变分量子电路（VQC）演员来解决这个问题。在数学证明的经典性能上限0.75的CHSH游戏中，我们展示了纠缠QMARL代理接近Tsirelson极限0.854，明确证明了它们的量子优势。我们证明了无纠缠量子电路与经典基线相匹配，证实了主动协调机制是纠缠而非量子电路本身。我们还探讨了特定纠缠结构的影响，因为有些贝尔状态促进了协调提升，而另一些则会积极损害表现。在合作导航（CoopNav）中，无纠缠的QMARL成功率比经典MAA2C提升$\sim2\times$（$\sim$0.85对$\sim$0.40），采用混合配置，量子演员配合经典中心化批评者，表现优于全经典和全量子解。我们展示了实验分析并讨论未来的工作。

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

部分可观测性下安全关键控制的动作条件风险门控

Authors: Yushen Liu, Yin-Jen Chen, Ziyi Chen, Tao Wang, Heng Huang, Xugui Zhou, Yanfu Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.14246
Pdf link: https://arxiv.org/pdf/2605.14246
Abstract Many safety-critical control problems are modeled as risk-sensitive partially observable Markov decision processes, where the controller must make decisions from incomplete observations while balancing task performance against safety risk. Although belief-space planning provides a principled solution, maintaining and planning over beliefs can be computationally costly and sensitive to model specification in practical domains. We propose a lightweight risk-gated reinforcement learning approximation for risk-sensitive control under partial observability. The method constructs a compact finite-history proxy state and learns an action-conditioned predictor of near-term safety violation. This predicted candidate-action risk is used in two complementary ways: as a risk penalty during value learning, and as a decision-time gate that interpolates between optimistic and conservative ensemble value estimates. As a result, low-risk actions are evaluated closer to reward-seeking estimates, while high-risk actions are evaluated more conservatively. We evaluate the approach in two safety-critical partially observable domains: automated glucose regulation and safety-constrained navigation. Across adult and adolescent glucose-control cohorts, the method improves overall glycemic tradeoffs and substantially reduces runtime relative to a belief-space planning baseline. On Safety-Gym navigation benchmarks, it achieves a more favorable reward-cost balance than unconstrained RL and several standard safe-RL baselines. These results suggest that action-conditioned near-term risk can provide an effective local signal for approximate risk-sensitive POMDP control when full belief-space planning is impractical.
中文摘要 许多安全关键控制问题被建模为风险敏感的部分可观测马尔可夫决策过程，控制者必须在任务性能与安全风险之间，从不完整的观测中做出决策。尽管信念空间规划提供了原则性的解决方案，但在实际领域中，维护和规划信念可能计算成本高昂且对模型规范敏感。我们提出了一种轻量级风险门控强化学习近似，用于部分可观测性下的风险敏感控制。该方法构建了一个紧凑的有限历史代理态，并学习了近期安全违规的动作条件预测变量。这种预测候选行动风险有两种互补用途：作为价值学习期间的风险惩罚，以及作为决策时间门，在乐观与保守集合价值估计之间插值。因此，低风险行为的评估更接近追求回报的估计，而高风险行为则更为保守地评估。我们在两个安全关键且部分可观测的领域评估该方法：自动血糖调节和安全约束导航。在成人和青少年血糖控制队列中，该方法改善了整体血糖权衡，并显著缩短了相较于信念空间规划基线的运行时间。在Safety-Gym导航基准测试中，它比无约束的强化学习和多个标准安全强化学习基线实现了更有利的奖品-成本平衡。这些结果表明，当全面信念空间规划不切实际时，行动条件的短期风险可为近似风险敏感的POMDP控制提供有效局部信号。

Towards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy

迈向实时自主导航：基于变压器的导管尖端追踪在透视中

Authors: Harry Robertshaw, Yanghe Hao, Weiyuan Deng, Benjamin Jackson, S.M.Hadi Sadati, Nikola Fischer, Tom Vercauteren, Alejandro Granados, Thomas C. Booth
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.14253
Pdf link: https://arxiv.org/pdf/2605.14253
Abstract Purpose: Mechanical thrombectomy (MT) improves stroke outcomes, but is limited by a lack of local treatment access. Widespread distribution of reinforcement learning (RL)-based robotic systems can be used to alleviate this challenge through autonomous navigation, but current RL methods require live device tip coordinate tracking to function. This paper aims to develop and evaluate a real-time catheter tip tracking pipeline under fluoroscopy, addressing challenges such as low contrast, noise, and device occlusion. Methods: A multi-threaded pipeline was designed, incorporating frame reading, preprocessing, inference, and post-processing. Deep learning segmentation models, including U-Net, U-Net+Transformer, and SegFormer, were trained and benchmarked using two-class and three-class formulations. Post-processing involved two-step component filtering, one-pixel medial skeletonization, and greedy arc-length path following with contour fall-back. Results: On manually-labeled moderate complexity fluoroscopic video data, the two-class SegFormer achieved a mean absolute error of 4.44 mm, outperforming U-Net (4.60 mm), U-Net+Transformer (6.20 mm) and all three-class models (5.19-7.74 mm). On segmentation benchmarks, the system exceeded state-of-the-art CathAction results with improvements of up to +5% in Dice scores for three-segmentation. Conclusion: The results demonstrate that the proposed multi-threaded tracking framework maintains stable performance under challenging imaging conditions, outperforming prior benchmarks, while providing a reliable and efficient foundation for RL-based autonomous MT navigation.
中文摘要 目的：机械性血栓切除术（MT）改善中风结局，但受限于局部治疗的可及性。基于强化学习（RL）的机器人系统广泛分布可以通过自主导航来缓解这一挑战，但当前的强化学习方法需要实时设备尖端坐标跟踪才能发挥作用。本文旨在开发和评估透视下导管尖端实时跟踪流程，解决低对比度、噪声和设备阻塞等挑战。方法：设计了多线程流水线，包含帧读取、预处理、推理和后处理。深度学习分割模型，包括U-Net、U-Net+Transformer和SegFormer，采用两类和三类公式进行训练和基准测试。后处理包括两步分量过滤、一像素中间骨架化以及贪婪的弧长路径跟踪并伴随轮廓回退。结果：在手动标记的中等复杂度荧光视频数据中，两类SegForformer实现了平均绝对误差4.44毫米，优于U-Net（4.60毫米）、U-Net+Transformer（6.20毫米）以及所有三类型号（5.19-7.74毫米）。在分割基准测试中，系统在三分割方面提升了多达+5%的Dice分数，超越了最先进的CathAction结果。结论：结果表明，所提出的多线程跟踪框架在复杂的成像条件下保持稳定性能，超越以往基准测试，同时为基于强化学习的自主机器翻译导航提供了可靠高效的基础。

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

PhyMotion：基于物理的人类视频生成的结构化3D运动奖励

Authors: Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Jaemin Cho, Yue Zhang, Mohit Bansal
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14269
Pdf link: https://arxiv.org/pdf/2605.14269
Abstract Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.
中文摘要 生成逼真的人体运动是视频制作中一个核心但尚未解决的挑战。虽然基于强化学习（RL）的后期训练推动了近期视频质量的提升，但将其推广到人体运动仍受限于无法可靠获得运动真实感的奖励信号。现有的视频奖励主要依赖二维感知信号，未明确建模三维身体状态、接触和动态，且常常给漂浮身体或物理上不合理的动作视频打高分。为此，我们提出了PhyMotion，一种结构化、细粒度的运动奖励，基于物理模拟器中恢复的3D人体轨迹，并评估多个物理可行性的运动质量。具体来说，我们从生成视频中恢复SMPL体网格，将其重新定位到MuJoCo物理模拟器中的类人生物上，并沿三个轴评估运动：运动学可信度、接触与平衡一致性以及动态可行性。每个组件都提供与运动质量特定方面相关的连续且可解读的信号，使奖励能够捕捉哪些运动的物理上正确或被侵犯。实验表明，PhyMotion与人类判断的相关性比现有的奖励表述更强。这些提升也适用于基于强化学习的后期训练，优化PhyMotion带来的改进比优化现有奖励更大、更持续，无论是在自动指标还是盲测人类评估下，都能提升自回归和双向视频生成器的动作真实性（+68 Elo提升）。消融显示，这三个轴能够互补地提供监督信号，而奖励则保持了整体视频生成质量，且训练开销较小。

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

KVPO：通过KV语义探索实现自回归视频比对的常微分方程原生GRPO

Authors: Ruicheng Zhang, Kaixi Cong, Jun Zhou, Zhizhou Zhong, Zunnan Xu, Shuiyang Mao, Wei Liu, Xiu Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.14278
Pdf link: https://arxiv.org/pdf/2605.14278
Abstract Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.
中文摘要 将流媒体自回归（AR）视频生成器与人类偏好对齐是一项挑战。现有的强化学习方法主要依赖基于噪声的探索和基于SDE的代理策略，这些策略与精炼AR模型的确定性常微分方程动态不匹配，且往往扰乱了低层次的表象，而非对长期视野连贯至关重要的高层语义故事进程。为解决这些局限性，我们介绍了KVPO，这是一个基于常微分方程的在线组相对策略优化（GRPO）框架，用于对齐流媒体视频生成器。对于多样性探索，KVPO引入了因果语义探索范式，将变异源从随机噪声迁移到历史KV缓存。通过随机路由历史KV条目，它构建了语义上多样的生成分支，这些分支严格地停留在数据流形上。在策略建模方面，KVPO引入了基于轨迹速度能量（TVE）的速度场代理策略，该策略量化了流速匹配空间中的分支似然，并给出与原生常微分方程表述完全一致的奖励加权对比目标。在多个精炼增强现实视频生成器上的实验显示，无论是单提示短视频还是多提示长视频设置，视觉质量、运动质量和文本-视频对齐均有稳定提升。

Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients

混合离散-连续作用空间中的策略优化，通过混合梯度

Authors: Matias Alvo, Daniel Russo, Yash Kanoria
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.14297
Pdf link: https://arxiv.org/pdf/2605.14297
Abstract We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of discrete actions or non-smooth dynamics yields biased or uninformative gradients. To address this, we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while maintaining unbiasedness. We also show how problems with action discontinuities can be reformulated in hybrid form, further broadening its applicability. Empirically, HPO substantially outperforms PPO on inventory control and switched linear-quadratic regulator problems, with performance gaps increasing as the continuous action dimension grows. Finally, we characterize the structure of the mixed gradient, showing that its cross term -- which captures how continuous actions influence future discrete decisions -- becomes negligible near a discrete best response, thereby enabling approximate decentralized updates of the continuous and discrete components and reducing variance near optimality. All resources are available at this http URL.
中文摘要 我们研究混合离散-连续动作空间中的强化学习，例如离散组件选择一个状态（或索引），连续组件在其中优化的环境——这是机器人学、控制和操作问题中常见的结构。标准无模型政策梯度方法依赖评分函数（SF）估计，在高维环境中存在严重的信用分配问题，导致梯度质量较差。另一方面，可微仿真通过模拟器反向传播在很大程度上绕过了这些问题，但离散作用量或非光滑动力学的存在会导致梯度偏置或信息不足。为此，我们提出了混合策略优化（HPO），在模拟器允许的平滑性允许处反向传播，使用混合梯度估计器，结合路径梯度和SF梯度，同时保持无偏。我们还展示了如何将作用不连续性问题重新表述为混合形式，进一步拓宽其适用范围。从经验来看，HPO在库存控制和切换线性-二次调节器问题上显著优于PPO，且随着连续动作维度的增长，性能差距进一步扩大。最后，我们刻画了混合梯度的结构，表明其交叉项——捕捉连续动作如何影响未来离散决策——在离散最佳响应附近变得可忽略不计，从而实现连续和离散分量的近似分散更新，并降低了接近最优的方差。所有资源均可在此 http 网址获取。

Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry

用于重用局部过渡几何的矩阵空间强化学习

Authors: Zuyuan Zhang, Carlee Joe-Wong, Tian Lan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14304
Pdf link: https://arxiv.org/pdf/2605.14304
Abstract Compositional generalization in sequential decision-making requires identifying which parts of prior rollouts remain useful for new tasks. Existing methods reuse skills or predictive models, but often overlook rich local transition geometry and dynamics. We propose Matrix-Space Reinforcement Learning (MSRL), a geometric abstraction that represents trajectory segments through positive semidefinite matrix descriptors aggregating first- and second-order statistics of lifted one-step transitions. These descriptors expose shared hidden structure, support algebraic composition in an abstract matrix space, and reveal opportunities for transfer. We prove that the descriptor is well defined up to coordinate gauge, complete for the induced low-order additive signal class, additive under valid segment composition, and minimally sufficient among admissible additive descriptors. We further show that conditioning value functions on the trajectory-segment matrix yields a first-order smooth approximation of action values, enabling source-learned matrix-to-value mappings to bootstrap learning in new tasks. MSRL is plug-in compatible with standard model-free and model-based methods, while obstruction filtering rejects implausible compositions. Empirically, MSRL achieves the best average finite-budget target AUC of 0.73, outperforming MSRL from scratch (0.65), TD-MPC-PT+FT (0.63), and TD-MPC (0.57).
中文摘要 顺序决策中的组合泛化需要识别哪些先前部署部分对新任务仍然有用。现有方法重复使用技能或预测模型，但往往忽视了丰富的局部过渡几何和动态。我们提出了矩阵空间强化学习（MSRL），这是一种几何抽象，通过正半定矩阵描述符来表示轨迹段，汇总提升单步跃迁的一阶和二阶统计量。这些描述符揭示了共享的隐藏结构，支持抽象矩阵空间中的代数复合，并揭示转移的机会。我们证明该描述符在坐标规范下定义良好，对诱导的低阶加法信号类完全，在有效段组成下可加性，并且在可接受的加法描述子中极小充分。我们还进一步证明，对轨迹段矩阵进行条件值函数的条件，可以实现动作值的一阶光滑近似，从而使源学习矩阵到值映射能够在新任务中启动学习。MSRL兼容插件兼容标准的无模型和基于模型的方法，而障碍过滤则拒绝不合理的合成。从实证角度看，MSRL实现了最佳的平均有限预算目标AUC0.73，优于MSRL（0.65）、TD-MPC-PT+FT（0.63）和TD-MPC（0.57）。

Sub-Band Full Duplex Resource Allocation: A Predictive Deep Reinforcement Learning Approach

子频带全双工资源分配：一种预测性深度强化学习方法

Authors: Abhiram D, Aiswarya Rajan, Arin Shemeem, Vipindev Adat Vasudevan, Abdulla P
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.14339
Pdf link: https://arxiv.org/pdf/2605.14339
Abstract This paper presents a predictive deep learning framework for dynamic sub-band allocation in Sub-Band Full Duplex (SBFD) systems, addressing the challenge of balancing uplink (UL) and downlink (DL) performance under highly dynamic traffic conditions. The key contribution lies in integrating a hybrid Bidirectional Long Short-Term Memory (Bi-LSTM) model for traffic forecasting with a Double Deep Q-Network (DDQN) for real-time resource allocation. Using both predicted traffic and current queue states, the proposed system enables proactive scheduling based on traffic demand. Evaluation results show that the prediction model achieves high accuracy in capturing bursty traffic patterns, while the DDQN agent effectively adapts UL/DL split ratios according to traffic variations. The framework improves spectrum utilization, reduces queue buildup, and avoids inefficient static configurations. The proposed approach demonstrates that combining predictive intelligence with reinforcement learning significantly enhances the efficiency and adaptability of SBFD systems, making it a strong candidate for autonomous resource management in future 6G networks.
中文摘要 本文提出了一个用于子频带全双工（SBFD）系统动态子带分配的预测深度学习框架，解决在高度动态的流量条件下平衡上行（UL）和下行（DL）性能的挑战。其关键贡献在于将混合双向长短期记忆（Bi-LSTM）模型用于流量预测，与双深度Q网络（DDQN）实现实时资源分配。利用预测流量和当前队列状态，该系统能够基于流量需求实现主动调度。评估结果显示，预测模型在捕捉突发性交通模式方面实现了高准确性，而DDQN代理则根据交通变化有效调整UL/DL分比。该框架提升了频谱利用率，减少队列积累，并避免了低效的静态配置。该方法表明，将预测智能与强化学习结合，显著提升了SBFD系统的效率和适应性，使其成为未来6G网络自主资源管理的有力候选。

CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation

CrystalReasoner：性质条件晶体结构生成的推理与强化学习

Authors: Yuyang Wu, Stefano Falletta, Delia McGrath, Sherry Yang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14344
Pdf link: https://arxiv.org/pdf/2605.14344
Abstract Generative modeling has emerged as a promising approach for crystal structure discovery. However, existing LLM-based generative models struggle with low-level atomic precision, while diffusion-based methods fall short in integrating high-level scientific knowledge. As a result, generated structures are often invalid, unstable, or do not possess desirable properties. To address this gap, we propose CrystalReasoner (\method), an end-to-end LLM framework that generates crystal structures from natural language instructions through reasoning and alignment. \method introduces physical priors as thinking tokens, which include crystallographic symmetry, local coordination environments and predicted physical properties before generating atomic coordinates. This bridges the gap between natural language and 3D structures. \method then employs reinforcement learning (RL) with a multi-objective, dense reward function to align generation with physical validity, chemical consistency, and thermodynamic stability. For property-conditioned tasks, we design task-specific reward functions and train specialized models for discrete constraints (e.g., space group) and continuous properties (e.g., elasticity, thermal expansion). Empirical results demonstrate that compared to prior works and baselines without thinking traces or RL, \method obtains better performance on diverse metrics, triples S.U.N. ratio, and achieves better performance for property conditioned generation. \method also exhibits adaptive reasoning, increasing reasoning lengths as the number of atoms increases. Our work demonstrates the potential of leveraging thinking traces and RL for generating valid, stable, and property-conditioned crystal structures. Please see our work at this https URL .
中文摘要 生成建模已成为一种有前景的晶体结构发现方法。然而，现有基于LLM的生成模型在低层次原子精度方面存在困难，而基于扩散的方法则在整合高级科学知识方面表现不足。因此，生成的结构往往无效、不稳定或不具备理想的属性。为弥补这一空白，我们提出了CrystalReasoner（\方法），这是一个端到端的大型语言模型框架，通过推理和对齐从自然语言指令生成晶体结构。\方法引入物理先验作为思考符号，包括晶体对称性、局部配位环境和预测的物理属性，然后生成原子坐标。这弥合了自然语言与三维结构之间的鸿沟。\方法随后采用多目标、密集奖励函数的强化学习（RL），使生成与物理效度、化学一致性和热力学稳定性保持一致。对于属性条件任务，我们设计任务特定的奖励函数，并训练离散约束（如空间组）和连续性质（如弹性、热膨胀）的专用模型。实证结果表明，与之前无思考痕迹或强化学习的基线相比，该方法在多种指标上表现更好，S.U.N.比率增加三倍，并在属性条件生成中表现更佳。\方法还表现出自适应推理，随着原子数量增加，推理长度也随之增加。我们的研究展示了利用思维痕迹和强化学习生成有效、稳定且性质条件化晶体结构的潜力。请在此 https 网址查看我们的工作。

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

通过自适应任务采样实现分布稳健多任务强化学习

Authors: Nicholas E. Corrado, Wenyuan Huang, Josiah P. Hanna
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.14350
Pdf link: https://arxiv.org/pdf/2605.14350
Abstract Multi-task reinforcement learning (MTRL) aims to train a single agent to efficiently optimize performance across multiple tasks simultaneously. However, jointly optimizing all tasks often yields imbalanced learning: agents quickly solve easy tasks but learn slowly on harder ones. While prior work primarily attributes this imbalance to conflicting task gradients and proposes gradient manipulation or specialized architectures to address it, we instead focus on a distinct and under-explored challenge: imbalanced data allocation. Standard MTRL allocates an equal number of environment interactions to each task, which over-allocates data to easy tasks that require relatively few interactions to solve and under-allocates data to hard tasks that require substantially more experience to solve. To address this challenge, we introduce Distributionally Robust Adaptive Task Sampling (DRATS), an algorithm that adaptively prioritizes sampling tasks furthest from being solved. We derive DRATS by formalizing MTRL as a feasibility problem from which we derive a minimax objective for minimizing the worst-case return gap, the difference between a desired target return and the agent's return on a task. In benchmarks like MetaWorld-MT10 and MT50, DRATS improves data efficiency and increases worst-task performance compared to existing task sampling algorithms.
中文摘要 多任务强化学习（MTRL）旨在训练单个代理，使其能够高效地同时优化多个任务的性能。然而，联合优化所有任务常常导致学习不平衡：代理快速解决简单任务，但较难任务学习缓慢。虽然以往研究主要将这种不平衡归因于冲突的任务梯度，并提出梯度操作或专门架构来解决，但我们转而关注一个鲜为人知的挑战：数据分配不平衡。标准MTRL为每个任务分配等量的环境交互，这导致数据分配到需要相对较少互动的简单任务，而对需要大量经验解决的困难任务则分配数据不足。为应对这一挑战，我们引入了分布稳健自适应任务采样（DRATS）算法，该算法能够自适应优先处理最接近解决的采样任务。我们通过将MTRL形式化为可行性问题，推导出最小极大目标，以最小化最坏情况回报差距，即期望目标回报与代理任务回报之间的差值。在MetaWorld-MT10和MT50等基准测试中，DRATS提升了数据效率，并提升了与现有任务抽样算法相比的最差任务表现。

Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

带语义奖励的强化学习实现低资源语言扩展而无需对齐税

Authors: Zeli Su, Ziyin Zhang, Zhou Liu, Xuexian Song, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Rong Fu, Guixian Xu, Wentao Zhang
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.14366
Pdf link: https://arxiv.org/pdf/2605.14366
Abstract Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision. Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.
中文摘要 将大型语言模型（LLMs）扩展到低资源语言，往往会带来“对齐税”：目标语言的改进以牺牲一般能力的灾难性遗忘为代价。我们认为，这种权衡源于监督微调（SFT）的刚性，SFT对狭窄且偏颇的数据分布强制进行代币级表面模仿。为解决这一限制，我们提出了一种由群相对策略优化（GRPO）驱动的语义空间对齐范式，其中模型通过嵌入层级语义奖励而非似然最大化进行优化。这一目标通过灵活实现促进意义的保存，实现受控更新，减少对预训练知识的破坏性干扰。我们评估了藏中机器翻译和藏语标题生成的方法。实验表明，我们的方法在获得低资源能力的同时显著减轻了对齐税，比SFT更有效地保持了整体能力。尽管表面重叠较少，语义强化学习在开放式生成中带来更高的语义质量和偏好，且少样本转移结果表明，在有限监督下，它能学习更多可转移且稳健的表示。总体而言，我们的研究表明，带有语义奖励的强化学习为包容性低资源语言扩展提供了更安全、更可靠的路径。

Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games

数据增强博弈开始，加速不完美信息游戏中的自我对弈探索

Authors: JB Lanier, Nathan Monette, Pierre Baldi, Roy Fox
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.14379
Pdf link: https://arxiv.org/pdf/2605.14379
Abstract Finding approximate equilibria for large-scale imperfect-information competitive games such as StarCraft, Dota, and CounterStrike remains computationally infeasible due to sparse rewards and challenging exploration over long horizons. In this paper, we propose a multi-agent starting-state sampling strategy designed to substantially accelerate online exploration in regularized policy-gradient game methods for two-player zero-sum (2p0s) games. Motivated by an assumption that offline demonstrations from skilled humans can provide good coverage of high-level strategies relevant to equilibrium play, we propose the initialization of reinforcement learning data collection at intermediate states sampled from offline data to facilitate exploration of strategically relevant subgames. Referring to this method as Data-Augmented Game Starts (DAGS), we perform experiments using synthetic datasets and analytically tractable, long-horizon control variants of two-player Kuhn Poker, Goofspiel, and a counterexample game designed to penalize biased beliefs over hidden information. Under fixed computational budgets, DAGS enables regularized policy gradient methods to achieve lower exploitability in games with significantly more challenging exploration. We show that augmenting starting state distributions when solving imperfect information games can lead to biased equilibria, and we provide a straightforward mitigation to this in the form of multi-task observation flags. Finally, we release a new set of benchmark environments that drastically increase exploration challenges and state counts in existing OpenSpiel games while keeping exploitability measurements analytically tractable.
中文摘要 对于大型不完美信息竞技游戏如星际争霸、刀塔和反恐精英，由于奖励稀少且探索困难，计算上仍然难以实现。本文提出了一种多智能体起始状态抽样策略，旨在大幅加速正则化策略梯度博弈方法中两人零和（2p0s）博弈的在线探索。基于这样一个假设：由熟练人类进行离线演示，能够很好地覆盖与均衡游戏相关的高层策略，我们提议在从离线数据中抽样的中间状态初始化强化学习数据收集，以便探索具有战略意义的子博弈。我们将此方法称为数据增强博弈开始（DAGS），使用合成数据集和可解析的长视野控制变体进行实验，包括双人库恩扑克、Goofspiel和一种反例游戏，这些游戏旨在惩罚对隐藏信息的偏见信念。在固定计算预算下，DAGS使规则化策略梯度方法能够在探索更具挑战性的游戏中实现较低的可利用性。我们证明，在解决不完全信息博弈时增强起始状态分布可能导致偏置均衡，并以多任务观察标志的形式提供了简单的缓解方法。最后，我们发布了一套新的基准环境，大幅提升了现有OpenSpiel游戏中的探索难度和状态数量，同时保持可利用性测量的分析性。

Energy-Efficient Quadruped Locomotion with Compliant Feet

节能四足行车，脚部顺从

Authors: Pramod Pal (1), Shishir Kolathaya (2), Ashitava Ghosal (1 and 3) ((1) Indian Institute of Science Bangalore India, (2) Robert Bosch Centre for Cyber Physical Systems Bangalore India, (3) Ahmedabad University Ahmedabad India)
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14411
Pdf link: https://arxiv.org/pdf/2605.14411
Abstract Quadruped robots are often designed with rigid feet to simplify control and maintain stable contact during locomotion. While this approach is straightforward, it limits the ability of the legs to absorb impact forces and reuse stored elastic energy, leading to higher energy expenditure during locomotion. To explore whether compliant feet can provide an advantage, we integrate foot compliance into a reinforcement learning (RL) locomotion controller and study its effect on walking efficiency. In simulation, we train eight policies corresponding to eight different spring stiffness values and then cross-evaluate their performance by measuring mechanical energy consumed per meter traveled. In experiments done on a developed quadruped, the energy consumption for the intermediate stiffness spring is lower by ~ 17% when compared to a very stiff or a very flexible spring incorporated in the feet, with similar trends appearing in the simulation results. These results indicate that selecting an appropriate foot compliance can improve locomotion efficiency without destabilizing the robot during motion.
中文摘要 四足机器人通常设计为刚性足，以简化控制并保持移动时的稳定接触。虽然这种方法很简单，但它限制了腿部吸收冲击力和回收储存弹性的能力，导致行走时能量消耗增加。为了探讨顺从脚部是否能带来优势，我们将脚部顺从性集成到强化学习（RL）运动控制器中，并研究其对步行效率的影响。在仿真中，我们训练八种策略，对应八种不同的弹簧刚度值，然后通过测量每米行进消耗的机械能来交叉评估其性能。在已开发的四足动物上进行的实验中，中间刚度弹簧的能量消耗比脚部内置的非常刚性或非常柔韧弹簧低约17%，模拟结果中也出现类似趋势。这些结果表明，选择合适的脚部配合性可以在不使机器人在运动中不稳定的情况下提高运动效率。

A Unified Knowledge Embedded Reinforcement Learning-based Framework for Generalized Capacitated Vehicle Routing Problems

基于通用容量车辆路由问题的统一知识嵌入式强化学习框架

Authors: Wen Wang, Xiangchen Wu, Liang Wang, Hao Hu, Xianping Tao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14416
Pdf link: https://arxiv.org/pdf/2605.14416
Abstract The Capacitated Vehicle Routing Problem (CVRP) is a fundamental NP-hard problem with broad applications in logistics and transportation. Real-world CVRPs often involve diverse objectives and complex constraints, such as time windows or backhaul requirements, motivating the development of a unified solution framework. Recent reinforcement learning (RL) approaches have shown promise in combinatorial optimization, yet they rely on end-to-end learning and lack explicit problem-solving knowledge, limiting solution quality. In this paper, we propose a knowledge-embedded framework inspired by the Route-First Cluster-Second heuristics. It incorporates knowledge at two levels: (1) decomposing CVRPs into the route-first and cluster-second subproblems, and (2) leveraging dynamic programming to solve the second subproblem, whose results guide the RL-based constructive solver to solve the first problem. To mitigate partial observability caused by problem decomposition, we introduce a unified history-enhanced context processing module. Extensive experiments show that this framework achieves superior solution quality compared with state-of-the-art learning-based methods, with a smaller gap to classical heuristics, demonstrating strong generalization across diverse CVRP variants.
中文摘要 电容车辆路由问题（CVRP）是一个基础性的NP难问题，在物流和运输领域有广泛应用。现实中的CVRP通常涉及多样目标和复杂约束，如时间窗口或回程要求，促使开发统一解决方案框架。近期强化学习（RL）方法在组合优化方面展现出潜力，但它们依赖端到端学习，缺乏明确的问题解决知识，限制了解决方案质量。本文提出了一个受路径-优先集群-第二启发式的知识嵌入框架。它在两个层面上整合了知识：（1）将CVRP分解为路径优先和聚类第二子问题，（2）利用动态规划解决第二个子问题，其结果指导基于强化学习的构造性求解器解决第一个问题。为减轻问题分解导致的部分可观测性，我们引入了一个统一的历史增强上下文处理模块。大量实验表明，该框架在解答质量上优于最先进的基于学习的方法，且与经典启发式的差距更小，展示了在多种CVRP变体间的强强推广性。

Efficient Generative Retrieval for E-commerce Search with Semantic Cluster IDs and Expert-Guided RL

采用语义集群ID和专家引导强化学习的电子商务搜索高效生成检索

Authors: Jianbo Zhu, Xing Fang, Jing Wang, Mingmin Jin, Bokang Wang, Guangxin Song, Zhenyu Xie, Junjie Bai
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14434
Pdf link: https://arxiv.org/pdf/2605.14434
Abstract Generative retrieval offers a promising alternative by unifying the fragmented multi-stage retrieval process into a single end-to-end model. However, its practical adoption in industrial e-commerce search remains challenging, given the massive and dynamic product catalogs, strict latency requirements, and the need to align retrieval with downstream ranking goals. In this work, we propose a retrieval framework tailored for real-world recall scenarios, positioning generative retrieval as a recall-stage supplement rather than an end-to-end replacement. Our method, CQ-SID (Category-and-Query constrained Semantic ID), employs category-aware and query-item contrastive learning along with Residual Quantized VAEs to encode items into hierarchical semantic cluster identifiers, significantly reducing beam search complexity. Additionally, we develop EG-GRPO (Expert-Guided Group Relative Policy Optimization), a reinforcement learning approach that aligns generative recall with downstream ranking under sparse rewards by injecting ground-truth samples to stabilize training. Offline experiments on TmallAPP search logs show that CQ-SID achieves up to 26.76% and 11.11% relative gains in semantic and personalized click hitrate over RQ-VAE baselines, while halving beam search size. EG-GRPO further improves multi-objective performance. Online A/B tests confirm gains in GMV (+1.15%) and UCTCVR (+0.40%). The generative recall channel now contributes substantially in production, accounting for over 50.25% of exposures, 58.96% of clicks, and 72.63% of purchases, demonstrating a viable path for deploying generative retrieval in real-world e-commerce systems.
中文摘要 生成式检索通过将分散的多阶段检索过程统一为单一的端到端模型，提供了有前景的替代方案。然而，鉴于庞大且动态的产品目录、严格的延迟要求以及需要将检索与下游排名目标对齐的需求，其在工业电子商务搜索中的实际应用仍然充满挑战。本研究提出一个针对现实回忆场景的检索框架，将生成性检索定位为回忆阶段的补充，而非端到端的替代。我们的方法CQ-SID（类别与查询约束语义ID）采用类别感知和查询项对比学习，结合残差量化VAE将项目编码为层次语义簇标识符，显著降低了束搜索复杂度。此外，我们开发了EG-GRPO（专家引导小组相对策略优化），这是一种强化学习方法，通过注入真实样本稳定训练，使生成回忆与下游排名在稀疏奖励下保持一致。天猫APP搜索日志上的离线实验显示，CQ-SID在语义和个性化点击率相较RQ-VAE基线提升了最多26.76%和11.11%，同时将束搜索规模减半。EG-GRPO进一步提升了多目标性能。在线A/B测试确认GMV（+1.15%）和UCTCVR（+0.40%）的提升。生成式回忆渠道现已在生产中贡献显著，占曝光量的50.25%以上，点击量超过58.96%，购买量占72.63%，展示了生成式检索在现实电商系统中部署的可行路径。

Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience

黑箱大型语言模型中多步推理和工具使用的提示策略，结合经验迭代提炼

Authors: Krishna Sayana, Ketan Todi, Ambarish Jash
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.14443
Pdf link: https://arxiv.org/pdf/2605.14443
Abstract The shift toward interacting with frozen, "black-box" Large Language Models (LLMs) has transformed prompt engineering from a heuristic exercise into a critical optimization challenge. We propose a Reinforcement Learning (RL) framework for training learned prompting policies via iterative distillation of experience. In this architecture, a lightweight prompter model is optimized to maximize task-specific rewards for a larger, frozen worker LLM. By utilizing a contrastive experience buffer that couples scalar rewards with dense textual critiques, our approach effectively amortizes iterative prompt refinement into single-shot policy weights. Our experimental analysis focuses on the Big Bench Extra Hard (BBEH) and Tau-bench suites, covering a diverse range of multi-step reasoning and tool-use tasks. We demonstrate significant gains, improving performance from 55% to 90% in logic-intensive reasoning and 74% to 91% in tool-use tasks. Furthermore, we analyze the structural evolution of prompts, demonstrating how the policy discovers specialized algorithmic heuristics. We provide comprehensive comparisons against state-of-the-art evolutionary baselines like GEPA, showing that iterative distillation achieves superior performance with higher sample efficiency.
中文摘要 向向冻结的“黑匣子”大型语言模型（LLM）交互的转变，使提示工程从启发式练习转变为关键的优化挑战。我们提出了一个强化学习（RL）框架，通过迭代提炼经验来培训已学到的提示策略。在该架构中，轻量级提词器模型被优化为更大、冻结的工人LLM最大化任务特定奖励。通过利用对比性的体验缓冲区，将标量奖励与密集的文本批评结合起来，我们的方法有效地摊销了迭代提示的细化为单次策略权重。我们的实验分析聚焦于大工作台Extra Hard（BBEH）和Tau工作台套件，涵盖了多步骤推理和工具使用任务的多样化。我们展示了显著的进步，逻辑密集型推理的表现从55%提升到90%，工具使用任务的表现从74%提升到91%。此外，我们分析了提示的结构演变，展示了该策略如何发现专门的算法启发式。我们提供了与先进进化基线如GEPA的全面比较，表明迭代蒸馏在更高的采样效率下实现了更优的性能。

LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning

LEMON：通过反事实强化学习学习可执行多智能体编排

Authors: Xudong Chen, Yixin Liu, Hua Wei, Kaize Ding
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14483
Pdf link: https://arxiv.org/pdf/2605.14483
Abstract Large language models (LLMs) have become a strong foundation for multi-agent systems, but their effectiveness depends heavily on orchestration design. Across different tasks, role design, capacity assignment, and dependency construction jointly affect both solution quality and execution efficiency. Existing approaches automate parts of this design process, yet they often optimize these decisions partially or sequentially, and rely on execution-level feedback that provides limited credit assignment for local orchestration decisions. We propose LEMON (\textbf{L}earning \textbf{E}xecutable \textbf{M}ulti-agent \textbf{O}rchestratio\textbf{N} via Counterfactual Reinforcement Learning), an LLM-based orchestrator that generates an executable orchestration specification. The specification integrates task-specific roles, customized duties, capacity levels, and dependency structure into a single deployable system. To train the orchestrator, we augment the orchestration-level GRPO objective with a localized counterfactual signal that edits role, capacity, or dependency fields and applies the resulting reward contrast only to the edited spans. Experiments on six reasoning and coding benchmarks, including MMLU, GSM8K, AQuA, MultiArith, SVAMP, and HumanEval, show that LEMON achieves state-of-the-art performance among the evaluated multi-agent orchestration methods. Our code is available at this https URL.
中文摘要 大型语言模型（LLMs）已成为多智能体系统的坚实基础，但其有效性在很大程度上依赖于编排设计。在不同任务中，角色设计、容量分配和依赖构建共同影响解决方案质量和执行效率。现有方法自动化了设计过程的部分内容，但它们通常部分或顺序地优化这些决策，并依赖执行层级反馈，这为本地编排决策提供了有限的信用分配。我们提出LEMON（\textbf{L}通过反事实强化学习获得\textbf{E}xecutable \textbf{M}ulti-agent \textbf{O}rchestratio\textbf{N}），这是一个基于LLM的编排器，生成可执行的编排规范。该规范将任务特定角色、定制职责、容量水平和依赖结构整合到一个可部署系统中。为了训练编排者，我们通过局部反事实信号增强编排级别的GRPO目标，该信号编辑角色、容量或依赖字段，并仅将奖励对比应用于编辑后的跨度。在包括MMLU、GSM8K、AQuA、MultiArith、SVAMP和HumanEval在内的六个推理和编码基准测试上的实验表明，LEMON在被评估的多智能体编排方法中实现了最先进的性能。我们的代码可在此 https URL 访问。

ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization

ROAD：通过双层优化实现离线到在线强化学习的自适应数据混合

Authors: Letian Yang (1), Xu Liu (1), Yiqiang Lu (2), Jian Liu (2), Weiqiang Wang (2), Shuai Li (1) ((1) Shanghai Jiao Tong University, Shanghai, China, (2) Ant Group, Shanghai, China)
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14497
Pdf link: https://arxiv.org/pdf/2605.14497
Abstract Offline-to-online reinforcement learning harnesses the stability of offline pretraining and the flexibility of online fine-tuning. A key challenge lies in the non-stationary distribution shift between offline datasets and the evolving online policy. Common approaches often rely on static mixing ratios or heuristic-based replay strategies, which lack adaptability to different environments and varying training dynamics, resulting in suboptimal tradeoff between stability and asymptotic performance. In this work, we propose Reinforcement Learning with Optimized Adaptive Data-mixing (ROAD), a dynamic plug-and-play framework that automates the data replay process. We identify a fundamental objective misalignment in existing approaches. To tackle this, we formulate the data selection problem as a bi-level optimization process, interpreting the data mixing strategy as a meta-decision governing the policy performance (outer-level) during online fine-tuning, while the conventional Q-learning updates operate at the inner level. To make it tractable, we propose a practical algorithm using a multi-armed bandit mechanism. This is guided by a surrogate objective approximating the bi-level gradient, which simultaneously maintains offline priors and prevents value overestimation. Our empirical results demonstrate that this approach consistently outperforms existing data replay methods across various datasets, eliminating the need for manual, context-specific adjustments while achieving superior stability and asymptotic performance.
中文摘要 离线到线上强化学习利用了离线预训练的稳定性和在线微调的灵活性。一个关键挑战在于离线数据集之间的非固定分布转移与不断演变的在线政策。常见方法通常依赖静态混合比率或基于启发式的回放策略，这些策略缺乏对不同环境和训练动态的适应性，导致稳定性与渐近性能之间的权衡不理想。本研究提出带有优化自适应数据混合的强化学习（ROAD），这是一种动态即插即用框架，自动化数据重放过程。我们发现现有方法存在根本的目标不一致。为此，我们将数据选择问题构建为一个二级优化过程，将数据混合策略解释为在线微调期间政策表现（外层）的元决策，而传统的Q学习更新则在内层运行。为了使其易于处理，我们提出了一种使用多臂强盗机制的实用算法。这由一个替代目标指导，该目标近似于二层梯度，同时保持离线先验并防止价值被高估。我们的实证结果表明，该方法在不同数据集中持续优于现有数据回放方法，无需手动、针对特定情境进行调整，同时实现了卓越的稳定性和渐近性能。

Fully Dynamic Rebalancing in Dockless Bike-Sharing Systems via Deep Reinforcement Learning

通过深度强化学习实现无对接单车共享系统的全动态再平衡

Authors: Edoardo Scarpel, Alberto Pettena, Matteo Cederle, Federico Chiariotti, Marco Fabris, Gian Antonio Susto
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.14501
Pdf link: https://arxiv.org/pdf/2605.14501
Abstract This paper proposes a fully dynamic Deep Reinforcement Learning (DRL) method for rebalancing dockless bike-sharing systems, overcoming the limitations of periodic, system-wide interventions. We model the service through a graph-based simulator and cast rebalancing as a Markov decision process. A DRL agent routes a single truck in real time, executing localized pick-up, drop-off, and charging actions guided by spatiotemporal criticality scores. Experiments on real-world data show significant reductions in availability failures with a minimal fleet size, while limiting spatial inequality and mobility deserts. Our approach demonstrates the value of learning-based rebalancing for efficient and reliable shared micromobility.
中文摘要 本文提出了一种完全动态的深度强化学习（DRL）方法，用于重新平衡无码头共享自行车系统，克服周期性系统范围干预的局限性。我们通过基于图的模拟器建模服务，并将重平衡法作为马尔可夫决策过程进行建模。日程车轮代理实时调度单辆卡车，执行局部的接送和充电动作，这些动作由时空关键性评分引导。基于真实世界数据的实验显示，在最小舰队规模下，可用性失效显著减少，同时限制空间不平等和机动性沙漠。我们的方法展示了基于学习的再平衡对于高效且可靠的共享微出行的价值。

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

从失败中学习：以纠正为导向的策略优化，并获得可验证的奖励

Authors: Mengjie Ren, Jie Lou, Boxi Cao, Xueru Wen, Hongyu Lin, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.14539
Pdf link: https://arxiv.org/pdf/2605.14539
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.
中文摘要 带可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理能力的有效范式。然而，RLVR训练常常受限于稀疏的二元奖励和较弱的信用分配，导致优化信号模糊，失败轨迹中蕴含的有用信息未能充分利用。为应对这一挑战，我们提出了修正导向策略优化（CIPO），这是一种简单有效的RLVR扩展，将策略失败的轨迹转换为纠正导向的监督，无需依赖任何外部信号。通过结合模型自身失败尝试得出的修正样本与标准RLVR目标共同优化，CIPO提高了学习效果，同时明确增强模型纠正自身错误的能力。涵盖数学推理和代码生成的11个基准测试的广泛实验表明，CIPO在推理和纠错性能上始终且显著地优于强基线。此外，CIPO带来更强的pass@K收益，表明它提升了模型的内在推理能力，而不仅仅是将概率质量重新分配到已有正确答案上。

Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

解决行动瓶颈：基于令牌级能量的能动强化学习

Authors: Langzhou He, Junyou Zhu, Yue Zhou, Zhengyao Gu, Junhua Liu, Wei-Chieh Huang, Henry Peng Zou, David Wipf, Philip S. Yu, Qitian Wu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.14558
Pdf link: https://arxiv.org/pdf/2605.14558
Abstract Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment. In this paper, we critically demonstrate that such uniform credit assignment largely misallocates token-level training signals. From an energy-based modeling perspective, we show that token-level training signals, quantified by their correlations with reward variance of different rollouts sampled from a given prompt, concentrate sharply on action tokens rather than reasoning tokens, even though action tokens account for only a small fraction of the trajectory. We refer to this phenomenon as the Action Bottleneck. Motivated by this observation, we propose an embarrassingly simple token reweighting approach, ActFocus, that downweights gradients on reasoning tokens, along with an additional energy-based redistribution mechanism that further increases the weights on action tokens with higher uncertainty. Across four environments and different model sizes, ActFocus consistently outperforms PPO and GRPO, yielding final-step gains of up to 65.2 and 63.7 percentage points, respectively, without any additional runtime or memory cost.
中文摘要 代理强化学习通过多回合轨迹训练大型语言模型，这些轨迹将长推理轨迹与短时间面向环境的动作交错交错。常见的策略梯度方法，如PPO和GRPO，将轨迹中的每个代币视为平等，从而实现统一的信用分配。本文批判性地证明，这种统一的信用分配在很大程度上错误分配了令牌级训练信号。从基于能量的建模角度来看，我们表明，通过从给定提示中抽样的不同推广活动奖励方差的相关性量化的代币级训练信号，更明显地聚焦于动作代币而非推理代币，尽管动作代币只占轨迹的一小部分。我们称这种现象为行动瓶颈。基于这一观察，我们提出了一种令人尴尬的简单代币重权方法ActFocus，它降低了推理代币的梯度，并结合一种基于能量的重新分配机制，进一步提高了不确定性较高的动作代币权重。在四种环境和不同模型大小下，ActFocus始终优于PPO和GRPO，分别实现最高65.2和63.7个百分点的最终提升，且无额外运行时间或内存消耗。

Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning

天使还是恶魔：探讨可塑性干预对深度强化学习中后门威胁的影响

Authors: Oubo Ma, Ruixiao Lin, Yang Dai, Jiahao Chen, Chunyi Zhou, Linkang Du, Shouling Ji
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2605.14587
Pdf link: https://arxiv.org/pdf/2605.14587
Abstract Extensive research has highlighted the severe threats posed by backdoor attacks to deep reinforcement learning (DRL). However, prior studies primarily focus on vanilla scenarios, while plasticity interventions have emerged as indispensable built-in components of modern DRL agents. Despite their effectiveness in mitigating plasticity loss, the impact of these interventions on DRL backdoor vulnerabilities remains underexplored, and this lack of systematic investigation poses risks in practical DRL deployments. To bridge this gap, we empirically study 14,664 cases integrating representative interventions and attack scenarios. We find that only one intervention (i.e., SAM) exacerbates backdoor threats, while other interventions mitigate them. Pathological analysis identifies that the exacerbation is attributed to backdoor gradient amplification, while the mitigation stems from activation pathway disruption and representation space compression. From these findings, we derive two novel insights: (1) a conceptual framework SCC for robust backdoor injection that deconstructs the mechanistic interplay between interventions and backdoors in DRL, and (2) abnormal loss landscape sharpness as a key indicator for DRL backdoor detection.
中文摘要 大量研究表明，后门攻击对深度强化学习（DRL）构成了严重威胁。然而，以往的研究主要聚焦于普通场景，而可塑性干预已成为现代日日学习（DRL）药物不可或缺的内置组件。尽管这些干预措施在减轻可塑性损失方面有效，但其对DRL后门漏洞的影响仍未被充分探讨，缺乏系统性调查在实际部署DRL中带来了风险。为弥合这一差距，我们实证研究了14,664个案例，整合了代表性干预与攻击场景。我们发现，只有一种干预措施（即SAM）会加剧后门威胁，而其他干预则能缓解这些威胁。病理分析表明，加剧归因于后门梯度放大，而缓解则源于激活通路破坏和表示空间压缩。基于这些发现，我们得出了两个新颖见解：（1）用于稳健后门注入的概念性SCC框架，解析了DRL中干预与后门之间的机制相互作用;（2）异常损失景观锐利度作为DRL后门检测的关键指标。

Fast Rates for Inverse Reinforcement Learning

逆向强化学习的快速速率

Authors: Andreas Schlaginhaufen, Maryam Kamgarpour
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.14599
Pdf link: https://arxiv.org/pdf/2605.14599
Abstract We establish novel structural and statistical results for entropy-regularized min-max inverse reinforcement learning (Min-Max-IRL) with linear reward classes in finite-horizon MDPs with Borel state and action spaces. On the structural side, we show that maximum likelihood estimation (MLE) and Min-Max-IRL are equivalent at the population level, and at the empirical level under deterministic dynamics. On the statistical side, exploiting pseudo-self-concordance of the Min-Max-IRL loss, we prove that both the trajectory-level KL divergence and the squared parameter error in the Hessian norm decay at the fast rate $\mathcal{O}(n^{-1})$, where $n$ is the number of expert trajectories. Our guarantees apply under misspecification and require no exploration assumptions. We further extend reward-identifiability results to general Borel spaces and derive novel results on the derivatives of the soft-optimal value function with respect to reward parameters.
中文摘要 我们建立了熵正则化最小-最大反强化学习（Min-Max-IRL）在具有Borel状态和作用空间的有限视野MDP中线性奖励类的结构和统计新结果。在结构层面，我们证明了最大似然估计（MLE）和最小最大实因测量在总体层面和确定性动力学下的经验层面等价。在统计学方面，利用最小极大实相损失的伪自对应性，我们证明轨迹级KL散度和黑森范数的平方参数误差在快速衰减率$\mathcal{O}（n^{-1}）$时均存在，其中$n$为专家轨迹数。我们的保证适用于错误规定，无需任何勘探假设。我们进一步将奖励可识别性结果扩展到一般Borel空间，并推导出关于软最优值函数关于奖励参数导数的新结果。

DRL-STAF: A Deep Reinforcement Learning Framework for State-Aware Forecasting of Complex Multivariate Hidden Markov Processes

DRL-STAF：一种用于复杂多变量隐马尔可夫过程状态感知预测的深度强化学习框架

Authors: Manrui Jiang, Jingru Huang, Yong Chen, Chen Zhang
Subjects: Subjects: Machine Learning (cs.LG); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2605.14632
Pdf link: https://arxiv.org/pdf/2605.14632
Abstract Forecasting multivariate hidden Markov processes is challenging due to nonlinear and nonstationary observations, latent state transitions, and cross-sequence dependencies. While deep learning methods achieve strong predictive accuracy, they typically lack explicit state modeling, whereas Hidden Markov Models (HMMs) provide interpretable latent states but struggle with complex nonlinear emissions and scalability. To address these limitations, we propose DRL-STAF, a Deep Reinforcement Learning based STate-Aware Forecasting framework that jointly predicts next-step observations and estimates the corresponding hidden states for complex multivariate hidden Markov processes. Specifically, DRL-STAF models complex nonlinear emissions using deep neural networks and estimates discrete hidden states using reinforcement learning, reducing the reliance on predefined transition structures and enabling flexible adaptation to diverse temporal dynamics. In particular, DRL-STAF mitigates the state-space explosion encountered by typical multivariate HMM-based methods. Extensive experiments demonstrate that DRL-STAF outperforms HMM variants, standalone deep learning models, and existing DL-HMM hybrids in most cases, while also providing reliable hidden-state estimates.
中文摘要 由于非线性和非平稳观测、潜态转变以及跨序列依赖性，预测多变量隐马尔可夫过程具有挑战性。虽然深度学习方法具有较强的预测精度，但通常缺乏显式状态建模，而隐马尔可夫模型（HMMs）则提供可解释的潜在状态，但在复杂的非线性发射和可扩展性方面存在困难。为解决这些局限性，我们提出了基于深度强化学习的状态感知预测框架DRL-STAF，能够联合预测下一步观测并估计复杂多变量隐马尔可夫过程的对应隐态。具体来说，DRL-STAF利用深度神经网络建模复杂的非线性发射，并通过强化学习估计离散隐态，减少对预定义过渡结构的依赖，实现对多样时间动态的灵活适应。特别是，DRL-STAF减轻了典型多变量基于HMM方法所遇到的状态空间爆炸。大量实验表明，DRL-STAF在大多数情况下优于HMM变体、独立深度学习模型和现有的DL-HMM混合模型，同时还能提供可靠的隐态估计。

Multi-objective application placement in fog computing using graph neural network-based reinforcement learning

利用基于图神经网络的强化学习，雾计算中的多目标应用部署

Authors: Isaac Lera, Carlos Guerrero
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2605.14649
Pdf link: https://arxiv.org/pdf/2605.14649
Abstract We propose a framework designed to tackle a multi-objective optimization challenge related to the placement of applications in fog computing, employing a deep reinforcement learning (DRL) approach. Unlike other optimization techniques, such as integer linear programming or genetic algorithms, DRL models are applied in real time to solve similar problem situations after training. Our model comprises a learning process featuring a graph neural network and two actor-critics, providing a holistic perspective on the priorities concerning interconnected services that constitute an application. The learning model incorporates the relationships between services as a crucial factor in placement decisions: Services with higher dependencies take precedence in location selection. Our experimental investigation involves illustrative cases where we compare our results with baseline strategies and genetic algorithms. We observed a comparable Pareto set with negligible execution times, measured in the order of milliseconds, in contrast to the hours required by alternative approaches.
中文摘要 我们提出了一个框架，旨在解决与雾计算应用布局相关的多目标优化挑战，采用深度强化学习（DRL）方法。与其他优化技术（如整数线性规划或遗传算法）不同，DRL模型在训练后实时应用于解决类似问题。我们的模型包含一个学习过程，包含一个图神经网络和两个actor-critic，提供了关于构成应用的互联服务优先级的整体视角。学习模型将服务之间的关系纳入放置决策的关键因素：依赖性较高的服务在位置选择中优先。我们的实验研究涉及示例案例，将结果与基线策略和遗传算法进行比较。我们观察到一个可比的帕累托集合，执行时间可忽略不计，测量为毫秒级，这与其他方法所需的小时形成对比。

Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model

通过与临床世界模型互动，将LLM中的患者动态具体化

Authors: Minghao Wu, Yuting Yan, Zhenyang Cai, Ke Ji, Chuangsen Fang, Ziying Sheng, Xidong Wang, Rongsheng Wang, Hejia Zhang, Shuang Li, Benyou Wang, Hongyuan Zha
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.14723
Pdf link: https://arxiv.org/pdf/2605.14723
Abstract Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid--vasopressor interventions, and follows a propose--simulate--refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose--simulate--refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.
中文摘要 ICU的败血症管理需要在快速变化的患者生理状况下做出顺序治疗决策。虽然大型语言模型（LLMs）编码了广泛的临床知识，并能基于指南进行推理，但它们本身并不完全基于基于行动条件的患者动态。我们介绍SepsisAgent，一款全球模型增强的大型语言模型代理，用于败血症治疗建议。SepsisAgent 使用已学习的临床世界模型模拟患者在候选液体——血管增压剂干预下的反应，并遵循提案——模拟——优化工作流程，然后才决定开药。我们首先证明，仅靠世界模型访问会导致LLM决策表现不一致，从而激励针对智能体的训练。随后，我们通过三阶段课程训练SepsisAgent：患者动态监督微调、提出——模拟——精炼行为克隆，以及基于世界模型的智能强化学习。在MIMIC-IV败血症轨迹上，SepsisAgent在非政策值上优于所有传统强化学习和大型语言模型基线，同时在指南遵循和不安全措施指标下实现最佳安全性。进一步分析显示，反复与临床世界模型的交互使智能体能够学习患者演变的规律性，即使模拟器访问被移除，这些规律依然有用。

Addressing Terminal Constraints in Data-Driven Demand Response Scheduling

数据驱动需求响应调度中的终端限制问题解决

Authors: Maximilian Bloor, Martha White, Ehecatl Antonio del Rio Chanona, Calvin Tsay
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14741
Pdf link: https://arxiv.org/pdf/2605.14741
Abstract Electrified chemical processes are incentivized by exposure to time-varying electricity markets to operate flexibly, but participating in demand response schemes can require satisfying terminal constraints over long horizons. Specifically, terminal constraints may be required when computing optimal schedules in order to preserve dynamic stability. Model-based optimization methods are computationally costly, and data-driven scheduling via reinforcement learning (RL) faces severe credit-assignment challenges. We integrate Goal-Space Planning (GSP) with Deep Deterministic Policy Gradient (DDPG), using learned temporally abstract models over discrete subgoals to propagate value across extended horizons. Using a simulated air separation benchmark, we demonstrate the proposed approach improves sample efficiency over standard DDPG while satisfying terminal storage constraints, mitigating myopic control behavior.
中文摘要 电气化化学工艺因接触时变电力市场而有灵活运营的动力，但参与需求响应方案可能需要长期满足终端条件。具体来说，在计算最优调度时可能需要终端约束以保持动态稳定性。基于模型的优化方法计算成本高，而通过强化学习（RL）进行数据驱动的调度则面临严重的学分分配挑战。我们将目标空间规划（GSP）与深度确定性政策梯度（DDPG）整合，利用学习到的时间抽象模型，跨越延伸的子目标传播价值。利用模拟空气分离基准测试，我们证明了该方法在满足终端存储约束的同时，提高了样品效率，同时满足终端存储限制，减轻了近视控制行为。

EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding

EARL：迈向一个统一分析引导的强化学习框架，用于自我中心的互动推理和像素基础

Authors: Yuejiao Su, Xinshen Zhang, Zhen Ye, Lei Yao, Lap-Pui Chau, Yi Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.14742
Pdf link: https://arxiv.org/pdf/2605.14742
Abstract Understanding human--environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic prior, which is integrated via a novel Analysis-guided Feature Synthesizer (AFS) for query-oriented reasoning. To optimize heterogeneous outputs, including textual answers, bounding boxes, and grounding masks, we design a multi-faceted reward function and train the response stage with GRPO. Experiments on Ego-IRGBench show that EARL achieves 65.48% cIoU for pixel grounding, outperforming previous RL-based methods by 8.37%, while OOD grounding results on EgoHOS indicate strong transferability to unseen egocentric grounding scenarios.
中文摘要 从自我中心的视觉理解人类与环境的互动对于辅助机器人和具身智能体至关重要，然而现有的多模态大型语言模型（MLLM）在准确的交互推理和细粒度像素基础方面仍存在困难。为此，本文介绍了EARL，一种以自我为中心分析为导向的强化学习框架，明确将粗交互语义转化为面向查询的回答和基础。具体来说，EARL采用了包括粗粒度解释和细粒度响应的两阶段解析框架。第一阶段整体解读以自我为中心的互动，生成结构化的文本描述。第二阶段根据用户查询生成文本答案和像素级掩码。为连接这两个阶段，我们提取了一个全局交互描述符作为语义先验，并通过一种新型分析引导特征合成器（AFS）集成，用于查询导向推理。为了优化异构输出，包括文本答案、边界框和接地掩码，我们设计了一个多方面的奖励函数，并用GRPO训练反应阶段。Ego-IRGBench 上的实验显示，EARL 在像素接地方面实现了 65.48% 的 cIoU，比以往基于强化学习的方法高出 8.37%，而 EgoHOS 上的 OOD 接地结果显示出对看不见的自我中心接地场景的强烈可迁移性。

Probabilistic Verification of Recurrent Neural Networks for Single and Multi-Agent Reinforcement Learning

单代理和多智能体强化学习中循环神经网络的概率验证

Authors: Luca Marzari, Enrico Marchesini
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14758
Pdf link: https://arxiv.org/pdf/2605.14758
Abstract History-dependent policies induced by recurrent neural networks (RNNs) rely on latent hidden state dynamics, making verification in partially observable reinforcement learning (RL) challenging. Existing RNN verification tools typically rely on restrictive modeling assumptions or coarse over-approximations of the hidden state space, which can lead to overly conservative or inconclusive results. We propose $\textbf{RNN}$ $\textbf{Pro}$babilistic $\textbf{Ve}$rification ($\texttt{RNN-ProVe}$), a probabilistic framework that $\textit{estimates the likelihood}$ of undesired behaviors in RNN-based policies. $\texttt{RNN-ProVe}$ uses policy-driven sampling to approximate the set of hidden states that are feasible under a trained policy, and derives statistical error bounds to produce bounded-error, high-confidence estimates of behavioral violations. Experiments on partially observable single-agent and cooperative multi-agent tasks show that $\texttt{RNN-ProVe}$ yields more quantitative, feasibility-aware probabilistic guarantees than existing tools, while scaling to recurrent and multi-agent settings.
中文摘要 循环神经网络（RNN）诱导的历史依赖策略依赖潜在的隐态动态，使得部分可观察强化学习（RL）中的验证具有挑战性。现有的RNN验证工具通常依赖于限制性建模假设或对隐藏状态空间的粗糙过度近似，这可能导致结果过于保守或不确定。我们提出了 $\textbf{RNN}$ $\textbf{Pro}$babilistic $\textbf{Ve}$rification （$\texttt{RNN-ProVe}$），这是一个概率框架，用于估算基于RNN策略中不良行为的可能性。$\texttt{RNN-ProVe}$ 利用策略驱动抽样来近似在训练策略下可行的隐藏状态集合，并推导统计误差界限，生成行为违规的高置信度估计。对部分可观测的单智能体和协作多智能体任务的实验表明，$\texttt{RNN-ProVe}$ 比现有工具提供了更多的定量和可行性概率保证，同时还能扩展到重复性和多智能体的环境。

Peng's Q($λ$) for Conservative Value Estimation in Offline Reinforcement Learning

Peng 的 Q（$λ$）用于离线强化学习中的保守价值估计

Authors: Byeongchan Kim, Min-hwan Oh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.14779
Pdf link: https://arxiv.org/pdf/2605.14779
Abstract We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng's Q($\lambda$) (CPQL). Our algorithm adapts the Peng's Q($\lambda$) (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, this is the first work in offline RL to theoretically and empirically demonstrate the effectiveness of conservative value estimation with a \textit{multi-step} operator by fully leveraging offline trajectories. The fixed point of the PQL operator in offline RL lies closer to the value function of the behavior policy, thereby naturally inducing implicit behavior regularization. CPQL simultaneously mitigates over-pessimistic value estimation, achieves performance greater than (or equal to) that of the behavior policy, and provides near-optimal performance guarantees -- a milestone that previous conservative approaches could not achieve. Extensive numerical experiments on the D4RL benchmark demonstrate that CPQL consistently and significantly outperforms existing offline single-step baselines. In addition to the contributions of CPQL in offline RL, our proposed method also contributes to the offline-to-online learning framework. Using the Q-function pre-trained by CPQL in offline settings enables the online PQL agent to avoid the performance drop typically observed at the start of fine-tuning and to attain robust performance improvements. Our code is available at this https URL.
中文摘要 我们提出了一种无模型的离线多步强化学习（RL）算法，即保守派彭的Q（$\lambda$）（CPQL）。我们的算法将彭氏Q（$\lambda$）（PQL）算子用于保守值估计，作为贝尔曼算子的替代方案。据我们所知，这是离线强化学习中首次通过充分利用离线轨迹，理论和实证地证明了保守价值估计在\textit{多步}算符下的有效性。离线强化学习中PQL算子的不动点更接近行为策略的值函数，因此自然诱导隐式行为正则化。CPQL同时减轻了过度悲观的价值估计，实现了优于（或等于）行为策略的性能，并提供了接近最佳的性能保证——这是以往保守方法未能实现的里程碑。D4RL基准测试的大量数值实验表明，CPQL持续且显著地优于现有的离线单步基线。除了CPQL在离线强化学习中的贡献外，我们提出的方法还为离线到在线学习框架做出了贡献。在离线环境中使用CPQL预训练的Q函数，使在线PQL代理能够避免微调初期通常观察到的性能下降，并实现稳健的性能提升。我们的代码可在此 https URL 访问。

CaMeRL: Collision-Aware and Memory-Enhanced Reinforcement Learning for UAV Navigation in Multi-Scale Obstacle Environments

CaMeRL：多尺度障碍环境中无人机导航的碰撞感知与记忆增强强化学习

Authors: Hong Hong, Feiyu Liao, Yongheng Liang, Boning Zhang, Haitao Wang, Hejun Wu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.14810
Pdf link: https://arxiv.org/pdf/2605.14810
Abstract In obstacle avoidance navigation of unmanned aerial vehicles (UAVs), variations in obstacle scale have received strangely less attention than obstacle number or density. Existing methods typically extract purely geometric features from single-frame depth observations. Such representations tend to neglect small obstacles and lose spatial context under occlusions caused by large obstacles, leading to noticeable degradation in environments with multi-scale obstacles. To address this issue, we propose CaMeRL, a Collision-aware and Memory-enhanced Reinforcement Learning framework for UAV navigation. The collision-aware latent representation encodes risk-sensitive depth cues to preserve fine-grained obstacle structures, thereby improving sensitivity to small obstacles. The temporal memory module integrates observations across frames, mitigating partial observability caused by large-obstacle occlusions. We evaluate CaMeRL with multi-scale obstacles, including ultra-small and extra-large obstacle settings. Results show that CaMeRL outperforms state-of-the-art baselines across all scales, with success rate gains of 0.48 and 0.28 in the ultra-small and extra-large settings, respectively. More importantly, CaMeRL achieves reliable navigation in cluttered outdoor environments.
中文摘要 在无人机（UAV）的障碍物规避导航中，障碍物尺度的变化反而比障碍物数量或密度受到的关注更少。现有方法通常从单帧深度观测中提取纯几何特征。此类表示往往忽视小障碍物，并在大障碍物造成的遮挡下失去空间语境，导致多尺度障碍物环境中明显退化。为解决这一问题，我们提出了CaMeRL，一种无人机导航的碰撞感知和内存增强强化学习框架。碰撞感知的潜在表征编码风险敏感深度线索，以保持细粒度障碍物结构，从而提高对小障碍物的灵敏度。时间记忆模块整合跨帧的观测数据，减少了因大型障碍物遮挡而产生的部分可观测性。我们通过多尺度障碍物评估CaMeRL，包括超小型和超大型障碍物设置。结果显示，CaMeRL在所有量表中均优于最先进的基线，在超小和超大量条件下成功率分别提升0.48和0.28。更重要的是，CaMeRL在拥挤的户外环境中实现了可靠的导航。

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

通过闭环验证推理解锁复杂的视觉生成

Authors: Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14876
Pdf link: https://arxiv.org/pdf/2605.14876
Abstract Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $\Delta$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.
中文摘要 尽管技术进步迅速，当前的文本转图像（T2I）模型主要依赖单步生成范式，这种范式在复杂语义上存在困难，且参数尺度带来的收益递减。尽管近期多步推理方法展现出前景，但它们受到缺乏验证的无根据规划幻觉、单一的事后反思、长上下文优化不稳定性以及难以抑制的推理延迟所阻碍。为克服这些瓶颈，我们提出了闭环视觉推理（CLVR）框架，这是一个将视觉语言逻辑规划与像素级扩散生成深度结合的综合系统。CLVR引入了自动化数据引擎，支持步级视觉验证，以综合可靠的推理轨迹，并提出代理提示强化学习（PPRL）通过将交错多模态历史提炼为显式奖励信号，解决长期上下文优化的不稳定性，从而准确归因。此外，为了缓解迭代去噪带来的严重延迟瓶颈，我们提出了$\Delta$-空间权重合并（DSWM），这是一种理论上有基础的方法，将比对权重与现成蒸馏先验融合，将每步推断成本降低至仅4个NFE，无需昂贵的重新蒸馏。大量实验表明，CLVR在多个基准测试中优于现有开源基线，并接近专有商业模型的性能，解锁了复杂视觉生成的通用测试时间缩放能力。

Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models

批评者驱动的沃罗诺伊量子化，用于将深度强化学习策略提炼为可解释模型

Authors: Senne Deproost, Denis Steckelmacher, Ann Nowé
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14897
Pdf link: https://arxiv.org/pdf/2605.14897
Abstract Despite many successful attempts at explaining Deep Reinforcement Learning policies using distillation, it remains difficult to balance the performance-interpretability trade-off and select a fitting surrogate model. In addition to this, traditional distillation only minimizes the distance between the behavior of the original and the surrogate policy while other RL-specific components such as action value are disregarded. To solve this, we introduce a new model-agnostic method called Critic-Driven Voronoi State Partitioning, which partitions a black box control policy into regions where a simple class of model can be optimized using gradient descent. By exploiting the critic value network of the original policy, we iteratively introduce new subpolicies in regions with insufficient value, standing in for a measure of policy complexity. The partitioning, a Voronoi quantizer, uses nearest neighbor lookups to assign a linear function to each point in the state space resulting in a cell-like diagram. We validate our approach on several well known benchmarks and proof that this distillation approaches the original policy using a reasonable sized set of linear functions.
中文摘要 尽管多次成功尝试用蒸馏法解释深度强化学习策略，但平衡性能与可解释性权衡并选择合适的替代模型仍然困难。此外，传统提纯仅最小化原始策略与替代策略行为之间的距离，而忽略其他强化语言特有的组成部分，如动作价值。为此，我们引入了一种新的模型无关方法——批判驱动沃罗诺状态划分，将黑箱控制策略划分为可用梯度下降优化简单模型类别的区域。通过利用原政策的批评价值网络，我们在价值不足的地区迭代引入新的子政策，以衡量政策复杂度。划分是Voronoi量子化器，利用最近邻查找为状态空间中的每个点分配线性函数，形成单元状图。我们在多个知名基准测试上验证了我们的方法，并证明该提炼方法通过合理规模的线性函数集接近原始策略。

Chrono-Gymnasium: An Open-Source, Gymnasium-Compatible Distributed Simulation Framework

Chrono-Gymnasium：一个开源、兼容Gymnasium的分布式模拟框架

Authors: Bocheng Zou, Harry Zhang, Khailanii Slaton, Jingquan Wang, Derrick Ruan, Huzaifa Mustafa Unjhawala, Radu Serban, Dan Negrut
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.14911
Pdf link: https://arxiv.org/pdf/2605.14911
Abstract High-fidelity physics simulation is essential for closing the sim-to-real gap in robotics and complex mechanical systems. However, the computational overhead of high-fidelity engines often limits their use in data-intensive tasks like Reinforcement Learning (RL) and global optimization. We introduce Chrono-Gymnasium, a distributed computing framework that scales the high-fidelity multi-body dynamics of Project Chrono across large-scale computing clusters. Built upon the Ray framework, Chrono-Gymnasium provides a standardized Gymnasium interface, enabling seamless integration with modern machine learning libraries while providing built-in synchronization and messaging primitives for distributed execution. We demonstrate the framework's capabilities through two distinct case studies: (1) the training of an RL agent for autonomous robotic navigation in complex terrains, and (2) the Bayesian Optimization of a planetary lander's design parameters to ensure landing stability. Our results show that Chrono-Gymnasium reduces wall-clock time for high-fidelity simulations without sacrificing physical accuracy, offering a scalable path for the design and control of complex robotic systems.
中文摘要 高精度物理仿真对于缩小机器人与复杂机械系统中模拟与现实之间的差距至关重要。然而，高保真引擎的计算开销常常限制了它们在强化学习（RL）和全局优化等数据密集型任务中的应用。我们介绍了Chrono-Gymnasium，一个分布式计算框架，将Project Chrono的高保真多体动力学扩展到大规模计算集群。基于Ray框架，Chrono-Gymnasium提供标准化的Gymnasium接口，实现与现代机器学习库的无缝集成，同时内置同步和消息传递原语支持分布式执行。我们通过两个不同的案例研究展示了该框架的能力：（1）训练一个强化学习代理以实现复杂地形中自主机器人导航，以及（2）行星着陆器设计参数的贝叶斯优化以确保着陆稳定性。我们的结果表明，Chrono-Gymnasium在不牺牲物理精度的前提下，为高保真模拟缩短了墙上时钟时间，为复杂机器人系统的设计和控制提供了可扩展的路径。

Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations

Slot-MPC：目标条件模型预测控制，基于对象为中心的表示

Authors: Jonathan Spieler, Angel Villar-Corrales, Sven Behnke
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.14937
Pdf link: https://arxiv.org/pdf/2605.14937
Abstract Predictive world models enable agents to model scene dynamics and reason about the consequences of their actions. Inspired by human perception, object-centric world models capture scene dynamics using object-level representations, which can be used for downstream applications such as action planning. However, most object-centric world models and reinforcement learning (RL) approaches learn reactive policies that are fixed at inference time, limiting generalization to novel situations. We propose Slot-MPC, an object-centric world modeling framework that enables planning through Model Predictive Control (MPC). Slot-MPC leverages vision encoders to learn slot-based representations, which encode individual objects in the scene, and uses these structured representations to learn an action-conditioned object-centric dynamics model. At inference time, the learned dynamics model enables action planning via MPC, allowing agents to adapt to previously unseen situations. Since the learned world model is differentiable, we can use gradient-based MPC to directly optimize actions, which is computationally more efficient than relying on gradient-free, sampling-based MPC methods. Experiments on simulated robotic manipulation tasks show that Slot-MPC improves both task performance and planning efficiency compared to non-object-centric world model baselines. In the considered offline setting with limited state-action coverage, we find that gradient-based MPC performs better than gradient-free, sampling-based MPC. Our results demonstrate that explicitly structured, object-centric representations provide a strong inductive bias for controllable and generalizable decision-making. Code and additional results are available at this https URL.
中文摘要 预测世界模型使智能体能够建模场景动态，并推理其行为的后果。受人类感知启发，以对象为中心的世界模型利用对象级表示捕捉场景动态，这些模型可用于后续应用，如行动规划。然而，大多数以对象为中心的世界模型和强化学习（RL）方法学习的反应式策略在推理时是固定的，限制了对新情境的泛化。我们提出了Slot-MPC，一种以对象为中心的世界建模框架，通过模型预测控制（MPC）实现规划。Slot-MPC利用视觉编码器学习基于槽的表示，这些表示编码场景中的单个对象，并利用这些结构化表示学习一个动作条件的以对象为中心的动力学模型。在推理阶段，学习到的动力学模型通过MPC实现行动规划，使智能体能够适应此前未曾见过的情境。由于学习到的世界模型是可微的，我们可以使用基于梯度的MPC直接优化动作，这比依赖无梯度、基于采样的MPC方法在计算上更高效。模拟机器人操作任务的实验表明，与非对象为中心的世界模型基线相比，Slot-MPC在任务表现和规划效率上都有所提升。在状态动作覆盖有限的离线环境下，我们发现基于梯度的MPC表现优于无梯度、基于采样的MPC。我们的结果表明，明确结构化、以对象为中心的表征为可控且可推广的决策提供了强烈的归纳偏向。代码和更多结果可在此 https URL 获取。

Not All Symbols Are Equal: Importance-Aware Constellation Design for Semantic Communication

并非所有符号都相同：语义传播中的重要性感知星座设计

Authors: Albert Shaju, Christo Kurisummoottil Thomas, Mayukh Roy Chowdhury
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2605.14940
Pdf link: https://arxiv.org/pdf/2605.14940
Abstract Semantic communication systems for goal-oriented transmission must protect task-relevant information not only through source compression but also via physical layer mapping. Existing approaches decouple constellation design and semantic encoding, exposing critical symbols to channel errors at the same rate as irrelevant ones. Contrary to this, in this paper, a joint semantic-physical layer framework is proposed, which is composed of a vector quantized-variational autoencoder that extracts discrete latent concepts, a semantic criticality indicator (SCI) that scores each concept by task relevance, and a deep reinforcement learning agent that dynamically selects the transmission subset based on instantaneous channel conditions. At the physical layer, a learned semantic-aware M -QAM constellation assigns symbol positions according to joint co-occurrence statistics and SCI scores, departing from the uniform spacing and Gray coding of standard M -QAM which minimizes average BER without regard for semantic content. We introduce a novel semantic symbol vulnerability (SSV) metric and a semantic protection probability (SPP) to quantify the exposure of task-critical symbols to decoding errors, and prove that any Gray-coded constellation is strictly suboptimal in SCI-Weighted SSV whenever the source exhibits non-uniform semantic importance and co-occurrence statistics. Simulation results demonstrate that the proposed constellation achieves near 100% SPP across modulation orders from 4-QAM to 1024-QAM versus 50% for standard constellations at high spectral efficiency, a 21:1 compression ratio with semantic quality above 0.9, generalizing across MNIST, Fashion-MNIST, and FSDD without modification.
中文摘要 用于目标导向传输的语义通信系统不仅要通过源压缩保护任务相关信息，还要通过物理层映射来保护。现有方法将星座设计和语义编码解耦，使关键符号与无关符号同样容易遭受信道错误。与此相反，本文提出了一个联合语义-物理层框架，由一个向量量化变分自编码器（提取离散潜在概念）、一个语义关键性指示器（SCI）按任务相关性评分每个概念，以及一个基于瞬时信道条件动态选择传输子集的深度强化学习代理组成。在物理层，一个学习到的语义感知型 M -QAM 星座根据联合共现统计和 SCI 分数分配符号位置，这与标准 M -QAM 的均匀间距和灰码不同，后者在不考虑语义内容的情况下最小化平均 BER。我们引入了一种新颖的语义符号脆弱性（SSV）指标和语义保护概率（SPP），以量化任务关键符号暴露于解码错误的风险，并证明任何灰码星座在SCI加权SSV中，只要源表现出非均匀的语义重要性和共现统计量，则严格次优。模拟结果表明，所提星座在调制阶数从4-QAM到1024-QAM的范围内，SPP接近100%，而标准星座在高谱效率下为50%，压缩比为21：1，语义质量高于0.9，且可推广至MNIST、Fashion-MNIST和FSDD且无需修改。

Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

基于性能的策略优化，用于自适应窗口处理的推测性解码

Authors: Jie Jiang, Xing Sun
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.14978
Pdf link: https://arxiv.org/pdf/2605.14978
Abstract Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36$\times$ across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.
中文摘要 推测解码通过让轻量级草稿模型提出候选标记的推测窗口，以便更大目标模型并行验证，从而加速LLM推断。实际上，投机效率常被难以起牌的局面所限制，早期的不匹配截断了被接受的前缀，导致投机窗口的剩余部分失效。大多数基于学习的起草器仍然采用标记级监督目标进行优化，尽管推测效用本质上是窗口级且前缀敏感的。我们提出了PPOW（带自适应窗口的绩效驱动策略优化），这是一种强化学习框架，将绘图优化从令牌级模仿转向窗口级优化。PPOW结合了成本感知加速奖励、基于分布的接近奖励和自适应散度感知窗口，后者优先考虑具有高置信度加权的选秀目标偏离信息窗口。在统一解码协议下，PPOW在多个型号系列和基准测试中实现了6.29-6.52的平均接受长度，以及3.39-4.36美元/时间的加速。这些结果表明，以性能驱动的窗口级优化是提高推测解码效率的实用方法。

Second-Order Actor-Critic Methods for Discounted MDPs via Policy Hessian Decomposition

通过策略黑森分解的二阶演员-批评者方法用于贴现MDP

Authors: Sanjeev Manivannan, Shuban V
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.14982
Pdf link: https://arxiv.org/pdf/2605.14982
Abstract We address the discounted reward setting in reinforcement learning (RL). To mitigate the value approximation challenges in policy gradient methods, actor-critic approaches have been developed and are known to converge to stationary points under suitable assumptions. However, these methods rely on first-order updates. In contrast, second-order optimization provides principled curvature-aware updates that are proven to accelerate convergence, but its application in RL is limited by the computational complexity of Hessian estimation. In this work, we analyze second-order approximations for the actor update that leverage the full curvature information of the objective as much as possible. A stable approximation requires treating the action-value function as locally constant with respect to policy parameters, which does not generally hold in policy gradient methods. We show that this approximation becomes well-justified under a two-timescale actor-critic framework, where the critic evolves on a faster timescale and can be treated as quasi-stationary during actor updates. Building on this insight, we formulate a second-order actor-critic method for the discounted reward setting that leverages Hessian-vector product (HVP) computations, resulting in a computationally efficient and stable second-order update.
中文摘要 我们讨论强化学习（RL）中的折扣奖励设置。为了缓解策略梯度方法中的值近似挑战，已经开发了actor-critic方法，并且已知在适当假设下会收敛到平稳点。然而，这些方法依赖于一阶更新。相比之下，二阶优化提供了原则性的曲率感知更新，已被证明能加速收敛，但其在强化学习中的应用受限于黑森估计的计算复杂性。本研究分析了演员更新的二阶近似，尽可能利用目标的全曲率信息。稳定近似需要将动作值函数视为对策略参数的局部常数，而在策略梯度方法中通常不成立。我们证明，在两时间尺度的actor-critic框架下，这种近似变得合理，因为critic在更快的时间尺度上演进，并且在actor更新时可以被视为准平稳的。基于这一见解，我们构建了一种二阶演员-批评者方法，用于贴现奖励设置，利用黑森向量积（HVP）计算，实现计算高效且稳定的二阶更新。

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

通过随机选择的少数机会指导，通过可验证的奖励来提升强化学习

Authors: Kai Yan, Alexander G. Schwing, Yu-Xiong Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.15012
Pdf link: https://arxiv.org/pdf/2605.15012
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.
中文摘要 带可验证奖励的强化学习（RLVR）在开发具有思维链展开的大型语言模型（LLMs）方面取得了巨大成功，适用于许多任务，如数学和编码。然而，RLVR在难以生成正确展开的复杂问题上，样本效率存在困难。先前的研究建议通过演示引导的RLVR解决这一问题，即在强化学习失效时进行监督微调（SFT）;然而，SFT通常需要大量数据，且获取成本高昂。本文提出FEST，一种FEw-ShoT演示引导的RLVR算法。它仅通过从SFT数据集中随机选出的128个演示，取得了令人信服的结果。我们发现，成功的关键有三个要素：监督信号、策略信号，以及在少数样本SFT数据集上衰减权重，以防止多跨纪元训练的过度拟合。在多个基准测试中，FEST在SFT数据数量远低于基线时表现优于基线，甚至与完整数据集匹配。

Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use

基于案例的自适应推理与执行校准，用于LLM工具

Authors: Renning Pang, Tian Lan, Leyuan Liu, Piao Tong, Sheng Cao, Xiaosong Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.15041
Pdf link: https://arxiv.org/pdf/2605.15041
Abstract Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases. Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns. The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning. Experiments on BFCLv2 and ToolBench demonstrate that CAST improves both schema-faithful execution and task-level tool-use success while reducing unnecessary deliberation. The approach achieves up to 5.85 percentage points gain in overall execution accuracy and reduces average reasoning length by 26%, significantly mitigating high-impact structural errors. Ultimately, this demonstrates how historical execution cases can provide reusable adaptation knowledge for calibrated tool use.
中文摘要 工具的使用不仅扩展了大型语言模型的参数化知识，但要实现可靠的执行，需要在适当的推理深度与严格的结构效度之间取得平衡。我们从基于案例的视角来探讨这个问题，提出了CAST，一个案例驱动框架，将历史执行轨迹视为结构化案例。CAST不再重复使用原始样本输出，而是提取案例衍生信号，识别复杂度剖面以估算最佳推理策略，同时结合失败剖面绘制可能的结构性崩溃图谱。该框架将这些知识转化为细粒度的奖励设计和自适应推理，使模型能够在强化学习过程中自主内化基于案例的策略。在BFCLv2和ToolBench上的实验表明，CAST不仅提升了忠实模式的执行，还能减少不必要的思考。该方法在整体执行准确率上可提升高达5.85个百分点，平均推理长度减少26%，显著减少高影响结构错误。最终，这表明历史执行案例如何为校准工具的使用提供可复用的适应知识。

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

扩散OPD：扩散模型中政策提炼的统一视角

Authors: Quanhao Li, Junqiu Yu, Kaixun Jiang, Yujie Wei, Zhen Xing, Pandeng Li, Ruihang Chu, Shiwei Zhang, Yu Liu, Zuxuan Wu
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.15055
Pdf link: https://arxiv.org/pdf/2605.15055
Abstract Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.
中文摘要 强化学习已成为改进基于扩散的文本到图像模型的强大工具，但现有方法主要限于单任务优化。将强化学习扩展到多任务具有挑战性：联合优化存在跨任务干扰和不平衡，而级联强化学习则繁琐且容易出现灾难性遗忘。我们提出了DiffusionOPD，这是一种基于在线策略提炼（OPD）的扩散模型多任务训练范式。DiffusionOPD首先独立培训针对特定任务的教师，然后根据学生自身的推广轨迹，将他们的能力提炼成统一的学生。这使单任务探索与多任务集成脱钩，避免了从零开始联合解决所有任务的优化负担。理论上，我们将OPD框架从离散令牌提升到连续状态马尔可夫过程，推导出一个闭式每步KL目标，通过均值匹配统一了随机SDE和确定性常微分方程的细化。我们通过形式和实证证明，该分析梯度相比传统的PPO式政策梯度提供了更低的方差和更好的一般性。大量实验表明，扩散OPD在训练效率和最终表现上持续超越多奖励强化学习和级联强化学习基线，同时在所有评估基准测试中均取得最先进的成绩。

Learning from Language Feedback via Variational Policy Distillation

通过变分策略蒸馏从语言反馈中学习

Authors: Yang Li, Erik Nijkamp, Semih Yavuz, Shafiq Rayhan Joty
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.15113
Pdf link: https://arxiv.org/pdf/2605.15113
Abstract Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.
中文摘要 可验证奖励强化学习（RLVR）存在稀疏的结果信号，导致复杂推理任务中出现严重的探索瓶颈。最新的策略自提纯方法试图通过利用语言反馈生成密集的代币级监督来解决这个问题。然而，这些方法依赖于固定的被动教师来解读反馈。随着学生政策的改进，教师的零样本评估能力趋于平稳，最终导致后续学习停滞。为克服这一问题，我们提出了变分策略提炼（VPD）框架，将从语言反馈中学习形式化为变分期望-最大化（EM）问题。VPD对这两种策略进行了协进：在E步中，教师通过自适应信任区域更新主动优化轨迹结果，将文本反馈转化为动态改进的目标令牌分布。在M步中，学生在政策推广中内化了这种密集的分布指导。通过不断提升教师从文本批评中提取可操作信号的能力，VPD克服了被动提炼的局限。通过科学推理和代码生成任务的多种诊断反馈评估，VPD持续优于标准RLVR和现有自蒸馏基线。最后，通过对我们关于刚性数学推理和冷启动机制的压力测试，我们阐明了反馈驱动自蒸馏与纯环境驱动强化学习的基本界限。

Self-Distilled Agentic Reinforcement Learning

自我蒸馏的代理强化学习

Authors: Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.15155
Pdf link: https://arxiv.org/pdf/2605.15155
Abstract Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.
中文摘要 强化学习（RL）已成为训练后LLM代理的核心范式，但其轨迹级奖励信号仅为长视野交互提供粗略监督。策略自蒸馏（OPSD）通过引入来自教师分支的密集令牌级指导，辅以特权上下文，补充了强化学习。然而，将OPSD转移给多回合代理存在问题：叠加多回合不稳定性会破坏监督，而技能条件特权指导则要求对负面教师拒绝进行非对称处理，原因可能源于技能检索或利用不完美。我们引入了SDAR（自蒸馏代理强化学习），将OPSD视为门控辅助目标，同时保持强化学习作为主要优化骨干。SDAR将分离的令牌级信号映射到S形门中，强化教师认可的正差值标记的提炼，并温和地削弱负面教师的拒绝。在ALFWorld、WebShop和Search-QA上的Qwen2.5和Qwen3系列中，SDAR相较GRPO显著提升（ALFWorld为+9.4%，Search-QA为+7.0%，WebShop-Acc为+10.2%），避免了天真GRPO+OPSD的不稳定性，并且在模型尺度上持续优于混合强化学习-OPSD基线。

RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

RAVEN：实时自回归视频推断，采用一致性模型GRPO

Authors: Yanzuo Lu, Ronglai Zuo, Jiankang Deng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.15190
Pdf link: https://arxiv.org/pdf/2605.15190
Abstract Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.
中文摘要 因果自回归视频扩散模型支持通过从先前生成内容推算出未来片段来实现实时流媒体生成。从高保真度双向教师中提炼此类生成器，可获得具有竞争力的少步模型，但培训期间遇到的历史分布与推断时产生的历史分布之间存在持续的差距，长期内限制了生成质量。我们介绍了实时自回归视频外推网络（RAVEN），这是一个训练时测试框架，将每个自展开重新打包为一系列干净的历史端点和噪声去噪态的交错序列。该表述使训练注意力与推断时间外推相匹配，并允许下游的块丢失来监督未来预测依赖的历史表示。我们进一步提出了一致性模型群相对策略优化（CM-GRPO），将一致性采样步骤重新表述为条件高斯转移，并直接对该核应用在线强化学习（RL），避免了之前流模型RL中采用的欧拉-丸山辅助过程。实验表明，RAVEN在质量、语义和动态度评估上均优于近期因果视频蒸馏基线，且CM-GRPO与RAVEN结合时提供了进一步的提升。

Keyword: diffusion policy

There is no result