Arxiv Papers of Today

生成时间: 2026-05-01 17:48:41 (UTC+8); Arxiv 发布时间: 2026-05-01 20:00 EDT (2026-05-02 08:00 UTC+8)

今天共有 35 篇相关文章

Keyword: reinforcement learning

Learning-to-Explain through 20Q Gaming: An Explainable Recommender for Cybersecurity Education

通过20Q游戏学习解释：网络安全教育的可解释推荐

Authors: Mary Nusrat, Sarfuddin Bhuiyan, Gahangir Hossain
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.26964
Pdf link: https://arxiv.org/pdf/2604.26964
Abstract The growing sophistication of contemporary cyber threats necessitates a more effective and adaptive approach to cybersecurity training. Intuitive and adaptive approaches to learning, which are often required, are not provided in traditional learning methods. In this article, we present a new educational framework, "Learning to Explain Cybersecurity with Q20 Game", based on explainable AI (XAI), an educational game to enhance interactivity in learning. We propose a novel, game-inspired framework - the Explainable Q20 Cybersecurity Recommender (EQ-20CR), that learns to elicit the minimal set of evidential facts needed to justify cybersecurity defensive action. By casting "Why should I execute this mitigation?" as a 20 questions (Q20) game, a policy-based reinforcement-learning (RL) agent actively queries an environment until it can both (i) recommend the optimal security education and (ii) explain that decision with a concise dialogue trace. The article draws from "Playing 20 Question Game with Policy-Based Reinforcement Learning" [1] and "Learning-to-Explain: Recommendation Reason Determination through Q20 Gaming" [2]. The framework uses a policy-based reinforcement learning (RL) agent that leads the user through a sequence of questions to recognize and articulate a targeted cybersecurity concept, attack vector, or defense strategy. Furthermore, users are gradually exposed to informative questions by the system, revealing complicated, structured way at an adaptive difficulty level. In this paper, we design the architecture, its application to various concepts of cybersecurity through illustrative case studies, and its transformative potential on the training and awareness of cybersecurity recommendations.
中文摘要 随着当代网络威胁日益复杂，网络安全培训需要更有效且适应性强的方案。传统学习方法中常常需要的直觉和适应性学习方法。本文介绍了一个新的教育框架“用Q20游戏学习解释网络安全”，基于可解释人工智能（XAI），这是一款旨在增强学习互动性的教育游戏。我们提出了一个新颖的、受游戏启发的框架——可解释Q20网络安全推荐器（EQ-20CR），它能够学习提取出为网络安全防御行动辩护所需的最低限度证据。通过将“为什么我应该执行这个缓解措施？”作为一个20个问题（Q20）游戏，基于策略的强化学习（RL）代理会主动查询环境，直到它能够（i）推荐最佳安全教育，并且（ii）通过简明的对话轨迹解释该决策。本文引用了《基于政策的强化学习玩20题游戏》[1]和《学习解释：通过Q20游戏确定推荐理由》[2]。该框架使用基于策略的强化学习（RL）代理，引导用户通过一系列问题识别并阐述针对性网络安全概念、攻击向量或防御策略。此外，用户会逐步通过系统接触到信息性问题，揭示复杂且结构化的复杂方式，且难度可适应。本文设计了架构，通过示例案例研究，设计了其在网络安全各概念中的应用，以及其在培训和网络安全建议意识上的变革潜力。

PALCAS: A Priority-Aware Intelligent Lane Change Advisory System for Autonomous Vehicles using Federated Reinforcement Learning

PALCAS：基于联邦强化学习的自动驾驶车辆优先级感知智能变道咨询系统

Authors: Yassine Ibork, Nhat Ha Nguyen, Myounggyu Won, Lokesh Das
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.27118
Pdf link: https://arxiv.org/pdf/2604.27118
Abstract We present a priority-aware intelligent lane change advisory system based on multi-agent federated reinforcement learning, namely PALCAS, for autonomous vehicles (AVs). While existing lane-change approaches typically focus on single-agent systems or centralized multi-agent systems, we introduce a federated reinforcement learning-based multi-agent lane change system prioritizing lane changing based on vehicle destination urgency. PALCAS incorporates a novel priority-aware safe lane-change reward function to enable judicious lane-change decisions in both mandatory and discretionary scenarios. PALCAS leverages the parameterized deep Q-network (PDQN) algorithm to facilitate effective cooperation among agents, enabling both lateral and longitudinal motion controls of AVs. Extensive simulations conducted using the SUMO traffic simulator and Mosaic V2X communication framework demonstrate that PALCAS significantly improves traffic efficiency, driving safety, comfort, destination arrival rates, and merging success rates compared to baseline methods.
中文摘要 我们提出了一套基于多智能体联合强化学习（即PALCAS）的优先级感知智能车道变换咨询系统，适用于自动驾驶车辆（AV）。现有变道方法通常侧重于单智能体系统或集中多智能体系统，我们引入了基于联邦强化学习的多智能体变道系统，基于车辆目的地紧急度优先变道。PALCAS采用了一种新颖的优先级感知安全变道奖励函数，使在强制性和裁量情境下都能做出明智的变道决策。PALCAS利用参数化深度Q网络（PDQN）算法促进代理间的有效协作，支持AV的横向和纵向运动控制。使用SUMO交通模拟器和Mosaic V2X通信框架进行的广泛模拟表明，PALCAS相较于基线方法显著提升了交通效率、驾驶安全、舒适性、目的地到达率和并入成功率。

A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations

高吞吐量、计算效率高的POMDP隐藏与寻址引擎（HASE），用于多代理操作

Authors: Timothy Flavin, Sandip Sen
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2604.27162
Pdf link: https://arxiv.org/pdf/2604.27162
Abstract Reinforcement Learning (RL) algorithms exhibit high sample complexity, particularly when applied to Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs). As a response, projects such as SampleFactory, EnvPool, Brax, and IsaacLab migrate parallel execution of classic environments such as MuJoCo and Atari into C++ thread pools or the GPU to decrease the computational cost of environment steps. We are interested in optimizing the decision-level of human-AI joint operations, so we introduce a compute-efficient Dec-POMDP engine natively architected in C++ called Hide-And-Seek-Engine. By employing Data-Oriented Design (DOD) principles, explicit 64-byte cache-line alignment to remove false sharing, and a zero-copy PyTorch memory bridge using pinned memory and Direct Memory Access (DMA), our engine sustains throughput of up to 33,000,000 steps per second (SPS) in a single-agent, 1024-environment, decentralized observations on an AMD Ryzen 9950X (16 cores). Ten agents reduces FPS to 7M SPS with generating random actions contributing 1/3rd the total runtime for reference. The engine achieves a throughput increase of approximately 3,500$\times$ over the baseline single threaded vectorized NumPy implementation and successfully trains cooperative multi-agent policies via PPO, DQN, and SAC in minutes, validating both its performance and generality.
中文摘要 强化学习（RL）算法表现出较高的样本复杂性，尤其是在应用于去中心化部分可观测马尔可夫决策过程（Dec-POMDPs）时。作为回应，SampleFactory、EnvPool、Brax 和 IsaacLab 等项目将经典环境（如 MuJoCo 和 Atari）的并行执行迁移到 C++ 线程池或 GPU 中，以降低环境步骤的计算成本。我们致力于优化人机联合行动的决策层面，因此引入了一个以C++原生架构的高效计算Dec-POMDP引擎，名为Hide-And-Seek-Engine。通过采用数据导向设计（DOD）原则、显式64字节缓存行对齐以消除虚假共享，以及使用钉顶内存和直接内存访问（DMA）的零副本PyTorch内存桥，我们的引擎在AMD Ryzen 9950X（16核）单代理、1024环境、去中心化观测中，维持最高33,000,000步每秒（SPS）的吞吐量。十个代理将帧率降低到700万SPS，随机动作占总运行时间的三分之一，供参考。该引擎相比基线单线程矢量化NumPy实现实现，吞吐量提升约3500美元/时间$，并通过PPO、DQN和SAC在几分钟内成功训练协作多代理策略，验证了其性能和通用性。

Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

学习触觉感知四足机车操作策略

Authors: Pokuang Zhou, Yuhao Zhou, Quan Luu, Seungho Han, Heng Zhang, Binghao Huang, Yunzhu Li, Arash Ajoudani, Zhengtong Xu, Yu She
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.27224
Pdf link: https://arxiv.org/pdf/2604.27224
Abstract Quadrupedal loco-manipulation is commonly built on visual perception and proprioception. Yet reliable contact-rich manipulation remains difficult: vision and proprioception alone cannot resolve uncertain, evolving interactions with the environment. Tactile sensing offers direct contact observability, but scalable tactile-aware learning framework for quadrupedal loco-manipulation is still underexplored. In this paper, we present a tactile-aware loco-manipulation policy learning pipeline with a hierarchical structure. Our approach has two key components. First, we leverage real-world human demonstrations to train a tactile-conditioned visuotactile high-level policy. This policy predicts not only end-effector trajectories for manipulation, but also the evolving tactile interaction cues that characterize how contact should develop over time. Second, we perform large-scale reinforcement learning in simulation to learn a tactile-aware whole-body control policy that tracks diverse commanded trajectories and tactile interaction cues, and transfers zero-shot to the real world. Together, these components enable coordinated locomotion and manipulation under contact-rich scenarios. We evaluate the system on real-world contact-rich tasks, including in-hand reorientation with insertion, valve tightening, and delicate object manipulation. Compared to vision-only and visuotactile baselines, our method improves performance by 28.54% on average across these tasks.
中文摘要 四足行走的操控通常建立在视觉感知和本体感觉之上。然而，可靠的接触丰富操作仍然困难：仅靠视觉和本体感觉无法解决与环境不确定且不断变化的互动。触觉感知提供了直接接触可观察性，但用于四足行走的可扩展触觉感知学习框架仍未被充分探索。本文提出了一个具有层级结构的触觉感知机车操作政策学习流程。我们的方法有两个关键组成部分。首先，我们利用真实世界的人类演示来培训一种触觉条件的视觉触觉高层次政策。该政策不仅预测了操作的终效器轨迹，还预测了随着时间推移，定义接触应如何发展的触觉互动线索。其次，我们在模拟中进行大规模强化学习，学习一种触觉感知的全身控制策略，追踪多样化的指令轨迹和触觉互动线索，并将零射击传递到现实世界。这些组件共同使得在接触丰富情境下实现协调的运动和操作。我们在实际接触密集任务中评估该系统，包括插入时的手部重新定位、阀门紧固以及精细的物体操作。与仅视觉和视觉基线相比，我们的方法在这些任务中平均提升了28.54%的性能。

AutoREC: A software platform for developing reinforcement learning agents for equivalent circuit model generation from electrochemical impedance spectroscopy data

AutoREC：一款用于开发增强学习代理的软件平台，用于从电化学阻抗光谱数据生成等效电路模型

Authors: Ali Jaberi (1), Yonatan Kurniawan (2), Robert Black (1), Shayan Mousavi M. (1), Kabir Verma (3), Zoya Sadighi (1), Santiago Miret (4), Jason Hattrick-Simpers (2) ((1) Clean Energy Innovation Research Center, National Research Council Canada, Mississauga, ON, Canada, (2) Department of Material Science and Engineering, University of Toronto, Toronto, ON, Canada, (3) Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada, (4) Lila Sciences, San Francisco, CA, USA)
Subjects: Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
Arxiv link: https://arxiv.org/abs/2604.27266
Pdf link: https://arxiv.org/pdf/2604.27266
Abstract This paper introduces AutoREC, an open-source Python package for developing reinforcement learning (RL) agents to automatically generate equivalent circuit models (ECMs) from electrochemical impedance spectroscopy (EIS) data. While ECMs are a standard framework for interpreting EIS data, traditional identification is typically based on manual trial-and-error, which requires domain experts and limits scalability, particularly in autonomous experimental pipelines such as self-driving laboratories. AutoREC addresses this challenge by formulating ECM construction as a sequential decision-making problem within a Markov Decision Process framework. It implements a Double Deep Q-Network with prioritized experience replay, along with a dedicated dead-loop mitigation strategy, to efficiently explore a complex action space for circuit generation. To demonstrate the capabilities of the platform, we trained an RL agent using AutoREC and evaluated its strengths and limitations across diverse datasets, while also discussing possible strategies to mitigate these limitations in future agent designs. The trained agent achieved a success rate exceeding $99.6\%$ on synthetic datasets and demonstrated strong generalization to unseen experimental EIS data from batteries, corrosion, oxygen evolution reaction, and CO$_2$ reduction systems. These results position AutoREC as a promising platform for adaptive and data-driven ECM generation, with potential for integration into automated electrochemical workflows.
中文摘要 本文介绍了AutoREC，这是一个开源的Python软件包，用于开发强化学习（RL）代理，能够自动从电化学阻抗谱（EIS）数据生成等效电路模型（ECM）。虽然ECM是解读EIS数据的标准框架，但传统识别通常基于手工试错，这需要领域专家，且限制了可扩展性，尤其是在自动驾驶实验室等自主实验管道中。AutoREC通过将ECM构造表述为马尔可夫决策过程框架下的顺序决策问题来应对这一挑战。它实现了双深度Q网络，优先体验回放，并采用专门的死循环缓解策略，高效探索复杂的电路生成动作空间。为了展示平台的能力，我们用AutoREC训练了一个强化学习代理，评估了其在不同数据集中的优势和局限性，同时讨论了未来代理设计中可能缓解这些限制的策略。训练有素的该模拟体在合成数据集中成功率超过99.6美元，并展示了对电池、腐蚀、氧气演化反应和二氧化碳$2美元减缓系统中未见实验环境影响信息（EIS）的强烈泛化能力。这些结果使AutoREC成为自适应和数据驱动ECM生成的有前景平台，具备整合进自动化电化学工作流程的潜力。

VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

VeraRetouch：一个轻量级全可微分的多任务推理照片修图框架

Authors: Yihong Guo, Youwei Lyu, Jiajun Tang, Yizhuo Zhou, Hongliang Wang, Jinwei Chen, Changqing Zou, Qingnan Fan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.27375
Pdf link: https://arxiv.org/pdf/2604.27375
Abstract Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at this https URL.
中文摘要 推理式照片修饰获得了显著关注，要求模特分析图像缺陷、进行推理过程并执行精确修饰。然而，现有方法通常依赖不可微分的外部软件，这带来了优化障碍，并且存在高参数冗余和有限的泛化性。为应对这些挑战，我们提出了VeraRetouch，一个轻量级且完全可微分的多任务照片修饰框架。我们采用0.5亿视觉语言模型（VLM）作为中央智能，基于指令和场景语义制定修图方案。此外，我们还开发了一种完全可微分的Retouch渲染器，取代了外部工具，实现了通过解耦控制潜能实现直接端到端像素级训练，用于光照、全局色彩和特定颜色调整。为克服数据稀缺性，我们引入了AetherRetouch-1M+，这是首个百万级专业修饰数据集，采用了新的逆向降解工作流程构建。此外，我们提出了DAPO-AE，一种增强自主审美认知的训练后强化学习策略。大量实验表明，VeraRetouch在多个基准测试中实现了最先进的性能，同时保持了显著更小的覆盖面积，支持移动部署。我们的代码和模型在此 https URL 公开。

Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift

检测容易，适应困难：分布转变下可视化模型强化学习的本地专家成长

Authors: Haiyang Zhao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.27411
Pdf link: https://arxiv.org/pdf/2604.27411
Abstract Visual model-based reinforcement learning (MBRL) agents can perform well on the training distribution, but often break down once the test environment shifts. In visual MBRL, recognizing that a shift has occurred is often the easier part; the harder part is turning that recognition into useful action-level correction. We study several ways of responding to shift, including planning penalties, direct fine-tuning, global residual correction, and coarse gating. In our experiments, these approaches either do not improve closed-loop control or hurt in-distribution (ID) performance. Based on these negative results, we propose JEPA-Indexed Local Expert Growth. The method uses a frozen JEPA representation only for problem indexing, while cluster-specific residual experts add local action corrections on top of the original controller. The baseline controller itself is not modified. Using paired-bootstrap evaluation, we find that the original naive-preference variant is not stable under stricter testing. In contrast, the harder-pair variant produces statistically significant OOD improvements on all four evaluated shift conditions while preserving ID performance. The learned experts also remain useful when the same shift is encountered again, which supports the view of adaptation as incremental knowledge growth rather than repeated full retraining. We further show that automatic ID rejection can be achieved with simple density models, whereas fine-grained discrimination among OOD sub-families is limited by the representation. Overall, the results indicate that, for visual MBRL under distribution shift, the main challenge is not simply noticing that the environment has changed, but applying the right local action correction after the change has been recognized.
中文摘要 基于可视化的模型强化学习（MBRL）代理在训练分布上表现良好，但一旦测试环境发生变化，通常会失效。在视觉 MBRL 中，识别变化通常是较容易的部分;更难的是将这种认知转化为有用的行动层面纠正。我们研究了多种响应变迁的方法，包括规划惩罚、直接微调、全局残差修正和粗门控。在我们的实验中，这些方法要么无法改善闭环控制，要么损害了分配（ID）性能。基于这些负面结果，我们提出JEPA指数化的地方专家增长方案。该方法仅使用冻结的JEPA表示进行问题索引，而簇特异的残余专家则在原始控制器之上添加局部动作修正。基准控制器本身没有被修改。利用配对自助法，我们发现原始的天真偏好变体在更严格的测试下并不稳定。相比之下，硬对变体在所有四种评估的班次条件下都带来了统计学上显著的值班改进，同时保持了ID性能。当同样的转变再次出现时，这些学识渊博的专家依然有用，这支持了将适应视为渐进式知识增长而非反复全面再培训的观点。我们进一步表明，通过简单密度模型可以实现自动识别拒绝，而OOD子族之间的细粒度区分受表示限制。总体来看，对于分布转移下的视觉MBRL，主要挑战不仅在于察觉环境变化，更在于在识别变化后应用正确的局部行动修正。

RAY-TOLD: Ray-Based Latent Dynamics for Dense Dynamic Obstacle Avoidance with TDMPC

RAY-TOLD：基于TDMPC的密集动态障碍避让的基于射线的潜动力学

Authors: Seungho Han, Seokju Lee, Jeonguk Kang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.27450
Pdf link: https://arxiv.org/pdf/2604.27450
Abstract Dense, dynamic crowds pose a persistent challenge for autonomous mobile robots. Purely reactive planning methods, such as Model Predictive Path Integral (MPPI) control, often fail to escape local minima in complex scenarios due to their limited prediction horizon. To bridge this gap, we propose Ray-based Task-Oriented Latent Dynamics (RAY-TOLD), a hybrid control architecture that integrates obstacle information into latent dynamics and utilizes the robustness of physics-based MPPI with the long-horizon foresight of reinforcement learning. RAY-TOLD leverages a LiDAR-centric latent dynamics model to encode high-dimensional sensor data into a compact state representation, enabling the learning of a terminal value function and a policy prior. We introduce a policy mixture sampling strategy that augments the MPPI candidate population with trajectories derived from the learned policy, effectively guiding the planner towards the goal while maintaining kinematic feasibility. Extensive tests in a stochastic environment with high-density dynamic obstacles demonstrate that our method outperforms the MPPI baseline, reducing the collision rate. The results confirm that blending short-horizon physics-based rollouts with learned long-horizon intent significantly enhances navigation reliability and safety.
中文摘要 密集且充满活力的人群对自主移动机器人构成持续挑战。纯反应式规划方法，如模型预测路径积分（MPPI）控制，由于预测视野有限，在复杂场景中常常无法突破局部极小值。为弥合这一差距，我们提出了基于雷的任务导向潜能动力学（RAY-TOLD）混合控制架构，将障碍信息整合进潜在动力学，并结合基于物理的MPPI的鲁棒性与强化学习的远景前瞻性。RAY-TOLD利用以激光雷达为中心的潜在动力学模型，将高维传感器数据编码为紧凑的状态表示，从而实现终端价值函数和策略先验的学习。我们引入了一种政策混合抽样策略，通过基于所学政策的轨迹补充MPPI候选群体，有效引导规划者朝目标前进，同时保持运动可行性。在随机环境中的高密度动态障碍条件下的大量测试表明，我们的方法优于MPPI基线，降低了碰撞率。结果证实，将基于物理的短视距推送与学习到的长视野意图相结合，显著提升了导航的可靠性和安全性。

From Coarse to Fine: Benchmarking and Reward Modeling for Writing-Centric Generation Tasks

从粗到细：以写作为中心的生成任务中的基准测试与奖励建模

Authors: Qingyu Ren, Tianjun Pan, Xingzhou Chen, Xuhong Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.27453
Pdf link: https://arxiv.org/pdf/2604.27453
Abstract Large language models have achieved remarkable progress in text generation but still struggle with generative writing tasks. In terms of evaluation, existing benchmarks evaluate writing reward models coarsely and fail to measure performance from the perspective of specific requirements. In terms of training, existing training methods either use LLM-as-a-judge approaches or train coarse-grained reward models, lacking fine-grained requirement-adherence reward modeling. To address these issues, we propose a fine-grained evaluation pipeline WEval for writing reward models and a fine-grained reinforcement learning training framework WRL. The evaluation data of WEval covers multiple task categories and requirement types, enabling systematic evaluation of writing reward models by measuring the correlation between the rankings of the reward model and gold rankings. WRL constructs positive and negative samples by selectively dropping instruction requirements, allowing for more precise reward model training. Experiments show that our models achieve substantial improvements across various writing benchmarks and exhibit strong generalization. The code and data are publicly available at \href{this https URL}{this https URL_Coarse_to_Fine}.
中文摘要 大型语言模型在文本生成方面取得了显著进步，但在生成式写作任务上仍然存在困难。在评估方面，现有基准对奖励模型的评估粗略，未能从具体需求角度衡量绩效。在培训方面，现有的训练方法要么采用LLM作为评判的方法，要么训练粗粒度奖励模型，缺乏细粒度的需求-依从性奖励建模。为解决这些问题，我们提出了一个细粒度的评估流水线WEval用于编写奖励模型，以及一个细粒度强化学习训练框架WRL。WEval的评估数据涵盖多个任务类别和需求类型，通过衡量奖励模型排名与金级排名之间的相关性，系统地评估奖励模型的编写。WRL通过选择性地减少指令要求来构建正负样本，从而实现更精准的奖励模型训练。实验显示，我们的模型在各种写作基准中取得了显著改进，并展现出强烈的泛化性。代码和数据公开地址为 \href{this https URL}{this https URL_Coarse_to_Fine}。

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

PRTS：通过对比表征实现的原始推理与任务系统

Authors: Yang Zhang, Jiangyuan Zhao, Chenyou Fan, Fangzheng Yan, Tian Li, Haitong Tang, Sen Fu, Xuan'er Wu, Qizhen Weng, Weinan Zhang, Xiu Li, Chi Zhang, Chenjia Bai, Xuelong Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.27472
Pdf link: https://arxiv.org/pdf/2604.27472
Abstract Vision-Language-Action (VLA) models advance robotic control via strong visual-linguistic priors. However, existing VLAs predominantly frame pretraining as supervised behavior cloning, overlooking the fundamental nature of robot learning as a goal-reaching process that requires understanding temporal task progress. We present \textbf{PRTS} (\textbf{P}rimitive \textbf{R}easoning and \textbf{T}asking \textbf{S}ystem), a VLA foundation model that reformulates pretraining through Goal-Conditioned Reinforcement Learning. By treating language instructions as goals and employing contrastive reinforcement learning, PRTS learns a unified embedding space where the inner product of state-action and goal embeddings approximates the log-discounted goal occupancy, the probability of reaching the language-specified goal from the current state-action, quantitatively assessing physical feasibility beyond static semantic matching. PRTS draws this dense goal-reachability supervision directly from offline trajectories without reward annotations, and folds it into the VLM backbone via a role-aware causal mask, incurring negligible overhead over vanilla behavior cloning. This paradigm endows the high-level reasoning system with intrinsic goal reachability awareness, bridging semantic reasoning and temporal task progress, and further benefits goal-conditioned action prediction. Pretrained on 167B tokens of diverse manipulation and embodied-reasoning data, PRTS reaches state-of-the-art performance on LIBERO, LIBERO-Pro, LIBERO-Plus, SimplerEnv, and a real-world suite of 14 complex tasks, with particularly substantial gains on long-horizon, contact-rich, and zero-shot novel-instruction settings, confirming that injecting goal-reachability awareness significantly improves both execution success and long-horizon planning of general-purpose robotic foundation policies.
中文摘要 视觉-语言-行动（VLA）模型通过强视觉-语言先验推进机器人控制。然而，现有的VLA主要将预训练视为监督行为克隆，忽视了机器人学习作为一个目标达成过程的根本性质，需要理解时间任务的进展。我们提出了 \textbf{PRTS}（\textbf{P}rimitive \textbf{R}easoning and \textbf{T}asking \textbf{S}ystem），这是一个通过目标条件强化学习重新表述预训练的 VLA 基础模型。通过将语言指令视为目标并采用对比强化学习，PRTS学习了一个统一的嵌入空间，其中状态动作和目标嵌入的内积近似对数折现目标占有率，即从当前状态动作达到语言指定目标的概率，定量评估物理可行性，超越静态语义匹配。PRTS 直接从离线轨迹提取这种密集的目标可达性监督，无需奖励注释，并通过角色感知因果掩膜将其折叠进 VLM 骨干网，相比原版行为克隆几乎没有开销。这一范式赋予高层推理系统内在的目标可达性意识，连接语义推理与时间任务进展，进一步促进目标条件行动预测。PRTS在167B代币上训练，涵盖多样的操作和具身推理数据，在LIBERO、LIBERO-Pro、LIBERO-Plus、SimplerEnv及14个复杂任务的实际套件上达到最先进的性能，尤其在长视野、接触丰富和零射新指令设置上取得显著提升，证实注入目标可达性意识显著提升了执行成功率和通用机器人基础策略的长期规划。

Leveraging Verifier-Based Reinforcement Learning in Image Editing

在图像编辑中利用基于验证器的强化学习

Authors: Hanzhong Guo, Jie Wu, Jie Liu, Yu Gao, Zilyu Ye, Linxiao Yuan, Xionghui Wang, Yizhou Yu, Weilin Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.27505
Pdf link: https://arxiv.org/pdf/2604.27505
Abstract While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.
中文摘要 虽然人类反馈强化学习（RLHF）已成为文本生成图像的关键范式，但其在图像编辑中的应用仍然鲜有探索。一个关键瓶颈是缺乏对所有编辑任务的强有力通用奖励模型。现有的编辑奖励模型通常给出总分，没有详细检查，忽视了不同的指令要求，导致奖励有偏。为此，我们认为关键是从简单的评分器转变为推理验证器。我们介绍了Edit-R1，这是一个构建基于思考链（CoT）验证者推理奖励模型（RRM）的框架，并利用它进行下游图像编辑。Edit-RRM 将指令拆分为不同的原则，根据每个原则评估编辑后的图像，并将这些检查汇总成可解释的细致奖励。为了构建这样的RRM，我们首先应用监督微调（SFT）作为“冷启动”，生成CoT奖励轨迹。随后，我们介绍了群体对比偏好优化（GCPO），这是一种强化学习算法，利用人类的成对偏好数据强化我们的点对立偏好管理。构建RRM后，我们使用GRPO训练编辑模型，使用这个不可微但强大的奖励模型。大量实验表明，我们的Edit-RRM作为编辑专用奖励模型，超越了Seed-1.5-VL和Seed-1.6-VL等强大的VLM，我们观察到明显的扩展趋势，性能从3B参数持续提升至7B参数。此外，Edit-R1 为 FLUX.1-kontext 等编辑模型带来了优势，凸显了其在增强图像编辑方面的有效性。

Bayesian policy gradient and actor-critic algorithms

贝叶斯策略梯度和演员-批评算法

Authors: Mohammad Ghavamzadeh, Yaakov Engel, Michal Valko
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.27563
Pdf link: https://arxiv.org/pdf/2604.27563
Abstract Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate the gradient, which tend to have high variance, requiring many samples and resulting in slow convergence. We first propose a Bayesian framework for policy gradient, based on modeling the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient and a measure of the uncertainty in the gradient estimates, namely, the gradient covariance, are provided at little extra cost. Since the proposed framework considers system trajectories as its basic observable unit, it does not require the dynamics within trajectories to be of any particular form, and can be extended to partially observable problems. On the downside, it cannot exploit the Markov property when the system is Markovian. To address this, we supplement our Bayesian policy gradient framework with a new actor-critic learning model in which a Bayesian class of non-parametric critics, based on Gaussian process temporal difference learning, is used. Such critics model the action-value function as a Gaussian process, allowing Bayes rule to be used to compute the posterior distribution over action-value functions, conditioned on the observed data. Appropriate choices of the policy parameterization and of the prior covariance (kernel) between action-values yield closed-form expressions for the posterior of the gradient of the expected return with respect to the policy parameters. We perform detailed experimental comparisons of the proposed Bayesian policy gradient and actor-critic algorithms with classic Monte-Carlo based policy gradient methods, on a number of reinforcement learning problems.
中文摘要 策略梯度方法是一种强化学习算法，通过跟踪性能梯度估计来调整参数化策略。传统的政策梯度方法使用蒙特卡洛技术来估计梯度，而梯度通常方差较大，需要大量样本，收敛速度较慢。我们首先提出了一个基于高斯过程建模政策梯度的贝叶斯框架。这减少了获得准确梯度估计所需的样本数量。此外，提供了自然梯度的估计值以及梯度估计中不确定性的度量，即梯度协方差，且成本极小。由于所提框架将系统轨迹视为其基本可观测量单元，因此不要求轨迹内的动态必须是特定形式，并且可以扩展到部分可观测的问题。缺点是，当系统是马尔可夫时，它无法利用马尔可夫性质。为此，我们在贝叶斯策略梯度框架基础上补充了一个新的演员-批评者学习模型，该模型基于高斯过程时间差分学习，采用了非参数批评者的贝叶斯类。这些批评者将作用值函数建模为高斯过程，允许使用贝叶斯规则计算作用值函数的后验分布，条件是基于观察到的数据。策略参数化和动作值间先验协方差（核）的适当选择，可以得到预期收益相对于策略参数梯度后期的闭式表达式。我们对所提出的贝叶斯策略梯度和actor-critic算法与经典的蒙特卡洛策略梯度方法进行了详细的实验比较，针对多个强化学习问题。

WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning

WaferSAGE：通过合成数据生成和评分标准引导强化学习的大型语言模型驱动晶圆缺陷分析

Authors: Ke Xu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.27629
Pdf link: https://arxiv.org/pdf/2604.27629
Abstract We present WaferSAGE, a framework for wafer defect visual question answering using small vision-language models. To address data scarcity in semiconductor manufacturing, we propose a three-stage synthesis pipeline incorporating structured rubric generation for precise evaluation. Starting from limited labeled wafer maps, we employ clustering-based cleaning to filter label noise, then generate comprehensive defect descriptions using vision-language models, which are converted into structured evaluation rubrics criteria. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis. Our dual assessment framework aligns rule-based metrics with LLM-Judge scores via Bayesian optimization, enabling reliable automated evaluation. Through curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO) and rubric-aligned rewards, our 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, closely approaching Gemini-3-Flash (7.149) while enabling complete on-premise deployment. We demonstrate that small models with domain-specific training can surpass proprietary large models in specialized industrial visual understanding, offering a viable path for privacy-preserving, cost-effective deployment in semiconductor manufacturing.
中文摘要 我们介绍了WaferSAGE，一个利用小型视觉语言模型进行晶圆缺陷视觉问答的框架。为解决半导体制造中的数据稀缺问题，我们提出了一个三阶段合成流程，结合结构化评分标准生成以实现精确评估。从有限标记的晶圆图开始，我们采用基于聚类的清洁方法过滤标签噪声，然后利用视觉语言模型生成全面的缺陷描述，这些模型转化为结构化的评估标准。这些评分标准指导VQA对的合成，确保缺陷类型识别、空间分布、形态学和根本原因分析的覆盖。我们的双重评估框架通过贝叶斯优化将基于规则的指标与LLM-Judge评分对齐，实现可靠的自动评估。通过基于课程的强化学习，配合组序列策略优化（GSPO）和与评分标准对齐的奖励，我们的4B参数Qwen3-VL模型获得了6.493的LLM-Judge得分，接近Gemini-3-Flash（7.149），同时实现完整的本地部署。我们证明，具备领域特定训练的小型模型在专业工业视觉理解中可以超越专有大型模型，为保护隐私、成本效益高的半导体制造部署提供了可行路径。

Autonomous Traffic Signal Optimization Using Digital Twin and Agentic AI for Real-Time Decision-Making

利用数字孪生和代理人工智能实现实时决策的自主交通信号优化

Authors: Salman Jan, Toqeer Ali Syed, Shahid Kamal, Qamar Wali, Ali Akarma
Subjects: Subjects: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.27753
Pdf link: https://arxiv.org/pdf/2604.27753
Abstract This article outlines a new framework of traffic light optimization through a digital twin of the transport infrastructure, managed by agentic AI to ensure real-time autonomous decisions. The framework relies on physical sensors and edge computing to measure real-time traffic information and simulate traffic flow in a constantly updated digital twin. The traffic light is automatically controlled through the digital twin according to traffic congestion, travel delay and traffic patterns. This approach is implemented as a three-layer system: perception, conceptualization and action. The perception layer receives data on physical systems; the conceptualization layer uses LangChain to process the data; and the action layer links to the Model Context Protocol (MCP) and traffic management APIs to implement optimised traffic signal control algorithms. The results show that the framework minimizes waiting time at traffic lights and positively affects the effectiveness of the entire traffic flow, which is better than the fixed-time and reinforcement learning-based baselines.
中文摘要 本文概述了通过交通基础设施数字孪生构建的新型交通信号灯优化框架，由智能人工智能管理，确保实时自主决策。该框架依赖物理传感器和边缘计算，实时测量交通信息，并在不断更新的数字孪生中模拟交通流。交通信号灯通过数字孪生自动控制，根据交通拥堵、行车延误和交通模式进行调整。该方法以三层系统的形式实现：感知、概念化和行动。感知层接收物理系统的数据;概念化层使用LangChain处理数据;动作层连接模型上下文协议（MCP）和交通管理API，实现优化的交通信号控制算法。结果显示，该框架最大限度地减少了红绿灯等待时间，并积极影响整个交通流的有效性，优于基于固定时间和强化学习的基线。

CastFlow: Learning Role-Specialized Agentic Workflows for Time Series Forecasting

CastFlow：学习基于角色的专属工作流用于时间序列预测

Authors: Bokai Pan, Mingyue Cheng, Zhiding Liu, Shuo Yu, Xiaoyu Tao, Yuchong Wu, Qi Liu, Defu Lian, Enhong Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.27840
Pdf link: https://arxiv.org/pdf/2604.27840
Abstract Recently, large language models (LLMs) have shown great promise in time series forecasting. However, most existing LLM-based forecasting methods still follow a static generative paradigm that directly maps historical observations to future values in a single pass. Under this paradigm, forecasting is constrained by limited temporal pattern extraction, single-round acquisition of contextual features, one-shot forecast generation, and lack of support from ensemble forecasts. To address these limitations, in this work, we propose CastFlow, a dynamic agentic forecasting framework that enables multi-view temporal pattern extraction, multi-round contextual features acquisition, iterative forecast refinement, and forecasting with ensemble forecasts. First, CastFlow organizes the forecasting process into planning, action, forecasting, and reflection, establishing an agentic workflow. Second, this workflow is supported by a memory module that retrieves prior experience and a multi-view toolkit that constructs diagnostic evidence and provides a reliable ensemble forecast baseline. Third, CastFlow adopts a role-specialized design that combines general-purpose reasoning with specialized numerical forecasting. Under this design, a frozen LLM preserves general-purpose reasoning, while a fine-tuned domain-specific LLM performs evidence-guided numerical forecasting based on the ensemble forecast baseline, rather than from scratch. To optimize a fine-tuned domain-specific LLM, we further develop a two-stage workflow-oriented training that combines supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). To evaluate the effectiveness of CastFlow, we conduct extensive experiments on diverse datasets and show that it achieves superior overall results against strong baselines. We hope that this work can serve as a step toward more adaptive and accurate time series forecasting.
中文摘要 近年来，大型语言模型（LLMs）在时间序列预测中展现出巨大潜力。然而，大多数现有基于LLM的预测方法仍采用静态生成范式，直接将历史观测值映射到未来值，一次过。在这种范式下，预测受限于有限的时间模式提取、单轮上下文特征获取、一次性预报生成以及缺乏集合预报的支持。为解决这些局限性，本研究提出了CastFlow，一种动态代理预测框架，支持多视角时间模式提取、多轮上下文特征获取、迭代预报细化以及集成预测。首先，CastFlow 将预测过程组织为规划、行动、预测和反思，建立了代理性工作流。其次，该工作流程由内存模块支持，可检索先前经验，并支持多视图工具包构建诊断证据并提供可靠的集合预报基线。第三，CastFlow采用角色专用设计，结合了通用推理与专门的数值预测。在这种设计下，冻结的LLM保留了通用推理，而精细调优的领域特定LLM则基于集合预测基线进行循证数值预测，而非从零开始。为了优化精细调优的领域专属LLM，我们进一步开发了一套两阶段的以工作流为导向的训练，结合了监督微调（SFT）和带可验证奖励（RLVR）的强化学习。为了评估CastFlow的有效性，我们在多种数据集上进行了大量实验，证明其在强基线下取得了更优越的整体效果。我们希望这项工作能成为迈向更具适应性和准确时间序列预测的一步。

Rethinking Agentic Reinforcement Learning In Large Language Models

重新思考大型语言模型中的能动强化学习

Authors: Fangming Cui, Ruixiao Zhu, Cheng Fang, Sunan Li, Jiahong Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2604.27859
Pdf link: https://arxiv.org/pdf/2604.27859
Abstract Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.
中文摘要 强化学习（RL）传统上专注于训练专业代理，在狭义环境中优化预定义的奖励函数。然而，强大的大型语言模型（LLM）的出现以及日益复杂且开放式任务的出现，催生了强化学习中向智能范式的范式转变。这一新兴框架超越了传统强化学习，强调开发能够在不确定的现实环境中实现目标设定、长期规划、动态策略适应和交互推理的自主智能体。与依赖静态目标和情节互动的传统方法不同，基于大型语言模型的智能强化学习将类认知能力，如元推理、自我反思和多步决策直接融入学习循环中。本文深入探讨这一趋势背后的概念基础、方法创新和有效设计。此外，我们识别了关键挑战，并概述了构建基于大型语言模型的代理强化学习的有前景未来方向。

Generate Your Talking Avatar from Video Reference

从视频参考生成你的会说话的头像

Authors: Zujin Guo, Zhenhui Ye, Yi Ren, Yuanming Li, Ce Chen, Zhibin Hong, Chen Change Loy
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.27918
Pdf link: https://arxiv.org/pdf/2604.27918
Abstract Existing talking avatar methods typically adopt an image-to-video pipeline conditioned on a static reference image within the same scene as the target generation. This restricted, single-view perspective lacks sufficient temporal and expression cues, limiting the ability to synthesize high-fidelity talking avatars in customized backgrounds. To this end, we introduce Talking Avatar generation from Video Reference (TAVR), a novel framework that shifts the paradigm by leveraging cross-scene video inputs. To effectively process these extended temporal contexts and bridge cross-scene domain gaps, TAVR integrates a token selection module alongside a comprehensive three-stage training scheme. Specifically, same-scene video pretraining establishes foundational appearance copying, which is subsequently expanded by cross-scene reference fine-tuning for robust cross-scene adaptation. Finally, task-specific reinforcement learning aligns the generated outputs with identity-based rewards to maximize identity similarity. To systematically evaluate cross-scene robustness, we construct a new benchmark comprising 158 carefully curated cross-scene video pairs. Extensive experiments show that TAVR benefits from flexible inference-time video referencing and consistently surpasses existing baselines both quantitatively and qualitatively. This work has been deployed to production. For more related research, please visit \href{this https URL}{HeyGen Research} and \href{this https URL}{HeyGen Avatar-V}.
中文摘要 现有的有声化身方法通常采用基于与目标生成场景内静态参考图像的图像到视频流水线。这种受限的单视角缺乏足够的时间和表情提示，限制了在定制背景中合成高保真对话化身的能力。为此，我们引入了视频参考（TAVR）中的对话头像生成，这是一种通过跨场景视频输入改变范式的新框架。为了有效处理这些扩展的时间上下文并弥合跨场景领域的差距，TAVR集成了一个令牌选择模块和全面的三阶段训练方案。具体来说，同场景视频预训练奠定了外观复制的基础，随后通过跨场景引用微调扩展，实现了稳健的跨场景适应。最后，任务特定强化学习将生成输出与基于身份的奖励对齐，以最大化身份相似度。为了系统评估跨场景的稳健性，我们构建了一个包含158对精心策划的跨场景视频对的新基准。大量实验表明，TAVR受益于灵活推断时间视频引用，并且在定量和定性上持续超越现有基线。这项工作已经部署到生产环境中。如需更多相关研究，请访问 \href{此 https URL}{HeyGen Research} 和 \href{此 https URL}{HeyGen Avatar-V}。

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

带强化学习的图形界面代理：迈向数字居民

Authors: Junan Hu, Jian Liu, Jingxiang Lai, Jiarui Hu, Yiwei Sheng, Shuang Chen, Jian Li, Dazhao Du, Song Guo
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.27955
Pdf link: https://arxiv.org/pdf/2604.27955
Abstract Graphical User Interface (GUI) agents have emerged as a promising paradigm for intelligent systems that perceive and interact with graphical interfaces visually. Yet supervised fine-tuning alone cannot handle long-horizon credit assignment, distribution shifts, and safe exploration in irreversible environments, making Reinforcement Learning (RL) a central methodology for advancing automation. In this work, we present the first comprehensive overview of the intersection between RL and GUI agents, and examine how this research direction may evolve toward digital inhabitants. We propose a principled taxonomy that organizes existing methods into Offline RL, Online RL, and Hybrid Strategies, and complement it with analyses of reward engineering, data efficiency, and key technical innovations. Our analysis reveals several emerging trends: the tension between reliability and scalability is motivating the adoption of composite, multi-tier reward architectures; GUI I/O latency bottlenecks are accelerating the shift toward world-model-based training, which can yield substantial performance gains; and the spontaneous emergence of System-2-style deliberation suggests that explicit reasoning supervision may not be necessary when sufficiently rich reward signals are available. We distill these findings into a roadmap covering process rewards, continual RL, cognitive architectures, and safe deployment, aiming to guide the next generation of robust GUI automation and its agent-native infrastructure.
中文摘要 图形用户界面（GUI）代理已成为智能系统视觉感知和交互的有前景范式。然而，仅靠监督式微调无法应对长期的信用分配、分配转移以及不可逆环境中的安全探索，因此强化学习（RL）成为推动自动化的核心方法论。本研究首次全面概述了强化学习与图形界面代理的交汇，并探讨这一研究方向如何向数字居民发展。我们提出了一个有原则的分类法，将现有方法组织为离线强化学习、在线强化学习和混合策略，并辅以奖励工程、数据效率和关键技术创新的分析。我们的分析揭示了若干新兴趋势：可靠性与可扩展性之间的张力推动了复合多层级奖励架构的采用;GUI I/O 延迟瓶颈加速了向基于世界模型的训练转变，这能带来显著的性能提升;系统2式审议的自发出现表明，当奖励信号足够丰富时，显式推理监督可能并非必需。我们将这些发现浓缩成一条路线图，涵盖流程奖励、持续强化学习、认知架构和安全部署，旨在引导下一代强大的图形界面自动化及其代理原生基础设施的发展。

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

潜在-GRPO：针对潜在推理的群体相对策略优化

Authors: Jingcheng Deng, Zihao Wei, Liang Pang, Junhong Wu, Shicheng Xu, Zenghao Duan, Huawei Shen
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.27998
Pdf link: https://arxiv.org/pdf/2604.27998
Abstract Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However, existing latent reasoning methods mainly focus on supervised learning, and reinforcement learning in latent space remains highly unstable. We study this problem through the lens of Group Relative Policy Optimization (GRPO), and show that directly adapting GRPO to latent reasoning is fundamentally non-trivial: latent reasoning changes both the probability density and the sampling mechanism, causing three coupled bottlenecks: absence of intrinsic latent manifolds, where unconstrained exploration pushes rollouts off the valid latent manifold; exploration-optimization misalignment, where trajectory-level rewards can induce incorrect token-level updates; and latent mixture non-closure, where jointly reinforcing multiple correct latent paths can produce an invalid averaged state. To address them, we propose \textbf{Latent-GRPO}, which combines invalid-sample advantage masking, one-sided noise sampling, and optimal correct-path first-token selection. Across four low-difficulty benchmarks (e.g., GSM8K-Aug) and four high-difficulty benchmarks (e.g., AIME), Latent-GRPO improves over its latent initialization by 7.86 Pass@1 points on low-difficulty tasks and surpasses explicit GRPO by 4.27 points on high-difficulty tasks while using 3--4$\times$ shorter reasoning chains. It also achieves stronger pass@$k$ performance under Gumbel sampling. These results establish Latent-GRPO as an effective approach for stable and efficient latent reasoning.
中文摘要 潜在推理通过将中间推理压缩为连续表示并大幅缩短推理链，提供了显式推理更高效的替代方案。然而，现有的潜在推理方法主要侧重于监督式学习，潜空间中的强化学习仍然高度不稳定。我们通过群相对策略优化（GRPO）的视角研究该问题，表明直接将GRPO应用于潜在推理本质上是非简单的：潜在推理改变了概率密度和采样机制，导致三个耦合瓶颈：缺乏内在潜在流形，即无约束探索会将推广推离有效潜流形;探索-优化错位，轨迹级奖励可能导致错误的代币级更新;以及潜在混合非闭合，即联合强化多条正确潜路径可能产生无效的平均态。为此，我们提出了 \textbf{Latent-GRPO}，结合了无效样本优势掩蔽、单侧噪声采样和最优正确路径第一标记选择。在四个低难度基准测试（如GSM8K-Aug）和四个高难度基准测试（如AIME）中，潜在GRPO在低难度任务上比潜在初始化提升7.86 Pass@1分，在使用3-4$×乘时链时，比显式GRPO高出4.27分。在Gumbel采样下，pass@$k美元性能也更强。这些结果证明了潜在-GRPO作为稳定高效潜在推理的有效方法。

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

核化优势估计：从非参数统计到大型语言模型推理

Authors: Shijin Gong, Kai Ye, Jin Zhu, Xinyu Zhang, Hongyi Zhou, Chengchun Shi
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2604.28005
Pdf link: https://arxiv.org/pdf/2604.28005
Abstract Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three approaches have been widely adopted: (i) Proximal policy optimization and advantage actor-critic rely on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. (ii) Group relative policy optimization (GRPO) avoids training a value network by approximating the value function using sample averages. However, GRPO samples a large number of reasoning traces per prompt to achieve accurate value function approximation, making it computationally expensive. (iii) REINFORCE-type algorithms sample only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency. In this work, we focus on a practical, resource-constrained setting in which only a small number of reasoning traces can be sampled per prompt, while low-variance gradient estimation remains essential for high-quality policy learning. To address this challenge, we bring classical nonparametric statistical methods, which are both computationally and statistically efficient, to LLM reasoning. We employ kernel smoothing as a concrete example for value function estimation and the subsequent policy optimization. Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation, leading to improved policy optimization.
中文摘要 近年来，大型语言模型（LLM）的进展越来越依赖强化学习（RL）来提升推理能力。三种方法已被广泛采用：（i）近端策略优化和优势行为者-批判者依赖深度神经网络估计学习策略的价值函数，以减少策略梯度的方差。然而，估计和维护这样的价值网络会带来大量的计算和内存开销。（ii）群相对策略优化（GRPO）通过使用样本平均值近似价值函数来避免训练价值网络。然而，GRPO每个提示符采样大量推理迹以实现准确的值函数近似，计算成本较高。（iii） REINFORCE型算法每个提示只采样一个推理轨迹，这降低了计算成本，但样本效率较低。本研究聚焦于一个实用且资源有限的环境，每个提示只能抽样少量推理迹，而低方差梯度估计仍是高质量政策学习的关键。为应对这一挑战，我们将经典的非参数统计方法引入LLM推理，这些方法既具有计算效率，也在统计上高效。我们以核平滑作为价值函数估计及后续策略优化的具体示例。数值和理论结果表明，我们的提案实现了准确的价值和梯度估计，从而提升了政策优化。

Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care

从分歧中学习：临床医生作为价值导向护理中隐性偏好信号的覆盖

Authors: Prabhjot Singh, Abhishek Gupta, Chris Betz, Abe Flansburg, Brett Ives, Sudeep Lama, Jung Hoon Son
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.28010
Pdf link: https://arxiv.org/pdf/2604.28010
Abstract We reframe clinician overrides of clinical AI recommendations as implicit preference data - the same signal structure exploited by reinforcement learning from human feedback (RLHF), but richer: the annotator is a domain expert, the alternatives carry real consequences, and downstream outcomes are observable. We present a formal framework extending standard preference learning with three contributions: a five-category override taxonomy mapping override types to distinct model update targets; a preference formulation conditioned on patient state s, organizational context c, and clinician capability kappa, where kappa decomposes into execution capability kappa-exec and alignment capability kappa-align; and a dual learning architecture that jointly trains a reward model and a capability model via alternating optimization, preventing a failure mode we term suppression bias-the systematic suppression of correct-but-difficult recommendations when clinician capability falls below the execution threshold. We argue that chronic disease management under outcome-based payment contracts produces override data with uniquely favorable properties-longitudinal density, concentrated decision space, outcome labels, and natural capability variation-and that training environments combining longitudinal outcome measurement with aligned financial incentives are a necessary condition for learning a reward model aligned with patient trajectory rather than with encounter economics. This framework emerged from operational work to improve clinician capability in a live value-based care deployment.
中文摘要 我们将临床医生对临床人工智能建议的覆盖重新框架为隐性偏好数据——这是与强化学习（RLHF）利用的信号结构相同，但内容更丰富：注释者是领域专家，替代方案带来真实后果，且后续结果可观察。我们提出了一个正式框架，扩展标准偏好学习，并贡献了三项：五类覆盖分类法，将覆盖类型映射到不同的模型更新目标;一种基于患者状态S、组织背景C和临床医生能力Kappa的偏好表述，其中Kappa分解为执行能力Kappa-Exec和对齐能力Kappa-Align;以及双重学习架构，通过交替优化共同训练奖励模型和能力模型，防止我们称之为抑制偏差的失败模式——即当临床能力低于执行阈值时，系统性抑制正确但困难的建议。我们认为，基于结果的支付合同下的慢性病管理会产生具有独特优势的覆盖数据——纵向密度、集中决策空间、结果标签和自然能力变异——而将纵向结果测量与对齐的财务激励相结合的培训环境，是学习与患者轨迹相符的奖励模型而非接触经济学的必要条件。该框架源自提升临床医生在实时价值导向护理部署中能力的运营工作。

Echo-α: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

回声α：用于超声解读的大型代理多模态推理模型

Authors: Jing Zhang, Wentao Jiang, Tao Huang, Zhiwei Wang, Jianxin Liu, Jian Chen, Ping Ye, Gang Wang, Zengmao Wang, Bo Du, Dacheng Tao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.28011
Pdf link: https://arxiv.org/pdf/2604.28011
Abstract Ultrasound interpretation requires both precise lesion localization and holistic clinical reasoning, yet existing methods typically excel at only one of these capabilities: specialized detectors offer strong localization but limited reasoning, whereas multimodal large language models (MLLMs) provide flexible reasoning but weak grounding in specialized medical domains. We present Echo-{\alpha}, an agentic multimodal reasoning model for ultrasound interpretation that unifies these strengths within an invoke-and-reason framework. Echo-{\alpha} is trained to coordinate organ-specific detector outputs, integrate them with global visual context, and convert the resulting evidence into grounded diagnostic decisions beyond detector-only inference. This behavior is established through a nine-task supervised curriculum and then refined by sequential reinforcement learning under different reward trade-offs, yielding Echo-{\alpha}-Grounding for lesion anchoring and Echo-{\alpha}-Diagnosis for final diagnosis. On multi-center renal and breast ultrasound benchmarks, Echo-{\alpha} outperforms competitive baselines on both grounding and diagnosis. In particular, on cross-center test sets, Echo-{\alpha}-Grounding attains 56.73%/43.78% [email protected] and Echo- {\alpha}-Diagnosis reaches 74.90%/49.20% overall accuracy on renal/breast ultrasound. These results suggest that agentic multimodal reasoning can turn specialized detectors into verifiable clinical evidence, offering a practical route toward ultrasound AI systems that are more accurate, interpretable, and transferable. The repository is at this https URL.
中文摘要 超声解读既需要精确的病灶定位，也需要整体临床推理，但现有方法通常只在其中一项能力上表现出色：专业检测器定位强但推理有限，而多模态大型语言模型（MLLMs）则提供灵活推理，但在专业医学领域基础薄弱。我们介绍Echo-{\alpha}，一种用于超声解读的代理多模态推理模型，将这些优势统一在调用与推理框架内。Echo-{\alpha} 经过训练，能够协调器官特异性探测器输出，将其与全局视觉语境整合，并将所得证据转化为超越仅探测器推断的有根据的诊断决策。这种行为通过九项任务的监督课程建立，随后通过不同奖励权衡的顺序强化学习加以完善，产生用于病灶锚定的回声-{\alpha}-接地和最终诊断的回声-{\alpha}-诊断。在多中心肾脏和乳腺超声基准测试中，Echo-{\alpha}在接地和诊断方面均优于竞争对手基线。特别是在跨中心检测组中，Echo-{\alpha}-Grounding的[email protected]%达到56.73%/43.78%，Echo-{\alpha}-诊断的整体准确率达到74.90%/49.20%。这些结果表明，智能多模态推理能够将专业检测器转化为可验证的临床证据，为实现更准确、可解释和可迁移的超声AI系统提供了实用途径。仓库地址是这个 https URL。

Cost-Aware Learning

成本意识学习

Authors: Clara Mohri, Amir Globerson, Haim Kaplan, Tomer Koren, Yishay Mansour
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.28020
Pdf link: https://arxiv.org/pdf/2604.28020
Abstract We consider the problem of Cost-Aware Learning, where sampling different component functions of a finite-sum objective incurs different costs. The objective is to reach a target error while minimizing the total cost. First, we propose the Cost-Aware Stochastic Gradient Descent algorithm for convex functions, and derive its cost complexity to attain an error of $\epsilon$. Furthermore, we establish a lower bound for this setting and provide a subset selection algorithm to further reduce the cost of training. We apply our theoretical insights to reinforcement learning with language models, where the computational cost of policy gradients varies with sequence length. To this end, we introduce Cost-Aware GRPO, an algorithm designed to reduce the cost of policy optimization while preserving performance. Empirical results on 1.5B and 8B LLMs demonstrate that our approach reduces the tokens used in policy optimization by up to about 30% while matching or exceeding baseline accuracy.
中文摘要 我们考虑成本感知学习问题，其中抽样有限和目标的不同组成函数会产生不同的成本。目标是在最小化总成本的同时达到目标误差。首先，我们提出了凸函数的代价感知随机梯度下降算法，并推导其成本复杂度以获得误差$\epsilon$。此外，我们为该设置设定了下界，并提供了子集选择算法，以进一步降低训练成本。我们将理论见解应用于语言模型的强化学习，其中策略梯度的计算成本随序列长度变化。为此，我们引入了成本感知GRPO算法，旨在降低策略优化成本，同时保持性能。15亿和8亿大型语言模型的实证结果表明，我们的方法在匹配或超过基线准确性的情况下，将策略优化中使用的代币减少了最多约30%。

Exponential families from a single KL identity

单个KL恒等式的指数族

Authors: Marc Dymetman
Subjects: Subjects: Machine Learning (cs.LG); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2604.28036
Pdf link: https://arxiv.org/pdf/2604.28036
Abstract Exponential families encompass the distributions central to modern machine learning -- softmax, Gaussians, and Boltzmann distributions -- and underlie the theory of variational inference, entropy-regularized reinforcement learning, and RLHF. We isolate a simple identity for exponential families that expresses the KL difference $\mathrm{KL}(q \| p_{\lambda_2}) - \mathrm{KL}(q \| p_{\lambda_1})$ in terms of the log-partition function $A(\lambda)$ and the moment $\mu_q$. Remarkably, this identity together with the single fact that $\mathrm{KL} \geq 0$ (with equality iff $p = q$) suffices, by direct substitution and rearrangement, to derive a cluster of results that are classically obtained by separate, heavier arguments: a generalized three-point identity for arbitrary reference distributions, Pythagorean theorems for I-projections and reverse I-projections, convexity of the log-partition function, identification of its Legendre dual in KL terms, the Gibbs variational principle, and the explicit optimizer in KL-regularized reward maximization, including the exponential tilting formula underlying entropy-regularized control and RLHF. Beyond these purely algebraic consequences, standard analytic arguments recover the gradient formula for the log-partition function, the Bregman representation of within-family KL divergence, and the surjectivity of the moment map. The note is self-contained.
中文摘要 指数族涵盖了现代机器学习中核心的分布——softmax、高斯分布和玻尔兹曼分布——并构成了变分推断、熵正则化强化学习和RLHF理论的基础。我们为指数族分离出一个简单的恒等式，该单位表达了KL差$\mathrm{KL}（q \| p_{\lambda_2}） - \mathrm{KL}（q \| p_{\lambda_1}）$，用对数划分函数 $A（\lambda）$ 和矩 $\mu_q$。值得注意的是，这一恒等式结合 $\mathrm{KL} \geq 0$（当 $p = q$ 时相等）这一事实，通过直接替换和重排，足以推导出一系列经典上由独立且较重的论证得到的结果：任意参考分布的广义三点恒等式、I-投影和反 I-投影的毕氏定理、对数配分函数的凸性，其勒让德对偶的 KL 术语识别、吉布斯变分原理，以及 KL 正则化奖励最大化中的显式优化器，包括熵正则化控制和 RLHF 背后的指数倾斜公式。除了这些纯代数推论外，标准解析论证还恢复了对数划分函数的梯度公式、族内KL散度的布雷格曼表示以及矩映射的满射性。这张便条是独立完整的。

RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward Hypotheses

RHyVE：能力感知验证与阶段感知部署，用于LLM生成的奖励假设

Authors: Feiyu Wu, Xu Zheng, Zhuocheng Wang, Yi ming Dai, Hui Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.28056
Pdf link: https://arxiv.org/pdf/2604.28056
Abstract Large language models (LLMs) make reward design in reinforcement learning substantially more scalable, but generated rewards are not automatically reliable training objectives. Existing work has focused primarily on generating, evolving, or selecting reward candidates, while paying less attention to when such candidates can be verified and deployed during policy optimization. We study this deployment-time problem by treating generated rewards as reward hypotheses whose utility depends on the competence of the current policy and the phase of training. We propose \textsc{RHyVE}, a competence-aware verification and phase-aware deployment protocol that compares small sets of reward hypotheses from shared policy checkpoints using short-horizon fork verification. Our experiments show that reward rankings are unreliable at low competence but become informative after task-dependent thresholds. On a sparse manipulation task, phase-aware deployment improves peak and retained performance under a locked protocol. Updated LLM-generated reward-candidate experiments show candidate-family-dependent behavior: generated pools can exhibit phase-dependent winner changes, but no fixed warm-up schedule is universally optimal. Held-out schedule selection, conservative selector baselines, compute-matched controls, and scale controls further show that \textsc{RHyVE} is best understood as a verification-informed deployment protocol rather than a universal scheduler. Dense and all-failure boundary experiments delimit the scope of the method. Together, these results suggest that reward generation and reward deployment should be studied as coupled problems: generated rewards must be verified and deployed under changing policy competence.
中文摘要 大型语言模型（LLM）使强化学习中的奖励设计大大更具可扩展性，但生成的奖励并非自动可靠的训练目标。现有工作主要集中在生成、进化或选择奖励候选，较少关注何时能验证和部署这些候选人在策略优化中。我们通过将生成的奖励视为奖励假设来研究该部署时间问题，其效用取决于当前政策的能力和培训阶段。我们提出了 \textsc{RHyVE}，一种能力感知验证和阶段感知部署协议，利用短视野分叉验证比较共享策略检查点中的小集奖励假设。我们的实验表明，奖励排名在低能力时不可靠，但在任务依赖阈值后则变得具有信息量。在稀疏操作任务中，相位感知部署在锁定协议下提升峰值和保持性能。更新的LLM生成的奖励候选实验显示候选族依赖行为：生成的池可能表现出相位依赖的赢家变化，但没有固定的预热计划是普遍最优的。长期的计划选择、保守的选择器基线、计算匹配控制和规模控制进一步表明，\textsc{RHyVE}更适合被理解为基于验证的部署协议，而非通用调度器。密集且全失效的边界实验限制了该方法的范围。综合这些结果表明，奖励生成与奖励部署应作为耦合问题进行研究：生成的奖励必须在政策能力变化下得到验证和部署。

Intelligent Self-tuning Active EMI Filtering for Electrified Automotive Power Systems Using Reinforcement Learning

用于电气化汽车电力系统的智能自调谐主动EMI滤波，采用增强学习

Authors: Mahuizi Lu, Kelin Jia, Rajib Goswami, Yukun Hu
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.28084
Pdf link: https://arxiv.org/pdf/2604.28084
Abstract The rapid electrification and intelligence of modern transportation systems place stringent demands on the electromagnetic compatibility, reliability, and adaptability of automotive power electronics. In electric and autonomous vehicles, electromagnetic interference (EMI) generated by high-frequency switching power converters can compromise safety-critical functions, in-vehicle communications, and system efficiency under dynamic operating conditions. Conventional passive EMI filters, while robust, are often oversized and lack adaptability, leading to increased weight, volume, and energy losses. This paper proposes an intelligent self-tuning active EMI filtering approach for electrified automotive power systems based on reinforcement learning (RL). The EMI mitigation problem is formulated as a Markov decision process, enabling an RL agent to continuously adapt filter parameters in response to time-varying interference characteristics. To improve robustness and generalisation under complex and non-stationary conditions, a variational autoencoder is employed for compact state representation, while a noise-based exploration mechanism enhances learning efficiency and prevents suboptimal convergence. The proposed method is evaluated using experimentally measured EMI spectra from an automotive electric drive unit within a MATLAB/Simulink co-simulation framework. Results demonstrate consistent EMI attenuation improvements of 25-30 dB across a wide frequency range compared with conventional control strategies and passive filtering solutions. By reducing reliance on oversized passive components and enabling adaptive EMI suppression, the proposed framework supports lightweight, energy-efficient, and reliable power-electronic systems for intelligent and green transportation applications.
中文摘要 现代交通系统的快速电气化和智能化对汽车电力电子的电磁兼容性、可靠性和适应性提出了严格要求。在电动和自动驾驶车辆中，高频开关电源转换器产生的电磁干扰（EMI）可能影响安全关键功能、车载通信以及动态运行条件下的系统效率。传统的被动EMI滤波器虽然坚固，但通常体积过大且缺乏适应性，导致重量、体积和能量损失增加。本文提出了一种基于强化学习（RL）的智能自调谐主动EMI滤波方法，应用于电气化汽车电力系统。EMI缓解问题被表述为马尔可夫决策过程，使强化学习代理能够持续调整滤波器参数，以响应时间变化的干扰特性。为了在复杂且非平稳条件下提高鲁棒性和泛化性，采用变分自编码器实现紧凑状态表示，同时基于噪声的探索机制提升学习效率并防止次优收敛。该方法通过MATLAB/Simulink协同仿真框架下，利用汽车电动驱动单元的实验测量EMI谱进行评估。结果显示，与传统控制策略和被动滤波方案相比，EMI衰减在宽频范围内持续提升了25-30 dB。通过减少对超大被动元件的依赖并实现自适应EMI抑制，所提框架支持了轻量化、节能且可靠的电力电子系统，应用于智能和绿色交通应用。

Towards Neuro-symbolic Causal Rule Synthesis, Verification, and Evaluation Grounded in Legal and Safety Principles

迈向基于法律和安全原则的神经符号因果规则综合、验证与评估

Authors: Zainab Rehan, Christian Medeiros Adriano, Sona Ghahremani, Holger Giese
Subjects: Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.28087
Pdf link: https://arxiv.org/pdf/2604.28087
Abstract Rule-based systems remain central in safety-critical domains but often struggle with scalability, brittleness, and goal misspecification. These limitations can lead to reward hacking and failures in formal verification, as AI systems tend to optimize for narrow objectives. In previous research, we developed a neuro-symbolic causal framework that integrates first-order logic abduction trees, structural causal models, and deep reinforcement learning within a MAPE-K loop to provide explainable adaptations under distribution shifts. In this paper, we extend that framework by introducing a meta-level layer designed to mitigate goal misspecification and support scalable rule maintenance. This layer consists of a Goal/Rule Synthesizer and a Rule Verification Engine, which iteratively refine a formal rule theory from high-level natural-language goals and principles provided by human experts. The synthesis pipeline employs large language models (LLMs) to: (1) decompose goals into candidate causes, (2) consolidate semantics to remove redundancies, (3) translate them into candidate first-order rules, and (4) compose necessary and sufficient causal sets. The verification pipeline then performs (1) syntax and schema validation, (2) logical consistency analysis, and (3) safety and invariant checks before integrating verified rules into the knowledge base. We evaluated our approach with a proof-of-concept implementation in two autonomous driving scenarios. Results indicate that, given human-specified goals and principles, the pipeline can successfully derive minimal necessary and sufficient rule sets and formalize them as logical constraints. These findings suggest that the pipeline supports incremental, modular, and traceable rule synthesis grounded in established legal and safety principles.
中文摘要 基于规则的系统在安全关键领域依然处于核心地位，但常常面临可扩展性、脆弱性和目标错误的困扰。这些限制可能导致奖励性黑客攻击和形式验证失败，因为AI系统倾向于针对狭窄目标进行优化。在以往的研究中，我们开发了一个神经符号因果框架，将一阶逻辑溯因树、结构因果模型和深度强化学习整合在MAPE-K循环中，以提供分布转变下的可解释适应。本文通过引入元层扩展该框架，旨在减少目标错误指定并支持可扩展的规则维护。这一层由目标/规则合成器和规则验证引擎组成，后者通过人类专家提供的高级自然语言目标和原则迭代完善形式规则理论。综合流程采用大型语言模型（LLM）来实现：（1）将目标分解为候选原因，（2）整合语义以消除冗余，（3）将其转换为候选的一阶规则，以及（4）组合必要且充分的因果集合。验证流程随后执行（1）语法和模式验证，（2）逻辑一致性分析，以及（3）安全性和不变性检查，然后将验证规则集成到知识库中。我们通过两种自动驾驶场景中的概念验证实现来评估我们的方法。结果表明，在人类指定的目标和原则水线能够成功推导出最小必要和充分规则集，并将其形式化为逻辑约束。这些发现表明，该管道支持基于既定法律和安全原则的渐进式、模块化和可追溯的规则综合。

FiLMMeD: Feature-wise Linear Modulation for Cross-Problem Multi-Depot Vehicle Routing

FiLMMeD：跨问题多车库车辆路由的线性调制功能

Authors: Arthur Corrêa, Paulo Nascimento, Samuel Moniz
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.28102
Pdf link: https://arxiv.org/pdf/2604.28102
Abstract Solving practical multi-depot vehicle routing problems (MDVRP) is a challenging optimization task central to modern logistics, increasingly driven by e-commerce. To address the MDVRP's computational complexity, neural-based combinatorial optimization methods offer a promising scalable alternative to traditional approaches. However, neural-based methods typically rely on rigid architectures and input encodings tailored to specific problem formulations. In real-world settings, heterogeneous constraints create multiple MDVRP variants, limiting the applicability of such models. While multi-task learning (MTL) has begun to accelerate the development of unified neural-based solvers, prior works focus almost exclusively on single-depot VRPs, leaving the MDVRP unaddressed. To bridge this gap, we propose Feature-wise Linear Modulation for Cross-Problem Multi-Depot Vehicle Routing (FiLMMeD), a novel unified neural-based model for 24 different MDVRP variants. We introduce three main contributions: (1) to improve the model's generalization, we augment the standard Transformer encoder with Feature-wise Linear Modulation (FiLM), which dynamically conditions learned internal representations based on the active set of constraints; (2) we provide an initial demonstration of Preference Optimization in the MTL setting, establishing it as a superior alternative to Reinforcement Learning for future MTL works; (3) to mitigate the generalization gap caused by the introduction of multi-depot constraints, we introduce a targeted curriculum learning strategy that progressively exposes the model to increasingly more complex constraint interactions. Extensive experiments on 24 MDVRP variants (including 8 novel formulations) and 16 single-depot VRPs confirm the effectiveness of FiLMMeD, which consistently outperforms state-of-the-art baselines. Our code is available at: this https URL
中文摘要 解决实用的多车库车辆路由问题（MDVRP）是现代物流中日益由电子商务驱动的一项具有挑战性的优化任务。为了解决MDVRP的计算复杂性，基于神经的组合优化方法为传统方法提供了有前景的可扩展替代方案。然而，基于神经的方法通常依赖于针对特定问题表述的刚性架构和输入编码。在现实环境中，异构约束会产生多个MDVRP变体，限制了此类模型的适用性。虽然多任务学习（MTL）已开始加速统一神经求解器的发展，但此前的研究几乎专注于单仓库VRP，MDVRP问题未被解决。为弥合这一空白，我们提出了跨问题多车库车辆路由的特征层面线性调制（FiLMMeD），这是一种针对24种不同MDVRP变体的新型统一神经模型。我们引入了三项主要贡献：（1）为了提升模型的泛化性，我们在标准变换器编码器基础上加入了特征线性调制（FiLM），该编码基于活动约束集动态条件化学习到的内部表示;（2）我们初步演示了MTL环境中的偏好优化，确立其作为未来MTL工作中强化学习的优良替代方案;（3）为弥补多仓库约束引入引起的泛化差距，我们引入了一种有针对性的课程学习策略，逐步使模型接触到越来越复杂的约束相互作用。对24种MDVRP变体（包括8种新配方）和16种单基点VRP的广泛实验证实了FiLMMeD的有效性，其持续优于最先进的基线。我们的代码可在以下地址获取：此 https URL

GSDrive: Reinforcing Driving Policies by Multi-mode Trajectory Probing with 3D Gaussian Splatting Environment

GSDrive：通过多模式轨迹探测与3D高斯喷溅环境强化驾驶政策

Authors: Ziang Guo, Min Chen, Xuefeng Zhang, Yixiao Zhou, Zufeng Zhang, Dzmitry Tsetserukou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.28111
Pdf link: https://arxiv.org/pdf/2604.28111
Abstract End-to-end (E2E) autonomous driving presents a promising approach for translating perceptual inputs directly into driving actions. However, prohibitive annotation costs and temporal data quality degradation hinder long-term real-world deployment. While combining imitation learning (IL) and reinforcement learning (RL) is a common strategy for policy improvement, conventional RL training relies on delayed, event-based rewards-policies learn only from catastrophic outcomes such as collisions, leading to premature convergence to suboptimal behaviors. To address these limitations, we introduce GSDrive, a framework that exploits 3D Gaussian Splatting (3DGS) for differentiable, physics-based reward shaping in E2E driving policy improvement. Our method incorporates a flow matching-based trajectory predictor within the 3DGS simulator, enabling multi-mode trajectory probing where candidate trajectories are rolled out to assess prospective rewards. This establishes a bidirectional knowledge exchange between IL and RL by grounding reward functions in physically simulated interaction signals, offering immediate dense feedback instead of sparse catastrophic events. Evaluated on the reconstructed nuScenes dataset, our method surpasses existing simulation-based RL driving approaches in closed-loop experiments. Code is available at this https URL.
中文摘要 端到端（E2E）自动驾驶为将感知输入直接转化为驾驶动作提供了一种有前景的方法。然而，高昂的注释成本和时间数据质量下降阻碍了长期的实际部署。虽然将模仿学习（IL）和强化学习（RL）结合是策略改进的常见策略，但传统强化学习依赖延迟的事件奖励——策略仅从碰撞等灾难性结果中学习，导致过早趋同于次优行为。为解决这些局限性，我们引入了 GSDrive，这是一个利用三维高斯喷溅（3DGS）实现可微分、基于物理的奖励整形的框架，推动端对端（E2E）政策改进。我们的方法在3DGS模拟器中集成了基于流量匹配的轨迹预测器，实现多模式轨迹探测，在候选轨迹展开时进行评估潜在奖励。这通过将奖励函数建立在物理模拟的交互信号中，建立了IL和RL之间的双向知识交流，提供即时且密集的反馈，而非稀疏的灾难性事件。在重建后的nuScenes数据集上评估，我们的方法在闭环实验中超越了现有基于仿真的强化学习驱动方法。代码可在此 https URL 访问。

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

PRISM：通过黑箱策略提炼实现多模态强化学习的预对齐

Authors: Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.28123
Pdf link: https://arxiv.org/pdf/2604.28123
Abstract The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step-by-step reasoning on the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at this https URL.
中文摘要 大型多模态模型（LMM）的标准训练后方案是对策划的演示进行监督微调（SFT），随后进行可验证奖励的强化学习（RLVR）。然而，SFT引入了既不保留模型原始能力，也无法忠实匹配监督分布的分布漂移。这一问题在多模态推理中进一步放大，感知错误和推理失败遵循明显的漂移模式，并在后续强化学习中叠加。我们引入PRISM三级流水线，通过在SFT和RLVR之间插入显式分布对齐阶段来缓解这种漂移。基于政策提炼（OPD）原则，PRISM将对齐定位为一场黑箱、响应级的对抗游戏，存在政策与专家混合（MoE）歧视者之间，配备专门的感知和推理专家，提供解开纠正信号，引导政策朝向监督分配方向，而无需访问教师日志。虽然126万次公开演示足以实现广泛的SFT初始化，但分布对齐需要更高保真度的监督;因此，我们策划了11.3万个来自Gemini 3 Flash的额外演示，包含密集的视觉基础和对最难未解问题的逐步推理。Qwen3-VL上的实验表明，PRISM在多种RL算法（GRPO、DAPO、GSPO）及多种多模态基准测试中持续提升下游RLVR性能，平均准确率分别比4B和8B的SFT到RLVR基线提升+4.4和+6.0点。我们的代码、数据和模型检查点均在此 https URL 公开。

AdvDMD: Adversarial Reward Meets DMD For High-Quality Few-Step Generation

AdvDMD：对抗性奖励与DMD的结合，实现高质量的少步生成

Authors: Xu Wang, Zexian Li, Litong Gong, Tiezheng Ge, Zhijie Deng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.28126
Pdf link: https://arxiv.org/pdf/2604.28126
Abstract Diffusion models offer superior generation quality at the expense of extensive sampling steps. Distillation methods, with Distribution Matching Distillation (DMD) as a popular example, can mitigate this issue, but performance degradation remains pronounced when sampling steps are limited. Reinforcement learning (RL) has been leveraged to improve the few-step generation quality during distillation, with the potential to even surpass the performance of the teacher model. However, existing approaches are combinatorial in nature, merely integrating an RL process with the distillation process, which introduces unnecessary complexities. To address this gap, we propose AdvDMD, a method that seamlessly unifies DMD distillation and RL. Specifically, AdvDMD employs the adversarially trained discriminator from DMD2 as the reward model, which assigns low scores to generated images and high scores to real ones. It is trained on both intermediate and final states of the denoising process and updated online with the distilled model, enabling a holistic supervision of the sampling trajectories and mitigating reward hacking. We adopt a unified SDE backward simulation and a different training schedule for DMD and RL to enable a more stable and efficient training. Experimental results demonstrate that the 4-step AdvDMD outperforms the original 40-step model for SD3.5 on DPG-Bench, while achieving significant performance gains for SD3 on the GenEval. On Qwen-Image, our 2-step AdvDMD achieves superior performance over TwinFlow.
中文摘要 扩散模型在牺牲大量采样步骤的代价下，提供了更优越的生成质量。以分布匹配蒸馏（DMD）为常见例子的蒸馏方法可以缓解这一问题，但当采样步骤受限时，性能下降依然明显。强化学习（RL）已被用来提升蒸馏过程中的少步生成质量，甚至有潜力超越教师模式的表现。然而，现有方法本质上是组合性的，仅将强化学习过程与蒸馏过程结合，这带来了不必要的复杂性。为弥补这一空白，我们提出了AdvDMD，这是一种无缝整合DMD蒸馏与强化学习的方法。具体来说，AdvDMD采用DMD2中对抗训练的判别器作为奖励模型，对生成图像赋予低分，对真实图像赋予高分。它在去噪过程的中间和最终状态上进行训练，并在线更新了提炼模型，实现了对采样轨迹的整体监督，并减少了奖励黑客行为。我们采用统一的SDE向后仿真，并为DMD和RL制定不同的训练计划，以实现更稳定、更高效的训练。实验结果显示，4步AdvDMD在DPG-Bench上优于SD3.5的原始40步模型，同时在GenEval上SD3实现了显著的性能提升。在Qwen-Image上，我们的两步AdvDMD性能优于TwinFlow。

Global Optimality for Constrained Exploration via Penalty Regularization

通过惩罚正则化实现受限勘探的全局最优

Authors: Florian Wolf, Ilyas Fatkhullin, Niao He
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2604.28144
Pdf link: https://arxiv.org/pdf/2604.28144
Abstract Efficient exploration is a central problem in reinforcement learning and is often formalized as maximizing the entropy of the state-action occupancy measure. While unconstrained maximum-entropy exploration is relatively well understood, real-world exploration is often constrained by safety, resource, or imitation requirements. This constrained setting is particularly challenging because entropy maximization lacks additive structure, rendering Bellman-equation-based methods inapplicable. Moreover, scalable approaches require policy parameterization, inducing non-convexity in both the objective and the constraints. To our knowledge, the only prior model-free policy-gradient approach for this setting under general policy parameterization is due to Ying et al. (2025). Unfortunately, their guarantees are limited to weak regret and ergodic averages, which do not imply that the final output is a single deployable policy that is near-optimal and nearly feasible. In this work we take a different approach to this problem, and propose Policy Gradient Penalty (PGP) method, a single-loop policy-space method that enforces general convex occupancy-measure constraints via quadratic-penalty regularization. PGP constructs pseudo-rewards that yield gradient estimates of the penalized objective, subsequently exploiting the classical Policy Gradient Theorem. We further establish the regularity of the penalized objective, providing the smoothness properties needed to justify the convergence of PGP. Leveraging hidden convexity and strong duality, we then establish global last-iterate convergence guarantees, attaining an $\epsilon$-optimal constrained entropy value with $\epsilon$ bounded constraint violation despite policy-induced non-convexity. We validate PGP through ablations on a grid-world benchmark and further demonstrate scalability on two challenging continuous-control tasks.
中文摘要 高效探索是强化学习的核心问题，通常被形式化为最大化状态-行动占有度量的熵。虽然无约束最大熵探索相对被理解，但现实世界的探索通常受限于安全性、资源或仿制要求。这种受限设置尤其具有挑战性，因为熵最大化缺乏可加性结构，使得基于贝尔曼方程的方法不适用。此外，可扩展方法需要策略参数化，从而在目标和约束中都实现非凸性。据我们所知，在一般策略参数化下，该环境中唯一无模型的策略梯度方法来自 Ying 等人（2025）。遗憾的是，它们的保证仅限于弱遗憾和遍历平均值，这并不意味着最终输出是一个接近最优且几乎可行的单一可部署策略。在本研究中，我们采取了不同的方法，提出了策略梯度惩罚（PGP）方法，这是一种单循环策略空间方法，通过二次惩罚正则化强制执行一般凸占有度约束。PGP构建伪奖励，从而得到被惩罚目标的梯度估计，随后利用经典策略梯度定理。我们进一步确立了被惩罚目标的正则性，提供了支持PGP收敛所需的平滑性质。利用隐凸性和强对偶性，我们建立了全局末次迭代收敛保证，尽管策略诱导的非凸性，仍实现 $\ε⁵⁵⁵ 最优约束值，且约束有界违背。我们通过网格世界基准测试的消融验证了PGP，并进一步展示了两个具有挑战性的连续控制任务的可扩展性。

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

大规模合成计算机用于长期生产力模拟

Authors: Tao Ge, Baolin Peng, Hao Cheng, Jianfeng Gao
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.28181
Pdf link: https://arxiv.org/pdf/2604.28181
Abstract Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and organized through directory structures and content-rich artifacts. To scale synthetic data creation for such productivity scenarios, we introduce Synthetic Computers at Scale, a scalable methodology for creating such environments with realistic folder hierarchies and content-rich artifacts (e.g., documents, spreadsheets, and presentations). Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer's user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the computer -- for example, navigating the filesystem for grounding, coordinating with simulated collaborators, and producing professional artifacts -- until these objectives are completed. In preliminary experiments, we create 1,000 synthetic computers and run long-horizon simulations on them; each run requires over 8 hours of agent runtime and spans more than 2,000 turns on average. These simulations produce rich experiential learning signals, whose effectiveness is validated by significant improvements in agent performance on both in-domain and out-of-domain productivity evaluations. Given that personas are abundant at billion scale, this methodology can in principle scale to millions or even billions of synthetic user worlds with sufficient compute, enabling broader coverage of diverse professions, roles, contexts, environments, and productivity needs. We argue that scalable synthetic computer creation, together with at-scale simulations, is highly promising as a foundational substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios.
中文摘要 现实的长期生产力工作强烈依赖于用户特定的计算机环境，其中大量工作上下文通过目录结构和丰富的内容工件来存储和组织。为了在此类生产力场景下扩展合成数据创建，我们引入了大规模合成计算机，这是一种可扩展的方法，用于创建具有真实文件夹层级结构和丰富内容工件（如文档、电子表格和演示文稿）的此类环境。基于每台合成计算机，我们运行长期模拟：一个智能体创建针对计算机用户的生产力目标，需要多个专业交付物和约一个月的人工工作;另一个代理作为该用户，继续在计算机间工作——例如，导航文件系统进行接地、与模拟协作者协调、生成专业工件——直到完成这些目标。在初步实验中，我们制造了1000台合成计算机，并对其进行长视野模拟;每次运行都需要超过8小时的特工运行时间，平均跨越超过2000回合。这些模拟产生丰富的体验式学习信号，其有效性通过代理在域内外生产力评估上的显著提升得到验证。鉴于角色在十亿级规模上非常丰富，这种方法原则上可以通过足够的计算量扩展到数百万甚至数十亿个合成用户世界，从而实现对不同职业、角色、环境、环境和生产力需求的更广泛覆盖。我们认为，可扩展的合成计算机创建与大规模模拟相结合，作为长期生产力场景中代理自我提升和代理强化学习的基础，具有极大前景。

Exploration Hacking: Can LLMs Learn to Resist RL Training?

探索黑客：大型语言模型能学会抵抗强化学习训练吗？

Authors: Eyon Jang, Damon Falck, Joschka Braun, Nathalie Kirch, Achu Menon, Perusha Moodley, Scott Emmons, Roland S. Zimmermann, David Lindner
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.28182
Pdf link: https://arxiv.org/pdf/2604.28182
Abstract Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment. Successful RL relies on sufficient exploration of diverse actions by the model during training, which creates a potential failure mode: a model could strategically alter its exploration during training to influence the subsequent training outcome. In this paper we study this behavior, called exploration hacking. First, we create model organisms of selective RL resistance by fine-tuning LLMs to follow specific underperformance strategies; these models can successfully resist our RL-based capability elicitation in agentic biosecurity and AI R&D environments while maintaining performance on related tasks. We then use our model organisms to evaluate detection and mitigation strategies, including monitoring, weight noising, and SFT-based elicitation. Finally, we show that current frontier models can exhibit explicit reasoning about suppressing their exploration when provided with sufficient information about their training context, with higher rates when this information is acquired indirectly through the environment. Together, our results suggest exploration hacking is a possible failure mode of RL on sufficiently capable LLMs.
中文摘要 强化学习（RL）已成为大型语言模型（LLMs）推理、代理能力和对齐后训练的关键。成功的强化学习依赖于模型在训练过程中对多样化行为的充分探索，这会造成潜在的失败模式：模型可以在训练中战略性地调整探索方式，以影响后续训练结果。本文研究这种行为，称为探索黑客。首先，我们通过微调LLM以遵循特定的性能不佳策略，创建具有选择性强化学习抗性的模式生物;这些模型能够成功抵御我们在智能生物安全和人工智能研发环境中基于强化学习的能力诱发，同时保持相关任务的性能。随后，我们利用模式生物评估检测和缓解策略，包括监测、体重噪声和基于SFT的诱发。最后，我们表明当前前沿模型在获得足够训练背景信息时，能够显性推理抑制探索，且当这些信息通过环境间接获得时，抑制探索的发生率更高。综合来看，我们的结果表明探索黑客是强化学习在足够强大LLM上可能出现的失败模式。

LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models

LaST-R1：通过自适应物理潜在推理强化VLA模型的作用

Authors: Hao Chen, Jiaming Liu, Zhonghao Yan, Nuowei Han, Renrui Zhang, Chenyang Gu, Jialin Gao, Ziyu Guo, Siyuan Qian, Yinxi Wang, Peng Jia, Chi-Wing Fu, Shanghang Zhang, Pheng-Ann Heng
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.28192
Pdf link: https://arxiv.org/pdf/2604.28192
Abstract Vision-Language-Action (VLA) models have increasingly incorporated reasoning mechanisms for complex robotic manipulation. However, existing approaches share a critical limitation: whether employing explicit linguistic reasoning that suffers from latency and discretization, or utilizing more expressive continuous latent reasoning, they are predominantly confined to static imitation learning that limits adaptability and generalization. While online reinforcement learning (RL) has been introduced to VLAs to enable trial-and-error exploration, current methods exclusively optimize the vanilla action space, bypassing the underlying physical reasoning process. In this paper, we present \textbf{LaST-R1}, a unified VLA framework that integrates latent Chain-of-Thought (CoT) reasoning over physical dynamics prior to action execution, along with a tailored RL post-training paradigm. Specifically, we propose \textbf{Latent-to-Action Policy Optimization (LAPO)}, a novel RL algorithm that jointly optimizes the latent reasoning process and the action generation. By bridging reasoning and control, LAPO improves the representation of physical world modeling and enhances robustness in interactive environments. Furthermore, an \textbf{adaptive latent CoT mechanism} is introduced to allow the policy to dynamically adjust its reasoning horizon based on environment complexity. Extensive experiments show that LaST-R1 achieves a near-perfect 99.8\% average success rate on the LIBERO benchmark with only one-shot supervised warm-up, significantly improving convergence speed and performance over prior state-of-the-art methods. In real-world deployments, LAPO post-training yields up to a 44\% improvement over the initial warm-up policy across four complex tasks, including both single-arm and dual-arm settings. Finally, LaST-R1 demonstrates strong generalization across simulated and real-world environments.
中文摘要 视觉-语言-行动（VLA）模型越来越多地融入了复杂的机器人操作推理机制。然而，现有方法存在一个关键局限：无论是采用存在延迟和离散化的显式语言推理，还是采用更具表现力的连续潜在推理，它们主要局限于静态模仿学习，限制了适应性和泛化性。虽然在线强化学习（RL）已被引入VLA以实现试错探索，但当前方法仅优化原版动作空间，绕过了底层的物理推理过程。本文介绍了\textbf{LaST-R1}，一个统一的VLA框架，整合了在动作执行前物理动力学上的潜在思维链（CoT）推理，并提供了量身定制的强化学习后训练范式。具体来说，我们提出了\textbf{潜在行动策略优化（LAPO）}，一种新颖的强化学习算法，能够联合优化潜在推理过程和动作生成。通过连接推理与控制，LAPO提升了物理世界建模的表示，并增强了交互环境中的稳健性。此外，引入了 \textbf{自适应潜在 CoT 机制}，使策略能够根据环境复杂性动态调整其推理视野。大量实验表明，LaST-R1在仅一次监督热身的情况下，在LIBERO基准测试上实现了近乎完美的99.8%平均成功率，显著提升了收敛速度和性能，优于以往的先进方法。在实际部署中，LAPO训练后在四项复杂任务（包括单臂和双臂设置）中，较初始热身策略提升多达44%。最后，LaST-R1在模拟和现实环境中展现了强有力的泛化能力。

Keyword: diffusion policy

There is no result