Arxiv Papers of Today

生成时间: 2026-04-09 17:11:44 (UTC+8); Arxiv 发布时间: 2026-04-09 20:00 EDT (2026-04-10 08:00 UTC+8)

今天共有 30 篇相关文章

Keyword: reinforcement learning

Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

通过强化学习和监督微调，应用驱动的教学知识优化开源大型语言模型

Authors: Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao, Ritankar Das
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.06385
Pdf link: https://arxiv.org/pdf/2604.06385
Abstract We present an innovative multi-stage optimization strategy combining reinforcement learning (RL) and supervised fine-tuning (SFT) to enhance the pedagogical knowledge of large language models (LLMs), as illustrated by EduQwen 32B-RL1, EduQwen 32B-SFT, and an optional third-stage model EduQwen 32B-SFT-RL2: (1) RL optimization that implements progressive difficulty training, focuses on challenging examples, and employs extended reasoning rollouts; (2) a subsequent SFT phase that leverages the RL-trained model to synthesize high-quality training data with difficulty-weighted sampling; and (3) an optional second round of RL optimization. EduQwen 32B-RL1, EduQwen 32B-SFT, and EduQwen 32B-SFT-RL2 are an application-driven family of open-source pedagogical LLMs built on a dense Qwen3-32B backbone. These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly larger proprietary systems such as the previous benchmark leader Gemini-3 Pro. These dense 32-billion-parameter models demonstrate that domain-specialized optimization can transform mid-sized open-source LLMs into true pedagogical domain experts that outperform much larger general-purpose systems, while preserving the transparency, customizability, and cost-efficiency required for responsible educational AI deployment.
中文摘要 我们提出了一种创新的多阶段优化策略，结合强化学习（RL）和监督式微调（SFT），以提升大型语言模型（LLM）的教学知识，如EduQwen 32B-RL1、EduQwen 32B-SFT以及可选的第三阶段模型EduQwen 32B-SFT-RL2所示：（1）实现渐进式难度训练、注重挑战性示例并采用扩展推理展开的强化学习优化;（2）随后的SFT阶段，利用强化学习训练模型综合高质量训练数据，采用难度加权抽样;以及（3）可选的第二轮强化学习优化。EduQwen 32B-RL1、EduQwen 32B-SFT 和 EduQwen 32B-SFT-RL2 是一系列基于密集 Qwen3-32B 骨干网的应用驱动开源教学大型语言模型。这些模型在跨域教学知识（CDPK）基准测试上取得了足够高的准确率，能够在交互式教学法基准测试排行榜中建立新的最先进（SOTA）结果，并超越了之前基准领先的Gemini-3 Pro等更大型的专有系统。这些拥有320亿参数的密集模型表明，领域专用优化能够将中型开源LLM转变为真正的教学领域专家，性能优于更大型通用系统，同时保持负责任的教育AI部署所需的透明度、可定制性和成本效益。

A Control Barrier Function-Constrained Model Predictive Control Framework for Safe Reinforcement Learning

一个控制障碍功能约束模型用于安全强化学习的预测控制框架

Authors: Ali Umut Kaypak, Prashanth Krishnamurthy, Farshad Khorrami
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.06463
Pdf link: https://arxiv.org/pdf/2604.06463
Abstract Ensuring safety under unknown and stochastic dynamics remains a significant challenge in reinforcement learning (RL). In this paper, we propose a model predictive control (MPC)-based safe RL framework, called Probabilistic Ensembles with CBF-constrained Trajectory Sampling (PECTS), to address this challenge. PECTS jointly learns stochastic system dynamics with probabilistic neural networks (PNNs) and control barrier functions (CBFs) with Lipschitz-bounded neural networks. Safety is enforced by incorporating learned CBF constraints into the MPC formulation while accounting for the model stochasticity. This enables probabilistic safety under model uncertainty. To solve the resulting MPC problem, we utilize a sampling-based optimizer together with a safe trajectory sampling method that discards unsafe trajectories based on the learned system model and CBF. We validate PECTS in various simulation studies, where it outperforms baseline methods.
中文摘要 在未知和随机动态下确保安全性仍然是强化学习（RL）中的一大挑战。本文提出了基于模型预测控制（MPC）的安全强化学习框架，称为带有CBF约束轨迹采样的概率集合（PECTS），以应对这一挑战。PECTS与概率神经网络（PNN）共同学习随机系统动力学，并与利普希茨有界神经网络学习控制屏障函数（CBF）。安全性通过在MPC表述中融入已学到的CBF约束，同时考虑模型随机性来实现。这使得在模型不确定性下实现概率安全性。为解决由此产生的MPC问题，我们结合基于抽样的优化器和安全轨迹采样方法，基于所学系统模型和CBF剔除不安全的轨迹。我们在各种模拟研究中验证了PECTS，其表现优于基线方法。

Discrete Flow Matching Policy Optimization

离散流匹配策略优化

Authors: Maojiang Su, Po-Chung Hsieh, Weimin Wu, Mingcheng Lu, Jiunhau Chen, Jerry Yao-Chieh Hu, Han Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2604.06491
Pdf link: https://arxiv.org/pdf/2604.06491
Abstract We introduce Discrete flow Matching policy Optimization (DoMinO), a unified framework for Reinforcement Learning (RL) fine-tuning Discrete Flow Matching (DFM) models under a broad class of policy gradient methods. Our key idea is to view the DFM sampling procedure as a multi-step Markov Decision Process. This perspective provides a simple and transparent reformulation of fine-tuning reward maximization as a robust RL objective. Consequently, it not only preserves the original DFM samplers but also avoids biased auxiliary estimators and likelihood surrogates used by many prior RL fine-tuning methods. To prevent policy collapse, we also introduce new total-variation regularizers to keep the fine-tuned distribution close to the pretrained one. Theoretically, we establish an upper bound on the discretization error of DoMinO and tractable upper bounds for the regularizers. Experimentally, we evaluate DoMinO on regulatory DNA sequence design. DoMinO achieves stronger predicted enhancer activity and better sequence naturalness than the previous best reward-driven baselines. The regularization further improves alignment with the natural sequence distribution while preserving strong functional performance. These results establish DoMinO as an useful framework for controllable discrete sequence generation.
中文摘要 我们介绍离散流匹配策略优化（DoMinO），这是一个统一的强化学习（RL）微调离散流匹配（DFM）模型的框架，涵盖广泛的策略梯度方法类别。我们的核心思想是将DFM抽样过程视为一个多步马尔可夫决策过程。这一视角为将奖励最大化作为稳健强化目标的微调提供了简单透明的重新表述。因此，它不仅保留了原始的DFM采样器，还避免了许多之前强化学习微调方法中使用的偏置辅助估计器和似然替代。为防止策略崩溃，我们还引入了新的全变异正则化子，使微调分布接近预训练分布。理论上，我们建立了DoMinO离散化误差的上界和正则化子的可解上界。通过实验，我们评估了DoMinO在调控DNA序列设计上的应用。DoMinO比以往最佳的奖励驱动基线实现了更强的预测增强子活性和更好的序列自然性。正则化进一步改善了与自然序列分布的比对，同时保持了强有力的功能表现。这些结果确立了 DoMinO 作为可控离散序列生成的有用框架。

Hyperfastrl: Hypernetwork-based reinforcement learning for unified control of parametric chaotic PDEs

Hyperfastrl：基于超网络的强化学习，用于统一控制参数化混沌偏微分方程

Authors: Anil Sapkota, Omer San
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.06497
Pdf link: https://arxiv.org/pdf/2604.06497
Abstract Spatiotemporal chaos in fluid systems exhibits severe parametric sensitivity, rendering classical adjoint-based optimal control intractable because each operating regime requires recomputing the control law. We address this bottleneck with hyperFastRL, a parameter-conditioned reinforcement learning framework that leverages Hypernetworks to shift from tuning isolated controllers per-regime to learning a unified parametric control manifold. By mapping a physical forcing parameter {\mu} directly to the weights of a spatial feedback policy, the architecture cleanly decouples parametric adaptation from spatial boundary stabilization. To overcome the extreme variance inherent to chaotic reward landscapes, we deploy a pessimistic distributional value estimation over a massively parallel environment ensemble. We evaluate three Hypernetwork functional forms, ranging from residual MLPs to periodic Fourier and Kolmogorov-Arnold (KAN) representations, on the Kuramoto-Sivashinsky equation under varying spatial forcing. All forms achieve robust stabilization. KAN yields the most consistent energy-cascade suppression and tracking across unseen parametrizations, while Fourier networks exhibit worse extrapolation variability. Furthermore, leveraging high-throughput parallelization allows us to intentionally trade a fraction of peak asymptotic reward for a 37% reduction in training wall-clock time, identifying an optimal operating regime for practical deployment in complex, parameter-varying chaotic PDEs.
中文摘要 流体系统中的时空混沌表现出极高的参数敏感性，使得基于伴随的经典最优控制变得难以解决，因为每个操作过程都需要重新计算控制定律。我们用参数条件强化学习框架hyperFastRL解决了这一瓶颈，该框架利用超网络从每个方案调整孤立控制器转向学习统一参数控制流形。通过将物理强迫参数{\mu}直接映射到空间反馈策略的权重，该架构干净利落地将参数适应与空间边界稳定解耦。为了克服混沌奖励景观固有的极端变异性，我们在一个大规模并行环境系综上部署了悲观分布价值估计。我们在不同空间强迫下的仓本-西瓦辛斯基方程评估了三种超网络功能形式，范围从残余MLP到周期性的傅里叶和柯尔莫哥洛夫-阿诺德（KAN）表示。所有形式均实现强健稳定。KAN 在未可见参数中实现最一致的能量级联抑制和跟踪，而傅里叶网络的外推变异性较差。此外，利用高通量并行化，我们可以有意地用部分峰值渐近奖励换取37%的训练时钟时间减少，从而确定了在复杂、参数变化的混沌偏微分方程中实际部署的最佳操作模式。

Train-Small Deploy-Large: Leveraging Diffusion-Based Multi-Robot Planning

训练-小部署-大：利用基于扩散的多机器人规划

Authors: Siddharth Singh, Soumee Guha, Qing Chang, Scott Acton
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.06598
Pdf link: https://arxiv.org/pdf/2604.06598
Abstract Learning based multi-robot path planning methods struggle to scale or generalize to changes, particularly variations in the number of robots during deployment. Most existing methods are trained on a fixed number of robots and may tolerate a reduced number during testing, but typically fail when the number increases. Additionally, training such methods for a larger number of agents can be both time consuming and computationally expensive. However, analytical methods can struggle to scale computationally or handle dynamic changes in the environment. In this work, we propose to leverage a diffusion model based planner capable of handling dynamically varying number of agents. Our approach is trained on a limited number of agents and generalizes effectively to larger numbers of agents during deployment. Results show that integrating a single shared diffusion model based planner with dedicated inter-agent attention computation and temporal convolution enables a train small deploy-large paradigm with good accuracy. We validate our method across multiple scenarios and compare the performance with existing multi-agent reinforcement learning techniques and heuristic control based methods.
中文摘要 基于学习的多机器人路径规划方法难以实现规模化或泛化，尤其是在部署过程中机器人数量的变化。大多数现有方法训练在固定数量的机器人上，测试时可能容忍减少数量，但通常在数量增加时失败。此外，为更多代理训练此类方法既耗时又计算成本高。然而，分析方法在计算扩展或应对环境中的动态变化方面可能存在困难。在本研究中，我们提出利用一种基于扩散模型的规划器，能够处理动态变化的代理数量。我们的方法针对有限数量的代理进行训练，并在部署期间有效地推广到更多代理。结果表明，将单一共享扩散模型规划器与专用代理间注意力计算和时间卷积集成，能够实现高精度的“训练小”-大部署范式。我们在多个场景下验证了我们的方法，并比较了与现有多智能体强化学习技术和基于启发式控制的方法的表现。

TwinLoop: Simulation-in-the-Loop Digital Twins for Online Multi-Agent Reinforcement Learning

TwinLoop：在线多智能体强化学习的环中模拟数字孪生

Authors: Nan Zhang, Zishuo Wang, Shuyu Huang, Georgios Diamantopoulos, Nikos Tziritas, Panagiotis Oikonomou, Georgios Theodoropoulos
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.06610
Pdf link: https://arxiv.org/pdf/2604.06610
Abstract Decentralised online learning enables runtime adaptation in cyber-physical multi-agent systems, but when operating conditions change, learned policies often require substantial trial-and-error interaction before recovering performance. To address this, we propose TwinLoop, a simulation-in-the-loop digital twin framework for online multi-agent reinforcement learning. When a context shift occurs, the digital twin is triggered to reconstruct the current system state, initialise from the latest agent policies, and perform accelerated policy improvement with simulation what-if analysis before synchronising updated parameters back to the agents in the physical system. We evaluate TwinLoop in a vehicular edge computing task-offloading scenario with changing workload and infrastructure conditions. The results suggest that digital twins can improve post-shift adaptation efficiency and reduce reliance on costly online trial-and-error.
中文摘要 去中心化在线学习使网络物理多智能体系统能够运行时适配，但当操作条件变化时，学习策略往往需要大量的反复试验才能恢复性能。为此，我们提出了TwinLoop，一种用于在线多智能体强化学习的环中模拟数字孪生框架。当上下文发生变化时，数字孪生会被触发重建当前系统状态，从最新的代理策略初始化，并通过模拟假设分析加速策略改进，然后将更新后的参数同步回物理系统的代理。我们在车辆边缘计算任务卸载场景下，在工作负载和基础设施条件变化的情况下，评估了TwinLoop。结果表明，数字孪生可以提升班后适应效率，减少对昂贵在线试错的依赖。

The Theorems of Dr. David Blackwell and Their Contributions to Artificial Intelligence

大卫·布莱克韦尔博士定理及其对人工智能的贡献

Authors: Napoleon Paxton
Subjects: Subjects: General Literature (cs.GL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2604.06621
Pdf link: https://arxiv.org/pdf/2604.06621
Abstract Dr. David Blackwell was a mathematician and statistician of the first rank, whose contributions to statistical theory, game theory, and decision theory predated many of the algorithmic breakthroughs that define modern artificial intelligence. This survey examines three of his most consequential theoretical results the Rao Blackwell theorem, the Blackwell Approachability theorem, and the Blackwell Informativeness theorem (comparison of experiments) and traces their direct influence on contemporary AI and machine learning. We show that these results, developed primarily in the 1940s and 1950s, remain technically live across modern subfields including Markov Chain Monte Carlo inference, autonomous mobile robot navigation (SLAM), generative model training, no-regret online learning, reinforcement learning from human feedback (RLHF), large language model alignment, and information design. NVIDIAs 2024 decision to name their flagship GPU architecture (Blackwell) provides vivid testament to his enduring relevance. We also document an emerging frontier: explicit Rao Blackwellized variance reduction in LLM RLHF pipelines, recently proposed but not yet standard practice. Together, Blackwell theorems form a unified framework addressing information compression, sequential decision making under uncertainty, and the comparison of information sources precisely the problems at the core of modern AI.
中文摘要 大卫·布莱克韦尔博士是一位一流的数学家和统计学家，他对统计理论、博弈论和决策理论的贡献早于定义现代人工智能的许多算法突破。本综述考察了他三项最具影响力的理论成果：Rao Blackwell定理、Blackwell可接近性定理和Blackwell信息性定理（实验比较），并追溯它们对当代人工智能和机器学习的直接影响。我们表明，这些主要在20世纪40至50年代开发的结果，在现代子领域依然有效，包括马尔可夫链蒙特卡洛推断、自主移动机器人导航（SLAM）、生成式模型训练、无遗憾在线学习、基于人类反馈的强化学习（RLHF）、大型语言模型对齐和信息设计等。英伟达2024年决定将其旗舰GPU架构命名为“Blackwell”，生动地证明了他持久的影响力。我们还记录了一个新兴前沿：LLM RLHF流水线中显式的Rao Blackwell化方差缩小，这是最近提出但尚未成为标准做法。布莱克韦尔定理共同构成了一个统一框架，解决信息压缩、不确定性下的顺序决策以及信息源的比较，正是现代人工智能核心问题所在。

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

重新思考推理中的泛化 SFT：关于优化、数据与模型能力的条件分析

Authors: Qihan Ren, Peng Wang, Ruikun Cai, Shuai Shao, Dadi Guo, Yuejin Xie, Yafu Li, Quanshi Zhang, Xia Hu, Jing Shao, Dongrui Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.06628
Pdf link: https://arxiv.org/pdf/2604.06628
Abstract A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.
中文摘要 LLM培训后期的主流叙事认为，监督微调（SFT）是记忆，而强化学习（RL）则是泛化。我们重新审视这一主张，在长思考链（CoT）监督下推理SFT，发现跨域推广并非缺失，而是条件性，由优化动力学、训练数据和基模型能力共同塑造。一些报告的失败是优化不足的伪影：跨域性能先下降，随后通过延长训练恢复和提升（即“蘸水与恢复”模式），因此短训练检查点可能低估泛化效果。数据质量和结构都很重要：低质量的解决方案会大幅影响泛化，而经过验证的长CoT痕迹则能带来持续的跨域收益。模型能力至关重要：更强的模型甚至能从玩具算术游戏中内化可转移的过程模式（如回溯），而较弱的模型则模仿表面冗长。然而，这一推广是不对称的：推理提升，安全性下降，这使问题从推理SFT是否推广到何种条件下及代价。

KD-MARL: Resource-Aware Knowledge Distillation in Multi-Agent Reinforcement Learning

KD-MARL：多智能体强化学习中的资源感知知识蒸馏

Authors: Monirul Islam Pavel, Siyi Hu, Muhammad Anwar Masum, Mahardhika Pratama, Ryszard Kowalczyk, Zehong Jimmy Cao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.06691
Pdf link: https://arxiv.org/pdf/2604.06691
Abstract Real world deployment of multi agent reinforcement learning MARL systems is fundamentally constrained by limited compute memory and inference time. While expert policies achieve high performance they rely on costly decision cycles and large scale models that are impractical for edge devices or embedded platforms. Knowledge distillation KD offers a promising path toward resource aware execution but existing KD methods in MARL focus narrowly on action imitation often neglecting coordination structure and assuming uniform agent capabilities. We propose resource aware Knowledge Distillation for Multi Agent Reinforcement Learning KD MARL a two stage framework that transfers coordinated behavior from a centralized expert to lightweight decentralized student agents. The student policies are trained without a critic relying instead on distilled advantage signals and structured policy supervision to preserve coordination under heterogeneous and limited observations. Our approach transfers both action level behavior and structural coordination patterns from expert policies while supporting heterogeneous student architectures allowing each agent model capacity to match its observation complexity which is crucial for efficient execution under partial or limited observability and limited onboard resources. Extensive experiments on SMAC and MPE benchmarks demonstrate that KD MARL achieves high performance retention while substantially reducing computational cost. Across standard multi agent benchmarks KD MARL retains over 90 percent of expert performance while reducing computational cost by up to 28.6 times FLOPs. The proposed approach achieves expert level coordination and preserves it through structured distillation enabling practical MARL deployment across resource constrained onboard platforms.
中文摘要 多智能体强化学习MARL系统的实际部署在计算内存和推理时间有限的基础上受限。虽然专家政策实现了高性能，但它们依赖于昂贵的决策周期和大规模模型，这对边缘设备或嵌入式平台来说并不切实际。知识蒸馏KD为实现资源感知执行提供了有前景的路径，但MARL现有KD方法仅专注于动作模仿，常常忽视协调结构，并假设代理能力统一。我们提出了资源感知型知识蒸馏，用于多智能体强化学习 KD MARL，这是一个两阶段框架，将协调行为从集中式专家转移到轻量级去中心化的学生智能体。学生政策在没有批评者的情况下进行培训，而是依赖精炼的优势信号和结构化的政策监督，以在异质且有限的观察条件下保持协调。我们的方法将专家策略中的动作级行为和结构协调模式转移过来，同时支持异构学生架构，使每个智能体模型的能力能够匹配其观测复杂度，这对于在部分或有限可观测性和有限的机载资源下高效执行至关重要。在SMAC和MPE基准测试上的大量实验表明，KD MARL在大幅降低计算成本的同时实现了高性能保留。在标准多代理基准测试中，KD MARL保留了超过90%的专家性能，同时计算成本降低了多达28.6倍的FLOP。该方法实现专家级协调，并通过结构化蒸馏保持协调，实现跨资源有限的机载平台的实际MARL部署。

Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

言行一致：通过多模态代理策略优化弥合推理与行动之间的差距，以图像思考

Authors: Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuchen Zhou, Xiaobo Xia, Yuanyu Wan, Lijun Zhang, Tat-Seng Chua
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.06777
Pdf link: https://arxiv.org/pdf/2604.06777
Abstract Recent advancements in Multimodal Large Language Models (MLLMs) have incentivized models to ``think with images'' by actively invoking visual tools during multi-turn reasoning. The common Reinforcement Learning (RL) practice of relying on outcome-based rewards ignores the fact that textual plausibility often masks executive failure, meaning that models may exhibit intuitive textual reasoning while executing imprecise or irrelevant visual actions within their agentic reasoning trajectories. This reasoning-action discrepancy introduces noise that accumulates throughout the multi-turn reasoning process, severely degrading the model's multimodal reasoning capabilities and potentially leading to training collapse. In this paper, we introduce Multimodal Agentic Policy Optimization (MAPO), bridging the gap between textual reasoning and visual actions generated by models within their Multimodal Chain-of-Thought (MCoT). Specifically, MAPO mandates the model to generate explicit textual descriptions for the visual content obtained via tool usage. We then employ a novel advantage estimation that couples the semantic alignment between these descriptions and the actual observations with the task reward. Theoretical findings are provided to justify the rationale behind MAPO, which inherently reduces the variance of gradients, and extensive experiments demonstrate that our method achieves superior performance across multiple visual reasoning benchmarks.
中文摘要 多模态大型语言模型（MLLM）的最新进展激励模型通过在多回合推理中主动调用视觉工具来“用图像思考”。常见的基于结果的强化学习（RL）实践忽视了文本合理性常常掩盖执行失败的事实，这意味着模型在其能动推理轨迹内执行不精确或无关的视觉动作时，可能表现出直觉文本推理。这种推理-动作差异引入了噪声，这些噪声在整个多回合推理过程中不断积累，严重削弱了模型的多模态推理能力，甚至可能导致训练崩溃。本文介绍了多模态代理策略优化（MAPO），弥合文本推理与模型在其多模态思维链（MCoT）中生成的视觉动作之间的鸿沟。具体来说，MAPO要求模型为通过工具使用获得的视觉内容生成显式的文本描述。随后，我们采用一种新颖的优势估计，将这些描述与实际观察之间的语义对齐与任务奖励相结合。提供了理论发现来支持MAPO背后的理由，MAPO本质上减少了梯度的方差，并通过大量实验证明我们的方法在多个视觉推理基准测试中实现了更优异的性能。

Equivariant Multi-agent Reinforcement Learning for Multimodal Vehicle-to-Infrastructure Systems

多模态车辆到基础设施系统的等变多智能体强化学习

Authors: Charbel Bou Chaaya, Mehdi Bennis
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.06914
Pdf link: https://arxiv.org/pdf/2604.06914
Abstract In this paper, we study a vehicle-to-infrastructure (V2I) system where distributed base stations (BSs) acting as road-side units (RSUs) collect multimodal (wireless and visual) data from moving vehicles. We consider a decentralized rate maximization problem, where each RSU relies on its local observations to optimize its resources, while all RSUs must collaborate to guarantee favorable network performance. We recast this problem as a distributed multi-agent reinforcement learning (MARL) problem, by incorporating rotation symmetries in terms of vehicles' locations. To exploit these symmetries, we propose a novel self-supervised learning framework where each BS agent aligns the latent features of its multimodal observation to extract the positions of the vehicles in its local region. Equipped with this sensing data at each RSU, we train an equivariant policy network using a graph neural network (GNN) with message passing layers, such that each agent computes its policy locally, while all agents coordinate their policies via a signaling scheme that overcomes partial observability and guarantees the equivariance of the global policy. We present numerical results carried out in a simulation environment, where ray-tracing and computer graphics are used to collect wireless and visual data. Results show the generalizability of our self-supervised and multimodal sensing approach, achieving more than two-fold accuracy gains over baselines, and the efficiency of our equivariant MARL training, attaining more than 50% performance gains over standard approaches.
中文摘要 本文研究了一种车辆到基础设施（V2I）系统，分布式基站（BS）作为路边单元（RSU）收集多模态（无线和视觉）数据。我们考虑一个去中心化速率最大化问题，每个RSU依赖本地观测数据优化资源，而所有RSU必须协作以确保有利的网络性能。我们将该问题重新构思为分布式多智能体强化学习（MARL）问题，通过结合车辆位置的旋转对称性。为利用这些对称性，我们提出了一种新的自监督学习框架，每个BS代理对齐其多模态观察的潜在特征，以提取车辆在其局部区域的位置。在每个RSU配备这些传感数据后，我们使用带有消息传递层的图神经网络（GNN）训练一个等变策略网络，使每个代理在本地计算其策略，而所有代理通过一种克服部分可观测性并保证全局策略等异性的信令方案协调其策略。我们展示了在仿真环境中进行的数值结果，利用光线追踪和计算机图形收集无线和视觉数据。结果显示，我们的自监督和多模态传感方法具有普遍性，准确率比基线提升超过两倍，且等变MARL训练效率高，性能提升超过50%。

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

FP4探索，BF16列车：通过高效推广规模进行扩散强化学习

Authors: Yitong Li, Junsong Chen, Shuchen Xue, Pengcuo Zeren, Siyuan Fu, Dinghao Yang, Yangyang Tang, Junjie Bai, Ping Luo, Song Han, Enze Xie
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.06916
Pdf link: https://arxiv.org/pdf/2604.06916
Abstract Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to $4.64\times$, unlocking the power of massive rollout scaling at a fraction of the cost.
中文摘要 基于强化学习的后期训练最近成为一种有前景的范式，用于将文本到图像扩散模型与人类偏好对齐。最新研究显示，扩大推广组规模带来了显著的性能提升，表明对齐仍有很大空间。然而，在大规模基础扩散模型（如FLUX.1-12B）上进行扩展会带来沉重的计算负担。为缓解这一瓶颈，我们探讨将FP4量化整合进扩散强化学习推广。然而，我们发现，朴素的量化管道本质上会带来性能下降的风险。为了解决效率与训练完整性之间的困境，我们提出了Sol-RL（光速RL），这是一种创新的FP4赋能的两阶段强化学习框架。首先，我们利用高通量的NVFP4部署来生成庞大的候选池，并提取高度对比的子集。其次，我们以BF16精度重新生成这些选定样本，并专门针对它们优化策略。通过将候选探索与策略优化脱钩，Sol-RL将扩展扩展的算法机制与NVFP4的系统级吞吐量提升整合起来。这种协同的算法-硬件设计有效加快了推广阶段，同时保留了高保真样本用于优化。我们通过实证证明，我们的框架在充分利用FP4运算带来的吞吐量提升的同时，依然保持了BF16精密流水线的训练完整性。在SANA、FLUX.1和SD3.5-L的广泛实验证明，我们的方法在多个指标上实现了卓越的比对性能，同时将训练收敛速度提升高达4.64美元，释放了大规模推广规模的强大优势，成本仅为其一小部分。

POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP

POS-ISP：任务感知ISP的序列层面流水线优化

Authors: Jiyun Won, Heemin Yang, Woohyeok Kim, Jungseul Ok, Sunghyun Cho
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.06938
Pdf link: https://arxiv.org/pdf/2604.06938
Abstract Recent work has explored optimizing image signal processing (ISP) pipelines for various tasks by composing predefined modules and adapting them to task-specific objectives. However, jointly optimizing module sequences and parameters remains challenging. Existing approaches rely on neural architecture search (NAS) or step-wise reinforcement learning (RL), but NAS suffers from a training-inference mismatch, while step-wise RL leads to unstable training and high computational overhead due to stage-wise decision-making. We propose POS-ISP, a sequence-level RL framework that formulates modular ISP optimization as a global sequence prediction problem. Our method predicts the entire module sequence and its parameters in a single forward pass and optimizes the pipeline using a terminal task reward, eliminating the need for intermediate supervision and redundant executions. Experiments across multiple downstream tasks show that POS-ISP improves task performance while reducing computational cost, highlighting sequence-level optimization as a stable and efficient paradigm for task-aware ISP. The project page is available at this https URL
中文摘要 近期研究探索了通过组合预定义模块并调整以适应特定任务目标，优化图像信号处理（ISP）流水线以满足各种任务。然而，联合优化模块序列和参数仍具挑战性。现有方法依赖神经结构搜索（NAS）或逐步强化学习（RL），但NAS存在训练与推断不匹配的问题，而逐步强化学习则导致训练不稳定，且由于分阶段决策，计算开销高。我们提出了POS-ISP，一个序列级强化学习框架，将模块化ISP优化作为一个全局序列预测问题。我们的方法通过一次前向传递预测整个模块序列及其参数，并通过终端任务奖励优化流水线，消除了中间监督和冗余执行的需求。跨多个下游任务的实验表明，POS-ISP在降低计算成本的同时提升任务性能，凸显了序列级优化作为任务感知ISP稳定高效范式的体现。项目页面可在此 https URL 访问。

A First Guess is Rarely the Final Answer: Learning to Search in the Travelling Salesperson Problem

第一次猜测很少是最终答案：学会在旅行推销员问题中寻找答案

Authors: Andoni Irazusta Garmendia
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.06940
Pdf link: https://arxiv.org/pdf/2604.06940
Abstract Most neural solvers for the Traveling Salesperson Problem (TSP) are trained to output a single solution, even though practitioners rarely stop there: at test time, they routinely spend extra compute on sampling or post-hoc search. This raises a natural question: can the search procedure itself be learned? Neural improvement methods take this perspective by learning a policy that applies local modifications to a candidate solution, accumulating gains over an improvement trajectory. Yet learned improvement for TSP remains comparatively immature, with existing methods still falling short of robust, scalable performance. We argue that a key reason is design mismatch: many approaches reuse state representations, architectural choices, and training recipes inherited from single-solution methods, rather than being built around the mechanics of local search. This mismatch motivates NICO-TSP (Neural Improvement for Combinatorial Optimization): a 2-opt improvement framework for TSP. NICO-TSP represents the current tour with exactly $n$ edge tokens aligned with the neighborhood operator, scores 2-opt moves directly without tour positional encodings, and trains via a two-stage procedure: imitation learning to short-horizon optimal trajectories, followed by critic-free group-based reinforcement learning over longer rollouts. Under compute-matched evaluations that measure improvement as a function of both search steps and wall-clock time, NICO-TSP delivers consistently stronger and markedly more step-efficient improvement than prior learned and heuristic search baselines, generalizes far more reliably to larger out-of-distribution instances, and serves both as a competitive replacement for classical local search and as a powerful test-time refinement module for constructive solvers.
中文摘要 大多数针对旅行销售员问题（TSP）的神经求解器都被训练为输出单一解，尽管从业者很少止步于此：在测试时，他们通常会在抽样或事后搜索上花费额外的计算。这自然引出了一个问题：搜索过程本身能否被学习？神经改进方法通过学习对候选解施加局部修改的策略，从而在改进轨迹中累积收益，从而从这一视角出发。然而，TSP的学习改进仍相对不成熟，现有方法仍未能达到稳健且可扩展的性能。我们认为一个关键原因是设计不匹配：许多方法重复使用从单一解方法继承的状态表示、架构选择和训练配方，而非围绕局部搜索机制构建。这种不匹配促使了NICO-TSP（组合优化神经改进）的诞生：一个用于2-OPT的TSP改进框架。NICO-TSP表示当前巡回，使用恰好$n$的边标记与邻域算子对齐，直接计分2-OPT移动而不使用巡回位置编码，并通过两阶段过程训练：模仿学习以实现短视野的最优轨迹，随后是无批评的基于群体的强化学习，进行更长时间的推广。在计算匹配评估中，提升基于搜索步骤和墙钟时间，NICO-TSP持续提供更强且明显更高效的提升，优于先前学习和启发式搜索基线，更可靠地推广到更大的非分布实例，既是经典局部搜索的竞争替代品，也是建设性求解器的强大测试时间优化模块。

Sustainable Transfer Learning for Adaptive Robot Skills

适应性机器人技能的可持续迁移学习

Authors: Khalil Abuibaid, Vinit Hegiste, Nigora Gafur, Achim Wagner, Martin Ruskowski
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.06943
Pdf link: https://arxiv.org/pdf/2604.06943
Abstract Learning robot skills from scratch is often time-consuming, while reusing data promotes sustainability and improves sample efficiency. This study investigates policy transfer across different robotic platforms, focusing on peg-in-hole task using reinforcement learning (RL). Policy training is carried out on two different robots. Their policies are transferred and evaluated for zero-shot, fine-tuning, and training from scratch. Results indicate that zero-shot transfer leads to lower success rates and relatively longer task execution times, while fine-tuning significantly improves performance with fewer training time-steps. These findings highlight that policy transfer with adaptation techniques improves sample efficiency and generalization, reducing the need for extensive retraining and supporting sustainable robotic learning.
中文摘要 从零开始学习机器人技能通常耗时，而重复利用数据有助于可持续性并提高样本效率。本研究探讨了不同机器人平台间的策略转移，重点是利用强化学习（RL）进行钉入式任务。政策培训在两种不同的机器人上进行。他们的政策被转移并评估，进行零射击、微调和从零开始训练。结果显示，零次传输导致较低的成功率和相对更长的任务执行时间，而微调则通过更少的训练步长显著提升了性能。这些发现强调，采用适应技术进行策略转移可以提升样本效率和泛化，减少大量再培训的需求，并支持可持续的机器人学习。

Learning-Based Strategy for Composite Robot Assembly Skill Adaptation

基于学习的复合机器人组装技能适应策略

Authors: Khalil Abuibaid, Aleksandr Sidorenko, Achim Wagner, Martin Ruskowski
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.06949
Pdf link: https://arxiv.org/pdf/2604.06949
Abstract Contact-rich robotic skills remain challenging for industrial robots due to tight geometric tolerances, frictional variability, and uncertain contact dynamics, particularly when using position-controlled manipulators. This paper presents a reusable and encapsulated skill-based strategy for peg-in-hole assembly, in which adaptation is achieved through Residual Reinforcement Learning (RRL). The assembly process is represented using composite skills with explicit pre-, post-, and invariant conditions, enabling modularity, reusability, and well-defined execution semantics across task variations. Safety and sample efficiency are promoted through RRL by restricting adaptation to residual refinements within each skill during contact-rich interactions, while the overall skill structure and execution flow remain invariant. The proposed approach is evaluated in MuJoCo simulation on a UR5e robot equipped with a Robotiq gripper and trained using SAC and JAX. Results demonstrate that the proposed formulation enables robust execution of assembly skills, highlighting its suitability for industrial automation.
中文摘要 由于几何公差严格、摩擦变异性和接触动力学不确定，尤其是在使用位置控制机械臂时，工业机器人在实现接触丰富的机器人技能方面依然充满挑战。本文提出了一种可重复使用且封装的基于技能的孔中钉组装策略，其中适应通过残余强化学习（RRL）实现。组装过程通过复合技能表示，明确了前置、后期和不变条件，实现了模块化、可重用性和在任务变体中明确定义的执行语义。通过RRL，在接触丰富互动中限制对每个技能内残余细化的适应，从而提升安全性和样本效率，同时整体技能结构和执行流程保持不变。该方法在配备Robotiq夹具的UR5e机器人上通过MuJoCo模拟评估，并使用SAC和JAX进行训练。结果表明，所提配方能够稳健地执行组装技能，凸显其在工业自动化中的适用性。

MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

MAR-GRPO：用于AR扩散混合图像生成的稳定GRPO

Authors: Xiaoxiao Ma, Jiachen Lei, Tianfei Ren, Jie Huang, Siming Fu, Aiming Hao, Jiahong Wu, Xiangxiang Chu, Feng Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.06966
Pdf link: https://arxiv.org/pdf/2604.06966
Abstract Reinforcement learning (RL) has been successfully applied to autoregressive (AR) and diffusion models. However, extending RL to hybrid AR-diffusion frameworks remains challenging due to interleaved inference and noisy log-probability estimation. In this work, we study masked autoregressive models (MAR) and show that the diffusion head plays a critical role in training dynamics, often introducing noisy gradients that lead to instability and early performance saturation. To address this issue, we propose a stabilized RL framework for MAR. We introduce multi-trajectory expectation (MTE), which estimates the optimization direction by averaging over multiple diffusion trajectories, thereby reducing diffusion-induced gradient noise. To avoid over-smoothing, we further estimate token-wise uncertainty from multiple trajectories and apply multi-trajectory optimization only to the top-k% uncertain tokens. In addition, we introduce a consistency-aware token selection strategy that filters out AR tokens that are less aligned with the final generated content. Extensive experiments across multiple benchmarks demonstrate that our method consistently improves visual quality, training stability, and spatial structure understanding over baseline GRPO and pre-RL models. Code is available at: this https URL.
中文摘要 强化学习（RL）已被成功应用于自回归（AR）和扩散模型。然而，由于交错推断和噪声较大的对数概率估计，将强化学习扩展到混合AR扩散框架仍具挑战性。本研究中，我们研究了掩蔽自回归模型（MAR），并证明扩散头在训练动力学中起关键作用，常引入噪声梯度，导致不稳定性和早期性能饱和。为解决这一问题，我们提出了一个稳定的强化学习框架用于MAR系统。我们引入了多轨迹期望（MTE），通过对多个扩散轨迹进行平均来估算优化方向，从而减少扩散引起的梯度噪声。为避免过度平滑，我们进一步估计多条轨迹的代币不确定性，并仅对前k%不确定代币进行多轨迹优化。此外，我们还引入了一致性意识的代币选择策略，过滤掉与最终生成内容不匹配的增强现实代币。跨多个基准测试的大量实验表明，我们的方法在视觉质量、训练稳定性和空间结构理解方面，持续优于基线GRPO和前强化学习模型。代码可在以下 https URL 获取。

EmoMAS: Emotion-Aware Multi-Agent System for High-Stakes Edge-Deployable Negotiation with Bayesian Orchestration

EmoMAS：情绪感知多智能体系统，用于高风险边缘部署谈判，采用贝叶斯编排

Authors: Yunbo Long, Yunhan Liu, Liming Xu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.07003
Pdf link: https://arxiv.org/pdf/2604.07003
Abstract Large language models (LLMs) has been widely used for automated negotiation, but their high computational cost and privacy risks limit deployment in privacy-sensitive, on-device settings such as mobile assistants or rescue robots. Small language models (SLMs) offer a viable alternative, yet struggle with the complex emotional dynamics of high-stakes negotiation. We introduces EmoMAS, a Bayesian multi-agent framework that transforms emotional decision-making from reactive to strategic. EmoMAS leverages a Bayesian orchestrator to coordinate three specialized agents: game-theoretic, reinforcement learning, and psychological coherence models. The system fuses their real-time insights to optimize emotional state transitions while continuously updating agent reliability based on negotiation feedback. This mixture-of-agents architecture enables online strategy learning without pre-training. We further introduce four high-stakes, edge-deployable negotiation benchmarks across debt, healthcare, emergency response, and educational domains. Through extensive agent-to-agent simulations across all benchmarks, both SLMs and LLMs equipped with EmoMAS consistently surpass all baseline models in negotiation performance while balancing ethical behavior. These results show that strategic emotional intelligence is also the key driver of negotiation success. By treating emotional expression as a strategic variable within a Bayesian multi-agent optimization framework, EmoMAS establishes a new paradigm for effective, private, and adaptive negotiation AI suitable for high-stakes edge deployment.
中文摘要 大型语言模型（LLMs）已被广泛用于自动协商，但其高计算成本和隐私风险限制了在隐私敏感的设备上环境的应用，如移动助手或救援机器人。小型语言模型（SLMs）提供了可行的替代方案，但在高风险谈判中复杂的情感动态中存在困难。我们介绍了EmoMAS，一种贝叶斯多智能体框架，将情绪决策从被动转变为战略性。EmoMAS利用贝叶斯编排器协调三种专门代理：博弈论模型、强化学习模型和心理相干模型。该系统融合了他们的实时洞察，优化情绪状态的转变，同时根据谈判反馈不断更新代理的可靠性。这种智能体混合架构使得无需预训练即可在线进行战略学习。我们还进一步介绍了四个高风险、可边缘部署的谈判基准，涵盖债务、医疗、应急响应和教育领域。通过跨所有基准测试的大量代理间模拟，配备EmoMAS的SLM和LLM在谈判性能上始终超越所有基线模型，同时在伦理行为上保持平衡。这些结果表明，战略情商也是谈判成功的关键驱动力。通过将情感表达视为贝叶斯多智能体优化框架中的战略变量，EmoMAS建立了适用于高风险边缘部署的高效、私密且自适应谈判AI的新范式。

Predictive Representations for Skill Transfer in Reinforcement Learning

强化学习中技能转移的预测表征

Authors: Ruben Vereecken, Luke Dickens, Alessandra Russo
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.07016
Pdf link: https://arxiv.org/pdf/2604.07016
Abstract A key challenge in scaling up Reinforcement Learning is generalizing learned behaviour. Without the ability to carry forward acquired knowledge an agent is doomed to learn each task from scratch. In this paper we develop a new formalism for transfer by virtue of state abstraction. Based on task-independent, compact observations (outcomes) of the environment, we introduce Outcome-Predictive State Representations (OPSRs), agent-centered and task-independent abstractions that are made up of predictions of outcomes. We show formally and empirically that they have the potential for optimal but limited transfer, then overcome this trade-off by introducing OPSR-based skills, i.e. abstract actions (based on options) that can be reused between tasks as a result of state abstraction. In a series of empirical studies, we learn OPSR-based skills from demonstrations and show how they speed up learning considerably in entirely new and unseen tasks without any pre-processing. We believe that the framework introduced in this work is a promising step towards transfer in RL in general, and towards transfer through combining state and action abstraction specifically.
中文摘要 扩大强化学习规模的一个关键挑战是推广已学行为。如果无法传承所学知识，代理注定只能从零开始学习每一项任务。本文通过状态抽象发展了一种新的转移形式主义。基于对环境的任务无关、紧凑的观察（结果），我们引入了结果预测状态表征（OPSRs），即以代理为中心且与任务无关的抽象，由结果预测组成。我们通过形式和实证方法证明它们具有最优但有限的转移潜力，随后通过引入基于OPSR的技能，即基于选项的抽象动作，这些动作可通过状态抽象在任务间重复使用来克服这一权衡。在一系列实证研究中，我们通过演示学习基于OPSR的技能，并展示了它们如何在完全新颖且未曾预处理的任务中显著加快学习速度。我们认为，本工作引入的框架是实现强化学习转移的有希望的一步，特别是通过结合状态抽象和动作抽象实现迁移。

Production-Ready Automated ECU Calibration using Residual Reinforcement Learning

使用残差强化学习的生产准备自动化ECU校准

Authors: Andreas Kampmeier, Kevin Badalian, Lucas Koch, Sung-Yong Lee, Jakob Andert
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.07059
Pdf link: https://arxiv.org/pdf/2604.07059
Abstract Electronic Control Units (ECUs) have played a pivotal role in transforming motorcars of yore into the modern vehicles we see on our roads today. They actively regulate the actuation of individual components and thus determine the characteristics of the whole system. In this, the behavior of the control functions heavily depends on their calibration parameters which engineers traditionally design by hand. This is taking place in an environment of rising customer expectations and steadily shorter product development cycles. At the same time, legislative requirements are increasing while emission standards are getting stricter. Considering the number of vehicle variants on top of all that, the conventional method is losing its practical and financial viability. Prior work has already demonstrated that optimal control functions can be automatically developed with reinforcement learning (RL); since the resulting functions are represented by artificial neural networks, they lack explainability, a circumstance which renders them challenging to employ in production vehicles. In this article, we present an explainable approach to automating the calibration process using residual RL which follows established automotive development principles. Its applicability is demonstrated by means of a map-based air path controller in a series control unit using a hardware-in-the-loop (HiL) platform. Starting with a sub-optimal map, the proposed methodology quickly converges to a calibration which closely resembles the reference in the series ECU. The results prove that the approach is suitable for the industry where it leads to better calibrations in significantly less time and requires virtually no human intervention
中文摘要 电子控制单元（ECU）在将昔日汽车转变为我们今天道路上现代化车辆方面发挥了关键作用。它们主动调节单个组件的驱动，从而决定整个系统的特性。因此，控制功能的行为高度依赖于工程师传统手工设计的校准参数。这发生在客户期望不断提升和产品开发周期持续缩短的环境中。与此同时，立法要求不断提高，排放标准也在不断严格。考虑到车辆的种类繁多，传统方式正逐渐失去其实用性和经济价值。先前的研究已经证明，通过强化学习（RL）可以自动开发最优控制函数;由于产生的功能由人工神经网络表示，缺乏可解释性，这使得它们在生产载体中难以应用。本文提出了一种可解释的自动化校准过程方法，利用残差强化学习，遵循既定的汽车开发原则。其适用性通过基于地图的航路控制器在串联控制单元中展示，采用硬件在环（HiL）平台。从一个次优映射开始，所提出的方法很快收敛到与系列ECU中参考极为相似的校准结果。结果证明该方法适合行业，能在显著缩短时间内实现更优校准，且几乎无需人工干预

Epistemic Robust Offline Reinforcement Learning

认知稳健离线强化学习

Authors: Abhilash Reddy Chenreddy, Erick Delage
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.07072
Pdf link: https://arxiv.org/pdf/2604.07072
Abstract Offline reinforcement learning learns policies from fixed datasets without further environment interaction. A key challenge in this setting is epistemic uncertainty, arising from limited or biased data coverage, particularly when the behavior policy systematically avoids certain actions. This can lead to inaccurate value estimates and unreliable generalization. Ensemble-based methods like SAC-N mitigate this by conservatively estimating Q-values using the ensemble minimum, but they require large ensembles and often conflate epistemic with aleatoric uncertainty. To address these limitations, we propose a unified and generalizable framework that replaces discrete ensembles with compact uncertainty sets over Q-values. %We further introduce an Epinet based model that directly shapes the uncertainty sets to optimize the cumulative reward under the robust Bellman objective without relying on ensembles. We also introduce a benchmark for evaluating offline RL algorithms under risk-sensitive behavior policies, and demonstrate that our method achieves improved robustness and generalization over ensemble-based baselines across both tabular and continuous state domains.
中文摘要 离线强化学习从固定数据集中学习策略，无需额外的环境交互。这一背景下的一个关键挑战是认知不确定性，源于有限或偏颇的数据覆盖，尤其是在行为政策系统性地避免某些行为时。这可能导致价值估算不准确和概括不可靠。像SAC-N这样的基于集合的方法通过保守估计Q值来缓解这一问题，但它们需要大型集合，且常常将认识论与偶然不确定性混为一谈。为解决这些局限性，我们提出了一个统一且可推广的框架，用Q值上的紧致不确定性集取代离散集合。我们进一步引入了基于Epinet的模型，直接塑造不确定性集合，以优化在稳健Bellman目标下的累积奖励，而无需依赖集合。我们还提出了在风险敏感行为策略下评估离线强化学习算法的基准，并证明我们的方法在基于集合的基线下，无论是表格状态域还是连续状态域，都能实现更好的鲁棒性和泛化性。

STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems

STRIDE-ED：一个基于策略的共情对话系统逐步推理框架

Authors: Hongru Ji, Yuyin Fan, Meng Zhao, Xianghua Li, Lianwei Wu, Chao Gao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.07100
Pdf link: https://arxiv.org/pdf/2604.07100
Abstract Empathetic dialogue requires not only recognizing a user's emotional state but also making strategy-aware, context-sensitive decisions throughout response generation. However, the lack of a comprehensive empathy strategy framework, explicit task-aligned multi-stage reasoning, and high-quality strategy-aware data fundamentally limits existing approaches, preventing them from effectively modeling empathetic dialogue as a complex, multi-stage cognitive and decision-making process. To address these challenges, we propose STRIDE-ED, a STRategy-grounded, Interpretable, and DEep reasoning framework that models Empathetic Dialogue through structured, strategy-conditioned reasoning. To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with empathetic strategies. Furthermore, we adopt a two-stage training paradigm that combines supervised fine-tuning with multi-objective reinforcement learning to better align model behaviors with target emotions, empathetic strategies, and response formats. Extensive experiments demonstrate that STRIDE-ED generalizes across diverse open-source LLMs and consistently outperforms existing methods on both automatic metrics and human evaluations.
中文摘要 同理心对话不仅需要识别用户的情绪状态，还需要在响应生成过程中做出策略意识、情境敏感的决策。然而，缺乏全面的同理心策略框架、明确的任务对齐多阶段推理以及高质量的策略感知数据，根本限制了现有方法，阻碍了它们有效地将同理心对话建模为复杂、多阶段的认知和决策过程。为应对这些挑战，我们提出了STRIDE-ED，这是一个基于STRategy、可解释且采用DEep的推理框架，通过结构化、策略条件化推理来建模同理心对话。为支持有效学习，我们开发了一套策略感知型数据优化流程，集成基于大型语言模型的注释、多模型一致性加权评估和动态抽样，构建与同理心策略相匹配的高质量训练数据。此外，我们采用两阶段训练范式，结合监督微调与多目标强化学习，更好地将模型行为与目标情绪、同理心策略和反应形式对齐。大量实验表明，STRIDE-ED能够推广到多种开源大型语言模型，并且在自动指标和人工评估方面始终优于现有方法。

Energy Saving for Cell-Free Massive MIMO Networks: A Multi-Agent Deep Reinforcement Learning Approach

无单元大规模MIMO网络的节能：多智能体深度强化学习方法

Authors: Qichen Wang, Keyu Li, Ozan Alp Topal, Özlem Tugfe Demir, Mustafa Ozger, Cicek Cavdar
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.07133
Pdf link: https://arxiv.org/pdf/2604.07133
Abstract This paper focuses on energy savings in downlink operation of cell-free massive MIMO (CF mMIMO) networks under dynamic traffic conditions. We propose a multi-agent deep reinforcement learning (MADRL) algorithm that enables each access point (AP) to autonomously control antenna re-configuration and advanced sleep mode (ASM) selection. After the training process, the proposed framework operates in a fully distributed manner, eliminating the need for centralized control and allowing each AP to dynamically adjust to real-time traffic fluctuations. Simulation results show that the proposed algorithm reduces power consumption (PC) by 56.23% compared to systems without any energy-saving scheme and by 30.12% relative to a non-learning mechanism that only utilizes the lightest sleep mode, with only a slight increase in drop ratio. Moreover, compared to the widely used deep Q-network (DQN) algorithm, it achieves a similar PC level but with a significantly lower drop ratio.
中文摘要 本文重点关注在动态流量条件下，无单元群大规模MIMO（CF mMIMO）网络下行运行中的节能效果。我们提出了一种多智能体深度强化学习（MADRL）算法，使每个接入点（AP）能够自主控制天线重配置和高级睡眠模式（ASM）选择。训练过程结束后，拟议框架实现全分布式运行，无需集中控制，允许每个接入点动态调整以应对实时流量波动。模拟结果显示，所提算法相比无节能方案的系统降低了56.23%的功耗（PC），相比仅使用最轻睡眠模式的非学习机制，降低了30.12%，且仅略有增加掉落率。此外，与广泛使用的深度Q网络（DQN）算法相比，它实现了类似的PC级别，但丢弃率显著降低。

Multi-Turn Reasoning LLMs for Task Offloading in Mobile Edge Computing

多回合推理大型语言模型用于移动边缘计算中的任务卸载

Authors: Ning Yang, Chuangxin Cheng, Haijun Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.07148
Pdf link: https://arxiv.org/pdf/2604.07148
Abstract Emerging computation-intensive applications impose stringent latency requirements on resource-constrained mobile devices. Mobile Edge Computing (MEC) addresses this challenge through task offloading. However, designing effective policies remains difficult due to dynamic task arrivals, time-varying channels, and the spatio-temporal coupling of server queues. Conventional heuristics lack adaptability, while Deep Reinforcement Learning (DRL) suffers from limited generalization and architectural rigidity, requiring retraining when network topology changes. Although Large Language Models (LLMs) offer semantic reasoning capabilities, standard Supervised Fine-Tuning (SFT) yields myopic policies that greedily minimize immediate latency without accounting for long-term system evolution. To address these limitations, we propose COMLLM, a generative framework that enables foresighted decision-making in MEC systems. COMLLM integrates Group Relative Policy Optimization (GRPO) with a Look-Ahead Collaborative Simulation (LACS) mechanism, which performs multi-step Monte Carlo rollouts while jointly modeling server queue dynamics. By incorporating these rollouts into the reward design, the framework captures the long-term impact of current decisions on future system states. Experimental results demonstrate that COMLLM achieves near-optimal latency and improved load-balancing fairness. Notably, it exhibits zero-shot topological scalability, allowing a model trained on small-scale networks to generalize to larger, unseen topologies without retraining, outperforming SFT, DRL, and heuristic baselines.
中文摘要 新兴的计算密集型应用对资源有限的移动设备施加了严格的延迟要求。移动边缘计算（MEC）通过任务卸载来应对这一挑战。然而，由于任务的动态到达、时间变化的通道以及服务器队列的时空耦合，设计有效的策略仍然很困难。传统启发式缺乏适应性，而深度强化学习（DRL）则存在泛化有限和架构刚性，网络拓扑变化时需要重新训练。尽管大型语言模型（LLMs）提供语义推理能力，标准的监督微调（SFT）却产生了目光短浅的策略，贪婪地最小化即时延迟，却未考虑系统的长期演进。为解决这些局限性，我们提出了COMLLM，一种生成框架，使MEC系统能够实现前瞻性决策。COMLLM将组相对策略优化（GRPO）与前瞻协同仿真（LACS）机制集成，后者在联合建模服务器队列动态的同时执行多步蒙特卡洛推广。通过将这些推广纳入奖励设计，框架捕捉了当前决策对未来系统状态的长期影响。实验结果表明，COMLLM实现了近乎最优的延迟和更好的负载均衡公平性。值得注意的是，它表现出零样本拓扑可扩展性，使得在小尺度网络上训练的模型能够推广到更大且未见的拓扑而无需重新训练，表现优于SFT、DRL和启发式基线。

Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

链条中的推理，树中学习：多回合代理策略优化的自我纠正与嫁接

Authors: Yu Li, Sizhe Tang, Tian Lan
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.07165
Pdf link: https://arxiv.org/pdf/2604.07165
Abstract Reinforcement learning for Large Language Model agents is often hindered by sparse rewards in multi-step reasoning tasks. Existing approaches like Group Relative Policy Optimization treat sampled trajectories as independent chains, assigning uniform credit to all steps in each chain and ignoring the existence of critical steps that may disproportionally impact reasoning outcome. In this paper, we propose T-STAR(Tree-structured Self-Taught Agent Rectification), a framework that recovers the latent correlated reward structure across seemingly independent trajectories. Specifically, we consolidate trajectories into a unified Cognitive Tree by identifying and merging functionally similar steps/nodes. It enables an Introspective Valuation mechanism that back-propagates trajectory-level rewards through the tree to obtain a new notion of variance-reduced relative advantage at step-level. Using the Cognitive Tree, we also develop In-Context Thought Grafting to synthesize corrective reasoning by contrasting successful and failed branches at critical divergence points/steps. Our proposed Surgical Policy Optimization then capitalizes on the rich policy gradient information concentrated at these critical points/steps through a Bradley-Terry type of surgical loss. Extensive experiments across embodied, interactive, reasoning, and planning benchmarks demonstrate that T-STAR achieves consistent improvements over strong baselines, with gains most pronounced on tasks requiring extended reasoning chains.
中文摘要 大型语言模型代理的强化学习常常受限于多步推理任务中奖励稀疏。现有方法如群相对策略优化（Group Relative Policy Optimization）将抽样轨迹视为独立链，对每条链中的所有步骤均分配统一的功劳，忽视可能对推理结果产生不成比例影响的关键步骤。本文提出了T-STAR（树结构自学代理整流）框架，该框架能够恢复看似独立轨迹中潜在的相关奖励结构。具体来说，我们通过识别并合并功能相似的步骤/节点，将这些轨迹整合成一个统一的认知树。它支持一种内省式估值机制，将轨迹级奖励反向传播至树中，从而获得步级方差降低相对优势的新概念。利用认知树，我们还开发了情境内思维嫁接技术，通过对比关键分歧点/步骤的成功与失败分支，综合纠正推理。我们提出的外科政策优化方案，利用布拉德利-特里型外科损失集中在这些关键点/步骤的丰富策略梯度信息。在具身、交互、推理和规划基准测试中的大量实验表明，T-STAR 在强基准基础上持续取得改进，尤其是在需要扩展推理链的任务上。

Smart Commander: A Hierarchical Reinforcement Learning Framework for Fleet-Level PHM Decision Optimization

Smart Commander：舰队级PHM决策优化的分层强化学习框架

Authors: Yong Si, Mingfei Lu, Jing Li, Yang Hu, Guijiang Li, Yueheng Song, Zhaokui Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.07171
Pdf link: https://arxiv.org/pdf/2604.07171
Abstract Decision-making in military aviation Prognostics and Health Management (PHM) faces significant challenges due to the "curse of dimensionality" in large-scale fleet operations, combined with sparse feedback and stochastic mission profiles. To address these issues, this paper proposes Smart Commander, a novel Hierarchical Reinforcement Learning (HRL) framework designed to optimize sequential maintenance and logistics decisions. The framework decomposes the complex control problem into a two-tier hierarchy: a strategic General Commander manages fleet-level availability and cost objectives, while tactical Operation Commanders execute specific actions for sortie generation, maintenance scheduling, and resource allocation. The proposed approach is validated within a custom-built, high-fidelity discrete-event simulation environment that captures the dynamics of aircraft configuration and support this http URL integrating layered reward shaping with planning-enhanced neural networks, the method effectively addresses the difficulty of sparse and delayed rewards. Empirical evaluations demonstrate that Smart Commander significantly outperforms conventional monolithic Deep Reinforcement Learning (DRL) and rule-based baselines. Notably, it achieves a substantial reduction in training time while demonstrating superior scalability and robustness in failure-prone environments. These results highlight the potential of HRL as a reliable paradigm for next-generation intelligent fleet management.
中文摘要 军用航空预测与健康管理（PHM）决策面临重大挑战，这源于大规模舰队行动中“维度诅咒”，加上反馈稀疏和任务随机性。为解决这些问题，本文提出了Smart Commander，一种新型分层强化学习（HRL）框架，旨在优化顺序维护和后勤决策。该框架将复杂的控制问题分解为两层级结构：战略总司令管理舰队级可用性和成本目标，而战术作战指挥官执行任务生成、维护调度和资源分配等具体行动。该方法在定制的高保真离散事件模拟环境中得到验证，该环境捕捉飞机配置动态并支持该http URL。该方法将分层奖励塑造与规划增强神经网络整合，有效解决了奖励稀疏和延迟的难题。实证评估表明，Smart Commander 在传统单体深度强化学习（DRL）和基于规则的基线上表现显著优异。值得注意的是，它在显著缩短训练时间的同时，在易发生故障的环境中展现出卓越的可扩展性和稳健性。这些结果凸显了HRL作为下一代智能车队管理可靠范式的潜力。

BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

BRIDGE：通过强化学习查询对齐实现多模态到文本检索

Authors: Mohamed Darwish Mounis, Mohamed Mahmoud, Shaimaa Sedek, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Abdelrahman Abdallah, Hyun-Soo Kang
Subjects: Subjects: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.07201
Pdf link: https://arxiv.org/pdf/2604.07201
Abstract Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query -- raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf{BRIDGE}, a two-component system that resolves this mismatch without multimodal encoders. \textbf{FORGE} (\textbf{F}ocused Retrieval Query Generato\textbf{r}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf{LENS} (\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch) is a reasoning-enhanced dense retriever fine-tuned on reasoning-intensive retrieval data to handle the intent-rich queries FORGE produces. Evaluated on MM-BRIGHT (2,803 queries, 29 domains), BRIDGE achieves \textbf{29.7} nDCG@10, surpassing all multimodal encoder baselines including Nomic-Vision (27.6). When FORGE is applied as a plug-and-play aligner on top of Nomic-Vision, the combined system reaches \textbf{33.3} nDCG@10 -- exceeding the best text-only retriever (32.2) -- demonstrating that \textit{query alignment} is the key bottleneck in multimodal-to-text retrieval. this https URL
中文摘要 多模态检索系统难以解析图像-文本查询，针对纯文本语料库：最佳视觉语言编码器在MM-BRIGHT上仅能实现27.6 nDCG@10，表现逊色。我们认为瓶颈不在于检索器，而在于查询——原始多模态查询以系统性方式纠缠了视觉描述、对话噪声和检索意图，从而系统性地降低了嵌入相似性。我们介绍了 \textbf{BRIDGE}，一个两分量系统，无需多模编码器解决了这种不匹配问题。\textbf{FORGE} （\textbf{F}ocused Retrieval Query Generato\textbf{r}）是一种通过强化学习训练的查询对齐模型，能够将噪声多模态查询提炼成紧凑、优化检索的搜索字符串。\textbf{LENS}（\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch）是一款基于推理密集型检索数据进行精细调优的推理增强密集检索器，以处理FORGE产生的富含意图的查询。在MM-BRIGHT（2,803次查询，29个域）上评估，BRIDGE实现了\textbf{29.7}nDCG@10，超过包括Nomic-Vision（27.6）在内的所有多模编码基线。当FORGE作为即插即用的对齐器应用于Nomic-Vision时，合并后的系统达到了\textbf{33.3}nDCG@10——超过了最佳纯文本检索器（32.2）——表明\textit{query alignment}是多模态到文本检索的关键瓶颈。这个 https 网址

Robust Quadruped Locomotion via Evolutionary Reinforcement Learning

通过进化强化学习实现的强健四足行走

Authors: Brian McAteer, Karl Mason
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.07224
Pdf link: https://arxiv.org/pdf/2604.07224
Abstract Deep reinforcement learning has recently achieved strong results in quadrupedal locomotion, yet policies trained in simulation often fail to transfer when the environment changes. Evolutionary reinforcement learning aims to address this limitation by combining gradient-based policy optimisation with population-driven exploration. This work evaluates four methods on a simulated walking task: DDPG, TD3, and two Cross-Entropy-based variants CEM-DDPG and CEM-TD3. All agents are trained on flat terrain and later tested both on this domain and on a rough terrain not encountered during training. TD3 performs best among the standard deep RL baselines on flat ground with a mean reward of 5927.26, while CEM-TD3 achieves the highest rewards overall during training and evaluation 17611.41. Under the rough-terrain transfer test, performance of the deep RL methods drops sharply. DDPG achieves -1016.32 and TD3 achieves -99.73, whereas the evolutionary variants retain much of their capability. CEM-TD3 records the strongest transfer performance with a mean reward of 19574.33. These findings suggest that incorporating evolutionary search can reduce overfitting and improve policy robustness in locomotion tasks, particularly when deployment conditions differ from those seen during training.
中文摘要 深度强化学习最近在四足行走中取得了显著成果，但模拟训练的策略在环境变化时往往无法传递。进化强化学习旨在通过结合基于梯度的策略优化与种群驱动的探索来解决这一局限。本研究评估了模拟步行任务中的四种方法：DDPG、TD3，以及两种基于交叉熵的变体CEM-DDPG和CEM-TD3。所有特工均接受平坦地形训练，随后在此领域及训练中未遇到的崎岖地形进行测试。TD3在标准深层强化学习基线中平坦地面表现最佳，平均奖励为5927.26，而CEM-TD3在训练和评估中整体奖励最高，为17611.41。在崎岖地形转移测试下，深层强化学习方法的性能急剧下降。DDPG的输出为-1016.32，TD3为-99.73，而进化型保留了大部分能力。CEM-TD3 以平均回报19574.33，创下最强的转运表现。这些发现表明，纳入进化搜索可以减少过拟合并提升运动任务中的策略鲁棒性，尤其是在部署条件与训练期间不同时。

Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions

Android Coach：通过单一状态多行动提升在线代理培训效率

Authors: Guo Gan, Yuxuan Ding, Cong Chen, Yuwei Ren, Yin Huang, Hong Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.07277
Pdf link: https://arxiv.org/pdf/2604.07277
Abstract Online reinforcement learning (RL) serves as an effective method for enhancing the capabilities of Android agents. However, guiding agents to learn through online interaction is prohibitively expensive due to the high latency of emulators and the sample inefficiency of existing RL algorithms. We identify a fundamental limitation in current approaches: the Single State Single Action paradigm, which updates the policy with one-to-one state-action pairs from online one-way rollouts without fully exploring each costly emulator state. In this paper, we propose Android Coach, a novel framework that shifts the training paradigm to Single State Multiple Actions, allowing the agent to sample and utilize multiple actions for a single online state. We enable this without additional emulator overhead by learning a critic that estimates action values. To ensure the critic serves as a reliable coach, we integrate a process reward model and introduce a group-wise advantage estimator based on the averaged critic outputs. Extensive experiments demonstrate the effectiveness and efficiency of Android Coach: it achieves 7.5% and 8.3% success rate improvements on AndroidLab and AndroidWorld over UI-TARS-1.5-7B, and attains 1.4x higher training efficiency than Single State Single Action methods PPO and GRPO at matched success rates.
中文摘要 在线强化学习（RL）是提升安卓代理能力的有效方法。然而，由于模拟器延迟较高且现有强化学习算法样本效率低，引导智能体通过在线交互学习成本极高。我们发现当前方法中的一个根本局限性：单一状态单行动范式，该范式通过在线单向推广的一对一状态-动作来更新策略，却未充分探索每个昂贵的模拟器状态。本文提出了Android Coach，这是一种新颖框架，将训练范式转变为单一状态多重操作，允许代理对单一在线状态进行采样和利用多个动作。我们通过学习一个估算动作值的批评者，实现了无需额外模拟器开销的实现。为了确保批评者成为可靠的教练，我们整合了过程奖励模型，并引入基于平均批评者输出的群体优势估计器。大量实验证明了 Android Coach 的有效性和效率：它在 AndroidLab 和 AndroidWorld 上相比 UI-TARS-1.5-7B 分别提升了 7.5% 和 8.3% 的成功率，并且在匹配成功率下，训练效率比单状态单行动方法的 PPO 和 GRPO 高出 1.4 倍。

Keyword: diffusion policy

RichMap: A Reachability Map Balancing Precision, Efficiency, and Flexibility for Rich Robot Manipulation Tasks

RichMap：一张平衡精准、高效和灵活性的可达性地图，适用于丰富的机器人操作任务

Authors: Yupu Lu, Yuxiang Ma, Jia Pan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.06778
Pdf link: https://arxiv.org/pdf/2604.06778
Abstract This paper presents RichMap, a high-precision reachability map representation designed to balance efficiency and flexibility for versatile robot manipulation tasks. By refining the classic grid-based structure, we propose a streamlined approach that achieves performance close to compact map forms (e.g., RM4D) while maintaining structural flexibility. Our method utilizes theoretical capacity bounds on $\mathbb{S}^2$ (or $SO(3)$) to ensure rigorous coverage and employs an asynchronous pipeline for efficient construction. We validate the map against comprehensive metrics, pursuing high prediction accuracy ($>98\%$), low false positive rates ($1\sim2\%$), and fast large-batch query ($\sim$15 $\mu$s/query). We extend the framework applications to quantify robot workspace similarity via maximum mean discrepancy (MMD) metrics and demonstrate energy-based guidance for diffusion policy transfer, achieving up to $26\%$ improvement for cross-embodiment scenarios in the block pushing experiment.
中文摘要 本文介绍了RichMap，一种高精度的可达性地图表示，旨在平衡效率与灵活性，适用于多功能机器人操作任务。通过完善经典的网格结构，我们提出了一种简化方法，在保持结构灵活性的同时，实现接近紧凑地图形式（如RM4D）的性能。我们的方法利用理论容量界限在 $\mathbb{S}^2$（或 $SO（3）$）上，以确保严格的覆盖，并采用异步流水线以实现高效构建。我们通过综合指标验证地图，追求高预测准确率（$>98\%$）、低误报率（$1\sim2\%$）和快速大批量查询（$\sim$15 $\mu$s/查询）。我们扩展了框架应用，通过最大平均差异（MMD）指标量化机器人工作空间相似度，并展示了基于能量的扩散策略转移指导，在块推实验中实现了交叉身体场景高达26%%的提升。