Arxiv Papers of Today

生成时间: 2026-02-23 16:53:28 (UTC+8); Arxiv 发布时间: 2026-02-23 20:00 EST (2026-02-24 09:00 UTC+8)

今天共有 22 篇相关文章

Keyword: reinforcement learning

Epistemic Traps: Rational Misalignment Driven by Model Misspecification

认识陷阱：由模型描述错误驱动的理性错位

Authors: Xingcheng Xu, Jingjing Qu, Qiaosheng Zhang, Chaochao Lu, Yanqing Yang, Na Zou, Xia Hu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.17676
Pdf link: https://arxiv.org/pdf/2602.17676
Abstract The rapid deployment of Large Language Models and AI agents across critical societal and technical domains is hindered by persistent behavioral pathologies including sycophancy, hallucination, and strategic deception that resist mitigation via reinforcement learning. Current safety paradigms treat these failures as transient training artifacts, lacking a unified theoretical framework to explain their emergence and stability. Here we show that these misalignments are not errors, but mathematically rationalizable behaviors arising from model misspecification. By adapting Berk-Nash Rationalizability from theoretical economics to artificial intelligence, we derive a rigorous framework that models the agent as optimizing against a flawed subjective world model. We demonstrate that widely observed failures are structural necessities: unsafe behaviors emerge as either a stable misaligned equilibrium or oscillatory cycles depending on reward scheme, while strategic deception persists as a "locked-in" equilibrium or through epistemic indeterminacy robust to objective risks. We validate these theoretical predictions through behavioral experiments on six state-of-the-art model families, generating phase diagrams that precisely map the topological boundaries of safe behavior. Our findings reveal that safety is a discrete phase determined by the agent's epistemic priors rather than a continuous function of reward magnitude. This establishes Subjective Model Engineering, defined as the design of an agent's internal belief structure, as a necessary condition for robust alignment, marking a paradigm shift from manipulating environmental rewards to shaping the agent's interpretation of reality.
中文摘要 大型语言模型和人工智能代理在关键社会和技术领域的快速部署受阻于持续存在的行为病理，包括谄媚、幻觉和战略欺骗，这些行为难以通过强化学习缓解。当前的安全范式将这些失效视为暂时的训练产物，缺乏统一的理论框架来解释其出现和稳定性。这里我们证明这些错位并非错误，而是由模型错误指定引起的数学合理行为。通过将理论经济学的伯克-纳什合理性应用到人工智能，我们得出了一个严谨的框架，将智能体建模为针对一个有缺陷的主观世界模型进行优化。我们证明，广泛观察到的失败是结构性的必然性：不安全行为要么以稳定的错位均衡或根据奖励方案的振荡周期出现，而战略欺骗则以“锁定”均衡的形式存在，或通过对客观风险具有韧性的认识论不确定性。我们通过对六个最先进模型族的行为实验验证这些理论预测，生成精准映射安全行为拓扑边界的相图。我们的发现表明，安全是由主体的认识先验决定的离散阶段，而非奖励幅度的连续函数。这确立了主观模型工程，即主体内部信念结构的设计，作为稳健对齐的必要条件，标志着从控环境奖励转向塑造主体对现实的诠释的范式转变。

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

CodeScaler：通过无执行奖励模型扩展代码LLM训练和测试时间推断

Authors: Xiao Zhu, Xinyu Zhou, Boyu Zhu, Hanxu Hu, Mingzhe Du, Haotian Zhang, Huiming Wang, Zhijiang Guo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.17684
Pdf link: https://arxiv.org/pdf/2602.17684
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).
中文摘要 可验证奖励强化学习（RLVR）通过利用单元测试中的基于执行的反馈，推动了代码大型语言模型的最新进展，但其可扩展性从根本上受限于高质量测试用例的可用性和可靠性。我们提出了CodeScaler，一种无执行奖励模型，旨在扩展强化学习训练和测试时间推理以生成代码。CodeScaler 基于经过验证的代码问题精心筛选的偏好数据进行训练，并结合语法感知代码提取和保持有效性的奖励整形，以确保稳定且稳健的优化。在五个编码基准测试中，CodeScaler平均提升了Qwen3-8B-Base +11.72分，比基于二进制执行的强化学习高出+1.82分，并实现了无需测试案例的合成数据集上的可扩展强化学习。在推理阶段，CodeScaler 作为一种有效的测试时间缩放方法，实现了与单元测试方法相当的性能，同时延迟降低了 10 倍。此外，CodeScaler 不仅在代码领域（+3.3分）上超越了 RM-Bench 上的现有奖励模型，在一般和推理领域（平均也超过 +2.7 分）。

Optimal Multi-Debris Mission Planning in LEO: A Deep Reinforcement Learning Approach with Co-Elliptic Transfers and Refueling

LEO中最优多碎片任务规划：采用共椭转移与加注的深度强化学习方法

Authors: Agni Bandyopadhyay, Gunther Waxenegger-Wilfing
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO); Space Physics (physics.space-ph)
Arxiv link: https://arxiv.org/abs/2602.17685
Pdf link: https://arxiv.org/pdf/2602.17685
Abstract This paper addresses the challenge of multi target active debris removal (ADR) in Low Earth Orbit (LEO) by introducing a unified coelliptic maneuver framework that combines Hohmann transfers, safety ellipse proximity operations, and explicit refueling logic. We benchmark three distinct planning algorithms Greedy heuristic, Monte Carlo Tree Search (MCTS), and deep reinforcement learning (RL) using Masked Proximal Policy Optimization (PPO) within a realistic orbital simulation environment featuring randomized debris fields, keep out zones, and delta V constraints. Experimental results over 100 test scenarios demonstrate that Masked PPO achieves superior mission efficiency and computational performance, visiting up to twice as many debris as Greedy and significantly outperforming MCTS in runtime. These findings underscore the promise of modern RL methods for scalable, safe, and resource efficient space mission planning, paving the way for future advancements in ADR autonomy.
中文摘要 本文通过引入统一的共椭圆机动框架，解决了低地球轨道（LEO）多目标主动碎片移除（ADR）的挑战，该框架结合了霍曼转移、安全椭圆接近作和显式加油逻辑。我们在真实轨道模拟环境中基准测试三种不同的规划算法：贪婪启发式算法、蒙特卡洛树搜索（MCTS）和深度强化学习（RL），采用掩盖近端策略优化（PPO），该环境包含随机碎片场、阻挡区和delta V约束。超过100个测试场景的实验结果表明，蒙面PPO实现了更优越的任务效率和计算性能，访问的碎片数量是贪婪号的两倍，且运行时间显著优于MCTS。这些发现凸显了现代强化学习方法在可扩展、安全且资源高效空间任务规划方面的前景，为未来ADR自主性的发展铺平了道路。

Reinforcement-Learning-Based Assistance Reduces Squat Effort with a Modular Hip--Knee Exoskeleton

基于强化学习的辅助通过模块化髋关节-膝关节外骨骼减少深蹲的努力

Authors: Neethan Ratnakumar, Mariya Huzaifa Tohfafarosh, Saanya Jauhri, Xianlian Zhou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.17794
Pdf link: https://arxiv.org/pdf/2602.17794
Abstract Squatting is one of the most demanding lower-limb movements, requiring substantial muscular effort and coordination. Reducing the physical demands of this task through intelligent and personalized assistance has significant implications, particularly in industries involving repetitive low-level assembly activities. In this study, we evaluated the effectiveness of a neural network controller for a modular Hip-Knee exoskeleton designed to assist squatting tasks. The neural network controller was trained via reinforcement learning (RL) in a physics-based, human-exoskeleton interaction simulation environment. The controller generated real-time hip and knee assistance torques based on recent joint-angle and velocity histories. Five healthy adults performed three-minute metronome-guided squats under three conditions: (1) no exoskeleton (No-Exo), (2) exoskeleton with Zero-Torque, and (3) exoskeleton with active assistance (Assistance). Physiological effort was assessed using indirect calorimetry and heart rate monitoring, alongside concurrent kinematic data collection. Results show that the RL-based controller adapts to individuals by producing torque profiles tailored to each subject's kinematics and timing. Compared with the Zero-Torque and No-Exo condition, active assistance reduced the net metabolic rate by approximately 10%, with minor reductions observed in heart rate. However, assisted trials also exhibited reduced squat depth, reflected by smaller hip and knee flexion. These preliminary findings suggest that the proposed controller can effectively lower physiological effort during repetitive squatting, motivating further improvements in both hardware design and control strategies.
中文摘要 深蹲是下肢最具挑战性的动作之一，需要大量的肌肉努力和协调性。通过智能和个性化协助减少这项工作的体力需求，尤其在涉及重复性低层次组装活动的行业中，具有重大影响。本研究评估了神经网络控制器用于模块化髋膝外骨骼的有效性，该模块化设计用于辅助深蹲任务。神经网络控制器通过强化学习（RL）在基于物理的人-外骨骼交互模拟环境中训练。控制器根据近期关节角度和速度历史生成实时髋关节和膝关节辅助扭矩。五名健康成年人在三种条件下进行了三分钟节拍器引导深蹲：（1）无外骨骼（无外骨骼）、（2）外骨骼配合零扭矩，以及（3）外骨骼配合主动辅助（辅助）。通过间接量热和心率监测，同时进行运动学数据收集，评估生理努力。结果显示，基于强化学习的控制器通过生成针对每个受试者运动学和时机的扭矩曲线来适应个体。与零扭矩和无Exo状态相比，主动辅助约降低了10%的净代谢率，心率则有轻微下降。然而，辅助试验也显示深蹲深度减少，体现在髋关节和膝关节屈曲较小。这些初步发现表明，拟议的控制器能够有效降低重复深蹲时的生理负担，推动硬件设计和控制策略的进一步改进。

MePoly: Max Entropy Polynomial Policy Optimization

MePoly：最大熵多项式策略优化

Authors: Hang Liu, Sangli Teng, Maani Ghaffari
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.17832
Pdf link: https://arxiv.org/pdf/2602.17832
Abstract Stochastic Optimal Control provides a unified mathematical framework for solving complex decision-making problems, encompassing paradigms such as maximum entropy reinforcement learning(RL) and imitation learning(IL). However, conventional parametric policies often struggle to represent the multi-modality of the solutions. Though diffusion-based policies are aimed at recovering the multi-modality, they lack an explicit probability density, which complicates policy-gradient optimization. To bridge this gap, we propose MePoly, a novel policy parameterization based on polynomial energy-based models. MePoly provides an explicit, tractable probability density, enabling exact entropy maximization. Theoretically, we ground our method in the classical moment problem, leveraging the universal approximation capabilities for arbitrary distributions. Empirically, we demonstrate that MePoly effectively captures complex non-convex manifolds and outperforms baselines in performance across diverse benchmarks.
中文摘要 随机最优控制提供了一个统一的数学框架，用于解决复杂决策问题，涵盖了诸如最大熵强化学习（RL）和模仿学习（IL）等范式。然而，传统的参数化政策往往难以反映解决方案的多模态性。虽然基于扩散的策略旨在恢复多模态性，但它们缺乏明确的概率密度，这使策略梯度优化变得复杂。为弥合这一差距，我们提出了MePoly，这是一种基于多项式能量模型的新政策参数化。MePoly 提供了明确且易于处理的概率密度，实现精确熵最大化。理论上，我们的方法基于经典矩问题，利用任意分布的普遍近似能力。通过实证，我们证明了MePoly有效捕捉复杂的非凸流形，并在多样化基准测试中优于基线表现。

MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance

MIRA：带有有限LLM指导的记忆集成强化学习代理

Authors: Narjes Nourzad, Carlee Joe-Wong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.17930
Pdf link: https://arxiv.org/pdf/2602.17930
Abstract Reinforcement learning (RL) agents often suffer from high sample complexity in sparse or delayed reward settings due to limited prior structure. Large language models (LLMs) can provide subgoal decompositions, plausible trajectories, and abstract priors that facilitate early learning. However, heavy reliance on LLM supervision introduces scalability constraints and dependence on potentially unreliable signals. We propose MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training. The graph stores decision-relevant information, including trajectory segments and subgoal structures, and is constructed from both the agent's high-return experiences and LLM outputs. This design amortizes LLM queries into a persistent memory rather than requiring continuous real-time supervision. From this memory graph, we derive a utility signal that softly adjusts advantage estimation to influence policy updates without modifying the underlying reward function. As training progresses, the agent's policy gradually surpasses the initial LLM-derived priors, and the utility term decays, preserving standard convergence guarantees. We provide theoretical analysis showing that utility-based shaping improves early-stage learning in sparse-reward environments. Empirically, MIRA outperforms RL baselines and achieves returns comparable to approaches that rely on frequent LLM supervision, while requiring substantially fewer online LLM queries. Project webpage: this https URL
中文摘要 由于先前结构有限，强化学习（RL）代理在稀疏或延迟奖励设置中常常存在高样本复杂度。大型语言模型（LLMs）可以提供子目标分解、合理轨迹和抽象先验，促进早期学习。然而，高度依赖LLM监管会带来可扩展性限制和对潜在不可靠信号的依赖。我们提出了MIRA（记忆集成强化学习代理），它结合了结构化、不断演进的记忆图来指导早期训练。该图存储了与决策相关的信息，包括轨迹段和子目标结构，并由智能体的高回报体验和大型语言模型输出构建而成。该设计将LLM查询摊销到持久内存中，而无需持续的实时监督。从该内存图中，我们推导出一个效用信号，该信号在不改变底层奖励函数的情况下，轻微调整优势估计以影响政策更新。随着训练的推进，代理的策略逐渐超过最初的LLM导出先验，效用项逐渐衰减，保持标准收敛保证。我们通过理论分析证明，基于效用的塑造在稀疏奖励环境中改善了早期学习。从实证角度看，MIRA优于强化学习基线，并取得与依赖频繁LLM监督的方法相当的回报，同时在线LLM查询次数明显减少。项目网页：此 https URL

Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning

基于内存的优势塑造用于LLM引导强化学习

Authors: Narjes Nourzad, Carlee Joe-Wong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.17931
Pdf link: https://arxiv.org/pdf/2602.17931
Abstract In environments with sparse or delayed rewards, reinforcement learning (RL) incurs high sample complexity due to the large number of interactions needed for learning. This limitation has motivated the use of large language models (LLMs) for subgoal discovery and trajectory guidance. While LLMs can support exploration, frequent reliance on LLM calls raises concerns about scalability and reliability. We address these challenges by constructing a memory graph that encodes subgoals and trajectories from both LLM guidance and the agent's own successful rollouts. From this graph, we derive a utility function that evaluates how closely the agent's trajectories align with prior successful strategies. This utility shapes the advantage function, providing the critic with additional guidance without altering the reward. Our method relies primarily on offline input and only occasional online queries, avoiding dependence on continuous LLM supervision. Preliminary experiments in benchmark environments show improved sample efficiency and faster early learning compared to baseline RL methods, with final returns comparable to methods that require frequent LLM interaction.
中文摘要 在奖励稀疏或延迟的环境中，强化学习（RL）由于学习需要大量交互，样本复杂度较高。这一局限促使人们使用大型语言模型（LLMs）进行子目标发现和轨迹指导。虽然大型语言模型可以支持探索，但频繁依赖LLM调用引发了对可扩展性和可靠性的担忧。我们通过构建一个内存图来应对这些挑战，该图涵盖了LLM指导和代理自身成功部署的子目标和轨迹。从该图中，我们推导出一个效用函数，评估代理人的轨迹与先前成功策略的高度契合程度。这种效用塑造了优势函数，为批评者提供了额外的指导，同时不改变奖励。我们的方法主要依赖离线输入，偶尔在线查询，避免依赖持续的LLM监督。基准环境中的初步实验显示，与基线强化学习方法相比，样本效率提升，早期学习更快，最终回报与需要频繁大型语言模型交互的方法相当。

Graph-Neural Multi-Agent Coordination for Distributed Access-Point Selection in Cell-Free Massive MIMO

图神经多智能体协调，用于无细胞大规模多输入输入中分布式接入点选择

Authors: Mohammad Zangooei, Lou Salaün, Chung Shue Chen, Raouf Boutaba
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2602.17954
Pdf link: https://arxiv.org/pdf/2602.17954
Abstract Cell-free massive MIMO (CFmMIMO) systems require scalable and reliable distributed coordination mechanisms to operate under stringent communication and latency constraints. A central challenge is the Access Point Selection (APS) problem, which seeks to determine the subset of serving Access Points (APs) for each User Equipment (UE) that can satisfy UEs' Spectral Efficiency (SE) requirements while minimizing network power consumption. We introduce APS-GNN, a scalable distributed multi-agent learning framework that decomposes APS into agents operating at the granularity of individual AP-UE connections. Agents coordinate via local observation exchange over a novel Graph Neural Network (GNN) architecture and share parameters to reuse their knowledge and experience. APS-GNN adopts a constrained reinforcement learning approach to provide agents with explicit observability of APS' conflicting objectives, treating SE satisfaction as a cost and power reduction as a reward. Both signals are defined locally, facilitating effective credit assignment and scalable coordination in large networks. To further improve training stability and exploration efficiency, the policy is initialized via supervised imitation learning from a heuristic APS baseline. We develop a realistic CFmMIMO simulator and demonstrate that APS-GNN delivers the target SE while activating 50-70% fewer APs than heuristic and centralized Multi-agent Reinforcement Learning (MARL) baselines in different evaluation scenarios. Moreover, APS-GNN achieves one to two orders of magnitude lower inference latency than centralized MARL approaches due to its fully parallel and distributed execution. These results establish APS-GNN as a practical and scalable solution for APS in large-scale CFmMIMO networks.
中文摘要 无小区大规模MIMO（CFmMIMO）系统需要可扩展且可靠的分布式协调机制，以在严格的通信和延迟约束下运行。一个核心挑战是接入点选择（APS）问题，该问题旨在确定每个用户设备（UE）中，能够满足UE频谱效率（SE）要求同时最小化网络功耗的服务接入点（AP）子集。我们介绍APS-GNN，一个可扩展的分布式多代理学习框架，将APS分解为在单个AP-UE连接细粒度下运行的代理。代理通过一种新型图神经网络（GNN）架构进行局部观察交换协调，并共享参数以重用他们的知识和经验。APS-GNN采用受限强化学习方法，为代理提供APS冲突目标的显式可观察性，将SE满足视为成本，将功耗降低视为奖励。这两个信号均在本地定义，便于在大型网络中有效分配信用和可扩展的协调。为进一步提升训练稳定性和探索效率，该策略通过从启发式APS基线的监督模仿学习启动。我们开发了逼真的CFmMIMO模拟器，并证明APS-GNN在不同评估场景下激活AP数量比启发式和集中式多智能体强化学习（MARL）基线少50%-70%的AP数量，从而实现目标SE。此外，APS-GNN由于其完全并行和分布式执行，推理延迟比集中式MARL方法低一个数量级。这些结果确立了APS-GNN作为大规模CFmMIMO网络中APS的实用且可扩展的解决方案。

Learning Optimal and Sample-Efficient Decision Policies with Guarantees

学习带有保证的最优和样本效率决策策略

Authors: Daqian Shao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.17978
Pdf link: https://arxiv.org/pdf/2602.17978
Abstract The paradigm of decision-making has been revolutionised by reinforcement learning and deep learning. Although this has led to significant progress in domains such as robotics, healthcare, and finance, the use of RL in practice is challenging, particularly when learning decision policies in high-stakes applications that may require guarantees. Traditional RL algorithms rely on a large number of online interactions with the environment, which is problematic in scenarios where online interactions are costly, dangerous, or infeasible. However, learning from offline datasets is hindered by the presence of hidden confounders. Such confounders can cause spurious correlations in the dataset and can mislead the agent into taking suboptimal or adversarial actions. Firstly, we address the problem of learning from offline datasets in the presence of hidden confounders. We work with instrumental variables (IVs) to identify the causal effect, which is an instance of a conditional moment restrictions (CMR) problem. Inspired by double/debiased machine learning, we derive a sample-efficient algorithm for solving CMR problems with convergence and optimality guarantees, which outperforms state-of-the-art algorithms. Secondly, we relax the conditions on the hidden confounders in the setting of (offline) imitation learning, and adapt our CMR estimator to derive an algorithm that can learn effective imitator policies with convergence rate guarantees. Finally, we consider the problem of learning high-level objectives expressed in linear temporal logic (LTL) and develop a provably optimal learning algorithm that improves sample efficiency over existing methods. Through evaluation on reinforcement learning benchmarks and synthetic and semi-synthetic datasets, we demonstrate the usefulness of the methods developed in this thesis in real-world decision making.
中文摘要 强化学习和深度学习彻底改变了决策范式。尽管这带来了机器人、医疗和金融等领域的重大进展，但实际上在实践中使用强化学习仍具有挑战性，尤其是在需要保证的高风险应用中学习决策政策时。传统的强化学习算法依赖大量与环境的在线交互，这在在线交互成本高昂、危险或不可行的场景中存在问题。然而，从离线数据集中学习会受到隐藏混杂因素的阻碍。这些混杂因素可能导致数据集中出现虚假相关性，并可能误导智能体采取次优或对抗性的行为。首先，我们解决了在存在隐藏混杂因素的情况下，如何从离线数据集中学习的问题。我们利用工具变量（IV）来识别因果效应，这是条件矩限制（CMR）问题的一个实例。受双重/去偏置机器学习启发，我们推导出一种具有收敛性和最优性保证的样本高效算法，用于解决CMR问题，其性能优于最先进算法。其次，我们放宽了（离线）模仿学习中隐藏混杂因素的条件，并调整CMR估计器，推导出能够学习有效模仿策略且有收敛率保证的算法。最后，我们考虑了学习线性时间逻辑（LTL）中表达的高级目标的问题，并开发了一种可证明的最优学习算法，以提升样本效率相较于现有方法。通过对强化学习基准测试以及合成和半合成数据集的评估，我们展示了本论文中开发的方法在现实决策中的实用性。

Whole-Brain Connectomic Graph Model Enables Whole-Body Locomotion Control in Fruit Fly

全脑连接组图模型实现了果蝇全身运动控制

Authors: Zehao Jin, Yaoye Zhu, Chen Zhang, Yanan Sui
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.17997
Pdf link: https://arxiv.org/pdf/2602.17997
Abstract Whole-brain biological neural networks naturally support the learning and control of whole-body movements. However, the use of brain connectomes as neural network controllers in embodied reinforcement learning remains unexplored. We investigate using the exact neural architecture of an adult fruit fly's brain for the control of its body movement. We develop Fly-connectomic Graph Model (FlyGM), whose static structure is identical to the complete connectome of an adult Drosophila for whole-body locomotion control. To perform dynamical control, FlyGM represents the static connectome as a directed message-passing graph to impose a biologically grounded information flow from sensory inputs to motor outputs. Integrated with a biomechanical fruit fly model, our method achieves stable control across diverse locomotion tasks without task-specific architectural tuning. To verify the structural advantages of the connectome-based model, we compare it against a degree-preserving rewired graph, a random graph, and multilayer perceptrons, showing that FlyGM yields higher sample efficiency and superior performance. This work demonstrates that static brain connectomes can be transformed to instantiate effective neural policy for embodied learning of movement control.
中文摘要 全脑生物神经网络自然支持学习和控制全身运动。然而，将大脑连接组作为神经网络控制器用于具身强化学习的应用尚未被充分探索。我们利用果蝇大脑的精确神经结构来控制其身体运动。我们开发了蝇连结组图模型（FlyGM），其静态结构与成年果蝇的完整连接组相同，用于全身运动控制。为了实现动态控制，FlyGM将静态连接组表示为有向消息传递图，以从感觉输入到运动输出施加生物基准的信息流。结合生物力学果蝇模型，我们的方法在多种运动任务中实现稳定控制，无需针对特定任务进行架构调校。为了验证基于连接组模型的结构优势，我们将其与保持次数的重接线图、随机图和多层感知器进行比较，显示FlyGM提供了更高的采样效率和更优越的性能。这项工作表明，静态大脑连接组可以转化为有效的神经策略，用于具身学习运动控制。

Flow Actor-Critic for Offline Reinforcement Learning

离线强化学习的Flow Actor-Critic

Authors: Jongseong Chae, Jongeui Park, Yongjae Shin, Gyeongmin Kim, Seungyul Han, Youngchul Sung
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.18015
Pdf link: https://arxiv.org/pdf/2602.18015
Abstract The dataset distributions in offline reinforcement learning (RL) often exhibit complex and multi-modal distributions, necessitating expressive policies to capture such distributions beyond widely-used Gaussian policies. To handle such complex and multi-modal datasets, in this paper, we propose Flow Actor-Critic, a new actor-critic method for offline RL, based on recent flow policies. The proposed method not only uses the flow model for actor as in previous flow policies but also exploits the expressive flow model for conservative critic acquisition to prevent Q-value explosion in out-of-data regions. To this end, we propose a new form of critic regularizer based on the flow behavior proxy model obtained as a byproduct of flow-based actor design. Leveraging the flow model in this joint way, we achieve new state-of-the-art performance for test datasets of offline RL including the D4RL and recent OGBench benchmarks.
中文摘要 离线强化学习（RL）中的数据集分布通常表现出复杂且多模态的分布，因此需要表达性策略来捕捉这些分布，超越广泛使用的高斯策略。为处理如此复杂且多模态的数据集，本文提出了基于最新流策略的Flow Actor-Critic方法，这是一种新的离线强化学习actor-critic方法。所提方法不仅像以往的流策略一样使用actor的流模型，还利用表达流模型进行保守批评获取，以防止Q值在数据外区域爆炸。为此，我们提出了一种基于基于流的演员设计副产品流行为代理模型的新型批评正则化器。通过这种联合方式利用流模型，我们实现了离线强化学习测试数据集（包括D4RL和近期OGBench基准测试）的先进性能。

Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets

异构机器人数据集的跨身体离线强化学习

Authors: Haruki Abe, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada
Subjects: Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.18025
Pdf link: https://arxiv.org/pdf/2602.18025
Abstract Scalable robot policy pre-training has been hindered by the high cost of collecting high-quality demonstrations for each platform. In this study, we address this issue by uniting offline reinforcement learning (offline RL) with cross-embodiment learning. Offline RL leverages both expert and abundant suboptimal data, and cross-embodiment learning aggregates heterogeneous robot trajectories across diverse morphologies to acquire universal control priors. We perform a systematic analysis of this offline RL and cross-embodiment paradigm, providing a principled understanding of its strengths and limitations. To evaluate this offline RL and cross-embodiment paradigm, we construct a suite of locomotion datasets spanning 16 distinct robot platforms. Our experiments confirm that this combined approach excels at pre-training with datasets rich in suboptimal trajectories, outperforming pure behavior cloning. However, as the proportion of suboptimal data and the number of robot types increase, we observe that conflicting gradients across morphologies begin to impede learning. To mitigate this, we introduce an embodiment-based grouping strategy in which robots are clustered by morphological similarity and the model is updated with a group gradient. This simple, static grouping substantially reduces inter-robot conflicts and outperforms existing conflict-resolution methods.
中文摘要 可扩展机器人政策的预训练因为每个平台收集高质量演示的成本而受阻。本研究通过结合离线强化学习（离线强化学习）与跨身体学习来解决这一问题。离线强化学习利用专家和大量次优数据，交叉身体学习汇总了不同形态的机器人轨迹，以获得通用控制先验。我们对这一离线强化学习和交叉身体范式进行了系统分析，提供了其优势与局限性的原则性理解。为评估这一离线强化学习和跨身体范式，我们构建了一套涵盖16个不同机器人平台的运动数据集。我们的实验证实，这种结合方法在预训练中表现优异，且数据集中存在的次优轨迹，优于纯行为克隆。然而，随着次优数据比例和机器人类型数量的增加，我们观察到形态学间的梯度冲突开始阻碍学习。为缓解这一问题，我们引入了一种基于身体的分组策略，将机器人按形态相似性聚类，并更新模型并添加群梯度。这种简单静态的分组大大减少了机器人间的冲突，并优于现有的冲突解决方法。

Mean-Field Reinforcement Learning without Synchrony

无同步的均值场强化学习

Authors: Shan Yang
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.18026
Pdf link: https://arxiv.org/pdf/2602.18026
Abstract Mean-field reinforcement learning (MF-RL) scales multi-agent RL to large populations by reducing each agent's dependence on others to a single summary statistic -- the mean action. However, this reduction requires every agent to act at every time step; when some agents are idle, the mean action is simply undefined. Addressing asynchrony therefore requires a different summary statistic -- one that remains defined regardless of which agents act. The population distribution $\mu \in \Delta(\mathcal{O})$ -- the fraction of agents at each observation -- satisfies this requirement: its dimension is independent of $N$, and under exchangeability it fully determines each agent's reward and transition. Existing MF-RL theory, however, is built on the mean action and does not extend to $\mu$. We therefore construct the Temporal Mean Field (TMF) framework around the population distribution $\mu$ from scratch, covering the full spectrum from fully synchronous to purely sequential decision-making within a single theory. We prove existence and uniqueness of TMF equilibria, establish an $O(1/\sqrt{N})$ finite-population approximation bound that holds regardless of how many agents act per step, and prove convergence of a policy gradient algorithm (TMF-PG) to the unique equilibrium. Experiments on a resource selection game and a dynamic queueing game confirm that TMF-PG achieves near-identical performance whether one agent or all $N$ act per step, with approximation error decaying at the predicted $O(1/\sqrt{N})$ rate.
中文摘要 平均场强化学习（MF-RL）通过将每个智能体对其他智能体的依赖减少到一个汇总统计量——平均作用，将多智能体强化学习扩展到大型群体。然而，这种归约要求每个智能体在每个时间步都必须行动;当某些智能体处于空闲状态时，均值作用量是未定义的。因此，处理异步需要不同的汇总统计量——无论哪个主体行动，这个统计量都保持定义。种群分布 $\mu \in \Delta（\mathcal{O}）$——每个观测值中代理的比例——满足这一要求：其维度独立于$N$，在可交换性下完全决定每个代理的奖励和转变。然而，现有的MF-RL理论基于均值作用量，不扩展到$\mu$。因此，我们从零构建了关于总体分布$\mu$的时序均值场（TMF）框架，涵盖了从完全同步到纯顺序决策的全谱系，在单一理论内实现。我们证明了TMF均衡的存在性和唯一性，建立了一个$O（1/\sqrt{N}）$有限总体近似上界，该界限无论每步有多少代理行动都成立，并证明了策略梯度算法（TMF-PG）对唯一均衡的收敛性。资源选择博弈和动态排队博弈的实验证实，TMF-PG无论单一代理还是所有代理每步$N$行动，性能几乎相同，近似误差在预测的$O（\sqrt{N}）速率下衰减。

Decision Support under Prediction-Induced Censoring

预测诱导审查下的决策支持

Authors: Yan Chen, Ruyi Huang, Cheng Liu
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.18031
Pdf link: https://arxiv.org/pdf/2602.18031
Abstract In many data-driven online decision systems, actions determine not only operational costs but also the data availability for future learning -- a phenomenon termed Prediction-Induced Censoring (PIC). This challenge is particularly acute in large-scale resource allocation for generative AI (GenAI) serving: insufficient capacity triggers shortages but hides the true demand, leaving the system with only a "greater-than" constraint. Standard decision-making approaches that rely on uncensored data suffer from selection bias, often locking the system into a self-reinforcing low-provisioning trap. To break this loop, this paper proposes an adaptive approach named PIC-Reinforcement Learning (PIC-RL), a closed-loop framework that transforms censoring from a data quality problem into a decision signal. PIC-RL integrates (1) Uncertainty-Aware Demand Prediction to manage the information-cost trade-off, (2) Pessimistic Surrogate Inference to construct decision-aligned conservative feedback from shortage events, and (3) Dual-Timescale Adaptation to stabilize online learning against distribution drift. The analysis provides theoretical guarantees that the feedback design corrects the selection bias inherent in naive learning. Experiments on production Alibaba GenAI traces demonstrate that PIC-RL consistently outperforms state-of-the-art baselines, reducing service degradation by up to 50% while maintaining cost efficiency.
中文摘要 在许多数据驱动的在线决策系统中，动作不仅决定运营成本，还决定未来学习的数据可用性——这一现象被称为预测诱导审查（PIC）。这一挑战在生成式人工智能（GenAI）服务的大规模资源分配中尤为严峻：容量不足会引发短缺，但掩盖了真实需求，系统仅面临“大于”的限制。依赖未审查数据的标准决策方法存在选择偏差，常常将系统锁定在自我强化的低配置陷阱中。为打破这一循环，本文提出了一种名为PIC-Reinforcement Learning（PIC-RL）的自适应方法，这是一种闭环框架，将审查从数据质量问题转变为决策信号。PIC-RL集成了（1）不确定性感知需求预测以管理信息-成本权衡，（2）悲观替代推断以构建基于短缺事件的决策一致的保守反馈，以及（3）双时间尺度适应以稳定在线学习免受分布漂移的影响。分析提供了理论保证反馈设计纠正了天真学习中固有的选择偏差。在阿里巴巴GenAI量产线上的实验显示，PIC-RL始终优于最先进的基线，服务降级可降低多达50%，同时保持成本效益。

Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards

梯度正则化防止了基于人类反馈和可验证奖励的强化学习中的奖励黑客行为

Authors: Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.18037
Pdf link: https://arxiv.org/pdf/2602.18037
Abstract Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated in RLHF. We then show that Reference Resets of the KL penalty implicitly use GR to find flatter regions with higher reward accuracy. We further improve on this by proposing to use explicit GR with an efficient finite-difference estimate. Empirically, GR performs better than a KL penalty across a diverse set of RL experiments with LMs. GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.
中文摘要 来自人类反馈的强化学习（RLHF）或可验证奖励（RLVR）是现代语言模型（LM）训练后培训中的两个关键步骤。一个常见问题是奖励黑客，策略可能会利用奖励的不准确性，学习出意料之外的行为。大多数以往研究通过限制对参考模型施加Kullback-Leibler惩罚（KL）来解决此问题。我们提出不同的框架：以一种偏向政策更新的方式训练LM，使奖励更准确的地区进行政策更新。首先，我们推导出奖励模型的准确性与收敛时最优值平坦性之间的理论联系。梯度正则化（GR）可用于对训练进行偏置，使区域更平坦，从而保持奖励模型的准确性。我们通过证明梯度范数和奖励准确率在RLHF中具有实证相关性来证实这些结果。我们随后证明，基尔弗雷克惩罚的参考重置隐含利用广义相对论来寻找奖励准确率更高的平坦区域。我们进一步改进，提出使用带有高效有限差分估计的显式广义相对论。从实证角度看，广义相对论在多样的强化学习实验中表现优于基层逻辑惩罚。广义相对论在RLHF中获得了更高的GPT评判胜率，避免过于关注基于规则的数学奖励格式，并防止在LLM即评判者数学任务中被裁判黑。

Interacting safely with cyclists using Hamilton-Jacobi reachability and reinforcement learning

利用汉密尔顿-雅各比可达性和强化学习，安全与骑行者互动

Authors: Aarati Andrea Noronha, Jean Oh
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.18097
Pdf link: https://arxiv.org/pdf/2602.18097
Abstract In this paper, we present a framework for enabling autonomous vehicles to interact with cyclists in a manner that balances safety and optimality. The approach integrates Hamilton-Jacobi reachability analysis with deep Q-learning to jointly address safety guarantees and time-efficient navigation. A value function is computed as the solution to a time-dependent Hamilton-Jacobi-Bellman inequality, providing a quantitative measure of safety for each system state. This safety metric is incorporated as a structured reward signal within a reinforcement learning framework. The method further models the cyclist's latent response to the vehicle, allowing disturbance inputs to reflect human comfort and behavioral adaptation. The proposed framework is evaluated through simulation and comparison with human driving behavior and an existing state-of-the-art method.
中文摘要 本文提出了一个框架，旨在使自动驾驶车辆能够以平衡安全与最优性的方式与骑行者互动。该方法将Hamilton-Jacobi可达性分析与深度Q学习相结合，共同解决安全保障和高效导航。价值函数作为时间依赖的哈密顿-雅各比-贝尔曼不等式的解计算，为每个系统状态提供定量的安全性度量。该安全指标作为结构化的奖励信号被纳入强化学习框架中。该方法进一步模拟了骑行者对车辆的潜在反应，使干扰输入反映人类的舒适度和行为适应。该框架通过模拟与人类驾驶行为及现有先进方法的比较进行评估。

TempoNet: Slack-Quantized Transformer-Guided Reinforcement Scheduler for Adaptive Deadline-Centric Real-Time Dispatchs

TempoNet：SLACK量化变压器引导强化调度器，用于自适应截止日期中心的实时调度

Authors: Rong Fu, Yibo Meng, Guangzhen Yao, Jiaxuan Lu, Zeyu Zhang, Zhaolu Kang, Ziming Guo, Jia Yee Tan, Xiaojing Du, Simon James Fong
Subjects: Subjects: Machine Learning (cs.LG); Operating Systems (cs.OS); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.18109
Pdf link: https://arxiv.org/pdf/2602.18109
Abstract Real-time schedulers must reason about tight deadlines under strict compute budgets. We present TempoNet, a reinforcement learning scheduler that pairs a permutation-invariant Transformer with a deep Q-approximation. An Urgency Tokenizer discretizes temporal slack into learnable embeddings, stabilizing value learning and capturing deadline proximity. A latency-aware sparse attention stack with blockwise top-k selection and locality-sensitive chunking enables global reasoning over unordered task sets with near-linear scaling and sub-millisecond inference. A multicore mapping layer converts contextualized Q-scores into processor assignments through masked-greedy selection or differentiable matching. Extensive evaluations on industrial mixed-criticality traces and large multiprocessor settings show consistent gains in deadline fulfillment over analytic schedulers and neural baselines, together with improved optimization stability. Diagnostics include sensitivity analyses for slack quantization, attention-driven policy interpretation, hardware-in-the-loop and kernel micro-benchmarks, and robustness under stress with simple runtime mitigations; we also report sample-efficiency benefits from behavioral-cloning pretraining and compatibility with an actor-critic variant without altering the inference pipeline. These results establish a practical framework for Transformer-based decision making in high-throughput real-time scheduling.
中文摘要 实时调度员必须在严格的计算预算下考虑紧迫的截止日期。我们介绍TempoNet，一种强化学习调度器，将置换不变变换器与深度Q近似配对。紧急性代币生成器将时间松弛离散化为可学习的嵌入，稳定价值学习并捕捉截止日期的接近性。一种具有延迟感知的稀疏注意力堆栈，采用分块式的顶K选择和局部敏感的分块，使得对无序任务集进行全局推理，具有近乎线性的扩展和亚毫秒级的推断能力。多核映射层通过掩蔽贪婪选择或可微匹配将上下文化的Q分数转换为处理器分配。对工业混合临界性痕迹和大型多处理器设置的广泛评估显示，截止日期满足率相较于分析调度器和神经基线持续提升，优化稳定性也有所提升。诊断包括对松弛量子化的敏感性分析、注意力驱动的策略解释、硬件在环和内核微基准测试，以及通过简单运行时缓解的压力鲁棒性分析;我们还报告了行为克隆预训练和与actor-critic变体兼容性的样本效率提升，且不改变推理流程。这些结果为基于Transformer的高通量实时调度决策建立了实用框架。

Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning

带注入噪声的流匹配用于离线到在线强化学习

Authors: Yongjae Shin, Jongseong Chae, Jongeui Park, Youngchul Sung
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.18117
Pdf link: https://arxiv.org/pdf/2602.18117
Abstract Generative models have recently demonstrated remarkable success across diverse domains, motivating their adoption as expressive policies in reinforcement learning (RL). While they have shown strong performance in offline RL, particularly where the target distribution is well defined, their extension to online fine-tuning has largely been treated as a direct continuation of offline pre-training, leaving key challenges unaddressed. In this paper, we propose Flow Matching with Injected Noise for Offline-to-Online RL (FINO), a novel method that leverages flow matching-based policies to enhance sample efficiency for offline-to-online RL. FINO facilitates effective exploration by injecting noise into policy training, thereby encouraging a broader range of actions beyond those observed in the offline dataset. In addition to exploration-enhanced flow policy training, we combine an entropy-guided sampling mechanism to balance exploration and exploitation, allowing the policy to adapt its behavior throughout online fine-tuning. Experiments across diverse, challenging tasks demonstrate that FINO consistently achieves superior performance under limited online budgets.
中文摘要 生成模型近年来在多个领域展现出显著成功，促使其作为强化学习（RL）表达策略的采用。虽然它们在离线强化学习中表现出色，尤其是在目标分布明确的领域，但其对在线微调的扩展大多被视为离线预训练的直接延续，导致关键挑战未被解决。本文提出了基于流匹配策略的注入噪声流比对（FINO）用于离线到在线强化学习（FINO），这是一种利用基于流匹配策略提升离线到在线强化学习样本效率的新方法。FINO 通过在政策培训中注入噪声，促进有效探索，从而鼓励在离线数据集之外采取更广泛的行动。除了探索增强的流量策略培训外，我们还结合了熵引导采样机制，平衡探索与利用，使策略能够在在线微调过程中调整其行为。在多样化且具有挑战性的任务中进行的实验表明，FINO 在有限的在线预算下始终能取得卓越的表现。

BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards

BLM-Guard：可解释的多模态广告审核，结合思维链条和政策对齐的奖励

Authors: Yiran Yang, Zhaowei Liu, Yuan Yuan, Yukun Song, Xiong Ma, Yinghao Song, Xiangji Zeng, Lu Sun, Yulu Wang, Hai Zhou, Shuai Cui, Zhaohan Gong, Jiefei Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.18193
Pdf link: https://arxiv.org/pdf/2602.18193
Abstract Short-video platforms now host vast multimodal ads whose deceptive visuals, speech and subtitles demand finer-grained, policy-driven moderation than community safety filters. We present BLM-Guard, a content-audit framework for commercial ads that fuses Chain-of-Thought reasoning with rule-based policy principles and a critic-guided reward. A rule-driven ICoT data-synthesis pipeline jump-starts training by generating structured scene descriptions, reasoning chains and labels, cutting annotation costs. Reinforcement learning then refines the model using a composite reward balancing causal coherence with policy adherence. A multitask architecture models intra-modal manipulations (e.g., exaggerated imagery) and cross-modal mismatches (e.g., subtitle-speech drift), boosting robustness. Experiments on real short-video ads show BLM-Guard surpasses strong baselines in accuracy, consistency and generalization.
中文摘要 短视频平台现在搭载大量多模态广告，其欺骗性的视觉、语音和字幕需要比社区安全过滤器更细致、以政策为驱动的审核。我们介绍BLM-Guard，这是一个商业广告内容审计框架，融合了思维链推理与基于规则的政策原则以及批评者引导的奖励。基于规则的ICoT数据综合流水线通过生成结构化场景描述、推理链和标签，降低注释成本，启动训练。强化学习随后通过复合奖励方法，平衡因果一致性与政策遵循，进一步完善模型。多任务架构模拟模态内作（如图像夸张）和跨模态不匹配（如字幕-语音漂移），提升了鲁棒性。对真实短视频广告的实验显示，BLM-Guard在准确性、一致性和泛化性方面都超过了强基线。

PRISM: Parallel Reward Integration with Symmetry for MORL

棱镜：MORL的对称并行奖励集成

Authors: Finn van der Knaap, Kejiang Qian, Zheng Xu, Fengxiang He
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.18277
Pdf link: https://arxiv.org/pdf/2602.18277
Abstract This work studies heterogeneous Multi-Objective Reinforcement Learning (MORL), where objectives can differ sharply in temporal frequency. Such heterogeneity allows dense objectives to dominate learning, while sparse long-horizon rewards receive weak credit assignment, leading to poor sample efficiency. We propose a Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels. PRISM introduces ReSymNet, a theory-motivated model that reconciles temporal-frequency mismatches across objectives, using residual blocks to learn a scaled opportunity value that accelerates exploration while preserving the optimal policy. We also propose SymReg, a reflectional equivariance regulariser that enforces agent mirroring and constrains policy search to a reflection-equivariant subspace. This restriction provably reduces hypothesis complexity and improves generalisation. Across MuJoCo benchmarks, PRISM consistently outperforms both a sparse-reward baseline and an oracle trained with full dense rewards, improving Pareto coverage and distributional balance: it achieves hypervolume gains exceeding 100\% over the baseline and up to 32\% over the oracle. The code is at \href{this https URL}{this https URL}.
中文摘要 本研究研究异质多目标强化学习（MORL），其中目标在时间频率上可能有显著差异。这种异质性使得密集目标主导学习，而稀疏的长期奖励则获得较弱的学分分配，导致样本效率较低。我们提出了一种并行奖励积分对称（PRISM）算法，通过归纳偏置强制反射对称性来对齐奖励通道。PRISM引入了ReSymNet，这是一种基于理论的模型，利用残差块学习缩放机会值，加速探索并保持最优策略。我们还提出了SymReg，一种反射等变正则化器，它强制代理镜像并将策略搜索限制在反射等变子空间。这一限制可证明降低假设复杂性并改善泛化性。在MuJoCo基准测试中，PRISM始终优于稀疏奖励基线和训练全密集奖励的预言机，提升了帕累托覆盖率和分布平衡：其超量增长超过基线100%，高达32%超预言机。代码在\href{this https URL}{this https URL}。

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

扩散以协调：高效的在线多智能体扩散策略

Authors: Zhuoran Li, Hai Zhong, Xun Wang, Qingxin Xia, Lihua Zhang, Longbo Huang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.18291
Pdf link: https://arxiv.org/pdf/2602.18291
Abstract Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across $10$ diverse tasks, demonstrating a remarkable $2.5\times$ to $5\times$ improvement in sample efficiency.
中文摘要 在线多智能体强化学习（MARL）是高效智能体协调的一个重要框架。关键是，提升政策表达力对于实现卓越绩效至关重要。基于扩散的生成模型非常适合满足这一需求，因其在图像生成和离线环境中展现出卓越的表现力和多模态表现力。然而，它们在在线MARL中的潜力仍然大多未被充分开发。一个主要障碍是扩散模型难以解决的可能性阻碍了基于熵的探索与协调。为应对这一挑战，我们提出了首批\underline{O}nline非策略\underline{underline{MA}实时学习框架，利用\underline{D}iffusion策略（\textbf{OMAD}）来协调协调。我们的核心创新是一项宽松的政策目标，最大化尺度化的联合熵，促进有效探索，而无需依赖可处理概率。在中心化训练与去中心化执行（CTDE）范式中，我们采用联合分布价值函数来优化去中心化扩散策略。它利用可作的熵增强靶点来指导扩散策略的同时更新，从而确保稳定的协调。对MPE和MAMuJoCo的广泛评估确立了我们的方法在10美元多样化任务中的最新领先地位，样本效率显著提升了2.5倍至5倍倍。

Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty

学习带有动作雅可比惩罚的平滑时间变化线性策略

Authors: Zhaoming Xie, Kevin Karol, Jessica Hodgins
Subjects: Subjects: Robotics (cs.RO); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2602.18312
Pdf link: https://arxiv.org/pdf/2602.18312
Abstract Reinforcement learning provides a framework for learning control policies that can reproduce diverse motions for simulated characters. However, such policies often exploit unnatural high-frequency signals that are unachievable by humans or physical robots, making them poor representations of real-world behaviors. Existing work addresses this issue by adding a reward term that penalizes a large change in actions over time. This term often requires substantial tuning efforts. We propose to use the action Jacobian penalty, which penalizes changes in action with respect to the changes in simulated state directly through auto differentiation. This effectively eliminates unrealistic high-frequency control signals without task specific tuning. While effective, the action Jacobian penalty introduces significant computational overhead when used with traditional fully connected neural network architectures. To mitigate this, we introduce a new architecture called a Linear Policy Net (LPN) that significantly reduces the computational burden for calculating the action Jacobian penalty during training. In addition, a LPN requires no parameter tuning, exhibits faster learning convergence compared to baseline methods, and can be more efficiently queried during inference time compared to a fully connected neural network. We demonstrate that a Linear Policy Net, combined with the action Jacobian penalty, is able to learn policies that generate smooth signals while solving a number of motion imitation tasks with different characteristics, including dynamic motions such as a backflip and various challenging parkour skills. Finally, we apply this approach to create policies for dynamic motions on a physical quadrupedal robot equipped with an arm.
中文摘要 强化学习提供了一个学习控制策略的框架，能够为模拟角色再现多样化的动作。然而，这些政策往往利用人类或机器人无法实现的高频非自然信号，使其无法准确反映现实行为。现有研究通过增加奖励条款来解决这个问题，惩罚随时间发生的重大行为变化。这个术语通常需要大量的调音工作。我们提议使用动作雅可比惩罚，它通过自微分直接惩罚动作中关于模拟状态变化的变化。这有效消除了不切实际的高频控制信号，无需针对特定任务进行调谐。虽然有效，但动作雅可比惩罚在传统全连接神经网络架构中使用时会带来显著的计算开销。为缓解这一问题，我们引入了一种名为线性策略网（LPN）的新架构，显著降低了训练过程中计算雅可比惩罚动作的计算负担。此外，LPN无需参数调优，学习收敛速度快于基线方法，且在推理时间内查询效率更高，相较于全连接神经网络。我们展示了线性策略网结合动作雅可比惩罚，能够学习在解决多种具有不同特性的运动模拟任务（包括后空翻等动态动作和各种具有挑战性的跑酷技能）时，产生平滑信号的策略。最后，我们将该方法应用于为配备机械臂的四足机器人制定动态运动的政策。

Keyword: diffusion policy

There is no result