Arxiv Papers of Today

生成时间: 2025-12-04 16:32:30 (UTC+8); Arxiv 发布时间: 2025-12-04 20:00 EST (2025-12-05 09:00 UTC+8)

今天共有 42 篇相关文章

Keyword: reinforcement learning

Safe and Sustainable Electric Bus Charging Scheduling with Constrained Hierarchical DRL

安全且可持续的电动公交充电安排与受限层级日程

Authors: Jiaju Qi, Lei Lei, Thorsteinn Jonsson, Dusit Niyato
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.03059
Pdf link: https://arxiv.org/pdf/2512.03059
Abstract The integration of Electric Buses (EBs) with renewable energy sources such as photovoltaic (PV) panels is a promising approach to promote sustainable and low-carbon public transportation. However, optimizing EB charging schedules to minimize operational costs while ensuring safe operation without battery depletion remains challenging - especially under real-world conditions, where uncertainties in PV generation, dynamic electricity prices, variable travel times, and limited charging infrastructure must be accounted for. In this paper, we propose a safe Hierarchical Deep Reinforcement Learning (HDRL) framework for solving the EB Charging Scheduling Problem (EBCSP) under multi-source uncertainties. We formulate the problem as a Constrained Markov Decision Process (CMDP) with options to enable temporally abstract decision-making. We develop a novel HDRL algorithm, namely Double Actor-Critic Multi-Agent Proximal Policy Optimization Lagrangian (DAC-MAPPO-Lagrangian), which integrates Lagrangian relaxation into the Double Actor-Critic (DAC) framework. At the high level, we adopt a centralized PPO-Lagrangian algorithm to learn safe charger allocation policies. At the low level, we incorporate MAPPO-Lagrangian to learn decentralized charging power decisions under the Centralized Training and Decentralized Execution (CTDE) paradigm. Extensive experiments with real-world data demonstrate that the proposed approach outperforms existing baselines in both cost minimization and safety compliance, while maintaining fast convergence speed.
中文摘要 电动公交（EB）与光伏（PV）电池板等可再生能源的整合，是促进可持续和低碳公共交通的有前景方式。然而，优化 EB 充电计划以降低运营成本，同时确保安全运行且不耗尽电池仍具挑战性——尤其是在现实环境中，光伏发电不确定性、电力价格动态变化、行程时间变动和充电基础设施有限。本文提出了一个安全的分层深度强化学习（HDRL）框架，用于在多源不确定性下解决 EB 计费调度问题（EBCSP）。我们将问题表述为受限马尔可夫决策过程（CMDP），并提供支持时间抽象决策的选项。我们开发了一种新型HDRL算法，即双演员-批判者多代理近端策略优化拉格朗日量（DAC-MAPPO-Lagrangian），将拉格朗日松弛整合进双演员-批判者（DAC）框架中。在高层面，我们采用集中式PPO-拉格朗日算法来学习安全充电器分配策略。在底层，我们采用了MAPPO-Lagrangian，学习集中培训与去中心化执行（CTDE）范式下的去中心化充电权决策。大量真实世界数据实验表明，该方法在成本最小化和安全合规性方面优于现有基线，同时保持快速收敛速度。

Dynamic Correction of Erroneous State Estimates via Diffusion Bayesian Exploration

通过扩散贝叶斯探索动态校正错误状态估计

Authors: Yiwei Shi, Hongnan Ma, Mengyue Yang, Cunjia Liu, Weiru Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO)
Arxiv link: https://arxiv.org/abs/2512.03102
Pdf link: https://arxiv.org/pdf/2512.03102
Abstract In emergency response and other high-stakes societal applications, early-stage state estimates critically shape downstream outcomes. Yet, these initial state estimates-often based on limited or biased information-can be severely misaligned with reality, constraining subsequent actions and potentially causing catastrophic delays, resource misallocation, and human harm. Under the stationary bootstrap baseline (zero transition and no rejuvenation), bootstrap particle filters exhibit Stationarity-Induced Posterior Support Invariance (S-PSI), wherein regions excluded by the initial prior remain permanently unexplorable, making corrections impossible even when new evidence contradicts current beliefs. While classical perturbations can in principle break this lock-in, they operate in an always-on fashion and may be inefficient. To overcome this, we propose a diffusion-driven Bayesian exploration framework that enables principled, real-time correction of early state estimation errors. Our method expands posterior support via entropy-regularized sampling and covariance-scaled diffusion. A Metropolis-Hastings check validates proposals and keeps inference adaptive to unexpected evidence. Empirical evaluations on realistic hazardous-gas localization tasks show that our approach matches reinforcement learning and planning baselines when priors are correct. It substantially outperforms classical SMC perturbations and RL-based methods under misalignment, and we provide theoretical guarantees that DEPF resolves S-PSI while maintaining statistical rigor.
中文摘要 在紧急响应及其他高风险社会应用中，早期阶段的州级估算对后续结果至关重要。然而，这些初步状态估计——往往基于有限或有偏见的信息——可能严重与现实不符，限制后续行动，可能导致灾难性延误、资源错配和人为伤害。在静止引导基线（零转变且无回归）下，引导粒子滤波器表现出平稳诱导后支持不变性（S-PSI），即初始先验排除的区域永久无法探测，即使有新证据与当前信念相矛盾，也无法进行修正。虽然经典微扰原则上可以打破这种锁定，但它们始终处于开启状态，效率可能较低。为克服这一问题，我们提出了一种扩散驱动的贝叶斯探索框架，能够实现原则性、实时的早期状态估计误差修正。我们的方法通过熵正则化抽样和协方差尺度扩散扩展后验支持。Metropolis-Hastings检查验证了提案，并保持推断对意外证据的适应性。对现实危险气体定位任务的实证评估表明，当先验正确时，我们的方法与强化学习和规划基线相匹配。在错位情况下，它显著优于经典SMC微扰和基于强化学习的方法，我们理论上保证DEPF能在保持统计严谨性的情况下解析S-PSI。

Hierarchical Process Reward Models are Symbolic Vision Learners

层级过程奖励模型是象征性愿景学习者

Authors: Shan Zhang, Aotian Chen, Kai Zou, Jindong Gu, Yuan Xue, Anton van den Hengel
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.03126
Pdf link: https://arxiv.org/pdf/2512.03126
Abstract Symbolic computer vision represents diagrams through explicit logical rules and structured representations, enabling interpretable understanding in machine vision. This requires fundamentally different learning paradigms from pixel-based visual models. Symbolic visual learners parse diagrams into geometric primitives-points, lines, and shapes-whereas pixel-based learners operate on textures and colors. We propose a novel self-supervised symbolic auto-encoder that encodes diagrams into structured primitives and their interrelationships within the latent space, and decodes them through our executable engine to reconstruct the input diagrams. Central to this architecture is Symbolic Hierarchical Process Reward Modeling, which applies hierarchical step-level parsing rewards to enforce point-on-line, line-on-shape, and shape-on-relation consistency. Since vanilla reinforcement learning exhibits poor exploration in the policy space during diagram reconstruction; we thus introduce stabilization mechanisms to balance exploration and exploitation. We fine-tune our symbolic encoder on downstream tasks, developing a neuro-symbolic system that integrates the reasoning capabilities of neural networks with the interpretability of symbolic models through reasoning-grounded visual rewards. Evaluations across reconstruction, perception, and reasoning tasks demonstrate the effectiveness of our approach: achieving a 98.2% reduction in MSE for geometric diagram reconstruction, surpassing GPT-4o by 0.6% with a 7B model on chart reconstruction, and improving by +13% on the MathGlance perception benchmark, and by +3% on MathVerse and GeoQA reasoning benchmarks.
中文摘要 符号计算机视觉通过显式逻辑规则和结构化表示来表示图表，使机器视觉能够实现可解释的理解。这需要从基于像素的视觉模型中根本不同的学习范式。符号型视觉型学习者将图表解析为几何基元——点、线和形状——而基于像素的学习者则处理纹理和颜色。我们提出了一种新的自监督符号自动编码器，将图解编码为结构化的原语及其在潜在空间中的相互关系，并通过可执行引擎解码以重建输入图。该架构的核心是符号层级过程奖励建模，应用层级步骤解析奖励以强制点在线、线在形状和形状在形状上的一致性。由于普通强化学习在图示重建过程中政策空间探索不足;因此，我们引入了平衡勘探与开发的稳定机制。我们在下游任务中微调符号编码器，开发出一个神经符号系统，将神经网络的推理能力与基于推理的视觉奖励实现符号模型的可解释性相结合。在重建、感知和推理任务的评估中，我们方法的有效性得到了验证：几何图重建的MSE降低了98.2%，在图表重建方面比GPT-4o高出0.6%，在MathGlance感知基准上提升+13%，在MathVerse和GeoQA推理基准中提升+3%。

Multi-Agent Reinforcement Learning and Real-Time Decision-Making in Robotic Soccer for Virtual Environments

多智能体强化学习与虚拟环境中机器人足球的实时决策

Authors: Aya Taourirte, Md Sohag Mia
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.03166
Pdf link: https://arxiv.org/pdf/2512.03166
Abstract The deployment of multi-agent systems in dynamic, adversarial environments like robotic soccer necessitates real-time decision-making, sophisticated cooperation, and scalable algorithms to avoid the curse of dimensionality. While Reinforcement Learning (RL) offers a promising framework, existing methods often struggle with the multi-granularity of tasks (long-term strategy vs. instant actions) and the complexity of large-scale agent interactions. This paper presents a unified Multi-Agent Reinforcement Learning (MARL) framework that addresses these challenges. First, we establish a baseline using Proximal Policy Optimization (PPO) within a client-server architecture for real-time action scheduling, with PPO demonstrating superior performance (4.32 avg. goals, 82.9% ball control). Second, we introduce a Hierarchical RL (HRL) structure based on the options framework to decompose the problem into a high-level trajectory planning layer (modeled as a Semi-Markov Decision Process) and a low-level action execution layer, improving global strategy (avg. goals increased to 5.26). Finally, to ensure scalability, we integrate mean-field theory into the HRL framework, simplifying many-agent interactions into a single agent vs. the population average. Our mean-field actor-critic method achieves a significant performance boost (5.93 avg. goals, 89.1% ball control, 92.3% passing accuracy) and enhanced training stability. Extensive simulations of 4v4 matches in the Webots environment validate our approach, demonstrating its potential for robust, scalable, and cooperative behavior in complex multi-agent domains.
中文摘要 在机器人足球等动态对抗环境中部署多智能体系统，需要实时决策、复杂协作和可扩展算法，以避免维度诅咒。虽然强化学习（RL）提供了一个有前景的框架，但现有方法常常在任务的多粒度性（长期战略与即时行动）以及大规模代理交互的复杂性方面遇到困难。本文提出了一个统一的多智能体强化学习（MARL）框架，以应对这些挑战。首先，我们在客户端-服务器架构中使用近端策略优化（PPO）建立了实时动作调度的基线，PPO表现优异（平均进球4.32%，控球率82.9%）。其次，我们基于选项框架引入了层级强化学习（HRL）结构，将问题分解为高层轨迹规划层（建模为半马尔可夫决策过程）和低层次行动执行层，提升整体战略（平均目标提升至5.26）。最后，为确保可扩展性，我们将均值场理论整合进HRL框架，将多代理间的交互简化为单一代理而非总体平均值。我们的平均场演员-批评方法显著提升了表现（平均进球5.93个球，控球率89.1%，传球准确率92.3%）并提升了训练稳定性。Webots 环境中对 4v4 比赛的大量模拟验证了我们的方法，展示了其在复杂多智能体领域中实现稳健、可扩展和协作行为的潜力。

GRAND: Guidance, Rebalancing, and Assignment for Networked Dispatch in Multi-Agent Path Finding

GRAND：多智能体路径寻找中的网络调度的指导、再平衡与分配

Authors: Johannes Gaber, Meshal Alharbi, Daniele Gammelli, Gioele Zardini
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.03194
Pdf link: https://arxiv.org/pdf/2512.03194
Abstract Large robot fleets are now common in warehouses and other logistics settings, where small control gains translate into large operational impacts. In this article, we address task scheduling for lifelong Multi-Agent Pickup-and-Delivery (MAPD) and propose a hybrid method that couples learning-based global guidance with lightweight optimization. A graph neural network policy trained via reinforcement learning outputs a desired distribution of free agents over an aggregated warehouse graph. This signal is converted into region-to-region rebalancing through a minimum-cost flow, and finalized by small, local assignment problems, preserving accuracy while keeping per-step latency within a 1 s compute budget. On congested warehouse benchmarks from the League of Robot Runners (LRR) with up to 500 agents, our approach improves throughput by up to 10% over the 2024 winning scheduler while maintaining real-time execution. The results indicate that coupling graph-structured learned guidance with tractable solvers reduces congestion and yields a practical, scalable blueprint for high-throughput scheduling in large fleets.
中文摘要 大型机器人车队现已在仓库及其他物流环境中普遍存在，小的控制权提升转化为重大的运营影响。本文探讨了终身多智能体取件与交付（MAPD）的任务调度，并提出了一种结合学习的全局指导与轻量级优化相结合的混合方法。通过强化学习训练的图神经网络策略会在聚合仓库图上输出所需的自由代理分布。该信号通过最低成本流转换为区域间再平衡，并通过小型局部分配问题完成，保持准确性，同时将每步延迟控制在1秒计算预算内。在机器人奔跑者联盟（LRR）拥挤仓库基准测试中，拥有多达500个代理，我们的方法比2024年获胜的调度器提升吞吐量最多10%，同时保持实时执行。结果表明，将图结构的学习式指导与可处理求解器结合，可以减少拥堵，并为大型车队中实现高通量调度提供实用且可扩展的蓝图。

A Multi-Agent, Policy-Gradient approach to Network Routing

多智能体、策略梯度网络路由方法

Authors: Nigel Tao, Jonathan Baxter, Lex Weaver
Subjects: Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2512.03211
Pdf link: https://arxiv.org/pdf/2512.03211
Abstract Network routing is a distributed decision problem which naturally admits numerical performance measures, such as the average time for a packet to travel from source to destination. OLPOMDP, a policy-gradient reinforcement learning algorithm, was successfully applied to simulated network routing under a number of network models. Multiple distributed agents (routers) learned co-operative behavior without explicit inter-agent communication, and they avoided behavior which was individually desirable, but detrimental to the group's overall performance. Furthermore, shaping the reward signal by explicitly penalizing certain patterns of sub-optimal behavior was found to dramatically improve the convergence rate.
中文摘要 网络路由是一种分布式决策问题，自然允许数值性能指标，例如数据包从源到目的地的平均传输时间。OLPOMDP 是一种策略梯度强化学习算法，已成功应用于多种网络模型下的网络路由模拟。多个分布式代理（路由器）在没有明确代理间通信的情况下学习合作行为，避免了对个体有利但对整体表现有害的行为。此外，通过明确惩罚某些次优行为模式来塑造奖励信号，被发现显著提高了收敛率。

SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning

SPARK：无引用强化学习的逐步过程感知奖励

Authors: Salman Rahman, Sruthi Gorantla, Arpit Gupta, Swastik Roy, Nanyun Peng, Yang Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.03244
Pdf link: https://arxiv.org/pdf/2512.03244
Abstract Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them using parallel scaling (self-consistency) and sequential scaling (meta-critique). In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision, achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT-4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Using Qwen2.5-Math-7B, we achieve 47.4% average accuracy across six mathematical reasoning benchmarks, outperforming ground-truth-based RLVR (43.9%). Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth.
中文摘要 提供密集、步级反馈的过程奖励模型（PRM）在强化学习方面展现出潜力，但其采用受限于昂贵的步级注释或真实参考的需求。我们提出了SPARK：一个三阶段框架，第一阶段生成器模型生成多样化的解决方案，验证者模型通过并行扩展（自一致性）和顺序扩展（元批判）来评估这些解决方案。第二阶段，我们将这些验证输出作为合成训练数据，微调生成过程奖励模型，这些模型随后在训练中作为奖励信号。我们证明，在步骤层面汇总多个独立验证，能产生超越真实结果监督的过程奖励模型训练数据，在ProcessBench（识别数学推理错误步骤的基准）上获得67.5 F1，而参考引导训练为66.4，GPT-4o为61.9。在最后阶段，我们将生成PRM与思维链验证（PRM-CoT）作为强化学习数学推理实验中的奖励模型，并引入格式约束以防止奖励黑客。使用Qwen2.5-Math-7B，我们在六个数学推理基准测试中实现了47.4%的平均准确率，优于基于实测的RLVR（43.9%）。我们的工作使得无引用的强化学习（RL）训练成为可能，超越了地面真实方法，为缺乏可验证答案或可访问地面真实的领域开辟了新可能。

SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding

空间推理器：大规模三维场景理解的主动感知

Authors: Hongpei Zheng, Shijie Li, Yanran Li, Hujun Yin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.03284
Pdf link: https://arxiv.org/pdf/2512.03284
Abstract Spatial reasoning in large-scale 3D environments remains challenging for current vision-language models, which are typically constrained to room-scale scenarios. We introduce H$^2$U3D (Holistic House Understanding in 3D), a 3D visual question answering dataset designed for house-scale scene understanding. H$^2$U3D features multi-floor environments spanning up to three floors and 10-20 rooms, covering more than 300 m$^2$. Through an automated annotation pipeline, it constructs hierarchical coarse-to-fine visual representations and generates diverse question-answer pairs with chain-of-thought annotations. We further propose SpatialReasoner, an active perception framework that autonomously invokes spatial tools to explore 3D scenes based on textual queries. SpatialReasoner is trained through a two-stage strategy: a supervised cold start followed by reinforcement learning with an adaptive exploration reward that promotes efficient exploration while discouraging redundant operations. Extensive experiments demonstrate that SpatialReasoner achieves state-of-the-art performance on H$^2$U3D, outperforming strong baselines including GPT-4o and Gemini-2.5-Pro. Notably, our method attains superior results while using only 3-4 images in total on average, compared to baselines requiring 16+ images, highlighting the effectiveness of our coarse-to-fine active exploration paradigm.
中文摘要 对于当前视觉语言模型来说，大规模三维环境中的空间推理仍然具有挑战性，这些模型通常受限于房间尺度的场景。我们介绍了H$^2$U3D（3D整体房屋理解），这是一个旨在理解房屋规模场景的3D视觉问答数据集。H$^2$U3D拥有多层环境，覆盖最多三层楼和10-20个房间，覆盖面积超过300平方米^2美元。通过自动化注释流水线，它构建了从粗到细的层级视觉表示，并生成带有思维链注释的多样问答对。我们进一步提出了空间推理器（SpatialReasoner），一种主动感知框架，能够自主调用空间工具，基于文本查询探索三维场景。SpatialReasoner 通过两阶段策略进行训练：先是监督冷启动，随后是带有自适应探索奖励的强化学习，促进高效探索，同时防止重复作。大量实验表明，SpatialReasoner 在 H$^2$U3D 上实现了最先进的性能，优于包括 GPT-4o 和 Gemini-2.5-Pro 在内的强基线。值得注意的是，我们的方法在平均仅使用3-4张图像的情况下，取得了更优的结果，而基线则需要16+张图像，这凸显了我们粗细到细的主动探索范式的有效性。

Better World Models Can Lead to Better Post-Training Performance

更好的世界模型可以带来更好的训练后表现

Authors: Prakhar Gupta, Henry Conklin, Sarah-Jane Leslie, Andrew Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.03400
Pdf link: https://arxiv.org/pdf/2512.03400
Abstract In this work we study how explicit world-modeling objectives affect the internal representations and downstream capability of Transformers across different training stages. We use a controlled 2x2x2 Rubik's Cube and ask: (1) how does explicitly pretraining a world model affect the model's latent representations, and (2) how does world-model quality affect the model's performance after reinforcement learning post-training? We compare standard next-token prediction to two explicit world-modeling strategies -- (i) state-prediction pretraining and (ii) a joint state-prediction + next-token objective -- and assess task performance after Group Relative Policy Optimization (GRPO) is applied as post-training. We evaluate the representation quality with linear probes and causal interventions. We find that explicit world-modeling yields more linearly decodable and causally steerable state representations. More importantly, we find that improved state representations lead to higher gains for GRPO, especially on harder cube states. Our results indicate that sharpening state representations can improve the effectiveness of post-training for sequence-planning tasks.
中文摘要 本研究中，我们研究显式世界建模目标如何影响变压金刚在不同训练阶段的内部表示和下游能力。我们使用一个受控的2x2x2魔方，并询问：（1）显式预训练世界模型如何影响模型的潜在表征，以及（2）世界模型质量如何影响训练后强化学习后的表现？我们将标准下一个标记预测与两种显式的世界建模策略进行比较——（i）状态预测预训练和（ii）联合状态预测+下一个标记目标——并在组相对策略优化（GRPO）作为后训练后评估任务表现。我们通过线性探针和因果干预评估表征质量。我们发现，显式世界建模能产生更多线性可解码且因果可导向的状态表示。更重要的是，我们发现改善的状态表示率会提高GRPO的收益，尤其是在更硬的立方体状态上。我们的结果表明，强化状态表征可以提升序列规划任务训练后工作的效果。

World Models for Autonomous Navigation of Terrestrial Robots from LIDAR Observations

来自LIDAR观测的地面机器人自主导航世界模型

Authors: Raul Steinmetz, Fabio Demo Rosa, Victor Augusto Kich, Jair Augusto Bottega, Ricardo Bedin Grando, Daniel Fernando Tello Gamarra
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.03429
Pdf link: https://arxiv.org/pdf/2512.03429
Abstract Autonomous navigation of terrestrial robots using Reinforcement Learning (RL) from LIDAR observations remains challenging due to the high dimensionality of sensor data and the sample inefficiency of model-free approaches. Conventional policy networks struggle to process full-resolution LIDAR inputs, forcing prior works to rely on simplified observations that reduce spatial awareness and navigation robustness. This paper presents a novel model-based RL framework built on top of the DreamerV3 algorithm, integrating a Multi-Layer Perceptron Variational Autoencoder (MLP-VAE) within a world model to encode high-dimensional LIDAR readings into compact latent representations. These latent features, combined with a learned dynamics predictor, enable efficient imagination-based policy optimization. Experiments on simulated TurtleBot3 navigation tasks demonstrate that the proposed architecture achieves faster convergence and higher success rate compared to model-free baselines such as SAC, DDPG, and TD3. It is worth emphasizing that the DreamerV3-based agent attains a 100% success rate across all evaluated environments when using the full dataset of the Turtlebot3 LIDAR (360 readings), while model-free methods plateaued below 85%. These findings demonstrate that integrating predictive world models with learned latent representations enables more efficient and robust navigation from high-dimensional sensory data.
中文摘要 由于传感器数据维度较高且无模型方法的样本效率低，利用强化学习（RL）实现地面机器人自主导航仍然具有挑战性。传统政策网络难以处理全分辨率的激光雷达输入，迫使以往工作依赖简化观测，从而降低空间感知和导航稳健性。本文提出了一种基于 DreamerV3 算法的新颖模型强化学习框架，将多层感知器变分自编码器（MLP-VAE）集成到世界模型中，将高维 LIDAR 读数编码为紧凑的潜在表示。这些潜在特征结合学习的动态预测器，使得基于想象力的策略优化成为可能。对TurtleBot3导航任务的模拟实验表明，所提架构相比SAC、DDPG和TD3等无模型基线实现了更快的收敛和更高的成功率。值得强调的是，基于DreamerV3的智能体在使用Turtlebot3 LIDAR完整数据集（360读数）时，在所有评估环境中实现100%的成功率，而无模型方法则稳定在85%以下。这些发现表明，将预测世界模型与学习的潜在表征相结合，能够更高效、更稳健地从高维感官数据中导航。

Multimodal Reinforcement Learning with Agentic Verifier for AI Agents

多模态强化学习与智能体验证器（Agentic Verifier）用于人工智能代理

Authors: Reuben Tan, Baolin Peng, Zhengyuan Yang, Hao Cheng, Oier Mees, Theodore Zhao, Andrea Tupini, Isar Meijier, Qianhui Wu, Yuncong Yang, Lars Liden, Yu Gu, Sheng Zhang, Xiaodong Liu, Lijuan Wang, Marc Pollefeys, Yong Jae Lee, Jianfeng Gao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.03438
Pdf link: https://arxiv.org/pdf/2512.03438
Abstract Agentic reasoning models trained with multimodal reinforcement learning (MMRL) have become increasingly capable, yet they are almost universally optimized using sparse, outcome-based rewards computed based on the final answers. Richer rewards computed from the reasoning tokens can improve learning significantly by providing more fine-grained guidance. However, it is challenging to compute more informative rewards in MMRL beyond those based on outcomes since different samples may require different scoring functions and teacher models may provide noisy reward signals too. In this paper, we introduce the Argos (Agentic Reward for Grounded & Objective Scoring), a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. We find that by leveraging our agentic verifier across both SFT data curation and RL training, our model achieves state-of-the-art results across multiple agentic tasks such as spatial reasoning, visual hallucination as well as robotics and embodied AI benchmarks. Critically, we demonstrate that just relying on SFT post-training on highly curated reasoning data is insufficient, as agents invariably collapse to ungrounded solutions during RL without our online verification. We also show that our agentic verifier can help to reduce reward-hacking in MMRL. Finally, we also provide a theoretical justification for the effectiveness of Argos through the concept of pareto-optimality.
中文摘要 通过多模态强化学习（MMRL）训练的代理推理模型变得越来越强大，但它们几乎普遍采用基于最终答案计算的稀疏、基于结果的奖励进行优化。从推理代币计算出更丰富的奖励，可以通过提供更细致的指导显著提升学习效果。然而，由于不同样本可能需要不同的评分函数，教师模型也可能产生噪声较大的奖励信号，因此在MMRL中计算出更多信息的奖励存在挑战。本文介绍了Argos（基于目标评分的智能奖励），这是一种原则性的奖励代理，用于训练智能任务的多模态推理模型。对于每个样本，Argos从教师模型衍生和基于规则的评分函数池中选择，同时评估：（i）最终反应准确性，（ii）涉及实体和行为的时空定位，以及（iii）推理过程的质量。我们发现，通过在SFT数据管理和强化学习训练中结合我们的代理验证器，我们的模型在多项代理任务（如空间推理、视觉幻觉以及机器人技术和具身AI基准）上实现了最先进的结果。关键是，我们证明仅依赖高度策划的推理数据的SFT后训练是不够的，因为在强化学习过程中，代理在没有在线验证的情况下，往往会屈服于无根基的解决方案。我们还展示了我们的代理验证器可以帮助减少MMRL中的奖励黑客行为。最后，我们还通过帕累托最优性概念为阿尔戈斯的有效性提供了理论上的依据。

PretrainZero: Reinforcement Active Pretraining

PretrainZero：强化主动预训练

Authors: Xingrun Xing, Zhiyuan Fan, Jie Lou, Guoqi Li, Jiajun Zhang, Debing Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.03442
Pdf link: https://arxiv.org/pdf/2512.03442
Abstract Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.
中文摘要 模仿人类行为，主动从一般经验中学习并实现人工智能，一直是人类的梦想。近期基于强化学习（RL）的大型思维模型展现了令人印象深刻的专家级能力，即软件和数学能力，但仍高度依赖特定领域的可验证奖励，这在扩展通用推理能力的性能边界方面造成了重大瓶颈。在本研究中，我们提出了PretrainZero，这是一个基于预训练语料库构建的强化主动学习框架，旨在将强化学习从领域特定的后培训扩展到通用预训练。PretrainZero 具有以下特点：1）主动预训练：受人类主动学习能力的启发，PretrainZero 学习统一的推理策略，主动识别预训练语料中合理且有信息的信息内容，并通过强化学习预测这些内容。2）自我监督学习：无需任何可验证的标签、预训练的奖励模型或监督式微调，我们直接利用强化学习预训练3至30B基础模型的推理者，显著打破了通用推理验证数据的壁垒。3）验证缩放：通过解决越来越复杂的掩蔽跨，PretrainZero显著提升了预训练基模型的一般推理能力。在强化预训练中，PretrainZero在MMLU-Pro、SuperGPQA和数学平均基准测试上提升了Qwen3-4B-Base的8.43、5.96和10.60。在后训练阶段，预训练模型还可以作为下游RLVR任务的推理基础模型。

Variable-Impedance Muscle Coordination under Slow-Rate Control Frequencies and Limited Observation Conditions Evaluated through Legged Locomotion

通过腿式行走评估低速控制频率和有限观察条件下的可变阻抗肌肉协调

Authors: Hidaka Asai, Tomoyuki Noda, Jun Morimoto
Subjects: Subjects: Systems and Control (eess.SY); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.03459
Pdf link: https://arxiv.org/pdf/2512.03459
Abstract Human motor control remains agile and robust despite limited sensory information for feedback, a property attributed to the body's ability to perform morphological computation through muscle coordination with variable impedance. However, it remains unclear how such low-level mechanical computation reduces the control requirements of the high-level controller. In this study, we implement a hierarchical controller consisting of a high-level neural network trained by reinforcement learning and a low-level variable-impedance muscle coor dination model with mono- and biarticular muscles in monoped locomotion task. We systematically restrict the high-level controller by varying the control frequency and by introducing biologically inspired observation conditions: delayed, partial, and substituted observation. Under these conditions, we evaluate how the low-level variable-impedance muscle coordination contributes to learning process of high-level neural network. The results show that variable-impedance muscle coordination enables stable locomotion even under slow-rate control frequency and limited observation conditions. These findings demonstrate that the morphological computation of muscle coordination effectively offloads high-frequency feedback of the high-level controller and provide a design principle for the controller in motor control.
中文摘要 尽管反馈的感觉信息有限，人类运动控制依然灵活且稳健，这一特性归因于身体通过肌肉协调进行形态计算的能力，且阻抗可变。然而，这种低层机械计算如何降低高级控制器的控制需求仍不清楚。本研究实现了一个分层控制器，由强化学习训练的高级神经网络和单关节和双关节肌肉的低级可变阻抗肌肉协调模型组成，用于单足行走任务。我们通过改变控制频率并引入生物启发的观察条件：延迟观察、部分观察和替代观察，系统地限制高级别控制。在这些条件下，我们评估低级可变阻抗肌肉协调如何促进高层神经网络的学习过程。结果表明，可变阻抗肌肉协调即使在低速率控制频率和有限观察条件下也能保持稳定运动。这些发现表明，肌肉协调的形态学计算有效减轻了高阶控制器的高频反馈，并为运动控制中控制器提供了设计原则。

Adaptive sampling using variational autoencoder and reinforcement learning

利用变分自编码器和强化学习的自适应采样

Authors: Adil Rasheed, Mikael Aleksander Jansen Shahly, Muhammad Faisal Aftab
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.03525
Pdf link: https://arxiv.org/pdf/2512.03525
Abstract Compressed sensing enables sparse sampling but relies on generic bases and random measurements, limiting efficiency and reconstruction quality. Optimal sensor placement uses historcal data to design tailored sampling patterns, yet its fixed, linear bases cannot adapt to nonlinear or sample-specific variations. Generative model-based compressed sensing improves reconstruction using deep generative priors but still employs suboptimal random sampling. We propose an adaptive sparse sensing framework that couples a variational autoencoder prior with reinforcement learning to select measurements sequentially. Experiments show that this approach outperforms CS, OSP, and Generative model-based reconstruction from sparse measurements.
中文摘要 压缩传感实现稀疏采样，但依赖通用基底和随机测量，限制了效率和重建质量。最优传感器布置利用历史数据设计定制采样模式，但其固定线性基底无法适应非线性或样本特异性变化。基于生成模型的压缩传感通过深度生成先验改善了重建，但仍采用次优随机抽样。我们提出了一种自适应稀疏传感框架，将变分自编码器事先与强化学习结合，以顺序选择测量值。实验表明，这种方法在稀疏测量中优于计算机科学、OSP和基于生成模型的重建方法。

Multi-Agent Reinforcement Learning with Communication-Constrained Priors

多智能体强化学习，带有通信受限先验

Authors: Guang Yang, Tianpei Yang, Jingwen Qiao, Yanqing Wu, Jing Huo, Xingguo Chen, Yang Gao
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.03528
Pdf link: https://arxiv.org/pdf/2512.03528
Abstract Communication is one of the effective means to improve the learning of cooperative policy in multi-agent systems. However, in most real-world scenarios, lossy communication is a prevalent issue. Existing multi-agent reinforcement learning with communication, due to their limited scalability and robustness, struggles to apply to complex and dynamic real-world environments. To address these challenges, we propose a generalized communication-constrained model to uniformly characterize communication conditions across different scenarios. Based on this, we utilize it as a learning prior to distinguish between lossy and lossless messages for specific scenarios. Additionally, we decouple the impact of lossy and lossless messages on distributed decision-making, drawing on a dual mutual information estimatior, and introduce a communication-constrained multi-agent reinforcement learning framework, quantifying the impact of communication messages into the global reward. Finally, we validate the effectiveness of our approach across several communication-constrained benchmarks.
中文摘要 通信是提升多智能体系统中合作策略学习的有效手段之一。然而，在大多数现实场景中，有损通信是一个普遍存在的问题。现有的多智能体强化学习结合通信，由于其扩展性和鲁棒性有限，难以应用于复杂且动态的现实环境。为应对这些挑战，我们提出了一个广义的通信约束模型，以统一描述不同场景下的通信条件。基于此，我们利用它作为学习，用于区分特定场景下的有损消息和无损消息。此外，我们解耦了有损和无损消息对分布式决策的影响，借助对偶互信息估计器，并引入了通信约束的多智能体强化学习框架，量化通信消息对全局奖励的影响。最后，我们验证了我们方法在多个通信受限基准测试中的有效性。

A Learning-based Control Methodology for Transitioning VTOL UAVs

一种基于学习的垂直起降无人机转换控制方法

Authors: Zexin Lin, Yebin Zhong, Hanwen Wan, Jiu Cheng, Zhenglong Sun, Xiaoqiang Ji
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.03548
Pdf link: https://arxiv.org/pdf/2512.03548
Abstract Transition control poses a critical challenge in Vertical Take-Off and Landing Unmanned Aerial Vehicle (VTOL UAV) development due to the tilting rotor mechanism, which shifts the center of gravity and thrust direction during transitions. Current control methods' decoupled control of altitude and position leads to significant vibration, and limits interaction consideration and adaptability. In this study, we propose a novel coupled transition control methodology based on reinforcement learning (RL) driven controller. Besides, contrasting to the conventional phase-transition approach, the ST3M method demonstrates a new perspective by treating cruise mode as a special case of hover. We validate the feasibility of applying our method in simulation and real-world environments, demonstrating efficient controller development and migration while accurately controlling UAV position and attitude, exhibiting outstanding trajectory tracking and reduced vibrations during the transition process.
中文摘要 由于倾斜旋翼机构在过渡过程中改变重心和推力方向，过渡控制在垂直起降无人机（VTOL UAV）开发中构成关键挑战。电流控制方法对高度和位置的解耦控制导致显著振动，限制了相互作用的考虑和适应性。本研究提出一种基于强化学习（RL）驱动控制器的新型耦合转换控制方法。此外，与传统的相变方法不同，ST3M方法通过将巡航模式视为悬停的特例，展现了新的视角。我们验证了在仿真和实际环境中应用该方法的可行性，展示了高效的控制器开发和迁移，同时准确控制无人机的位置和姿态，展现出出色的轨迹跟踪和过渡过程中的振动减少。

RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL

RoboScape-R：通过强化学习实现通用机器人训练的统一奖励-观察世界模型

Authors: Yinzhou Tang, Yu Shang, Yinuo Chen, Bingwen Wei, Xin Zhang, Shu'ang Yu, Liangzhi Shi, Chao Yu, Chen Gao, Wei Wu, Yong Li
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.03556
Pdf link: https://arxiv.org/pdf/2512.03556
Abstract Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi-scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task-specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape-R, a framework leveraging the world model to serve as a versatile, general-purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model-based general reward mechanism that generates ''endogenous'' rewards derived from the model's intrinsic understanding of real-world state transition dynamics. Extensive experiments demonstrate that RoboScape-R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out-of-domain scenarios.
中文摘要 实现可普遍化的具体政策仍是关键挑战。传统的政策学习范式，包括模仿学习（IL）和强化学习（RL），在培养跨多样场景的泛化性方面遇到困难。虽然IL策略常常过度拟合于特定专家轨迹，但强化学习本身缺乏有效多场景泛化所需的统一且通用的奖励信号。我们认为，世界模型具有独特的能力，能够作为一种普遍的环境代理，来解决这一限制。然而，当前的世界模型主要关注预测观测结果的能力，同时仍依赖任务特定的手工奖励函数，因此未能提供真正通用的训练环境。针对这一问题，我们提出了RoboScape-R框架，利用世界模型作为强化学习范式中具身环境的多功能通用代理。我们引入了一种基于世界模型的新型通用奖励机制，该机制源自模型对现实状态转换动态的内在理解，生成“内生”奖励。大量实验表明，RoboScape-R通过提供高效且通用的训练环境，有效解决了传统强化学习方法的局限性，显著提升了内涵策略的泛化能力。我们的方法为将世界模型作为在线培训策略提供了关键见解，并在域外场景下平均比基线提升37.5%。

Accelerating Detailed Routing Convergence through Offline Reinforcement Learning

通过离线强化学习加速详细路由收敛

Authors: Afsara Khan, Austin Rovinski
Subjects: Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2512.03594
Pdf link: https://arxiv.org/pdf/2512.03594
Abstract Detailed routing remains one of the most complex and time-consuming steps in modern physical design due to the challenges posed by shrinking feature sizes and stricter design rules. Prior detailed routers achieve state-of-the-art results by leveraging iterative pathfinding algorithms to route each net. However, runtimes are a major issue in detailed routers, as converging to a solution with zero design rule violations (DRVs) can be prohibitively expensive. In this paper, we propose leveraging reinforcement learning (RL) to enable rapid convergence in detailed routing by learning from previous designs. We make the key observation that prior detailed routers statically schedule the cost weights used in their routing algorithms, meaning they do not change in response to the design or technology. By training a conservative Q-learning (CQL) model to dynamically select the routing cost weights which minimize the number of algorithm iterations, we find that our work completes the ISPD19 benchmarks with 1.56x average and up to 3.01x faster runtime than the baseline router while maintaining or improving the DRV count in all cases. We also find that this learning shows signs of generalization across technologies, meaning that learning designs in one technology can translate to improved outcomes in other technologies.
中文摘要 由于特征尺寸缩小和设计规则更严格带来的挑战，详细布线仍是现代物理设计中最复杂且耗时的步骤之一。以往的详细布线器通过利用迭代路径寻找算法为每个网络布线，实现了最先进的结果。然而，运行时是精细路由器中的一个主要问题，因为收敛到零设计规则违规（DRV）的解决方案成本可能高得令人望而却步。本文提出利用强化学习（RL）实现快速收敛，通过从以往设计中学习实现详细路由的快速收敛。我们观察到一个关键点：之前的详细路由器会静态调度其路由算法中使用的成本权重，这意味着这些权重不会随着设计和技术而变化。通过训练保守Q学习（CQL）模型动态选择路由成本权重以最小化算法迭代次数，我们发现我们的工作完成了ISPD19基准测试的平均值为1.56倍，运行时间比基线路由器快达3.01倍，同时在所有情况下都保持或提升了DRV计数。我们还发现，这种学习在不同技术上表现出泛化的迹象，意味着一种技术的学习设计可以转化为其他技术的改进成果。

A Descriptive Model for Modelling Attacker Decision-Making in Cyber-Deception

网络欺骗中攻击者决策建模的描述模型

Authors: B.R. Turner, O. Guidetti, N.M. Karie, R. Ryan, Y. Yan
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2512.03641
Pdf link: https://arxiv.org/pdf/2512.03641
Abstract Cyber-deception is an increasingly important defensive strategy, shaping adversarial decision making through controlled misinformation, uncertainty, and misdirection. Although game-theoretic, Bayesian, Markov decision process, and reinforcement learning models offer insight into deceptive interactions, they typically assume an attacker has already chosen to engage. Such approaches overlook cognitive and perceptual factors that influence an attacker's initial decision to engage or withdraw. This paper presents a descriptive model that incorporates the psychological and strategic elements shaping this decision. The model defines five components, belief (B), scepticism (S), deception fidelity (D), reconnaissance (R), and experience (E), which interact to capture how adversaries interpret deceptive cues and assess whether continued engagement is worthwhile. The framework provides a structured method for analysing engagement decisions in cyber-deception scenarios. A series of experiments has been designed to evaluate this model through Capture the Flag activities incorporating varying levels of deception, supported by behavioural and biometric observations. These experiments have not yet been conducted, and no experimental findings are presented in this paper. These experiments will combine behavioural observations with biometric indicators to produce a multidimensional view of adversarial responses. Findings will improve understanding of the factors influencing engagement decisions and refine the model's relevance to real-world cyber-deception settings. By addressing the gap in existing models that presume engagement, this work supports more cognitively realistic and strategically effective cyber-deception practices.
中文摘要 网络欺骗是一种日益重要的防御策略，通过受控的虚假信息、不确定性和误导来塑造对抗性决策。尽管博弈论、贝叶斯、马尔可夫决策过程和强化学习模型能洞察欺骗性互动，但它们通常假设攻击者已经选择介入。这种方法忽视了影响攻击者最初选择交战或撤退的认知和感知因素。本文提出了一个描述性模型，融合了塑造这一决策的心理和战略因素。该模型定义了五个组成部分：信念（B）、怀疑（S）、欺骗忠诚度（D）、侦察（R）和经验（E），这些组成部分相互作用以捕捉对手如何解读欺骗线索并评估持续交战的价值。该框架为分析网络欺骗场景中的交战决策提供了结构化方法。通过夺旗活动设计了一系列实验，结合不同程度的欺骗，并结合行为和生物特征观察来评估该模型。这些实验尚未进行，本文未提出任何实验发现。这些实验将结合行为观察与生物特征指标，生成对抗性反应的多维视图。研究结果将加深对影响参与决策因素的理解，并优化模型在现实网络欺骗环境中的相关性。通过弥补假设参与的现有模型中的空白，本研究支持了更具认知现实和战略有效性的网络欺骗实践。

ContactRL: Safe Reinforcement Learning based Motion Planning for Contact based Human Robot Collaboration

ContactRL：基于安全强化学习的动作规划，用于基于接触的人机协作

Authors: Sundas Rafat Mulkana, Ronyu Yu, Tanaya Guha, Emma Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.03707
Pdf link: https://arxiv.org/pdf/2512.03707
Abstract In collaborative human-robot tasks, safety requires not only avoiding collisions but also ensuring safe, intentional physical contact. We present ContactRL, a reinforcement learning (RL) based framework that directly incorporates contact safety into the reward function through force feedback. This enables a robot to learn adaptive motion profiles that minimize human-robot contact forces while maintaining task efficiency. In simulation, ContactRL achieves a low safety violation rate of 0.2\% with a high task success rate of 87.7\%, outperforming state-of-the-art constrained RL baselines. In order to guarantee deployment safety, we augment the learned policy with a kinetic energy based Control Barrier Function (eCBF) shield. Real-world experiments on an UR3e robotic platform performing small object handovers from a human hand across 360 trials confirm safe contact, with measured normal forces consistently below 10N. These results demonstrate that ContactRL enables safe and efficient physical collaboration, thereby advancing the deployment of collaborative robots in contact-rich tasks.
中文摘要 在人机协作任务中，安全不仅需要避免碰撞，还要确保安全且有意的身体接触。我们介绍了ContactRL，一种基于强化学习（RL）的框架，通过力反馈直接将接触安全纳入奖励函数。这使得机器人能够学习自适应运动曲线，最大限度地减少人机接触力，同时保持任务效率。在模拟中，ContactRL实现了0.2%的低安全违规率，任务成功率高达87.7%，优于最先进的受限强化学习基线。为了保证部署安全，我们用基于动能的控制屏障功能（eCBF）护盾来补充所学政策。在UR3e机器人平台上进行的360次试验中，对人体手进行小型物体交接的实际实验确认了安全接触，测量的法向力始终低于10N。这些结果表明，ContactRL能够实现安全高效的物理协作，从而推动协作机器人在接触密集任务中的部署。

Tutorial on Large Language Model-Enhanced Reinforcement Learning for Wireless Networks

无线网络大型语言模型增强强化学习教程

Authors: Lingyi Cai, Wenjie Fu, Yuxi Huang, Ruichen Zhang, Yinqiu Liu, Jiawen Kang, Zehui Xiong, Tao Jiang, Dusit Niyato, Xianbin Wang, Shiwen Mao, Xuemin Shen
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2512.03722
Pdf link: https://arxiv.org/pdf/2512.03722
Abstract Reinforcement Learning (RL) has shown remarkable success in enabling adaptive and data-driven optimization for various applications in wireless networks. However, classical RL suffers from limitations in generalization, learning feedback, interpretability, and sample efficiency in dynamic wireless environments. Large Language Models (LLMs) have emerged as a transformative Artificial Intelligence (AI) paradigm with exceptional capabilities in knowledge generalization, contextual reasoning, and interactive generation, which have demonstrated strong potential to enhance classical RL. This paper serves as a comprehensive tutorial on LLM-enhanced RL for wireless networks. We propose a taxonomy to categorize the roles of LLMs into four critical functions: state perceiver, reward designer, decision-maker, and generator. Then, we review existing studies exploring how each role of LLMs enhances different stages of the RL pipeline. Moreover, we provide a series of case studies to illustrate how to design and apply LLM-enhanced RL in low-altitude economy networking, vehicular networks, and space-air-ground integrated networks. Finally, we conclude with a discussion on potential future directions for LLM-enhanced RL and offer insights into its future development in wireless networks.
中文摘要 强化学习（RL）在实现无线网络中各种应用的自适应和数据驱动优化方面取得了显著成功。然而，经典强化学习在泛化、学习反馈、可解释性和动态无线环境中的采样效率方面存在局限性。大型语言模型（LLM）作为一种具有变革性的人工智能（AI）范式，在知识泛化、上下文推理和交互生成方面表现出卓越的能力，展现出强有力的潜力来提升经典强化学习。本文作为关于无线网络LLM增强强化学习的全面教程。我们提出了一种分类法，将大型语言模型的角色分为四个关键功能：状态感知者、奖励设计者、决策者和生成器。随后，我们回顾了现有研究，探讨大型语言模型的各个角色如何增强强化学习流程的不同阶段。此外，我们还提供了一系列案例研究，展示如何在低空经济网络、车载网络及空地综合网络中设计和应用LLM增强型强化学习。最后，我们讨论了大型语言增强学习（LLM）未来可能的发展方向，并提供了其在无线网络中未来发展的见解。

Autonomous Planning In-space Assembly Reinforcement-learning free-flYer (APIARY) International Space Station Astrobee Testing

自主规划空间组装强化学习免费飞行Yer（APIARY）国际空间站Astrobee测试

Authors: Samantha Chapin, Kenneth Stewart, Roxana Leontie, Carl Glen Henshaw
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.03729
Pdf link: https://arxiv.org/pdf/2512.03729
Abstract The US Naval Research Laboratory's (NRL's) Autonomous Planning In-space Assembly Reinforcement-learning free-flYer (APIARY) experiment pioneers the use of reinforcement learning (RL) for control of free-flying robots in the zero-gravity (zero-G) environment of space. On Tuesday, May 27th 2025 the APIARY team conducted the first ever, to our knowledge, RL control of a free-flyer in space using the NASA Astrobee robot on-board the International Space Station (ISS). A robust 6-degrees of freedom (DOF) control policy was trained using an actor-critic Proximal Policy Optimization (PPO) network within the NVIDIA Isaac Lab simulation environment, randomizing over goal poses and mass distributions to enhance robustness. This paper details the simulation testing, ground testing, and flight validation of this experiment. This on-orbit demonstration validates the transformative potential of RL for improving robotic autonomy, enabling rapid development and deployment (in minutes to hours) of tailored behaviors for space exploration, logistics, and real-time mission needs.
中文摘要 美国海军研究实验室（NRL）的自主规划空间组装强化学习自由飞行Yer（APIARY）实验开创性地利用强化学习（RL）在太空零重力环境下控制自由飞行机器人。2025年5月27日星期二，APIARY团队据我们所知，首次利用NASA国际空间站（ISS）上的Astrobee机器人，首次实现了对自由飞行器进行强化控。在NVIDIA Isaac实验室仿真环境中，通过actor-critic近端策略优化（PPO）网络训练了一个稳健的6自由度（DOF）控制策略，对目标姿态和质量分布进行随机化以增强鲁棒性。本文详细介绍了该实验的模拟测试、地面测试和飞行验证过程。此次轨道演示验证了强化学习在提升机器人自主性方面的变革潜力，使得能够快速开发和部署针对空间探索、物流和实时任务需求的定制行为（几分钟到数小时内）。

Crossing the Sim2Real Gap Between Simulation and Ground Testing to Space Deployment of Autonomous Free-flyer Control

跨越模拟与地面测试到太空部署自主自由飞行器控制之间的Sim2Real鸿沟

Authors: Kenneth Stewart, Samantha Chapin, Roxana Leontie, Carl Glen Henshaw
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.03736
Pdf link: https://arxiv.org/pdf/2512.03736
Abstract Reinforcement learning (RL) offers transformative potential for robotic control in space. We present the first on-orbit demonstration of RL-based autonomous control of a free-flying robot, the NASA Astrobee, aboard the International Space Station (ISS). Using NVIDIA's Omniverse physics simulator and curriculum learning, we trained a deep neural network to replace Astrobee's standard attitude and translation control, enabling it to navigate in microgravity. Our results validate a novel training pipeline that bridges the simulation-to-reality (Sim2Real) gap, utilizing a GPU-accelerated, scientific-grade simulation environment for efficient Monte Carlo RL training. This successful deployment demonstrates the feasibility of training RL policies terrestrially and transferring them to space-based applications. This paves the way for future work in In-Space Servicing, Assembly, and Manufacturing (ISAM), enabling rapid on-orbit adaptation to dynamic mission requirements.
中文摘要 强化学习（RL）为太空机器人控制带来了变革性的潜力。我们将首次在轨演示基于强化轨道的自主控制自由飞行机器人——NASA Astrobee，该机器人搭载于国际空间站（ISS）。利用NVIDIA的Omniverse物理模拟器和课程学习，我们训练了一个深度神经网络，取代了Astrobee的标准姿态和平移控制，使其能够在微重力环境下导航。我们的结果验证了一种新型训练流程，弥合了模拟与现实（Sim2Real）之间的差距，利用GPU加速的科学级仿真环境实现高效的蒙特卡洛强化学习。此次成功部署展示了将强化学习政策在地面上培训并迁移到太空应用的可行性。这为未来在空间维护、组装和制造（ISAM）领域的工作铺平了道路，实现了对动态任务需求的快速轨道适应。

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

以编程愿景思考：迈向以图像思考的统一视角

Authors: Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, Tao Jin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.03746
Pdf link: https://arxiv.org/pdf/2512.03746
Abstract Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at this https URL.
中文摘要 多模态大型语言模型（MLLMs）能够通过图像思考，可以交互式地使用工具推理视觉输入，但目前的方法通常依赖于有限的有限工具组合，且现实需求有限且可扩展性有限。在这项工作中，我们首先揭示了一个关键且此前被忽视的弱点：即使是最先进的MLLM也出奇地脆弱，仅仅是简单方向变化或自然损坏的图像就会显著性能下降，这凸显了基于工具推理更为稳健的必要性。为此，我们提出了CodeVision，一个灵活且可扩展的代码即工具框架，模型生成代码作为通用接口调用任何图像作，超越固定的工具注册。我们采用两阶段方法训练模型，首先在高质量数据集上进行监督微调（SFT），该数据集为复杂多轮工具组合和错误恢复精心策划，随后采用新颖且密集的过程奖励函数进行强化学习（RL），鼓励战略性和高效的工具使用。为促进这项研究，我们构建了新的SFT和RL数据集，并引入了一套具有挑战性的基准测试套件，旨在严格评估其对方向变化的鲁棒性和多工具推理能力。在Qwen2.5-VL和Qwen3-VL系列上的实验表明，我们的方法显著提升了模型性能，并促进了如灵活工具组合、高效链式执行以及运行时反馈中稳健的错误恢复等新能力。代码可在此 https URL 访问。

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

扩散大型语言模型的原则性强化学习从序列层面出现

Authors: Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie, Stefano Ermon, Yi Wu, Chongxuan Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.03759
Pdf link: https://arxiv.org/pdf/2512.03759
Abstract Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at this https URL.
中文摘要 强化学习（RL）已被证明对自回归语言模型非常有效，但将这些方法应用于扩散大型语言模型（dLLMs）仍面临根本性挑战。核心难点在于似然近似：自回归模型自然提供代币级条件概率，这对代币级强化学习目标（如GRPO）至关重要，而dLLMs通过迭代非自回归去噪步骤生成序列，这些步骤缺乏该因式分解。为解决这一根本性不匹配，我们提出了基于ELBO的序列级策略优化（ESPO）原则性强化学习框架，将整个序列生成视为单一动作，并以ELBO作为可作的序列级似然代理。我们的方法结合了每个代币的重要性比归一化和稳健的 KL 散度估计，以确保大规模训练的稳定。对数学推理、编码和规划任务的大量实验表明，ESPO在代币级基线上表现显著优于，在倒计时任务上实现了20-40分的显著提升，同时在数学和编码基准测试上保持了持续的进步。我们的方法确立了序列级优化作为dLLM中强化学习的原则性且实证有效的范式。我们的代码可在此 https URL 访问。

Sample-Efficient Model-Free Policy Gradient Methods for Stochastic LQR via Robust Linear Regression

通过鲁棒线性回归实现随机LQR的样本高效无模型策略梯度方法

Authors: Bowen Song, Sebastien Gros, Andrea Iannelli
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.03764
Pdf link: https://arxiv.org/pdf/2512.03764
Abstract Policy gradient algorithms are widely used in reinforcement learning and belong to the class of approximate dynamic programming methods. This paper studies two key policy gradient algorithms - the Natural Policy Gradient and the Gauss-Newton Method - for solving the Linear Quadratic Regulator (LQR) problem in unknown stochastic linear systems. The main challenge lies in obtaining an unbiased gradient estimate from noisy data due to errors-in-variables in linear regression. This issue is addressed by employing a primal-dual estimation procedure. Using this novel gradient estimation scheme, the paper establishes convergence guarantees with a sample complexity of order O(1/epsilon). Theoretical results are further supported by numerical experiments, which demonstrate the effectiveness of the proposed algorithms.
中文摘要 策略梯度算法广泛应用于强化学习，属于近似动态规划方法的一类。本文研究了两种关键的策略梯度算法——自然策略梯度和高斯-牛顿法——用于在未知随机线性系统中解决线性二次调节器（LQR）问题。主要挑战在于如何从因线性回归中变量误差导致的噪声数据中获得无偏的梯度估计。这个问题通过采用原始-对偶估计方法来解决。利用这一新颖的梯度估计方案，论文建立了样本复杂度为O（1/ε）阶的收敛保证。理论结果还得到了数值实验的支持，这些实验展示了所提出算法的有效性。

Safety Reinforced Model Predictive Control (SRMPC): Improving MPC with Reinforcement Learning for Motion Planning in Autonomous Driving

安全强化模型预测控制（SRMPC）：通过强化学习改进MPC以实现自动驾驶运动规划

Authors: Johannes Fischer, Marlon Steiner, Ömer Sahin Tas, Christoph Stiller
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.03774
Pdf link: https://arxiv.org/pdf/2512.03774
Abstract Model predictive control (MPC) is widely used for motion planning, particularly in autonomous driving. Real-time capability of the planner requires utilizing convex approximation of optimal control problems (OCPs) for the planner. However, such approximations confine the solution to a subspace, which might not contain the global optimum. To address this, we propose using safe reinforcement learning (SRL) to obtain a new and safe reference trajectory within MPC. By employing a learning-based approach, the MPC can explore solutions beyond the close neighborhood of the previous one, potentially finding global optima. We incorporate constrained reinforcement learning (CRL) to ensure safety in automated driving, using a handcrafted energy function-based safety index as the constraint objective to model safe and unsafe regions. Our approach utilizes a state-dependent Lagrangian multiplier, learned concurrently with the safe policy, to solve the CRL problem. Through experimentation in a highway scenario, we demonstrate the superiority of our approach over both MPC and SRL in terms of safety and performance measures.
中文摘要 模型预测控制（MPC）广泛应用于运动规划，尤其是在自动驾驶领域。规划器实现实时能力需要利用最优控制问题的凸近似（OCP）。然而，这种近似将解限制在一个子空间，而该子空间可能不包含全局最优解。为此，我们建议使用安全强化学习（SRL）在MPC中获得新的安全参考轨迹。通过采用基于学习的方法，MPC可以探索超越前一解的近邻域，可能找到全局最优解。我们采用了受限强化学习（CRL）以确保自动驾驶的安全，使用手工制作的基于能量函数的安全指数作为约束目标，以建模安全与不安全区域。我们的方法利用与安全策略同时学习的状态依赖拉格朗日乘数来解决CRL问题。通过在高速公路场景中的实验，我们展示了我们方法在安全和性能指标上的优于MPC和SRL。

Omni-AutoThink: Adaptive Multimodal Reasoning via Reinforcement Learning

全自动思考：通过强化学习实现的自适应多模态推理

Authors: Dongchao Yang, Songxiang Liu, Disong Wang, Yuanyuan Wang, Guanglu Wan, Helen Meng
Subjects: Subjects: Artificial Intelligence (cs.AI); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2512.03783
Pdf link: https://arxiv.org/pdf/2512.03783
Abstract Recent advances in Omni models have enabled unified multimodal perception and generation. However, most existing systems still exhibit rigid reasoning behaviors, either overthinking simple problems or failing to reason when necessary. To address this limitation, we propose Omni-AutoThink, a novel adaptive reasoning framework that dynamically adjusts the model's reasoning depth according to task difficulty. Our framework comprises two stages: (1) an Adaptive Supervised Fine-Tuning (Adaptive SFT) stage, which endows the Omni model with fundamental reasoning capability using large-scale reasoning-augmented data, and (2) an Adaptive Reinforcement Learning (Adaptive GRPO) stage, which optimizes reasoning behaviors based on task complexity and reward feedback. We further construct a comprehensive adaptive reasoning benchmark that spans text-only, text-audio, text-visual, and text-audio-visual modalities, providing both training and evaluation splits for multimodal reasoning assessment. Experimental results demonstrate that our proposed framework significantly improves adaptive reasoning performance compared to previous baselines. All benchmark data and code will be publicly released.
中文摘要 近期全能模型的进步使得统一的多模态感知和生成成为可能。然而，大多数现有系统仍然表现出僵化的推理行为，要么过度思考简单问题，要么在必要时未能推理。为解决这一局限性，我们提出了Omni-AutoThink，一种新的自适应推理框架，能够根据任务难度动态调整模型的推理深度。我们的框架包含两个阶段：（1）自适应监督微调（Adaptive SFT）阶段，利用大规模推理增强数据赋予全能模型基础推理能力;（2）自适应强化学习（Adaptive GRPO）阶段，基于任务复杂性和奖励反馈优化推理行为。我们还构建了一个全面的自适应推理基准，涵盖纯文本、文本-音频、文本-视觉和文本-视听模式，提供多模态推理评估的训练和评估分工。实验结果表明，我们提出的框架相比以往基线显著提升了适应性推理表现。所有基准数据和代码将公开发布。

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

AdaptVision：通过自适应视觉习得实现高效的视觉语言模型

Authors: Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.03794
Pdf link: https://arxiv.org/pdf/2512.03794
Abstract Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
中文摘要 视觉语言模型（VLMs）在视觉问答任务中取得了显著成功，但其对大量视觉符号的依赖带来了显著的计算负担。虽然现有高效的VLM方法通过固定比率压缩减少了视觉标记，但它们是被动运作的，缺乏适应不同任务需求的能力。这引发了一个根本性问题：VLM能否自主决定每个样本所需的最小视觉标记数量？受人类主动视觉机制启发，我们介绍了AdaptVision，一种高效的VLM范式，通过粗细方法实现自适应视觉代币获取。我们的模型最初处理低分辨率图像中的压缩视觉符号，并在必要时通过调用边界框工具裁剪关键区域，选择性地获取额外的视觉信息。我们采用强化学习框架训练AdaptVision，巧妙平衡准确性与效率。我们方法的核心是解耦转向策略优化（DTPO），它将学习目标拆分为两个部分：（1）工具学习，优化工具的正确利用，（2）精度提升，优化生成的响应以提升答案正确性。基于该表述，我们通过计算与每个目标相关的代币的独立优势，进一步解耦优势估计。这种表述相比普通GRPO，使AdaptVision的优化更加高效。跨多个VQA基准测试的综合实验表明，AdaptVision在视觉标记消耗远少于最先进高效VLM方法的同时，实现了更优的性能。

MPCFormer: A physics-informed data-driven approach for explainable socially-aware autonomous driving

MPCFormer：一种基于物理学的数据驱动方法，用于可解释的社会意识自动驾驶

Authors: Jia Hu, Zhexi Lian, Xuerun Yan, Ruiang Bi, Dou Shen, Yu Ruan, Haoran Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.03795
Pdf link: https://arxiv.org/pdf/2512.03795
Abstract Autonomous Driving (AD) vehicles still struggle to exhibit human-like behavior in highly dynamic and interactive traffic scenarios. The key challenge lies in AD's limited ability to interact with surrounding vehicles, largely due to a lack of understanding the underlying mechanisms of social interaction. To address this issue, we introduce MPCFormer, an explainable socially-aware autonomous driving approach with physics-informed and data-driven coupled social interaction dynamics. In this model, the dynamics are formulated into a discrete space-state representation, which embeds physics priors to enhance modeling explainability. The dynamics coefficients are learned from naturalistic driving data via a Transformer-based encoder-decoder architecture. To the best of our knowledge, MPCFormer is the first approach to explicitly model the dynamics of multi-vehicle social interactions. The learned social interaction dynamics enable the planner to generate manifold, human-like behaviors when interacting with surrounding traffic. By leveraging the MPC framework, the approach mitigates the potential safety risks typically associated with purely learning-based methods. Open-looped evaluation on NGSIM dataset demonstrates that MPCFormer achieves superior social interaction awareness, yielding the lowest trajectory prediction errors compared with other state-of-the-art approach. The prediction achieves an ADE as low as 0.86 m over a long prediction horizon of 5 seconds. Close-looped experiments in highly intense interaction scenarios, where consecutive lane changes are required to exit an off-ramp, further validate the effectiveness of MPCFormer. Results show that MPCFormer achieves the highest planning success rate of 94.67%, improves driving efficiency by 15.75%, and reduces the collision rate from 21.25% to 0.5%, outperforming a frontier Reinforcement Learning (RL) based planner.
中文摘要 自动驾驶（AD）车辆在高度动态和交互的交通环境中仍难以展现出类人行为。关键挑战在于AD与周围车辆互动的能力有限，主要原因是对社会互动的底层机制缺乏理解。为解决这一问题，我们引入了MPCFormer，一种可解释的社会意识自动驾驶方法，具有物理学基础和数据驱动的耦合社交互动动态。在该模型中，动力学被表述为离散的空间状态表示，嵌入物理先验以增强建模的可解释性。动力学系数通过基于变压器的编码-解码器架构，从自然的驾驶数据中学习得来的。据我们所知，MPCFormer是首个明确建模多载体社交互动动态的方法。所学到的社会互动动态使规划者在与周围交通互动时能够产生多重、类人行为。通过利用MPC框架，该方法降低了纯基于学习方法通常存在的潜在安全风险。对NGSIM数据集的开环评估表明，MPCFormer实现了更优的社交互动感知，相比其他先进方法，轨迹预测误差最低。该预测在5秒的长预测视野内，ADE可低至0.86米。在高度强烈交互场景下进行的闭环实验，即连续变道才能出口，进一步验证了MPCFormer的有效性。结果显示，MPCFormer实现了94.67%的最高规划成功率，驾驶效率提升15.75%，并将碰撞率从21.25%降至0.5%，优于基于前沿强化学习（RL）的规划工具。

Deep Reinforcement Learning for Dynamic Algorithm Configuration: A Case Study on Optimizing OneMax with the (1+($λ$,$λ$))-GA

动态算法配置的深度强化学习：关于用（1+（$λ$，$λ$））-GA 优化 OneMax 的案例研究

Authors: Tai Nguyen, Phong Le, André Biedenkapp, Carola Doerr, Nguyen Dang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.03805
Pdf link: https://arxiv.org/pdf/2512.03805
Abstract Dynamic Algorithm Configuration (DAC) studies the efficient identification of control policies for parameterized optimization algorithms. Numerous studies have leveraged the robustness of decision-making in Reinforcement Learning (RL) to address the optimization challenges in algorithm configuration. However, applying RL to DAC is challenging and often requires extensive domain expertise. We conduct a comprehensive study of deep-RL algorithms in DAC through a systematic analysis of controlling the population size parameter of the (1+($\lambda$,$\lambda$))-GA on OneMax instances. Our investigation of DDQN and PPO reveals two fundamental challenges that limit their effectiveness in DAC: scalability degradation and learning instability. We trace these issues to two primary causes: under-exploration and planning horizon coverage, each of which can be effectively addressed through targeted solutions. To address under-exploration, we introduce an adaptive reward shifting mechanism that leverages reward distribution statistics to enhance DDQN agent exploration, eliminating the need for instance-specific hyperparameter tuning and ensuring consistent effectiveness across different problem scales. In dealing with the planning horizon coverage problem, we demonstrate that undiscounted learning effectively resolves it in DDQN, while PPO faces fundamental variance issues that necessitate alternative algorithmic designs. We further analyze the hyperparameter dependencies of PPO, showing that while hyperparameter optimization enhances learning stability, it consistently falls short in identifying effective policies across various configurations. Finally, we demonstrate that DDQN equipped with our adaptive reward shifting strategy achieves performance comparable to theoretically derived policies with vastly improved sample efficiency, outperforming prior DAC approaches by several orders of magnitude.
中文摘要 动态算法配置（DAC）研究参数化优化算法中控制策略的高效识别。大量研究利用强化学习（RL）中决策的稳健性来解决算法配置中的优化挑战。然而，将强化学习应用于DAC具有挑战性，通常需要丰富的领域专业知识。我们通过系统分析控制 OneMax 实例上（1+（$\lambda$，$\lambda$））-GA 的种群规模参数，对 DAC 中的深度强化学习算法进行了全面研究。我们对DDQN和PPO的研究揭示了限制它们在DAC中有效性的两个根本挑战：可扩展性下降和学习不稳定性。我们将这些问题归因于两个主要原因：勘探不足和规划视野覆盖，这两者都可以通过有针对性的解决方案有效解决。为解决探索不足问题，我们引入了一种自适应奖励转移机制，利用奖励分布统计数据提升DDQN代理的探索，消除了针对实例的超参数调优需求，确保不同问题尺度下的一致有效性。在处理规划视野覆盖问题时，我们证明了无缺失学习在DDQN中有效解决了该问题，而PPO则面临根本性的方差问题，需要采用替代的算法设计。我们进一步分析了PPO的超参数依赖关系，表明虽然超参数优化提升了学习稳定性，但在识别有效策略方面始终未能实现。最后，我们证明了配备自适应奖励转移策略的DDQN在样本效率上与理论推导策略相当的性能，远超以往DAC方法几个数量级。

Multi-Agent Deep Reinforcement Learning for UAV-Assisted 5G Network Slicing: A Comparative Study of MAPPO, MADDPG, and MADQN

无人机辅助5G网络切片的多智能体深度强化学习：MAPPO、MADDPG和MADQN的比较研究

Authors: Ghoshana Bista, Abbas Bradai, Emmanuel Moulay, Abdulhalim Dandoush
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.03835
Pdf link: https://arxiv.org/pdf/2512.03835
Abstract The growing demand for robust, scalable wireless networks in the 5G-and-beyond era has led to the deployment of Unmanned Aerial Vehicles (UAVs) as mobile base stations to enhance coverage in dense urban and underserved rural areas. This paper presents a Multi-Agent Deep Reinforcement Learning (MADRL) framework that integrates Proximal Policy Optimization (MAPPO), Multi-Agent Deep Deterministic Policy Gradient (MADDPG), and Multi-Agent Deep Q-Networks (MADQN) to jointly optimize UAV positioning, resource allocation, Quality of Service (QoS), and energy efficiency through 5G network slicing. The framework adopts Centralized Training with Decentralized Execution (CTDE), enabling autonomous real-time decision-making while preserving global coordination. Users are prioritized into Premium (A), Silver (B), and Bronze (C) slices with distinct QoS requirements. Experiments in realistic urban and rural scenarios show that MAPPO achieves the best overall QoS-energy tradeoff, especially in interference-rich environments; MADDPG offers more precise continuous control and can attain slightly higher SINR in open rural settings at the cost of increased energy usage; and MADQN provides a computationally efficient baseline for discretized action spaces. These findings demonstrate that no single MARL algorithm is universally dominant; instead, algorithm suitability depends on environmental topology, user density, and service requirements. The proposed framework highlights the potential of MARL-driven UAV systems to enhance scalability, reliability, and differentiated QoS delivery in next-generation wireless networks.
中文摘要 在5G及以后时代，对强大且可扩展的无线网络需求不断增长，促使无人机（UAV）作为移动基站的部署，以增强城市密集和服务不足农村地区的覆盖。本文提出了一个多智能体深度强化学习（MADRL）框架，整合了近端策略优化（MAPPO）、多智能体深度确定性策略梯度（MADDPG）和多智能体深度Q网络（MADQN），通过5G网络切片共同优化无人机定位、资源分配、服务质量（QoS）和能源效率。该框架采用去中心化执行的集中训练（CTDE），实现自主的实时决策，同时保持全球协调。用户被优先分配为高级（A）、银（B）和青铜（C）切片，拥有不同的服务质量要求。在现实的城市和农村场景中，MAPPO在干扰密集的环境中实现了最佳的整体QoS-能量权衡;MADDPG提供更精确的连续控制，在开阔农村环境中可略高的SINR，但代价是能耗增加;MADQN为离散化动作空间提供了一个计算效率高的基线。这些发现表明，没有单一的MARL算法是普遍领先的;相反，算法适用性取决于环境拓扑、用户密度和服务需求。该框架强调了基于MARL驱动的无人机系统在提升下一代无线网络中可扩展性、可靠性和差异化服务质量交付的潜力。

DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

DVPO：基于分布价值建模的策略优化，用于LLM后期训练

Authors: Dingwei Zhu, Zhiheng Xi, Shihan Dou, Yuhui Wang, Sixian Li, Junjie Ye, Honglin Guo, Shichun Liu, Chenhao Huang, Yajie Yang, Junlin Shang, Senjie Jin, Ming Zhang, Jiazheng Zhang, Caishuang Huang, Yunke Zhang, Demei Yan, Yuran Wang, Tao Gui
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.03847
Pdf link: https://arxiv.org/pdf/2512.03847
Abstract Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabilize training and harm generalization. While existing approaches such as worst-case optimization (e.g., RFQI, CQL) and mean-based methods (e.g., PPO, GRPO) can improve stability, they often overlook generalization and may produce overly conservative policies, leading to uneven performance across diverse real scenarios. To this end, we introduce DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), a new RL framework that combines conditional risk theory with distributional value modeling to better balance robustness and generalization. DVPO learns token-level value distributions to provide fine-grained supervision, and applies an asymmetric risk regularization to shape the distribution tails: it contracts the lower tail to dampen noisy negative deviations, while expanding the upper tail to preserve exploratory diversity. Across extensive experiments and analysis in multi-turn dialogue, math reasoning, and scientific QA, DVPO consistently outperforms PPO, GRPO, and robust Bellman-based PPO under noisy supervision, showing its potential for LLM post-training in the real-world.
中文摘要 强化学习（RL）在训练后LLM中表现出强劲表现，但实际部署常伴有嘈杂或不完整的监督。在这种情况下，复杂且不可靠的监督信号会破坏训练稳定性并损害泛化。虽然现有方法如最坏情况优化（如RFQI、CQL）和基于均值的方法（如PPO、GRPO）可以提升稳定性，但它们常常忽视泛化性，可能导致过于保守的策略，导致在多种真实场景中表现不均。为此，我们引入了DVPO（分布价值建模与风险感知策略优化），这是一个新的强化学习框架，结合了条件风险理论与分布价值建模，以更好地平衡鲁棒性和泛化性。DVPO学习代币级价值分布以提供细致监督，并应用非对称风险正则化来塑造分布尾部：收缩下尾部以抑制噪声负偏差，同时扩展上尾部以保持探索多样性。在多回合对话、数学推理和科学质量保证的广泛实验和分析中，DVPO在噪声监督下持续优于PPO、GRPO及基于Bellman的稳健PPO，展现了其在现实世界中作为LLM后培训的潜力。

Automatic Attack Discovery for Few-Shot Class-Incremental Learning via Large Language Models

通过大型语言模型实现的自动攻击发现，实现少数类增量学习

Authors: Haidong Kang, Wei Wu, Hanling Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.03882
Pdf link: https://arxiv.org/pdf/2512.03882
Abstract Few-shot class incremental learning (FSCIL) is a more realistic and challenging paradigm in continual learning to incrementally learn unseen classes and overcome catastrophic forgetting on base classes with only a few training examples. Previous efforts have primarily centered around studying more effective FSCIL approaches. By contrast, less attention was devoted to thinking the security issues in contributing to FSCIL. This paper aims to provide a holistic study of the impact of attacks on FSCIL. We first derive insights by systematically exploring how human expert-designed attack methods (i.e., PGD, FGSM) affect FSCIL. We find that those methods either fail to attack base classes, or suffer from huge labor costs due to relying on huge expert knowledge. This highlights the need to craft a specialized attack method for FSCIL. Grounded in these insights, in this paper, we propose a simple yet effective ACraft method to automatically steer and discover optimal attack methods targeted at FSCIL by leveraging Large Language Models (LLMs) without human experts. Moreover, to improve the reasoning between LLMs and FSCIL, we introduce a novel Proximal Policy Optimization (PPO) based reinforcement learning to optimize learning, making LLMs generate better attack methods in the next generation by establishing positive feedback. Experiments on mainstream benchmarks show that our ACraft significantly degrades the performance of state-of-the-art FSCIL methods and dramatically beyond human expert-designed attack methods while maintaining the lowest costs of attack.
中文摘要 少数类增量学习（FSCIL）是一种更现实且具有挑战性的持续学习范式，旨在通过少量训练实例逐步学习未见的类，并克服基础类的灾难性遗忘。此前的工作主要集中在研究更有效的FSCIL方法。相比之下，对FSCIL贡献时对安全问题的关注较少。本文旨在对攻击对FSCIL的影响进行全面研究。我们首先通过系统性地探讨人类专家设计的攻击方法（如PGD、FGSM）如何影响FSCIL来获得洞见。我们发现这些方法要么无法攻击基类，要么因依赖庞大的专家知识而付出巨大的劳动力成本。这凸显了为FSCIL制定专门攻击方法的必要性。基于这些见解，本文提出了一种简单而有效的ACraft方法，能够利用大型语言模型（LLM），自动引导并发现针对FSCIL的最佳攻击方法，无需人工专家。此外，为了改善LLM与FSCIL之间的推理，我们引入了一种基于近端策略优化（PPO）的新型强化学习，以优化学习，使LLM通过建立正反馈，在下一代中生成更优的攻击方法。主流基准测试的实验显示，我们的ACraft显著降低了最先进的FSCIL方法的性能，远超人类专家设计的攻击方法，同时保持了最低的攻击成本。

Digital Twin-based Control Co-Design of Full Vehicle Active Suspensions via Deep Reinforcement Learning

基于数字孪生的控制协同设计，通过深度强化学习实现全车主动悬挂

Authors: Ying-Kuan Tsai, Yi-Ping Chen, Vispi Karkaria, Wei Chen
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.03891
Pdf link: https://arxiv.org/pdf/2512.03891
Abstract Active suspension systems are critical for enhancing vehicle comfort, safety, and stability, yet their performance is often limited by fixed hardware designs and control strategies that cannot adapt to uncertain and dynamic operating conditions. Recent advances in digital twins (DTs) and deep reinforcement learning (DRL) offer new opportunities for real-time, data-driven optimization across a vehicle's lifecycle. However, integrating these technologies into a unified framework remains an open challenge. This work presents a DT-based control co-design (CCD) framework for full-vehicle active suspensions using multi-generation design concepts. By integrating automatic differentiation into DRL, we jointly optimize physical suspension components and control policies under varying driver behaviors and environmental uncertainties. DRL also addresses the challenge of partial observability, where only limited states can be sensed and fed back to the controller, by learning optimal control actions directly from available sensor information. The framework incorporates model updating with quantile learning to capture data uncertainty, enabling real-time decision-making and adaptive learning from digital-physical interactions. The approach demonstrates personalized optimization of suspension systems under two distinct driving settings (mild and aggressive). Results show that the optimized systems achieve smoother trajectories and reduce control efforts by approximately 43% and 52% for mild and aggressive, respectively, while maintaining ride comfort and stability. Contributions include: developing a DT-enabled CCD framework integrating DRL and uncertainty-aware model updating for full-vehicle active suspensions, introducing a multi-generation design strategy for self-improving systems, and demonstrating personalized optimization of active suspension systems for distinct driver types.
中文摘要 主动悬挂系统对于提升车辆舒适性、安全性和稳定性至关重要，但其性能常受限于固定硬件设计和控制策略，无法适应不确定和动态的运行条件。数字孪生（DT）和深度强化学习（DRL）的最新进展为车辆生命周期中实时、数据驱动的优化提供了新机遇。然而，将这些技术整合到统一框架中仍是一个开放的挑战。本研究提出了基于DT的控制协同设计（CCD）框架，应用于多代设计理念的全车主动悬挂。通过将自动差异化整合进日行车（DRL），我们共同优化物理悬挂组件和控制政策，适应不同的驾驶员行为和环境不确定性。DRL还解决了部分可观测性的挑战，即只能感知有限状态并反馈给控制器，通过直接从现有传感器信息学习最优控制动作。该框架结合了模型更新与分位数学习，以捕捉数据不确定性，实现实时决策和从数字-物理交互中实现自适应学习。该方法展示了在两种不同驾驶环境（轻度和激进）下悬挂系统的个性化优化。结果显示，优化后的系统在轻度和激进时分别实现更平滑的轨迹，并减少约43%和52%的控制工作，同时保持乘坐舒适性和稳定性。贡献包括：开发了集成DRL和不确定性模型更新的DT驱动CCD框架，引入多代自改进设计策略，并展示了针对不同驾驶员类型的主动悬挂系统的个性化优化。

Autonomous Reinforcement Learning Robot Control with Intel's Loihi 2 Neuromorphic Hardware

采用英特尔Loihi 2神经形态硬件实现自主强化学习机器人控制

Authors: Kenneth Stewart, Roxana Leontie, Samantha Chapin, Joe Hays, Sumit Bam Shrestha, Carl Glen Henshaw
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.03911
Pdf link: https://arxiv.org/pdf/2512.03911
Abstract We present an end-to-end pipeline for deploying reinforcement learning (RL) trained Artificial Neural Networks (ANNs) on neuromorphic hardware by converting them into spiking Sigma-Delta Neural Networks (SDNNs). We demonstrate that an ANN policy trained entirely in simulation can be transformed into an SDNN compatible with Intel's Loihi 2 architecture, enabling low-latency and energy-efficient inference. As a test case, we use an RL policy for controlling the Astrobee free-flying robot, similar to a previously hardware in space-validated controller. The policy, trained with Rectified Linear Units (ReLUs), is converted to an SDNN and deployed on Intel's Loihi 2, then evaluated in NVIDIA's Omniverse Isaac Lab simulation environment for closed-loop control of Astrobee's motion. We compare execution performance between GPU and Loihi 2. The results highlight the feasibility of using neuromorphic platforms for robotic control and establish a pathway toward energy-efficient, real-time neuromorphic computation in future space and terrestrial robotics applications.
中文摘要 我们提出了一个端到端的流程，用于将强化学习（RL）训练的人工神经网络（ANN）部署到神经形态硬件上，将其转换为尖峰的Sigma-Delta神经网络（SDNN）。我们展示了完全在仿真中训练的人工神经网络策略可以转化为兼容英特尔Loihi 2架构的SDNN，实现低延迟和节能推断。作为测试用例，我们使用强化学习策略来控制Astrobee自由飞行机器人，类似于之前空间验证控制器中的硬件策略。该策略通过整流线性单元（ReLUs）训练，转换为SDNN并部署在英特尔的Loihi 2上，随后在NVIDIA的Omniverse Isaac Lab仿真环境中评估，以实现对Astrobee运动的闭环控制。我们比较了GPU和Loihi 2的执行性能。结果凸显了利用神经形态平台进行机器人控制的可行性，并为未来空间和地面机器人应用中实现节能、实时神经形态计算奠定了道路。

Hierarchical Vision Language Action Model Using Success and Failure Demonstrations

采用成功与失败演示的层级视觉语言行动模型

Authors: Jeongeun Park, Jihwan Yoon, Byungwoo Jeon, Juhan Park, Jinwoo Shin, Namhoon Cho, Kyungjae Lee, Sangdoo Yun, Sungjoon Choi
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.03913
Pdf link: https://arxiv.org/pdf/2512.03913
Abstract Prior Vision-Language-Action (VLA) models are typically trained on teleoperated successful demonstrations, while discarding numerous failed attempts that occur naturally during data collection. However, these failures encode where and how policies can be fragile, information that can be exploited to improve robustness. We address this problem by leveraging mixed-quality datasets to learn failure-aware reasoning at planning time. We introduce VINE, a hierarchical vision-language-action model that separates high-level reasoning (System 2) from low-level control (System 1) under a hierarchical reinforcement learning formalism, making failures usable as a structured learning signal rather than noisy supervision. System 2 performs feasibility-guided tree search over a 2D scene-graph abstraction: it proposes subgoal transitions, predicts success probabilities from both successes and failures, and prunes brittle branches before execution, effectively casting plan evaluation as feasibility scoring. The selected subgoal sequence is then passed to System 1, which executes low-level actions without modifying the agent's core skills. Trained entirely from offline teleoperation data, VINE integrates negative experience directly into the decision loop. Across challenging manipulation tasks, this approach consistently improves success rates and robustness, demonstrating that failure data is an essential resource for converting the broad competence of VLAs into robust execution.
中文摘要 以往的视觉-语言-行动（VLA）模型通常基于远程作的成功演示训练，同时剔除数据收集过程中自然发生的大量失败尝试。然而，这些失败编码了政策的脆弱性和方式，这些信息可以被利用以提升鲁棒性。我们通过利用混合质量数据集在规划时学习失败意识推理来解决这一问题。我们介绍了VINE，一种分层视觉-语言-行动模型，通过分层强化学习形式主义将高层推理（系统2）和低层控制（系统1）分离，使失败成为结构化学习信号，而非噪声监督。System 2 通过二维场景图抽象进行可行性引导树搜索：它提出子目标转换，预测成功与失败的成功概率，并在执行前修剪脆弱分支，实际上将计划评估视为可行性评分。选定的子目标序列随后传递给系统1，系统1执行低级别动作而不修改代理的核心技能。VINE完全由离线远程作数据训练，将负面经验直接融入决策循环。在具有挑战性的作任务中，这种方法持续提升成功率和鲁棒性，证明失败数据是将VLA广泛能力转化为稳健执行的关键资源。

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

TempR1：通过时间感知多任务强化学习提升对MLLM的时间理解

Authors: Tao Wu, Li Yang, Gen Zhan, Yiting Liao, Junlin Li, Deliang Fu, Li Zhang, Limin Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.03963
Pdf link: https://arxiv.org/pdf/2512.03963
Abstract Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs' temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each, enabling TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments demonstrate that TempR1 attains state-of-the-art performance across multiple benchmarks. Moreover, its joint optimization over complementary tasks yields a strong synergistic effect, enhancing both generalization and single-task performance, establishing a scalable and principled paradigm for temporal reasoning in MLLMs.
中文摘要 提升对多模态大型语言模型（MLLM）的时间理解对于推进长视频分析至关重要，从而实现时间定位、动作检测和时间敏感问题解答等任务。虽然强化学习（RL）最近被探索用于提升时间推理能力，但现有方法通常局限于有限的任务类型和数据，限制了其在不同时间理解场景中的推广。为应对这一挑战，我们介绍了TempR1，一种时间感知型多任务强化学习框架，系统性地增强MLLM的时间理解能力。我们策划了一个多任务语料库，使模型接触到多样的时间结构和语义，并基于组相对策略优化（GRPO）算法，实现稳定且高效的跨任务优化。具体来说，我们将时间任务分为三种对应类型，连接预测区间与真实实例，并为每种类型设计定制的局部化奖励，使TempR1能够捕捉细粒度的时间依赖关系并适应不同的时间模式。大量实验表明，TempR1在多个基准测试中达到了最先进的性能。此外，其对互补任务的联合优化产生了强烈的协同效应，提升了泛化性和单任务性能，建立了MLLM中时间推理的可扩展且有原则的范式。

Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning

引导流策略：离线强化学习中的高价值行动学习

Authors: Franki Nguimatsia Tiofack, Théotime Le Hellard, Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.03973
Pdf link: https://arxiv.org/pdf/2512.03973
Abstract Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks. Webpage: this https URL
中文摘要 离线强化学习通常依赖于行为正则化，以强制策略保持接近数据集分布。然而，这些方法未能区分其正则化分量中的高值和低值作用。我们引入了引导流策略（GFP），将多步流程匹配策略与简化的一步行为者结合起来。演员通过加权行为克隆来引导流策略，专注于从数据集中克隆高价值动作，而不是盲目模仿所有状态-动作对。反过来，流策略限制演员保持与数据集最佳转变对齐，同时最大化批评者。这种相互指导使GFP能够在OGBench、Minari和D4RL基准测试中实现144项基于状态和像素的任务中达到最先进的性能，在次优数据集和具有挑战性的任务上取得显著提升。网页：这个 https URL

SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

SpaceTools：通过双交互式强化学习工具增强空间推理

Authors: Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.04069
Pdf link: https://arxiv.org/pdf/2512.04069
Abstract Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: this https URL.
中文摘要 视觉语言模型（VLMs）展现了强大的定性视觉理解，但在具身应用所需的度量精确空间推理方面存在困难。代理范式承诺VLM可以使用多种工具来增强这些能力，如深度估计器、分割模型和姿态估计器。然而，如何在不依赖手工提示策略或强制固定、预定义的工具管道限制VLM发现最佳工具使用模式的情况下实现这一愿景仍是一个开放的挑战。强化学习可以弥补这一空白，但由于多工具推理的搜索空间较大，目前只能用单一视觉工具进行推理。我们介绍了双重交互强化学习（DIRL），这是一个两阶段的培训框架，VLMs通过互动探索和反馈学习协调多种工具。在教学阶段，我们将通过交互式强化学习训练的单一工具专家演示与使用所有工具的前沿模型的痕迹结合起来。在探索阶段，该模型通过持续的强化学习进一步完善了多工具协调能力。我们的模型SpaceTools具备工具增强的空间推理能力，在空间理解基准测试（RoboSpatial-Home、BLINK、BOP-ASK）上实现了最先进的性能，并展示了使用7自由度机器人作为工具的可靠现实世界作能力。DIRL相比原版SFT（RoboSpatial+12%）和RL（RoboSpatial+16%）基线有了显著改进。项目页面：这个 https URL。

SkillFactory: Self-Distillation For Learning Cognitive Behaviors

SkillFactory：自我提炼以学习认知行为

Authors: Zayne Sprague, Jack Lu, Manya Wadhwa, Sedrick Keh, Mengye Ren, Greg Durrett
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.04072
Pdf link: https://arxiv.org/pdf/2512.04072
Abstract Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
中文摘要 利用长思考链的推理模型运用多种认知技能，如对答案的验证、回溯、通过替代方法重试等。以往研究表明，当基础语言模型具备这些技能时，通过强化学习（RL）进一步训练该模型，可以学会如何利用它们。我们如何让模型发挥基础模型未展现的技能？我们的研究SkillFactory是一种方法，帮助模型在强化学习前的监督微调（SFT）阶段大致学习这些技能。我们的方法不依赖于从更强模型中提取，而是使用模型本身的样本，重新排列以提供符合这些技能格式的训练数据。这些“银色”SFT轨迹可能不完美，但对于模型在强化学习中习得技能仍然有效。我们的评估显示：（1）从 SkillFactory SFT 初始化开始，有助于模型在 RL 后推广到更难的任务变体，尽管 RL 前性能较低;（2）模型确实使用了认知技能;（3）RL化的SkillFactory模型在域外任务回归方面比RL化基础模型更为稳健。我们的研究表明，在强化学习之前学习的归纳偏见有助于模型学习稳健的认知技能使用。

PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design

PosterCopilot：迈向专业平面设计的布局推理与可控编辑

Authors: Jiazhe Wei, Ken Li, Tianyu Lao, Haofan Wang, Liang Wang, Caifeng Shan, Chenyang Si
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.04082
Pdf link: https://arxiv.org/pdf/2512.04082
Abstract Graphic design forms the cornerstone of modern visual communication, serving as a vital medium for promoting cultural and commercial events. Recent advances have explored automating this process using Large Multimodal Models (LMMs), yet existing methods often produce geometrically inaccurate layouts and lack the iterative, layer-specific editing required in professional workflows. To address these limitations, we present PosterCopilot, a framework that advances layout reasoning and controllable editing for professional graphic design. Specifically, we introduce a progressive three-stage training strategy that equips LMMs with geometric understanding and aesthetic reasoning for layout design, consisting of Perturbed Supervised Fine-Tuning, Reinforcement Learning for Visual-Reality Alignment, and Reinforcement Learning from Aesthetic Feedback. Furthermore, we develop a complete workflow that couples the trained LMM-based design model with generative models, enabling layer-controllable, iterative editing for precise element refinement while maintaining global visual consistency. Extensive experiments demonstrate that PosterCopilot achieves geometrically accurate and aesthetically superior layouts, offering unprecedented controllability for professional iterative design.
中文摘要 平面设计构成了现代视觉传播的基石，是推广文化和商业活动的重要媒介。近期的进展尝试利用大型多模态模型（LMM）自动化这一过程，但现有方法常常产生几何精度不高的布局，且缺乏专业工作流程中所需的迭代、层级特定编辑功能。为了解决这些限制，我们介绍了PosterCopilot，这是一个推动布局推理和可控编辑的框架，适用于专业平面设计。具体来说，我们引入了一种渐进式三阶段训练策略，赋予LMM几何理解和布局设计的美学推理能力，包括扰动监督微调、视觉-现实对齐强化学习以及美学反馈强化学习。此外，我们开发了一套完整的工作流程，将训练好的LMM设计模型与生成模型结合起来，实现图层可控、迭代编辑，实现精细元素精细化，同时保持全局视觉一致性。大量实验表明，PosterCopilot实现了几何精确且美学优越的布局，为专业迭代设计提供了前所未有的可控性。

Keyword: diffusion policy

There is no result

Keyword: reinforcement learning

Safe and Sustainable Electric Bus Charging Scheduling with Constrained Hierarchical DRL

安全且可持续的电动公交充电安排与受限层级日程

Dynamic Correction of Erroneous State Estimates via Diffusion Bayesian Exploration

通过扩散贝叶斯探索动态校正错误状态估计

Hierarchical Process Reward Models are Symbolic Vision Learners

层级过程奖励模型是象征性愿景学习者

Multi-Agent Reinforcement Learning and Real-Time Decision-Making in Robotic Soccer for Virtual Environments

多智能体强化学习与虚拟环境中机器人足球的实时决策

GRAND: Guidance, Rebalancing, and Assignment for Networked Dispatch in Multi-Agent Path Finding

GRAND：多智能体路径寻找中的网络调度的指导、再平衡与分配

A Multi-Agent, Policy-Gradient approach to Network Routing

多智能体、策略梯度网络路由方法

SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning

SPARK：无引用强化学习的逐步过程感知奖励

SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding

空间推理器：大规模三维场景理解的主动感知

Better World Models Can Lead to Better Post-Training Performance

更好的世界模型可以带来更好的训练后表现

World Models for Autonomous Navigation of Terrestrial Robots from LIDAR Observations

来自LIDAR观测的地面机器人自主导航世界模型

Multimodal Reinforcement Learning with Agentic Verifier for AI Agents

多模态强化学习与智能体验证器（Agentic Verifier）用于人工智能代理

PretrainZero: Reinforcement Active Pretraining

PretrainZero：强化主动预训练

Variable-Impedance Muscle Coordination under Slow-Rate Control Frequencies and Limited Observation Conditions Evaluated through Legged Locomotion

通过腿式行走评估低速控制频率和有限观察条件下的可变阻抗肌肉协调

Adaptive sampling using variational autoencoder and reinforcement learning

利用变分自编码器和强化学习的自适应采样

Multi-Agent Reinforcement Learning with Communication-Constrained Priors

多智能体强化学习，带有通信受限先验

A Learning-based Control Methodology for Transitioning VTOL UAVs

一种基于学习的垂直起降无人机转换控制方法

RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL

RoboScape-R：通过强化学习实现通用机器人训练的统一奖励-观察世界模型

Accelerating Detailed Routing Convergence through Offline Reinforcement Learning

通过离线强化学习加速详细路由收敛

A Descriptive Model for Modelling Attacker Decision-Making in Cyber-Deception

网络欺骗中攻击者决策建模的描述模型

ContactRL: Safe Reinforcement Learning based Motion Planning for Contact based Human Robot Collaboration

ContactRL：基于安全强化学习的动作规划，用于基于接触的人机协作

Tutorial on Large Language Model-Enhanced Reinforcement Learning for Wireless Networks

无线网络大型语言模型增强强化学习教程

Autonomous Planning In-space Assembly Reinforcement-learning free-flYer (APIARY) International Space Station Astrobee Testing

自主规划空间组装强化学习免费飞行Yer（APIARY）国际空间站Astrobee测试

Crossing the Sim2Real Gap Between Simulation and Ground Testing to Space Deployment of Autonomous Free-flyer Control

跨越模拟与地面测试到太空部署自主自由飞行器控制之间的Sim2Real鸿沟

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

以编程愿景思考：迈向以图像思考的统一视角

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

扩散大型语言模型的原则性强化学习从序列层面出现

Sample-Efficient Model-Free Policy Gradient Methods for Stochastic LQR via Robust Linear Regression

通过鲁棒线性回归实现随机LQR的样本高效无模型策略梯度方法

Safety Reinforced Model Predictive Control (SRMPC): Improving MPC with Reinforcement Learning for Motion Planning in Autonomous Driving

安全强化模型预测控制（SRMPC）：通过强化学习改进MPC以实现自动驾驶运动规划

Omni-AutoThink: Adaptive Multimodal Reasoning via Reinforcement Learning

全自动思考：通过强化学习实现的自适应多模态推理

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

AdaptVision：通过自适应视觉习得实现高效的视觉语言模型

MPCFormer: A physics-informed data-driven approach for explainable socially-aware autonomous driving

MPCFormer：一种基于物理学的数据驱动方法，用于可解释的社会意识自动驾驶

Deep Reinforcement Learning for Dynamic Algorithm Configuration: A Case Study on Optimizing OneMax with the (1+($λ$,$λ$))-GA

动态算法配置的深度强化学习：关于用 （1+（$λ$，$λ$））-GA 优化 OneMax 的案例研究

Multi-Agent Deep Reinforcement Learning for UAV-Assisted 5G Network Slicing: A Comparative Study of MAPPO, MADDPG, and MADQN

无人机辅助5G网络切片的多智能体深度强化学习：MAPPO、MADDPG和MADQN的比较研究

DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

DVPO：基于分布价值建模的策略优化，用于LLM后期训练

Automatic Attack Discovery for Few-Shot Class-Incremental Learning via Large Language Models

通过大型语言模型实现的自动攻击发现，实现少数类增量学习

Digital Twin-based Control Co-Design of Full Vehicle Active Suspensions via Deep Reinforcement Learning

基于数字孪生的控制协同设计，通过深度强化学习实现全车主动悬挂

Autonomous Reinforcement Learning Robot Control with Intel's Loihi 2 Neuromorphic Hardware

采用英特尔Loihi 2神经形态硬件实现自主强化学习机器人控制

Hierarchical Vision Language Action Model Using Success and Failure Demonstrations

采用成功与失败演示的层级视觉语言行动模型

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

TempR1：通过时间感知多任务强化学习提升对MLLM的时间理解

Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning

引导流策略：离线强化学习中的高价值行动学习

SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

动态算法配置的深度强化学习：关于用（1+（$λ$，$λ$））-GA 优化 OneMax 的案例研究