Arxiv Papers of Today

生成时间: 2025-12-08 16:33:42 (UTC+8); Arxiv 发布时间: 2025-12-08 20:00 EST (2025-12-09 09:00 UTC+8)

今天共有 17 篇相关文章

Keyword: reinforcement learning

Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning

Semore：基于VLM引导的增强语义运动表示，用于视觉强化学习

Authors: Wentao Wang, Chunyang Liu, Kehua Sheng, Bo Zhang, Yan Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.05172
Pdf link: https://arxiv.org/pdf/2512.05172
Abstract The growing exploration of Large Language Models (LLM) and Vision-Language Models (VLM) has opened avenues for enhancing the effectiveness of reinforcement learning (RL). However, existing LLM-based RL methods often focus on the guidance of control policy and encounter the challenge of limited representations of the backbone networks. To tackle this problem, we introduce Enhanced Semantic Motion Representations (Semore), a new VLM-based framework for visual RL, which can simultaneously extract semantic and motion representations through a dual-path backbone from the RGB flows. Semore utilizes VLM with common-sense knowledge to retrieve key information from observations, while using the pre-trained clip to achieve the text-image alignment, thereby embedding the ground-truth representations into the backbone. To efficiently fuse semantic and motion representations for decision-making, our method adopts a separately supervised approach to simultaneously guide the extraction of semantics and motion, while allowing them to interact spontaneously. Extensive experiments demonstrate that, under the guidance of VLM at the feature level, our method exhibits efficient and adaptive ability compared to state-of-art methods. All codes are released.
中文摘要 大型语言模型（LLM）和视觉语言模型（VLM）的不断探索为提升强化学习（RL）的有效性开辟了新途径。然而，现有基于LLM的强化学习方法通常侧重于控制策略的指导，并面临骨干网络表示有限的挑战。为解决这一问题，我们引入了增强语义运动表示（Semore），这是一种基于视觉强化学习的新型VLM框架，能够通过双路径骨干同时从RGB流中提取语义和运动表示。Semore利用带有常识知识的VLM从观测中提取关键信息，同时利用预训练剪辑实现文本-图像对齐，从而将真实的表征嵌入骨干网。为了高效融合语义和运动表征以实现决策，我们的方法采用了独立监督的方法，既引导语义和运动的提取，又允许它们自发交互。大量实验表明，在VLM在特征层面的指导下，我们的方法相较于最先进方法展现出高效且适应性强的能力。所有代码均已发布。

Hierarchical Reinforcement Learning for the Dynamic VNE with Alternatives Problem

动态VNE问题的层级强化学习与替代方案

Authors: Ali Al Housseini, Cristina Rottondi, Omran Ayoub
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.05207
Pdf link: https://arxiv.org/pdf/2512.05207
Abstract Virtual Network Embedding (VNE) is a key enabler of network slicing, yet most formulations assume that each Virtual Network Request (VNR) has a fixed topology. Recently, VNE with Alternative topologies (VNEAP) was introduced to capture malleable VNRs, where each request can be instantiated using one of several functionally equivalent topologies that trade resources differently. While this flexibility enlarges the feasible space, it also introduces an additional decision layer, making dynamic embedding more challenging. This paper proposes HRL-VNEAP, a hierarchical reinforcement learning approach for VNEAP under dynamic arrivals. A high-level policy selects the most suitable alternative topology (or rejects the request), and a low-level policy embeds the chosen topology onto the substrate network. Experiments on realistic substrate topologies under multiple traffic loads show that naive exploitation strategies provide only modest gains, whereas HRL-VNEAP consistently achieves the best performance across all metrics. Compared to the strongest tested baselines, HRL-VNEAP improves acceptance ratio by up to \textbf{20.7\%}, total revenue by up to \textbf{36.2\%}, and revenue-over-cost by up to \textbf{22.1\%}. Finally, we benchmark against an MILP formulation on tractable instances to quantify the remaining gap to optimality and motivate future work on learning- and optimization-based VNEAP solutions.
中文摘要 虚拟网络嵌入（VNE）是网络切片的关键推动者，但大多数表述假设每个虚拟网络请求（VNR）都具有固定拓扑结构。最近引入了带有替代拓扑的VNE（VNEAP），用于捕捉可塑的VNR，每个请求可以通过多种功能等效的拓扑实例化，这些拓扑以不同方式交换资源。虽然这种灵活性扩大了可行空间，但也引入了额外的决策层，使动态嵌入更具挑战性。本文提出了HRL-VNEAP，这是一种针对动态到达下的VNEAP分层强化学习方法。高级策略选择最合适的备选拓扑（或拒绝请求），低层策略将所选拓扑嵌入基质网络中。在多重流量负载下的现实基底拓扑实验表明，朴素的利用策略仅带来有限的收益，而HRL-VNEAP在所有指标上始终保持最佳性能。与最强测试基线相比，HRL-VNEAP将验收率提升至\textbf{20.7\%}，总收入提升至\textbf{36.2\%}，收入超过成本提升至\textbf{22.1\%}。最后，我们以可处理实例的MILP表述为基准，量化剩余的最优差距，并激励未来基于学习和优化的VNEAP解决方案研究。

Bridging Interpretability and Optimization: Provably Attribution-Weighted Actor-Critic in Reproducing Kernel Hilbert Spaces

桥接可解释性与优化：在重现核希尔伯特空间中的可证明归因加权演员-批评者

Authors: Na Li, Hangguan Shan, Wei Ni, Wenjie Zhang, Xinyu Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.05291
Pdf link: https://arxiv.org/pdf/2512.05291
Abstract Actor-critic (AC) methods are a cornerstone of reinforcement learning (RL) but offer limited interpretability. Current explainable RL methods seldom use state attributions to assist training. Rather, they treat all state features equally, thereby neglecting the heterogeneous impacts of individual state dimensions on the reward. We propose RKHS--SHAP-based Advanced Actor--Critic (RSA2C), an attribution-aware, kernelized, two-timescale AC algorithm, including Actor, Value Critic, and Advantage Critic. The Actor is instantiated in a vector-valued reproducing kernel Hilbert space (RKHS) with a Mahalanobis-weighted operator-valued kernel, while the Value Critic and Advantage Critic reside in scalar RKHSs. These RKHS-enhanced components use sparsified dictionaries: the Value Critic maintains its own dictionary, while the Actor and Advantage Critic share one. State attributions, computed from the Value Critic via RKHS--SHAP (kernel mean embedding for on-manifold expectations and conditional mean embedding for off-manifold expectations), are converted into Mahalanobis-gated weights that modulate Actor gradients and Advantage Critic targets. Theoretically, we derive a global, non-asymptotic convergence bound under state perturbations, showing stability through the perturbation-error term and efficiency through the convergence-error term. Empirical results on three standard continuous-control environments show that our algorithm achieves efficiency, stability, and interpretability.
中文摘要 演员-批评（AC）方法是强化学习（RL）的基石，但其解释性有限。当前可解释的强化学习方法很少使用状态归因来辅助训练。相反，他们平等对待所有状态特征，忽视了各个状态维度对奖励的异质影响。我们提出基于SHAP的RKHS——高级演员-批判者（RSA2C），这是一种归因感知、核化、双时间尺度的AC算法，包含Actor、Value Critic和Advantage Critic。Actor实例化在向量值的重现核希尔伯特空间（RKHS）中，核为Mahalanobis加权的算符值核，而Value Critic和Advantage Critic则存在标量RKHS中。这些RKHS增强组件使用稀疏词典：价值批评者维护自己的词典，而行动者和优势批评者共享一个词典。状态归因通过价值批判者通过RKHS-SHAP（流形上期望的核平均嵌入和流形外期望的条件平均嵌入）计算，转换为Mahalanobis门控权重，以调制演员梯度和优势批评目标。理论上，我们推导出一个全局、非渐近的收敛，在状态扰动下束缚，通过扰动误差项展现稳定性，通过收敛误差项显示效率。在三种标准连续控制环境中的实证结果表明，我们的算法实现了效率、稳定性和可解释性。

Enhancing Deep Deterministic Policy Gradients on Continuous Control Tasks with Decoupled Prioritized Experience Replay

通过解耦优先级经验重放增强持续控制任务中的深度确定性策略梯度

Authors: Mehmet Efe Lorasdagi, Dogan Can Cicek, Furkan Burak Mutlu, Suleyman Serdar Kozat
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.05320
Pdf link: https://arxiv.org/pdf/2512.05320
Abstract Background: Deep Deterministic Policy Gradient-based reinforcement learning algorithms utilize Actor-Critic architectures, where both networks are typically trained using identical batches of replayed transitions. However, the learning objectives and update dynamics of the Actor and Critic differ, raising concerns about whether uniform transition usage is optimal. Objectives: We aim to improve the performance of deep deterministic policy gradient algorithms by decoupling the transition batches used to train the Actor and the Critic. Our goal is to design an experience replay mechanism that provides appropriate learning signals to each component by using separate, tailored batches. Methods: We introduce Decoupled Prioritized Experience Replay (DPER), a novel approach that allows independent sampling of transition batches for the Actor and the Critic. DPER can be integrated into any off-policy deep reinforcement learning algorithm that operates in continuous control domains. We combine DPER with the state-of-the-art Twin Delayed DDPG algorithm and evaluate its performance across standard continuous control benchmarks. Results: DPER outperforms conventional experience replay strategies such as vanilla experience replay and prioritized experience replay in multiple MuJoCo tasks from the OpenAI Gym suite. Conclusions: Our findings show that decoupling experience replay for Actor and Critic networks can enhance training dynamics and final policy quality. DPER offers a generalizable mechanism that enhances performance for a wide class of actor-critic off-policy reinforcement learning algorithms.
中文摘要 背景：基于深度确定性策略梯度的强化学习算法采用了Actor-Critic架构，两个网络通常通过相同的重放批次转换进行训练。然而，Actor 和 Critic 的学习目标和更新动态存在差异，引发了对统一过渡使用是否最优的担忧。目标：我们旨在通过解耦用于训练演员和批评者的过渡批次，提升深度确定性策略梯度算法的性能。我们的目标是设计一种体验回放机制，通过使用独立且定制化的批次，为每个组件提供适当的学习信号。方法：我们引入了解耦优先体验回放（DPER），这是一种新颖的方法，允许对演员和批评者的过渡批次进行独立采样。DPER 可以集成到任何在连续控制域中运行的非策略深度强化学习算法中。我们将DPER与最先进的双延迟DDPG算法结合，并评估其在标准连续控制基准测试中的表现。结果：DPER 在 OpenAI Gym 套件中的多个 MuJoCo 任务中，表现优于传统的经验回放策略，如原版体验回放和优先体验重放。结论：我们的研究结果表明，Actor和Critic网络的经验回放解耦可以提升培训动态和最终策略质量。DPER 提供了一种可通用的机制，提升了广泛类别的 actor-critic 非策略强化学习算法的性能。

ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction

ParaUni：增强统一多模态模型中的生成，支持强化驱动的层级并行信息交互

Authors: Jiangtong Tan, Lin Liu, Jie Huanng, Xiaopeng Zhang, Qi Tian, Feng Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.05422
Pdf link: https://arxiv.org/pdf/2512.05422
Abstract Unified multimodal models significantly improve visual generation by combining vision-language models (VLMs) with diffusion models. However, existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference. Considering abundant and hierarchical information in VLM's layers from low-level details to high-level semantics, we propose \textbf{ParaUni}. It extracts features from variants VLM's layers in a \textbf{Para}llel way for comprehensive information interaction and retains a flexible separation architecture to enhance generation in \textbf{Uni}fied multimodal model. Concretely, visual features from all VLM's layers are fed in parallel into a Layer Integration Module (LIM), which efficiently integrates fine-grained details and semantic abstractions and provides the fused representation as a condition to the diffusion model. To further enhance performance, we reveal that these hierarchical layers respond unequally to different rewards in Reinforcement Learning (RL). Crucially, we design a Layer-wise Dynamic Adjustment Mechanism (LDAM) to facilitate multiple reward improvements that aligns the hierarchical properties of these layers using RL. Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and shows strong potential for multiple reward advances during RL stages. Code is available at this https URL.
中文摘要 统一的多模态模型通过结合视觉语言模型（VLM）与扩散模型，显著提升视觉生成能力。然而，由于代表性差异巨大，现有方法难以完全平衡足够的互动和灵活的实现。考虑到VLM层中从低层细节到高层语义的丰富且层级化的信息，我们提出了\textbf{ParaUni}。它以 \textbf{Para}llel 方式从变体 VLM 层中提取特征，实现全面的信息交互，并保留灵活的分离架构以增强 \textbf{Uni}化多模模型的生成能力。具体来说，所有VLM层的视觉特征并行输入层集成模块（LIM），该模块高效集成细粒度细节和语义抽象，并将融合后的表示作为扩散模型的条件。为了进一步提升表现，我们揭示了这些层级层对强化学习（RL）中不同奖励的响应不均。关键是，我们设计了分层动态调整机制（LDAM），通过强化学习实现多层奖励改进，使这些层的层级属性对齐。大量实验表明，ParaUni利用互补的多层特性显著提升生成质量，并在强化学习阶段展现出多重奖励进步的强大潜力。代码可在此 https URL 访问。

Distributed scalable coupled policy algorithm for networked multi-agent reinforcement learning

网络多智能体强化学习的分布式可扩展耦合策略算法

Authors: Pengcheng Dai, Dongming Wang, Wenwu Yu, Wei Ren
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.05447
Pdf link: https://arxiv.org/pdf/2512.05447
Abstract This paper studies networked multi-agent reinforcement learning (NMARL) with interdependent rewards and coupled policies. In this setting, each agent's reward depends on its own state-action pair as well as those of its direct neighbors, and each agent's policy is parameterized by its local parameters together with those of its $\kappa_{p}$-hop neighbors, with $\kappa_{p}\geq 1$ denoting the coupled radius. The objective of the agents is to collaboratively optimize their policies to maximize the discounted average cumulative reward. To address the challenge of interdependent policies in collaborative optimization, we introduce a novel concept termed the neighbors' averaged $Q$-function and derive a new expression for the coupled policy gradient. Based on these theoretical foundations, we develop a distributed scalable coupled policy (DSCP) algorithm, where each agent relies only on the state-action pairs of its $\kappa_{p}$-hop neighbors and the rewards its their $(\kappa_{p}+1)$-hop neighbors. Specially, in the DSCP algorithm, we employ a geometric 2-horizon sampling method that does not require storing a full $Q$-table to obtain an unbiased estimate of the coupled policy gradient. Moreover, each agent interacts exclusively with its direct neighbors to obtain accurate policy parameters, while maintaining local estimates of other agents' parameters to execute its local policy and collect samples for optimization. These estimates and policy parameters are updated via a push-sum protocol, enabling distributed coordination of policy updates across the network. We prove that the joint policy produced by the proposed algorithm converges to a first-order stationary point of the objective function. Finally, the effectiveness of DSCP algorithm is demonstrated through simulations in a robot path planning environment, showing clear improvement over state-of-the-art methods.
中文摘要 本文研究了具有相互依赖奖励和耦合策略的网络多智能体强化学习（NMARL）。在此环境中，每个代理的奖励依赖于其自身的状态-动作对以及其直接邻居的状态-动作对，每个代理的策略由其本地参数及其$\kappa_{p}$-跳数邻居的参数共同参数化，其中$\kappa_{p}\geq 1$表示耦合半径。代理人的目标是协作优化其保单，以最大化贴现的平均累计奖励。为解决协作优化中相互依赖策略的挑战，我们引入了一个新概念，称为邻居平均$Q$函数，并推导出耦合策略梯度的新表达式。基于这些理论基础，我们开发了一种分布式可扩展耦合策略（DSCP）算法，每个代理仅依赖其$\kappa_{p}$跳数邻居的状态-动作对，奖励则依赖其$（\kappa_{p}+1）$跳数邻居。特别是在DSCP算法中，我们采用了几何二视界采样方法，无需存储完整的$Q$表即可获得耦合策略梯度的无偏估计。此外，每个代理只与其直接邻居交互以获得准确的策略参数，同时保持其他代理参数的局部估计以执行本地策略并收集样本进行优化。这些估计和策略参数通过推和协议更新，实现网络间分布式的策略更新协调。我们证明所提出算法产生的联合策略收敛到目标函数的一阶平稳点。最后，通过机器人路径规划环境下的仿真，DSCP算法的有效性得到了验证，明显优于最先进方法。

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

熵比裁剪作为稳定强化学习的软全局约束

Authors: Zhenpeng Su, Leiyu Pan, Minxuan Lv, Tiehua Mei, Zijia Lin, Yuntao Li, Wenping Hu, Ruiming Tang, Kun Gai, Guorui Zhou
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.05591
Pdf link: https://arxiv.org/pdf/2512.05591
Abstract Large language model post-training relies on reinforcement learning to improve model capability and alignment quality. However, the off-policy training paradigm introduces distribution shift, which often pushes the policy beyond the trust region, leading to training instabilities manifested as fluctuations in policy entropy and unstable gradients. Although PPO-Clip mitigates this issue through importance clipping, it still overlooks the global distributional shift of actions. To address these challenges, we propose using the entropy ratio between the current and previous policies as a new global metric that effectively quantifies the relative change in policy exploration throughout updates. Building on this metric, we introduce an \textbf{Entropy Ratio Clipping} (ERC) mechanism that imposes bidirectional constraints on the entropy ratio. This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions. We integrate ERC into both DAPO and GPPO reinforcement learning algorithms. Experiments across multiple benchmarks show that ERC consistently improves performance.
中文摘要 大型语言模型的后训练依赖强化学习来提升模型能力和比对质量。然而，非策略训练范式引入分布转移，这常常将策略推向信任区域之外，导致训练不稳定性表现为策略熵波动和梯度不稳定。虽然PPO-Clip通过重要性裁剪来缓解这一问题，但它仍然忽略了动作的全球分布变化。为应对这些挑战，我们提议使用当前与以往政策之间的熵比作为新的全局指标，有效量化更新期间政策探索的相对变化。基于该度量，我们引入了 \textbf{熵比截断}（ERC）机制，对熵比施加双向约束。这稳定了全局分布层面的策略更新，并弥补了PPO-clip无法调节未采样动作概率偏移的不足。我们将ERC集成到DAPO和GPPO强化学习算法中。跨多个基准测试的实验显示，ERC持续提升性能。

MedTutor-R1: Socratic Personalized Medical Teaching with Multi-Agent Simulation

MedTutor-R1：苏格拉底个性化医学教学与多智能体模拟

Authors: Zhitao He, Haolin Yang, Zeyu Qin, Yi R Fung
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.05671
Pdf link: https://arxiv.org/pdf/2512.05671
Abstract The significant gap between rising demands for clinical training and the scarcity of expert instruction poses a major challenge to medical education. With powerful capabilities in personalized guidance, Large Language Models (LLMs) offer a promising solution to bridge this gap. However, current research focuses mainly on one-on-one knowledge instruction, overlooking collaborative reasoning, a key skill for students developed in teamwork like ward rounds. To this end, we develop ClinEdu, a multi-agent pedagogical simulator with personality-driven patients and diverse student cohorts, enabling controlled testing of complex pedagogical processes and scalable generation of teaching data. Based on ClinEdu, we construct ClinTeach, a large Socratic teaching dialogue dataset that captures the complexities of group instruction. We then train MedTutor-R1, the first multimodal Socratic tutor designed for one-to-many instruction in clinical medical education. MedTutor-R1 is first instruction-tuned on our ClinTeach dataset and then optimized with reinforcement learning, using rewards derived from a three-axis rubric, covering structural fidelity, analytical quality, and clinical safety, to refine its adaptive Socratic strategies. For authentic in-situ assessment, we use simulation-based interactive evaluation that redeploys the tutor back into ClinEdu. Experimental results demonstrate that our MedTutor-R1 outperforms the base model by over 20% in average pedagogical score and is comparable to o3, while also exhibiting high adaptability in handling a varying number of students. This promising performance underscores the effectiveness of our pedagogical simulator, ClinEdu.
中文摘要 临床培训需求增长与专业教学稀缺之间的巨大差距，对医学教育构成了重大挑战。凭借强大的个性化指导能力，大型语言模型（LLM）为弥合这一差距提供了有前景的解决方案。然而，当前的研究主要侧重于一对一知识教学，忽视了协作推理，这是学生在团队合作中培养的关键技能，如查房。为此，我们开发了ClinEdu，一个多代理教学模拟器，支持以个性驱动的患者和多样化的学生群体，实现复杂教学过程的受控测试和可扩展的教学数据生成。基于ClinEdu，我们构建了ClinTeach，这是一个大型苏格拉底式教学对话数据集，捕捉了小组教学的复杂性。随后我们培训MedTutor-R1，这是首个为临床医学教育一对多教学设计的多模态苏格拉底式导师。MedTutor-R1 首先在我们的 ClinTeach 数据集上进行了指令调优，随后通过强化学习优化，利用涵盖结构准确性、分析质量和临床安全性的三轴评分标准，优化其自适应苏格拉底式策略。对于真实的现场评估，我们使用基于模拟的交互式评估，将导师重新部署回ClinEdu。实验结果显示，我们的MedTutor-R1在平均教学分数上比基础模型高出20%以上，且可与O3相当，同时在应对不同数量学生时表现出高度适应性。这一令人期待的表现凸显了我们教学模拟器ClinEdu的有效性。

LA-RL: Language Action-guided Reinforcement Learning with Safety Guarantees for Autonomous Highway Driving

LA-RL：语言行动引导强化学习，具备安全保障，适用于自动驾驶高速公路

Authors: Yiming Shu, Jiahui Xu, Jiwei Tang, Ruiyang Gao, Chen Sun
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.05686
Pdf link: https://arxiv.org/pdf/2512.05686
Abstract Autonomous highway driving demands a critical balance between proactive, efficiency-seeking behavior and robust safety guarantees. This paper proposes Language Action-guided Reinforcement Learning (LA-RL) with Safety Guarantees, a novel framework that integrates the semantic reasoning of large language models (LLMs) into the actor-critic architecture with an improved safety layer. Within this framework, task-specific reward shaping harmonizes the dual objectives of maximizing driving efficiency and ensuring safety, guiding decision-making based on both environmental insights and clearly defined goals. To enhance safety, LA-RL incorporates a safety-critical planner that combines model predictive control (MPC) with discrete control barrier functions (DCBFs). This layer formally constrains the LLM-informed policy to a safe action set, employs a slack mechanism that enhances solution feasibility, prevents overly conservative behavior and allows for greater policy exploration without compromising safety. Extensive experiments demonstrate that it significantly outperforms several current state-of-the-art methods, offering a more adaptive, reliable, and robust solution for autonomous highway driving. Compared to existing SOTA, it achieves approximately 20$\%$ higher success rate than the knowledge graph (KG) based baseline and about 30$\%$ higher than the retrieval augmented generation (RAG) based baseline. In low-density environments, LA-RL achieves a 100$\%$ success rate. These results confirm its enhanced exploration of the state-action space and its ability to autonomously adopt more efficient, proactive strategies in complex, mixed-traffic highway environments.
中文摘要 自动驾驶高速公路需要在主动、追求效率的行为与强有力的安全保障之间取得关键平衡。本文提出了带有安全保障的语言动作引导强化学习（LA-RL），这是一种将大型语言模型（LLMs）语义推理整合进演员-批评者架构中并改进安全层的新框架。在此框架下，任务特定奖励塑造协调了最大化驾驶效率与确保安全的双重目标，指导基于环境洞察和明确目标的决策。为提升安全性，LA-RL集成了安全关键规划器，结合了模型预测控制（MPC）与离散控制屏障函数（DCBF）。该层形式上限制了基于LLM的策略为安全行动集，采用了提升解决方案可行性的松弛机制，防止过于保守的行为，并允许在不牺牲安全的前提下进行更广泛的策略探索。大量实验表明，其性能远超多种现有最先进方法，为自动驾驶高速公路提供了更适应性、更可靠且更稳健的解决方案。与现有SOTA相比，其成功率比基于知识图谱（KG）的基线高出约20$\%$，比基于检索增强生成（RAG）的基线高约30%%$。在低密度环境中，LA-RL的成功率达到100%%。这些结果证实了其对状态行动空间的深入探索，以及在复杂混合交通高速公路环境中自主采用更高效、主动策略的能力。

Bayesian Active Inference for Intelligent UAV Anti-Jamming and Adaptive Trajectory Planning

智能无人机反干扰和自适应轨迹规划的贝叶斯主动推断

Authors: Ali Krayani, Seyedeh Fatemeh Sadati, Lucio Marcenaro, Carlo Regazzoni
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.05711
Pdf link: https://arxiv.org/pdf/2512.05711
Abstract This paper proposes a hierarchical trajectory planning framework for UAVs operating under adversarial jamming conditions. Leveraging Bayesian Active Inference, the approach combines expert-generated demonstrations with probabilistic generative modeling to encode high-level symbolic planning, low-level motion policies, and wireless signal feedback. During deployment, the UAV performs online inference to anticipate interference, localize jammers, and adapt its trajectory accordingly, without prior knowledge of jammer locations. Simulation results demonstrate that the proposed method achieves near-expert performance, significantly reducing communication interference and mission cost compared to model-free reinforcement learning baselines, while maintaining robust generalization in dynamic environments.
中文摘要 本文提出了一种针对无人机在对抗干扰条件下的分层轨迹规划框架。利用贝叶斯主动推断，该方法结合专家生成演示与概率生成建模，编码高层符号规划、低层次运动策略和无线信号反馈。部署期间，无人机进行在线推断，预测干扰、定位干扰器并相应调整轨迹，无需事先了解干扰器位置。模拟结果表明，所提方法实现了近乎专家级的性能，显著降低了通信干扰和任务成本，相较于无模型强化学习基线，同时保持动态环境中的稳健泛化。

A Fast Anti-Jamming Cognitive Radar Deployment Algorithm Based on Reinforcement Learning

基于强化学习的快速抗干扰认知雷达部署算法

Authors: Wencheng Cai, Xuchao Gao, Congying Han, Mingqiang Li, Tiande Guo
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.05753
Pdf link: https://arxiv.org/pdf/2512.05753
Abstract The fast deployment of cognitive radar to counter jamming remains a critical challenge in modern warfare, where more efficient deployment leads to quicker detection of targets. Existing methods are primarily based on evolutionary algorithms, which are time-consuming and prone to falling into local optima. We tackle these drawbacks via the efficient inference of neural networks and propose a brand new framework: Fast Anti-Jamming Radar Deployment Algorithm (FARDA). We first model the radar deployment problem as an end-to-end task and design deep reinforcement learning algorithms to solve it, where we develop integrated neural modules to perceive heatmap information and a brand new reward format. Empirical results demonstrate that our method achieves coverage comparable to evolutionary algorithms while deploying radars approximately 7,000 times faster. Further ablation experiments confirm the necessity of each component of FARDA.
中文摘要 快速部署认知雷达以反干扰仍然是现代战争中的一项关键挑战，更高效的部署能更快发现目标。现有方法主要基于进化算法，这些算法耗时且容易陷入局部最优。我们通过神经网络的高效推断解决这些缺陷，提出了一个新框架：快速反干扰雷达部署算法（FARDA）。我们首先将雷达部署问题建模为端到端任务，并设计深度强化学习算法以解决，开发集成神经模块以感知热图信息，并采用全新的奖励格式。实证结果表明，我们的方法覆盖范围可与进化算法相当，同时部署雷达的速度约为其7000倍。进一步的消融实验证实了FARDA各组成部分的必要性。

Real-time Remote Tracking and Autonomous Planning for Whale Rendezvous using Robots

利用机器人进行鲸鱼会合的实时远程跟踪与自主规划

Authors: Sushmita Bhattacharya, Ninad Jadhav, Hammad Izhar, Karen Li, Kevin George, Robert Wood, Stephanie Gil
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.05808
Pdf link: https://arxiv.org/pdf/2512.05808
Abstract We introduce a system for real-time sperm whale rendezvous at sea using an autonomous uncrewed aerial vehicle. Our system employs model-based reinforcement learning that combines in situ sensor data with an empirical whale dive model to guide navigation decisions. Key challenges include (i) real-time acoustic tracking in the presence of multiple whales, (ii) distributed communication and decision-making for robot deployments, and (iii) on-board signal processing and long-range detection from fish-trackers. We evaluate our system by conducting rendezvous with sperm whales at sea in Dominica, performing hardware experiments on land, and running simulations using whale trajectories interpolated from marine biologists' surface observations.
中文摘要 我们引入了一种利用自动无人飞行器实现抹香鲸海上实时会合的系统。我们的系统采用基于模型的强化学习，结合原位传感器数据与实证鲸鱼潜水模型，指导导航决策。主要挑战包括：（i）在多头鲸鱼存在下的实时声学跟踪，（ii）机器人部署的分布式通信和决策，以及（iii）船载信号处理和鱼类追踪器的远程探测。我们通过在多米尼加与抹香鲸进行会合、在陆地上进行硬件实验，以及利用海洋生物学家水面观测插值的鲸鱼轨迹进行模拟来评估我们的系统。

Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation

迈向多智能体驾驶模拟的高效且稳健的行为模型

Authors: Fabian Konstantinidis, Moritz Sackmann, Ulrich Hofmann, Christoph Stiller
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.05812
Pdf link: https://arxiv.org/pdf/2512.05812
Abstract Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.
中文摘要 可扩展的多智能体驾驶仿真需要既真实又计算高效的行为模型。我们通过优化控制单个流量参与者的行为模型来解决这个问题。为提高效率，我们采用实例中心场景表示法，每个交通参与者和地图元素都用其自身的局部坐标系建模。这种设计实现了高效的视点不变场景编码，并允许静态地图标记在多个仿真步骤间重复使用。为了建模交互，我们采用以查询为中心的对称上下文编码器，并在局部帧之间进行相对位置编码。我们使用对抗性反向强化学习来学习行为模型，并提出一种自适应奖励转换，能够在训练过程中自动平衡稳健性和真实性。实验表明，我们的方法能随着代币数量的增加高效扩展，显著缩短训练和推理时间，同时在定位准确性和鲁棒性方面优于多个以代理为中心的基线。

Variational Quantum Rainbow Deep Q-Network for Optimizing Resource Allocation Problem

变分量子彩虹深度Q网络用于资源分配优化问题

Authors: Truong Thanh Hung Nguyen, Truong Thinh Nguyen, Hung Cao
Subjects: Subjects: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2512.05946
Pdf link: https://arxiv.org/pdf/2512.05946
Abstract Resource allocation remains NP-hard due to combinatorial complexity. While deep reinforcement learning (DRL) methods, such as the Rainbow Deep Q-Network (DQN), improve scalability through prioritized replay and distributional heads, classical function approximators limit their representational power. We introduce Variational Quantum Rainbow DQN (VQR-DQN), which integrates ring-topology variational quantum circuits with Rainbow DQN to leverage quantum superposition and entanglement. We frame the human resource allocation problem (HRAP) as a Markov decision process (MDP) with combinatorial action spaces based on officer capabilities, event schedules, and transition times. On four HRAP benchmarks, VQR-DQN achieves 26.8% normalized makespan reduction versus random baselines and outperforms Double DQN and classical Rainbow DQN by 4.9-13.4%. These gains align with theoretical connections between circuit expressibility, entanglement, and policy quality, demonstrating the potential of quantum-enhanced DRL for large-scale resource allocation. Our implementation is available at: this https URL.
中文摘要 由于组合复杂性，资源分配依然是NP难的。虽然深度强化学习（DRL）方法，如彩虹深度Q网络（DQN）通过优先重放和分布头提升了可扩展性，但经典函数近似器限制了其表示能力。我们介绍了变分量子彩虹DQN（VQR-DQN），它将环拓扑变分量子电路与彩虹DQN整合，利用量子叠加和纠缠。我们将人力资源分配问题（HRAP）框架为马尔可夫决策过程（MDP），并基于官员能力、事件安排和过渡时间的组合行动空间。在四个HRAP基准测试中，VQR-DQN相较随机基线实现了26.8%的归一化完成期减少，并且比双重DQN和经典彩虹DQN高出4.9%-13.4%。这些成果与电路表达性、纠缠和策略质量之间的理论联系相符，展示了量子增强DRL在大规模资源配置中的潜力。我们的实现可访问：https URL。

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

剩下的都必须是真实的：过滤驱动大型语言模型的推理，塑造多样性

Authors: Germán Kruszewski, Pierre Erbacher, Jos Rozen, Marc Dymetman
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.05962
Pdf link: https://arxiv.org/pdf/2512.05962
Abstract Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning. However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity. We argue that this arises because RL implicitly optimizes the "mode-seeking" or "zero-forcing" Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others. In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones. Starting from a pre-trained LLM, we approximate this target distribution using the $\alpha$-divergence family, which unifies prior approaches and enables direct control of the precision-diversity trade-off by interpolating between mode-seeking and mass-covering divergences. On a Lean theorem-proving benchmark, our method achieves state-of-the-art performance along the coverage-precision Pareto frontier, outperforming all prior methods on the coverage axis.
中文摘要 强化学习（RL）已成为调校LLM以解决推理任务的事实标准。然而，越来越多的证据表明，以这种方式训练的模型往往会显著减少多样性。我们认为这是因为强化学习隐式地将“寻模”或“零强制”的反向 KL 优化到目标分布，导致模型将质量集中在目标的某些高概率区域，而忽视其他区域。在本研究中，我们从明确的目标分布出发，通过过滤错误答案并保持正确答案的相对概率获得。从预训练的LLM出发，我们利用$\alpha$散度族近似该目标分布，该族统一了先前的方法，并通过插值寻模发散和质量覆盖发散实现精度-多样性权衡的直接控制。在精益定理验证基准测试上，我们的方法在覆盖精度帕累托前沿实现了最先进的性能，在覆盖轴上优于所有之前的方法。

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

EditThinker：解锁任何图片编辑器的迭代推理能力

Authors: Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, Xunliang Cai, Linjiang Huang, Hongsheng Li, Si Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.05965
Pdf link: https://arxiv.org/pdf/2512.05965
Abstract Instruction-based image editing has emerged as a prominent research area, which, benefiting from image generation foundation models, have achieved high aesthetic quality, making instruction-following capability the primary challenge. Existing approaches improve instruction adherence via supervised or reinforcement learning, yet single-turn success rates remain limited due to inherent stochasticity and a lack of deliberation. In this work, we propose a deliberative editing framework to 'think' while they edit, which simulates the human cognitive loop by iteratively executing a Think-while-Edit cycle: Critiquing results and Refining instructions , followed by Repeating the generation until satisfactory. Specifically, we train a single MLLM, EditThinker, to act as the reasoning engine of this framework, which jointly produce the critique score, reasoning process, and refined instructions. We employ reinforcement learning to align the EditThinker's thinking with its editing, thereby generating more targeted instruction improvements. Extensive experiments on four benchmarks demonstrate that our approach significantly improves the instruction-following capability of any image editing model by a large margin. We will release our data construction framework, datasets, and models to benefit the community.
中文摘要 基于指令的图像编辑已成为一个重要研究领域，借助图像生成基础模型，实现了高美学质量，使得跟随指令的能力成为主要挑战。现有方法通过监督学习或强化学习提高教学依从性，但由于固有的随机性和缺乏深思熟虑，单回合成功率仍然有限。在本研究中，我们提出了一种“思考”编辑框架，通过迭代执行“思考-编辑”循环来模拟人类认知循环：批评结果并精炼指令，随后重复生成直到满意。具体来说，我们训练一个MLLM——EditThinker，作为该框架的推理引擎，共同生成批评评分、推理过程和精炼指令。我们采用强化学习，使编辑思考者的思维与其编辑行为对齐，从而产生更有针对性的教学改进。对四个基准测试的广泛实验表明，我们的方法显著提升了任何图像编辑模型的指令跟随能力。我们将发布数据构建框架、数据集和模型，惠及社区。

Keyword: diffusion policy

XR-DT: Extended Reality-Enhanced Digital Twin for Agentic Mobile Robots

XR-DT：面向智能移动机器人的扩展现实增强数字孪生

Authors: Tianyi Wang, Jiseop Byeon, Ahmad Yehia, Huihai Wang, Yiming Xu, Tianyi Zeng, Ziran Wang, Junfeng Jiao, Christian Claudel
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.05270
Pdf link: https://arxiv.org/pdf/2512.05270
Abstract As mobile robots increasingly operate alongside humans in shared workspaces, ensuring safe, efficient, and interpretable Human-Robot Interaction (HRI) has become a pressing challenge. While substantial progress has been devoted to human behavior prediction, limited attention has been paid to how humans perceive, interpret, and trust robots' inferences, impeding deployment in safety-critical and socially embedded environments. This paper presents XR-DT, an eXtended Reality-enhanced Digital Twin framework for agentic mobile robots, that bridges physical and virtual spaces to enable bi-directional understanding between humans and robots. Our hierarchical XR-DT architecture integrates virtual-, augmented-, and mixed-reality layers, fusing real-time sensor data, simulated environments in the Unity game engine, and human feedback captured through wearable AR devices. Within this framework, we design an agentic mobile robot system with a unified diffusion policy for context-aware task adaptation. We further propose a chain-of-thought prompting mechanism that allows multimodal large language models to reason over human instructions and environmental context, while leveraging an AutoGen-based multi-agent coordination layer to enhance robustness and collaboration in dynamic tasks. Initial experimental results demonstrate accurate human and robot trajectory prediction, validating the XR-DT framework's effectiveness in HRI tasks. By embedding human intention, environmental dynamics, and robot cognition into the XR-DT framework, our system enables interpretable, trustworthy, and adaptive HRI.
中文摘要 随着移动机器人越来越多地与人类在共享工作空间中协作，确保安全、高效且易于理解的人机交互（HRI）已成为一项紧迫挑战。尽管人类行为预测取得了重大进展，但对人类如何感知、解读和信任机器人推断的关注有限，阻碍了在安全关键和社会嵌入环境中的部署。本文介绍了XR-DT，一种面向智能移动机器人的增强现实增强数字孪生框架，连接物理与虚拟空间，实现人与机器人之间的双向理解。我们的分层XR-DT架构整合了虚拟现实、增强现实和混合现实层，融合了实时传感器数据、Unity游戏引擎中的模拟环境以及通过可穿戴增强现实设备捕捉的人类反馈。在此框架下，我们设计了一套具有统一扩散策略的智能移动机器人系统，用于情境感知任务适应。我们进一步提出了一种思维链提示机制，使多模态大型语言模型能够基于人类指令和环境上下文进行推理，同时利用基于AutoGen的多智能体协调层，增强动态任务的稳健性和协作性。初步实验结果证明了人机轨迹预测的准确性，验证了XR-DT框架在HRI任务中的有效性。通过将人类意图、环境动态和机器人认知嵌入XR-DT框架，我们的系统实现了可解释、可信赖且自适应的HRI。