Arxiv Papers of Today

生成时间: 2025-12-02 16:35:20 (UTC+8); Arxiv 发布时间: 2025-12-02 20:00 EST (2025-12-03 09:00 UTC+8)

今天共有 76 篇相关文章

Keyword: reinforcement learning

DREAMer-VXS: A Latent World Model for Sample-Efficient AGV Exploration in Stochastic, Unobserved Environments

DREAMer-VXS：一种用于随机、未观测环境中样本高效AGV探索的潜在世界模型

Authors: Agniprabha Chakraborty
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.00005
Pdf link: https://arxiv.org/pdf/2512.00005
Abstract The paradigm of learning-based robotics holds immense promise, yet its translation to real-world applications is critically hindered by the sample inefficiency and brittleness of conventional model-free reinforcement learning algorithms. In this work, we address these challenges by introducing DREAMer-VXS, a model-based framework for Autonomous Ground Vehicle (AGV) exploration that learns to plan from imagined latent trajectories. Our approach centers on learning a comprehensive world model from partial and high-dimensional LiDAR observations. This world model is composed of a Convolutional Variational Autoencoder (VAE), which learns a compact representation of the environment's structure, and a Recurrent State-Space Model (RSSM), which models complex temporal dynamics. By leveraging this learned model as a high-speed simulator, the agent can train its navigation policy almost entirely in imagination. This methodology decouples policy learning from real-world interaction, culminating in a 90% reduction in required environmental interactions to achieve expert-level performance when compared to state-of-the-art model-free SAC baselines. The agent's behavior is guided by an actor-critic policy optimized with a composite reward function that balances task objectives with an intrinsic curiosity bonus, promoting systematic exploration of unknown spaces. We demonstrate through extensive simulated experiments that DREAMer-VXS not only learns orders of magnitude faster but also develops more generalizable and robust policies, achieving a 45% increase in exploration efficiency in unseen environments and superior resilience to dynamic obstacles.
中文摘要 基于学习的机器人范式潜力巨大，但其转化为现实世界的应用受到传统无模型强化学习算法样本效率低和脆弱性的严重阻碍。本研究通过引入DREAMer-VXS模型框架来应对这些挑战，该框架用于自主地面载具（AGV）探索，能够从想象中的潜在轨迹中学习规划。我们的方法主要从部分和高维激光雷达观测中学习一个全面的世界模型。该世界模型由卷积变分自编码器（VAE）组成，该模型学习环境结构的紧凑表示，以及一个循环状态空间模型（RSSM），用于模拟复杂的时间动态。通过利用该学习到的模型作为高速模拟器，智能体几乎可以完全在想象中训练其导航策略。该方法将政策学习与现实世界互动脱钩，最终实现实现专家级绩效所需的环境互动减少90%，相比最先进的无模型SAC基线。代理的行为由一个行为者-批评者策略引导，该策略通过复合奖励函数优化，平衡任务目标与内在好奇心加成，促进对未知空间的系统探索。通过大量模拟实验，我们证明DREAMer-VXS不仅学习速度快了几个数量级，还能开发出更具通用性和稳健性的策略，在未见环境中的探索效率提升了45%，并且对动态障碍的韧性更优。

Perturbation-mitigated USV Navigation with Distributionally Robust Reinforcement Learning

利用分布式强化学习的微观减缓USV导航

Authors: Zhaofan Zhang, Minghao Yang, Sihong Xie, Hui Xiong
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.00030
Pdf link: https://arxiv.org/pdf/2512.00030
Abstract The robustness of Unmanned Surface Vehicles (USV) is crucial when facing unknown and complex marine environments, especially when heteroscedastic observational noise poses significant challenges to sensor-based navigation tasks. Recently, Distributional Reinforcement Learning (DistRL) has shown promising results in some challenging autonomous navigation tasks without prior environmental information. However, these methods overlook situations where noise patterns vary across different environmental conditions, hindering safe navigation and disrupting the learning of value functions. To address the problem, we propose DRIQN to integrate Distributionally Robust Optimization (DRO) with implicit quantile networks to optimize worst-case performance under natural environmental conditions. Leveraging explicit subgroup modeling in the replay buffer, DRIQN incorporates heterogeneous noise sources and target robustness-critical scenarios. Experimental results based on the risk-sensitive environment demonstrate that DRIQN significantly outperforms state-of-the-art methods, achieving +13.51\% success rate, -12.28\% collision rate and +35.46\% for time saving, +27.99\% for energy saving, compared with the runner-up.
中文摘要 无人水面飞行器（USV）的鲁棒性在面对未知且复杂的海洋环境时至关重要，尤其是在异频变差观测噪声对基于传感器的导航任务构成重大挑战时。最近，分布式强化学习（DistRL）在一些没有先验环境信息的挑战性自主导航任务中取得了令人鼓舞的成果。然而，这些方法忽视了噪声模式在不同环境条件下变化的情况，这些情况会妨碍安全导航并干扰价值函数的学习。为解决该问题，我们提出DRIQN将分布稳健优化（DRO）与隐式分位数网络整合，以优化自然环境条件下的最坏情况表现。利用重放缓冲区中的显式子群建模，DRIQN整合了异构噪声源，并针对鲁棒性关键场景进行研究。基于风险敏感环境的实验结果显示，DRIQN显著优于最先进方法，成功率为+13.51%，碰撞率为-12.28%，节省时间为+35.46%，节能率为+27.99%，均为次优。

Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions

缩小差距：标准化考试题目中视觉语言模型的数据中心微调

Authors: Egemen Sert, Şeyda Ertekin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2512.00042
Pdf link: https://arxiv.org/pdf/2512.00042
Abstract Multimodal reasoning has become a cornerstone of modern AI research. Standardized exam questions offer a uniquely rigorous testbed for such reasoning, providing structured visual contexts and verifiable answers. While recent progress has largely focused on algorithmic advances such as reinforcement learning (e.g., GRPO, DPO), the data centric foundations of vision language reasoning remain less explored. We show that supervised fine-tuning (SFT) with high-quality data can rival proprietary approaches. To this end, we compile a 161.4 million token multimodal dataset combining textbook question-solution pairs, curriculum aligned diagrams, and contextual materials, and fine-tune Qwen-2.5VL-32B using an optimized reasoning syntax (QMSA). The resulting model achieves 78.6% accuracy, only 1.0% below Gemini 2.0 Flash, on our newly released benchmark YKSUniform, which standardizes 1,854 multimodal exam questions across 309 curriculum topics. Our results reveal that data composition and representational syntax play a decisive role in multimodal reasoning. This work establishes a data centric framework for advancing open weight vision language models, demonstrating that carefully curated and curriculum-grounded multimodal data can elevate supervised fine-tuning to near state-of-the-art performance.
中文摘要 多模态推理已成为现代人工智能研究的基石。标准化考试题目为此类推理提供了独特且严格的测试平台，提供结构化的视觉语境和可验证的答案。虽然近期进展主要集中在强化学习等算法进展（如GRPO、DPO），但视觉语言推理的数据中心基础仍较少被探索。我们表明，高质量数据的监督式微调（SFT）可以与专有方法媲美。为此，我们编制了一个1.614亿个代币多模态数据集，结合了教科书题型与解题对、课程对齐的图表和上下文材料，并利用优化的推理语法（QMSA）微调Qwen-2.5VL-32B。最终模型在我们新发布的基准测试YKSUniform上实现了78.6%的准确率，仅比Gemini 2.0 Flash低1.0%，该基准测试标准化了309个课程主题的1854道多模态考试题目。我们的结果表明，数据组合和表示句法在多模态推理中起着决定性作用。这项工作建立了以数据为中心的框架，推动开放权重视觉语言模型的发展，展示了精心策划且基于课程的多模态数据能够将监督式微调提升到接近最先进的性能。

Causal Reinforcement Learning based Agent-Patient Interaction with Clinical Domain Knowledge

基于临床领域知识的因果强化学习主体-患者互动

Authors: Wenzheng Zhao, Ran Zhang, Ruth Palan Lopez, Shu-Fen Wung, Fengpei Yuan
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.00048
Pdf link: https://arxiv.org/pdf/2512.00048
Abstract Reinforcement Learning (RL) faces significant challenges in adaptive healthcare interventions, such as dementia care, where data is scarce, decisions require interpretability, and underlying patient-state dynamic are complex and causal in nature. In this work, we present a novel framework called Causal structure-aware Reinforcement Learning (CRL) that explicitly integrates causal discovery and reasoning into policy optimization. This method enables an agent to learn and exploit a directed acyclic graph (DAG) that describes the causal dependencies between human behavioral states and robot actions, facilitating more efficient, interpretable, and robust decision-making. We validate our approach in a simulated robot-assisted cognitive care scenario, where the agent interacts with a virtual patient exhibiting dynamic emotional, cognitive, and engagement states. The experimental results show that CRL agents outperform conventional model-free RL baselines by achieving higher cumulative rewards, maintaining desirable patient states more consistently, and exhibiting interpretable, clinically-aligned behavior. We further demonstrate that CRL's performance advantage remains robust across different weighting strategies and hyperparameter settings. In addition, we demonstrate a lightweight LLM-based deployment: a fixed policy is embedded into a system prompt that maps inferred states to actions, producing consistent, supportive dialogue without LLM finetuning. Our work illustrates the promise of causal reinforcement learning for human-robot interaction applications, where interpretability, adaptiveness, and data efficiency are paramount.
中文摘要 强化学习（RL）在适应性医疗干预中面临重大挑战，如痴呆护理，数据稀缺，决策需要可解释性，且潜在的患者与患者状态动态复杂且因果。在本研究中，我们提出了一种名为因果结构感知强化学习（CRL）的新框架，明确将因果发现和推理整合进策略优化中。该方法使智能体能够学习并利用一个有向无环图（DAG），该图描述了人类行为状态与机器人动作之间的因果关系，从而促进更高效、可解释性和稳健的决策。我们在模拟机器人辅助认知护理场景中验证了我们的方法，该场景中代理与展现动态情感、认知和参与状态的虚拟患者互动。实验结果显示，CRL代理人在获得更高的累计奖励、更稳定地维持理想患者状态以及表现出可解释且临床对齐的行为方面，优于传统无模型强化学习基线。我们还进一步证明，CRL在不同加权策略和超参数设置下的性能优势依然稳健。此外，我们还展示了一种轻量级基于LLM的部署：系统提示中嵌入固定策略，将推断状态映射到动作，产生一致且支持性的对话，无需LLM微调。我们的工作展示了因果强化学习在人机交互应用中的前景，在这些应用中，可解释性、适应性和数据效率至关重要。

Socially aware navigation for mobile robots: a survey on deep reinforcement learning approaches

移动机器人的社会意识导航：深度强化学习方法的综述

Authors: Ibrahim Khalil Kabir, Muhammad Faizan Mysorewala
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.00049
Pdf link: https://arxiv.org/pdf/2512.00049
Abstract Socially aware navigation is a fast-evolving research area in robotics that enables robots to move within human environments while adhering to the implicit human social norms. The advent of Deep Reinforcement Learning (DRL) has accelerated the development of navigation policies that enable robots to incorporate these social conventions while effectively reaching their objectives. This survey offers a comprehensive overview of DRL-based approaches to socially aware navigation, highlighting key aspects such as proxemics, human comfort, naturalness, trajectory and intention prediction, which enhance robot interaction in human environments. This work critically analyzes the integration of value-based, policy-based, and actor-critic reinforcement learning algorithms alongside neural network architectures, such as feedforward, recurrent, convolutional, graph, and transformer networks, for enhancing agent learning and representation in socially aware navigation. Furthermore, we examine crucial evaluation mechanisms, including metrics, benchmark datasets, simulation environments, and the persistent challenges of sim-to-real transfer. Our comparative analysis of the literature reveals that while DRL significantly improves safety, and human acceptance over traditional approaches, the field still faces setback due to non-uniform evaluation mechanisms, absence of standardized social metrics, computational burdens that limit scalability, and difficulty in transferring simulation to real robotic hardware applications. We assert that future progress will depend on hybrid approaches that leverage the strengths of multiple approaches and producing benchmarks that balance technical efficiency with human-centered evaluation.
中文摘要 社会意识导航是机器人学中快速发展的研究领域，使机器人能够在人类环境中移动，同时遵守隐含的人类社会规范。深度强化学习（DRL）的出现加速了导航政策的发展，使机器人能够在有效实现目标的同时融入这些社会惯例。本调查全面介绍基于日程学习的社会意识导航方法，重点介绍了邻近感、人类舒适度、自然性、轨迹和意图预测等关键方面，这些因素增强了机器人在人类环境中的互动。本研究批判性地分析了基于价值、基于策略和演员-批评者强化算法与神经网络架构（如前馈网络、递归网络、卷积网络、图网络和变换器网络）的整合，以增强社会意识导航中的代理学习和表征。此外，我们还考察了关键的评估机制，包括指标、基准数据集、仿真环境以及模拟到真实转移的持续挑战。我们对文献的比较分析显示，尽管DRL显著提升了安全性和人类接受度，优于传统方法，但该领域仍面临受挫，原因包括评估机制不统一、缺乏标准化社会指标、限制可扩展性的计算负担以及将仿真迁移到真实机器人硬件应用的困难。我们断言，未来的进展将依赖于混合方法，利用多种方法的优势，制定平衡技术效率与以人为本评估的基准。

Reinforcement Learning from Implicit Neural Feedback for Human-Aligned Robot Control

基于隐性神经反馈的强化学习，用于人对齐机器人控制

Authors: Suzie Kim
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.00050
Pdf link: https://arxiv.org/pdf/2512.00050
Abstract Conventional reinforcement learning (RL) approaches often struggle to learn effective policies under sparse reward conditions, necessitating the manual design of complex, task-specific reward functions. To address this limitation, reinforcement learning from human feedback (RLHF) has emerged as a promising strategy that complements hand-crafted rewards with human-derived evaluation signals. However, most existing RLHF methods depend on explicit feedback mechanisms such as button presses or preference labels, which disrupt the natural interaction process and impose a substantial cognitive load on the user. We propose a novel reinforcement learning from implicit human feedback (RLIHF) framework that utilizes non-invasive electroencephalography (EEG) signals, specifically error-related potentials (ErrPs), to provide continuous, implicit feedback without requiring explicit user intervention. The proposed method adopts a pre-trained decoder to transform raw EEG signals into probabilistic reward components, enabling effective policy learning even in the presence of sparse external rewards. We evaluate our approach in a simulation environment built on the MuJoCo physics engine, using a Kinova Gen2 robotic arm to perform a complex pick-and-place task that requires avoiding obstacles while manipulating target objects. The results show that agents trained with decoded EEG feedback achieve performance comparable to those trained with dense, manually designed rewards. These findings validate the potential of using implicit neural feedback for scalable and human-aligned reinforcement learning in interactive robotics.
中文摘要 传统的强化学习（RL）方法在奖励稀疏条件下常常难以学习有效的策略，因此需要手动设计复杂的任务特定奖励函数。为解决这一局限，来自人类反馈的强化学习（RLHF）已成为一种有前景的策略，它与手工奖励与人类生成的评估信号相辅相成。然而，大多数现有的RLHF方法依赖于显性反馈机制，如按键或偏好标签，这会干扰自然的互动过程，并给用户带来较大的认知负担。我们提出了一种基于隐性人类反馈（RLIHF）框架的新型强化学习，利用非侵入性脑电图（EEG）信号，特别是错误相关电位（ErrPs），提供连续的隐性反馈，无需用户明确干预。该方法采用预训练解码器，将原始脑电信号转换为概率奖励组件，即使在外部奖励稀疏的情况下也能实现有效策略学习。我们在基于MuJoCo物理引擎的仿真环境中评估了我们的方法，使用Kinova第二代机械臂执行复杂的拾取任务，该任务需要在作目标物体时避开障碍物。结果显示，接受解码脑电反馈训练的代理，其表现与采用密集、手动设计的奖励训练者相当。这些发现验证了在交互式机器人中利用隐性神经反馈实现可扩展且与人对齐强化学习的潜力。

SpeedAug: Policy Acceleration via Tempo-Enriched Policy and RL Fine-Tuning

SpeedAug：通过节奏丰富策略和强化学习微调实现策略加速

Authors: Taewook Nam, Sung Ju Hwang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.00062
Pdf link: https://arxiv.org/pdf/2512.00062
Abstract Recent advances in robotic policy learning have enabled complex manipulation in real-world environments, yet the execution speed of these policies often lags behind hardware capabilities due to the cost of collecting faster demonstrations. Existing works on policy acceleration reinterpret action sequence for unseen execution speed, thereby encountering distributional shifts from the original demonstrations. Reinforcement learning is a promising approach that adapts policies for faster execution without additional demonstration, but its unguided exploration is sample inefficient. We propose SpeedAug, an RL-based policy acceleration framework that efficiently adapts pre-trained policies for faster task execution. SpeedAug constructs behavior prior that encompasses diverse tempos of task execution by pre-training a policy on speed-augmented demonstrations. Empirical results on robotic manipulation benchmarks show that RL fine-tuning initialized from this tempo-enriched policy significantly improves the sample efficiency of existing RL and policy acceleration methods while maintaining high success rate.
中文摘要 机器人策略学习的最新进展使得在现实环境中实现复杂作成为可能，但由于收集更快演示的成本，这些策略的执行速度常常落后于硬件能力。现有的策略加速研究会重新解释未见执行速度的动作序列，从而遇到与原始演示相差的分布变化。强化学习是一种有前景的方法，能够调整策略以加快执行速度，无需额外演示，但其无引导探索在样本中效率较低。我们提出了SpeedAug，一个基于强化学习的策略加速框架，能够高效地调整预训练策略以实现更快的任务执行。SpeedAug通过预训练速度增强演示策略，构建涵盖任务执行多样节奏的行为。机器人作基准测试的实证结果表明，基于该节奏丰富策略的强化学习微调，显著提升了现有强化学习和策略加速方法的样本效率，同时保持了较高的成功率。

InF-ATPG: Intelligent FFR-Driven ATPG with Advanced Circuit Representation Guided Reinforcement Learning

InF-ATPG：智能FFR驱动ATPG，具备先进电路表示引导强化学习

Authors: Bin Sun, Rengang Zhang, Zhiteng Chao, Zizhen Liu, Jianan Mu, Jing Ye, Huawei Li
Subjects: Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.00079
Pdf link: https://arxiv.org/pdf/2512.00079
Abstract Automatic test pattern generation (ATPG) is a crucial process in integrated circuit (IC) design and testing, responsible for efficiently generating test patterns. As semiconductor technology progresses, traditional ATPG struggles with long execution times to achieve the expected fault coverage, which impacts the time-to-market of chips. Recent machine learning techniques, like reinforcement learning (RL) and graph neural networks (GNNs), show promise but face issues such as reward delay in RL models and inadequate circuit representation in GNN-based methods. In this paper, we propose InF-ATPG, an intelligent FFR-driven ATPG framework that overcomes these challenges by using advanced circuit representation to guide RL. By partitioning circuits into fanout-free regions (FFRs) and incorporating ATPG-specific features into a novel QGNN architecture, InF-ATPG enhances test pattern generation efficiency. Experimental results show InF-ATPG reduces backtracks by 55.06\% on average compared to traditional methods and 38.31\% compared to the machine learning approach, while also improving fault coverage.
中文摘要 自动测试图样生成（ATPG）是集成电路（IC）设计与测试中的关键过程，负责高效生成测试图样。随着半导体技术的发展，传统ATPG面临较长的执行时间以实现预期故障覆盖率，这影响了芯片的上市时间。近期的机器学习技术，如强化学习（RL）和图神经网络（GNN），展现出潜力，但面临如强化学习模型中奖励延迟和基于GNN方法的电路表示不足等问题。本文提出了InF-ATPG，一种智能的FFR驱动ATPG框架，通过先进的电路表示来引导强化学习，克服了这些挑战。通过将电路划分为无扇出区域（FFR），并将ATPG特有特性整合进一种新颖的QGNN架构中，InF-ATPG提高了测试图样生成的效率。实验结果显示，与传统方法相比，InF-ATPG平均减少了55.06%的回溯，与机器学习方法相比减少了38.31%的回溯，同时提高了故障覆盖率。

NetDeTox: Adversarial and Efficient Evasion of Hardware-Security GNNs via RL-LLM Orchestration

NetDeTox：通过RL-LLM编排高效规避硬件安全GNNs的对抗性与高效规避

Authors: Zeng Wang, Minghao Shao, Akashdeep Saha, Ramesh Karri, Johann Knechtel, Muhammad Shafique, Ozgur Sinanoglu
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.00119
Pdf link: https://arxiv.org/pdf/2512.00119
Abstract Graph neural networks (GNNs) have shown promise in hardware security by learning structural motifs from netlist graphs. However, this reliance on motifs makes GNNs vulnerable to adversarial netlist rewrites; even small-scale edits can mislead GNN predictions. Existing adversarial approaches, ranging from synthesis-recipe perturbations to gate transformations, come with high design overheads. We present NetDeTox, an automated end-to-end framework that orchestrates large language models (LLMs) with reinforcement learning (RL) in a systematic manner, enabling focused local rewriting. The RL agent identifies netlist components critical for GNN-based reasoning, while the LLM devises rewriting plans to diversify motifs that preserve functionality. Iterative feedback between the RL and LLM stages refines adversarial rewritings to limit overheads. Compared to the SOTA work AttackGNN, NetDeTox successfully degrades the effectiveness of all security schemes with fewer rewrites and substantially lower area overheads (reductions of 54.50% for GNN-RE, 25.44% for GNN4IP, and 41.04% for OMLA, respectively). For GNN4IP, ours can even optimize/reduce the original benchmarks' area, in particular for larger circuits, demonstrating the practicality and scalability of NetDeTox.
中文摘要 图神经网络（GNN）通过从网表图中学习结构动机，在硬件安全方面展现出了潜力。然而，这种对动机的依赖使GNN容易受到对抗网表重写的影响;即使是小规模的编辑也可能误导GNN的预测。现有的对抗方法，从合成配方扰动到门变换，都带来了高设计开销。我们介绍NetDeTox，一个自动化端到端框架，系统地协调大型语言模型（LLM）与强化学习（RL），实现聚焦的局部重写。强化学习代理识别对基于GNN推理至关重要的网表组件，而大型语言模型则设计重写计划以多样化保持功能的主题。强化学习阶段与大型语言模型阶段之间的迭代反馈优化了对抗性重写，以减少开销。与SOTA的AttackGNN工作相比，NetDeTox成功降低了所有安全方案的有效性，重写次数更少，区域开销大幅降低（GNN-RE分别减少了54.50%，GNN4IP减少了25.44%，OMLA减少了41.04%）。对于GNN4IP，我们的测试甚至可以优化/缩小原始基准测试的面积，特别是针对大型电路，展示了NetDeTox的实用性和可扩展性。

A Hierarchical Hybrid AI Approach: Integrating Deep Reinforcement Learning and Scripted Agents in Combat Simulations

分层混合人工智能方法：在战斗模拟中整合深度强化学习与脚本化智能体

Authors: Scotty Black, Christian Darken
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.00249
Pdf link: https://arxiv.org/pdf/2512.00249
Abstract In the domain of combat simulations in support of wargaming, the development of intelligent agents has predominantly been characterized by rule-based, scripted methodologies with deep reinforcement learning (RL) approaches only recently being introduced. While scripted agents offer predictability and consistency in controlled environments, they fall short in dynamic, complex scenarios due to their inherent inflexibility. Conversely, RL agents excel in adaptability and learning, offering potential improvements in handling unforeseen situations, but suffer from significant challenges such as black-box decision-making processes and scalability issues in larger simulation environments. This paper introduces a novel hierarchical hybrid artificial intelligence (AI) approach that synergizes the reliability and predictability of scripted agents with the dynamic, adaptive learning capabilities of RL. By structuring the AI system hierarchically, the proposed approach aims to utilize scripted agents for routine, tactical-level decisions and RL agents for higher-level, strategic decision-making, thus addressing the limitations of each method while leveraging their individual strengths. This integration is shown to significantly improve overall performance, providing a robust, adaptable, and effective solution for developing and training intelligent agents in complex simulation environments.
中文摘要 在支持战争游戏的战斗模拟领域，智能代理的开发主要以基于规则的脚本化方法论为特征，深度强化学习（RL）方法直到最近才被引入。虽然脚本代理在受控环境中具有可预测性和一致性，但在动态复杂场景中因其固有的不灵活性而显得不足。相反，强化学习代理在适应性和学习方面表现出色，在处理突发情况方面有望带来潜在改进，但在大型仿真环境中面临诸如黑箱决策流程和可扩展性问题等重大挑战。本文介绍了一种新型分层混合人工智能（AI）方法，将脚本化代理的可靠性和可预测性与强化学习的动态自适应学习能力协同。通过分层结构构建人工智能系统，拟议方法旨在利用脚本化智能体执行常规战术级决策，利用强化智能体进行更高层次的战略决策，从而解决每种方法的局限性，同时发挥各自的优势。这种集成显著提升整体性能，为复杂仿真环境中智能代理的开发和训练提供了稳健、适应性和高效的解决方案。

Gradient Inversion in Federated Reinforcement Learning

联合强化学习中的梯度反演

Authors: Shenghong He
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.00303
Pdf link: https://arxiv.org/pdf/2512.00303
Abstract Federated reinforcement learning (FRL) enables distributed learning of optimal policies while preserving local data privacy through gradient this http URL, FRL faces the risk of data privacy leaks, where attackers exploit shared gradients to reconstruct local training this http URL to traditional supervised federated learning, successful reconstruction in FRL requires the generated data not only to match the shared gradients but also to align with real transition dynamics of the environment (i.e., aligning with the real data transition distribution).To address this issue, we propose a novel attack method called Regularization Gradient Inversion Attack (RGIA), which enforces prior-knowledge-based regularization on states, rewards, and transition dynamics during the optimization process to ensure that the reconstructed data remain close to the true transition this http URL, we prove that the prior-knowledge-based regularization term narrows the solution space from a broad set containing spurious solutions to a constrained subset that satisfies both gradient matching and true transition this http URL experiments on control tasks and autonomous driving tasks demonstrate that RGIA can effectively constrain reconstructed data transition distributions and thus successfully reconstruct local private data.
中文摘要 联合强化学习（FRL）使得分布式学习最优策略成为可能，同时通过梯度保护本地数据隐私。FRL面临数据隐私泄露的风险，攻击者利用共享梯度将本地训练重构为传统的监督联邦学习。成功重建FRL不仅需要生成的数据匹配共享梯度，还需与环境的真实过渡动态相符（i.例如，与实际数据转移分布保持一致）。为解决此问题，我们提出了一种名为正则化梯度反演攻击（RGIA）的新攻击方法，该方法在优化过程中对状态、奖励和转移动态强制先验知识正则化，以确保重建数据接近真实的转移，http URL。我们证明了基于先验知识的正则化项将包含虚假解的宽广集合缩小到受约束的子集，满足满足梯度匹配和真实过渡条件，这些HTTP URL实验在控制任务和自动驾驶任务中展示了RGIA能够有效约束重建的数据转换分布，从而成功重建本地私有数据。

RL-Struct: A Lightweight Reinforcement Learning Framework for Reliable Structured Output in LLMs

RL-struct：一个用于LLM中可靠结构化输出的轻量级强化学习框架

Authors: Ruike Hu, Shulei Wu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.00319
Pdf link: https://arxiv.org/pdf/2512.00319
Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language generation and reasoning. However, their integration into automated software ecosystems is often hindered by the "Structure Gap" - the inherent tension between the probabilistic nature of token generation and the deterministic requirements of structured data formats (e.g., JSON, XML). Traditional Supervised Fine-Tuning (SFT) often fails to enforce strict syntactic constraints, leading to "hallucinated" keys or malformed structures, while constrained decoding methods impose significant inference latency. In this paper, we propose a lightweight, efficient Reinforcement Learning (RL) framework to bridge this gap. We introduce a novel Multi-dimensional Reward Function that decomposes the structured output task into a hierarchy of constraints: structural integrity, format correctness, content accuracy, and validity. Leveraging Gradient Regularized Policy Optimization (GRPO), we enable the model to internalize these constraints without the need for a separate critic network, reducing peak VRAM usage by 40% compared to PPO. We validate our approach on multiple tasks, including complex recipe generation and structured math reasoning (GSM8K-JSON). Experimental results demonstrate that our method achieves 89.7% structural accuracy and 92.1% JSON validity, significantly outperforming both zero-shot baselines (e.g., GPT-3.5) and SFT on larger models like LLaMA-3-8B. Furthermore, we provide a detailed analysis of training dynamics, revealing a distinct self-paced curriculum where the model sequentially acquires syntactic proficiency before semantic accuracy. Our model is publicly available at this https URL.
中文摘要 大型语言模型（LLM）在自然语言生成和推理方面展现出了卓越的能力。然而，它们在自动化软件生态系统中的整合常常受到“结构差距”的阻碍——即代币生成的概率性与结构化数据格式（如JSON、XML）确定性需求之间的固有张力。传统的监督式微调（SFT）常常无法强制执行严格的语法约束，导致“幻觉”键或结构畸形，而受限译码方法则会带来显著的推理延迟。本文提出了一个轻量级、高效的强化学习（RL）框架来弥合这一差距。我们引入了一种新的多维奖励函数，将结构化输出任务分解为一系列约束：结构完整性、格式正确性、内容准确性和有效性。利用梯度正则化策略优化（GRPO），我们使模型能够内化这些约束，而无需单独的批评网络，峰值显存使用率相比PPO降低了40%。我们在多个任务上验证我们的方法，包括复杂配方生成和结构化数学推理（GSM8K-JSON）。实验结果显示，我们的方法实现了89.7%的结构准确率和92.1%的JSON效度，显著优于零样本基线（如GPT-3.5）和在LLaMA-3-8B等大型模型上的SFT。此外，我们详细分析了训练动态，揭示了一种独特的自我进度课程，模型先获得语法熟练度，而非语义准确性。我们的模型在此 https URL 公开。

Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

可证明的内存高效自玩算法，用于无模型强化学习

Authors: Na Li, Yuchen Jiao, Hangguan Shan, Shefeng Yan
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.00351
Pdf link: https://arxiv.org/pdf/2512.00351
Abstract The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an $\varepsilon$-approximate Nash policy with space complexity $O(SABH)$ and sample complexity $\widetilde{O}(H^4SAB/\varepsilon^2)$, where $S$ is the number of states, ${A, B}$ is the number of actions for two players, and $H$ is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when $\min{A, B}\ll H^2$. Second, ME-Nash-QL achieves the lowest computational complexity $O(T\mathrm{poly}(AB))$ while preserving Markov policies, where $T$ is the number of samples. Third, ME-Nash-QL also achieves the best burn-in cost $O(SAB\,\mathrm{poly}(H))$, whereas previous algorithms have a burn-in cost of at least $O(S^3 AB\,\mathrm{poly}(H))$ to attain the same level of sample complexity with ours.
中文摘要 多智能体强化学习（MARL）这一蓬勃发展的领域研究一组交互智能体如何在共享的动态环境中自主做出决策。该领域的现有理论研究至少面临以下两个障碍：内存效率低、样本复杂度对长视野和大状态空间的高度依赖、高计算复杂度、非马尔可夫策略、非纳什策略以及高烧入成本。在本研究中，我们通过设计一个无模型的自对弈算法\emph{Memory-Efficient Nash Q-Learning （ME-Nash-QL）}，用于两人零和马尔可夫游戏，这是MARL的一个特定设定，朝着解决这一问题迈出一步。ME-Nash-QL已被证明享有以下优点。首先，它可以输出一个$\varepsilon$近似纳什策略，空间复杂度$O（SABH）$，样本复杂度为$\widetilde{O}（H^4SAB/\varepsilon^2）$，其中$S$为状态数，${A，B}$为两名玩家的行动数，$H$为地平线长度。它在表格情况下的空间复杂度和长视野（即 $\min{A， B}\ll H^2$）的样本复杂度方面优于现有算法。其次，ME-Nash-QL 实现了最低的计算复杂度 $O（T\mathrm{poly}（AB））$，同时保持了 Markov 策略，其中 $T$ 是样本数。第三，ME-Nash-QL还实现了最佳的烧入成本$O（SAB\，\mathrm{poly}（H））$，而之前的算法至少有$O（S^3 AB\，\mathrm{poly}（H）））$的烧屏成本，以实现与我们相同的样本复杂度。

Sample-Efficient Tabular Self-Play for Offline Robust Reinforcement Learning

样本高效的表格自玩，用于离线强化学习

Authors: Na Li, Zewu Zheng, Wei Ni, Hangguan Shan, Wenjie Zhang, Xinyu Li
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.00352
Pdf link: https://arxiv.org/pdf/2512.00352
Abstract Multi-agent reinforcement learning (MARL), as a thriving field, explores how multiple agents independently make decisions in a shared dynamic environment. Due to environmental uncertainties, policies in MARL must remain robust to tackle the sim-to-real gap. We focus on robust two-player zero-sum Markov games (TZMGs) in offline settings, specifically on tabular robust TZMGs (RTZMGs). We propose a model-based algorithm (\textit{RTZ-VI-LCB}) for offline RTZMGs, which is optimistic robust value iteration combined with a data-driven Bernstein-style penalty term for robust value estimation. By accounting for distribution shifts in the historical dataset, the proposed algorithm establishes near-optimal sample complexity guarantees under partial coverage and environmental uncertainty. An information-theoretic lower bound is developed to confirm the tightness of our algorithm's sample complexity, which is optimal regarding both state and action spaces. To the best of our knowledge, RTZ-VI-LCB is the first to attain this optimality, sets a new benchmark for offline RTZMGs, and is validated experimentally.
中文摘要 多智能体强化学习（MARL）作为一个蓬勃发展的领域，探索多个智能体如何在共享的动态环境中独立做出决策。由于环境不确定性，MARL的政策必须保持强健，以应对模拟与现实之间的差距。我们专注于离线环境中的稳健双人零和马尔可夫博弈（TZMGs），特别是在表格稳健TZMG（RTZMGs）上。我们提出了一种基于模型的算法（\textit{RTZ-VI-LCB}），用于离线RTZMG，该算法结合了乐观鲁棒值迭代和基于数据的Bernstein式惩罚项进行稳健值估计。通过考虑历史数据集中的分布变化，所提出的算法在部分覆盖和环境不确定性条件下，建立了近乎最优的样本复杂度保证。为了验证算法样本复杂度的紧密性，并建立了一个信息论下界，该下界在状态空间和作用空间方面均为最优。据我们所知，RTZ-VI-LCB 是首个实现这一最优性能的方案，为离线 RTZMG 树立了新的标杆，并经过实验验证。

Learning Causal States Under Partial Observability and Perturbation

在部分可观测性和扰动下学习因果状态

Authors: Na Li, Hangguan Shan, Wei Ni, Wenjie Zhang, Xinyu Li, Yamin Wang
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.00357
Pdf link: https://arxiv.org/pdf/2512.00357
Abstract A critical challenge for reinforcement learning (RL) is making decisions based on incomplete and noisy observations, especially in perturbed and partially observable Markov decision processes (P$^2$OMDPs). Existing methods fail to mitigate perturbations while addressing partial observability. We propose \textit{Causal State Representation under Asynchronous Diffusion Model (CaDiff)}, a framework that enhances any RL algorithm by uncovering the underlying causal structure of P$^2$OMDPs. This is achieved by incorporating a novel asynchronous diffusion model (ADM) and a new bisimulation metric. ADM enables forward and reverse processes with different numbers of steps, thus interpreting the perturbation of P$^2$OMDP as part of the noise suppressed through diffusion. The bisimulation metric quantifies the similarity between partially observable environments and their causal counterparts. Moreover, we establish the theoretical guarantee of CaDiff by deriving an upper bound for the value function approximation errors between perturbed observations and denoised causal states, reflecting a principled trade-off between approximation errors of reward and transition-model. Experiments on Roboschool tasks show that CaDiff enhances returns by at least 14.18\% compared to baselines. CaDiff is the first framework that approximates causal states using diffusion models with both theoretical rigor and practicality.
中文摘要 强化学习（RL）面临的一个关键挑战是基于不完整且噪声较大的观测做出决策，尤其是在扰动且部分可观测的马尔可夫决策过程（P$^2$OMDPs）中。现有方法在处理部分可观测性问题时未能减轻扰动。我们提出了 \textit{异步扩散模型（CaDiff）下的因果状态表示}框架，通过揭示 P$^2$OMDP 的底层因果结构，增强任何强化学习算法。这通过引入一种新颖的异步扩散模型（ADM）和新的双模拟指标来实现。ADM支持不同步数的正向和反向过程，从而将P$^2$OMDP的扰动解释为通过扩散抑制噪声的一部分。双模拟度量量化部分可观测环境与其因果对应物之间的相似性。此外，我们通过推导扰动观测与去噪因果状态之间价值函数近似误差的上界，建立了CaDiff的理论保证，反映了奖励近似误差与过渡模型之间的原则性权衡。机器人学校任务的实验显示，Cadiff的回报提升至少比基线高出14.18%。CaDiff是第一个利用扩散模型兼具理论严谨性和实用性的因果态近似框架。

Hardware-Software Collaborative Computing of Photonic Spiking Reinforcement Learning for Robotic Continuous Control

光子尖峰强化学习的硬件-软件协同计算，用于机器人连续控制

Authors: Mengting Yu, Shuiying Xiang, Changjian Xie, Yonghang Chen, Haowen Zhao, Xingxing Guo, Yahui Zhang, Yanan Han, Yue Hao
Subjects: Subjects: Robotics (cs.RO); Optics (physics.optics)
Arxiv link: https://arxiv.org/abs/2512.00427
Pdf link: https://arxiv.org/pdf/2512.00427
Abstract Robotic continuous control tasks impose stringent demands on the energy efficiency and latency of computing architectures due to their high-dimensional state spaces and real-time interaction requirements. Conventional electronic computing platforms face computational bottlenecks, whereas the fusion of photonic computing and spiking reinforcement learning (RL) offers a promising alternative. Here, we propose a novel computing architecture based on photonic spiking RL, which integrates the Twin Delayed Deep Deterministic policy gradient (TD3) algorithm with spiking neural network (SNN). The proposed architecture employs an optical-electronic hybrid computing paradigm wherein a silicon photonic Mach-Zehnder interferometer (MZI) chip executes linear matrix computations, while nonlinear spiking activations are performed in the electronic domain. Experimental validation on the Pendulum-v1 and HalfCheetah-v2 benchmarks demonstrates the system capability for software-hardware co-inference, achieving a control policy reward of 5831 on HalfCheetah-v2, a 23.33% reduction in convergence steps, and an action deviation below 2.2%. Notably, this work represents the first application of a programmable MZI photonic computing chip to robotic continuous control tasks, attaining an energy efficiency of 1.39 TOPS/W and an ultralow computational latency of 120 ps. Such performance underscores the promise of photonic spiking RL for real-time decision-making in autonomous and industrial robotic systems.
中文摘要 机器人连续控制任务因其高维状态空间和实时交互需求，对计算架构的能效和延迟提出了严格要求。传统电子计算平台面临计算瓶颈，而光子计算与尖峰强化学习（RL）的融合则提供了有前景的替代方案。本研究提出一种基于光子尖峰强化学习的新型计算架构，将双延迟深度确定性策略梯度（TD3）算法与尖峰神经网络（SNN）集成。所提架构采用光电子混合计算范式，其中硅光子马赫-泽恩德干涉仪（MZI）芯片执行线性矩阵计算，而非线性尖峰激活则在电子领域执行。在Pendulum-v1和HalfCheetah-v2基准测试上的实验验证显示，系统具备软硬件协同推断能力，HalfCheetah-v2的控制策略奖励为5831，收敛步骤减少23.33%，动作偏差低于2.2%。值得注意的是，这项工作首次将可编程MZI光子计算芯片应用于机器人连续控制任务，实现了1.39 TOPS/W的能效和120 ps的超低计算延迟。这种性能凸显了光子尖峰强化学习在自主和工业机器人系统中实时决策中的潜力。

Learning What Helps: Task-Aligned Context Selection for Vision Tasks

学习什么有帮助：愿景任务的任务对齐上下文选择

Authors: Jingyu Guo, Emir Konuk, Fredrik Strand, Christos Matsoukas, Kevin Smith
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.00489
Pdf link: https://arxiv.org/pdf/2512.00489
Abstract Humans often resolve visual uncertainty by comparing an image with relevant examples, but ViTs lack the ability to identify which examples would improve their predictions. We present Task-Aligned Context Selection (TACS), a framework that learns to select paired examples which truly improve task performance rather than those that merely appear similar. TACS jointly trains a selector network with the task model through a hybrid optimization scheme combining gradient-based supervision and reinforcement learning, making retrieval part of the learning objective. By aligning selection with task rewards, TACS enables discriminative models to discover which contextual examples genuinely help. Across 18 datasets covering fine-grained recognition, medical image classification, and medical image segmentation, TACS consistently outperforms similarity-based retrieval, particularly in challenging or data-limited settings.
中文摘要 人类通常通过将图像与相关示例进行比较来解决视觉不确定性，但ViT无法识别哪些示例能提升其预测效果。我们提出了任务对齐上下文选择（TACS）框架，该框架学习选择真正提升任务表现的配对实例，而非仅仅是外观相似的例子。TACS通过结合梯度监督和强化学习的混合优化方案，联合训练选择器网络与任务模型，使检索成为学习目标的一部分。通过将选择与任务奖励对齐，TACS使判别模型能够发现哪些情境实例真正有帮助。在涵盖细粒度识别、医学图像分类和医学图像分割的18个数据集中，TACS在具有挑战性或数据有限的环境中，始终优于基于相似度的检索。

ESPO: Entropy Importance Sampling Policy Optimization

ESPO：熵重要性抽样策略优化

Authors: Yuepeng Sheng, Yuwei Huang, Shuman Liu, Haibo Zhang, Anxiang Zeng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.00499
Pdf link: https://arxiv.org/pdf/2512.00499
Abstract Large language model (LLM) reinforcement learning has increasingly relied on group-based policy optimization frameworks, such as GRPO and GSPO, to achieve stable fine-tuning at scale. However, a fundamental trade-off persists between optimization granularity and training stability. While GSPO improves robustness via sequence-level optimization, its monolithic treatment of sequences introduces severe inefficiencies: its conservative clipping mechanism indiscriminately discards valid training samples-a phenomenon we term gradient underutilization-and its uniform credit assignment fails to capture the heterogeneous contributions of critical reasoning steps. In this work, we propose Entropy Importance Sampling Policy Optimization (ESPO), a novel framework that reconciles fine-grained control with training stability. ESPO decomposes sequences into groups based on predictive entropy, enabling (1) Entropy-driven Importance Sampling to capture intra-sequence heterogeneity, and (2) Entropy-adaptive Clipping to dynamically allocate trust regions based on model uncertainty. Extensive experiments on mathematical reasoning benchmarks demonstrate that ESPO not only accelerates convergence but also achieves state-of-the-art performance, notably improving accuracy on the challenging HMMT benchmark from 4.4% to 13.13%.
中文摘要 大型语言模型（LLM）强化学习越来越依赖基于群体的策略优化框架，如GRPO和GSPO，以实现大规模的稳定微调。然而，优化粒度与训练稳定性之间存在根本权衡。虽然GSPO通过序列层级优化提升了鲁棒性，但其对序列的单一处理带来了严重的低效：其保守裁剪机制无差别地丢弃有效的训练样本——我们称之为梯度不足利用——且其统一的功劳分配未能捕捉关键推理步骤的异质贡献。在本研究中，我们提出了熵重要性抽样策略优化（ESPO），这是一种新颖框架，将细粒度控制与训练稳定性调和。ESPO根据预测熵将序列分解为组，使（1）基于熵的重要性抽样以捕捉序列内异质性成为可能，以及（2）熵适应剪裁，基于模型不确定性动态分配信任区域。数学推理基准测试的大量实验表明，ESPO不仅加速收敛，还实现了最先进的性能，尤其是在具有挑战性的HMMT基准测试中，准确率从4.4%提升到13.13%。

G-KV: Decoding-Time KV Cache Eviction with Global Attention

G-KV：带全局注意力的解码时间KV缓存驱逐

Authors: Mengqi Liao, Lu Wang, Chaoyun Zhang, Zekai Shen, Xiaowei Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Huaiyu Wan
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.00504
Pdf link: https://arxiv.org/pdf/2512.00504
Abstract Recent reasoning large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths. KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning. However, existing methods often focus on prompt compression or token eviction with local attention score, overlooking the long-term importance of tokens. We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance. Additionally, we introduce post-training techniques, including reinforcement learning and distillation, to optimize models for compressed KV cache settings. The code of this paper is available on: this https URL.
中文摘要 近期推理：大型语言模型（LLMs）在复杂任务中表现出色，但由于序列长度较长，面临显著的计算和内存挑战。KV缓存压缩已成为一种有效方法，能够大幅提升推理效率。然而，现有方法往往侧重于即时压缩或基于局部注意力评分的令牌驱逐，忽视了令牌的长期重要性。我们提出了G-KV，这是一种采用全局评分机制、结合本地和历史注意力评分的KV缓存驱逐方法，以更准确地评估代币的重要性。此外，我们还引入了训练后技术，包括强化学习和蒸馏，以优化模型以适应压缩KV缓存设置。本文代码可在以下 https URL 获取。

Truthful Double Auctions under Approximate VCG: Immediate-Penalty Enforcement in P2P Energy Trading

Approximate VCG下的真实双重拍卖：P2P能源交易中的即时罚款执行

Authors: Xun Shao, Ryuuto Shimizu
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.00513
Pdf link: https://arxiv.org/pdf/2512.00513
Abstract This paper examines truthful double auctions when exact VCG allocation is computationally infeasible and repeated-game punishments are impractical. We analyze an $\alpha$-approximate VCG mechanism and show that truthful reporting becomes a subgame-perfect equilibrium when the immediate penalty exceeds the incentive gap created by approximation, scaled by monitoring accuracy. To validate this result, we construct a PPO-based multi-agent reinforcement learning environment for P2P smart-grid trading, where prosumers incur penalties for bidding far from their true valuations. Across systematic experiments varying approximation accuracy, tolerance, penalty magnitude, and discounting, the learned behavior closely matches theoretical predictions. The findings demonstrate that immediate-penalty approximate VCG mechanisms provide a practical and transparent approach to sustaining truthful behavior in distributed market settings.
中文摘要 本文探讨了当精确的VCG分配在计算上不可行且重复博弈惩罚不切实际时的真实双重拍卖。我们分析了一个$\alpha$近似的VCG机制，表明当即时惩罚超过通过监控准确性进行扩展的近似激励差距时，真实报告就成为一种亚游戏完美均衡。为验证这一结果，我们构建了一个基于PPO的多智能体强化学习环境，用于P2P智能网格交易，专业消费者因出价远离真实估值而受到罚款。在系统实验中，变化近似准确度、容忍度、惩罚幅度和折现，所学行为与理论预测高度吻合。研究结果表明，即时惩罚近似VCG机制为在分布式市场环境中维持真实行为提供了实用且透明的方法。

DQ4FairIM: Fairness-aware Influence Maximization using Deep Reinforcement Learning

DQ4FairIM：利用深度强化学习实现公平感知影响力最大化

Authors: Akrati Saxena, Harshith Kumar Yadav, Bart Rutten, Shashi Shekhar Jha
Subjects: Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.00545
Pdf link: https://arxiv.org/pdf/2512.00545
Abstract The Influence Maximization (IM) problem aims to select a set of seed nodes within a given budget to maximize the spread of influence in a social network. However, real-world social networks have several structural inequalities, such as dominant majority groups and underrepresented minority groups. If these inequalities are not considered while designing IM algorithms, the outcomes might be biased, disproportionately benefiting majority groups while marginalizing minorities. In this work, we address this gap by designing a fairness-aware IM method using Reinforcement Learning (RL) that ensures equitable influence outreach across all communities, regardless of protected attributes. Fairness is incorporated using a maximin fairness objective, which prioritizes improving the outreach of the least-influenced group, pushing the solution toward an equitable influence distribution. We propose a novel fairness-aware deep RL method, called DQ4FairIM, that maximizes the expected number of influenced nodes by learning an RL policy. The learnt policy ensures that minority groups formulate the IM problem as a Markov Decision Process (MDP) and use deep Q-learning, combined with the Structure2Vec network embedding, earning together with Structure2Vec network embedding to solve the MDP. We perform extensive experiments on synthetic benchmarks and real-world networks to compare our method with fairness-agnostic and fairness-aware baselines. The results show that our method achieves a higher level of fairness while maintaining a better fairness-performance trade-off than baselines. Additionally, our approach learns effective seeding policies that generalize across problem instances without retraining, such as varying the network size or the number of seed nodes.
中文摘要 影响力最大化（IM）问题旨在选择一组种子节点，以最大化社交网络中的影响力扩散。然而，现实世界的社交网络存在若干结构性不平等，比如主导多数群体和少数群体代表性不足。如果在设计IM算法时不考虑这些不平等，结果可能会有偏见，使多数群体不成比例地受益，而边缘化少数群体。在本研究中，我们通过设计一种公平意识的信息管理方法，利用强化学习（RL）来弥补这一空白，确保所有社区都能公平地开展影响力推广，无论受保护属性如何。公平性采用最大化公平目标，优先提升影响最小群体的影响力，推动解决方案实现公平的影响力分布。我们提出了一种新颖的公平意识深度强化学习方法，称为DQ4FairIM，通过学习强化学习策略最大化受影响节点的期望数量。所学政策确保少数群体将IM问题提出为马尔可夫决策过程（MDP），并结合Structure2Vec网络嵌入，结合Structure2Vec网络嵌入，共同求解MDP。我们在合成基准和现实世界网络上进行了大量实验，以将我们的方法与公平无关和公平意识基线进行比较。结果显示，我们的方法在保持比基线更好的公平性-性能权衡的同时，实现了更高的公平性水平。此外，我们的方法学习到有效的种子策略，这些策略可以在问题实例间泛化，无需重新训练，比如改变网络大小或种子节点数量。

List Replicable Reinforcement Learning

列表可复制强化学习

Authors: Bohan Zhang, Michael Chen, A. Pavan, N. V. Vinodchandran, Lin F. Yang, Ruosong Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.00553
Pdf link: https://arxiv.org/pdf/2512.00553
Abstract Replicability is a fundamental challenge in reinforcement learning (RL), as RL algorithms are empirically observed to be unstable and sensitive to variations in training conditions. To formally address this issue, we study \emph{list replicability} in the Probably Approximately Correct (PAC) RL framework, where an algorithm must return a near-optimal policy that lies in a \emph{small list} of policies across different runs, with high probability. The size of this list defines the \emph{list complexity}. We introduce both weak and strong forms of list replicability: the weak form ensures that the final learned policy belongs to a small list, while the strong form further requires that the entire sequence of executed policies remains constrained. These objectives are challenging, as existing RL algorithms exhibit exponential list complexity due to their instability. Our main theoretical contribution is a provably efficient tabular RL algorithm that guarantees list replicability by ensuring the list complexity remains polynomial in the number of states, actions, and the horizon length. We further extend our techniques to achieve strong list replicability, bounding the number of possible policy execution traces polynomially with high probability. Our theoretical result is made possible by key innovations including (i) a novel planning strategy that selects actions based on lexicographic order among near-optimal choices within a randomly chosen tolerance threshold, and (ii) a mechanism for testing state reachability in stochastic environments while preserving replicability. Finally, we demonstrate that our theoretical investigation sheds light on resolving the \emph{instability} issue of RL algorithms used in practice. In particular, we show that empirically, our new planning strategy can be incorporated into practical RL frameworks to enhance their stability.
中文摘要 可重复性是强化学习（RL）中的一个根本挑战，因为经验观察到强化学习算法不稳定且对训练条件的变化敏感。为了正式解决这个问题，我们在Probably Approx Correct（PAC）RL框架中研究了\emph{列表可重复性}，其中算法必须以高概率返回一个近似最优策略，该策略存在于不同运行中的\emph{小列表}。该列表的大小定义了\emph{列表复杂度}。我们引入了弱形式和强列表可重复性：弱形式确保最终学习策略属于一个小列表，而强形式进一步要求整个执行策略序列保持约束。这些目标具有挑战性，因为现有的强化学习算法由于不稳定性而表现出指数级的列表复杂性。我们的主要理论贡献是一种可证明高效的表式强化学习算法，通过确保列表复杂度在状态数、动作数和视野长度上保持多项式，保证列表的可重复性。我们进一步扩展技术，实现强列表复制性，以高概率多项式限制可能的策略执行轨迹数量。我们的理论成果得益于关键创新，包括：（i）一种新颖的规划策略，在随机选择的容忍阈值内，根据字典序选择近似最优选择中选择动作;（ii）在随机环境中测试状态可达性的机制，同时保持可重复性。最后，我们证明了理论研究为解决实践中强化学习算法的\emph{不稳定性}问题提供了启示。特别是，我们展示了从实证角度看，我们的新规划策略可以被纳入实际的强化学习框架中，以增强其稳定性。

SAGE: Semantic-Aware Gray-Box Game Regression Testing with Large Language Models

SAGE：基于大型语言模型的语义感知灰盒游戏回归测试

Authors: Jinyu Cai, Jialong Li, Nianyu Li, Zhenyu Mao, Mingyue Zhang, Kenji Tei
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2512.00560
Pdf link: https://arxiv.org/pdf/2512.00560
Abstract The rapid iteration cycles of modern live-service games make regression testing indispensable for maintaining quality and stability. However, existing regression testing approaches face critical limitations, especially in common gray-box settings where full source code access is unavailable: they heavily rely on manual effort for test case construction, struggle to maintain growing suites plagued by redundancy, and lack efficient mechanisms for prioritizing relevant tests. These challenges result in excessive testing costs, limited automation, and insufficient bug detection. To address these issues, we propose SAGE, a semanticaware regression testing framework for gray-box game environments. SAGE systematically addresses the core challenges of test generation, maintenance, and selection. It employs LLM-guided reinforcement learning for efficient, goal-oriented exploration to automatically generate a diverse foundational test suite. Subsequently, it applies a semantic-based multi-objective optimization to refine this suite into a compact, high-value subset by balancing cost, coverage, and rarity. Finally, it leverages LLM-based semantic analysis of update logs to prioritize test cases most relevant to version changes, enabling efficient adaptation across iterations. We evaluate SAGE on two representative environments, Overcooked Plus and Minecraft, comparing against both automated baselines and human-recorded test cases. Across all environments, SAGE achieves superior bug detection with significantly lower execution cost, while demonstrating strong adaptability to version updates.
中文摘要 现代在线服务游戏的快速迭代周期使回归测试成为保持质量和稳定性的必备手段。然而，现有的回归测试方法面临严重局限，尤其是在常见的灰盒环境中，无法获得完整源代码访问：它们严重依赖手动构建测试用例，难以维护冗余困扰的不断增长的套件，且缺乏高效优先排序相关测试的机制。这些挑战导致测试成本过高、自动化有限以及缺陷检测不足。为解决这些问题，我们提出了SAGE，一个针对灰盒游戏环境的语义感知回归测试框架。SAGE系统地解决测试生成、维护和选择等核心挑战。它采用LLM引导的强化学习，实现高效、目标导向的探索，自动生成多样化的基础测试套件。随后，它采用基于语义的多目标优化，通过平衡成本、覆盖率和稀有性，将该套件精炼成一个紧凑且高价值的子集。最后，它利用基于LLM的更新日志语义分析，优先考虑与版本变更最相关的测试案例，实现跨迭代的高效适应。我们在两个代表性环境——Overcooked Plus和Minecraft上评估SAGE，并与自动化基线和人工录制的测试用例进行比较。在所有环境中，SAGE以显著降低的执行成本实现了卓越的错误检测，同时展现出对版本更新的强烈适应能力。

HAVEN: Hierarchical Adversary-aware Visibility-Enabled Navigation with Cover Utilization using Deep Transformer Q-Networks

HAVEN：采用深度变压器Q网络的分层式对手感知可视性导航，具掩护利用

Authors: Mihir Chauhan, Damon Conover, Aniket Bera
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.00592
Pdf link: https://arxiv.org/pdf/2512.00592
Abstract Autonomous navigation in partially observable environments requires agents to reason beyond immediate sensor input, exploit occlusion, and ensure safety while progressing toward a goal. These challenges arise in many robotics domains, from urban driving and warehouse automation to defense and surveillance. Classical path planning approaches and memoryless reinforcement learning often fail under limited fields of view (FoVs) and occlusions, committing to unsafe or inefficient maneuvers. We propose a hierarchical navigation framework that integrates a Deep Transformer Q-Network (DTQN) as a high-level subgoal selector with a modular low-level controller for waypoint execution. The DTQN consumes short histories of task-aware features, encoding odometry, goal direction, obstacle proximity, and visibility cues, and outputs Q-values to rank candidate subgoals. Visibility-aware candidate generation introduces masking and exposure penalties, rewarding the use of cover and anticipatory safety. A low-level potential field controller then tracks the selected subgoal, ensuring smooth short-horizon obstacle avoidance. We validate our approach in 2D simulation and extend it directly to a 3D Unity-ROS environment by projecting point-cloud perception into the same feature schema, enabling transfer without architectural changes. Results show consistent improvements over classical planners and RL baselines in success rate, safety margins, and time to goal, with ablations confirming the value of temporal memory and visibility-aware candidate design. These findings highlight a generalizable framework for safe navigation under uncertainty, with broad relevance across robotic platforms.
中文摘要 在部分可观测环境中实现自主导航，要求智能体超越即时传感器输入，利用遮挡，确保安全，同时朝目标前进。这些挑战存在于许多机器人领域，从城市驾驶、仓库自动化到国防和监控。经典路径规划方法和无记忆强化学习常在有限视野（FoV）和遮挡条件下失败，导致采取不安全或低效的机动。我们提出了一个分层导航框架，将深度变换器Q-Network（DTQN）作为高级子目标选择器与模块化的低级控制器集成，用于航点执行。DTQN会消耗任务感知功能的简短历史，包括里程计、目标方向、障碍物接近度和可见性线索，并输出Q值以对候选子目标进行排名。能见度感知候选生成引入了遮蔽和暴露惩罚，奖励使用掩护和预防性安全。低层潜在现场控制器随后跟踪选定的子目标，确保短视距障碍物平稳避开。我们在二维仿真中验证方法，并通过将点云感知投影到同一特征模式中，直接将其扩展到三维Unity-ROS环境，实现无需架构更改的迁移。结果显示，在成功率、安全裕度和目标时间方面，相较传统计划器和强化学习基线持续有改善，消融验证了时间记忆和可视化候选人设计的价值。这些发现凸显了一个在不确定性下实现安全导航的通用框架，具有广泛的机器人平台相关性。

Clinical-R1: Empowering Large Language Models for Faithful and Comprehensive Reasoning with Clinical Objective Relative Policy Optimization

临床R1：赋能大型语言模型，实现忠实且全面的推理，结合临床客观相对政策优化

Authors: Boyang Gu, Hongjian Zhou, Bradley Max Segal, Jinge Wu, Zeyu Cao, Hantao Zhong, Lei Clifton, Fenglin Liu, David A. Clifton
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.00601
Pdf link: https://arxiv.org/pdf/2512.00601
Abstract Recent advances in large language models (LLMs) have shown strong reasoning capabilities through large-scale pretraining and post-training reinforcement learning, demonstrated by DeepSeek-R1. However, current post-training methods, such as Grouped Relative Policy Optimization (GRPO), mainly reward correctness, which is not aligned with the multi-dimensional objectives required in high-stakes fields such as medicine, where reasoning must also be faithful and comprehensive. We introduce Clinical-Objective Relative Policy Optimization (CRPO), a scalable, multi-objective, verifiable reinforcement learning method designed to align LLM post-training with clinical reasoning principles. CRPO integrates rule-based and verifiable reward signals that jointly optimize accuracy, faithfulness, and comprehensiveness without relying on human annotation. To demonstrate its effectiveness, we train Clinical-R1-3B, a 3B-parameter model for clinical reasoning. The experiments on three benchmarks demonstrate that our CRPO substantially improves reasoning on truthfulness and completeness over standard GRPO while maintaining comfortable accuracy enhancements. This framework provides a scalable pathway to align LLM reasoning with clinical objectives, enabling safer and more collaborative AI systems for healthcare while also highlighting the potential of multi-objective, verifiable RL methods in post-training scaling of LLMs for medical domains.
中文摘要 大型语言模型（LLMs）的最新进展通过大规模训练前和训练后强化学习展现了强大的推理能力，这一点由DeepSeek-R1所展示。然而，当前的培训后方法，如分组相对政策优化（GRPO），主要奖励正确性，这与医学等高风险领域所需的多维目标不一致，因为推理必须忠实且全面。我们引入了临床-客观相对策略优化（CRPO），这是一种可扩展、多目标、可验证的强化学习方法，旨在将LLM训练后与临床推理原则对齐。CRPO集成了基于规则且可验证的奖励信号，共同优化准确性、忠实性和全面性，而无需依赖人工注释。为证明其有效性，我们训练了Clinical-R1-3B，一个3B参数的临床推理模型。三个基准测试的实验表明，我们的CRPO在真实性和完整性推理方面相较标准GRPO显著提升，同时保持了舒适的准确性提升。该框架为LLM推理与临床目标的结合提供了可扩展的路径，使医疗领域AI系统更安全、更具协作性，同时也凸显了多目标、可验证的强化学习方法在医学领域训练后扩展LLM的潜力。

When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

当人类偏好翻转：RLHF的实例依赖性强棒损失

Authors: Yifan Xu, Xichen Ye, Yifan Chen, Qiaosheng Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.00709
Pdf link: https://arxiv.org/pdf/2512.00709
Abstract Quality of datasets plays an important role in large language model (LLM) alignment. In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation; the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs. To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages; in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model. Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns. In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and DPO algorithms. In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.
中文摘要 数据集质量在大型语言模型（LLM）对齐中起着重要作用。然而，在收集人工反馈时，偏好翻转无处不在，会导致数据注释损坏;这个问题需要对潜在翻转对更强健的比对算法。为此，本文介绍了一种针对偏好翻转的 Flipping-Aware 直接偏好优化（FA-DPO）算法，该算法从带有人类反馈的强化学习（RLHF）视角出发。我们将固有的人类意图模型和外部因素引入的偏好翻转机制分为两个不同阶段;在后者中，我们基于Bradley-Terry（BT）模型引入了实例依赖的翻转概率。此外，通过利用与偏好注释相关的特征，我们捕捉了判断中的不确定性，并模拟了偏好翻转的模式。在实践中，我们设计了一个简单但高效的迭代优化算法，兼容原始的RLHF和DPO算法。在我们的实验中，我们考察了实例依赖偏好翻转模型在多种情境下的评估，以评估我们提出的方法及其他基线方法。

Upcycled and Merged MoE Reward Model for Mitigating Reward Hacking

提升再造并合并的MoE奖励模型，用于缓解奖励黑客行为

Authors: Lingling Fu
Subjects: Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2512.00724
Pdf link: https://arxiv.org/pdf/2512.00724
Abstract Reward models play a critical role in Reinforcement Learning from Human Feedback (RLHF) by assessing the consistency between generated outputs and human preferences. However, conventional reward models are prone to reward hacking or over-optimization, where the policy exploits shortcut patterns to obtain high reward scores that do not reflect true human preference. Although Mixture-of-Experts (MoE)-based reward models can enhance discriminative capability, they typically introduce substantial computational overhead. To address these challenges, we propose an upcycle and merge MoE reward modeling approach. We first upcycle a dense reward model into a MoE architecture, where a shared expert captures general knowledge, while normal experts specialize in instruction-specific patterns. We then apply routing-weight normalization and merge experts back into a dense model through a learnable weight-averaging mechanism, preserving performance gains while significantly reducing inference cost. Experimental results demonstrate that our method effectively mitigates reward hacking across various model scales. Our work highlights the potential of upcycle and merge MoE structures for improving both robustness and efficiency of RLHF reward models.
中文摘要 奖励模型在人类反馈强化学习（RLHF）中发挥关键作用，通过评估生成输出与人类偏好之间的一致性。然而，传统奖励模型容易出现被攻击或过度优化的奖励，政策利用捷径模式获得高分，而这些奖励分数并不反映真实的人类偏好。虽然基于专家混合（MoE）的奖励模型可以增强判别能力，但通常会带来较大的计算开销。为应对这些挑战，我们提出了一种上循环并合并的MoE奖励建模方法。我们首先将密集奖励模型升级为MoE架构，其中共享专家负责捕获一般知识，而普通专家则专注于特定指令模式。随后，我们应用路由权重归一化，并通过可学习的权重平均机制将专家合并回密集模型，保持性能提升并显著降低推理成本。实验结果表明，我们的方法在不同模型尺度上有效减少了奖励黑客行为。我们的研究强调了上循环和合并MoE结构在提升RLHF奖励模型的稳健性和效率方面的潜力。

MS-PPO: Morphological-Symmetry-Equivariant Policy for Legged Robot Locomotion

MS-PPO：腿式机器人运动的形态-对称-等变策略

Authors: Sizhe Wei, Xulin Chen, Fengze Xie, Garrett Ethan Katz, Zhenyu Gan, Lu Gan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.00727
Pdf link: https://arxiv.org/pdf/2512.00727
Abstract Reinforcement learning has recently enabled impressive locomotion capabilities on legged robots; however, most policy architectures remain morphology- and symmetry-agnostic, leading to inefficient training and limited generalization. This work introduces MS-PPO, a morphological-symmetry-equivariant policy learning framework that encodes robot kinematic structure and morphological symmetries directly into the policy network. We construct a morphology-informed graph neural architecture that is provably equivariant with respect to the robot's morphological symmetry group actions, ensuring consistent policy responses under symmetric states while maintaining invariance in value estimation. This design eliminates the need for tedious reward shaping or costly data augmentation, which are typically required to enforce symmetry. We evaluate MS-PPO in simulation on Unitree Go2 and Xiaomi CyberDog2 robots across diverse locomotion tasks, including trotting, pronking, slope walking, and bipedal turning, and further deploy the learned policies on hardware. Extensive experiments show that MS-PPO achieves superior training stability, symmetry generalization ability, and sample efficiency in challenging locomotion tasks, compared to state-of-the-art baselines. These findings demonstrate that embedding both kinematic structure and morphological symmetry into policy learning provides a powerful inductive bias for legged robot locomotion control. Our code will be made publicly available at this https URL.
中文摘要 强化学习最近使得腿部机器人的运动能力令人印象深刻;然而，大多数政策架构仍然对形态和对称性无关，导致训练效率低下且泛化有限。本研究介绍了MS-PPO，一种形态-对称-等变策略学习框架，将机器人的运动学结构和形态对称性直接编码到策略网络中。我们构建了一个形态学驱动的图神经结构，该结构可证明对机器人的形态对称群行动等变，确保在对称状态下策略响应一致，同时保持价值估计的不变性。这种设计消除了繁琐的奖励塑造或昂贵的数据增强，这些通常都是强制对称性所必需的。我们在Unitree Go2和小米CyberDog2机器人上模拟MS-PPO应用，涵盖多种运动任务，包括小跑、前倾、斜坡行走和双足转弯，并将所学策略进一步部署到硬件上。大量实验表明，MS-PPO在具有挑战性的运动任务中，比最先进的基线更优，实现了更优的训练稳定性、对称泛化能力和样本效率。这些发现表明，将运动学结构和形态对称性嵌入政策学习，为腿型机器人运动控制提供了强大的归纳偏向。我们的代码将通过该 https URL 公开发布。

AI Agent for Source Finding by SoFiA-2 for SKA-SDC2

SoFiA-2 为 SKA-SDC2 提供源查找的 AI 代理

Authors: Xingchen Zhou, Nan Li, Peng Jia, Yingfeng Liu, Furen Deng, Shuanghao Shu, Ying Li, Liang Cao, Huanyuan Shan, Ayodeji Ibitoye
Subjects: Subjects: Machine Learning (cs.LG); Astrophysics of Galaxies (astro-ph.GA)
Arxiv link: https://arxiv.org/abs/2512.00769
Pdf link: https://arxiv.org/pdf/2512.00769
Abstract Source extraction is crucial in analyzing data from next-generation, large-scale sky surveys in radio bands, such as the Square Kilometre Array (SKA). Several source extraction programs, including SoFiA and Aegean, have been developed to address this challenge. However, finding optimal parameter configurations when applying these programs to real observations is non-trivial. For example, the outcomes of SoFiA intensely depend on several key parameters across its preconditioning, source-finding, and reliability-filtering modules. To address this issue, we propose a framework to automatically optimize these parameters using an AI agent based on a state-of-the-art reinforcement learning (RL) algorithm, i.e., Soft Actor-Critic (SAC). The SKA Science Data Challenge 2 (SDC2) dataset is utilized to assess the feasibility and reliability of this framework. The AI agent interacts with the environment by adjusting parameters based on the feedback from the SDC2 score defined by the SDC2 Team, progressively learning to select parameter sets that yield improved performance. After sufficient training, the AI agent can automatically identify an optimal parameter configuration that outperform the benchmark set by Team SoFiA within only 100 evaluation steps and with reduced time consumption. Our approach could address similar problems requiring complex parameter tuning, beyond radio band surveys and source extraction. Yet, high-quality training sets containing representative observations and catalogs of ground truth are essential.
中文摘要 源提取对于分析下一代大规模天空巡天（如平方公里阵列（SKA）等无线电波段的数据至关重要。包括SoFiA和Aegean在内的多个源提取项目已开发以应对这一挑战。然而，在将这些程序应用于实际观测时找到最优参数配置并非简单。例如，SoFiA的结果高度依赖于其预处理、源查找和可靠性过滤模块中的若干关键参数。为解决这一问题，我们提出了一个框架，利用基于最先进的强化学习（RL）算法（即软演员批判者（SAC）的人工智能代理自动优化这些参数。SKA科学数据挑战2（SDC2）数据集用于评估该框架的可行性和可靠性。AI代理通过根据SDC2团队定义的SDC2分数反馈调整参数与环境交互，逐步学习选择能提升性能的参数集。经过充分训练后，AI代理能够在仅100个评估步骤内自动识别出优于SoFiA团队设定的最优参数配置，且时间消耗更短。我们的方法可以解决类似需要复杂参数调谐的问题，超越无线电频段巡天和源提取。然而，包含代表性观测和真实数据目录的高质量训练集是必不可少的。

What Is Preference Optimization Doing, How and Why?

偏好优化在做什么，如何以及为什么？

Authors: Yue Wang, Qizhou Wang, Zizhuo Zhang, Ang Li, Gang Niu, Bo Han, Masashi Sugiyama
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.00778
Pdf link: https://arxiv.org/pdf/2512.00778
Abstract Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO follows dynamic targets that balance exploration and exploitation, thus validating the common belief from a new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key components in PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the learning targets meanwhile mutually offset each other. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to absolute values of token-level advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.
中文摘要 偏好优化（PO）对于大型语言模型（LLMs）来说是不可或缺的，诸如直接偏好优化（DPO）和近端策略优化（PPO）等方法取得了巨大成功。普遍认为DPO是监督学习，而PPO是强化学习，但对这些差异背后原因的深入分析仍然缺乏。为填补这一空白，我们分析了它们的优化动态，揭示了不同的算法行为并理解其背后原因。首先，我们考察基于梯度更新的目标方向，发现DPO遵循稳定目标，而PPO遵循动态目标，平衡探索与利用，从而从新视角验证了这一普遍观点。其次，我们考察了积极学习、负面学习和减重加权的作用，这三者是PO方法中的三个关键组成部分。我们的分析显示，这些组成部分扮演着相当不同的角色。在DPO中，正向和负面学习共同塑造学习目标，同时相互抵消。然而，DPO中的损失重权重更多是调节，减少过拟合，而非奖励信号。在PPO中，负面学习主要支持探索，而非确定目标。与此同时，与代币层优势绝对值相关的损失重权重显示了代币群体在更新目标中的不同角色。基于这些发现，我们进行了精心设计的消融研究，进一步探讨控制这些动态如何影响优化效率和实际性能。分析所得的见解不仅加深了对PO方法的理解，也激发了更贴近偏好的大型语言模型的发展。

Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding

用草稿思考：高效长视频理解的推测性时间推理

Authors: Pengfei Hu, Meng Cao, Yingyao Wang, Yi Wang, Jiahua Dong, Jun Song, Yu Cheng, Bo Zheng, Xiaodan Liang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.00805
Pdf link: https://arxiv.org/pdf/2512.00805
Abstract Long video understanding is essential for human-like intelligence, enabling coherent perception and reasoning over extended temporal contexts. While the emerging thinking-with-frames paradigm, which alternates between global temporal reasoning and local frame examination, has advanced the reasoning capabilities of video multi-modal large language models (MLLMs), it suffers from a significant efficiency bottleneck due to the progressively growing and redundant multi-modal context. To address this, we propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework that decouples temporal perception from reasoning via a cooperative dual-model design. In SpecTemp, a lightweight draft MLLM rapidly explores and proposes salient frames from densely sampled temporal regions, while a powerful target MLLM focuses on temporal reasoning and verifies the draft's proposals, iteratively refining its attention until convergence. This design mirrors the collaborative pathways of the human brain, balancing efficiency with accuracy. To support training, we construct the SpecTemp-80K dataset, featuring synchronized dual-level annotations for coarse evidence spans and fine-grained frame-level evidence. Experiments across multiple video understanding benchmarks demonstrate that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.
中文摘要 长视频理解对于类人类智能至关重要，能够在更长时间的情境中实现连贯的感知和推理。虽然新兴的“带帧思维”范式在全局时间推理和局部帧检查之间交替出现，推动了视频多模态大型语言模型（MLLMs）的推理能力，但由于多模态上下文日益增长和冗余，它仍面临显著的效率瓶颈。为此，我们提出了SpecTemp，一种基于强化学习的推测性时间推理框架，通过协同的双模型设计将时间感知与推理解耦。在SpecTemp中，轻量级MLLM草稿快速探索并提出来自密集采样时间区域的显著帧，而强大的目标MLLM则专注于时间推理并验证草案的提议，迭代优化关注度直到收敛。这种设计反映了人脑的协作通路，兼顾效率与准确性。为支持训练，我们构建了SpecTemp-80K数据集，该数据集具备粗证据跨度同步的双层注释和细粒度的帧级证据。多项视频理解基准测试的实验表明，SpecTemp不仅保持了竞争性准确性，而且相比现有的“帧思维”方法，推断速度显著加快。

ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning

ReJump：一种用于分析和提升大型语言模型推理的树跳表示法

Authors: Yuchen Zeng, Shuibai Zhang, Wonjun Kang, Shutong Wu, Lynnix Zou, Ying Fan, Heeju Kim, Ziqian Lin, Jungtaek Kim, Hyung Il Koo, Dimitris Papailiopoulos, Kangwook Lee
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.00831
Pdf link: https://arxiv.org/pdf/2512.00831
Abstract Large Reasoning Models (LRMs) are Large Language Models (LLMs) explicitly trained to generate long-form Chain-of-Thoughts (CoTs), achieving impressive success on challenging tasks like math and programming. However, their underlying reasoning "algorithms" remain poorly understood. To investigate this, we propose ReJump, which represents a reasoning trace as a visitation order over nodes in a tree of intermediate problem-solving steps. Transitions between nodes, which we term jumps, include adjacent moves that capture behaviors such as calculation, and non-adjacent moves that capture behaviors such as backtracking and verification. ReJump enables analyzing LLM reasoning with diverse metrics that quantify exploration, exploitation, overthinking, forgetting, and verification. Using our proposed LLM agent to extract reasoning traces into ReJump format, we evaluate state-of-the-art LRMs on two tasks and find that models with similar accuracy can exhibit distinct reasoning behaviors, while different tasks favor different reasoning styles (e.g., varying balance between exploration and exploitation). To further understand how learning strategies shape reasoning, we use ReJump to compare distilled LRMs with their teachers, CoT-prompted LLMs with LRMs, and to examine how the number of reasoning examples and reinforcement learning affect reasoning behavior. Finally, we show that ReJump can improve reasoning quality at test time through strategies such as ReJump-guided Best-of-N selection and prompt selection. Our code is publicly available at this https URL.
中文摘要 大型推理模型（LRM）是经过明确训练生成长形式思维链（CoT）的大型语言模型（LLM），在数学和编程等具有挑战性的任务中取得了显著成功。然而，他们背后的推理“算法”仍然被理解不足。为研究这一点，我们提出了ReJump，它将推理轨迹表示为对中间问题解决步骤树中节点的访问顺序。节点之间的转移，我们称之为跳跃，包括捕捉计算等行为的相邻移动，以及捕捉回溯和验证等行为的非相邻移动。ReJump 支持通过多样指标分析大型语言模型推理，量化探索、利用、过度思考、遗忘和验证。利用我们提出的LLM代理将推理痕迹提取成ReJump格式，我们在两个任务上评估了最先进的LRMs，发现精度相近的模型可能表现出不同的推理行为，而不同任务则偏好不同的推理风格（例如探索与利用之间的平衡变化）。为了进一步理解学习策略如何影响推理，我们利用ReJump比较了精炼的LRM（精简LRM）与其教师、CoT提示的LLM与LRMs，并考察推理示例数量和强化学习如何影响推理行为。最后，我们展示了ReJump可以通过ReJump引导的N最佳选择和提示选择等策略，在测试时提升推理质量。我们的代码在此 https URL 公开。

Beyond High-Entropy Exploration: Correctness-Aware Low-Entropy Segment-Based Advantage Shaping for Reasoning LLMs

超越高熵探索：正确性感知、低熵段优势塑造为推理大型语言模型

Authors: Xinzhu Chen, Xuesheng Li, Zhongxiang Sun, Weijie Yu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.00908
Pdf link: https://arxiv.org/pdf/2512.00908
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become a central approach for improving the reasoning ability of large language models. Recent work studies RLVR through token entropy, arguing that high-entropy tokens drive exploration and should receive stronger updates. However, they overlook the fact that most of a reasoning trajectory consists of low-entropy segments that encode stable and reusable structural patterns. Through qualitative and quantitative analyses, we find that the overlap of low-entropy segments across correct responses strongly correlates with model accuracy, while overlaps involving incorrect responses exhibit stable but unproductive patterns. Motivated by these findings, we propose LESS, a correctness-aware reinforcement framework that performs fine-grained advantage modulation over low-entropy segments. LESS amplifies segments unique to correct responses, suppresses those unique to incorrect ones, and neutralizes segments shared by both, while preserving high-entropy exploration in the underlying RL algorithm. Instantiated on top of the popular GRPO, LESS consistently improves accuracy over strong RL baselines across three backbones and six math benchmarks, achieves stronger robustness of the performance floor.
中文摘要 带可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理能力的核心方法。近期研究通过代币熵研究RLVR，认为高熵代币推动探索，应获得更强有力的更新。然而，他们忽视了推理轨迹的大部分由低熵段组成，这些片段编码稳定且可重复使用的结构模式。通过定性和定量分析，我们发现低熵片段与正确响应的重叠与模型准确性高度相关，而涉及错误反应的重叠则表现出稳定但无生产力的模式。基于这些发现，我们提出了LESS，一种正确性感知强化框架，能够对低熵片段进行细粒度优势调制。LESS放大正确响应独有的片段，抑制错误响应独有的片段，并中和双方共享的片段，同时保持底层强化学习算法中的高熵探索。基于流行的GRPO实例化，LESS在三个骨干和六个数学基准测试上持续提升强强化学习基线的准确性，实现性能底线的更稳健。

Partially Equivariant Reinforcement Learning in Symmetry-Breaking Environments

对称破缺环境中的部分等变强化学习

Authors: Junwoo Chang, Minwoo Park, Joohwan Seo, Roberto Horowitz, Jongmin Lee, Jongeun Choi
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.00915
Pdf link: https://arxiv.org/pdf/2512.00915
Abstract Group symmetries provide a powerful inductive bias for reinforcement learning (RL), enabling efficient generalization across symmetric states and actions via group-invariant Markov Decision Processes (MDPs). However, real-world environments almost never realize fully group-invariant MDPs; dynamics, actuation limits, and reward design usually break symmetries, often only locally. Under group-invariant Bellman backups for such cases, local symmetry-breaking introduces errors that propagate across the entire state-action space, resulting in global value estimation errors. To address this, we introduce Partially group-Invariant MDP (PI-MDP), which selectively applies group-invariant or standard Bellman backups depending on where symmetry holds. This framework mitigates error propagation from locally broken symmetries while maintaining the benefits of equivariance, thereby enhancing sample efficiency and generalizability. Building on this framework, we present practical RL algorithms -- Partially Equivariant (PE)-DQN for discrete control and PE-SAC for continuous control -- that combine the benefits of equivariance with robustness to symmetry-breaking. Experiments across Grid-World, locomotion, and manipulation benchmarks demonstrate that PE-DQN and PE-SAC significantly outperform baselines, highlighting the importance of selective symmetry exploitation for robust and sample-efficient RL.
中文摘要 群对称性为强化学习（RL）提供了强大的归纳偏置，使得通过群不变的马尔可夫决策过程（MDP）实现跨对称状态和动作的高效推广。然而，现实环境中几乎从未实现完全群不变的MDP;动力学、驱动极限和奖励设计通常会破坏对称性，且往往仅局部性。在此类情况下的群不变贝尔曼备份下，局部对称破缺引入了在整个状态-动作空间中传播的误差，导致全局值估计误差。为此，我们引入了部分群不变MDP（PI-MDP），该方法根据对称性成立的部分选择性应用群不变或标准贝尔曼备份。该框架在保持等变性优势的同时，减少了局部破缺对称性的误差传播，从而提升样本效率和泛化性。基于该框架，我们提出了实用的强化学习算法——部分等变（PE）-DQN用于离散控制，PE-SAC用于连续控制——结合了等变性的优势与对称破缺的鲁棒性。跨网格世界、移动和作基准测试的实验表明，PE-DQN和PE-SAC显著优于基线，凸显了选择性对称性利用对稳健且样本高效强化学习的重要性。

Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

视频扩散模型的目标驱动奖励用于强化学习

Authors: Qi Wang, Mian Wu, Yuyang Zhang, Mingqi Yuan, Wenyao Zhang, Haoxiang You, Yunbo Wang, Xin Jin, Xiaokang Yang, Wenjun Zeng
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.00961
Pdf link: https://arxiv.org/pdf/2512.00961
Abstract Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos. To enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state. We then employ a learned forward-backward representation that represents the probability of visiting the goal state from a given state-action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on various Meta-World tasks demonstrate the effectiveness of our approach.
中文摘要 强化学习（RL）在多个领域取得了显著成功，但它通常依赖精心设计的程序化奖励函数来指导代理行为。设计此类奖励函数可能具有挑战性，且可能难以在不同任务间很好地推广。为解决这一限制，我们利用预训练视频扩散模型中丰富的世界知识，为强化学习代理提供目标驱动的奖励信号，而无需临时设计奖励。我们的核心想法是利用在大规模视频数据集上预训练的现成视频扩散模型，作为视频级和帧级目标的有益奖励函数。对于视频级奖励，我们首先在特定领域数据集上微调预训练视频扩散模型，然后利用其视频编码器评估代理轨迹的潜在表征与生成的目标视频之间的对齐。为了实现更细致的目标达成，我们通过使用CLIP识别生成视频中最相关的帧，从而推导出帧级目标，CLIP作为目标状态。随后，我们采用学习到的前后表示法，表示从给定状态-行动对中访问目标状态的概率，作为框架级奖励，促进更连贯和目标驱动的轨迹。各种元世界任务的实验证明了我们方法的有效性。

Optimizing Generative Ranking Relevance via Reinforcement Learning in Xiaohongshu Search

通过强化学习优化小红书搜索中的生成排名相关性

Authors: Ziyang Zeng, Heming Jing, Jindong Chen, Xiangli Li, Hongyu Liu, Yixuan He, Zhengyu Li, Yige Sun, Zheyong Xie, Yuqing Yang, Shaosheng Cao, Jun Fan, Yi Wu, Yao Hu
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.00968
Pdf link: https://arxiv.org/pdf/2512.00968
Abstract Ranking relevance is a fundamental task in search engines, aiming to identify the items most relevant to a given user query. Traditional relevance models typically produce scalar scores or directly predict relevance labels, limiting both interpretability and the modeling of complex relevance signals. Inspired by recent advances in Chain-of-Thought (CoT) reasoning for complex tasks, we investigate whether explicit reasoning can enhance both interpretability and performance in relevance modeling. However, existing reasoning-based Generative Relevance Models (GRMs) primarily rely on supervised fine-tuning on large amounts of human-annotated or synthetic CoT data, which often leads to limited generalization. Moreover, domain-agnostic, free-form reasoning tends to be overly generic and insufficiently grounded, limiting its potential to handle the diverse and ambiguous cases prevalent in open-domain search. In this work, we formulate relevance modeling in Xiaohongshu search as a reasoning task and introduce a Reinforcement Learning (RL)-based training framework to enhance the grounded reasoning capabilities of GRMs. Specifically, we incorporate practical business-specific relevance criteria into the multi-step reasoning prompt design and propose Stepwise Advantage Masking (SAM), a lightweight process-supervision strategy which facilitates effective learning of these criteria through improved credit assignment. To enable industrial deployment, we further distill the large-scale RL-tuned model to a lightweight version suitable for real-world search systems. Extensive experiments on industrial datasets, along with online A/B tests, demonstrate the effectiveness of our approach.
中文摘要 相关性排序是搜索引擎中一项基本任务，旨在识别与特定用户查询最相关的项目。传统相关性模型通常产生标量分数或直接预测相关性标签，限制了复杂相关性信号的可解释性和建模。受复杂任务中思维链（Chain-of-Thought，CoT）推理的最新进展启发，我们探讨显式推理是否能提升相关性建模的可解释性和性能。然而，现有基于推理的生成相关性模型（GRMs）主要依赖于对大量人工注释或合成的CoT数据进行监督微调，这通常导致泛化有限。此外，领域无关、自由形式的推理往往过于泛泛且缺乏充分基础，限制了其处理开放域搜索中多样且模糊情况的潜力。本研究中，我们将小红书搜索中的相关性建模作为推理任务，并引入基于强化学习（RL）的训练框架，以增强GRM的基础推理能力。具体来说，我们将多步推理提示设计中融入实用的业务相关性标准，并提出了逐步优势掩蔽（SAM），这是一种轻量级过程监督策略，通过提升学分促进这些标准的有效学习分配。为了实现工业部署，我们将大规模RL调优模型进一步提炼为适合实际搜索系统的轻量级版本。对工业数据集的广泛实验以及在线A/B测试证明了我们方法的有效性。

AltNet: Addressing the Plasticity-Stability Dilemma in Reinforcement Learning

AltNet：解决强化学习中的可塑性与稳定性困境

Authors: Mansi Maheshwari, John C. Raisbeck, Bruno Castro da Silva
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.01034
Pdf link: https://arxiv.org/pdf/2512.01034
Abstract Neural networks have shown remarkable success in supervised learning when trained on a single task using a fixed dataset. However, when neural networks are trained on a reinforcement learning task, their ability to continue learning from new experiences declines over time. This decline in learning ability is known as plasticity loss. To restore plasticity, prior work has explored periodically resetting the parameters of the learning network, a strategy that often improves overall performance. However, such resets come at the cost of a temporary drop in performance, which can be dangerous in real-world settings. To overcome this instability, we introduce AltNet, a reset-based approach that restores plasticity without performance degradation by leveraging twin networks. The use of twin networks anchors performance during resets through a mechanism that allows networks to periodically alternate roles: one network learns as it acts in the environment, while the other learns off-policy from the active network's interactions and a replay buffer. At fixed intervals, the active network is reset and the passive network, having learned from prior experiences, becomes the new active network. AltNet restores plasticity, improving sample efficiency and achieving higher performance, while avoiding performance drops that pose risks in safety-critical settings. We demonstrate these advantages in several high-dimensional control tasks from the DeepMind Control Suite, where AltNet outperforms various relevant baseline methods, as well as state-of-the-art reset-based techniques.
中文摘要 神经网络在使用固定数据集训练单一任务时，在监督学习方面表现出显著成功。然而，当神经网络被训练到强化学习任务时，它们从新体验中持续学习的能力会随着时间推移而下降。这种学习能力的下降被称为可塑性丧失。为恢复可塑性，先前研究曾尝试定期重置学习网络参数，这一策略常常能提升整体表现。然而，这种重置会以暂时的性能下降为代价，这在现实环境中可能很危险。为克服这种不稳定性，我们引入了AltNet，一种基于重置的方法，通过利用双网络恢复可塑性而不降低性能。双网络的使用通过一种机制锚定复位时的性能，该机制允许网络周期替扮演角色：一个网络在环境中学习，另一个则通过活动网络的交互和重放缓冲区进行非策略学习。在固定间隔中，主动网络被重置，被动网络在吸取经验后成为新的主动网络。AltNet恢复了可塑性，提高了样本效率并实现更高的性能，同时避免了在安全关键环境中带来风险的性能下降。我们在DeepMind控制套件中的多个高维控制任务中展示了这些优势，AltNet在这些任务中优于多种相关基线方法以及最先进的基于重置的技术。

Shielded Controller Units for RL with Operational Constraints Applied to Remote Microgrids

用于远程微电网的强化学习屏蔽控制单元，并对作约束施加

Authors: Hadi Nekoei, Alexandre Blondin Massé, Rachid Hassani, Sarath Chandar, Vincent Mai
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01046
Pdf link: https://arxiv.org/pdf/2512.01046
Abstract Reinforcement learning (RL) is a powerful framework for optimizing decision-making in complex systems under uncertainty, an essential challenge in real-world settings, particularly in the context of the energy transition. A representative example is remote microgrids that supply power to communities disconnected from the main grid. Enabling the energy transition in such systems requires coordinated control of renewable sources like wind turbines, alongside fuel generators and batteries, to meet demand while minimizing fuel consumption and battery degradation under exogenous and intermittent load and wind conditions. These systems must often conform to extensive regulations and complex operational constraints. To ensure that RL agents respect these constraints, it is crucial to provide interpretable guarantees. In this paper, we introduce Shielded Controller Units (SCUs), a systematic and interpretable approach that leverages prior knowledge of system dynamics to ensure constraint satisfaction. Our shield synthesis methodology, designed for real-world deployment, decomposes the environment into a hierarchical structure where each SCU explicitly manages a subset of constraints. We demonstrate the effectiveness of SCUs on a remote microgrid optimization task with strict operational requirements. The RL agent, equipped with SCUs, achieves a 24% reduction in fuel consumption without increasing battery degradation, outperforming other baselines while satisfying all constraints. We hope SCUs contribute to the safe application of RL to the many decision-making challenges linked to the energy transition.
中文摘要 强化学习（RL）是一个强大的框架，用于优化在不确定性下复杂系统中的决策，这在现实环境中尤其在能源转型背景下是一项重要挑战。一个代表性的例子是远程微电网，为与主电网断开的社区供电。实现此类系统中的能源转型，需要协调控制风力涡轮机等可再生能源，配合燃料发生器和电池，以满足需求，同时在外生和间歇性负载及风力条件下最大限度地减少燃料消耗和电池劣化。这些系统通常必须遵守广泛的法规和复杂的运营限制。为了确保强化学习代理遵守这些约束，提供可解释的保证至关重要。本文介绍屏蔽控制单元（SCU），这是一种系统且可解释的方法，利用系统动力学的先验知识确保约束满足。我们的盾牌综合方法论为实际部署设计，将环境分解为层级结构，每个SCU显式管理一组约束。我们展示了SCU在远程微电网优化任务中对严格运营要求的有效性。配备SCU的RL代理实现了24%的燃油消耗降低，同时不加剧电池劣化，优于其他基线产品，同时满足所有限制。我们希望SCU有助于将强化学习安全地应用于能源转型相关的众多决策挑战。

Automating the Refinement of Reinforcement Learning Specifications

自动化强化学习规范的细化

Authors: Tanmay Ambadkar, Đorđe Žikelić, Abhinav Verma
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01047
Pdf link: https://arxiv.org/pdf/2512.01047
Abstract Logical specifications have been shown to help reinforcement learning algorithms in achieving complex tasks. However, when a task is under-specified, agents might fail to learn useful policies. In this work, we explore the possibility of improving coarse-grained logical specifications via an exploration-guided strategy. We propose \textsc{AutoSpec}, a framework that searches for a logical specification refinement whose satisfaction implies satisfaction of the original specification, but which provides additional guidance therefore making it easier for reinforcement learning algorithms to learn useful policies. \textsc{AutoSpec} is applicable to reinforcement learning tasks specified via the SpectRL specification logic. We exploit the compositional nature of specifications written in SpectRL, and design four refinement procedures that modify the abstract graph of the specification by either refining its existing edge specifications or by introducing new edge specifications. We prove that all four procedures maintain specification soundness, i.e. any trajectory satisfying the refined specification also satisfies the original. We then show how \textsc{AutoSpec} can be integrated with existing reinforcement learning algorithms for learning policies from logical specifications. Our experiments demonstrate that \textsc{AutoSpec} yields promising improvements in terms of the complexity of control tasks that can be solved, when refined logical specifications produced by \textsc{AutoSpec} are utilized.
中文摘要 逻辑规范已被证明有助于强化学习算法实现复杂任务。然而，当任务未被充分指定时，代理可能无法学习有用的策略。本研究探讨通过探索引导策略改进粗粒度逻辑规范的可能性。我们提出了 \textsc{AutoSpec}，这是一个寻找逻辑规范细化的框架，其满足意味着满足原始规范，同时提供额外指导，从而使强化学习算法更容易学习有用的策略。\textsc{AutoSpec} 适用于通过 SpectRL 规范逻辑指定的强化学习任务。我们利用SpectRL规范的组合特性，设计了四个细化程序，通过细化现有边规范或引入新的边缘规范来修改规范的抽象图。我们证明了这四个过程都保持规范的合理性，即任何满足精细规范的轨迹也满足原始规范。随后，我们展示了如何将 \textsc{AutoSpec} 与现有的强化学习算法集成，用于从逻辑规范中学习策略。我们的实验表明，当利用 \textsc{AutoSpec} 生成的精炼逻辑规范时，\textsc{AutoSpec} 在可解决控制任务的复杂度方面带来了有希望的改进。

Adaptive-lambda Subtracted Importance Sampled Scores in Machine Unlearning for DDPMs and VAEs

机器学习解散中DDPM和VAE中自适应λ减去重要性抽样分数

Authors: MohammadParsa Dini, Human Jafari, Sajjad Amini, MohammadMahdi Mojahedian
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.01054
Pdf link: https://arxiv.org/pdf/2512.01054
Abstract Machine Unlearning is essential for large generative models (VAEs, DDPMs) to comply with the right to be forgotten and prevent undesired content generation without costly retraining. Existing approaches, such as Static-lambda SISS for diffusion models, rely on a fixed mixing weight lambda, which is suboptimal because the required unlearning strength varies across samples and training stages. We propose Adaptive-lambda SISS, a principled extension that turns lambda into a latent variable dynamically inferred at each training step. A lightweight inference network parameterizes an adaptive posterior over lambda, conditioned on contextual features derived from the instantaneous SISS loss terms (retain/forget losses and their gradients). This enables joint optimization of the diffusion model and the lambda-inference mechanism via a variational objective, yielding significantly better trade-offs. We further extend the adaptive-lambda principle to score-based unlearning and introduce a multi-class variant of Score Forgetting Distillation. In addition, we present two new directions: (i) a hybrid objective combining the data-free efficiency of Score Forgetting Distillation with the direct gradient control of SISS, and (ii) a Reinforcement Learning formulation that treats unlearning as a sequential decision process, learning an optimal policy over a state space defined by the model's current memory of the forget set. Experiments on an augmented MNIST benchmark show that Adaptive-lambda SISS substantially outperforms the original static-lambda SISS, achieving stronger removal of forgotten classes while better preserving generation quality on the retain set.
中文摘要 机器学习去学习对于大型生成模型（VAE、DDPM）至关重要，以确保其享有被遗忘权利，避免产生不希望的内容，而无需昂贵的重新训练。现有方法，如扩散模型的静态λ SISS，依赖固定的混合权重λ，但由于所需去学强度在不同样本和训练阶段之间不同，这并不最优。我们提出了自适应λ SISS，这是一种原则性扩展，将lambda转化为在每个训练步骤动态推断的潜在变量。轻量级推断网络参数化λ上的自适应后验，条件是基于即时SISS损失项（保留/遗忘损失及其梯度）得出的上下文特征。这使扩散模型和λ推断机制通过变分目标实现联合优化成为可能，从而显著改善权衡。我们进一步将自适应λ原理扩展到基于分数的去学习，并引入了多类变体的分数遗忘蒸馏。此外，我们提出了两个新方向：（i）结合了分数遗忘蒸馏的无数据效率与SISS直接梯度控制的混合目标，以及（ii）一种强化学习表述，将“忘掉”视为顺序决策过程，在由模型当前遗忘集记忆定义的状态空间上学习最优策略。增强后的MNIST基准测试实验显示，自适应λ SISS显著优于原始静态λ SISS，实现了更强的遗忘类移除，同时更好地保持保留集的生成质量。

Reinforcement Learning for Gliding Projectile Guidance and Control

滑翔弹道导引与控制的强化学习

Authors: Joel Cahn, Antonin Thomas, Philippe Pastor
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.01066
Pdf link: https://arxiv.org/pdf/2512.01066
Abstract This paper presents the development of a control law, which is intended to be implemented on an optical guided glider. This guiding law follows an innovative approach, the reinforcement learning. This control law is used to make navigation more flexible and autonomous in a dynamic environment. The final objective is to track a target detected with the camera and then guide the glider to this point with high precision. Already applied on quad-copter drones, we wish by this study to demonstrate the applicability of reinforcement learning for fixed-wing aircraft on all of its axis.
中文摘要 本文介绍了一套控制定律的发展，该定律旨在应用于光学制导滑翔机。这条指导性法则采用了创新的方法——强化学习。该控制定律用于使导航在动态环境中更加灵活和自主。最终目标是用相机追踪发现的目标，并以高精度引导滑翔机到该点。该技术已应用于四旋翼无人机，我们希望通过本研究展示强化学习在固定翼飞机所有轴线上的适用性。

Accelerating Inference of Masked Image Generators via Reinforcement Learning

通过强化学习加速掩体图像生成器的推断

Authors: Pranav Subbaraman, Shufan Li, Siyan Zhao, Aditya Grover
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.01094
Pdf link: https://arxiv.org/pdf/2512.01094
Abstract Masked Generative Models (MGM)s demonstrate strong capabilities in generating high-fidelity images. However, they need many sampling steps to create high-quality generations, resulting in slow inference speed. In this work, we propose Speed-RL, a novel paradigm for accelerating a pretrained MGMs to generate high-quality images in fewer steps. Unlike conventional distillation methods which formulate the acceleration problem as a distribution matching problem, where a few-step student model is trained to match the distribution generated by a many-step teacher model, we consider this problem as a reinforcement learning problem. Since the goal of acceleration is to generate high quality images in fewer steps, we can combine a quality reward with a speed reward and finetune the base model using reinforcement learning with the combined reward as the optimization target. Through extensive experiments, we show that the proposed method was able to accelerate the base model by a factor of 3x while maintaining comparable image quality.
中文摘要 蒙面生成模型（MGM）展现了生成高精度图像的强大能力。然而，它们需要大量采样步骤才能生成高质量的生成，导致推理速度较慢。在本研究中，我们提出了Speed-RL，一种新颖的范式，用于加速预训练的MGMs，使其在更少步骤内生成高质量图像。与传统提炼方法将加速问题表述为分布匹配问题，后者训练少数步学生模型以匹配多步教师模型生成的分布，我们将此问题视为强化学习问题。由于加速的目标是在更少步内生成高质量图像，我们可以将高质量奖励与速度奖励结合起来，并利用强化学习微调基础模型，以组合奖励为优化目标。通过大量实验，我们证明该方法能够将基础模型加速3倍，同时保持相当的图像质量。

World Model Robustness via Surprise Recognition

通过意外识别实现世界模型稳健性

Authors: Geigh Zollicoffer, Tanush Chopra, Mingkuan Yan, Xiaoxu Ma, Kenneth Eaton, Mark Riedl
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.01119
Pdf link: https://arxiv.org/pdf/2512.01119
Abstract AI systems deployed in the real world must contend with distractions and out-of-distribution (OOD) noise that can destabilize their policies and lead to unsafe behavior. While robust training can reduce sensitivity to some forms of noise, it is infeasible to anticipate all possible OOD conditions. To mitigate this issue, we develop an algorithm that leverages a world model's inherent measure of surprise to reduce the impact of noise in world model--based reinforcement learning agents. We introduce both multi-representation and single-representation rejection sampling, enabling robustness to settings with multiple faulty sensors or a single faulty sensor. While the introduction of noise typically degrades agent performance, we show that our techniques preserve performance relative to baselines under varying types and levels of noise across multiple environments within self-driving simulation domains (CARLA and Safety Gymnasium). Furthermore, we demonstrate that our methods enhance the stability of two state-of-the-art world models with markedly different underlying architectures: Cosmos and DreamerV3. Together, these results highlight the robustness of our approach across world modeling domains. We release our code at this https URL .
中文摘要 在现实世界中部署的人工智能系统必须应对干扰和分发外（OOD）噪音，这些噪音可能破坏其政策并导致不安全行为。虽然稳健训练可以降低对某些噪声形式的敏感性，但预判所有可能的户外环境状况是不可行的。为缓解这一问题，我们开发了一种算法，利用世界模型固有的惊讶度量，减少基于世界模型的强化学习代理中噪声的影响。我们引入了多表示和单表示拒绝采样，使得对多个传感器故障或单个传感器故障的环境都具有鲁棒性。虽然噪声的引入通常会降低智能体性能，但我们证明了我们的技术在自动驾驶仿真领域（CARLA和安全健身房）中，在多种环境中，在不同类型和噪音水平下相对于基线的性能保持得以保持。此外，我们证明了我们的方法增强了两种底层架构截然不同的先进世界模型：Cosmos和DreamerV3的稳定性。这些结果共同凸显了我们方法在全球建模领域的稳健性。我们以这个 https URL 发布代码。

Mode-Conditioning Unlocks Superior Test-Time Scaling

模式调节解锁了更优越的测试时间缩放

Authors: Chen Henry Wu, Sachin Goyal, Aditi Raghunathan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.01127
Pdf link: https://arxiv.org/pdf/2512.01127
Abstract Parallel sampling promises substantial gains in test-time scaling, but its effectiveness is sharply limited by diversity collapse, where models concentrate on a few modes and repeated samples produce the same mistakes. We propose the mode-conditioning (ModC) framework, which explicitly allocates test-time compute across reasoning modes using either specialist models or mode-specific prefixes. ModC consistently improves scaling across controlled graph-search tasks and large-scale reasoning benchmarks, spanning model families and sizes from 0.5B to 7B. On OpenThoughts, fine-tuning Qwen2.5-7B with ModC achieves a 4x efficiency gain over standard training while also improving the maximum attainable Pass@k. We further show that gradient clustering enables ModC without explicit mode labels, yielding up to 10% gains on datasets such as NuminaMath. Finally, we show that ModC improves reinforcement learning (RL) and can further boost diversity-inducing RL methods. These results demonstrate that standard training underutilizes the diversity in data, and that ModC provides a simple, effective remedy for unlocking the full benefits of diversity in test-time scaling.
中文摘要 并行采样有望在测试时间尺度上带来显著提升，但其有效性受到多样性崩溃的严重限制，即模型集中于少数模式，重复样本也会产生相同的错误。我们提出了模式条件化（ModC）框架，该框架通过专门模型或特定模式前缀，明确分配测试时间计算在不同推理模式之间。ModC持续提升受控图搜索任务和大规模推理基准的扩展性，涵盖模型族和规模从0.5亿到7亿。在OpenThoughts上，通过ModC微调Qwen2.5-7B实现了4倍效率提升，同时提升了最大可达Pass@k。我们还进一步证明，梯度聚类使ModC无需显式模式标签即可实现，在NuminaMath等数据集上可实现高达10%的提升。最后，我们证明ModC提升了强化学习（RL），并能进一步提升多样性诱导强化学习的方法。这些结果表明，标准训练未能充分利用数据中的多样性，ModC为在测试时间缩放中释放多样性的全部优势提供了一种简单有效的解决方案。

A TinyML Reinforcement Learning Approach for Energy-Efficient Light Control in Low-Cost Greenhouse Systems

一种用于低成本温室系统中节能光控的微型ML强化学习方法

Authors: Mohamed Abdallah Salem (1), Manuel Cuevas Perez (1), Ahmed Harb Rabia (1) ((1) North Dakota State University)
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.01167
Pdf link: https://arxiv.org/pdf/2512.01167
Abstract This study presents a reinforcement learning (RL)-based control strategy for adaptive lighting regulation in controlled environments using a low-power microcontroller. A model-free Q-learning algorithm was implemented to dynamically adjust the brightness of a Light-Emitting Diode (LED) based on real-time feedback from a light-dependent resistor (LDR) sensor. The system was trained to stabilize at 13 distinct light intensity levels (L1 to L13), with each target corresponding to a specific range within the 64-state space derived from LDR readings. A total of 130 trials were conducted, covering all target levels with 10 episodes each. Performance was evaluated in terms of convergence speed, steps taken, and time required to reach target states. Box plots and histograms were generated to analyze the distribution of training time and learning efficiency across targets. Experimental validation demonstrated that the agent could effectively learn to stabilize at varying light levels with minimal overshooting and smooth convergence, even in the presence of environmental perturbations. This work highlights the feasibility of lightweight, on-device RL for energy-efficient lighting control and sets the groundwork for multi-modal environmental control applications in resource-constrained agricultural systems.
中文摘要 本研究提出了一种基于强化学习（RL）的控制策略，用于利用低功耗微控制器在受控环境中实现自适应照明调节。实现了一种无模型Q-学习算法，基于光依赖电阻（LDR）传感器的实时反馈动态调节发光二极管（LED）的亮度。该系统被训练为稳定在13个不同的光强级别（L1到L13），每个目标对应于基于LDR读数得出的64态空间内的特定范围。共进行了130次试验，涵盖所有目标水平，每项10发。性能评估包括收敛速度、所采取步骤以及达到目标状态所需的时间。生成了箱形图和直方图，分析训练时间和学习效率在目标间的分布。实验验证表明，该试剂即使在环境扰动存在下，也能有效学习在不同光照水平下稳定，且几乎没有过冲和平滑的收敛。这项工作强调了轻量化设备内强化学习在节能照明控制中的可行性，并为资源有限农业系统中多模态环境控制应用奠定了基础。

Sum Rate Maximization in STAR-RIS-UAV-Assisted Networks: A CA-DDPG Approach for Joint Optimization

STAR-RIS-UAV 辅助网络中的求和速率最大化：CA-DDPG 联合优化方法

Authors: Yujie Huang, Haibin Wan, Xiangcheng Li, Tuanfa Qin, Yun Li, Jun Li, Wen Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01202
Pdf link: https://arxiv.org/pdf/2512.01202
Abstract With the rapid advances in programmable materials, reconfigurable intelligent surfaces (RIS) have become a pivotal technology for future wireless communications. The simultaneous transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) can both transmit and reflect signals, enabling comprehensive signal control and expanding application scenarios. This paper introduces an unmanned aerial vehicle (UAV) to further enhance system flexibility and proposes an optimization design for the spectrum efficiency of the STAR-RIS-UAV-assisted wireless communication system. We present a deep reinforcement learning (DRL) algorithm capable of iteratively optimizing beamforming, phase shifts, and UAV positioning to maximize the system's sum rate through continuous interactions with the environment. To improve exploration in deterministic policies, we introduce a stochastic perturbation factor, which enhances exploration capabilities. As exploration is strengthened, the algorithm's ability to accurately evaluate the state-action value function becomes critical. Thus, based on the deep deterministic policy gradient (DDPG) algorithm, we propose a convolution-augmented deep deterministic policy gradient (CA-DDPG) algorithm that balances exploration and evaluation to improve the system's sum rate. The simulation results demonstrate that the CA-DDPG algorithm effectively interacts with the environment, optimizing the beamforming matrix, phase shift matrix, and UAV location, thereby improving system capacity and achieving better performance than other algorithms.
中文摘要 随着可编程材料的快速发展，可重构智能表面（RIS）已成为未来无线通信的关键技术。同时发射与反射可重构智能曲面（STAR-RIS）既能传输又能反射信号，实现全面的信号控制和扩展应用场景。本文介绍了无人机（UAV），以进一步提升系统灵活性，并提出了STAR-RIS-UAV辅助无线通信系统的频谱效率优化设计。我们提出了一种深度强化学习（DRL）算法，能够通过与环境的持续交互，迭代优化束流形成、相位移和无人机定位，以最大化系统的总和速率。为提升确定性策略的探索能力，我们引入了随机扰动因子，增强了探索能力。随着探索的加强，算法准确评估状态-动作值函数的能力变得至关重要。因此，基于深度确定性策略梯度（DDPG）算法，我们提出了一种卷积增强的深度确定性策略梯度（CA-DDPG）算法，平衡探索与评估，以提升系统的求和率。模拟结果表明，CA-DDPG算法能够有效与环境交互，优化束流形成矩阵、相位移矩阵和无人机定位，从而提升系统容量，实现优于其他算法的性能。

CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions

CoSineVerifier：用于计算导向科学问题的工具增强答案验证

Authors: Ruixiang Feng, Zhenwei An, Yuntao Wen, Ran Le, Yiming Jia, Chen Yang, Zongchao Chen, Lisi Chen, Shen Gao, Shuo Shang, Yang Song, Tao Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01224
Pdf link: https://arxiv.org/pdf/2512.01224
Abstract Answer verification methods are widely employed in language model training pipelines spanning data curation, evaluation, and reinforcement learning with verifiable rewards (RLVR). While prior work focus on developing unified verifiers applicable across multiple reasoning scenarios, significant challenges remain in computation-oriented scientific domains, such as algebraic equivalence checking and physical constant substitution. In this paper, we introduce \model, a tool-augmented verifier that leverages external executors to perform precise computations and symbolic simplifications. \model enables robust verification that goes beyond simple semantic matching. We propose a novel two-stage pipeline, which begin with cold-start fine-tuning and followed by multi-turn reinforcement learning with tool integration. Extensive experiments conducted on STEM subjects, general QA, and long-form reasoning tasks demonstrates strong generalization of \model. The results shows that the \model achieves state-of-the-art performance on VerifyBench-Hard and SCI-Bench. And we also employ our \model in RLVR as a reward model, the results show that it consistently outperforms both rubric-based and model-based verifiers on AIME'24 and AIME'25, demonstrating strong potential to enhance reasoning capabilities of LLM. Our model is released at \hyperlink{this https URL}{this https URL}.
中文摘要 答案验证方法广泛应用于涵盖数据整理、评估和可验证奖励强化学习（RLVR）的语言模型训练流程中。尽管以往工作主要致力于开发适用于多种推理场景的统一验证器，但在以计算为导向的科学领域，如代数等价检查和物理常数替换，仍面临重大挑战。本文介绍了\model，一种工具增强的验证器，利用外部执行者进行精确计算和符号简化。\model 实现了超越简单语义匹配的稳健验证。我们提出了一种新的两阶段流水线，首先是冷启动微调，随后进行多回合强化学习并结合工具集成。对STEM受试者、一般质量保证和长形式推理任务的广泛实验证明了模型具有强有力的推广性。结果显示，\model 在 VerifyBench-Hard 和 SCI-Bench 上都达到了最先进的性能。我们还在RLVR中将我们的\模型作为奖励模型，结果显示它在AIME'24和AIME'25中持续优于基于评分标准和基于模型的验证器，显示出增强LLM推理能力的强大潜力。我们的模型发布于 \hyperlink{this https URL}{this https URL}。

On the Tension Between Optimality and Adversarial Robustness in Policy Optimization

关于策略优化中最优性与对抗性鲁棒性之间的张力

Authors: Haoran Li, Jiayu Lv, Congying Han, Zicheng Zhang, Anqi Li, Yan Liu, Tiande Guo, Nan Jiang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01228
Pdf link: https://arxiv.org/pdf/2512.01228
Abstract Achieving optimality and adversarial robustness in deep reinforcement learning has long been regarded as conflicting goals. Nonetheless, recent theoretical insights presented in CAR suggest a potential alignment, raising the important question of how to realize this in practice. This paper first identifies a key gap between theory and practice by comparing standard policy optimization (SPO) and adversarially robust policy optimization (ARPO). Although they share theoretical consistency, a fundamental tension between robustness and optimality arises in practical policy gradient methods. SPO tends toward convergence to vulnerable first-order stationary policies (FOSPs) with strong natural performance, whereas ARPO typically favors more robust FOSPs at the expense of reduced returns. Furthermore, we attribute this tradeoff to the reshaping effect of the strongest adversary in ARPO, which significantly complicates the global landscape by inducing deceptive sticky FOSPs. This improves robustness but makes navigation more challenging. To alleviate this, we develop the BARPO, a bilevel framework unifying SPO and ARPO by modulating adversary strength, thereby facilitating navigability while preserving global optima. Extensive empirical results demonstrate that BARPO consistently outperforms vanilla ARPO, providing a practical approach to reconcile theoretical and empirical performance.
中文摘要 在深度强化学习中实现最优性和对抗性鲁棒性长期以来被视为相互冲突的目标。尽管如此，CAR中最近提出的理论见解暗示了潜在的对齐，提出了如何在实践中实现这一重要问题。本文首先通过比较标准策略优化（SPO）和对抗稳健策略优化（ARPO）来识别理论与实践之间的关键差距。尽管它们在理论上保持一致，但在实际的政策梯度方法中，稳健性与最优性之间存在根本张力。SPO倾向于趋向于具有强劲自然表现的脆弱一阶固定政策（FOSP），而ARPO通常偏好更稳健的FOSP，但代价是回报较低。此外，我们认为这一权衡归因于ARPO这一最强对手的重塑效应，这极大地复杂化了全球格局，诱导了具有欺骗性的粘性FOSP。这提升了鲁棒性，但使导航更具挑战性。为缓解这一问题，我们开发了BARPO，这是一个双层框架，通过调节对手强度来统一SPO和ARPO，从而促进导航并保持全局最优。大量实证结果表明，BARPO持续优于普通ARPO，提供了一种实用的方法来协调理论与实证性能。

PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards

PSR：多主题个性化图像生成与成对主题一致性奖励的规模化

Authors: Shulei Wang, Longhui Wei, Xin He, Jianbo Ouyang, Hui Lu, Zhou Zhao, Qi Tian
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.01236
Pdf link: https://arxiv.org/pdf/2512.01236
Abstract Personalized generation models for a single subject have demonstrated remarkable effectiveness, highlighting their significant potential. However, when extended to multiple subjects, existing models often exhibit degraded performance, particularly in maintaining subject consistency and adhering to textual prompts. We attribute these limitations to the absence of high-quality multi-subject datasets and refined post-training strategies. To address these challenges, we propose a scalable multi-subject data generation pipeline that leverages powerful single-subject generation models to construct diverse and high-quality multi-subject training data. Through this dataset, we first enable single-subject personalization models to acquire knowledge of synthesizing multi-image and multi-subject scenarios. Furthermore, to enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards and general-purpose rewards, which are incorporated into a refined reinforcement learning stage. To comprehensively evaluate multi-subject personalization, we introduce a new benchmark that assesses model performance using seven subsets across three dimensions. Extensive experiments demonstrate the effectiveness of our approach in advancing multi-subject personalized image generation. Github Link: this https URL
中文摘要 针对单一主题的个性化生成模型已展现出显著效果，凸显了其巨大潜力。然而，当扩展到多个主题时，现有模型常常表现下降，尤其是在保持学科一致性和遵循文本提示方面。我们将这些局限归因于缺乏高质量的多受试者数据集和精细的训练后策略。为应对这些挑战，我们提出了一个可扩展的多学科数据生成流程，利用强大的单学科生成模型构建多样且高质量的多学科训练数据。通过该数据集，我们首先使单受试者个性化模型能够获得合成多图像和多受试者场景的知识。此外，为了增强主体一致性和文本可控性，我们设计了一套成对主语-一致性奖励和通用奖励，并将其纳入精细的强化学习阶段。为了全面评估多主体个性化，我们引入了一个新的基准，利用七个子集在三个维度中评估模型表现。大量实验证明了我们方法在推动多主体个性化图像生成方面的有效性。GitHub 链接：这个 https URL

Kardia-R1: Unleashing LLMs to Reason toward Understanding and Empathy for Emotional Support via Rubric-as-Judge Reinforcement Learning

Kardia-R1：释放大型语言模型，通过评分标准作为评判强化学习，推理理解和共情情感支持

Authors: Jiahao Yuan, Zhiqing Cui, Hanqing Wang, Yuansheng Gao, Yucheng Zhou, Usman Naseem
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.01282
Pdf link: https://arxiv.org/pdf/2512.01282
Abstract As web platforms evolve towards greater personalization and emotional complexity, conversational agents must transcend superficial empathy to demonstrate identity-aware emotional reasoning. However, existing systems face two limitations: (1) reliance on situation-centric datasets lacking persistent user identity, which hampers the capture of personalized affective nuances; and (2) dependence on opaque, coarse reward signals that hinder development of verifiable empathetic reasoning. To address these gaps, we introduce KardiaBench, a large-scale user-grounded benchmark comprising 178,080 QA pairs across 22,080 multi-turn conversations anchored to 671 real-world profiles. The dataset is constructed via a model-in-the-loop pipeline with iterative rubric-guided refinement to ensure psychological plausibility and persona consistency. This progressive empathy pipeline that integrates user comprehension, contextual reasoning, and emotion perception into conversations, followed by iterative critique and rubric-based refinement to ensure psychological plausibility, emotional fidelity, and persona consistency. Building on this, we propose Kardia-R1, a framework that trains models for interpretable, stepwise empathetic cognition. Kardia-R1 leverages Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), a GRPO-based method that uses explainable, human-aligned rubric rewards to tightly couple user understanding, emotional inference, and supportive response generation. Extensive experiments across four LLM backbones demonstrate that Kardia-R1 consistently outperforms othet methods in emotion accuracy, empathy, relevance, persona consistency, and safety. Our dataset and model will be released at this https URL.
中文摘要 随着网络平台向更个性化和情感复杂性发展，对话代理必须超越表面的同理心，展现身份意识的情感推理能力。然而，现有系统面临两个局限：（1）依赖缺乏持久用户身份的情境中心数据集，这阻碍了个性化情感细微差别的捕捉;以及（2）对阻碍可验证同理心推理发展的不透明、粗糙的奖励信号的依赖。为弥补这些空白，我们推出了KardiaBench，这是一个大规模的用户基础基准测试，包含178,080对QA对，涵盖22,080对多回合对话，基于671个真实世界档案。该数据集通过模型在环流程构建，并通过迭代的评分标准引导细化，确保心理上可信性和人物形象一致性。这一渐进式共情管道将用户理解、情境推理和情感感知整合进对话，随后进行迭代批评和基于评分标准的细化，以确保心理上的合理性、情感忠实度和人物形象的一致性。基于此，我们提出了Kardia-R1框架，用于训练可解释的、逐步的共情认知模型。Kardia-R1 利用了基于 GRPO 的评分标准同理强化学习（Rubric-ERL），该方法利用可解释、与人类对齐的评分标准奖励，紧密结合用户理解、情感推断和支持性反应生成。在四种大型语言模型骨干上的广泛实验表明，Kardia-R1在情感准确性、共情、相关性、人物一致性和安全性方面始终优于其他方法。我们的数据集和模型将通过该 https URL 发布。

CuES: A Curiosity-driven and Environment-grounded Synthesis Framework for Agentic RL

CuES：一个基于好奇心驱动且基于环境的智能强化学习综合框架

Authors: Shinji Mai, Yunpeng Zhai, Ziqian Chen, Cheng Chen, Anni Zou, Shuchang Tao, Zhaoyang Liu, Bolin Ding
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01311
Pdf link: https://arxiv.org/pdf/2512.01311
Abstract Large language model based agents are increasingly deployed in complex, tool augmented environments. While reinforcement learning provides a principled mechanism for such agents to improve through interaction, its effectiveness critically depends on the availability of structured training tasks. In many realistic settings, however, no such tasks exist a challenge we term task scarcity, which has become a key bottleneck for scaling agentic RL. Existing approaches typically assume predefined task collections, an assumption that fails in novel environments where tool semantics and affordances are initially unknown. To address this limitation, we formalize the problem of Task Generation for Agentic RL, where an agent must learn within a given environment that lacks predefined tasks. We propose CuES, a Curiosity driven and Environment grounded Synthesis framework that autonomously generates diverse, executable, and meaningful tasks directly from the environment structure and affordances, without relying on handcrafted seeds or external corpora. CuES drives exploration through intrinsic curiosity, abstracts interaction patterns into reusable task schemas, and refines them through lightweight top down guidance and memory based quality control. Across three representative environments, AppWorld, BFCL, and WebShop, CuES produces task distributions that match or surpass manually curated datasets in both diversity and executability, yielding substantial downstream policy improvements. These results demonstrate that curiosity driven, environment grounded task generation provides a scalable foundation for agents that not only learn how to act, but also learn what to learn. The code is available at this https URL.
中文摘要 基于大型语言模型的代理越来越多地部署在复杂的工具增强环境中。虽然强化学习为此类代理通过互动提供了原则性的提升机制，但其有效性关键依赖于结构化训练任务的可用性。然而，在许多现实环境中，并不存在这样的任务，我们称之为任务稀缺性，这已成为能动强化学习扩展的关键瓶颈。现有方法通常假设预定义的任务集合，但在工具语义和赋能性初期未知的新环境中，这一假设不成立。为解决这一限制，我们形式化了代理强化学习的任务生成问题，即代理必须在缺乏预定义任务的环境中学习。我们提出了CuES，一个基于好奇心驱动、基于环境的综合框架，能够自主地从环境结构和可供性中生成多样、可执行且有意义的任务，无需依赖手工制作的种子或外部语料库。CuES通过内在好奇心推动探索，将交互模式抽象为可重复使用的任务模式，并通过轻量化的自上而下指导和基于内存的质量控制进行优化。在AppWorld、BFCL和WebShop三大代表性环境中，CuES生成的任务分布在多样性和可执行性方面均与手工整理数据集匹配甚至超越，带来了显著的下游策略改进。这些结果表明，基于好奇心驱动、扎根于环境的任务生成为智能体提供了可扩展的基础，不仅能让他们学会如何行动，也能学会学习什么。代码可在该 https URL 访问。

Extending NGU to Multi-Agent RL: A Preliminary Study

将NGU扩展到多智能体强化学习：初步研究

Authors: Juan Hernandez, Diego Fernández, Manuel Cifuentes, Denis Parra, Rodrigo Toro Icarte
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01321
Pdf link: https://arxiv.org/pdf/2512.01321
Abstract The Never Give Up (NGU) algorithm has proven effective in reinforcement learning tasks with sparse rewards by combining episodic novelty and intrinsic motivation. In this work, we extend NGU to multi-agent environments and evaluate its performance in the simple_tag environment from the PettingZoo suite. Compared to a multi-agent DQN baseline, NGU achieves moderately higher returns and more stable learning dynamics. We investigate three design choices: (1) shared replay buffer versus individual replay buffers, (2) sharing episodic novelty among agents using different k thresholds, and (3) using heterogeneous values of the beta parameter. Our results show that NGU with a shared replay buffer yields the best performance and stability, highlighting that the gains come from combining NGU intrinsic exploration with experience sharing. Novelty sharing performs comparably when k = 1 but degrades learning for larger values. Finally, heterogeneous beta values do not improve over a small common value. These findings suggest that NGU can be effectively applied in multi-agent settings when experiences are shared and intrinsic exploration signals are carefully tuned.
中文摘要 永不放弃（NGU）算法通过结合情节新奇性和内在动机，在奖励稀疏的强化学习任务中已被证明非常有效。在本研究中，我们将NGU扩展到多智能体环境，并评估其在PettingZoo套件simple_tag环境中的性能。与多智能体DQN基线相比，NGU实现了中等更高的回报和更稳定的学习动态。我们研究了三种设计选择：（1）共享回放缓冲区与单个回放缓冲区，（2）在使用不同k阈值的代理间共享情节新颖性，以及（3）使用β参数的异质值。我们的结果表明，拥有共享回放缓冲区的NGU能带来最佳的性能和稳定性，强调NGU内在探索与经验共享的结合带来的收益。当k=1时，新颖性分享表现相当，但当k值更大时，学习会降低。最后，异质的β值不会在较小的共同值上改善。这些发现表明，当经验共享且内在探索信号被精心调谐时，NGU可以有效应用于多智能体环境。

Discovering Self-Protective Falling Policy for Humanoid Robot via Deep Reinforcement Learning

通过深度强化学习发现类人机器人的自我保护坠落政策

Authors: Diyuan Shi, Shangke Lyu, Donglin Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.01336
Pdf link: https://arxiv.org/pdf/2512.01336
Abstract Humanoid robots have received significant research interests and advancements in recent years. Despite many successes, due to their morphology, dynamics and limitation of control policy, humanoid robots are prone to fall as compared to other embodiments like quadruped or wheeled robots. And its large weight, tall Center of Mass, high Degree-of-Freedom would cause serious hardware damages when falling uncontrolled, to both itself and surrounding objects. Existing researches in this field mostly focus on using control based methods that struggle to cater diverse falling scenarios and may introduce unsuitable human prior. On the other hand, large-scale Deep Reinforcement Learning and Curriculum Learning could be employed to incentivize humanoid agent discovering falling protection policy that fits its own nature and property. In this work, with carefully designed reward functions and domain diversification curriculum, we successfully train humanoid agent to explore falling protection behaviors and discover that by forming a `triangle' structure, the falling damages could be significantly reduced with its rigid-material body. With comprehensive metrics and experiments, we quantify its performance with comparison to other methods, visualize its falling behaviors and successfully transfer it to real world platform.
中文摘要 近年来，类人机器人获得了重要的研究兴趣和进展。尽管取得了许多成功，但由于其形态、动力学和控制策略的限制，类人机器人相比四足或轮式机器人更容易坠落。而它庞大的重量、高质量中心、高自由度，在失控坠落时会对自身和周围物体造成严重的硬件损伤。该领域的现有研究主要集中在使用基于控制的方法，这些方法难以应对多样化的坠落场景，且可能引入不合适的人体先验。另一方面，大规模的深度强化学习和课程学习可以激励类人生物发现符合自身特性和属性的堕落保护策略。在这项工作中，我们通过精心设计的奖励函数和领域多样化课程，成功训练类人生物特工探索坠落保护行为，并发现通过形成“三角形”结构，其刚性物质体可显著减少坠落伤害。通过全面的指标和实验，我们通过与其他方法的比较量化其性能，可视化其下降行为，并成功将其迁移到现实世界平台。

Directed evolution algorithm drives neural prediction

定向进化算法推动神经预测

Authors: Yanlin Wang, Nancy M Young, Patrick C M Wong
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01362
Pdf link: https://arxiv.org/pdf/2512.01362
Abstract Neural prediction offers a promising approach to forecasting the individual variability of neurocognitive functions and disorders and providing prognostic indicators for personalized invention. However, it is challenging to translate neural predictive models into medical artificial intelligent applications due to the limitations of domain shift and label scarcity. Here, we propose the directed evolution model (DEM), a novel computational model that mimics the trial-and-error processes of biological directed evolution to approximate optimal solutions for predictive modeling tasks. We demonstrated that the directed evolution algorithm is an effective strategy for uncertainty exploration, enhancing generalization in reinforcement learning. Furthermore, by incorporating replay buffer and continual backpropagate methods into DEM, we provide evidence of achieving better trade-off between exploitation and exploration in continuous learning settings. We conducted experiments on four different datasets for children with cochlear implants whose spoken language developmental outcomes vary considerably on the individual-child level. Preoperative neural MRI data has shown to accurately predict the post-operative outcome of these children within but not across datasets. Our results show that DEM can efficiently improve the performance of cross-domain pre-implantation neural predictions while addressing the challenge of label scarcity in target domain.
中文摘要 神经预测为预测个体神经认知功能和障碍的变异性提供了一种有前景的方法，并为个性化发明提供预后指标。然而，由于领域转换和标签稀缺性等限制，将神经预测模型转化为医疗人工智能应用存在挑战。本文提出定向进化模型（DEM），这是一种新型计算模型，模拟生物定向进化的试错过程，以近似预测建模任务的最优解。我们证明了定向进化算法是探索不确定性的有效策略，增强了强化学习中的泛化性。此外，通过将重放缓冲和持续反向传播方法纳入DEM，我们提供了在持续学习环境中实现利用与探索之间更好权衡的证据。我们在四个不同的数据集上进行了实验，针对人工耳蜗植入儿童，其口语语言发展结果在个体儿童层面上差异显著。术前神经MRI数据显示，能够准确预测这些儿童在不同数据集内的术后结局，但并未跨越数据。我们的结果表明，DEM能够高效提升跨域植入前神经预测的性能，同时解决目标域中标签稀缺性的问题。

BlinkBud: Detecting Hazards from Behind via Sampled Monocular 3D Detection on a Single Earbud

BlinkBud：通过单耳采样单眼三维检测从背后检测危险

Authors: Yunzhe Li, Jiajun Yan, Yuzhou Wei, Kechen Liu, Yize Zhao, Chong Zhang, Hongzi Zhu, Li Lu, Shan Chang, Minyi Guo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01366
Pdf link: https://arxiv.org/pdf/2512.01366
Abstract Failing to be aware of speeding vehicles approaching from behind poses a huge threat to the road safety of pedestrians and cyclists. In this paper, we propose BlinkBud, which utilizes a single earbud and a paired phone to online detect hazardous objects approaching from behind of a user. The core idea is to accurately track visually identified objects utilizing a small number of sampled camera images taken from the earbud. To minimize the power consumption of the earbud and the phone while guaranteeing the best tracking accuracy, a novel 3D object tracking algorithm is devised, integrating both a Kalman filter based trajectory estimation scheme and an optimal image sampling strategy based on reinforcement learning. Moreover, the impact of constant user head movements on the tracking accuracy is significantly eliminated by leveraging the estimated pitch and yaw angles to correct the object depth estimation and align the camera coordinate system to the user's body coordinate system, respectively. We implement a prototype BlinkBud system and conduct extensive real-world experiments. Results show that BlinkBud is lightweight with ultra-low mean power consumptions of 29.8 mW and 702.6 mW on the earbud and smartphone, respectively, and can accurately detect hazards with a low average false positive ratio (FPR) and false negative ratio (FNR) of 4.90% and 1.47%, respectively.
中文摘要 忽视从后方高速驶来的车辆，对行人和骑行者的道路安全构成巨大威胁。本文提出了BlinkBud，利用单耳耳塞和一对手机在线检测从用户背后接近的危险物体。核心理念是利用从耳机中采集的少量摄像头图像，准确追踪视觉识别的物体。为了在保证最佳跟踪精度的同时最小化耳机和手机的功耗，设计了一种新颖的三维物体跟踪算法，结合了基于卡尔曼滤波器的轨迹估计方案和基于强化学习的最优图像采样策略。此外，通过利用估算的俯仰角和偏航角度来修正物体深度估计，并将相机坐标系分别对准用户的身体坐标系，显著消除了用户持续头部移动对跟踪精度的影响。我们实现了BlinkBud的原型系统，并进行了广泛的实际实验。结果显示，BlinkBud 轻便，耳机和智能手机的平均功耗分别为 29.8 mW 和 702.6 mW，且能准确检测危害，平均假阳性率（FPR）和假阴性率（FNR）分别为 4.90% 和 1.47%。

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

用LLM稳定强化学习：表述与实践

Authors: Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, An Yang, Jingren Zhou, Junyang Lin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.01374
Pdf link: https://arxiv.org/pdf/2512.01374
Abstract This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
中文摘要 本文提出了一种基于大型语言模型的强化学习（RL）新颖表述，解释了为何以及如何在何种条件下，通过策略梯度方法（如REINFORCE）中的替代代币级目标来优化真正的序列级奖励。具体来说，通过一阶近似，我们证明了只有当训练-推断差异和策略陈旧性都被最小化时，该替代指标才变得越来越有效。这一见解为多种广泛采用的技术在稳定强化学习（RL）中的关键作用提供了原则性解释，包括重要性抽样校正、裁剪，尤其是专家混合（MoE）模型的路由重放。通过对30B环境模型的广泛实验，累计数十万GPU小时，我们表明在策略训练中，带重要性抽样修正的基本策略梯度算法能实现最高的训练稳定性。当引入非策略更新以加速收敛时，结合剪裁和路由重放成为缓解策略陈旧带来不稳定性的关键。值得注意的是，一旦训练稳定，长时间优化无论冷启动初始化如何，最终性能都能稳定地获得相当。我们希望共享的见解和稳定强化学习训练的配方能促进未来研究。

Multi-Path Collaborative Reasoning via Reinforcement Learning

通过强化学习实现多路径协同推理

Authors: Jindi Lv, Yuhao Zhou, Zheng Zhu, Xiaofeng Wang, Guan Huang, Jiancheng Lv
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01485
Pdf link: https://arxiv.org/pdf/2512.01485
Abstract Chain-of-Thought (CoT) reasoning has significantly advanced the problem-solving capabilities of Large Language Models (LLMs), yet conventional CoT often exhibits internal determinism during decoding, limiting exploration of plausible alternatives. Recent methods attempt to address this by generating soft abstract tokens to enable reasoning in a continuous semantic space. However, we find that such approaches remain constrained by the greedy nature of autoregressive decoding, which fundamentally isolates the model from alternative reasoning possibilities. In this work, we propose Multi-Path Perception Policy Optimization (M3PO), a novel reinforcement learning framework that explicitly injects collective insights into the reasoning process. M3PO leverages parallel policy rollouts as naturally diverse reasoning sources and integrates cross-path interactions into policy updates through a lightweight collaborative mechanism. This design allows each trajectory to refine its reasoning with peer feedback, thereby cultivating more reliable multi-step reasoning patterns. Empirical results show that M3PO achieves state-of-the-art performance on both knowledge- and reasoning-intensive benchmarks. Models trained with M3PO maintain interpretability and inference efficiency, underscoring the promise of multi-path collaborative learning for robust reasoning.
中文摘要 思维链（CoT）推理显著提升了大型语言模型（LLMs）的问题解决能力，但传统CoT在解码过程中常表现出内部确定性，限制了对合理替代方案的探索。近期方法试图通过生成软抽象符号来解决这个问题，以便在连续语义空间中进行推理。然而，我们发现这些方法仍受制于自回归解码的贪婪性质，这从根本上将模型与替代推理可能性隔离开来。本研究提出多路径感知策略优化（M3PO），一种新型强化学习框架，明确将集体洞见注入推理过程。M3PO利用平行政策的推广作为自然多样的推理来源，并通过轻量级协作机制将交叉路径互动整合进政策更新中。这种设计允许每个轨迹通过同伴反馈来完善其推理，从而培养出更可靠的多步推理模式。实证结果表明，M3PO在知识密集型和推理密集型基准测试中均达到了最先进的性能。使用M3PO训练的模型保持了可解释性和推理效率，凸显了多路径协作学习在强健推理中的潜力。

Learning the Boundary of Solvability: Aligning LLMs to Detect Unsolvable Problems

学习可解性边界：对齐大型语言模型以检测无解问题

Authors: Dengyun Peng, Qiguang Chen, Bofei Liu, Jiannan Guan, Libo Qin, Zheng Yan, Jinhao Liu, Jianshu Zhang, Wanxiang Che
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.01661
Pdf link: https://arxiv.org/pdf/2512.01661
Abstract Ensuring LLM reliability requires not only solving complex problems but also recognizing when a problem is unsolvable. Current models often struggle to distinguish objective unsolvability (inherent contradictions in the problem) from subjective capability limitations (problems beyond the model's competence), which leads to hallucinations and overconfidence. To address this, we propose UnsolvableQA and UnsolvableRL to solve feasible problems, detect inherent contradictions, and prudently refuse tasks beyond capability. Specifically, we construct UnsolvableQA, a dataset of paired solvable and unsolvable instances derived via a dual-track methodology: programmatic generation for logic puzzles and a novel "Reverse Construction" method that injects contradictions into valid reasoning chains for mathematics. Building on this dataset, we introduce UnsolvableRL, a reinforcement learning framework with three reward components jointly accounting for accuracy, unsolvability, and difficulty. Empirical results show that our approach achieves near-perfect unsolvability detection while also improving accuracy on solvable tasks. Crucially, we identify Capability Collapse, demonstrating that explicit exposure to unsolvable data is indispensable for preventing models from becoming systematically overconfident. Our code and data are available at this https URL.
中文摘要 确保LLM的可靠性不仅需要解决复杂问题，还需要识别问题何时无法解决。当前模型常常难以区分客观不可解性（问题内在矛盾）与主观能力限制（超出模型能力范围的问题），这导致幻觉和过度自信。为此，我们提出了UnsolvableQA和UnsolvableRL来解决可行问题，发现内在矛盾，并谨慎拒绝超出能力范围的任务。具体来说，我们构建了UnsolvableQA，这是一个通过双轨方法推导出的配对可解和不可解实例数据集：逻辑谜题的程序生成和一种新颖的“逆向构造”方法，将矛盾注入数学的有效推理链中。基于该数据集，我们引入UnsolvableRL，这是一个强化学习框架，包含三个奖励成分共同考虑准确性、不可解性和难度。实证结果表明，我们的方法实现了近乎完美的不可解性检测，同时提高了可解任务的准确性。关键是，我们识别了能力崩溃，表明明确暴露于无解数据对于防止模型系统性过度自信至关重要。我们的代码和数据可在此 https URL 访问。

How Does RL Post-training Induce Skill Composition? A Case Study on Countdown

强化学习的培训后如何促进技能构成？倒计时案例研究

Authors: Simon Park, Simran Kaur, Sanjeev Arora
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01775
Pdf link: https://arxiv.org/pdf/2512.01775
Abstract While reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known components) is often conflated with mere length generalization. To this end, we study what RL post-training teaches about skill composition and how the structure of the composition affects the skill transfer. We focus on the Countdown task (given n numbers and a target, form an expression that evaluates to the target) and analyze model solutions as expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.'' Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution (OOD) generalization to larger n and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) a structure-dependent hierarchy of learnability -- models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures (even when the composition depth is the same as some left-heavy structures). Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal.
中文摘要 虽然强化学习（RL）成功提升了大型语言模型中的推理能力，但其在促进组合泛化（即从已知组件中综合新技能的能力）中的作用，常常被误认为仅仅是长度推广。为此，我们研究强化学习后培训对技能构成的启示，以及结构如何影响技能转移。我们专注于倒计时任务（给定n个数字和一个目标，形成一个对目标进行评估的表达式），并将模型解分析为表达树，每个子树对应一个可重复使用的子任务，因此可以被视为一种“技能”。通过跟踪树形及其训练成功率，我们发现：（i）分布外（OOD）推广到更大 n 和未见树形，表明子任务的组合重用;（ii）依赖结构的可学习层级——模型先掌握浅平衡树（工作量在子任务间平衡），而在深度不平衡结构上则持续脆弱（即使组合深度与某些左重结构相同）。我们的诊断揭示了所学内容、顺序以及泛化失败的地方，阐明了仅用强化学习后训练如何诱导出超出标准指标如pass@k的OOD泛化。

GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation

GR-RL：灵巧且精准地进行远程机器人作

Authors: Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, Wanli Peng, Jingchao Qiao, Zeyu Ren, Haixin Shi, Zhi Su, Jiawen Tian, Yuyang Xiao, Shenyu Zhang, Liwei Zheng, Hang Li, Yonghui Wu
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01801
Pdf link: https://arxiv.org/pdf/2512.01801
Abstract We present GR-RL, a robotic learning framework that turns a generalist vision-language-action (VLA) policy into a highly capable specialist for long-horizon dexterous manipulation. Assuming the optimality of human demonstrations is core to existing VLA policies. However, we claim that in highly dexterous and precise manipulation tasks, human demonstrations are noisy and suboptimal. GR-RL proposes a multi-stage training pipeline that filters, augments, and reinforces the demonstrations by reinforcement learning. First, GR-RL learns a vision-language-conditioned task progress, filters the demonstration trajectories, and only keeps the transitions that contribute positively to the progress. Specifically, we show that by directly applying offline RL with sparse reward, the resulting $Q$-values can be treated as a robust progress function. Next, we introduce morphological symmetry augmentation that greatly improves the generalization and performance of GR-RL. Lastly, to better align the VLA policy with its deployment behaviors for high-precision control, we perform online RL by learning a latent space noise predictor. With this pipeline, GR-RL is, to our knowledge, the first learning-based policy that can autonomously lace up a shoe by threading shoelaces through multiple eyelets with an 83.3% success rate, a task requiring long-horizon reasoning, millimeter-level precision, and compliant soft-body interaction. We hope GR-RL provides a step toward enabling generalist robot foundations models to specialize into reliable real-world experts.
中文摘要 我们介绍GR-RL，一种机器人学习框架，将通用的视觉-语言-行动（VLA）策略转变为远程灵巧作的高能力专家。假设人类演示的最优性是现有VLA政策的核心。然而，我们认为在高度灵巧且精准的作任务中，人类演示噪音大且效果不佳。GR-RL提出了一个多阶段训练流程，通过强化学习过滤、增强和强化演示。首先，GR-RL学习视觉-语言条件的任务进展，过滤示范轨迹，只保留对进展有积极贡献的过渡。具体来说，我们证明了通过直接应用带有稀疏奖励的离线强化学习，所得$Q$值可以被视为稳健的进展函数。接下来，我们引入形态对称性增强，极大提升了GR-RL的推广和性能。最后，为了更好地将VLA策略与其部署行为对齐，实现高精度控制，我们通过学习潜空间噪声预测器进行在线强化学习。据我们所知，GR-RL是首个基于学习的策略，能够通过将鞋带穿过多个鞋眼，成功率达83.3%，这项任务需要长远的推理、毫米级精度和合规的软体交互。我们希望GR-RL为通用机器人基础模型能够专注于可靠的现实世界专家迈出一步。

CauSight: Learning to Supersense for Visual Causal Discovery

CauSight：学习超感官以发现视觉因果

Authors: Yize Zhang, Meiqi Chen, Sirui Chen, Bo Peng, Yanxi Zhang, Tianyu Li, Chaochao Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.01827
Pdf link: https://arxiv.org/pdf/2512.01827
Abstract Causal thinking enables humans to understand not just what is seen, but why it happens. To replicate this capability in modern AI systems, we introduce the task of visual causal discovery. It requires models to infer cause-and-effect relations among visual entities across diverse scenarios instead of merely perceiving their presence. To this end, we first construct the Visual Causal Graph dataset (VCG-32K), a large-scale collection of over 32,000 images annotated with entity-level causal graphs, and further develop CauSight, a novel vision-language model to perform visual causal discovery through causally aware reasoning. Our training recipe integrates three components: (1) training data curation from VCG-32K, (2) Tree-of-Causal-Thought (ToCT) for synthesizing reasoning trajectories, and (3) reinforcement learning with a designed causal reward to refine the reasoning policy. Experiments show that CauSight outperforms GPT-4.1 on visual causal discovery, achieving over a threefold performance boost (21% absolute gain). Our code, model, and dataset are fully open-sourced at project page: this https URL.
中文摘要 因果思维使人类不仅理解所见，更理解为什么会发生。为了在现代人工智能系统中复制这一能力，我们引入了视觉因果发现的任务。它要求模型推断不同场景中视觉实体之间的因果关系，而不仅仅是感知它们的存在。为此，我们首先构建了视觉因果图数据集（VCG-32K），这是一个包含32,000多张图像的大规模集合，并用实体级因果图注释，并进一步开发了CauSight，一种通过因果意识推理进行视觉因果发现的新型视觉语言模型。我们的训练方案整合了三个部分：（1）来自VCG-32K的训练数据整理，（2）因果思维树（ToCT）用于综合推理轨迹，以及（3）带有设计因果奖励的强化学习，以优化推理策略。实验显示，CauSight在视觉因果发现方面优于GPT-4.1，实现了超过三倍的性能提升（绝对提升21%）。我们的代码、模型和数据集均完全开源于项目页面：https URL。

Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability

超越SFT：强化学习，打造更安全的大型推理模型，提升推理能力

Authors: Jinghan Jia, Nathalie Baracaldo, Sijia Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.01848
Pdf link: https://arxiv.org/pdf/2512.01848
Abstract Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning, significantly improving mathematical and logical problem solving. However, this explicit reasoning process also introduces new safety risks, as unsafe behaviors often emerge within intermediate reasoning trajectories, even when final answers appear harmless. Existing safety alignment approaches primarily rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets. While intuitive, we find that SFT produces inconsistent safety improvements, degrades reasoning ability, and generalizes poorly across model families. These limitations suggest that purely supervised approaches are insufficient for robust safety alignment in LRMs. To address this, we investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training. Unlike SFT, RL directly optimizes model policies with reward feedback, enabling more adaptive and stable alignment. Extensive experiments across multiple model families and benchmarks show that RL achieves stronger and more consistent safety gains while maintaining reasoning competence. Further analysis of reflection dynamics and token-level entropy reveals that RL suppresses unsafe exploratory reasoning while preserving reflective depth, leading to safer and more reliable reasoning processes.
中文摘要 大型推理模型（LRM）通过生成显式思维链（CoT）推理来扩展大型语言模型，显著提升了数学和逻辑问题的解决能力。然而，这种显式推理过程也带来了新的安全风险，因为即使在最终答案看似无害时，不安全的行为也常常出现在中间推理轨迹中。现有的安全比对方法主要依赖于监督式微调（SFT），而非以安全为导向的长CoT数据集。虽然直观，但我们发现SFT带来的安全性提升不一致，降低推理能力，并且在模型族间推广能力较差。这些局限表明，纯监督方法无法实现LRMS的稳健安全对齐。为此，我们研究强化学习（RL）作为LRM安全培训的补充优化框架。与SFT不同，强化学习直接通过奖励反馈优化模型策略，实现更具适应性和稳定性的对齐。跨多个模型家族和基准的广泛实验表明，强化学习在保持推理能力的同时，能实现更强且更稳定的安全收益。对反射动力学和标记级熵的进一步分析显示，强化学习抑制了不安全的探索性推理，同时保持了反思深度，从而实现更安全、更可靠的推理过程。

Graph Distance as Surprise: Free Energy Minimization in Knowledge Graph Reasoning

图距离作为惊喜：知识图谱推理中的自由能量最小化

Authors: Gaganpreet Jhajj, Fuhua Lin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.01878
Pdf link: https://arxiv.org/pdf/2512.01878
Abstract In this work, we propose that reasoning in knowledge graph (KG) networks can be guided by surprise minimization. Entities that are close in graph distance will have lower surprise than those farther apart. This connects the Free Energy Principle (FEP) from neuroscience to KG systems, where the KG serves as the agent's generative model. We formalize surprise using the shortest-path distance in directed graphs and provide a framework for KG-based agents. Graph distance appears in graph neural networks as message passing depth and in model-based reinforcement learning as world model trajectories. This work-in-progress study explores whether distance-based surprise can extend recent work showing that syntax minimizes surprise and free energy via tree structures.
中文摘要 本研究提出，知识图谱（KG）网络中的推理可以通过惊喜最小化来指导。图距离较近的实体惊讶率会低于距离较远的实体。这将神经科学中的自由能原理（FEP）与KG系统连接起来，KG作为智能体的生成模型。我们利用有向图中的最短路径距离形式化惊讶，并为基于 KG 的智能体提供了框架。图距离在图神经网络中表现为消息传递深度，在基于模型的强化学习中则作为世界模型轨迹。这项正在进行的研究探讨基于距离的惊讶是否能扩展近期研究，表明句法通过树结构最大限度地减少惊讶和自由能。

New Spiking Architecture for Multi-Modal Decision-Making in Autonomous Vehicles

自动驾驶车辆多模态决策的新尖峰架构

Authors: Aref Ghoreishee, Abhishek Mishra, Lifeng Zhou, John Walsh, Nagarajan Kandasamy
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01882
Pdf link: https://arxiv.org/pdf/2512.01882
Abstract This work proposes an end-to-end multi-modal reinforcement learning framework for high-level decision-making in autonomous vehicles. The framework integrates heterogeneous sensory input, including camera images, LiDAR point clouds, and vehicle heading information, through a cross-attention transformer-based perception module. Although transformers have become the backbone of modern multi-modal architectures, their high computational cost limits their deployment in resource-constrained edge environments. To overcome this challenge, we propose a spiking temporal-aware transformer-like architecture that uses ternary spiking neurons for computationally efficient multi-modal fusion. Comprehensive evaluations across multiple tasks in the Highway Environment demonstrate the effectiveness and efficiency of the proposed approach for real-time autonomous decision-making.
中文摘要 本研究提出了一个端到端多模态强化学习框架，用于自动驾驶车辆的高级决策。该框架通过基于交叉注意力的变换器感知模块，集成了异质感官输入，包括摄像头图像、激光雷达点云和车辆航向信息。尽管变换器已成为现代多模态架构的骨干，但其高计算成本限制了其在资源受限边缘环境中的部署。为克服这一挑战，我们提出了一种尖峰时间感知型变换器架构，利用三元尖峰神经元实现计算效率高的多模态融合。在高速公路环境中对多项任务的全面评估展示了该方法在实时自主决策中的有效性和效率。

Rectifying LLM Thought from Lens of Optimization

从优化视角纠正LLM思维

Authors: Junnan Liu, Hongwei Liu, Songyang Zhang, Kai Chen
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.01925
Pdf link: https://arxiv.org/pdf/2512.01925
Abstract Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.
中文摘要 大型语言模型（LLMs）的最新进展主要源于其涌现的推理能力，尤其是通过长思考链（CoT）提示，使得深入的探索和思考成为可能。尽管取得了这些进展，长CoT的大型语言模型常表现出不优的推理行为，如过度思考和过长的推理链，这会影响性能。本文通过优化视角分析推理过程，将CoT框架为梯度下降过程，每一步推理都是问题解决的更新。基于这一观点，我们引入了RePro（纠正过程级奖励），这是一种在培训后阶段精炼LLM推理的新方法。RePro定义了一个替代目标函数，用于评估CoT背后的优化过程，利用双重评分机制量化其强度和稳定性。这些分数被汇总成一个复合的过程级奖励，并无缝集成到带有可验证奖励的强化学习（RLVR）流水线中，以优化LLMs。在数学、科学和编程等多个基准测试中，跨越多种强化学习算法和多样化LLMs的大量实验表明，RePro能够持续提升推理性能并减轻次优推理行为。

Agentic Policy Optimization via Instruction-Policy Co-Evolution

通过指令-策略共进化实现代理策略优化

Authors: Han Zhou, Xingchen Wan, Ivan Vulić, Anna Korhonen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.01945
Pdf link: https://arxiv.org/pdf/2512.01945
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent's policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.
中文摘要 带可验证奖励的强化学习（RLVR）推动了大型语言模型（LLMs）的推理能力，使自主代理能够有效进行多回合和工具集成推理。虽然指令是定义代理的主要协议，但RLVR通常依赖静态和手动设计的指令。然而，这些指令可能对基础模型来说不够优，且随着代理策略的改进和探索与环境的交互，最优指令可能会发生变化。为弥合这一差距，我们引入了INSPO，一种新的指令-策略共进框架，将指令优化整合为强化学习（RL）循环中的动态组成部分。INSPO维护一个动态的指令候选群体，这些指令通过问题进行抽样，强化学习循环中的奖励信号自动归属于每条指令，低表现的指令则定期被修剪。新指令通过策略反射机制生成和验证，基于LLM的优化器从重放缓冲区分析过去经验，并根据当前策略演化出更有效的策略。我们在多回合检索和推理任务上进行了大量实验，证明INSPO在依赖静态指令时远远优于强基线。INSPO发现了引导智能体走向更具战略性推理路径的创新指令，在计算开销略有增加的情况下实现了显著的性能提升。

GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment

GrndCtrl：通过自我监督奖励对齐来接地世界模型

Authors: Haoyang He, Jay Patrikar, Dong-Ki Kim, Max Smith, Daniel McGann, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei, Sebastian Scherer
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.01952
Pdf link: https://arxiv.org/pdf/2512.01952
Abstract Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.
中文摘要 视频世界建模的最新进展使大型生成模型能够以高视觉真实度模拟具象环境，为预测、规划和控制提供强有力的先验。然而，尽管这些模型很真实，但往往缺乏几何基础，限制了它们在需要空间一致性和长视野稳定性的导航任务中的应用。我们引入了带世界基础的强化学习（RLWG），这是一种自我监督的后培训框架，通过几何和感知奖励将预训练的世界模型与物理可验证的结构对齐。类似于语言模型中的可验证反馈强化学习（RLVR），RLWG可以使用多种奖励来衡量姿态循环一致性、深度重投影和时间一致性。我们用基于群体相对策略优化（GRPO）的奖励对齐适应方法GrndCtrl实现了这一框架，生成了保持稳定轨迹、一致性几何和可靠部署的世界模型，实现了具备导航的可靠。如同大型语言模型中的训练后对齐，GrndCtrl利用可验证的奖励，桥接生成式预训练与接地行为，在户外环境中实现了优越的空间一致性和导航稳定性，优于监督微调。

Learned-Rule-Augmented Large Language Model Evaluators

学习规则增强大型语言模型评估器

Authors: Jie Meng, Jin Mao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.01958
Pdf link: https://arxiv.org/pdf/2512.01958
Abstract Large language models (LLMs) are predominantly used as evaluators for natural language generation (NLG) tasks, but their application to broader evaluation scenarios remains limited. In this work, we explore the potential of LLMs as general evaluators across diverse tasks. Although LLM-based evaluators have made progress in different areas, existing methods struggle to generalize due to their reliance on costly, human-designed evaluation principles, which are often misaligned with both annotated data and LLMs' this http URL address these challenges, we propose a rule-augmented evaluation paradigm. First, we introduce a rule distillation method that automatically extracts scoring rules from data using an LLM-assisted Monte Carlo Tree Search (MCTS), alleviating scalability issues and improving alignment with data. Second, to enable LLMs to effectively apply the learned rules, we propose two strategies: (1) Chain-of-Rule (CoR), which guides LLM to follow distilled rules, and (2) training a rule-augmented LLM evaluator (RuAE) via reinforcement learning, further bridging the gap between rules and LLMs' reasoning. Extensive experiments on diverse tasks demonstrate the effectiveness and generalizability of our approach across various evaluation scenarios.
中文摘要 大型语言模型（LLMs）主要用作自然语言生成（NLG）任务的评估器，但其在更广泛评估场景中的应用仍然有限。本文探讨LLM作为多项任务通用评估工具的潜力。尽管基于LLM的评估者在不同领域取得了进展，但现有方法由于依赖成本高昂、人为设计的评估原则，且这些原则常常与注释数据和LLMs不一致，我们提出了一种规则增强的评估范式。首先，我们引入了一种规则提炼方法，利用LLM辅助的蒙特卡洛树搜索（MCTS）自动从数据中提取评分规则，缓解了可扩展性问题并改善了与数据的对齐度。其次，为了使LLM有效应用所学规则，我们提出了两种策略：（1）规则链（CoR），引导LLM遵循精炼规则;（2）通过强化学习训练规则增强LLM评估器（RuAE），进一步弥合规则与LLM推理之间的差距。在多样化任务上的大量实验证明了我们方法在各种评估场景下的有效性和可推广性。

Forecasting in Offline Reinforcement Learning for Non-stationary Environments

非固定环境下的离线强化学习预测

Authors: Suzan Ece Ada, Georg Martius, Emre Ugur, Erhan Oztop
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.01987
Pdf link: https://arxiv.org/pdf/2512.01987
Abstract Offline Reinforcement Learning (RL) provides a promising avenue for training policies from pre-collected datasets when gathering additional interaction data is infeasible. However, existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time, assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific pattern of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent's experience, we aim to bridge the gap between offline RL and the complexities of real-world, non-stationary environments.
中文摘要 离线强化学习（RL）为在收集额外交互数据不可行时，从预收集的数据集中制定策略提供了有前景的途径。然而，现有的离线强化学习方法通常假设是平稳的，或者只考虑测试时的合成扰动，而这些假设在现实场景中常常失效，这些假设通常会因突兀且随时间变化的偏移而失败。这些偏移可能导致部分可观测性，导致代理误判真实状态并降低性能。为克服这一挑战，我们引入了非平移离线强化学习（FORL）中的预测，该框架统一了（i）基于条件扩散的候选状态生成，训练时不假设未来非平稳性的特定模式，以及（ii）零样本时间序列基础模型。FORL针对易发生意外、可能非马尔可夫偏移的环境，要求从每集开始就具备强大的代理性能。对离线强化学习基准的实证评估，结合真实时间序列数据以模拟真实的非平稳性，表明FORL在性能方面持续提升，优于竞争基线。通过将零样本预测与代理经验相结合，我们旨在弥合离线强化学习与现实世界非固定环境复杂性的差距。

RoaD: Rollouts as Demonstrations for Closed-Loop Supervised Fine-Tuning of Autonomous Driving Policies

RoaD：作为闭环监督自动驾驶政策微调演示的推广

Authors: Guillermo Garcia-Cobo, Maximilian Igl, Peter Karkus, Zhejun Zhang, Michael Watson, Yuxiao Chen, Boris Ivanovic, Marco Pavone
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01993
Pdf link: https://arxiv.org/pdf/2512.01993
Abstract Autonomous driving policies are typically trained via open-loop behavior cloning of human demonstrations. However, such policies suffer from covariate shift when deployed in closed loop, leading to compounding errors. We introduce Rollouts as Demonstrations (RoaD), a simple and efficient method to mitigate covariate shift by leveraging the policy's own closed-loop rollouts as additional training data. During rollout generation, RoaD incorporates expert guidance to bias trajectories toward high-quality behavior, producing informative yet realistic demonstrations for fine-tuning. This approach enables robust closed-loop adaptation with orders of magnitude less data than reinforcement learning, and avoids restrictive assumptions of prior closed-loop supervised fine-tuning (CL-SFT) methods, allowing broader applications domains including end-to-end driving. We demonstrate the effectiveness of RoaD on WOSAC, a large-scale traffic simulation benchmark, where it performs similar or better than the prior CL-SFT method; and in AlpaSim, a high-fidelity neural reconstruction-based simulator for end-to-end driving, where it improves driving score by 41\% and reduces collisions by 54\%.
中文摘要 自动驾驶政策通常通过开环行为克隆人体演示来训练。然而，此类策略在闭环部署时存在协变量偏移，导致错误叠加。我们引入了“作为演示的推出”（RoaD），这是一种简单高效的方法，通过利用策略自身的闭环推广作为额外训练数据，来缓解协变量偏移。在推广生成过程中，RoaD融入专家指导，将行为轨迹偏向高质量，提供既有信息又真实的演示，便于微调。该方法实现了鲁棒的闭环适配，数据量远少于强化学习，避免了之前闭环监督微调（CL-SFT）方法的限制性假设，允许更广泛的应用领域，包括端到端驱动。我们在WOSAC（一种大规模交通模拟基准测试）上展示了RoaD的有效性，其表现与之前的CL-SFT方法相当甚至更好;以及AlpaSim，一款基于端到端驾驶的高保真神经重建模拟器，驾驶评分提升41%，碰撞减少54%。

Learning Sim-to-Real Humanoid Locomotion in 15 Minutes

15分钟内学习模拟到现实的人形移动

Authors: Younggyo Seo, Carmelo Sferrazza, Juyue Chen, Guanya Shi, Rocky Duan, Pieter Abbeel
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.01996
Pdf link: https://arxiv.org/pdf/2512.01996
Abstract Massively parallel simulation has reduced reinforcement learning (RL) training time for robots from days to minutes. However, achieving fast and reliable sim-to-real RL for humanoid control remains difficult due to the challenges introduced by factors such as high dimensionality and domain randomization. In this work, we introduce a simple and practical recipe based on off-policy RL algorithms, i.e., FastSAC and FastTD3, that enables rapid training of humanoid locomotion policies in just 15 minutes with a single RTX 4090 GPU. Our simple recipe stabilizes off-policy RL algorithms at massive scale with thousands of parallel environments through carefully tuned design choices and minimalist reward functions. We demonstrate rapid end-to-end learning of humanoid locomotion controllers on Unitree G1 and Booster T1 robots under strong domain randomization, e.g., randomized dynamics, rough terrain, and push perturbations, as well as fast training of whole-body human-motion tracking policies. We provide videos and open-source implementation at: this https URL.
中文摘要 大规模并行仿真将机器人强化学习（RL）训练时间从数天缩短到几分钟。然而，由于高维度和域随机化等因素带来的挑战，实现人形控制的快速且可靠的模拟到真实强化仍然困难。在本研究中，我们介绍了一个基于非策略强化学习算法（FastSAC和FastTD3）的简单实用配方，使得仅用一块RTX 4090 GPU就能在15分钟内快速训练类人移动策略。我们简单的配方通过精心调优的设计选择和极简的奖励函数，在数千个并行环境中大规模稳定了非策略强化学习算法。我们展示了在Unitree G1和Booster T1机器人上，在强域随机化（如随机动力学、崎岖地形和推力扰动）下，实现人形运动控制器的快速端到端学习，同时快速训练全身人体运动跟踪策略。我们提供视频和开源实现，网址为：https URL。

Learning Dexterous Manipulation Skills from Imperfect Simulations

从不完美模拟中学习灵巧控技能

Authors: Elvis Hsieh, Wen-Han Hsieh, Yen-Jen Wang, Toru Lin, Jitendra Malik, Koushil Sreenath, Haozhi Qi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.02011
Pdf link: https://arxiv.org/pdf/2512.02011
Abstract Reinforcement learning and sim-to-real transfer have made significant progress in dexterous manipulation. However, progress remains limited by the difficulty of simulating complex contact dynamics and multisensory signals, especially tactile feedback. In this work, we propose \ours, a sim-to-real framework that addresses these limitations and demonstrates its effectiveness on nut-bolt fastening and screwdriving with multi-fingered hands. The framework has three stages. First, we train reinforcement learning policies in simulation using simplified object models that lead to the emergence of correct finger gaits. We then use the learned policy as a skill primitive within a teleoperation system to collect real-world demonstrations that contain tactile and proprioceptive information. Finally, we train a behavior cloning policy that incorporates tactile sensing and show that it generalizes to nuts and screwdrivers with diverse geometries. Experiments across both tasks show high task progress ratios compared to direct sim-to-real transfer and robust performance even on unseen object shapes and under external perturbations. Videos and code are available on this https URL.
中文摘要 强化学习和模拟到现实的转移在灵巧作方面取得了显著进展。然而，由于模拟复杂接触动力学和多感官信号，尤其是触觉反馈，进展仍然受限。在本研究中，我们提出了一个模拟到现实的框架，解决了这些局限性，并展示了其在多指手螺母螺栓固定和螺丝驱动方面的有效性。该框架分为三个阶段。首先，我们在模拟中使用简化的对象模型训练强化学习策略，从而实现正确的手指步态。然后，我们将所学策略作为远程作系统中的技能基础，收集包含触觉和本体感觉信息的真实演示。最后，我们训练一种行为克隆策略，结合触觉感知，并展示了其推广到具有多样几何形状的螺母和螺丝刀。两项任务的实验显示，任务进度比高于直接模拟到实物传输，且即使在看不见的物体形状和外部扰动下也能表现出稳健的性能。视频和代码可在该 https URL 上观看。

A Diffusion Model Framework for Maximum Entropy Reinforcement Learning

用于最大熵强化学习的扩散模型框架

Authors: Sebastian Sanokowski, Kaustubh Patil, Alois Knoll
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.02019
Pdf link: https://arxiv.org/pdf/2512.02019
Abstract Diffusion models have achieved remarkable success in data-driven learning and in sampling from complex, unnormalized target distributions. Building on this progress, we reinterpret Maximum Entropy Reinforcement Learning (MaxEntRL) as a diffusion model-based sampling problem. We tackle this problem by minimizing the reverse Kullback-Leibler (KL) divergence between the diffusion policy and the optimal policy distribution using a tractable upper bound. By applying the policy gradient theorem to this objective, we derive a modified surrogate objective for MaxEntRL that incorporates diffusion dynamics in a principled way. This leads to simple diffusion-based variants of Soft Actor-Critic (SAC), Proximal Policy Optimization (PPO) and Wasserstein Policy Optimization (WPO), termed DiffSAC, DiffPPO and DiffWPO. All of these methods require only minor implementation changes to their base algorithm. We find that on standard continuous control benchmarks, DiffSAC, DiffPPO and DiffWPO achieve better returns and higher sample efficiency than SAC and PPO.
中文摘要 扩散模型在数据驱动学习和从复杂、未归一化的目标分布中抽样方面取得了显著成功。基于这一进展，我们将最大熵强化学习（MaxEntRL）重新解释为基于扩散模型的采样问题。我们通过利用可解的上界最小化扩散策略与最优策略分布之间的反库尔巴克-莱布勒（KL）发散来解决该问题。通过将策略梯度定理应用于该目标，我们推导出一个修正的MaxEntRL替代目标，以原则性的方式包含扩散动力学。这导致了基于扩散的简单变体，如软演员-批判者（SAC）、近端策略优化（PPO）和瓦瑟斯坦策略优化（WPO），称为DiffSAC、DiffPPO和DiffWPO。所有这些方法只需对其基础算法进行少量实现修改。我们发现，在标准连续对照基准测试中，DiffSAC、DiffPPO和DiffWPO比SAC和PPO获得了更好的回报和更高的样品效率。

Keyword: diffusion policy

Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning

Bootstrap 动态感知三维可视化表示，用于可扩展机器人学习

Authors: Qiwei Liang, Boyang Cai, Minghao Lai, Sitong Zhuang, Tao Lin, Yan Qin, Yixuan Ye, Jiaming Liang, Renjing Xu
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.00074
Pdf link: https://arxiv.org/pdf/2512.00074
Abstract Despite strong results on recognition and segmentation, current 3D visual pre-training methods often underperform on robotic manipulation. We attribute this gap to two factors: the lack of state-action-state dynamics modeling and the unnecessary redundancy of explicit geometric reconstruction. We introduce AFRO, a self-supervised framework that learns dynamics-aware 3D representations without action or reconstruction supervision. AFRO casts state prediction as a generative diffusion process and jointly models forward and inverse dynamics in a shared latent space to capture causal transition structure. To prevent feature leakage in action learning, we employ feature differencing and inverse-consistency supervision, improving the quality and stability of visual features. When combined with Diffusion Policy, AFRO substantially increases manipulation success rates across 16 simulated and 4 real-world tasks, outperforming existing pre-training approaches. The framework also scales favorably with data volume and task complexity. Qualitative visualizations indicate that AFRO learns semantically rich, discriminative features, offering an effective pre-training solution for 3D representation learning in robotics. Project page: this https URL
中文摘要 尽管在识别和分割方面取得了显著成果，当前的3D视觉预训练方法在机器人作方面往往表现不佳。我们将这一差距归因于两个因素：缺乏状态-作用-状态动态建模，以及显式几何重建的不必要冗余。我们介绍了AFRO，一种自监督框架，无需作或重建监督即可学习动态感知的三维表示。AFRO将状态预测定位为一种生成扩散过程，并在共享的潜在空间中共同建模前向和反向动态，以捕捉因果过渡结构。为防止动作学习中的特征泄漏，我们采用特征差分和反一致性监督，提升视觉特征的质量和稳定性。结合扩散政策，AFRO在16项模拟任务和4项真实任务中大幅提升作成功率，优于现有的预训练方法。该框架还能随着数据量和任务复杂度的扩展性。定性可视化表明，AFRO能够学习语义丰富、判别性强的特征，为机器人三维表示学习提供了有效的预训练解决方案。项目页面：此 https URL

PointNet4D: A Lightweight 4D Point Cloud Video Backbone for Online and Offline Perception in Robotic Applications

PointNet4D：一款用于机器人应用中线上线下感知的轻量级4D点云视频骨干

Authors: Yunze Liu, Zifan Wang, Peiran Wu, Jiayang Ao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.01383
Pdf link: https://arxiv.org/pdf/2512.01383
Abstract Understanding dynamic 4D environments-3D space evolving over time-is critical for robotic and interactive systems. These applications demand systems that can process streaming point cloud video in real-time, often under resource constraints, while also benefiting from past and present observations when available. However, current 4D backbone networks rely heavily on spatiotemporal convolutions and Transformers, which are often computationally intensive and poorly suited to real-time applications. We propose PointNet4D, a lightweight 4D backbone optimized for both online and offline settings. At its core is a Hybrid Mamba-Transformer temporal fusion block, which integrates the efficient state-space modeling of Mamba and the bidirectional modeling power of Transformers. This enables PointNet4D to handle variable-length online sequences efficiently across different deployment scenarios. To enhance temporal understanding, we introduce 4DMAP, a frame-wise masked auto-regressive pretraining strategy that captures motion cues across frames. Our extensive evaluations across 9 tasks on 7 datasets, demonstrating consistent improvements across diverse domains. We further demonstrate PointNet4D's utility by building two robotic application systems: 4D Diffusion Policy and 4D Imitation Learning, achieving substantial gains on the RoboTwin and HandoverSim benchmarks.
中文摘要 理解动态四维环境——随着时间演变的三维空间——对于机器人和交互系统至关重要。这些应用需要能够实时处理流点云视频的系统，通常在资源限制下，同时在可用时利用过去和现在的观测数据。然而，当前的4D骨干网络高度依赖时空卷积和变换器，这些通常计算量大且不适合实时应用。我们提出了PointNet4D，一种轻量级的4D骨干网，针对在线和离线环境进行了优化。其核心是一个混合曼巴-变换器时间融合块，整合了Mamba高效的状态空间建模和变换金刚的双向建模能力。这使得PointNet4D能够高效处理不同部署场景下的可变长度在线序列。为了增强时间理解，我们引入了4DMAP，这是一种按帧掩蔽的自动回归预训练策略，能够捕捉跨帧的运动线索。我们对7个数据集的9个任务进行了广泛评估，展示了在多个领域持续的改进。我们进一步展示了PointNet4D的实用性，构建了两个机器人应用系统：4D扩散策略和4D模仿学习，在RoboTwin和HandoverSim基准测试中取得了显著提升。

A Diffusion Model Framework for Maximum Entropy Reinforcement Learning

用于最大熵强化学习的扩散模型框架

Authors: Sebastian Sanokowski, Kaustubh Patil, Alois Knoll
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.02019
Pdf link: https://arxiv.org/pdf/2512.02019
Abstract Diffusion models have achieved remarkable success in data-driven learning and in sampling from complex, unnormalized target distributions. Building on this progress, we reinterpret Maximum Entropy Reinforcement Learning (MaxEntRL) as a diffusion model-based sampling problem. We tackle this problem by minimizing the reverse Kullback-Leibler (KL) divergence between the diffusion policy and the optimal policy distribution using a tractable upper bound. By applying the policy gradient theorem to this objective, we derive a modified surrogate objective for MaxEntRL that incorporates diffusion dynamics in a principled way. This leads to simple diffusion-based variants of Soft Actor-Critic (SAC), Proximal Policy Optimization (PPO) and Wasserstein Policy Optimization (WPO), termed DiffSAC, DiffPPO and DiffWPO. All of these methods require only minor implementation changes to their base algorithm. We find that on standard continuous control benchmarks, DiffSAC, DiffPPO and DiffWPO achieve better returns and higher sample efficiency than SAC and PPO.
中文摘要 扩散模型在数据驱动学习和从复杂、未归一化的目标分布中抽样方面取得了显著成功。基于这一进展，我们将最大熵强化学习（MaxEntRL）重新解释为基于扩散模型的采样问题。我们通过利用可解的上界最小化扩散策略与最优策略分布之间的反库尔巴克-莱布勒（KL）发散来解决该问题。通过将策略梯度定理应用于该目标，我们推导出一个修正的MaxEntRL替代目标，以原则性的方式包含扩散动力学。这导致了基于扩散的简单变体，如软演员-批判者（SAC）、近端策略优化（PPO）和瓦瑟斯坦策略优化（WPO），称为DiffSAC、DiffPPO和DiffWPO。所有这些方法只需对其基础算法进行少量实现修改。我们发现，在标准连续对照基准测试中，DiffSAC、DiffPPO和DiffWPO比SAC和PPO获得了更好的回报和更高的样品效率。