Arxiv Papers of Today

生成时间: 2025-12-12 16:33:45 (UTC+8); Arxiv 发布时间: 2025-12-12 20:00 EST (2025-12-13 09:00 UTC+8)

今天共有 32 篇相关文章

Keyword: reinforcement learning

TDC-Cache: A Trustworthy Decentralized Cooperative Caching Framework for Web3.0

TDC-Cache：一个值得信赖的去中心化协作缓存框架，适用于Web3.0

Authors: Jinyu Chen, Long Shi, Taotao Wang, Jiaheng Wang, Wei Zhang
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.09961
Pdf link: https://arxiv.org/pdf/2512.09961
Abstract The rapid growth of Web3.0 is transforming the Internet from a centralized structure to decentralized, which empowers users with unprecedented self-sovereignty over their own data. However, in the context of decentralized data access within Web3.0, it is imperative to cope with efficiency concerns caused by the replication of redundant data, as well as security vulnerabilities caused by data inconsistency. To address these challenges, we develop a Trustworthy Decentralized Cooperative Caching (TDC-Cache) framework for Web3.0 to ensure efficient caching and enhance system resilience against adversarial threats. This framework features a two-layer architecture, wherein the Decentralized Oracle Network (DON) layer serves as a trusted intermediary platform for decentralized caching, bridging the contents from decentralized storage and the content requests from users. In light of the complexity of Web3.0 network topologies and data flows, we propose a Deep Reinforcement Learning-Based Decentralized Caching (DRL-DC) for TDC-Cache to dynamically optimize caching strategies of distributed oracles. Furthermore, we develop a Proof of Cooperative Learning (PoCL) consensus to maintain the consistency of decentralized caching decisions within DON. Experimental results show that, compared with existing approaches, the proposed framework reduces average access latency by 20%, increases the cache hit rate by at most 18%, and improves the average success consensus rate by 10%. Overall, this paper serves as a first foray into the investigation of decentralized caching framework and strategy for Web3.0.
中文摘要 Web3.0的快速发展正在将互联网从集中式结构转变为去中心化，赋予用户前所未有的自主权，掌控自己的数据。然而，在Web3.0去中心化数据访问的背景下，必须应对冗余数据复制带来的效率问题以及数据不一致性带来的安全漏洞。为应对这些挑战，我们开发了可信去中心化协作缓存（TDC-Cache）框架，用于Web3.0，确保缓存高效，增强系统对对对抗威胁的韧性。该框架采用两层架构，其中去中心化甲骨机网络（DON）层作为可信的中介平台，连接去中心化存储的内容与用户的内容请求。鉴于Web3.0网络拓扑和数据流的复杂性，我们提出了基于深度强化学习的去中心化缓存（DRL-DC）用于TDC-Cache，以动态优化分布式预言机的缓存策略。此外，我们制定了合作学习证明（PoCL）共识，以维护DON内去中心化缓存决策的一致性。实验结果显示，与现有方法相比，所提框架平均访问延迟降低了20%，缓存命中率最多提升18%，平均成功共识率提升了10%。总体而言，本文是对去中心化缓存框架和Web3.0策略研究的首次尝试。

Latent Action World Models for Control with Unlabeled Trajectories

无标记轨迹控制的潜在动作世界模型

Authors: Marvin Alles, Xingyuan Zhang, Patrick van der Smagt, Philip Becker-Ehmck
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.10016
Pdf link: https://arxiv.org/pdf/2512.10016
Abstract Inspired by how humans combine direct interaction with action-free experience (e.g., videos), we study world models that learn from heterogeneous data. Standard world models typically rely on action-conditioned trajectories, which limits effectiveness when action labels are scarce. We introduce a family of latent-action world models that jointly use action-conditioned and action-free data by learning a shared latent action representation. This latent space aligns observed control signals with actions inferred from passive observations, enabling a single dynamics model to train on large-scale unlabeled trajectories while requiring only a small set of action-labeled ones. We use the latent-action world model to learn a latent-action policy through offline reinforcement learning (RL), thereby bridging two traditionally separate domains: offline RL, which typically relies on action-conditioned data, and action-free training, which is rarely used with subsequent RL. On the DeepMind Control Suite, our approach achieves strong performance while using about an order of magnitude fewer action-labeled samples than purely action-conditioned baselines. These results show that latent actions enable training on both passive and interactive data, which makes world models learn more efficiently.
中文摘要 受人类如何将直接互动与无动作体验（例如视频）相结合的启发，我们研究从异构数据中学习的世界模型。标准世界模型通常依赖于动作条件轨迹，当动作标签稀缺时，这限制了其有效性。我们介绍了一系列潜作用世界模型，它们通过学习共享的潜在作用表示，结合使用动作条件和无动作数据。该潜在空间将观测到的控制信号与被动观测推断的动作对齐，使单个动力学模型能够在大尺度未标记轨迹上训练，而只需少量带标记动作轨迹。我们使用潜在动作世界模型通过离线强化学习（RL）学习潜在动作策略，从而连接了两个传统上独立的领域：通常依赖动作条件数据的离线强化学习和很少用于后续强化学习的无动作训练。在DeepMind控制套件中，我们的方法在使用约数量级少于纯动作条件基线的动作标记样本的情况下，实现了强大的性能。这些结果表明，潜在动作能够同时训练被动和交互数据，从而使世界模型学习更高效。

Diffusion Is Your Friend in Show, Suggest and Tell

扩散是你的朋友，展示、建议和传达

Authors: Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.10038
Pdf link: https://arxiv.org/pdf/2512.10038
Abstract Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: this https URL_suggest_tell.
中文摘要 扩散去噪模型在生成式计算机视觉任务中展现了令人印象深刻的效果，但在离散领域仍未能超越标准自回归解，顶多只能与之匹敌。在本研究中，我们提出了一种不同的范式，采用扩散模型，为自回归生成提供建议，而非取代它们。通过这样做，我们将前者的双向和精炼能力与后者提供的强大语言结构结合起来。为了展示其有效性，我们介绍了展示、建议与讲述（SST），该技术在类似环境中的模型中实现了COCO的先进效果。特别是，SST在未使用强化学习的情况下，在COCO数据集上实现了125.1的CIDEr-D，比自回归和扩散模型的先进结果各高1.5个百分点和2.5个百分点。除了强劲的结果外，我们还进行了大量实验以验证提案并分析建议模块的影响。结果显示建议与说明质量呈正相关，整体显示出目前尚未充分探索但有前景的研究方向。代码可在以下地址获取：https URL_suggest_tell。

SEMDICE: Off-policy State Entropy Maximization via Stationary Distribution Correction Estimation

SEMDICE：通过平稳分布修正估计实现非策略状态熵最大化

Authors: Jongmin Lee, Meiqi Sun, Pieter Abbeel
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.10042
Pdf link: https://arxiv.org/pdf/2512.10042
Abstract In the unsupervised pre-training for reinforcement learning, the agent aims to learn a prior policy for downstream tasks without relying on task-specific reward functions. We focus on state entropy maximization (SEM), where the goal is to learn a policy that maximizes the entropy of the state stationary distribution. In this paper, we introduce SEMDICE, a principled off-policy algorithm that computes an SEM policy from an arbitrary off-policy dataset, which optimizes the policy directly within the space of stationary distributions. SEMDICE computes a single, stationary Markov state-entropy-maximizing policy from an arbitrary off-policy dataset. Experimental results demonstrate that SEMDICE outperforms baseline algorithms in maximizing state entropy while achieving the best adaptation efficiency for downstream tasks among SEM-based unsupervised RL pre-training methods.
中文摘要 在强化学习的无监督预训练中，代理旨在学习下游任务的先验策略，而不依赖任务特定的奖励函数。我们关注状态熵最大化（SEM），目标是学习一种最大化状态平稳分布熵的策略。本文介绍了SEMDICE算法，这是一种原则性非策略算法，它从任意非策略数据集中计算SEM策略，直接在平稳分布空间内优化策略。SEMDICE从任意非策略数据集中计算出单一的、平稳的马尔可夫状态熵最大化策略。实验结果表明，SEMDICE在基于SEM的无监督强化学习预训练方法中，在最大化状态熵方面优于基线算法，同时实现了下游任务的最佳适应效率。

Push Smarter, Not Harder: Hierarchical RL-Diffusion Policy for Efficient Nonprehensile Manipulation

更聪明，而非更难：层级强化学习扩散策略以实现高效非抓握作

Authors: Steven Caro, Stephen L. Smith
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.10099
Pdf link: https://arxiv.org/pdf/2512.10099
Abstract Nonprehensile manipulation, such as pushing objects across cluttered environments, presents a challenging control problem due to complex contact dynamics and long-horizon planning requirements. In this work, we propose HeRD, a hierarchical reinforcement learning-diffusion policy that decomposes pushing tasks into two levels: high-level goal selection and low-level trajectory generation. We employ a high-level reinforcement learning (RL) agent to select intermediate spatial goals, and a low-level goal-conditioned diffusion model to generate feasible, efficient trajectories to reach them. This architecture combines the long-term reward maximizing behaviour of RL with the generative capabilities of diffusion models. We evaluate our method in a 2D simulation environment and show that it outperforms the state-of-the-art baseline in success rate, path efficiency, and generalization across multiple environment configurations. Our results suggest that hierarchical control with generative low-level planning is a promising direction for scalable, goal-directed nonprehensile manipulation. Code, documentation, and trained models are available: this https URL.
中文摘要 非抓握作，如推动物体穿越杂乱环境，由于接触动力学复杂且规划时间较长，控制问题具有挑战性。在本研究中，我们提出了HeRD，一种分层强化学习-扩散策略，将推动任务分解为两个层级：高层目标选择和低层轨迹生成。我们使用高级强化学习（RL）代理选择中间空间目标，并使用低级目标条件扩散模型生成可行且高效的路径以实现目标。该架构结合了强化学习的长期奖励最大化行为与扩散模型的生成能力。我们在二维仿真环境中评估该方法，证明其在成功率、路径效率和多种环境配置下的泛化性方面优于最先进的基线。我们的结果表明，层级控制与生成式低层次规划是可扩展、目标导向、非抓握作的有前景方向。代码、文档和训练模型均可获取：https URL。

Explicit Control Barrier Function-based Safety Filters and their Resource-Aware Computation

显式控制屏障功能安全滤波器及其资源感知计算

Authors: Pol Mestres, Shima Sadat Mousavi, Pio Ong, Lizhi Yang, Ersin Das, Joel W. Burdick, Aaron D. Ames
Subjects: Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2512.10118
Pdf link: https://arxiv.org/pdf/2512.10118
Abstract This paper studies the efficient implementation of safety filters that are designed using control barrier functions (CBFs), which minimally modify a nominal controller to render it safe with respect to a prescribed set of states. Although CBF-based safety filters are often implemented by solving a quadratic program (QP) in real time, the use of off-the-shelf solvers for such optimization problems poses a challenge in applications where control actions need to be computed efficiently at very high frequencies. In this paper, we introduce a closed-form expression for controllers obtained through CBF-based safety filters. This expression is obtained by partitioning the state-space into different regions, with a different closed-form solution in each region. We leverage this formula to introduce a resource-aware implementation of CBF-based safety filters that detects changes in the partition region and uses the closed-form expression between changes. We showcase the applicability of our approach in examples ranging from aerospace control to safe reinforcement learning.
中文摘要 本文研究了利用控制障碍函数（CBF）设计的安全滤波器的高效实现，这些滤波器对名义控制器进行最小修改，使其在规定状态集合下安全。尽管基于CBF的安全滤波器通常通过实时求解二次规划（QP）实现，但在需要高效计算控制动作的高频应用中，使用现成求解器来解决此类优化问题仍是一大挑战。本文引入了通过基于CBF的安全滤波器获得的控制器闭式表达式。该表达式通过将状态空间划分为不同区域得到，每个区域中存在不同的闭形式解。我们利用该公式引入基于CBF的安全过滤器的资源感知实现，检测划分区域的变化，并使用变更间的闭式表达式。我们通过航空航天控制到安全强化学习等多个例子展示了我们方法的适用性。

An exploration for higher efficiency in multi objective optimisation with reinforcement learning

通过强化学习探索多目标优化的更高效率

Authors: Mehmet Emin Aydin
Subjects: Subjects: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2512.10208
Pdf link: https://arxiv.org/pdf/2512.10208
Abstract Efficiency in optimisation and search processes persists to be one of the challenges, which affects the performance and use of optimisation algorithms. Utilising a pool of operators instead of a single operator to handle move operations within a neighbourhood remains promising, but an optimum or near optimum sequence of operators necessitates further investigation. One of the promising ideas is to generalise experiences and seek how to utilise it. Although numerous works are done around this issue for single objective optimisation, multi-objective cases have not much been touched in this regard. A generalised approach based on multi-objective reinforcement learning approach seems to create remedy for this issue and offer good solutions. This paper overviews a generalisation approach proposed with certain stages completed and phases outstanding that is aimed to help demonstrate the efficiency of using multi-objective reinforcement learning.
中文摘要 优化和搜索流程的效率依然是挑战之一，影响优化算法的性能和使用。利用一组操作员而非单个操作员来处理邻域内的移动作仍然有前景，但最优或接近最优操作员序列仍需进一步研究。其中一个有前景的想法是对经验进行概括，并寻求如何利用它。尽管围绕单一目标优化已有大量研究，但多目标案例在这方面鲜有涉猎。基于多目标强化学习方法的通用方法似乎能解决这一问题并提供良好解决方案。本文概述了一种提出的泛化方法，该方法已完成某些阶段，尚未完成阶段，旨在展示多目标强化学习的高效性。

Latent Chain-of-Thought World Modeling for End-to-End Driving

端到端驾驶的潜在思维链世界建模

Authors: Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, Boris Ivanovic
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.10226
Pdf link: https://arxiv.org/pdf/2512.10226
Abstract Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.
中文摘要 近期的视觉-语言-行动（VLA）自动驾驶模型探索推理时间推理，作为提升驾驶性能和在挑战场景中安全性的方法。大多数先前工作使用自然语言表达思维链（CoT）推理，然后再进行驱动行动。然而，文本可能不是推理最有效的表现方式。在本研究中，我们提出了潜在CoT驱动（LCDrive）：一种用潜在语言表达CoT的模型，能够捕捉所考虑驱动行为的可能结果。我们的方法通过在行动对齐的潜在空间中表示，统一了CoT的推理和决策。模型不是自然语言，而是通过交错（1）动作提议令牌来推理，这些令牌使用与模型输出动作相同的词汇;以及（2）世界模型代币，基于学习的潜在世界模型，表达这些行为的未来结果。我们通过监督模型的行动提案和基于场景未来实际部署的世界模型代币来冷启动潜在CoT。随后我们通过闭环强化学习进行后期训练，以增强推理能力。在大规模端到端驾驶基准测试中，LCDrive相比非推理和文本推理基线，在交互式强化学习方面实现了更快的推理速度、更优的轨迹质量和更显著的提升。

Task-Oriented Grasping Using Reinforcement Learning with a Contextual Reward Machine

基于情境奖励机的任务导向抓握，利用强化学习

Authors: Hui Li, Akhlak Uz Zaman, Fujian Yan, Hongsheng He
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.10235
Pdf link: https://arxiv.org/pdf/2512.10235
Abstract This paper presents a reinforcement learning framework that incorporates a Contextual Reward Machine for task-oriented grasping. The Contextual Reward Machine reduces task complexity by decomposing grasping tasks into manageable sub-tasks. Each sub-task is associated with a stage-specific context, including a reward function, an action space, and a state abstraction function. This contextual information enables efficient intra-stage guidance and improves learning efficiency by reducing the state-action space and guiding exploration within clearly defined boundaries. In addition, transition rewards are introduced to encourage or penalize transitions between stages which guides the model toward desirable stage sequences and further accelerates convergence. When integrated with the Proximal Policy Optimization algorithm, the proposed method achieved a 95% success rate across 1,000 simulated grasping tasks encompassing diverse objects, affordances, and grasp topologies. It outperformed the state-of-the-art methods in both learning speed and success rate. The approach was transferred to a real robot, where it achieved a success rate of 83.3% in 60 grasping tasks over six affordances. These experimental results demonstrate superior accuracy, data efficiency, and learning efficiency. They underscore the model's potential to advance task-oriented grasping in both simulated and real-world settings.
中文摘要 本文提出了一个包含情境奖励机用于任务导向抓取的强化学习框架。情境奖励机通过将抓取任务分解为可管理的子任务，降低任务复杂度。每个子任务都对应一个阶段特定的上下文，包括奖励函数、动作空间和状态抽象函数。这些上下文信息促进了阶段内的高效指导，并通过缩小状态-行动空间、引导探索在明确定义的边界内提升学习效率。此外，引入了过渡奖励，以鼓励或惩罚阶段之间的转换，引导模型趋向理想的阶段序列，进一步加速收敛。当与近端策略优化算法集成时，该方法在涵盖多样对象、可适用性和抓取拓扑的1000个模拟抓取任务中实现了95%的成功率。它在学习速度和成功率上都优于最先进的方法。该方法被应用于真实机器人，在6个抓取任务中，六种条件中实现了83.3%的成功率。这些实验结果显示出卓越的准确性、数据效率和学习效率。它们强调了该模型在模拟和现实环境中推动任务导向抓取的潜力。

Multi-dimensional Preference Alignment by Conditioning Reward Itself

通过条件反射奖励实现多维偏好对齐

Authors: Jiho Jang, Jinyoung Kim, Kyungjune Baek, Nojun Kwak
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.10237
Pdf link: https://arxiv.org/pdf/2512.10237
Abstract Reinforcement Learning from Human Feedback has emerged as a standard for aligning diffusion models. However, we identify a fundamental limitation in the standard DPO formulation because it relies on the Bradley-Terry model to aggregate diverse evaluation axes like aesthetic quality and semantic alignment into a single scalar reward. This aggregation creates a reward conflict where the model is forced to unlearn desirable features of a specific dimension if they appear in a globally non-preferred sample. To address this issue, we propose Multi Reward Conditional DPO (MCDPO). This method resolves reward conflicts by introducing a disentangled Bradley-Terry objective. MCDPO explicitly injects a preference outcome vector as a condition during training, which allows the model to learn the correct optimization direction for each reward axis independently within a single network. We further introduce dimensional reward dropout to ensure balanced optimization across dimensions. Extensive experiments on Stable Diffusion 1.5 and SDXL demonstrate that MCDPO achieves superior performance on benchmarks. Notably, our conditional framework enables dynamic and multiple-axis control at inference time using Classifier Free Guidance to amplify specific reward dimensions without additional training or external reward models.
中文摘要 基于人类反馈的强化学习已成为对齐扩散模型的标准。然而，我们发现标准DPO表述存在一个根本性局限，因为它依赖Bradley-Terry模型，将美学质量和语义对齐等多样评估轴聚合为单一标量奖励。这种聚合会产生奖励冲突，模型被迫放弃特定维度中期望特征，如果它们出现在全局非偏好样本中。为解决这一问题，我们提出了多重奖励条件性DPO（MCDPO）。该方法通过引入一个解开的布拉德利-特里目标来解决奖励冲突。MCDPO在训练过程中明确注入一个偏好结果向量作为条件，使模型能够独立学习单个网络中每个奖励轴的正确优化方向。我们进一步引入了维度奖励中落，以确保跨维度的均衡优化。对Stable Diffusion 1.5和SDXL的广泛实验表明，MCDPO在基准测试中表现更优。值得注意的是，我们的条件框架支持在推理时通过无分类器指导实现动态多轴控制，无需额外训练或外部奖励模型即可放大特定奖励维度。

Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

针对异构GPU集群上DL工作负载的混合学习与优化动态调度

Authors: Shruti Dongare, Redwan Ibne Seraj Khan, Hadeel Albahar, Nannan Zhao, Diego Melendez Maita, Ali R. Butt
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.10271
Pdf link: https://arxiv.org/pdf/2512.10271
Abstract Modern cloud platforms increasingly host large-scale deep learning (DL) workloads, demanding high-throughput, low-latency GPU scheduling. However, the growing heterogeneity of GPU clusters and limited visibility into application characteristics pose major challenges for existing schedulers, which often rely on offline profiling or application-specific assumptions. We present RLTune, an application-agnostic reinforcement learning (RL)-based scheduling framework that dynamically prioritizes and allocates DL jobs on heterogeneous GPU clusters. RLTune integrates RL-driven prioritization with MILP-based job-to-node mapping to optimize system-wide objectives such as job completion time (JCT), queueing delay, and resource utilization. Trained on large-scale production traces from Microsoft Philly, Helios, and Alibaba, RLTune improves GPU utilization by up to 20%, reduces queueing delay by up to 81%, and shortens JCT by as much as 70 percent. Unlike prior approaches, RLTune generalizes across diverse workloads without requiring per-job profiling, making it practical for cloud providers to deploy at scale for more efficient, fair, and sustainable DL workload management.
中文摘要 现代云平台越来越多地承载大规模深度学习（DL）工作负载，要求高吞吐量、低延迟的GPU调度。然而，GPU集群日益异质化以及对应用特性的可视性有限，给现有调度器带来了重大挑战，因为调度器往往依赖离线分析或应用特定假设。我们介绍RLTune，一种基于应用无关的强化学习（RL）调度框架，能够动态优先级并分配异构GPU集群上的DL作业。RLTune 将基于 RL 的优先级与基于 MILP 的作业到节点映射集成，以优化系统范围的目标，如作业完成时间（JCT）、队列延迟和资源利用率。RLTune基于Microsoft Philly、Helios和阿里巴巴的大规模生产线路训练，能将GPU利用率提升多达20%，排队延迟减少高达81%，并将JCT缩短多达70%。与以往方法不同，RLTune能够在不同工作负载上进行通用化，无需逐作业分析，使云服务提供商能够大规模部署，实现更高效、公平和可持续的DL工作负载管理。

A Privacy-Preserving Cloud Architecture for Distributed Machine Learning at Scale

一种保护隐私的云架构，用于大规模分布式机器学习

Authors: Vinoth Punniyamoorthy, Ashok Gadi Parthi, Mayilsamy Palanigounder, Ravi Kiran Kodali, Bikesh Kumar, Kabilan Kannan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.10341
Pdf link: https://arxiv.org/pdf/2512.10341
Abstract Distributed machine learning systems require strong privacy guarantees, verifiable compliance, and scalable deploy- ment across heterogeneous and multi-cloud environments. This work introduces a cloud-native privacy-preserving architecture that integrates federated learning, differential privacy, zero- knowledge compliance proofs, and adaptive governance powered by reinforcement learning. The framework supports secure model training and inference without centralizing sensitive data, while enabling cryptographically verifiable policy enforcement across institutions and cloud platforms. A full prototype deployed across hybrid Kubernetes clusters demonstrates reduced membership- inference risk, consistent enforcement of formal privacy budgets, and stable model performance under differential privacy. Ex- perimental evaluation across multi-institution workloads shows that the architecture maintains utility with minimal overhead while providing continuous, risk-aware governance. The pro- posed framework establishes a practical foundation for deploying trustworthy and compliant distributed machine learning systems at scale.
中文摘要 分布式机器学习系统需要强有力的隐私保障、可验证的合规性以及跨异构和多云环境的可扩展部署。该研究引入了一种云原生的隐私保护架构，集成了联邦学习、差分隐私、零知识合规证明以及由强化学习驱动的自适应治理。该框架支持安全的模型训练和推断，无需集中敏感数据，同时实现跨机构和云平台的加密可验证策略执行。在混合Kubernetes集群上部署的完整原型展示了成员-推理风险的降低、正式隐私预算的一致执行以及差别隐私下模型性能的稳定。跨多机构工作负载的体验式评估表明，该架构在保持实用性且开销最小的同时，提供持续且风险感知的治理。该框架为大规模部署可信且合规的分布式机器学习系统奠定了实用基础。

Boosting RL-Based Visual Reasoning with Selective Adversarial Entropy Intervention

通过选择性对抗熵干预提升基于强化学习的视觉推理能力

Authors: Yang Yu, Zhuangzhuang Chen, Siqi Wang, Lanqing Li, Xiaomeng Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.10414
Pdf link: https://arxiv.org/pdf/2512.10414
Abstract Recently, reinforcement learning (RL) has become a common choice in enhancing the reasoning capabilities of vision-language models (VLMs). Considering existing RL- based finetuning methods, entropy intervention turns out to be an effective way to benefit exploratory ability, thereby improving policy performance. Notably, most existing stud- ies intervene in entropy by simply controlling the update of specific tokens during policy optimization of RL. They ig- nore the entropy intervention during the RL sampling that can boost the performance of GRPO by improving the di- versity of responses. In this paper, we propose Selective- adversarial Entropy Intervention, namely SaEI, which en- hances policy entropy by distorting the visual input with the token-selective adversarial objective coming from the en- tropy of sampled responses. Specifically, we first propose entropy-guided adversarial sampling (EgAS) that formu- lates the entropy of sampled responses as an adversarial ob- jective. Then, the corresponding adversarial gradient can be used to attack the visual input for producing adversarial samples, allowing the policy model to explore a larger an- swer space during RL sampling. Then, we propose token- selective entropy computation (TsEC) to maximize the ef- fectiveness of adversarial attack in EgAS without distorting factual knowledge within VLMs. Extensive experiments on both in-domain and out-of-domain datasets show that our proposed method can greatly improve policy exploration via entropy intervention, to boost reasoning capabilities. Code will be released once the paper is accepted.
中文摘要 近年来，强化学习（RL）已成为增强视觉语言模型（VLM）推理能力的常用选择。结合现有基于强化学习的微调方法，熵干预被证明是提升探索能力的有效方法，从而提升政策绩效。值得注意的是，大多数现有的梭哈通过在强化学习策略优化期间仅仅控制特定代币的更新来干预熵。他们认为强化学习抽样中的熵干预可以通过改善反应多样性来提升GRPO的性能。本文提出选择性-对抗熵干预，即SaEI，通过扭曲视觉输入，利用来自抽样反应的熵的标记选择性对抗目标来增强策略熵。具体来说，我们首先提出了熵引导的对抗抽样（EgAS），它将抽样反应的熵作为对抗对象来形成。然后，相应的对抗梯度可用于攻击视觉输入，生成对抗样本，使策略模型在强化学习采样时探索更大的反向空间。随后，我们提出代币选择性熵计算（TsEC），以最大化EgAS中对抗性攻击的有效性，同时不扭曲VLM中的事实知识。对域内外数据集的大量实验表明，我们提出的方法通过熵干预大幅提升策略探索，提升推理能力。论文被接受后，代码将发布。

HypeR Adaptivity: Joint $hr$-Adaptive Meshing via Hypergraph Multi-Agent Deep Reinforcement Learning

HypeR 自适应性：通过 Hypergraph 多智能体深度强化学习实现联合 $hr$-自适应网格化

Authors: Niccolò Grillo, James Rowbottom, Pietro Liò, Carola Bibiane Schönlieb, Stefania Fresca
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2512.10439
Pdf link: https://arxiv.org/pdf/2512.10439
Abstract Adaptive mesh refinement is central to the efficient solution of partial differential equations (PDEs) via the finite element method (FEM). Classical $r$-adaptivity optimizes vertex positions but requires solving expensive auxiliary PDEs such as the Monge-Ampère equation, while classical $h$-adaptivity modifies topology through element subdivision but suffers from expensive error indicator computation and is constrained by isotropic refinement patterns that impose accuracy ceilings. Combined $hr$-adaptive techniques naturally outperform single-modality approaches, yet inherit both computational bottlenecks and the restricted cost-accuracy trade-off. Emerging machine learning methods for adaptive mesh refinement seek to overcome these limitations, but existing approaches address $h$-adaptivity or $r$-adaptivity in isolation. We present HypeR, a deep reinforcement learning framework that jointly optimizes mesh relocation and refinement. HypeR casts the joint adaptation problem using tools from hypergraph neural networks and multi-agent reinforcement learning. Refinement is formulated as a heterogeneous multi-agent Markov decision process (MDP) where element agents decide discrete refinement actions, while relocation follows an anisotropic diffusion-based policy on vertex agents with provable prevention of mesh tangling. The reward function combines local and global error reduction to promote general accuracy. Across benchmark PDEs, HypeR reduces approximation error by up to 6--10$\times$ versus state-of-art $h$-adaptive baselines at comparable element counts, breaking through the uniform refinement accuracy ceiling that constrains subdivision-only methods. The framework produces meshes with improved shape metrics and alignment to solution anisotropy, demonstrating that jointly learned $hr$-adaptivity strategies can substantially enhance the capabilities of automated mesh generation.
中文摘要 自适应网格细化是通过有限元法（FEM）高效求解偏微分方程（PDE）的核心。经典$r$自适应性优化顶点位置，但需要求解昂贵的辅助偏微分方程，如Monge-Ampère方程;而经典$h$自适应性通过元素细分修改拓扑，但存在昂贵的误差指示器计算，且受限于各向同性细化模式，从而限制了精度上限。组合$hr$自适应技术自然优于单模态方法，但同时伴随着计算瓶颈和有限的成本与准确性权衡。新兴的自适应网格细化机器学习方法试图克服这些限制，但现有方法主要单独解决$h$适应性或$r$适应性。我们介绍HypeR，一个深度强化学习框架，共同优化网格重定位和精炼。HypeR利用超图神经网络和多智能体强化学习工具提出了联合适应问题。精炼被表述为一种异构多代理马尔可夫决策过程（MDP），元素代理决定离散细化动作，而重定位则遵循基于各向异性扩散的顶点代理策略，并可证明防止网格纠缠。奖励函数结合了局部和全局误差的减少，以促进整体准确性。在基准偏微分方程中，HypeR在相同元素数下，相较于最先进的$h$自适应基线，将近似误差降低了最多6-10%\时间$，突破了限制仅细分方法的统一精细精度上限。该框架生成的网格具有改进的形状指标和与解各向异性的对齐，展示了共同学习的$hr$适应性策略可以显著提升自动网格生成的能力。

UACER: An Uncertainty-Aware Critic Ensemble Framework for Robust Adversarial Reinforcement Learning

UACER：一个不确定性感知的批评者集合框架，用于强健的对抗强化学习

Authors: Jiaxi Wu, Tiantian Zhang, Yuxing Wang, Yongzhe Chang, Xueqian Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.10492
Pdf link: https://arxiv.org/pdf/2512.10492
Abstract Robust adversarial reinforcement learning has emerged as an effective paradigm for training agents to handle uncertain disturbance in real environments, with critical applications in sequential decision-making domains such as autonomous driving and robotic control. Within this paradigm, agent training is typically formulated as a zero-sum Markov game between a protagonist and an adversary to enhance policy robustness. However, the trainable nature of the adversary inevitably induces non-stationarity in the learning dynamics, leading to exacerbated training instability and convergence difficulties, particularly in high-dimensional complex environments. In this paper, we propose a novel approach, Uncertainty-Aware Critic Ensemble for robust adversarial Reinforcement learning (UACER), which consists of two strategies: 1) Diversified critic ensemble: a diverse set of K critic networks is exploited in parallel to stabilize Q-value estimation rather than conventional single-critic architectures for both variance reduction and robustness enhancement. 2) Time-varying Decay Uncertainty (TDU) mechanism: advancing beyond simple linear combinations, we develop a variance-derived Q-value aggregation strategy that explicitly incorporates epistemic uncertainty to dynamically regulate the exploration-exploitation trade-off while simultaneously stabilizing the training process. Comprehensive experiments across several MuJoCo control problems validate the superior effectiveness of UACER, outperforming state-of-the-art methods in terms of overall performance, stability, and efficiency.
中文摘要 稳健的对抗强化学习已成为训练智能体处理真实环境中不确定干扰的有效范式，在自动驾驶和机器人控制等顺序决策领域具有关键应用。在这一范式下，代理培训通常被表述为主角与对手之间的零和马尔可夫博弈，以增强政策的稳健性。然而，对手的可训练特性不可避免地导致学习动态的非平稳性，导致训练不稳定性和收敛困难加剧，尤其是在高维复杂环境中。本文提出了一种新方法——不确定性感知批评者集合（Uncerin-Consciousy Critic Ensemble）用于强健对抗强化学习（UACER），该方法包括两种策略：1）多样化批评者集合：并行利用多样化的K批评网络，以稳定Q值估计，取代传统单批判者架构，实现方差减少和增强鲁棒性。2）时间变化衰变不确定性（TDU）机制：超越简单线性组合，我们开发了一种基于方差导出的Q值聚合策略，明确纳入认识论不确定性，动态调节探索与利用权衡，同时稳定训练过程。在多个MuJoCo控制问题上的综合实验验证了UACER的卓越效能，在整体性能、稳定性和效率方面优于最先进方法。

Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning

自适应重放缓冲区用于离线到在线强化学习

Authors: Chihyeon Song, Jaewoo Lee, Jinkyoo Park
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.10510
Pdf link: https://arxiv.org/pdf/2512.10510
Abstract Offline-to-Online Reinforcement Learning (O2O RL) faces a critical dilemma in balancing the use of a fixed offline dataset with newly collected online experiences. Standard methods, often relying on a fixed data-mixing ratio, struggle to manage the trade-off between early learning stability and asymptotic performance. To overcome this, we introduce the Adaptive Replay Buffer (ARB), a novel approach that dynamically prioritizes data sampling based on a lightweight metric we call 'on-policyness'. Unlike prior methods that rely on complex learning procedures or fixed ratios, ARB is designed to be learning-free and simple to implement, seamlessly integrating into existing O2O RL algorithms. It assesses how closely collected trajectories align with the current policy's behavior and assigns a proportional sampling weight to each transition within that trajectory. This strategy effectively leverages offline data for initial stability while progressively focusing learning on the most relevant, high-rewarding online experiences. Our extensive experiments on D4RL benchmarks demonstrate that ARB consistently mitigates early performance degradation and significantly improves the final performance of various O2O RL algorithms, highlighting the importance of an adaptive, behavior-aware replay buffer design.
中文摘要 离线到在线强化学习（O2O RL）面临着一个关键难题，即在使用固定离线数据集与新收集的在线体验之间取得平衡。标准方法通常依赖固定的数据混合比例，难以平衡早期学习稳定性与渐近性能之间的权衡。为克服这一问题，我们引入了自适应重放缓冲区（Adaptive Replay Buffer，ARB），这是一种基于轻量级指标“策略性”动态优先级数据抽样的新方法。与以往依赖复杂学习过程或固定比例的方法不同，ARB设计为无学习且易于实现，能够无缝集成到现有的O2O强化学习算法中。它评估收集轨迹与当前政策行为的高度契合，并为该轨迹中的每个转变分配比例抽样权重。该策略有效利用线下数据实现初始稳定性，同时逐步将学习重点放在最相关、最有回报的在线体验上。我们在D4RL基准测试上的大量实验表明，ARB能够持续减轻早期性能下降，并显著提升各种O2O强化学习算法的最终性能，凸显了自适应、行为感知重放缓冲区设计的重要性。

Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

通过复杂度提升强化学习实现奥林匹亚级几何大型语言模型代理

Authors: Haiteng Zhao, Junhao Shen, Yiming Zhang, Songyang Gao, Kuikun Liu, Tianyou Ma, Fan Zheng, Dahua Lin, Wenwei Zhang, Kai Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.10534
Pdf link: https://arxiv.org/pdf/2512.10534
Abstract Large language model (LLM) agents exhibit strong mathematical problem-solving abilities and can even solve International Mathematical Olympiad (IMO) level problems with the assistance of formal proof systems. However, due to weak heuristics for auxiliary constructions, AI for geometry problem solving remains dominated by expert models such as AlphaGeometry 2, which rely heavily on large-scale data synthesis and search for both training and evaluation. In this work, we make the first attempt to build a medalist-level LLM agent for geometry and present InternGeometry. InternGeometry overcomes the heuristic limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine, and reflecting on the engine's feedback to guide subsequent proposals. A dynamic memory mechanism enables InternGeometry to conduct more than two hundred interactions with the symbolic engine per problem. To further accelerate learning, we introduce Complexity-Boosting Reinforcement Learning (CBRL), which gradually increases the complexity of synthesized problems across training stages. Built on InternThinker-32B, InternGeometry solves 44 of 50 IMO geometry problems (2000-2024), exceeding the average gold medalist score (40.9), using only 13K training examples, just 0.004% of the data used by AlphaGeometry 2, demonstrating the potential of LLM agents on expert-level geometry tasks. InternGeometry can also propose novel auxiliary constructions for IMO problems that do not appear in human solutions. We will release the model, data, and symbolic engine to support future research.
中文摘要 大型语言模型（LLM）代理展现出强大的数学问题解决能力，甚至能借助形式证明系统解决国际数学奥林匹克（IMO）级别的问题。然而，由于辅助构造的启发式方法较弱，几何问题的人工智能仍由专家模型如AlphaGeometry 2主导，这些模型高度依赖大规模数据综合和搜索进行训练和评估。在本研究中，我们首次尝试构建一个奖牌级的几何级大型语言模型代理，并呈现了InternGeometry。InternGeometry 通过迭代提出命题和辅助构造，用符号引擎验证，并反思引擎的反馈以指导后续提案，克服了几何中的启发式局限。动态记忆机制使InternGeometry每个问题都能与符号引擎进行超过200次交互。为了进一步加速学习，我们引入了复杂度提升强化学习（CBRL），逐步提升各训练阶段综合问题的复杂度。InternGeometry 基于 InternThinker-32B，解决了 2000-2024 年 IMO 50 个几何问题中的 44 个，超过了平均金牌得主得分（40.9），仅使用了 1.3 万个训练样本，仅占 AlphaGeometry 2 数据的 0.004%，展示了 LLM 代理在专家级几何任务中的潜力。InternGeometry 还可以为 IMO 问题提出新的辅助构造，这些问题在人类解法中不会出现。我们将发布模型、数据和符号引擎，以支持未来的研究。

Grounding Everything in Tokens for Multimodal Large Language Models

将一切建立在多模态大型语言模型的代币基础上

Authors: Xiangxuan Ren, Zhongdao Wang, Liping Hou, Pin Tang, Guoqing Wang, Chao Ma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.10554
Pdf link: https://arxiv.org/pdf/2512.10554
Abstract Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.
中文摘要 多模态大型语言模型（MLLMs）在视觉理解和推理方面取得了显著进展。然而，MLLM所采用的自回归变换器架构需要对输入图像进行分词，这限制了它们在二维图像空间中准确定位对象的能力。这引出了一个重要问题：如何改进顺序语言标记，更好地为MLLM在二维空间中定位物体？为此，我们提出了一种用于基础对象的空间表示方法，即GETok，它将可学习的词汇集成到MLLM中。GETok首先使用网格令牌将图像平面划分为结构化的空间锚点，然后利用偏移令牌实现精确且迭代的定位预测。通过将空间关系直接嵌入代币中，GETok在不修改自回归架构的情况下，显著推动了MLLM在原生二维空间推理中的进步。大量实验表明，GETok在监督微调和强化学习环境中，在各种指涉任务中都优于最先进方法。

Multi-Objective Reward and Preference Optimization: Theory and Algorithms

多目标奖励与偏好优化：理论与算法

Authors: Akhil Agnihotri
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.10601
Pdf link: https://arxiv.org/pdf/2512.10601
Abstract This thesis develops theoretical frameworks and algorithms that advance constrained reinforcement learning (RL) across control, preference learning, and alignment of large language models. The first contribution addresses constrained Markov Decision Processes (CMDPs) under the average-cost criterion through the Average-Constrained Policy Optimization (ACPO) algorithm. ACPO integrates sensitivity analysis with trust-region updates to ensure stable constraint handling, achieving state-of-the-art empirical performance with theoretical guarantees. Constrained RL is then extended to finite-horizon settings via e-COP, the first policy optimization method for episodic CMDPs. Built on an episodic policy difference lemma, e-COP offers provable performance, simplicity, and scalability in safety-critical environments. The thesis then investigates reinforcement learning from human preferences. warmPref-PS introduces a posterior sampling strategy for linear bandits that integrates offline preference data from heterogeneous raters into online learning. Explicit modeling of rater competence yields substantial regret reduction and more efficient data collection for RLHF. The PSPL algorithm further advances preference-based RL by jointly sampling reward models and transition dynamics from pairwise trajectory comparisons, providing Bayesian simple-regret guarantees and robust empirical identification of optimal policies. The final contribution applies these methods to large-scale model alignment. A multi-objective constrained optimization view yields MOPO, an iterative algorithm with closed-form updates that scales to multi-billion-parameter language models and remains robust across alignment settings. Collectively, the thesis unifies constrained RL across average-cost, episodic, and preference-driven paradigms, delivering theoretical advances and practical tools for safe and aligned decision-making.
中文摘要 本论文发展了理论框架和算法，推动了受限强化学习（RL）在大型语言模型的控制、偏好学习和对齐领域的发展。第一个贡献通过平均约束策略优化（ACPO）算法，解决了平均成本准则下的受限马尔可夫决策过程（CMDP）。ACPO将敏感性分析与信任区域更新整合，确保约束处理的稳定，实现了最先进的经验性能和理论保证。受限强化学习随后通过e-COP扩展到有限视野设置，这是首个针对情节式CMDP的策略优化方法。基于情景策略差分引理，e-COP在安全关键环境中提供可验证的性能、简洁性和可扩展性。随后，论文探讨了基于人类偏好的强化学习。warmPref-PS引入了一种线性强化后的采样策略，将来自异质评审者的离线偏好数据整合到在线学习中。对评定能力的显式建模显著减少了RLHF的遗憾，并提高了数据收集效率。PSPL算法通过两两轨迹比较联合抽样奖励模型和转变动态，进一步推进基于偏好的强化学习，提供贝叶斯简单遗憾保证和稳健的经验最优策略识别。最后一项贡献将这些方法应用于大规模模型比对。多目标约束优化视图产生MOPO，这是一种迭代算法，具有封闭形式的更新，可扩展至数十亿参数的语言模型，并在对齐设置中保持鲁棒性。总体而言，该论文统一了受限强化学习在平均成本、偶发和偏好驱动范式中的范式，提供了理论进展和实用工具，助力安全且一致的决策。

AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence

AgriGPT-Omni：多语言农业智能的统一语音-视觉-文本框架

Authors: Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Jianyu Zhang, Xiao Xu, Nueraili Aierken, Shijian Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.10624
Pdf link: https://arxiv.org/pdf/2512.10624
Abstract Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the lack of multilingual speech data, unified multimodal architectures, and comprehensive evaluation benchmarks. To address these challenges, we present AgriGPT-Omni, an agricultural omni-framework that integrates speech, vision, and text in a unified framework. First, we construct a scalable data synthesis and collection pipeline that converts agricultural texts and images into training data, resulting in the largest agricultural speech dataset to date, including 492K synthetic and 1.4K real speech samples across six languages. Second, based on this, we train the first agricultural omni-model via a three-stage paradigm: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning, enabling unified reasoning across languages and modalities. Third, we propose AgriBench-Omni-2K, the first tri-modal benchmark for agriculture, covering diverse speech-vision-text tasks and multilingual slices, with standardized protocols and reproducible tools. Experiments show that AgriGPT-Omni significantly outperforms general-purpose baselines on multilingual and multimodal reasoning as well as real-world speech understanding. All models, data, benchmarks, and code will be released to promote reproducible research, inclusive agricultural intelligence, and sustainable AI development for low-resource regions.
中文摘要 尽管多模态大型语言模型取得了快速进展，农业应用仍受限于缺乏多语语音数据、统一多模态架构和全面评估基准。为应对这些挑战，我们推出了AgriGPT-Omni，一个整合语音、视觉和文本于统一框架中的农业全能框架。首先，我们构建了一个可扩展的数据综合和收集流程，将农业文本和图像转化为训练数据，最终形成迄今为止最大的农业语音数据集，包含49.2万个合成语音样本，覆盖六种语言。其次，基于此，我们通过三阶段范式训练首个农业全模型：文本知识注入、渐进多模态对齐和基于GRPO的强化学习，实现跨语言和模态的统一推理。第三，我们提出AgriBench-Omni-2K，这是农业首个三模态基准，涵盖多样化的语音-视觉-文本任务和多语言切片，配备标准化协议和可重复工具。实验显示，AgriGPT-Omni在多语言和多模态推理以及现实世界语音理解方面，显著优于通用基线。所有模型、数据、基准和代码将发布，以促进可重复的研究、包容性的农业智能以及低资源地区的可持续人工智能发展。

Enhancing Radiology Report Generation and Visual Grounding using Reinforcement Learning

利用强化学习提升放射科报告生成和视觉基础

Authors: Benjamin Gundersen, Nicolas Deperrois, Samuel Ruiperez-Campillo, Thomas M. Sutter, Julia E. Vogt, Michael Moor, Farhad Nooralahzadeh, Michael Krauthammer
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.10691
Pdf link: https://arxiv.org/pdf/2512.10691
Abstract Recent advances in vision-language models (VLMs) have improved Chest X-ray (CXR) interpretation in multiple aspects. However, many medical VLMs rely solely on supervised fine-tuning (SFT), which optimizes next-token prediction without evaluating answer quality. In contrast, reinforcement learning (RL) can incorporate task-specific feedback, and its combination with explicit intermediate reasoning ("thinking") has demonstrated substantial gains on verifiable math and coding tasks. To investigate the effects of RL and thinking in a CXR VLM, we perform large-scale SFT on CXR data to build an updated RadVLM based on Qwen3-VL, followed by a cold-start SFT stage that equips the model with basic thinking ability. We then apply Group Relative Policy Optimization (GRPO) with clinically grounded, task-specific rewards for report generation and visual grounding, and run matched RL experiments on both domain-specific and general-domain Qwen3-VL variants, with and without thinking. Across these settings, we find that while strong SFT remains crucial for high base performance, RL provides additional gains on both tasks, whereas explicit thinking does not appear to further improve results. Under a unified evaluation pipeline, the RL-optimized RadVLM models outperform their baseline counterparts and reach state-of-the-art performance on both report generation and grounding, highlighting clinically aligned RL as a powerful complement to SFT for medical VLMs.
中文摘要 视觉语言模型（VLM）的最新进展在多个方面提升了胸部X光（CXR）的解读能力。然而，许多医疗VLM仅依赖监督微调（SFT），该系统优化下一标记预测，而不评估答案质量。相比之下，强化学习（RL）可以包含针对任务的反馈，其与显性中级推理（“思维”）结合，已在可验证的数学和编码任务中取得了显著提升。为研究强化学习与思维在CXR VLM中的影响，我们对CXR数据进行大规模SFT，构建基于Qwen3-VL的更新RadVLM，随后进行冷启动SFT阶段，赋予模型基础思维能力。随后，我们应用基于临床基础、任务特异性奖励的组相对策略优化（GRPO）来生成报告和视觉化，并在领域特异性和一般域Qwen3-VL变体上进行匹配强化学习实验，有无思考。在这些环境中，我们发现虽然强SFT对高基础表现至关重要，但强化学习在这两项任务上都能带来额外收益，而显性思维似乎并未进一步提升结果。在统一的评估流程下，强化学习优化的RadVLM模型优于基线模型，在报告生成和接地方面均达到最先进性能，凸显了临床对齐的强化强化学习作为医疗VLM中SFT的强大补充。

How to Brake? Ethical Emergency Braking with Deep Reinforcement Learning

怎么刹车？深度强化学习的伦理紧急制动

Authors: Jianbo Wang, Galina Sidorenko, Johan Thunberg
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.10698
Pdf link: https://arxiv.org/pdf/2512.10698
Abstract Connected and automated vehicles (CAVs) have the potential to enhance driving safety, for example by enabling safe vehicle following and more efficient traffic scheduling. For such future deployments, safety requirements should be addressed, where the primary such are avoidance of vehicle collisions and substantial mitigating of harm when collisions are unavoidable. However, conservative worst-case-based control strategies come at the price of reduced flexibility and may compromise overall performance. In light of this, we investigate how Deep Reinforcement Learning (DRL) can be leveraged to improve safety in multi-vehicle-following scenarios involving emergency braking. Specifically, we investigate how DRL with vehicle-to-vehicle communication can be used to ethically select an emergency breaking profile in scenarios where overall, or collective, three-vehicle harm reduction or collision avoidance shall be obtained instead of single-vehicle such. As an algorithm, we provide a hybrid approach that combines DRL with a previously published method based on analytical expressions for selecting optimal constant deceleration. By combining DRL with the previous method, the proposed hybrid approach increases the reliability compared to standalone DRL, while achieving superior performance in terms of overall harm reduction and collision avoidance.
中文摘要 互联和自动驾驶车辆（CAV）有潜力提升驾驶安全，例如通过实现安全的车辆跟踪和更高效的交通调度。对于未来部署，应关注安全要求，主要包括避免车辆碰撞以及在不可避免碰撞时对损害进行重大减轻。然而，保守的基于最坏情况的控制策略会以降低灵活性为代价，并可能影响整体性能。基于此，我们探讨了深度强化学习（DRL）如何应用于多车跟踪紧急制动场景中的安全性。具体来说，我们研究了在实现整体或集体三车减害或避免碰撞的情景中，如何利用车辆间的车辆间沟通，伦理地选择紧急制动方案。作为一种算法，我们提供了一种混合方法，结合了DRL与之前发表的基于解析表达式的方法，用于选择最优常减速度。通过将日行车与前述方法结合，所提混合方法相较于独立日行车，提高了可靠性，同时在整体减害和避免碰撞方面实现了更优。

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

奥林匹克级数学问题解决的长视野推理代理

Authors: Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang, Zhongrui Cai, Fan Zheng, Tianyou Ma, Junhao Shen, Haiteng Zhao, Duanyang Zhang, Huilun Zhang, Kuikun Liu, Chengqi Lyu, Yanhui Duan, Chiyu Chen, Ningsheng Ma, Jianfei Gao, Han Lyu, Dahua Lin, Kai Chen
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.10739
Pdf link: https://arxiv.org/pdf/2512.10739
Abstract Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.
中文摘要 大型语言模型（LLMs）通过可验证奖励强化学习（RLVR）在解决复杂推理任务方面取得了显著进展。这一进步也与可靠验证者自动化的监管密不可分。然而，当前基于结果的验证者（OV）无法检查长推理链（CoT）中不可靠的中间步骤。与此同时，当前基于过程的验证器（PV）在复杂长CoT中难以可靠检测错误，这受限于由于人工注释成本高昂，高质量注释稀缺。因此，我们提出基于 \textbf{O}utcome（OPV）的 \textbf{P}rocess \textbf{V}erifier（OPV），该方法验证了来自长 CoT 总结结果的合理性过程，实现了准确高效的验证并支持大规模注释。为赋能拟验证器，我们采用带有专家注释的迭代主动学习框架，逐步提升OPV的验证能力，同时降低注释成本。具体来说，在每次迭代中，当前最佳OPV最不确定的情况会被注释，随后用来通过拒绝微调（RFT）和RLVR训练下一轮的新OPV。大量实验证明了OPV的优越性能和广泛的适用性。它在我们提供的 \textsc{\thisbench} 上取得了新的最先进成绩，表现优于更大型的开源模型，如 Qwen3-Max-Preview，F1 评分为 83.1，而 76.3。此外，OPV能有效检测合成数据集中的假阳性，与专家评估高度一致。与政策模型协作时，OPV持续带来性能提升，例如随着预算扩展，DeepSeek-R1-Distill-Qwen-32B在AIME2025上的准确率从55.2%提升至73.3%。

Learning to Split: A Reinforcement-Learning-Guided Splitting Heuristic for Neural Network Verification

学习分裂：一种基于强化学习引导的神经网络验证分离启发式方法

Authors: Maya Swisa, Guy Katz
Subjects: Subjects: Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2512.10747
Pdf link: https://arxiv.org/pdf/2512.10747
Abstract State-of-the-art neural network verifiers operate by encoding neural network verification as constraint satisfaction problems. When dealing with standard piecewise-linear activation functions, such as ReLUs, verifiers typically employ branching heuristics that break a complex constraint satisfaction problem into multiple, simpler problems. The verifier's performance depends heavily on the order in which this branching is performed: a poor selection may give rise to exponentially many sub-problem, hampering scalability. Here, we focus on the setting where multiple verification queries must be solved for the same neural network. The core idea is to use past experience to make good branching decisions, expediting verification. We present a reinforcement-learning-based branching heuristic that achieves this, by applying a learning from demonstrations (DQfD) techniques. Our experimental evaluation demonstrates a substantial reduction in average verification time and in the average number of iterations required, compared to modern splitting heuristics. These results highlight the great potential of reinforcement learning in the context of neural network verification.
中文摘要 最先进的神经网络验证器通过将神经网络验证编码为约束满足问题来工作。在处理标准分段线性激活函数（如ReLUs）时，验证者通常采用分支启发式方法，将复杂的约束满足问题拆解为多个更简单的问题。验证器的表现很大程度上取决于分支执行的顺序：选择不当可能导致子问题呈指数级增长，影响可扩展性。这里，我们重点关注同一神经网络中必须解决多个验证查询的情境。核心理念是利用以往经验做出良好的分支决策，加快验证速度。我们提出了一种基于强化学习的分支启发式方法，通过应用演示学习（DQfD）技术实现这一目标。我们的实验评估显示，与现代分裂启发式相比，平均验证时间和平均迭代次数显著减少。这些结果凸显了强化学习在神经网络验证背景下的巨大潜力。

OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

OPV：基于结果的流程验证器，实现高效的长链思考验证

Authors: Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu, Zhongrui Cai, Tianyou Ma, Yuhong Liu, Zhi Wang, Runyuan Ma, Guangyu Wang, Wei Li, Conghui He, Dahua Lin, Kai Chen
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.10756
Pdf link: https://arxiv.org/pdf/2512.10756
Abstract Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.
中文摘要 大型语言模型（LLMs）通过可验证奖励强化学习（RLVR）在解决复杂推理任务方面取得了显著进展。这一进步也与可靠验证者自动化的监管密不可分。然而，当前基于结果的验证者（OV）无法检查长推理链（CoT）中不可靠的中间步骤。与此同时，当前基于过程的验证器（PV）在复杂长CoT中难以可靠检测错误，这受限于由于人工注释成本高昂，高质量注释稀缺。因此，我们提出了基于结果的过程验证器（OPV），它通过验证长期CoT总结结果的合理性过程，实现准确高效的验证并支持大规模注释。为赋能拟验证器，我们采用带有专家注释的迭代主动学习框架，逐步提升OPV的验证能力，同时降低注释成本。具体来说，在每次迭代中，当前最佳OPV最不确定的情况会被注释，随后用来通过拒绝微调（RFT）和RLVR训练下一轮的新OPV。大量实验证明了OPV的优越性能和广泛的适用性。它在我们期待的 OPV-Bench 上取得了新的先进性能，表现优于更大型的开源模型，如 Qwen3-Max-Preview，F1 评分为 83.1，高于 76.3。此外，OPV能有效检测合成数据集中的假阳性，与专家评估高度一致。在与政策模型协作时，OPV持续带来性能提升，例如随着计算预算的扩展，DeepSeek-R1-Distill-Qwen-32B在AIME2025上的准确率从55.2%提升至73.3%。

Learning Controllable and Diverse Player Behaviors in Multi-Agent Environments

在多智能体环境中学习可控且多样化的玩家行为

Authors: Atahan Cilan, Atay Özgövde
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.10835
Pdf link: https://arxiv.org/pdf/2512.10835
Abstract This paper introduces a reinforcement learning framework that enables controllable and diverse player behaviors without relying on human gameplay data. Existing approaches often require large-scale player trajectories, train separate models for different player types, or provide no direct mapping between interpretable behavioral parameters and the learned policy, limiting their scalability and controllability. We define player behavior in an N-dimensional continuous space and uniformly sample target behavior vectors from a region that encompasses the subset representing real human styles. During training, each agent receives both its current and target behavior vectors as input, and the reward is based on the normalized reduction in distance between them. This allows the policy to learn how actions influence behavioral statistics, enabling smooth control over attributes such as aggressiveness, mobility, and cooperativeness. A single PPO-based multi-agent policy can reproduce new or unseen play styles without retraining. Experiments conducted in a custom multi-player Unity game show that the proposed framework produces significantly greater behavioral diversity than a win-only baseline and reliably matches specified behavior vectors across diverse targets. The method offers a scalable solution for automated playtesting, game balancing, human-like behavior simulation, and replacing disconnected players in online games.
中文摘要 本文介绍了一个强化学习框架，能够实现可控且多样化的玩家行为，而无需依赖人类游戏数据。现有方法通常需要大规模的玩家轨迹，为不同玩家类型训练独立模型，或不提供可解释行为参数与所学策略之间的直接映射，限制了其可扩展性和可控性。我们在一个N维连续空间中定义玩家行为，并从包含代表真实人类风格子集的区域中均匀采样目标行为向量。在训练过程中，每个代理同时接收其当前和目标行为向量作为输入，奖励基于它们之间距离的归一化减少。这使得策略能够学习行为如何影响行为统计，从而实现对攻击性、机动性和合作性等属性的顺畅控制。基于PPO的单一多智能体策略可以在无需重新训练的情况下重现新的或未见的玩法风格。在一个自定义多人Unity游戏中进行的实验表明，所提出的框架产生了显著更高的行为多样性，优于仅赢的基线，并且能够可靠地匹配不同目标间指定的行为向量。该方法为自动化游戏测试、游戏平衡、类人行为模拟以及替换在线游戏中断联玩家提供了可扩展的解决方案。

Iterative Compositional Data Generation for Robot Control

机器人控制的迭代合成数据生成

Authors: Anh-Quan Pham, Marcel Hussing, Shubhankar P. Patankar, Dani S. Bassett, Jorge Mendez-Mendez, Eric Eaton
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.10891
Pdf link: https://arxiv.org/pdf/2512.10891
Abstract Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.
中文摘要 收集机器人作数据成本高昂，因此在多对象、多机器人和多环境环境中出现的庞大组合空间任务中，获取演示并不切实际。虽然最新的生成模型可以综合单个任务的有用数据，但它们未能利用机器人领域的组成结构，难以推广到看不见的任务组合。我们提出了一种语义组合扩散变换器，将过渡分解为机器人、对象、障碍和目标特定组件，并通过注意力学习它们的交互。在对有限任务子集进行训练后，我们证明模型能够零样本生成高质量的转换，从而学习对未见任务组合的控制策略。随后，我们引入一种迭代自我提升过程，通过离线强化学习验证合成数据，并将其纳入后续的培训轮次。我们的方法显著提升了零拍摄性能，优于单一且硬编码的构图基线，最终解决了几乎所有悬而未决的任务，并展示了在学习到的表征中出现有意义的构图结构。

Digital Twin Supervised Reinforcement Learning Framework for Autonomous Underwater Navigation

数字孪生监督强化学习框架，用于自主水下导航

Authors: Zamirddine Mari, Mohamad Motasem Nawaf, Pierre Drap
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.10925
Pdf link: https://arxiv.org/pdf/2512.10925
Abstract Autonomous navigation in underwater environments remains a major challenge due to the absence of GPS, degraded visibility, and the presence of submerged obstacles. This article investigates these issues through the case of the BlueROV2, an open platform widely used for scientific experimentation. We propose a deep reinforcement learning approach based on the Proximal Policy Optimization (PPO) algorithm, using an observation space that combines target-oriented navigation information, a virtual occupancy grid, and ray-casting along the boundaries of the operational area. The learned policy is compared against a reference deterministic kinematic planner, the Dynamic Window Approach (DWA), commonly employed as a robust baseline for obstacle avoidance. The evaluation is conducted in a realistic simulation environment and complemented by validation on a physical BlueROV2 supervised by a 3D digital twin of the test site, helping to reduce risks associated with real-world experimentation. The results show that the PPO policy consistently outperforms DWA in highly cluttered environments, notably thanks to better local adaptation and reduced collisions. Finally, the experiments demonstrate the transferability of the learned behavior from simulation to the real world, confirming the relevance of deep RL for autonomous navigation in underwater robotics.
中文摘要 由于缺乏GPS、能见度下降以及水下障碍物存在，水下环境下的自主导航依然面临重大挑战。本文通过BlueROV2这一广泛用于科学实验的开放平台，探讨了这些问题。我们提出了一种基于近点策略优化（PPO）算法的深度强化学习方法，利用结合目标导向导航信息、虚拟占领网格和沿作战区域边界的射线投射的观察空间。所学策略与一种参考确定性运动学规划器动态窗口方法（Dynamic Window Approach，DWA）进行比较，后者常被用作障碍物规避的稳健基线。评估在真实的仿真环境中进行，并辅以在测试现场三维数字孪生监督下，使用实体BlueROV2进行验证，有助于降低现实实验带来的风险。结果显示，PPO政策在高度杂乱环境中持续优于DWA，这得益于更好的局部适应性和减少碰撞。最后，实验展示了从模拟中学到的行为可迁移到现实世界，证实了深强化学习在水下机器人自主导航中的相关性。

Curriculum-Based Reinforcement Learning for Autonomous UAV Navigation in Unknown Curved Tubular Conduit

基于课程的强化学习，用于未知弯曲管状管道中的自主无人机导航

Authors: Zamirddine Mari, Jérôme Pasquet, Julien Seinturier
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.10934
Pdf link: https://arxiv.org/pdf/2512.10934
Abstract Autonomous drone navigation in confined tubular environments remains a major challenge due to the constraining geometry of the conduits, the proximity of the walls, and the perceptual limitations inherent to such scenarios. We propose a reinforcement learning approach enabling a drone to navigate unknown three-dimensional tubes without any prior knowledge of their geometry, relying solely on local observations from LiDAR and a conditional visual detection of the tube center. In contrast, the Pure Pursuit algorithm, used as a deterministic baseline, benefits from explicit access to the centerline, creating an information asymmetry designed to assess the ability of RL to compensate for the absence of a geometric model. The agent is trained through a progressive Curriculum Learning strategy that gradually exposes it to increasingly curved geometries, where the tube center frequently disappears from the visual field. A turning-negotiation mechanism, based on the combination of direct visibility, directional memory, and LiDAR symmetry cues, proves essential for ensuring stable navigation under such partial observability conditions. Experiments show that the PPO policy acquires robust and generalizable behavior, consistently outperforming the deterministic controller despite its limited access to geometric information. Validation in a high-fidelity 3D environment further confirms the transferability of the learned behavior to a continuous physical dynamics. The proposed approach thus provides a complete framework for autonomous navigation in unknown tubular environments and opens perspectives for industrial, underground, or medical applications where progressing through narrow and weakly perceptive conduits represents a central challenge.
中文摘要 由于管道的限制性几何结构、墙体的接近以及此类场景固有的感知限制，自主无人机导航在受限的管状环境中依然是一大挑战。我们提出一种强化学习方法，使无人机能够在不先验几何结构的情况下导航未知的三维管道，仅依靠激光雷达的局部观测和对管中心的条件性视觉检测。相比之下，作为确定性基线的纯追踪算法受益于对中心线的明确访问，形成信息不对称性，旨在评估强化学习补偿几何模型缺失的能力。代理通过渐进式课程学习策略进行训练，逐步接触越来越弯曲的几何形状，其中管中心经常从视野中消失。基于直接可见性、方向记忆和激光雷达对称线索结合的转向协商机制，对于确保在此类部分可观测条件下的稳定导航至关重要。实验表明，PPO策略获得稳健且可推广的行为，尽管其对几何信息的访问有限，仍持续优于确定性控制器。在高精度三维环境中的验证进一步确认了所学行为可迁移到连续物理动力学中。因此，该方法为未知管状环境中的自主导航提供了完整框架，并为工业、地下或医疗应用打开了视角，因为在狭窄且感知较弱的通道中推进是核心挑战。

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

我们准备好迎接文本转三维生成的强化学习了吗？进步调查

Authors: Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.10949
Pdf link: https://arxiv.org/pdf/2512.10949
Abstract Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at this https URL.
中文摘要 强化学习（RL）此前已被证明在大型语言和多模态模型中有效，最近已被成功扩展以增强二维图像生成。然而，由于三维物体空间复杂度更高，需要全局一致的几何和细粒度的局部纹理，三维生成应用RL仍然鲜有充分探索。这使得三维生成对奖励设计和强化学习算法极为敏感。为应对这些挑战，我们首次系统地研究了文本到三维自回归生成的强化学习。（1）奖励设计：我们评估奖励维度和模型选择，表明与人类偏好的一致性至关重要，且通用多模态模型为三维属性提供了稳健信号。（2）强化学习算法：我们研究GRPO变体，强调令牌级优化的有效性，并进一步研究训练数据和迭代的扩展性。（3）文本转三维基准测试：由于现有基准测试未能测量三维生成模型中的隐性推理能力，我们引入了MME-3DR。（4）高级强化学习范式：受三维生成的自然层级驱动，我们提出了Hi-GRPO，通过专用奖励集合优化全局到局部的层级三维生成。基于这些见解，我们开发了AR3D-R1，这是首个强化学习增强的文本转三维模型，涵盖从粗糙形状到纹理细化的专家。我们希望这项研究能为强化学习驱动的三维生成推理提供见解。代码发布于此 https 网址。

Keyword: diffusion policy

Push Smarter, Not Harder: Hierarchical RL-Diffusion Policy for Efficient Nonprehensile Manipulation

更聪明，而非更难：层级强化学习扩散策略以实现高效非抓握作

Authors: Steven Caro, Stephen L. Smith
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.10099
Pdf link: https://arxiv.org/pdf/2512.10099
Abstract Nonprehensile manipulation, such as pushing objects across cluttered environments, presents a challenging control problem due to complex contact dynamics and long-horizon planning requirements. In this work, we propose HeRD, a hierarchical reinforcement learning-diffusion policy that decomposes pushing tasks into two levels: high-level goal selection and low-level trajectory generation. We employ a high-level reinforcement learning (RL) agent to select intermediate spatial goals, and a low-level goal-conditioned diffusion model to generate feasible, efficient trajectories to reach them. This architecture combines the long-term reward maximizing behaviour of RL with the generative capabilities of diffusion models. We evaluate our method in a 2D simulation environment and show that it outperforms the state-of-the-art baseline in success rate, path efficiency, and generalization across multiple environment configurations. Our results suggest that hierarchical control with generative low-level planning is a promising direction for scalable, goal-directed nonprehensile manipulation. Code, documentation, and trained models are available: this https URL.
中文摘要 非抓握作，如推动物体穿越杂乱环境，由于接触动力学复杂且规划时间较长，控制问题具有挑战性。在本研究中，我们提出了HeRD，一种分层强化学习-扩散策略，将推动任务分解为两个层级：高层目标选择和低层轨迹生成。我们使用高级强化学习（RL）代理选择中间空间目标，并使用低级目标条件扩散模型生成可行且高效的路径以实现目标。该架构结合了强化学习的长期奖励最大化行为与扩散模型的生成能力。我们在二维仿真环境中评估该方法，证明其在成功率、路径效率和多种环境配置下的泛化性方面优于最先进的基线。我们的结果表明，层级控制与生成式低层次规划是可扩展、目标导向、非抓握作的有前景方向。代码、文档和训练模型均可获取：https URL。

ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning

隐式RDP：一种端到端视觉力扩散策略，采用结构性慢速快速学习

Authors: Wendi Chen, Han Xue, Yi Wang, Fangyuan Zhou, Jun Lv, Yang Jin, Shirun Tang, Chuan Wen, Cewu Lu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.10946
Pdf link: https://arxiv.org/pdf/2512.10946
Abstract Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at this https URL.
中文摘要 人类层面的接触丰富作依赖于两种关键模式的不同作用：视觉提供空间丰富但时间缓慢的全局上下文，而力感捕捉快速、高频的局部接触动态。由于这些信号的基本频率和信息差异，整合这些信号具有挑战性。在本研究中，我们提出了ImplicitRDP，一种统一的端到端视觉力扩散策略，将视觉规划与反应力控制整合在单一网络中。我们引入了结构性慢速学习，这是一种利用因果注意力同时处理异步视觉和力令牌的机制，使策略能够在力频率下进行闭环调整，同时保持动作块的时间一致性。此外，为了缓解端到端模型未能调整不同模态权重时的模态崩溃，我们提出了基于虚拟目标的表征正则化。该辅助目标将力反馈映射到与作用量相同的空间，提供比原始力预测更强、更有物理基础的学习信号。对接触丰富任务的广泛实验表明，ImplicitRDP在仅视觉和层级基线上均有显著优异表现，通过简化的培训流程实现了更优越的反应性和成功率。代码和视频将在此HTTPS网址公开发布。