Arxiv Papers of Today

生成时间: 2025-10-28 16:31:57 (UTC+8); Arxiv 发布时间: 2025-10-28 20:00 EDT (2025-10-29 08:00 UTC+8)

今天共有 70 篇相关文章

Keyword: reinforcement learning

Taxonomy and Trends in Reinforcement Learning for Robotics and Control Systems: A Structured Review

机器人和控制系统强化学习的分类学和趋势：结构化综述

Authors: Kumater Ter, RexCharles Donatus, Ore-Ofe Ajayi, Daniel Udekwe
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.21758
Pdf link: https://arxiv.org/pdf/2510.21758
Abstract Reinforcement learning (RL) has become a foundational approach for enabling intelligent robotic behavior in dynamic and uncertain environments. This work presents an in-depth review of RL principles, advanced deep reinforcement learning (DRL) algorithms, and their integration into robotic and control systems. Beginning with the formalism of Markov Decision Processes (MDPs), the study outlines essential elements of the agent-environment interaction and explores core algorithmic strategies including actor-critic methods, value-based learning, and policy gradients. Emphasis is placed on modern DRL techniques such as DDPG, TD3, PPO, and SAC, which have shown promise in solving high-dimensional, continuous control tasks. A structured taxonomy is introduced to categorize RL applications across domains such as locomotion, manipulation, multi-agent coordination, and human-robot interaction, along with training methodologies and deployment readiness levels. The review synthesizes recent research efforts, highlighting technical trends, design patterns, and the growing maturity of RL in real-world robotics. Overall, this work aims to bridge theoretical advances with practical implementations, providing a consolidated perspective on the evolving role of RL in autonomous robotic systems.
中文摘要 强化学习（RL）已成为在动态和不确定环境中实现智能机器人行为的基础方法。这项工作深入回顾了 RL 原理、高级深度强化学习（DRL）算法及其与机器人和控制系统的集成。该研究从马尔可夫决策过程（MDP）的形式主义开始，概述了智能体-环境交互的基本要素，并探索了核心算法策略，包括行为者-批评方法、基于价值的学习和政策梯度。重点放在现代 DRL 技术上，例如 DDPG、TD3、PPO 和 SAC，这些技术在解决高维、连续控制任务方面显示出前景。引入结构化分类法，对移动、作、多智能体协调和人机交互等领域的 RL 应用程序进行分类，以及训练方法和部署准备级别。该综述综合了最近的研究工作，强调了技术趋势、设计模式以及 RL 在现实世界机器人技术中的日益成熟。总体而言，这项工作旨在将理论进步与实际实施联系起来，为RL在自主机器人系统中不断发展的作用提供一个综合的视角。

Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs

通过 VLM 中的掩蔽预测激活视觉上下文和常识推理

Authors: Jiaao Yu, Shenwei Li, Mingjie Han, Yifei Yin, Wenzheng Song, Chenghao Jia, Man Lan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.21807
Pdf link: https://arxiv.org/pdf/2510.21807
Abstract Recent breakthroughs in reasoning models have markedly advanced the reasoning capabilities of large language models, particularly via training on tasks with verifiable rewards. Yet, a significant gap persists in their adaptation to real world multimodal scenarios, most notably, vision language tasks, due to a heavy focus on single modal language settings. While efforts to transplant reinforcement learning techniques from NLP to VLMs have emerged, these approaches often remain confined to perception centric tasks or reduce images to textual summaries, failing to fully exploit visual context and commonsense knowledge, ultimately constraining the generalization of reasoning capabilities across diverse multimodal environments. To address this limitation, we introduce a novel fine tuning task, Masked Prediction via Context and Commonsense, which forces models to integrate visual context and commonsense reasoning by reconstructing semantically meaningful content from occluded images, thereby laying the foundation for generalized reasoning. To systematically evaluate the model performance in generalized reasoning, we developed a specialized evaluation benchmark, MPCC Eval, and employed various fine tuning strategies to guide reasoning. Among these, we introduced an innovative training method, Reinforcement Fine tuning with Prior Sampling, which not only enhances model performance but also improves its generalized reasoning capabilities in OOD and cross task scenarios.
中文摘要 推理模型的最新突破显着提高了大型语言模型的推理能力，特别是通过对具有可验证奖励的任务进行训练。然而，由于高度关注单一模态语言设置，它们在适应现实世界的多模态场景（尤其是视觉语言任务）方面仍然存在重大差距。虽然已经出现了将强化学习技术从 NLP 移植到 VLM 的努力，但这些方法通常仍然局限于以感知为中心的任务或将图像简化为文本摘要，未能充分利用视觉上下文和常识性知识，最终限制了推理能力在不同多模态环境中的泛化。为了解决这一限制，我们引入了一种新颖的微调任务，即通过上下文和常识进行掩蔽预测，它通过从被遮挡的图像中重建语义上有意义的内容，迫使模型整合视觉上下文和常识推理，从而为广义推理奠定基础。为了系统地评估模型在广义推理中的表现，我们开发了一个专门的评估基准 MPCC Eval，并采用了各种微调策略来指导推理。其中，我们引入了一种创新的训练方法，即具有先验采样的强化微调，它不仅增强了模型的性能，还提高了其在OOD和跨任务场景下的广义推理能力。

Embodied Navigation with Auxiliary Task of Action Description Prediction

具身导航与动作描述预测的辅助任务

Authors: Haru Kondoh, Asako Kanezaki
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.21809
Pdf link: https://arxiv.org/pdf/2510.21809
Abstract The field of multimodal robot navigation in indoor environments has garnered significant attention in recent years. However, as tasks and methods become more advanced, the action decision systems tend to become more complex and operate as black-boxes. For a reliable system, the ability to explain or describe its decisions is crucial; however, there tends to be a trade-off in that explainable systems can not outperform non-explainable systems in terms of performance. In this paper, we propose incorporating the task of describing actions in language into the reinforcement learning of navigation as an auxiliary task. Existing studies have found it difficult to incorporate describing actions into reinforcement learning due to the absence of ground-truth data. We address this issue by leveraging knowledge distillation from pre-trained description generation models, such as vision-language models. We comprehensively evaluate our approach across various navigation tasks, demonstrating that it can describe actions while attaining high navigation performance. Furthermore, it achieves state-of-the-art performance in the particularly challenging multimodal navigation task of semantic audio-visual navigation.
中文摘要 近年来，室内环境中的多模态机器人导航领域引起了广泛关注。然而，随着任务和方法变得更加先进，行动决策系统往往变得更加复杂并像黑匣子一样运行。对于一个可靠的系统，解释或描述其决策的能力至关重要;然而，往往存在权衡，即可解释系统在性能方面无法优于不可解释系统。在本文中，我们建议将语言描述动作的任务作为辅助任务纳入导航的强化学习中。现有研究发现，由于缺乏地面实况数据，很难将描述动作纳入强化学习。我们通过利用预训练的描述生成模型（例如视觉语言模型）的知识蒸馏来解决这个问题。我们全面评估了我们跨各种导航任务的方法，证明它可以描述动作，同时获得高导航性能。此外，它还在语义视听导航这一特别具有挑战性的多模态导航任务中实现了最先进的性能。

GAPO: Group Adaptive Policy Optimization for Real-World Code Edit

GAPO：用于真实世界代码编辑的组自适应策略优化

Authors: Jianqing Zhang, Zhezheng Hao, Wei Xia, Hande Dong, Hong Wang, Chenxing Wei, Yuyan Zhou, Yubin Qi, Qiang Lin, Jian Cao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.21830
Pdf link: https://arxiv.org/pdf/2510.21830
Abstract Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods like GRPO are popular for their critic-free, normalized advantage estimation. However, in real-world code-editing scenarios, reward distributions are often skewed with unpredictable outliers, leading to distorted advantage computation and increased noise. To address this issue, we propose Group Adaptive Policy Optimization (GAPO), which adaptively finds an outlier-free highest-density interval (HDI) per prompt and then uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation. This adaptive Q robustly handles skewed distributions while remaining plug-and-play and efficient. We validate GAPO on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks across 10 languages, demonstrating consistent improvements in exact match accuracy over GRPO and its variant DAPO. Code is publicly available.
中文摘要 强化学习（RL）广泛用于代码编辑中的后训练大型语言模型（LLM），其中 GRPO 等组相对方法因其无批评、规范化的优势估计而广受欢迎。然而，在现实世界的代码编辑场景中，奖励分布往往因不可预测的异常值而出现偏差，导致优势计算失真和噪声增加。为了解决这个问题，我们提出了组自适应策略优化（GAPO），它自适应地在每个提示中找到一个无异常值的最高密度区间（HDI），然后使用该区间的中位数作为自适应Q来替换优势计算中的组均值。这种自适应 Q 可以稳健地处理倾斜的分布，同时保持即插即用和高效。我们使用包含 10 种语言的 51,844 个真实世界、历史感知代码编辑任务的大型内部数据集，在 9 个指令调优的 LLM （3B-14B）上验证了 GAPO，证明与 GRPO 及其变体 DAPO 相比，精确匹配准确性得到了持续改进。代码是公开的。

SynCast: Synergizing Contradictions in Precipitation Nowcasting via Diffusion Sequential Preference Optimization

SynCast：通过扩散顺序偏好优化协同降水临近预报中的矛盾

Authors: Kaiyi Xu, Junchao Gong, Wenlong Zhang, Ben Fei, Lei Bai, Wanli Ouyang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.21847
Pdf link: https://arxiv.org/pdf/2510.21847
Abstract Precipitation nowcasting based on radar echoes plays a crucial role in monitoring extreme weather and supporting disaster prevention. Although deep learning approaches have achieved significant progress, they still face notable limitations. For example, deterministic models tend to produce over-smoothed predictions, which struggle to capture extreme events and fine-scale precipitation patterns. Probabilistic generative models, due to their inherent randomness, often show fluctuating performance across different metrics and rarely achieve consistently optimal results. Furthermore, precipitation nowcasting is typically evaluated using multiple metrics, some of which are inherently conflicting. For instance, there is often a trade-off between the Critical Success Index (CSI) and the False Alarm Ratio (FAR), making it challenging for existing models to deliver forecasts that perform well on both metrics simultaneously. To address these challenges, we introduce preference optimization into precipitation nowcasting for the first time, motivated by the success of reinforcement learning from human feedback in large language models. Specifically, we propose SynCast, a method that employs the two-stage post-training framework of Diffusion Sequential Preference Optimization (Diffusion-SPO), to progressively align conflicting metrics and consistently achieve superior performance. In the first stage, the framework focuses on reducing FAR, training the model to effectively suppress false alarms. Building on this foundation, the second stage further optimizes CSI with constraints that preserve FAR alignment, thereby achieving synergistic improvements across these conflicting metrics.
中文摘要 基于雷达回波的降水临近预报在监测极端天气和支持防灾方面发挥着至关重要的作用。尽管深度学习方法取得了重大进展，但它们仍然面临显着的局限性。例如，确定性模型往往会产生过度平滑的预测，难以捕捉极端事件和精细降水模式。概率生成模型由于其固有的随机性，通常在不同指标上表现出波动的性能，并且很少能获得一致的最佳结果。此外，降水临近预报通常使用多个指标进行评估，其中一些指标本质上是相互冲突的。例如，关键成功指数（CSI）和误报率（FAR）之间经常需要权衡，这使得现有模型难以同时提供在这两个指标上表现良好的预测。为了应对这些挑战，我们首次将偏好优化引入沉淀临近预报中，其动机是大型语言模型中人类反馈强化学习的成功。具体来说，我们提出了 SynCast，这是一种采用扩散顺序偏好优化（Diffusion-SPO）的两阶段后训练框架的方法，以逐步调整冲突的指标并始终如一地实现卓越的性能。在第一阶段，该框架侧重于降低 FAR，训练模型以有效抑制误报。在此基础上，第二阶段通过保持 FAR 一致性的约束进一步优化 CSI，从而在这些相互冲突的指标之间实现协同改进。

SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models

SCoPE VLM：在视觉语言模型中实现高效文档导航的选择性上下文处理

Authors: Gyubeum Lim, Yemo Koo, Vijay Krishna Madisetti
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.21850
Pdf link: https://arxiv.org/pdf/2510.21850
Abstract Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy Optimization, a tailored reinforcement learning method to reduce the gap between training and inference. Our method substantially reduces memory usage and effectively models human-like reading behaviors. To the best of our knowledge, SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document question answering, advancing the capabilities of multimodal agents.
中文摘要 理解长上下文视觉信息仍然是视觉语言模型面临的基本挑战，特别是在 GUI 控制和 Web 导航等代理任务中。虽然网页和 GUI 环境本质上是结构化文档，但当前的 VLM 在其训练目标中通常忽略了面向决策的文档理解。现有方法主要扩展可视化嵌入以处理长高分辨率输入，但这些方法占用大量内存，对于本地可部署的解决方案来说不切实际。为了解决这些问题，我们提出了 SCoPE VLM，这是一款文档导航专家，它利用新颖的滚动链机制来选择性和递归地导航文档，专门关注相关段。我们引入了专用的数据生成管道来构建信息丰富的滚动轨迹链和情景组相对策略优化，这是一种量身定制的强化学习方法，以减少训练和推理之间的差距。我们的方法大大减少了内存使用，并有效地模拟了类似人类的阅读行为。据我们所知，SCoPE VLM 是第一个在多页文档问答中显式模拟代理阅读模式的框架，从而提高了多模态代理的功能。

Computational Hardness of Reinforcement Learning with Partial $q^π$-Realizability

具有部分$q^π$可实现性的强化学习的计算难度

Authors: Shayan Karimi, Xiaoqi Tan
Subjects: Subjects: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.21888
Pdf link: https://arxiv.org/pdf/2510.21888
Abstract This paper investigates the computational complexity of reinforcement learning in a novel linear function approximation regime, termed partial $q^{\pi}$-realizability. In this framework, the objective is to learn an $\epsilon$-optimal policy with respect to a predefined policy set $\Pi$, under the assumption that all value functions for policies in $\Pi$ are linearly realizable. The assumptions of this framework are weaker than those in $q^{\pi}$-realizability but stronger than those in $q^$-realizability, providing a practical model where function approximation naturally arises. We prove that learning an $\epsilon$-optimal policy in this setting is computationally hard. Specifically, we establish NP-hardness under a parameterized greedy policy set (argmax) and show that - unless NP = RP - an exponential lower bound (in feature vector dimension) holds when the policy set contains softmax policies, under the Randomized Exponential Time Hypothesis. Our hardness results mirror those in $q^$-realizability and suggest computational difficulty persists even when $\Pi$ is expanded beyond the optimal policy. To establish this, we reduce from two complexity problems, $\delta$-Max-3SAT and $\delta$-Max-3SAT(b), to instances of GLinear-$\kappa$-RL (greedy policy) and SLinear-$\kappa$-RL (softmax policy). Our findings indicate that positive computational results are generally unattainable in partial $q^{\pi}$-realizability, in contrast to $q^{\pi}$-realizability under a generative access model.
中文摘要 本文研究了一种新型线性函数逼近状态下强化学习的计算复杂度，称为部分$q^{\pi}$可实现性。在这个框架中，目标是在假设 $\Pi$ 中策略的所有值函数都是可线性实现的，学习一个关于预定义策略集 $\Pi$ 的 $\epsilon$ 最优策略。该框架的假设比 $q^{\pi}$-realizability 中的假设弱，但比 $q^$-realizability 中的假设强，提供了一个函数近似自然产生的实用模型。我们证明，在这种情况下学习 $\epsilon$ 最优策略在计算上是困难的。具体来说，我们在参数化贪婪策略集（argmax）下建立了 NP 硬度，并表明 - 除非 NP = RP - 在随机指数时间假说下，当策略集包含软最大策略时，指数下限（在特征向量维度中）成立。我们的硬度结果反映了 $q^$-realizability 的结果，并表明即使 $\Pi$ 扩展到最优策略之外，计算难度仍然存在。为了确定这一点，我们从两个复杂性问题 $\delta$-Max-3SAT 和 $\delta$-Max-3SAT（b）简化为 GLinear-$\kappa$-RL（贪婪策略）和 SLinear-$\kappa$-RL（softmax 策略）的实例。我们的研究结果表明，在部分$q^{\pi}$可实现性中，正计算结果通常无法实现，而生成式访问模型下的$q^{\pi}$可实现性。

Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models

超越推理收益：减轻大型推理模型中的一般能力遗忘

Authors: Hoang Phan, Xianjun Yang, Kevin Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, Deren Lei
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.21978
Pdf link: https://arxiv.org/pdf/2510.21978
Abstract Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, where models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are calculated on the current task, thus they do not guarantee broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training focus each objective should receive. To address this, we propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts in an online manner using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks based on Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.
中文摘要 具有可验证奖励的强化学习（RLVR）在数学和多模态推理方面取得了令人瞩目的进步，并已成为当代语言和视觉语言模型的标准训练后范式。然而，RLVR 配方引入了能力回归的重大风险，即模型在没有采用正则化策略的情况下在长时间训练后忘记了基础技能。我们从经验上证实了这一担忧，观察到开源推理模型在感知和忠实度等核心能力上的性能下降。虽然强加 KL 散度等正则化项有助于防止偏离基本模型，但这些项是根据当前任务计算的，因此它们并不能保证更广泛的知识。同时，跨异构域的常用经验回放使得决定每个目标应该接受多少培训重点变得不简单。为了解决这个问题，我们提出了 RECAP——一种具有动态目标重新加权的重放策略，用于常识保存。我们的重新加权机制使用收敛和不稳定的短期信号以在线方式进行调整，将训练后的重点从饱和目标转移到表现不佳或不稳定的目标上。我们的方法是端到端的，很容易适用于现有的 RLVR 管道，无需训练额外的模型或进行大量调整。基于Qwen2.5-VL-3B和Qwen2.5-VL-7B的基准测试的大量实验证明了我们的方法的有效性，它不仅保留了一般能力，而且通过在任务内奖励之间实现更灵活的权衡来改进推理。

Is Temporal Difference Learning the Gold Standard for Stitching in RL?

学习时间差异是 RL 拼接的金标准吗？

Authors: Michał Bortkiewicz, Władysław Pałucki, Mateusz Ostaszewski, Benjamin Eysenbach
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.21995
Pdf link: https://arxiv.org/pdf/2510.21995
Abstract Reinforcement learning (RL) promises to solve long-horizon tasks even when training data contains only short fragments of the behaviors. This experience stitching capability is often viewed as the purview of temporal difference (TD) methods. However, outside of small tabular settings, trajectories never intersect, calling into question this conventional wisdom. Moreover, the common belief is that Monte Carlo (MC) methods should not be able to recombine experience, yet it remains unclear whether function approximation could result in a form of implicit stitching. The goal of this paper is to empirically study whether the conventional wisdom about stitching actually holds in settings where function approximation is used. We empirically demonstrate that Monte Carlo (MC) methods can also achieve experience stitching. While TD methods do achieve slightly stronger capabilities than MC methods (in line with conventional wisdom), that gap is significantly smaller than the gap between small and large neural networks (even on quite simple tasks). We find that increasing critic capacity effectively reduces the generalization gap for both the MC and TD methods. These results suggest that the traditional TD inductive bias for stitching may be less necessary in the era of large models for RL and, in some cases, may offer diminishing returns. Additionally, our results suggest that stitching, a form of generalization unique to the RL setting, might be achieved not through specialized algorithms (temporal difference learning) but rather through the same recipe that has provided generalization in other machine learning settings (via scale). Project website: this https URL
中文摘要 强化学习（RL）有望解决长期任务，即使训练数据仅包含行为的短片段。这种体验拼接功能通常被视为时间差（TD）方法的范围。然而，在小表格设置之外，轨迹从未相交，这对这一传统智慧提出了质疑。此外，人们普遍认为蒙特卡洛（MC）方法不应该能够重新组合经验，但目前尚不清楚函数近似是否会导致某种形式的隐式拼接。本文的目的是实证研究关于拼接的传统智慧在使用函数近似的环境中是否真的成立。我们通过经验证明，蒙特卡洛（MC）方法也可以实现体验拼接。虽然TD方法确实比MC方法实现了稍强的能力（符合传统观点），但这种差距明显小于小型和大型神经网络之间的差距（即使在非常简单的任务上也是如此）。我们发现，提高批评能力可以有效缩小 MC 和 TD 方法的泛化差距。这些结果表明，在RL大模型时代，传统的TD电感偏差拼接可能不太必要，并且在某些情况下，可能会提供递减的收益。此外，我们的结果表明，拼接是 RL 设置特有的一种泛化形式，可能不是通过专门的算法（时间差异学习）来实现的，而是通过在其他机器学习设置中提供泛化的相同配方（通过规模）来实现的。项目网站：此 https URL

Do You Trust the Process?: Modeling Institutional Trust for Community Adoption of Reinforcement Learning Policies

你信任这个过程吗？：为社区采用强化学习政策建模机构信任

Authors: Naina Balepur, Xingrui Pei, Hari Sundaram
Subjects: Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2510.22017
Pdf link: https://arxiv.org/pdf/2510.22017
Abstract Many governmental bodies are adopting AI policies for decision-making. In particular, Reinforcement Learning has been used to design policies that citizens would be expected to follow if implemented. Much RL work assumes that citizens follow these policies, and evaluate them with this in mind. However, we know from prior work that without institutional trust, citizens will not follow policies put in place by governments. In this work, we develop a trust-aware RL algorithm for resource allocation in communities. We consider the case of humanitarian engineering, where the organization is aiming to distribute some technology or resource to community members. We use a Deep Deterministic Policy Gradient approach to learn a resource allocation that fits the needs of the organization. Then, we simulate resource allocation according to the learned policy, and model the changes in institutional trust of community members. We investigate how this incorporation of institutional trust affects outcomes, and ask how effectively an organization can learn policies if trust values are private. We find that incorporating trust into RL algorithms can lead to more successful policies, specifically when the organization's goals are less certain. We find more conservative trust estimates lead to increased fairness and average community trust, though organization success suffers. Finally, we explore a strategy to prevent unfair outcomes to communities. We implement a quota system by an external entity which decreases the organization's utility when it does not serve enough community members. We find this intervention can improve fairness and trust among communities in some cases, while decreasing the success of the organization. This work underscores the importance of institutional trust in algorithm design and implementation, and identifies a tension between organization success and community well-being.
中文摘要 许多政府机构正在采用人工智能政策进行决策。特别是，强化学习已被用于设计公民在实施后应遵循的政策。许多 RL 工作都假设公民遵循这些政策，并考虑到这一点来评估它们。然而，我们从之前的工作中知道，如果没有机构信任，公民就不会遵循政府制定的政策。在这项工作中，我们开发了一种用于社区资源分配的信任感知 RL 算法。我们考虑人道主义工程的情况，该组织旨在向社区成员分发一些技术或资源。我们使用深度确定性策略梯度方法来学习适合组织需求的资源分配。然后，根据学习到的政策对资源配置进行模拟，并对社区成员的制度信任度变化进行建模。我们调查了这种机构信任的结合如何影响结果，并询问如果信任价值观是私有的，组织可以如何有效地学习政策。我们发现，将信任纳入 RL 算法可以带来更成功的策略，特别是当组织的目标不太确定时。我们发现，更保守的信任估计会提高公平性和平均社区信任度，尽管组织的成功会受到影响。最后，我们探讨了一种防止对社区造成不公平结果的策略。我们由外部实体实施配额系统，当组织不能为足够的社区成员提供服务时，该系统会降低组织的效用。我们发现，在某些情况下，这种干预可以提高社区之间的公平性和信任，同时降低组织的成功率。这项工作强调了机构信任在算法设计和实施中的重要性，并确定了组织成功与社区福祉之间的紧张关系。

Online Optimization for Offline Safe Reinforcement Learning

离线安全强化学习的在线优化

Authors: Yassine Chemingui, Aryan Deshwal, Alan Fern, Thanh Nguyen-Tang, Janardhan Rao Doppa
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.22027
Pdf link: https://arxiv.org/pdf/2510.22027
Abstract We study the problem of Offline Safe Reinforcement Learning (OSRL), where the goal is to learn a reward-maximizing policy from fixed data under a cumulative cost constraint. We propose a novel OSRL approach that frames the problem as a minimax objective and solves it by combining offline RL with online optimization algorithms. We prove the approximate optimality of this approach when integrated with an approximate offline RL oracle and no-regret online optimization. We also present a practical approximation that can be combined with any offline RL algorithm, eliminating the need for offline policy evaluation. Empirical results on the DSRL benchmark demonstrate that our method reliably enforces safety constraints under stringent cost budgets, while achieving high rewards. The code is available at this https URL.
中文摘要 我们研究了离线安全强化学习（OSRL）的问题，其目标是在累积成本约束下从固定数据中学习奖励最大化策略。我们提出了一种新颖的OSRL方法，将问题构建为极小极大目标，并通过将离线RL与在线优化算法相结合来解决它。我们证明了这种方法在与近似离线 RL 预言机和无悔在线优化集成时的近似最优性。我们还提出了一个实用的近似值，可以与任何离线 RL 算法结合使用，无需离线策略评估。DSRL基准测试的实证结果表明，该方法在严格的成本预算下可靠地执行安全约束，同时获得了高回报。该代码可在此 https URL 中找到。

Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

惩罚长度：揭示质量估计指标中的系统偏差

Authors: Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.22028
Pdf link: https://arxiv.org/pdf/2510.22028
Abstract Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact of length bias in QE have been underexplored. Through a systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a preference for shorter translations when multiple candidates are available for the same source text. These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning. To mitigate this, we propose two strategies: (a) applying length normalization during model training, and (b) incorporating reference texts during evaluation. Both approaches were found to effectively reduce the identified length bias.
中文摘要 质量估计（QE）指标在机器翻译中至关重要，用于无参考评估，并作为强化学习等任务中的奖励信号。然而，量化宽松中长度偏差的普遍性和影响一直没有得到充分探索。通过对 10 种不同语言对中表现最佳的基于回归和 LLM-as-a-Judge 的 QE 指标的系统研究，我们揭示了两个关键的长度偏差：首先，QE 指标随着翻译长度的增加而持续过度预测错误，即使对于高质量、无错误的文本也是如此。其次，当同一源文本有多个候选者可用时，他们表现出对较短翻译的偏好。这些固有的长度偏差可能会不公平地惩罚更长、更正确的翻译，并可能导致 QE 重新排名和 QE 引导强化学习等应用中的次优决策。为了缓解这种情况，我们提出了两种策略：（a）在模型训练期间应用长度归一化，以及（b）在评估期间合并参考文本。发现这两种方法都能有效减少已识别的长度偏差。

Predictive Coding Enhances Meta-RL To Achieve Interpretable Bayes-Optimal Belief Representation Under Partial Observability

预测编码增强了元强化研究，以实现部分可观察性下可解释的贝叶斯最优信念表示

Authors: Po-Chen Kuo, Han Hou, Will Dabney, Edgar Y. Walker
Subjects: Subjects: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2510.22039
Pdf link: https://arxiv.org/pdf/2510.22039
Abstract Learning a compact representation of history is critical for planning and generalization in partially observable environments. While meta-reinforcement learning (RL) agents can attain near Bayes-optimal policies, they often fail to learn the compact, interpretable Bayes-optimal belief states. This representational inefficiency potentially limits the agent's adaptability and generalization capacity. Inspired by predictive coding in neuroscience--which suggests that the brain predicts sensory inputs as a neural implementation of Bayesian inference--and by auxiliary predictive objectives in deep RL, we investigate whether integrating self-supervised predictive coding modules into meta-RL can facilitate learning of Bayes-optimal representations. Through state machine simulation, we show that meta-RL with predictive modules consistently generates more interpretable representations that better approximate Bayes-optimal belief states compared to conventional meta-RL across a wide variety of tasks, even when both achieve optimal policies. In challenging tasks requiring active information seeking, only meta-RL with predictive modules successfully learns optimal representations and policies, whereas conventional meta-RL struggles with inadequate representation learning. Finally, we demonstrate that better representation learning leads to improved generalization. Our results strongly suggest the role of predictive learning as a guiding principle for effective representation learning in agents navigating partial observability.
中文摘要 学习历史的紧凑表示对于在部分可观察环境中进行规划和概括至关重要。虽然元强化学习（RL）代理可以实现接近贝叶斯最优策略，但它们通常无法学习紧凑的、可解释的贝叶斯最优信念状态。这种表征效率低下可能会限制代理的适应性和泛化能力。受到神经科学中的预测编码的启发——这表明大脑将感觉输入预测为贝叶斯推理的神经实现——以及深度 RL 中的辅助预测目标，我们研究了将自监督预测编码模块集成到元快速反应中是否可以促进贝叶斯最优表示的学习。通过状态机模拟，我们表明，与传统的元RL相比，具有预测模块的元强化增长始终能生成更多可解释的表示，在各种任务中更好地近似贝叶斯最优信念状态，即使两者都实现了最佳策略。在需要主动信息搜索的具有挑战性的任务中，只有具有预测模块的元强拿性学习才能成功学习最佳表示和策略，而传统的元强拿性学习则难以应对不充分的表示学习。最后，我们证明更好的表示学习可以提高泛化能力。我们的结果强烈表明，预测学习作为在部分可观测性导航的智能体中有效表示学习的指导原则的作用。

Agentic Reinforcement Learning for Real-World Code Repair

用于真实世界代码修复的代理强化学习

Authors: Siyu Zhu, Anastasiya Karpovich, Albert Chen, Jessica Koscheka, Shailesh Jannu, Di Wen, Yuqing Zhu, Rohit Jain, Alborz Geramifard
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.22075
Pdf link: https://arxiv.org/pdf/2510.22075
Abstract We tackle the challenge of training reliable code-fixing agents in real repositories, where complex builds and shifting dependencies make evaluation unstable. We developed a verifiable pipeline with success defined as post-fix build validation and improved reproducibility across ~1K real issues by pinning dependencies and disabling automatic upgrades. Building on this, we introduced a scalable simplified pipeline for large-scale reinforcement learning (RL). Using this setup, we supervised fine-tuned Qwen3-32B in the full pipeline and applied RL on top of the SFT model in the simplified environment. The SFT model distilled from GPT-4.1 trajectories performs on par while being 56x smaller, and RL added 7-20% absolute gains under matched train-test conditions. "Thinking mode" was on par or worse in our experiments. Both SFT and RL models failed to generalize across environments, highlighting the importance of matching train-test environments for building reliable real-world code-fixing agents.
中文摘要 我们解决了在真实存储库中训练可靠的代码修复代理的挑战，在真实存储库中，复杂的构建和不断变化的依赖关系使评估不稳定。我们开发了一个可验证的管道，成功定义为修复后构建验证，并通过固定依赖项和禁用自动升级提高了 ~1K 实际问题的可重复性。在此基础上，我们为大规模强化学习（RL）引入了可扩展的简化管道。使用该设置，我们在整个管道中监督了微调的 Qwen3-32B，并在简化环境中将 RL 应用于 SFT 模型之上。从 GPT-4.1 轨迹中提炼出来的 SFT 模型性能相当，同时体积小了 56 倍，RL 在匹配的训练测试条件下增加了 7-20% 的绝对增益。在我们的实验中，“思维模式”相当或更差。SFT 和 RL 模型都未能跨环境泛化，这凸显了匹配训练测试环境对于构建可靠的真实世界代码修复代理的重要性。

STAR-RIS-assisted Collaborative Beamforming for Low-altitude Wireless Networks

STAR-RIS辅助的低空无线网络协同波束成形

Authors: Xinyue Liang, Hui Kang, Junwei Che, Jiahui Li, Geng Sun, Qingqing Wu, Jiacheng Wang, Dusit Niyato
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.22108
Pdf link: https://arxiv.org/pdf/2510.22108
Abstract While low-altitude wireless networks (LAWNs) based on uncrewed aerial vehicles (UAVs) offer high mobility, flexibility, and coverage for urban communications, they face severe signal attenuation in dense environments due to obstructions. To address this critical issue, we consider introducing collaborative beamforming (CB) of UAVs and omnidirectional reconfigurable beamforming (ORB) of simultaneous transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) to enhance the signal quality and directionality. On this basis, we formulate a joint rate and energy optimization problem (JREOP) to maximize the transmission rate of the overall system, while minimizing the energy consumption of the UAV swarm. Due to the non-convex and NP-hard nature of JREOP, we propose a heterogeneous multi-agent collaborative dynamic (HMCD) optimization framework, which has two core components. The first component is a simulated annealing (SA)-based STAR-RIS control method, which dynamically optimizes reflection and transmission coefficients to enhance signal propagation. The second component is an improved multi-agent deep reinforcement learning (MADRL) control method, which incorporates a self-attention evaluation mechanism to capture interactions between UAVs and an adaptive velocity transition mechanism to enhance training stability. Simulation results demonstrate that HMCD outperforms various baselines in terms of convergence speed, average transmission rate, and energy consumption. Further analysis reveals that the average transmission rate of the overall system scales positively with both UAV count and STAR-RIS element numbers.
中文摘要 虽然基于无人机（UAV）的低空无线网络（LAWN）为城市通信提供了高移动性、灵活性和覆盖范围，但它们在密集环境中因障碍物而面临严重的信号衰减。为了解决这一关键问题，我们考虑引入无人机的协同波束成形（CB）和同时发射和反射可重构智能表面的全向可重构波束成形（ORB），以提高信号质量和方向性。在此基础上，我们提出了联合速率和能量优化问题（JREOP），以最大限度地提高整个系统的传输速率，同时最小化无人机群的能耗。由于JREOP的非凸性和NP硬性，我们提出了一个异构多智能体协同动态（HMCD）优化框架，该框架具有两个核心组件。第一个组件是基于模拟退火（SA）的STAR-RIS控制方法，该方法动态优化反射和传输系数以增强信号传播。第二个组成部分是改进的多智能体深度强化学习（MADRL）控制方法，该方法结合了捕捉无人机之间交互的自注意力评估机制和增强训练稳定性的自适应速度转换机制。仿真结果表明，HMCD在收敛速度、平均传输速率和能耗方面优于各种基线。进一步分析表明，整个系统的平均传输速率随无人机数量和STAR-RIS元件数量呈正向比例。

EasyUUV: An LLM-Enhanced Universal and Lightweight Sim-to-Real Reinforcement Learning Framework for UUV Attitude Control

EasyUUV：用于 UUV 姿态控制的 LLM 增强型通用轻量级模拟实强化学习框架

Authors: Guanwen Xie, Jingzehua Xu, Jiwei Tang, Yubo Huang, Shuai Zhang, Xiaofan Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.22126
Pdf link: https://arxiv.org/pdf/2510.22126
Abstract Despite recent advances in Unmanned Underwater Vehicle (UUV) attitude control, existing methods still struggle with generalizability, robustness to real-world disturbances, and efficient deployment. To address the above challenges, this paper presents EasyUUV, a Large Language Model (LLM)-enhanced, universal, and lightweight simulation-to-reality reinforcement learning (RL) framework for robust attitude control of UUVs. EasyUUV combines parallelized RL training with a hybrid control architecture, where a learned policy outputs high-level attitude corrections executed by an adaptive S-Surface controller. A multimodal LLM is further integrated to adaptively tune controller parameters at runtime using visual and textual feedback, enabling training-free adaptation to unmodeled dynamics. Also, we have developed a low-cost 6-DoF UUV platform and applied an RL policy trained through efficient parallelized simulation. Extensive simulation and real-world experiments validate the effectiveness and outstanding performance of EasyUUV in achieving robust and adaptive UUV attitude control across diverse underwater conditions. The source code is available from the following website: this https URL
中文摘要 尽管无人水下航行器（UUV）姿态控制最近取得了进展，但现有方法在通用性、对现实世界干扰的鲁棒性和高效部署方面仍然存在困难。为了应对上述挑战，本文提出了EasyUUV，这是一个大型语言模型（LLM）增强的、通用的、轻量级的模拟到现实的强化学习（RL）框架，用于UUV的鲁棒姿态控制。EasyUUV 将并行 RL 训练与混合控制架构相结合，其中学习的策略输出由自适应 S-Surface 控制器执行的高级姿态校正。进一步集成了多模态 LLM，以使用视觉和文本反馈在运行时自适应调整控制器参数，从而实现对未建模动态的免训练适应。此外，我们还开发了一个低成本的6-DoF UUV平台，并应用了通过高效并行仿真训练的RL策略。广泛的模拟和真实实验验证了 EasyUUV 在各种水下条件下实现稳健且自适应的 UUV 姿态控制方面的有效性和卓越性能。源代码可从以下网站获得：此 https URL

OlaMind: Towards Human-Like and Hallucination-Safe Customer Service for Retrieval-Augmented Dialogue

OlaMind：迈向类似人类和幻觉安全的客户服务，用于检索增强对话

Authors: Tianhong Gao, Jundong Shen, Bei Shi, Jiapeng Wang, Ying Ju, Junfeng Yao, Jiao Ran, Yong Zhang, Lin Dong, Huiyu Yu, Tingting Ye
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.22143
Pdf link: https://arxiv.org/pdf/2510.22143
Abstract Intelligent customer service (ICS) systems via retrieval-augmented generation (RAG) have been widely adopted in Web-based domains such as social platforms and e-commerce, achieving remarkable improvements in automation and efficiency. However, notable limitations still remain: these systems are prone to hallucinations and often generate rigid, mechanical responses, which can introduce business risks and undermine user experience, especially in Web-based customer service interactions under the RAG scenarios. In this paper, we introduce OlaMind, a human-like and hallucination-safe customer service framework for retrieval-augmented dialogue. Specifically, it first leverages a Learn-to-Think stage to learn the reasoning processes and response strategies from human experts, and then employs a Learn-to-Respond stage to perform cold-start supervised fine-tuning (SFT) combined with reinforcement learning (RL) for basic-to-hard self-refinement. Our method significantly enhances human-likeness and naturalness while effectively mitigating hallucinations and critical business risks. We have conducted large-scale online A/B experiments in an industry-level social customer service setting, and extensive experimental results show that OlaMind achieves significant cumulative relative improvements with intelligent resolution rates +28.92%/+18.42% and human takeover rate -6.08%/-7.12% in community-support/livestream-interaction scenarios, respectively, which highlights its consistent effectiveness across diverse real-world applications. The code and data will be publicly available.
中文摘要 通过检索增强生成（RAG）的智能客户服务（ICS）系统已在社交平台和电子商务等基于Web的领域得到广泛应用，实现了自动化和效率的显著提升。然而，仍然存在显着的局限性：这些系统容易出现幻觉，并且经常产生僵化的机械响应，这可能会引入业务风险并破坏用户体验，尤其是在 RAG 场景下基于 Web 的客户服务交互中。在本文中，我们介绍了 OlaMind，这是一种类似人类且幻觉安全的客户服务框架，用于检索增强对话。具体来说，它首先利用 Learn-to-Think 阶段从人类专家那里学习推理过程和响应策略，然后采用 Learn-to-Response 阶段进行冷启动监督微调（SFT）结合强化学习（RL），实现从基础到困难的自我细化。我们的方法显着增强了人类的相似性和自然性，同时有效减轻了幻觉和关键的业务风险。我们在行业级社交客服环境中进行了大规模的在线 A/B 实验，广泛的实验结果表明，OlaMind 在社区支持/直播互动场景下，智能解决率分别为 +28.92%/+18.42% 和人工接管率 -6.08%/-7.12%，实现了显着的累积相对提升，这凸显了其在多种现实世界应用中的一致有效性。代码和数据将公开。

Solving Continuous Mean Field Games: Deep Reinforcement Learning for Non-Stationary Dynamics

求解连续均值场博弈：非平稳动力学的深度强化学习

Authors: Lorenzo Magnino, Kai Shao, Zida Wu, Jiacheng Shen, Mathieu Laurière
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2510.22158
Pdf link: https://arxiv.org/pdf/2510.22158
Abstract Mean field games (MFGs) have emerged as a powerful framework for modeling interactions in large-scale multi-agent systems. Despite recent advancements in reinforcement learning (RL) for MFGs, existing methods are typically limited to finite spaces or stationary models, hindering their applicability to real-world problems. This paper introduces a novel deep reinforcement learning (DRL) algorithm specifically designed for non-stationary continuous MFGs. The proposed approach builds upon a Fictitious Play (FP) methodology, leveraging DRL for best-response computation and supervised learning for average policy representation. Furthermore, it learns a representation of the time-dependent population distribution using a Conditional Normalizing Flow. To validate the effectiveness of our method, we evaluate it on three different examples of increasing complexity. By addressing critical limitations in scalability and density approximation, this work represents a significant advancement in applying DRL techniques to complex MFG problems, bringing the field closer to real-world multi-agent systems.
中文摘要 平均场博弈（MFG）已成为大规模多智能体系统中交互建模的强大框架。尽管最近在 MFG 的强化学习（RL）方面取得了进展，但现有方法通常仅限于有限空间或平稳模型，阻碍了它们在现实世界问题中的适用性。本文介绍了一种专门为非平稳连续MFG设计的新型深度强化学习（DRL）算法。所提出的方法建立在虚构游戏（FP）方法的基础上，利用 DRL 进行最佳响应计算，并利用监督学习进行平均策略表示。此外，它还使用条件归一化流学习与时间相关的人口分布的表示。为了验证我们方法的有效性，我们通过三个复杂性增加的不同示例对其进行了评估。通过解决可扩展性和密度近似方面的关键限制，这项工作代表了将 DRL 技术应用于复杂 MFG 问题的重大进步，使该领域更接近现实世界的多智能体系统。

Dopamine-driven synaptic credit assignment in neural networks

神经网络中多巴胺驱动的突触信用分配

Authors: Saranraj Nambusubramaniyan, Shervin Safavi, Raja Guru, Andreas Knoblauch
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.22178
Pdf link: https://arxiv.org/pdf/2510.22178
Abstract Solving the synaptic Credit Assignment Problem(CAP) is central to learning in both biological and artificial neural systems. Finding an optimal solution for synaptic CAP means setting the synaptic weights that assign credit to each neuron for influencing the final output and behavior of neural networks or animals. Gradient-based methods solve this problem in artificial neural networks using back-propagation, however, not in the most efficient way. For instance, back-propagation requires a chain of top-down gradient computations. This leads to an expensive optimization process in terms of computing power and memory linked with well-known weight transport and update locking problems. To address these shortcomings, we take a NeuroAI approach and draw inspiration from neural Reinforcement Learning to develop a derivative-free optimizer for training neural networks, Dopamine. Dopamine is developed for Weight Perturbation (WP) learning that exploits stochastic updating of weights towards optima. It achieves this by minimizing the regret, a form of Reward Prediction Error (RPE) between the expected outcome from the perturbed model and the actual outcome from the unperturbed model. We use this RPE to adjust the learning rate in the network (i.e., creating an adaptive learning rate strategy, similar to the role of dopamine in the brain). We tested the Dopamine optimizer for training multi-layered perceptrons for XOR tasks, and recurrent neural networks for chaotic time series forecasting. Dopamine-trained models demonstrate accelerated convergence and outperform standard WP, and give comparable performance to gradient-based algorithms, while consuming significantly less computation and memory. Overall, the Dopamine optimizer not only finds robust solutions and comparable performance to the state-of-the-art Machine Learning optimizers but is also neurobiologically more plausible.
中文摘要 解决突触学分分配问题（CAP）是生物和人工神经系统学习的核心。寻找突触 CAP 的最佳解决方案意味着设置突触权重，为每个神经元分配影响神经网络或动物的最终输出和行为的功劳。然而，基于梯度的方法在使用反向传播的人工神经网络中解决了这个问题，但并不是以最有效的方式。例如，反向传播需要一系列自上而下的梯度计算。这导致了计算能力和内存方面的昂贵优化过程，与众所周知的权重传输和更新锁定问题相关。为了解决这些缺点，我们采用了 NeuroAI 方法，并从神经强化学习中汲取灵感，开发了一种用于训练神经网络的无导数优化器多巴胺。多巴胺是为权重扰动（WP）学习而开发的，该学习利用权重向最优值的随机更新。它通过最大限度地减少后悔来实现这一点，后悔是受扰动模型的预期结果与未受扰动模型的实际结果之间的一种奖励预测误差（RPE）形式。我们使用这个 RPE 来调整网络中的学习率（即创建自适应学习率策略，类似于多巴胺在大脑中的作用）。我们测试了多巴胺优化器来训练多层感知器以进行异或任务，并测试了循环神经网络用于混沌时间序列预测。多巴胺训练的模型表现出加速的收敛并优于标准 WP，并提供与基于梯度的算法相当的性能，同时消耗的计算和内存显着减少。总体而言，多巴胺优化器不仅找到了强大的解决方案和与最先进的机器学习优化器相当的性能，而且在神经生物学上也更合理。

PACR: Progressively Ascending Confidence Reward for LLM Reasoning

PACR：LLM 推理的信心逐步提升奖励

Authors: Eunseop Yoon, Hee Suk Yoon, Jaehyun Jang, SooHwan Eom, Qi Dai, Chong Luo, Mark A. Hasegawa-Johnson, Chang D. Yoo
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.22255
Pdf link: https://arxiv.org/pdf/2510.22255
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved LLM reasoning, but its sparse, outcome-based reward provides no guidance for intermediate steps, slowing exploration. We propose Progressively Ascending Confidence Reward (PACR), a dense, model-intrinsic reward computed directly from the model's evolving belief in the correct answer. PACR encodes the inductive bias that, along a well-formed reasoning trajectory, the probability of the ground-truth answer should have a generally ascending trend. We provide empirical and theoretical analysis validating that such an inductive bias constrains the exploration search space to regions richer in logically sound reasoning. We demonstrate that PACR accelerates exploration, reaches reward saturation with fewer trajectories, and yields improvements on multiple benchmarks. Our results suggest that dense, model-intrinsic shaping signals can make RLVR training more effective and reliable.
中文摘要 具有可验证奖励的强化学习（RLVR）显着改进了 LLM 推理，但其稀疏的、基于结果的奖励没有为中间步骤提供指导，从而减慢了探索速度。我们提出了渐进式递增置信度奖励（PACR），这是一种密集的、模型固有的奖励，直接根据模型对正确答案的不断变化的信念计算得出。PACR 编码归纳偏差，沿着格式良好的推理轨迹，真实答案的概率应该具有总体上升趋势。我们提供了实证和理论分析，验证了这种归纳偏差将探索搜索空间限制在逻辑合理推理更丰富的区域。我们证明，PACR 加速了探索，以更少的轨迹达到奖励饱和，并在多个基准上产生了改进。我们的结果表明，密集的、模型固有的整形信号可以使RLVR训练更加有效和可靠。

CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning

CityRiSE：通过强化学习推理视觉语言模型中的城市社会经济地位

Authors: Tianhui Liu, Hetian Pang, Xin Zhang, Jie Feng, Yong Li, Pan Hui
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.22282
Pdf link: https://arxiv.org/pdf/2510.22282
Abstract Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce \textbf{CityRiSE}, a novel framework for \textbf{R}eason\textbf{i}ng urban \textbf{S}ocio-\textbf{E}conomic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful visual cues, enabling structured and goal-oriented reasoning for generalist socio-economic status prediction. Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, particularly for prediction on unseen cities and unseen indicators. This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.
中文摘要 利用街景和卫星图像等公开的大规模网络数据，城市社会经济传感对于实现全球可持续发展目标至关重要。随着大型视觉语言模型（LVLM）的出现，出现了新的机会，通过将其视为多模态感知和理解问题来解决这一任务。然而，最近的研究表明，LVLM 仍然难以从视觉数据中做出准确且可解释的社会经济预测。为了解决这些局限性并最大限度地发挥 LVLM 的潜力，我们引入了 \textbf{CityRiSE}，这是一个通过纯强化学习（RL）在 LVLM 中实现 \textbf{R}eason\textbf{i}ng 城市 \textbf{S}ocio-\textbf{E} 共体状态的新框架。通过精心策划的多模态数据和可验证的奖励设计，我们的方法指导 LVLM 专注于语义上有意义的视觉线索，从而为通才社会经济地位预测提供结构化和目标导向的推理。实验表明，具有涌现推理过程的CityRiSE显著优于现有基线，提高了预测准确性和在不同城市环境中的泛化性，特别是对于对看不见的城市和看不见的指标的预测。这项工作强调了将 RL 和 LVLM 相结合以实现可解释和通才城市社会经济感知的前景。

GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

GRPO-Guard：通过调节削波缓解流量匹配中的隐式过度优化

Authors: Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, Xiaodan Liang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.22319
Pdf link: https://arxiv.org/pdf/2510.22319
Abstract Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.
中文摘要 最近，基于GRPO的强化学习在优化流匹配模型方面取得了显著进展，有效提高了其与特定任务奖励的一致性。在这些框架内，政策更新依赖于重要性比率裁剪来限制过度自信的正负梯度。然而，在实践中，我们观察到重要性-比率分布的系统性变化——其均值低于 1，并且其方差在不同时间步长之间存在很大差异。这种左移和不一致的分布阻止了正优势样本进入裁剪区域，导致该机制无法限制过度自信的正更新。结果，策略模型不可避免地进入了隐式的过度优化阶段——在代理奖励不断增加的同时，图像质量和文本提示对齐等基本指标急剧恶化，最终使学习到的策略在实际使用中变得不切实际。为了解决这个问题，我们引入了 GRPO-Guard，这是对现有 GRPO 框架的简单而有效的增强。我们的方法结合了比率归一化，可恢复平衡且步长一致的重要性比率，确保 PPO 裁剪正确限制去噪时间步长中的有害更新。此外，梯度重新加权策略可平衡噪声条件下的策略梯度，从而防止来自特定时间步长区域的过度更新。这些设计共同充当调节削波机制，稳定优化并大大减轻隐式过度优化，而无需依赖大量的 KL 正则化。对多个扩散主干（例如 SD3.5M、Flux.1-dev）和各种代理任务的广泛实验表明，GRPO-Guard 在保持甚至提高生成质量的同时显着减少了过度优化。

BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles

BLIP-FusePPO：用于自动驾驶汽车车道保持的视觉语言深度强化学习框架

Authors: Seyed Ahmad Hosseini Miangoleh, Amin Jalal Aghdasian, Farzaneh Abdollahi
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2510.22370
Pdf link: https://arxiv.org/pdf/2510.22370
Abstract In this paper, we propose Bootstrapped Language-Image Pretraining-driven Fused State Representation in Proximal Policy Optimization (BLIP-FusePPO), a novel multimodal reinforcement learning (RL) framework for autonomous lane-keeping (LK), in which semantic embeddings generated by a vision-language model (VLM) are directly fused with geometric states, LiDAR observations, and Proportional-Integral-Derivative-based (PID) control feedback within the agent observation space. The proposed method lets the agent learn driving rules that are aware of their surroundings and easy to understand by combining high-level scene understanding from the VLM with low-level control and spatial signals. Our architecture brings together semantic, geometric, and control-aware representations to make policy learning more robust. A hybrid reward function that includes semantic alignment, LK accuracy, obstacle avoidance, and speed regulation helps learning to be more efficient and generalizable. Our method is different from the approaches that only use semantic models to shape rewards. Instead, it directly embeds semantic features into the state representation. This cuts down on expensive runtime inference and makes sure that semantic guidance is always available. The simulation results show that the proposed model is better at LK stability and adaptability than the best vision-based and multimodal RL baselines in a wide range of difficult driving situations. We make our code publicly available.
中文摘要 在本文中，我们提出了近端策略优化中的引导语言-图像预训练驱动的融合状态表示（BLIP-FusePPO），这是一种用于自主车道保持（LK）的新型多模态强化学习（RL）框架，其中视觉语言模型（VLM）生成的语义嵌入直接与智能体观察空间内的几何状态、LiDAR观测和基于比例积分微分（PID）的控制反馈融合。所提出的方法通过将VLM的高级场景理解与低级控制和空间信号相结合，让智能体学习感知周围环境且易于理解的驾驶规则。我们的架构汇集了语义、几何和控制感知表示，使策略学习更加健壮。包括语义对齐、LK 准确性、避障和速度调节在内的混合奖励函数有助于学习更加高效和普遍。我们的方法不同于仅使用语义模型来塑造奖励的方法。相反，它直接将语义特征嵌入到状态表示中。这减少了昂贵的运行时推理，并确保语义指导始终可用。仿真结果表明，在各种困难驾驶情况下，所提模型在LK稳定性和适应性方面优于最佳视觉和多模态RL基线。我们公开我们的代码。

Teaching Machine Learning Through Cricket: A Practical Engineering Education Approach

通过板球教授机器学习：实用的工程教育方法

Authors: Mohd Ruhul Ameen, Akif Islam, Abu Saleh Musa Miah, M. Saifuzzaman Rafat, Jungpil Shin
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2510.22392
Pdf link: https://arxiv.org/pdf/2510.22392
Abstract Teaching complex machine learning concepts such as reinforcement learning and Markov Decision Processes remains challenging in engineering education. Students often struggle to connect abstract mathematics to real-world applications. We present LearnML@Cricket, a 12-week curriculum that uses cricket analytics to teach these concepts through practical, hands-on examples. By mapping game scenarios directly to ML algorithms, students learn through doing rather than memorizing. Our curriculum includes coding laboratories, real datasets, and immediate application to engineering problems. We propose an empirical study to measure whether this approach improves both understanding and practical implementation skills compared to traditional teaching methods.
中文摘要 在工程教育中，教授强化学习和马尔可夫决策过程等复杂的机器学习概念仍然具有挑战性。学生经常难以将抽象数学与现实世界的应用联系起来。我们推出 LearnML@Cricket，这是一个为期 12 周的课程，使用板球分析通过实际的实践示例教授这些概念。通过将游戏场景直接映射到 ML 算法，学生可以通过实践而不是记忆来学习。我们的课程包括编码实验室、真实数据集以及立即应用于工程问题。我们提出了一项实证研究，以衡量与传统教学方法相比，这种方法是否提高了理解和实际实施技能。

A Novel Multi-Timescale Stability-Preserving Hierarchical Reinforcement Learning Controller Framework for Adaptive Control in High-Dimensional Dynamical Systems

一种用于高维动力系统自适应控制的新型多时间尺度稳定性保持分层强化学习控制器框架

Authors: Mohammad Ali Labbaf Khaniki, Fateme Taroodi, Benyamin Safizadeh
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.22420
Pdf link: https://arxiv.org/pdf/2510.22420
Abstract Controlling high-dimensional stochastic systems, critical in robotics, autonomous vehicles, and hyperchaotic systems, faces the curse of dimensionality, lacks temporal abstraction, and often fails to ensure stochastic stability. To overcome these limitations, this study introduces the Multi-Timescale Lyapunov-Constrained Hierarchical Reinforcement Learning (MTLHRL) framework. MTLHRL integrates a hierarchical policy within a semi-Markov Decision Process (SMDP), featuring a high-level policy for strategic planning and a low-level policy for reactive control, which effectively manages complex, multi-timescale decision-making and reduces dimensionality overhead. Stability is rigorously enforced using a neural Lyapunov function optimized via Lagrangian relaxation and multi-timescale actor-critic updates, ensuring mean-square boundedness or asymptotic stability in the face of stochastic dynamics. The framework promotes efficient and reliable learning through trust-region constraints and decoupled optimization. Extensive simulations on an 8D hyperchaotic system and a 5-DOF robotic manipulator demonstrate MTLHRL's empirical superiority. It significantly outperforms baseline methods in both stability and performance, recording the lowest error indices (e.g., Integral Absolute Error (IAE): 3.912 in hyperchaotic control and IAE: 1.623 in robotics), achieving faster convergence, and exhibiting superior disturbance rejection. MTLHRL offers a theoretically grounded and practically viable solution for robust control of complex stochastic systems.
中文摘要 控制高维随机系统在机器人、自动驾驶汽车和超混沌系统中至关重要，面临着维度的诅咒，缺乏时间抽象，并且往往无法确保随机稳定性。为了克服这些限制，本研究引入了多时间尺度李雅普诺夫约束分层强化学习（MTLHRL）框架。MTLHRL 在半马尔可夫决策过程（SMDP）中集成了分层策略，具有用于战略规划的高级策略和用于反应控制的低级策略，可有效管理复杂的多时间尺度决策并减少维度开销。使用通过拉格朗日弛豫和多时间尺度参与者批评更新优化的神经李雅普诺夫函数严格执行稳定性，确保面对随机动力学时的均方有界性或渐近稳定性。该框架通过信任区域约束和解耦优化促进高效可靠的学习。在 8D 上进行广泛的模拟超混沌系统和 5 自由度机器人机械手证明了 MTLHRL 的经验优势。它在稳定性和性能方面都明显优于基线方法，记录了最低的误差指数（例如，超混沌控制中的积分绝对误差（IAE）：3.912 和机器人技术中的 IAE：1.623），实现了更快的收敛，并表现出卓越的干扰抑制能力。MTLHRL 为复杂随机系统的稳健控制提供了理论上有根据且实际可行的解决方案。

Agent-GSPO: Communication-Efficient Multi-Agent Systems via Group Sequence Policy Optimization

Agent-GSPO：通过组序列策略优化实现通信高效的多智能体系统

Authors: Yijia Fan, Jusheng Zhang, Jing Yang, Keze Wang
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.22477
Pdf link: https://arxiv.org/pdf/2510.22477
Abstract To combat the prohibitive communication costs of free-for-all" multi-agent systems (MAS), we introduce \textbf{Agent-GSPO}, a framework that directly optimizes for token economy using sequence-level reinforcement learning. Agent-GSPO leverages the stable and memory-efficient Group Sequence Policy Optimization (GSPO) algorithm to train agents on a communication-aware reward that explicitly penalizes verbosity. Across seven reasoning benchmarks, Agent-GSPO not only achieves new state-of-the-art performance but does so with a fraction of the token consumption of existing methods. By fostering emergent strategies likestrategic silence," our approach provides a practical blueprint for developing scalable and economically viable multi-agent systems.
中文摘要 为了应对“混战”多智能体系统（MAS）的高昂通信成本，我们引入了 \textbf{Agent-GSPO}，这是一个使用序列级强化学习直接优化代币经济的框架。Agent-GSPO 利用稳定且内存高效的组序列策略优化（GSPO）算法来训练代理进行通信感知奖励，该奖励明确惩罚冗长性。在七个推理基准测试中，Agent-GSPO 不仅实现了新的最先进的性能，而且以现有方法的一小部分代币消耗实现了这一点。通过培养“战略沉默”等新兴策略，我们的方法为开发可扩展且经济上可行的多智能体系统提供了实用的蓝图。

Resource Allocation for XR with Edge Offloading: A Reinforcement Learning Approach

XR的资源分配与边缘卸载：一种强化学习方法

Authors: Alperen Duru, Mohammad Mozaffari, Ticao Zhang, Mehrnaz Afshang
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2510.22505
Pdf link: https://arxiv.org/pdf/2510.22505
Abstract Future immersive XR applications will require energy-efficient, high data rate, and low-latency wireless communications in uplink and downlink. One of the key considerations for supporting such XR applications is intelligent and adaptive resource allocation with edge offloading. To address these demands, this paper proposes a reinforcement learning-based resource allocation framework that dynamically allocates uplink and downlink slots while making offloading decisions based on the XR headset's capabilities and network conditions. The paper presents a numerical analysis of the tradeoff between frame loss rate (FLR) and energy efficiency, identifying decision regions for partial offloading to optimize performance. Results show that for the used set of system parameters, partial offloading can extend the coverage area by 55% and reduce energy consumption by up to 34%, compared to always or never offloading. The results demonstrate that the headset's local computing capability plays a crucial role in offloading decisions. Higher computing abilities enable more efficient local processing, reduce the need for offloading, and enhance energy savings.
中文摘要 未来的沉浸式XR应用将需要在上行链路和下行链路中实现节能、高数据速率和低延迟的无线通信。支持此类 XR 应用程序的关键考虑因素之一是通过边缘卸载进行智能和自适应资源分配。针对这些需求，本文提出了一种基于强化学习的资源分配框架，该框架可以动态分配上行和下行时隙，同时根据XR头显的功能和网络条件做出卸载决策。本文对帧丢失率（FLR）和能效之间的权衡进行了数值分析，确定了部分卸载的决策区域以优化性能。结果表明，对于所使用的系统参数集，与始终卸载或从不卸载相比，部分卸载可扩大覆盖面积55%，降低能耗高达34%。结果表明，耳机的本地计算能力在卸载决策中起着至关重要的作用。更高的计算能力可以实现更高效的本地处理，减少卸载需求，并增强节能效果。

Transitive RL: Value Learning via Divide and Conquer

传递 RL：通过分而治之的价值学习

Authors: Seohong Park, Aditya Oberai, Pranav Atreya, Sergey Levine
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.22512
Pdf link: https://arxiv.org/pdf/2510.22512
Abstract In this work, we present Transitive Reinforcement Learning (TRL), a new value learning algorithm based on a divide-and-conquer paradigm. TRL is designed for offline goal-conditioned reinforcement learning (GCRL) problems, where the aim is to find a policy that can reach any state from any other state in the smallest number of steps. TRL converts a triangle inequality structure present in GCRL into a practical divide-and-conquer value update rule. This has several advantages compared to alternative value learning paradigms. Compared to temporal difference (TD) methods, TRL suffers less from bias accumulation, as in principle it only requires $O(\log T)$ recursions (as opposed to $O(T)$ in TD learning) to handle a length-$T$ trajectory. Unlike Monte Carlo methods, TRL suffers less from high variance as it performs dynamic programming. Experimentally, we show that TRL achieves the best performance in highly challenging, long-horizon benchmark tasks compared to previous offline GCRL algorithms.
中文摘要 在这项工作中，我们提出了传递强化学习（TRL），这是一种基于分而治之范式的新型价值学习算法。TRL 专为离线目标条件强化学习（GCRL）问题而设计，其目的是找到一个可以在最少的步骤中从任何其他状态到达任何状态的策略。TRL 将 GCRL 中存在的三角形不等式结构转换为实用的分治值更新规则。与替代价值学习范式相比，这有几个优势。与时差（TD）方法相比，TRL受到偏差累积的影响较小，因为原则上它只需要$O（\log T）$递归（与TD学习中的$O（T）$相反）来处理长度-$T$的轨迹。与蒙特卡洛方法不同，TRL 在执行动态规划时受到高方差的影响较小。通过实验，我们表明，与以前的离线GCRL算法相比，TRL在极具挑战性的长期基准测试任务中取得了最佳性能。

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

FAPO：用于高效可靠推理的缺陷感知策略优化

Authors: Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, Xin Liu, Min Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.22543
Pdf link: https://arxiv.org/pdf/2510.22543
Abstract Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models (LLMs). In this context, models explore reasoning trajectories and exploit rollouts with correct answers as positive signals for policy optimization. However, these rollouts might involve flawed patterns such as answer-guessing and jump-in-reasoning. Such flawed-positive rollouts are rewarded identically to fully correct ones, causing policy models to internalize these unreliable reasoning patterns. In this work, we first conduct a systematic study of flawed-positive rollouts in RL and find that they enable rapid capability gains during the early optimization stage, while constraining reasoning capability later by reinforcing unreliable patterns. Building on these insights, we propose Flawed-Aware Policy Optimization (FAPO), which presents a parameter-free reward penalty for flawed-positive rollouts, enabling the policy to leverage them as useful shortcuts in the warm-up stage, securing stable early gains, while gradually shifting optimization toward reliable reasoning in the later refinement stage. To accurately and comprehensively detect flawed-positive rollouts, we introduce a generative reward model (GenRM) with a process-level reward that precisely localizes reasoning errors. Experiments show that FAPO is effective in broad domains, improving outcome correctness, process reliability, and training stability without increasing the token budget.
中文摘要 具有可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLM）推理能力的一种有前途的范式。在这种情况下，模型探索推理轨迹，并将正确答案的推出作为政策优化的积极信号。然而，这些推出可能涉及有缺陷的模式，例如答案猜测和跳入推理。这种有缺陷的积极推出与完全正确的推出一样得到奖励，导致政策模型将这些不可靠的推理模式内化。在这项工作中，我们首先对RL中的有缺陷的正向推出进行了系统研究，发现它们在早期优化阶段能够快速获得能力，同时通过强化不可靠的模式来限制后期的推理能力。基于这些见解，我们提出了缺陷感知策略优化（FAPO），它为有缺陷的积极推出提供了无参数的奖励惩罚，使策略能够在预热阶段将其作为有用的捷径，确保稳定的早期收益，同时在后期细化阶段逐渐将优化转向可靠的推理。为了准确、全面地检测有缺陷的正向推出，我们引入了一种生成奖励模型（GenRM），该模型具有流程级奖励，可精确定位推理错误。实验表明，FAPO 在广泛的领域是有效的，可以在不增加代币预算的情况下提高结果的正确性、过程可靠性和训练稳定性。

SPIRAL: Self-Play Incremental Racing Algorithm for Learning in Multi-Drone Competitions

SPIRAL：用于多无人机比赛学习的自玩增量赛车算法

Authors: Onur Akgün
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.22568
Pdf link: https://arxiv.org/pdf/2510.22568
Abstract This paper introduces SPIRAL (Self-Play Incremental Racing Algorithm for Learning), a novel approach for training autonomous drones in multi-agent racing competitions. SPIRAL distinctively employs a self-play mechanism to incrementally cultivate complex racing behaviors within a challenging, dynamic environment. Through this self-play core, drones continuously compete against increasingly proficient versions of themselves, naturally escalating the difficulty of competitive interactions. This progressive learning journey guides agents from mastering fundamental flight control to executing sophisticated cooperative multi-drone racing strategies. Our method is designed for versatility, allowing integration with any state-of-the-art Deep Reinforcement Learning (DRL) algorithms within its self-play framework. Simulations demonstrate the significant advantages of SPIRAL and benchmark the performance of various DRL algorithms operating within it. Consequently, we contribute a versatile, scalable, and self-improving learning framework to the field of autonomous drone racing. SPIRAL's capacity to autonomously generate appropriate and escalating challenges through its self-play dynamic offers a promising direction for developing robust and adaptive racing strategies in multi-agent environments. This research opens new avenues for enhancing the performance and reliability of autonomous racing drones in increasingly complex and competitive scenarios.
中文摘要 本文介绍了SPIRAL（Self-Play Incremental Racing Algorithm for Learning），这是一种在多智能体赛车比赛中训练自主无人机的新方法。SPIRAL 独特地采用了一种自我游戏机制，在充满挑战的动态环境中逐步培养复杂的赛车行为。通过这个自玩核心，无人机不断与越来越熟练的自己竞争，自然而然地提升了竞技互动的难度。这种渐进式学习之旅指导特工从掌握基本的飞行控制到执行复杂的合作多无人机赛车策略。我们的方法专为多功能性而设计，允许在其自玩框架内与任何最先进的深度强化学习（DRL）算法集成。仿真展示了 SPIRAL 的显着优势，并对其中运行的各种 DRL 算法的性能进行了基准测试。因此，我们为自主无人机赛车领域贡献了一个多功能、可扩展和自我完善的学习框架。SPIRAL 能够通过其自我游戏动态自主生成适当和不断升级的挑战，这为在多代理环境中开发强大且适应性强的赛车策略提供了一个有希望的方向。这项研究为在日益复杂和竞争激烈的场景中提高自主赛车无人机的性能和可靠性开辟了新途径。

Curriculum-Based Iterative Self-Play for Scalable Multi-Drone Racing

基于课程的迭代自我游戏，实现可扩展的多无人机竞速

Authors: Onur Akgün
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.22570
Pdf link: https://arxiv.org/pdf/2510.22570
Abstract The coordination of multiple autonomous agents in high-speed, competitive environments represents a significant engineering challenge. This paper presents CRUISE (Curriculum-Based Iterative Self-Play for Scalable Multi-Drone Racing), a reinforcement learning framework designed to solve this challenge in the demanding domain of multi-drone racing. CRUISE overcomes key scalability limitations by synergistically combining a progressive difficulty curriculum with an efficient self-play mechanism to foster robust competitive behaviors. Validated in high-fidelity simulation with realistic quadrotor dynamics, the resulting policies significantly outperform both a standard reinforcement learning baseline and a state-of-the-art game-theoretic planner. CRUISE achieves nearly double the planner's mean racing speed, maintains high success rates, and demonstrates robust scalability as agent density increases. Ablation studies confirm that the curriculum structure is the critical component for this performance leap. By providing a scalable and effective training methodology, CRUISE advances the development of autonomous systems for dynamic, competitive tasks and serves as a blueprint for future real-world deployment.
中文摘要 在高速、竞争激烈的环境中协调多个自主代理是一项重大的工程挑战。本文提出了 CRUISE（Curriculum-Based Iterative Self-Play for Scalable Multi-Drone Racing），这是一种强化学习框架，旨在解决多无人机赛车要求苛刻的领域中的这一挑战。CRUISE 通过将渐进难度课程与高效的自我游戏机制协同相结合，以培养强大的竞争行为，克服了关键的可扩展性限制。在具有真实四旋翼动力学的高保真仿真中得到验证，由此产生的策略明显优于标准强化学习基线和最先进的博弈论规划器。CRUISE 实现了几乎是规划者平均赛车速度的两倍，保持了高成功率，并随着代理密度的增加表现出强大的可扩展性。消融研究证实，课程结构是这一绩效飞跃的关键组成部分。通过提供可扩展且有效的培训方法，CRUISE 推进了动态、竞争性任务的自主系统的开发，并作为未来实际部署的蓝图。

UCB-type Algorithm for Budget-Constrained Expert Learning

用于预算约束专家学习的UCB型算法

Authors: Ilgam Latypov, Alexandra Suvorikova, Alexey Kroshnin, Alexander Gasnikov, Yuriy Dorn
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.22654
Pdf link: https://arxiv.org/pdf/2510.22654
Abstract In many modern applications, a system must dynamically choose between several adaptive learning algorithms that are trained online. Examples include model selection in streaming environments, switching between trading strategies in finance, and orchestrating multiple contextual bandit or reinforcement learning agents. At each round, a learner must select one predictor among $K$ adaptive experts to make a prediction, while being able to update at most $M \le K$ of them under a fixed training budget. We address this problem in the \emph{stochastic setting} and introduce \algname{M-LCB}, a computationally efficient UCB-style meta-algorithm that provides \emph{anytime regret guarantees}. Its confidence intervals are built directly from realized losses, require no additional optimization, and seamlessly reflect the convergence properties of the underlying experts. If each expert achieves internal regret $\tilde O(T^\alpha)$, then \algname{M-LCB} ensures overall regret bounded by $\tilde O!\Bigl(\sqrt{\tfrac{KT}{M}} \;+\; (K/M)^{1-\alpha}\,T^\alpha\Bigr)$. To our knowledge, this is the first result establishing regret guarantees when multiple adaptive experts are trained simultaneously under per-round budget constraints. We illustrate the framework with two representative cases: (i) parametric models trained online with stochastic losses, and (ii) experts that are themselves multi-armed bandit algorithms. These examples highlight how \algname{M-LCB} extends the classical bandit paradigm to the more realistic scenario of coordinating stateful, self-learning experts under limited resources.
中文摘要 在许多现代应用程序中，系统必须在在线训练的几种自适应学习算法之间动态选择。示例包括流环境中的模型选择、金融交易策略之间的切换以及编排多个上下文强盗或强化学习代理。在每一轮中，学习者必须从 $K 名自适应专家中选择一个预测变量来进行预测，同时能够在固定的训练预算下最多更新其中的 $M 个 \le K$。我们在 \emph{随机设置} 中解决了这个问题，并引入了 \algname{M-LCB}，这是一种计算高效的 UCB 风格元算法，可提供 \emph{anytime regret guarantees}。其置信区间直接根据已实现的损失构建，无需额外优化，并无缝反映基础专家的收敛属性。如果每个专家都达到了内部后悔 $\tilde O（T^\alpha）$，那么 \algname{M-LCB} 确保整体后悔以 $\tilde O\！\Bigl（\sqrt{\tfrac{KT}{M}} \;+\;（K/M）^{1-\alpha}\，T^\alpha\Bigr）$.据我们所知，这是在每轮预算约束下同时培训多个自适应专家时建立后悔保证的第一个结果。我们用两个代表性的案例来说明该框架：（i）使用随机损失在线训练的参数模型，以及（ii）本身就是多臂强盗算法的专家。这些示例强调了 \algname{M-LCB} 如何将经典强盗范式扩展到在有限资源下协调有状态、自学习专家的更现实的场景。

FlowCritic: Bridging Value Estimation with Flow Matching in Reinforcement Learning

FlowCritic：在强化学习中将价值估计与流量匹配联系起来

Authors: Shan Zhong, Shutong Ding, He Diao, Xiangyu Wang, Kah Chan Teh, Bei Peng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.22686
Pdf link: https://arxiv.org/pdf/2510.22686
Abstract Reliable value estimation serves as the cornerstone of reinforcement learning (RL) by evaluating long-term returns and guiding policy improvement, significantly influencing the convergence speed and final performance. Existing works improve the reliability of value function estimation via multi-critic ensembles and distributional RL, yet the former merely combines multi point estimation without capturing distributional information, whereas the latter relies on discretization or quantile regression, limiting the expressiveness of complex value distributions. Inspired by flow matching's success in generative modeling, we propose a generative paradigm for value estimation, named FlowCritic. Departing from conventional regression for deterministic value prediction, FlowCritic leverages flow matching to model value distributions and generate samples for value estimation.
中文摘要 可靠的价值估计是强化学习（reingenment learning，RL）的基石，通过评估长期收益和指导政策改进，显著影响收敛速度和最终性能。现有工作通过多重批评系丛和分布RL提高了价值函数估计的可靠性，但前者仅结合多点估计而不捕获分布信息，而后者则依赖离散化或分位数回归，限制了复杂价值分布的表达能力。受到流匹配在生成建模方面的成功启发，我们提出了一种用于价值估计的生成范式，名为 FlowCritic。与用于确定性价值预测的传统回归不同，FlowCritic 利用流量匹配来对价值分布进行建模并生成用于价值估计的样本。

RL-AVIST: Reinforcement Learning for Autonomous Visual Inspection of Space Targets

力强-华成：用于空间目标自主目视检查的强化学习

Authors: Matteo El-Hariry, Andrej Orsula, Matthieu Geist, Miguel Olivares-Mendez
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.22699
Pdf link: https://arxiv.org/pdf/2510.22699
Abstract The growing need for autonomous on-orbit services such as inspection, maintenance, and situational awareness calls for intelligent spacecraft capable of complex maneuvers around large orbital targets. Traditional control systems often fall short in adaptability, especially under model uncertainties, multi-spacecraft configurations, or dynamically evolving mission contexts. This paper introduces RL-AVIST, a Reinforcement Learning framework for Autonomous Visual Inspection of Space Targets. Leveraging the Space Robotics Bench (SRB), we simulate high-fidelity 6-DOF spacecraft dynamics and train agents using DreamerV3, a state-of-the-art model-based RL algorithm, with PPO and TD3 as model-free baselines. Our investigation focuses on 3D proximity maneuvering tasks around targets such as the Lunar Gateway and other space assets. We evaluate task performance under two complementary regimes: generalized agents trained on randomized velocity vectors, and specialized agents trained to follow fixed trajectories emulating known inspection orbits. Furthermore, we assess the robustness and generalization of policies across multiple spacecraft morphologies and mission domains. Results demonstrate that model-based RL offers promising capabilities in trajectory fidelity, and sample efficiency, paving the way for scalable, retrainable control solutions for future space operations
中文摘要 对检查、维护和态势感知等自主在轨服务的需求不断增长，需要能够围绕大型轨道目标进行复杂机动的智能航天器。传统的控制系统往往在适应性方面存在不足，特别是在模型不确定性、多航天器配置或动态演变的任务环境中。本文介绍了RL-AVIST，这是一种用于空间目标自主视觉检查的强化学习框架。利用空间机器人工作台（SRB），我们模拟高保真 6 自由度航天器动力学，并使用 DreamerV3（一种最先进的基于模型的 RL 算法）训练代理，以 PPO 和 TD3 作为无模型基线。我们的研究重点是围绕月球门户和其他太空资产等目标的 3D 近距离机动任务。我们在两种互补的制度下评估任务性能：在随机速度矢量上训练的广义代理，以及训练以遵循模拟已知检查轨道的固定轨迹的专用代理。此外，我们评估了跨多个航天器形态和任务领域的策略的稳健性和泛化性。结果表明，基于模型的 RL 在轨迹保真度和样本效率方面具有很有前途的能力，为未来的太空运营提供了可扩展、可重新训练的控制解决方案铺平了道路

Policies over Poses: Reinforcement Learning based Distributed Pose-Graph Optimization for Multi-Robot SLAM

策略高于姿态：基于强化学习的分布式姿态图优化，用于多机器人SLAM

Authors: Sai Krishna Ghanta, Ramviyas Parasuraman
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.22740
Pdf link: https://arxiv.org/pdf/2510.22740
Abstract We consider the distributed pose-graph optimization (PGO) problem, which is fundamental in accurate trajectory estimation in multi-robot simultaneous localization and mapping (SLAM). Conventional iterative approaches linearize a highly non-convex optimization objective, requiring repeated solving of normal equations, which often converge to local minima and thus produce suboptimal estimates. We propose a scalable, outlier-robust distributed planar PGO framework using Multi-Agent Reinforcement Learning (MARL). We cast distributed PGO as a partially observable Markov game defined on local pose-graphs, where each action refines a single edge's pose estimate. A graph partitioner decomposes the global pose graph, and each robot runs a recurrent edge-conditioned Graph Neural Network (GNN) encoder with adaptive edge-gating to denoise noisy edges. Robots sequentially refine poses through a hybrid policy that utilizes prior action memory and graph embeddings. After local graph correction, a consensus scheme reconciles inter-robot disagreements to produce a globally consistent estimate. Our extensive evaluations on a comprehensive suite of synthetic and real-world datasets demonstrate that our learned MARL-based actors reduce the global objective by an average of 37.5% more than the state-of-the-art distributed PGO framework, while enhancing inference efficiency by at least 6X. We also demonstrate that actor replication allows a single learned policy to scale effortlessly to substantially larger robot teams without any retraining. Code is publicly available at this https URL.
中文摘要 我们考虑了分布式位姿图优化（PGO）问题，该问题是多机器人同步定位和映射（SLAM）中精确轨迹估计的基础。传统的迭代方法将高度非凸的优化目标线性化，需要重复求解正态方程，这些方程通常收敛到局部最小值，从而产生次优估计。我们提出了一个使用多智能体强化学习（MARL）的可扩展、异常值鲁棒的分布式平面PGO框架。我们将分布式 PGO 转换为在局部姿势图上定义的部分可观察的马尔可夫博弈，其中每个动作都会细化单个边缘的姿态估计。图分区器分解全局位姿图，每个机器人运行一个递归边缘条件图神经网络（GNN）编码器，该编码器具有自适应边缘门控功能，以对噪声边缘进行降噪。机器人通过利用先验动作记忆和图形嵌入的混合策略按顺序细化姿势。局部图校正后，共识方案会协调机器人之间的分歧，以产生全局一致的估计值。我们对一套全面的合成和真实世界数据集的广泛评估表明，我们学习到的基于 MARL 的 Actor 比最先进的分布式 PGO 框架平均将全局目标降低 37.5%，同时将推理效率提高至少 6 倍。我们还证明，参与者复制允许单个学习策略毫不费力地扩展到更大的机器人团队，而无需任何重新训练。代码在此 https URL 中公开可用。

Scalable Supervising Software Agents with Patch Reasoner

具有补丁推理器的可扩展监督软件代理

Authors: Junjielong Xu, Boyin Tan, Xiaoyuan Liu, Chao Peng, Pengfei Gao, Pinjia He
Subjects: Subjects: Computation and Language (cs.CL); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2510.22775
Pdf link: https://arxiv.org/pdf/2510.22775
Abstract While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other's modification and gain a dense reward for stable training. R4P achieves 72.2% Acc. for verifying patches from SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P's practicality, we design and train a lite scaffold, Mini-SE, with pure reinforcement learning where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2% Pass@1 on SWE-bench-verified, showing a 10.0% improvement over the original Qwen3-32B. This can be further improved to 32.8% with R4P for test-time scaling. Furthermore, R4P verifies patches within a second, 50x faster than testing on average. The stable scaling curves of rewards and accuracy along with high efficiency reflect R4P's practicality.
中文摘要 虽然大型语言模型代理具有高级软件工程任务，但现有基于测试的监督的不可扩展性限制了数据扩展的潜在改进。原因有两个：（1）构建和运行测试沙箱相当繁重和脆弱，（2）具有高覆盖率测试的数据自然很少见，并且受到通过边缘情况进行的测试黑客攻击的威胁。在本文中，我们提出了 R4P，这是一种补丁验证器模型，用于通过推理为训练和测试 SWE 代理提供可扩展的奖励。我们认为补丁验证从根本上说是一项推理任务，反映了人类存储库维护者如何在不编写和运行新的复制测试的情况下审查补丁。为了获得足够的参考并降低奖励黑客攻击的风险，R4P 使用分组目标进行 RL 训练，使其能够根据彼此的修改验证多个补丁，并为稳定训练获得密集的奖励。R4P 在验证 SWE bench-validate 补丁方面达到了 72.2%，超过了 OpenAI o3。为了证明 R4P 的实用性，我们设计并训练了一个精简的支架 Mini-SE，具有纯强化学习，其中所有奖励都来自 R4P。因此，Mini-SE 在 SWE bench-verified 上实现了 26.2% 的Pass@1，比原始 Qwen3-32B 提高了 10.0%。通过 R4P 可以进一步提高到 32.8%，以进行测试时间缩放。此外，R4P 在一秒钟内验证补丁，比平均测试快 50 倍。奖励和准确率的稳定扩展曲线以及高效率反映了 R4P 的实用性。

VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions

VEHME：用于评估手写数学表达式的视觉语言模型

Authors: Thu Phuong Nguyen, Duc M. Nguyen, Hyotaek Jeon, Hyunwook Lee, Hyunmin Song, Sungahn Ko, Taehwan Kim
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.22798
Pdf link: https://arxiv.org/pdf/2510.22798
Abstract Automatically assessing handwritten mathematical solutions is an important problem in educational technology with practical applications, but it remains a significant challenge due to the diverse formats, unstructured layouts, and symbolic complexity of student work. To address this challenge, we introduce VEHME-a Vision-Language Model for Evaluating Handwritten Mathematics Expressions-designed to assess open-form handwritten math responses with high accuracy and interpretable reasoning traces. VEHME integrates a two-phase training pipeline: (i) supervised fine-tuning using structured reasoning data, and (ii) reinforcement learning that aligns model outputs with multi-dimensional grading objectives, including correctness, reasoning depth, and error localization. To enhance spatial understanding, we propose an Expression-Aware Visual Prompting Module, trained on our synthesized multi-line math expressions dataset to robustly guide attention in visually heterogeneous inputs. Evaluated on AIHub and FERMAT datasets, VEHME achieves state-of-the-art performance among open-source models and approaches the accuracy of proprietary systems, demonstrating its potential as a scalable and accessible tool for automated math assessment. Our training and experiment code is publicly available at our GitHub repository.
中文摘要 自动评估手写数学解决方案是具有实际应用的教育技术中的一个重要问题，但由于学生作业的格式多样、布局非结构化和符号复杂性，它仍然是一个重大挑战。为了应对这一挑战，我们推出了 VEHME——一种用于评估手写数学表达式的视觉语言模型——旨在以高精度和可解释的推理轨迹评估开放形式的手写数学响应。VEHME 集成了两阶段训练管道：（i）使用结构化推理数据进行监督微调，以及（ii）强化学习，使模型输出与多维分级目标保持一致，包括正确性、推理深度和错误定位。为了增强空间理解，我们提出了一个表达式感知视觉提示模块，该模块在我们合成的多行数学表达式数据集上进行训练，以稳健地引导视觉异构输入中的注意力。在 AIHub 和 FERMAT 数据集上进行评估，VEHME 在开源模型中实现了最先进的性能，并接近专有系统的准确性，展示了其作为自动数学评估的可扩展且可访问的工具的潜力。我们的训练和实验代码在我们的 GitHub 存储库中公开提供。

HRM-Agent: Training a recurrent reasoning model in dynamic environments using reinforcement learning

HRM-Agent：使用强化学习在动态环境中训练循环推理模型

Authors: Long H Dang, David Rawlinson
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.22832
Pdf link: https://arxiv.org/pdf/2510.22832
Abstract The Hierarchical Reasoning Model (HRM) has impressive reasoning abilities given its small size, but has only been applied to supervised, static, fully-observable problems. One of HRM's strengths is its ability to adapt its computational effort to the difficulty of the problem. However, in its current form it cannot integrate and reuse computation from previous time-steps if the problem is dynamic, uncertain or partially observable, or be applied where the correct action is undefined, characteristics of many real-world problems. This paper presents HRM-Agent, a variant of HRM trained using only reinforcement learning. We show that HRM can learn to navigate to goals in dynamic and uncertain maze environments. Recent work suggests that HRM's reasoning abilities stem from its recurrent inference process. We explore the dynamics of the recurrent inference process and find evidence that it is successfully reusing computation from earlier environment time-steps.
中文摘要 分层推理模型（HRM）由于体积小，具有令人印象深刻的推理能力，但仅应用于监督、静态、完全可观察的问题。HRM 的优势之一是它能够根据问题的难度调整其计算工作。然而，在目前的形式下，如果问题是动态的、不确定的或部分可观察的，它就无法集成和重用以前时间步长的计算，或者在正确动作未定义的情况下应用，这是许多现实世界问题的特征。本文介绍了 HRM-Agent，它是仅使用强化学习训练的 HRM 的变体。我们表明，人力资源管理可以学会在动态和不确定的迷宫环境中导航到目标。最近的研究表明，人力资源管理局的推理能力源于其反复推理过程。我们探索了循环推理过程的动态，并找到了它成功地重用了早期环境时间步长的计算的证据。

Toward Agents That Reason About Their Computation

面向对其计算进行推理的代理

Authors: Adrian Orenstein, Jessica Chen, Gwyneth Anne Delos Santos, Bayley Sapara, Michael Bowling
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.22833
Pdf link: https://arxiv.org/pdf/2510.22833
Abstract While reinforcement learning agents can achieve superhuman performance in many complex tasks, they typically do not become more computationally efficient as they improve. In contrast, humans gradually require less cognitive effort as they become more proficient at a task. If agents could reason about their compute as they learn, could they similarly reduce their computation footprint? If they could, we could have more energy efficient agents or free up compute cycles for other processes like planning. In this paper, we experiment with showing agents the cost of their computation and giving them the ability to control when they use compute. We conduct our experiments on the Arcade Learning Environment, and our results demonstrate that with the same training compute budget, agents that reason about their compute perform better on 75% of games. Furthermore, these agents use three times less compute on average. We analyze individual games and show where agents gain these efficiencies.
中文摘要 虽然强化学习代理可以在许多复杂的任务中实现超人的性能，但它们通常不会随着改进而变得更加计算效率。相比之下，随着人类对一项任务的熟练程度越来越高，他们逐渐需要更少的认知努力。如果代理可以在学习时推理他们的计算，他们是否可以同样减少他们的计算足迹？如果可以的话，我们可以拥有更节能的代理，或者为规划等其他流程腾出计算周期。在本文中，我们尝试向代理展示其计算成本，并赋予他们控制何时使用计算的能力。我们在 Arcade 学习环境中进行了实验，结果表明，在相同的训练计算预算下，推理计算的代理在 75% 的游戏中表现更好。此外，这些代理平均使用的计算量减少了三倍。我们分析各个游戏，并展示代理在哪些方面获得了这些效率。

Guardian: Decoupling Exploration from Safety in Reinforcement Learning

卫报：强化学习中的探索与安全性的解耦

Authors: Kaitong Cai, Jusheng Zhang, Jing Yang, Keze Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.22859
Pdf link: https://arxiv.org/pdf/2510.22859
Abstract Hybrid offline--online reinforcement learning (O2O RL) promises both sample efficiency and robust exploration, but suffers from instability due to distribution shift between offline and online data. We introduce RLPD-GX, a framework that decouples policy optimization from safety enforcement: a reward-seeking learner explores freely, while a projection-based guardian guarantees rule-consistent execution and safe value backups. This design preserves the exploratory value of online interactions without collapsing to conservative policies. To further stabilize training, we propose dynamic curricula that gradually extend temporal horizons and anneal offline--online data mixing. We prove convergence via a contraction property of the guarded Bellman operator, and empirically show state-of-the-art performance on Atari-100k, achieving a normalized mean score of 3.02 (+45\% over prior hybrid methods) with stronger safety and stability. Beyond Atari, ablations demonstrate consistent gains across safety-critical and long-horizon tasks, underscoring the generality of our design. Extensive and comprehensive results highlight decoupled safety enforcement as a simple yet principled route to robust O2O RL, suggesting a broader paradigm for reconciling exploration and safety in reinforcement learning.
中文摘要 混合离线--在线强化学习（O2O RL）保证了样本效率和鲁棒探索，但由于离线和在线数据之间的分布偏移而存在不稳定性。我们引入了 RLPD-GX，这是一个将策略优化与安全执行解耦的框架：寻求奖励的学习者可以自由探索，而基于投影的监护人则保证规则一致的执行和安全的价值备份。这种设计保留了在线互动的探索价值，同时又不屈服于保守的政策。为了进一步稳定训练，我们提出了逐步扩展时间视野和离线退火的动态课程——在线数据混合。我们通过受保护的贝尔曼算子的收缩特性证明了收敛性，并凭证显示了在 Atari-100k 上最先进的性能，实现了 3.02 的归一化平均分数（与以前的混合方法相比 +45\%），具有更强的安全性和稳定性。除了 Atari 之外，烧蚀在安全关键型和长期任务中也表现出一致的收益，这凸显了我们设计的通用性。广泛而全面的结果强调，解耦安全执法是通往稳健 O2O RL 的简单而有原则的途径，为在强化学习中协调探索和安全提出了更广泛的范式。

Offline Preference Optimization via Maximum Marginal Likelihood Estimation

通过最大边际似然估计进行离线偏好优化

Authors: Saeed Najafi, Alona Fyshe
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.22881
Pdf link: https://arxiv.org/pdf/2510.22881
Abstract Aligning Large Language Models (LLMs) with human preferences is crucial, but standard methods like Reinforcement Learning from Human Feedback (RLHF) are often complex and unstable. In this work, we propose a new, simpler approach that recasts alignment through the lens of Maximum Marginal Likelihood (MML) estimation. Our new MML based Preference Optimization (MMPO) maximizes the marginal log-likelihood of a preferred text output, using the preference pair as samples for approximation, and forgoes the need for both an explicit reward model and entropy maximization. We theoretically demonstrate that MMPO implicitly performs preference optimization, producing a weighted gradient that naturally up-weights chosen responses over rejected ones. Across models ranging from 135M to 8B parameters, we empirically show that MMPO: 1) is more stable with respect to the hyperparameter $\beta$ compared to alternative baselines, and 2) achieves competitive or superior preference alignment while better preserving the base model's general language capabilities. Through a series of ablation experiments, we show that this improved performance is indeed attributable to MMPO's implicit preference optimization within the gradient updates.
中文摘要 使大型语言模型（LLM）与人类偏好保持一致至关重要，但人类反馈强化学习（RLHF）等标准方法通常复杂且不稳定。在这项工作中，我们提出了一种新的、更简单的方法，通过最大边际似然（MML）估计的视角重新塑造对齐方式。我们新的基于 MML 的偏好优化（MMPO）使用偏好对作为近似样本，最大化了首选文本输出的边际对数似然，并且放弃了对显式奖励模型和熵最大化的需求。我们从理论上证明，MMPO 隐式执行偏好优化，产生加权梯度，自然地将所选响应的权重提高到被拒绝的响应上。在从 135M 到 8B 参数的模型中，我们根据经验表明，MMPO：1）与替代基线相比，相对于超参数 $\beta$ 更稳定，以及 2）实现竞争性或优越的偏好对齐，同时更好地保留基础模型的通用语言能力。通过一系列消融实验，我们表明这种性能的提高确实归因于MMPO在梯度更新中的隐式偏好优化。

Never Too Rigid to Reach: Adaptive Virtual Model Control with LLM- and Lyapunov-Based Reinforcement Learning

永远不会太僵化而无法实现：基于 LLM 和 Lyapunov 的强化学习的自适应虚拟模型控制

Authors: Jingzehua Xu, Yangyang Li, Yangfei Chen, Guanwen Xie, Shuai Zhang
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.22892
Pdf link: https://arxiv.org/pdf/2510.22892
Abstract Robotic arms are increasingly deployed in uncertain environments, yet conventional control pipelines often become rigid and brittle when exposed to perturbations or incomplete information. Virtual Model Control (VMC) enables compliant behaviors by embedding virtual forces and mapping them into joint torques, but its reliance on fixed parameters and limited coordination among virtual components constrains adaptability and may undermine stability as task objectives evolve. To address these limitations, we propose Adaptive VMC with Large Language Model (LLM)- and Lyapunov-Based Reinforcement Learning (RL), which preserves the physical interpretability of VMC while supporting stability-guaranteed online adaptation. The LLM provides structured priors and high-level reasoning that enhance coordination among virtual components, improve sample efficiency, and facilitate flexible adjustment to varying task requirements. Complementarily, Lyapunov-based RL enforces theoretical stability constraints, ensuring safe and reliable adaptation under uncertainty. Extensive simulations on a 7-DoF Panda arm demonstrate that our approach effectively balances competing objectives in dynamic tasks, achieving superior performance while highlighting the synergistic benefits of LLM guidance and Lyapunov-constrained adaptation.
中文摘要 机械臂越来越多地部署在不确定的环境中，但传统的控制管道在受到扰动或不完整信息时往往会变得僵硬和脆弱。虚拟模型控制（VMC）通过嵌入虚拟力并将其映射到关节扭矩中来实现顺应行为，但它对固定参数的依赖和虚拟组件之间的有限协调限制了适应性，并可能随着任务目标的发展破坏稳定性。为了解决这些限制，我们提出了基于大型语言模型（LLM）和李雅普诺夫的自适应VMC强化学习（RL），它保留了VMC的物理可解释性，同时支持稳定性保证的在线适配。LLM 提供结构化先验和高级推理，可增强虚拟组件之间的协调，提高样本效率，并有助于灵活调整以适应不同的任务要求。作为补充，基于李雅普诺夫的强化学习强制执行理论稳定性约束，确保在不确定性下安全可靠地适应。对 7-DoF Panda 手臂的广泛模拟表明，我们的方法有效地平衡了动态任务中的竞争目标，实现了卓越的性能，同时突出了 LLM 指导和李雅普诺夫约束适应的协同优势。

Hazard-Responsive Digital Twin for Climate-Driven Urban Resilience and Equity

气候驱动的城市复原力和公平的灾害响应数字孪生

Authors: Zhenglai Shen, Hongyu Zhou
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.22941
Pdf link: https://arxiv.org/pdf/2510.22941
Abstract Compounding climate hazards, such as wildfire-induced outages and urban heatwaves, challenge the stability and equity of cities. We present a Hazard-Responsive Digital Twin (H-RDT) that combines physics-informed neural network modeling, multimodal data fusion, and equity-aware risk analytics for urban-scale response. In a synthetic district with diverse building archetypes and populations, a simulated wildfire-outage-heatwave cascade shows that H-RDT maintains stable indoor temperature predictions (approximately 31 to 33 C) under partial sensor loss, reproducing outage-driven surges and recovery. The reinforcement learning based fusion module adaptively reweights IoT, UAV, and satellite inputs to sustain spatiotemporal coverage, while the equity-adjusted mapping isolates high-vulnerability clusters (schools, clinics, low-income housing). Prospective interventions, such as preemptive cooling-center activation and microgrid sharing, reduce population-weighted thermal risk by 11 to 13 percent, shrink the 95th-percentile (tail) risk by 7 to 17 percent, and cut overheating hours by up to 9 percent. Beyond the synthetic demonstration, the framework establishes a transferable foundation for real-city implementation, linking physical hazard modeling with social equity and decision intelligence. The H-RDT advances digital urban resilience toward adaptive, learning-based, and equity-centered decision support for climate adaptation.
中文摘要 野火引发的停电和城市热浪等气候灾害日益复杂，对城市的稳定性和公平性提出了挑战。我们提出了一种灾害响应数字孪生（H-RDT），它结合了物理信息神经网络建模、多模态数据融合和公平感知风险分析，用于城市规模的响应。在具有不同建筑原型和人口的合成区中，模拟的野火-停电-热浪级联表明，H-RDT 在部分传感器丢失的情况下保持稳定的室内温度预测（约 31 至 33 C），再现了停电驱动的激增和恢复。基于强化学习的融合模块自适应地重新加权物联网、无人机和卫星输入以维持时空覆盖，而公平调整的映射则隔离高脆弱性集群（学校、诊所、低收入住房）。前瞻性干预措施，如先发制人的冷却中心激活和微电网共享，可将人口加权热风险降低 11% 至 13%，将第 95 个百分位（尾部）风险降低 7% 至 17%，并将过热时间减少多达 9%。除了综合演示之外，该框架还为现实城市的实施奠定了可转移的基础，将物理危害建模与社会公平和决策智能联系起来。H-RDT 推动数字城市复原力，为气候适应提供适应性、基于学习和以公平为中心的决策支持。

Multi-Agent Conditional Diffusion Model with Mean Field Communication as Wireless Resource Allocation Planner

以均值现场通信为无线资源分配规划器的多智能体条件扩散模型

Authors: Kechen Meng, Sinuo Zhang, Rongpeng Li, Xiangming Meng, Chan Wang, Ming Lei, Zhifeng Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.22969
Pdf link: https://arxiv.org/pdf/2510.22969
Abstract In wireless communication systems, efficient and adaptive resource allocation plays a crucial role in enhancing overall Quality of Service (QoS). While centralized Multi-Agent Reinforcement Learning (MARL) frameworks rely on a central coordinator for policy training and resource scheduling, they suffer from scalability issues and privacy risks. In contrast, the Distributed Training with Decentralized Execution (DTDE) paradigm enables distributed learning and decision-making, but it struggles with non-stationarity and limited inter-agent cooperation, which can severely degrade system performance. To overcome these challenges, we propose the Multi-Agent Conditional Diffusion Model Planner (MA-CDMP) for decentralized communication resource management. Built upon the Model-Based Reinforcement Learning (MBRL) paradigm, MA-CDMP employs Diffusion Models (DMs) to capture environment dynamics and plan future trajectories, while an inverse dynamics model guides action generation, thereby alleviating the sample inefficiency and slow convergence of conventional DTDE methods. Moreover, to approximate large-scale agent interactions, a Mean-Field (MF) mechanism is introduced as an assistance to the classifier in DMs. This design mitigates inter-agent non-stationarity and enhances cooperation with minimal communication overhead in distributed settings. We further theoretically establish an upper bound on the distributional approximation error introduced by the MF-based diffusion generation, guaranteeing convergence stability and reliable modeling of multi-agent stochastic dynamics. Extensive experiments demonstrate that MA-CDMP consistently outperforms existing MARL baselines in terms of average reward and QoS metrics, showcasing its scalability and practicality for real-world wireless network optimization.
中文摘要 在无线通信系统中，高效和自适应的资源分配在提高整体服务质量（QoS）方面发挥着至关重要的作用。虽然集中式多智能体强化学习（MARL）框架依赖于中央协调器进行策略训练和资源调度，但它们存在可扩展性问题和隐私风险。相比之下，具有去中心化执行的分布式训练（DTDE）范式支持分布式学习和决策，但它在非平稳性和有限的代理间合作方面存在困难，这会严重降低系统性能。为了克服这些挑战，我们提出了用于去中心化通信资源管理的多智能体条件扩散模型规划器（MA-CDMP）。MA-CDMP基于基于模型的强化学习（MBRL）范式，采用扩散模型（DM）来捕捉环境动态并规划未来轨迹，而逆动力学模型则指导动作生成，从而缓解了传统DTDE方法的样本效率低下和收敛缓慢的问题。此外，为了近似大规模的智能体交互，引入了均值场（MF）机制作为DM中分类器的辅助。这种设计减轻了代理间的非平稳性，并以分布式环境中最小的通信开销增强了协作。我们进一步从理论上建立了基于MF的扩散生成引入的分布近似误差的上限，保证了收敛稳定性和多智能体随机动力学的可靠建模。大量实验表明，MA-CDMP 在平均奖励和 QoS 指标方面始终优于现有的 MARL 基线，展示了其在实际无线网络优化中的可扩展性和实用性。

Softmax is $1/2$-Lipschitz: A tight bound across all $\ell_p$ norms

Softmax 是 $1/2$-Lipschitz：对所有 $\ell_p$ 规范的严格约束

Authors: Pravin Nair
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.23012
Pdf link: https://arxiv.org/pdf/2510.23012
Abstract The softmax function is a basic operator in machine learning and optimization, used in classification, attention mechanisms, reinforcement learning, game theory, and problems involving log-sum-exp terms. Existing robustness guarantees of learning models and convergence analysis of optimization algorithms typically consider the softmax operator to have a Lipschitz constant of $1$ with respect to the $\ell_2$ norm. In this work, we prove that the softmax function is contractive with the Lipschitz constant $1/2$, uniformly across all $\ell_p$ norms with $p \ge 1$. We also show that the local Lipschitz constant of softmax attains $1/2$ for $p = 1$ and $p = \infty$, and for $p \in (1,\infty)$, the constant remains strictly below $1/2$ and the supremum $1/2$ is achieved only in the limit. To our knowledge, this is the first comprehensive norm-uniform analysis of softmax Lipschitz continuity. We demonstrate how the sharper constant directly improves a range of existing theoretical results on robustness and convergence. We further validate the sharpness of the $1/2$ Lipschitz constant of the softmax operator through empirical studies on attention-based architectures (ViT, GPT-2, Qwen3-8B) and on stochastic policies in reinforcement learning.
中文摘要 softmax函数是机器学习和优化中的基本算子，用于分类、注意力机制、强化学习、博弈论以及涉及log-sum-exp项的问题。学习模型的现有鲁棒性保证和优化算法的收敛分析通常认为 softmax 算子相对于 $\ell_2$ 范数具有 $1$ 的 Lipschitz 常数。在这项工作中，我们证明了 softmax 函数与 Lipschitz 常数 $1/2$ 具有收缩性，在所有 $\ell_p$ 范数中均$p \ge 1$。我们还表明，当 $p = 1$ 且 $p = \infty$ 时，softmax 的局部 Lipschitz 常数达到 $1/2$，并且对于 $p \in （1，\infty）$，该常数严格低于 $1/2$，并且仅在极限内实现最高 $1/2$。据我们所知，这是第一次对 softmax Lipschitz 连续性的全面范数均匀分析。我们演示了更尖锐的常数如何直接改善一系列关于鲁棒性和收敛性的现有理论结果。我们通过对基于注意力的架构（ViT、GPT-2、Qwen3-8B）和强化学习中的随机策略的实证研究，进一步验证了softmax算子的$1/2$ Lipschitz常数的锐度。

Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts

为混合专家提供稳定有效的强化学习

Authors: Di Zhang, Xun Wu, Shaohan Huang, Yaru Hao, Li Dong, Zewen Chi, Zhifang Sui, Furu Wei
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.23027
Pdf link: https://arxiv.org/pdf/2510.23027
Abstract Recent advances in reinforcement learning (RL) have substantially improved the training of large-scale language models, leading to significant gains in generation quality and reasoning ability. However, most existing research focuses on dense models, while RL training for Mixture-of-Experts (MoE) architectures remains underexplored. To address the instability commonly observed in MoE training, we propose a novel router-aware approach to optimize importance sampling (IS) weights in off-policy RL. Specifically, we design a rescaling strategy guided by router logits, which effectively reduces gradient variance and mitigates training divergence. Experimental results demonstrate that our method significantly improves both the convergence stability and the final performance of MoE models, highlighting the potential of RL algorithmic innovations tailored to MoE architectures and providing a promising direction for efficient training of large-scale expert models.
中文摘要 强化学习（RL）的最新进展极大地改进了大规模语言模型的训练，从而在生成质量和推理能力方面取得了显著的提升。然而，大多数现有研究都集中在密集模型上，而混合专家（MoE）架构的 RL 训练仍然没有得到充分探索。为了解决 MoE 训练中常见的不稳定性，我们提出了一种新的路由器感知方法来优化非策略 RL 中的重要性采样（IS）权重。具体来说，我们设计了一种以路由器 logits 为指导的重缩放策略，可以有效降低梯度方差并减轻训练差异。实验结果表明，该方法显著提高了MoE模型的收敛稳定性和最终性能，凸显了针对MoE架构量身定制的RL算法创新的潜力，为大规模专家模型的高效训练提供了有希望的方向。

Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

通过工具集成强化学习激励 LLM 法官的代理推理

Authors: Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang, Hongkun Yu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.23038
Pdf link: https://arxiv.org/pdf/2510.23038
Abstract Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters. Remarkably, TIR-Judge-Zero - trained entirely without distilled judge trajectories, matches the performance of distilled variants, demonstrating that tool-augmented judges can self-evolve through iterative reinforcement learning.
中文摘要 大型语言模型（LLM）被广泛用作评估响应质量的评委，为人类评估提供了一种可扩展的替代方案。然而，大多数法学硕士法官仅根据基于文本的内在推理进行作，限制了他们验证复杂约束或执行准确计算的能力。受到工具集成推理（TIR）在众多任务中的成功激励，我们提出了 TIR-Judge，这是一个用于培训 LLM 法官的端到端 RL 框架，它集成了用于精确评估的代码执行器。TIR-Judge 建立在三个原则之上：（i）跨可验证和不可验证领域的多样化训练，（ii）灵活的判断格式（逐点、成对、按列表），以及（iii）直接从初始模型引导的迭代 RL，无需蒸馏。在七个公开基准测试中，TIR-Judge 比基于强推理的判断器高出 6.4%（逐点）和 7.7%（成对），尽管只有 8B 参数，但却实现了与 Claude-Opus-4 相当的列表性能。值得注意的是，TIR-Judge-Zero - 完全没有提炼的法官轨迹进行训练，与提炼变体的性能相匹配，证明工具增强的法官可以通过迭代强化学习进行自我进化。

Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

作为替代奖励最大化的优势塑造：统一Pass@K政策梯度

Authors: Christos Thrampoulidis, Sadegh Mahdavi, Wenlong Deng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.23049
Pdf link: https://arxiv.org/pdf/2510.23049
Abstract This note reconciles two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards: (1) direct REINFORCE-style methods, and (2) advantage-shaping techniques that directly modify GRPO. We show that these are two sides of the same coin. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical ``hard-example up-weighting'' modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods. This perspective provides a lens for RLVR policy gradient optimization beyond our original motivation of Pass@K.
中文摘要 本说明将强化学习中Pass@K目标的两种看似不同的策略梯度优化方法与可验证的奖励相协调：（1）直接的REINFORCE式方法，以及（2）直接修改GRPO的优势塑造技术。我们表明，这是同一枚硬币的两面。通过对现有的优势塑造算法进行逆向工程，我们发现它们隐式优化了代理奖励。我们特别将对 GRPO 的实际“硬示例加权”修改解释为奖励级正则化。相反，从替代奖励目标开始，我们提供了一个简单的方法来推导现有和新的优势塑造方法。这种观点为 RLVR 策略梯度优化提供了一个视角，超越了我们最初的Pass@K动机。

AirFed: Federated Graph-Enhanced Multi-Agent Reinforcement Learning for Multi-UAV Cooperative Mobile Edge Computing

AirFed：面向多无人机协作移动边缘计算的联邦图增强多智能体强化学习

Authors: Zhiyu Wang, Suman Raj, Rajkumar Buyya
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2510.23053
Pdf link: https://arxiv.org/pdf/2510.23053
Abstract Multiple Unmanned Aerial Vehicles (UAVs) cooperative Mobile Edge Computing (MEC) systems face critical challenges in coordinating trajectory planning, task offloading, and resource allocation while ensuring Quality of Service (QoS) under dynamic and uncertain environments. Existing approaches suffer from limited scalability, slow convergence, and inefficient knowledge sharing among UAVs, particularly when handling large-scale IoT device deployments with stringent deadline constraints. This paper proposes AirFed, a novel federated graph-enhanced multi-agent reinforcement learning framework that addresses these challenges through three key innovations. First, we design dual-layer dynamic Graph Attention Networks (GATs) that explicitly model spatial-temporal dependencies among UAVs and IoT devices, capturing both service relationships and collaborative interactions within the network topology. Second, we develop a dual-Actor single-Critic architecture that jointly optimizes continuous trajectory control and discrete task offloading decisions. Third, we propose a reputation-based decentralized federated learning mechanism with gradient-sensitive adaptive quantization, enabling efficient and robust knowledge sharing across heterogeneous UAVs. Extensive experiments demonstrate that AirFed achieves 42.9% reduction in weighted cost compared to state-of-the-art baselines, attains over 99% deadline satisfaction and 94.2% IoT device coverage rate, and reduces communication overhead by 54.5%. Scalability analysis confirms robust performance across varying UAV numbers, IoT device densities, and system scales, validating AirFed's practical applicability for large-scale UAV-MEC deployments.
中文摘要 多无人机（UAV）协作移动边缘计算（MEC）系统在协调轨迹规划、任务卸载和资源分配，同时确保动态和不确定环境下的服务质量（QoS）方面面临着关键挑战。现有方法存在可扩展性有限、收敛缓慢以及无人机之间知识共享效率低下的问题，特别是在处理具有严格期限限制的大规模物联网设备部署时。本文提出了AirFed，这是一种新型的联邦图增强多智能体强化学习框架，它通过三项关键创新来应对这些挑战。首先，我们设计了双层动态图注意力网络（GAT），该网络对无人机和物联网设备之间的时空依赖关系进行显式建模，捕获网络拓扑中的服务关系和协作交互。其次，我们开发了一种双 Actor 单批评架构，共同优化连续轨迹控制和离散任务卸载决策。第三，我们提出了一种基于信誉的去中心化联邦学习机制，具有梯度敏感的自适应量化，从而实现跨异构无人机的高效、稳健的知识共享。广泛的实验表明，与最先进的基线相比，AirFed 的加权成本降低了 42.9%，实现了超过 99% 的截止日期满意度和 94.2% 的物联网设备覆盖率，并将通信开销降低了 54.5%。可扩展性分析证实了在不同无人机数量、物联网设备密度和系统规模下具有强大的性能，验证了 AirFed 在大规模 UAV-MEC 部署中的实际适用性。

Think before Recommendation: Autonomous Reasoning-enhanced Recommender

推荐前三思：自主推理增强推荐器

Authors: Xiaoyu Kong, Junguang Jiang, Bin Liu, Ziru Xu, Han Zhu, Jian Xu, Bo Zheng, Jiancan Wu, Xiang Wang
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.23077
Pdf link: https://arxiv.org/pdf/2510.23077
Abstract The core task of recommender systems is to learn user preferences from historical user-item interactions. With the rapid development of large language models (LLMs), recent research has explored leveraging the reasoning capabilities of LLMs to enhance rating prediction tasks. However, existing distillation-based methods suffer from limitations such as the teacher model's insufficient recommendation capability, costly and static supervision, and superficial transfer of reasoning ability. To address these issues, this paper proposes RecZero, a reinforcement learning (RL)-based recommendation paradigm that abandons the traditional multi-model and multi-stage distillation approach. Instead, RecZero trains a single LLM through pure RL to autonomously develop reasoning capabilities for rating prediction. RecZero consists of two key components: (1) "Think-before-Recommendation" prompt construction, which employs a structured reasoning template to guide the model in step-wise analysis of user interests, item features, and user-item compatibility; and (2) rule-based reward modeling, which adopts group relative policy optimization (GRPO) to compute rewards for reasoning trajectories and optimize the LLM. Additionally, the paper explores a hybrid paradigm, RecOne, which combines supervised fine-tuning with RL, initializing the model with cold-start reasoning samples and further optimizing it with RL. Experimental results demonstrate that RecZero and RecOne significantly outperform existing baseline methods on multiple benchmark datasets, validating the superiority of the RL paradigm in achieving autonomous reasoning-enhanced recommender systems.
中文摘要 推荐系统的核心任务是从历史用户与项目的交互中学习用户偏好。随着大型语言模型（LLMs）的快速发展，最近的研究探索了利用LLMs的推理能力来增强评级预测任务。然而，现有的基于蒸馏的方法存在教师模型推荐能力不足、监督成本高、静态化、推理能力转移肤浅等局限性。针对这些问题，本文提出了RecZero，这是一种基于强化学习（RL）的推荐范式，摒弃了传统的多模型和多阶段蒸馏方法。相反，RecZero 通过纯 RL 训练单个 LLM，以自主开发用于评级预测的推理能力。RecZero由两个关键组件组成：（1）“先思后推荐”提示构建，采用结构化推理模板指导模型对用户兴趣、项目特征和用户-项目兼容性进行逐步分析;（2）基于规则的奖励建模，采用群体相对策略优化（GRPO）来计算推理轨迹的奖励并优化LLM。此外，本文还探索了一种混合范式 RecOne，它将监督微调与 RL 相结合，用冷启动推理样本初始化模型，并使用 RL 进一步优化模型。实验结果表明，RecZero和RecOne在多个基准数据集上明显优于现有的基线方法，验证了RL范式在实现自主推理增强推荐系统方面的优越性。

Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI

在 BabyAI 中采用 PPO 调整交错编码器以进行语言引导强化学习

Authors: Aryan Mathur, Asaduddin Ahmed
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2510.23148
Pdf link: https://arxiv.org/pdf/2510.23148
Abstract Deep reinforcement learning agents often struggle when tasks require understanding both vision and language. Conventional architectures typically isolate perception (for example, CNN-based visual encoders) from decision-making (policy networks). This separation can be inefficient, since the policy's failures do not directly help the perception module learn what is important. To address this, we implement the Perception-Decision Interleaving Transformer (PDiT) architecture introduced by Mao et al. (2023), a model that alternates between perception and decision layers within a single transformer. This interleaving allows feedback from decision-making to refine perceptual features dynamically. In addition, we integrate a contrastive loss inspired by CLIP to align textual mission embeddings with visual scene features. We evaluate the PDiT encoders on the BabyAI GoToLocal environment and find that the approach achieves more stable rewards and stronger alignment compared to a standard PPO baseline. The results suggest that interleaved transformer encoders are a promising direction for developing more integrated autonomous agents.
中文摘要 当任务需要理解视觉和语言时，深度强化学习代理经常会遇到困难。传统架构通常将感知（例如，基于 CNN 的视觉编码器）与决策（策略网络）隔离开来。这种分离可能效率低下，因为策略的失败不会直接帮助感知模块了解什么是重要的。为了解决这个问题，我们实现了毛等人（2023）引入的感知-决策交错转换器（PDiT）架构，该架构在单个转换器中在感知层和决策层之间交替。这种交错允许来自决策的反馈来动态细化感知特征。此外，我们还集成了受 CLIP 启发的对比损失，以使文本任务嵌入与视觉场景特征保持一致。我们评估了 BabyAI GoToLocal 环境中的 PDiT 编码器，发现与标准 PPO 基线相比，该方法实现了更稳定的奖励和更强的一致性。结果表明，交错变压器编码器是开发更集成的自主智能体的一个有前途的方向。

Guiding Skill Discovery with Foundation Models

使用基础模型指导技能发现

Authors: Zhao Yang, Thomas M. Moerland, Mike Preuss, Aske Plaat, Vincent François-Lavet, Edward S. Hu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.23167
Pdf link: https://arxiv.org/pdf/2510.23167
Abstract Learning diverse skills without hand-crafted reward functions could accelerate reinforcement learning in downstream tasks. However, existing skill discovery methods focus solely on maximizing the diversity of skills without considering human preferences, which leads to undesirable behaviors and possibly dangerous skills. For instance, a cheetah robot trained using previous methods learns to roll in all directions to maximize skill diversity, whereas we would prefer it to run without flipping or entering hazardous areas. In this work, we propose a Foundation model Guided (FoG) skill discovery method, which incorporates human intentions into skill discovery through foundation models. Specifically, FoG extracts a score function from foundation models to evaluate states based on human intentions, assigning higher values to desirable states and lower to undesirable ones. These scores are then used to re-weight the rewards of skill discovery algorithms. By optimizing the re-weighted skill discovery rewards, FoG successfully learns to eliminate undesirable behaviors, such as flipping or rolling, and to avoid hazardous areas in both state-based and pixel-based tasks. Interestingly, we show that FoG can discover skills involving behaviors that are difficult to define. Interactive visualisations are available from this https URL.
中文摘要 在没有手工制作的奖励函数的情况下学习多样化的技能可以加速下游任务的强化学习。然而，现有的技能发现方法只关注技能的多样性最大化，而不考虑人类的偏好，这会导致不良行为和可能的危险技能。例如，使用以前方法训练的猎豹机器人学会向各个方向滚动，以最大限度地提高技能多样性，而我们更希望它在不翻转或进入危险区域的情况下运行。在这项工作中，我们提出了一种基础模型引导（FoG）技能发现方法，该方法通过基础模型将人类意图纳入技能发现中。具体来说，FoG 从基础模型中提取一个评分函数，以根据人类意图评估状态，为理想的状态分配较高的值，为不良状态分配较低的值。然后，这些分数用于重新加权技能发现算法的奖励。通过优化重新加权的技能发现奖励，FoG 成功地学会了消除不良行为，例如翻转或滚动，并在基于状态和基于像素的任务中避开危险区域。有趣的是，我们表明 FoG 可以发现涉及难以定义的行为的技能。交互式可视化可从此 https URL 获得。

TARC: Time-Adaptive Robotic Control

TARC：时间自适应机器人控制

Authors: Arnav Sukhija, Lenart Treven, Jin Cheng, Florian Dörfler, Stelian Coros, Andreas Krause
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.23176
Pdf link: https://arxiv.org/pdf/2510.23176
Abstract Fixed-frequency control in robotics imposes a trade-off between the efficiency of low-frequency control and the robustness of high-frequency control, a limitation not seen in adaptable biological systems. We address this with a reinforcement learning approach in which policies jointly select control actions and their application durations, enabling robots to autonomously modulate their control frequency in response to situational demands. We validate our method with zero-shot sim-to-real experiments on two distinct hardware platforms: a high-speed RC car and a quadrupedal robot. Our method matches or outperforms fixed-frequency baselines in terms of rewards while significantly reducing the control frequency and exhibiting adaptive frequency control under real-world conditions.
中文摘要 机器人技术中的固定频率控制在低频控制的效率和高频控制的鲁棒性之间进行了权衡，这在适应性生物系统中是看不到的。我们通过强化学习方法解决这个问题，在该方法中，策略共同选择控制动作及其应用持续时间，使机器人能够根据情境需求自主调节其控制频率。我们在两个不同的硬件平台上通过零样本模拟到真实实验来验证我们的方法：高速遥控车和四足机器人。我们的方法在奖励方面与固定频率基线相匹配或优于固定频率基线，同时显着降低控制频率并在现实条件下表现出自适应频率控制。

Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach

真实足球模拟中的类人守门员：一种样本高效的强化学习方法

Authors: Alessandro Sestini, Joakim Bergdahl, Jean-Philippe Barrette-LaPierre, Florian Fuchs, Brady Chen, Micheal Jones, Linus Gisslén
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.23216
Pdf link: https://arxiv.org/pdf/2510.23216
Abstract While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI behaviors. Previous research focuses on training super-human agents with large models, which is impractical for game studios with limited resources aiming for human-like agents. This paper proposes a sample-efficient DRL method tailored for training and fine-tuning agents in industrial settings such as the video game industry. Our method improves sample efficiency of value-based DRL by leveraging pre-collected data and increasing network plasticity. We evaluate our method training a goalkeeper agent in EA SPORTS FC 25, one of the best-selling football simulations today. Our agent outperforms the game's built-in AI by 10% in ball saving rate. Ablation studies show that our method trains agents 50% faster compared to standard DRL methods. Finally, qualitative evaluation from domain experts indicates that our approach creates more human-like gameplay compared to hand-crafted agents. As a testimony of the impact of the approach, the method is intended to replace the hand-crafted counterpart in next iterations of the series.
中文摘要 虽然几款备受瞩目的视频游戏已成为深度强化学习（DRL）的测试平台，但游戏行业很少采用这种技术来制作真实的人工智能行为。此前的研究重点是用大模型训练超人智能体，这对于资源有限的游戏工作室来说，以类人智能体为目标，是不切实际的。本文提出了一种样本高效的DRL方法，该方法专为视频游戏行业等工业环境中的训练和微调代理而定制。我们的方法通过利用预先收集的数据和增加网络可塑性来提高基于价值的DRL的样本效率。我们评估了我们在 EA SPORTS FC 25 中训练守门员特工的方法，这是当今最畅销的足球模拟游戏之一。我们的代理在扑球率方面比游戏内置的 AI 高出 10%。消融研究表明，与标准 DRL 方法相比，我们的方法训练试剂的速度提高了 50%。最后，领域专家的定性评估表明，与手工制作的代理相比，我们的方法创造了更像人类的游戏玩法。作为该方法影响的证明，该方法旨在在该系列的下一次迭代中取代手工制作的对应物。

Code Aesthetics with Agentic Reward Feedback

带有代理奖励反馈的代码美学

Authors: Bang Xiao, Lingjie Jiang, Shaohan Huang, Tengchao Lv, Yupan Huang, Xun Wu, Lei Cui, Furu Wei
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.23272
Pdf link: https://arxiv.org/pdf/2510.23272
Abstract Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and also enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o and GPT-4.1, and achieves performance comparable to large open-source models with 480B-685B parameters, underscoring the effectiveness of our approach.
中文摘要 大型语言模型（LLM）已成为开发人员在代码相关任务中的宝贵助手。虽然法学硕士擅长代码生成和错误修复等传统编程任务，但它们在面向视觉的编码任务中却遇到了困难，通常会产生次优的美感。在本文中，我们引入了一种新的管道来增强 LLM 生成代码的美学质量。我们首先构建了AesCode-358K，这是一个专注于代码美学的大规模指令调优数据集。接下来，我们提出代理奖励反馈，这是一个评估可执行性、静态美学和交互美学的多智能体系统。在此基础上，我们开发了 GRPO-AR，它将这些信号集成到 GRPO 算法中，以联合优化功能和代码美学。最后，我们开发了 OpenDesign，这是一个评估代码美学的基准。实验结果表明，将AesCode-358K上的监督微调与使用代理奖励反馈的强化学习相结合，可以显著提高OpenDesign的性能，并增强PandasPlotBench等现有基准测试的结果。值得注意的是，我们的 AesCoder-4B 超越了 GPT-4o 和 GPT-4.1，并实现了与具有 480B-685B 参数的大型开源模型相当的性能，凸显了我们方法的有效性。

CNOT Minimal Circuit Synthesis: A Reinforcement Learning Approach

CNOT 最小电路合成：一种强化学习方法

Authors: Riccardo Romanello, Daniele Lizzio Bosco, Jacopo Cossio, Dusan Sutulovic, Giuseppe Serra, Carla Piazza, Paolo Burelli
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.23304
Pdf link: https://arxiv.org/pdf/2510.23304
Abstract CNOT gates are fundamental to quantum computing, as they facilitate entanglement, a crucial resource for quantum algorithms. Certain classes of quantum circuits are constructed exclusively from CNOT gates. Given their widespread use, it is imperative to minimise the number of CNOT gates employed. This problem, known as CNOT minimisation, remains an open challenge, with its computational complexity yet to be fully characterised. In this work, we introduce a novel reinforcement learning approach to address this task. Instead of training multiple reinforcement learning agents for different circuit sizes, we use a single agent up to a fixed size $m$. Matrices of sizes different from m are preprocessed using either embedding or Gaussian striping. To assess the efficacy of our approach, we trained an agent with m = 8, and evaluated it on matrices of size n that range from 3 to 15. The results we obtained show that our method overperforms the state-of-the-art algorithm as the value of n increases.
中文摘要 CNOT 门是量子计算的基础，因为它们促进纠缠，这是量子算法的重要资源。某些类别的量子电路完全由 CNOT 门构建。鉴于它们的广泛使用，必须尽量减少使用的 CNOT 门的数量。这个问题被称为 CNOT 最小化，仍然是一个悬而未决的挑战，其计算复杂性尚未得到充分表征。在这项工作中，我们引入了一种新的强化学习方法来解决这一任务。我们不是针对不同的电路大小训练多个强化学习代理，而是使用一个固定大小的智能体，最大$m$。大小与 m 不同的矩阵使用嵌入或高斯条纹进行预处理。为了评估我们方法的有效性，我们训练了一个 m = 8 的代理，并在大小为 n 的矩阵上进行了评估，范围为 3 到 15。我们获得的结果表明，随着 n 值的增加，我们的方法优于最先进的算法。

Transferable Deep Reinforcement Learning for Cross-Domain Navigation: from Farmland to the Moon

跨域导航的可转移深度强化学习：从农田到月球

Authors: Shreya Santra, Thomas Robbins, Kazuya Yoshida
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.23329
Pdf link: https://arxiv.org/pdf/2510.23329
Abstract Autonomous navigation in unstructured environments is essential for field and planetary robotics, where robots must efficiently reach goals while avoiding obstacles under uncertain conditions. Conventional algorithmic approaches often require extensive environment-specific tuning, limiting scalability to new domains. Deep Reinforcement Learning (DRL) provides a data-driven alternative, allowing robots to acquire navigation strategies through direct interactions with their environment. This work investigates the feasibility of DRL policy generalization across visually and topographically distinct simulated domains, where policies are trained in terrestrial settings and validated in a zero-shot manner in extraterrestrial environments. A 3D simulation of an agricultural rover is developed and trained using Proximal Policy Optimization (PPO) to achieve goal-directed navigation and obstacle avoidance in farmland settings. The learned policy is then evaluated in a lunar-like simulated environment to assess transfer performance. The results indicate that policies trained under terrestrial conditions retain a high level of effectiveness, achieving close to 50\% success in lunar simulations without the need for additional training and fine-tuning. This underscores the potential of cross-domain DRL-based policy transfer as a promising approach to developing adaptable and efficient autonomous navigation for future planetary exploration missions, with the added benefit of minimizing retraining costs.
中文摘要 非结构化环境中的自主导航对于野外和行星机器人技术至关重要，机器人必须在不确定条件下有效地达到目标，同时避开障碍物。传统的算法方法通常需要针对特定环境进行广泛的调整，从而限制了对新领域的可扩展性。深度强化学习（DRL）提供了一种数据驱动的替代方案，允许机器人通过与环境的直接交互来获取导航策略。这项工作调查了在视觉和地形上不同的模拟域中进行 DRL 策略泛化的可行性，其中策略在陆地环境中进行训练，并在外星环境中以零样本方式进行验证。使用近端策略优化（PPO）开发和训练农业漫游车的 3D 模拟，以实现农田环境中的目标导向导航和避障。然后在类似月球的模拟环境中评估学习到的策略，以评估传输性能。结果表明，在地面条件下训练的策略保持了高水平的有效性，在月球模拟中取得了接近 50% 的成功率，而无需额外的训练和微调。这凸显了基于 DRL 的跨域策略转移的潜力，它是一种有前途的方法，可以为未来的行星探索任务开发适应性强且高效的自主导航，并具有最大限度地降低再培训成本的额外好处。

The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

N 个世界之最：通过max@k优化使强化学习与 N 个最佳采样保持一致

Authors: Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, Egor Bogomolov
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.23393
Pdf link: https://arxiv.org/pdf/2510.23393
Abstract The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.
中文摘要 具有可验证奖励的强化学习（RLVR）在数学和编码领域的应用表明，大型语言模型的推理和解决问题的能力得到了显着提高。尽管强化学习在单代问题解决方面取得了成功，但强化学习微调过程可能会损害模型的探索能力，这反映在代际多样性的减少以及导致的大 N 值的最佳采样过程中的性能下降。在这项工作中，我们专注于优化max@k指标，即pass@k的持续推广。我们得出一个无偏的政策梯度估计，用于直接优化该指标。此外，我们将推导扩展到非策略更新，这是现代 RLVR 算法中的一个常见元素，可以提高样本效率。根据经验，我们表明我们的目标有效地优化了非政策场景中的max@k指标，使模型与 Best-of-N 推理策略保持一致。

VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations

VideoTG-R1：通过反射边界注释的课程强化学习增强视频时间基础

Authors: Lu Dong, Haiyu Zhang, Han Lin, Ziang Yan, Xiangyu Zeng, Hongjie Zhang, Yifei Huang, Yi Wang, Zhen-Hua Ling, Limin Wang, Yali Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.23397
Pdf link: https://arxiv.org/pdf/2510.23397
Abstract Video temporal grounding (VTG) aims to locate precise segments in videos based on language queries, which is a fundamental challenge in video understanding. While recent Multimodal Large Language Models (MLLMs) have shown promise in tackling VTG through reinforcement learning (RL), they overlook the challenges arising from both the quality and difficulty of training samples. (1) Partially annotated samples. Many samples contain relevant segments beyond the annotated interval, introducing ambiguous supervision. (2) Hard-to-ground samples. Samples with poor zero-shot performance produce consistently low and indistinguishable rewards during RL training, exhibiting no clear preference among multiple outputs and thus hindering learning efficiency. To address these challenges, we propose VideoTG-R1, a novel curriculum RL framework with reflected boundary annotations, enabling data-efficient training. Specifically, we propose a Boundary Reflection Agent that utilizes MLLMs to predict query-relevant timestamps outside the annotated intervals, allowing us to identify and filter out partially annotated samples, thereby reducing ambiguity. Furthermore, we introduce a Difficulty Estimation Agent to assess the training difficulty of each sample and design a curriculum RL strategy that dynamically masks the videos of hard-to-ground samples according to the training steps, easing the training difficulty and providing clearer preference. Experiments on the VTG and grounded VideoQA tasks demonstrate the effectiveness of our method. Remarkably, with only 10% of the training samples and 21% of the computational budget, VideoTG-R1 outperforms full-data counterparts under both group relative policy optimization (GRPO) and supervised fine-tuning (SFT). The code is available at this https URL.
中文摘要 视频时序接地（VTG）旨在根据语言查询定位视频中的精确片段，这是视频理解中的一个基本挑战。虽然最近的多模态大型语言模型（MLLM）在通过强化学习（RL）解决VTG方面显示出希望，但它们忽视了训练样本的质量和难度带来的挑战。（1）部分注释样品。许多样本包含超出注释区间的相关片段，引入了模糊的监督。（2）难以研磨的样品。零样本性能较差的样本在RL训练过程中产生始终较低且难以区分的奖励，在多个输出中表现出不明确的偏好，从而阻碍了学习效率。为了应对这些挑战，我们提出了 VideoTG-R1，这是一种具有反射边界注释的新型课程 RL 框架，可实现数据高效的训练。具体来说，我们提出了一种边界反射代理，它利用 MLLM 来预测注释间隔之外的查询相关时间戳，使我们能够识别和过滤掉部分注释的样本，从而减少歧义。此外，我们引入了难度估计代理来评估每个样本的训练难度，并设计了课程RL策略，根据训练步骤动态屏蔽难地样本的视频，降低了训练难度，提供了更清晰的偏好。VTG 和接地 VideoQA 任务的实验证明了我们方法的有效性。值得注意的是，VideoTG-R1 仅拥有 10% 的训练样本和 21% 的计算预算，在组相对策略优化（GRPO）和监督微调（SFT）下均优于全数据对应产品。该代码可在此 https URL 中找到。

Causal Deep Q Network

因果深Q网络

Authors: Elouanes Khelifi, Amir Saki, Usef Faghihi
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.23424
Pdf link: https://arxiv.org/pdf/2510.23424
Abstract Deep Q Networks (DQN) have shown remarkable success in various reinforcement learning tasks. However, their reliance on associative learning often leads to the acquisition of spurious correlations, hindering their problem-solving capabilities. In this paper, we introduce a novel approach to integrate causal principles into DQNs, leveraging the PEACE (Probabilistic Easy vAriational Causal Effect) formula for estimating causal effects. By incorporating causal reasoning during training, our proposed framework enhances the DQN's understanding of the underlying causal structure of the environment, thereby mitigating the influence of confounding factors and spurious correlations. We demonstrate that integrating DQNs with causal capabilities significantly enhances their problem-solving capabilities without compromising performance. Experimental results on standard benchmark environments showcase that our approach outperforms conventional DQNs, highlighting the effectiveness of causal reasoning in reinforcement learning. Overall, our work presents a promising avenue for advancing the capabilities of deep reinforcement learning agents through principled causal inference.
中文摘要 Deep Q Networks （DQN）在各种强化学习任务中取得了显着的成功。然而，他们对联想学习的依赖往往会导致获得虚假相关性，从而阻碍他们解决问题的能力。在本文中，我们介绍了一种将因果原理整合到 DQN 中的新方法，利用 PEACE（概率简易因果效应）公式来估计因果效应。通过在训练过程中结合因果推理，我们提出的框架增强了DQN对环境潜在因果结构的理解，从而减轻了混杂因素和虚假相关性的影响。我们证明，将 DQN 与因果能力集成可以显着增强其解决问题的能力，而不会影响性能。在标准基准测试环境中的实验结果表明，我们的方法优于传统的 DQN，凸显了因果推理在强化学习中的有效性。总体而言，我们的工作为通过有原则的因果推理提高深度强化学习智能体的能力提供了一条有前途的途径。

An Information-Theoretic Analysis of Out-of-Distribution Generalization in Meta-Learning with Applications to Meta-RL

元学习中分布外泛化的信息论分析及其在元强化学习中的应用

Authors: Xingtu Liu
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.23448
Pdf link: https://arxiv.org/pdf/2510.23448
Abstract In this work, we study out-of-distribution generalization in meta-learning from an information-theoretic perspective. We focus on two scenarios: (i) when the testing environment mismatches the training environment, and (ii) when the training environment is broader than the testing environment. The first corresponds to the standard distribution mismatch setting, while the second reflects a broad-to-narrow training scenario. We further formalize the generalization problem in meta-reinforcement learning and establish corresponding generalization bounds. Finally, we analyze the generalization performance of a gradient-based meta-reinforcement learning algorithm.
中文摘要 在这项工作中，我们从信息论的角度研究了元学习中的分布外泛化。我们重点关注两种情况：（i）当测试环境与训练环境不匹配时，以及（ii）当训练环境比测试环境更广泛时。第一个对应于标准分布不匹配设置，而第二个则反映从广义到狭义的训练场景。我们进一步形式化了元强化学习中的泛化问题，并建立了相应的泛化边界。最后，分析了基于梯度的元强化学习算法的泛化性能。

MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding

MergeMix：用于视觉和多模态理解的统一增强范式

Authors: Xin Jin, Siyuan Li, Siyong Jian, Kai Yu, Huan Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.23479
Pdf link: https://arxiv.org/pdf/2510.23479
Abstract Vision-language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requires large-scale human annotations and cannot capture subtle preferences, while RL brings in a reward signal for training, but suffers from overhead and instability. These limitations highlight a trade-off between scalability, robustness, and alignment quality. To address this, we propose MergeMix, a training-time augmentation paradigm that bridges SFT and RL. It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context, and then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss. As a mixup augmentation, MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Extensive experiments demonstrate that MergeMix achieves competitive accuracy with improved efficiency, providing a scalable approach to preference alignment in classification and MLLMs.
中文摘要 多模态大型语言模型（MLLM）中的视觉语言对齐通常依赖于监督微调（SFT）或强化学习（RL）。SFT 稳定高效，但需要大规模的人工标注，无法捕捉细微的偏好，而 RL 为训练带来奖励信号，但存在开销和不稳定性的问题。这些限制凸显了可扩展性、稳健性和对齐质量之间的权衡。为了解决这个问题，我们提出了 MergeMix，这是一种连接 SFT 和 RL 的训练时间增强范式。它首先通过具有更多聚类表示和空间上下文的标记合并应用注意力感知图像混合，然后通过构建具有混合图像和原始图像的偏好对，并通过 SimPO 丢失进行优化，为 MLLM 提供偏好驱动的训练范式。作为一种混合增强，MergeMix 增强了注意力的一致性和效率，在分类方面超越了其他基于启发式的方法。大量实验表明，MergeMix 以更高的效率实现了具有竞争力的准确性，为分类和 MLLM 中的偏好调整提供了一种可扩展的方法。

Learning to Reason Efficiently with Discounted Reinforcement Learning

通过折扣强化学习学习有效推理

Authors: Alex Ayoub, Kavosh Asadi, Dale Schuurmans, Csaba Szepesvári, Karim Bouyarmane
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.23486
Pdf link: https://arxiv.org/pdf/2510.23486
Abstract Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.
中文摘要 大型推理模型（LRM）通常会消耗过多的代币，从而增加计算成本和延迟。我们挑战了较长的响应可以提高准确性的假设。通过使用打折的强化学习设置（可解释为小的标记成本）惩罚推理标记，并分析受限策略类中的布莱克威尔最优性，我们鼓励简洁而准确的推理。实验证实了我们的理论结果，即这种方法在保持准确性的同时缩短了思维链。

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

VOLD：通过策略蒸馏从法学硕士到视觉语言模型的推理转移

Authors: Walid Bousselham, Hilde Kuehne, Cordelia Schmid
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.23497
Pdf link: https://arxiv.org/pdf/2510.23497
Abstract Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.
中文摘要 训练视觉语言模型（VLM）进行复杂推理仍然是一项具有挑战性的任务，即由于高质量的图像-文本推理数据稀缺。相反，基于文本的推理资源丰富且可扩展，但如何利用它们进行 VLM 推理仍然是一个悬而未决的问题。为了解决这个问题，我们提出了 VOLD，这是一个将推理能力从纯文本教师模型转移到 VLM 学生模型的框架。为此，VOLD 将通过群体相对策略优化（GRPO）的强化学习与策略蒸馏相结合，这使得学生推理轨迹能够由教师模型指导，从而比单独使用 GRPO 具有显着收益。我们进一步表明，在这种情况下，冷启动调整对于在线培训阶段的有效转移至关重要，并且如果教师和学生之间没有足够的分布一致性，政策蒸馏无法提供有意义的指导。我们在MMMU-Pro、MathVision、MathVista和LogicVista等各种基准测试中评估了VOOLD，表明VOLD的性能明显优于基线模型，并且比最先进的模型有了一定的改进。我们的消融表明，通过 SFT 进行冷启动调整对于与纯文本教师进行政策蒸馏的重要性。

Sequential Multi-Agent Dynamic Algorithm Configuration

顺序多智能体动态算法配置

Authors: Chen Lu, Ke Xue, Lei Yuan, Yao Wang, Yaoyuan Wang, Sheng Fu, Chao Qian
Subjects: Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2510.23535
Pdf link: https://arxiv.org/pdf/2510.23535
Abstract Dynamic algorithm configuration (DAC) is a recent trend in automated machine learning, which can dynamically adjust the algorithm's configuration during the execution process and relieve users from tedious trial-and-error tuning tasks. Recently, multi-agent reinforcement learning (MARL) approaches have improved the configuration of multiple heterogeneous hyperparameters, making various parameter configurations for complex algorithms possible. However, many complex algorithms have inherent inter-dependencies among multiple parameters (e.g., determining the operator type first and then the operator's parameter), which are, however, not considered in previous approaches, thus leading to sub-optimal results. In this paper, we propose the sequential multi-agent DAC (Seq-MADAC) framework to address this issue by considering the inherent inter-dependencies of multiple parameters. Specifically, we propose a sequential advantage decomposition network, which can leverage action-order information through sequential advantage decomposition. Experiments from synthetic functions to the configuration of multi-objective optimization algorithms demonstrate Seq-MADAC's superior performance over state-of-the-art MARL methods and show strong generalization across problem classes. Seq-MADAC establishes a new paradigm for the widespread dependency-aware automated algorithm configuration. Our code is available at this https URL.
中文摘要 动态算法配置（DAC）是自动化机器学习的最新趋势，它可以在执行过程中动态调整算法的配置，将用户从繁琐的试错调优任务中解脱出来。最近，多智能体强化学习（MARL）方法改进了多个异构超参数的配置，使复杂算法的各种参数配置成为可能。然而，许多复杂的算法在多个参数之间具有固有的相互依赖关系（例如，首先确定算子类型，然后确定算子的参数），然而，在以前的方法中没有考虑到这些因素，因此导致次优结果。在本文中，我们提出了顺序多智能体DAC（Seq-MADAC）框架，通过考虑多个参数固有的相互依赖性来解决这个问题。具体而言，我们提出了一种顺序优势分解网络，该网络可以通过顺序优势分解来利用动作顺序信息。从合成函数到多目标优化算法配置的实验证明了 Seq-MADAC 优于最先进的 MARL 方法的性能，并显示出跨问题类的很强的泛化性。Seq-MADAC 为广泛的依赖感知自动化算法配置建立了新的范式。我们的代码可在此 https URL 中找到。

Multi-Agent Evolve: LLM Self-Improve through Co-evolution

多智能体进化：法学硕士通过共同进化自我完善

Authors: Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhan, Mostofa Patwary, Jiaxuan You
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.23595
Pdf link: https://arxiv.org/pdf/2510.23595
Abstract Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs). However, the success of RL for LLMs heavily relies on human-curated datasets and verifiable rewards, which limit their scalability and generality. Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data. However, their methods primarily depend on a grounded environment for feedback (e.g., a Python interpreter or a game engine); extending them to general domains remains challenging. To address these challenges, we propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A. The core design of MAE is based on a triplet of interacting agents (Proposer, Solver, Judge) that are instantiated from a single LLM, and applies reinforcement learning to optimize their behaviors. The Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both while co-evolving. Experiments on Qwen2.5-3B-Instruct demonstrate that MAE achieves an average improvement of 4.54% on multiple benchmarks. These results highlight MAE as a scalable, data-efficient method for enhancing the general reasoning abilities of LLMs with minimal reliance on human-curated supervision.
中文摘要 强化学习（RL）在增强大型语言模型（LLM）的推理能力方面显示出巨大的潜力。然而，RL 在 LLM 中的成功在很大程度上依赖于人类策划的数据集和可验证的奖励，这限制了它们的可扩展性和通用性。最近的 Self-Play RL 方法受到游戏和围棋范式成功的影响，旨在增强 LLM 推理能力，而无需人工注释数据。然而，他们的方法主要依赖于接地环境的反馈（例如，Python 解释器或游戏引擎）;将它们扩展到一般领域仍然具有挑战性。为了应对这些挑战，我们提出了多智能体演化（MAE），这是一个框架，使法学硕士能够自我进化以解决各种任务，包括数学、推理和常识问答。MAE 的核心设计基于从单个 LLM 实例化的交互代理（Proposer、Solver、Judge）的三元组，并应用强化学习来优化它们的行为。提议者产生问题，解决者尝试解决方案，法官在共同进化的同时评估两者。在Qwen2.5-3B-Instruct上的实验表明，MAE在多个基准测试中平均提高了4.54%。这些结果凸显了 MAE 是一种可扩展、数据高效的方法，可以增强 LLM 的一般推理能力，同时最大限度地减少对人工策划监督的依赖。

Think Twice: Branch-and-Rethink Reasoning Reward Model

三思而后行：分支和重新思考推理奖励模型

Authors: Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.23596
Pdf link: https://arxiv.org/pdf/2510.23596
Abstract Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-oncescoringintofocused, second-lookreasoning, BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains. The code and the model will be released soon.
中文摘要 大型语言模型（LLM）越来越依赖思维模型，这些模型将中间步骤外部化并分配额外的测试时间计算，三思而后行的策略表明，有意识的第二次通过可以引发更强的推理。相比之下，大多数奖励模型（RM）仍然一次性将许多质量维度压缩到单个标量中，这种设计会诱导判断扩散：注意力分散到评估标准中，产生稀释的焦点和浅层分析。我们引入了分支和重新思考（BR-RM），这是一种两轮 RM，它将两次思考原则转移到奖励建模中。第 1 回合执行自适应分支，选择一小组实例关键维度（例如事实性和安全性）并勾勒出简洁的、寻求证据的假设。第 2 回合执行分支条件重新思考，这是一种有针对性的重读，测试这些假设并只仔细检查最重要的内容。我们使用简单的二进制结果奖励和严格的格式检查，在结构化的两轮轨迹上使用 GRPO 风格的强化学习进行训练，使该方法与标准 RLHF 管道兼容。通过将一次性计分转化为聚焦、二次推理，BR-R减少判断扩散，提高对细微但间接性错误的敏感性，同时保持实用性和可扩展性。实验结果表明，我们的模型在跨不同领域的三个具有挑战性的奖励建模基准上取得了最先进的性能。代码和模型将很快发布。

Keyword: diffusion policy

Two-Steps Diffusion Policy for Robotic Manipulation via Genetic Denoising

通过遗传去噪进行机器人作的两步扩散策略

Authors: Mateo Clemente, Leo Brunswic, Rui Heng Yang, Xuan Zhao, Yasser Khalil, Haoyu Lei, Amir Rasouli, Yinchuan Li
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.21991
Pdf link: https://arxiv.org/pdf/2510.21991
Abstract Diffusion models, such as diffusion policy, have achieved state-of-the-art results in robotic manipulation by imitating expert demonstrations. While diffusion models were originally developed for vision tasks like image and video generation, many of their inference strategies have been directly transferred to control domains without adaptation. In this work, we show that by tailoring the denoising process to the specific characteristics of embodied AI tasks -- particularly structured, low-dimensional nature of action distributions -- diffusion policies can operate effectively with as few as 5 neural function evaluations (NFE). Building on this insight, we propose a population-based sampling strategy, genetic denoising, which enhances both performance and stability by selecting denoising trajectories with low out-of-distribution risk. Our method solves challenging tasks with only 2 NFE while improving or matching performance. We evaluate our approach across 14 robotic manipulation tasks from D4RL and Robomimic, spanning multiple action horizons and inference budgets. In over 2 million evaluations, our method consistently outperforms standard diffusion-based policies, achieving up to 20\% performance gains with significantly fewer inference steps.
中文摘要 扩散模型，例如扩散策略，通过模仿专家演示，在机器人作方面取得了最先进的结果。虽然扩散模型最初是为图像和视频生成等视觉任务而开发的，但它们的许多推理策略已直接转移到控制域，无需适配。在这项工作中，我们表明，通过根据具身人工智能任务的特定特征（特别是动作分布的结构化、低维性质）定制去噪过程，扩散策略只需 5 个神经功能评估（NFE）即可有效运行。基于这一见解，我们提出了一种基于群体的抽样策略，即遗传去噪，它通过选择具有低分布外风险的去噪轨迹来提高性能和稳定性。我们的方法只需 2 个 NFE 即可解决具有挑战性的任务，同时提高或匹配性能。我们评估了 D4RL 和 Robomimic 的 14 个机器人纵任务的方法，跨越多个行动范围和推理预算。在超过 200 万次评估中，我们的方法始终优于基于标准扩散的策略，以显着减少的推理步骤实现高达 20\% 的性能提升。

ManiDP: Manipulability-Aware Diffusion Policy for Posture-Dependent Bimanual Manipulation

ManiDP：用于姿势依赖性双手作的可纵性感知扩散策略

Authors: Zhuo Li, Junjia Liu, Dianxi Li, Tao Teng, Miao Li, Sylvain Calinon, Darwin Caldwell, Fei Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.23016
Pdf link: https://arxiv.org/pdf/2510.23016
Abstract Recent work has demonstrated the potential of diffusion models in robot bimanual skill learning. However, existing methods ignore the learning of posture-dependent task features, which are crucial for adapting dual-arm configurations to meet specific force and velocity requirements in dexterous bimanual manipulation. To address this limitation, we propose Manipulability-Aware Diffusion Policy (ManiDP), a novel imitation learning method that not only generates plausible bimanual trajectories, but also optimizes dual-arm configurations to better satisfy posture-dependent task requirements. ManiDP achieves this by extracting bimanual manipulability from expert demonstrations and encoding the encapsulated posture features using Riemannian-based probabilistic models. These encoded posture features are then incorporated into a conditional diffusion process to guide the generation of task-compatible bimanual motion sequences. We evaluate ManiDP on six real-world bimanual tasks, where the experimental results demonstrate a 39.33$\%$ increase in average manipulation success rate and a 0.45 improvement in task compatibility compared to baseline methods. This work highlights the importance of integrating posture-relevant robotic priors into bimanual skill diffusion to enable human-like adaptability and dexterity.
中文摘要 最近的工作证明了扩散模型在机器人双手技能学习中的潜力。然而，现有方法忽略了对姿势相关任务特征的学习，这对于调整双臂配置以满足灵巧双手作中的特定力和速度要求至关重要。为了解决这一限制，我们提出了可纵性感知扩散策略（ManiDP），这是一种新颖的模仿学习方法，它不仅可以生成合理的双手轨迹，还可以优化双臂配置以更好地满足姿势相关的任务需求。ManiDP 通过从专家演示中提取双手可作性并使用基于黎曼的概率模型对封装的姿势特征进行编码来实现这一目标。然后，这些编码的姿势特征被合并到条件扩散过程中，以指导生成任务兼容的双手运动序列。我们在六个真实世界的双手任务上评估了 ManiDP，实验结果表明，与基线方法相比，平均作成功率提高了 39.33$\%$，任务兼容性提高了 0.45。这项工作强调了将与姿势相关的机器人先验整合到双手技能传播中的重要性，以实现类似人类的适应性和灵活性。

Deep Active Inference with Diffusion Policy and Multiple Timescale World Model for Real-World Exploration and Navigation

基于扩散策略和多时间尺度世界模型进行深度主动推理，实现现实世界探索和导航

Authors: Riko Yokozawa, Kentaro Fujii, Yuta Nomura, Shingo Murata
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.23258
Pdf link: https://arxiv.org/pdf/2510.23258
Abstract Autonomous robotic navigation in real-world environments requires exploration to acquire environmental information as well as goal-directed navigation in order to reach specified targets. Active inference (AIF) based on the free-energy principle provides a unified framework for these behaviors by minimizing the expected free energy (EFE), thereby combining epistemic and extrinsic values. To realize this practically, we propose a deep AIF framework that integrates a diffusion policy as the policy model and a multiple timescale recurrent state-space model (MTRSSM) as the world model. The diffusion policy generates diverse candidate actions while the MTRSSM predicts their long-horizon consequences through latent imagination, enabling action selection that minimizes EFE. Real-world navigation experiments demonstrated that our framework achieved higher success rates and fewer collisions compared with the baselines, particularly in exploration-demanding scenarios. These results highlight how AIF based on EFE minimization can unify exploration and goal-directed navigation in real-world robotic settings.
中文摘要 现实环境中的自主机器人导航需要探索以获取环境信息以及目标导向导航才能到达指定目标。基于自由能原理的主动推理（AIF）通过最小化预期自由能（EFE）为这些行为提供了一个统一的框架，从而将认识值和外在值结合起来。为了实际实现这一点，我们提出了一个深度AIF框架，该框架将扩散策略作为策略模型，将多时间尺度循环状态空间模型（MTRSSM）作为世界模型集成在一起。扩散策略产生多样化的候选行动，而 MTRSSM 通过潜在想象力预测其长期后果，从而实现最小化 EFE 的行动选择。真实世界的导航实验表明，与基线相比，我们的框架实现了更高的成功率和更少的碰撞，特别是在勘探要求较高的场景中。这些结果强调了基于EFE最小化的AIF如何在现实世界的机器人环境中统一探索和目标导向导航。