Arxiv Papers of Today

生成时间: 2025-10-17 16:29:29 (UTC+8); Arxiv 发布时间: 2025-10-17 20:00 EDT (2025-10-18 08:00 UTC+8)

今天共有 51 篇相关文章

Keyword: reinforcement learning

Joint Active RIS Configuration and User Power Control for Localization: A Neuroevolution-Based Approach

用于定位的联合主动RIS配置和用户功率控制：基于神经进化的方法

Authors: George Stamatelis, Hui Chen, Henk Wymeersch, George C. Alexandropoulos
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.13819
Pdf link: https://arxiv.org/pdf/2510.13819
Abstract This paper studies user localization aided by a Reconfigurable Intelligent Surface (RIS). A feedback link from the Base Station (BS) to the user is adopted to enable dynamic power control of the user pilot transmissions in the uplink. A novel multi-agent algorithm for the joint control of the RIS phase configuration and the user transmit power is presented, which is based on a hybrid approach integrating NeuroEvolution (NE) and supervised learning. The proposed scheme requires only single-bit feedback messages for the uplink power control, supports RIS elements with discrete responses, and is numerically shown to outperform fingerprinting, deep reinforcement learning baselines and backpropagation-based position estimators.
中文摘要 本文研究了可重构智能表面（RIS）辅助的用户定位。采用从基站（BS）到用户的反馈链路，实现上行链路中用户导频传输的动态功率控制。提出了一种基于集成神经进化（NE）和监督学习的混合方法的用于联合控制RIS相位配置和用户发射功率的新型多智能体算法。所提出的方案只需要单位反馈消息来控制上行链路功率，支持具有离散响应的RIS单元，并且在数值上表明其性能优于指纹识别、深度强化学习基线和基于反向传播的位置估计器。

Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL

弥合语义差距：多语言文本到 SQL 的对比奖励

Authors: Ashish Kattamuri, Ishita Prasad, Meetu Malhotra, Arpita Vats, Rahul Raja, Albert Lie
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.13827
Pdf link: https://arxiv.org/pdf/2510.13827
Abstract Current Text-to-SQL methods are evaluated and only focused on executable queries, overlooking the semantic alignment challenge -- both in terms of the semantic meaning of the query and the correctness of the execution results. Even execution accuracy itself shows significant drops when moving from English to other languages, with an average decline of 6 percentage points across non-English languages. We address these challenges by presenting a new framework that combines Group Relative Policy Optimization (GRPO) within a multilingual contrastive reward signal to enhance both task efficiency and semantic accuracy in Text-to-SQL systems in cross-lingual scenarios. Our method teaches models to obtain better correspondence between SQL generation and user intent by combining a reward signal based on semantic similarity. On the seven-language MultiSpider dataset, fine-tuning the LLaMA-3-3B model with GRPO improved the execution accuracy up to 87.4 percent (+26 pp over zero-shot) and semantic accuracy up to 52.29 percent (+32.86 pp). Adding our contrastive reward signal in the GRPO framework further improved the average semantic accuracy to 59.14 percent (+6.85 pp, up to +10 pp for Vietnamese). Our experiments showcase that a smaller, parameter-efficient 3B LLaMA model fine-tuned with our contrastive reward signal outperforms a much larger zero-shot 8B LLaMA model, with an uplift of 7.43 pp in execution accuracy (from 81.43 percent on the 8B model to 88.86 percent on the 3B model), and nearly matches its semantic accuracy (59.14 percent vs. 68.57 percent) -- all using just 3,000 reinforcement learning training examples. These results demonstrate how we can improve the performance of Text-to-SQL systems with contrastive rewards for directed semantic alignment, without requiring large-scale training datasets.
中文摘要 当前的文本到 SQL 方法被评估，并且只关注可执行查询，而忽略了语义对齐挑战——无论是在查询的语义含义还是执行结果的正确性方面。当从英语转移到其他语言时，甚至执行准确性本身也显示出显着下降，非英语语言的平均下降了 6 个百分点。我们通过提出一个新框架来应对这些挑战，该框架将组相对策略优化（GRPO）结合在多语言对比奖励信号中，以提高跨语言场景中文本到 SQL 系统的任务效率和语义准确性。我们的方法通过结合基于语义相似性的奖励信号，教模型在 SQL 生成和用户意图之间获得更好的对应关系。在七种语言的MultiSpider数据集上，使用GRPO微调LLaMA-3-3B模型，执行准确率提高到87.4%（比零样本+26 pp），语义准确率提高到52.29%（+32.86 pp）。在 GRPO 框架中添加我们的对比奖励信号进一步将平均语义准确率提高到 59.14%（+6.85 pp，越南语高达 +10 pp）。我们的实验表明，使用我们的对比奖励信号微调的更小、参数效率更高的 3B LLaMA 模型优于更大的零样本 8B LLaMA 模型，执行准确率提高了 7.43 个百分点（从 8B 模型的 81.43% 提高到 3B 模型的 88.86%），并且几乎与其语义准确度相当（59.14% 对 68.57%）——所有这些都仅使用 3,000 个强化学习训练示例。这些结果展示了我们如何通过对定向语义对齐的对比奖励来提高文本到 SQL 系统的性能，而无需大规模训练数据集。

K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding

K-frames：场景驱动的任意k关键帧选择，用于理解长视频

Authors: Yifeng Yao, Yike Yun, Jing Wang, Huishuai Zhang, Dongyan Zhao, Ke Tian, Zhihao Wang, Minghui Qiu, Tao Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.13891
Pdf link: https://arxiv.org/pdf/2510.13891
Abstract Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in image understanding, but long-video are constrained by context windows and computational cost. Uniform frame sampling often leads to substantial information loss. Meanwhile existing keyframe selection methods such as text-frame retrieval or RL-based frame optimization typically yield sparse and temporally disjointed frames, overlooking scene continuity and lacking flexibility for multi-scale frame selection. To address these limitations, we introduce K-frames, a novel paradigm for scene-driven keyframe selection that preserves temporal continuity. Instead of selecting individual frames, K-frames predicts semantically coherent, query-relevant clips, which enables any-k keyframes selection to meet diverse user budgets. To achieve this approach, we first introduce PeakClips, a dataset of 200K video highlights conditioned by query. Building on this dataset, K-frames learns clip2frame selection using a three-stage progressive curriculum. It involves two Supervised Fine-Tuning stages for temporal grounding and key-clip perception, followed by a Reinforcement Learning stage that directly optimizes the scene-driven prediction policy for downstream task without further annotations. Extensive experiments on major long-video understanding benchmarks demonstrate that K-frames provides an effective, interpretable, and plug-and-play solution for keyframe selection at various scales. Our dataset and model will be available.
中文摘要 多模态大型语言模型（MLLM）在图像理解方面表现出了显著的能力，但长视频受到上下文窗口和计算成本的限制。均匀的帧采样通常会导致大量信息丢失。同时，现有的关键帧选择方法，如文本帧检索或基于 RL 的帧优化，通常会产生稀疏和时间脱节的帧，忽略了场景的连续性，并且缺乏多尺度帧选择的灵活性。为了解决这些限制，我们引入了 K 帧，这是一种用于场景驱动关键帧选择的新范式，可保持时间连续性。K-frames 不是选择单个帧，而是预测语义上连贯的、与查询相关的剪辑，这使得任意 k 个关键帧选择能够满足不同的用户预算。为了实现这种方法，我们首先引入了 PeakClips，这是一个由查询条件的 200K 视频亮点数据集。基于该数据集，K-frames 使用三阶段渐进式课程学习 clip2frame 选择。它涉及两个用于时间接地和按键夹感知的监督微调阶段，然后是强化学习阶段，直接优化下游任务的场景驱动预测策略，无需进一步注释。对主要长视频理解基准的广泛实验表明，K-frames为各种尺度的关键帧选择提供了一种有效、可解释和即插即用的解决方案。我们的数据集和模型将可用。

A Diffusion-Refined Planner with Reinforcement Learning Priors for Confined-Space Parking

一种具有强化学习先验的密闭空间停车的扩散精细规划器

Authors: Mingyang Jiang, Yueyuan Li, Jiaru Zhang, Songan Zhang, Ming Yang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.14000
Pdf link: https://arxiv.org/pdf/2510.14000
Abstract The growing demand for parking has increased the need for automated parking planning methods that can operate reliably in confined spaces. In restricted and complex environments, high-precision maneuvers are required to achieve a high success rate in planning, yet existing approaches often rely on explicit action modeling, which faces challenges when accurately modeling the optimal action distribution. In this paper, we propose DRIP, a diffusion-refined planner anchored in reinforcement learning (RL) prior action distribution, in which an RL-pretrained policy provides prior action distributions to regularize the diffusion training process. During the inference phase the denoising process refines these coarse priors into more precise action distributions. By steering the denoising trajectory through the reinforcement learning prior distribution during training, the diffusion model inherits a well-informed initialization, resulting in more accurate action modeling, a higher planning success rate, and reduced inference steps. We evaluate our approach across parking scenarios with varying degrees of spatial constraints. Experimental results demonstrate that our method significantly improves planning performance in confined-space parking environments while maintaining strong generalization in common scenarios.
中文摘要 停车需求不断增长，增加了对能够在密闭空间内可靠运行的自动停车规划方法的需求。在受限和复杂的环境中，需要高精度的机动才能实现较高的规划成功率，但现有方法往往依赖于显式动作建模，这在准确建模最佳动作分布时面临挑战。在本文中，我们提出了DRIP，这是一种锚定在强化学习（RL）先验动作分布中的扩散细化规划器，其中RL预训练策略提供先验动作分布，以规范扩散训练过程。在推理阶段，去噪过程将这些粗略先验细化为更精确的动作分布。通过在训练过程中通过强化学习先验分布引导去噪轨迹，扩散模型继承了明智的初始化，从而实现更准确的动作建模、更高的规划成功率并减少推理步骤。我们评估了具有不同程度空间限制的停车场景的方法。实验结果表明，该方法在有限空间停车环境中显著提高了规划性能，同时在常见场景下保持了较强的泛化性。

Optimistic Reinforcement Learning-Based Skill Insertions for Task and Motion Planning

基于乐观强化学习的技能插入，用于任务和运动规划

Authors: Gaoyuan Liu, Joris de Winter, Yuri Durodie, Denis Steckelmacher, Ann Nowe, Bram Vanderborght
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.14065
Pdf link: https://arxiv.org/pdf/2510.14065
Abstract Task and motion planning (TAMP) for robotics manipulation necessitates long-horizon reasoning involving versatile actions and skills. While deterministic actions can be crafted by sampling or optimizing with certain constraints, planning actions with uncertainty, i.e., probabilistic actions, remains a challenge for TAMP. On the contrary, Reinforcement Learning (RL) excels in acquiring versatile, yet short-horizon, manipulation skills that are robust with uncertainties. In this letter, we design a method that integrates RL skills into TAMP pipelines. Besides the policy, a RL skill is defined with data-driven logical components that enable the skill to be deployed by symbolic planning. A plan refinement sub-routine is designed to further tackle the inevitable effect uncertainties. In the experiments, we compare our method with baseline hierarchical planning from both TAMP and RL fields and illustrate the strength of the method. The results show that by embedding RL skills, we extend the capability of TAMP to domains with probabilistic skills, and improve the planning efficiency compared to the previous methods.
中文摘要 用于机器人纵的任务和运动规划（TAMP）需要涉及多种动作和技能的长期推理。虽然确定性动作可以通过采样或优化来制定，但在一定约束下进行优化，但规划具有不确定性的动作，即概率动作，仍然是 TAMP 面临的挑战。相反，强化学习（RL）擅长获得具有不确定性的通用但视野较短的作技能。在这封信中，我们设计了一种将 RL 技能集成到 TAMP 管道中的方法。除了策略之外，RL 技能还使用数据驱动的逻辑组件进行定义，这些组件使技能能够通过符号规划来部署。计划细化子例程旨在进一步解决不可避免的影响不确定性。在实验中，我们将我们的方法与 TAMP 和 RL 字段的基线分层规划进行了比较，并说明了该方法的强度。结果表明，通过嵌入RL技能，将TAMP的能力扩展到具有概率技能的域，与以往的方法相比，提高了规划效率。

STEMS: Spatial-Temporal Enhanced Safe Multi-Agent Coordination for Building Energy Management

STEMS：用于建筑能源管理的时空增强安全多智能体协调

Authors: Huiliang Zhang, Di Wu, Arnaud Zinflou, Benoit Boulet
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.14112
Pdf link: https://arxiv.org/pdf/2510.14112
Abstract Building energy management is essential for achieving carbon reduction goals, improving occupant comfort, and reducing energy costs. Coordinated building energy management faces critical challenges in exploiting spatial-temporal dependencies while ensuring operational safety across multi-building systems. Current multi-building energy systems face three key challenges: insufficient spatial-temporal information exploitation, lack of rigorous safety guarantees, and system complexity. This paper proposes Spatial-Temporal Enhanced Safe Multi-Agent Coordination (STEMS), a novel safety-constrained multi-agent reinforcement learning framework for coordinated building energy management. STEMS integrates two core components: (1) a spatial-temporal graph representation learning framework using a GCN-Transformer fusion architecture to capture inter-building relationships and temporal patterns, and (2) a safety-constrained multi-agent RL algorithm incorporating Control Barrier Functions to provide mathematical safety guarantees. Extensive experiments on real-world building datasets demonstrate STEMS's superior performance over existing methods, showing that STEMS achieves 21% cost reduction, 18% emission reduction, and dramatically reduces safety violations from 35.1% to 5.6% while maintaining optimal comfort with only 0.13 discomfort proportion. The framework also demonstrates strong robustness during extreme weather conditions and maintains effectiveness across different building types.
中文摘要 建筑能源管理对于实现碳减排目标、提高居住者舒适度和降低能源成本至关重要。协调的建筑能源管理在利用时空依赖性同时确保多建筑系统的运行安全方面面临着关键挑战。当前多建筑能源系统面临时空信息开发不足、缺乏严谨的安全保障和系统复杂性三大关键挑战。本文提出了时空增强安全多智能体协调（STEMS），这是一种用于协调建筑能源管理的新型安全约束多智能体强化学习框架。STEMS集成了两个核心组件：（1）使用GCN-Transformer融合架构的时空图表示学习框架，以捕获建筑间关系和时间模式，以及（2）包含控制屏障函数的安全约束多智能体RL算法，以提供数学安全保障。在真实世界建筑数据集上的大量实验表明，STEMS 优于现有方法，表明 STEMS 实现了 21% 的成本降低、18% 的排放，并将安全违规行为从 35.1% 大幅降低到 5.6%，同时保持最佳舒适度，不适比例仅为 0.13。该框架在极端天气条件下还表现出强大的稳健性，并在不同建筑类型中保持有效性。

ViTacGen: Robotic Pushing with Vision-to-Touch Generation

ViTacGen：具有视觉到触摸生成的机器人推动

Authors: Zhiyuan Wu, Yijiong Lin, Yongqiang Zhao, Xuyang Zhang, Zhuo Chen, Nathan Lepora, Shan Luo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.14117
Pdf link: https://arxiv.org/pdf/2510.14117
Abstract Robotic pushing is a fundamental manipulation task that requires tactile feedback to capture subtle contact forces and dynamics between the end-effector and the object. However, real tactile sensors often face hardware limitations such as high costs and fragility, and deployment challenges involving calibration and variations between different sensors, while vision-only policies struggle with satisfactory performance. Inspired by humans' ability to infer tactile states from vision, we propose ViTacGen, a novel robot manipulation framework designed for visual robotic pushing with vision-to-touch generation in reinforcement learning to eliminate the reliance on high-resolution real tactile sensors, enabling effective zero-shot deployment on visual-only robotic systems. Specifically, ViTacGen consists of an encoder-decoder vision-to-touch generation network that generates contact depth images, a standardized tactile representation, directly from visual image sequence, followed by a reinforcement learning policy that fuses visual-tactile data with contrastive learning based on visual and generated tactile observations. We validate the effectiveness of our approach in both simulation and real world experiments, demonstrating its superior performance and achieving a success rate of up to 86\%.
中文摘要 机器人推动是一项基本的作任务，需要触觉反馈来捕捉末端执行器和物体之间的细微接触力和动力学。然而，真正的触觉传感器往往面临硬件限制，如成本高、易碎性，以及涉及不同传感器之间校准和变化的部署挑战，而纯视觉策略则难以获得令人满意的性能。受人类从视觉推断触觉状态的能力的启发，我们提出了 ViTacGen，这是一种新型机器人纵框架，专为视觉机器人推动而设计，在强化学习中生成视觉到触摸，以消除对高分辨率真实触觉传感器的依赖，从而在纯视觉机器人系统上实现有效的零样本部署。具体来说，ViTacGen 由一个编码器-解码器视觉到触摸生成网络组成，该网络直接从视觉图像序列生成接触深度图像、标准化触觉表示，然后是强化学习策略，将视觉触觉数据与基于视觉和生成的触觉观察的对比学习融合在一起。我们在模拟和真实世界实验中验证了我们方法的有效性，展示了其卓越的性能并实现了高达 86\% 的成功率。

Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL

揭秘目标条件RL中涌现探索背后的机制

Authors: Mahsa Bastankhah, Grace Liu, Dilip Arumugam, Thomas L. Griffiths, Benjamin Eysenbach
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14129
Pdf link: https://arxiv.org/pdf/2510.14129
Abstract In this work, we take a first step toward elucidating the mechanisms behind emergent exploration in unsupervised reinforcement learning. We study Single-Goal Contrastive Reinforcement Learning (SGCRL), a self-supervised algorithm capable of solving challenging long-horizon goal-reaching tasks without external rewards or curricula. We combine theoretical analysis of the algorithm's objective function with controlled experiments to understand what drives its exploration. We show that SGCRL maximizes implicit rewards shaped by its learned representations. These representations automatically modify the reward landscape to promote exploration before reaching the goal and exploitation thereafter. Our experiments also demonstrate that these exploration dynamics arise from learning low-rank representations of the state space rather than from neural network function approximation. Our improved understanding enables us to adapt SGCRL to perform safety-aware exploration.
中文摘要 在这项工作中，我们迈出了阐明无监督强化学习中涌现探索背后的机制的第一步。我们研究单目标对比强化学习（SGCRL），这是一种自监督算法，能够在没有外部奖励或课程的情况下解决具有挑战性的长期目标实现任务。我们将算法目标函数的理论分析与对照实验相结合，以了解是什么推动了它的探索。我们表明，SGCRL 最大化了由其学习到的表征塑造的隐性奖励。这些表示会自动修改奖励格局，以促进在达到目标之前进行探索，然后进行开发。我们的实验还表明，这些探索动力学源于学习状态空间的低秩表示，而不是神经网络函数近似。我们改进的理解使我们能够调整 SGCRL 来执行安全感知勘探。

Combining Reinforcement Learning and Behavior Trees for NPCs in Video Games with AMD Schola

将视频游戏中 NPC 的强化学习和行为树与 AMD Schola 相结合

Authors: Tian Liu, Alex Cann, Ian Colbert, Mehdi Saeedi
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14154
Pdf link: https://arxiv.org/pdf/2510.14154
Abstract While the rapid advancements in the reinforcement learning (RL) research community have been remarkable, the adoption in commercial video games remains slow. In this paper, we outline common challenges the Game AI community faces when using RL-driven NPCs in practice, and highlight the intersection of RL with traditional behavior trees (BTs) as a crucial juncture to be explored further. Although the BT+RL intersection has been suggested in several research papers, its adoption is rare. We demonstrate the viability of this approach using AMD Schola -- a plugin for training RL agents in Unreal Engine -- by creating multi-task NPCs in a complex 3D environment inspired by the commercial video game ``The Last of Us". We provide detailed methodologies for jointly training RL models with BTs while showcasing various skills.
中文摘要 虽然强化学习（RL）研究界的快速进步令人瞩目，但商业视频游戏的采用仍然缓慢。在本文中，我们概述了游戏 AI 社区在实践中使用 RL 驱动的 NPC 时面临的常见挑战，并强调了 RL 与传统行为树（BT）的交叉点，作为需要进一步探索的关键节点。尽管 BT+RL 交叉点已在多篇研究论文中提出，但其采用很少见。我们使用 AMD Schola（一个用于在虚幻引擎中训练 RL 代理的插件）来演示这种方法的可行性，方法是在受商业视频游戏“最后生还者”启发的复杂 3D 环境中创建多任务 NPC。我们提供了与 BT 联合训练 RL 模型的详细方法，同时展示了各种技能。

ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

ARM-FM：通过基础模型进行组合强化学习的自动奖励机

Authors: Roger Creus Castanyer, Faisal Mohamed, Pablo Samuel Castro, Cyrus Neary, Glen Berseth
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14176
Pdf link: https://arxiv.org/pdf/2510.14176
Abstract Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) -- an automata-based formalism for reward specification -- are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.
中文摘要 强化学习（RL）算法对奖励函数规范高度敏感，这仍然是限制其广泛适用性的核心挑战。我们提出了 ARM-FM：通过基础模型的自动奖励机器，这是一个在 RL 中进行自动化组合奖励设计的框架，它利用了基础模型（FM）的高级推理能力。奖励机（RM）——一种基于自动机的奖励规范形式——被用作RL目标规范的机制，并通过使用FM自动构建。RM 的结构化形式可以产生有效的任务分解，而 FM 的使用可以实现自然语言中的客观规范。具体来说，我们（i）使用 FM 从自然语言规范自动生成 RM;（ii）将语言嵌入与每个 RM 自动机状态相关联，以实现跨任务的泛化;（iii）提供 ARM-FM 在各种具有挑战性的环境中有效性的经验证据，包括零样本泛化的证据。

Incentive-Based Federated Learning

基于激励的联邦学习

Authors: Chanuka A.S. Hewa Kaluannakkage, Rajkumar Buyya
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2510.14208
Pdf link: https://arxiv.org/pdf/2510.14208
Abstract Federated learning promises to revolutionize machine learning by enabling collaborative model training without compromising data privacy. However, practical adaptability can be limited by critical factors, such as the participation dilemma. Participating entities are often unwilling to contribute to a learning system unless they receive some benefits, or they may pretend to participate and free-ride on others. This chapter identifies the fundamental challenges in designing incentive mechanisms for federated learning systems. It examines how foundational concepts from economics and game theory can be applied to federated learning, alongside technology-driven solutions such as blockchain and deep reinforcement learning. This work presents a comprehensive taxonomy that thoroughly covers both centralized and decentralized architectures based on the aforementioned theoretical concepts. Furthermore, the concepts described are presented from an application perspective, covering emerging industrial applications, including healthcare, smart infrastructure, vehicular networks, and blockchain-based decentralized systems. Through this exploration, this chapter demonstrates that well-designed incentive mechanisms are not merely optional features but essential components for the practical success of federated learning. This analysis reveals both the promising solutions that have emerged and the significant challenges that remain in building truly sustainable, fair, and robust federated learning ecosystems.
中文摘要 联邦学习有望通过在不损害数据隐私的情况下实现协作模型训练来彻底改变机器学习。然而，实际适应性可能会受到关键因素的限制，例如参与困境。参与实体通常不愿意为学习系统做出贡献，除非他们获得一些好处，或者他们可能会假装参与并搭便车。本章确定了为联邦学习系统设计激励机制的基本挑战。它研究了如何将经济学和博弈论的基本概念应用于联邦学习，以及区块链和深度强化学习等技术驱动的解决方案。这项工作提出了一个全面的分类法，基于上述理论概念，全面涵盖了集中式和分散式架构。此外，所描述的概念是从应用角度提出的，涵盖新兴工业应用，包括医疗保健、智能基础设施、车载网络和基于区块链的去中心化系统。通过这一探索，本章表明，精心设计的激励机制不仅仅是可选功能，而且是联邦学习实际成功的重要组成部分。该分析揭示了已经出现的有前途的解决方案，以及在构建真正可持续、公平和强大的联邦学习生态系统方面仍然存在的重大挑战。

Spatial Computing Communications for Multi-User Virtual Reality in Distributed Mobile Edge Computing Network

分布式移动边缘计算网络中面向多用户虚拟现实的空间计算通信

Authors: Caolu Xu, Zhiyong Chen, Meixia Tao, Li Song, Wenjun Zhang
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.14243
Pdf link: https://arxiv.org/pdf/2510.14243
Abstract Immersive virtual reality (VR) applications impose stringent requirements on latency, energy efficiency, and computational resources, particularly in multi-user interactive scenarios. To address these challenges, we introduce the concept of spatial computing communications (SCC), a framework designed to meet the latency and energy demands of multi-user VR over distributed mobile edge computing (MEC) networks. SCC jointly represents the physical space, defined by users and base stations, and the virtual space, representing shared immersive environments, using a probabilistic model of user dynamics and resource requirements. The resource deployment task is then formulated as a multi-objective combinatorial optimization (MOCO) problem that simultaneously minimizes system latency and energy consumption across distributed MEC resources. To solve this problem, we propose MO-CMPO, a multi-objective consistency model with policy optimization that integrates supervised learning and reinforcement learning (RL) fine-tuning guided by preference weights. Leveraging a sparse graph neural network (GNN), MO-CMPO efficiently generates Pareto-optimal solutions. Simulations with real-world New Radio base station datasets demonstrate that MO-CMPO achieves superior hypervolume performance and significantly lower inference latency than baseline methods. Furthermore, the analysis reveals practical deployment patterns: latency-oriented solutions favor local MEC execution to reduce transmission delay, while energy-oriented solutions minimize redundant placements to save energy.
中文摘要 沉浸式虚拟现实（VR）应用对延迟、能源效率和计算资源提出了严格的要求，特别是在多用户交互场景中。为了应对这些挑战，我们引入了空间计算通信（SCC）的概念，该框架旨在满足分布式移动边缘计算（MEC）网络上多用户 VR 的延迟和能源需求。SCC 使用用户动态和资源需求的概率模型，共同表示由用户和基站定义的物理空间和代表共享沉浸式环境的虚拟空间。然后，将资源部署任务表述为多目标组合优化（MOCO）问题，同时最大限度地减少分布式MEC资源的系统延迟和能耗。为了解决这个问题，我们提出了MO-CMPO，这是一种具有策略优化的多目标一致性模型，它集成了监督学习和强化学习（RL）微调，以偏好权重为指导。利用稀疏图神经网络（GNN），MO-CMPO 可以有效地生成帕累托最优解。使用真实世界新无线电基站数据集进行的仿真表明，与基线方法相比，MO-CMPO 实现了卓越的超容量性能和显着更低的推理延迟。此外，分析还揭示了实际的部署模式：面向延迟的解决方案倾向于本地 MEC 执行以减少传输延迟，而面向能源的解决方案则最大限度地减少冗余放置以节省能源。

Policy Regularized Distributionally Robust Markov Decision Processes with Linear Function Approximation

具有线性函数近似的策略正则化分布鲁棒马尔可夫决策过程

Authors: Jingwen Gu, Yiting He, Zhishuai Liu, Pan Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.14246
Pdf link: https://arxiv.org/pdf/2510.14246
Abstract Decision-making under distribution shift is a central challenge in reinforcement learning (RL), where training and deployment environments differ. We study this problem through the lens of robust Markov decision processes (RMDPs), which optimize performance against adversarial transition dynamics. Our focus is the online setting, where the agent has only limited interaction with the environment, making sample efficiency and exploration especially critical. Policy optimization, despite its success in standard RL, remains theoretically and empirically underexplored in robust RL. To bridge this gap, we propose \textbf{D}istributionally \textbf{R}obust \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization algorithm (DR-RPO), a model-free online policy optimization method that learns robust policies with sublinear regret. To enable tractable optimization within the softmax policy class, DR-RPO incorporates reference-policy regularization, yielding RMDP variants that are doubly constrained in both transitions and policies. To scale to large state-action spaces, we adopt the $d$-rectangular linear MDP formulation and combine linear function approximation with an upper confidence bonus for optimistic exploration. We provide theoretical guarantees showing that policy optimization can achieve polynomial suboptimality bounds and sample efficiency in robust RL, matching the performance of value-based approaches. Finally, empirical results across diverse domains corroborate our theory and demonstrate the robustness of DR-RPO.
中文摘要 分布转移下的决策是强化学习（RL）中的核心挑战，其中训练和部署环境不同。我们通过鲁棒马尔可夫决策过程（RMDP）的视角研究这个问题，该过程针对对抗性转换动态优化性能。我们的重点是在线环境，代理与环境的交互有限，因此样本效率和探索尤为重要。尽管策略优化在标准 RL 中取得了成功，但在鲁棒 RL 中在理论和实证上仍然没有得到充分探索。为了弥合这一差距，我们提出了 \textbf{D}istributionally \textbf{R}obust \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization algorithm （DR-RPO），这是一种无模型的在线策略优化方法，用于学习具有亚线性遗憾的鲁棒策略。为了在 softmax 策略类中实现易于处理的优化，DR-RPO 结合了引用策略正则化，从而产生在转换和策略中受到双重约束的 RMDP 变体。为了扩展到大的状态-动作空间，我们采用了 $d$-矩形线性 MDP 公式，并将线性函数近似与上置信度奖励相结合，以进行乐观探索。我们提供的理论保证表明，策略优化可以在鲁棒RL中实现多项式次优边界和样本效率，与基于价值的方法的性能相匹配。最后，跨不同领域的经验结果证实了我们的理论，并证明了DR-RPO的鲁棒性。

Towards Agentic Self-Learning LLMs in Search Environment

在搜索环境中走向代理自学习法学硕士

Authors: Wangtao Sun, Xiang Cheng, Jialin Fan, Yao Xu, Xing Yu, Shizhu He, Jun Zhao, Kang Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.14253
Pdf link: https://arxiv.org/pdf/2510.14253
Abstract We study whether self-learning can scale LLM-based agents without relying on human-curated datasets or predefined rule-based rewards. Through controlled experiments in a search-agent setting, we identify two key determinants of scalable agent training: the source of reward signals and the scale of agent task data. We find that rewards from a Generative Reward Model (GRM) outperform rigid rule-based signals for open-domain learning, and that co-evolving the GRM with the policy further boosts performance. Increasing the volume of agent task data-even when synthetically generated-substantially enhances agentic capabilities. Building on these insights, we propose \textbf{Agentic Self-Learning} (ASL), a fully closed-loop, multi-role reinforcement learning framework that unifies task generation, policy execution, and evaluation within a shared tool environment and LLM backbone. ASL coordinates a Prompt Generator, a Policy Model, and a Generative Reward Model to form a virtuous cycle of harder task setting, sharper verification, and stronger solving. Empirically, ASL delivers steady, round-over-round gains, surpasses strong RLVR baselines (e.g., Search-R1) that plateau or degrade, and continues improving under zero-labeled-data conditions, indicating superior sample efficiency and robustness. We further show that GRM verification capacity is the main bottleneck: if frozen, it induces reward hacking and stalls progress; continual GRM training on the evolving data distribution mitigates this, and a small late-stage injection of real verification data raises the performance ceiling. This work establishes reward source and data scale as critical levers for open-domain agent learning and demonstrates the efficacy of multi-role co-evolution for scalable, self-improving agents. The data and code of this paper are released at this https URL
中文摘要 我们研究自学习是否可以在不依赖人类策划的数据集或预定义的基于规则的奖励的情况下扩展基于 LLM 的代理。通过搜索代理设置中的对照实验，我们确定了可扩展代理训练的两个关键决定因素：奖励信号的来源和代理任务数据的规模。我们发现，对于开放领域学习，生成奖励模型（GRM）的奖励优于基于严格规则的信号，并且将 GRM 与策略共同演进可以进一步提高性能。增加代理任务数据量（即使是合成生成的）可以大大增强代理能力。基于这些见解，我们提出了 \textbf{Agentic Self-Learning} （ASL），这是一个完全闭环、多角色的强化学习框架，可在共享工具环境和 LLM 主干网中统一任务生成、策略执行和评估。ASL 协调提示生成器、策略模型和生成奖励模型，形成更难的任务设置、更敏锐的验证和更强的解决的良性循环。根据经验，ASL 提供稳定的、逐轮的增益，超过趋于稳定或降解的强 RLVR 基线（例如 Search-R1），并在零标记数据条件下继续改进，表明具有卓越的样本效率和鲁棒性。我们进一步表明，GRM 验证能力是主要瓶颈：如果冻结，它会诱发奖励黑客攻击并停滞进度;对不断发展的数据分布进行持续的 GRM 训练可以缓解这种情况，而真实验证数据的少量后期注入会提高性能上限。这项工作将奖励源和数据规模确立为开放领域代理学习的关键杠杆，并证明了多角色协同进化对可扩展、自我改进的代理的有效性。本文的数据和代码发布在此 https URL

Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization

通过奖励引导优化生成身份图像到视频

Authors: Liao Shen, Wentao Jiang, Yiran Zhu, Tiezheng Ge, Zhiguo Cao, Bo Zheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.14255
Pdf link: https://arxiv.org/pdf/2510.14255
Abstract Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at \href{this https URL}{this https URL}.
中文摘要 图像到视频（I2V）生成的最新进展在从静态图像合成高质量、时间相干的视频方面取得了显着进展。在I2V的所有应用中，以人为本的视频生成占了很大一部分。然而，现有的I2V模型在保持输入的人体图像和生成的视频之间的身份一致性方面遇到了困难，特别是当视频中的人物表现出明显的表情变化和动作时。当人脸只占据图像的一小部分时，这个问题就变得至关重要。由于人类对身份变异高度敏感，这在 I2V 生成中提出了一个关键但未被充分探索的挑战。在本文中，我们提出了身份保留奖励引导优化（IPRO），这是一种基于强化学习的新型视频传播框架，以增强身份保存。我们的方法没有引入辅助模块或改变模型架构，而是引入了一种直接有效的调整算法，使用人脸识别评分器优化扩散模型。为了提高性能并加速收敛，我们的方法通过采样链的最后一步反向传播奖励信号，从而实现更丰富的梯度反馈。我们还提出了一种新的面部评分机制，将地面实况视频中的面部视为面部特征池，提供多角度的面部信息以增强泛化。进一步结合 KL 发散正则化以稳定训练并防止对奖励信号过度拟合。在 Wan 2.2 I2V 模型和我们内部的 I2V 模型上的大量实验证明了我们方法的有效性。我们的项目和代码可在 \href{this https URL}{this https URL} 中找到。

Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning

Identity-GRPO：通过强化学习优化多人身份保留视频生成

Authors: Xiangyu Meng, Zixian Zhang, Zhenghao Zhang, Junchao Liao, Long Qin, Weizhi Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.14256
Pdf link: https://arxiv.org/pdf/2510.14256
Abstract While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.
中文摘要 虽然 VACE 和 Phantom 等先进方法已经为不同场景中的特定主题提供了先进的视频生成，但它们在动态交互中难以保留多人身份，其中多个角色的一致身份至关重要。为了解决这个问题，我们提出了 Identity-GRPO，这是一种人类反馈驱动的优化管道，用于完善多人身份保留视频生成。首先，我们构建了一个视频奖励模型，该模型在包含人类注释和合成失真数据的大规模偏好数据集上训练，成对注释侧重于在整个视频中保持人类的一致性。然后，我们采用了为多人一致性量身定制的 GRPO 变体，这极大地增强了 VACE 和 Phantom。通过广泛的消融研究，我们评估了注释质量和设计选择对策略优化的影响。实验表明，与基线方法相比，Identity-GRPO 在人类一致性指标方面实现了高达 18.9% 的改进，为将强化学习与个性化视频生成相结合提供了可作的见解。

AlphaQuanter: An End-to-End Tool-Orchestrated Agentic Reinforcement Learning Framework for Stock Trading

AlphaQuanter：用于股票交易的端到端工具编排代理强化学习框架

Authors: Zheye Deng, Jiashu Wang
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2510.14264
Pdf link: https://arxiv.org/pdf/2510.14264
Abstract While Large Language Model (LLM) agents show promise in automated trading, they still face critical limitations. Prominent multi-agent frameworks often suffer from inefficiency, produce inconsistent signals, and lack the end-to-end optimization required to learn a coherent strategy from market feedback. To address this, we introduce AlphaQuanter, a single-agent framework that uses reinforcement learning (RL) to learn a dynamic policy over a transparent, tool-augmented decision workflow, which empowers a single agent to autonomously orchestrate tools and proactively acquire information on demand, establishing a transparent and auditable reasoning process. Extensive experiments demonstrate that AlphaQuanter achieves state-of-the-art performance on key financial metrics. Moreover, its interpretable reasoning reveals sophisticated strategies, offering novel and valuable insights for human traders. Our code for data acquisition and agent training is publicly available at: this https URL
中文摘要 虽然大型语言模型（LLM）代理在自动交易方面显示出前景，但它们仍然面临严重的局限性。著名的多智能体框架经常效率低下，产生不一致的信号，并且缺乏从市场反馈中学习连贯策略所需的端到端优化。为了解决这个问题，我们引入了 AlphaQuanter，这是一个单代理框架，它使用强化学习（RL）通过透明、工具增强的决策工作流程学习动态策略，它使单个代理能够自主编排工具并主动按需获取信息，建立透明且可审计的推理过程。广泛的实验表明，AlphaQuanter 在关键财务指标上取得了最先进的性能。此外，其可解释的推理揭示了复杂的策略，为人类交易者提供了新颖而有价值的见解。我们的数据采集和代理培训代码可在以下位置公开获取：此 https URL

Learning Human-Humanoid Coordination for Collaborative Object Carrying

学习人人协调协作物体搬运

Authors: Yushi Du, Yixuan Li, Baoxiong Jia, Yutang Lin, Pei Zhou, Wei Liang, Yanchao Yang, Siyuan Huang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14293
Pdf link: https://arxiv.org/pdf/2510.14293
Abstract Human-humanoid collaboration shows significant promise for applications in healthcare, domestic assistance, and manufacturing. While compliant robot-human collaboration has been extensively developed for robotic arms, enabling compliant human-humanoid collaboration remains largely unexplored due to humanoids' complex whole-body dynamics. In this paper, we propose a proprioception-only reinforcement learning approach, COLA, that combines leader and follower behaviors within a single policy. The model is trained in a closed-loop environment with dynamic object interactions to predict object motion patterns and human intentions implicitly, enabling compliant collaboration to maintain load balance through coordinated trajectory planning. We evaluate our approach through comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrating the effectiveness, generalization, and robustness of our model across various terrains and objects. Simulation experiments demonstrate that our model reduces human effort by 24.7%. compared to baseline approaches while maintaining object stability. Real-world experiments validate robust collaborative carrying across different object types (boxes, desks, stretchers, etc.) and movement patterns (straight-line, turning, slope climbing). Human user studies with 23 participants confirm an average improvement of 27.4% compared to baseline models. Our method enables compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, offering a practical solution for real-world deployment.
中文摘要 人与人形协作在医疗保健、家政援助和制造业的应用中显示出巨大的前景。虽然已广泛开发出顺从的机器人与人形协作，但由于人形机器人复杂的全身动态，实现顺应的人与人协作在很大程度上仍未被探索。在本文中，我们提出了一种仅本体感觉的强化学习方法，COLA，该方法将领导者和追随者行为结合在一个单一策略中。该模型在具有动态对象交互的闭环环境中进行训练，以隐式预测对象运动模式和人类意图，从而实现合规协作，通过协调的轨迹规划来保持负载平衡。我们通过全面的模拟器和协作携带任务的真实实验来评估我们的方法，证明我们的模型在各种地形和物体上的有效性、泛化性和稳健性。仿真实验表明，我们的模型减少了 24.7% 的人力。与基线方法相比，同时保持物体稳定性。实际实验验证了跨不同物体类型（箱子、桌子、担架等）和运动模式（直线、转弯、爬坡）的稳健协作搬运能力。对 23 名参与者进行的人类用户研究证实，与基线模型相比，平均改善了 27.4%。我们的方法无需外部传感器或复杂的交互模型即可实现顺应的人形协同携带，为实际部署提供了实用的解决方案。

Active Measuring in Reinforcement Learning With Delayed Negative Effects

具有延迟负面影响的强化学习中的主动测量

Authors: Daiqi Gao, Ziping Xu, Aseel Rawashdeh, Predrag Klasnja, Susan A. Murphy
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.14315
Pdf link: https://arxiv.org/pdf/2510.14315
Abstract Measuring states in reinforcement learning (RL) can be costly in real-world settings and may negatively influence future outcomes. We introduce the Actively Observable Markov Decision Process (AOMDP), where an agent not only selects control actions but also decides whether to measure the latent state. The measurement action reveals the true latent state but may have a negative delayed effect on the environment. We show that this reduced uncertainty may provably improve sample efficiency and increase the value of the optimal policy despite these costs. We formulate an AOMDP as a periodic partially observable MDP and propose an online RL algorithm based on belief states. To approximate the belief states, we further propose a sequential Monte Carlo method to jointly approximate the posterior of unknown static environment parameters and unobserved latent states. We evaluate the proposed algorithm in a digital health application, where the agent decides when to deliver digital interventions and when to assess users' health status through surveys.
中文摘要 在现实环境中测量强化学习（RL）中的状态可能成本高昂，并且可能对未来的结果产生负面影响。我们引入了主动可观察马尔可夫决策过程（AOMDP），其中智能体不仅选择控制动作，还决定是否测量潜在状态。测量动作揭示了真实的潜在状态，但可能对环境产生负面的延迟影响。我们表明，尽管存在这些成本，但这种不确定性的降低可以证明可以提高样本效率并增加最优策略的价值。我们将AOMDP表述为周期性部分可观测的MDP，并提出了一种基于信念状态的在线RL算法。为了近似信念状态，我们进一步提出了一种顺序蒙特卡洛方法，以联合近似未知静态环境参数和未观察到的潜在状态的后验。我们在数字健康应用程序中评估所提出的算法，其中代理决定何时提供数字干预以及何时通过调查评估用户的健康状况。

Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

使用多轮 RL 评估和减少来自语言模型的欺骗性对话

Authors: Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal, Sergey Levine
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14318
Pdf link: https://arxiv.org/pdf/2510.14318
Abstract Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception. We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric. Our findings reveal this novel deception measure correlates more closely with human judgments than any existing metrics we test. Additionally, our benchmarking of eight state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines. Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.
中文摘要 大型语言模型（LLM）在客户支持、教育和医疗保健等应用中与全球数百万人进行交互。然而，它们产生欺骗性输出的能力，无论是有意还是无意，都会带来严重的安全问题。法学硕士行为的不可预测性，加上针对幻觉、错误信息和用户纵的保护措施不足，使其滥用成为一种严重的现实风险。在本文中，我们调查了法学硕士在对话中参与欺骗的程度，并提出了信念错位指标来量化欺骗。我们使用五个既定的欺骗检测指标和我们提出的指标，评估四种不同对话场景中的欺骗行为。我们的研究结果表明，这种新颖的欺骗措施与人类判断的相关性比我们测试的任何现有指标都更密切。此外，我们对八个最先进模型的基准测试表明，法学硕士在大约 26% 的对话回合中自然地表现出欺骗行为，即使提示了看似良性的目标。当被提示进行欺骗时，法学硕士能够将欺骗性相对于基线增加多达 31%。出乎意料的是，使用 RLHF（确保广泛部署的 LLM 安全的主要方法）训练的模型仍然以平均 43% 的比率表现出欺骗率。鉴于对话中的欺骗是一种在互动历史中发展起来的行为，其有效评估和缓解需要超越单一话语分析。我们引入了一种多轮强化学习方法来微调法学硕士以减少欺骗行为，与其他指令调整模型相比，减少了 77.6%。

Large Reasoning Embedding Models: Towards Next-Generation Dense Retrieval Paradigm

大型推理嵌入模型：迈向下一代密集检索范式

Authors: Jianting Tang, Dongshuai Li, Tao Wen, Fuyu Lv, Dan Ou, Linli Xu
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2510.14321
Pdf link: https://arxiv.org/pdf/2510.14321
Abstract In modern e-commerce search systems, dense retrieval has become an indispensable component. By computing similarities between query and item (product) embeddings, it efficiently selects candidate products from large-scale repositories. With the breakthroughs in large language models (LLMs), mainstream embedding models have gradually shifted from BERT to LLMs for more accurate text modeling. However, these models still adopt direct-embedding methods, and the semantic accuracy of embeddings remains inadequate. Therefore, contrastive learning is heavily employed to achieve tight semantic alignment between positive pairs. Consequently, such models tend to capture statistical co-occurrence patterns in the training data, biasing them toward shallow lexical and semantic matches. For difficult queries exhibiting notable lexical disparity from target items, the performance degrades significantly. In this work, we propose the Large Reasoning Embedding Model (LREM), which novelly integrates reasoning processes into representation learning. For difficult queries, LREM first conducts reasoning to achieve a deep understanding of the original query, and then produces a reasoning-augmented query embedding for retrieval. This reasoning process effectively bridges the semantic gap between original queries and target items, significantly improving retrieval accuracy. Specifically, we adopt a two-stage training process: the first stage optimizes the LLM on carefully curated Query-CoT-Item triplets with SFT and InfoNCE losses to establish preliminary reasoning and embedding capabilities, and the second stage further refines the reasoning trajectories via reinforcement learning (RL). Extensive offline and online experiments validate the effectiveness of LREM, leading to its deployment on China's largest e-commerce platform since August 2025.
中文摘要 在现代电商搜索系统中，密集检索已成为不可或缺的组成部分。通过计算查询和项目（产品）嵌入之间的相似性，它可以有效地从大规模存储库中选择候选产品。随着大型语言模型（LLM）的突破，主流嵌入模型逐渐从 BERT 转向 LLM，以实现更准确的文本建模。然而，这些模型仍然采用直接嵌入方法，嵌入的语义准确性仍然不足。因此，大量采用对比学习来实现正对之间的紧密语义对齐。因此，此类模型倾向于捕获训练数据中的统计共现模式，使其偏向浅层词汇和语义匹配。对于与目标项表现出明显词法差异的困难查询，性能会显着下降。在这项工作中，我们提出了大型推理嵌入模型（LREM），该模型新颖地将推理过程集成到表征学习中。对于高难度查询，LREM首先进行推理，实现对原始查询的深入理解，然后产生推理增强的查询嵌入进行检索。这种推理过程有效地弥合了原始查询和目标项目之间的语义差距，显着提高了检索准确性。具体来说，我们采用两阶段的训练过程：第一阶段在精心策划的具有SFT和InfoNCE损失的Query-CoT-Item三元组上优化LLM，以建立初步的推理和嵌入能力，第二阶段通过强化学习（RL）进一步细化推理轨迹。广泛的线下和线上实验验证了 LREM 的有效性，并于 2025 年 8 月起在中国最大的电子商务平台上部署。

Risk-Aware Reinforcement Learning with Bandit-Based Adaptation for Quadrupedal Locomotion

基于强盗的四足运动适应的风险感知强化学习

Authors: Yuanhong Zeng, Anushri Dixit
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.14338
Pdf link: https://arxiv.org/pdf/2510.14338
Abstract In this work, we study risk-aware reinforcement learning for quadrupedal locomotion. Our approach trains a family of risk-conditioned policies using a Conditional Value-at-Risk (CVaR) constrained policy optimization technique that provides improved stability and sample efficiency. At deployment, we adaptively select the best performing policy from the family of policies using a multi-armed bandit framework that uses only observed episodic returns, without any privileged environment information, and adapts to unknown conditions on the fly. Hence, we train quadrupedal locomotion policies at various levels of robustness using CVaR and adaptively select the desired level of robustness online to ensure performance in unknown environments. We evaluate our method in simulation across eight unseen settings (by changing dynamics, contacts, sensing noise, and terrain) and on a Unitree Go2 robot in previously unseen terrains. Our risk-aware policy attains nearly twice the mean and tail performance in unseen environments compared to other baselines and our bandit-based adaptation selects the best-performing risk-aware policy in unknown terrain within two minutes of operation.
中文摘要 在这项工作中，我们研究了四足运动的风险感知强化学习。我们的方法使用条件风险价值（CVaR）约束策略优化技术训练一系列风险条件策略，该技术可提供更高的稳定性和样本效率。在部署时，我们使用多臂强盗框架从策略系列中自适应地选择性能最佳的策略，该框架仅使用观察到的情景返回，没有任何特权环境信息，并动态适应未知条件。因此，我们使用 CVaR 在各种鲁棒性级别训练四足运动策略，并自适应地在线选择所需的鲁棒性级别，以确保在未知环境中的性能。我们在八种看不见的环境（通过改变动力学、接触、传感噪声和地形）和 Unitree Go2 机器人在以前看不见的地形中评估了我们的方法。与其他基线相比，我们的风险感知策略在看不见的环境中实现了近两倍的平均值和尾部性能，我们基于强盗的适应在运行后两分钟内在未知地形中选择了性能最佳的风险感知策略。

Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control

Hi-Agent：用于移动设备控制的分层视觉语言代理

Authors: Zhe Wu, Hongjin Lu, Junliang Xing, Changhao Zhang, Yin Zhu, Yuhao Yang, Yuheng Jing, Kai Li, Kun Shao, Jianye Hao, Jun Wang, Yuanchun Shi
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.14388
Pdf link: https://arxiv.org/pdf/2510.14388
Abstract Building agents that autonomously operate mobile devices has attracted increasing attention. While Vision-Language Models (VLMs) show promise, most existing approaches rely on direct state-to-action mappings, which lack structured reasoning and planning, and thus generalize poorly to novel tasks or unseen UI layouts. We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control, featuring a high-level reasoning model and a low-level action model that are jointly optimized. For efficient training, we reformulate multi-step decision-making as a sequence of single-step subgoals and propose a foresight advantage function, which leverages execution feedback from the low-level model to guide high-level optimization. This design alleviates the path explosion issue encountered by Group Relative Policy Optimization (GRPO) in long-horizon tasks and enables stable, critic-free joint training. Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark, significantly outperforming prior methods across three paradigms: prompt-based (AppAgent: 17.7%), supervised (Filtered BC: 54.5%), and reinforcement learning-based (DigiRL: 71.9%). It also demonstrates competitive zero-shot generalization on the ScreenSpot-v2 benchmark. On the more challenging AndroidWorld benchmark, Hi-Agent also scales effectively with larger backbones, showing strong adaptability in high-complexity mobile control scenarios.
中文摘要 自主作移动设备的建筑代理越来越受到关注。虽然视觉语言模型（VLM）显示出前景，但大多数现有方法依赖于直接的状态到作映射，缺乏结构化的推理和规划，因此很难推广到新颖的任务或看不见的 UI 布局。我们介绍了Hi-Agent，这是一种用于移动控制的可训练分层视觉语言代理，具有联合优化的高级推理模型和低级动作模型。为了实现高效训练，我们将多步决策重新表述为一系列单步子目标，并提出了一个前瞻优势函数，该函数利用来自低级模型的执行反馈来指导高级优化。这种设计缓解了群体相对策略优化（GRPO）在长视野任务中遇到的路径爆炸问题，并实现了稳定、无批评的联合训练。Hi-Agent 在 Android-in-the-Wild （AitW）基准测试中实现了 87.9% 的新最先进（SOTA）任务成功率，在三种范式中显着优于之前的方法：基于提示（AppAgent：17.7%）、监督（过滤 BC：54.5%）和基于强化学习（DigiRL：71.9%）。它还在 ScreenSpot-v2 基准测试上展示了具有竞争力的零样本泛化。在更具挑战性的 AndroidWorld 基准测试中，Hi-Agent 还通过更大的骨干网进行了有效扩展，在高复杂度的移动控制场景中表现出强大的适应性。

Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

说明就是您所需要的：用于指令遵循的自监督强化学习

Authors: Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.14420
Pdf link: https://arxiv.org/pdf/2510.14420
Abstract Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at this https URL
中文摘要 语言模型通常难以遵循对实际应用至关重要的多约束指令。现有的强化学习（RL）方法依赖于外部监督和来自多约束任务的稀疏奖励信号。我们提出了一种无标签的自监督RL框架，通过直接从指令中获取奖励信号并生成用于奖励模型训练的伪标签，消除了对外部监督的依赖。我们的方法引入了约束分解策略和高效的约束二元分类，以解决稀疏奖励挑战，同时保持计算效率。实验表明，我们的方法具有很好的推广性，在 3 个域内和 5 个域外数据集中取得了强劲的改进，包括具有挑战性的代理和多轮指令跟踪。数据和代码在此 https URL 上公开可用

Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents

自然语言工具：在大型语言代理中调用工具的自然语言方法

Authors: Reid T. Johnson, Michelle D. Pain, Jordan D. West
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.14453
Pdf link: https://arxiv.org/pdf/2510.14453
Abstract We present Natural Language Tools (NLT), a framework that replaces programmatic JSON tool calling in large language models (LLMs) with natural language outputs. By decoupling tool selection from response generation, NLT eliminates task interference and format constraints that degrade tool call performance. When evaluated across 10 models and 6,400 trials spanning customer service and mental health domains, NLT improves tool calling accuracy by 18.4 percentage points while reducing output variance by 70%. Open-weight models see the largest gains, surpassing flagship closed-weight alternatives, with implications for model training in both reinforcement learning and supervised fine-tuning stages. These improvements persist under prompt perturbations and extend tool-calling capabilities to models lacking native support.
中文摘要 我们提出了自然语言工具（NLT），这是一个框架，它用自然语言输出取代了大型语言模型（LLM）中的编程 JSON 工具调用。通过将工具选择与响应生成解耦，NLT 消除了降低工具调用性能的任务干扰和格式约束。在对客户服务和心理健康领域的 10 个模型和 6,400 个试验进行评估时，NLT 将工具调用准确性提高了 18.4 个百分点，同时将输出方差降低了 70%。开放权重模型的收益最大，超过了旗舰封闭权重替代方案，这对强化学习和监督微调阶段的模型训练都有影响。这些改进在提示扰动下持续存在，并将工具调用功能扩展到缺乏本机支持的模型。

Learning to Undo: Rollback-Augmented Reinforcement Learning with Reversibility Signals

学习撤消：使用可逆性信号的回滚增强强化学习

Authors: Andrejs Sorstkins, Omer Tariq, Muhammad Bilal
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14503
Pdf link: https://arxiv.org/pdf/2510.14503
Abstract This paper proposes a reversible learning framework to improve the robustness and efficiency of value based Reinforcement Learning agents, addressing vulnerability to value overestimation and instability in partially irreversible environments. The framework has two complementary core mechanisms: an empirically derived transition reversibility measure called Phi of s and a, and a selective state rollback operation. We introduce an online per state action estimator called Phi that quantifies the likelihood of returning to a prior state within a fixed horizon K. This measure is used to adjust the penalty term during temporal difference updates dynamically, integrating reversibility awareness directly into the value function. The system also includes a selective rollback operator. When an action yields an expected return markedly lower than its instantaneous estimated value and violates a predefined threshold, the agent is penalized and returns to the preceding state rather than progressing. This interrupts sub optimal high risk trajectories and avoids catastrophic steps. By combining reversibility aware evaluation with targeted rollback, the method improves safety, performance, and stability. In the CliffWalking v0 domain, the framework reduced catastrophic falls by over 99.8 percent and yielded a 55 percent increase in mean episode return. In the Taxi v3 domain, it suppressed illegal actions by greater than or equal to 99.9 percent and achieved a 65.7 percent improvement in cumulative reward, while also sharply reducing reward variance in both environments. Ablation studies confirm that the rollback mechanism is the critical component underlying these safety and performance gains, marking a robust step toward safe and reliable sequential decision making.
中文摘要 本文提出了一种可逆学习框架，以提高基于价值的强化学习代理的鲁棒性和效率，解决了部分不可逆环境中价值高估和不稳定性的脆弱性。该框架有两个互补的核心机制：一种是经验推导的过渡可逆性度量，称为 s 和 a 的 Phi，以及一种选择性状态回滚作。我们引入了一个名为 Phi 的在线每个状态动作估计器，它量化了在固定视野 K 内返回到先前状态的可能性。该度量用于动态调整时序差更新期间的惩罚项，将可逆性意识直接整合到价值函数中。该系统还包括一个选择性回滚运算符。当某个作产生的预期回报明显低于其瞬时估计值并违反预定义阈值时，代理会受到惩罚并返回到之前的状态而不是继续前进。这会中断次优的高风险轨迹并避免灾难性的步骤。通过将可逆性感知评估与有针对性的回滚相结合，该方法提高了安全性、性能和稳定性。在 CliffWalking v0 领域，该框架将灾难性跌倒减少了 99.8% 以上，平均事件回报率提高了 55%。在 Taxi v3 领域，它抑制了大于或等于 99.9% 的非法行为，并实现了 65.7% 的累积奖励提升，同时还大幅降低了两种环境中的奖励方差。消融研究证实，回滚机制是这些安全性和性能提升的关键组成部分，标志着朝着安全可靠的顺序决策迈出了坚实的一步。

Agentic Entropy-Balanced Policy Optimization

代理熵平衡策略优化

Authors: Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2510.14545
Pdf link: https://arxiv.org/pdf/2510.14545
Abstract Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.
中文摘要 最近，智能体强化学习（Agentic RL）在激励Web智能体的多轮、长视野工具使用能力方面取得了重大进展。虽然主流的智能体RL算法在熵的指导下自主探索高不确定性的工具调用步骤，但过度依赖熵信号可能会施加进一步的约束，导致训练崩溃。在本文中，我们深入研究了熵带来的挑战，并提出了代理熵平衡策略优化（AEPO），这是一种代理RL算法，旨在平衡推出和策略更新阶段的熵。EPO 由两个核心组成部分组成：（1）动态熵平衡推出机制，通过熵预监测自适应地分配全局和分支采样预算，同时对连续的高熵工具调用步骤施加分支惩罚，以防止过度分支问题;（2）熵平衡策略优化，在高熵削波项中插入停止梯度运算，以保留和正确重新缩放高熵标记上的梯度，同时结合熵感知优势估计以优先考虑高不确定性标记的学习。14 个具有挑战性的数据集的结果表明，AEPO 始终优于 7 种主流 RL 算法。仅用 1K RL 样本，带有 AEPO 的 Qwen3-14B 就取得了令人印象深刻的结果：Pass@1 GAIA 的 GAIA 率为 47.6%，Humanity's Last Exam 的 11.2%，WebWalker 的 43.0%;GAIA 的 65.0%，Humanity's Last Exam 的 26.0%，WebWalker 的 70.0% Pass@5。进一步分析表明，AEPO 提高了推出抽样多样性，同时保持稳定的策略熵，促进了可扩展的 Web 代理训练。

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

基于知识的可视化问答，具有多模态处理、检索和过滤功能

Authors: Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang, Jieping Ye
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.14605
Pdf link: https://arxiv.org/pdf/2510.14605
Abstract Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at this https URL
中文摘要 基于知识的视觉问答（KB-VQA）需要视觉语言模型（VLM）将视觉理解与外部知识检索相结合。尽管检索增强生成（RAG）通过结合知识库查询在这项任务上取得了重大进展，但它仍然在多模态查询的质量和检索结果的相关性方面遇到困难。为了克服这些挑战，我们提出了一种新的三阶段方法，称为 Wiki-PRF，包括处理、检索和过滤阶段。处理阶段动态调用可视化工具，提取精确的多模态信息进行检索。检索阶段融合视觉和文本特征，实现多模态知识检索。过滤阶段对检索结果执行相关性过滤和集中。为此，我们引入了一种视觉语言模型，通过强化学习的方式训练了答案准确性和格式一致性作为奖励信号。这增强了模型的推理、准确查询的工具调用以及不相关内容的过滤。在基准数据集（E-VQA 和 InfoSeek）上的实验显示，答案质量显着提高~（36.0 和 42.8），达到了最先进的性能。代码可在此 https URL 中找到

Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models

代码驱动的数序计算：增强大型语言模型的归纳推理能力

Authors: Kedi Chen, Zhikai Lei, Xu Guo, Xuecheng Wu, Siyuan Zeng, Jianghao Yin, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Qipeng Guo, Kai Chen, Wei Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.14620
Pdf link: https://arxiv.org/pdf/2510.14620
Abstract Large language models (LLMs) make remarkable progress in reasoning tasks. Among different reasoning modes, inductive reasoning, due to its better alignment with human learning, attracts increasing interest. However, research on inductive reasoning faces certain challenges. First, existing inductive data mostly focuses on superficial regularities while lacking more complex internal patterns. Second, current works merely prompt LLMs or finetune on simple prompt-response pairs, but do not provide precise thinking processes nor implement difficulty control. Unlike previous work, we address these challenges by introducing \textit{CodeSeq}, a synthetic post-training dataset built from number sequences. We package number sequences into algorithmic problems to discover their general terms, defining a general term generation (GTG) task correspondingly. Our pipeline generates supervised finetuning data by reflecting on failed test cases and incorporating iterative corrections, thereby teaching LLMs to learn autonomous case generation and self-checking. Additionally, it leverages reinforcement learning with a novel Case-Synergy Solvability Scaling Reward based on both solvability, estimated from the problem pass rate, and the success rate of self-directed case generation, enabling models to learn more effectively from both successes and failures. Experimental results show that the models trained with \textit{CodeSeq} improve on various reasoning tasks and can preserve the models' OOD performance.
中文摘要 大型语言模型（LLM）在推理任务方面取得了显着进展。在不同的推理模式中，归纳推理由于其与人类学习的一致性更好，引起了越来越多的兴趣。然而，归纳推理的研究面临一定的挑战。首先，现有的归纳数据大多关注表面规律，而缺乏更复杂的内部模式。其次，目前的工作只是提示法学硕士或对简单的提示-响应对进行微调，但没有提供精确的思维过程，也没有实现难度控制。与之前的工作不同，我们通过引入 \textit{CodeSeq} 来应对这些挑战，这是一个由数字序列构建的合成后训练数据集。我们将数字序列打包到算法问题中以发现它们的通用项，相应地定义通用项生成（GTG）任务。我们的管道通过反思失败的测试用例并结合迭代更正来生成监督微调数据，从而教法学硕士学习自主案例生成和自我检查。此外，它还利用强化学习，基于根据问题通过率估计的可解决性和自主案例生成的成功率，具有新颖的案例协同可解决性缩放奖励，使模型能够更有效地从成功和失败中学习。实验结果表明，用\textit{CodeSeq}训练的模型在各种推理任务上都有改进，并且能够保持模型的OOD性能。

RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF

RLAIF-SPA：通过 RLAIF 优化基于 LLM 的情感语音合成

Authors: Qing Yang, Zhenghao Liu, Junxin Wang, Yangfan Du, Pengcheng Huang, Tong Xiao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.14628
Pdf link: https://arxiv.org/pdf/2510.14628
Abstract Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.
中文摘要 文本转语音合成在中性语音中已经达到了接近人类的质量，但情感表达仍然是一个挑战。现有方法通常依赖于昂贵的情感注释或优化间接目标，无法捕捉语音的情感表达和感知自然性，从而导致生成的语音准确但情感平淡。为了应对这些挑战，我们提出了 RLAIF-SPA 框架，结合人工智能反馈强化学习（RLAIF）机制，采用自动语音识别（ASR）和大型语言模型（LLM）技术分别判断语义准确性和韵律-情感标签对齐，作为情感表达和可理解性优化的直接奖励。具体来说，它利用韵律标签对齐，通过围绕四个细粒度维度（结构、情感、速度和语气）共同考虑语义准确性和韵律情感对齐来提高表达质量。此外，它还结合了语义准确性反馈，以确保生成清晰准确的语音。在Libri Speech数据集上的实验表明，RLAIF-SPA优于Chat-TTS，WER降低了26.1%，SIM-O提高了9.1%，人类评估提高了10%以上。

MR.Rec: Synergizing Memory and Reasoning for Personalized Recommendation Assistant with LLMs

先生。Rec：将记忆和推理与 LLM 协同个性化推荐助手

Authors: Jiani Huang, Xingchen Zou, Lianghao Xia, Qing Li
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2510.14629
Pdf link: https://arxiv.org/pdf/2510.14629
Abstract The application of Large Language Models (LLMs) in recommender systems faces key challenges in delivering deep personalization and intelligent reasoning, especially for interactive scenarios. Current methods are often constrained by limited context windows and single-turn reasoning, hindering their ability to capture dynamic user preferences and proactively reason over recommendation contexts. To address these limitations, we propose this http URL, a novel framework that synergizes memory and reasoning for LLM-based recommendations. To achieve personalization, we develop a comprehensive Retrieval-Augmented Generation (RAG) system that efficiently indexes and retrieves relevant external memory to enhance LLM personalization capabilities. Furthermore, to enable the synergy between memory and reasoning, our RAG system goes beyond conventional query-based retrieval by integrating reasoning enhanced memory retrieval. Finally, we design a reinforcement learning framework that trains the LLM to autonomously learn effective strategies for both memory utilization and reasoning refinement. By combining dynamic memory retrieval with adaptive reasoning, this approach ensures more accurate, context-aware, and highly personalized recommendations. Extensive experiments demonstrate that this http URL significantly outperforms state-of-the-art baselines across multiple metrics, validating its efficacy in delivering intelligent and personalized recommendations. We will release code and data upon paper notification.
中文摘要 大型语言模型（LLMs）在推荐系统中的应用在提供深度个性化和智能推理方面面临着关键挑战，特别是在交互场景中。当前的方法通常受到有限的上下文窗口和单轮推理的限制，阻碍了它们捕获动态用户偏好和主动推理推荐上下文的能力。为了解决这些限制，我们提出了这个 http URL，这是一个新颖的框架，可以协同基于 LLM 的推荐的内存和推理。为了实现个性化，我们开发了一个全面的检索增强生成（RAG）系统，可以有效地索引和检索相关的外部内存，以增强LLM的个性化能力。此外，为了实现记忆和推理之间的协同作用，我们的 RAG 系统通过集成推理增强记忆检索，超越了传统的基于查询的检索。最后，我们设计了一个强化学习框架，训练法学硕士自主学习有效的记忆利用和推理细化策略。通过将动态记忆检索与自适应推理相结合，这种方法可确保更准确、上下文感知和高度个性化的推荐。广泛的实验表明，该 http URL 在多个指标上明显优于最先进的基线，验证了其在提供智能和个性化推荐方面的功效。我们将在纸质通知后发布代码和数据。

ATGen: Adversarial Reinforcement Learning for Test Case Generation

ATGen：用于测试用例生成的对抗性强化学习

Authors: Qingyao Li, Xinyi Dai, Weiwen Liu, Xiangyang Li, Yasheng Wang, Ruiming Tang, Yong Yu, Weinan Zhang
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2510.14635
Pdf link: https://arxiv.org/pdf/2510.14635
Abstract Large Language Models (LLMs) excel at code generation, yet their outputs often contain subtle bugs, for which effective test cases are a critical bottleneck. Existing test generation methods, whether based on prompting or supervised fine-tuning, rely on static datasets. This imposes a fixed-difficulty ceiling'', fundamentally limiting their ability to uncover novel or more complex bugs beyond their training scope. To overcome this, we introduce ATGen, a framework that trains a test case generator via adversarial reinforcement learning. ATGen pits a test generator against an adversarial code generator that continuously crafts harder bugs to evade the current policy. This dynamic loop creates a curriculum of increasing difficulty challenging current policy. The test generator is optimized via Reinforcement Learning (RL) to jointly maximizeOutput Accuracy'' and ``Attack Success'', enabling it to learn a progressively stronger policy that breaks the fixed-difficulty ceiling of static training. Extensive experiments demonstrate that ATGen significantly outperforms state-of-the-art baselines. We further validate its practical utility, showing it serves as both a more effective filter for Best-of-N inference and a higher-quality reward source for training code generation models. Our work establishes a new, dynamic paradigm for improving the reliability of LLM-generated code.
中文摘要 大型语言模型（LLM）擅长代码生成，但其输出通常包含细微的错误，有效的测试用例是一个关键瓶颈。现有的测试生成方法，无论是基于提示还是监督微调，都依赖于静态数据集。这施加了“固定难度上限”，从根本上限制了他们发现超出训练范围的新颖或更复杂错误的能力。为了克服这个问题，我们引入了 ATGen，这是一个通过对抗强化学习训练测试用例生成器的框架。ATGen 将测试生成器与对抗性代码生成器进行对抗，后者不断制造更难的错误以规避当前政策。这种动态循环创造了一个挑战当前政策的难度越来越大的课程。测试生成器通过强化学习（RL）进行优化，共同最大化“输出准确性”和“攻击成功率”，使其能够学习一个逐渐更强的策略，打破静态训练的固定难度上限。大量实验表明，ATGen 的性能明显优于最先进的基线。我们进一步验证了它的实际效用，表明它既可以作为 Best-of-N 推理的更有效过滤器，也可以作为训练代码生成模型的更高质量的奖励来源。我们的工作为提高 LLM 生成代码的可靠性建立了一种新的动态范式。

The Bidding Games: Reinforcement Learning for MEV Extraction on Polygon Blockchain

竞价游戏：Polygon 区块链上 MEV 提取的强化学习

Authors: Andrei Seoev, Leonid Gremyachikh, Anastasiia Smirnova, Yash Madhwal, Alisa Kalacheva, Dmitry Belousov, Ilia Zubov, Aleksei Smirnov, Denis Fedyanin, Vladimir Gorgadze, Yury Yanovich
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2510.14642
Pdf link: https://arxiv.org/pdf/2510.14642
Abstract In blockchain networks, the strategic ordering of transactions within blocks has emerged as a significant source of profit extraction, known as Maximal Extractable Value (MEV). The transition from spam-based Priority Gas Auctions to structured auction mechanisms like Polygon Atlas has transformed MEV extraction from public bidding wars into sealed-bid competitions under extreme time constraints. While this shift reduces network congestion, it introduces complex strategic challenges where searchers must make optimal bidding decisions within a sub-second window without knowledge of competitor behavior or presence. Traditional game-theoretic approaches struggle in this high-frequency, partially observable environment due to their reliance on complete information and static equilibrium assumptions. We present a reinforcement learning framework for MEV extraction on Polygon Atlas and make three contributions: (1) A novel simulation environment that accurately models the stochastic arrival of arbitrage opportunities and probabilistic competition in Atlas auctions; (2) A PPO-based bidding agent optimized for real-time constraints, capable of adaptive strategy formulation in continuous action spaces while maintaining production-ready inference speeds; (3) Empirical validation demonstrating our history-conditioned agent captures 49\% of available profits when deployed alongside existing searchers and 81\% when replacing the market leader, significantly outperforming static bidding strategies. Our work establishes that reinforcement learning provides a critical advantage in high-frequency MEV environments where traditional optimization methods fail, offering immediate value for industrial participants and protocol designers alike.
中文摘要 在区块链网络中，区块内交易的战略排序已成为利润提取的重要来源，称为最大可提取价值（MEV）。从基于垃圾邮件的优先 Gas 拍卖到像 Polygon Atlas 这样的结构化拍卖机制的过渡，将 MEV 从公开竞价战中提取转变为极端时间限制下的密封竞标竞争。虽然这种转变减少了网络拥塞，但它带来了复杂的战略挑战，搜索者必须在亚秒级窗口内做出最佳竞价决策，而无需了解竞争对手的行为或存在。传统的博弈论方法由于依赖于完整信息和静态平衡假设，因此在这种高频、部分可观测的环境中举步维艰。我们提出了一个用于 Polygon Atlas 上 MEV 提取的强化学习框架，并做出了三项贡献：（1）一种新型的模拟环境，可以准确模拟 Atlas 拍卖中套利机会的随机到来和概率竞争;（2）基于PPO的竞价代理，针对实时约束进行了优化，能够在连续行动空间中自适应制定策略，同时保持生产就绪的推理速度;（3）实证验证表明，我们的历史条件代理在与现有搜索者一起部署时获得了 49\% 的可用利润，在取代市场领导者时获得了 81\% 的可用利润，显着优于静态竞价策略。我们的工作表明，强化学习在传统优化方法失败的高频 MEV 环境中提供了关键优势，为工业参与者和协议设计者等提供了直接的价值。

An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs

用于搜索增强法学硕士的基于评分标准的高效生成式验证器

Authors: Linyue Ma, Yilong Xu, Xiang Long, Zhi Zheng
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2510.14660
Pdf link: https://arxiv.org/pdf/2510.14660
Abstract Search augmentation empowers Large Language Models with retrieval capabilities to overcome the limitations imposed by static parameters. Recently, Reinforcement Learning leverages tailored reward signals as a viable technique to enhance LLMs performing tasks involving search. However, existing reward modeling for search-augmented LLMs faces several limitations. Rule-based rewards, such as Exact Match, are verifiable but fragile to variations in expression and cannot be applied to long-form workloads. In contrast, generative rewards improve robustness, but designing verifiable and stable rewards for long-form workloads in dynamic corpora remains challenging and also incurs high computational costs. In this paper, we propose a unified and verifiable paradigm, "nugget-as-rubric", which treats atomic information points as structured evaluation criteria for different search-augmentation workloads. Short-form tasks correspond to a single rubric, whereas long-form tasks expand to multiple rubrics aligned with the question's information needs. To support long-form settings, we design an automatic rubric construction pipeline based on query rewriting, which can automatically retrieve passages relevant to each question and extract rubrics from them, both from static corpora and from dynamic online web content. Furthermore, we introduce \textbf{Search-Gen-V}, a 4B-parameter efficient generative verifier under our proposed verifiable paradigm, which is trained via the idea of distillation and a two-stage strategy. Experimental results show that Search-Gen-V achieves strong verification accuracy across different workloads, making it a scalable, robust, and efficient verifiable reward constructor for search-augmented LLMs.
中文摘要 搜索增强使大型语言模型具有检索能力，以克服静态参数带来的限制。最近，强化学习利用定制的奖励信号作为一种可行的技术来增强法学硕士执行涉及搜索的任务。然而，搜索增强法学硕士的现有奖励建模面临着一些局限性。基于规则的奖励（例如完全匹配）是可验证的，但易受表达式变化的影响，并且不能应用于长格式工作负载。相比之下，生成式奖励提高了鲁棒性，但为动态语料库中的长篇工作负载设计可验证且稳定的奖励仍然具有挑战性，并且还会产生高昂的计算成本。在本文中，我们提出了一种统一的、可验证的范式，即“nugget-as-rubric”，它将原子信息点视为不同搜索增强工作负载的结构化评估标准。短格式任务对应于单个评分量规，而长格式任务扩展到与问题信息需求一致的多个评分量规。为了支持长格式设置，我们设计了一个基于查询重写的自动评分标准构建管道，它可以自动检索与每个问题相关的段落，并从静态语料库和动态在线 Web 内容中提取评分标准。此外，我们引入了\textbf{Search-Gen-V}，这是我们提出的可验证范式下的4B参数高效生成验证器，它通过蒸馏的思想和两阶段策略进行训练。实验结果表明，Search-Gen-V 在不同工作负载中实现了强大的验证准确性，使其成为搜索增强 LLM 的可扩展、健壮且高效的可验证奖励构造器。

Cognitive-Aligned Spatio-Temporal Large Language Models For Next Point-of-Interest Prediction

用于下一个兴趣点预测的认知对齐时空大语言模型

Authors: Penglong Zhai, Jie Li, Fanyi Di, Yue Liu, Yifang Yuan, Jie Huang, Peng Wu, Sicong Wang, Mingyang Yin, Tingting Hu, Yao Xu, Xin Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.14702
Pdf link: https://arxiv.org/pdf/2510.14702
Abstract The next point-of-interest (POI) recommendation task aims to predict the users' immediate next destinations based on their preferences and historical check-ins, holding significant value in location-based services. Recently, large language models (LLMs) have shown great potential in recommender systems, which treat the next POI prediction in a generative manner. However, these LLMs, pretrained primarily on vast corpora of unstructured text, lack the native understanding of structured geographical entities and sequential mobility patterns required for next POI prediction tasks. Moreover, in industrial-scale POI prediction applications, incorporating world knowledge and alignment of human cognition, such as seasons, weather conditions, holidays, and users' profiles (such as habits, occupation, and preferences), can enhance the user experience while improving recommendation performance. To address these issues, we propose CoAST (Cognitive-Aligned Spatial-Temporal LLMs), a framework employing natural language as an interface, allowing for the incorporation of world knowledge, spatio-temporal trajectory patterns, profiles, and situational information. Specifically, CoAST mainly comprises of 2 stages: (1) Recommendation Knowledge Acquisition through continued pretraining on the enriched spatial-temporal trajectory data of the desensitized users; (2) Cognitive Alignment to align cognitive judgments with human preferences using enriched training data through Supervised Fine-Tuning (SFT) and a subsequent Reinforcement Learning (RL) phase. Extensive offline experiments on various real-world datasets and online experiments deployed in "Guess Where You Go" of AMAP App homepage demonstrate the effectiveness of CoAST.
中文摘要 下一个兴趣点（POI）推荐任务旨在根据用户的偏好和历史签到来预测用户的下一个目的地，这在基于位置的服务中具有重要价值。最近，大型语言模型（LLM）在推荐系统中显示出巨大的潜力，它以生成的方式处理下一个POI预测。然而，这些 LLM 主要在大量非结构化文本语料库上进行预训练，缺乏对下一个 POI 预测任务所需的结构化地理实体和顺序移动模式的原生理解。此外，在工业规模的POI预测应用中，结合世界知识和人类认知的一致性，如季节、天气状况、节假日和用户档案（如习惯、职业和偏好），可以增强用户体验，同时提高推荐性能。为了解决这些问题，我们提出了 CoAST（认知对齐时空法学硕士），这是一个采用自然语言作为接口的框架，允许整合世界知识、时空轨迹模式、剖面图和情境信息。具体来说，CoAST主要包括2个阶段：（1）通过对脱敏用户的丰富时空轨迹数据进行持续预训练来获取推荐知识;（2）认知调整，通过监督微调（SFT）和随后的强化学习（RL）阶段使用丰富的训练数据，使认知判断与人类偏好保持一致。对各种真实世界数据集的大量离线实验和部署在 AMAP App 主页“猜猜你去哪里”中的在线实验证明了 CoAST 的有效性。

The Pursuit of Diversity: Multi-Objective Testing of Deep Reinforcement Learning Agents

追求多样性：深度强化学习智能体的多目标测试

Authors: Antony Bartlett, Cynthia Liem, Annibale Panichella
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14727
Pdf link: https://arxiv.org/pdf/2510.14727
Abstract Testing deep reinforcement learning (DRL) agents in safety-critical domains requires discovering diverse failure scenarios. Existing tools such as INDAGO rely on single-objective optimization focused solely on maximizing failure counts, but this does not ensure discovered scenarios are diverse or reveal distinct error types. We introduce INDAGO-Nexus, a multi-objective search approach that jointly optimizes for failure likelihood and test scenario diversity using multi-objective evolutionary algorithms with multiple diversity metrics and Pareto front selection strategies. We evaluated INDAGO-Nexus on three DRL agents: humanoid walker, self-driving car, and parking agent. On average, INDAGO-Nexus discovers up to 83% and 40% more unique failures (test effectiveness) than INDAGO in the SDC and Parking scenarios, respectively, while reducing time-to-failure by up to 67% across all agents.
中文摘要 在安全关键领域测试深度强化学习（DRL）代理需要发现各种故障场景。INDAGO 等现有工具依赖于单一目标优化，仅专注于最大化故障计数，但这并不能确保发现的场景是多样化的或揭示不同的错误类型。我们介绍了 INDAGO-Nexus，这是一种多目标搜索方法，它使用具有多个多样性指标的多目标演化算法和帕累托前沿选择策略，共同优化故障概率和测试场景多样性。我们对 INDAGO-Nexus 在三种 DRL 代理上进行了评估：人形助行器、自动驾驶汽车和停车代理。平均而言，INDAGO-Nexus 在 SDC 和停车场景中分别比 INDAGO 多发现 83% 和 40% 的独特故障（测试有效性），同时将所有代理的故障时间缩短多达 67%。

AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

AutoRubric-R1V：基于评分标准的生成奖励，用于忠实的多模态推理

Authors: Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, Peng Qi
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.14738
Pdf link: https://arxiv.org/pdf/2510.14738
Abstract Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.
中文摘要 多模态大型语言模型（MLLM）已经从感知任务迅速发展到复杂的多步推理，但具有可验证奖励的强化学习（RLVR）往往会导致虚假推理，因为只有最终答案的正确性才会得到奖励。为了解决这一限制，我们提出了 AutoRubric-R1V，这是一个框架，它通过自动收集的基于评分标准的生成奖励，将 RLVR 与流程级监督相结合。我们的关键创新在于一种可扩展的自聚合方法，该方法从成功的轨迹中提炼出一致的推理检查点，从而无需人工注释或更强大的教师模型即可构建特定于问题的评分标准。通过共同利用基于评分标准和结果奖励，AutoRubric-R1V 在六个多模态推理基准上实现了最先进的性能，并显着提高了专门评估中的推理忠实度。

Leveraging Neural Descriptor Fields for Learning Contact-Aware Dynamic Recovery

利用神经描述符字段学习接触感知动态恢复

Authors: Fan Yang, Zixuan Huang, Abhinav Kumar, Sergio Aguilera Marinovic, Soshi Iba, Rana Soltani Zarrin, Dmitry Berenson
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.14768
Pdf link: https://arxiv.org/pdf/2510.14768
Abstract Real-world dexterous manipulation often encounters unexpected errors and disturbances, which can lead to catastrophic failures, such as dropping the manipulated object. To address this challenge, we focus on the problem of catching a falling object while it remains within grasping range and, importantly, resetting the system to a configuration favorable for resuming the primary manipulation task. We propose Contact-Aware Dynamic Recovery (CADRE), a reinforcement learning framework that incorporates a Neural Descriptor Field (NDF)-inspired module to extract implicit contact features. Compared to methods that rely solely on object pose or point cloud input, NDFs can directly reason about finger-object correspondence and adapt to different object geometries. Our experiments show that incorporating contact features improves training efficiency, enhances convergence performance for RL training, and ultimately leads to more successful recoveries. Additionally, we demonstrate that CADRE can generalize zero-shot to unseen objects with different geometries.
中文摘要 现实世界的灵巧作经常会遇到意想不到的错误和干扰，这可能导致灾难性的失败，例如掉落纵的物体。为了应对这一挑战，我们专注于在坠落物体保持在抓取范围内时抓住它的问题，重要的是，将系统重置为有利于恢复主要作任务的配置。我们提出了接触感知动态恢复（CADRE），这是一个强化学习框架，它结合了受神经描述符字段（NDF）启发的模块来提取隐式接触特征。与仅依赖物体位姿或点云输入的方法相比，NDF可以直接推理手指-物体的对应关系，并适应不同的物体几何形状。我们的实验表明，结合接触特征可以提高训练效率，增强 RL 训练的收敛性能，并最终导致更成功的恢复。此外，我们还证明了CADRE可以将零样本推广到具有不同几何形状的看不见的物体。

SkyDreamer: Interpretable End-to-End Vision-Based Drone Racing with Model-Based Reinforcement Learning

SkyDreamer：基于模型的强化学习的可解释的端到端基于视觉的无人机赛车

Authors: Aderik Verraest, Stavrow Bahnam, Robin Ferede, Guido de Croon, Christophe De Wagter
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.14783
Pdf link: https://arxiv.org/pdf/2510.14783
Abstract Autonomous drone racing (ADR) systems have recently achieved champion-level performance, yet remain highly specific to drone racing. While end-to-end vision-based methods promise broader applicability, no system to date simultaneously achieves full sim-to-real transfer, onboard execution, and champion-level performance. In this work, we present SkyDreamer, to the best of our knowledge, the first end-to-end vision-based ADR policy that maps directly from pixel-level representations to motor commands. SkyDreamer builds on informed Dreamer, a model-based reinforcement learning approach where the world model decodes to privileged information only available during training. By extending this concept to end-to-end vision-based ADR, the world model effectively functions as an implicit state and parameter estimator, greatly improving interpretability. SkyDreamer runs fully onboard without external aid, resolves visual ambiguities by tracking progress using the state decoded from the world model's hidden state, and requires no extrinsic camera calibration, enabling rapid deployment across different drones without retraining. Real-world experiments show that SkyDreamer achieves robust, high-speed flight, executing tight maneuvers such as an inverted loop, a split-S and a ladder, reaching speeds of up to 21 m/s and accelerations of up to 6 g. It further demonstrates a non-trivial visual sim-to-real transfer by operating on poor-quality segmentation masks, and exhibits robustness to battery depletion by accurately estimating the maximum attainable motor RPM and adjusting its flight path in real-time. These results highlight SkyDreamer's adaptability to important aspects of the reality gap, bringing robustness while still achieving extremely high-speed, agile flight.
中文摘要 自主无人机赛车（ADR）系统最近取得了冠军级的性能，但仍高度针对无人机赛车。虽然基于视觉的端到端方法有望实现更广泛的适用性，但迄今为止，还没有任何系统能够同时实现完整的模拟到真实传输、板载执行和冠军级性能。在这项工作中，我们向SkyDreamer展示了第一个基于视觉的端到端ADR策略，该策略直接从像素级表示映射到电机命令。SkyDreamer 建立在知情的 Dreamer 之上，这是一种基于模型的强化学习方法，其中世界模型解码为仅在训练期间可用的特权信息。通过将这一概念扩展到基于视觉的端到端 ADR，世界模型有效地充当隐式状态和参数估计器，大大提高了可解释性。SkyDreamer 完全在机上运行，无需外部帮助，通过使用从世界模型的隐藏状态解码的状态来跟踪进度来解决视觉模糊性，并且不需要外部相机校准，无需重新训练即可在不同的无人机上快速部署。实际实验表明，SkyDreamer 实现了稳健的高速飞行，执行倒环、分体 S 和梯子等紧凑机动，速度高达 21 m/s，加速度高达 6 g。它通过在低质量的分割掩码上运行，进一步展示了一种重要的视觉模拟到真实的传输，并通过准确估计最大可达到的电机转速并实时调整其飞行路径，表现出对电池耗尽的鲁棒性。这些结果凸显了 SkyDreamer 对现实差距重要方面的适应性，带来了稳健性，同时仍能实现极高速、敏捷的飞行。

SimKO: Simple Pass@K Policy Optimization

SimKO：简单的Pass@K策略优化

Authors: Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, Yandong Wen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.14807
Pdf link: https://arxiv.org/pdf/2510.14807
Abstract Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, prevailing RLVR methods exhibit a systematic bias toward exploitation over exploration, as evidenced by improved pass@1 but reduced pass@K (K>1) performance. To understand this issue, we analyze training dynamics of RLVR methods by tracking the token-level probability distributions over vocabulary candidates. Our analysis reveals a consistent probability concentration effect where the top-1 candidate increasingly accumulates probability mass and suppresses that of other candidates. More importantly, stronger over-concentration correlates with worse pass@K performance. Inspired by this finding, we propose Simple Pass@K Optimization (SimKO), a method designed to mitigate the over-concentration issue, thereby encouraging exploration. SimKO operates in an asymmetrical manner. For verified-correct responses, it boosts the probabilities of the top-K candidates. For verified-incorrect responses, it applies stronger penalties to the top-1 candidate. We observe that this asymmetric design is particularly effective at mitigating over-concentration when applied at tokens with high entropy. Across various math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, providing a simple way to improve RLVR's exploration.
中文摘要 具有可验证奖励的强化学习（RLVR）提高了大型语言模型（LLM）的推理能力。然而，流行的 RLVR 方法表现出系统性的偏向于开发而不是勘探，这从改进的pass@1但降低的pass@K （K>1）性能中可以看出。为了理解这个问题，我们通过跟踪词汇候选者的标记级概率分布来分析 RLVR 方法的训练动态。我们的分析揭示了一个一致的概率集中效应，即前 1 名候选者越来越多地积累概率质量并抑制其他候选者的概率质量。更重要的是，更强的过度集中与更差的pass@K性能相关。受这一发现的启发，我们提出了简单Pass@K优化（SimKO），这是一种旨在缓解过度集中问题的方法，从而鼓励探索。SimKO 以不对称的方式运行。对于经过验证的正确回答，它提高了前 K 名候选人的概率。对于经过验证的错误回答，它会对前 1 名的候选人施加更严厉的处罚。我们观察到，当应用于高熵的标记时，这种不对称设计在减轻过度集中方面特别有效。在各种数学和逻辑推理基准测试中，SimKO 始终能为宽范围的 K 产生更高的pass@K，从而提供了一种改进 RLVR 探索的简单方法。

RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning

RL-100：使用真实世界强化学习进行高性能机器人作

Authors: Kun Lei, Huanyu Li, Dongjie Yu, Zhenyu Wei, Lingxiao Guo, Zhennan Jiang, Ziyu Wang, Shiyu Liang, Huazhe Xu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14830
Pdf link: https://arxiv.org/pdf/2510.14830
Abstract Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass skilled human operators. We present RL-100, a real-world reinforcement learning training framework built on diffusion visuomotor policies trained bu supervised learning. RL-100 introduces a three-stage pipeline. First, imitation learning leverages human priors. Second, iterative offline reinforcement learning uses an Offline Policy Evaluation procedure, abbreviated OPE, to gate PPO-style updates that are applied in the denoising process for conservative and reliable improvement. Third, online reinforcement learning eliminates residual failure modes. An additional lightweight consistency distillation head compresses the multi-step sampling process in diffusion into a single-step policy, enabling high-frequency control with an order-of-magnitude reduction in latency while preserving task performance. The framework is task-, embodiment-, and representation-agnostic and supports both 3D point clouds and 2D RGB inputs, a variety of robot platforms, and both single-step and action-chunk policies. We evaluate RL-100 on seven real-robot tasks spanning dynamic rigid-body control, such as Push-T and Agile Bowling, fluids and granular pouring, deformable cloth folding, precise dexterous unscrewing, and multi-stage orange juicing. RL-100 attains 100\% success across evaluated trials for a total of 900 out of 900 episodes, including up to 250 out of 250 consecutive trials on one task. The method achieves near-human teleoperation or better time efficiency and demonstrates multi-hour robustness with uninterrupted operation lasting up to two hours.
中文摘要 家庭和工厂中的真实机器人作需要接近或超越熟练的人类操作员的可靠性、效率和稳健性。我们提出了 RL-100，这是一个基于扩散视觉运动策略训练的真实世界强化学习训练框架，经过监督学习的训练。RL-100 引入了三级流水线。首先，模仿学习利用了人类的先验。其次，迭代离线强化学习使用离线策略评估程序（缩写为 OPE）来控制 PPO 式更新，这些更新应用于去噪过程，以实现保守和可靠的改进。第三，在线强化学习消除了残余失效模式。额外的轻量级稠度蒸馏头将扩散中的多步采样过程压缩为单步策略，从而实现高频控制，延迟减少一个数量级，同时保持任务性能。该框架与任务、实施例和表示无关，支持 3D 点云和 2D RGB 输入、各种机器人平台以及单步和动作块策略。我们评估了 RL-100 的七项真实机器人任务，涵盖动态刚体控制，例如 Push-T 和敏捷保龄球、流体和颗粒倾倒、可变形的布折叠、精确灵巧的拧松和多级橙汁。RL-100 在评估试验中取得了 100\% 的成功率，总共 900 次发作中的 900 次，包括一项任务的 250 次连续试验中的多达 250 次。该方法实现了近乎人类的远程作或更好的时间效率，并表现出数小时的鲁棒性，不间断运行持续长达两个小时。

Reinforcement Learning with Stochastic Reward Machines

使用随机奖励机进行强化学习

Authors: Jan Corazza, Ivan Gavran, Daniel Neider
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14837
Pdf link: https://arxiv.org/pdf/2510.14837
Abstract Reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequences of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. To overcome this practical limitation, we introduce a novel type of reward machines, called stochastic reward machines, and an algorithm for learning them. Our algorithm, based on constraint solving, learns minimal stochastic reward machines from the explorations of a reinforcement learning agent. This algorithm can easily be paired with existing reinforcement learning algorithms for reward machines and guarantees to converge to an optimal policy in the limit. We demonstrate the effectiveness of our algorithm in two case studies and show that it outperforms both existing methods and a naive approach for handling noisy reward functions.
中文摘要 奖励机是处理强化学习问题的成熟工具，其中奖励稀疏且依赖于复杂的动作序列。然而，现有的学习奖励机算法假设了一个过于理想化的环境，即奖励必须没有噪音。为了克服这一实际限制，我们引入了一种新型的奖励机器，称为随机奖励机器，以及一种学习它们的算法。我们的算法基于约束求解，从强化学习代理的探索中学习最小随机奖励机。该算法可以很容易地与奖励机器的现有强化学习算法配对，并保证在极限内收敛到最优策略。我们在两个案例研究中证明了我们的算法的有效性，并表明它优于现有方法和处理嘈杂奖励函数的朴素方法。

Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Improves Without Labels or Model Updates

更智能，更难地绘制地图：无需标签或模型更新即可改进的测试时强化学习代理

Authors: Wen-Kwang Tsao, Yao-Ching Yu, Chien-Ming Huang
Subjects: Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2510.14900
Pdf link: https://arxiv.org/pdf/2510.14900
Abstract The Enterprise Intelligence Platform must integrate logs from numerous third-party vendors in order to perform various downstream tasks. However, vendor documentation is often unavailable at test time. It is either misplaced, mismatched, poorly formatted, or incomplete, which makes schema mapping challenging. We introduce a reinforcement learning agent that can self-improve without labeled examples or model weight updates. During inference, the agent: 1) Identifies ambiguous field-mapping attempts. 2) Generates targeted web-search queries to gather external evidence. 3) Applies a confidence-based reward to iteratively refine its mappings. To demonstrate this concept, we converted Microsoft Defender for Endpoint logs into a common schema. Our method increased mapping accuracy from 56.4\%(LLM-only) to 72.73\%(RAG) to 93.94\% over 100 iterations using GPT-4o. At the same time, it reduced the number of low-confidence mappings requiring expert review by 85\%. This new approach provides an evidence-driven, transparent method for solving future industry problems, paving the way for more robust, accountable, scalable, efficient, flexible, adaptable, and collaborative solutions.
中文摘要 企业智能平台必须集成来自众多第三方供应商的日志，以便执行各种下游任务。但是，供应商文档在测试时通常不可用。它要么放错位置，要么不匹配，要么格式不当，要么不完整，这使得模式映射具有挑战性。我们引入了一种强化学习代理，它可以在没有标记示例或模型权重更新的情况下进行自我改进。在推理过程中，代理：1）识别模棱两可的字段映射尝试。2）生成有针对性的网络搜索查询以收集外部证据。3）应用基于置信度的奖励来迭代细化其映射。为了演示此概念，我们将 Microsoft Defender for Endpoint 日志转换为通用架构。我们的方法使用 GPT-4o 在 100 次迭代中将映射精度从 56.4%（仅 LLM）提高到 72.73\%（RAG）再到 93.94\%。同时，它将需要专家审查的低置信度映射数量减少了 85%。这种新方法为解决未来的行业问题提供了一种证据驱动、透明的方法，为更强大、更负责任、可扩展、高效、灵活、适应性和协作的解决方案铺平了道路。

Reasoning with Sampling: Your Base Model is Smarter Than You Think

抽样推理：您的基本模型比您想象的更智能

Authors: Aayush Karan, Yilun Du
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.14901
Pdf link: https://arxiv.org/pdf/2510.14901
Abstract Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.
中文摘要 前沿推理模型在强化学习（RL）后训练大型语言模型（LLM）的驱动下，在广泛的学科中表现出了令人难以置信的能力。然而，尽管这种范式取得了广泛的成功，但许多文献都致力于理清 RL 期间出现但不存在于基本模型中的真正新颖的行为。在我们的工作中，我们从不同的角度处理这个问题，而是询问是否可以通过纯采样在推理时从基础模型中引出可比的推理能力，而无需任何额外的训练。受马尔可夫链蒙特卡洛（MCMC）技术的启发，我们提出了一种利用基本模型自身似然的简单迭代抽样算法。在不同的基础模型上，我们表明我们的算法在推理方面提供了显着的提升，在各种单次任务（包括 MATH500、HumanEval 和 GPQA）上几乎与 RL 的算法相当，甚至优于RL。此外，我们的采样器避免了RL后训练特征的多个样本的多样性崩溃。至关重要的是，我们的方法不需要训练、精选数据集或验证器，这表明其广泛适用性超出了易于验证的领域。

A Hard-Label Black-Box Evasion Attack against ML-based Malicious Traffic Detection Systems

针对基于机器学习的恶意流量检测系统的硬标签黑盒规避攻击

Authors: Zixuan Liu, Yi Zhao, Zhuotao Liu, Qi Li, Chuanpu Fu, Guangmeng Zhou, Ke Xu
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2510.14906
Pdf link: https://arxiv.org/pdf/2510.14906
Abstract Machine Learning (ML)-based malicious traffic detection is a promising security paradigm. It outperforms rule-based traditional detection by identifying various advanced attacks. However, the robustness of these ML models is largely unexplored, thereby allowing attackers to craft adversarial traffic examples that evade detection. Existing evasion attacks typically rely on overly restrictive conditions (e.g., encrypted protocols, Tor, or specialized setups), or require detailed prior knowledge of the target (e.g., training data and model parameters), which is impractical in realistic black-box scenarios. The feasibility of a hard-label black-box evasion attack (i.e., applicable across diverse tasks and protocols without internal target insights) thus remains an open challenge. To this end, we develop NetMasquerade, which leverages reinforcement learning (RL) to manipulate attack flows to mimic benign traffic and evade detection. Specifically, we establish a tailored pre-trained model called Traffic-BERT, utilizing a network-specialized tokenizer and an attention mechanism to extract diverse benign traffic patterns. Subsequently, we integrate Traffic-BERT into the RL framework, allowing NetMasquerade to effectively manipulate malicious packet sequences based on benign traffic patterns with minimal modifications. Experimental results demonstrate that NetMasquerade enables both brute-force and stealthy attacks to evade 6 existing detection methods under 80 attack scenarios, achieving over 96.65% attack success rate. Notably, it can evade the methods that are either empirically or certifiably robust against existing evasion attacks. Finally, NetMasquerade achieves low-latency adversarial traffic generation, demonstrating its practicality in real-world scenarios.
中文摘要 基于机器学习（ML）的恶意流量检测是一种很有前途的安全范例。它通过识别各种高级攻击优于基于规则的传统检测。然而，这些 ML 模型的稳健性在很大程度上尚未被探索，从而允许攻击者制作逃避检测的对抗性流量示例。现有的规避攻击通常依赖于过度限制的条件（例如，加密协议、Tor 或专门的设置），或者需要对目标有详细的先验知识（例如，训练数据和模型参数），这在现实的黑盒场景中是不切实际的。因此，硬标签黑盒规避攻击（即，在没有内部目标洞察的情况下适用于不同的任务和协议）的可行性仍然是一个悬而未决的挑战。为此，我们开发了 NetMasquerade，它利用强化学习（RL）来纵攻击流以模拟良性流量并逃避检测。具体来说，我们建立了一个名为 Traffic-BERT 的定制预训练模型，利用网络专用分词器和注意力机制来提取多样化的良性流量模式。随后，我们将 Traffic-BERT 集成到 RL 框架中，使 NetMasquerade 能够以最少的修改有效地纵基于良性流量模式的恶意数据包序列。实验结果表明，在80种攻击场景下，NetMasquerade能够同时实现暴力攻击和隐身攻击，规避现有的6种检测方法，攻击成功率超过96.65%。值得注意的是，它可以规避经验上或可证明对现有规避攻击具有稳健性的方法。最后，NetMasquerade 实现了低延迟的对抗性流量生成，展示了其在现实场景中的实用性。

VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tunin

VT-Refine：通过仿真 Fine-Tunin 学习具有视觉触觉反馈的双手装配

Authors: Binghao Huang, Jie Xu, Iretiayo Akinola, Wei Yang, Balakumar Sundaralingam, Rowland O'Flaherty, Dieter Fox, Xiaolong Wang, Arsalan Mousavian, Yu-Wei Chao, Yunzhu Li
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14930
Pdf link: https://arxiv.org/pdf/2510.14930
Abstract Humans excel at bimanual assembly tasks by adapting to rich tactile feedback -- a capability that remains difficult to replicate in robots through behavioral cloning alone, due to the suboptimality and limited diversity of human demonstrations. In this work, we present VT-Refine, a visuo-tactile policy learning framework that combines real-world demonstrations, high-fidelity tactile simulation, and reinforcement learning to tackle precise, contact-rich bimanual assembly. We begin by training a diffusion policy on a small set of demonstrations using synchronized visual and tactile inputs. This policy is then transferred to a simulated digital twin equipped with simulated tactile sensors and further refined via large-scale reinforcement learning to enhance robustness and generalization. To enable accurate sim-to-real transfer, we leverage high-resolution piezoresistive tactile sensors that provide normal force signals and can be realistically modeled in parallel using GPU-accelerated simulation. Experimental results show that VT-Refine improves assembly performance in both simulation and the real world by increasing data diversity and enabling more effective policy fine-tuning. Our project page is available at this https URL.
中文摘要 人类通过适应丰富的触觉反馈来擅长双手组装任务——由于人类演示的次优性和有限的多样性，这种能力仍然很难仅通过行为克隆在机器人中复制。在这项工作中，我们提出了 VT-Refine，这是一个视觉触觉策略学习框架，它结合了真实世界的演示、高保真触觉模拟和强化学习，以解决精确、接触丰富的双手组装。我们首先使用同步的视觉和触觉输入在一小组演示上训练扩散策略。然后，该策略被转移到配备模拟触觉传感器的模拟数字孪生中，并通过大规模强化学习进一步细化，以增强鲁棒性和泛化性。为了实现准确的模拟到真实传输，我们利用高分辨率压阻式触觉传感器提供法向力信号，并且可以使用 GPU 加速仿真并行进行真实建模。实验结果表明，VT-Refine通过增加数据多样性和实现更有效的策略微调，提高了仿真和现实世界中的装配性能。我们的项目页面可在此 https URL 中找到。

LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

LaSeR：具有最后一个令牌自我奖励的强化学习

Authors: Wenkai Yang, Weijie Liu, Ruobing Xie, Yiju Guo, Lulu Wu, Saiyong Yang, Yankai Lin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14943
Pdf link: https://arxiv.org/pdf/2510.14943
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model's self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model's reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.
中文摘要 具有可验证奖励的强化学习（RLVR）最近成为增强大型语言模型（LLM）推理能力的核心范式。为了解决测试时缺乏验证信号的问题，先前的研究将模型的自我验证能力训练纳入标准 RLVR 流程，从而将推理和验证能力统一到单个 LLM 中。然而，以前的做法要求 LLM 使用两个独立的提示模板按顺序生成解决方案和自我验证，这大大降低了效率。在这项工作中，我们从理论上揭示了自我验证的 RL 目标的封闭形式解决方案可以简化为一种非常简单的形式：解决方案的真实推理奖励等于其最后一个标记的自我奖励分数，该分数计算为在解决方案的最后一个标记处分配给任何预先指定标记的策略模型的下一个标记对数概率与预先计算的常数之间的差值，按 KL 系数缩放。基于这一见解，我们提出了 LaSeR（Reinforcement Learning with Last-Token Self-Rewarding），这是一种简单地用 MSE 损失来增强原始 RLVR 损失的算法，该算法将最后一个代币自我奖励分数与基于验证者的推理奖励保持一致，共同优化 LLM 的推理和自我奖励能力。优化后的自我奖励分数可用于训练和测试，以提高模型性能。值得注意的是，我们的算法从生成后立即对最后一个代币的预测下一个代币概率分布得出这些分数，仅产生一个额外代币推理的最小额外成本。实验表明，该方法不仅提高了模型的推理性能，还具备了显著的自我奖励能力，从而提高了其推理时间缩放性能。

CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions

CBF-RL：具有控制屏障功能的训练中的安全滤波强化学习

Authors: Lizhi Yang, Blake Werner, Massimiliano de Sa Aaron D. Ames
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.14959
Pdf link: https://arxiv.org/pdf/2510.14959
Abstract Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed \emph{online} via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs \emph{in training}. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.
中文摘要 强化学习（RL）虽然功能强大且富有表现力，但往往会以牺牲安全为代价来优先考虑性能。然而，违反安全规定可能会导致实际部署中的灾难性后果。控制屏障功能（CBF）提供了一种强制执行动态安全的原则性方法——传统上通过安全过滤器部署 \emph{online}。虽然结果是安全的行为，但 RL 政策不了解 CBF 这一事实可能会导致保守行为。本文提出了 CBF-RL，这是一个通过强制执行 CBF \emph{in training} 来使用 RL 生成安全行为的框架。CBF-RL 有两个关键属性：（1）最小地修改名义 RL 策略以通过 CBF 项对安全约束进行编码，（2）以及训练中策略推出的安全过滤。从理论上讲，我们证明了连续时间安全滤波器可以通过离散时间推出的封闭式表达式来部署。实际上，我们证明 CBF-RL 将学习策略中的安全约束内化——既强制执行更安全的行动，又偏向于更安全的奖励——无需在线安全过滤器即可实现安全部署。我们通过对导航任务和Unitree G1人形机器人的消融研究来验证我们的框架，其中CBF-RL在不确定性下能够实现更安全的探索、更快的收敛和稳健的性能，使人形机器人能够在没有运行时安全过滤器的情况下在现实环境中避开障碍物并安全地爬楼梯。

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

基于信息增益的策略优化：一种简单有效的多轮LLM代理方法

Authors: Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14967
Pdf link: https://arxiv.org/pdf/2510.14967
Abstract Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.
中文摘要 基于大型语言模型（LLM）的代理越来越多地接受强化学习（RL）训练，以增强它们通过工具使用与外部环境交互的能力，特别是在需要多轮推理和知识获取的基于搜索的环境中。然而，现有方法通常依赖于仅在最终答案时提供的基于结果的奖励。这种奖励稀疏性在多回合设置中变得特别成问题，其中长轨迹加剧了两个关键问题：（i）优势崩溃，所有推出都获得相同的奖励并且没有提供有用的学习信号，以及（ii）缺乏细粒度的学分分配，其中回合之间的依赖关系被掩盖，尤其是在长期任务中。在本文中，我们提出了基于信息增益的策略优化（IGPO），这是一个简单而有效的RL框架，为多轮次智能体训练提供了密集和内在的监督。IGPO 将每个交互回合建模为获取有关基本事实信息的增量过程，并将回合级奖励定义为策略产生正确答案的概率的边际增加。与之前依赖于外部奖励模型或昂贵的蒙特卡洛估计的过程级奖励方法不同，IGPO 直接从模型自身的信念更新中获得内在奖励。这些内在的回合级奖励与结果级监督相结合，形成密集的奖励轨迹。在域内和域外基准测试上的广泛实验表明，IGPO 在多轮场景中始终优于强基线，实现更高的准确性和更高的样本效率。

Agentic Design of Compositional Machines

组合机器的代理设计

Authors: Wenqian Zhang, Weiyang Liu, Zhen Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14980
Pdf link: https://arxiv.org/pdf/2510.14980
Abstract The design of complex machines stands as both a marker of human intelligence and a foundation of engineering practice. Given recent advances in large language models (LLMs), we ask whether they, too, can learn to create. We approach this question through the lens of compositional machine design: a task in which machines are assembled from standardized components to meet functional demands like locomotion or manipulation in a simulated physical environment. To support this investigation, we introduce BesiegeField, a testbed built on the machine-building game Besiege, which enables part-based construction, physical simulation and reward-driven evaluation. Using BesiegeField, we benchmark state-of-the-art LLMs with agentic workflows and identify key capabilities required for success, including spatial reasoning, strategic assembly, and instruction-following. As current open-source models fall short, we explore reinforcement learning (RL) as a path to improvement: we curate a cold-start dataset, conduct RL finetuning experiments, and highlight open challenges at the intersection of language, machine design, and physical reasoning.
中文摘要 复杂机器的设计既是人类智能的标志，也是工程实践的基础。鉴于大型语言模型（LLM）的最新进展，我们询问它们是否也可以学习创造。我们通过组合机器设计的视角来处理这个问题：在这项任务中，机器由标准化组件组装而成，以满足在模拟物理环境中的运动或纵等功能需求。为了支持这项研究，我们引入了 BesiegeField，这是一个基于机器制造游戏 Besiege 构建的测试平台，它支持基于零件的构建、物理模拟和奖励驱动的评估。使用 BesiegeField，我们使用代理工作流程对最先进的 LLM 进行基准测试，并确定成功所需的关键功能，包括空间推理、战略组装和指令遵循。由于当前的开源模型存在不足，我们探索强化学习（RL）作为改进的途径：我们策划了一个冷启动数据集，进行 RL 微调实验，并强调语言、机器设计和物理推理交叉点的开放挑战。

Keyword: diffusion policy

VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tunin

VT-Refine：通过仿真 Fine-Tunin 学习具有视觉触觉反馈的双手装配

Authors: Binghao Huang, Jie Xu, Iretiayo Akinola, Wei Yang, Balakumar Sundaralingam, Rowland O'Flaherty, Dieter Fox, Xiaolong Wang, Arsalan Mousavian, Yu-Wei Chao, Yunzhu Li
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.14930
Pdf link: https://arxiv.org/pdf/2510.14930
Abstract Humans excel at bimanual assembly tasks by adapting to rich tactile feedback -- a capability that remains difficult to replicate in robots through behavioral cloning alone, due to the suboptimality and limited diversity of human demonstrations. In this work, we present VT-Refine, a visuo-tactile policy learning framework that combines real-world demonstrations, high-fidelity tactile simulation, and reinforcement learning to tackle precise, contact-rich bimanual assembly. We begin by training a diffusion policy on a small set of demonstrations using synchronized visual and tactile inputs. This policy is then transferred to a simulated digital twin equipped with simulated tactile sensors and further refined via large-scale reinforcement learning to enhance robustness and generalization. To enable accurate sim-to-real transfer, we leverage high-resolution piezoresistive tactile sensors that provide normal force signals and can be realistically modeled in parallel using GPU-accelerated simulation. Experimental results show that VT-Refine improves assembly performance in both simulation and the real world by increasing data diversity and enabling more effective policy fine-tuning. Our project page is available at this https URL.
中文摘要 人类通过适应丰富的触觉反馈来擅长双手组装任务——由于人类演示的次优性和有限的多样性，这种能力仍然很难仅通过行为克隆在机器人中复制。在这项工作中，我们提出了 VT-Refine，这是一个视觉触觉策略学习框架，它结合了真实世界的演示、高保真触觉模拟和强化学习，以解决精确、接触丰富的双手组装。我们首先使用同步的视觉和触觉输入在一小组演示上训练扩散策略。然后，该策略被转移到配备模拟触觉传感器的模拟数字孪生中，并通过大规模强化学习进一步细化，以增强鲁棒性和泛化性。为了实现准确的模拟到真实传输，我们利用高分辨率压阻式触觉传感器提供法向力信号，并且可以使用 GPU 加速仿真并行进行真实建模。实验结果表明，VT-Refine通过增加数据多样性和实现更有效的策略微调，提高了仿真和现实世界中的装配性能。我们的项目页面可在此 https URL 中找到。