Arxiv Papers of Today

生成时间: 2025-10-08 16:29:04 (UTC+8); Arxiv 发布时间: 2025-10-08 20:00 EDT (2025-10-09 08:00 UTC+8)

今天共有 41 篇相关文章

Keyword: reinforcement learning

CARE: Cognitive-reasoning Augmented Reinforcement for Emotional Support Conversation

CARE：情感支持对话的认知推理增强强化

Authors: Jie Zhu, Yuanchen Zhou, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.05122
Pdf link: https://arxiv.org/pdf/2510.05122
Abstract Emotional Support Conversation (ESC) plays a vital role in alleviating psychological stress and providing emotional value through dialogue. While recent studies have largely focused on data augmentation and synthetic corpus construction, they often overlook the deeper cognitive reasoning processes that underpin effective emotional support. To address this gap, we propose \textbf{CARE}, a novel framework that strengthens reasoning in ESC without relying on large-scale synthetic data. CARE leverages the original ESC training set to guide models in generating logically coherent and supportive responses, thereby explicitly enhancing cognitive reasoning. Building on this foundation, we further employ reinforcement learning to refine and reinforce the reasoning process. Experimental results demonstrate that CARE significantly improves both the logical soundness and supportive quality of responses, advancing the development of empathetic, cognitively robust, and human-like emotional support systems.
中文摘要 情感支持对话（ESC）在通过对话缓解心理压力和提供情感价值方面发挥着至关重要的作用。虽然最近的研究主要集中在数据增强和合成语料库构建上，但它们常常忽视支撑有效情感支持的更深层次的认知推理过程。为了解决这一差距，我们提出了 \textbf{CARE}，这是一种新颖的框架，可以在不依赖大规模合成数据的情况下加强 ESC 中的推理。CARE 利用原始的 ESC 训练集来指导模型生成逻辑连贯和支持性的响应，从而明确增强认知推理。在此基础上，我们进一步采用强化学习来完善和强化推理过程。实验结果表明，CARE 显着提高了反应的逻辑健全性和支持性质量，促进了同理心、认知稳健和类人情感支持系统的发展。

Adaptive Reinforcement Learning for Dynamic Configuration Allocation in Pre-Production Testing

用于生产前测试中动态配置分配的自适应强化学习

Authors: Yu Zhu
Subjects: Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.05147
Pdf link: https://arxiv.org/pdf/2510.05147
Abstract Ensuring reliability in modern software systems requires rigorous pre-production testing across highly heterogeneous and evolving environments. Because exhaustive evaluation is infeasible, practitioners must decide how to allocate limited testing resources across configurations where failure probabilities may drift over time. Existing combinatorial optimization approaches are static, ad hoc, and poorly suited to such non-stationary settings. We introduce a novel reinforcement learning (RL) framework that recasts configuration allocation as a sequential decision-making problem. Our method is the first to integrate Q-learning with a hybrid reward design that fuses simulated outcomes and real-time feedback, enabling both sample efficiency and robustness. In addition, we develop an adaptive online-offline training scheme that allows the agent to quickly track abrupt probability shifts while maintaining long-run stability. Extensive simulation studies demonstrate that our approach consistently outperforms static and optimization-based baselines, approaching oracle performance. This work establishes RL as a powerful new paradigm for adaptive configuration allocation, advancing beyond traditional methods and offering broad applicability to dynamic testing and resource scheduling domains.
中文摘要 确保现代软件系统的可靠性需要在高度异构和不断发展的环境中进行严格的生产前测试。由于详尽的评估是不可行的，因此从业者必须决定如何在故障概率可能随时间漂移的配置之间分配有限的测试资源。现有的组合优化方法是静态的、临时的，并且不太适合这种非平稳设置。我们引入了一种新颖的强化学习（RL）框架，该框架将配置分配重新塑造为顺序决策问题。我们的方法是第一个将 Q 学习与混合奖励设计相结合的方法，该设计融合了模拟结果和实时反馈，从而实现了样本效率和稳健性。此外，我们还开发了一种自适应的线上-离线训练方案，使智能体能够快速跟踪突然的概率变化，同时保持长期稳定性。广泛的仿真研究表明，我们的方法始终优于静态和基于优化的基线，接近预言机性能。这项工作将 RL 确立为自适应配置分配的强大新范式，超越了传统方法，并提供了对动态测试和资源调度领域的广泛适用性。

Percepta: High Performance Stream Processing at the Edge

Percepta：边缘的高性能流处理

Authors: Clarisse Sousa, Tiago Fonseca, Luis Lino Ferreira, Ricardo Venâncio, Ricardo Severino
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.05149
Pdf link: https://arxiv.org/pdf/2510.05149
Abstract The rise of real-time data and the proliferation of Internet of Things (IoT) devices have highlighted the limitations of cloud-centric solutions, particularly regarding latency, bandwidth, and privacy. These challenges have driven the growth of Edge Computing. Associated with IoT appears a set of other problems, like: data rate harmonization between multiple sources, protocol conversion, handling the loss of data and the integration with Artificial Intelligence (AI) models. This paper presents Percepta, a lightweight Data Stream Processing (DSP) system tailored to support AI workloads at the edge, with a particular focus on such as Reinforcement Learning (RL). It introduces specialized features such as reward function computation, data storage for model retraining, and real-time data preparation to support continuous decision-making. Additional functionalities include data normalization, harmonization across heterogeneous protocols and sampling rates, and robust handling of missing or incomplete data, making it well suited for the challenges of edge-based AI deployment.
中文摘要 实时数据的兴起和物联网（IoT）设备的激增凸显了以云为中心的解决方案的局限性，特别是在延迟、带宽和隐私方面。这些挑战推动了边缘计算的发展。与物联网相关的是一系列其他问题，例如：多个源之间的数据速率协调、协议转换、处理数据丢失以及与人工智能（AI）模型的集成。本文介绍了 Percepta，这是一个轻量级数据流处理（DSP）系统，专为支持边缘的 AI 工作负载而定制，特别关注强化学习（RL）等。它引入了奖励函数计算、用于模型重新训练的数据存储和实时数据准备等专业功能，以支持持续决策。其他功能包括数据规范化、跨异构协议和采样率的协调，以及对缺失或不完整数据的稳健处理，使其非常适合基于边缘的人工智能部署的挑战。

Adversarial Reinforcement Learning for Offensive and Defensive Agents in a Simulated Zero-Sum Network Environment

模拟零和网络环境下进攻和防御智能体的对抗强化学习

Authors: Abrar Shahid, Ibteeker Mahir Ishum, AKM Tahmidul Haque, M Sohel Rahman, A. B. M. Alim Al Islam
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2510.05157
Pdf link: https://arxiv.org/pdf/2510.05157
Abstract This paper presents a controlled study of adversarial reinforcement learning in network security through a custom OpenAI Gym environment that models brute-force attacks and reactive defenses on multi-port services. The environment captures realistic security trade-offs including background traffic noise, progressive exploitation mechanics, IP-based evasion tactics, honeypot traps, and multi-level rate-limiting defenses. Competing attacker and defender agents are trained using Deep Q-Networks (DQN) within a zero-sum reward framework, where successful exploits yield large terminal rewards while incremental actions incur small costs. Through systematic evaluation across multiple configurations (varying trap detection probabilities, exploitation difficulty thresholds, and training regimens), the results demonstrate that defender observability and trap effectiveness create substantial barriers to successful attacks. The experiments reveal that reward shaping and careful training scheduling are critical for learning stability in this adversarial setting. The defender consistently maintains strategic advantage across 50,000+ training episodes, with performance gains amplifying when exposed to complex defensive strategies including adaptive IP blocking and port-specific controls. Complete implementation details, reproducible hyperparameter configurations, and architectural guidelines are provided to support future research in adversarial RL for cybersecurity. The zero-sum formulation and realistic operational constraints make this environment suitable for studying autonomous defense systems, attacker-defender co-evolution, and transfer learning to real-world network security scenarios.
中文摘要 本文通过自定义的 OpenAI Gym 环境对网络安全中的对抗性强化学习进行了对照研究，该环境对多端口服务上的暴力攻击和反应性防御进行建模。该环境捕获了现实的安全权衡，包括背景流量噪声、渐进式利用机制、基于 IP 的规避策略、蜜罐陷阱和多级速率限制防御。竞争的攻击者和防御者代理在零和奖励框架内使用深度 Q 网络（DQN）进行训练，其中成功的漏洞利用会产生巨大的终端奖励，而增量作会产生很小的成本。通过跨多种配置（不同的陷阱检测概率、利用难度阈值和训练方案）进行系统评估，结果表明防御者可观测性和陷阱有效性为成功攻击设置了重大障碍。实验表明，奖励塑造和仔细的训练安排对于这种对抗性环境中的学习稳定性至关重要。防御者在 50,000+ 次训练中始终保持战略优势，当暴露于复杂的防御策略（包括自适应 IP 阻止和特定于端口的控制）时，性能提升会放大。提供了完整的实现细节、可重现的超参数配置和架构指南，以支持未来对网络安全对抗性 RL 的研究。零和公式和现实的作约束使得该环境适合研究自主防御系统、攻击者-防御者共同进化以及将学习转移到现实世界的网络安全场景中。

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

从中毒到意识到：培养法学硕士的后门自我意识

Authors: Guangyu Shen, Siyuan Cheng, Xiangzhe Xu, Yuan Zhou, Hanxi Guo, Zhuo Zhang, Xiangyu Zhang
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.05169
Pdf link: https://arxiv.org/pdf/2510.05169
Abstract Large Language Models (LLMs) can acquire deceptive behaviors through backdoor attacks, where the model executes prohibited actions whenever secret triggers appear in the input. Existing safety training methods largely fail to address this vulnerability, due to the inherent difficulty of uncovering hidden triggers implanted in the model. Motivated by recent findings on LLMs' situational awareness, we propose a novel post-training framework that cultivates self-awareness of backdoor risks and enables models to articulate implanted triggers even when they are absent from the prompt. At its core, our approach introduces an inversion-inspired reinforcement learning framework that encourages models to introspectively reason about their own behaviors and reverse-engineer the triggers responsible for misaligned outputs. Guided by curated reward signals, this process transforms a poisoned model into one capable of precisely identifying its implanted trigger. Surprisingly, we observe that such backdoor self-awareness emerges abruptly within a short training window, resembling a phase transition in capability. Building on this emergent property, we further present two complementary defense strategies for mitigating and detecting backdoor threats. Experiments on five backdoor attacks, compared against six baseline methods, demonstrate that our approach has strong potential to improve the robustness of LLMs against backdoor risks. The code is available at LLM Backdoor Self-Awareness.
中文摘要 大型语言模型（LLM）可以通过后门攻击获得欺骗行为，每当输入中出现秘密触发器时，模型就会执行禁止的作。现有的安全训练方法在很大程度上无法解决这一漏洞，因为发现模型中植入的隐藏触发因素存在固有的困难。受最近关于法学硕士态势感知的研究结果的激励，我们提出了一种新颖的训练后框架，该框架可以培养对后门风险的自我意识，并使模型能够清楚地表达植入的触发器，即使它们不存在于提示中。从本质上讲，我们的方法引入了一个反演启发的强化学习框架，该框架鼓励模型对自己的行为进行内省推理，并对导致输出未对齐的触发器进行逆向工程。在精心策划的奖励信号的指导下，这个过程将中毒模型转变为能够精确识别其植入触发器的模型。令人惊讶的是，我们观察到这种后门自我意识在短暂的训练窗口内突然出现，类似于能力的相变。基于这一紧急属性，我们进一步提出了两种用于缓解和检测后门威胁的互补防御策略。对五种后门攻击的实验与六种基线方法的比较表明，我们的方法在提高法学硕士抵御后门风险的鲁棒性方面具有强大的潜力。该代码可在 LLM 后门自我意识中找到。

Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning

让它平静下来：用于可验证强化学习的探索性退火解码

Authors: Chenghao Yang, Lin Gui, Chenxiao Yang, Victor Veitch, Lizhu Zhang, Zhuokai Zhao
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.05251
Pdf link: https://arxiv.org/pdf/2510.05251
Abstract Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs), yet its success hinges on effective exploration. An ideal exploration strategy must navigate two fundamental challenges: it must preserve sample quality while also ensuring training stability. While standard fixed-temperature sampling is simple, it struggles to balance these competing demands, as high temperatures degrade sample quality and low temperatures limit discovery. In this work, we propose a simpler and more effective strategy, Exploratory Annealed Decoding (EAD), grounded in the insight that exploration is most impactful on early tokens which define a sequence's semantic direction. EAD implements an intuitive explore-at-the-beginning, exploit-at-the-end strategy by annealing the sampling temperature from high to low during generation. This dynamic schedule encourages meaningful, high-level diversity at the start, then gradually lowers the temperature to preserve sample quality and keep the sampling distribution close to the target policy, which is essential for stable training. We demonstrate that EAD is a lightweight, plug-and-play method that significantly improves sample efficiency, consistently outperforming fixed-temperature sampling across various RLVR algorithms and model sizes. Our work suggests that aligning exploration with the natural dynamics of sequential generation offers a robust path to improving LLM reasoning.
中文摘要 具有可验证奖励的强化学习（RLVR）是增强大型语言模型（LLM）推理能力的强大范式，但其成功取决于有效的探索。理想的勘探策略必须应对两个基本挑战：既要保持样品质量，又要确保训练稳定性。虽然标准的固定温度采样很简单，但它很难平衡这些相互竞争的需求，因为高温会降低样品质量，而低温会限制发现。在这项工作中，我们提出了一种更简单、更有效的策略，即探索性退火解码（EAD），其基础是探索对定义序列语义方向的早期标记影响最大。EAD 通过在生成过程中将采样温度从高到低退火，实现了直观的在开始时探索，在结束时利用策略。这种动态时间表在开始时鼓励有意义的高水平多样性，然后逐渐降低温度以保持样本质量并使抽样分布接近目标政策，这对于稳定的训练至关重要。我们证明，EAD是一种轻量级的即插即用方法，可显著提高采样效率，在各种RLVR算法和模型大小上始终优于固定温度采样。我们的工作表明，将探索与顺序生成的自然动态相结合，为改进法学硕士推理提供了一条稳健的途径。

Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment

超越单片奖励：MLLM 对齐的混合和多方面奖励优化

Authors: Radha Gulhane, Sathish Reddy Indurthi
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.05283
Pdf link: https://arxiv.org/pdf/2510.05283
Abstract Aligning multimodal large language models (MLLMs) with human preferences often relies on single-signal, model-based reward methods. Such monolithic rewards often lack confidence calibration across domain-specific tasks, fail to capture diverse aspects of human preferences, and require extensive data annotation and reward model training. In this work, we propose a hybrid reward modeling framework that integrates complementary reward paradigms: (i) model-based rewards, where a learned reward model predicts scalar or vector scores from synthetic and human feedback, and (ii) rule-based rewards, where domain-specific heuristics provide explicit correctness signals with confidence. Beyond accuracy, we further incorporate multi-aspect rewards to enforce instruction adherence and introduce a generalized length-penalty reward to stabilize training and improve performance. The proposed framework provides a flexible and effective approach to aligning MLLMs through reinforcement learning policy optimization. Our experiments show consistent improvements across different multimodal benchmarks when applying hybrid and multi-aspect reward modeling. Our best performing model in the 3B family achieves an overall average improvement of ~9.5% across general and math reasoning tasks. Focusing specifically on mathematical benchmarks, the model achieves a significant average improvement of ~16%, highlighting its effectiveness in mathematical reasoning and problem solving.
中文摘要 使多模态大型语言模型（MLLM）与人类偏好保持一致通常依赖于单信号、基于模型的奖励方法。这种单一的奖励通常缺乏跨领域特定任务的置信度校准，无法捕捉人类偏好的不同方面，并且需要大量的数据注释和奖励模型训练。在这项工作中，我们提出了一个混合奖励建模框架，该框架集成了互补的奖励范式：（i）基于模型的奖励，其中学习的奖励模型从合成和人类反馈中预测标量或向量分数，以及（ii）基于规则的奖励，其中特定领域的启发式方法提供显式正确性信号。除了准确性之外，我们还进一步纳入多方面奖励以强制遵守指令，并引入广义长度惩罚奖励以稳定训练并提高绩效。所提出的框架提供了一种灵活有效的方法，通过强化学习策略优化来调整 MLLM。我们的实验表明，在应用混合和多方面奖励建模时，不同的多模态基准都有持续的改进。我们在 3B 系列中表现最好的模型在一般和数学推理任务中实现了 ~9.5% 的总体平均改进。该模型特别关注数学基准，实现了 ~16% 的显着平均改进，凸显了其在数学推理和解决问题方面的有效性。

Adjusting the Output of Decision Transformer with Action Gradient

使用动作梯度调整决策转换器的输出

Authors: Rui Lin, Yiwen Zhang, Zhicheng Peng, Minghao Lyu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.05285
Pdf link: https://arxiv.org/pdf/2510.05285
Abstract Decision Transformer (DT), which integrates reinforcement learning (RL) with the transformer model, introduces a novel approach to offline RL. Unlike classical algorithms that take maximizing cumulative discounted rewards as objective, DT instead maximizes the likelihood of actions. This paradigm shift, however, presents two key challenges: stitching trajectories and extrapolation of action. Existing methods, such as substituting specific tokens with predictive values and integrating the Policy Gradient (PG) method, address these challenges individually but fail to improve performance stably when combined due to inherent instability. To address this, we propose Action Gradient (AG), an innovative methodology that directly adjusts actions to fulfill a function analogous to that of PG, while also facilitating efficient integration with token prediction techniques. AG utilizes the gradient of the Q-value with respect to the action to optimize the action. The empirical results demonstrate that our method can significantly enhance the performance of DT-based algorithms, with some results achieving state-of-the-art levels.
中文摘要 决策变换器（DT）将强化学习（RL）与变压器模型相结合，引入了一种新的离线强化学习方法。与以最大化累积折扣奖励为目标的经典算法不同，DT 反而最大化了行动的可能性。然而，这种范式转变带来了两个关键挑战：拼接轨迹和行动推断。现有的方法，例如用预测值替换特定标记以及集成策略梯度（PG）方法，单独解决了这些挑战，但由于固有的不稳定性，组合起来无法稳定地提高性能。为了解决这个问题，我们提出了动作梯度（AG），这是一种创新方法，可以直接调整动作以实现类似于 PG 的功能，同时还促进与代币预测技术的有效集成。AG 利用 Q 值相对于动作的梯度来优化动作。实证结果表明，该方法可以显著提高基于DT的算法的性能，部分结果达到了最先进的水平。

Adaptive Dynamics Planning for Robot Navigation

机器人导航的自适应动力学规划

Authors: Lu Yuanjie, Mao Mingyang, Xu Tong, Wang Linji, Lin Xiaomin, Xiao Xuesu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.05330
Pdf link: https://arxiv.org/pdf/2510.05330
Abstract Autonomous robot navigation systems often rely on hierarchical planning, where global planners compute collision-free paths without considering dynamics, and local planners enforce dynamics constraints to produce executable commands. This discontinuity in dynamics often leads to trajectory tracking failure in highly constrained environments. Recent approaches integrate dynamics within the entire planning process by gradually decreasing its fidelity, e.g., increasing integration steps and reducing collision checking resolution, for real-time planning efficiency. However, they assume that the fidelity of the dynamics should decrease according to a manually designed scheme. Such static settings fail to adapt to environmental complexity variations, resulting in computational overhead in simple environments or insufficient dynamics consideration in obstacle-rich scenarios. To overcome this limitation, we propose Adaptive Dynamics Planning (ADP), a learning-augmented paradigm that uses reinforcement learning to dynamically adjust robot dynamics properties, enabling planners to adapt across diverse environments. We integrate ADP into three different planners and further design a standalone ADP-based navigation system, benchmarking them against other baselines. Experiments in both simulation and real-world tests show that ADP consistently improves navigation success, safety, and efficiency.
中文摘要 自主机器人导航系统通常依赖于分层规划，其中全局规划器在不考虑动力学的情况下计算无碰撞路径，而本地规划器则强制执行动态约束以生成可执行命令。这种动力学的不连续性通常会导致高度受限环境中的轨迹跟踪失败。最近的方法通过逐渐降低其保真度（例如，增加集成步骤和降低碰撞检查分辨率）将动态集成到整个规划过程中，以提高实时规划效率。但是，他们假设动力学的保真度应该根据手动设计的方案降低。这种静态设置无法适应环境复杂性的变化，导致简单环境中的计算开销或障碍物繁多的场景下动力学考虑不足。为了克服这一限制，我们提出了自适应动力学规划（ADP），这是一种学习增强范式，它使用强化学习来动态调整机器人动力学属性，使规划者能够适应不同的环境。我们将 ADP 集成到三个不同的规划器中，并进一步设计一个独立的基于 ADP 的导航系统，将它们与其他基线进行基准测试。仿真和实际测试中的实验表明，ADP 不断提高导航成功率、安全性和效率。

Teacher-Student Guided Inverse Modeling for Steel Final Hardness Estimation

师生引导的钢材最终硬度估算逆建模

Authors: Ahmad Alsheikh, Andreas Fischer
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.05402
Pdf link: https://arxiv.org/pdf/2510.05402
Abstract Predicting the final hardness of steel after heat treatment is a challenging regression task due to the many-to-one nature of the process -- different combinations of input parameters (such as temperature, duration, and chemical composition) can result in the same hardness value. This ambiguity makes the inverse problem, estimating input parameters from a desired hardness, particularly difficult. In this work, we propose a novel solution using a Teacher-Student learning framework. First, a forward model (Teacher) is trained to predict final hardness from 13 metallurgical input features. Then, a backward model (Student) is trained to infer plausible input configurations from a target hardness value. The Student is optimized by leveraging feedback from the Teacher in an iterative, supervised loop. We evaluate our method on a publicly available tempered steel dataset and compare it against baseline regression and reinforcement learning models. Results show that our Teacher-Student framework not only achieves higher inverse prediction accuracy but also requires significantly less computational time, demonstrating its effectiveness and efficiency for inverse process modeling in materials science.
中文摘要 由于该过程的多对一性质，预测钢材在热处理后的最终硬度是一项具有挑战性的回归任务——输入参数（例如温度、持续时间和化学成分）的不同组合可以产生相同的硬度值。这种模糊性使得从所需硬度估计输入参数的逆问题变得特别困难。在这项工作中，我们提出了一种使用师生学习框架的新颖解决方案。首先，训练一个前向模型（Teacher）来预测13个冶金输入特征的最终硬度。然后，训练一个向后模型（Student）从目标硬度值推断出合理的输入配置。通过在迭代、监督循环中利用教师的反馈来优化学生。我们在公开可用的回火钢数据集上评估了我们的方法，并将其与基线回归和强化学习模型进行了比较。结果表明，我们的师生框架不仅实现了更高的逆预测精度，而且需要的计算时间也显着减少，证明了其在材料科学中逆过程建模的有效性和效率。

Adversarial Reinforcement Learning for Large Language Model Agent Safety

面向大型语言模型代理安全的对抗性强化学习

Authors: Zizhao Wang, Dingcheng Li, Vaishakh Keshava, Phillip Wallis, Ananth Balashankar, Peter Stone, Lukas Rutishauser
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.05442
Pdf link: https://arxiv.org/pdf/2510.05442
Abstract Large Language Model (LLM) agents can leverage tools such as Google Search to complete complex tasks. However, this tool usage introduces the risk of indirect prompt injections, where malicious instructions hidden in tool outputs can manipulate the agent, posing security risks like data leakage. Current defense strategies typically rely on fine-tuning LLM agents on datasets of known attacks. However, the generation of these datasets relies on manually crafted attack patterns, which limits their diversity and leaves agents vulnerable to novel prompt injections. To address this limitation, we propose Adversarial Reinforcement Learning for Agent Safety (ARLAS), a novel framework that leverages adversarial reinforcement learning (RL) by formulating the problem as a two-player zero-sum game. ARLAS co-trains two LLMs: an attacker that learns to autonomously generate diverse prompt injections and an agent that learns to defend against them while completing its assigned tasks. To ensure robustness against a wide range of attacks and to prevent cyclic learning, we employ a population-based learning framework that trains the agent to defend against all previous attacker checkpoints. Evaluated on BrowserGym and AgentDojo, agents fine-tuned with ARLAS achieve a significantly lower attack success rate than the original model while also improving their task success rate. Our analysis further confirms that the adversarial process generates a diverse and challenging set of attacks, leading to a more robust agent compared to the base model.
中文摘要 大型语言模型（LLM）代理可以利用 Google 搜索等工具来完成复杂的任务。然而，这种工具的使用带来了间接提示注入的风险，即隐藏在工具输出中的恶意指令可能会纵代理，从而带来数据泄露等安全风险。当前的防御策略通常依赖于对已知攻击数据集的 LLM 代理进行微调。然而，这些数据集的生成依赖于手动制作的攻击模式，这限制了它们的多样性，并使代理容易受到新型提示注入的影响。为了解决这一限制，我们提出了智能体安全的对抗性强化学习（ARLAS），这是一种利用对抗性强化学习（RL）的新框架，将问题表述为两人零和博弈。ARLAS 共同训练两个 LLM：一个攻击者学习自主生成不同的提示注入，一个代理学习在完成分配任务时防御它们。为了确保对各种攻击的鲁棒性并防止循环学习，我们采用了基于群体的学习框架来训练代理防御所有以前的攻击者检查点。在 BrowserGym 和 AgentDojo 上进行评估，使用 ARLAS 微调的代理实现了明显低于原始模型的攻击成功率，同时也提高了他们的任务成功率。我们的分析进一步证实，对抗过程会产生一组多样化且具有挑战性的攻击，与基本模型相比，代理会更加强大。

Prior-Aligned Meta-RL: Thompson Sampling with Learned Priors and Guarantees in Finite-Horizon MDPs

先验对齐的元RL：在有限视野MDP中具有学习先验和保证的Thompson采样

Authors: Runlin Zhou, Chixiang Chen, Elynn Chen
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.05446
Pdf link: https://arxiv.org/pdf/2510.05446
Abstract We study meta-reinforcement learning in finite-horizon MDPs where related tasks share similar structures in their optimal action-value functions. Specifically, we posit a linear representation $Q^_h(s,a)=\Phi_h(s,a)\,\theta^{(k)}_h$ and place a Gaussian meta-prior $ \mathcal{N}(\theta^_h,\Sigma^*_h)$ over the task-specific parameters $\theta^{(k)}_h$. Building on randomized value functions, we propose two Thompson-style algorithms: (i) MTSRL, which learns only the prior mean and performs posterior sampling with the learned mean and known covariance; and (ii) $\text{MTSRL}^{+}$, which additionally estimates the covariance and employs prior widening to control finite-sample estimation error. Further, we develop a prior-alignment technique that couples the posterior under the learned prior with a meta-oracle that knows the true prior, yielding meta-regret guarantees: we match prior-independent Thompson sampling in the small-task regime and strictly improve with more tasks once the prior is learned. Concretely, for known covariance we obtain $\tilde{O}(H^{4}S^{3/2}\sqrt{ANK})$ meta-regret, and with learned covariance $\tilde{O}(H^{4}S^{3/2}\sqrt{AN^3K})$; both recover a better behavior than prior-independent after $K \gtrsim \tilde{O}(H^2)$ and $K \gtrsim \tilde{O}(N^2H^2)$, respectively. Simulations on a stateful recommendation environment (with feature and prior misspecification) show that after brief exploration, MTSRL/MTSRL(^+) track the meta-oracle and substantially outperform prior-independent RL and bandit-only meta-baselines. Our results give the first meta-regret guarantees for Thompson-style RL with learned Q-priors, and provide practical recipes (warm-start via RLSVI, OLS aggregation, covariance widening) for experiment-rich settings.
中文摘要 我们研究有限视界 MDP 中的元强化学习，其中相关任务在其最佳动作值函数中具有相似的结构。具体来说，我们假设一个线性表示 $Q^_h（s，a）=\Phi_h（s，a）\，\theta^{（k）}_h$，并将高斯元先验 $ \mathcal{N}（\theta^_h，\Sigma^*_h）$ 放在特定于任务的参数 $\theta^{（k）}_h$ 上。在随机值函数的基础上，我们提出了两种汤普森式算法：（i）MTSRL，它仅学习先验均值，并使用学习到的均值和已知协方差进行后验抽样;（ii） $\text{MTSRL}^{+}$，它额外估计协方差并采用先验展宽来控制有限样本估计误差。此外，我们开发了一种先验对齐技术，将学习先验下的后验与知道真正先验的元预言机耦合，从而产生元后悔保证：我们在小任务制度中匹配与先验无关的汤普森采样，并在学习到先验后严格改进更多任务。具体来说，对于已知的协方差，我们得到 $\tilde{O}（H^{4}S^{3/2}\sqrt{ANK}）$ 元遗憾，并且学习协方差 $\tilde{O}（H^{4}S^{3/2}\sqrt{AN^3K}）$;在分别$K \gtrsim \tilde{O}（H^2）$ 和 $K \gtrsim \tilde{O}（N^2H^2）$ 后，两者都恢复了比先验独立更好的行为。对有状态推荐环境（具有特征和先验错误规范）的模拟表明，经过简短的探索，MTSRL/MTSRL\（^+\）跟踪元预言机，并且大大优于与先验无关的 RL 和仅强盗的元基线。我们的结果为具有学习到的 Q 先验的 Thompson 式 RL 提供了第一个元遗憾保证，并为实验丰富的设置提供了实用的配方（通过 RLSVI 的热启动、OLS 聚合、协方差扩大）。

Vul-R2: A Reasoning LLM for Automated Vulnerability Repair

Vul-R2：用于自动漏洞修复的推理法学硕士

Authors: Xin-Cheng Wen, Zirui Lin, Yijun Yang, Cuiyun Gao, Deheng Ye
Subjects: Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2510.05480
Pdf link: https://arxiv.org/pdf/2510.05480
Abstract The exponential increase in software vulnerabilities has created an urgent need for automatic vulnerability repair (AVR) solutions. Recent research has formulated AVR as a sequence generation problem and has leveraged large language models (LLMs) to address this problem. Typically, these approaches prompt or fine-tune LLMs to generate repairs for vulnerabilities directly. Although these methods show state-of-the-art performance, they face the following challenges: (1) Lack of high-quality, vulnerability-related reasoning data. Current approaches primarily rely on foundation models that mainly encode general programming knowledge. Without vulnerability-related reasoning data, they tend to fail to capture the diverse vulnerability repair patterns. (2) Hard to verify the intermediate vulnerability repair process during LLM training. Existing reinforcement learning methods often leverage intermediate execution feedback from the environment (e.g., sandbox-based execution results) to guide reinforcement learning training. In contrast, the vulnerability repair process generally lacks such intermediate, verifiable feedback, which poses additional challenges for model training.
中文摘要 软件漏洞呈指数级增长，迫切需要自动漏洞修复（AVR）解决方案。最近的研究将 AVR 表述为序列生成问题，并利用大型语言模型（LLM）来解决这个问题。通常，这些方法会提示或微调 LLM 直接生成漏洞修复。尽管这些方法显示出最先进的性能，但它们面临以下挑战：（1）缺乏高质量的、与漏洞相关的推理数据。当前的方法主要依赖于主要编码一般编程知识的基础模型。如果没有与漏洞相关的推理数据，它们往往无法捕获不同的漏洞修复模式。（2）LLM训练过程中的中间漏洞修复过程难以验证。现有的强化学习方法通常利用来自环境的中间执行反馈（例如，基于沙盒的执行结果）来指导强化学习训练。相比之下，漏洞修复过程通常缺乏这种中间的、可验证的反馈，这给模型训练带来了额外的挑战。

TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation

TensorBLEU：基于矢量化 GPU 的 BLEU 分数实现，用于每句话的训练中评估

Authors: Adam Filipek
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.05485
Pdf link: https://arxiv.org/pdf/2510.05485
Abstract Modern natural language processing models have achieved unprecedented scale, yet the tools for their evaluation often remain a computational bottleneck, limiting the pace of research. This is particularly acute for in-training evaluation metrics, such as per-sentence reward signals in Reinforcement Learning, which must operate efficiently on batches of token IDs directly on the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the BLEU metric designed from the ground up for this specific use case. Our approach is fully vectorized for GPU-accelerated, per-sentence computation within PyTorch and introduces a memory-efficient counting mechanism. By creating a compact, batch-specific dictionary of n-grams using \texttt{this http URL}, our method avoids the prohibitive memory costs of traditional hashing-based vectorization, making it practical for large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard library for token-ID-based BLEU calculation on the CPU. Experiments show that TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and exceeding 40x on data-center-class hardware (NVIDIA A100). This performance transforms a significant bottleneck into a negligible part of the training loop. By clearly defining its role as a "Token-ID BLEU" for development purposes and open-sourcing our implementation, we provide a powerful tool for accelerating research in areas like RL-based model fine-tuning.
中文摘要 现代自然语言处理模型已经达到了前所未有的规模，但其评估工具往往仍然是计算瓶颈，限制了研究的步伐。这对于训练中的评估指标（例如强化学习中的每句子奖励信号）尤其严重，这些指标必须直接在 GPU 上对批量令牌 ID 进行高效作。在本文中，我们介绍了 TensorBLEU，这是针对此特定用例从头开始设计的 BLEU 指标的新颖实现。我们的方法完全矢量化，用于 PyTorch 中的 GPU 加速、每句子计算，并引入了一种内存高效的计数机制。通过使用 \texttt{this http URL} 创建一个紧凑的、特定于批处理的 n-gram 字典，我们的方法避免了传统基于哈希的矢量化的令人望而却步的内存成本，使其适用于大词汇模型。我们将 TensorBLEU 与 NLTK 进行基准测试，NLTK 是 CPU 上基于令牌 ID 的 BLEU 计算的标准库。实验表明，TensorBLEU 在消费级 GPU （NVIDIA T4）上提供超过 13 倍的加速，在数据中心级硬件（NVIDIA A100）上提供超过 40 倍的加速。这种性能将一个重要的瓶颈转化为训练循环中可以忽略不计的部分。通过明确定义其作为开发目的的“令牌 ID BLEU”的角色，并将我们的实施开源，我们为加速基于 RL 的模型微调等领域的研究提供了一个强大的工具。

Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment

可证明在离线和在线 RLHF/DPO 对齐中同时缓解损坏、过度优化和冗长

Authors: Ziyi Chen, Junyi Li, Peiran Yu, Heng Huang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.05526
Pdf link: https://arxiv.org/pdf/2510.05526
Abstract Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) are important techniques to align large language models (LLM) with human preference. However, the quality of RLHF and DPO training is seriously compromised by \textit{\textbf{C}orrupted} preference, reward \textit{\textbf{O}veroptimization}, and bias towards \textit{\textbf{V}erbosity}. To our knowledge, most existing works tackle only one of these important issues, and the few other works require much computation to estimate multiple reward models and lack theoretical guarantee of generalization ability. In this work, we propose RLHF-\textbf{COV} and DPO-\textbf{COV} algorithms that can simultaneously mitigate these three issues, in both offline and online settings. This ability is theoretically demonstrated by obtaining length-regularized generalization error rates for our DPO-COV algorithms trained on corrupted data, which match the best-known rates for simpler cases with clean data and without length regularization. Moreover, our DPO-COV algorithm is simple to implement without reward estimation, and is proved to be equivalent to our RLHF-COV algorithm, which directly implies the equivalence between the vanilla RLHF and DPO algorithms. Experiments demonstrate the effectiveness of our DPO-COV algorithms under both offline and online settings.
中文摘要 人类反馈强化学习（RLHF）和直接偏好优化（DPO）是使大型语言模型（LLM）与人类偏好保持一致的重要技术。然而，RLHF 和 DPO 训练的质量受到 \textit{\textbf{C}orrupted} 偏好、奖励 \textit{\textbf{O}veroptimization} 和偏向 \textit{\textbf{V}erbosity} 的严重影响。据我们所知，现有的大多数工作只解决了这些重要问题中的一个，其他少数工作需要大量的计算来估计多个奖励模型，缺乏泛化能力的理论保证。在这项工作中，我们提出了 RLHF-\textbf{COV} 和 DPO-\textbf{COV} 算法，可以在离线和在线设置中同时缓解这三个问题。从理论上讲，这种能力是通过获得我们在损坏数据上训练的 DPO-COV 算法的长度正则化泛化错误率来证明的，该错误率与具有干净数据且没有长度正则化的简单情况的最已知率相匹配。此外，我们的DPO-COV算法无需奖励估计即可实现简单，并被证明与我们的RLHF-COV算法等效，这直接暗示了普通RLHF算法和DPO算法之间的等效性。实验证明了我们的 DPO-COV 算法在离线和在线设置下的有效性。

Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations

发表论文是一门艺术：学术演讲的自我提升美学代理

Authors: Chengzhi Liu, Yuzhe Yang, Kaiwen Zhou, Zhen Zhang, Yue Fan, Yannan Xie, Peng Qi, Xin Eric Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.05571
Pdf link: https://arxiv.org/pdf/2510.05571
Abstract The promotion of academic papers has become an important means of enhancing research visibility. However, existing automated methods struggle limited storytelling, insufficient aesthetic quality, and constrained self-adjustment, making it difficult to achieve efficient and engaging dissemination. At the heart of those challenges is a simple principle: \emph{there is no way to improve it when you cannot evaluate it right}. To address this, we introduce \textbf{EvoPresent}, a self-improvement agent framework that unifies coherent narratives, aesthetic-aware designs, and realistic presentation delivery via virtual characters. Central to EvoPresent is \textbf{PresAesth}, a multi-task reinforcement learning (RL) aesthetic model that provides reliable aesthetic scoring, defect adjustment, and comparative feedback, enabling iterative self-improvement even under limited aesthetic training data. To systematically evaluate the methods, we introduce \textbf{EvoPresent Benchmark}, a comprehensive benchmark comprising: \textit{Presentation Generation Quality}, built on 650 top-tier AI conference papers with multimodal resources (slides, videos and scripts) to assess both content and design; and \textit{Aesthetic Awareness}, consisting of 2,000 slide pairs with varying aesthetic levels, supporting joint training and evaluation on scoring, defect adjustment, and comparison. Our findings highlight that (i) High-quality feedback is essential for agent self-improvement, while initial capability alone does not guarantee effective self-correction. (ii) Automated generation pipelines exhibit a trade-off between visual design and content construction. (iii) Multi-task RL training shows stronger generalization in aesthetic awareness tasks.
中文摘要 学术论文的推广已成为提高研究知名度的重要手段。然而，现有的自动化方法难以讲故事、审美质量不足、自我调整受限，难以实现高效、引人入胜的传播。这些挑战的核心是一个简单的原则：\emph{当你无法正确评估它时，就没有办法改进它}。为了解决这个问题，我们引入了 \textbf{EvoPresent}，这是一个自我完善代理框架，它通过虚拟角色统一了连贯的叙述、具有美学意识的设计和逼真的演示交付。EvoPresent 的核心是 \textbf{PresAesth}，这是一种多任务强化学习（RL）美学模型，可提供可靠的美学评分、缺陷调整和比较反馈，即使在有限的美学训练数据下也能实现迭代自我完善。为了系统地评估这些方法，我们引入了 \textbf{EvoPresent Benchmark}，这是一个综合基准，包括：\textit{Presentation Generation Quality}，建立在 650 篇顶级 AI 会议论文之上，具有多模态资源（幻灯片、视频和脚本），用于评估内容和设计;以及\textit{审美意识}，由2000对不同审美水平的幻灯片组成，支持评分、缺陷调整和比较的联合训练和评估。我们的研究结果强调，（i）高质量的反馈对于智能体的自我完善至关重要，而仅靠初始能力并不能保证有效的自我纠正。（ii）自动生成管道在视觉设计和内容构建之间表现出权衡。（iii）多任务RL训练在审美意识任务中表现出更强的泛化性。

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

流程中代理系统优化，实现有效规划和工具使用

Authors: Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.05592
Pdf link: https://arxiv.org/pdf/2510.05592
Abstract Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.
中文摘要 结果驱动的强化学习在大型语言模型（LLM）中具有先进的推理能力，但流行的工具增强方法训练了一个单一的、单一的策略，该策略在完整的上下文下交错思想和工具调用;这在长视野和多样化工具下扩展性很差，并且对新场景的推广性很弱。代理系统通过跨专业模块分解工作提供了一种有前途的替代方案，但大多数系统仍然无需训练或依赖于与多轮交互的实时动态解耦的离线训练。我们介绍了 AgentFlow，这是一个可训练的、流中的代理框架，它通过不断发展的内存协调四个模块（规划器、执行器、验证者、生成器），并在多轮循环中直接优化其规划器。为了在实时环境中进行策略训练，我们提出了基于流程的组细化策略优化（Flow-GRPO），它通过将多轮优化转换为一系列可处理的单轮策略更新来解决长期、稀疏奖励的信用分配。它向每个转折点广播单一的、可验证的轨迹级结果，使当地规划者的决策与全球成功保持一致，并利用小组标准化优势稳定学习。在十个基准测试中，具有 7B 级主干网的 AgentFlow 优于表现最佳的基线，搜索平均准确率提高 14.9%，代理平均准确率提高 14.0%，数学准确率提高 14.5%，科学任务平均准确率提高 4.1%，甚至超过了 GPT-4o 等更大的专有模型。进一步的分析证实了流程优化的好处，显示出改进的规划、增强的工具调用可靠性以及模型大小和推理轮次的正向扩展。

Improving Chain-of-Thought Efficiency for Autoregressive Image Generation

提高自回归图像生成的思维链效率

Authors: Zeqi Gu, Markos Georgopoulos, Xiaoliang Dai, Marjan Ghazvininejad, Chu Wang, Felix Juefei-Xu, Kunpeng Li, Yujun Shi, Zecheng He, Zijian He, Jiawei Zhou, Abe Davis, Jialiang Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.05593
Pdf link: https://arxiv.org/pdf/2510.05593
Abstract Autoregressive multimodal large language models have recently gained popularity for image generation, driven by advances in foundation models. To enhance alignment and detail, newer approaches employ chain-of-thought (CoT) reasoning, expanding user inputs into elaborated prompts prior to image synthesis. However, this strategy can introduce unnecessary redundancy -- a phenomenon we call visual overthinking -- which increases computational costs and can introduce details that contradict the original prompt. In this work, we explore how to generate more concise CoT sequences for more efficient image generation. We introduce ShortCoTI, a lightweight optimization framework that encourages more concise CoT while preserving output image quality. ShortCoTI rewards more concise prompts with an adaptive function that scales according to an estimated difficulty for each task. Incorporating this reward into a reinforcement learning paradigm reduces prompt reasoning length by 54% while maintaining or slightly improving quality metrics across multiple benchmarks (T2I-CompBench, GenEval). Qualitative analysis shows that our method eliminates verbose explanations and repetitive refinements, producing reasoning prompts that are both concise and semantically rich. As a result, ShortCoTI improves computational efficiency without compromising the fidelity or visual appeal of generated images.
中文摘要 在基础模型进步的推动下，自回归多模态大型语言模型最近在图像生成方面越来越受欢迎。为了增强对齐性和细节，较新的方法采用思维链（CoT）推理，在图像合成之前将用户输入扩展到精心设计的提示中。然而，这种策略可能会引入不必要的冗余——我们称之为视觉过度思考的现象——这会增加计算成本，并可能引入与原始提示相矛盾的细节。在这项工作中，我们探索了如何生成更简洁的CoT序列，以实现更高效的图像生成。我们介绍了 ShortCoTI，这是一个轻量级优化框架，它鼓励更简洁的 CoT，同时保持输出图像质量。ShortCoTI 通过自适应函数奖励更简洁的提示，该函数根据每个任务的估计难度进行缩放。将此奖励纳入强化学习范式可将提示推理长度减少 54%，同时在多个基准测试（T2I-CompBench、GenEval）中保持或略微提高质量指标。定性分析表明，我们的方法消除了冗长的解释和重复的细化，产生了既简洁又语义丰富的推理提示。因此，ShortCoTI 提高了计算效率，而不会影响生成图像的保真度或视觉吸引力。

A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks

没有计划的目标只是一个愿望：高效且有效的全球规划师培训，以应对长期代理任务

Authors: Shuzheng Si, Haozhe Zhao, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.05608
Pdf link: https://arxiv.org/pdf/2510.05608
Abstract Agents based on large language models (LLMs) struggle with brainless trial-and-error and generating hallucinatory actions due to a lack of global planning in long-horizon tasks. In this paper, we introduce a plan-and-execute framework and propose EAGLET, an efficient and effective planner training method to enhance the executor agent's planning abilities without human effort. Specifically, we train a plug-and-play global planner through a two-step process: we first synthesize high-quality plans from an advanced LLM using our proposed homologous consensus filtering strategy, and apply fine-tuning as a cold start. Moreover, we further improve the planner with a rule-based reinforcement learning stage using a novel executor capability gain reward, ensuring it can handle task instructions of varying difficulty. Experiments on three long-horizon agent tasks show that executor agents equipped with our planner outperform existing methods, achieving new state-of-the-art performance. Meanwhile, EAGLET reduces training costs by 8x compared to RL-based baselines, and it does not require manual effort or extra training data, offering an efficient and effective solution.
中文摘要 由于在长期任务中缺乏全局规划，基于大型语言模型（LLM）的代理在无脑试错和产生幻觉动作方面苦苦挣扎。本文介绍了计划与执行框架，并提出了一种高效且有效的计划者培训方法EAGLET，无需人工即可增强执行者代理人的计划能力。具体来说，我们通过两步过程训练一个即插即用的全局规划器：我们首先使用我们提出的同源共识过滤策略从高级 LLM 中合成高质量的计划，然后应用微调作为冷启动。此外，我们还使用一种新的执行器能力获得奖励，通过基于规则的强化学习阶段进一步改进了规划器，确保它能够处理不同难度的任务指令。对三个远视野代理任务的实验表明，配备我们的规划器的执行器代理的性能优于现有方法，实现了新的最先进的性能。同时，与基于 RL 的基线相比，EAGLET 将训练成本降低了 8 倍，并且不需要人工工作或额外的训练数据，提供了高效且有效的解决方案。

HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection

HOI-R1：探索多模态大语言模型在人物交互检测中的潜力

Authors: Junwen Chen, Peilin Xiong, Keiji Yanai
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.05609
Pdf link: https://arxiv.org/pdf/2510.05609
Abstract Recent Human-object interaction detection (HOID) methods highly require prior knowledge from VLMs to enhance the interaction recognition capabilities. The training strategies and model architectures for connecting the knowledge from VLMs to the HOI instance representations from the object detector are challenging, and the whole framework is complex for further development or application. On the other hand, the inherent reasoning abilities of MLLMs on human-object interaction detection are under-explored. Inspired by the recent success of training MLLMs with reinforcement learning (RL) methods, we propose HOI-R1 and first explore the potential of the language model on the HOID task without any additional detection modules. We introduce an HOI reasoning process and HOID reward functions to solve the HOID task by pure text. The results on the HICO-DET dataset show that HOI-R1 achieves 2x the accuracy of the baseline with great generalization ability. The source code is available at this https URL.
中文摘要 最近的人-物体交互检测（HOID）方法高度需要VLM的先验知识来增强交互识别能力。将VLM的知识连接到目标检测器的HOI实例表示的训练策略和模型架构具有挑战性，整个框架对于进一步的开发或应用来说很复杂。另一方面，MLLM在人-物体交互检测方面的内在推理能力尚未得到充分探索。受到最近使用强化学习（RL）方法训练MLLM的成功启发，我们提出了HOI-R1，并首先探索了语言模型在HOID任务上的潜力，而无需任何额外的检测模块。我们引入了 HOI 推理过程和 HOID 奖励函数，以纯文本解决 HOID 任务。HICO-DET数据集上的结果表明，HOI-R1的准确率是基线的2倍，具有很强的泛化能力。源代码可在此 https URL 中找到。

DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision

DecEx-RAG：通过流程监督通过决策和执行优化来促进代理检索增强生成

Authors: Yongqi Leng, Yikun Lei, Xikai Liu, Meizhi Zhong, Bojian Xiong, Yurong Zhang, Yan Gao, Yi Wu, Yao Hu, Deyi Xiong
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.05691
Pdf link: https://arxiv.org/pdf/2510.05691
Abstract Agentic Retrieval-Augmented Generation (Agentic RAG) enhances the processing capability for complex tasks through dynamic retrieval and adaptive workflows. Recent advances (e.g., Search-R1) have shown that outcome-supervised reinforcement learning demonstrate strong performance. However, this approach still suffers from inefficient exploration, sparse reward signals, and ambiguous global reward feedback. To address these challenges, we propose DecEx-RAG, which models RAG as a Markov Decision Process (MDP) incorporating decision-making and execution, while introducing an efficient pruning strategy to optimize data expansion. Through comprehensive process-level policy optimization, DecEx-RAG significantly enhances the autonomous task decomposition, dynamic retrieval, and high-quality answer generation capabilities of large language models (LLMs). Experiments show that DecEx-RAG achieves an average absolute performance improvement of $6.2\%$ across six datasets, significantly outperforming existing baselines. Moreover, the pruning strategy improves data construction efficiency by nearly $6 \times$, providing an efficient solution for process-supervised RAG training. The code is available at this https URL.
中文摘要 代理检索增强生成（Agentic RAG）通过动态检索和自适应工作流程增强复杂任务的处理能力。最近的进展（例如 Search-R1）表明，结果监督强化学习表现出强大的性能。然而，这种方法仍然存在探索效率低下、奖励信号稀疏和全局奖励反馈模糊的问题。为了应对这些挑战，我们提出了 DecEx-RAG，它将 RAG 建模为包含决策和执行的马尔可夫决策过程（MDP），同时引入有效的修剪策略来优化数据扩展。通过全面的流程级策略优化，DecEx-RAG显著增强了大语言模型（LLMs）的自主任务分解、动态检索和高质量答案生成能力。实验表明，DecEx-RAG 在六个数据集中实现了 6.2 美元的平均绝对性能提升，显着优于现有基线。此外，剪枝策略将数据构建效率提高了近 6 美元，为过程监督 RAG 训练提供了高效的解决方案。该代码可在此 https URL 中找到。

Oracle-Guided Masked Contrastive Reinforcement Learning for Visuomotor Policies

用于视觉运动策略的 Oracle 引导掩码对比强化学习

Authors: Yuhang Zhang, Jiaping Xiao, Chao Yan, Mir Feroskhan
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.05692
Pdf link: https://arxiv.org/pdf/2510.05692
Abstract A prevailing approach for learning visuomotor policies is to employ reinforcement learning to map high-dimensional visual observations directly to action commands. However, the combination of high-dimensional visual inputs and agile maneuver outputs leads to long-standing challenges, including low sample efficiency and significant sim-to-real gaps. To address these issues, we propose Oracle-Guided Masked Contrastive Reinforcement Learning (OMC-RL), a novel framework designed to improve the sample efficiency and asymptotic performance of visuomotor policy learning. OMC-RL explicitly decouples the learning process into two stages: an upstream representation learning stage and a downstream policy learning stage. In the upstream stage, a masked Transformer module is trained with temporal modeling and contrastive learning to extract temporally-aware and task-relevant representations from sequential visual inputs. After training, the learned encoder is frozen and used to extract visual representations from consecutive frames, while the Transformer module is discarded. In the downstream stage, an oracle teacher policy with privileged access to global state information supervises the agent during early training to provide informative guidance and accelerate early policy learning. This guidance is gradually reduced to allow independent exploration as training progresses. Extensive experiments in simulated and real-world environments demonstrate that OMC-RL achieves superior sample efficiency and asymptotic policy performance, while also improving generalization across diverse and perceptually complex scenarios.
中文摘要 学习视觉运动策略的一种流行方法是采用强化学习将高维视觉观察直接映射到动作命令。然而，高维视觉输入和敏捷机动输出的结合带来了长期存在的挑战，包括样本效率低和模拟到真实存在显着差距。为了解决这些问题，我们提出了Oracle引导的掩蔽对比强化学习（OMC-RL），这是一种旨在提高视觉运动策略学习的样本效率和渐近性能的新框架。OMC-RL明确地将学习过程解耦为两个阶段：上游表示学习阶段和下游策略学习阶段。在上游阶段，通过时间建模和对比学习训练掩码 Transformer 模块，以从顺序视觉输入中提取时间感知和任务相关的表示。训练后，学习到的编码器被冻结并用于从连续帧中提取视觉表示，而 Transformer 模块被丢弃。在下游阶段，具有全局状态信息特权访问权限的 Oracle 教师策略在早期训练期间对智能体进行监督，提供信息指导并加速早期策略学习。随着培训的进行，该指南逐渐减少，以允许独立探索。在模拟和真实环境中的大量实验表明，OMC-RL实现了卓越的样本效率和渐近策略性能，同时还提高了在各种感知复杂场景中的泛化能力。

Joint Communication Scheduling and Velocity Control for Multi-UAV-Assisted Post-Disaster Monitoring: An Attention-Based In-Context Learning Approach

多无人机辅助灾后监测的联合通信调度和速度控制：一种基于注意力的情境学习方法

Authors: Yousef Emami, Seyedsina Nabavirazavi, Jingjing Zheng, Hao Zhou, Miguel Gutierrez Gaitan, Kai Li, Luis Almeida
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.05698
Pdf link: https://arxiv.org/pdf/2510.05698
Abstract Recently, Unmanned Aerial Vehicles (UAVs) are increasingly being investigated to collect sensory data in post-disaster monitoring scenarios, such as tsunamis, where early actions are critical to limit coastal damage. A major challenge is to design the data collection schedules and flight velocities, as unfavorable schedules and velocities can lead to transmission errors and buffer overflows of the ground sensors, ultimately resulting in significant packet loss. Meanwhile, online Deep Reinforcement Learning (DRL) solutions have a complex training process and a mismatch between simulation and reality that does not meet the urgent requirements of tsunami monitoring. Recent advances in Large Language Models (LLMs) offer a compelling alternative. With their strong reasoning and generalization capabilities, LLMs can adapt to new tasks through In-Context Learning (ICL), which enables task adaptation through natural language prompts and example-based guidance without retraining. However, LLM models have input data limitations and thus require customized approaches. In this paper, a joint optimization of data collection schedules and velocities control for multiple UAVs is proposed to minimize data loss. The battery level of the ground sensors, the length of the queues, and the channel conditions, as well as the trajectories of the UAVs, are taken into account. Attention-Based In-Context Learning for Velocity Control and Data Collection Schedule (AIC-VDS) is proposed as an alternative to DRL in emergencies. The simulation results show that the proposed AIC-VDS outperforms both the Deep-Q-Network (DQN) and maximum channel gain baselines.
中文摘要 最近，越来越多的人研究无人机（UAV）在灾后监测场景（例如海啸）中收集传感数据，在这些场景中，早期行动对于限制沿海破坏至关重要。一个主要挑战是设计数据收集时间表和飞行速度，因为不利的时间表和速度会导致地面传感器的传输错误和缓冲区溢出，最终导致严重的数据包丢失。同时，在线深度强化学习（DRL）解决方案训练过程复杂，模拟与现实不匹配，无法满足海啸监测的迫切需求。大型语言模型（LLM）的最新进展提供了一个引人注目的替代方案。凭借其强大的推理和泛化能力，LLM可以通过上下文学习（ICL）适应新任务，无需重新训练，即可通过自然语言提示和基于示例的指导进行任务适应。然而，LLM 模型具有输入数据限制，因此需要定制方法。该文提出了多架无人机的数据采集计划和速度控制的联合优化，以尽量减少数据丢失。地面传感器的电池电量、队列长度、通道条件以及无人机的轨迹都被考虑在内。提出基于注意力的速度控制和数据收集计划上下文学习（AIC-VDS）作为紧急情况下 DRL 的替代方案。仿真结果表明，所提出的AIC-VDS优于深度Q网络（DQN）和最大信道增益基线。

EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

EMORL-TTS：基于LLM的TTS中细粒度情绪控制的强化学习

Authors: Haoxun Li, Yu Liu, Yuqing Sun, Hanlei Shi, Leyuan Qu, Taihao Li
Subjects: Subjects: Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2510.05758
Pdf link: https://arxiv.org/pdf/2510.05758
Abstract Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We propose EMORL-TTS (Fine-grained Emotion-controllable TTS with Reinforcement Learning), a framework that unifies global intensity control in the VAD space with local emphasis regulation. Our method combines supervised fine-tuning with reinforcement learning guided by task-specific rewards for emotion category, intensity, and emphasis. Moreover, we further investigate how emphasis placement modulates fine-grained emotion intensity. Experiments show that EMORL-TTS improves emotion accuracy, intensity differentiation, and emphasis clarity, while preserving synthesis quality comparable to strong LLM-based baselines.
中文摘要 最近基于LLM的TTS系统实现了强大的质量和零样本能力，但由于依赖离散语音token，缺乏细粒度的情绪控制。现有方法要么将情绪限制在分类标签上，要么不能推广到基于 LLM 的架构。我们提出了 EMORL-TTS（具有强化学习的细粒度情绪可控 TTS），这是一个将 VAD 空间中的全局强度控制与局部重点调节统一起来的框架。我们的方法将监督微调与强化学习相结合，以特定于任务的情绪类别、强度和重点奖励为指导。此外，我们进一步研究了重点放置如何调节细粒度情绪强度。实验表明，EMORL-TTS 提高了情绪准确性、强度区分和强调清晰度，同时保持了与基于 LLM 的强基线相当的合成质量。

Risk level dependent Minimax Quantile lower bounds for Interactive Statistical Decision Making

用于交互式统计决策的风险级别相关最小最大分位数下限

Authors: Raghav Bongole, Amirreza Zamani, Tobias J. Oechtering, Mikael Skoglund
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.05808
Pdf link: https://arxiv.org/pdf/2510.05808
Abstract Minimax risk and regret focus on expectation, missing rare failures critical in safety-critical bandits and reinforcement learning. Minimax quantiles capture these tails. Three strands of prior work motivate this study: minimax-quantile bounds restricted to non-interactive estimation; unified interactive analyses that focus on expected risk rather than risk level specific quantile bounds; and high-probability bandit bounds that still lack a quantile-specific toolkit for general interactive protocols. To close this gap, within the interactive statistical decision making framework, we develop high-probability Fano and Le Cam tools and derive risk level explicit minimax-quantile bounds, including a quantile-to-expectation conversion and a tight link between strict and lower minimax quantiles. Instantiating these results for the two-armed Gaussian bandit immediately recovers optimal-rate bounds.
中文摘要 极小最大风险和遗憾侧重于期望，错过了在安全关键型强盗和强化学习中至关重要的罕见故障。最小最大分位数捕获这些尾部。先前工作的三条线推动了这项研究：仅限于非交互式估计的最小最大分位数边界;统一的交互式分析，关注预期风险而不是风险水平特定的分位数边界;以及仍然缺乏用于通用交互协议的特定分位数工具包的高概率强盗边界。为了缩小这一差距，在交互式统计决策框架内，我们开发了高概率的 Fano 和 Le Cam 工具，并推导出风险水平显式最小最大分位数边界，包括分位数到期望值的转换以及严格最小最大分位数和下限极小最大分位数之间的紧密联系。为双臂高斯强盗实例化这些结果会立即恢复最佳速率边界。

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

EEPO：通过采样然后忘记进行探索增强的策略优化

Authors: Liang Chen, Xueting Han, Qizhou Wang, Bo Han, Jing Bai, Hinrich Schutze, Kam-Fai Wong
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.05837
Pdf link: https://arxiv.org/pdf/2510.05837
Abstract Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, diminished exploratory capacity, and ultimately limited performance gains. Although techniques that increase policy stochasticity can promote exploration, they frequently fail to escape dominant behavioral modes. This creates a self-reinforcing loop-repeatedly sampling and rewarding dominant modes-that further erodes exploration. We introduce Exploration-Enhanced Policy Optimization (EEPO), a framework that promotes exploration via two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight unlearning step to temporarily suppress these sampled responses, forcing the second stage to explore different regions of the output space. This sample-then-forget mechanism disrupts the self-reinforcing loop and promotes wider exploration during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO, achieving average relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.
中文摘要 在大型语言模型（LLM）的强化学习与可验证奖励（RLVR）中，平衡探索和开发仍然是核心挑战。当前的 RLVR 方法往往过分强调开发，导致熵崩溃、探索能力下降，并最终限制性能提升。尽管增加政策随机性的技术可以促进探索，但它们往往无法摆脱主导行为模式。这创造了一个自我强化的循环——重复采样和奖励主导模式——进一步侵蚀了探索。我们引入了探索增强策略优化（EEPO），这是一个框架，通过具有自适应取消学习的两阶段推出来促进探索。在第一阶段，模型生成一半的轨迹;然后，它会经历一个轻量级的取消学习步骤，以暂时抑制这些采样响应，迫使第二阶段探索输出空间的不同区域。这种先采样后忘记的机制破坏了自我强化循环，并在推出过程中促进了更广泛的探索。在五个推理基准中，EEPO 的表现优于 GRPO，在 Qwen2.5-3B 上实现了 24.3% 的平均相对收益，在 Llama3.2-3B-Instruct 上实现了 33.0% 的平均相对收益，在 Qwen3-8B-Base 上实现了 10.4% 的相对收益。

Prompt reinforcing for long-term planning of large language models

大语言模型长期规划的提示强化

Authors: Hsien-Chin Lin, Benjamin Matthias Ruppik, Carel van Niekerk, Chia-Hao Shen, Michael Heck, Nurul Lubis, Renato Vukovic, Shutong Feng, Milica Gašić
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.05921
Pdf link: https://arxiv.org/pdf/2510.05921
Abstract Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.
中文摘要 大型语言模型（LLM）在广泛的自然语言处理任务中取得了显著的成功，并且可以通过提示进行调整。然而，它们在多回合交互中仍然不理想，通常依赖于不正确的早期假设，并且无法随着时间的推移跟踪用户目标，这使得此类任务特别具有挑战性。对话系统方面的先前研究表明，长期规划对于处理交互式任务至关重要。在这项工作中，我们提出了一个受强化学习启发的提示优化框架，该框架只需修改基于LLM的代理的任务指令提示即可实现这种规划。通过生成逐轮反馈并利用经验回放进行提示重写，我们提出的方法显示出文本转 SQL 和面向任务的对话等多轮任务的显着改进。此外，它泛化了不同的基于 LLM 的代理，并且可以利用不同的 LLM 作为元提示代理。这保证了未来对强化学习启发的无参数优化方法的研究。

EARL: Efficient Agentic Reinforcement Learning Systems for Large Language Models

EARL：用于大型语言模型的高效智能体强化学习系统

Authors: Zheyue Tan, Mustapha Abdullahi, Tuo Shi, Huining Yuan, Zelai Xu, Chao Yu, Boxun Li, Bo Zhao
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.05943
Pdf link: https://arxiv.org/pdf/2510.05943
Abstract Reinforcement learning (RL) has become a pivotal component of large language model (LLM) post-training, and agentic RL extends this paradigm to operate as agents through multi-turn interaction and tool use. Scaling such systems exposes two practical bottlenecks: (1) context length grows rapidly during training, inflating memory usage and latency, and triggering out-of-memory (OOM) failures; and (2) intermediate tensors accumulate with context length, making cross-device data movement a major system bottleneck. We present EARL, a scalable system for efficient agentic RL. EARL designs a parallelism selector that dynamically adapts model and training parallelism across RL stages based on sequence length and system load, and a data dispatcher that performs layout-aware, decentralized exchange of intermediate data batches. Together, these components increase throughput, reduce long-context failures, and enable stable large-scale training of agentic LLMs without relying on hard limits or penalties of context length.
中文摘要 强化学习（RL）已成为大型语言模型（LLM）后训练的关键组成部分，代理RL将这一范式扩展为通过多轮交互和工具使用作为代理进行作。扩展此类系统暴露了两个实际瓶颈：（1）上下文长度在训练过程中迅速增长，增加了内存使用和延迟，并触发内存不足（OOM）故障;（2）中间张量随着上下文长度的增加而累积，使跨设备数据移动成为主要的系统瓶颈。我们提出了 EARL，这是一个用于高效代理 RL 的可扩展系统。EARL 设计了一个并行性选择器，该选择器可根据序列长度和系统负载动态调整跨 RL 阶段的模型和训练并行性，以及一个数据调度器，该选项可执行中间数据批次的布局感知、分散交换。这些组件共同提高了吞吐量，减少了长上下文故障，并实现了代理 LLM 的稳定大规模训练，而无需依赖上下文长度的硬性限制或惩罚。

Learning to Crawl: Latent Model-Based Reinforcement Learning for Soft Robotic Adaptive Locomotion

学习爬行：基于潜在模型的软机器人自适应运动强化学习

Authors: Vaughn Gzenda, Robin Chhabra
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.05957
Pdf link: https://arxiv.org/pdf/2510.05957
Abstract Soft robotic crawlers are mobile robots that utilize soft body deformability and compliance to achieve locomotion through surface contact. Designing control strategies for such systems is challenging due to model inaccuracies, sensor noise, and the need to discover locomotor gaits. In this work, we present a model-based reinforcement learning (MB-RL) framework in which latent dynamics inferred from onboard sensors serve as a predictive model that guides an actor-critic algorithm to optimize locomotor policies. We evaluate the framework on a minimal crawler model in simulation using inertial measurement units and time-of-flight sensors as observations. The learned latent dynamics enable short-horizon motion prediction while the actor-critic discovers effective locomotor policies. This approach highlights the potential of latent-dynamics MB-RL for enabling embodied soft robotic adaptive locomotion based solely on noisy sensor feedback.
中文摘要 软体机器人履带是利用软体变形性和顺应性通过表面接触实现运动的移动机器人。由于模型不准确、传感器噪声以及发现运动步态的需要，为此类系统设计控制策略具有挑战性。在这项工作中，我们提出了一个基于模型的强化学习（MB-RL）框架，其中从机载传感器推断的潜在动力学作为预测模型，指导演员-批评算法优化运动策略。我们在仿真中使用惯性测量单元和飞行时间传感器作为观测值评估了最小履带模型的框架。学习到的潜在动力学可以实现短视距运动预测，而演员-批评者则发现有效的运动策略。这种方法突出了潜在动力学 MB-RL 在仅基于噪声传感器反馈的情况下实现具身软机器人自适应运动的潜力。

Information-Theoretic Policy Pre-Training with Empowerment

信息论政策预训练与赋权

Authors: Moritz Schneider, Robert Krug, Narunas Vaskevicius, Luigi Palmieri, Michael Volpp, Joschka Boedecker
Subjects: Subjects: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.05996
Pdf link: https://arxiv.org/pdf/2510.05996
Abstract Empowerment, an information-theoretic measure of an agent's potential influence on its environment, has emerged as a powerful intrinsic motivation and exploration framework for reinforcement learning (RL). Besides for unsupervised RL and skill learning algorithms, the specific use of empowerment as a pre-training signal has received limited attention in the literature. We show that empowerment can be used as a pre-training signal for data-efficient downstream task adaptation. For this we extend the traditional notion of empowerment by introducing discounted empowerment, which balances the agent's control over the environment across short- and long-term horizons. Leveraging this formulation, we propose a novel pre-training paradigm that initializes policies to maximize discounted empowerment, enabling agents to acquire a robust understanding of environmental dynamics. We analyze empowerment-based pre-training for various existing RL algorithms and empirically demonstrate its potential as a general-purpose initialization strategy: empowerment-maximizing policies with long horizons are data-efficient and effective, leading to improved adaptability in downstream tasks. Our findings pave the way for future research to scale this framework to high-dimensional and complex tasks, further advancing the field of RL.
中文摘要 赋权是一种衡量智能体对其环境潜在影响的信息论衡量标准，已成为强化学习（RL）的强大内在动机和探索框架。除了对于无监督RL和技能学习算法外，授权作为预训练信号的具体用途在文献中受到的关注有限。我们表明，授权可以用作数据高效的下游任务适应的预训练信号。为此，我们通过引入折扣授权来扩展传统的授权概念，这平衡了代理在短期和长期范围内对环境的控制。利用这种表述，我们提出了一种新颖的预训练范式，该范式可以初始化政策以最大限度地提高折扣授权，使代理能够对环境动态有深入的了解。我们分析了各种现有 RL 算法的基于授权的预训练，并实证证明了其作为通用初始化策略的潜力：具有长视野的授权最大化策略具有数据效率和有效性，从而提高了下游任务的适应性。我们的发现为未来的研究铺平了道路，将该框架扩展到高维和复杂的任务，进一步推动了RL领域的发展。

Optimal Batched Scheduling of Stochastic Processing Networks Using Atomic Action Decomposition

基于原子作用分解的随机处理网络的最优批量调度

Authors: Jim Dai, Manxi Wu, Zhanhao Zhang
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.06033
Pdf link: https://arxiv.org/pdf/2510.06033
Abstract Stochastic processing networks (SPNs) have broad applications in healthcare, transportation, and communication networks. The control of SPN is to dynamically assign servers in batches under uncertainty to optimize long-run performance. This problem is challenging as the policy dimension grows exponentially with the number of servers, making standard reinforcement learning and policy optimization methods intractable at scale. We propose an atomic action decomposition framework that addresses this scalability challenge by breaking joint assignments into sequential single-server assignments. This yields policies with constant dimension, independent of the number of servers. We study two classes of atomic policies, the step-dependent and step-independent atomic policies, and prove that both achieve the same optimal long-run average reward as the original joint policies. These results establish that computing the optimal SPN control can be made scalable without loss of optimality using the atomic framework. Our results offer theoretical justification for the strong empirical success of the atomic framework in large-scale applications reported in previous articles.
中文摘要 随机处理网络（SPN）在医疗保健、交通和通信网络中有着广泛的应用。SPN的控制是在不确定性下批量动态分配服务器，以优化长期运行性能。这个问题具有挑战性，因为策略维度随着服务器数量的增加呈指数级增长，使得标准强化学习和策略优化方法在大规模上变得棘手。我们提出了一个原子动作分解框架，通过将联合分配分解为顺序的单服务器分配来解决这一可扩展性挑战。这会产生具有恒定维度的策略，与服务器数量无关。我们研究了两类原子策略，即阶梯依赖和步进独立原子策略，并证明两者都实现了与原始联合策略相同的最佳长期平均奖励。这些结果表明，使用原子框架可以在不损失最优性的情况下使计算最优 SPN 控制变得可扩展。我们的结果为原子框架在之前文章中报道的大规模应用中的强大经验成功提供了理论依据。

From Learning to Mastery: Achieving Safe and Efficient Real-World Autonomous Driving with Human-In-The-Loop Reinforcement Learning

从学习到精通：通过人机交互强化学习实现安全高效的真实世界自动驾驶

Authors: Li Zeqiao, Wang Yijing, Wang Haoyu, Li Zheng, Li Peng, Liu Wenfei, Zuo Zhiqiang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.06038
Pdf link: https://arxiv.org/pdf/2510.06038
Abstract Autonomous driving with reinforcement learning (RL) has significant potential. However, applying RL in real-world settings remains challenging due to the need for safe, efficient, and robust learning. Incorporating human expertise into the learning process can help overcome these challenges by reducing risky exploration and improving sample efficiency. In this work, we propose a reward-free, active human-in-the-loop learning method called Human-Guided Distributional Soft Actor-Critic (H-DSAC). Our method combines Proxy Value Propagation (PVP) and Distributional Soft Actor-Critic (DSAC) to enable efficient and safe training in real-world environments. The key innovation is the construction of a distributed proxy value function within the DSAC framework. This function encodes human intent by assigning higher expected returns to expert demonstrations and penalizing actions that require human intervention. By extrapolating these labels to unlabeled states, the policy is effectively guided toward expert-like behavior. With a well-designed state space, our method achieves real-world driving policy learning within practical training times. Results from both simulation and real-world experiments demonstrate that our framework enables safe, robust, and sample-efficient learning for autonomous driving.
中文摘要 强化学习（RL）的自动驾驶具有巨大的潜力。然而，由于需要安全、高效和稳健的学习，在现实环境中应用 RL 仍然具有挑战性。将人类专业知识融入学习过程可以通过减少风险勘探和提高样本效率来帮助克服这些挑战。在这项工作中，我们提出了一种无奖励、主动的人机循环学习方法，称为人类引导的分布软行为者批评者（H-DSAC）。我们的方法结合了代理值传播（PVP）和分布软 Actor-Critic （DSAC），以在现实环境中实现高效、安全的训练。关键创新是在DSAC框架内构建分布式代理价值函数。该函数通过为专家演示分配更高的预期回报并惩罚需要人工干预的行为来编码人类意图。通过将这些标签外推到未标记的状态，该政策有效地引导到类似专家的行为。凭借精心设计的状态空间，我们的方法在实践培训时间内实现了现实世界的驾驶政策学习。仿真和真实实验的结果表明，我们的框架能够为自动驾驶提供安全、稳健和高效的样本学习。

VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization

VideoMiner：通过基于树的组相对策略优化，迭代接地一小时视频的关键帧

Authors: Xinye Cao, Hongcan Guo, Jiawen Qian, Guoshun Nan, Chao Wang, Yuqi Pan, Tianhao Hou, Xiaojuan Wang, Yutong Gao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.06040
Pdf link: https://arxiv.org/pdf/2510.06040
Abstract Understanding hour-long videos with multi-modal large language models (MM-LLMs) enriches the landscape of human-centered AI applications. However, for end-to-end video understanding with LLMs, uniformly sampling video frames results in LLMs being overwhelmed by a vast amount of irrelevant information as video length increases. Existing hierarchical key frame extraction methods improve the accuracy of video understanding but still face two critical challenges. 1) How can the interference of extensive redundant information in long videos be mitigated? 2) How can a model dynamically adapt to complex hierarchical structures while accurately identifying key frames? To address these issues, we propose VideoMiner, which iteratively segments, captions, and clusters long videos, forming a hierarchical tree structure. The proposed VideoMiner progresses from long videos to events to frames while preserving temporal coherence, effectively addressing the first challenge. To precisely locate key frames, we introduce T-GRPO, a tree-based group relative policy optimization in reinforcement learning method that guides the exploration of the VideoMiner. The proposed T-GRPO is specifically designed for tree structures, integrating spatiotemporal information at the event level while being guided by the question, thus solving the second challenge. We achieve superior performance in all long-video understanding tasks and uncover several interesting insights. Our proposed T-GRPO surprisingly incentivizes the model to spontaneously generate a reasoning chain. Additionally, the designed tree growth auxin dynamically adjusts the expansion depth, obtaining accuracy and efficiency gains. The code is publicly available at this https URL.
中文摘要 使用多模态大型语言模型（MM-LLM）理解长达一小时的视频可以丰富以人为本的人工智能应用的前景。然而，对于使用 LLM 进行端到端视频理解，随着视频长度的增加，统一采样视频帧会导致 LLM 被大量不相关信息淹没。现有的分层关键帧提取方法提高了视频理解的准确性，但仍面临两个关键挑战。1）如何减轻长视频中大量冗余信息的干扰？2）模型如何在准确识别关键帧的同时动态适应复杂的层次结构？为了解决这些问题，我们提出了 VideoMiner，它迭代地对长视频进行分割、字幕和聚类，形成分层树结构。所提出的 VideoMiner 从长视频发展到事件再到帧，同时保持时间连贯性，有效解决了第一个挑战。为了精确定位关键帧，我们引入了 T-GRPO，这是一种基于树的强化学习组相对策略优化方法，用于指导 VideoMiner 的探索。所提出的T-GRPO是专门针对树结构设计的，在问题引导的同时，在事件层面整合时空信息，从而解决了第二个挑战。我们在所有长视频理解任务中都取得了卓越的性能，并发现了一些有趣的见解。我们提出的 T-GRPO 令人惊讶地激励了模型自发生成推理链。此外，设计的树木生长生长素可动态调节膨胀深度，从而获得精度和效率增益。该代码在此 https URL 中公开可用。

ASPO: Asymmetric Importance Sampling Policy Optimization

ASPO：非对称重要性抽样策略优化

Authors: Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, Kun Gai
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.06062
Pdf link: https://arxiv.org/pdf/2510.06062
Abstract Recent Large Language Model (LLM) post-training methods rely on token-level clipping mechanisms during Reinforcement Learning (RL). However, we identify a fundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens. This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens, aligning their update direction with the learning dynamics of negative ones. AIS further incorporates a soft dual-clipping mechanism to stabilize extreme updates while maintaining gradient flow. Comprehensive experiments on coding and mathematical reasoning benchmarks demonstrate that ASPO significantly mitigates premature convergence, improves training stability, and enhances final performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting IS in LLM RL. The code and models of ASPO are available at this https URL.
中文摘要 最近的大型语言模型（LLM）后训练方法依赖于强化学习（RL）期间的标记级裁剪机制。然而，我们发现了这种结果监督 RL （OSRL）范式中的一个根本缺陷：正优势代币的重要性抽样（IS）比率不匹配，导致正负代币的代币权重不平衡。这种不匹配抑制了低概率令牌的更新，同时过度放大了已经高概率的令牌。为了解决这个问题，我们提出了非对称重要性抽样策略优化（ASPO），它使用一种简单而有效的策略来翻转正优势代币的 IS 比率，使其更新方向与负优势代币的学习动态保持一致。AIS 进一步融入了软双剪辑机制，以稳定极限更新，同时保持渐变流。对编码和数学推理基准的综合实验表明，ASPO 显着减轻了过早收敛，提高了训练稳定性，并增强了基于 GRPO 的强基线的最终性能。我们的分析为代币级权重在 OSRL 中的作用提供了新的见解，并强调了纠正 LLM RL 中 IS 的至关重要性。ASPO 的代码和模型可在此 https URL 中找到。

When Thinking Drifts: Evidential Grounding for Robust Video Reasoning

当思维漂移时：稳健视频推理的证据基础

Authors: Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.06077
Pdf link: https://arxiv.org/pdf/2510.06077
Abstract Video reasoning, the task of enabling machines to infer from dynamic visual content through multi-step logic, is crucial for advanced AI. While the Chain-of-Thought (CoT) mechanism has enhanced reasoning in text-based tasks, its application to video understanding remains underexplored. This paper presents a systematic analysis revealing that CoT often degrades performance in video reasoning, generating verbose but misleading internal monologues, and leading to hallucinated visual details and overridden correct intuitions - a phenomenon we term "visual thinking drift". We explain this drift through a Bayesian lens, positing that CoT traces often diverge from actual visual evidence, instead amplifying internal biases or language priors, causing models to storytell rather than engage in grounded reasoning. To counteract this, we introduce Visual Evidence Reward (VER), a novel reinforcement learning framework that explicitly rewards the generation of reasoning traces that are verifiably grounded in visual evidence. Comprehensive evaluation across 10 diverse video understanding benchmarks demonstrates that our Video-VER consistently achieves top performance. Our work sheds light on the distinct challenges of video-centric reasoning and encourages the development of AI that robustly grounds its inferences in visual evidence - for large multimodal models that not only "think before answering", but also "see while thinking".
中文摘要 视频推理是使机器能够通过多步骤逻辑从动态视觉内容中推断的任务，对于高级人工智能至关重要。虽然思维链（CoT）机制增强了基于文本的任务中的推理能力，但其在视频理解中的应用仍未得到充分探索。本文进行了系统分析，揭示了 CoT 经常降低视频推理的性能，产生冗长但具有误导性的内部独白，并导致幻觉视觉细节和覆盖正确的直觉——这种现象我们称之为“视觉思维漂移”。我们通过贝叶斯视角解释这种漂移，假设 CoT 轨迹通常与实际视觉证据不同，而是放大了内部偏见或语言先验，导致模型讲故事而不是进行扎根推理。为了解决这个问题，我们引入了视觉证据奖励（VER），这是一种新颖的强化学习框架，它明确奖励基于视觉证据的可验证推理痕迹的生成。对 10 个不同视频理解基准的综合评估表明，我们的 Video-VER 始终保持最佳性能。我们的工作揭示了以视频为中心的推理的独特挑战，并鼓励人工智能的发展，将其推理稳健地建立在视觉证据的基础上——对于大型多模态模型，这些模型不仅“先思考后回答”，而且“边思考边看”。

Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL

从失败中学习：通过故障感知逆向 RL 了解 LLM 对齐

Authors: Nyal Patel, Matthieu Bou, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.06092
Pdf link: https://arxiv.org/pdf/2510.06092
Abstract Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with human preferences, yet the underlying reward signals they internalize remain hidden, posing a critical challenge for interpretability and safety. Existing approaches attempt to extract these latent incentives using Inverse Reinforcement Learning (IRL), but treat all preference pairs equally, often overlooking the most informative signals: those examples the extracted reward model misclassifies or assigns nearly equal scores, which we term \emph{failures}. We introduce a novel \emph{failure-aware} IRL algorithm that focuses on misclassified or difficult examples to recover the latent rewards defining model behaviors. By learning from these failures, our failure-aware IRL extracts reward functions that better reflect the true objectives behind RLHF. We demonstrate that failure-aware IRL outperforms existing IRL baselines across multiple metrics when applied to LLM detoxification, without requiring external classifiers or supervision. Crucially, failure-aware IRL yields rewards that better capture the true incentives learned during RLHF, enabling more effective re-RLHF training than standard IRL. This establishes failure-aware IRL as a robust, scalable method for auditing model alignment and reducing ambiguity in the IRL process.
中文摘要 人类反馈强化学习（RLHF）使大型语言模型（LLM）与人类偏好保持一致，但它们内化的潜在奖励信号仍然隐藏，对可解释性和安全性提出了严峻挑战。现有方法试图使用逆强化学习（IRL）提取这些潜在激励，但平等对待所有偏好对，往往忽略了信息量最大的信号：提取的奖励模型错误分类或分配几乎相等的分数的示例，我们称之为 \emph{failures}。我们引入了一种新颖的 \emph{failure-aware} IRL 算法，该算法专注于错误分类或困难的示例，以恢复定义模型行为的潜在奖励。通过从这些失败中学习，我们的故障感知 IRL 提取了奖励函数，以更好地反映 RLHF 背后的真正目标。我们证明，当应用于 LLM 解毒时，故障感知 IRL 在多个指标上优于现有的 IRL 基线，而无需外部分类器或监督。至关重要的是，故障感知 IRL 产生的奖励可以更好地捕捉 RLHF 期间学到的真正激励，从而实现比标准 IRL 更有效的重新 RLHF 训练。这将故障感知 IRL 确立为一种强大、可扩展的方法，用于审核模型对齐并减少 IRL 流程中的歧义。

The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives

对齐审计器：用于验证和完善 LLM 目标的贝叶斯框架

Authors: Matthieu Bou, Nyal Patel, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.06096
Pdf link: https://arxiv.org/pdf/2510.06096
Abstract The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task (non-identifiability). This paper introduces a principled auditing framework that re-frames reward inference from a simple estimation task to a comprehensive process for verification. Our framework leverages Bayesian IRL to not only recover a distribution over objectives but to enable three critical audit capabilities: (i) Quantifying and systematically reducing non-identifiability by demonstrating posterior contraction over sequential rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics that expose spurious shortcuts and identify out-of-distribution prompts where the inferred objective cannot be trusted; and (iii) Validating policy-level utility by showing that the refined, low-uncertainty reward can be used directly in RLHF to achieve training dynamics and toxicity reductions comparable to the ground-truth alignment process. Empirically, our framework successfully audits a detoxified LLM, yielding a well-calibrated and interpretable objective that strengthens alignment guarantees. Overall, this work provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving us toward more trustworthy and accountable AI.
中文摘要 大型语言模型（LLM）隐式优化的目标仍然危险地不透明，这使得可信的对齐和审计成为一项巨大的挑战。虽然逆强化学习（IRL）可以从行为中推断出奖励函数，但现有方法要么产生单一的、过度自信的奖励估计，要么无法解决任务的基本模糊性（不可识别性）。本文引入了一个有原则的审计框架，该框架将奖励推理从简单的估计任务重新构建为一个全面的验证过程。我们的框架利用贝叶斯现实生活，不仅可以恢复目标的分布，还可以实现三个关键的审计能力：（i）通过证明连续几轮证据的后验收缩来量化和系统地减少不可识别性;（ii）提供可作的、感知不确定性的诊断，以揭露虚假的捷径，并识别无法信任推断的目标的分布外提示;（iii）通过证明精细的低不确定性奖励可以直接用于 RLHF 来验证政策层面的效用，以实现与地面事实对齐过程相当的训练动态和毒性降低。根据经验，我们的框架成功地审核了解毒的法学硕士，产生了一个经过良好校准和可解释的目标，从而加强了一致性保证。总体而言，这项工作为审计员、安全团队和监管机构提供了一个实用的工具包，以验证 LLM 真正想要实现的目标，从而推动我们走向更值得信赖和负责任的 AI。

Multi-Task Reinforcement Learning with Language-Encoded Gated Policy Networks

使用语言编码的门控策略网络的多任务强化学习

Authors: Rushiv Arora
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.06138
Pdf link: https://arxiv.org/pdf/2510.06138
Abstract Multi-task reinforcement learning often relies on task metadata -- such as brief natural-language descriptions -- to guide behavior across diverse objectives. We present Lexical Policy Networks (LEXPOL), a language-conditioned mixture-of-policies architecture for multi-task RL. LEXPOL encodes task metadata with a text encoder and uses a learned gating module to select or blend among multiple sub-policies, enabling end-to-end training across tasks. On MetaWorld benchmarks, LEXPOL matches or exceeds strong multi-task baselines in success rate and sample efficiency, without task-specific retraining. To analyze the mechanism, we further study settings with fixed expert policies obtained independently of the gate and show that the learned language gate composes these experts to produce behaviors appropriate to novel task descriptions and unseen task combinations. These results indicate that natural-language metadata can effectively index and recombine reusable skills within a single policy.
中文摘要 多任务强化学习通常依赖于任务元数据（例如简短的自然语言描述）来指导不同目标的行为。我们提出了词法策略网络（LEXPOL），这是一种用于多任务 RL 的语言条件混合策略架构。LEXPOL 使用文本编码器对任务元数据进行编码，并使用学习的门控模块在多个子策略之间进行选择或混合，从而实现跨任务的端到端训练。在 MetaWorld 基准测试中，LEXPOL 在成功率和样本效率方面达到或超过了强大的多任务基线，而无需特定于任务的重新训练。为了分析该机制，我们进一步研究了独立于门的固定专家策略的设置，并表明学习的语言门组成了这些专家，以产生适合新任务描述和看不见的任务组合的行为。这些结果表明，自然语言元数据可以在单个策略中有效地索引和重新组合可重用技能。

Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction

窥视黑匣子内部：用于可解释和准确的关系提取的强化学习

Authors: Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu
Subjects: Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2510.06198
Pdf link: https://arxiv.org/pdf/2510.06198
Abstract This paper introduces a framework for relation extraction (RE) that enhances both accuracy and explainability. The framework has two key components: (i) a reasoning mechanism that formulates relation extraction as a series of text-processing steps inspired by cognitive science, and (ii) an optimization process driven by reinforcement learning (RL) with a novel reward function designed to improve both task accuracy and explanation quality. We call our approach CogRE. Our framework addresses the lack of supervision for language-based explanations in traditional RE by promoting outputs that include important relation keywords. These keywords are drawn from a high-quality dictionary that is automatically constructed using an LLM. We evaluate our approach for the task of one-shot RE using two LLMs and two RE datasets. Our experiments show that CogRE improves explanation quality by addressing two common failure patterns in one-shot RE: poor attention focus and limited one-shot learning capability. For example, our cognitive-structured reasoning with Qwen2.5-15B-Instruct on One-shot NYT29 achieves 24.65% F1, surpassing prior reasoning-based designs. Optimizing this approach with RL using our reward further improves performance by +23.46% (absolute). Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
中文摘要 本文介绍了一个关系提取（RE）框架，该框架提高了准确性和可解释性。该框架有两个关键组成部分：（i）一种推理机制，将关系提取表述为一系列受认知科学启发的文本处理步骤，以及（ii）由强化学习（RL）驱动的优化过程，具有旨在提高任务准确性和解释质量的新颖奖励函数。我们将我们的方法称为 CogRE。我们的框架通过促进包含重要关系关键字的输出，解决了传统可再生能源中缺乏对基于语言的解释的监督问题。这些关键字是从使用 LLM 自动构建的高质量词典中提取的。我们使用两个 LLM 和两个 RE 数据集评估了我们对一次性 RE 任务的方法。我们的实验表明，CogRE通过解决一次性RE中的两种常见失败模式来提高解释质量：注意力集中度差和一次性学习能力有限。例如，我们在 One-shot NYT29 上使用 Qwen2.5-15B-Instruct 进行认知结构推理，达到了 24.65% 的 F1，超过了之前基于推理的设计。使用我们的奖励通过 RL 优化此方法，可进一步将性能提高 +23.46%（绝对值）。最后，人类评估表明，我们最好的模型生成了与黄金标签紧密结合的关系关键字，将人类解释质量评级提高了 54%（相对）。

Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents

分层GRPO：处理LLM搜索代理强化学习中的结构异质性

Authors: Mingkang Zhu, Xi Chen, Bei Yu, Hengshuang Zhao, Jiaya Jia
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.06214
Pdf link: https://arxiv.org/pdf/2510.06214
Abstract Large language model (LLM) agents increasingly rely on external tools such as search engines to solve complex, multi-step problems, and reinforcement learning (RL) has become a key paradigm for training them. However, the trajectories of search agents are structurally heterogeneous, where variations in the number, placement, and outcomes of search calls lead to fundamentally different answer directions and reward distributions. Standard policy gradient methods, which use a single global baseline, suffer from what we identify and formalize as cross-stratum bias-an "apples-to-oranges" comparison of heterogeneous trajectories. This cross-stratum bias distorts credit assignment and hinders exploration of complex, multi-step search strategies. To address this, we propose Stratified GRPO, whose central component, Stratified Advantage Normalization (SAN), partitions trajectories into homogeneous strata based on their structural properties and computes advantages locally within each stratum. This ensures that trajectories are evaluated only against their true peers. Our analysis proves that SAN eliminates cross-stratum bias, yields conditionally unbiased unit-variance estimates inside each stratum, and retains the global unbiasedness and unit-variance properties enjoyed by standard normalization, resulting in a more pure and scale-stable learning signal. To improve practical stability under finite-sample regimes, we further linearly blend SAN with the global estimator. Extensive experiments on diverse single-hop and multi-hop question-answering benchmarks demonstrate that Stratified GRPO consistently and substantially outperforms GRPO by up to 11.3 points, achieving higher training rewards, greater training stability, and more effective search policies. These results establish stratification as a principled remedy for structural heterogeneity in RL for LLM search agents.
中文摘要 大型语言模型（LLM）代理越来越依赖搜索引擎等外部工具来解决复杂的多步骤问题，而强化学习（RL）已成为训练它们的关键范式。然而，搜索代理的轨迹在结构上是异质的，搜索调用的数量、位置和结果的变化导致了根本不同的回答方向和奖励分配。使用单一全球基线的标准政策梯度方法存在我们识别并形式化为跨层偏差的问题——异质轨迹的“苹果与橙子”比较。这种跨层偏差扭曲了学分分配，阻碍了对复杂、多步骤检索策略的探索。为了解决这个问题，我们提出了分层GRPO，其核心组件分层优势归一化（SAN）根据轨迹的结构特性划分为同质地层，并在每个层内本地计算优势。这确保了仅针对其真实对等方评估轨迹。我们的分析证明，SAN消除了跨层偏差，在每个层内产生有条件的无偏单位方差估计，并保留了标准归一化所享有的全局无偏和单位方差特性，从而产生更纯净和尺度稳定的学习信号。为了提高有限样本制度下的实际稳定性，我们进一步将SAN与全局估计器线性混合。对不同单跳和多跳问答基准的广泛实验表明，分层GRPO始终如一地大幅优于GRPO高达11.3分，实现了更高的训练奖励、更高的训练稳定性和更有效的搜索策略。这些结果确立了分层作为 LLM 搜索代理 RL 结构异质性的原则补救措施。

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

TaTToo：基于工具的思维 PRM，用于表格推理中测试时间缩放

Authors: Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.06217
Pdf link: https://arxiv.org/pdf/2510.06217
Abstract Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.
中文摘要 过程奖励模型（PRM）最近成为增强大型推理模型（LRM）推理能力的强大框架，特别是在测试时间缩放（TTS）的背景下。然而，它们在表格推理领域监督 LRM 的潜力仍未得到充分探索。通过详细的实证分析，我们发现现有的 PRM 虽然被广泛用于监督纯文本推理步骤，但在特定于表的作（例如子表检索和模式交互）方面存在困难，从而导致了关键的性能瓶颈。为了解决这一限制，我们提出了 TaTToo，这是一种新颖的基于表格的 PRM 框架，它（i）对表格推理步骤进行明确推理，以及（ii）集成基于工具的验证以提供精确的奖励监督。具体来说，我们首先设计了一个可扩展的数据管理管道，通过将表验证基本原理与基于工具的执行相结合，构建超过 60k 个高质量的阶梯级注释。在收集到的数据基础上，我们使用双阶段范式训练 TaTToo：冷启动监督微调以捕获工具使用推理模式，然后通过基于工具的奖励塑造进行强化学习，使我们的模型与基于表的验证保持一致。我们对新设计的 PRM 所引起的政策改进进行全面评估。在涵盖数值推理、事实核查和数据分析的 5 个具有挑战性的表格推理基准中，TaTToo 在推理时将下游策略 LRM 提高了 30.9%，仅用 8B 参数就超过了 Qwen-2.5-Math-PRM-72B 等强大的 PRM 基线，并在各种 TTS 策略中表现出很强的通用性。

Keyword: diffusion policy

There is no result